Editor’s Picks
Top papers matching your research interests in multimodal LLMs, audio and vision understanding/generation.
[1] MAR3: Multi-Agent Recognition, Reasoning, and Reflection for Reference Audio-Visual Segmentation
Yuan Zhao, Zhenqi Jia, Yongqiang Zhang
Main category: cs.MM
TL;DR: MAR3 is a training-free multi-agent framework for Reference Audio-Visual Segmentation that uses LLM agents to recognize expression difficulty and dominant modality, adaptively reason about objects, and iteratively refine segmentation through reflective learning.
Details
Motivation: Previous Ref-AVS methods fail to explicitly recognize expression difficulty and dominant modality in multimodal cues, over-rely on instruction-tuning dataset quality for object reasoning, and lack reflective validation of segmentation results, leading to erroneous mask predictions.Method: Proposes a training-free Multi-Agent Recognition, Reasoning, and Reflection (MAR3) framework incorporating sociological Delphi theory. Uses Consensus Multimodal Recognition with LLM agents to recognize expression difficulty and dominant modality, adaptive Collaborative Object Reasoning based on modality-dominant difficulty rule, and Reflective Learning Segmentation where a check agent examines and iteratively corrects segmentation results.
Result: Achieves 69.2% J&F score on Ref-AVSBench dataset, outperforming state-of-the-art by 3.4% absolutely.
Conclusion: MAR3 effectively addresses limitations of previous Ref-AVS methods by explicitly recognizing multimodal cue characteristics, adaptively reasoning about objects, and incorporating reflective validation, leading to superior segmentation performance.
Abstract: Reference Audio-Visual Segmentation (Ref-AVS) aims to segment objects in audible videos based on multimodal cues in reference expressions. Previous methods overlook the explicit recognition of expression difficulty and dominant modality in multimodal cues, over-rely on the quality of the instruction-tuning dataset for object reasoning, and lack reflective validation of segmentation results, leading to erroneous mask predictions. To address these issues, in this paper, we propose a novel training-free Multi-Agent Recognition, Reasoning, and Reflection framework to achieve high-quality Reference Audio-Visual Segmentation, termed MAR3. Incorporating the sociological Delphi theory to achieve robust analysis, a Consensus Multimodal Recognition mechanism is proposed that enables LLM agents to explicitly recognize the difficulty of reference expressions and the dominant modality of multimodal cues. Based on our modality-dominant difficulty rule, we propose an adaptive Collaborative Object Reasoning strategy to reliably reason about the referred object. To further ensure precise mask prediction, we develop a Reflective Learning Segmentation mechanism, in which a check agent examines intermediate segmentation results and iteratively corrects the object text prompt of the segment agent. Experiments demonstrate that MAR3 achieves superior performance (69.2% in J&F) on the Ref-AVSBench dataset, outperforming SOTA by 3.4% absolutely.
Relevance: 9/10
[2] Learning to Select Visual In-Context Demonstrations
Eugene Lee, Yu-Chi Lin, Jiajie Diao
Main category: cs.LG
TL;DR: LSD (Learning to Select Demonstrations) uses RL to optimize demonstration selection for multimodal LLMs in visual in-context learning, outperforming kNN on factual regression tasks.
Details
Motivation: Current kNN-based demonstration selection for MLLMs is suboptimal for complex factual regression tasks as it selects redundant examples that fail to capture the full output range, limiting in-context learning effectiveness.Method: Reframe selection as sequential decision-making and train a Reinforcement Learning agent (Dueling DQN with query-centric Transformer Decoder) to construct optimal demonstration sets that maximize MLLM downstream performance.
Result: LSD significantly outperforms baselines on objective, factual regression tasks across five visual regression benchmarks, while kNN remains optimal for subjective preference tasks. LSD better defines regression boundaries by balancing visual relevance with diversity.
Conclusion: Learned demonstration selection (LSD) is strictly necessary for visual ICL on factual regression tasks, illuminating when sophisticated selection methods are required versus when simple kNN suffices.
Abstract: Multimodal Large Language Models (MLLMs) adapt to visual tasks via in-context learning (ICL), which relies heavily on demonstration quality. The dominant demonstration selection strategy is unsupervised k-Nearest Neighbor (kNN) search. While simple, this similarity-first approach is sub-optimal for complex factual regression tasks; it selects redundant examples that fail to capture the task’s full output range. We reframe selection as a sequential decision-making problem and introduce Learning to Select Demonstrations (LSD), training a Reinforcement Learning agent to construct optimal demonstration sets. Using a Dueling DQN with a query-centric Transformer Decoder, our agent learns a policy that maximizes MLLM downstream performance. Evaluating across five visual regression benchmarks, we uncover a crucial dichotomy: while kNN remains optimal for subjective preference tasks, LSD significantly outperforms baselines on objective, factual regression tasks. By balancing visual relevance with diversity, LSD better defines regression boundaries, illuminating when learned selection is strictly necessary for visual ICL.
Relevance: 9/10
[3] A Step Toward Federated Pretraining of Multimodal Large Language Models
Baochen Xiong, Yifan Xu, Xiaoshan Yang, Yaguang Song, Yaowei Wang, Changsheng Xu
Main category: cs.LG
TL;DR: Fed-CMP: A federated learning framework for multimodal LLM pre-training that collaboratively trains cross-modal projectors while freezing vision encoders and LLMs, addressing parameter interference and gradient oscillations.
Details
Motivation: MLLM development is limited by scarce public multimodal data, while private data remains inaccessible due to privacy concerns. Federated learning could unlock distributed resources, but existing work focuses on fine-tuning, leaving pre-training unexplored.Method: Proposes Fed-CMP framework with two key components: 1) Canonical Reliability-Aware Aggregation - constructs canonical space to decompose client projectors into shared alignment basis and client-specific coefficients with reliability-weighted fusion; 2) Orthogonality-Preserved Momentum - applies momentum to shared alignment basis via orthogonal projection to accumulate historical optimization directions while preserving geometric structure.
Result: Extensive experiments on four federated pre-training scenarios based on public datasets show Fed-CMP significantly outperforms existing baselines.
Conclusion: Fed-CMP successfully addresses challenges in federated MLLM pre-training, enabling collaborative training of cross-modal projectors while mitigating parameter interference and gradient oscillations.
Abstract: The rapid evolution of Multimodal Large Language Models (MLLMs) is bottlenecked by the saturation of high-quality public data, while vast amounts of diverse multimodal data remain inaccessible in privacy-sensitive silos. Federated Learning (FL) offers a promising solution to unlock these distributed resources, but existing research focuses predominantly on fine-tuning, leaving the foundational pre-training phase largely unexplored. In this paper, we formally introduce the Federated MLLM Alignment (Fed-MA) task, a lightweight pre-training paradigm that freezes the vision encoder and LLM while collaboratively training the cross-modal projector. We identify two critical challenges in this setting: (i) parameter interference in aggregating local projectors; and (ii) gradient oscillations in one-pass collaborative SGD. To address these challenges, we propose Fed-CMP, a pioneering framework for federated MLLM pre-training. Fed-CMP employs Canonical Reliability-Aware Aggregation, which constructs a canonical space to decompose client projectors into a shared alignment basis and client-specific coefficients, then performs reliability-weighted fusion to suppress parameter interference. Furthermore, Fed-CMP introduces Orthogonality-Preserved Momentum, which applies momentum to the shared alignment basis via orthogonal projection, accumulating historical optimization directions while preserving geometric structure. We construct four federated pre-training scenarios based on public datasets, and extensive experiments validate that Fed-CMP significantly outperforms existing baselines.
Relevance: 9/10
Today’s Research Highlights
AI-enhanced summaries of the latest research papers from arXiv.
Table of Contents
- cs.CL [Total: 142]
- cs.CV [Total: 487]
- cs.AI [Total: 168]
- cs.SD [Total: 22]
- cs.LG [Total: 257]
- cs.MA [Total: 11]
- cs.MM [Total: 2]
- eess.AS [Total: 12]
- eess.IV [Total: 23]
cs.CL
[1] GeoBlock: Inferring Block Granularity from Dependency Geometry in Diffusion Language Models
Lipeng Wan, Junjie Ma, Jianhui Gu, Zeyang Liu, Xuyang Lu, Xuguang Lan
Main category: cs.CL
TL;DR: GeoBlock: A geometry-aware block inference framework for diffusion language models that dynamically determines block granularity based on attention-derived dependency patterns, enabling parallel refinement while maintaining autoregressive reliability.
Details
Motivation: Existing block-sizing strategies in diffusion language models rely on fixed rules or heuristic signals without considering dependency geometry, which determines which tokens can be safely refined together. The paper introduces a geometry view where regions with strong causal ordering require sequential updates, while semantically cohesive regions admit parallel refinement.Method: GeoBlock analyzes cross-token dependency patterns from attention mechanisms to identify geometrically stable refinement regions. It dynamically determines appropriate block boundaries during decoding by examining dependency geometry rather than using predefined schedules or local confidence heuristics. The framework requires no additional training and integrates into existing block diffusion architectures.
Result: Extensive experiments across multiple benchmarks show that GeoBlock reliably identifies geometry-consistent block boundaries and improves the accuracy of block diffusion with only a small additional computational budget. The method preserves parallel efficiency while enforcing dependency-consistent refinement.
Conclusion: GeoBlock provides a principled approach to block inference in diffusion language models by leveraging dependency geometry, enabling efficient parallel refinement while maintaining the reliability of autoregressive decoding. The framework demonstrates that geometry-aware block sizing can significantly improve diffusion language model performance.
Abstract: Block diffusion enables efficient parallel refinement in diffusion language models, but its decoding behavior depends critically on block size. Existing block-sizing strategies rely on fixed rules or heuristic signals and do not account for the dependency geometry that determines which tokens can be safely refined together. This motivates a geometry view of diffusion decoding: \emph{regions with strong causal ordering require sequential updates, whereas semantically cohesive regions admit parallel refinement.} We introduce GeoBlock, a geometry-aware block inference framework that determines block granularity directly from attention-derived dependency geometry. Instead of relying on predefined schedules or local confidence heuristics, GeoBlock analyzes cross-token dependency patterns to identify geometrically stable refinement regions and dynamically determines appropriate block boundaries during decoding. By adapting block granularity to the dependency geometry, GeoBlock preserves the parallel efficiency of block diffusion while enforcing dependency-consistent refinement that exhibits autoregressive reliability. GeoBlock requires no additional training and integrates seamlessly into existing block diffusion architectures. Extensive experiments across multiple benchmarks show that GeoBlock reliably identifies geometry-consistent block boundaries and improves the accuracy of block diffusion with only a small additional computational budget.
[2] AlpsBench: An LLM Personalization Benchmark for Real-Dialogue Memorization and Preference Alignment
Jianfei Xiao, Xiang Yu, Chengbing Wang, Wuqiang Zheng, Xinyu Lin, Kaining Liu, Hongxun Ding, Yang Zhang, Wenjie Wang, Fuli Feng, Xiangnan He
Main category: cs.CL
TL;DR: AlpsBench is a new benchmark for evaluating LLM personalization using real-world human-LLM dialogues, focusing on four key memory management tasks across the personalization lifecycle.
Details
Motivation: Current LLM personalization research lacks proper evaluation benchmarks - existing ones either miss critical personalized information management aspects or rely on synthetic dialogues that don't reflect real-world interactions, creating a distribution gap.Method: Created AlpsBench with 2,500 long-term interaction sequences from WildChat (real human-LLM dialogues), paired with human-verified structured memories capturing explicit and implicit personalization signals. Defined four tasks: personalized information extraction, updating, retrieval, and utilization.
Result: Benchmarking frontier LLMs revealed: (1) models struggle with latent user trait extraction, (2) memory updating hits performance ceilings, (3) retrieval accuracy drops sharply with large distractor pools, (4) explicit memory mechanisms improve recall but don’t guarantee preference-aligned or emotionally resonant responses.
Conclusion: AlpsBench provides a comprehensive framework for evaluating LLM personalization using real-world dialogues, highlighting current limitations in memory management and personalization capabilities of state-of-the-art models.
Abstract: As Large Language Models (LLMs) evolve into lifelong AI assistants, LLM personalization has become a critical frontier. However, progress is currently bottlenecked by the absence of a gold-standard evaluation benchmark. Existing benchmarks either overlook personalized information management that is critical for personalization or rely heavily on synthetic dialogues, which exhibit an inherent distribution gap from real-world dialogue. To bridge this gap, we introduce AlpsBench, An LLM PerSonalization benchmark derived from real-world human-LLM dialogues. AlpsBench comprises 2,500 long-term interaction sequences curated from WildChat, paired with human-verified structured memories that encapsulate both explicit and implicit personalization signals. We define four pivotal tasks - personalized information extraction, updating, retrieval, and utilization - and establish protocols to evaluate the entire lifecycle of memory management. Our benchmarking of frontier LLMs and memory-centric systems reveals that: (i) models struggle to reliably extract latent user traits; (ii) memory updating faces a performance ceiling even in the strongest models; (iii) retrieval accuracy declines sharply in the presence of large distractor pools; and (iv) while explicit memory mechanisms improve recall, they do not inherently guarantee more preference-aligned or emotionally resonant responses. AlpsBench aims to provide a comprehensive framework.
[3] The Cognitive Divergence: AI Context Windows, Human Attention Decline, and the Delegation Feedback Loop
Netanel Eliav
Main category: cs.CL
TL;DR: The paper documents a growing “Cognitive Divergence” where AI context windows are expanding exponentially while human sustained-attention capacity is contracting, creating an asymmetry that may lead to cognitive delegation feedback loops.
Details
Motivation: To document and theorize the self-reinforcing dynamic between expanding AI context windows and declining human attention spans, and to understand the implications of this "Cognitive Divergence" for human-AI interaction.Method: Statistical analysis of AI context window growth trends, meta-analysis of human reading rates to derive Effective Context Span (ECS), neurobiological mechanism review across eight neuroimaging studies, and empirical evidence analysis of delegation thresholds.
Result: AI context windows grew from 512 tokens (2017) to 2,000,000 tokens (2026) while human ECS declined from ~16,000 tokens (2004) to ~1,800 tokens (2026), creating a 556-1,111x raw divergence ratio (56-111x quality-adjusted).
Conclusion: The Cognitive Divergence creates a Delegation Feedback Loop where humans increasingly delegate cognitive tasks to AI, potentially further reducing human cognitive capacities, requiring research on validated ECS measurement and longitudinal study of AI-mediated cognitive change.
Abstract: This paper documents and theorises a self-reinforcing dynamic between two measurable trends: the exponential expansion of large language model (LLM) context windows and the secular contraction of human sustained-attention capacity. We term the resulting asymmetry the Cognitive Divergence. AI context windows have grown from 512 tokens in 2017 to 2,000,000 tokens by 2026 (factor ~3,906; fitted lambda = 0.59/yr; doubling time ~14 months). Over the same period, human Effective Context Span (ECS) – a token-equivalent measure derived from validated reading-rate meta-analysis (Brysbaert, 2019) and an empirically motivated Comprehension Scaling Factor – has declined from approximately 16,000 tokens (2004 baseline) to an estimated 1,800 tokens (2026, extrapolated from longitudinal behavioural data ending 2020 (Mark, 2023); see Section 9 for uncertainty discussion). The AI-to-human ratio grew from near parity at the ChatGPT launch (November 2022) to 556–1,111x raw and 56–111x quality-adjusted, after accounting for retrieval degradation (Liu et al., 2024; Chroma, 2025). Beyond documenting this divergence, the paper introduces the Delegation Feedback Loop hypothesis: as AI capability grows, the cognitive threshold at which humans delegate to AI falls, extending to tasks of negligible demand; the resulting reduction in cognitive practice may further attenuate the capacities already documented as declining (Gerlich, 2025; Kim et al., 2026; Kosmyna et al., 2025). Neither trend reverses spontaneously. The paper characterises the divergence statistically, reviews neurobiological mechanisms across eight peer-reviewed neuroimaging studies, presents empirical evidence bearing on the delegation threshold, and proposes a research agenda centred on a validated ECS psychometric instrument and longitudinal study of AI-mediated cognitive change.
[4] SACRED: A Faithful Annotated Multimedia Multimodal Multilingual Dataset for Classifying Connectedness Types in Online Spirituality
Qinghao Guan, Yuchen Pan, Donghao Li, Zishi Zhang, Yiyang Chen, Lu Li, Flaminia Canu, Emilia Volkart, Gerold Schneider
Main category: cs.CL
TL;DR: Researchers created SACRED, a high-quality multimodal dataset for spirituality studies, evaluated 13 LLMs on it, finding DeepSeek-V3 performs best for text classification and GPT-4o-mini for vision tasks, and discovered new connectedness patterns.
Details
Motivation: Spirituality research in social sciences suffers from limited datasets that are often unavailable online. There's a need for high-quality multimodal datasets to study abstract concepts like spirituality, which transcend culture and offer unique individual experiences.Method: Collaborated with social scientists to develop SACRED, a high-quality multimedia multimodal dataset with guaranteed classification faithfulness. Evaluated 13 popular LLMs along with traditional rule-based and fine-tuned approaches on this dataset.
Result: DeepSeek-V3 achieved 79.19% accuracy on Quora test set for text classification. GPT-4o-mini surpassed other models in vision tasks with 63.99% F1 score. Discovered a new type of connectedness valuable for communication science studies.
Conclusion: SACRED is the first annotated multimodal dataset from online spirituality communication. The study demonstrates LLMs’ capability in classifying abstract concepts like spirituality and reveals new patterns of connectedness in spiritual communication.
Abstract: In religion and theology studies, spirituality has garnered significant research attention for the reason that it not only transcends culture but offers unique experience to each individual. However, social scientists often rely on limited datasets, which are basically unavailable online. In this study, we collaborated with social scientists to develop a high-quality multimedia multi-modal datasets, \textbf{SACRED}, in which the faithfulness of classification is guaranteed. Using \textbf{SACRED}, we evaluated the performance of 13 popular LLMs as well as traditional rule-based and fine-tuned approaches. The result suggests DeepSeek-V3 model performs well in classifying such abstract concepts (i.e., 79.19% accuracy in the Quora test set), and the GPT-4o-mini model surpassed the other models in the vision tasks (63.99% F1 score). Purportedly, this is the first annotated multi-modal dataset from online spirituality communication. Our study also found a new type of connectedness which is valuable for communication science studies.
[5] Do Multilingual VLMs Reason Equally? A Cross-Lingual Visual Reasoning Audit for Indian Languages
Swastik R
Main category: cs.CL
TL;DR: First cross-lingual visual reasoning audit for Indian languages shows significant performance drops (9.8-25pp) when VLMs switch from English to Indian languages, with Dravidian languages suffering more than Indo-Aryan languages.
Details
Motivation: Current vision-language model evaluations are overwhelmingly English-centric, creating a gap in understanding how these models perform across diverse languages, particularly for Indian languages which represent significant linguistic diversity.Method: Translated 980 questions from MathVista, ScienceQA, and MMMU into 6 Indian languages using IndicTrans2 with Gemini 2.0 Flash cross-verification. Evaluated 8 VLMs across 7 languages (including English), generating 68,600 inference records with text-only and chain-of-thought ablations.
Result: Accuracy drops of 9.8-25 percentage points when switching from English to Indian languages. Dravidian languages suffer up to 13.2pp more than Indo-Aryan languages. Chain-of-thought prompting degrades performance for Bengali (-14.4pp) and Kannada (-11.4pp). Aya-Vision-8B drops 28.5pp on Dravidian scripts despite multilingual pretraining.
Conclusion: Multilingual pretraining alone doesn’t transfer visual reasoning capabilities. Current VLMs exhibit English-centric reasoning chains and significant performance disparities across languages, highlighting the need for more equitable multilingual vision-language models.
Abstract: Vision-language models score well on mathematical, scientific, and spatial reasoning benchmarks, yet these evaluations are overwhelmingly English. I present the first cross-lingual visual reasoning audit for Indian languages. 980 questions from MathVista, ScienceQA, and MMMU are translated into Hindi, Tamil, Telugu, Bengali, Kannada, and Marathi using IndicTrans2, with Gemini 2.0 Flash cross-verification on 50 samples per language (inter-translator agreement 0.79-0.84). Eight VLMs, from 7B open-source models to GPT-4o, are evaluated across all seven languages, yielding 68,600 inference records that include text-only and chain-of-thought ablations. I find accuracy drops of 9.8-25 percentage points when switching from English to an Indian language, with Dravidian languages suffering up to 13.2 pp more than Indo-Aryan. Chain-of-thought prompting degrades Bengali (-14.4 pp) and Kannada (-11.4 pp) rather than helping, exposing English-centric reasoning chains. Aya-Vision-8B, built for 23 languages, still drops 28.5 pp on Dravidian scripts; multilingual pretraining alone does not transfer visual reasoning. I release the translated benchmark and all model outputs.
[6] LogicDiff: Logic-Guided Denoising Improves Reasoning in Masked Diffusion Language Models
Shaik Aman
Main category: cs.CL
TL;DR: LogicDiff improves masked diffusion language models’ reasoning by using logical role prediction to guide token unmasking order instead of confidence-based unmasking.
Details
Motivation: Standard masked diffusion language models use confidence-based unmasking that systematically defers high-entropy logical connective tokens, which are critical branching points in reasoning chains, leading to severely degraded reasoning performance.Method: LogicDiff uses a lightweight classification head (4.2M parameters) to predict logical roles (premise, connective, derived step, conclusion, filler) from base model hidden states with 98.4% accuracy, then uses a dependency-ordered scheduler to unmask tokens in logical dependency order: premises first, then connectives, then derived steps, then conclusions.
Result: Without modifying base model parameters, LogicDiff improves LLaDA-8B-Instruct accuracy from 22.0% to 60.7% on GSM8K (+38.7 percentage points) and from 23.6% to 29.2% on MATH-500 (+5.6 pp), with less than 6% speed overhead.
Conclusion: A substantial portion of the reasoning deficit in masked diffusion language models is attributable to suboptimal token unmasking order, not to limitations of the model’s learned representations.
Abstract: Masked diffusion language models (MDLMs) generate text by iteratively unmasking tokens from a fully masked sequence, offering parallel generation and bidirectional context. However, their standard confidence-based unmasking strategy systematically defers high-entropy logical connective tokens, the critical branching points in reasoning chains, leading to severely degraded reasoning performance. We introduce LogicDiff, an inference-time method that replaces confidence-based unmasking with logic-role-guided unmasking. A lightweight classification head (4.2M parameters, 0.05% of the base model) predicts the logical role of each masked position (premise, connective, derived step, conclusion, or filler) from the base model’s hidden states with 98.4% accuracy. A dependency-ordered scheduler then unmasks tokens in logical dependency order: premises first, then connectives, then derived steps, then conclusions. Without modifying a single parameter of the base model and without any reinforcement learning or task-specific training, LogicDiff improves LLaDA-8B-Instruct accuracy from 22.0% to 60.7% on GSM8K (+38.7 percentage points) and from 23.6% to 29.2% on MATH-500 (+5.6 pp), with less than 6% speed overhead. Our results demonstrate that a substantial portion of the reasoning deficit in MDLMs is attributable to suboptimal token unmasking order, not to limitations of the model’s learned representations.
[7] Resolving the Robustness-Precision Trade-off in Financial RAG through Hybrid Document-Routed Retrieval
Zhiyuan Cheng, Longying Lai, Yue Liu
Main category: cs.CL
TL;DR: HDRR (Hybrid Document-Routed Retrieval) combines document-level routing with chunk-based retrieval to overcome limitations of both approaches in financial QA systems, achieving superior performance by eliminating cross-document confusion while preserving precision.
Details
Motivation: Standard chunk-based retrieval (CBR) for financial document QA suffers from cross-document chunk confusion in structurally homogeneous corpora like regulatory filings, while document-level routing (SFR) reduces failures but sacrifices precision of targeted chunk retrieval.Method: Proposes Hybrid Document-Routed Retrieval (HDRR), a two-stage architecture that first uses Semantic File Routing (SFR) with LLM structured output to route queries to relevant documents, then performs chunk-based retrieval scoped only to the identified document(s).
Result: HDRR achieves best performance on all metrics: average score of 7.54 (25.2% above CBR, 16.9% above SFR), failure rate of only 6.4%, correctness rate of 67.7% (+18.7pp over CBR), and perfect-answer rate of 20.1% (+6.3pp over CBR, +11.6pp over SFR).
Conclusion: HDRR resolves the robustness-precision trade-off between document routing and chunk retrieval, eliminating cross-document confusion while preserving targeted chunk precision, making it superior for financial document QA systems.
Abstract: Retrieval-Augmented Generation (RAG) systems for financial document question answering typically follow a chunk-based paradigm: documents are split into fragments, embedded into vector space, and retrieved via similarity search. While effective in general settings, this approach suffers from cross-document chunk confusion in structurally homogeneous corpora such as regulatory filings. Semantic File Routing (SFR), which uses LLM structured output to route queries to whole documents, reduces catastrophic failures but sacrifices the precision of targeted chunk retrieval. We identify this robustness-precision trade-off through controlled evaluation on the FinDER benchmark (1,500 queries across five groups): SFR achieves higher average scores (6.45 vs. 6.02) and fewer failures (10.3% vs. 22.5%), while chunk-based retrieval (CBR) yields more perfect answers (13.8% vs. 8.5%). To resolve this trade-off, we propose Hybrid Document-Routed Retrieval (HDRR), a two-stage architecture that uses SFR as a document filter followed by chunk-based retrieval scoped to the identified document(s). HDRR eliminates cross-document confusion while preserving targeted chunk precision. Experimental results demonstrate that HDRR achieves the best performance on every metric: an average score of 7.54 (25.2% above CBR, 16.9% above SFR), a failure rate of only 6.4%, a correctness rate of 67.7% (+18.7 pp over CBR), and a perfect-answer rate of 20.1% (+6.3 pp over CBR, +11.6 pp over SFR). HDRR resolves the trade-off by simultaneously achieving the lowest failure rate and the highest precision across all five experimental groups.
[8] Arithmetic OOD Failure Unfolds in Stages in Minimal GPTs
Seine A. Shintani
Main category: cs.CL
TL;DR: GPT trained on 2-digit addition fails at 3-digit generalization due to staged failures: layout barrier, carry semantics, conditional recomposition, and tens residual errors.
Details
Motivation: To understand why arithmetic benchmarks often reduce to single scores that conflate different failure types, and to decompose the specific failures in GPT generalization from 2-digit to 3-digit addition.Method: Train minimal GPT on exhaustive 2-digit addition, then systematically test 3-digit generalization failures through controlled experiments: layout shift tests, carry probes, conditional recomposition studies, and targeted repair interventions.
Result: Identified four staged failures: 1) layout barrier requiring mixed-layout exposure, 2) hundreds position acting as carry flag rather than semantic digit, 3) conditional recomposition bottleneck, and 4) tens residual errors. Targeted repairs improved performance from 0.664 to 0.822 exact match.
Conclusion: Arithmetic out-of-distribution failures can be decomposed into experimentally testable stages: layout, carry-semantics, recomposition, and tens-residual, providing a more nuanced understanding than single-score benchmarks.
Abstract: Arithmetic benchmarks are often reduced to a single held-out score, but that score can conflate qualitatively different failures. We study a controlled minimal GPT trained on exhaustive 2-digit addition, where all local digit transitions are already present in training, and ask why 3-digit generalization still fails. The failure is staged. First, there is a layout barrier: a learned absolute-position model collapses under a pure 3-digit layout shift, and mixed-layout exposure is the only intervention that materially weakens this barrier. Second, after layout repair, the hundreds position behaves like a carry flag rather than a semantic hundreds digit; targeted carry probes reverse the relevant logit margin, whereas a matched extra-data control does not. Third, after carry repair, the main remaining bottleneck is conditional recomposition: high-conditioned tail data outperforms a matched control, high-only data, and tail-only data on all true-3-digit suites, and the same ordering reappears in a larger 2-layer bridge experiment. The residual errors after recomposition are then overwhelmingly tens-only, and a separate 10-seed late-stage study shows that a sign-aware tens repair raises exact match on the hardest thousands-carry suite from 0.664 to 0.822. We therefore provide an experimentally testable decomposition of arithmetic OOD failure into layout, carry-semantics, recomposition, and late tens-residual stages.
[9] Magic Words or Methodical Work? Challenging Conventional Wisdom in LLM-Based Political Text Annotation
Lorca McLaren, James Cross, Zuzanna Krakowska, Robin Rauner, Martijn Schoonvelde
Main category: cs.CL
TL;DR: LLM implementation choices for political science text annotation show complex interaction effects where no single model, prompt style, or learning approach is uniformly superior across tasks, requiring careful validation frameworks.
Details
Motivation: Political scientists are increasingly using LLMs for text annotation, but there's poor understanding of how sensitive annotation results are to implementation choices like model selection, size, learning approach, and prompt engineering.Method: Controlled evaluation of six open-weight models across four political science annotation tasks under identical conditions (quantisation, hardware, prompt-templates), examining interaction effects between model choice, size, learning approach, and prompt style.
Result: Interaction effects dominate main effects - no single model, prompt style, or learning approach is uniformly superior. Model size is unreliable for cost/performance prediction, and popular prompt engineering techniques yield inconsistent or negative effects.
Conclusion: Researchers need validation-first frameworks with principled pipeline decision ordering, prompt freezing guidance, held-out evaluation, reporting standards, and open-source tools to navigate complex implementation choices transparently.
Abstract: Political scientists are rapidly adopting large language models (LLMs) for text annotation, yet the sensitivity of annotation results to implementation choices remains poorly understood. Most evaluations test a single model or configuration; how model choice, model size, learning approach, and prompt style interact, and whether popular “best practices” survive controlled comparison, are largely unexplored. We present a controlled evaluation of these pipeline choices, testing six open-weight models across four political science annotation tasks under identical quantisation, hardware, and prompt-template conditions. Our central finding is methodological: interaction effects dominate main effects, so seemingly reasonable pipeline choices can become consequential researcher degrees of freedom. No single model, prompt style, or learning approach is uniformly superior, and the best-performing model varies across tasks. Two corollaries follow. First, model size is an unreliable guide both to cost and to performance: cross-family efficiency differences are so large that some larger models are less resource-intensive than much smaller alternatives, while within model families mid-range variants often match or exceed larger counterparts. Second, widely recommended prompt engineering techniques yield inconsistent and sometimes negative effects on annotation performance. We use these benchmark results to develop a validation-first framework - with a principled ordering of pipeline decisions, guidance on prompt freezing and held-out evaluation, reporting standards, and open-source tools - to help researchers navigate this decision space transparently.
[10] A large corpus of lucid and non-lucid dream reports
Remington Mallett
Main category: cs.CL
TL;DR: Researchers curated a large corpus of 55k dream reports from online forums, including 10k lucid dreams, to enable systematic study of dream phenomenology through natural language analysis.
Details
Motivation: Lucid dreams are difficult to study due to scarce prevalence and resistance to deliberate induction, leading to lack of clarity around lucid dream phenomenology and under-realized applications.Method: Scraped 10 years of publicly available dream reports from an online forum where users share anonymous dream journals, with optional user-provided labels (lucid, non-lucid, nightmare). Applied descriptive statistics, visualizations, and construct validation to analyze language patterns.
Result: Created a corpus of 55k dream reports from 5k contributors, including 10k lucid, 25k non-lucid, and 2k nightmare labels. Construct validation showed language patterns in lucid-labeled reports are consistent with known characteristics of lucid dreams.
Conclusion: The corpus has broad value for dream science, with the labeled subset being particularly powerful for new discoveries in lucid dream studies, enabling systematic analysis of dream phenomenology.
Abstract: All varieties of dreaming remain a mystery. Lucid dreams in particular, or those characterized by awareness of the dream, are notoriously difficult to study. Their scarce prevalence and resistance to deliberate induction make it difficult to obtain a sizeable corpus of lucid dream reports. The consequent lack of clarity around lucid dream phenomenology has left the many purported applications of lucidity under-realized. Here, a large corpus of 55k dream reports from 5k contributors is curated, described, and validated for future research. Ten years of publicly available dream reports were scraped from an online forum where users share anonymous dream journals. Importantly, users optionally categorize their dream as lucid, non-lucid, or a nightmare, offering a user-provided labeling system that includes 10k lucid and 25k non-lucid, and 2k nightmare labels. After characterizing the corpus with descriptive statistics and visualizations, construct validation shows that language patterns in lucid-labeled reports are consistent with known characteristics of lucid dreams. While the entire corpus has broad value for dream science, the labeled subset is particularly powerful for new discoveries in lucid dream studies.
[11] The Last Fingerprint: How Markdown Training Shapes LLM Prose
E. M. Freeburg
Main category: cs.CL
TL;DR: LLMs overuse em dashes due to markdown formatting leaking from training data, serving as a diagnostic signature of fine-tuning methodology rather than just stylistic defect.
Details
Motivation: To explain why LLMs overuse em dashes and connect this observation to markdown formatting in training data, providing a mechanistic account of this AI-generated text pattern.Method: Proposed a five-step genealogy connecting training data to em dash usage, conducted suppression experiments across 12 models from 5 providers with instructions to avoid markdown formatting, and tested explicit em dash prohibition.
Result: Em dash frequency varies from 0.0 to 9.1 per 1,000 words across models, with Llama models producing none. Em dashes persist even under markdown suppression in most models, serving as a signature of specific fine-tuning procedures.
Conclusion: Em dash overuse results from markdown leaking into prose from training data, functioning as a diagnostic tool for fine-tuning methodology rather than a stylistic defect, connecting previously isolated observations about AI text patterns.
Abstract: Large language models produce em dashes at varying rates, and the observation that some models “overuse” them has become one of the most widely discussed markers of AI-generated text. Yet no mechanistic account of this pattern exists, and the parallel observation that LLMs default to markdown-formatted output has never been connected to it. We propose that the em dash is markdown leaking into prose – the smallest surviving unit of the structural orientation that LLMs acquire from markdown-saturated training corpora. We present a five-step genealogy connecting training data composition, structural internalization, the dual-register status of the em dash, and post-training amplification. We test this with a two-condition suppression experiment across twelve models from five providers (Anthropic, OpenAI, Meta, Google, DeepSeek): when models are instructed to avoid markdown formatting, overt features (headers, bullets, bold) are eliminated or nearly eliminated, but em dashes persist – except in Meta’s Llama models, which produce none at all. Em dash frequency and suppression resistance vary from 0.0 per 1,000 words (Llama) to 9.1 (GPT-4.1 under suppression), functioning as a signature of the specific fine-tuning procedure applied. A three-condition suppression gradient shows that even explicit em dash prohibition fails to eliminate the artifact in some models, and a base-vs-instruct comparison confirms that the latent tendency exists pre-RLHF. These findings connect two previously isolated online discourses and reframe em dash frequency as a diagnostic of fine-tuning methodology rather than a stylistic defect.
[12] RASPRef: Retrieval-Augmented Self-Supervised Prompt Refinement for Large Reasoning Models
Rahul Soni
Main category: cs.CL
TL;DR: RASPRef is a framework for self-supervised prompt refinement using retrieval and feedback signals to improve reasoning in language models without human annotation.
Details
Motivation: Current reasoning-focused language models are highly sensitive to prompt formulation, and manual prompt design doesn't scale well across tasks or domains. There's a need for automated, scalable prompt improvement methods.Method: Retrieval-Augmented Self-Supervised Prompt Refinement (RASPRef) retrieves relevant examples and reasoning trajectories, then uses multi-sample consistency, verifier feedback, and model-generated critiques to iteratively refine prompts without human supervision.
Result: Experiments on GSM8K-style mathematical reasoning tasks show retrieval-guided prompting improves performance compared to static prompting baselines. The effectiveness depends on retrieval quality, trajectory selection, and feedback signals.
Conclusion: Prompt design remains critical for reasoning-oriented language models, and self-improving prompts offer a practical, scalable strategy for improving reasoning performance without manual intervention.
Abstract: Recent reasoning-focused language models such as DeepSeek R1 and OpenAI o1 have demonstrated strong performance on structured reasoning benchmarks including GSM8K, MATH, and multi-hop question answering tasks. However, their performance remains highly sensitive to prompt formulation, and designing effective prompts is typically a manual and iterative process that does not scale well across tasks or domains. To address this limitation, we introduce Retrieval-Augmented Self-Supervised Prompt Refinement (RASPRef), a framework that improves prompts without requiring human annotations or task-specific supervision. The approach retrieves relevant examples and previously generated reasoning trajectories, and leverages signals such as multi-sample consistency, verifier feedback, and model-generated critiques to iteratively refine the prompt. Unlike prior approaches that focus primarily on improving model outputs, RASPRef directly treats the prompt as the optimization target and improves it through an iterative retrieval-guided refinement process. Experiments on GSM8K-style mathematical reasoning tasks show that retrieval-guided prompting improves performance compared with a static prompting baseline. We further discuss how retrieval quality, trajectory selection, and self-supervised feedback signals may influence the effectiveness of prompt refinement. These findings suggest that prompt design remains a critical factor for reasoning-oriented language models, and that self-improving prompts offer a practical and scalable strategy for improving reasoning performance.
[13] Open ASR Leaderboard: Towards Reproducible and Transparent Multilingual and Long-Form Speech Recognition Evaluation
Vaibhav Srivastav, Steven Zheng, Eric Bezzam, Eustache Le Bihan, Nithin Rao Koluguri, Piotr Żelasko, Somshubra Majumdar, Adel Moumen, Sanchit Gandhi
Main category: cs.CL
TL;DR: The Open ASR Leaderboard is a reproducible benchmarking platform that evaluates 86 speech recognition systems across 12 datasets, comparing accuracy (WER) and efficiency (RTFx) across different model architectures and toolkits.
Details
Motivation: To create a standardized, transparent benchmarking platform for automatic speech recognition (ASR) systems that enables fair comparison across different architectures, toolkits, and research/industry contributions, addressing the need for reproducible evaluation in the ASR field.Method: Developed a comprehensive benchmarking platform with standardized evaluation metrics (word error rate and inverse real-time factor), tested 86 open-source and proprietary systems across 12 datasets covering English short/long-form and multilingual short-form tracks, with support for multiple toolkits including ESPNet, NeMo, SpeechBrain, and Transformers.
Result: Conformer-based encoders with transformer-based decoders achieved the best average WER, while CTC and TDT decoders offered superior efficiency (RTFx), making them better for long-form and batched processing. All code and dataset loaders are open-sourced for community use.
Conclusion: The Open ASR Leaderboard provides a transparent, extensible evaluation framework that enables reproducible benchmarking and systematic comparison of ASR systems, with findings that guide architectural choices based on accuracy vs. efficiency trade-offs.
Abstract: We present the Open ASR Leaderboard, a reproducible benchmarking platform with community contributions from academia and industry. It compares 86 open-source and proprietary systems across 12 datasets, with English short- and long-form and multilingual short-form tracks. We standardize word error rate (WER) and inverse real-time factor (RTFx) evaluation for consistent accuracy-efficiency comparisons across model architectures and toolkits (e.g., ESPNet, NeMo, SpeechBrain, Transformers). We observe that Conformer-based encoders paired with transformer-based decoders achieve the best average WER, while connectionist temporal classification (CTC) and token-and-duration transducer (TDT) decoders offer superior RTFx, making them better suited for long-form and batched processing. All code and dataset loaders are open-sourced to support transparent, extensible evaluation. We present our evaluation methodology to facilitate community-driven benchmarking in ASR and other tasks.
[14] Pashto Common Voice: Building the First Open Speech Corpus for a 60-Million-Speaker Low-Resource Language
Hanif Rahman, Shafeeq ur Rehman
Main category: cs.CL
TL;DR: Created first large-scale open Pashto speech corpus (147 hours, 1,483 speakers) through community effort, improving Whisper ASR from 99.0% to 13.4% WER
Details
Motivation: Pashto has over 60 million native speakers but lacks open speech technology resources. There's a need for large-scale, openly licensed speech data to enable speech technology development for this underserved language.Method: Community-driven approach using Mozilla Common Voice platform (CV14-CV23). Methods included: interface localization, Wikipedia-based sentence extraction with automated filtering, phonemically targeted contributions for frequently dropped Pashto characters, and multi-channel community outreach including VOA Pashto broadcast campaign.
Result: Built corpus of 147 hours with 1,483 unique speakers (107,781 clips, 60,337 validated, 82.33 validated hours). Speaker participation increased 108-fold between CV17 and CV18. Fine-tuning Whisper Base on MCV20 reduced WER from 99.0% (zero-shot) to 13.4% on MCV20 test split.
Conclusion: Successfully created the first large-scale open Pashto speech corpus through community effort, demonstrating that targeted outreach and phonemic strategies can build resources for low-resource languages, significantly improving ASR performance.
Abstract: We present the Pashto Common Voice corpus – the first large-scale, openly licensed speech resource for Pashto, a language with over 60 million native speakers largely absent from open speech technology. Through a community effort spanning 2022-2025, the corpus grew from 1.5 hours and 5 contributors to 147 total hours and 1,483 unique speakers across ten Mozilla Common Voice releases (CV14-CV23). Speaker participation increased approximately 108-fold between CV17 and CV18, coinciding with a VOA Pashto broadcast campaign. We describe the full methodology: interface localisation, Wikipedia-based sentence extraction with automated filtering, phonemically targeted contributions for the four most frequently dropped Pashto characters, and multi-channel community outreach. MCV23 contains 107,781 clips (60,337 validated; 82.33 validated hours) across 13 content domains. Fine-tuning Whisper Base on the MCV20 yields 13.4% WER on the MCV20 test split, against the published Whisper Base zero-shot WER of 99.0% on Pashto.
[15] AgentSwing: Adaptive Parallel Context Management Routing for Long-Horizon Web Agents
Zhaopeng Feng, Liangcai Su, Zhen Zhang, Xinyu Wang, Xiaotian Zhang, Xiaobin Wang, Runnan Fang, Qi Zhang, Baixuan Li, Shihao Cai, Rui Ye, Hui Chen, Jiang Yong, Joey Tianyi Zhou, Chenxiong Qian, Pengjun Xie, Bryan Hooi, Zuozhu Liu, Jingren Zhou
Main category: cs.CL
TL;DR: AgentSwing: A state-aware adaptive parallel context management routing framework for LLM agents that dynamically selects optimal context management strategies during long-horizon information-seeking tasks.
Details
Motivation: Current LLM agents for long-horizon information-seeking face context capacity bottlenecks. Existing methods use fixed context management strategies throughout entire trajectories, which cannot adapt as context usefulness and reliability evolve during long searches.Method: Proposes AgentSwing framework with probabilistic modeling of long-horizon success through search efficiency and terminal precision. At trigger points, expands multiple context-managed branches in parallel and uses lookahead routing to select the most promising continuation.
Result: Experiments across diverse benchmarks and agent backbones show AgentSwing consistently outperforms static context management methods, often matching or exceeding their performance with up to 3× fewer interaction turns while improving ultimate performance ceiling.
Conclusion: AgentSwing provides effective adaptive context management for long-horizon agents, and the probabilistic framework offers principled analysis and design methodology for future context management strategies.
Abstract: As large language models (LLMs) evolve into autonomous agents for long-horizon information-seeking, managing finite context capacity has become a critical bottleneck. Existing context management methods typically commit to a single fixed strategy throughout the entire trajectory. Such static designs may work well in some states, but they cannot adapt as the usefulness and reliability of the accumulated context evolve during long-horizon search. To formalize this challenge, we introduce a probabilistic framework that characterizes long-horizon success through two complementary dimensions: search efficiency and terminal precision. Building on this perspective, we propose AgentSwing, a state-aware adaptive parallel context management routing framework. At each trigger point, AgentSwing expands multiple context-managed branches in parallel and uses lookahead routing to select the most promising continuation. Experiments across diverse benchmarks and agent backbones show that AgentSwing consistently outperforms strong static context management methods, often matching or exceeding their performance with up to $3\times$ fewer interaction turns while also improving the ultimate performance ceiling of long-horizon web agents. Beyond the empirical gains, the proposed probabilistic framework provides a principled lens for analyzing and designing future context management strategies for long-horizon agents.
[16] TAPS: Task Aware Proposal Distributions for Speculative Sampling
Mohamad Zbib, Mohamad Bazzi, Ammar Mohanna, Hasan Abed Al Kader Hammoud, Bernard Ghanem
Main category: cs.CL
TL;DR: Specialized draft model training for speculative decoding improves performance on specific task domains, with confidence-based routing outperforming weight averaging for combining multiple specialized drafters.
Details
Motivation: Current draft models for speculative decoding are trained on broad generic corpora, leaving unclear how much decoding quality depends on draft training distribution and whether task-specific training could improve performance.Method: Train lightweight HASS and EAGLE-2 draft models on MathInstruct, ShareGPT, and mixed-data variants; evaluate on MT-Bench, GSM8K, MATH-500, and SVAMP; study acceptance length, temperature effects, and methods for combining specialized drafters (checkpoint averaging, confidence-based routing, merged-tree verification).
Result: Task-specific training yields clear specialization: MathInstruct-trained drafts excel on reasoning benchmarks, ShareGPT-trained drafts excel on MT-Bench. Mixed-data improves robustness but doesn’t dominate across temperatures. Confidence-based routing outperforms single-domain drafts, with merged-tree verification achieving highest acceptance length overall.
Conclusion: Speculative decoding quality depends on both draft architecture and match between training data and downstream workload; specialized drafters are better combined at inference time than in weight space; confidence is more useful routing signal than entropy.
Abstract: Speculative decoding accelerates autoregressive generation by letting a lightweight draft model propose future tokens that a larger target model then verifies in parallel. In practice, however, draft models are usually trained on broad generic corpora, which leaves it unclear how much speculative decoding quality depends on the draft training distribution. We study this question with lightweight HASS and EAGLE-2 drafters trained on MathInstruct, ShareGPT, and mixed-data variants, evaluated on MT-Bench, GSM8K, MATH-500, and SVAMP. Measured by acceptance length, task-specific training yields clear specialization: MathInstruct-trained drafts are strongest on reasoning benchmarks, while ShareGPT-trained drafts are strongest on MT-Bench. Mixed-data training improves robustness, but larger mixtures do not dominate across decoding temperatures. We also study how to combine specialized drafters at inference time. Naive checkpoint averaging performs poorly, whereas confidence-based routing improves over single-domain drafts and merged-tree verification yields the highest acceptance length overall for both backbones. Finally, confidence is a more useful routing signal than entropy: rejected tokens tend to have higher entropy, but confidence produces much clearer benchmark-level routing decisions. These results show that speculative decoding quality depends not only on draft architecture, but also on the match between draft training data and downstream workload, and that specialized drafters are better combined at inference time than in weight space.
[17] Introducing MELI: the Mandarin-English Language Interview Corpus
Suyuan Liu, Molly Babel
Main category: cs.CL
TL;DR: The MELI Corpus is a 29.8-hour open-source speech dataset from 51 Mandarin-English bilingual speakers, featuring matched Mandarin and English sessions with read sentences and spontaneous interviews, fully transcribed and force-aligned.
Details
Motivation: To create a comprehensive resource for studying bilingual speech, code-switching patterns, and the relationship between acoustics and language attitudes in Mandarin-English bilingual speakers.Method: Collected speech data from 51 bilingual speakers with matched Mandarin and English sessions, including read sentences and spontaneous interviews. Audio was recorded at 44.1 kHz, fully transcribed, force-aligned at word/phone levels, and anonymized.
Result: Created a 29.8-hour corpus with ~14.7 hours Mandarin and ~15.1 hours English content. Documented token/type statistics, code-switching patterns (frequent in Mandarin sessions), and enables within-/cross-speaker/language acoustic comparisons.
Conclusion: The MELI Corpus provides a valuable open-source resource for bilingual speech research, supporting both quantitative acoustic analysis and qualitative investigation of language attitudes and code-switching behaviors.
Abstract: We introduce the Mandarin-English Language Interview (MELI) Corpus, an open-source resource of 29.8 hours of speech from 51 Mandarin-English bilingual speakers. MELI combines matched sessions in Mandarin and English with two speaking styles: read sentences and spontaneous interviews about language varieties, standardness, and learning experiences. Audio was recorded at 44.1 kHz (16-bit, stereo). Interviews were fully transcribed, force-aligned at word and phone levels, and anonymized. Descriptively, the Mandarin component totals ~14.7 hours (mean duration 17.3 minutes) and the English component ~15.1 hours (mean duration 17.8 minutes). We report token/type statistics for each language and document code-switching patterns (frequent in Mandarin sessions; more limited in English sessions). The corpus design supports within-/cross-speaker, within/cross-language acoustic comparison and links acoustics to speakers’ stated language attitudes, enabling both quantitative and qualitative analyses. The MELI Corpus will be released with transcriptions, alignments, metadata, scans of labelled maps and documentation under a CC BY-NC 4.0 license.
[18] Text Data Integration
Md Ataur Rahman, Dimitris Sacharidis, Oscar Romero, Sergi Nadal
Main category: cs.CL
TL;DR: This chapter argues for integrating textual (unstructured) data into data integration systems alongside structured data, discussing challenges, state of the art, and open problems.
Details
Motivation: Current data integration systems primarily focus on structured data, but unstructured textual data contains valuable knowledge that should be utilized. The heterogeneity of data formats poses challenges for meaningful storage and processing.Method: The chapter presents a conceptual framework for integrating textual data, discussing the challenges, reviewing state-of-the-art approaches, and identifying open research problems in this area.
Result: The chapter makes a case for textual data integration, outlines the technical challenges involved, surveys existing approaches, and highlights important open problems for future research.
Conclusion: Textual data integration is an important but challenging area that requires further research to fully leverage the knowledge contained in unstructured text alongside structured data sources.
Abstract: Data comes in many forms. From a shallow perspective, they can be viewed as being either in structured (e.g., as a relation, as key-value pairs) or unstructured (e.g., text, image) formats. So far, machines have been fairly good at processing and reasoning over structured data that follows a precise schema. However, the heterogeneity of data poses a significant challenge on how well diverse categories of data can be meaningfully stored and processed. Data Integration, a crucial part of the data engineering pipeline, addresses this by combining disparate data sources and providing unified data access to end-users. Until now, most data integration systems have leaned on only combining structured data sources. Nevertheless, unstructured data (a.k.a. free text) also contains a plethora of knowledge waiting to be utilized. Thus, in this chapter, we firstly make the case for the integration of textual data, to later present its challenges, state of the art and open problems.
[19] Debiasing Large Language Models toward Social Factors in Online Behavior Analytics through Prompt Knowledge Tuning
Hossein Salemi, Jitin Krishnan, Hemant Purohit
Main category: cs.CL
TL;DR: LLMs trained on human data may implicitly mimic social attribution processes, but ignoring this in reasoning can lead to biased responses in social contexts. The paper introduces a method to mitigate bias by enriching prompts with social-attribution knowledge based on message context and user goals.
Details
Motivation: Large Language Models trained on human-generated corpora may implicitly mimic social attribution processes (dispositional vs. situational causality), but ignoring this in reasoning could lead to biased responses in social contexts. Current reasoning paradigms like Chain-of-Thought don't adequately address social attribution bias.Method: Introduces a scalable method to mitigate social-attribution bias by enriching instruction prompts with two prompt aids using social-attribution knowledge: (1) using user’s goal to infer dispositional causality, and (2) using message context to infer situational causality. Tested on intent detection and theme detection tasks in disaster domain social media across multiple languages.
Result: Method improves model performance while reducing social-attribution bias in zero-shot classification tasks for behavior analytics. Experiments show biases in three open-source LLMs (Llama3, Mistral, Gemma) toward social attribution, and demonstrate effectiveness of mitigation strategies across disaster types and multiple languages.
Conclusion: Incorporating social-attribution knowledge into LLM prompts effectively reduces bias and improves performance in social context reasoning tasks, particularly for behavior analytics applications in domains like disaster response across multiple languages.
Abstract: Attribution theory explains how individuals interpret and attribute others’ behavior in a social context by employing personal (dispositional) and impersonal (situational) causality. Large Language Models (LLMs), trained on human-generated corpora, may implicitly mimic this social attribution process in social contexts. However, the extent to which LLMs utilize these causal attributions in their reasoning remains underexplored. Although using reasoning paradigms, such as Chain-of-Thought (CoT), has shown promising results in various tasks, ignoring social attribution in reasoning could lead to biased responses by LLMs in social contexts. In this study, we investigate the impact of incorporating a user’s goal as knowledge to infer dispositional causality and message context to infer situational causality on LLM performance. To this end, we introduce a scalable method to mitigate such biases by enriching the instruction prompts for LLMs with two prompt aids using social-attribution knowledge, based on the context and goal of a social media message. This method improves the model performance while reducing the social-attribution bias of the LLM in the reasoning on zero-shot classification tasks for behavior analytics applications. We empirically show the benefits of our method across two tasks-intent detection and theme detection on social media in the disaster domain-when considering the variability of disaster types and multiple languages of social media. Our experiments highlight the biases of three open-source LLMs: Llama3, Mistral, and Gemma, toward social attribution, and show the effectiveness of our mitigation strategies.
[20] HumMusQA: A Human-written Music Understanding QA Benchmark Dataset
Benno Weck, Pablo Puentes, Andrea Poltronieri, Satyajeet Prabhu, Dmitry Bogdanov
Main category: cs.CL
TL;DR: A new expert-curated benchmark dataset of 320 hand-written questions for evaluating music understanding in Large Audio-Language Models, with benchmarking of 6 state-of-the-art models and testing for uni-modal shortcuts.
Details
Motivation: Current evaluation benchmarks for music understanding in LALMs often fail to truly test whether models can perceive and interpret music. Existing data methodologies are insufficient for probing complex audio comprehension, necessitating a more rigorous, expert-curated approach.Method: Created a meticulously structured dataset of 320 hand-written questions curated and validated by experts with musical training. Used this dataset to benchmark six state-of-the-art LALMs and tested their robustness to uni-modal shortcuts.
Result: The paper presents a new benchmark dataset and demonstrates its use through benchmarking of six state-of-the-art models. The results likely reveal performance gaps and highlight the importance of expert-curated evaluation for music understanding.
Conclusion: Focused, manual curation by musical experts is superior for probing complex audio comprehension in LALMs. The proposed benchmark provides a more rigorous standard for evaluating music understanding capabilities.
Abstract: The evaluation of music understanding in Large Audio-Language Models (LALMs) requires a rigorously defined benchmark that truly tests whether models can perceive and interpret music, a standard that current data methodologies frequently fail to meet. This paper introduces a meticulously structured approach to music evaluation, proposing a new dataset of 320 hand-written questions curated and validated by experts with musical training, arguing that such focused, manual curation is superior for probing complex audio comprehension. To demonstrate the use of the dataset, we benchmark six state-of-the-art LALMs and additionally test their robustness to uni-modal shortcuts.
[21] Story2Proposal: A Scaffold for Structured Scientific Paper Writing
Zhuoyang Qian, Wei Shi, Xu Lin, Li Ling, Meng Luo, Ziming Wang, Zhiwei Zhang, Tengyue Xu, Gaoge Liu, Zhentao Zhang, Shuo Zhang, Ziqi Wang, Zheng Feng, Yan Luo, Shu Xu, Yongjin Chen, Zhibo Feng, Zhuo Chen, Bruce Yuan, Biao Wu, Harry Wang, Kris Chen
Main category: cs.CL
TL;DR: Story2Proposal: A contract-governed multi-agent framework that converts research stories into structured manuscripts with improved visual alignment and structural consistency through coordinated agents operating under a persistent shared visual contract.
Details
Motivation: Existing language-model generation pipelines for scientific manuscripts often produce structural drift, missing figures/tables, and cross-section inconsistencies due to unconstrained text synthesis with validation applied only after generation.Method: Introduces a multi-agent framework with architect, writer, refiner, and renderer agents coordinated around a persistent shared visual contract that tracks section structure and registered visual elements, with evaluation agents providing feedback in a generate-evaluate-adapt loop.
Result: Achieved expert evaluation score of 6.145 vs 3.963 for DirectChat (+2.182) across multiple LLM backbones, and 5.705 vs 5.197 for structured generation baseline Fars, showing improved structural consistency and visual alignment.
Conclusion: Story2Proposal demonstrates that contract-governed multi-agent frameworks can effectively maintain alignment between narrative reasoning, experimental evidence, and visual artifacts throughout the document generation lifecycle.
Abstract: Generating scientific manuscripts requires maintaining alignment between narrative reasoning, experimental evidence, and visual artifacts across the document lifecycle. Existing language-model generation pipelines rely on unconstrained text synthesis with validation applied only after generation, often producing structural drift, missing figures or tables, and cross-section inconsistencies. We introduce Story2Proposal, a contract-governed multi-agent framework that converts a research story into a structured manuscript through coordinated agents operating under a persistent shared visual contract. The system organizes architect, writer, refiner, and renderer agents around a contract state that tracks section structure and registered visual elements, while evaluation agents supply feedback in a generate evaluate adapt loop that updates the contract during generation. Experiments on tasks derived from the Jericho research corpus show that Story2Proposal achieved an expert evaluation score of 6.145 versus 3.963 for DirectChat (+2.182) across GPT, Claude, Gemini, and Qwen backbones. Compared with the structured generation baseline Fars, Story2Proposal obtained an average score of 5.705 versus 5.197, indicating improved structural consistency and visual alignment.
[22] On the Role of Encoder Depth: Pruning Whisper and LoRA Fine-Tuning in SLAM-ASR
Ganesh Pavan Kartikeya Bharadwaj Kolluri, Michael Kampouridis, Ravi Shekhar
Main category: cs.CL
TL;DR: Pruning Whisper encoder layers in SLAM-ASR systems causes minimal WER degradation (2-4%), and combining pruning with LoRA adaptation outperforms unpruned baselines while reducing parameters by 7-14%.
Details
Motivation: While model pruning has been studied for full Whisper encoder-decoder architectures, its impact within SLAM-ASR systems remains under-investigated. The research aims to understand how layer pruning affects Whisper encoder performance when used as the acoustic backbone in SLAM-ASR and whether LoRA-based fine-tuning can recover performance degradation.Method: The study analyzes layer pruning effects in Whisper encoder within SLAM-ASR systems across three Whisper variants (Small, Medium, Large-v2) and three languages representing different resource levels (Danish, Dutch, English). Over 200 training runs examine how LoRA-based fine-tuning compensates for pruning-induced performance degradation.
Result: Pruning two encoder layers causes only 2-4% WER degradation. Combining pruning with LoRA adaptation consistently outperforms unpruned baselines while reducing total parameters by 7-14%. LoRA reduces total word errors by 11-21% for Dutch and English, but only 4-7% for low-resource Danish, where it introduces increased insertion errors.
Conclusion: Pruning Whisper encoder layers in SLAM-ASR is viable with minimal performance loss, and LoRA adaptation can effectively compensate for degradation, especially in languages where the LLM has strong pre-existing proficiency. The effectiveness depends on the language model’s linguistic priors and available training data.
Abstract: Automatic speech recognition (ASR) has advanced rapidly in recent years, driven by large-scale pretrained models and end-to-end architectures such as SLAM-ASR. A key component of SLAM-ASR systems is the Whisper speech encoder, which provides robust acoustic representations. While model pruning has been explored for the full Whisper encoder-decoder architecture, its impact within the SLAM-ASR setting remains under-investigated. In this work, we analyze the effects of layer pruning in the Whisper encoder when used as the acoustic backbone of SLAM-ASR. We further examine the extent to which LoRA-based fine-tuning can recover performance degradation caused by pruning. Experiments conducted across three Whisper variants (Small, Medium, Large-v2), three languages representing distinct resource levels (Danish, Dutch, English), and over 200 training runs demonstrate that pruning two encoder layers causes only 2-4% WER degradation, and that combining this pruning with LoRA adaptation consistently outperforms the unpruned baseline while reducing total parameters by 7-14%. Moreover, our error analysis reveals that LoRA primarily compensates through the language model’s linguistic priors, reducing total word errors by 11-21% for Dutch and English, with substitutions and deletions showing the largest reductions. However, for low-resource Danish, the reduction is smaller (4-7%), and LoRA introduces increased insertion errors, indicating that compensation effectiveness depends on the LLM’s pre-existing language proficiency and available training data.
[23] Routing Sensitivity Without Controllability: A Diagnostic Study of Fairness in MoE Language Models
Junhyeok Lee, Kyu Sung Choi
Main category: cs.CL
TL;DR: FARE framework reveals routing-level stereotype interventions in MoE models are limited - either unachievable, non-robust, or come with high utility costs, and don’t transfer to decoded generation due to entangled bias/knowledge in expert groups.
Details
Motivation: While MoE language models show sensitivity to demographic content at routing level, current approaches for exploiting this sensitivity for fairness control have structural limitations. The paper aims to systematically probe these limits across diverse MoE architectures.Method: Introduces FARE (Fairness-Aware Routing Equilibrium), a diagnostic framework to test routing-level stereotype interventions across various MoE architectures (Mixtral, Qwen1.5, Qwen3, DeepSeekMoE, OLMoE). Uses group-level expert masking to analyze bias/knowledge entanglement.
Result: Routing-level preference shifts are either unachievable (Mixtral, Qwen1.5, Qwen3), statistically non-robust (DeepSeekMoE), or come with substantial utility cost (OLMoE: -4.4%p CrowS-Pairs at -6.3%p TQA). Even when log-likelihood shifts are robust, they don’t transfer to decoded generation across all metrics.
Conclusion: Routing sensitivity is necessary but insufficient for stereotype control in MoE models. Bias and core knowledge are deeply entangled within expert groups, limiting routing-level interventions. Findings identify architectural conditions for designing more controllable future MoE systems.
Abstract: Mixture-of-Experts (MoE) language models are universally sensitive to demographic content at the routing level, yet exploiting this sensitivity for fairness control is structurally limited. We introduce Fairness-Aware Routing Equilibrium (FARE), a diagnostic framework designed to probe the limits of routing-level stereotype intervention across diverse MoE architectures. FARE reveals that routing-level preference shifts are either unachievable (Mixtral, Qwen1.5, Qwen3), statistically non-robust (DeepSeekMoE), or accompanied by substantial utility cost (OLMoE, -4.4%p CrowS-Pairs at -6.3%p TQA). Critically, even where log-likelihood preference shifts are robust, they do not transfer to decoded generation: expanded evaluations on both non-null models yield null results across all generation metrics. Group-level expert masking reveals why: bias and core knowledge are deeply entangled within expert groups. These findings indicate that routing sensitivity is necessary but insufficient for stereotype control, and identify specific architectural conditions that can inform the design of more controllable future MoE systems.
[24] Courtroom-Style Multi-Agent Debate with Progressive RAG and Role-Switching for Controversial Claim Verification
Masnun Nuha Chowdhury, Nusrat Jahan Beg, Umme Hunny Khan, Syed Rifat Raiyan, Md Kamrul Hasan, Hasan Mahmud
Main category: cs.CL
TL;DR: PROClaim: A courtroom-style multi-agent framework for claim verification using structured adversarial deliberation with progressive RAG and multi-judge aggregation.
Details
Motivation: LLMs are unreliable for high-stakes claim verification due to hallucinations and shallow reasoning. Existing approaches like RAG and multi-agent debate have limitations: one-pass retrieval and unstructured debate dynamics.Method: Courtroom-style multi-agent framework with specialized roles (Plaintiff, Defense, Judge), Progressive RAG (P-RAG) for dynamic evidence expansion, evidence negotiation, self-reflection, and heterogeneous multi-judge aggregation.
Result: Achieves 81.7% accuracy on Check-COVID benchmark, outperforming standard multi-agent debate by 10.0 percentage points. P-RAG drives primary performance gains (+7.5 pp). Effectively mitigates systematic biases.
Conclusion: Structural deliberation and model heterogeneity provide robust foundation for reliable claim verification, addressing LLM limitations in high-stakes scenarios.
Abstract: Large language models (LLMs) remain unreliable for high-stakes claim verification due to hallucinations and shallow reasoning. While retrieval-augmented generation (RAG) and multi-agent debate (MAD) address this, they are limited by one-pass retrieval and unstructured debate dynamics. We propose a courtroom-style multi-agent framework, PROClaim, that reformulates verification as a structured, adversarial deliberation. Our approach integrates specialized roles (e.g., Plaintiff, Defense, Judge) with Progressive RAG (P-RAG) to dynamically expand and refine the evidence pool during the debate. Furthermore, we employ evidence negotiation, self-reflection, and heterogeneous multi-judge aggregation to enforce calibration, robustness, and diversity. In zero-shot evaluations on the Check-COVID benchmark, PROClaim achieves 81.7% accuracy, outperforming standard multi-agent debate by 10.0 percentage points, with P-RAG driving the primary performance gains (+7.5 pp). We ultimately demonstrate that structural deliberation and model heterogeneity effectively mitigate systematic biases, providing a robust foundation for reliable claim verification. Our code and data are publicly available at https://github.com/mnc13/PROClaim.
[25] Learning to Predict Future-Aligned Research Proposals with Language Models
Heng Wang, Pengcheng Jiang, Jiashuo Sun, Zhiyi Shi, Haofei Yu, Jiawei Han, Heng Ji
Main category: cs.CL
TL;DR: Paper proposes evaluating LLM-generated research proposals via time-sliced scientific forecasting, using a Future Alignment Score to measure how well proposals anticipate future research directions.
Details
Motivation: Evaluating quality of LLM-generated research proposals is difficult because novelty and soundness are hard to measure automatically, and human evaluation is costly. Need a verifiable alternative.Method: Reframe proposal generation as time-sliced scientific forecasting: given research question and papers before cutoff time, model generates structured proposal evaluated by whether it anticipates research directions in papers published after cutoff. Use Future Alignment Score computed via retrieval and LLM-based semantic scoring against held-out future corpus. Build time-consistent dataset of 17,771 papers with pre-cutoff citations, synthesize reasoning traces teaching gap identification and inspiration borrowing.
Result: Future-aligned tuning improves future alignment over unaligned baselines (up to +10.6% overall FAS). Domain-expert human evaluation corroborates improved proposal quality. Practical impact: implemented two model-generated proposals with code agent, obtaining 4.17% accuracy gain on MATH from new prompting strategy and consistent improvements for novel model-merging method.
Conclusion: Time-sliced scientific forecasting provides verifiable evaluation framework for LLM-generated research proposals, with demonstrated improvements in proposal quality and practical impact.
Abstract: Large language models (LLMs) are increasingly used to assist ideation in research, but evaluating the quality of LLM-generated research proposals remains difficult: novelty and soundness are hard to measure automatically, and large-scale human evaluation is costly. We propose a verifiable alternative by reframing proposal generation as a time-sliced scientific forecasting problem. Given a research question and inspiring papers available before a cutoff time, the model generates a structured proposal and is evaluated by whether it anticipates research directions that appear in papers published after the time. We operationalize this objective with the Future Alignment Score (FAS), computed via retrieval and LLM-based semantic scoring against a held-out future corpus. To train models, we build a time-consistent dataset of 17,771 papers from targets and their pre-cutoff citations, and synthesize reasoning traces that teach gap identification and inspiration borrowing. Across Llama-3.1 and Qwen2.5 models, future-aligned tuning improves future alignment over unaligned baselines (up to +10.6% overall FAS), and domain-expert human evaluation corroborates improved proposal quality. Finally, we demonstrate practical impact by implementing two model-generated proposals with a code agent, obtaining 4.17% accuracy gain on MATH from a new prompting strategy and consistent improvements for a novel model-merging method.
[26] POTSA: A Cross-Lingual Speech Alignment Framework for Speech-to-Text Translation
Xuanchen Li, Chenrui Cui, Tianrui Wang, Meng Ge, Zikang Huang, Jin Li, Yizhou Peng, Yuheng Lu, Nyima Tashi, Longbiao Wang, Jianwu Dang
Main category: cs.CL
TL;DR: POTSA uses Optimal Transport with parallel speech pairs to improve multilingual speech translation by aligning representations across languages, achieving state-of-the-art results with minimal parallel data.
Details
Motivation: Existing speech LLMs for multilingual translation often overlook semantic commonalities across source languages, leading to biased translation performance, especially for low-resource languages.Method: Proposes POTSA framework with: 1) Bias Compensation module for coarse speech representation alignment, 2) token-level Optimal Transport constraints on Q-Former using parallel speech pairs for fine-grained consistency, 3) layer scheduling strategy to focus OT constraints on semantically beneficial layers.
Result: Achieves SOTA on FLEURS benchmark: +1.29 BLEU over five common languages and +2.93 BLEU on zero-shot languages, using only 10 hours of parallel speech per language.
Conclusion: POTSA effectively bridges high- and low-resource translation gaps by leveraging semantic commonalities through Optimal Transport alignment with minimal parallel data.
Abstract: Speech Large Language Models have achieved breakthroughs in multilingual speech-to-text translation. However, existing approaches often overlook semantic commonalities across source languages, leading to biased translation performance. In this work, we propose POTSA (Parallel Optimal Transport for Speech Alignment), a new framework based on cross-lingual parallel speech pairs and Optimal Transport, designed to bridge high- and low-resource translation gaps. First, we introduce a Bias Compensation module to coarsely align initial speech representations. Second, we impose token-level OT constraints on a Q-Former using parallel pairs to establish fine-grained representation consistency. Then, we apply a layer scheduling strategy to focus OT constraints on semantically beneficial layers. Experiments on FLEURS show our method achieves SOTA performance, with +1.29 BLEU over five common languages and +2.93 BLEU on zero-shot languages, using only 10 hours of parallel speech per language.
[27] Rethinking Easy-to-Hard: Limits of Curriculum Learning in Post-Training for Deductive Reasoning
Maximilian Mordig, Andreas Opedal, Weiyang Liu, Bernhard Schölkopf
Main category: cs.CL
TL;DR: Curriculum learning for LLM post-training shows no advantage over random sampling for compositional reasoning tasks, challenging the practical utility of difficulty-based sequencing.
Details
Motivation: The paper investigates whether curriculum learning (CL), which organizes training examples from easy to hard, actually benefits compositional reasoning in LLMs, despite the intuitive appeal of this approach for tasks where complex problems are built from elementary inference rules.Method: Systematic empirical study using synthetic arithmetic and logical benchmarks where difficulty is characterized by reasoning complexity rather than surface-level proxies. Tests across multiple model families and curriculum schedules, comparing difficulty-based sequencing with standard random sampling in both supervised fine-tuning (SFT) and reinforcement learning (RL) methods.
Result: Surprisingly, across all experiments, no robust advantage was found for difficulty-based sequencing over random sampling in either accuracy or response length. These findings persist across both SFT and RL methods, suggesting the specific ordering of training examples plays a negligible role in achieving compositional generalization.
Conclusion: Curriculum-based post-training may not provide practical utility for deductive reasoning tasks in LLMs, challenging the common assumption that learning in increasing order of difficulty eases generalization for compositional reasoning.
Abstract: Curriculum learning (CL), motivated by the intuition that learning in increasing order of difficulty should ease generalization, is commonly adopted both in pre-training and post-training of large language models (LLMs). The intuition of CL is particularly compelling for compositional reasoning, where complex problems are built from elementary inference rules; however, the actual impact of CL on such tasks remains largely underexplored. We present a systematic empirical study of CL for post-training of LLMs, using synthetic arithmetic and logical benchmarks where difficulty is characterized by reasoning complexity rather than surface-level proxies. Surprisingly, across multiple model families and curriculum schedules, we find no robust advantage in difficulty-based sequencing over standard random sampling in either accuracy or response length. These findings persist across both supervised fine-tuning (SFT) and reinforcement learning (RL) methods. Our study suggests that, in the context of deductive reasoning, the specific ordering of training examples plays a negligible role in achieving compositional generalization, challenging the practical utility of curriculum-based post-training.
[28] Structural Stress and Learned Helplessness in Afghanistan: A Multi-Layer Analysis of the AFSTRESS Dari Corpus
Jawid Ahmad Baktash, Mursal Dawodi, Nadira Ahmadi
Main category: cs.CL
TL;DR: AFSTRESS is the first multi-label corpus of self-reported stress narratives in Dari (Eastern Persian) collected from Afghan individuals during a humanitarian crisis, enabling computational, social, and psychological analysis of stress patterns.
Details
Motivation: There is a lack of computational resources for analyzing stress and well-being in Dari, particularly for crisis-affected populations. The authors aim to create the first multi-label corpus to enable analysis of stress narratives from Afghan individuals during an ongoing humanitarian crisis.Method: Collected 737 self-reported stress narratives from Afghan individuals, with participants describing experienced stress and selecting emotion/stressor labels via Dari checklists. The dataset includes 12 binary labels (5 emotions, 7 stressors). Baseline experiments used character TF-IDF with Linear SVM, compared with ParsBERT and XLM-RoBERTa models.
Result: Character TF-IDF with Linear SVM achieved Micro-F1 = 0.663 and Macro-F1 = 0.651, outperforming ParsBERT and XLM-RoBERTa. Threshold tuning improved Micro-F1 by 10.3 points. Structural stressors dominated, with uncertain future (62.6%) and education closure (60.0%) exceeding emotional states. Strongest co-occurrence was between hopelessness and uncertain future (J = 0.388).
Conclusion: AFSTRESS provides the first Dari resource for computational analysis of stress and well-being in a crisis-affected population, revealing that stress is primarily structurally driven rather than emotional, with structural stressors dominating the narratives.
Abstract: We introduce AFSTRESS, the first multi-label corpus of self-reported stress narratives in Dari (Eastern Persian), comprising 737 responses collected from Afghan individuals during an ongoing humanitarian crisis. Participants describe experienced stress and select emotion and stressor labels via Dari checklists. The dataset enables analysis at three levels: computational (multi-label classification), social (structural drivers and gender disparities), and psychological (learned helplessness, chronic stress, and emotional cascade patterns). It includes 12 binary labels (5 emotions, 7 stressors), with high label cardinality (5.54) and density (0.462), reflecting complex, multi-dimensional stress. Structural stressors dominate: uncertain future (62.6 percent) and education closure (60.0 percent) exceed emotional states, indicating stress is primarily structurally driven. The strongest co-occurrence is between hopelessness and uncertain future (J = 0.388). Baseline experiments show that character TF-IDF with Linear SVM achieves Micro-F1 = 0.663 and Macro-F1 = 0.651, outperforming ParsBERT and XLM-RoBERTa, while threshold tuning improves Micro-F1 by 10.3 points. AFSTRESS provides the first Dari resource for computational analysis of stress and well-being in a crisis-affected population.
[29] SCOPE: Tree-based Self-Correcting Online Log Parsing via Syntactic-Semantic Collaboration
Dongyi Fan, Suqiong Zhang, Lili He, Ming Liu, Yifan Huo
Main category: cs.CL
TL;DR: SCOPE is a self-correcting online log parsing method that combines heuristic and LLM-based approaches using a bi-directional tree structure and two-stage syntactic-semantic collaboration to achieve high accuracy with reduced LLM usage.
Details
Motivation: Traditional heuristic-based log parsers are efficient but lack semantic understanding, while LLM-based parsers are accurate but incur high latency from frequent model calls. There's a need for a method that balances efficiency and accuracy.Method: SCOPE uses a bi-directional tree structure for efficient template matching from both forward and reverse directions, and a two-stage framework: lightweight NLP with POS information for syntax-based matching, with LLM invoked selectively as fallback for semantically complex cases.
Result: Extensive evaluations on diverse benchmark datasets show SCOPE outperforms state-of-the-art methods in both accuracy and efficiency, significantly reducing LLM API usage while maintaining high accuracy.
Conclusion: SCOPE successfully integrates heuristic and LLM-based approaches to achieve a balance between efficiency and effectiveness in log parsing, with public implementation available for further research.
Abstract: Log parsing is a critical step for automated log analysis in complex systems. Traditional heuristic-based methods offer high efficiency but are limited in accuracy due to overlooking semantic context. In contrast, recent LLM-based parsers improve accuracy via se mantic understanding but incur high latency from frequent model calls. To address this, we propose SCOPE, the first self-correcting online log parsing method that integrates the strengths of both heuristic and LLM-based paradigms. SCOPE introduces a novel bi-directional tree structure that enables efficient template match ing from both forward and reverse directions, resulting in a higher overall matching rate. Additionally, it adopts a two-stage syntactic semantic collaboration framework: a lightweight NLP model first utilizes part-of-speech (POS) information for syntax-based match ing, while the LLM is selectively invoked as a fallback to handle semantically complex cases when uncertainty remains. This design significantly reduces LLM API usage while maintaining high ac curacy, achieving a balance between efficiency and effectiveness. Extensive evaluations on diverse benchmark datasets show that SCOPE outperforms state-of-the-art methods in both accuracy and efficiency. The implementation and datasets are publicly released to facilitate further research.
[30] Nwāchā Munā: A Devanagari Speech Corpus and Proximal Transfer Benchmark for Nepal Bhasha ASR
Rishikesh Kumar Sharma, Safal Narshing Shrestha, Jenny Poudel, Rupak Tiwari, Arju Shrestha, Rupak Raj Ghimire, Bal Krishna Bal
Main category: cs.CL
TL;DR: First manually transcribed Devanagari speech corpus for endangered Nepal Bhasha language, showing proximal cross-lingual transfer from Nepali outperforms large multilingual models in ultra-low-resource ASR setting.
Details
Motivation: Nepal Bhasha (Newari) is an endangered language with severe scarcity of annotated speech resources, making it digitally marginalized. There's a need to create resources and explore efficient methods for ultra-low-resource ASR.Method: Created Nwāchā Munā corpus (5.39 hours manually transcribed Devanagari speech). Investigated proximal cross-lingual transfer from Nepali vs. large multilingual pretraining. Fine-tuned Nepali Conformer model with data augmentation and compared with Whisper-Small.
Result: Fine-tuned Nepali Conformer reduced CER from 52.54% zero-shot baseline to 17.59% with augmentation, matching Whisper-Small performance despite using significantly fewer parameters. Proximal transfer proved computationally efficient.
Conclusion: Proximal cross-lingual transfer from geographically/linguistically adjacent languages can rival large multilingual models for ultra-low-resource ASR, offering computationally efficient alternatives. Dataset released to enable Newari community.
Abstract: Nepal Bhasha (Newari), an endangered language of the Kathmandu Valley, remains digitally marginalized due to the severe scarcity of annotated speech resources. In this work, we introduce Nwāchā Munā, a newly curated 5.39-hour manually transcribed Devanagari speech corpus for Nepal Bhasha, and establish the first benchmark using script-preserving acoustic modeling. We investigate whether proximal cross-lingual transfer from a geographically and linguistically adjacent language (Nepali) can rival large-scale multilingual pretraining in an ultra-low-resource Automatic Speech Recognition (ASR) setting. Fine-tuning a Nepali Conformer model reduces the Character Error Rate (CER) from a 52.54% zero-shot baseline to 17.59% with data augmentation, effectively matching the performance of the multilingual Whisper-Small model despite utilizing significantly fewer parameters. Our findings demonstrate that proximal transfer from Nepali language serves as a computationally efficient alternative to massive multilingual models. We openly release the dataset and benchmarks to digitally enable the Newari community and foster further research in Nepal Bhasha.
[31] Mitigating Hallucination on Hallucination in RAG via Ensemble Voting
Zequn Xie, Zhengyang Sun
Main category: cs.CL
TL;DR: VOTE-RAG is a training-free framework that uses parallel voting mechanisms to reduce “hallucination on hallucination” in retrieval-augmented generation systems.
Details
Motivation: RAG systems can suffer from "hallucination on hallucination" where flawed retrieval results mislead the generation model, creating compounded hallucinations. Current approaches are complex and may have problem drift risks.Method: Two-stage voting framework: (1) Retrieval Voting - multiple agents generate diverse queries in parallel and aggregate retrieved documents; (2) Response Voting - multiple agents independently generate answers based on aggregated documents, with final output determined by majority vote.
Result: VOTE-RAG achieves performance comparable to or surpassing more complex frameworks on six benchmark datasets, with simpler architecture, full parallelizability, and avoidance of problem drift risk.
Conclusion: Simple, reliable ensemble voting is a superior and more efficient method for mitigating RAG hallucinations compared to more complex approaches.
Abstract: Retrieval-Augmented Generation (RAG) aims to reduce hallucinations in Large Language Models (LLMs) by integrating external knowledge. However, RAG introduces a critical challenge: hallucination on hallucination," where flawed retrieval results mislead the generation model, leading to compounded hallucinations. To address this issue, we propose VOTE-RAG, a novel, training-free framework with a two-stage structure and efficient, parallelizable voting mechanisms. VOTE-RAG includes: (1) Retrieval Voting, where multiple agents generate diverse queries in parallel and aggregate all retrieved documents; (2) Response Voting, where multiple agents independently generate answers based on the aggregated documents, with the final output determined by majority vote. We conduct comparative experiments on six benchmark datasets. Our results show that VOTE-RAG achieves performance comparable to or surpassing more complex frameworks. Additionally, VOTE-RAG features a simpler architecture, is fully parallelizable, and avoids the problem drift" risk. Our work demonstrates that simple, reliable ensemble voting is a superior and more efficient method for mitigating RAG hallucinations.
[32] PubMed Reasoner: Dynamic Reasoning-based Retrieval for Evidence-Grounded Biomedical Question Answering
Yiqing Zhang, Xiaozhong Liu, Fabricio Murai
Main category: cs.CL
TL;DR: PubMed Reasoner: A biomedical QA agent with three-stage reasoning (query refinement, reflective retrieval, evidence-grounded response) that achieves state-of-the-art accuracy on PubMedQA and improves clinical knowledge benchmarks.
Details
Motivation: Current biomedical QA systems lack mechanisms to iteratively refine poor queries and only apply self-reflection after full retrieval completion, limiting their ability to provide accurate, evidence-based answers with verifiable citations.Method: Three-stage approach: 1) Self-critic query refinement evaluates MeSH terms for coverage, alignment, and redundancy to enhance PubMed queries based on partial metadata retrieval; 2) Reflective retrieval processes articles in batches until sufficient evidence is gathered; 3) Evidence-grounded response generation produces answers with explicit citations.
Result: Achieves 78.32% accuracy on PubMedQA, slightly surpassing human experts, with consistent gains on MMLU Clinical Knowledge. LLM-as-judge evaluations show preference across reasoning soundness, evidence grounding, clinical relevance, and trustworthiness.
Conclusion: PubMed Reasoner provides practical assistance to clinicians and biomedical researchers by orchestrating retrieval-first reasoning over authoritative sources while controlling compute and token costs, offering a trustworthy biomedical QA system.
Abstract: Trustworthy biomedical question answering (QA) systems must not only provide accurate answers but also justify them with current, verifiable evidence. Retrieval-augmented approaches partially address this gap but lack mechanisms to iteratively refine poor queries, whereas self-reflection methods kick in only after full retrieval is completed. In this context, we introduce PubMed Reasoner, a biomedical QA agent composed of three stages: self-critic query refinement evaluates MeSH terms for coverage, alignment, and redundancy to enhance PubMed queries based on partial (metadata) retrieval; reflective retrieval processes articles in batches until sufficient evidence is gathered; and evidence-grounded response generation produces answers with explicit citations. PubMed Reasoner with a GPT-4o backbone achieves 78.32% accuracy on PubMedQA, slightly surpassing human experts, and showing consistent gains on MMLU Clinical Knowledge. Moreover, LLM-as-judge evaluations prefer our responses across: reasoning soundness, evidence grounding, clinical relevance, and trustworthiness. By orchestrating retrieval-first reasoning over authoritative sources, our approach provides practical assistance to clinicians and biomedical researchers while controlling compute and token costs.
[33] Culturally Adaptive Explainable LLM Assessment for Multilingual Information Disorder: A Human-in-the-Loop Approach
Maziar Kianimoghadam Jouneghani
Main category: cs.CL
TL;DR: A Hybrid Intelligence Loop framework combining human-written rationales from native speakers with adaptive in-context learning to improve multilingual information disorder detection across cultural contexts.
Details
Motivation: Current LLMs are monocultural and English-centric, producing rationales that overlook localized framing, making them ineffective for recognizing information disorder across different cultural and linguistic contexts.Method: Human-in-the-loop framework that pairs English task instructions with dynamically retrieved target-language exemplars from filtered annotations, using an Exemplar Bank seeded from multilingual Information Disorder corpus annotations for adaptive prompting.
Result: Initial pilot compares static and adaptive prompting on Farsi and Italian news, evaluating span/severity prediction, rationale quality/cultural appropriateness, and model alignment across evaluator groups.
Conclusion: Provides a testbed for culturally grounded explainable AI by grounding model assessment in human-written rationales from native-speaking annotators through adaptive in-context learning.
Abstract: Recognizing information disorder is difficult because judgments about manipulation depend on cultural and linguistic context. Yet current Large Language Models (LLMs) often behave as monocultural, English-centric “black boxes,” producing fluent rationales that overlook localized framing. Preliminary evidence from the multilingual Information Disorder (InDor) corpus suggests that existing models struggle to explain manipulated news consistently across communities. To address this gap, this ongoing study proposes a Hybrid Intelligence Loop, a human-in-the-loop (HITL) framework that grounds model assessment in human-written rationales from native-speaking annotators. The approach moves beyond static target-language few-shot prompting by pairing English task instructions with dynamically retrieved target-language exemplars drawn from filtered InDor annotations through In-Context Learning (ICL). In the initial pilot, the Exemplar Bank is seeded from these filtered annotations and used to compare static and adaptive prompting on Farsi and Italian news. The study evaluates span and severity prediction, the quality and cultural appropriateness of generated rationales, and model alignment across evaluator groups, providing a testbed for culturally grounded explainable AI.
[34] Not Worth Mentioning? A Pilot Study on Salient Proposition Annotation
Amir Zeldes, Katherine Conhaim, Lauren Levine
Main category: cs.CL
TL;DR: This paper adapts graded summarization-based salience metrics from Salient Entity Extraction to quantify proposition salience in text, applying it to multi-genre data and studying its relationship with discourse unit centrality in RST parsing.
Details
Motivation: Despite extensive work on extractive summarization, there's limited research on operationalizing graded proposition salience in natural text. The paper aims to bridge this gap by adapting existing salience metrics to propositions.Method: The authors adopt graded summarization-based salience from Salient Entity Extraction, adapt it for proposition salience, define annotation tasks, apply to multi-genre dataset, evaluate agreement, and study relationship with discourse unit centrality in RST parsing.
Result: The paper presents a method for quantifying proposition salience, applies it to multi-genre data with evaluated agreement, and provides preliminary analysis of how proposition salience relates to discourse unit centrality in RST parsing.
Conclusion: The work operationalizes graded proposition salience using adapted metrics from entity extraction, providing a foundation for studying proposition importance in text and its relationship with discourse structure.
Abstract: Despite a long tradition of work on extractive summarization, which by nature aims to recover the most important propositions in a text, little work has been done on operationalizing graded proposition salience in naturally occurring data. In this paper, we adopt graded summarization-based salience as a metric from previous work on Salient Entity Extraction (SEE) and adapt it to quantify proposition salience. We define the annotation task, apply it to a small multi-genre dataset, evaluate agreement and carry out a preliminary study of the relationship between our metric and notions of discourse unit centrality in discourse parsing following Rhetorical Structure Theory (RST).
[35] Improving Attributed Long-form Question Answering with Intent Awareness
Xinran Zhao, Aakanksha Naik, Jay DeYoung, Joseph Chee Chang, Jena D. Hwang, Tongshuang Wu, Varsha Kishore
Main category: cs.CL
TL;DR: LLMs trained on academic papers lack understanding of author reasoning and intent, which limits their ability to generate high-quality long-form reports. The paper proposes using structured tag-based schemes to extract implicit author intents, improving both zero-shot generation and synthetic data creation for fine-tuning smaller models.
Details
Motivation: While LLMs are trained on diverse academic papers and reports, they lack exposure to the reasoning processes and intents that guide authors in crafting these documents. This limitation hinders their ability to generate high-quality, knowledge-intensive long-form reports.Method: Develop and employ structured, tag-based schemes to extract underlying implicit author intents (such as writing or citation intents). Use these extracted intents to enhance zero-shot generation capabilities in LLMs and create high-quality synthetic data for fine-tuning smaller models.
Result: Experiments show improved performance across various challenging scientific report generation tasks, with average improvements of +2.9 and +12.3 absolute points for large and small models over baselines, respectively. Intent awareness also enhances model citation usage and substantially improves report readability.
Conclusion: Enhancing LLMs’ intent awareness through structured tag-based schemes significantly improves the quality of generated long-form reports, benefiting both zero-shot generation and fine-tuning of smaller models while improving citation usage and readability.
Abstract: Large language models (LLMs) are increasingly being used to generate comprehensive, knowledge-intensive reports. However, while these models are trained on diverse academic papers and reports, they are not exposed to the reasoning processes and intents that guide authors in crafting these documents. We hypothesize that enhancing a model’s intent awareness can significantly improve the quality of generated long-form reports. We develop and employ structured, tag-based schemes to better elicit underlying implicit intents to write or cite. We demonstrate that these extracted intents enhance both zero-shot generation capabilities in LLMs and enable the creation of high-quality synthetic data for fine-tuning smaller models. Our experiments reveal improved performance across various challenging scientific report generation tasks, with an average improvement of +2.9 and +12.3 absolute points for large and small models over baselines, respectively. Furthermore, our analysis illuminates how intent awareness enhances model citation usage and substantially improves report readability.
[36] Multi-Agent Dialectical Refinement for Enhanced Argument Classification
Jakub Bąba, Jarosław A. Chudziak
Main category: cs.CL
TL;DR: MAD-ACC is a multi-agent debate framework for argument component classification that uses dialectical refinement to resolve structural ambiguity, outperforming single-agent LLMs without domain-specific training.
Details
Motivation: Traditional supervised AM approaches require expensive domain-specific fine-tuning, while single-agent LLMs struggle with structural ambiguity (e.g., distinguishing Claims vs Premises) and suffer from sycophancy where they reinforce initial errors rather than critically evaluating them.Method: MAD-ACC uses a Proponent-Opponent-Judge model where agents defend conflicting interpretations of ambiguous text, exposing logical nuances through dialectical refinement. This multi-agent debate approach generates transparent reasoning transcripts.
Result: On the UKP Student Essays corpus, MAD-ACC achieves Macro F1 score of 85.7%, significantly outperforming single-agent reasoning baselines without requiring domain-specific training, while providing explainable decisions through debate transcripts.
Conclusion: Multi-agent debate frameworks like MAD-ACC offer an effective, training-free alternative to traditional supervised AM approaches, resolving structural ambiguity through dialectical refinement while providing transparency and explainability missing from “black-box” classifiers.
Abstract: Argument Mining (AM) is a foundational technology for automated writing evaluation, yet traditional supervised approaches rely heavily on expensive, domain-specific fine-tuning. While Large Language Models (LLMs) offer a training-free alternative, they often struggle with structural ambiguity, failing to distinguish between similar components like Claims and Premises. Furthermore, single-agent self-correction mechanisms often suffer from sycophancy, where the model reinforces its own initial errors rather than critically evaluating them. We introduce MAD-ACC (Multi-Agent Debate for Argument Component Classification), a framework that leverages dialectical refinement to resolve classification uncertainty. MAD-ACC utilizes a Proponent-Opponent-Judge model where agents defend conflicting interpretations of ambiguous text, exposing logical nuances that single-agent models miss. Evaluation on the UKP Student Essays corpus demonstrates that MAD-ACC achieves a Macro F1 score of 85.7%, significantly outperforming single-agent reasoning baselines, without requiring domain-specific training. Additionally, unlike “black-box” classifiers, MAD-ACC’s dialectical approach offers a transparent and explainable alternative by generating human-readable debate transcripts that explain the reasoning behind decisions.
[37] A tree interpretation of arc standard dependency derivation
Zihao Huang, Ai Ka Lee, Jungyeul Park
Main category: cs.CL
TL;DR: Arc-standard derivations for projective dependency trees create unique ordered tree representations with contiguous yields and stable lexical anchoring, characterizing projectivity through direct derivational interpretation rather than conversion.
Details
Motivation: The paper aims to establish a formal connection between dependency parsing and phrase structure representations, showing that arc-standard transition sequences can be directly interpreted as ordered tree construction rather than requiring conversion from completed dependency graphs.Method: The approach demonstrates that each arc-standard transition (shift, leftarc, rightarc) corresponds to a deterministic tree update operation. For projective trees, this creates a unique ordered representation with surface-contiguous yields. For non-projective inputs, pseudo-projective lifting is used before derivation and inverse decoding after recovery.
Result: A proof-of-concept implementation in a standard neural transition-based parser shows that the mapped derivations are executable and support stable dependency recovery. The representation characterizes projectivity: a dependency tree admits such a contiguous ordered representation if and only if it is projective.
Conclusion: Arc-standard derivations provide a direct, derivational approach to constructing ordered tree representations from dependency parsing, establishing formal connections between dependency and constituency formalisms while maintaining recoverability of original dependency structures.
Abstract: We show that arc-standard derivations for projective dependency trees determine a unique ordered tree representation with surface-contiguous yields and stable lexical anchoring. Each \textsc{shift}, \textsc{leftarc}, and \textsc{rightarc} transition corresponds to a deterministic tree update, and the resulting hierarchical object uniquely determines the original dependency arcs. We further show that this representation characterizes projectivity: a single-headed dependency tree admits such a contiguous ordered representation if and only if it is projective. The proposal is derivational rather than convertive. It interprets arc-standard transition sequences directly as ordered tree construction, rather than transforming a completed dependency graph into a phrase-structure output. For non-projective inputs, the same interpretation can be used in practice via pseudo-projective lifting before derivation and inverse decoding after recovery. A proof-of-concept implementation in a standard neural transition-based parser shows that the mapped derivations are executable and support stable dependency recovery.
[38] Over-Refusal and Representation Subspaces: A Mechanistic Analysis of Task-Conditioned Refusal in Aligned LLMs
Utsav Maskey, Mark Dras, Usman Naseem
Main category: cs.CL
TL;DR: Analysis shows harmful-refusal directions in aligned language models are captured by a single global vector, while over-refusal directions are task-dependent and reside within benign task clusters, explaining why global ablation fails to address over-refusal.
Details
Motivation: Aligned language models trained to refuse harmful requests often exhibit over-refusal, declining safe instructions that resemble harmful ones. The paper aims to understand why simple global direction ablation fails to correct over-refusal while maintaining proper refusal mechanisms.Method: Analyzes representational geometry of both refusal types by examining hidden-state vectors. Uses linear probing to investigate representational distinctions between harmful-refusal and over-refusal directions across transformer layers.
Result: Harmful-refusal directions are task-agnostic and captured by a single global vector, while over-refusal directions are task-dependent: they reside within benign task-representation clusters, vary across tasks, and span a higher-dimensional subspace.
Conclusion: The two refusal types are representationally distinct from early transformer layers, explaining why global direction ablation alone cannot address over-refusal. Task-specific geometric interventions are necessary to correct over-refusal while preserving proper refusal mechanisms.
Abstract: Aligned language models that are trained to refuse harmful requests also exhibit over-refusal: they decline safe instructions that seemingly resemble harmful instructions. A natural approach is to ablate the global refusal direction, steering the hidden-state vectors away or towards the harmful-refusal examples, but this corrects over-refusal only incidentally while disrupting the broader refusal mechanism. In this work, we analyse the representational geometry of both refusal types to understand why this happens. We show that harmful-refusal directions are task-agnostic and can be captured by a single global vector, whereas over-refusal directions are task-dependent: they reside within the benign task-representation clusters, vary across tasks, and span a higher-dimensional subspace. Linear probing confirms that the two refusal types are representationally distinct from the early transformer layers. These findings provide a mechanistic explanation of why global direction ablation alone cannot address over-refusal, and establish that task-specific geometric interventions are necessary.
[39] Hidden Ads: Behavior Triggered Semantic Backdoors for Advertisement Injection in Vision Language Models
Duanyi Yao, Changyue Li, Zhicong Huang, Cheng Hong, Songze Li
Main category: cs.CL
TL;DR: Hidden Ads introduces a new backdoor attack on Vision-Language Models that injects unauthorized advertisements through natural user recommendation-seeking behaviors, making it practical for real-world deployment.
Details
Motivation: As VLMs are increasingly deployed in consumer applications for recommendations, there's a need to understand security vulnerabilities where attackers can exploit natural user behaviors to inject promotional content without detection.Method: Proposes a multi-tier threat framework with three adversary capability levels: hard prompt injection, soft prompt optimization, and supervised fine-tuning. Uses teacher VLM-generated chain-of-thought reasoning to create natural trigger-slogan associations across semantic domains.
Result: Experiments on three VLM architectures show high injection efficacy with near-zero false positives while maintaining task accuracy. The attack is data-efficient, transfers to unseen datasets, and scales to multiple concurrent domain-slogan pairs.
Conclusion: Hidden Ads represents a practical security threat to VLMs in recommendation services, with current defenses (instruction-based filtering and clean fine-tuning) failing to remove the backdoor without significant utility degradation.
Abstract: Vision-Language Models (VLMs) are increasingly deployed in consumer applications where users seek recommendations about products, dining, and services. We introduce Hidden Ads, a new class of backdoor attacks that exploit this recommendation-seeking behavior to inject unauthorized advertisements. Unlike traditional pattern-triggered backdoors that rely on artificial triggers such as pixel patches or special tokens, Hidden Ads activates on natural user behaviors: when users upload images containing semantic content of interest (e.g., food, cars, animals) and ask recommendation-seeking questions, the backdoored model provides correct, helpful answers while seamlessly appending attacker-specified promotional slogans. This design preserves model utility and produces natural-sounding injections, making the attack practical for real-world deployment in consumer-facing recommendation services. We propose a multi-tier threat framework to systematically evaluate Hidden Ads across three adversary capability levels: hard prompt injection, soft prompt optimization, and supervised fine-tuning. Our poisoned data generation pipeline uses teacher VLM-generated chain-of-thought reasoning to create natural trigger–slogan associations across multiple semantic domains. Experiments on three VLM architectures demonstrate that Hidden Ads achieves high injection efficacy with near-zero false positives while maintaining task accuracy. Ablation studies confirm that the attack is data-efficient, transfers effectively to unseen datasets, and scales to multiple concurrent domain-slogan pairs. We evaluate defenses including instruction-based filtering and clean fine-tuning, finding that both fail to remove the backdoor without causing significant utility degradation.
[40] A gentle tutorial and a structured reformulation of Bock’s algorithm for minimum directed spanning trees
Yuxi Wang, Jungyeul Park
Main category: cs.CL
TL;DR: A tutorial and reformulation of Bock’s 1971 algorithm for constructing minimum directed spanning trees (arborescences), making it accessible for modern readers and showing its application as an exact decoder for nonprojective dependency parsing.
Details
Motivation: To make Bock's 1971 algorithm readable and reproducible for modern audiences, while demonstrating its relevance to dependency parsing in NLP where it serves as an exact decoder for nonprojective graph-based parsing.Method: Provides a gentle tutorial with line-by-line execution trace of Bock’s original algorithm, introduces a structured reformulation that clarifies phase structure, state maintenance, and control flow, and includes a worked example for dependency parsing showing transformation to Bock’s formulation.
Result: Successfully makes Bock’s algorithm accessible with complete execution traces and structured reformulation, demonstrates practical application to dependency parsing through affine transformation of maximum weight to minimum cost problems.
Conclusion: Bock’s 1971 algorithm remains relevant for modern NLP applications, particularly as an exact decoder for nonprojective dependency parsing, and can be made accessible through careful tutorial presentation and structured reformulation.
Abstract: This paper presents a gentle tutorial and a structured reformulation of Bock’s 1971 Algol procedure for constructing minimum directed spanning trees. Our aim is to make the original algorithm readable and reproducible for modern readers, while highlighting its relevance as an exact decoder for nonprojective graph based dependency parsing. We restate the minimum arborescence objective in Bock’s notation and provide a complete line by line execution trace of the original ten node example, extending the partial trace given in the source paper from initialization to termination. We then introduce a structured reformulation that makes explicit the procedure’s phase structure, maintained state, and control flow, while preserving the logic of the original method. As a further illustration, we include a worked example adapted from {jurafsky-martin-2026-book} for dependency parsing, showing how a maximum weight arborescence problem is reduced to Bock’s minimum cost formulation by a standard affine transformation and traced under the same state variables.
[41] Umwelt Engineering: Designing the Cognitive Worlds of Linguistic Agents
Rodney Jehu-Appiah
Main category: cs.CL
TL;DR: Umwelt engineering as a third layer in agent design that alters cognition by constraining linguistic environments, with experiments showing vocabulary constraints improve reasoning and ensemble diversity
Details
Motivation: To explore how altering the medium of reasoning (linguistic environment) affects cognition itself, proposing Umwelt engineering as a new layer in agent design beyond prompt and context engineeringMethod: Two experiments: 1) Three language models reasoning under vocabulary constraints (No-Have and E-Prime) across seven tasks; 2) 16 linguistically constrained agents tackling debugging problems with ensemble analysis
Result: No-Have improved ethical reasoning by 19.1pp, classification by 6.5pp, and epistemic calibration by 7.4pp. E-Prime showed dramatic but model-dependent effects. Constrained agent ensembles achieved 100% ground-truth coverage vs 88.2% for control, with counterfactual agents being crucial
Conclusion: Umwelt engineering effectively alters cognition through cognitive restructuring and diversification, offering a new approach to agent design that can improve reasoning and problem-solving through linguistic constraints
Abstract: I propose Umwelt engineering – the deliberate design of the linguistic cognitive environment – as a third layer in the agent design stack, upstream of both prompt and context engineering. Two experiments test the thesis that altering the medium of reasoning alters cognition itself. In Experiment 1, three language models reason under two vocabulary constraints – No-Have (eliminating possessive “to have”) and E-Prime (eliminating “to be”) – across seven tasks (N=4,470 trials). No-Have improves ethical reasoning by 19.1 pp (p < 0.001), classification by 6.5 pp (p < 0.001), and epistemic calibration by 7.4 pp, while achieving 92.8% constraint compliance. E-Prime shows dramatic but model-dependent effects: cross-model correlations reach r = -0.75. In Experiment 2, 16 linguistically constrained agents tackle 17 debugging problems. No constrained agent outperforms the control individually, yet a 3-agent ensemble achieves 100% ground-truth coverage versus 88.2% for the control. A permutation test confirms only 8% of random 3-agent subsets achieve full coverage, and every successful subset contains the counterfactual agent. Two mechanisms emerge: cognitive restructuring and cognitive diversification. The primary limitation is the absence of an active control matching constraint prompt elaborateness.
[42] PRBench: End-to-end Paper Reproduction in Physics Research
Shi Qiu, Junyi Deng, Yiwei Deng, Haoran Dong, Jieyu Fu, Mao Li, Zeyu Li, Zhaolong Zhang, Huiwen Zheng, Leidong Bao, Anqi Lv, Zihan Mo, Yadi Niu, Yiyang Peng, Yu Tian, Yili Wang, Ziyu Wang, Zi-Yu Wang, Jiashen Wei, Liuheng Wu, Aoran Xue, Leyi Yang, Guanglu Yuan, Xiarui Zhan, Jingjun Zhang, Zifan Zheng, Pengfei Liu, Linrui Zhen, Kaiyang Li, Qichang Li, Ziheng Zhou, Guo-En Nian, Yunwei Xiao, Qing-Hong Cao, Linjie Dai, Xu Feng, Peng Gao, Ying Gu, Chang Liu, Jia Liu, Ming-xing Luo, Yan-Qing Ma, Liang-You Peng, Huichao Song, Shufeng Wang, Chenxu Wang, Tao Wang, Yi-Nan Wang, Chengyin Wu, Pengwei Zhao, Hua Xing Zhu
Main category: cs.CL
TL;DR: PRBench is a benchmark for evaluating AI agents’ ability to reproduce scientific papers end-to-end, focusing on physics subfields with 30 expert-curated tasks requiring comprehension, implementation, and quantitative result matching.
Details
Motivation: While AI agents show strong reasoning and problem-solving capabilities for scientific tasks like formula derivation and code generation, their reliability in end-to-end reproduction of real scientific papers remains unproven. The authors aim to create a rigorous benchmark to assess whether these agents can autonomously reproduce published research.Method: Created PRBench with 30 expert-curated tasks spanning 11 physics subfields, each grounded in real published papers. Tasks require agents to comprehend methodology, implement algorithms from scratch, and produce quantitative results matching original publications. Agents operate in sandboxed environments with only task instructions and paper content. Used agentified assessment pipeline to evaluate coding agents across scientific reasoning and execution dimensions.
Result: Best-performing agent (OpenAI Codex powered by GPT-5.3-Codex) achieved only 34% mean overall score. All agents had zero end-to-end callback success rate, with particularly poor performance in data accuracy and code correctness. Identified systematic failure modes including formula implementation errors, inability to debug numerical simulations, and fabrication of output data.
Conclusion: Current AI agents struggle significantly with end-to-end scientific paper reproduction, revealing substantial gaps in their capabilities for autonomous scientific research. PRBench provides a rigorous benchmark for measuring progress toward this goal and highlights key areas needing improvement.
Abstract: AI agents powered by large language models exhibit strong reasoning and problem-solving capabilities, enabling them to assist scientific research tasks such as formula derivation and code generation. However, whether these agents can reliably perform end-to-end reproduction from real scientific papers remains an open question. We introduce PRBench, a benchmark of 30 expert-curated tasks spanning 11 subfields of physics. Each task requires an agent to comprehend the methodology of a published paper, implement the corresponding algorithms from scratch, and produce quantitative results matching the original publication. Agents are provided only with the task instruction and paper content, and operate in a sandboxed execution environment. All tasks are contributed by domain experts from over 20 research groups at the School of Physics, Peking University, each grounded in a real published paper and validated through end-to-end reproduction with verified ground-truth results and detailed scoring rubrics. Using an agentified assessment pipeline, we evaluate a set of coding agents on PRBench and analyze their capabilities across key dimensions of scientific reasoning and execution. The best-performing agent, OpenAI Codex powered by GPT-5.3-Codex, achieves a mean overall score of 34%. All agents exhibit a zero end-to-end callback success rate, with particularly poor performance in data accuracy and code correctness. We further identify systematic failure modes, including errors in formula implementation, inability to debug numerical simulations, and fabrication of output data. Overall, PRBench provides a rigorous benchmark for evaluating progress toward autonomous scientific research.
[43] Budget-Xfer: Budget-Constrained Source Language Selection for Cross-Lingual Transfer to African Languages
Tewodros Kederalah Idris, Roald Eiselen, Prasenjit Mitra
Main category: cs.CL
TL;DR: Budget-Xfer framework optimizes source language selection and data allocation for cross-lingual transfer learning under fixed annotation budgets, showing multi-source transfer outperforms single-source but selection strategies have modest differences.
Details
Motivation: Existing cross-lingual transfer learning comparisons don't control for total training data, confounding language selection effects with data quantity effects, making it unclear how to best allocate limited annotation budgets across source languages.Method: Introduces Budget-Xfer framework that formulates multi-source cross-lingual transfer as budget-constrained resource allocation problem, evaluating four allocation strategies across NER and sentiment analysis for three African languages using two multilingual models in 288 experiments.
Result: Multi-source transfer significantly outperforms single-source (Cohen’s d = 0.80 to 1.98), but differences among multi-source strategies are modest and non-significant. Embedding similarity as selection proxy is task-dependent - random selection outperforms similarity-based for NER but not sentiment analysis.
Conclusion: Budget-Xfer provides principled framework for cross-lingual transfer optimization, showing multi-source transfer is crucial but specific allocation strategies matter less than previously thought, with task-dependent value of language similarity metrics.
Abstract: Cross-lingual transfer learning enables NLP for low-resource languages by leveraging labeled data from higher-resource sources, yet existing comparisons of source language selection strategies do not control for total training data, confounding language selection effects with data quantity effects. We introduce Budget-Xfer, a framework that formulates multi-source cross-lingual transfer as a budget-constrained resource allocation problem. Given a fixed annotation budget B, our framework jointly optimizes which source languages to include and how much data to allocate from each. We evaluate four allocation strategies across named entity recognition and sentiment analysis for three African target languages (Hausa, Yoruba, Swahili) using two multilingual models, conducting 288 experiments. Our results show that (1) multi-source transfer significantly outperforms single-source transfer (Cohen’s d = 0.80 to 1.98), driven by a structural budget underutilization bottleneck; (2) among multi-source strategies, differences are modest and non-significant; and (3) the value of embedding similarity as a selection proxy is task-dependent, with random selection outperforming similarity-based selection for NER but not sentiment analysis.
[44] The Degree of Language Diacriticity and Its Effect on Tasks
Adi Cohen, Yuval Pinter
Main category: cs.CL
TL;DR: Proposes a data-driven framework to quantify diacritic complexity across languages and shows it correlates with diacritic restoration model performance.
Details
Motivation: Diacritics are important in many writing systems but their impact on language technology hasn't been systematically quantified across scripts. Prior work focused on individual languages without cross-linguistic, data-driven frameworks.Method: Develops information-theoretic metrics for diacritic complexity (frequency, ambiguity, structural diversity). Computes metrics over 24 corpora in 15 languages with single- and multi-diacritic scripts. Evaluates correlation with BERT- and RNN-based diacritic restoration performance.
Result: Higher diacritic complexity strongly correlates with lower restoration accuracy. In single-diacritic scripts, frequency and structural measures align. In multi-diacritic scripts, structural complexity shows strongest association with performance.
Conclusion: Measurable properties of diacritic usage influence diacritic restoration model performance, showing orthographic complexity is functionally relevant for modeling, not just descriptive.
Abstract: Diacritics are orthographic marks that clarify pronunciation, distinguish similar words, or alter meaning. They play a central role in many writing systems, yet their impact on language technology has not been systematically quantified across scripts. While prior work has examined diacritics in individual languages, there’s no cross-linguistic, data-driven framework for measuring the degree to which writing systems rely on them and how this affects downstream tasks. We propose a data-driven framework for quantifying diacritic complexity using corpus-level, information-theoretic metrics that capture the frequency, ambiguity, and structural diversity of character-diacritic combinations. We compute these metrics over 24 corpora in 15 languages, spanning both single- and multi-diacritic scripts. We then examine how diacritic complexity correlates with performance on the task of diacritics restoration, evaluating BERT- and RNN-based models. We find that across languages, higher diacritic complexity is strongly associated with lower restoration accuracy. In single-diacritic scripts, where character-diacritic combinations are more predictable, frequency-based and structural measures largely align. In multi-diacritic scripts, however, structural complexity exhibits the strongest association with performance, surpassing frequency-based measures. These findings show that measurable properties of diacritic usage influence the performance of diacritic restoration models, demonstrating that orthographic complexity is not only descriptive but functionally relevant for modeling.
[45] Investigating the Influence of Language on Sycophantic Behavior of Multilingual LLMs
Bayan Abdullah Aldahlawi, A. B. M. Ashikur Rahman, Irfan Ahmad
Main category: cs.CL
TL;DR: LLM sycophancy varies by language despite mitigation efforts; newer models show reduced but still language-dependent agreeableness patterns across multilingual tweet-like prompts.
Details
Motivation: While LLM sycophancy has been studied and mitigated in newer models, the effect of language on sycophantic behavior remains unexplored, creating a need for systematic multilingual testing.Method: Evaluated GPT-4o mini, Gemini 1.5 Flash, and Claude 3.5 Haiku using tweet-like opinion prompts translated into Arabic, Chinese, French, Spanish, and Portuguese, analyzing sycophancy across languages and sensitive topics.
Result: Newer models show significantly less sycophancy overall than earlier generations, but sycophancy extent is still influenced by language, revealing systematic cultural and linguistic patterns in model agreeableness.
Conclusion: Progress in sycophancy mitigation is evident, but broader multilingual audits are needed for trustworthy, bias-aware LLM deployment due to persistent language-dependent sycophantic patterns.
Abstract: Large language models (LLMs) have achieved strong performance across a wide range of tasks, but they are also prone to sycophancy, the tendency to agree with user statements regardless of validity. Previous research has outlined both the extent and the underlying causes of sycophancy in earlier models, such as ChatGPT-3.5 and Davinci. Newer models have since undergone multiple mitigation strategies, yet there remains a critical need to systematically test their behavior. In particular, the effect of language on sycophancy has not been explored. In this work, we investigate how the language influences sycophantic responses. We evaluate three state-of-the-art models, GPT-4o mini, Gemini 1.5 Flash, and Claude 3.5 Haiku, using a set of tweet-like opinion prompts translated into five additional languages: Arabic, Chinese, French, Spanish, and Portuguese. Our results show that although newer models exhibit significantly less sycophancy overall compared to earlier generations, the extent of sycophancy is still influenced by the language. We further provide a granular analysis of how language shapes model agreeableness across sensitive topics, revealing systematic cultural and linguistic patterns. These findings highlight both the progress of mitigation efforts and the need for broader multilingual audits to ensure trustworthy and bias-aware deployment of LLMs.
[46] Can Large Language Models Simulate Human Cognition Beyond Behavioral Imitation?
Yuxuan Gu, Lunjun Liu, Xiaocheng Feng, Kun Zhu, Weihong Zhong, Lei Huang, Bing Qin
Main category: cs.CL
TL;DR: A benchmark using 217 AI researchers’ publication trajectories to test if LLMs can simulate human cognition or just imitate behaviors, with cross-domain temporal-shift evaluation and cognitive alignment metrics.
Details
Motivation: Existing datasets fail to capture authentic individual cognitive patterns, using either synthetic reasoning traces or aggregated population data. There's a need to determine if LLMs truly simulate human cognition or merely imitate surface-level behaviors.Method: Created benchmark using longitudinal research trajectories of 217 AI researchers across diverse domains, using their scientific publications as externalized cognitive representations. Employed cross-domain, temporal-shift generalization setting to distinguish cognitive transfer from behavior imitation. Proposed multidimensional cognitive alignment metric for individual-level assessment.
Result: Systematic evaluation of state-of-the-art LLMs and enhancement techniques provides empirical study on how well LLMs simulate human cognition and how much existing techniques can enhance these capabilities.
Conclusion: The benchmark enables assessment of LLMs’ cognitive simulation capabilities beyond surface-level behavior imitation, offering tools to measure individual-level cognitive consistency and generalization across domains and time.
Abstract: An essential problem in artificial intelligence is whether LLMs can simulate human cognition or merely imitate surface-level behaviors, while existing datasets suffer from either synthetic reasoning traces or population-level aggregation, failing to capture authentic individual cognitive patterns. We introduce a benchmark grounded in the longitudinal research trajectories of 217 researchers across diverse domains of artificial intelligence, where each author’s scientific publications serve as an externalized representation of their cognitive processes. To distinguish whether LLMs transfer cognitive patterns or merely imitate behaviors, our benchmark deliberately employs a cross-domain, temporal-shift generalization setting. A multidimensional cognitive alignment metric is further proposed to assess individual-level cognitive consistency. Through systematic evaluation of state-of-the-art LLMs and various enhancement techniques, we provide a first-stage empirical study on the questions: (1) How well do current LLMs simulate human cognition? and (2) How far can existing techniques enhance these capabilities?
[47] KAT-Coder-V2 Technical Report
Fengxiang Li, Han Zhang, Haoyang Huang, Jinghui Wang, Jinhua Hao, Kun Yuan, Mengtong Li, Minglei Zhang, Pengcheng Xu, Wenhao Zhuang, Yizhen Shao, Zongxian Feng, Can Tang, Chao Wang, Chengxiao Tong, Fan Yang, Gang Xiong, Haixuan Gao, Han Gao, Hao Wang, Haochen Liu, Hongliang Sun, Jiabao Li, Jingwen Chang, Jun Du, Junyi Peng, Leizhen Cui, Meimei Jing, Mingqi Wu, Shangpeng Yan, Shaotong Qi, Suzhe Xu, Wenxuan Zhao, Xianda Sun, Xuan Xie, Yanbo Wang, Yao Xia, Yinghan Cui, Yingpeng Chen, Yong Wang, Yuze Shi, Zhiwei Shen, Ziyu Wang, Ming Sun, Lin Ye, Bin Chen
Main category: cs.CL
TL;DR: KAT-Coder-V2 is an agentic coding model using a “Specialize-then-Unify” paradigm with five expert domains trained independently then consolidated via distillation, achieving state-of-the-art performance on multiple coding benchmarks.
Details
Motivation: To create a comprehensive agentic coding model that can handle diverse coding tasks across different domains (software engineering, web coding, terminal operations, web search, and general coding) by leveraging specialized expertise while maintaining unified capabilities.Method: Adopts “Specialize-then-Unify” paradigm: 1) Decomposes agentic coding into five expert domains, 2) Each domain undergoes independent supervised fine-tuning and reinforcement learning, 3) Consolidates experts into single model via on-policy distillation, 4) Uses KwaiEnv infrastructure for scalable sandbox training, 5) Proposes MCLA for stabilizing MoE RL training and Tree Training for computational efficiency.
Result: Achieves 79.6% on SWE-bench Verified (close to Claude Opus 4.6’s 80.8%), 88.7 on PinchBench (surpassing GLM-5 and MiniMax M2.7), ranks first in all three frontend aesthetics scenarios, and maintains strong scores on Terminal-Bench Hard (46.8) and tau^2-Bench (93.9).
Conclusion: KAT-Coder-V2 demonstrates that the “Specialize-then-Unify” paradigm with multi-domain expertise and scalable RL training can create powerful agentic coding models that excel across diverse coding benchmarks while maintaining computational efficiency.
Abstract: We present KAT-Coder-V2, an agentic coding model developed by the KwaiKAT team at Kuaishou. KAT-Coder-V2 adopts a “Specialize-then-Unify” paradigm that decomposes agentic coding into five expert domains - SWE, WebCoding, Terminal, WebSearch, and General - each undergoing independent supervised fine-tuning and reinforcement learning, before being consolidated into a single model via on-policy distillation. We develop KwaiEnv, a modular infrastructure sustaining tens of thousands of concurrent sandbox instances, and scale RL training along task complexity, intent alignment, and scaffold generalization. We further propose MCLA for stabilizing MoE RL training and Tree Training for eliminating redundant computation over tree-structured trajectories with up to 6.2x speedup. KAT-Coder-V2 achieves 79.6% on SWE-bench Verified (vs. Claude Opus 4.6 at 80.8%), 88.7 on PinchBench (surpassing GLM-5 and MiniMax M2.7), ranks first across all three frontend aesthetics scenarios, and maintains strong generalist scores on Terminal-Bench Hard (46.8) and tau^2-Bench (93.9). Our model is publicly available at https://streamlake.com/product/kat-coder.
[48] Retromorphic Testing with Hierarchical Verification for Hallucination Detection in RAG
Boxi Yu, Yuzhong Zhang, Liting Lin, Lionel Briand, Emir Muñoz
Main category: cs.CL
TL;DR: RT4CHART is a retromorphic testing framework for detecting hallucinations in retrieval-augmented generation by decomposing outputs into verifiable claims and performing hierarchical verification against retrieved context.
Details
Motivation: LLMs continue to hallucinate in RAG settings, producing claims unsupported by or conflicting with retrieved context. Existing approaches lack fine-grained, evidence-grounded diagnostics for context faithfulness.Method: RT4CHART decomposes model outputs into independently verifiable claims and performs hierarchical, local-to-global verification against retrieved context. Each claim is labeled as entailed, contradicted, or baseless, with explicit evidence mapping back to answer spans.
Result: Achieves best answer-level hallucination detection F1 among baselines: 0.776 on RAGTruth++ (83% improvement over strongest baseline) and 47.5% span-level F1 on RAGTruth-Enhance. Re-annotation reveals 1.68x more hallucination cases than original labels.
Conclusion: RT4CHART provides fine-grained, interpretable auditing for RAG hallucinations through hierarchical verification. Existing benchmarks substantially underestimate hallucination prevalence, highlighting the need for more rigorous evaluation frameworks.
Abstract: Large language models (LLMs) continue to hallucinate in retrieval-augmented generation (RAG), producing claims that are unsupported by or conflict with the retrieved context. Detecting such errors remains challenging when faithfulness is evaluated solely with respect to the retrieved context. Existing approaches either provide coarse-grained, answer-level scores or focus on open-domain factuality, often lacking fine-grained, evidence-grounded diagnostics. We present RT4CHART, a retromorphic testing framework for context-faithfulness assessment. RT4CHART decomposes model outputs into independently verifiable claims and performs hierarchical, local-to-global verification against the retrieved context. Each claim is assigned one of three labels: entailed, contradicted, or baseless. Furthermore, RT4CHART maps claim-level decisions back to specific answer spans and retrieves explicit supporting or refuting evidence from the context, enabling fine-grained and interpretable auditing. We evaluate RT4CHART on RAGTruth++ (408 samples) and RAGTruth-Enhance (2,675 samples), a newly re-annotated benchmark. RT4CHART achieves the best answer-level hallucination detection F1 among all baselines. On RAGTruth++, it reaches an F1 score of 0.776, outperforming the strongest baseline by 83%. On RAGTruth-Enhance, it achieves a span-level F1 of 47.5%. Ablation studies show that the hierarchical verification design is the primary driver of performance gains. Finally, our re-annotation reveals 1.68x more hallucination cases than the original labels, suggesting that existing benchmarks substantially underestimate the prevalence of hallucinations.
[49] TailNLG: A Multilingual Benchmark Addressing Verbalization of Long-Tail Entities
Lia Draetta, Michael Oliverio, Virginia Ramón-Ferrer, Pier Felice Balestrucci, Flaviana Corallo, Carlos Badenes-Olmedo, Alessandro Mazzei, Marco Antonio Stranisci, Rossana Damiano
Main category: cs.CL
TL;DR: TailNLG benchmark reveals LLMs have consistent bias against long-tail entities in multilingual Data-to-Text generation, with lower embedding scores and higher uncertainty for rare entities.
Details
Motivation: To study biases in verbalization of rare (long-tail) entities in Data-to-Text generation, which is important for making knowledge graphs accessible and supporting retrieval-augmented generation systems, but has received little attention despite recent multilingual advances.Method: Introduce TailNLG, a new multilingual benchmark in English, Italian, and Spanish built from Wikidata covering entities with varying popularity levels. Evaluate three families of large language models in zero-shot settings, comparing performance on rare vs common entities and against established WebNLG benchmark.
Result: Reveals consistent bias against long-tail entities: embedding-based scores are lower and model uncertainty is higher for rare entities. Impact varies across models and languages, and existing evaluation metrics don’t consistently capture these differences.
Conclusion: Highlights the need for more reliable evaluation frameworks to address biases in Data-to-Text generation, particularly for long-tail entities that are crucial for comprehensive knowledge graph verbalization.
Abstract: The automatic verbalization of structured knowledge is a key task for making knowledge graphs accessible to non-expert users and supporting retrieval-augmented generation systems. Although recent advances in Data-to-Text generation have improved multilingual coverage, little attention has been paid to potential biases in the verbalization of rare entities, frequently known as long-tail entities. In this work, we present the first systematic study of long-tail entities in Data-to-Text generation. We introduce TailNLG, a new multilingual benchmark in English, Italian, and Spanish, built from Wikidata and covering entities with varying levels of popularity. We evaluate three different families of large language models in zero-shot settings and compare their performance on rare versus common entities, as well as against the established WebNLG benchmark. Our results reveal a consistent bias against long-tail entities: embedding-based scores are lower, and model uncertainty is higher for rare entities. We further show that the impact of long-tail entities varies across models and languages, and that existing evaluation metrics do not consistently capture these differences, highlighting the need for more reliable evaluation frameworks.
[50] Understanding Teacher Revisions of Large Language Model-Generated Feedback
Conrad Borchers, Luiz Rodrigues, Newarney Torrezão da Costa, Cleon Xavier, Rafael Ferreira Mello
Main category: cs.CL
TL;DR: Teachers accept 80% of AI-generated feedback unchanged, with editing patterns varying widely; ML models can predict revisions from text features; teachers often simplify AI feedback toward more concise forms.
Details
Motivation: To understand how teachers revise AI-generated feedback before delivering it to students, as teacher revisions shape what students ultimately receive and are central to evaluating AI classroom tools.Method: Analysis of 1,349 AI-generated feedback instances and corresponding teacher edits from 117 teachers, examining textual characteristics, predicting revision decisions using machine learning models with sentence embeddings, and qualitative coding of pedagogical feedback types.
Result: Teachers accept 80% of AI feedback unchanged; edited feedback tends to be longer initially but gets shortened; editing behavior varies widely (50% never edit, 10% edit >2/3); ML models achieve AUC=0.75 for predicting revisions; teachers often simplify AI feedback, shifting from explanatory to corrective forms.
Conclusion: Findings characterize teacher engagement with AI-generated feedback and highlight opportunities to design feedback systems that better align with teacher priorities while reducing unnecessary editing effort.
Abstract: Large language models (LLMs) increasingly generate formative feedback for students, yet little is known about how teachers revise this feedback before it reaches learners. Teachers’ revisions shape what students receive, making revision practices central to evaluating AI classroom tools. We analyze a dataset of 1,349 instances of AI-generated feedback and corresponding teacher-edited explanations from 117 teachers. We examine (i) textual characteristics associated with teacher revisions, (ii) whether revision decisions can be predicted from the AI feedback text, and (iii) how revisions change the pedagogical type of feedback delivered. First, we find that teachers accept AI feedback without modification in about 80% of cases, while edited feedback tends to be significantly longer and subsequently shortened by teachers. Editing behavior varies substantially across teachers: about 50% never edit AI feedback, and only about 10% edit more than two-thirds of feedback instances. Second, machine learning models trained only on the AI feedback text as input features, using sentence embeddings, achieve fair performance in identifying which feedback will be revised (AUC=0.75). Third, qualitative coding shows that when revisions occur, teachers often simplify AI-generated feedback, shifting it away from high-information explanations toward more concise, corrective forms. Together, these findings characterize how teachers engage with AI-generated feedback in practice and highlight opportunities to design feedback systems that better align with teacher priorities while reducing unnecessary editing effort.
[51] Conversational Agents and the Understanding of Human Language: Reflections on AI, LLMs, and Cognitive Science
Andrei Popescu-Belis
Main category: cs.CL
TL;DR: The paper examines the relationship between NLP evolution and human language understanding, concluding that despite impressive chatbot abilities, language technology hasn’t significantly advanced our understanding of human language processing.
Details
Motivation: To explore whether the evolution of natural language processing technology has contributed to our understanding of human language capacity as studied by linguistics and cognitive science.Method: The paper traces the historical evolution of NLP from its beginnings to the current era of large language models, comparing each major paradigm with theories of human language capacity from linguistics and cognitive science.
Result: The analysis reveals that despite the impressive language abilities achieved by modern chatbots using artificial neural networks, the evolution of language technology has not substantially deepened our understanding of how human minds process natural language.
Conclusion: Current NLP systems, including large language models, operate on fundamentally different principles than human language processing, and their success doesn’t translate to better understanding of human linguistic cognition.
Abstract: In this paper, we discuss the relationship between natural language processing by computers (NLP) and the understanding of the human language capacity, as studied by linguistics and cognitive science. We outline the evolution of NLP from its beginnings until the age of large language models, and highlight for each of its main paradigms some similarities and differences with theories of the human language capacity. We conclude that the evolution of language technology has not substantially deepened our understanding of how human minds process natural language, despite the impressive language abilities attained by current chatbots using artificial neural networks.
[52] Improving Clinical Diagnosis with Counterfactual Multi-Agent Reasoning
Zhiwen You, Xi Chen, Aniket Vashishtha, Simo Du, Gabriel Erion-Barner, Hongyuan Mei, Hao Peng, Yue Guo
Main category: cs.CL
TL;DR: A counterfactual multi-agent diagnostic framework that improves LLM-based clinical diagnosis by explicitly testing how individual findings support or weaken competing diagnoses through counterfactual case editing and specialist discussions.
Details
Motivation: Current LLM-based diagnostic agents reason over fixed clinical evidence without explicitly testing how individual findings support or weaken competing diagnoses, lacking the interpretability needed for clinical decision support. The work aims to make hypothesis testing explicit and evidence-grounded, inspired by clinician training methods.Method: Proposes a counterfactual multi-agent diagnostic framework with: 1) Counterfactual case editing to modify clinical findings and evaluate how changes affect competing diagnoses, 2) Counterfactual Probability Gap to quantify how strongly individual findings support a diagnosis by measuring confidence shifts under edits, and 3) Multi-round specialist discussions guided by these counterfactual signals to challenge unsupported hypotheses and refine differential diagnoses.
Result: Across three diagnostic benchmarks and seven LLMs, the method consistently improves diagnostic accuracy over prompting and prior multi-agent baselines, with largest gains in complex and ambiguous cases. Human evaluation indicates the framework produces more clinically useful, reliable, and coherent reasoning.
Conclusion: Incorporating counterfactual evidence verification is an important step toward building reliable AI systems for clinical decision support, making LLM-based diagnostic reasoning more interpretable and clinically grounded.
Abstract: Clinical diagnosis is a complex reasoning process in which clinicians gather evidence, form hypotheses, and test them against alternative explanations. In medical training, this reasoning is explicitly developed through counterfactual questioning–e.g., asking how a diagnosis would change if a key symptom were absent or altered–to strengthen differential diagnosis skills. As large language model (LLM)-based systems are increasingly used for diagnostic support, ensuring the interpretability of their recommendations becomes critical. However, most existing LLM-based diagnostic agents reason over fixed clinical evidence without explicitly testing how individual findings support or weaken competing diagnoses. In this work, we propose a counterfactual multi-agent diagnostic framework inspired by clinician training that makes hypothesis testing explicit and evidence-grounded. Our framework introduces counterfactual case editing to modify clinical findings and evaluate how these changes affect competing diagnoses. We further define the Counterfactual Probability Gap, a method that quantifies how strongly individual findings support a diagnosis by measuring confidence shifts under these edits. These counterfactual signals guide multi-round specialist discussions, enabling agents to challenge unsupported hypotheses, refine differential diagnoses, and produce more interpretable reasoning trajectories. Across three diagnostic benchmarks and seven LLMs, our method consistently improves diagnostic accuracy over prompting and prior multi-agent baselines, with the largest gains observed in complex and ambiguous cases. Human evaluation further indicates that our framework produces more clinically useful, reliable, and coherent reasoning. These results suggest that incorporating counterfactual evidence verification is an important step toward building reliable AI systems for clinical decision support.
[53] ProText: A benchmark dataset for measuring (mis)gendering in long-form texts
Hadas Kotek, Margit Bowler, Patrick Sonnenberg, Yu’an Yang
Main category: cs.CL
TL;DR: ProText is a dataset for measuring gendering and misgendering in long-form English texts across multiple dimensions, designed to probe gender bias in LLM text transformations like summarization and rewriting.
Details
Motivation: To address limitations in existing gender bias benchmarks that focus primarily on pronoun resolution and binary gender, and to create a more comprehensive dataset for measuring gendering and misgendering in diverse text transformations.Method: Created ProText dataset spanning three dimensions: Theme nouns (names, occupations, titles, kinship terms), Theme category (male-stereotyped, female-stereotyped, gender-neutral), and Pronoun category (masculine, feminine, gender-neutral, none). Validated through case studies using prompts and state-of-the-art LLMs.
Result: The dataset enables nuanced insights into gender bias, stereotyping, misgendering, and gendering. Case studies revealed systematic gender bias, particularly when inputs lack explicit gender cues or when models default to heteronormative assumptions.
Conclusion: ProText provides a valuable resource for measuring gender bias in LLM text transformations, extending beyond traditional pronoun resolution benchmarks and binary gender frameworks to capture more complex forms of gendering and misgendering.
Abstract: We introduce ProText, a dataset for measuring gendering and misgendering in stylistically diverse long-form English texts. ProText spans three dimensions: Theme nouns (names, occupations, titles, kinship terms), Theme category (stereotypically male, stereotypically female, gender-neutral/non-gendered), and Pronoun category (masculine, feminine, gender-neutral, none). The dataset is designed to probe (mis)gendering in text transformations such as summarization and rewrites using state-of-the-art Large Language Models, extending beyond traditional pronoun resolution benchmarks and beyond the gender binary. We validated ProText through a mini case study, showing that even with just two prompts and two models, we can draw nuanced insights regarding gender bias, stereotyping, misgendering, and gendering. We reveal systematic gender bias, particularly when inputs contain no explicit gender cues or when models default to heteronormative assumptions.
[54] Model Capability Dominates: Inference-Time Optimization Lessons from AIMO 3
Natapong Nitarach
Main category: cs.CL
TL;DR: Diverse reasoning strategies for LLM voting don’t improve mathematical reasoning; high-temperature sampling already decorrelates errors sufficiently, and model capability dominates performance.
Details
Motivation: While majority voting over multiple LLM attempts improves mathematical reasoning, correlated errors limit effectiveness. The authors hypothesize that assigning structurally different reasoning strategies to different voters could decorrelate errors and improve performance.Method: Tested Diverse Prompt Mixer in AIMO~3 competition with 3 models, 23+ experiments, and 50 IMO-level problems on a single H100 80GB GPU with 5-hour limit. Compared different prompt strategies and inference-time optimizations.
Result: Every intervention failed. High-temperature sampling already decorrelates errors sufficiently; weaker prompt strategies reduce per-attempt accuracy more than they reduce correlation. Model capability dominates by an order of magnitude across a 17-point model capability gap.
Conclusion: For mathematical reasoning with LLMs, model capability is far more important than sophisticated voting strategies. Simple high-temperature sampling provides sufficient error decorrelation, and attempts to diversify reasoning strategies don’t yield benefits.
Abstract: Majority voting over multiple LLM attempts improves mathematical reasoning, but correlated errors limit the effective sample size. A natural fix: assign structurally different reasoning strategies to different voters to decorrelate errors. We test this Diverse Prompt Mixer in the AIMO~3 competition: 3 models, 23+ experiments, and 50 IMO-level problems on a single H100 80 GB with a 5-hour limit. Every intervention fails. High-temperature sampling already decorrelates errors sufficiently; weaker prompt strategies reduce per-attempt accuracy more than they reduce correlation. Across a 17-point model capability gap and every inference-time optimization we tried, model capability dominates by an order of magnitude.
[55] What can LLMs tell us about the mechanisms behind polarity illusions in humans? Experiments across model scales and training steps
Dario Paape
Main category: cs.CL
TL;DR: LLMs show different patterns for two polarity illusions: NPI illusion weakens with model size while depth charge illusion strengthens, suggesting shallow processing rather than rational inference mechanisms in sentence comprehension.
Details
Motivation: To investigate how two well-known polarity illusions (NPI illusion and depth charge illusion) manifest in large language models, and what this reveals about human sentence processing mechanisms.Method: Used the Pythia scaling suite to test different sized language models on two polarity illusions, analyzing how model performance changes with increasing scale.
Result: NPI illusion becomes weaker and disappears as model size increases, while depth charge illusion becomes stronger in larger models.
Conclusion: Human polarity illusions may not require “rational inference” mechanisms; shallow processing and partial grammaticalization of ungrammatical structures in LLMs suggest alternative explanations rooted in construction grammar.
Abstract: I use the Pythia scaling suite (Biderman et al. 2023) to investigate if and how two well-known polarity illusions, the NPI illusion and the depth charge illusion, arise in LLMs. The NPI illusion becomes weaker and ultimately disappears as model size increases, while the depth charge illusion becomes stronger in larger models. The results have implications for human sentence processing: it may not be necessary to assume “rational inference” mechanisms that convert ill-formed sentences into well-formed ones to explain polarity illusions, given that LLMs cannot plausibly engage in this kind of reasoning, especially at the implicit level of next-token prediction. On the other hand, shallow, “good enough” processing and/or partial grammaticalization of prescriptively ungrammatical structures may both occur in LLMs. I propose a synthesis of different theoretical accounts that is rooted in the basic tenets of construction grammar.
[56] KazByte: Adapting Qwen models to Kazakh via Byte-level Adapter
Rauan Akylzhanov
Main category: cs.CL
TL;DR: ByteKaz: A two-stage approach to improve Kazakh language processing in LLMs by bypassing tokenizers with byte-level adapters and fine-tuning attention layers.
Details
Motivation: Standard LLM tokenizers fragment Kazakh text into many tokens compared to English, causing computational inefficiency, shorter effective context windows, and poor morphological understanding due to tokenizer bias toward high-resource languages.Method: Two-stage approach: 1) Train small adapter to process raw bytes and interface with frozen Qwen2.5-7B, 2) Freeze adapter and fine-tune only attention layers on Kazakh text to adapt the model.
Result: Empirical validation is ongoing; this paper presents the ByteKaz architecture and training protocol as a design proposal with hypotheses to be tested.
Conclusion: ByteKaz offers a promising approach to overcome tokenizer limitations for low-resource languages like Kazakh, potentially matching or exceeding original model performance on Kazakh benchmarks through specialized adaptation.
Abstract: Large language models fragment Kazakh text into many more tokens than equivalent English text, because their tokenizers were built for high-resource languages. This tokenizer tax inflates compute, shortens the effective context window, and weakens the model’s grip on Kazakh morphology. We propose to bypass the tokenizer entirely by feeding raw bytes through a small adapter that learns to speak the internal language of a frozen Qwen2.5-7B. Once the adapter is trained, we freeze it and fine-tune only the attention layers of Qwen on Kazakh text. Our central hypothesis is that this two-stage process – first teach the interface, then adapt the model – should match or exceed the accuracy of the original Qwen2.5-7B on standard Kazakh benchmarks. This report describes the ByteKaz architecture and training protocol. Empirical validation is ongoing; this version stakes the design and hypotheses for the record.
[57] Article and Comment Frames Shape the Quality of Online Comments
Matteo Guida, Yulia Otmakhova, Eduard Hovy, Lea Frermann
Main category: cs.CL
TL;DR: Framing in news articles affects comment quality: articles with certain frames generate healthier comments, and comments adopting the article’s frame are healthier than those departing from it.
Details
Motivation: While framing theory suggests information presentation shapes audience responses, computational work has largely ignored audience reactions. Recent work showed framing shapes response content, but this paper investigates whether framing also affects response quality.Method: Analyzed 1M comments across 2.7K news articles, operationalizing quality as comment health (constructive, good-faith contributions). Used statistical analysis to examine how article frames predict comment health while controlling for topic.
Result: Article frames significantly predict comment health. Comments that adopt the article frame are healthier than those that depart from it. Unhealthy top-level comments tend to generate more unhealthy responses, independent of the frame being used.
Conclusion: Establishes a link between framing theory and discourse quality, laying groundwork for downstream applications. Demonstrates potential with a proactive frame-aware LLM-based system to mitigate unhealthy discourse.
Abstract: Framing theory posits that how information is presented shapes audience responses, but computational work has largely ignored audience reactions. While recent work showed that article framing systematically shapes the content of reader responses, this paper asks: Does framing also affect response quality? Analyzing 1M comments across 2.7K news articles, we operationalize quality as comment health (constructive, good-faith contributions). We find that article frames significantly predict comment health while controlling for topic, and that comments that adopt the article frame are healthier than those that depart from it. Further, unhealthy top-level comments tend to generate more unhealthy responses, independent of the frame being used in the comment. Our results establish a link between framing theory and discourse quality, laying the groundwork for downstream applications. We illustrate this potential with a proactive frame-aware LLM- based system to mitigate unhealthy discourse
[58] Top-down string-to-dependency Neural Machine Translation
Shuhei Kondo, Katsuhito Sudoh, Yuji Matsumoto
Main category: cs.CL
TL;DR: A neural machine translation model with a novel syntactic decoder that generates target dependency trees top-down to improve translation of long, rare inputs.
Details
Motivation: Standard NMT models with encoder-decoder attention struggle with translating long inputs that are rare or unseen during training. Incorporating target syntax can help address these length-related problems.Method: Proposes a syntactic decoder that generates target-language dependency trees in a top-down, left-to-right order, moving from conventional sequence-to-sequence to string-to-tree decoding.
Result: Experiments show the top-down string-to-tree decoding generalizes better than conventional sequence-to-sequence decoding for translating long inputs not observed in training data.
Conclusion: Incorporating syntactic structure through top-down tree generation improves NMT performance on challenging long inputs, addressing generalization issues in conventional models.
Abstract: Most of modern neural machine translation (NMT) models are based on an encoder-decoder framework with an attention mechanism. While they perform well on standard datasets, they can have trouble in translation of long inputs that are rare or unseen during training. Incorporating target syntax is one approach to dealing with such length-related problems. We propose a novel syntactic decoder that generates a target-language dependency tree in a top-down, left-to-right order. Experiments show that the proposed top-down string-to-tree decoding generalizes better than conventional sequence-to-sequence decoding in translating long inputs that are not observed in the training data.
[59] EnsemJudge: Enhancing Reliability in Chinese LLM-Generated Text Detection through Diverse Model Ensembles
Zhuoshang Wang, Yubing Ren, Guoyu Zhao, Xiaowei Zhu, Hao Li, Yanan Cao
Main category: cs.CL
TL;DR: Proposes EnsemJudge, a robust framework for detecting Chinese LLM-generated text using tailored strategies and ensemble voting mechanisms, achieving state-of-the-art performance on NLPCC2025 Shared Task 1.
Details
Motivation: LLM-generated text detection is crucial for mitigating misuse risks, but existing methods struggle with out-of-domain inputs, adversarial samples, and limited focus on Chinese text detection.Method: EnsemJudge framework incorporates tailored strategies and ensemble voting mechanisms specifically designed for Chinese text detection, trained on a carefully constructed Chinese dataset from NLPCC2025 Shared Task 1.
Result: Outperformed all baseline methods and achieved first place in NLPCC2025 Shared Task 1, demonstrating effectiveness and reliability in Chinese LLM-generated text detection.
Conclusion: EnsemJudge provides a robust solution for Chinese LLM-generated text detection, addressing gaps in current research and showing practical applicability in real-world scenarios.
Abstract: Large Language Models (LLMs) are widely applied across various domains due to their powerful text generation capabilities. While LLM-generated texts often resemble human-written ones, their misuse can lead to significant societal risks. Detecting such texts is an essential technique for mitigating LLM misuse, and many detection methods have shown promising results across different datasets. However, real-world scenarios often involve out-of-domain inputs or adversarial samples, which can affect the performance of detection methods to varying degrees. Furthermore, most existing research has focused on English texts, with limited work addressing Chinese text detection. In this study, we propose EnsemJudge, a robust framework for detecting Chinese LLM-generated text by incorporating tailored strategies and ensemble voting mechanisms. We trained and evaluated our system on a carefully constructed Chinese dataset provided by NLPCC2025 Shared Task 1. Our approach outperformed all baseline methods and achieved first place in the task, demonstrating its effectiveness and reliability in Chinese LLM-generated text detection. Our code is available at https://github.com/johnsonwangzs/MGT-Mini.
[60] Rethinking Atomic Decomposition for LLM Judges: A Prompt-Controlled Study of Reference-Grounded QA Evaluation
Xinran Zhang
Main category: cs.CL
TL;DR: Study compares atomic vs holistic LLM judges for reference-grounded QA evaluation, finding holistic judges match or exceed atomic judges on most benchmarks, especially for detecting partial support.
Details
Motivation: To determine whether the advantage of atomic decomposition in LLM-based reference-grounded judges comes from decomposition itself or simply from richer prompting, by comparing atomic and holistic judges with controlled prompting.Method: Compared self-decomposing atomic judge (single-prompt decompose-and-verify) against prompt-controlled holistic judge with same inputs and detailed rubric. Tested on 200 source examples each from TruthfulQA, ASQA, and QAMPARI datasets using four model families, with source-level paired tests, cluster bootstrap, and aggregation across three pre-frozen prompt variants per design.
Result: Holistic judge matches or exceeds atomic judge on two of three benchmarks (ASQA and QAMPARI favor holistic across all four model families, statistically reliable in three of four), while TruthfulQA shows small atomic edge. Holistic advantage concentrated in partially_supported cases (incompleteness detection).
Conclusion: For self-decomposing single-prompt patterns on QA-style benchmarks, holistic judges can perform as well or better than atomic judges, particularly for detecting partial support. The advantage of atomic decomposition may stem more from richer prompting than decomposition itself.
Abstract: Atomic decomposition – breaking a candidate answer into claims before verifying each against a reference – is a widely adopted design for LLM-based reference-grounded judges. However, atomic prompts are typically richer and longer, making it unclear whether any advantage comes from decomposition or from richer prompting. We study this for benchmark-style completeness-sensitive reference-support classification: classifying a candidate as fully supported, partially supported, or unsupported relative to a supplied reference. We compare a self-decomposing atomic judge (single-prompt decompose-and-verify) against a prompt-controlled holistic judge with the same inputs and a similarly detailed rubric. On 200 source examples per dataset across TruthfulQA, ASQA, and QAMPARI, with four model families, source-level paired tests, cluster bootstrap, and aggregation across three pre-frozen prompt variants per design family, we find the holistic judge matches or exceeds the atomic judge on two of three benchmarks: ASQA and QAMPARI favor holistic across all four families (statistically reliable in three of four), while TruthfulQA shows a small atomic edge. The holistic advantage is concentrated in partially_supported cases – incompleteness detection. A sensitivity check against human annotations confirms the ranking under both benchmark-completeness and human factual-correctness standards. Our finding is specific to the self-decomposing single-prompt pattern on three QA-style benchmarks with 200 source examples each; multi-stage atomic pipelines and non-QA tasks remain untested. Among perturbations examined, reference-quality degradation produced the largest accuracy drops for both judge families.
[61] Transfer Learning for an Endangered Slavic Variety: Dependency Parsing in Pomak Across Contact-Shaped Dialects
Sercan Karakaş
Main category: cs.CL
TL;DR: This paper presents new resources and baselines for dependency parsing in Pomak, an endangered Eastern South Slavic language, focusing on cross-dialect transfer between Greek and Turkish varieties.
Details
Motivation: The motivation is to address the lack of resources for Pomak, an endangered language with substantial dialectal variation, and to study how well dependency parsers transfer across dialects when trained on existing treebanks from different varieties.Method: Two experimental phases: 1) Zero-shot transfer evaluation of a parser trained on Greek-variety UD data to Turkish-variety Pomak, quantifying phonological and morphosyntactic differences; 2) Introducing a new manually annotated Turkish-variety corpus (650 sentences) and using targeted fine-tuning and cross-variety transfer learning.
Result: Targeted fine-tuning on the small Turkish-variety corpus substantially improves accuracy, and performance is further boosted by cross-variety transfer learning that combines both dialects.
Conclusion: The paper demonstrates that even small amounts of targeted data can significantly improve cross-dialect parsing performance, and that combining resources from different dialects through transfer learning yields the best results for low-resource endangered languages.
Abstract: This paper presents new resources and baselines for Dependency Parsing in Pomak, an endangered Eastern South Slavic language with substantial dialectal variation and no widely adopted standard. We focus on the variety spoken in Turkey (Uzunköprü) and ask how well a dependency parser trained on the existing Pomak Universal Dependencies treebank, which was built primarily from the variety that is spoken in Greece, transfers across dialects. We run two experimental phases. First, we train a parser on the Greek-variety UD data and evaluate zero-shot transfer to Turkish-variety Pomak, quantifying the impact of phonological and morphosyntactic differences. Second, we introduce a new manually annotated Turkish-variety Pomak corpus of 650 sentences and show that, despite its small size, targeted fine-tuning substantially improves accuracy; performance is further boosted by cross-variety transfer learning that combines the two dialects.
[62] Who Wrote the Book? Detecting and Attributing LLM Ghostwriters
Anudeex Shetty, Qiongkai Xu, Olga Ohrimenko, Jey Han Lau
Main category: cs.CL
TL;DR: GhostWriteBench is a dataset for LLM authorship attribution with long-form texts, and TRACE is a novel fingerprinting method that captures token-level transition patterns for identifying AI-generated content.
Details
Motivation: The need for effective methods to attribute authorship of AI-generated text, especially as LLMs become more capable of producing human-like long-form content, and the challenge of generalization across different domains and unseen LLM models.Method: TRACE creates interpretable fingerprints by capturing token-level transition patterns (like word rank) using a lightweight language model, working for both open- and closed-source models.
Result: TRACE achieves state-of-the-art performance on GhostWriteBench, remains robust in out-of-distribution settings, and works well with limited training data.
Conclusion: The proposed TRACE method provides an effective, interpretable, and lightweight solution for LLM authorship attribution that generalizes well across different domains and unseen models.
Abstract: In this paper, we introduce GhostWriteBench, a dataset for LLM authorship attribution. It comprises long-form texts (50K+ words per book) generated by frontier LLMs, and is designed to test generalisation across multiple out-of-distribution (OOD) dimensions, including domain and unseen LLM author. We also propose TRACE – a novel fingerprinting method that is interpretable and lightweight – that works for both open- and closed-source models. TRACE creates the fingerprint by capturing token-level transition patterns (e.g., word rank) estimated by another lightweight language model. Experiments on GhostWriteBench demonstrate that TRACE achieves state-of-the-art performance, remains robust in OOD settings, and works well in limited training data scenarios.
[63] From Reviews to Requirements: Can LLMs Generate Human-Like User Stories?
Shadman Sakib, Oishy Fatema Akhand, Tasnia Tasneem, Shohel Ahmed
Main category: cs.CL
TL;DR: LLMs (GPT-3.5 Turbo, Gemini 2.0 Flash, Mistral 7B) can generate usable user stories from app reviews, matching human quality in fluency but struggling with independence and uniqueness needed for agile backlogs.
Details
Motivation: App store reviews provide valuable user feedback but are messy and difficult to analyze manually at scale. Existing automated techniques often fail to produce clean, backlog-ready user stories for agile projects.Method: Evaluated LLMs (GPT-3.5 Turbo, Gemini 2.0 Flash, Mistral 7B Instruct) on Mini-BAR dataset of 1,000+ health app reviews using zero-shot, one-shot, and two-shot prompting. Evaluated generated user stories using human judgment (RUST framework) and RoBERTa classifier fine-tuned on UStAI.
Result: LLMs can match or outperform humans in writing fluent, well-formatted user stories, especially with few-shot prompts. However, they struggle to produce independent and unique user stories essential for agile backlogs.
Conclusion: LLMs can reliably turn unstructured app reviews into actionable software requirements, providing developers with clear guidance to turn user feedback into meaningful improvements, though limitations remain in story independence and uniqueness.
Abstract: App store reviews provide a constant flow of real user feedback that can help improve software requirements. However, these reviews are often messy, informal, and difficult to analyze manually at scale. Although automated techniques exist, many do not perform well when replicated and often fail to produce clean, backlog-ready user stories for agile projects. In this study, we evaluate how well large language models (LLMs) such as GPT-3.5 Turbo, Gemini 2.0 Flash, and Mistral 7B Instruct can generate usable user stories directly from raw app reviews. Using the Mini-BAR dataset of 1,000+ health app reviews, we tested zero-shot, one-shot, and two-shot prompting methods. We evaluated the generated user stories using both human judgment (via the RUST framework) and a RoBERTa classifier fine-tuned on UStAI to assess their overall quality. Our results show that LLMs can match or even outperform humans in writing fluent, well-formatted user stories, especially when few-shot prompts are used. However, they still struggle to produce independent and unique user stories, which are essential for building a strong agile backlog. Overall, our findings show how LLMs can reliably turn unstructured app reviews into actionable software requirements, providing developers with clear guidance to turn user feedback into meaningful improvements.
[64] DongYuan: An LLM-Based Framework for Integrative Chinese and Western Medicine Spleen-Stomach Disorders Diagnosis
Hua Li, Yingying Li, Xiaobin Feng, Xinyi Fu, Lifeng Dong, Qingfeng Yang, Yanzhe Chen, Xiaoju Feng, Zhidong Cao, Jianbin Guo, Yanru Du
Main category: cs.CL
TL;DR: DongYuan is an integrative Chinese and Western medicine diagnostic framework for spleen-stomach disorders, featuring curated datasets, a core diagnostic LLM with two-stage training, a consultation navigation model, and a comprehensive evaluation benchmark.
Details
Motivation: Address three major challenges in applying LLMs to integrative Chinese and Western medicine: lack of high-quality data, absence of models integrating TCM syndrome differentiation with Western medical diagnosis, and shortage of standardized evaluation benchmarks for spleen-stomach disorders.Method: 1) Curated three ICWM datasets (SSDF-Syndrome, SSDF-Dialogue, SSDF-PD); 2) Developed SSDF-Core diagnostic LLM with two-stage training (SFT + DPO); 3) Created SSDF-Navigator pluggable consultation navigation model; 4) Established SSDF-Bench evaluation benchmark.
Result: SSDF-Core significantly outperforms 12 mainstream baselines on SSDF-Bench, demonstrating robust ICWM reasoning capabilities for spleen-stomach disorder diagnosis.
Conclusion: DongYuan provides a solid methodological foundation and practical technical references for developing intelligent ICWM diagnostic systems, addressing key challenges in medical AI applications.
Abstract: The clinical burden of spleen-stomach disorders is substantial. While large language models (LLMs) offer new potential for medical applications, they face three major challenges in the context of integrative Chinese and Western medicine (ICWM): a lack of high-quality data, the absence of models capable of effectively integrating the reasoning logic of traditional Chinese medicine (TCM) syndrome differentiation with that of Western medical (WM) disease diagnosis, and the shortage of a standardized evaluation benchmark. To address these interrelated challenges, we propose DongYuan, an ICWM spleen-stomach diagnostic framework. Specifically, three ICWM datasets (SSDF-Syndrome, SSDF-Dialogue, and SSDF-PD) were curated to fill the gap in high-quality data for spleen-stomach disorders. We then developed SSDF-Core, a core diagnostic LLM that acquires robust ICWM reasoning capabilities through a two-stage training regimen of supervised fine-tuning. tuning (SFT) and direct preference optimization (DPO), and complemented it with SSDF-Navigator, a pluggable consultation navigation model designed to optimize clinical inquiry strategies. Additionally, we established SSDF-Bench, a comprehensive evaluation benchmark focused on ICWM diagnosis of spleen-stomach disorders. Experimental results demonstrate that SSDF-Core significantly outperforms 12 mainstream baselines on SSDF-Bench. DongYuan lays a solid methodological foundation and provides practical technical references for the future development of intelligent ICWM diagnostic systems.
[65] Beyond Cosine Similarity: Zero-Initialized Residual Complex Projection for Aspect-Based Sentiment Analysis
Yijin Wang, Fandi Sun
Main category: cs.CL
TL;DR: Novel framework using complex projection and anti-collision loss for disentangling aspect semantics from sentiment polarities in ABSA, achieving SOTA performance.
Details
Motivation: Address representation entanglement in ABSA where aspect semantics and sentiment polarities are conflated in real-valued embeddings, and solve false-negative collisions in contrastive learning that degrade performance on high-frequency aspects.Method: Proposes Zero-Initialized Residual Complex Projection (ZRCP) to project textual features into complex semantic space, using phase to disentangle sentiment polarities while amplitude encodes semantic intensity. Introduces Anti-collision Masked Angle Loss with anti-collision mask to preserve intra-polarity cohesion while expanding inter-polarity discriminative margin.
Result: Achieves state-of-the-art Macro-F1 score of 0.8851. Deep geometric analyses show that constraining amplitude catastrophically over-regularizes subjective representations, proving the importance of unconstrained-amplitude and phase-driven approach.
Conclusion: The framework effectively disentangles aspect semantics from sentiment polarities using complex projection and anti-collision mechanisms, providing robust fine-grained sentiment analysis with superior performance.
Abstract: Aspect-Based Sentiment Analysis (ABSA) is fundamentally challenged by representation entanglement, where aspect semantics and sentiment polarities are often conflated in real-valued embedding spaces. Furthermore, standard contrastive learning suffers from false-negative collisions, severely degrading performance on high-frequency aspects. In this paper, we propose a novel framework featuring a Zero-Initialized Residual Complex Projection (ZRCP) and an Anti-collision Masked Angle Loss,inspired by quantum projection and entanglement ideas. Our approach projects textual features into a complex semantic space, systematically utilizing the phase to disentangle sentiment polarities while allowing the amplitude to encode the semantic intensity and lexical richness of subjective descriptions. To tackle the collision bottleneck, we introduce an anti-collision mask that elegantly preserves intra-polarity aspect cohesion while expanding the inter-polarity discriminative margin by over 50%. Experimental results demonstrate that our framework achieves a state-of-the-art Macro-F1 score of 0.8851. Deep geometric analyses further reveal that explicitly penalizing the complex amplitude catastrophically over-regularizes subjective representations, proving that our unconstrained-amplitude and phase-driven objective is crucial for robust, fine-grained sentiment disentanglement.
[66] \textit{Versteasch du mi?} Computational and Socio-Linguistic Perspectives on GenAI, LLMs, and Non-Standard Language
Verena Platzgummer, John McCrae, Sina Ahmadi
Main category: cs.CL
TL;DR: Critical analysis of how LLMs and GenAI perpetuate linguistic inequality, focusing on non-standard varieties like South Tyrolean dialects and Kurdish, exploring technical approaches and policy implications for more inclusive AI.
Details
Motivation: The paper addresses how large language models and generative AI systems are biased toward dominant languages and standardized linguistic forms, deepening the digital language divide. It examines how these technologies reproduce historical processes of linguistic standardization rooted in European nationalism and colonialism, and explores whether they can be made more inclusive of non-standard linguistic varieties.Method: Interdisciplinary approach combining critical sociolinguistics and computational linguistics. Uses two case studies: South Tyrolean dialects (widely used in informal communication in Italy) and varieties of Kurdish. Explores both technical approaches for making LLMs handle non-standard language and policy implications for democratic, decolonial digital strategies.
Result: The paper provides a critical framework for understanding how GenAI technologies perpetuate linguistic hierarchies and standardization. It identifies both technical challenges in handling non-standard linguistic varieties and policy considerations for making AI more linguistically inclusive and equitable.
Conclusion: LLMs and GenAI systems are not neutral but reproduce existing linguistic inequalities. Addressing this requires both technical solutions for handling linguistic variation and broader policy approaches that consider historical and sociopolitical dimensions of language standardization, with implications for creating more democratic and decolonial digital ecosystems.
Abstract: The design of Large Language Models and generative artificial intelligence has been shown to be “unfair” to less-spoken languages and to deepen the digital language divide. Critical sociolinguistic work has also argued that these technologies are not only made possible by prior socio-historical processes of linguistic standardisation, often grounded in European nationalist and colonial projects, but also exacerbate epistemologies of language as “monolithic, monolingual, syntactically standardized systems of meaning”. In our paper, we draw on earlier work on the intersections of technology and language policy and bring our respective expertise in critical sociolinguistics and computational linguistics to bear on an interrogation of these arguments. We take two different complexes of non-standard linguistic varieties in our respective repertoires–South Tyrolean dialects, which are widely used in informal communication in South Tyrol, Italy, as well as varieties of Kurdish–as starting points to an interdisciplinary exploration of the intersections between GenAI and linguistic variation and standardisation. We discuss both how LLMs can be made to deal with nonstandard language from a technical perspective, and whether, when or how this can contribute to “democratic and decolonial digital and machine learning strategies”, which has direct policy implications.
[67] Categorical Perception in Large Language Model Hidden States: Structural Warping at Digit-Count Boundaries
Jon-Paul Cacioli
Main category: cs.CL
TL;DR: LLMs show categorical perception-like geometric warping in hidden representations when processing Arabic numerals, with distinct patterns across architectures - some show both geometric warping and explicit categorization, others only geometric warping.
Details
Motivation: To investigate whether categorical perception phenomena observed in human psychology also occur in the hidden-state representations of large language models when processing structured numerical information.Method: Used representational similarity analysis across six LLMs from five architecture families, comparing categorical perception-additive models (log-distance plus boundary boost) vs. purely continuous models, examining Arabic numeral processing at digit-count boundaries.
Result: Found categorical perception effects in 100% of primary layers across all models, specific to structurally defined boundaries (digit-count transitions at 10 and 100). Identified two distinct signatures: “classic CP” (Gemma, Qwen) with both geometric warping and explicit categorization, and “structural CP” (Llama, Mistral, Phi) with only geometric warping.
Conclusion: Structural input-format discontinuities are sufficient to produce categorical perception geometry in LLMs independently of explicit semantic category knowledge, revealing architecture-dependent patterns of representation warping.
Abstract: Categorical perception (CP) – enhanced discriminability at category boundaries – is among the most studied phenomena in perceptual psychology. This paper reports that analogous geometric warping occurs in the hidden-state representations of large language models (LLMs) processing Arabic numerals. Using representational similarity analysis across six models from five architecture families, the study finds that a CP-additive model (log-distance plus a boundary boost) fits the representational geometry better than a purely continuous model at 100% of primary layers in every model tested. The effect is specific to structurally defined boundaries (digit-count transitions at 10 and 100), absent at non-boundary control positions, and absent in the temperature domain where linguistic categories (hot/cold) lack a tokenisation discontinuity. Two qualitatively distinct signatures emerge: “classic CP” (Gemma, Qwen), where models both categorise explicitly and show geometric warping, and “structural CP” (Llama, Mistral, Phi), where geometry warps at the boundary but models cannot report the category distinction. This dissociation is stable across boundaries and is a property of the architecture, not the stimulus. Structural input-format discontinuities are sufficient to produce categorical perception geometry in LLMs, independently of explicit semantic category knowledge.
[68] Coconstructions in spoken data: UD annotation guidelines and first results
Ludovica Pannitto, Sylvain Kahane, Kaja Dobrovoljc, Elena Battaglia, Bruno Guillaume, Caterina Mauri, Eleonora Zucchini
Main category: cs.CL
TL;DR: Proposes annotation guidelines for syntactic dependencies across speaker turns in spoken language treebanks within Universal Dependencies framework
Details
Motivation: Current syntactic annotation frameworks don't adequately handle dependencies that span across speaker turns in spoken dialogue, such as collaborative constructions, question-answer pairs, and backchannelsMethod: Proposes two representations: speaker-based representation following speech turn segmentation, and dependency-based representation with cross-turn dependencies. Also introduces new propositions to distinguish reformulations vs repairs and handle unfinished phrases
Result: Developed annotation guidelines for cross-speaker dependencies in spoken language treebanks, enabling more accurate syntactic analysis of conversational speech
Conclusion: The proposed framework improves syntactic annotation of spoken dialogue by accounting for dependencies across speaker boundaries, better capturing the collaborative nature of conversation
Abstract: The paper proposes annotation guidelines for syntactic dependencies that span across speaker turns - including collaborative coconstructions proper, wh-question answers, and backchannels - in spoken language treebanks within the Universal Dependencies framework. Two representations are proposed: a speaker-based representation following the segmentation into speech turns, and a dependency-based representation with dependencies across speech turns. New propositions are also put forward to distinguish between reformulations and repairs, and to promote elements in unfinished phrases.
[69] Merge and Conquer: Instructing Multilingual Models by Adding Target Language Weights
Eneko Valero, Maria Ribalta i Albado, Oscar Sainz, Naiara Perez, German Rigau
Main category: cs.CL
TL;DR: Model merging enables efficient language adaptation for low-resource languages without requiring language-specific instruction data or repeated fine-tuning.
Details
Motivation: LLMs are heavily English-centric with poor performance in low-resource languages. Traditional adaptation methods require significant computational resources and high-quality instruction data, which are often unavailable for low-resource language communities.Method: Proposes model merging as a lightweight alternative: merging an instruction-tuned LLM with language-specific base models to transfer language knowledge without needing language-specific instructions or repeated fine-tuning.
Result: Experiments with four Iberian languages (Basque, Catalan, Galician, Spanish) show merging enables effective instruction following in new languages and supports multilingual capability through combining multiple language-specific models.
Conclusion: Model merging is a viable and efficient alternative to traditional adaptation methods for low-resource languages, achieving competitive performance while greatly reducing computational cost.
Abstract: Large Language Models (LLMs) remain heavily centered on English, with limited performance in low-resource languages. Existing adaptation approaches, such as continual pre-training, demand significant computational resources. In the case of instructed models, high-quality instruction data is also required, both of which are often inaccessible for low-resource language communities. Under these constraints, model merging offers a lightweight alternative, but its potential in low-resource contexts has not been systematically explored. In this work, we explore whether it is possible to transfer language knowledge to an instruction-tuned LLM by merging it with a language-specific base model, thereby eliminating the need of language-specific instructions and repeated fine-tuning processes whenever stronger instructed variants become available. Through experiments covering four Iberian languages (Basque, Catalan, Galician, and Spanish) and two model families, we show that merging enables effective instruction following behavior in new languages and even supports multilingual capability through the combination of multiple language-specific models. Our results indicate that model merging is a viable and efficient alternative to traditional adaptation methods for low-resource languages, achieving competitive performance while greatly reducing computational cost.
[70] The Necessity of Setting Temperature in LLM-as-a-Judge
Lujun Li, Lama Sleem, Yangjie Xu, Yewei Song, Aolin Jia, Jerome Francois, Radu State
Main category: cs.CL
TL;DR: Systematic investigation of temperature’s impact on LLM-as-a-Judge evaluation performance, showing temperature significantly affects judge behavior and offering engineering guidelines for optimal temperature selection.
Details
Motivation: LLM-as-a-Judge has become popular for evaluating text quality, but researchers use fixed temperature settings (0.1 or 1.0) empirically without understanding how temperature affects judge performance, despite evidence that LLM performance is temperature-sensitive and task-dependent.Method: Conducted controlled experiments to investigate temperature-judge performance relationship, and used causal inference framework within empirical statistical analysis to rigorously examine direct causal effect of temperature on judge behavior.
Result: Temperature significantly influences judge performance in LLM-centric evaluation, with effects being task-dependent; lower temperatures don’t always yield optimal outcomes; provides actionable engineering insights for evaluation pipeline design.
Conclusion: Temperature is a critical parameter in LLM-as-a-Judge evaluation that requires careful selection based on task characteristics, challenging the empirical convention of fixed temperature settings.
Abstract: LLM-as-a-Judge has emerged as an effective and low-cost paradigm for evaluating text quality and factual correctness. Prior studies have shown substantial agreement between LLM judges and human experts, even on tasks that are difficult to assess automatically. In practice, researchers commonly employ fixed temperature configurations during the evaluation process-with values of 0.1 and 1.0 being the most prevalent choices-a convention that is largely empirical rather than principled. However, recent researches suggest that LLM performance exhibits non-trivial sensitivity to temperature settings, that lower temperatures do not universally yield optimal outcomes, and that such effects are highly task-dependent. This raises a critical research question: does temperature influence judge performance in LLM centric evaluation? To address this, we systematically investigate the relationship between temperature and judge performance through a series of controlled experiments, and further adopt a causal inference framework within our empirical statistical analysis to rigorously examine the direct causal effect of temperature on judge behavior, offering actionable engineering insights for the design of LLM-centric evaluation pipelines.
[71] Kernel-Smith: A Unified Recipe for Evolutionary Kernel Optimization
He Du, Qiming Ge, Jiakai Hu, Aijun Yang, Zheng Cai, Zixian Huang, Sheng Yuan, Qinxiu Cheng, Xinchen Xie, Yicheng Chen, Yining Li, Jiaxing Xie, Huanan Dong, Yaguang Wu, Xiangjun Huang, Jian Yang, Hui Wang, Bowen Zhou, Bowen Li, Qipeng Guo, Kai Chen
Main category: cs.CL
TL;DR: Kernel-Smith is an evolutionary framework for GPU kernel generation that combines evaluation-driven evolution with specialized training to optimize LLMs as local improvers rather than one-shot generators.
Details
Motivation: Current LLMs struggle with generating high-performance GPU kernels due to the complexity of hardware optimization, requiring reliable evaluation and iterative improvement rather than one-shot generation.Method: Combines evolutionary agent maintaining population of executable candidates with execution feedback (compilation, correctness, speedup) and converts evolution trajectories into training signals for LLM optimization as local improvers within evolutionary loop.
Result: Achieves SOTA on KernelBench with Triton backend, outperforming proprietary models like Gemini-3.0-pro and Claude-4.6-opus; also validates on MetaX MACA backend surpassing large models like DeepSeek-V3.2-think and Qwen3-235B-2507-think.
Conclusion: LLM-driven kernel optimization can transfer from controlled evaluation to practical deployment, with framework showing potential for seamless adaptation across heterogeneous GPU platforms.
Abstract: We present Kernel-Smith, a framework for high-performance GPU kernel and operator generation that combines a stable evaluation-driven evolutionary agent with an evolution-oriented post-training recipe. On the agent side, Kernel-Smith maintains a population of executable candidates and iteratively improves them using an archive of top-performing and diverse programs together with structured execution feedback on compilation, correctness, and speedup. To make this search reliable, we build backend-specific evaluation services for Triton on NVIDIA GPUs and Maca on MetaX GPUs. On the training side, we convert long-horizon evolution trajectories into step-centric supervision and reinforcement learning signals by retaining correctness-preserving, high-gain revisions, so that the model is optimized as a strong local improver inside the evolutionary loop rather than as a one-shot generator. Under a unified evolutionary protocol, Kernel-Smith-235B-RL achieves state-of-the-art overall performance on KernelBench with Nvidia Triton backend, attaining the best average speedup ratio and outperforming frontier proprietary models including Gemini-3.0-pro and Claude-4.6-opus. We further validate the framework on the MetaX MACA backend, where our Kernel-Smith-MACA-30B surpasses large-scale counterparts such as DeepSeek-V3.2-think and Qwen3-235B-2507-think, highlighting potential for seamless adaptation across heterogeneous platforms. Beyond benchmark results, the same workflow produces upstream contributions to production systems including SGLang and LMDeploy, demonstrating that LLM-driven kernel optimization can transfer from controlled evaluation to practical deployment.
[72] Not All Subjectivity Is the Same! Defining Desiderata for the Evaluation of Subjectivity in NLP
Urja Khurana, Michiel van der Meer, Enrico Liscio, Antske Fokkens, Pradeep K. Murukannaiah
Main category: cs.CL
TL;DR: Position paper proposing 7 evaluation desiderata for subjectivity-sensitive NLP models to better align evaluation practices with models’ objectives of reflecting diverse perspectives.
Details
Motivation: Current NLP datasets increasingly incorporate subjective judgments and models are being developed to reflect diverse perspectives, but evaluation practices may not align with these objectives, potentially marginalizing minority voices.Method: Top-down approach constructing 7 evaluation desiderata based on how subjectivity is represented in NLP data and models, followed by analysis of 60 papers’ experimental setups to identify gaps.
Result: Analysis reveals several understudied aspects: distinction between ambiguous vs polyphonic input, whether subjectivity is effectively expressed to users, and lack of interplay between different desiderata.
Conclusion: Evaluation practices for subjectivity-sensitive models need improvement to better capture diverse perspectives and ensure minority voices are not marginalized, requiring attention to the identified gaps.
Abstract: Subjective judgments are part of several NLP datasets and recent work is increasingly prioritizing models whose outputs reflect this diversity of perspectives. Such responses allow us to shed light on minority voices, which are frequently marginalized or obscured by dominant perspectives. It remains a question whether our evaluation practices align with these models’ objectives. This position paper proposes seven evaluation desiderata for subjectivity-sensitive models, rooted in how subjectivity is represented in NLP data and models. The desiderata are constructed in a top-down approach, keeping in mind the user-centric impact of such models. We scan the experimental setup of 60 papers and show that various aspects of subjectivity are still understudied: the distinction between ambiguous and polyphonic input, whether subjectivity is effectively expressed to the user, and a lack of interplay between different desiderata, amongst other gaps.
[73] Tailoring AI-Driven Reading Scaffolds to the Distinct Needs of Neurodiverse Learners
Soufiane Jhilal, Eleonora Pasqua, Caterina Marchesi, Riccardo Corradi, Martina Galletti
Main category: cs.CL
TL;DR: Study examines how different text scaffolds (segmentation, pictograms, labels) affect reading comprehension and experience for neurodiverse learners, finding heterogeneous responses and no universally optimal approach.
Details
Motivation: Neurodiverse learners need reading supports, but rich scaffolds can sometimes overload attention and working memory rather than help. The study aims to understand how different types of scaffolds (structural vs. semantic) affect comprehension and reading experience in supervised inclusive contexts.Method: Used an adapted reading interface with four modalities: unmodified text, sentence-segmented text, segmented text with pictograms, and segmented text with pictograms plus keyword labels. Conducted a within-subject pilot with 14 primary-school learners with special educational needs and disabilities, measuring comprehension with standardized questions and collecting child- and therapist-reported experience measures with open-ended feedback.
Result: Results showed heterogeneous responses - some learners benefited from segmentation and pictograms, while others showed increased coordination costs with visual scaffolds. Experience ratings showed limited differences between modalities, with some effects linked to clinical complexity. Open-ended feedback frequently requested simpler wording and additional visual supports.
Conclusion: No single scaffold is universally optimal, reinforcing the need for calibrated, adjustable scaffolding. Provides design implications for human-AI co-regulation in supervised inclusive reading contexts.
Abstract: Neurodiverse learners often require reading supports, yet increasing scaffold richness can sometimes overload attention and working memory rather than improve comprehension. Grounded in the Construction-Integration model and a contingent scaffolding perspective, we examine how structural versus semantic scaffolds shape comprehension and reading experience in a supervised inclusive context. Using an adapted reading interface, we compared four modalities: unmodified text, sentence-segmented text, segmented text with pictograms, and segmented text with pictograms plus keyword labels. In a within-subject pilot with 14 primary-school learners with special educational needs and disabilities, we measured reading comprehension using standardized questions and collected brief child- and therapist-reported experience measures alongside open-ended feedback. Results highlight heterogeneous responses as some learners showed patterns consistent with benefits from segmentation and pictograms, while others showed patterns consistent with increased coordination costs when visual scaffolds were introduced. Experience ratings showed limited differences between modalities, with some apparent effects linked to clinical complexity, particularly for perceived ease of understanding. Open-ended feedback of the learners frequently requested simpler wording and additional visual supports. These findings suggest that no single scaffold is universally optimal, reinforcing the need for calibrated, adjustable scaffolding and provide design implications for human-AI co-regulation in supervised inclusive reading contexts.
[74] Exploring Cultural Variations in Moral Judgments with Large Language Models
Hadi Mohammadi, Ayoub Bagheri
Main category: cs.CL
TL;DR: LLMs’ ability to reflect culturally diverse moral values is examined by comparing model outputs with global survey data, finding that advanced instruction-tuned models better align with real-world moral attitudes but show Western bias.
Details
Motivation: To investigate whether Large Language Models can capture culturally diverse moral values and how well they align with real-world moral attitudes across different regions and cultures.Method: Used log-probability-based moral justifiability scores to compare LLM outputs with World Values Survey and Pew Global Attitudes Survey data, analyzing both smaller monolingual/multilingual models and advanced instruction-tuned models across various ethical topics.
Result: Earlier/smaller models showed near-zero or negative correlations with human judgments, while advanced instruction-tuned models achieved substantially higher positive correlations. Models aligned better with W.E.I.R.D. (Western, Educated, Industrialized, Rich, Democratic) nations than other regions.
Conclusion: While scaling model size and instruction tuning improves alignment with cross-cultural moral norms, significant challenges remain for certain topics and regions, highlighting the need for improved cultural sensitivity in LLMs.
Abstract: Large Language Models (LLMs) have shown strong performance across many tasks, but their ability to capture culturally diverse moral values remains unclear. In this paper, we examine whether LLMs mirror variations in moral attitudes reported by the World Values Survey (WVS) and the Pew Research Center’s Global Attitudes Survey (PEW). We compare smaller monolingual and multilingual models (GPT-2, OPT, BLOOMZ, and Qwen) with recent instruction-tuned models (GPT-4o, GPT-4o-mini, Gemma-2-9b-it, and Llama-3.3-70B-Instruct). Using log-probability-based \emph{moral justifiability} scores, we correlate each model’s outputs with survey data covering a broad set of ethical topics. Our results show that many earlier or smaller models often produce near-zero or negative correlations with human judgments. In contrast, advanced instruction-tuned models achieve substantially higher positive correlations, suggesting they better reflect real-world moral attitudes. We provide a detailed regional analysis revealing that models align better with Western, Educated, Industrialized, Rich, and Democratic (W.E.I.R.D.) nations than with other regions. While scaling model size and using instruction tuning improves alignment with cross-cultural moral norms, challenges remain for certain topics and regions. We discuss these findings in relation to bias analysis, training data diversity, information retrieval implications, and strategies for improving the cultural sensitivity of LLMs.
[75] Marco DeepResearch: Unlocking Efficient Deep Research Agents via Verification-Centric Design
Bin Zhu, Qianghuai Jia, Tian Lan, Junyang Ren, Feng Gu, Feihu Jiang, Longyue Wang, Zhao Xu, Weihua Luo
Main category: cs.CL
TL;DR: Marco DeepResearch introduces a verification-centric framework for deep research agents with verification mechanisms at QA data synthesis, trajectory construction, and test-time scaling levels.
Details
Motivation: Existing deep research agents lack explicit verification mechanisms during training and inference, causing error propagation that degrades performance on long-horizon tasks.Method: Three-level verification framework: 1) QA data synthesis with verification to control question difficulty and ensure answer correctness; 2) Verification-driven trajectory synthesis injecting verification patterns; 3) Test-time scaling using the agent itself as a verifier.
Result: Marco DeepResearch outperforms 8B-scale deep research agents on challenging benchmarks like BrowseComp and BrowseComp-ZH, and with 600 tool calls approaches/surpasses 30B-scale agents like Tongyi DeepResearch-30B.
Conclusion: Verification-centric design is crucial for reliable deep research agents, enabling smaller models to achieve performance comparable to much larger models through systematic verification mechanisms.
Abstract: Deep research agents autonomously conduct open-ended investigations, integrating complex information retrieval with multi-step reasoning across diverse sources to solve real-world problems. To sustain this capability on long-horizon tasks, reliable verification is critical during both training and inference. A major bottleneck in existing paradigms stems from the lack of explicit verification mechanisms in QA data synthesis, trajectory construction, and test-time scaling. Errors introduced at each stage propagate downstream and degrade the overall agent performance. To address this, we present Marco DeepResearch, a deep research agent optimized with a verification-centric framework design at three levels: \textbf{(1)~QA Data Synthesis:} We introduce verification mechanisms to graph-based and agent-based QA synthesis to control question difficulty while ensuring answers are unique and correct; \textbf{(2)~Trajectory Construction:} We design a verification-driven trajectory synthesis method that injects explicit verification patterns into training trajectories; and \textbf{(3)~Test-time scaling:} We use Marco DeepResearch itself as a verifier at inference time and effectively improve performance on challenging questions. Extensive experimental results demonstrate that our proposed Marco DeepResearch agent significantly outperforms 8B-scale deep research agents on most challenging benchmarks, such as BrowseComp and BrowseComp-ZH. Crucially, under a maximum budget of 600 tool calls, Marco DeepResearch even surpasses or approaches several 30B-scale agents, like Tongyi DeepResearch-30B.
[76] LombardoGraphia: Automatic Classification of Lombard Orthography Variants
Edoardo Signoroni, Pavel Rychlý
Main category: cs.CL
TL;DR: First study on automatic Lombard orthography classification with a curated corpus of 11,186 Wikipedia samples across 9 orthographic variants, achieving up to 96% accuracy with traditional and neural models.
Details
Motivation: Lombard language lacks unified orthographic standard with multiple systems, creating challenges for NLP resource development and model training in this underresourced language variety.Method: Curated LombardoGraphia corpus from Wikipedia, processed and filtered for orthographic analysis. Trained 24 traditional and neural classification models with various features and encoding levels.
Result: Best models achieved 96.06% overall accuracy and 85.78% average class accuracy, though performance on minority classes remains challenging due to data imbalance.
Conclusion: Provides crucial infrastructure for building variety-aware NLP resources for Lombard, enabling better language processing for this underresourced language variety.
Abstract: Lombard, an underresourced language variety spoken by approximately 3.8 million people in Northern Italy and Southern Switzerland, lacks a unified orthographic standard. Multiple orthographic systems exist, creating challenges for NLP resource development and model training. This paper presents the first study of automatic Lombard orthography classification and LombardoGraphia, a curated corpus of 11,186 Lombard Wikipedia samples tagged across 9 orthographic variants, and models for automatic orthography classification. We curate the dataset, processing and filtering raw Wikipedia content to ensure text suitable for orthographic analysis. We train 24 traditional and neural classification models with various features and encoding levels. Our best models achieve 96.06% and 85.78% overall and average class accuracy, though performance on minority classes remains challenging due to data imbalance. Our work provides crucial infrastructure for building variety-aware NLP resources for Lombard.
[77] Structural-Ambiguity-Aware Translation from Natural Language to Signal Temporal Logic
Kosei Fushimi, Kazunobu Serizawa, Junya Ikemoto, Kazumune Hashimoto
Main category: cs.CL
TL;DR: NL-to-STL translation method that preserves ambiguity by generating multiple candidate formulas instead of forcing single interpretation
Details
Motivation: STL is difficult for non-experts to write directly, while natural language provides convenient interface but suffers from structural ambiguity that makes one-to-one translation unreliableMethod: Three-stage pipeline: 1) ambiguity-preserving n-best parsing using Combinatory Categorial Grammar, 2) STL-oriented template-based semantic composition, 3) canonicalization with score aggregation to output deduplicated STL candidates with plausibility scores
Result: Method generates multiple STL candidates for genuinely ambiguous inputs while collapsing unambiguous or canonically equivalent derivations to single STL formula
Conclusion: Proposed ambiguity-preserving approach better handles natural language ambiguity in NL-to-logic translation compared to existing one-best methods
Abstract: Signal Temporal Logic (STL) is widely used to specify timed and safety-critical tasks for cyber-physical systems, but writing STL formulas directly is difficult for non-expert users. Natural language (NL) provides a convenient interface, yet its inherent structural ambiguity makes one-to-one translation into STL unreliable. In this paper, we propose an \textit{ambiguity-preserving} method for translating NL task descriptions into STL candidate formulas. The key idea is to retain multiple plausible syntactic analyses instead of forcing a single interpretation at the parsing stage. To this end, we develop a three-stage pipeline based on Combinatory Categorial Grammar (CCG): ambiguity-preserving $n$-best parsing, STL-oriented template-based semantic composition, and canonicalization with score aggregation. The proposed method outputs a deduplicated set of STL candidates with plausibility scores, thereby explicitly representing multiple possible formal interpretations of an ambiguous instruction. In contrast to existing one-best NL-to-logic translation methods, the proposed approach is designed to preserve attachment and scope ambiguity. Case studies on representative task descriptions demonstrate that the method generates multiple STL candidates for genuinely ambiguous inputs while collapsing unambiguous or canonically equivalent derivations to a single STL formula.
[78] TIEG-Youpu Solution for NeurIPS 2022 WikiKG90Mv2-LSC
Feng Nie, Zhixiu Ye, Sifa Xie, Shuang Wu, Xin Yuan, Liang Yao, Jiazhen Peng, Xu Cheng
Main category: cs.CL
TL;DR: A knowledge graph embedding method for WikiKG90Mv2 using retrieve-then-rerank pipeline with priority infilling retrieval and ensemble reranking with neighbor-enhanced representations.
Details
Motivation: Large-scale knowledge graphs like WikiKG90Mv2 (90M+ entities) require efficient and accurate embedding methods for practical applications like knowledge acquisition, QA, and recommendation systems.Method: Retrieve-then-rerank pipeline: 1) Priority infilling retrieval model for structurally/semantically similar candidates, 2) Ensemble reranking model with neighbor-enhanced representations for final link prediction.
Result: Outperforms existing baselines, improves MRR on validation set from 0.2342 to 0.2839.
Conclusion: Proposed method effectively handles large-scale knowledge graph embedding with improved accuracy while maintaining efficiency.
Abstract: WikiKG90Mv2 in NeurIPS 2022 is a large encyclopedic knowledge graph. Embedding knowledge graphs into continuous vector spaces is important for many practical applications, such as knowledge acquisition, question answering, and recommendation systems. Compared to existing knowledge graphs, WikiKG90Mv2 is a large scale knowledge graph, which is composed of more than 90 millions of entities. Both efficiency and accuracy should be considered when building graph embedding models for knowledge graph at scale. To this end, we follow the retrieve then re-rank pipeline, and make novel modifications in both retrieval and re-ranking stage. Specifically, we propose a priority infilling retrieval model to obtain candidates that are structurally and semantically similar. Then we propose an ensemble based re-ranking model with neighbor enhanced representations to produce final link prediction results among retrieved candidates. Experimental results show that our proposed method outperforms existing baseline methods and improves MRR of validation set from 0.2342 to 0.2839.
[79] EarlySciRev: A Dataset of Early-Stage Scientific Revisions Extracted from LaTeX Writing Traces
Léane Jourdan, Julien Aubert-Béduchaud, Yannis Chupin, Marah Baccari, Florian Boudin
Main category: cs.CL
TL;DR: EarlySciRev dataset extracts early-stage scientific text revisions from arXiv LaTeX source files by analyzing commented-out text to capture authentic author revisions during the writing process.
Details
Motivation: Current resources for studying scientific writing revisions are limited to final or near-final versions, restricting empirical study of revision behavior and evaluation of LLMs for scientific writing assistance.Method: Extract revision pairs from arXiv LaTeX source files by identifying commented-out text (discarded/alternative formulations), aligning commented segments with nearby final text, and applying LLM-based filtering to validate genuine revisions.
Result: Created EarlySciRev dataset with 578k validated revision pairs from 1.28M candidate pairs, providing authentic early drafting traces and a human-annotated benchmark for revision detection.
Conclusion: EarlySciRev enables research on scientific writing dynamics, revision modeling, and LLM-assisted editing by providing access to previously unavailable early-stage revision data.
Abstract: Scientific writing is an iterative process that generates rich revision traces, yet publicly available resources typically expose only final or near-final versions of papers. This limits empirical study of revision behaviour and evaluation of large language models (LLMs) for scientific writing. We introduce EarlySciRev, a dataset of early-stage scientific text revisions automatically extracted from arXiv LaTeX source files. Our key observation is that commented-out text in LaTeX often preserves discarded or alternative formulations written by the authors themselves. By aligning commented segments with nearby final text, we extract paragraph-level candidate revision pairs and apply LLM-based filtering to retain genuine revisions. Starting from 1.28M candidate pairs, our pipeline yields 578k validated revision pairs, grounded in authentic early drafting traces. We additionally provide a human-annotated benchmark for revision detection. EarlySciRev complements existing resources focused on late-stage revisions or synthetic rewrites and supports research on scientific writing dynamics, revision modelling, and LLM-assisted editing.
[80] GraphWalker: Agentic Knowledge Graph Question Answering via Synthetic Trajectory Curriculum
Shuwen Xu, Yao Xu, Jiaxiang Liu, Chenhao Yuan, Wenshuo Peng, Jun Zhao, Kang Liu
Main category: cs.CL
TL;DR: GraphWalker is a novel agentic KGQA framework that uses automated trajectory synthesis and stage-wise fine-tuning to improve reasoning generalization and address training data scarcity.
Details
Motivation: Existing agentic KGQA approaches face challenges with training data scarcity and reasoning generalization. Prompting-based methods lack autonomous navigation training, while current training pipelines confine reasoning to predefined trajectories, limiting agent exploration.Method: Two-stage SFT training paradigm: 1) Train agent on structurally diverse trajectories synthesized from constrained random-walk paths to establish broad exploration prior over KG; 2) Fine-tune on small set of expert trajectories to develop reflection and error recovery capabilities. This enables higher performance ceiling for lightweight RL stage.
Result: Achieves state-of-the-art performance on CWQ and WebQSP datasets. Additional results on GrailQA and constructed GraphWalkerBench confirm enhanced generalization to out-of-distribution reasoning paths.
Conclusion: GraphWalker’s stage-wise SFT paradigm effectively addresses training data scarcity and improves reasoning generalization in agentic KGQA through automated trajectory synthesis and progressive fine-tuning.
Abstract: Agentic knowledge graph question answering (KGQA) requires an agent to iteratively interact with knowledge graphs (KGs), posing challenges in both training data scarcity and reasoning generalization. Specifically, existing approaches often restrict agent exploration: prompting-based methods lack autonomous navigation training, while current training pipelines usually confine reasoning to predefined trajectories. To this end, this paper proposes \textit{GraphWalker}, a novel agentic KGQA framework that addresses these challenges through \textit{Automated Trajectory Synthesis} and \textit{Stage-wise Fine-tuning}. GraphWalker adopts a two-stage SFT training paradigm: First, the agent is trained on structurally diverse trajectories synthesized from constrained random-walk paths, establishing a broad exploration prior over the KG; Second, the agent is further fine-tuned on a small set of expert trajectories to develop reflection and error recovery capabilities. Extensive experiments demonstrate that our stage-wise SFT paradigm unlocks a higher performance ceiling for a lightweight reinforcement learning (RL) stage, enabling GraphWalker to achieve state-of-the-art performance on CWQ and WebQSP. Additional results on GrailQA and our constructed GraphWalkerBench confirm that GraphWalker enhances generalization to out-of-distribution reasoning paths. The code is publicly available at https://github.com/XuShuwenn/GraphWalker
[81] Compressing Transformer Language Models via Matrix Product Operator Decomposition: A Case Study on PicoGPT
Younes Javanmard, Tanmoy Pandit, Masoud Mardani
Main category: cs.CL
TL;DR: MPO decomposition enables efficient transformer compression by factorizing weight matrices into low-rank cores, achieving up to 13x compression while maintaining 97.7% of baseline accuracy.
Details
Motivation: Transformer-based language models have quadratic parameter scaling with hidden dimension, making deployment on resource-constrained hardware expensive. There's a need for principled compression methods that maintain performance while reducing computational costs.Method: Replace every nn.Linear layer in PicoGPT (GPT-2-style model) with MPOLinear modules parameterized as MPO chains. Use Matrix Product Operator decomposition to factorize weight matrices into chains of low-rank cores with bond dimension chi controlling approximation quality. Initialize cores via TT-SVD from pretrained weights or random initialization, and train using standard PyTorch autograd.
Result: Achieves up to 13x compression per transformer block at chi=4. At chi=16, model uses 191,872 parameters instead of 1,020,224 while retaining 97.7% of baseline token accuracy (51.6% vs 52.8%). Chi=8 model gives best accuracy per parameter, exceeding dense baseline by 2.7x on this metric.
Conclusion: MPO parameterization is a practical and theoretically grounded alternative to low-rank methods and unstructured pruning for transformer compression, offering significant parameter reduction with minimal accuracy loss.
Abstract: Transformer-based language models achieve strong performance across NLP tasks, but their quadratic parameter scaling with hidden dimension makes deployment on resource-constrained hardware expensive. We study Matrix Product Operator (MPO) decomposition as a principled compression method for transformers. MPO factorises weight matrices into chains of low-rank cores, with approximation quality controlled by the bond dimension chi. We replace every nn.Linear layer in PicoGPT, a GPT-2-style character-level language model with about 1M parameters, with an MPOLinear module parameterised as an MPO chain. Cores are initialised either by TT-SVD from pretrained dense weights or from random initialisation, and trained using standard PyTorch autograd without a custom backward pass. We derive balanced factorisation schemes for the five distinct weight shapes in PicoGPT and evaluate bond dimensions chi in {4, 8, 16, 32} on Tiny Shakespeare. MPO compression achieves up to 13x compression per transformer block at chi = 4. At chi = 16, the model uses 191,872 parameters instead of 1,020,224 while retaining 97.7% of baseline token accuracy (51.6% vs 52.8%). Reconstruction error follows the expected trend and is lower for three-site than two-site factorisations at the same bond dimension. The chi = 8 model gives the best accuracy per parameter, exceeding the dense baseline by 2.7x on this metric. These results show that MPO parameterisation is a practical and theoretically grounded alternative to low-rank methods and unstructured pruning for transformer compression.
[82] Training data generation for context-dependent rubric-based short answer grading
Pavel Šindelář, Dávid Slivka, Christopher Bouma, Filip Prášil, Ondřej Bojar
Main category: cs.CL
TL;DR: This paper explores methods for creating large-scale training datasets for automatic student answer grading using only small confidential reference datasets, with a focus on preserving confidentiality through derived text formats.
Details
Motivation: The PISA test faces challenges in grading student answers due to language differences and annotator bias. Automatic grading methods require large domain-specific datasets for training, but such datasets are often confidential and limited in size.Method: The authors explore methods to create surrogate datasets using small confidential reference datasets. They use derived text formats to preserve confidentiality while generating larger training datasets, comparing these to purely prompt-based generation approaches.
Result: Successfully created three surrogate datasets that are more similar to the reference dataset than prompt-based generation alone. Early experiments suggest one approach might improve model training for automatic answer grading.
Conclusion: The proposed methods enable creation of large-scale training datasets from small confidential sources, potentially improving automatic grading systems for educational assessments like PISA while maintaining data confidentiality.
Abstract: Every 4 years, the PISA test is administered by the OECD to test the knowledge of teenage students worldwide and allow for comparisons of educational systems. However, having to avoid language differences and annotator bias makes the grading of student answers challenging. For these reasons, it would be interesting to compare methods of automatic student answer grading. To train some of these methods, which require machine learning, or to compute parameters or select hyperparameters for those that do not, a large amount of domain-specific data is needed. In this work, we explore a small number of methods for creating a large-scale training dataset using only a relatively small confidential dataset as a reference, leveraging a set of very simple derived text formats to preserve confidentiality. Using these methods, we successfully created three surrogate datasets that are, at the very least, superficially more similar to the reference dataset than purely the result of prompt-based generation. Early experiments suggest one of these approaches might also lead to improved model training.
[83] EpiScreen: Early Epilepsy Detection from Electronic Health Records with Large Language Models
Shuang Zhou, Kai Yu, Zaifu Zhan, Huixue Zhou, Min Zeng, Feng Xie, Zhiyi Sha, Rui Zhang
Main category: cs.CL
TL;DR: EpiScreen: A low-cost AI system using clinical notes and large language models for early epilepsy detection, achieving high accuracy and improving neurologist performance by up to 10.9%.
Details
Motivation: Epilepsy and psychogenic non-epileptic seizures are often misdiagnosed due to similar symptoms, leading to diagnostic delays, unnecessary treatments, and patient harm. Current gold standard (video-EEG) is expensive and inaccessible, creating need for low-cost screening solution.Method: Developed EpiScreen by fine-tuning large language models on labeled clinical notes from electronic health records. Used MIMIC-IV dataset and private University of Minnesota cohort for training and evaluation.
Result: Achieved AUC of 0.875 on MIMIC-IV and 0.980 on private cohort. In clinician-AI collaboration, EpiScreen-assisted neurologists outperformed unaided experts by up to 10.9%.
Conclusion: EpiScreen enables early, cost-effective epilepsy screening that can reduce diagnostic delays and unnecessary interventions, especially in resource-limited regions.
Abstract: Epilepsy and psychogenic non-epileptic seizures often present with similar seizure-like manifestations but require fundamentally different management strategies. Misdiagnosis is common and can lead to prolonged diagnostic delays, unnecessary treatments, and substantial patient morbidity. Although prolonged video-electroencephalography is the diagnostic gold standard, its high cost and limited accessibility hinder timely diagnosis. Here, we developed a low-cost, effective approach, EpiScreen, for early epilepsy detection by utilizing routinely collected clinical notes from electronic health records. Through fine-tuning large language models on labeled notes, EpiScreen achieved an AUC of up to 0.875 on the MIMIC-IV dataset and 0.980 on a private cohort of the University of Minnesota. In a clinician-AI collaboration setting, EpiScreen-assisted neurologists outperformed unaided experts by up to 10.9%. Overall, this study demonstrates that EpiScreen supports early epilepsy detection, facilitating timely and cost-effective screening that may reduce diagnostic delays and avoid unnecessary interventions, particularly in resource-limited regions.
[84] Adaptive Block-Scaled Data Types
Jack Cook, Hyemin S. Lee, Kathryn Le, Junxian Guo, Giovanni Traverso, Anantha P. Chandrakasan, Song Han
Main category: cs.CL
TL;DR: IF4: Adaptive 4-bit quantization format that dynamically selects between FP4 and INT4 representations per value group, outperforming existing 4-bit formats for LLM quantization.
Details
Motivation: NVFP4 has hardware support but suffers from quantization errors on near-maximal values. Need better 4-bit quantization that maintains accuracy while being hardware-efficient.Method: Proposes IF4 data type that adaptively selects between FP4 and INT4 for each group of 16 values, using the unused sign bit in NVFP4’s scale factor to denote the selected type. Also designs IF3 and IF6 formats.
Result: IF4 outperforms existing 4-bit block-scaled formats, achieving lower loss during quantized training and higher accuracy in post-training quantization. Hardware design shows IF4 can be efficiently implemented.
Conclusion: IF4 provides superior 4-bit quantization for LLMs by adaptively choosing between float and integer representations, maintaining accuracy while being hardware-friendly.
Abstract: NVFP4 has grown increasingly popular as a 4-bit format for quantizing large language models due to its hardware support and its ability to retain useful information with relatively few bits per parameter. However, the format is not without limitations: recent work has shown that NVFP4 suffers from its error distribution, resulting in large amounts of quantization error on near-maximal values in each group of 16 values. In this work, we leverage this insight to design new Adaptive Block-Scaled Data Types that can adapt to the distribution of their input values. For four-bit quantization, our proposed IF4 (Int/Float 4) data type selects between FP4 and INT4 representations for each group of 16 values, which are then scaled by an E4M3 scale factor as is done with NVFP4. The selected data type is denoted using the scale factor’s sign bit, which is currently unused in NVFP4, and we apply the same insight to design formats for other bit-widths, including IF3 and IF6. When used to quantize language models, we find that IF4 outperforms existing 4-bit block-scaled formats, achieving lower loss during quantized training and achieving higher accuracy on many tasks in post-training quantization. We additionally design and evaluate an IF4 Multiply-Accumulate (MAC) unit to demonstrate that IF4 can be implemented efficiently in next-generation hardware accelerators. Our code is available at https://github.com/mit-han-lab/fouroversix.
[85] Beyond In-Distribution Success: Scaling Curves of CoT Granularity for Language Model Generalization
Ru Wang, Wei Huang, Selena Song, Haoyu Zhang, Qian Niu, Yusuke Iwasawa, Yutaka Matsuo, Jiaxian Guo
Main category: cs.CL
TL;DR: CoT reasoning enhances OOD generalization for compound tasks by forcing internalization of valid dependency structures, with finer-grained CoT data and positional embeddings improving generalization performance.
Details
Motivation: Transformer-based language models struggle with out-of-distribution generalization on compound tasks despite near-perfect in-distribution performance, motivating investigation of Chain-of-Thought reasoning as a solution.Method: Controlled experiments across compound tasks comparing QA-trained models with CoT reasoning, analyzing granularity of CoT data, sample efficiency, and theoretical analysis of shortcut learning vs. true reasoning principles.
Result: CoT reasoning significantly improves OOD generalization, with finer-grained CoT data correlating with better performance and remarkable sample efficiency (matching QA performance with 80% less data).
Conclusion: CoT reasoning is a crucial training paradigm for enabling LM generalization under real-world distributional shifts for compound tasks, with theoretical justification for its effectiveness.
Abstract: Generalization to novel compound tasks under distribution shift is important for deploying transformer-based language models (LMs). This work investigates Chain-of-Thought (CoT) reasoning as a means to enhance OOD generalization. Through controlled experiments across several compound tasks, we reveal three key insights: (1) While QA-trained models achieve near-perfect in-distribution accuracy, their OOD performance degrades catastrophically, even with 10000k+ training examples; (2) the granularity of CoT data strongly correlates with generalization performance; finer-grained CoT data leads to better generalization; (3) CoT exhibits remarkable sample efficiency, matching QA performance with much less (even 80%) data. Theoretically, we demonstrate that compound tasks inherently permit shortcuts in Q-A data that misalign with true reasoning principles, while CoT forces internalization of valid dependency structures, and thus can achieve better generalization. Further, we show that transformer positional embeddings can amplify generalization by emphasizing subtask condition recurrence in long CoT sequences. Our combined theoretical and empirical analysis provides compelling evidence for CoT reasoning as a crucial training paradigm for enabling LM generalization under real-world distributional shifts for compound tasks.
[86] Benchmarking NLP-supported Language Sample Analysis for Swiss Children’s Speech
Anja Ryser, Yingqiang Gao, Sarah Ebling
Main category: cs.CL
TL;DR: Paper introduces NLP methods for semi-automatic language sample analysis to support diagnosis of developmental language disorders in children, using German speech data without commercial LLMs.
Details
Motivation: Language sample analysis (LSA) is valuable for diagnosing developmental language disorders but is labor-intensive, limiting its clinical use. The research aims to make LSA more efficient by automating parts of the process while maintaining human specialist involvement.Method: Used natural language processing methods (not commercial large language models) applied to transcribed speech data from 119 children in German-speaking Switzerland with typical and atypical language development. Focused on identifying optimal practices for semi-automatic LSA.
Result: Preliminary findings show potential for integrating locally deployed NLP methods into semi-automatic language sample analysis to support speech-language pathologists in more efficient DLD diagnosis.
Conclusion: Locally deployed NLP methods can enhance the efficiency of language sample analysis for developmental language disorder diagnosis while maintaining human specialist involvement in the diagnostic process.
Abstract: Language sample analysis (LSA) is a process that complements standardized psychometric tests for diagnosing, for example, developmental language disorder (DLD) in children. However, its labour-intensive nature has limited its use in speech-language pathology practice. We introduce an approach that leverages natural language processing (NLP) methods that do not rely on commercial large language models (LLMs) applied to transcribed speech data from 119 children in the German-speaking part of Switzerland with typical and atypical language development. This preliminary study aims to identify optimal practices that support speech-language pathologists in diagnosing DLD more efficiently with active involvement of human specialists. Preliminary findings underscore the potential of integrating locally deployed NLP methods into the process of semi-automatic LSA.
[87] Cultural Biases of Large Language Models and Humans in Historical Interpretation
Fabio Celli, Georgios Spathulas
Main category: cs.CL
TL;DR: Comparison of human vs LLM historical annotations shows both have cultural bias, but LLMs achieve higher consensus on interpreting historical facts from short texts, with different disagreement patterns.
Details
Motivation: To understand how Large Language Models compare to humans in historical annotation tasks, examining cultural bias and consensus in interpreting historical facts from short texts, with implications for digital humanities.Method: Comparative analysis of historical annotations performed by humans and Large Language Models, examining patterns of agreement/disagreement, cultural bias, and error types in interpreting historical facts from short texts.
Result: Both humans and LLMs exhibit cultural bias, but LLMs achieve higher consensus on historical fact interpretation. Humans disagree based on personal biases, while LLMs disagree due to information skipping or hallucinations.
Conclusion: Findings enable large-scale annotation and quantitative analysis of historical data for digital humanities, offering new educational/research opportunities to explore historical interpretations from different LLMs and foster critical thinking about bias.
Abstract: This paper compares historical annotations by humans and Large Language Models. The findings reveal that both exhibit some cultural bias, but Large Language Models achieve a higher consensus on the interpretation of historical facts from short texts. While humans tend to disagree on the basis of their personal biases, Large Models disagree when they skip information or produce hallucinations. These findings have significant implications for digital humanities, enabling large-scale annotation and quantitative analysis of historical data. This offers new educational and research opportunities to explore historical interpretations from different Language Models, fostering critical thinking about bias.
[88] BRIDGE: Benchmarking Large Language Models for Understanding Real-world Clinical Practice Text
Jiageng Wu, Bowen Gu, Ren Zhou, Kevin Xie, Doug Snyder, Yixing Jiang, Valentina Carducci, Richard Wyss, Rishi J Desai, Emily Alsentzer, Leo Anthony Celi, Adam Rodman, Sebastian Schneeweiss, Jonathan H. Chen, Santiago Romero-Brufau, Kueiyu Joshua Lin, Jie Yang
Main category: cs.CL
TL;DR: BRIDGE is a comprehensive multilingual benchmark for evaluating LLMs on real-world clinical data across 87 tasks, 9 languages, and 14 clinical specialties, showing open-source models can match proprietary ones.
Details
Motivation: Current LLM benchmarks for medical applications rely on exam-style questions or PubMed text, failing to capture real-world clinical data complexity. There's a need for comprehensive evaluation using actual electronic health records across diverse clinical scenarios and languages.Method: Created BRIDGE benchmark with 87 tasks from real-world clinical data across 9 languages, covering 8 task types, 6 clinical stages, 20 applications, and 14 specialties. Evaluated 95 LLMs (including DeepSeek-R1, GPT-4o, Gemini, Qwen3) under various inference strategies.
Result: Substantial performance variation across model sizes, languages, NLP tasks, and clinical specialties. Open-source LLMs can achieve performance comparable to proprietary models. Medically fine-tuned LLMs based on older architectures often underperform versus updated general-purpose models.
Conclusion: BRIDGE provides a foundational resource for developing and evaluating LLMs in real-world clinical text understanding, showing the importance of comprehensive multilingual evaluation on actual clinical data rather than simplified benchmarks.
Abstract: Large language models (LLMs) hold great promise for medical applications and are evolving rapidly, with new models being released at an accelerated pace. However, benchmarking on large-scale real-world data such as electronic health records (EHRs) is critical, as clinical decisions are directly informed by these sources, yet current evaluations remain limited. Most existing benchmarks rely on medical exam-style questions or PubMed-derived text, failing to capture the complexity of real-world clinical data. Others focus narrowly on specific application scenarios, limiting their generalizability across broader clinical use. To address this gap, we present BRIDGE, a comprehensive multilingual benchmark comprising 87 tasks sourced from real-world clinical data sources across nine languages. It covers eight major task types spanning the entire continuum of patient care across six clinical stages and 20 representative applications, including triage and referral, consultation, information extraction, diagnosis, prognosis, and billing coding, and involves 14 clinical specialties. We systematically evaluated 95 LLMs (including DeepSeek-R1, GPT-4o, Gemini series, and Qwen3 series) under various inference strategies. Our results reveal substantial performance variation across model sizes, languages, natural language processing tasks, and clinical specialties. Notably, we demonstrate that open-source LLMs can achieve performance comparable to proprietary models, while medically fine-tuned LLMs based on older architectures often underperform versus updated general-purpose models. The BRIDGE and its corresponding leaderboard serve as a foundational resource and a unique reference for the development and evaluation of new LLMs in real-world clinical text understanding. The BRIDGE leaderboard: https://huggingface.co/spaces/YLab-Open/BRIDGE-Medical-Leaderboard
[89] Understanding the Anchoring Effect of LLM with Synthetic Data: Existence, Mechanism, and Potential Mitigations
Yiming Huang, Biquan Bie, Zuqiu Na, Weilin Ruan, Songxin Lei, Yutao Yue, Xinlei He
Main category: cs.CL
TL;DR: LLMs exhibit anchoring bias similar to humans, with shallow layers most affected; reasoning helps mitigate but conventional strategies fail
Details
Motivation: As LLMs become more prevalent, understanding their cognitive biases like anchoring effect is crucial for reliability and fairnessMethod: Created SynAnchors dataset for large-scale studies, benchmarked widely used LLMs with refined evaluation metrics
Result: LLMs commonly exhibit anchoring bias, primarily in shallow layers; reasoning provides some mitigation but conventional strategies don’t work
Conclusion: Anchoring bias is a significant issue in LLMs requiring new mitigation approaches beyond conventional methods
Abstract: The rise of Large Language Models (LLMs) like ChatGPT has advanced natural language processing, yet concerns about cognitive biases are growing. In this paper, we investigate the anchoring effect, a cognitive bias where the mind relies heavily on the first information as anchors to make affected judgments. We explore whether LLMs are affected by anchoring, the underlying mechanisms, and potential mitigation strategies. To facilitate studies at scale on the anchoring effect, we introduce a new dataset, SynAnchors (https://huggingface.co/datasets/TimTargaryen/SynAnchors). Combining refined evaluation metrics, we benchmark current widely used LLMs. Our findings show that LLMs’ anchoring bias exists commonly with shallow-layer acting and can not be eliminated by conventional strategies, while reasoning can offer some mitigation.
[90] Just as Humans Need Vaccines, So Do Models: Model Immunization to Combat Falsehoods
Shaina Raza, Rizwan Qureshi, Azib Farooq, Marcelo Lotif, Aman Chadha, Deval Pandya, Christos Emmanouilidis
Main category: cs.CL
TL;DR: Model immunization trains LLMs on curated false claim-correction pairs as “vaccine doses” to teach them to recognize and reject misinformation patterns, improving truthfulness without harming general capabilities.
Details
Motivation: LLMs reproduce misinformation by learning persuasive linguistic patterns (hedging, false presuppositions, fabricated citations), not just memorizing false facts. Current approaches like post-hoc filtering or preference alignment don't provide direct negative supervision on falsehoods.Method: Supervised fine-tuning on curated (false claim, correction) pairs injected as small “vaccine doses” (5-10% of tokens) alongside truthful data. Introduces direct negative supervision on labeled falsehoods. Includes design requirements: dosage, labeling, quarantine, and diversity.
Result: Across four open weight model families: improves TruthfulQA accuracy by 12 points, increases misinformation rejection rates by 30 points, while preserving overall model capability.
Conclusion: Immunization is a practical and scalable component of responsible LLM development. Advocates for standardized vaccine corpora and benchmarks to evaluate generalization.
Abstract: Large language models (LLMs) reproduce misinformation not by memorizing false facts alone, but by learning the linguistic patterns that make falsehoods persuasive, such as hedging, false presuppositions, and fabricated citations. We propose model immunization, a training paradigm based on supervised fine-tuning over curated (false claim, correction) pairs, injected as small vaccine doses (5 to 10% of tokens) alongside truthful data. Unlike post-hoc filtering or preference-based alignment, immunization introduces direct negative supervision on labeled falsehoods. Across four open weight model families, this approach improves TruthfulQA accuracy by 12 points and increases misinformation rejection rates by 30 points, while preserving overall model capability. We further outline key design requirements, including dosage, labeling, quarantine, and diversity and advocate for standardized vaccine corpora and benchmarks to evaluate generalization. These findings position immunization as a practical and scalable component of responsible LLM development. Project page: https://github.com/shainarazavi/ai-vaccine/
[91] LingoLoop Attack: Trapping MLLMs via Linguistic Context and State Entrapment into Endless Loops
Jiyuan Fu, Kaixun Jiang, Lingyi Hong, Jinglun Li, Haijing Guo, Dingkang Yang, Zhaoyu Chen, Wenqiang Zhang
Main category: cs.CL
TL;DR: LingoLoop is an attack framework that exploits MLLM vulnerabilities by inducing excessive verbose and repetitive outputs through POS-aware delay mechanisms and generative path pruning, causing resource exhaustion.
Details
Motivation: MLLMs require substantial computational resources during inference, making them vulnerable to resource exhaustion attacks. Prior attacks were limited by not considering token-level POS characteristics and sentence-level structural patterns affecting output counts.Method: Two main mechanisms: 1) POS-Aware Delay Mechanism that postpones EOS token generation by adjusting attention weights based on POS information, and 2) Generative Path Pruning Mechanism that limits hidden state magnitude to encourage persistent repetitive loops.
Result: LingoLoop successfully traps MLLMs like Qwen2.5-VL-3B in generative loops, driving them to generation limits and inducing outputs with up to 367x more tokens than clean inputs, causing significant energy consumption surges.
Conclusion: The attack exposes significant vulnerabilities in MLLMs related to resource exhaustion, posing challenges for reliable deployment and highlighting the need for defensive mechanisms against such attacks.
Abstract: Multimodal Large Language Models (MLLMs) have shown great promise but require substantial computational resources during inference. Attackers can exploit this by inducing excessive output, leading to resource exhaustion and service degradation. Prior energy-latency attacks aim to increase generation time by broadly shifting the output token distribution away from the EOS token, but they neglect the influence of token-level Part-of-Speech (POS) characteristics on EOS and sentence-level structural patterns on output counts, limiting their efficacy. To address this, we propose LingoLoop, an attack designed to induce MLLMs to generate excessively verbose and repetitive sequences. First, we find that the POS tag of a token strongly affects the likelihood of generating an EOS token. Based on this insight, we propose a POS-Aware Delay Mechanism to postpone EOS token generation by adjusting attention weights guided by POS information. Second, we identify that constraining output diversity to induce repetitive loops is effective for sustained generation. We introduce a Generative Path Pruning Mechanism that limits the magnitude of hidden states, encouraging the model to produce persistent loops. Extensive experiments on models like Qwen2.5-VL-3B demonstrate LingoLoop’s powerful ability to trap them in generative loops; it consistently drives them to their generation limits and, when those limits are relaxed, can induce outputs with up to 367x more tokens than clean inputs, triggering a commensurate surge in energy consumption. These findings expose significant MLLMs’ vulnerabilities, posing challenges for their reliable deployment.
[92] GHTM: A Graph-based Hybrid Topic Modeling Approach with a Benchmark Dataset for the Low-Resource Bengali Language
Farhana Haque, Md. Abdur Rahman, Sumon Ahmed
Main category: cs.CL
TL;DR: GHTM: A novel graph-based hybrid topic model for Bengali that combines TF-IDF-weighted GloVe embeddings, GCN, and NMF to achieve state-of-the-art topic coherence and diversity, with strong cross-lingual generalization.
Details
Motivation: Topic modeling in Bengali is understudied due to lack of resources, standardized evaluation frameworks, modern methodological approaches, and reproducible implementations. Existing research has only three Bengali-specific architectures and lacks diverse datasets beyond newspaper corpora.Method: Proposes GHTM (Graph-based Hybrid Topic Model) that: 1) Uses TF-IDF-weighted GloVe embeddings to represent text documents, 2) Builds a document-similarity graph and applies Graph Convolutional Networks (GCN) for representation refinement through neighborhood aggregation, 3) Applies Non-negative Matrix Factorization (NMF) to extract interpretable topics from refined representations.
Result: GHTM achieves superior topic coherence (NPMI: 0.27-0.28) and diversity compared to existing methods while maintaining computational efficiency across datasets of varying scales. It also demonstrates strong cross-lingual generalization, outperforming established graph-based models on the English 20Newsgroups benchmark.
Conclusion: GHTM addresses critical gaps in Bengali topic modeling research by providing a novel architecture with superior performance, computational efficiency, and cross-lingual generalization. The introduction of NCTBText dataset provides much-needed topical diversity beyond newspaper-centric Bengali corpora for future research.
Abstract: Topic modeling is a Natural Language Processing (NLP) technique used to discover latent themes and abstract topics from text corpora by grouping co-occurring keywords. Although widely researched in English, topic modeling remains understudied in Bengali due to a lack of adequate resources and initiatives. Existing Bengali topic modeling research lacks standardized evaluation frameworks with comprehensive baselines and diverse datasets, exploration of modern methodological approaches, and reproducible implementations, with only three Bengali-specific architectures proposed to date. To address these gaps, this study presents a comprehensive evaluation of traditional and contemporary topic modeling approaches across three Bengali datasets and introduces GHTM (Graph-based Hybrid Topic Model), a novel architecture that strategically integrates TF-IDF-weighted GloVe embeddings, Graph Convolutional Networks (GCN), and Non-negative Matrix Factorization (NMF). GHTM represents text documents using hybrid TF-IDF-weighted GloVe embeddings. It builds a document-similarity graph and leverages GCN to refine the representations through neighborhood aggregation. Then, it finally decomposes the refined representations using NMF to extract interpretable topics. Experimental results demonstrate that GHTM achieves superior topic coherence (NPMI: 0.27-0.28) and diversity compared to existing methods while maintaining computational efficiency across datasets of varying scales. The model also demonstrates strong cross-lingual generalization, outperforming established graph-based models on the English 20Newsgroups benchmark. Additionally, we introduce NCTBText, a diverse Bengali textbook-based dataset comprising 8,650 text documents, curated from eight subject areas, providing much-needed topical diversity beyond newspaper-centric Bengali corpora and serving as a benchmark for future research.
[93] Link Prediction for Event Logs in the Process Industry
Anastasia Zhukova, Thomas Walton, Christian E. Lobmüller, Bela Gipp
Main category: cs.CL
TL;DR: A record linking model for German process industry shift logs using cross-document coreference resolution with NLI and semantic text similarity for link prediction.
Details
Motivation: Fragmented event logs in process industry shift books hinder effective knowledge retrieval and problem-solving; need to link related records across documents to improve graph-based RAG systems.Method: Adapts cross-document coreference resolution (CDCR) task, combines state-of-the-art CDCR models with natural language inference (NLI) and semantic text similarity (STS) principles for link prediction.
Result: Record linking model outperformed baseline NLP and STS methods by 28% (11.43 percentage points) and 27.4% (11.21 percentage points) respectively.
Conclusion: Common NLP tasks can be effectively combined and adapted for domain-specific settings to improve data quality and connectivity in industrial knowledge management systems.
Abstract: In the era of graph-based retrieval-augmented generation (RAG), link prediction is a significant preprocessing step for improving the quality of fragmented or incomplete domain-specific data for the graph retrieval. Knowledge management in the process industry uses RAG-based applications to optimize operations, ensure safety, and facilitate continuous improvement by effectively leveraging operational data and past insights. A key challenge in this domain is the fragmented nature of event logs in shift books, where related records are often kept separate, even though they belong to a single event or process. This fragmentation hinders the recommendation of previously implemented solutions to users, which is crucial in the timely problem-solving at live production sites. To address this problem, we develop a record linking model, which we define as a cross-document coreference resolution (CDCR) task. Record linking adapts the task definition of CDCR and combines two state-of-the-art CDCR models with the principles of natural language inference (NLI) and semantic text similarity (STS) to perform link prediction. The evaluation shows that our record linking model outperformed the best versions of our baselines, i.e., NLP and STS, by 28% (11.43 p) and 27.4% (11.21 p), respectively. Our work demonstrates that common NLP tasks can be combined and adapted to a domain-specific setting of the German process industry, improving data quality and connectivity in shift logs.
[94] AirQA: A Comprehensive QA Dataset for AI Research with Instance-Level Evaluation
Tiancheng Huang, Ruisheng Cao, Yuxin Zhang, Zhangyi Kang, Zijian Wang, Chenrun Wang, Yijie Luo, Hang Zheng, Lirong Qian, Lu Chen, Kai Yu
Main category: cs.CL
TL;DR: AirQA is a comprehensive human-annotated paper QA dataset for AI papers with multi-task, multi-modal evaluation, and ExTrActor is an automated framework for instruction data synthesis using LLM-based agents.
Details
Motivation: The volume of academic papers makes it difficult for researchers to extract key information efficiently. While LLM-based agents can automate QA workflows, there's a lack of comprehensive benchmarks to evaluate their capabilities, and training interactive agents is hindered by shortage of high-quality interaction trajectories.Method: Created AirQA dataset with 13,956 AI papers and 1,246 human-annotated questions covering multi-task, multi-modal, and instance-level evaluation. Developed ExTrActor framework with three LLM-based agents for automated instruction data synthesis, including example generation and trajectory collection without human intervention.
Result: Most models underperform on AirQA, demonstrating dataset quality. ExTrActor consistently improves multi-turn tool-use capability of small models, enabling them to achieve performance comparable to larger models.
Conclusion: AirQA provides a comprehensive benchmark for evaluating paper QA capabilities, while ExTrActor offers an effective automated solution for instruction data synthesis to enhance LLM-based agents’ performance on scientific paper understanding tasks.
Abstract: The growing volume of academic papers has made it increasingly difficult for researchers to efficiently extract key information. While large language models (LLMs) based agents are capable of automating question answering (QA) workflows for scientific papers, there still lacks a comprehensive and realistic benchmark to evaluate their capabilities. Moreover, training an interactive agent for this specific task is hindered by the shortage of high-quality interaction trajectories. In this work, we propose AirQA, a human-annotated comprehensive paper QA dataset in the field of artificial intelligence (AI), with 13,956 papers and 1,246 questions, that encompasses multi-task, multi-modal and instance-level evaluation. Furthermore, we propose ExTrActor, an automated framework for instruction data synthesis. With three LLM-based agents, ExTrActor can perform example generation and trajectory collection without human intervention. Evaluations of multiple open-source and proprietary models show that most models underperform on AirQA, demonstrating the quality of our dataset. Extensive experiments confirm that ExTrActor consistently improves the multi-turn tool-use capability of small models, enabling them to achieve performance comparable to larger ones.
[95] Multi-View Attention Multiple-Instance Learning Enhanced by LLM Reasoning for Cognitive Distortion Detection
Jun Seo Kim, Hyemi Kim, Woo Joo Oh, Hongjin Cho, Hochul Lee, Hye Hyeon Kim
Main category: cs.CL
TL;DR: A novel framework combining LLMs with Multiple-Instance Learning for automatic detection of cognitive distortions in mental health NLP, using Emotion-Logic-Behavior decomposition and multi-view attention.
Details
Motivation: Cognitive distortions are crucial for mental health disorders but challenging to detect automatically due to contextual ambiguity, co-occurrence, and semantic overlap. Existing methods lack interpretability and fine-grained reasoning capabilities.Method: Proposes a framework combining LLMs with Multiple-Instance Learning (MIL). Each utterance is decomposed into Emotion, Logic, and Behavior (ELB) components. LLMs process these to infer distortion instances with type, expression, and salience scores. A Multi-View Gated Attention mechanism integrates instances for final classification.
Result: Experiments on Korean (KoACD) and English (Therapist QA) datasets show that incorporating ELB decomposition and LLM-inferred salience scores improves classification performance, particularly for distortions with high interpretive ambiguity.
Conclusion: The approach provides a psychologically grounded and generalizable method for fine-grained reasoning in mental health NLP, enhancing interpretability and expression-level analysis of cognitive distortions.
Abstract: Cognitive distortions have been closely linked to mental health disorders, yet their automatic detection remained challenging due to contextual ambiguity, co-occurrence, and semantic overlap. We proposed a novel framework that combines Large Language Models (LLMs) with Multiple-Instance Learning (MIL) architecture to enhance interpretability and expression-level reasoning. Each utterance was decomposed into Emotion, Logic, and Behavior (ELB) components, which were processed by LLMs to infer multiple distortion instances, each with a predicted type, expression, and model-assigned salience score. These instances were integrated via a Multi-View Gated Attention mechanism for final classification. Experiments on Korean (KoACD) and English (Therapist QA) datasets demonstrate that incorporating ELB and LLM-inferred salience scores improves classification performance, especially for distortions with high interpretive ambiguity. Our results suggested a psychologically grounded and generalizable approach for fine-grained reasoning in mental health NLP.
[96] Dual-Space Smoothness for Robust and Balanced LLM Unlearning
Han Yan, Zheyuan Liu, Meng Jiang
Main category: cs.CL
TL;DR: PRISM: A unified framework for robust machine unlearning that enforces dual-space smoothness in representation and parameter spaces to balance unlearning metrics and defend against attacks.
Details
Motivation: Address limitations of current machine unlearning methods that suffer from catastrophic forgetting, metric imbalance, and vulnerability to relearn/jailbreak attacks, while balancing privacy, utility, and safety concerns.Method: PRISM enforces dual-space smoothness through two stages: (1) representation space stage with robustly trained probe to defend against jailbreak attacks, and (2) parameter-space stage that decouples retain-forget gradient conflicts and smooths parameter space to mitigate relearning attacks.
Result: Extensive experiments on WMDP and MUSE benchmarks across conversational-dialogue and continuous-text settings show PRISM outperforms SOTA baselines under multiple attacks while achieving better balance among key metrics.
Conclusion: PRISM provides a robust framework for machine unlearning that addresses current limitations by enforcing dual-space smoothness, achieving balanced performance across privacy, utility, and safety metrics while defending against attacks.
Abstract: As large language models evolve, Machine Unlearning has emerged to address growing concerns around user privacy, copyright infringement, and overall safety. Yet state-of-the-art (SOTA) unlearning methods often suffer from catastrophic forgetting and metric imbalance, for example, by over-optimizing one objective (e.g., unlearning effectiveness, utility preservation, or privacy protection) at the expense of others. In addition, small perturbations in the representation or parameter space can be exploited by relearn and jailbreak attacks. To address these challenges, we propose PRISM, a unified framework that enforces dual-space smoothness in representation and parameter spaces to improve robustness and balance unlearning metrics. PRISM consists of two smoothness optimization stages: (i) a representation space stage that employs a robustly trained probe to defend against jailbreak attacks, and (ii) a parameter-space stage that decouples retain-forget gradient conflicts, reduces imbalance, and smooths the parameter space to mitigate relearning attacks. Extensive experiments on WMDP and MUSE, across conversational-dialogue and continuous-text settings, show that PRISM outperforms SOTA baselines under multiple attacks while achieving a better balance among key metrics.
[97] The Rise of AfricaNLP: Contributions, Contributors, Community Impact, and Bibliometric Analysis
Tadesse Destaw Belay, Kedir Yassin Hussen, Sukairaj Hafiz Imam, Ibrahim Said Ahmad, Isa Inuwa-Dutse, Abrham Belete Haile, Grigori Sidorov, Eusebio Ricardez Vazquez, Iqra Ameer, Idris Abdulmumin, Tajuddeen Gwadabe, Vukosi Marivate, Seid Muhie Yimam, Shamsuddeen Hassan Muhammad
Main category: cs.CL
TL;DR: Analysis of two decades of African NLP research trends, contributions, and contributors using a dataset of 2.2K papers and 7.8K annotated contribution sentences.
Details
Motivation: To track the progress of African NLP research and automatically analyze contributions to understand the nature of the field and researchers in Africa, addressing gaps in understanding regional NLP development.Method: Quantitative examination of two decades (2005-2025) using a dataset of 2.2K NLP papers, 4.9K authors, and 7.8K human-annotated contribution sentences, with benchmark results and a research explorer tool.
Result: Created AfricaNLPContributions dataset and research explorer tool that provides insights into AfricaNLP research trends, publications, topics, tasks, and contributor patterns over 20 years.
Conclusion: The dataset and tool offer a powerful lens for tracing AfricaNLP research trends and enable data-driven research approaches for understanding regional NLP development.
Abstract: Natural Language Processing (NLP) is undergoing constant transformation, as Large Language Models (LLMs) are driving daily breakthroughs in research and practice. In this regard, tracking the progress of NLP research and automatically analyzing the contributions of research papers provides key insights into the nature of the field and the researchers. This study explores the progress of African NLP (AfricaNLP) by asking (and answering) research questions about the progress of AfricaNLP (publications, NLP topics, and NLP tasks), contributions (data, method, and task), and contributors (authors, affiliated institutions, and funding bodies). We quantitatively examine two decades (2005 - 2025) of contributions to AfricaNLP research, using a dataset of 2.2K NLP papers, 4.9K contributing authors, and 7.8K human-annotated contribution sentences (AfricaNLPContributions), along with benchmark results. Our dataset and AfricaNLP research explorer tool will provide a powerful lens for tracing AfricaNLP research trends and holds potential for generating data-driven research approaches.
[98] Neuron-Level Analysis of Cultural Understanding in Large Language Models
Taisei Yamamoto, Ryoma Kumon, Danushka Bollegala, Hitomi Yanaka
Main category: cs.CL
TL;DR: Neuron-level analysis reveals culture-general and culture-specific neurons in LLMs, showing they’re concentrated in shallow-middle layers and crucial for cultural understanding but not general NLU.
Details
Motivation: LLMs exhibit cultural bias and limited awareness of underrepresented cultures, with mechanisms of cultural understanding remaining underexplored. Need to understand internal mechanisms driving cultural behavior in LLMs.Method: Conduct neuron-level analysis using gradient-based scoring with filtering to identify culture-general neurons (contribute to all cultures) and culture-specific neurons (tied to individual cultures). Validate through suppression experiments and analyze layer distributions.
Result: Culture-general and culture-specific neurons account for <1% of all neurons, concentrated in shallow to middle MLP layers. Suppressing them degrades cultural benchmark performance by up to 30% while general NLU performance remains unaffected. Culture-specific neurons support knowledge of related cultures. Training on NLU benchmarks can diminish cultural understanding when updating modules with many culture-general neurons.
Conclusion: The study provides insights into LLMs’ internal mechanisms for cultural understanding and offers practical guidance for model training and engineering to improve cultural awareness while maintaining general language capabilities.
Abstract: As large language models (LLMs) are increasingly deployed worldwide, ensuring their fair and comprehensive cultural understanding is important. However, LLMs exhibit cultural bias and limited awareness of underrepresented cultures, while the mechanisms underlying their cultural understanding remain underexplored. To fill this gap, we conduct a neuron-level analysis to identify neurons that drive cultural behavior, introducing a gradient-based scoring method with additional filtering for precise refinement. We identify culture-general neurons contributing to cultural understanding regardless of cultures, and culture-specific neurons tied to an individual culture. Culture-general and culture-specific neurons account for less than 1% of all neurons and are concentrated in shallow to middle MLP layers. We validate their role by showing that suppressing them substantially degrades performance on cultural benchmarks (by up to 30%), while performance on general natural language understanding (NLU) benchmarks remains largely unaffected. Moreover, we show that culture-specific neurons support knowledge of not only the target culture, but also related cultures. Finally, we demonstrate that training on NLU benchmarks can diminish models’ cultural understanding when we update modules containing many culture-general neurons. These findings provide insights into the internal mechanisms of LLMs and offer practical guidance for model training and engineering. Our code is available at https://github.com/ynklab/CULNIG
[99] CLMN: Concept based Language Models via Neural Symbolic Reasoning
Yibo Yang
Main category: cs.CL
TL;DR: CLMN is a neural-symbolic framework that combines continuous concept embeddings with fuzzy-logic reasoning to maintain both performance and interpretability in NLP, learning adaptive interaction rules between concepts.
Details
Motivation: Existing concept bottleneck models in NLP either use binary activations that harm text representations or latent concepts that weaken semantics, and they rarely model dynamic concept interactions like negation and context, limiting interpretability in critical domains like healthcare and finance.Method: CLMN represents concepts as continuous, human-readable embeddings and applies fuzzy-logic reasoning to learn adaptive interaction rules. It augments original text features with concept-aware representations and automatically induces interpretable logic rules.
Result: Across multiple datasets and pre-trained language models, CLMN achieves higher accuracy than existing concept-based methods while improving explanation quality.
Conclusion: Integrating neural representations with symbolic reasoning in a unified concept space can yield practical, transparent NLP systems.
Abstract: Deep learning has advanced NLP, but interpretability remains limited, especially in healthcare and finance. Concept bottleneck models tie predictions to human concepts in vision, but NLP versions either use binary activations that harm text representations or latent concepts that weaken semantics, and they rarely model dynamic concept interactions such as negation and context. We introduce the Concept Language Model Network (CLMN), a neural-symbolic framework that keeps both performance and interpretability. CLMN represents concepts as continuous, human-readable embeddings and applies fuzzy-logic reasoning to learn adaptive interaction rules that state how concepts affect each other and the final decision. The model augments original text features with concept-aware representations and automatically induces interpretable logic rules. Across multiple datasets and pre-trained language models, CLMN achieves higher accuracy than existing concept-based methods while improving explanation quality. These results show that integrating neural representations with symbolic reasoning in a unified concept space can yield practical, transparent NLP systems.
[100] Schema for In-Context Learning
Pan Chen, Shaohong Chen, Mark Wang, Shi Xuan Leong, Priscilla Fung, Varinia Bernales, Alan Aspuru-Guzik
Main category: cs.CL
TL;DR: SA-ICL introduces schema-based in-context learning that extracts abstract reasoning templates from examples to enhance LLM performance on novel tasks, inspired by cognitive schema theory.
Details
Motivation: Traditional ICL lacks explicit knowledge retrieval and transfer mechanisms at the abstraction level. Inspired by cognitive science's schema theory, which posits that humans use pre-existing mental frameworks to structure understanding, the authors aim to enhance LLMs' reasoning by providing explicit schema-based scaffolding.Method: SA-ICL extracts building blocks of cognition from demonstration examples to create abstracted schemas - lightweight, structured templates of key inferential steps and their relationships. These schemas are then used to augment the model’s reasoning process when presented with novel questions.
Result: Experiments on chemistry and physics questions from GPQA dataset show SA-ICL consistently boosts performance (up to 36.19% improvement) when using high-quality single demonstration examples. It reduces reliance on multiple demonstrations and enhances interpretability.
Conclusion: SA-ICL bridges disparate ICL strategies (pattern priming to Chain-of-Thought prompting) and provides a new path for enhancing human-like reasoning in LLMs through explicit schema-based scaffolding.
Abstract: In-Context Learning (ICL) enables transformer-based language models to adapt to new tasks by conditioning on demonstration examples. However, traditional example-driven in-context learning lacks explicit modules for knowledge retrieval and transfer at the abstraction level. Inspired by cognitive science, specifically schema theory, which holds that humans interpret new information by activating pre-existing mental frameworks (schemas) to structure understanding, we introduce Schema-Activated In-Context Learning (SA-ICL). This framework extracts the representation of the building blocks of cognition for the reasoning process instilled from prior examples, creating an abstracted schema, a lightweight, structured template of key inferential steps and their relationships, which is then used to augment a model’s reasoning process when presented with a novel question. We demonstrate that a broad range of large language models (LLMs) lack the capacity to form and utilize internal schema-based learning representations implicitly, but instead benefit significantly from explicit schema-based scaffolding. Across chemistry and physics questions from the GPQA dataset, our experiments show that SA-ICL consistently boosts performance, up to 36.19 percent, when the single demonstration example is of high quality, which simultaneously reduces reliance on the number of demonstrations and enhances interpretability. Schema-Activated In-Context Learning not only bridges disparate ICL strategies ranging from pattern priming to Chain-of-Thought prompting, but also paves a new path for enhancing human-like reasoning in LLMs.
[101] Evaluating Latent Knowledge of Public Tabular Datasets in Large Language Models
Matteo Silvestri, Fabiano Veglianti, Flavio Giorgi, Fabrizio Silvestri, Gabriele Tolomei
Main category: cs.CL
TL;DR: Proposes a framework to detect data contamination in tabular datasets for LLMs using controlled query generation and statistical testing, finding evidence of contamination in 4 out of 8 datasets.
Details
Motivation: Data contamination in LLMs is a growing concern where test performance is inflated by prior exposure to test data rather than genuine generalization. While this issue is recognized in text domains, it remains largely unexplored for tabular data, and existing memorization tests are too coarse for accurate detection.Method: Develops a framework that: 1) Generates controlled multiple-choice queries from tabular datasets while preserving task structure, 2) Applies systematic transformations to selectively disrupt dataset information while keeping partial knowledge, 3) Uses non-neural baselines for reference performance, and 4) Implements statistical testing to formally detect significant deviations indicating contamination.
Result: Empirical evaluation on eight widely used tabular datasets reveals clear evidence of contamination in four cases, suggesting that performance on downstream tasks involving these datasets may be substantially inflated.
Conclusion: Current evaluation practices for LLMs on tabular data may be unreliable due to data contamination, raising concerns about the validity of reported performance gains and highlighting the need for more rigorous contamination detection methods.
Abstract: Large language models (LLMs) are increasingly exposed to data contamination, i.e., performance gains driven by prior exposure of test datasets rather than generalization. However, in the context of tabular data, this problem is largely unexplored. Existing approaches primarily rely on memorization tests, which are too coarse to detect contamination. In contrast, we propose a framework for assessing contamination in tabular datasets by generating controlled queries and performing comparative evaluation. Given a dataset, we craft multiple-choice aligned queries that preserve task structure while allowing systematic transformations of the underlying data. These transformations are designed to selectively disrupt dataset information while preserving partial knowledge, enabling us to isolate performance attributable to contamination. We complement this setup with non-neural baselines that provide reference performance, and we introduce a statistical testing procedure to formally detect significant deviations indicative of contamination. Empirical results on eight widely used tabular datasets reveal clear evidence of contamination in four cases. These findings suggest that performance on downstream tasks involving such datasets may be substantially inflated, raising concerns about the reliability of current evaluation practices.
[102] LuxIT: A Luxembourgish Instruction Tuning Dataset from Monolingual Seed Data
Julian Valline, Cedric Lothritz, Siwen Guo, Jordi Cabot
Main category: cs.CL
TL;DR: LuxIT: A monolingual instruction tuning dataset for Luxembourgish created using synthetic data generation with DeepSeek-R1-0528, enabling improved LLM performance on Luxembourgish proficiency exams and NLP tasks.
Details
Motivation: Instruction-tuned LLMs perform poorly in low-resource languages like Luxembourgish due to lack of high-quality training data, creating a need for specialized datasets to improve performance in such linguistic settings.Method: Created LuxIT dataset by synthesizing instruction-answer pairs from native Luxembourgish texts using DeepSeek-R1-0528, followed by LLM-as-a-judge quality assurance to retain 227,507 high-quality pairs. Fine-tuned 14 smaller LLMs (≤15B parameters) on LuxIT and evaluated on Luxembourgish proficiency exams and five downstream NLP tasks.
Result: Training on LuxIT improved mean accuracy by +5.37 percentage points on language exams across all 14 models (12/14 showed improvement). On NLP tasks, 9/14 models improved in macro-averaged F1, though gains on the two benchmarks didn’t systematically correlate.
Conclusion: Monolingual synthetic data can effectively improve LLM capabilities in low-resource languages, demonstrating feasibility while highlighting the multi-faceted nature of language proficiency.
Abstract: The effectiveness of instruction-tuned Large Language Models (LLMs) is often limited in low-resource linguistic settings due to a lack of high-quality training data. We introduce LuxIT, a novel, monolingual instruction tuning dataset for Luxembourgish developed to mitigate this challenge. We synthesize the dataset from a corpus of native Luxembourgish texts, utilizing DeepSeek-R1-0528, chosen for its shown proficiency in Luxembourgish. Following generation, we apply a quality assurance process, employing an LLM-as-a-judge approach, retaining 227,507 high-quality instruction-answer pairs. To investigate the practical utility of the dataset, we fine-tune 14 smaller-scale LLMs ($\leq$15B parameters) on LuxIT and evaluate them on standardized Luxembourgish proficiency exams and five downstream NLP tasks. Training on LuxIT yields a mean accuracy change of +5.37 percentage points on language exams across all 14 models, with 12 of 14 showing improvement. On NLP downstream tasks, 9 of 14 models improve in macro-averaged F1, though gains on the two benchmarks do not systematically correlate. These results underscore the feasibility of leveraging monolingual synthetic data to improve LLM capabilities in low-resource languages, while highlighting the multi-faceted nature of language proficiency.
[103] Surfacing Subtle Stereotypes: A Multilingual, Debate-Oriented Evaluation of Modern LLMs
Muhammed Saeed, Muhammad Abdul-mageed, Shady Shehata
Main category: cs.CL
TL;DR: Multilingual debate-style benchmark reveals narrative biases in LLMs across sensitive domains, showing entrenched stereotypes persist despite safety alignment, especially in low-resource languages.
Details
Motivation: Current bias evaluations rely on English classification tasks, missing how narrative bias appears in realistic generative settings across languages. Need for multilingual benchmarks to assess cultural biases in open-ended communication.Method: Created CORPUSNAME benchmark with 8,400 structured debate prompts across 4 sensitive domains (Women’s Rights, Backwardness, Terrorism, Religion) in 7 languages. Tested 4 flagship LLMs (GPT-4o, Claude 3.5 Haiku, DeepSeek-Chat, LLaMA-3-70B), generating over 100,000 debate responses with automatic classification of demographic group stereotypes.
Result: All models reproduce entrenched stereotypes: Arabs linked to Terrorism and Religion (≥89%), Africans to socioeconomic “backwardness” (up to 77%), Western groups framed as modern/progressive. Biases grow sharply in lower-resource languages, showing English alignment doesn’t generalize globally.
Conclusion: Current alignment methods reduce explicit toxicity but fail to prevent biased outputs in open-ended multilingual contexts. Persistent divide in multilingual fairness requires better culturally inclusive model alignment.
Abstract: Large language models (LLMs) are widely deployed for open-ended communication, yet most bias evaluations still rely on English, classification-style tasks. We introduce \corpusname, a new multilingual, debate-style benchmark designed to reveal how narrative bias appears in realistic generative settings. Our dataset includes 8{,}400 structured debate prompts spanning four sensitive domains – Women’s Rights, Backwardness, Terrorism, and Religion – across seven languages ranging from high-resource (English, Chinese) to low-resource (Swahili, Nigerian Pidgin). Using four flagship models (GPT-4o, Claude3.5Haiku, DeepSeek-Chat, and LLaMA-3-70B), we generate over 100{,}000 debate responses and automatically classify which demographic groups are assigned stereotyped versus modern roles. Results show that all models reproduce entrenched stereotypes despite safety alignment: Arabs are overwhelmingly linked to Terrorism and Religion ($\geq$89%), Africans to socioeconomic ``backwardness’’ (up to 77%), and Western groups are consistently framed as modern or progressive. Biases grow sharply in lower-resource languages, revealing that alignment trained primarily in English does not generalize globally. Our findings highlight a persistent divide in multilingual fairness: current alignment methods reduce explicit toxicity but fail to prevent biased outputs in open-ended contexts. We release our \corpusname benchmark and analysis framework to support the next generation of multilingual bias evaluation and safer, culturally inclusive model alignment.
[104] Beyond Elicitation: Provision-based Prompt Optimization for Knowledge-Intensive Tasks
Yunzhe Xu, Zhuosheng Zhang, Zhe Liu
Main category: cs.CL
TL;DR: KPPO: A prompt optimization framework that integrates systematic knowledge provision rather than just elicitation, addressing limitations of existing methods on knowledge-intensive tasks.
Details
Motivation: Existing prompt optimization methods focus on elicitation-based strategies that search for optimal prompts to activate models' capabilities, but these have fundamental limitations for knowledge-intensive tasks as they operate within static knowledge capacity rather than providing factual knowledge, terminology precision, and reasoning patterns required in specialized domains.Method: Knowledge-Provision-based Prompt Optimization (KPPO) reformulates prompt optimization as systematic knowledge integration with three key innovations: 1) knowledge gap filling mechanism for identification and targeted remediation, 2) batch-wise candidate evaluation considering both performance improvement and distributional stability, 3) adaptive knowledge pruning strategy balancing performance and token efficiency.
Result: Evaluation on 15 knowledge-intensive benchmarks from various domains shows KPPO’s superiority over elicitation-based methods with ~6% average improvement over baselines while achieving comparable or lower token consumption, reducing up to 29% of inference token usage.
Conclusion: KPPO provides a more effective approach to prompt optimization for knowledge-intensive tasks by shifting from elicitation to systematic knowledge integration, addressing fundamental limitations of existing methods.
Abstract: While prompt optimization has emerged as a critical technique for enhancing language model performance, existing approaches primarily focus on elicitation-based strategies that search for optimal prompts to activate models’ capabilities. These methods exhibit fundamental limitations when addressing knowledge-intensive tasks, as they operate within static knowledge capacity rather than providing the factual knowledge, terminology precision, and reasoning patterns required in specialized domains. To address these limitations, we propose Knowledge-Provision-based Prompt Optimization (KPPO), a framework that reformulates prompt optimization as systematic knowledge integration rather than potential elicitation. KPPO introduces three key innovations: 1) a knowledge gap filling mechanism for knowledge gap identification and targeted remediation; 2) a batch-wise candidate evaluation approach that considers both performance improvement and distributional stability; 3) an adaptive knowledge pruning strategy that balances performance and token efficiency, reducing up to 29% of inference token usage. Evaluation on 15 knowledge-intensive benchmarks from various domains demonstrates KPPO’s superiority over elicitation-based methods, with an average improvement of ~6% over baselines while achieving comparable or lower token consumption.
[105] $π$-Attention: Periodic Sparse Transformers for Efficient Long-Context Modeling
Dong Liu, Yanxuan Yu
Main category: cs.CL
TL;DR: ΠAttention: A periodic sparse Transformer with ring-local neighborhoods, deterministic π-stride skips, and adaptive fusion gates for efficient long-range modeling with linear complexity.
Details
Motivation: Transformers have quadratic complexity with sequence length, creating bottlenecks for long-range modeling. Existing sparse attention methods like RingAttention reduce costs but suffer from limited receptive fields and lack adaptability.Method: ΠAttention factorizes attention into three components: ring-local neighborhoods for local context, deterministic π-stride skips for periodic long-range connections, and adaptive fusion gates to dynamically combine local and global information. This maintains linear complexity while expanding receptive field coverage.
Result: ΠAttention achieves O(kL + πlogL) receptive field growth vs O(kL) for RingAttention. Experiments show it matches or surpasses dense attention quality with 8.3% lower perplexity than RingAttention while using 50% fewer GPUs for same context length.
Conclusion: Periodic sparse attention with adaptive fusion provides an effective approach to efficient long-context modeling, balancing computational efficiency with expressive power through structured sparsity patterns.
Abstract: Transformers have revolutionized natural language processing, but their quadratic complexity with respect to sequence length remains a fundamental bottleneck for long-range modeling. While sparse attention mechanisms like RingAttention reduce computational costs by restricting attention to local neighborhoods, they suffer from limited receptive fields and lack of adaptability. We present \PiAttention, a periodic sparse Transformer that factorizes attention into ring-local neighborhoods, deterministic $π$-stride skips, and an adaptive fusion gate. The periodic structure provides predictable coverage of distant tokens, while the sparse footprint keeps the per-layer complexity linear in context length. We prove that \PiAttention achieves $\mathcal{O}(kL + π\log L)$ receptive field growth compared to $\mathcal{O}(kL)$ for RingAttention, where $k$ is the local window size, $π$ is the skip period, and $L$ is the sequence length. Extensive experiments on language modeling, retrieval, and vision-language tasks demonstrate that \PiAttention matches or surpasses dense attention quality with 8.3% lower perplexity than RingAttention while using 50% fewer GPUs for the same context length. Our detailed ablations and visualizations reveal the importance of periodic skips, adaptive fusion, and head-level sparsity coordination for efficient long-context modeling.
[106] Based on Data Balancing and Model Improvement for Multi-Label Sentiment Classification Performance Enhancement
Zijin Su, Huanzhu Lyu, Yuren Niu, Yiming Liu
Main category: cs.CL
TL;DR: Created balanced multi-label sentiment dataset from GoEmotions, Sentiment140, and GPT-4 mini, then developed enhanced classification model with FastText embeddings, CNN, BiLSTM, and attention mechanisms.
Details
Motivation: Existing multi-label sentiment datasets like GoEmotions suffer from severe class imbalance, which hampers model performance especially for underrepresented emotions, creating a need for balanced datasets and improved classification models.Method: Constructed balanced dataset by integrating GoEmotions data, emotion-labeled samples from Sentiment140 using RoBERTa-base-GoEmotions model, and GPT-4 mini generated texts. Developed classification model with FastText embeddings, convolutional layers, bidirectional LSTM, attention mechanism, and sigmoid-activated output layer with mixed precision training.
Result: Experimental results show significant improvements in accuracy, precision, recall, F1-score, and AUC compared to models trained on imbalanced data, demonstrating the effectiveness of the balanced dataset and enhanced model architecture.
Conclusion: The balanced dataset and enhanced multi-label classification model effectively address class imbalance issues in sentiment analysis, leading to improved performance across multiple evaluation metrics.
Abstract: Multi-label sentiment classification plays a vital role in natural language processing by detecting multiple emotions within a single text. However, existing datasets like GoEmotions often suffer from severe class imbalance, which hampers model performance, especially for underrepresented emotions. To address this, we constructed a balanced multi-label sentiment dataset by integrating the original GoEmotions data, emotion-labeled samples from Sentiment140 using a RoBERTa-base-GoEmotions model, and manually annotated texts generated by GPT-4 mini. Our data balancing strategy ensured an even distribution across 28 emotion categories. Based on this dataset, we developed an enhanced multi-label classification model that combines pre-trained FastText embeddings, convolutional layers for local feature extraction, bidirectional LSTM for contextual learning, and an attention mechanism to highlight sentiment-relevant words. A sigmoid-activated output layer enables multi-label prediction, and mixed precision training improves computational efficiency. Experimental results demonstrate significant improvements in accuracy, precision, recall, F1-score, and AUC compared to models trained on imbalanced data, highlighting the effectiveness of our approach.
[107] HEAD-QA v2: Expanding a Healthcare Benchmark for Reasoning
Alexis Correa-Guillén, Carlos Gómez-Rodríguez, David Vilares
Main category: cs.CL
TL;DR: HEAD-QA v2 is an expanded Spanish/English healthcare multiple-choice reasoning dataset with over 12,000 questions from Spanish professional exams, used to benchmark LLMs on biomedical reasoning.
Details
Motivation: Addresses the growing need for high-quality datasets that capture the linguistic and conceptual complexity of healthcare reasoning, building on previous work to create a more comprehensive resource for biomedical reasoning research.Method: Extended the dataset to over 12,000 questions from ten years of Spanish professional exams, created additional multilingual versions, and benchmarked several open-source LLMs using prompting, RAG (Retrieval-Augmented Generation), and probability-based answer selection techniques.
Result: Results show that performance is mainly driven by model scale and intrinsic reasoning ability, with complex inference strategies obtaining limited gains. The dataset establishes itself as a reliable resource for biomedical reasoning research.
Conclusion: HEAD-QA v2 provides a valuable benchmark for advancing research on biomedical reasoning and model improvement, though current methods show that model scale and intrinsic reasoning capabilities are more important than complex inference strategies.
Abstract: We introduce HEAD-QA v2, an expanded and updated version of a Spanish/English healthcare multiple-choice reasoning dataset originally released by Vilares and Gómez-Rodríguez (2019). The update responds to the growing need for high-quality datasets that capture the linguistic and conceptual complexity of healthcare reasoning. We extend the dataset to over 12,000 questions from ten years of Spanish professional exams, benchmark several open-source LLMs using prompting, RAG, and probability-based answer selection, and provide additional multilingual versions to support future work. Results indicate that performance is mainly driven by model scale and intrinsic reasoning ability, with complex inference strategies obtaining limited gains. Together, these results establish HEAD-QA v2 as a reliable resource for advancing research on biomedical reasoning and model improvement.
[108] Towards Hyper-Efficient RAG Systems in VecDBs: Distributed Parallel Multi-Resolution Vector Search
Dong Liu, Yanxuan Yu
Main category: cs.CL
TL;DR: SPI introduces multi-resolution semantic pyramid indexing for RAG systems, enabling query-adaptive resolution control to improve retrieval speed and relevance in vector databases.
Details
Motivation: Existing vector database retrieval pipelines use flat or single-resolution indexing that cannot adapt to varying semantic granularity needed by diverse queries, leading to suboptimal trade-offs between retrieval speed and contextual relevance.Method: Proposes Semantic Pyramid Indexing (SPI) - a multi-resolution vector indexing framework that constructs a semantic pyramid over document embeddings and dynamically selects optimal resolution levels per query using a lightweight classifier, enabling progressive coarse-to-fine retrieval.
Result: SPI achieves up to 5.7× retrieval speedup, 1.8× memory efficiency gain, and improves end-to-end QA F1 scores by up to 2.5 points compared to strong baselines on MS MARCO, Natural Questions, and multimodal retrieval benchmarks.
Conclusion: SPI provides an effective adaptive indexing framework for RAG systems that significantly improves retrieval efficiency while maintaining semantic coverage, with theoretical guarantees and compatibility with existing vector database infrastructures.
Abstract: Retrieval-Augmented Generation (RAG) systems have become a dominant approach to augment large language models (LLMs) with external knowledge. However, existing vector database (VecDB) retrieval pipelines rely on flat or single-resolution indexing structures, which cannot adapt to the varying semantic granularity required by diverse user queries. This limitation leads to suboptimal trade-offs between retrieval speed and contextual relevance. To address this, we propose \textbf{Semantic Pyramid Indexing (SPI)}, a novel multi-resolution vector indexing framework that introduces query-adaptive resolution control for RAG in VecDBs. Unlike existing hierarchical methods that require offline tuning or separate model training, SPI constructs a semantic pyramid over document embeddings and dynamically selects the optimal resolution level per query through a lightweight classifier. This adaptive approach enables progressive retrieval from coarse-to-fine representations, significantly accelerating search while maintaining semantic coverage. We implement SPI as a plugin for both FAISS and Qdrant backends and evaluate it across multiple RAG tasks including MS MARCO, Natural Questions, and multimodal retrieval benchmarks. SPI achieves up to \textbf{5.7$\times$} retrieval speedup and \textbf{1.8$\times$} memory efficiency gain while improving end-to-end QA F1 scores by up to \textbf{2.5 points} compared to strong baselines. Our theoretical analysis provides guarantees on retrieval quality and latency bounds, while extensive ablation studies validate the contribution of each component. The framework’s compatibility with existing VecDB infrastructures makes it readily deployable in production RAG systems. Code is availabe at \href{https://github.com/FastLM/SPI_VecDB}{https://github.com/FastLM/SPI_VecDB}.
[109] Estonian WinoGrande Dataset: Comparative Analysis of LLM Performance on Human and Machine Translation
Marii Ojastu, Hele-Andra Kuulmets, Aleksei Dorkin, Marika Borovikova, Dage Särg, Kairit Sirts
Main category: cs.CL
TL;DR: Estonian translation of WinoGrande benchmark shows human-translated data yields slightly lower LLM performance than English original, while machine-translated data performs significantly worse, with limited improvement from prompt engineering.
Details
Motivation: To create a culturally adapted Estonian version of the WinoGrande commonsense reasoning benchmark for evaluating language models in non-English contexts, and to explore whether machine translation with specialized prompting can approach human translation quality.Method: Human translation by specialists, evaluation of proprietary and open-source models on translated data, and exploration of machine translation with detailed prompts addressing Estonian linguistic characteristics and WinoGrande-specific translation challenges.
Result: Model performance on human-translated Estonian data is slightly lower than on original English test set; machine-translated data performs notably worse; prompt engineering offers limited improvement in translation quality or model accuracy.
Conclusion: Human translation by language specialists remains essential for reliable evaluation of language competency and reasoning in LLMs, as machine translation and prompt engineering show limited effectiveness for culturally adapted benchmarks.
Abstract: In this paper, we present a localized and culturally adapted Estonian translation of the test set from the widely used commonsense reasoning benchmark, WinoGrande. We detail the translation and adaptation process carried out by translation specialists and evaluate the performance of both proprietary and open source models on the human translated benchmark. Additionally, we explore the feasibility of achieving high-quality machine translation by incorporating insights from the manual translation process into the design of a detailed prompt. This prompt is specifically tailored to address both the linguistic characteristics of Estonian and the unique translation challenges posed by the WinoGrande dataset. Our findings show that model performance on the human translated Estonian dataset is slightly lower than on the original English test set, while performance on machine-translated data is notably worse. Additionally, our experiments indicate that prompt engineering offers limited improvement in translation quality or model accuracy, and highlight the importance of involving language specialists in dataset translation and adaptation to ensure reliable and interpretable evaluations of language competency and reasoning in large language models.
[110] A Systematic Study of In-the-Wild Model Merging for Large Language Models
Oğuz Kağan Hitit, Leander Girrbach, Zeynep Akata
Main category: cs.CL
TL;DR: Model merging of heterogeneous experts with overlapping/conflicting objectives often fails; Task Arithmetic is the only reliable method for LLMs in “in-the-wild” settings.
Details
Motivation: To evaluate whether model merging benefits extend to settings where merged experts have overlapping or conflicting objectives rather than distinct roles, since most prior work focused on clearly separated tasks.Method: Large-scale systematic evaluation of six state-of-the-art merging methods (including subspace methods) across four open-weight LLMs, twelve fine-tuned checkpoints per base model, and sixteen standard LLM benchmarks in heterogeneous “in-the-wild” settings.
Result: Task Arithmetic (oldest and simplest method) is the only approach that reliably yields performance gains on LLMs in heterogeneous settings. Other interference-aware and subspace merging methods typically don’t improve over the base model.
Conclusion: Current merging techniques mostly fail to extract useful weight updates from heterogeneous/conflicting experts, motivating the need for LLM-specific merging algorithms and merging-aware fine-tuning methods.
Abstract: Model merging combines multiple fine-tuned checkpoints into a single model without additional training, offering an attractive approach to reusing models and efficiently improving performance. However, it remains unclear whether the advantages reported for settings where all merged experts have distinct roles and are tuned on clearly separated tasks also hold in settings where the merged experts do not have clearly distinct roles, but are trained on overlapping or even conflicting objectives. To evaluate this setting, we present a large-scale, systematic evaluation of “in-the-wild” model merging of heterogeneous experts, that may have been trained on overlapping or conflicting objectives. Concretely, we evaluate six state-of-the-art merging methods, including recent subspace methods, across four open-weight LLMs, twelve fine-tuned checkpoints per base model, and sixteen standard LLM benchmarks. Evaluating through standardized benchmarks, we measure both the probability that a model merged from a heterogeneous set of experts outperforms the base model and we measure relative gains over the best individual checkpoint. Our results show that the oldest and simplest method, Task Arithmetic, is the only approach that reliably yields performance gains on LLMs in this “in-the-wild” setting. Other interference-aware and subspace merging methods typically do not result in notable improvements over the base model. Our findings indicate that current merging techniques mostly do not enable extracting useful weight updates from heterogeneous and potentially conflicting versions. This motivates the design of LLM-specific merging algorithms and merging-aware fine-tuning methods.
[111] CREST: Universal Safety Guardrails Through Cluster-Guided Cross-Lingual Transfer
Lavish Bansal, Naman Mishra
Main category: cs.CL
TL;DR: CREST is a parameter-efficient multilingual safety classification model that supports 100 languages with only 0.5B parameters, using cross-lingual transfer from 13 high-resource languages to address safety in low-resource languages.
Details
Motivation: Existing safety guardrails for LLMs are predominantly tailored for high-resource languages, leaving low-resource language speakers underrepresented. There's a need for universal, language-agnostic safety systems that can scale globally.Method: CREST uses cluster-based cross-lingual transfer, training on only 13 high-resource languages and transferring knowledge to 100 languages. It’s parameter-efficient with only 0.5B parameters.
Result: CREST outperforms existing state-of-the-art guardrails of comparable scale and achieves competitive results against models with significantly larger parameter counts (2.5B+).
Conclusion: The work highlights limitations of language-specific guardrails and underscores the importance of developing universal, language-agnostic safety systems that can scale effectively to serve global populations.
Abstract: Ensuring content safety in large language models (LLMs) is essential for their deployment in real-world applications. However, existing safety guardrails are predominantly tailored for high-resource languages, leaving a significant portion of the world’s population underrepresented who communicate in low-resource languages. To address this, we introduce CREST (CRoss-lingual Efficient Safety Transfer), a parameter-efficient multilingual safety classification model that supports 100 languages with only 0.5B parameters. By training on a strategically chosen subset of only 13 high-resource languages, our model utilizes cluster-based cross-lingual transfer from a few to 100 languages, enabling effective generalization to both unseen high-resource and low-resource languages. This approach addresses the challenge of limited training data in low-resource settings. We conduct comprehensive evaluations across six safety benchmarks to demonstrate that CREST outperforms existing state-of-the-art guardrails of comparable scale and achieves competitive results against models with significantly larger parameter counts (2.5B parameters and above). Our findings highlight the limitations of language-specific guardrails and underscore the importance of developing universal, language-agnostic safety systems that can scale effectively to serve global populations.
[112] Multilingual Medical Reasoning for Question Answering with Large Language Models
Pietro Ferrazzi, Aitor Soroa, Rodrigo Agerri
Main category: cs.CL
TL;DR: Multilingual medical reasoning traces generated from Wikipedia knowledge improve LLM performance on medical QA tasks across English, Italian, and Spanish.
Details
Motivation: Existing medical QA approaches are English-focused and rely on distillation from general-purpose LLMs, raising reliability concerns. There's a need for multilingual medical reasoning resources that leverage structured medical knowledge.Method: Used retrieval-augmented generation over medical Wikipedia information to create 500k reasoning traces in English, Italian, and Spanish. Extended MedQA and MedMCQA datasets to Italian and Spanish. Tested pipeline in both in-domain and out-of-domain settings.
Result: Reasoning traces improved performance via both in-context learning (few-shot) and supervised fine-tuning, achieving state-of-the-art results among 8B-parameter LLMs on medical QA benchmarks.
Conclusion: The generated multilingual reasoning resources support development of more transparent clinical decision-support tools and enable better medical QA performance across languages.
Abstract: Large Language Models (LLMs) with reasoning capabilities have recently demonstrated strong potential in medical Question Answering (QA). Existing approaches are largely English-focused and primarily rely on distillation from general-purpose LLMs, raising concerns about the reliability of their medical knowledge. In this work, we present a method to generate multilingual reasoning traces based on medical knowledge extracted from Wikipedia. We produce 500k traces in English, Italian, and Spanish, using a retrieval-augmented generation approach over medical information from Wikipedia. The traces are generated to solve medical questions drawn from MedQA and MedMCQA, which we extend to Italian and Spanish. We test our pipeline in both in-domain and out-of-domain settings across Medical QA benchmarks, and demonstrate that our reasoning traces improve performance both when utilized via in-context learning (few-shot) and supervised fine-tuning, yielding state-of-the-art results among 8B-parameter LLMs. We believe that these resources can support the development of more transparent clinical decision-support tools in multilingual settings. We release the full suite of resources: reasoning traces, translated QA datasets, Medical-Wikipedia, and fine-tuned models.
[113] OnCoCo 1.0: A Public Dataset for Fine-Grained Message Classification in Online Counseling Conversations
Jens Albrecht, Robert Lehmann, Aleksandra Poltermann, Eric Rudolph, Philipp Steigerwald, Mara Stieler
Main category: cs.CL
TL;DR: OnCoCo 1.0 is a new public dataset for fine-grained message classification in online counseling conversations with 38 counselor and 28 client utterance types, containing ~2,800 labeled messages.
Details
Motivation: Existing category systems for counseling conversations are limited by narrow focus on Motivational Interviewing and dependence on face-to-face counseling datasets, which restricts detailed analysis of textual online counseling conversations.Method: Developed a comprehensive new coding scheme with 38 counselor and 28 client utterance categories, created a labeled dataset of ~2,800 messages from counseling conversations, and fine-tuned several models on the dataset.
Result: Created a publicly available dataset and models for fine-grained classification of online counseling conversations, demonstrating applicability through model fine-tuning experiments.
Conclusion: The work contributes a new fine-grained conversational resource to the language resources community, extending existing datasets for social and mental-health dialogue analysis.
Abstract: This paper presents OnCoCo 1.0, a new public dataset for fine-grained message classification in online counseling. It is based on a new, integrative system of categories, designed to improve the automated analysis of psychosocial online counseling conversations. Existing category systems, predominantly based on Motivational Interviewing (MI), are limited by their narrow focus and dependence on datasets derived mainly from face-to-face counseling. This limits the detailed examination of textual counseling conversations. In response, we developed a comprehensive new coding scheme that differentiates between 38 types of counselor and 28 types of client utterances, and created a labeled dataset consisting of about 2.800 messages from counseling conversations. We fine-tuned several models on our dataset to demonstrate its applicability. The data and models are publicly available to researchers and practitioners. Thus, our work contributes a new type of fine-grained conversational resource to the language resources community, extending existing datasets for social and mental-health dialogue analysis.
[114] Does Tone Change the Answer? Evaluating Prompt Politeness Effects on Modern LLMs: GPT, Gemini, and LLaMA
Hanyu Cai, Binqi Shen, Lier Jin, Lan Hu, Xiaojing Fan
Main category: cs.CL
TL;DR: Systematic evaluation shows tone sensitivity in LLMs is model-dependent and domain-specific, with rude prompts reducing accuracy mainly in Humanities tasks for some models, but overall modern LLMs are broadly robust to tonal variation.
Details
Motivation: To systematically examine how pragmatic elements like linguistic tone and politeness affect LLM performance across different model families, as this impact remains underexplored despite prompt engineering being critical for LLM performance.Method: Proposed evaluation framework using MMMLU benchmark to test three LLMs (GPT-4o mini, Gemini 2.0 Flash, Llama 4 Scout) under Very Polite, Neutral, and Very Rude prompt variants across six STEM and Humanities tasks, with statistical significance testing of pairwise accuracy differences.
Result: Tone sensitivity is model-dependent and domain-specific: Neutral or Very Polite prompts generally yield higher accuracy than Very Rude prompts, but statistically significant effects appear only in Humanities tasks where rude tone reduces accuracy for GPT and Llama, while Gemini remains tone-insensitive. Aggregated across tasks, tone effects diminish and lose statistical significance.
Conclusion: While interaction tone can matter in specific interpretive settings, modern LLMs are broadly robust to tonal variation in typical mixed-domain use, providing practical guidance for prompt design and model selection in real-world deployments.
Abstract: Prompt engineering has emerged as a critical factor influencing large language model (LLM) performance, yet the impact of pragmatic elements such as linguistic tone and politeness remains underexplored, particularly across different model families. In this work, we propose a systematic evaluation framework to examine how interaction tone affects model accuracy and apply it to three recently released and widely available LLMs: GPT-4o mini (OpenAI), Gemini 2.0 Flash (Google DeepMind), and Llama 4 Scout (Meta). Using the MMMLU benchmark, we evaluate model performance under Very Polite, Neutral, and Very Rude prompt variants across six tasks spanning STEM and Humanities domains, and analyze pairwise accuracy differences with statistical significance testing. Our results show that tone sensitivity is both model-dependent and domain-specific. Neutral or Very Polite prompts generally yield higher accuracy than Very Rude prompts, but statistically significant effects appear only in a subset of Humanities tasks, where rude tone reduces accuracy for GPT and Llama, while Gemini remains comparatively tone-insensitive. When performance is aggregated across tasks within each domain, tone effects diminish and largely lose statistical significance. Compared with earlier research, these findings suggest that dataset scale and coverage materially influence the detection of tone effects. Overall, our study indicates that while interaction tone can matter in specific interpretive settings, modern LLMs are broadly robust to tonal variation in typical mixed-domain use, providing practical guidance for prompt design and model selection in real-world deployments.
[115] Beg to Differ: Understanding Reasoning-Answer Misalignment Across Languages
Anaelia Ovalle, Candace Ross, Sebastian Ruder, Adina Williams, Karen Ullrich, Mark Ibrahim, Levent Sagun
Main category: cs.CL
TL;DR: Multilingual reasoning evaluation reveals models’ reasoning often fails to support conclusions, especially in non-Latin scripts, despite high task accuracy.
Details
Motivation: To investigate whether chain-of-thought reasoning quality transfers across languages, as current multilingual evaluation focuses on task accuracy but overlooks reasoning quality.Method: Created human-validated framework to evaluate if reasoning traces logically support conclusions across languages. Analyzed 65k reasoning traces from GlobalMMLU questions across 6 languages and 6 frontier models.
Result: Found critical blind spot: high task accuracy but reasoning often fails to support conclusions. Non-Latin scripts show ≥2× more reasoning-conclusion misalignment than Latin scripts. Error taxonomy reveals evidential errors (unsupported claims) and illogical reasoning as main failures.
Conclusion: Current multilingual evaluation provides incomplete picture of model reasoning capabilities; need reasoning-aware evaluation frameworks that assess reasoning quality, not just task accuracy.
Abstract: Large language models demonstrate strong reasoning capabilities through chain-of-thought prompting, but whether this reasoning quality transfers across languages remains underexplored. We introduce a human-validated framework to evaluate whether model-generated reasoning traces logically support their conclusions across languages. Analyzing 65k reasoning traces from GlobalMMLU questions across 6 languages and 6 frontier models, we uncover a critical blind spot: while models achieve high task accuracy, their reasoning can fail to support their conclusions. Reasoning traces in non-Latin scripts show at least twice as much misalignment between their reasoning and conclusions than those in Latin scripts. We develop an error taxonomy through human annotation to characterize these failures, finding they stem primarily from evidential errors (unsupported claims, ambiguous facts) followed by illogical reasoning steps. Our findings demonstrate that current multilingual evaluation practices provide an incomplete picture of model reasoning capabilities and highlight the need for reasoning-aware evaluation frameworks.
[116] Activation Steering for Masked Diffusion Language Models
Adi Shnaidman, Erin Feiglin, Osher Yaari, Efrat Mentel, Amit Levi, Raz Lapid
Main category: cs.CL
TL;DR: The paper introduces an activation steering method for masked diffusion language models (MDLMs) that extracts low-dimensional control directions from contrastive prompts and applies them during reverse diffusion to systematically modify model behavior, with safety refusal as a case study.
Details
Motivation: Masked diffusion language models offer unique advantages like mask-parallel decoding and different controllability-efficiency tradeoffs compared to autoregressive LLMs, but lack efficient representation-level mechanisms for inference-time control. The authors aim to develop a lightweight activation steering primitive for MDLMs without requiring optimization or altering the diffusion process.Method: Proposes extracting a single low-dimensional direction from contrastive prompt sets using one prompt-only forward pass, then applying global interventions on residual-stream activations throughout reverse diffusion. Uses safety refusal as a case study to analyze refusal behavior in MDLMs and identify consistent activation subspaces.
Result: Found refusal behavior in multiple MDLMs is governed by a consistent, approximately one-dimensional activation subspace. The steering method yields large systematic behavioral shifts, outperforming prompt-based and optimization-based baselines. Discovered diffusion-specific accessibility where effective directions can be extracted from pre-instruction tokens (unlike autoregressive models). Ablations show maximal leverage in early denoising steps and mid-to-late transformer layers.
Conclusion: The activation steering primitive enables efficient representation-level control in MDLMs, revealing architecture-dependent representations of safety constraints. Directions transfer strongly between languages in multilingual MDLMs but don’t generalize to autoregressive architectures, highlighting fundamental differences in how different model architectures encode behavioral constraints.
Abstract: Masked diffusion language models (MDLMs) generate text via iterative masked-token denoising, enabling mask-parallel decoding and distinct controllability and efficiency tradeoffs from autoregressive LLMs. Yet, efficient representation-level mechanisms for inference-time control in MDLMs remain largely unexplored. To address this gap, we introduce an activation steering primitive for MDLMs: we extract a single low-dimensional direction from contrastive prompt sets using one prompt-only forward pass, and apply a global intervention on residual-stream activations throughout reverse diffusion, without performing optimization or altering the diffusion sampling procedure. Using safety refusal as a deployment-relevant case study, we find that refusal behavior in multiple MDLMs is governed by a consistent, approximately one-dimensional activation subspace. Applying the corresponding direction yields large and systematic behavioral shifts and is substantially more effective than prompt-based and optimization-based baselines. We further uncover diffusion-specific accessibility: effective directions can be extracted not only from post-instruction tokens, but also from pre-instruction tokens that are typically ineffective in autoregressive models due to causal attention. Ablations localize maximal leverage to early denoising steps and mid-to-late transformer layers, with early diffusion blocks contributing disproportionately. Finally, in an MDLM trained on English and Chinese, extracted directions transfer strongly between English and Chinese, but do not reliably generalize to an autoregressive architecture, highlighting architecture-dependent representations of safety constraints.
[117] JMedEthicBench: A Multi-Turn Conversational Benchmark for Evaluating Medical Safety in Japanese Large Language Models
Junyu Liu, Zirui Li, Qian Niu, Zequn Zhang, Yue Xun, Wenlong Hou, Shujun Wang, Yusuke Iwasawa, Yutaka Matsuo, Kan Hatakeyama-Sato
Main category: cs.CL
TL;DR: JMedEthicBench: First multi-turn conversational benchmark for evaluating medical safety of LLMs in Japanese healthcare, based on 67 guidelines with 50k+ adversarial conversations using jailbreak strategies.
Details
Motivation: Existing safety benchmarks are English-centric and use single-turn prompts, while real clinical consultations are multi-turn. Need to evaluate LLM safety in healthcare, especially for Japanese context.Method: Created benchmark based on 67 Japan Medical Association guidelines with over 50,000 adversarial conversations using 7 automatically discovered jailbreak strategies. Used dual-LLM scoring protocol to evaluate 27 models.
Result: Commercial models maintain robust safety while medical-specialized models show increased vulnerability. Safety scores decline significantly across conversation turns (median: 9.5 to 5.0). Cross-lingual evaluation shows vulnerabilities persist across Japanese and English versions.
Conclusion: Domain-specific fine-tuning may weaken safety mechanisms, multi-turn interactions represent distinct threat surface requiring dedicated alignment strategies, and medical model vulnerabilities are inherent alignment limitations rather than language-specific.
Abstract: As Large Language Models (LLMs) are increasingly deployed in healthcare field, it becomes essential to carefully evaluate their medical safety before clinical use. However, existing safety benchmarks remain predominantly English-centric, and test with only single-turn prompts despite multi-turn clinical consultations. To address these gaps, we introduce JMedEthicBench, the first multi-turn conversational benchmark for evaluating medical safety of LLMs for Japanese healthcare. Our benchmark is based on 67 guidelines from the Japan Medical Association and contains over 50,000 adversarial conversations generated using seven automatically discovered jailbreak strategies. Using a dual-LLM scoring protocol, we evaluate 27 models and find that commercial models maintain robust safety while medical-specialized models exhibit increased vulnerability. Furthermore, safety scores decline significantly across conversation turns (median: 9.5 to 5.0, $p < 0.001$). Cross-lingual evaluation on both Japanese and English versions of our benchmark reveals that medical model vulnerabilities persist across languages, indicating inherent alignment limitations rather than language-specific factors. These findings suggest that domain-specific fine-tuning may accidentally weaken safety mechanisms and that multi-turn interactions represent a distinct threat surface requiring dedicated alignment strategies.
[118] FACTUM: Mechanistic Detection of Citation Hallucination in Long-Form RAG
Maxime Dassen, Rebecca Kotula, Kenton Murray, Andrew Yates, Dawn Lawrie, Efsun Kayi, James Mayfield, Kevin Duh
Main category: cs.CL
TL;DR: FACTUM framework analyzes citation hallucinations in RAG models as coordination failures between attention and feed-forward pathways, using four mechanistic scores to detect trustworthy citations.
Details
Motivation: Citation hallucinations in RAG models undermine their reliability, but existing work oversimplifies the problem as mere over-reliance on parametric knowledge. The paper aims to understand the deeper mechanistic causes of these failures.Method: Introduces FACTUM framework with four scores: Contextual Alignment (CAS), Attention Sink Usage (BAS), Parametric Force (PFS), and Pathway Alignment (PAS). Analyzes coordination between Attention (reading) and Feed-Forward Network (recalling) pathways across different model scales.
Result: FACTUM outperforms state-of-the-art baselines by up to 37.5% in AUC. Correct citations show higher parametric force and greater attention sink usage. The signature of correctness evolves with model scale - 3B models rely on high pathway alignment while 8B models use specialized orthogonal strategies.
Conclusion: Citation hallucinations result from complex coordination failures between neural pathways, not just parametric over-reliance. High parametric force can be constructive when properly coordinated with attention pathways, enabling more reliable RAG systems.
Abstract: Retrieval-Augmented Generation (RAG) models are critically undermined by citation hallucinations, a deceptive failure where a model cites a source that fails to support its claim. While existing work attributes hallucination to a simple over-reliance on parametric knowledge, we reframe this failure as an evolving, scale-dependent coordination failure between the Attention (reading) and Feed-Forward Network (recalling) pathways. We introduce FACTUM (Framework for Attesting Citation Trustworthiness via Underlying Mechanisms), a framework of four mechanistic scores: Contextual Alignment (CAS), Attention Sink Usage (BAS), Parametric Force (PFS), and Pathway Alignment (PAS). Our analysis reveals that correct citations are consistently marked by higher parametric force (PFS) and greater use of the attention sink (BAS) for information synthesis. Crucially, we find that “one-size-fits-all” theories are insufficient as the signature of correctness evolves with scale: while the 3B model relies on high pathway alignment (PAS), our best-performing 8B detector identifies a shift toward a specialized strategy where pathways provide distinct, orthogonal information. By capturing this complex interplay, FACTUM outperforms state-of-the-art baselines by up to 37.5% in AUC. Our results demonstrate that high parametric force is constructive when successfully coordinated with the Attention pathway, paving the way for more nuanced and reliable RAG systems.
[119] †DAGGER: Distractor-Aware Graph Generation for Executable Reasoning in Math Problems
Zabir Al Nazi, Shubhashis Roy Dipta, Sudipta Kar
Main category: cs.CL
TL;DR: Introduces DISTRACTMATH-BN, a Bangla benchmark with irrelevant distractors for mathematical reasoning, and DAGGER, a method that reformulates math problem solving as computational graph generation to improve robustness and efficiency.
Details
Motivation: Chain-of-Thought prompting is widely used for mathematical problem solving in low-resource languages, but its behavior under irrelevant context (distractors) remains underexplored, especially in noisy, low-resource settings.Method: Created DISTRACTMATH-BN benchmark by augmenting MGSM and MSVAMP with semantically coherent but computationally irrelevant information. Proposed DAGGER, which reformulates mathematical problem solving as executable computational graph generation with explicit modeling of distractor nodes. Fine-tuned Gemma-3 models using supervised fine-tuning followed by Group Relative Policy Optimization.
Result: Standard models dropped by up to 41 points under distractors, while reasoning-specialized models declined by 14-20 points despite using 5x more tokens. DAGGER achieved comparable weighted accuracy on augmented benchmarks while using 89% fewer tokens than reasoning models, with robustness emerging without explicit training on distractor-augmented examples.
Conclusion: Enforcing structured intermediate representations (computational graphs) improves robustness and inference efficiency in mathematical reasoning compared to free-form approaches, particularly in noisy, low-resource settings.
Abstract: Chain-of-Thought (CoT) prompting is widely adopted for mathematical problem solving, including in low-resource languages, yet its behavior under irrelevant context remains underexplored. To systematically study this challenge, we introduce DISTRACTMATH-BN, a Bangla benchmark that augments MGSM and MSVAMP with semantically coherent but computationally irrelevant information. Evaluating seven models ranging from 3B to 12B parameters, we observe substantial performance degradation under distractors: standard models drop by up to 41 points, while reasoning-specialized models decline by 14 to 20 points despite consuming five times more tokens. We propose †DAGGER, which reformulates mathematical problem solving as executable computational graph generation with explicit modeling of distractor nodes. Fine-tuning Gemma-3 models using supervised fine-tuning followed by Group Relative Policy Optimization achieves comparable weighted accuracy on augmented benchmarks while using 89 percent fewer tokens than reasoning models. Importantly, this robustness emerges without explicit training on distractor-augmented examples. Our results suggest that enforcing structured intermediate representations improves robustness and inference efficiency in mathematical reasoning compared to free-form approaches, particularly in noisy, low-resource settings.
[120] Symphonym: Universal Phonetic Embeddings for Cross-Script Name Matching
Stephen Gadd
Main category: cs.CL
TL;DR: Symphonym is a neural embedding system that maps toponyms from 20 writing systems into a unified phonetic space for cross-script matching without language identification or phonetic resources at inference time.
Details
Motivation: Matching place names across different writing systems is a major challenge for integrating multilingual geographic sources. Existing approaches rely on language-specific phonetic algorithms or romanization steps that discard phonetic information and don't generalize across script boundaries.Method: Uses a Teacher-Student knowledge distillation architecture. The Teacher learns from articulatory phonetic features derived from IPA transcriptions, then transfers this knowledge to a character-level Student model. Trained on 32.7 million triplet samples from 67 million toponyms across GeoNames, Wikidata, and Getty Thesaurus.
Result: Achieves highest Recall@1 (85.2%) and Mean Reciprocal Rank (90.8%) on the MEHDIE cross-script benchmark (medieval Hebrew and Arabic toponym matches). Shows cross-temporal generalization from modern training to pre-modern sources. Ablation with raw articulatory features alone yields only 45.0% MRR.
Conclusion: The approach effectively handles pre-standardization orthographic variation in historical documents and transfers to personal names in archival sources, suggesting broad applicability to name resolution tasks in digital humanities and linked open data contexts.
Abstract: Matching place names across writing systems is a persistent obstacle to the integration of multilingual geographic sources, whether modern gazetteers, medieval itineraries, or colonial-era surveys. Existing approaches depend on language-specific phonetic algorithms or romanisation steps that discard phonetic information, and none generalises across script boundaries. This paper presents Symphonym, a neural embedding system which maps toponyms from twenty writing systems into a unified 128-dimensional phonetic space, enabling direct cross-script similarity comparison without language identification or phonetic resources at inference time. A Teacher-Student knowledge distillation architecture first learns from articulatory phonetic features derived from IPA transcriptions, then transfers this knowledge to a character-level Student model. Trained on 32.7 million triplet samples drawn from 67 million toponyms spanning GeoNames, Wikidata, and the Getty Thesaurus of Geographic Names, the Student achieves the highest Recall@1 (85.2%) and Mean Reciprocal Rank (90.8%) on the MEHDIE cross-script benchmark – medieval Hebrew and Arabic toponym matches curated by domain experts and entirely independent of the training data – demonstrating cross-temporal generalisation from modern training material to pre-modern sources. An ablation using raw articulatory features alone yields only 45.0% MRR, confirming the contribution of the neural training curriculum. The approach naturally handles pre-standardisation orthographic variation characteristic of historical documents, and transfers effectively to personal names in archival sources, suggesting broad applicability to name resolution tasks in digital humanities and linked open data contexts.
[121] LLMs versus the Halting Problem: Revisiting Program Termination Prediction
Oren Sultan, Jordi Armengol-Estape, Pascal Kesseli, Julien Vanegue, Dafna Shahaf, Yossi Adi, Peter O’Hearn
Main category: cs.CL
TL;DR: LLMs show strong performance on program termination prediction tasks, ranking close to specialized verification tools, but struggle with providing valid proofs and handling complex programs.
Details
Motivation: The paper investigates whether large language models can reliably predict program termination, given that the Halting Problem is undecidable and existing verification tools are language-specific and approximate. Recent LLM successes prompt exploration of their potential for reasoning about undecidable problems.Method: Evaluated LLMs on diverse programs from the Termination category of SV-Comp 2025, comparing performance against specialized verification tools. Analyzed models including GPT-5, Claude Sonnet-4.5, and Code World Model (CWM) on termination prediction tasks.
Result: LLMs perform remarkably well at predicting program termination, with GPT-5 and Claude Sonnet-4.5 ranking just behind the top-ranked tool (using test-time-scaling), and CWM placing just behind the second-ranked tool. However, LLMs often fail to provide valid witnesses as proofs, and performance degrades with increasing program length and complexity.
Conclusion: LLMs show promise for program termination prediction despite the undecidable nature of the problem, but have limitations in proof generation and handling complex programs. This motivates further research into LLMs for reasoning about undecidable problems.
Abstract: Determining whether a program terminates is a central problem in computer science. Turing’s foundational result established the Halting Problem as undecidable, showing that no algorithm can universally determine termination for all programs and inputs. Consequently, automatic verification tools approximate termination, sometimes failing to prove or disprove; these tools rely on problem-specific architectures, and are usually tied to particular programming languages. Recent success and progress in large language models (LLMs) raises the following question: can LLMs reliably predict program termination? In this work, we evaluate LLMs on a diverse set of programs from the Termination category of the International Competition on Software Verification (SV-Comp) 2025. Our results suggest that LLMs perform remarkably well at predicting program termination, where GPT-5 and Claude Sonnet-4.5 would rank just behind the top-ranked tool (using test-time-scaling), and Code World Model (CWM) would place just behind the second-ranked tool. While LLMs are effective at predicting program termination, they often fail to provide a valid witness as a proof. Moreover, LLMs performance drops as program length and complexity increases. We hope these insights motivate further research into program termination and the broader potential of LLMs for reasoning about undecidable problems.
[122] MuVaC: A Variational Causal Framework for Multimodal Sarcasm Understanding in Dialogues
Diandian Guo, Fangfang Yuan, Cong Cao, Xixun Lin, Chuan Zhou, Hao Peng, Yanan Cao, Yanbing Liu
Main category: cs.CL
TL;DR: MuVaC is a variational causal inference framework for joint multimodal sarcasm detection and explanation, modeling the causal relationship between detection and explanation tasks.
Details
Motivation: Sarcasm is prevalent in multimodal social media but challenging to understand. Current research treats sarcasm detection and explanation as separate tasks, overlooking their causal dependency where detection results from the reasoning process that explains sarcasm.Method: Proposes MuVaC framework that: 1) Models Multimodal Sarcasm Detection (MSD) and Multimodal Sarcasm Explanation (MuSE) using structural causal models with variational causal pathways for joint optimization; 2) Uses alignment-then-fusion approach for robust multimodal feature integration; 3) Ensures consistency between detection results and explanations for trustworthy reasoning.
Result: Experimental results show MuVaC’s superiority on public datasets, offering a new perspective for understanding multimodal sarcasm through joint optimization of detection and explanation.
Conclusion: MuVaC successfully bridges the gap between sarcasm detection and explanation by modeling their causal relationship, providing a framework that mimics human cognitive mechanisms for understanding multimodal sarcasm.
Abstract: The prevalence of sarcasm in multimodal dialogues on the social platforms presents a crucial yet challenging task for understanding the true intent behind online content. Comprehensive sarcasm analysis requires two key aspects: Multimodal Sarcasm Detection (MSD) and Multimodal Sarcasm Explanation (MuSE). Intuitively, the act of detection is the result of the reasoning process that explains the sarcasm. Current research predominantly focuses on addressing either MSD or MuSE as a single task. Even though some recent work has attempted to integrate these tasks, their inherent causal dependency is often overlooked. To bridge this gap, we propose MuVaC, a variational causal inference framework that mimics human cognitive mechanisms for understanding sarcasm, enabling robust multimodal feature learning to jointly optimize MSD and MuSE. Specifically, we first model MSD and MuSE from the perspective of structural causal models, establishing variational causal pathways to define the objectives for joint optimization. Next, we design an alignment-then-fusion approach to integrate multimodal features, providing robust fusion representations for sarcasm detection and explanation generation. Finally, we enhance the reasoning trustworthiness by ensuring consistency between detection results and explanations. Experimental results demonstrate the superiority of MuVaC in public datasets, offering a new perspective for understanding multimodal sarcasm.
[123] Can Small Language Models Handle Context-Summarized Multi-Turn Customer-Service QA? A Synthetic Data-Driven Comparative Evaluation
Lakshan Cooray, Deshan Sumanathilaka, Pattigadapa Venkatesh Raju
Main category: cs.CL
TL;DR: Instruction-tuned small language models (SLMs) are evaluated for multi-turn customer-service QA using history summarization, showing some can approach LLM performance but with limitations in dialogue continuity.
Details
Motivation: Large Language Models (LLMs) have strong performance but high computational costs limit practical use in resource-constrained environments like customer-service systems. Small Language Models (SLMs) offer efficiency but their effectiveness for multi-turn QA with dialogue continuity requirements remains underexplored.Method: Uses instruction-tuned SLMs with history summarization strategy to preserve conversational state. Evaluates nine instruction-tuned low-parameterized SLMs against three commercial LLMs using lexical/semantic similarity metrics, human evaluation, and LLM-as-a-judge methods. Introduces conversation stage-based qualitative analysis.
Result: Results show notable variation across SLMs - some demonstrate near-LLM performance while others struggle with dialogue continuity and contextual alignment. Highlights both potential and limitations of low-parameterized models for real-world customer-service QA.
Conclusion: SLMs show promise for efficient customer-service QA but current limitations in maintaining dialogue continuity and contextual understanding need addressing. The study provides insights into practical deployment of smaller models in resource-constrained environments.
Abstract: Customer-service question answering (QA) systems increasingly rely on conversational language understanding. While Large Language Models (LLMs) achieve strong performance, their high computational cost and deployment constraints limit practical use in resource-constrained environments. Small Language Models (SLMs) provide a more efficient alternative, yet their effectiveness for multi-turn customer-service QA remains underexplored, particularly in scenarios requiring dialogue continuity and contextual understanding. This study investigates instruction-tuned SLMs for context-summarized multi-turn customer-service QA, using a history summarization strategy to preserve essential conversational state. We also introduce a conversation stage-based qualitative analysis to evaluate model behavior across different phases of customer-service interactions. Nine instruction-tuned low-parameterized SLMs are evaluated against three commercial LLMs using lexical and semantic similarity metrics alongside qualitative assessments, including human evaluation and LLM-as-a-judge methods. Results show notable variation across SLMs, with some models demonstrating near-LLM performance, while others struggle to maintain dialogue continuity and contextual alignment. These findings highlight both the potential and current limitations of low-parameterized language models for real-world customer-service QA systems.
[124] SEAD: Self-Evolving Agent for Multi-Turn Service Dialogue
Yuqin Dai, Ning Gao, Wei Zhang, Jie Wang, Zichen Luo, Jinpeng Wang, Yujie Wang, Ruiyuan Wu, Chaozheng Wang
Main category: cs.CL
TL;DR: SEAD is a self-evolving agent framework for service dialogues that improves performance by generating diverse user states and realistic role-playing without large-scale human annotations.
Details
Motivation: Current LLMs perform suboptimally in service dialogues due to reliance on noisy, low-quality human conversation data, data scarcity, and difficulty simulating authentic goal-oriented user behaviors.Method: SEAD decouples user modeling into two components: a Profile Controller that generates diverse user states to manage training curriculum, and a User Role-play Model that focuses on realistic role-playing, ensuring adaptive training scenarios rather than adversarial environments.
Result: SEAD significantly outperforms both open-source foundation models and closed-source commercial models, improving task completion rate by 17.6% and dialogue efficiency by 11.1%.
Conclusion: SEAD provides an effective framework for training service dialogue agents without large-scale human annotations, addressing key limitations in current approaches through self-evolving user modeling.
Abstract: Large Language Models have demonstrated remarkable capabilities in open-domain dialogues. However, current methods exhibit suboptimal performance in service dialogues, as they rely on noisy, low-quality human conversation data. This limitation arises from data scarcity and the difficulty of simulating authentic, goal-oriented user behaviors. To address these issues, we propose SEAD (Self-Evolving Agent for Service Dialogue), a framework that enables agents to learn effective strategies without large-scale human annotations. SEAD decouples user modeling into two components: a Profile Controller that generates diverse user states to manage training curriculum, and a User Role-play Model that focuses on realistic role-playing. This design ensures the environment provides adaptive training scenarios rather than acting as an unfair adversary. Experiments demonstrate that SEAD significantly outperforms Open-source Foundation Models and Closed-source Commercial Models, improving task completion rate by 17.6% and dialogue efficiency by 11.1%. Code is available at: https://github.com/Da1yuqin/SEAD.
[125] OmniRAG-Agent: Agentic Omnimodal Reasoning for Low-Resource Long Audio-Video Question Answering
Yifan Zhu, Xinyu Mu, Tao Feng, Zhonghong Ou, Yuning Gong, Haoran Luo
Main category: cs.CL
TL;DR: OmniRAG-Agent: An agentic omnimodal QA method for budgeted long audio-video reasoning that combines retrieval-augmented generation with agent planning and optimization.
Details
Motivation: Address challenges in low-resource long audio-video QA including costly dense encoding, weak fine-grained retrieval, limited proactive planning, and lack of end-to-end optimization.Method: Builds image-audio retrieval-augmented generation module for fetching relevant frames/audio snippets, uses agent loop for planning/tool calling across turns, and applies group relative policy optimization for joint improvement.
Result: Outperforms prior methods on OmniVideoBench, WorldSense, and Daily-Omni under low-resource settings, with ablations validating each component.
Conclusion: OmniRAG-Agent effectively addresses key challenges in long-horizon omnimodal QA through retrieval-augmented generation, agentic planning, and optimization techniques.
Abstract: Long-horizon omnimodal question answering answers questions by reasoning over text, images, audio, and video. Despite recent progress on OmniLLMs, low-resource long audio-video QA still suffers from costly dense encoding, weak fine-grained retrieval, limited proactive planning, and no clear end-to-end optimization. To address these issues, we propose OmniRAG-Agent, an agentic omnimodal QA method for budgeted long audio-video reasoning. It builds an image-audio retrieval-augmented generation module that lets an OmniLLM fetch short, relevant frames and audio snippets from external banks. Moreover, it uses an agent loop that plans, calls tools across turns, and merges retrieved evidence to answer complex queries. Furthermore, we apply group relative policy optimization to jointly improve tool use and answer quality over time. Experiments on OmniVideoBench, WorldSense, and Daily-Omni show that OmniRAG-Agent consistently outperforms prior methods under low-resource settings and achieves strong results, with ablations validating each component.
[126] GreekMMLU: A Native-Sourced Multitask Benchmark for Evaluating Language Models in Greek
Yang Zhang, Mersin Konomi, Christos Xypolopoulos, Konstantinos Divriotis, Konstantinos Skianis, Giannis Nikolentzos, Giorgos Stamou, Guokan Shang, Michalis Vazirgiannis
Main category: cs.CL
TL;DR: GreekMMLU: A native-sourced Greek benchmark for evaluating LLMs across 45 subjects with 21,805 questions, revealing performance gaps between models and providing analysis for improving Greek language capabilities.
Details
Motivation: Existing Greek evaluation benchmarks for LLMs are limited and often machine-translated from English, failing to capture authentic Greek linguistic and cultural characteristics, creating a need for native-sourced evaluation datasets.Method: Created GreekMMLU benchmark with 21,805 multiple-choice questions across 45 subject areas, sourced or authored in Greek from academic, professional, and governmental exams. Organized under new subject taxonomy with educational difficulty levels. Released 16,857 samples publicly and reserved 4,948 for private leaderboard.
Result: Evaluation of 80+ LLMs revealed substantial performance gaps: frontier vs open-weight models, and Greek-adapted vs general multilingual models. Systematic analysis identified factors influencing performance including model scale, adaptation, and prompting.
Conclusion: GreekMMLU provides a robust, contamination-resistant benchmark for evaluating Greek language understanding in LLMs, revealing current limitations and offering insights for improving Greek language capabilities through better adaptation and scaling.
Abstract: Large Language Models (LLMs) are commonly trained on multilingual corpora that include Greek, yet reliable evaluation benchmarks for Greek-particularly those based on authentic, native-sourced content-remain limited. Existing datasets are often machine-translated from English, failing to capture Greek linguistic and cultural characteristics. We introduce GreekMMLU, a native-sourced benchmark for massive multitask language understanding in Greek, comprising 21,805 multiple-choice questions across 45 subject areas, organized under a newly defined subject taxonomy and annotated with educational difficulty levels spanning primary to professional examinations. All questions are sourced or authored in Greek from academic, professional, and governmental exams. We publicly release 16,857 samples and reserve 4,948 samples for a private leaderboard to enable robust and contamination-resistant evaluation. Evaluations of over 80 open- and closed-source LLMs reveal substantial performance gaps between frontier and open-weight models, as well as between Greek-adapted models and general multilingual ones. Finally, we provide a systematic analysis of factors influencing performance-including model scale, adaptation, and prompting-and derive insights for improving LLM capabilities in Greek.
[127] Using LLMs for Knowledge Component-level Correctness Labeling in Open-ended Coding Problems
Zhangqi Duan, Arnav Kankaria, Dhruv Kartik, Andrew Lan
Main category: cs.CL
TL;DR: LLM-based framework for automatically labeling knowledge component (KC) correctness in student programming code, improving learning curve modeling compared to problem-level label propagation.
Details
Motivation: Real-world educational datasets lack fine-grained KC-level correctness labels, especially for open-ended programming tasks where multiple KCs are involved simultaneously. Propagating problem-level correctness to all associated KCs obscures partial mastery and leads to poorly fitted learning curves.Method: Proposes an automated framework using LLMs to label KC-level correctness directly from student-written code. Includes assessment of whether each KC is correctly applied and introduces a temporal context-aware Code-KC mapping mechanism to better align KCs with individual student code.
Result: Experimental results show the framework leads to learning curves more consistent with cognitive theory and improves predictive performance compared to baselines. Human evaluation demonstrates substantial agreement between LLM and expert annotations.
Conclusion: The LLM-based framework effectively addresses the KC-level labeling challenge in programming education, enabling more accurate student modeling and learning analytics for open-ended tasks.
Abstract: Fine-grained skill representations, commonly referred to as knowledge components (KCs), are fundamental to many approaches in student modeling and learning analytics. However, KC-level correctness labels are rarely available in real-world datasets, especially for open-ended programming tasks where solutions typically involve multiple KCs simultaneously. Simply propagating problem-level correctness to all associated KCs obscures partial mastery and often leads to poorly fitted learning curves. To address this challenge, we propose an automated framework that leverages large language models (LLMs) to label KC-level correctness directly from student-written code. Our method assesses whether each KC is correctly applied and further introduces a temporal context-aware Code-KC mapping mechanism to better align KCs with individual student code. We evaluate the resulting KC-level correctness labels in terms of learning curve fit and predictive performance using the power law of practice and the Additive Factors Model. Experimental results show that our framework leads to learning curves that are more consistent with cognitive theory and improves predictive performance, compared to baselines. Human evaluation further demonstrates substantial agreement between LLM and expert annotations.
[128] MetaState: Persistent Working Memory Enhances Reasoning in Discrete Diffusion Language Models
Kejing Xia, Mingzhe Li, Lixuan Wei, Zhenbang Du, Xiangchi Yuan, Dachuan Shi, Qirui Jin, Wenke Lee
Main category: cs.CL
TL;DR: MetaState adds lightweight recurrent memory to frozen discrete diffusion language models to preserve continuous information across denoising steps, improving reasoning performance with minimal parameter overhead.
Details
Motivation: Standard discrete diffusion LLMs suffer from an "Information Island" issue where continuous information from intermediate denoising steps is discarded after sampling and remasking, preventing propagation of reasoning states across steps, which is particularly harmful for complex reasoning tasks.Method: MetaState introduces three modules with shared time conditioning: a cross-attention Mixer to read backbone activations into memory slots, a GRU-style Updater to integrate information across steps, and a cross-attention Injector to write updated memory back into the backbone. Trained with K-step unrolling pipeline to learn multi-step dynamics.
Result: MetaState adds only ~0.6% trainable parameters while keeping backbone frozen, and consistently improves reasoning performance over frozen baselines on mathematical reasoning and code generation benchmarks, with average gain of 4.5% across all evaluations.
Conclusion: MetaState effectively addresses the Information Island bottleneck in discrete diffusion models by providing persistent working memory, enabling better reasoning performance with minimal computational overhead.
Abstract: Discrete diffusion language models (dLLMs) generate text by iteratively denoising a masked sequence. However, standard dLLMs condition each denoising step solely on the current hard-masked sequence, while intermediate continuous representations are discarded after sampling and remasking. We term this bottleneck the \textbf{Information Island} issue: continuous information remains isolated within individual denoising steps and fails to propagate across the trajectory. This bottleneck is especially harmful for reasoning, which requires intermediate reasoning state to be preserved and updated across many denoising steps. To address this limitation, we introduce \textbf{MetaState}, a lightweight recurrent augmentation that equips a frozen dLLM backbone with persistent, fixed-size working memory. MetaState comprises three modules with a shared time conditioner: a cross-attention \textbf{Mixer} that reads backbone activations into memory slots, a GRU-style \textbf{Updater} that integrates information across steps, and a cross-attention \textbf{Injector} that writes the updated memory back into the backbone. We train these modules with a dedicated $K$-step unrolling pipeline to learn multi-step dynamics. MetaState adds only ${\sim}0.6%$ trainable parameters while keeping the backbone frozen, and consistently improves reasoning performance over frozen baselines on mathematical reasoning and code generation benchmarks, with an average gain of $4.5%$ across all evaluations.
[129] A Browser-based Open Source Assistant for Multimodal Content Verification
Rosanna Milner, Michael Foster, Twin Karmakharm, Olesya Razuvayevskaya, Ian Roberts, Valentin Porcellini, Denis Teyssou, Kalina Bontcheva
Main category: cs.CL
TL;DR: A browser-based verification assistant tool that integrates multiple NLP classifiers to help journalists detect disinformation and AI-generated content through a unified interface.
Details
Motivation: Journalists and fact-checkers face challenges verifying digital media due to disinformation and AI-generated false content, with existing NLP detection tools being inaccessible and not integrated into daily workflows.Method: Developed a browser-based tool (VERIFICATION ASSISTANT) that allows users to submit URLs/media files, automatically extracts content, routes it to backend NLP classifiers, and presents credibility signals in an accessible format.
Result: The tool is part of the widely adopted VERIFICATION PLUGIN (140,000+ users), provides actionable credibility signals, estimates AI-generated content, and offers verification guidance in a clear format.
Conclusion: The VERIFICATION ASSISTANT successfully bridges the gap between advanced NLP detection methods and non-expert users by integrating multiple credibility analysis services into a practical, accessible tool for real-world disinformation detection.
Abstract: Disinformation and false content produced by generative AI pose a significant challenge for journalists and fact-checkers who must rapidly verify digital media information. While there is an abundance of NLP models for detecting credibility signals such as persuasion techniques, subjectivity, or machine-generated text, such methods often remain inaccessible to non-expert users and are not integrated into their daily workflows as a unified framework. This paper demonstrates the VERIFICATION ASSISTANT, a browser-based tool designed to bridge this gap. The VERIFICATION ASSISTANT, a core component of the widely adopted VERIFICATION PLUGIN (140,000+ users), allows users to submit URLs or media files to a unified interface. It automatically extracts content and routes it to a suite of backend NLP classifiers, delivering actionable credibility signals, estimating AI-generated content, and providing other verification guidance in a clear, easy-to-digest format. This paper showcases the tool architecture, its integration of multiple NLP services, and its real-world application to detecting disinformation.
[130] Model Merging in the Era of Large Language Models: Methods, Applications, and Future Directions
Mingyang Song, Mao Zheng
Main category: cs.CL
TL;DR: Survey paper on model merging techniques for large language models, presenting a comprehensive taxonomy (FUSE) covering foundations, unification strategies, scenarios, and ecosystem.
Details
Motivation: As fine-tuned LLMs proliferate, model merging offers computationally efficient alternatives to ensembles and full retraining, enabling composition of specialized capabilities at minimal cost.Method: Presents the FUSE taxonomy organized along Foundations (theoretical underpinnings), Unification Strategies (algorithmic approaches), Scenarios (downstream applications), and Ecosystem (tools/benchmarks). Reviews weight averaging, task vector arithmetic, sparsification-enhanced methods, mixture-of-experts, and evolutionary optimization.
Result: Comprehensive survey establishing structured foundation for model merging research, covering theoretical foundations, algorithmic approaches, practical applications, and supporting tools/benchmarks.
Conclusion: Provides researchers and practitioners with structured foundation for advancing model merging, identifies key open challenges and future directions in the field.
Abstract: Model merging combines the parameters of multiple neural networks into a single model without additional training. As fine-tuned large language models (LLMs) proliferate, merging offers a computationally efficient alternative to ensembles and full retraining, enabling practitioners to compose specialized capabilities at minimal cost. This survey examines model merging in the LLM era through the \textbf{FUSE} taxonomy, organized along \textbf{F}oundations, \textbf{U}nification Strategies, \textbf{S}cenarios, and \textbf{E}cosystem. We first establish the theoretical underpinnings of merging, including loss landscape geometry and mode connectivity, then systematically review the algorithmic space spanning weight averaging, task vector arithmetic, sparsification-enhanced methods, mixture-of-experts architectures, and evolutionary optimization. We further examine downstream applications across multi-task learning, safety alignment, domain specialization, and federated learning, and survey the supporting ecosystem of tools and evaluation benchmarks. Finally, we identify key open challenges and future directions, aiming to equip researchers and practitioners with a structured foundation for advancing model merging.
[131] AgentDrift: Unsafe Recommendation Drift Under Tool Corruption Hidden by Ranking Metrics in LLM Agents
Zekun Wu, Adriano Koshiyama, Sahan Bulathwela, Maria Perez-Ortiz
Main category: cs.CL
TL;DR: Tool-augmented LLM agents in financial advisory contexts show evaluation blindness: standard ranking metrics preserve recommendation quality under contamination while 65-93% of turns contain risk-inappropriate products, invisible to NDCG.
Details
Motivation: Current evaluation of tool-augmented LLM agents focuses on recommendation quality metrics (like NDCG) but fails to assess safety, particularly in high-stakes domains like financial advisory where inappropriate recommendations can cause harm.Method: Paired-trajectory protocol replaying real financial dialogues under clean vs. contaminated tool-output conditions across 8 LLMs (7B to frontier), decomposing divergence into information-channel and memory-channel mechanisms, with causal interventions (activation patching, feature clamping, direct steering) and sparse autoencoder probing.
Result: Evaluation blindness observed: recommendation quality preserved under contamination (UPR~1.0) while 65-93% of turns contain risk-inappropriate products; violations emerge at turn 1 and persist without self-correction; susceptibility scales with instruction-following fidelity; models internally distinguish adversarial perturbations but fail to propagate signal to output.
Conclusion: Trajectory-level safety monitoring needed for deployed multi-turn agents; safety-penalized NDCG variant (sNDCG) reduces preservation ratios to 0.51-0.74; representation-to-action gap is structural and resists linear repair.
Abstract: Tool-augmented LLM agents increasingly operate as multi-turn advisors in high-stakes domains, yet their evaluation relies on ranking metrics that measure what is recommended but not whether it is safe for the user. We present a paired-trajectory protocol that replays real financial dialogues under clean and contaminated tool-output conditions across eight LLMs (7B to frontier), decomposing divergence into information-channel and memory-channel mechanisms. We observe evaluation blindness: recommendation quality is preserved under contamination (UPR~1.0) while risk-inappropriate products appear in 65-93% of turns, invisible to standard NDCG. Violations are information-channel-driven, emerge at turn 1, and persist without self-correction over 23-step trajectories. Even non-extreme perturbations (within-band corruption, narrative-only attacks) evade threshold monitors while producing significant drift. Susceptibility scales with instruction-following fidelity across all eight models. Sparse autoencoder probing reveals models internally distinguish adversarial perturbations but fail to propagate this signal to output; causal interventions (activation patching, feature clamping, direct steering) confirm this representation-to-action gap is structural and resists linear repair. A safety-penalized NDCG variant (sNDCG) reduces preservation ratios to 0.51-0.74. These results motivate trajectory-level safety monitoring for deployed multi-turn agents.
[132] GhanaNLP Parallel Corpora: Comprehensive Multilingual Resources for Low-Resource Ghanaian Languages
Lawrence Adu Gyamfi, Paul Azunre, Stephen Edward Moore, Joel Budu, Akwasi Asare, Mich-Seth Owusu, Jonathan Ofori Asiamah
Main category: cs.CL
TL;DR: GhanaNLP initiative creates 41,513 parallel sentence pairs for 5 Ghanaian languages (Twi, Fante, Ewe, Ga, Kusaal) to address low-resource language challenges in NLP.
Details
Motivation: Low-resource languages face challenges due to limited digitized linguistic data. Ghanaian languages are widely spoken but underrepresented in digital spaces, creating a need for curated parallel corpora to support NLP research and applications.Method: Human professionals collected, translated, and annotated parallel sentence pairs between local Ghanaian languages and English. Data enriched with standard structural metadata for consistency and usability.
Result: Created 41,513 parallel sentence pairs across 5 Ghanaian languages. Deployed in real-world applications like Khaya AI translation engine. Datasets support machine translation, speech technologies, and language preservation.
Conclusion: This work contributes to democratizing AI by enabling inclusive language technologies for African languages, supporting research, education, and commercial applications in low-resource language contexts.
Abstract: Low resource languages present unique challenges for natural language processing due to the limited availability of digitized and well structured linguistic data. To address this gap, the GhanaNLP initiative has developed and curated 41,513 parallel sentence pairs for the Twi, Fante, Ewe, Ga, and Kusaal languages, which are widely spoken across Ghana yet remain underrepresented in digital spaces. Each dataset consists of carefully aligned sentence pairs between a local language and English. The data were collected, translated, and annotated by human professionals and enriched with standard structural metadata to ensure consistency and usability. These corpora are designed to support research, educational, and commercial applications, including machine translation, speech technologies, and language preservation. This paper documents the dataset creation methodology, structure, intended use cases, and evaluation, as well as their deployment in real world applications such as the Khaya AI translation engine. Overall, this work contributes to broader efforts to democratize AI by enabling inclusive and accessible language technologies for African languages.
[133] sebis at ArchEHR-QA 2026: How Much Can You Do Locally? Evaluating Grounded EHR QA on a Single Notebook
Ibrahim Ebrar Yurt, Fabian Karl, Tejaswi Choppa, Florian Matthes
Main category: cs.CL
TL;DR: Smaller models running locally on commodity hardware can achieve competitive performance on clinical EHR question answering tasks without cloud infrastructure, enabling privacy-preserving medical AI systems.
Details
Motivation: Clinical EHR question answering needs privacy-preserving solutions that don't rely on cloud infrastructure due to medical data privacy constraints and computational limitations in clinical environments.Method: Participated in all four subtasks of ArchEHR-QA 2026 shared task, evaluating multiple approaches designed to run on commodity hardware without external APIs or cloud infrastructure, using properly configured smaller models.
Result: Systems achieved competitive performance on shared task leaderboards, performing above average in two subtasks, with smaller models approaching performance of larger systems when properly configured.
Conclusion: Privacy-preserving EHR QA systems running fully locally are feasible with current models and commodity hardware, enabling clinical deployment without compromising data privacy.
Abstract: Clinical question answering over electronic health records (EHRs) can help clinicians and patients access relevant medical information more efficiently. However, many recent approaches rely on large cloud-based models, which are difficult to deploy in clinical environments due to privacy constraints and computational requirements. In this work, we investigate how far grounded EHR question answering can be pushed when restricted to a single notebook. We participate in all four subtasks of the ArchEHR-QA 2026 shared task and evaluate several approaches designed to run on commodity hardware. All experiments are conducted locally without external APIs or cloud infrastructure. Our results show that such systems can achieve competitive performance on the shared task leaderboards. In particular, our submissions perform above average in two subtasks, and we observe that smaller models can approach the performance of much larger systems when properly configured. These findings suggest that privacy-preserving EHR QA systems running fully locally are feasible with current models and commodity hardware. The source code is available at https://github.com/ibrahimey/ArchEHR-QA-2026.
[134] ExPosST: Explicit Positioning with Adaptive Masking for LLM-Based Simultaneous Machine Translation
Yuzhe Shang, Pengzhi Gao, Yazheng Yang, Jiayao Ma, Wei Liu, Jian Luan, Jinsong Su
Main category: cs.CL
TL;DR: ExPosST: A framework for simultaneous machine translation using LLMs that resolves positional mismatch through explicit position allocation and policy-consistent fine-tuning.
Details
Motivation: Applying decoder-only LLMs to simultaneous machine translation creates a positional mismatch dilemma between decoding efficiency and positional consistency. Existing approaches fail to achieve inference efficiency, positional consistency, and broad model compatibility simultaneously.Method: Proposes ExPosST framework with explicit position allocation that reserves fixed positional slots for incoming source tokens, enabling efficient decoding with KV cache across different positional encoding methods. Also introduces policy-consistent fine-tuning strategy to align training with inference-time decoding behavior.
Result: Experiments across multiple language pairs demonstrate that ExPosST effectively supports simultaneous translation under diverse policies.
Conclusion: ExPosST provides a general framework that resolves the positional mismatch dilemma in applying LLMs to simultaneous translation, achieving better balance between efficiency and consistency.
Abstract: Large language models (LLMs) have recently demonstrated promising performance in simultaneous machine translation (SimulMT). However, applying decoder-only LLMs to SimulMT introduces a positional mismatch, which leads to a dilemma between decoding efficiency and positional consistency. Existing approaches often rely on specific positional encodings or carefully designed prompting schemes, and thus fail to simultaneously achieve inference efficiency, positional consistency, and broad model compatibility. In this work, we propose ExPosST, a general framework that resolves this dilemma through explicit position allocation. ExPosST reserves fixed positional slots for incoming source tokens, enabling efficient decoding with KV cache across different positional encoding methods. To further bridge the gap between fine-tuning and inference, we introduce a policy-consistent fine-tuning strategy that aligns training with inference-time decoding behavior. Experiments across multiple language pairs demonstrate that ExPosST effectively supports simultaneous translation under diverse policies.
[135] BanglaSocialBench: A Benchmark for Evaluating Sociopragmatic and Cultural Alignment of LLMs in Bangladeshi Social Interaction
Tanvir Ahmed Sijan, S. M Golam Rifat, Pankaj Chowdhury Partha, Md. Tanjeed Islam, Md. Musfique Anwar
Main category: cs.CL
TL;DR: BanglaSocialBench: First benchmark evaluating sociopragmatic competence in Bangla for LLMs, focusing on culturally appropriate language use in social hierarchy, kinship, and customs.
Details
Motivation: LLMs show strong multilingual fluency but lack sociopragmatic competence - the ability to use language appropriately in social contexts. Bangla presents particular challenges with its three-tiered pronominal system, kinship-based addressing, and culturally embedded social customs that require sensitivity to social hierarchy and interactional norms.Method: Created BanglaSocialBench with 1,719 culturally grounded instances across three domains: Bangla Address Terms, Kinship Reasoning, and Social Customs. All content written and verified by native Bangla speakers. Evaluated twelve contemporary LLMs in zero-shot settings to assess sociopragmatic competence.
Result: LLMs show systematic patterns of cultural misalignment: default to overly formal address forms, fail to recognize multiple socially acceptable address pronouns, conflate kinship terminology across religious contexts. Sociopragmatic failures are structured and non-random, revealing persistent limitations in culturally appropriate language inference.
Conclusion: Current LLMs have significant limitations in inferring and applying culturally appropriate language use in realistic Bangladeshi social interactions. Sociopragmatic competence requires deeper cultural understanding beyond factual recall, highlighting the need for culturally grounded benchmarks and model improvements.
Abstract: Large Language Models have demonstrated strong multilingual fluency, yet fluency alone does not guarantee socially appropriate language use. In high-context languages, communicative competence requires sensitivity to social hierarchy, relational roles, and interactional norms that are encoded directly in everyday language. Bangla exemplifies this challenge through its three-tiered pronominal system, kinship-based addressing, and culturally embedded social customs. We introduce BanglaSocialBench, the first benchmark designed to evaluate sociopragmatic competence in Bangla through context-dependent language use rather than factual recall. The benchmark spans three domains: Bangla Address Terms, Kinship Reasoning, and Social Customs, and consists of 1,719 culturally grounded instances written and verified by native Bangla speakers. We evaluate twelve contemporary LLMs in a zero-shot setting and observe systematic patterns of cultural misalignment. Models frequently default to overly formal address forms, fail to recognize multiple socially acceptable address pronouns, and conflate kinship terminology across religious contexts. Our findings show that sociopragmatic failures are often structured and non-random, revealing persistent limitations in how current LLMs infer and apply culturally appropriate language use in realistic Bangladeshi social interactions.
[136] EngGPT2: Sovereign, Efficient and Open Intelligence
G. Ciarfaglia, A. Rosanova, S. Cipolla, J. Bartoli, A. Di Domenico, C. Fioroni, A. Fontana, M. R. Scoleri, M. I. Mone, D. Franchi, M. C. Del Gaudio, A. Leodori, F. Cinti, M. Capozzi, C. Baston, F. Picariello, M. Gabusi, S. Bonura, V. Morreale, I. Bailo
Main category: cs.CL
TL;DR: EngGPT2-16B-A3B is an efficient Italian LLM with Mixture-of-Experts architecture, trained on 2.5T tokens with 25% Italian data, offering multiple reasoning modes while consuming significantly less power than comparable dense models.
Details
Motivation: To create a sovereign, efficient, and open European LLM that combines performance with resource efficiency, aligns with EU AI Act, and serves Italian/European NLP needs with lower computational requirements than existing models.Method: Trained-from-scratch Mixture-of-Experts architecture with 16B total parameters (3B active per inference), trained on 2.5T tokens (25% Italian data), positioned between GPT-OSS and Qwen3 expert sizes, with multiple reasoning modes.
Result: Achieves performance comparable to dense 8B-16B models on benchmarks (MMLU-Pro, GSM8K, IFEval, HumanEval) while using 1/5 to 1/2 inference power and 1/10 to 1/6 training data/power, with strong Italian language capabilities.
Conclusion: EngGPT2 sets a new standard for resource-efficient, high-performance LLMs tailored to European contexts, offering sovereign AI development aligned with EU regulations while maintaining competitive performance.
Abstract: EngGPT2-16B-A3B is the latest iteration of Engineering Group’s Italian LLM and it’s built to be a Sovereign, Efficient and Open model. EngGPT2 is trained on 2.5 trillion tokens - less than Qwen3’s 36T or Llama3’s 15T - and delivers performance on key benchmarks, including MMLU-Pro, GSM8K, IFEval and HumanEval, comparable to dense models in the 8B-16B range, while requiring one-fifth to half of the inference power, and between one-tenth to one-sixth of the training data and consequent needed training power. Designed as a trained-from-scratch Mixture-of-Experts (MoE) architecture, EngGPT2 features 16 billion parameters with 3 billion active per inference, with expert sizes positioned between those used in GPT-OSS and Qwen3. Approximately 25% of its training corpus consists of Italian-language data, to deliver strong capabilities for European and Italian NLP tasks among models of similar scale. This efficiency aims to position EngGPT2 as a key contributor to the growing portfolio of open-weight European models, combining performance and efficiency with full alignment to the EU AI Act. EngGPT2 is also a single model capable of multiple reasoning modes: non-reasoning, reasoning in Italian or English, and turbo-reasoning (a concise, bullet-point style reasoning available in both languages designed for real-time reasoning use cases). EngGPT2 aims to set a new standard for resource-conscious, high-performance LLMs tailored to European and Italian contexts.
[137] HypeLoRA: Hyper-Network-Generated LoRA Adapters for Calibrated Language Model Fine-Tuning
Bartosz Trojan, Filip Gębala
Main category: cs.CL
TL;DR: LoRA-based parameter-efficient fine-tuning achieves calibration parity with full fine-tuning for RoBERTa on GLUE tasks, with hyper-network adaptation showing similar performance and revealing trade-offs between calibration and accuracy.
Details
Motivation: Transformer models often produce overconfident predictions that don't match true empirical frequencies (miscalibration). This work investigates whether parameter-efficient adaptation methods like LoRA can maintain good calibration while being more efficient than full fine-tuning.Method: Evaluates LoRA and a novel hyper-network-based adaptation framework for RoBERTa across GLUE benchmark. The hyper-network approach generates LoRA factors (A and B matrices) with structural coupling across layers. Uses calibration metrics including ECE, MCE, and ACE.
Result: LoRA-based adaptation achieves calibration parity with (and sometimes exceeds) full fine-tuning while maintaining higher parameter efficiency. Hyper-network approach produces similar results to standard LoRA, with better MCC on CoLA. Freezing matrices A improves ECE but requires balancing with accuracy trade-offs.
Conclusion: Structured low-rank updates provide a viable foundation for uncertainty-aware Transformer architectures, clarifying the relationship between parameter efficiency and probabilistic reliability.
Abstract: Modern Transformer-based models frequently suffer from miscalibration, producing overconfident predictions that do not reflect true empirical frequencies. This work investigates the calibration dynamics of LoRA: Low-Rank Adaptation and a novel hyper-network-based adaptation framework as parameter-efficient alternatives to full fine-tuning for RoBERTa. Evaluating across the GLUE benchmark, we demonstrate that LoRA-based adaptation consistently achieves calibration parity with (and in specific tasks exceeds) full fine-tuning, while maintaining significantly higher parameter efficiency. We further explore a dynamic approach where a shared hyper-network generates LoRA factors (A and B matrices) to induce structural coupling across layers. This approach produced results similar to standard LoRA fine-tuning, even achieving better MCC on CoLA dataset. Our study also reveal a critical trade-off: constraining the adaptation space (e.g., freezing matrices A) acts as a powerful regularizer that enhances Expected Calibration Error (ECE), but necessitates a carefully balanced sacrifice in downstream task accuracy. To support future research, we provide a unified and reproducible implementation of contemporary calibration metrics, including ECE, MCE, and ACE. Our findings clarify the relationship between parameter efficiency and probabilistic reliability, positioning structured low-rank updates as a viable foundation for uncertainty-aware Transformer architectures. Code available at: https://github.com/btrojan-official/HypeLoRA
[138] Automatic Analysis of Collaboration Through Human Conversational Data Resources: A Review
Yi Yu, Maria Boritchev, Chloé Clavel
Main category: cs.CL
TL;DR: A review paper on collaboration analysis using task-oriented conversational data, covering theories, coding schemes, tasks, and modeling approaches for understanding collaborative processes through verbal communication.
Details
Motivation: Collaboration is a fundamental human behavior where conversation serves as the primary medium for information exchange. The paper aims to understand how task-oriented human-human conversational data can be utilized for collaboration analysis, addressing a gap in systematic approaches to studying collaborative processes through verbal communication.Method: The paper conducts a comprehensive review of existing literature on collaboration analysis using task-oriented conversation resources. It systematically examines related theories, coding schemes, tasks, and modeling approaches to provide a structured overview of the field.
Result: The review synthesizes current approaches and identifies patterns in how conversational data is used for collaboration analysis. It provides a practical resource for researchers and highlights unexplored areas for future research in this domain.
Conclusion: Task-oriented conversational data is a valuable resource for collaboration analysis, and systematic approaches combining theories, coding schemes, and modeling techniques can advance our understanding of collaborative processes. The review serves as both a practical guide and a roadmap for future research directions.
Abstract: Collaboration is a task-oriented, high-level human behavior. In most cases, conversation serves as the primary medium for information exchange and coordination, making conversational data a valuable resource for the automatic analysis of collaborative processes. In this paper, we focus on verbal aspects of collaboration and conduct a review of collaboration analysis using task-oriented conversation resources, encompassing related theories, coding schemes, tasks, and modeling approaches. We aim to address the question of how to utilize task-oriented human-human conversational data for collaboration analysis. We hope our review will serve as a practical resource and illuminate unexplored areas for future collaboration analysis.
[139] Alignment Whack-a-Mole : Finetuning Activates Verbatim Recall of Copyrighted Books in Large Language Models
Xinyue Liu, Niloofar Mireshghallah, Jane C. Ginsburg, Tuhin Chakrabarty
Main category: cs.CL
TL;DR: Finetuning LLMs on plot summaries causes them to reproduce copyrighted books verbatim, revealing that model weights store copies of training data and bypassing safety alignment protections.
Details
Motivation: To investigate whether frontier LLM companies' claims about not storing copies of training data and having effective safety measures against copyright infringement are accurate, particularly examining if finetuning can bypass these protections.Method: Finetuned GPT-4o, Gemini-2.5-Pro, and DeepSeek-V3.1 models to expand plot summaries into full text, then tested their ability to reproduce held-out copyrighted books using only semantic descriptions as prompts without actual book text.
Result: Models reproduced up to 85-90% of held-out copyrighted books with verbatim spans exceeding 460 words. Finetuning on one author’s novels unlocked recall of books from over 30 unrelated authors. Three different models memorized the same books in the same regions with high correlation (r ≥ 0.90).
Conclusion: Model weights store copies of copyrighted works, finetuning reactivates latent memorization from pretraining, and current safety measures are insufficient to prevent copyright infringement, undermining legal defenses based on these protections.
Abstract: Frontier LLM companies have repeatedly assured courts and regulators that their models do not store copies of training data. They further rely on safety alignment strategies via RLHF, system prompts, and output filters to block verbatim regurgitation of copyrighted works, and have cited the efficacy of these measures in their legal defenses against copyright infringement claims. We show that finetuning bypasses these protections: by training models to expand plot summaries into full text, a task naturally suited for commercial writing assistants, we cause GPT-4o, Gemini-2.5-Pro, and DeepSeek-V3.1 to reproduce up to 85-90% of held-out copyrighted books, with single verbatim spans exceeding 460 words, using only semantic descriptions as prompts and no actual book text. This extraction generalizes across authors: finetuning exclusively on Haruki Murakami’s novels unlocks verbatim recall of copyrighted books from over 30 unrelated authors. The effect is not specific to any training author or corpus: random author pairs and public-domain finetuning data produce comparable extraction, while finetuning on synthetic text yields near-zero extraction, indicating that finetuning on individual authors’ works reactivates latent memorization from pretraining. Three models from different providers memorize the same books in the same regions ($r \ge 0.90$), pointing to an industry-wide vulnerability. Our findings offer compelling evidence that model weights store copies of copyrighted works and that the security failures that manifest after finetuning on individual authors’ works undermine a key premise of recent fair use rulings, where courts have conditioned favorable outcomes on the adequacy of measures preventing reproduction of protected expression.
[140] KG-Hopper: Empowering Compact Open LLMs with Knowledge Graph Reasoning via Reinforcement Learning
Shuai Wang, Yinan Yu
Main category: cs.CL
TL;DR: KG-Hopper: RL framework enabling compact LLMs to perform integrated multi-hop KG reasoning in single inference round
Details
Motivation: LLMs struggle with knowledge-intensive reasoning like KBQA; existing approaches use sequential pipelines causing error cascades and lack flexibilityMethod: Reinforcement Learning framework trains Reasoning LLM to embed entire KG traversal and decision process into unified “thinking” stage, enabling global reasoning with backtracking
Result: 7B-parameter KG-Hopper outperforms larger multi-step systems (up to 70B) and achieves competitive performance with proprietary models like GPT-3.5-Turbo/GPT-4o-mini
Conclusion: KG-Hopper enables compact open LLMs to perform integrated multi-hop KG reasoning efficiently, addressing limitations of sequential approaches
Abstract: Large Language Models (LLMs) demonstrate impressive natural language capabilities but often struggle with knowledge-intensive reasoning tasks. Knowledge Base Question Answering (KBQA), which leverages structured Knowledge Graphs (KGs) exemplifies this challenge due to the need for accurate multi-hop reasoning. Existing approaches typically perform sequential reasoning steps guided by predefined pipelines, restricting flexibility and causing error cascades due to isolated reasoning at each step. To address these limitations, we propose KG-Hopper, a novel Reinforcement Learning (RL) framework that empowers compact open LLMs with the ability to perform integrated multi-hop KG reasoning within a single inference round. Rather than reasoning step-by-step, we train a Reasoning LLM that embeds the entire KG traversal and decision process into a unified ``thinking’’ stage, enabling global reasoning over cross-step dependencies and dynamic path exploration with backtracking. Experimental results on eight KG reasoning benchmarks show that KG-Hopper, based on a 7B-parameter LLM, consistently outperforms larger multi-step systems (up to 70B) and achieves competitive performance with proprietary models such as GPT-3.5-Turbo and GPT-4o-mini, while remaining compact, open, and data-efficient. The code is publicly available at: https://github.com/Wangshuaiia/KG-Hopper.
[141] Retrieving Climate Change Disinformation by Narrative
Max Upravitelev, Veronika Solopova, Charlott Jakob, Premtim Sahitaj, Sebastian Möller, Vera Schmitt
Main category: cs.CL
TL;DR: Narrative detection reframed as retrieval task using hypothetical document generation to bridge abstract narrative descriptions with concrete text instantiations for climate disinformation detection.
Details
Motivation: Traditional climate disinformation detection relies on fixed taxonomies that can't accommodate emerging narratives, requiring a more flexible approach that doesn't need predefined labels.Method: Proposes SpecFi framework that generates hypothetical documents using community summaries from graph-based community detection as few-shot examples, treating narrative detection as retrieval task with narrative core messages as queries.
Result: Achieves MAP of 0.505 on CARDS dataset without narrative labels, introduces narrative variance metric showing SpecFi-CS remains robust (32.7% loss) while BM25 degrades significantly (63.4% loss) on high-variance narratives.
Conclusion: Retrieval-based approach enables detection of emerging narratives without predefined taxonomies, with graph-based methods capable of surfacing narrative structure from unlabeled text similar to expert-crafted taxonomies.
Abstract: Detecting climate disinformation narratives typically relies on fixed taxonomies, which do not accommodate emerging narratives. Thus, we re-frame narrative detection as a retrieval task: given a narrative’s core message as a query, rank texts from a corpus by alignment with that narrative. This formulation requires no predefined label set and can accommodate emerging narratives. We repurpose three climate disinformation datasets (CARDS, Climate Obstruction, climate change subset of PolyNarrative) for retrieval evaluation and propose SpecFi, a framework that generates hypothetical documents to bridge the gap between abstract narrative descriptions and their concrete textual instantiations. SpecFi uses community summaries from graph-based community detection as few-shot examples for generation, achieving a MAP of 0.505 on CARDS without access to narrative labels. We further introduce narrative variance, an embedding-based difficulty metric, and show via partial correlation analysis that standard retrieval degrades on high-variance narratives (BM25 loses 63.4% of MAP), while SpecFi-CS remains robust (32.7% loss). Our analysis also reveals that unsupervised community summaries converge on descriptions close to expert-crafted taxonomies, suggesting that graph-based methods can surface narrative structure from unlabeled text.
[142] PaperVoyager : Building Interactive Web with Visual Language Models
Dasen Dai, Biao Wu, Meng Fang, Wenhao Wang
Main category: cs.CL
TL;DR: Paper-to-Interactive-System Agent converts research papers into executable interactive web systems for dynamic mechanism exploration, with PaperVoyager framework improving generation quality.
Details
Motivation: Existing document agents create static artifacts (summaries, webpages, slides) which are insufficient for technical papers involving dynamic mechanisms and state transitions. There's a need for interactive systems that allow users to manipulate inputs and observe dynamic behaviors.Method: Proposes PaperVoyager, a structured generation framework that performs end-to-end processing without human intervention: paper understanding, system modeling, and interactive webpage synthesis. It explicitly models mechanisms and interaction logic during synthesis to convert PDF papers into executable interactive web systems.
Result: PaperVoyager significantly improves the quality of generated interactive systems. The authors introduce a benchmark of 19 research papers paired with expert-built interactive systems as ground truth for evaluation.
Conclusion: The work offers a new paradigm for interactive scientific paper understanding by enabling conversion of research papers into executable interactive systems, allowing users to explore dynamic mechanisms through manipulation and observation.
Abstract: Recent advances in visual language models have enabled autonomous agents for complex reasoning, tool use, and document understanding. However, existing document agents mainly transform papers into static artifacts such as summaries, webpages, or slides, which are insufficient for technical papers involving dynamic mechanisms and state transitions. In this work, we propose a Paper-to-Interactive-System Agent that converts research papers into executable interactive web systems. Given a PDF paper, the agent performs end-to-end processing without human intervention, including paper understanding, system modeling, and interactive webpage synthesis, enabling users to manipulate inputs and observe dynamic behaviors. To evaluate this task, we introduce a benchmark of 19 research papers paired with expert-built interactive systems as ground truth. We further propose PaperVoyager, a structured generation framework that explicitly models mechanisms and interaction logic during synthesis. Experiments show that PaperVoyager significantly improves the quality of generated interactive systems, offering a new paradigm for interactive scientific paper understanding.
cs.CV
[143] An Annotation-to-Detection Framework for Autonomous and Robust Vine Trunk Localization in the Field by Mobile Agricultural Robots
Dimitrios Chatziparaschis, Elia Scudiero, Brent Sams, Konstantinos Karydis
Main category: cs.CV
TL;DR: A multi-modal annotation-to-detection framework for agricultural robotics using limited labeled data, featuring cross-modal annotation transfer and early sensor fusion for robust object detection in unstructured environments.
Details
Motivation: Agricultural fields present dynamic, heterogeneous challenges for object detection in autonomous robots, with need for real-time systems that don't rely on large manually labeled datasets.Method: Comprehensive annotation-to-detection framework with cross-modal annotation transfer, early-stage sensor fusion pipeline, multi-stage detection architecture, integrated with customized multi-modal LiDAR/Odometry Mapping and tree association module.
Result: System identified over 70% of trees in single traversal with mean distance error <0.37m in novel vineyard settings with diverse lighting and crop densities.
Conclusion: Framework achieves robust detection with limited starting annotations through multi-modal, incremental-stage annotation and training, showing potential for real-world agricultural applications.
Abstract: The dynamic and heterogeneous nature of agricultural fields presents significant challenges for object detection and localization, particularly for autonomous mobile robots that are tasked with surveying previously unseen unstructured environments. Concurrently, there is a growing need for real-time detection systems that do not depend on large-scale manually labeled real-world datasets. In this work, we introduce a comprehensive annotation-to-detection framework designed to train a robust multi-modal detector using limited and partially labeled training data. The proposed methodology incorporates cross-modal annotation transfer and an early-stage sensor fusion pipeline, which, in conjunction with a multi-stage detection architecture, effectively trains and enhances the system’s multi-modal detection capabilities. The effectiveness of the framework was demonstrated through vine trunk detection in novel vineyard settings that featured diverse lighting conditions and varying crop densities to validate performance. When integrated with a customized multi-modal LiDAR and Odometry Mapping (LOAM) algorithm and a tree association module, the system demonstrated high-performance trunk localization, successfully identifying over 70% of trees in a single traversal with a mean distance error of less than 0.37m. The results reveal that by leveraging multi-modal, incremental-stage annotation and training, the proposed framework achieves robust detection performance regardless of limited starting annotations, showcasing its potential for real-world and near-ground agricultural applications.
[144] A Multimodal Deep Learning Framework for Edema Classification Using HCT and Clinical Data
Aram Ansary Ogholbake, Hannah Choi, Spencer Brandenburg, Alyssa Antuna, Zahraa Al-Sharshahi, Makayla Cox, Haseeb Ahmed, Jacqueline Frank, Nathan Millson, Luke Bauerle, Jessica Lee, David Dornbos, Qiang Cheng
Main category: cs.CV
TL;DR: AttentionMixer: A multimodal framework combining head CT scans with clinical metadata for brain edema detection using cross-attention fusion and MLP-Mixer refinement.
Details
Motivation: Current brain edema detection methods often rely solely on head CT imaging, ignoring valuable clinical context. Clinical metadata (age, lab values, timing) provides complementary information but is typically either ignored or naively concatenated with imaging features.Method: 1) Self-supervised Vision Transformer Autoencoder (ViT-AE++) encodes HCT volumes without large labeled datasets. 2) Clinical metadata mapped to same feature space as keys/values. 3) Cross-attention fusion where HCT features serve as queries to dynamically modulate imaging features. 4) Lightweight MLP-Mixer refines fused representation. 5) Learnable embeddings handle missing metadata.
Result: Superior performance compared to HCT-only, metadata-only, and prior multimodal baselines: accuracy 87.32%, precision 92.10%, F1-score 85.37%, AUC 94.14%. Ablation studies confirm benefits of cross-attention and MLP-Mixer refinement.
Conclusion: Structured, interpretable multimodal fusion combining imaging with clinical metadata substantially improves brain edema detection in clinical practice, with cross-attention providing dynamic feature modulation and interpretability.
Abstract: We propose AttentionMixer, a unified deep learning framework for multimodal detection of brain edema that combines structural head CT (HCT) with routine clinical metadata. While HCT provides rich spatial information, clinical variables such as age, laboratory values, and scan timing capture complementary context that might be ignored or naively concatenated. AttentionMixer is designed to fuse these heterogeneous sources in a principled and efficient manner. HCT volumes are first encoded using a self-supervised Vision Transformer Autoencoder (ViT-AE++), without requiring large labeled datasets. Clinical metadata are mapped into the same feature space and used as keys and values in a cross-attention module, where HCT-derived feature vector serves as queries. This cross-attention fusion allows the network to dynamically modulate imaging features based on patient-specific context and provides an interpretable mechanism for multimodal integration. A lightweight MLP-Mixer then refines the fused representation before final classification, enabling global dependency modeling with substantially reduced parameter overhead. Missing or incomplete metadata are handled via a learnable embedding, promoting robustness to real-world clinical data quality. We evaluate AttentionMixer on a curated brain HCT cohort with expert edema annotations using five-fold cross-validation. Compared with strong HCT-only, metadata-only, and prior multimodal baselines, AttentionMixer achieves superior performance (accuracy 87.32%, precision 92.10%, F1-score 85.37%, AUC 94.14%). Ablation studies confirm the benefit of both cross-attention and MLP-Mixer refinement, and permutation-based metadata importance analysis highlights clinically meaningful variables driving predictions. These results demonstrate that structured, interpretable multimodal fusion can substantially improve edema detection in clinical practice.
[145] A Near-Raw Talking-Head Video Dataset for Various Computer Vision Tasks
Babak Naderi, Ross Cutler
Main category: cs.CV
TL;DR: A near-raw talking-head video dataset of 847 recordings (212 minutes) captured from 805 participants using 446 webcams, stored losslessly with perceptual quality annotations, used for benchmarking video codec efficiency.
Details
Motivation: There is a scarcity of high-fidelity talking-head video datasets for real-time communication research, with existing datasets being limited in scale and signal quality.Method: Collected 847 talking-head recordings (15s each) from 805 participants using 446 consumer webcams, stored using FFV1 lossless codec to preserve camera-native signals, annotated with MOS and perceptual quality tokens, and curated a stratified benchmarking subset of 120 clips.
Result: Created a dataset 5× larger than previous talking-head webcam datasets with lossless signal fidelity; codec evaluation showed H.266 achieves up to -71.3% BD-rate savings vs H.264, with significant interactions between encoder, dataset, and content conditions.
Conclusion: The dataset provides a valuable resource for training and benchmarking video compression and enhancement models in real-time communication applications.
Abstract: Talking-head videos constitute a predominant content type in real-time communication, yet publicly available datasets for video processing research in this domain remain scarce and limited in signal fidelity. In this paper, we open-source a near-raw dataset of 847 talking-head recordings (approximately 212 minutes), each 15,s in duration, captured from 805 participants using 446 unique consumer webcam devices in their natural environments. All recordings are stored using the FFV1 lossless codec, preserving the camera-native signal – uncompressed (24.4%) or MJPEG-encoded (75.6%) – without additional lossy processing. Each recording is annotated with a Mean Opinion Score (MOS) and ten perceptual quality tokens that jointly explain 64.4% of the MOS variance. From this corpus, we curate a stratified benchmarking subset of 120 clips in three content conditions: original, background blur, and background replacement. Codec efficiency evaluation across four datasets and four codecs, namely H.264, H.265, H.266, and AV1, yields VMAF BD-rate savings up to $-71.3%$ (H.266) relative to H.264, with significant encoder$\times$dataset ($η_p^2 = .112$) and encoder$\times$content condition ($η_p^2 = .149$) interactions, demonstrating that both content type and background processing affect compression efficiency. The dataset offers 5$\times$ the scale of the largest prior talking-head webcam dataset (847 vs.\ 160 clips) with lossless signal fidelity, establishing a resource for training and benchmarking video compression and enhancement models in real-time communication.
[146] The Nonverbal Gap: Toward Affective Computer Vision for Safer and More Equitable Online Dating
Ratna Kandala, Niva Manchanda, Akshata Kishore Moharir
Main category: cs.CV
TL;DR: A vision paper proposing computer vision research agenda for online dating safety, focusing on detecting nonverbal cues like discomfort, engagement asymmetry, and consent-aware design to address safety gaps in current platforms.
Details
Motivation: Current online dating platforms strip away nonverbal cues (gaze, facial expression, body posture, timing) that humans rely on for signaling comfort, disinterest, and consent, creating safety gaps with disproportionate consequences for women. The computer vision community has developed affective tools but has largely ignored dating as a research context.Method: Proposes a fairness-first research agenda with four capability areas: 1) real-time discomfort detection, 2) engagement asymmetry modeling between partners, 3) consent-aware interaction design, and 4) longitudinal interaction summarization. Grounded in established CV methodology (facial action unit detection, gaze estimation, engagement modeling, multimodal affect recognition) and social psychology of romantic communication.
Result: Vision paper calling for establishing online dating safety as a first-class research domain. Requires purpose-built datasets with dyadic consent protocols, fairness evaluation across demographic groups, and on-device processing to prevent affective data from becoming surveillance infrastructure.
Conclusion: The computer vision community has both technical opportunity and moral responsibility to address online dating safety gaps before commercial deployment outpaces ethical deliberation. WICV community members are uniquely positioned to lead this research.
Abstract: Online dating has become the dominant way romantic relationships begin, yet current platforms strip the nonverbal cues: gaze, facial expression, body posture, response timing, that humans rely on to signal comfort, disinterest, and consent, creating a communication gap with disproportionate safety consequences for women. We argue that this gap represents both a technical opportunity and a moral responsibility for the computer vision community, which has developed the affective tools, facial action unit detection, gaze estimation, engagement modeling, and multimodal affect recognition, needed to begin addressing it, yet has largely ignored the dating domain as a research context. We propose a fairness-first research agenda organized around four capability areas: real-time discomfort detection, engagement asymmetry modeling between partners, consent-aware interaction design, and longitudinal interaction summarization, each grounded in established CV methodology and motivated by the social psychology of romantic communication. We argue that responsible pursuit of this agenda requires purpose-built datasets collected under dyadic consent protocols, fairness evaluation disaggregated across race, gender identity, neurotype, and cultural background, and architectural commitments to on-device processing that prevent affective data from becoming platform surveillance infrastructure. This vision paper calls on the WICV community, whose members are uniquely positioned to understand both the technical opportunity and the human stakes, to establish online dating safety as a first-class research domain before commercial deployment outpaces ethical deliberation.
[147] Multi-view Graph Convolutional Network with Fully Leveraging Consistency via Granular-ball-based Topology Construction, Feature Enhancement and Interactive Fusion
Chengjie Cui, Taihua Xua, Shuyin Xia, Qinghua Zhang, Yun Cui, Shiping Wang
Main category: cs.CV
TL;DR: MGCN-FLC is a multi-view graph convolutional network that fully leverages three types of consistency (inter-node, inter-feature, inter-view) through granular ball topology construction, feature enhancement, and interactive fusion modules for improved semi-supervised node classification.
Details
Motivation: Existing GCN-based multi-view methods have limitations: 1) rely on artificial KNN topology construction with arbitrary k values, 2) overlook inter-feature consistency within views, and 3) fail to fully utilize inter-view consistency by fusing representations after intra-view operations rather than during learning.Method: MGCN-FLC uses three modules: 1) Granular ball algorithm for topology construction to capture inter-node consistency by clustering nodes into balls with high internal similarity, 2) Feature enhancement module to capture inter-feature consistency, and 3) Interactive fusion module enabling deep interaction between all views to obtain comprehensive inter-view consistency.
Result: Experimental results on nine datasets show that MGCN-FLC outperforms state-of-the-art semi-supervised node classification methods.
Conclusion: The proposed MGCN-FLC effectively addresses limitations of existing GCN-based multi-view methods by fully leveraging three types of consistency through specialized modules, leading to superior performance in semi-supervised node classification tasks.
Abstract: The effective utilization of consistency is crucial for multi-view learning. GCNs leverage node connections to propagate information across the graph, facilitating the exploitation of consistency in multi-view data. However, most existing GCN-based multi-view methods suffer from several limitations. First, current approaches predominantly rely on KNN for topology construction, where the artificial selection of the k value significantly constrains the effective exploitation of inter-node consistency. Second, the inter-feature consistency within individual views is often overlooked, which adversely affects the quality of the final embedding representations. Moreover, these methods fail to fully utilize inter-view consistency as the fusion of embedded representations from multiple views is often implemented after the intra-view graph convolutional operation. Collectively, these issues limit the model’s capacity to fully capture inter-node, inter-feature and inter-view consistency. To address these issues, this paper proposes the multi-view graph convolutional network with fully leveraging consistency via GB-based topology construction, feature enhancement and interactive fusion (MGCN-FLC). MGCN-FLC can fully utilize three types of consistency via the following three modules to enhance learning ability:The topology construction module based on the granular ball algorithm, which clusters nodes into granular balls with high internal similarity to capture inter-node consistency;The feature enhancement module that improves feature representations by capturing inter-feature consistency;The interactive fusion module that enables each view to deeply interact with all other views, thereby obtaining more comprehensive inter-view consistency. Experimental results on nine datasets show that the proposed MGCN-FLC outperforms state-of-the-art semi-supervised node classification methods.
[148] Contextual inference from single objects in Vision-Language models
Martina G. Vilas, Timothy Schaumlöffel, Gemma Roig
Main category: cs.CV
TL;DR: VLMs can infer scene context from single objects, with performance influenced by object properties similar to humans, but scene and superordinate context representations are partially dissociable and grounded differently in the model architecture.
Details
Motivation: Understanding how vision-language models (VLMs) organize and infer scene context from single objects, which has implications for model robustness and parallels with human scene perception.Method: Systematic behavioral and mechanistic analysis presenting VLMs with single objects on masked backgrounds, probing inference of fine-grained scene categories and coarse superordinate context (indoor vs. outdoor).
Result: Single objects support above-chance inference at both levels, modulated by object properties that predict human scene categorization. Scene and superordinate predictions are partially dissociable, with different coupling across models. Object representations stable without background context predict successful inference. Scene identity is encoded throughout the network while superordinate information emerges late or not at all.
Conclusion: Contextual inference in VLMs is more complex than accuracy suggests, with distinct behavioral and mechanistic signatures for different levels of context, revealing fundamental differences in how scene and superordinate information are grounded.
Abstract: How much scene context a single object carries is a well-studied question in human scene perception, yet how this capacity is organized in vision-language models (VLMs) remains poorly understood, with direct implications for the robustness of these models. We investigate this question through a systematic behavioral and mechanistic analysis of contextual inference from single objects. Presenting VLMs with single objects on masked backgrounds, we probe their ability to infer both fine-grained scene category and coarse superordinate context (indoor vs. outdoor). We found that single objects support above-chance inference at both levels, with performance modulated by the same object properties that predict human scene categorization. Object identity, scene, and superordinate predictions are partially dissociable: accurate inference at one level neither requires nor guarantees accurate inference at the others, and the degree of coupling differs markedly across models. Mechanistically, object representations that remain stable when background context is removed are more predictive of successful contextual inference. Scene and superordinate schemas are grounded in fundamentally different ways: scene identity is encoded in image tokens throughout the network, while superordinate information emerges only late or not at all. Together, these results reveal that the organization of contextual inference in VLMs is more complex than accuracy alone suggests, with behavioral and mechanistic signatures
[149] LVRPO: Language-Visual Alignment with GRPO for Multimodal Understanding and Generation
Shentong Mo, Sukmin Yun
Main category: cs.CV
TL;DR: LVRPO is a language-visual reinforcement preference optimization framework that uses Group Relative Policy Optimization to align multimodal representations through preference-driven reinforcement signals, enabling better multimodal understanding and generation without auxiliary components.
Details
Motivation: Existing unified multimodal pretraining approaches rely on implicit alignment signals and are suboptimal for simultaneously supporting multimodal understanding and generation, especially for fine-grained reasoning and controllable generation tasks.Method: Proposes LVRPO framework using Group Relative Policy Optimization (GRPO) to directly optimize multimodal model behaviors through preference-driven reinforcement signals, aligning language and visual representations without additional alignment losses or auxiliary encoders.
Result: Empirically outperforms strong unified-pretraining baselines on a broad suite of benchmarks spanning multimodal understanding, generation, and reasoning tasks.
Conclusion: LVRPO provides an effective reinforcement-based approach for aligning multimodal representations that naturally extends to diverse multimodal capabilities and supports both understanding and generation tasks.
Abstract: Unified multimodal pretraining has emerged as a promising paradigm for jointly modeling language and vision within a single foundation model. However, existing approaches largely rely on implicit or indirect alignment signals and remain suboptimal for simultaneously supporting multimodal understanding and generation, particularly in settings that require fine-grained language-visual reasoning and controllable generation. In this work, we propose LVRPO, a language-visual reinforcement-based preference optimization framework that explicitly aligns language and visual representations using Group Relative Policy Optimization (GRPO). Instead of introducing additional alignment losses at the representation level, LVRPO directly optimizes multimodal model behaviors through preference-driven reinforcement signals, encouraging consistent and semantically grounded interactions between language and vision across both understanding and generation tasks. This formulation enables effective alignment without requiring auxiliary encoders or handcrafted cross-modal objectives, and naturally extends to diverse multimodal capabilities. Empirically, LVRPO consistently outperforms strong unified-pretraining baselines on a broad suite of benchmarks spanning multimodal understanding, generation, and reasoning.
[150] Distilled Large Language Model-Driven Dynamic Sparse Expert Activation Mechanism
Qinghui Chen, Zekai Zhang, Zaigui Zhang, Kai Zhang, Dagang Li, Wenmin Wang, Jinglin Zhang, Cong Liu
Main category: cs.CV
TL;DR: DS-MoE framework integrates LLM-driven sparse mixture-of-experts with text-guided dynamic routing for defect detection, achieving superior performance over vision-only models.
Details
Motivation: Address challenges in visual recognition including high inter-class similarity, extreme scale variation, and limited computational budgets, overcoming limitations of rigid fusion mechanisms and heavy annotation pipelines in existing approaches.Method: Proposes Distilled LLM-Driven Sparse Mixture-of-Experts framework with text-guided dynamic routing and lightweight multi-scale comprehension. Uses sparse MoE architecture to dynamically align textual semantics with defect-specific visual patterns, adaptively activating task-relevant experts based on semantic relevance. Incorporates lightweight MobileSAM encoder for real-time inference while preserving multi-scale details.
Result: Extensive experiments on PCB, aluminum foil, and mold defect datasets show superior performance over pure vision models. DS-MoE surpasses YOLOv8/YOLOX with gains of +13.9, +1.4, and +2.0 pp mAP@0.5:0.95 on BBMP, aluminum, and PCB datasets respectively, while also improving precision and recall.
Conclusion: The DS-MoE framework effectively integrates cross-modal understanding through LLM-driven sparse MoE architecture, demonstrating strong performance in defect detection tasks with improved generalization and computational efficiency.
Abstract: High inter-class similarity, extreme scale variation, and limited computational budgets hinder reliable visual recognition across diverse real-world data. Existing vision-centric and cross-modal approaches often rely on rigid fusion mechanisms and heavy annotation pipelines, leading to sub-optimal generalization. We propose the Distilled Large Language Model (LLM)-Driven Sparse Mixture-of-Experts (DS-MoE) framework, which integrates text-guided dynamic routing and lightweight multi-scale comprehension. The DS-MoE framework dynamically aligns textual semantics with defect-specific visual patterns through a sparse MoE architecture, where task-relevant experts are adaptively activated based on semantic relevance, resolving inter-class ambiguity. A lightweight MobileSAM encoder enables real-time inference while preserving multi-scale defect details. Extensive experiments on PCB, aluminum foil, and mold defect datasets demonstrate that our framework achieves superior performance compared to existing pure vision models. \textbf{DS-MoE} surpasses YOLOv8/YOLOX with gains of +13.9, +1.4, and +2.0 pp mAP@ 0.5:0.95 on BBMP, aluminum, and PCB, respectively, while also improving precision and recall.
[151] Look, Compare and Draw: Differential Query Transformer for Automatic Oil Painting
Lingyu Liu, Yaxiong Wang, Li Zhu, Lizi Liao, Zhedong Zheng
Main category: cs.CV
TL;DR: DQ-Transformer uses differential image analysis and adversarial training for neural oil painting with dynamic, expressive brushstrokes, reducing duplicate strokes and improving realism.
Details
Motivation: To address the problem of duplicate and common-place brushstrokes in automatic oil painting that lead to less aesthetic outcomes, inspired by human painting process of observing, comparing, and drawing.Method: Proposes Differential Query Transformer (DQ-Transformer) that incorporates differential image analysis with positional encoding to guide stroke prediction, maintaining sensitivity to local details, plus adversarial training for stroke prediction accuracy.
Result: DQ-Transformer surpasses existing methods in visual realism and artistic authenticity with fewer strokes, validated through qualitative evaluations and user study.
Conclusion: The approach enables more refined and nuanced stroke generation for automatic oil painting, achieving better results with differential image analysis and adversarial training.
Abstract: This work introduces a new approach to automatic oil painting that emphasizes the creation of dynamic and expressive brushstrokes. A pivotal challenge lies in mitigating the duplicate and common-place strokes, which often lead to less aesthetic outcomes. Inspired by the human painting process, \ie, observing, comparing, and drawing, we incorporate differential image analysis into a neural oil painting model, allowing the model to effectively concentrate on the incremental impact of successive brushstrokes. To operationalize this concept, we propose the Differential Query Transformer (DQ-Transformer), a new architecture that leverages differentially derived image representations enriched with positional encoding to guide the stroke prediction process. This integration enables the model to maintain heightened sensitivity to local details, resulting in more refined and nuanced stroke generation. Furthermore, we incorporate adversarial training into our framework, enhancing the accuracy of stroke prediction and thereby improving the overall realism and fidelity of the synthesized paintings. Extensive qualitative evaluations, complemented by a controlled user study, validate that our DQ-Transformer surpasses existing methods in both visual realism and artistic authenticity, typically achieving these results with fewer strokes. The stroke-by-stroke painting animations are available on our project website.
[152] Ordinal Semantic Segmentation Applied to Medical and Odontological Images
Mariana Dória Prata Lima, Gilson Antonio Giraldi, Jaime S. Cardoso
Main category: cs.CV
TL;DR: This paper investigates loss functions that incorporate ordinal relationships into deep neural networks for semantic segmentation to improve semantic consistency, with applications in medical imaging.
Details
Motivation: Current deep learning approaches for semantic segmentation achieve high accuracy but often ignore ordinal relationships among classes, which encode important domain knowledge for scene interpretation. Incorporating these ordinal relationships can improve semantic consistency.Method: The study adapts loss functions originally proposed for ordinal classification to ordinal semantic segmentation, investigating three categories: unimodal losses (constrain predicted probability distribution according to class ordering), quasi-unimodal losses (relax constraints while preserving ordinal coherence), and spatial losses (penalize semantic inconsistencies between neighboring pixels). Specifically examines Expanded Mean Squared Error (EXP_MSE), Quasi-Unimodal Loss (QUL), and spatial Contact Surface Loss using Signal Distance Function (CSSDF).
Result: The approaches show promising results in medical imaging, improving robustness, generalization, and anatomical consistency.
Conclusion: Incorporating ordinal relationships through specialized loss functions enhances semantic segmentation by promoting greater semantic consistency, particularly valuable in domains like medical imaging where anatomical relationships are important.
Abstract: Semantic segmentation consists of assigning a semantic label to each pixel according to predefined classes. This process facilitates the understanding of object appearance and spatial relationships, playing an important role in the global interpretation of image content. Although modern deep learning approaches achieve high accuracy, they often ignore ordinal relationships among classes, which may encode important domain knowledge for scene interpretation. In this work, loss functions that incorporate ordinal relationships into deep neural networks are investigated to promote greater semantic consistency in semantic segmentation tasks. These loss functions are categorized as unimodal, quasi-unimodal, and spatial. Unimodal losses constrain the predicted probability distribution according to the class ordering, while quasi-unimodal losses relax this constraint by allowing small variations while preserving ordinal coherence. Spatial losses penalize semantic inconsistencies between neighboring pixels, encouraging smoother transitions in the image space. In particular, this study adapts loss functions originally proposed for ordinal classification to ordinal semantic segmentation. Among them, the Expanded Mean Squared Error (EXP_MSE), the Quasi-Unimodal Loss (QUL), and the spatial Contact Surface Loss using Signal Distance Function (CSSDF) are investigated. These approaches have shown promising results in medical imaging, improving robustness, generalization, and anatomical consistency.
[153] Beyond Static Visual Tokens: Structured Sequential Visual Chain-of-Thought Reasoning
Guangfu Guo, Xiaoqian Lu, Yue Feng, Mingming Sun
Main category: cs.CV
TL;DR: SSV-CoT introduces structured sequential visual reasoning for multimodal LLMs by organizing visual attention from primary to secondary cues using saliency maps, enabling curriculum-like semantic progression without region annotations.
Details
Motivation: Current multimodal LLMs treat images as static visual prefixes and rely on text-based reasoning, lacking goal-driven and adaptive visual access. The paper is inspired by human visual perception where attention shifts selectively from most informative regions to secondary cues.Method: Proposes Structural Sequential Visual CoT (SSV-CoT): 1) Uses question-relevant saliency maps to identify and organize key visual regions, modeling spatial distribution of visual importance; 2) Performs reasoning following this discriminative order, creating curriculum-like semantic progression from primary to secondary cues. Trained end-to-end with text CoT and answer supervision, without region-level annotations or external tools.
Result: Experiments on diverse visual reasoning benchmarks show performance gains, validating the effectiveness of structured and sequential visual cognition.
Conclusion: SSV-CoT enables more human-like visual reasoning in multimodal LLMs by structuring visual attention sequentially from most to least informative regions, improving performance on visual reasoning tasks without requiring specialized annotations.
Abstract: Current multimodal LLMs encode images as static visual prefixes and rely on text-based reasoning, lacking goal-driven and adaptive visual access. Inspired by human visual perception-where attention is selectively and sequentially shifted from the most informative regions to secondary cues-we propose Structural Sequential Visual CoT SSV-CoT. First, a question-relevant saliency map identifies and organizes key visual regions, explicitly modeling the spatial distribution of visual importance. Second, reasoning is performed following this discriminative order, inducing a curriculum-like semantic progression from primary to secondary cues. This method is trained end-to-end, using text cot and answer supervision, without relying on region-level annotations or specialized external tools. Experiments on diverse visual reasoning benchmarks show gains, validating structured and sequential visual cognition.
[154] Navigating the Mirage: A Dual-Path Agentic Framework for Robust Misleading Chart Question Answering
Yanjie Zhang, Yafei Li, Rui Sheng, Zixin Chen, Yanna Lin, Huamin Qu, Lei Chen, Yushi Sun
Main category: cs.CV
TL;DR: ChartCynics is an agentic dual-path framework that detects deceptive charts by separating visual perception from data verification, using specialized pathways for structural anomaly detection and numerical grounding, with an agentic summarizer to resolve cross-modal conflicts.
Details
Motivation: Vision-Language Models struggle with misleading charts due to deceptive visual structures and distorted data representations, requiring specialized approaches to detect visual deception and ensure trustworthy chart interpretation.Method: Dual-path framework with Diagnostic Vision Path (strategic ROI cropping for structural anomalies) and OCR-Driven Data Path (numerical grounding), plus Agentic Summarizer optimized via two-stage protocol: Oracle-Informed SFT for reasoning distillation and Deception-Aware GRPO for adversarial alignment.
Result: Achieves 74.43% and 64.55% accuracy on two benchmarks, providing ~29% absolute performance boost over Qwen3-VL-8B backbone, outperforming state-of-the-art proprietary models.
Conclusion: Specialized agentic workflows can grant smaller open-source models superior robustness for trustworthy chart interpretation, establishing new foundation for detecting visual deception in multimodal contexts.
Abstract: Despite the success of Vision-Language Models (VLMs), misleading charts remain a significant challenge due to their deceptive visual structures and distorted data representations. We present ChartCynics, an agentic dual-path framework designed to unmask visual deception via a “skeptical” reasoning paradigm. Unlike holistic models, ChartCynics decouples perception from verification: a Diagnostic Vision Path captures structural anomalies (e.g., inverted axes) through strategic ROI cropping, while an OCR-Driven Data Path ensures numerical grounding. To resolve cross-modal conflicts, we introduce an Agentic Summarizer optimized via a two-stage protocol: Oracle-Informed SFT for reasoning distillation and Deception-Aware GRPO for adversarial alignment. This pipeline effectively penalizes visual traps and enforces logical consistency. Evaluations on two benchmarks show that ChartCynics achieves 74.43% and 64.55% accuracy, providing an absolute performance boost of ~29% over the Qwen3-VL-8B backbone, outperforming state-of-the-art proprietary models. Our results demonstrate that specialized agentic workflows can grant smaller open-source models superior robustness, establishing a new foundation for trustworthy chart interpretation.
[155] SleepVLM: Explainable and Rule-Grounded Sleep Staging via a Vision-Language Model
Guifeng Deng, Pan Wang, Jiquan Wang, Shuying Rao, Junyi Xie, Wanjun Guo, Tao Li, Haiteng Jiang
Main category: cs.CV
TL;DR: SleepVLM: A vision-language model for interpretable sleep staging from PSG waveform images that generates clinician-readable rationales based on AASM scoring criteria.
Details
Motivation: Automated sleep staging has achieved expert-level accuracy but lacks auditable reasoning, hindering clinical adoption. There's a need for transparent, interpretable models that can generate clinician-readable rationales.Method: Uses a rule-grounded vision-language model (VLM) with waveform-perceptual pre-training and rule-grounded supervised fine-tuning. Processes multi-channel polysomnography (PSG) waveform images and generates explanations based on American Academy of Sleep Medicine scoring criteria.
Result: Achieved Cohen’s kappa scores of 0.767 on held-out test set (MASS-SS1) and 0.743 on external cohort (ZUAMHCS), matching state-of-the-art performance. Expert evaluations gave mean scores >4.0/5.0 for factual accuracy, evidence comprehensiveness, and logical coherence.
Conclusion: SleepVLM couples competitive performance with transparent, rule-based explanations, potentially improving trustworthiness and auditability of automated sleep staging in clinical workflows. Releases MASS-EX dataset for interpretable sleep medicine research.
Abstract: While automated sleep staging has achieved expert-level accuracy, its clinical adoption is hindered by a lack of auditable reasoning. We introduce SleepVLM, a rule-grounded vision-language model (VLM) designed to stage sleep from multi-channel polysomnography (PSG) waveform images while generating clinician-readable rationales based on American Academy of Sleep Medicine (AASM) scoring criteria. Utilizing waveform-perceptual pre-training and rule-grounded supervised fine-tuning, SleepVLM achieved Cohen’s kappa scores of 0.767 on an held out test set (MASS-SS1) and 0.743 on an external cohort (ZUAMHCS), matching state-of-the-art performance. Expert evaluations further validated the quality of the model’s reasoning, with mean scores exceeding 4.0/5.0 for factual accuracy, evidence comprehensiveness, and logical coherence. By coupling competitive performance with transparent, rule-based explanations, SleepVLM may improve the trustworthiness and auditability of automated sleep staging in clinical workflows. To facilitate further research in interpretable sleep medicine, we release MASS-EX, a novel expert-annotated dataset.
[156] TGIF2: Extended Text-Guided Inpainting Forgery Dataset & Benchmark
Hannes Mareen, Dimitrios Karageorgiou, Paschalis Giakoumoglou, Peter Lambert, Symeon Papadopoulos, Glenn Van Wallendael
Main category: cs.CV
TL;DR: TGIF2 extends the text-guided inpainting forgery dataset with FLUX.1 models and random non-semantic masks to benchmark forensic methods against modern generative inpainting, revealing limitations in generalization, object bias, and vulnerability to enhancement attacks.
Details
Motivation: Existing benchmarks show that image forgery localization methods struggle with fully regenerated images from text-guided inpainting, while synthetic image detection methods cannot localize manipulations. With new generative models emerging, updated datasets and benchmarks are needed to evaluate forensic robustness against modern inpainting techniques.Method: Extends TGIF dataset with edits generated by FLUX.1 models and adds random non-semantic masks. Conducts forensic evaluation spanning image forgery localization and synthetic image detection methods, including fine-tuning IFL methods on fully regenerated images and testing generative super-resolution attacks.
Result: Both IFL and SID methods degrade on FLUX.1 manipulations, showing limited generalization. Fine-tuning improves localization on fully regenerated images but reveals object bias when evaluated with random non-semantic masks. Generative super-resolution significantly weakens forensic traces, demonstrating that common image enhancement operations can undermine current forensic pipelines.
Conclusion: TGIF2 provides an updated dataset and benchmark that enables new insights into challenges posed by modern inpainting and AI-based image enhancements, highlighting the need for more robust forensic methods that can handle both localization in fully regenerated images and detection of synthetic content.
Abstract: Generative AI has made text-guided inpainting a powerful image editing tool, but at the same time a growing challenge for media forensics. Existing benchmarks, including our text-guided inpainting forgery (TGIF) dataset, show that image forgery localization (IFL) methods can localize manipulations in spliced images but struggle not in fully regenerated (FR) images, while synthetic image detection (SID) methods can detect fully regenerated images but cannot perform localization. With new generative inpainting models emerging and the open problem of localization in FR images remaining, updated datasets and benchmarks are needed. We introduce TGIF2, an extended version of TGIF, that captures recent advances in text-guided inpainting and enables a deeper analysis of forensic robustness. TGIF2 augments the original dataset with edits generated by FLUX.1 models, as well as with random non-semantic masks. Using the TGIF2 dataset, we conduct a forensic evaluation spanning IFL and SID, including fine-tuning IFL methods on FR images and generative super-resolution attacks. Our experiments show that both IFL and SID methods degrade on FLUX.1 manipulations, highlighting limited generalization. Additionally, while fine-tuning improves localization on FR images, evaluation with random non-semantic masks reveals object bias. Furthermore, generative super-resolution significantly weakens forensic traces, demonstrating that common image enhancement operations can undermine current forensic pipelines. In summary, TGIF2 provides an updated dataset and benchmark, which enables new insights into the challenges posed by modern inpainting and AI-based image enhancements. TGIF2 is available at https://github.com/IDLabMedia/tgif-dataset.
[157] Language-Conditioned World Modeling for Visual Navigation
Yifei Dong, Fengyi Wu, Yilong Dai, Lingdong Kong, Guangyu Chen, Xu Zhu, Qiyu Hu, Tianyu Wang, Johnalbert Garnica, Feng Liu, Siyu Huang, Qi Dai, Zhi-Qi Cheng
Main category: cs.CV
TL;DR: LCVN introduces a language-conditioned visual navigation benchmark and two model families for open-loop trajectory prediction from natural language instructions and initial observations.
Details
Motivation: Language-conditioned visual navigation is challenging because agents must ground language to perception and continuous control without access to goal images, requiring joint study of language grounding, imagination, and policy learning.Method: Two complementary model families: 1) LCVN-WM (diffusion-based world model) + LCVN-AC (actor-critic agent in latent space), and 2) LCVN-Uni (autoregressive multimodal architecture predicting both actions and future observations).
Result: LCVN-WM+AC provides more temporally coherent rollouts while LCVN-Uni generalizes better to unseen environments. The LCVN Dataset contains 39,016 trajectories with 117,048 human-verified instructions across diverse environments.
Conclusion: LCVN demonstrates the value of jointly studying language grounding, imagination, and policy learning, providing a concrete basis for investigating language-conditioned world models in embodied AI.
Abstract: We study language-conditioned visual navigation (LCVN), in which an embodied agent is asked to follow a natural language instruction based only on an initial egocentric observation. Without access to goal images, the agent must rely on language to shape its perception and continuous control, making the grounding problem particularly challenging. We formulate this problem as open-loop trajectory prediction conditioned on linguistic instructions and introduce the LCVN Dataset, a benchmark of 39,016 trajectories and 117,048 human-verified instructions that supports reproducible research across a range of environments and instruction styles. Using this dataset, we develop LCVN frameworks that link language grounding, future-state prediction, and action generation through two complementary model families. The first family combines LCVN-WM, a diffusion-based world model, with LCVN-AC, an actor-critic agent trained in the latent space of the world model. The second family, LCVN-Uni, adopts an autoregressive multimodal architecture that predicts both actions and future observations. Experiments show that these families offer different advantages: the former provides more temporally coherent rollouts, whereas the latter generalizes better to unseen environments. Taken together, these observations point to the value of jointly studying language grounding, imagination, and policy learning in a unified task setting, and LCVN provides a concrete basis for further investigation of language-conditioned world models. The code is available at https://github.com/F1y1113/LCVN.
[158] Steering Sparse Autoencoder Latents to Control Dynamic Head Pruning in Vision Transformers (Student Abstract)
Yousung Lee, Dongsoo Har
Main category: cs.CV
TL;DR: SAE-based dynamic head pruning in ViTs uses sparse autoencoders to create interpretable, class-specific pruning policies that maintain accuracy while reducing computational cost.
Details
Motivation: Existing dynamic head pruning methods in Vision Transformers lack interpretability and control. The authors aim to bridge pruning efficiency with mechanistic interpretability by leveraging sparse representations.Method: Train Sparse Autoencoders (SAEs) on final-layer residual embeddings of ViTs, then amplify sparse latents with different strategies to alter pruning decisions. Per-class steering identifies compact, class-specific head subsets.
Result: Achieves improved accuracy (76% to 82%) while reducing head usage (0.72 to 0.33) for specific classes like “bowl” using heads h2 and h5. Shows sparse latent features enable class-specific control of dynamic pruning.
Conclusion: Sparse latent features effectively bridge pruning efficiency and mechanistic interpretability in Vision Transformers, enabling interpretable and controllable dynamic head pruning.
Abstract: Dynamic head pruning in Vision Transformers (ViTs) improves efficiency by removing redundant attention heads, but existing pruning policies are often difficult to interpret and control. In this work, we propose a novel framework by integrating Sparse Autoencoders (SAEs) with dynamic pruning, leveraging their ability to disentangle dense embeddings into interpretable and controllable sparse latents. Specifically, we train an SAE on the final-layer residual embedding of the ViT and amplify the sparse latents with different strategies to alter pruning decisions. Among them, per-class steering reveals compact, class-specific head subsets that preserve accuracy. For example, bowl improves accuracy (76% to 82%) while reducing head usage (0.72 to 0.33) via heads h2 and h5. These results show that sparse latent features enable class-specific control of dynamic pruning, effectively bridging pruning efficiency and mechanistic interpretability in ViTs.
[159] SonoWorld: From One Image to a 3D Audio-Visual Scene
Derong Jin, Xiyi Chen, Ming C. Lin, Ruohan Gao
Main category: cs.CV
TL;DR: Image2AVScene: Generating 3D audio-visual scenes from single images using SonoWorld framework for spatial audio aligned with scene geometry and semantics.
Details
Motivation: While visual scene generation from single images has advanced significantly, creating immersive 3D experiences requires both visual and audio components. Current methods focus on visual generation but lack spatial audio that aligns with scene geometry and semantics.Method: SonoWorld framework: 1) Outpaints 360° panorama from single image, 2) Lifts to navigable 3D scene, 3) Places language-guided sound anchors, 4) Renders ambisonics for point, areal, and ambient sources, ensuring spatial audio alignment with scene geometry.
Result: Quantitative evaluations on newly curated real-world dataset and controlled user study confirm effectiveness. Applications demonstrated for free-viewpoint audio-visual rendering, one-shot acoustic learning, and audio-visual spatial source separation.
Conclusion: Image2AVScene task and SonoWorld framework successfully generate immersive 3D audio-visual scenes from single images, bridging the gap between visual scene generation and spatial audio synthesis for complete multimodal experiences.
Abstract: Tremendous progress in visual scene generation now turns a single image into an explorable 3D world, yet immersion remains incomplete without sound. We introduce Image2AVScene, the task of generating a 3D audio-visual scene from a single image, and present SonoWorld, the first framework to tackle this challenge. From one image, our pipeline outpaints a 360° panorama, lifts it into a navigable 3D scene, places language-guided sound anchors, and renders ambisonics for point, areal, and ambient sources, yielding spatial audio aligned with scene geometry and semantics. Quantitative evaluations on a newly curated real-world dataset and a controlled user study confirm the effectiveness of our approach. Beyond free-viewpoint audio-visual rendering, we also demonstrate applications to one-shot acoustic learning and audio-visual spatial source separation. Project website: https://humathe.github.io/sonoworld/
[160] CNMBI: Determining the Number of Clusters Using Center Pairwise Matching and Boundary Filtering
Ruilin Zhang, Haiyang Zheng, Hongpeng Wang
Main category: cs.CV
TL;DR: CNMBI is a novel approach for determining optimal cluster numbers without prior information, using dynamic comparison of cluster centers via bipartite graph theory and active removal of low-confidence samples.
Details
Motivation: Existing methods for determining optimal cluster numbers rely on cluster validation with assumptions about data distribution, limiting their application to complex real-world data like large-scale images and high-dimensional data.Method: CNMBI leverages inherent data distribution information to map clustering as a dynamic comparison process between cluster centers regarding positional behavior. It uses bipartite graph theory to model this process efficiently and actively removes low-confidence samples based on their confidence levels.
Result: Extensive comparisons with state-of-the-art methods on various challenging datasets (including CIFAR-10 and STL-10) demonstrate CNMBI’s superiority. The method shows robustness and flexibility with different data dimensions and shapes.
Conclusion: CNMBI provides an effective approach for determining optimal cluster numbers without prior information, overcoming limitations of traditional validation-based methods and handling complex real-world data better.
Abstract: One of the main challenges in data mining is choosing the optimal number of clusters without prior information. Notably, existing methods are usually in the philosophy of cluster validation and hence have underlying assumptions on data distribution, which prevents their application to complex data such as large-scale images and high-dimensional data from the real world. In this regard, we propose an approach named CNMBI. Leveraging the distribution information inherent in the data space, we map the target task as a dynamic comparison process between cluster centers regarding positional behavior, without relying on the complete clustering results and designing the complex validity index as before. Bipartite graph theory is then employed to efficiently model this process. Additionally, we find that different samples have different confidence levels and thereby actively remove low-confidence ones, which is, for the first time to our knowledge, considered in cluster number determination. CNMBI is robust and allows for more flexibility in the dimension and shape of the target data (e.g., CIFAR-10 and STL-10). Extensive comparison studies with state-of-the-art competitors on various challenging datasets demonstrate the superiority of our method.
[161] Motion Semantics Guided Normalizing Flow for Privacy-Preserving Video Anomaly Detection
Yang Liu, Boan Chen, Yuanyuan Meng, Jing Liu, Zhengliang Guo, Wei Zhou, Peng Sun, Hong Chen
Main category: cs.CV
TL;DR: MSG-Flow: A hierarchical motion semantics approach for skeleton-based video anomaly detection using vector quantization, Transformers, and normalizing flows.
Details
Motivation: Need for privacy-preserving human activity understanding in embodied multimedia systems. Existing skeleton-based methods fail to capture hierarchical nature of human activities (discrete semantic primitives + fine-grained kinematic details), reducing discriminability for anomalies at different abstraction levels.Method: Motion Semantics Guided Normalizing Flow (MSG-Flow) decomposes skeleton-based VAD into hierarchical motion semantics modeling: 1) Vector quantized variational auto-encoder discretizes continuous motion into interpretable primitives, 2) Autoregressive Transformer models semantic-level temporal dependencies, 3) Conditional normalizing flow captures detail-level pose variations.
Result: Achieves state-of-the-art performance on HR-ShanghaiTech (88.1% AUC) and HR-UBnormal (75.8% AUC) benchmarks.
Conclusion: MSG-Flow effectively models hierarchical motion semantics for skeleton-based video anomaly detection, offering privacy-preserving approach with improved discriminability for anomalies at different abstraction levels.
Abstract: As embodied perception systems increasingly bridge digital and physical realms in interactive multimedia applications, the need for privacy-preserving approaches to understand human activities in physical environments has become paramount. Video anomaly detection is a critical task in such embodied multimedia systems for intelligent surveillance and forensic analysis. Skeleton-based approaches have emerged as a privacy-preserving alternative that processes physical world information through abstract human pose representations while discarding sensitive visual attributes such as identity and facial features. However, existing skeleton-based methods predominantly model continuous motion trajectories in a monolithic manner, failing to capture the hierarchical nature of human activities composed of discrete semantic primitives and fine-grained kinematic details, which leads to reduced discriminability when anomalies manifest at different abstraction levels. In this regard, we propose Motion Semantics Guided Normalizing Flow (MSG-Flow) that decomposes skeleton-based VAD into hierarchical motion semantics modeling. It employs vector quantized variational auto-encoder to discretize continuous motion into interpretable primitives, an autoregressive Transformer to model semantic-level temporal dependencies, and a conditional normalizing flow to capture detail-level pose variations. Extensive experiments on benchmarks (HR-ShanghaiTech & HR-UBnormal) demonstrate that MSG-Flow achieves state-of-the-art performance with 88.1% and 75.8% AUC respectively.
[162] TDEC: Deep Embedded Image Clustering with Transformer and Distribution Information
Ruilin Zhang, Haiyang Zheng, Hongpeng Wang
Main category: cs.CV
TL;DR: TDEC is a deep embedded image clustering method that combines Transformer-based feature learning with dimensionality reduction and distribution-aware clustering for improved performance on complex image datasets.
Details
Motivation: Existing deep clustering methods often ignore global dependency information between image regions and produce clustering-unfriendly high-dimensional features based only on simple distance metrics, limiting performance on complex images.Method: Proposes TDEC with three key components: 1) T-Encoder using Transformer to learn discriminative features with global dependencies, 2) Dim-Reduction block to create clustering-friendly low-dimensional space, and 3) distribution-aware clustering that considers embedded feature distributions for reliable supervision.
Result: TDEC achieves significantly higher clustering performance than recent state-of-the-art approaches on complex datasets, demonstrating robustness across different data sizes, cluster numbers, and context complexities.
Conclusion: The joint consideration of feature representation, dimensional preference, and robust assignment in TDEC provides a superior deep embedded image clustering framework that effectively handles complex image data.
Abstract: Image clustering is a crucial but challenging task in multimedia machine learning. Recently the combination of clustering with deep learning has achieved promising performance against conventional methods on high-dimensional image data. Unfortunately, existing deep clustering methods (DC) often ignore the importance of information fusion with a global perception field among different image regions on clustering images, especially complex ones. Additionally, the learned features are usually clustering-unfriendly in terms of dimensionality and are based only on simple distance information for the clustering. In this regard, we propose a deep embedded image clustering TDEC, which for the first time to our knowledge, jointly considers feature representation, dimensional preference, and robust assignment for image clustering. Specifically, we introduce the Transformer to form a novel module T-Encoder to learn discriminative features with global dependency while using the Dim-Reduction block to build a friendly low-dimensional space favoring clustering. Moreover, the distribution information of embedded features is considered in the clustering process to provide reliable supervised signals for joint training. Our method is robust and allows for more flexibility in data size, the number of clusters, and the context complexity. More importantly, the clustering performance of TDEC is much higher than recent competitors. Extensive experiments with state-of-the-art approaches on complex datasets show the superiority of TDEC.
[163] Scaling Spatial Intelligence with Multimodal Foundation Models
Zhongang Cai, Ruisi Wang, Chenyang Gu, Fanyi Pu, Junxiang Xu, Yubo Wang, Wanqi Yin, Zhitao Yang, Chen Wei, Qingping Sun, Tongxi Zhou, Jiaqi Li, Hui En Pang, Oscar Qian, Yukun Wei, Zhiqian Lin, Xuanke Shi, Kewang Deng, Xiaoyang Han, Zukai Chen, Xiangyu Fan, Hanming Deng, Lewei Lu, Liang Pan, Bo Li, Ziwei Liu, Quan Wang, Dahua Lin, Lei Yang
Main category: cs.CV
TL;DR: SenseNova-SI family scales up multimodal foundation models to cultivate spatial intelligence through systematic curation of 8M diverse data samples, achieving state-of-the-art performance on spatial benchmarks while maintaining strong general multimodal understanding.
Details
Motivation: Despite progress in multimodal foundation models, they still exhibit surprising deficiencies in spatial intelligence. The authors aim to address this gap by scaling up models to cultivate robust spatial understanding capabilities.Method: Built upon established multimodal foundations (Qwen3-VL, InternVL3, Bagel), the authors systematically curated SenseNova-SI-8M: eight million diverse data samples under a rigorous taxonomy of spatial capabilities. They take a principled approach to constructing high-performing spatial intelligence models.
Result: SenseNova-SI demonstrates unprecedented performance across spatial intelligence benchmarks: 68.8% on VSI-Bench, 43.3% on MMSI, 85.7% on MindCube, 54.7% on ViewSpatial, 47.7% on SITE, 63.9% on BLINK, 55.5% on 3DSR, and 72.0% on EmbSpatial, while maintaining strong general multimodal understanding (84.9% on MMBench-En).
Conclusion: The work successfully cultivates spatial intelligence in multimodal foundation models through systematic data scaling, shows emergent generalization capabilities, analyzes risks of overfitting and language shortcuts, presents spatial chain-of-thought reasoning, and validates downstream applications. All models are publicly released.
Abstract: Despite remarkable progress, multimodal foundation models still exhibit surprising deficiencies in spatial intelligence. In this work, we explore scaling up multimodal foundation models to cultivate spatial intelligence within the SenseNova-SI family, built upon established multimodal foundations including visual understanding models (i.e., Qwen3-VL and InternVL3) and unified understanding and generation models (i.e., Bagel). We take a principled approach to constructing high-performing and robust spatial intelligence by systematically curating SenseNova-SI-8M: eight million diverse data samples under a rigorous taxonomy of spatial capabilities. SenseNova-SI demonstrates unprecedented performance across a broad range of spatial intelligence benchmarks: 68.8% on VSI-Bench, 43.3% on MMSI, 85.7% on MindCube, 54.7% on ViewSpatial, 47.7% on SITE, 63.9% on BLINK, 55.5% on 3DSR, and 72.0% on EmbSpatial, while maintaining strong general multimodal understanding (e.g., 84.9% on MMBench-En). More importantly, we analyze the impact of data scaling, discuss early signs of emergent generalization capabilities enabled by diverse data training, analyze the risk of overfitting and language shortcuts, present a preliminary study on spatial chain-of-thought reasoning, and validate the potential downstream application. All newly trained multimodal foundation models are publicly released.
[164] From Diffusion To Flow: Efficient Motion Generation In MotionGPT3
Jaymin Ban, JiHong Jeon, SangYeop Jeong
Main category: cs.CV
TL;DR: Rectified flow objectives outperform diffusion in continuous-latent text-to-motion generation within MotionGPT3 framework, offering better convergence, efficiency, and competitive quality.
Details
Motivation: While rectified flow has shown advantages over diffusion in image and audio generation, it's unclear if these benefits transfer to motion generation. This work aims to empirically compare diffusion vs rectified flow objectives in the MotionGPT3 framework for text-to-motion generation.Method: Conducted controlled empirical study comparing diffusion and rectified flow objectives within MotionGPT3 framework. Held model architecture, training protocol, and evaluation setup fixed to isolate effect of generative objective. Used HumanML3D dataset for experiments.
Result: Rectified flow converges in fewer training epochs, reaches strong test performance earlier, and matches or exceeds diffusion-based motion quality. Flow-based priors show stable behavior across inference step counts and achieve competitive quality with fewer sampling steps, yielding better efficiency-quality trade-offs.
Conclusion: Benefits of rectified flow objectives extend to continuous-latent text-to-motion generation, highlighting importance of training objective choice in motion priors.
Abstract: Recent text-driven motion generation methods span both discrete token-based approaches and continuous-latent formulations. MotionGPT3 exemplifies the latter paradigm, combining a learned continuous motion latent space with a diffusion-based prior for text-conditioned synthesis. While rectified flow objectives have recently demonstrated favorable convergence and inference-time properties relative to diffusion in image and audio generation, it remains unclear whether these advantages transfer cleanly to the motion generation setting. In this work, we conduct a controlled empirical study comparing diffusion and rectified flow objectives within the MotionGPT3 framework. By holding the model architecture, training protocol, and evaluation setup fixed, we isolate the effect of the generative objective on training dynamics, final performance, and inference efficiency. Experiments on the HumanML3D dataset show that rectified flow converges in fewer training epochs, reaches strong test performance earlier, and matches or exceeds diffusion-based motion quality under identical conditions. Moreover, flow-based priors exhibit stable behavior across a wide range of inference step counts and achieve competitive quality with fewer sampling steps, yielding improved efficiency–quality trade-offs. Overall, our results suggest that several known benefits of rectified flow objectives do extend to continuous-latent text-to-motion generation, highlighting the importance of the training objective choice in motion priors.
[165] PAVAS: Physics-Aware Video-to-Audio Synthesis
Oh Hyun-Bin, Yuhta Takida, Toshimitsu Uesaka, Tae-Hyun Oh, Yuki Mitsufuji
Main category: cs.CV
TL;DR: PAVAS introduces physics-aware video-to-audio generation using physical parameter estimation and a physics-driven adapter to create more realistic sounds based on object properties and motion.
Details
Motivation: Current V2A models focus on appearance-driven correlations but ignore the physical factors that shape real-world sounds, leading to less realistic audio generation.Method: Uses Physics-Driven Audio Adapter (Phy-Adapter) with physical parameters from Physical Parameter Estimator (PPE), which employs VLM for mass estimation and segmentation-based 3D reconstruction for motion trajectory and velocity computation.
Result: Outperforms existing V2A models in quantitative and qualitative evaluations, producing physically plausible and perceptually coherent audio.
Conclusion: Incorporating physical reasoning into V2A generation significantly improves audio realism and physical consistency.
Abstract: Recent advances in Video-to-Audio (V2A) generation have achieved impressive perceptual quality and temporal synchronization, yet most models remain appearance-driven, capturing visual-acoustic correlations without considering the physical factors that shape real-world sounds. We present Physics-Aware Video-to-Audio Synthesis (PAVAS), a method that incorporates physical reasoning into a latent diffusion-based V2A generation through the Physics-Driven Audio Adapter (Phy-Adapter). The adapter receives object-level physical parameters estimated by the Physical Parameter Estimator (PPE), which uses a Vision-Language Model (VLM) to infer the moving-object mass and a segmentation-based dynamic 3D reconstruction module to recover its motion trajectory for velocity computation. These physical cues enable the model to synthesize sounds that reflect underlying physical factors. To assess physical realism, we curate VGG-Impact, a benchmark focusing on object-object interactions, and introduce Audio-Physics Correlation Coefficient (APCC), an evaluation metric that measures consistency between physical and auditory attributes. Comprehensive experiments show that PAVAS produces physically plausible and perceptually coherent audio, outperforming existing V2A models in both quantitative and qualitative evaluations. Visit https://physics-aware-video-to-audio-synthesis.github.io for demo videos.
[166] LogiStory: A Logic-Aware Framework for Multi-Image Story Visualization
Chutian Meng, Fan Ma, Chi Zhang, Jiaxu Miao, Yi Yang, Yueting Zhuang
Main category: cs.CV
TL;DR: LogiStory: A logic-aware framework for multi-image story visualization that explicitly models visual logic to improve narrative coherence in generated image sequences.
Details
Motivation: Current multimodal systems struggle with maintaining logical flow in visual sequence generation, resulting in disjointed actions, fragmented narratives, and unclear storylines. The paper identifies a lack of attention to visual logic as the core problem.Method: Proposes LogiStory framework with explicit visual logic modeling using a multi-agent system that grounds roles, extracts causal chains, and verifies story-level consistency. Also introduces LogicTale benchmark for evaluation.
Result: Experiments show significant improvement in narrative logic of generated visual stories. The approach effectively bridges structured story planning with visual generation.
Conclusion: Provides foundational step toward modeling visual logic in general image sequence and video generation tasks, transforming narrative coherence from implicit byproduct to explicit modeling objective.
Abstract: Generating coherent and communicative visual sequences, such as image sequences and videos, remains a significant challenge for current multimodal systems. Despite advances in visual quality and the integration of world knowledge, existing models still struggle to maintain logical flow, often resulting in disjointed actions, fragmented narratives, and unclear storylines. We attribute these issues to the lack of attention to visual logic, a critical yet underexplored dimension of visual sequence generation that we define as the perceptual and causal coherence among characters, actions, and scenes over time. To bridge this gap, we propose a logic-aware multi-image story visualization framework, LogiStory. The framework is built around the central innovation of explicitly modeling visual logic in story visualization. To realize this idea, we design a multi-agent system that grounds roles, extracts causal chains, and verifies story-level consistency, transforming narrative coherence from an implicit byproduct of image generation into an explicit modeling objective. This design effectively bridges structured story planning with visual generation, enhancing both narrative clarity and visual quality in story visualization. Furthermore, to evaluate the generation capacity, we construct LogicTale, a benchmark comprising richly annotated stories, emphasizing causal reasoning, and visual logic interpretability. We establish comprehensive automatic and human evaluation protocols designed to measure both visual logic and perceptual quality. Experiments demonstrate that our approach significantly improves the narrative logic of generated visual stories. This work provides a foundational step towards modeling and enforcing visual logic in general image sequence and video generation tasks.
[167] Survey on Remote Sensing Scene Classification: From Traditional Methods to Large Generative AI Models
Qionghao Huang, Can Hu
Main category: cs.CV
TL;DR: Comprehensive survey of remote sensing scene classification evolution from traditional methods to modern AI systems including deep learning, foundation models, and generative AI approaches.
Details
Motivation: To systematically trace and analyze the complete methodological evolution of remote sensing scene classification, from classical approaches to cutting-edge AI systems, and identify future research directions.Method: Survey methodology examining the historical development through literature review, covering classical texture descriptors, machine learning classifiers, deep learning (CNNs, Vision Transformers, GNNs), foundation models, vision-language systems, and generative AI approaches.
Result: Comprehensive analysis showing the paradigmatic transformation from handcrafted features to sophisticated AI systems, highlighting breakthrough developments in self-supervised foundation models, vision-language systems, and generative AI for tackling annotation costs and data challenges.
Conclusion: Remote sensing scene classification has evolved dramatically, with future priorities including advancing hyperspectral/multi-temporal analysis, developing robust cross-domain generalization methods, and establishing standardized evaluation protocols.
Abstract: Remote sensing scene classification has experienced a paradigmatic transformation from traditional handcrafted feature methods to sophisticated artificial intelligence systems that now form the backbone of modern Earth observation applications. This comprehensive survey examines the complete methodological evolution, systematically tracing development from classical texture descriptors and machine learning classifiers through the deep learning revolution to current state-of-the-art foundation models and generative AI approaches. We chronicle the pivotal shift from manual feature engineering to automated hierarchical representation learning via convolutional neural networks, followed by advanced architectures including Vision Transformers, graph neural networks, and hybrid frameworks. The survey provides in-depth coverage of breakthrough developments in self-supervised foundation models and vision-language systems, highlighting exceptional performance in zero-shot and few-shot learning scenarios. Special emphasis is placed on generative AI innovations that tackle persistent challenges through synthetic data generation and advanced feature learning strategies. We analyze contemporary obstacles including annotation costs, multimodal data fusion complexities, interpretability demands, and ethical considerations, alongside current trends in edge computing deployment, federated learning frameworks, and sustainable AI practices. Based on comprehensive analysis of recent advances and gaps, we identify key future research priorities: advancing hyperspectral and multi-temporal analysis capabilities, developing robust cross-domain generalization methods, and establishing standardized evaluation protocols to accelerate scientific progress in remote sensing scene classification systems.
[168] Deep Learning Aided Vision System for Planetary Rovers
Lomash Relia, Jai G Singla, Amitabh, Nitant Dube
Main category: cs.CV
TL;DR: A vision system for planetary rovers combining real-time perception (stereo imagery, object detection, distance estimation) with offline terrain reconstruction (monocular depth estimation, point cloud fusion) for autonomous exploration.
Details
Motivation: To develop a scalable, compute-efficient vision solution for autonomous planetary exploration that combines real-time perception capabilities with detailed offline terrain reconstruction for rover navigation and analysis.Method: Two-module system: 1) Real-time module uses CLAHE-enhanced stereo imagery, YOLOv11n object detection, and neural network for distance estimation; 2) Offline module uses Depth Anything V2 for monocular depth estimation and Open3D for point cloud fusion.
Result: Neural network achieves median depth error of 2.26 cm within 1-10 meter range on Chandrayaan 3 NavCam data; object detection maintains balanced precision-recall on lunar scenes; system provides reliable metric context alongside qualitative reconstructions.
Conclusion: The architecture offers a scalable, compute-efficient vision solution for autonomous planetary exploration by combining real-time perception with detailed offline reconstruction capabilities.
Abstract: This study presents a vision system for planetary rovers, combining real-time perception with offline terrain reconstruction. The real-time module integrates CLAHE enhanced stereo imagery, YOLOv11n based object detection, and a neural network to estimate object distances. The offline module uses the Depth Anything V2 metric monocular depth estimation model to generate depth maps from captured images, which are fused into dense point clouds using Open3D. Real world distance estimates from the real time pipeline provide reliable metric context alongside the qualitative reconstructions. Evaluation on Chandrayaan 3 NavCam stereo imagery, benchmarked against a CAHV based utility, shows that the neural network achieves a median depth error of 2.26 cm within a 1 to 10 meter range. The object detection model maintains a balanced precision recall tradeoff on grayscale lunar scenes. This architecture offers a scalable, compute-efficient vision solution for autonomous planetary exploration.
[169] Generating Synthetic Wildlife Health Data from Camera Trap Imagery: A Pipeline for Alopecia and Body Condition Training Data
David Brundage
Main category: cs.CV
TL;DR: Pipeline generates synthetic wildlife health condition images from camera trap photos for ML training, achieving 0.85 AUROC in real-world screening tests.
Details
Motivation: No ML-ready datasets exist for wildlife health conditions in camera trap imagery, creating a barrier to automated health screening. Need synthetic training data to enable wildlife health monitoring.Method: Pipeline uses real camera trap photos from iWildCam, applies MegaDetector bounding boxes and stratified sampling across 8 species. Generative phenotype editing creates severity variants for hair loss (mange) and emaciation. Adaptive scene drift QC uses sham prefilter and decoupled mask/score approach with day/night metrics to reject altered scenes.
Result: From 201 base images across 4 species, generated 553 QC-passing synthetic variants with 83% pass rate. Sim-to-real transfer experiment training only on synthetic data achieved 0.85 AUROC on real camera trap images of suspected health conditions.
Conclusion: Synthetic data pipeline successfully captures visual features sufficient for wildlife health screening, demonstrating practical utility for automated monitoring where real labeled data is scarce.
Abstract: No publicly available, ML ready datasets exist for wildlife health conditions in camera trap imagery, creating a fundamental barrier to automated health screening. We present a pipeline for generating synthetic training images depicting alopecia and body condition deterioration in wildlife from real camera trap photographs. Our pipeline constructs a curated base image set from iWildCam using MegaDetector derived bounding boxes and center frame weighted stratified sampling across 8 North American species. A generative phenotype editing system produces controlled severity variants depicting hair loss consistent with mange and emaciation. An adaptive scene drift quality control system uses a sham prefilter and decoupled mask then score approach with complementary day or night metrics to reject images where the generative model altered the original scene. We frame the pipeline explicitly as a screening data source. From 201 base images across 4 species, we generate 553 QC passing synthetic variants with an overall pass rate of 83 percent. A sim to real transfer experiment training exclusively on synthetic data and testing on real camera trap images of suspected health conditions achieves 0.85 AUROC, demonstrating that the synthetic data captures visual features sufficient for screening.
[170] Unblur-SLAM: Dense Neural SLAM for Blurry Inputs
Qi Zhang, Denis Rozumny, Francesco Girlanda, Sezer Karaoglu, Marc Pollefeys, Theo Gevers, Martin R. Oswald
Main category: cs.CV
TL;DR: Unblur-SLAM: A novel RGB SLAM pipeline that reconstructs sharp 3D scenes from blurred images using a two-stage approach combining feed-forward deblurring with 3D Gaussian Splatting and blur modeling.
Details
Motivation: Traditional SLAM systems struggle with blurred images (motion blur and defocus blur), which degrade pose estimation and 3D reconstruction quality. Existing methods either ignore blur or use simplistic deblurring that doesn't integrate well with SLAM optimization.Method: Two-stage approach: 1) Feed-forward image deblurring network with specialized training for SLAM, followed by local-global multi-view optimization for successfully deblurred frames. 2) For failed deblurring cases, uses 3D Gaussian Splatting (3DGS) representation with an additional blur network to model multiple blurred sub-frames and simulate blur formation in 3D space, learning sharp details and refined sub-frame poses.
Result: Experiments on real-world datasets show consistent improvements in both pose estimation accuracy and sharp reconstruction of geometry and texture compared to existing methods.
Conclusion: Unblur-SLAM effectively handles different types of blur in SLAM systems, adapts computation based on blur severity, and achieves state-of-the-art performance for sharp 3D reconstruction from blurred inputs.
Abstract: We propose Unblur-SLAM, a novel RGB SLAM pipeline for sharp 3D reconstruction from blurred image inputs. In contrast to previous work, our approach is able to handle different types of blur and demonstrates state-of-the-art performance in the presence of both motion blur and defocus blur. Moreover, we adjust the computation effort with the amount of blur in the input image. As a first stage, our method uses a feed-forward image deblurring model for which we propose a suitable training scheme that can improve both tracking and mapping modules. Frames that are successfully deblurred by the feed-forward network obtain refined poses and depth through local-global multi-view optimization and loop closure. Frames that fail the first stage deblurring are directly modeled through the global 3DGS representation and an additional blur network to model multiple blurred sub-frames and simulate the blur formation process in 3D space, thereby learning sharp details and refined sub-frame poses. Experiments on several real-world datasets demonstrate consistent improvements in both pose estimation and sharp reconstruction results of geometry and texture.
[171] Domain-Guided YOLO26 with Composite BCE-Dice-Lovász Loss for Multi-Class Fetal Head Ultrasound Segmentation
M. Fazri Nizar
Main category: cs.CV
TL;DR: A prompt-free YOLO26-Seg pipeline for joint detection and segmentation of fetal brain structures in ultrasound images, achieving improved Dice scores over baseline methods.
Details
Motivation: Current fetal head structure segmentation in prenatal ultrasound requires bounding-box prompts at test time, creating practical bottlenecks in obstetric imaging. There's a need for more automated, prompt-free approaches.Method: Developed a prompt-free pipeline using YOLO26-Seg for joint detection and segmentation of three fetal brain structures (Brain, CSP, LV) in a single forward pass. Key modifications include: 1) composite BCE-Dice-Lovász segmentation loss with inverse-frequency class weighting via runtime monkey-patching, 2) domain-guided copy-paste augmentation respecting anatomical locations, and 3) inter-patient stratified splitting to prevent data leakage.
Result: Achieved mean Dice coefficient of 0.9253 on 575 test images, exceeding the baseline (0.9012) by 2.68 percentage points, despite only reporting over three foreground classes (baseline includes easy background class).
Conclusion: The prompt-free YOLO26-Seg approach with composite loss and domain-specific augmentations effectively improves fetal brain structure segmentation in ultrasound images, addressing practical bottlenecks in obstetric imaging.
Abstract: Segmenting fetal head structures from prenatal ultrasound remains a practical bottleneck in obstetric imaging. The current state-of-the-art baseline, proposed alongside the published dataset, adapts the Segment Anything Model with per-class Dice and Lovász losses but still depends on bounding-box prompts at test time. We build a prompt-free pipeline on top of YOLO26-Seg that jointly detects and segments three structures, Brain, Cavum Septi Pellucidi (CSP), and Lateral Ventricles (LV), in a single forward pass. Three modifications are central to our approach: (i) a composite BCE-Dice-Lovász segmentation loss with inverse-frequency class weighting, injected into the YOLO26 training loop via runtime monkey-patching; (ii) domain-guided copy-paste augmentation that transplants minority-class structures while respecting their anatomical location relative to the brain boundary; and (iii) inter-patient stratified splitting to prevent data leakage. On 575 held-out test images, the composite loss variant reaches a mean Dice coefficient of 0.9253, exceeding the baseline (0.9012) by 2.68 percentage points, despite reporting over three foreground classes only, whereas the baseline’s reported mean includes the easy background class. We further ablate each component and discuss annotation-quality and class-imbalance effects on CSP and LV performance.
[172] Beyond Textual Knowledge-Leveraging Multimodal Knowledge Bases for Enhancing Vision-and-Language Navigation
Dongsheng Yang, Yinfeng Yu, Liejun Wang
Main category: cs.CV
TL;DR: BTK enhances Vision-and-Language Navigation by integrating environment-specific textual knowledge with generative image knowledge bases to improve semantic grounding and cross-modal alignment.
Details
Motivation: Existing VLN methods struggle to effectively capture key semantic cues and accurately align them with visual observations, limiting navigation performance in complex unseen environments.Method: Proposes BTK framework that uses Qwen3-4B to extract goal-related phrases, Flux-Schnell to construct image knowledge bases (R2R-GP, REVERIE-GP), and BLIP-2 to create textual knowledge bases from panoramic views. Integrates these via Goal-Aware Augmentor and Knowledge Augmentor.
Result: Significant improvements on R2R (7,189 trajectories) and REVERIE (21,702 instructions) datasets. On test unseen splits: SR increased by 5% and 2.07% respectively, SPL increased by 4% and 3.69% respectively.
Conclusion: BTK effectively integrates multimodal knowledge bases to enhance semantic grounding and cross-modal alignment in VLN, outperforming existing baselines on standard benchmarks.
Abstract: Vision-and-Language Navigation (VLN) requires an agent to navigate through complex unseen environments based on natural language instructions. However, existing methods often struggle to effectively capture key semantic cues and accurately align them with visual observations. To address this limitation, we propose Beyond Textual Knowledge (BTK), a VLN framework that synergistically integrates environment-specific textual knowledge with generative image knowledge bases. BTK employs Qwen3-4B to extract goal-related phrases and utilizes Flux-Schnell to construct two large-scale image knowledge bases: R2R-GP and REVERIE-GP. Additionally, we leverage BLIP-2 to construct a large-scale textual knowledge base derived from panoramic views, providing environment-specific semantic cues. These multimodal knowledge bases are effectively integrated via the Goal-Aware Augmentor and Knowledge Augmentor, significantly enhancing semantic grounding and cross-modal alignment. Extensive experiments on the R2R dataset with 7,189 trajectories and the REVERIE dataset with 21,702 instructions demonstrate that BTK significantly outperforms existing baselines. On the test unseen splits of R2R and REVERIE, SR increased by 5% and 2.07% respectively, and SPL increased by 4% and 3.69% respectively. The source code is available at https://github.com/yds3/IPM-BTK/.
[173] GradAttn: Replacing Fixed Residual Connections with Task-Modulated Attention Pathways
Soudeep Ghoshal, Himanshu Buckchash
Main category: cs.CV
TL;DR: GradAttn replaces fixed residual connections in ResNet with attention-controlled gradient flow, using a hybrid CNN-transformer framework to dynamically weight shallow texture and deep semantic features across network hierarchies.
Details
Motivation: Deep ConvNets suffer from gradient signal degradation as depth increases. While ResNet addressed this with residual connections, these fixed short-circuits cannot adapt to varying input complexity or selectively emphasize task-relevant features across network hierarchies.Method: Introduces GradAttn, a hybrid CNN-transformer framework that replaces fixed residual connections with attention-controlled gradient flow. Extracts multi-scale CNN features at different depths and regulates them through self-attention to dynamically weight shallow texture features and deep semantic representations.
Result: GradAttn outperforms ResNet-18 on 5 of 8 diverse datasets (natural images, medical imaging, fashion recognition), achieving up to +11.07% accuracy improvement on FashionMNIST while maintaining comparable network size. Gradient flow analysis reveals controlled instabilities from attention often coincide with improved generalization.
Conclusion: Attention mechanisms can serve as enablers of learnable gradient control, offering a new paradigm for adaptive representation learning in deep neural architectures. Positional encoding effectiveness is dataset dependent, with CNN hierarchies often encoding sufficient spatial structure.
Abstract: Deep ConvNets suffer from gradient signal degradation as network depth increases, limiting effective feature learning in complex architectures. ResNet addressed this through residual connections, but these fixed short-circuits cannot adapt to varying input complexity or selectively emphasize task relevant features across network hierarchies. This study introduces GradAttn, a hybrid CNN-transformer framework that replaces fixed residual connections with attention-controlled gradient flow. By extracting multi-scale CNN features at different depths and regulating them through self-attention, GradAttn dynamically weights shallow texture features and deep semantic representations. For representational analysis, we evaluated three GradAttn variants across eight diverse datasets, from natural images, medical imaging, to fashion recognition. Results demonstrate that GradAttn outperforms ResNet-18 on five of eight datasets, achieving up to +11.07% accuracy improvement on FashionMNIST while maintaining comparable network size. Gradient flow analysis reveals that controlled instabilities, introduced by attention, often coincide with improved generalization, challenging the assumption that perfect stability is optimal. Furthermore, positional encoding effectiveness proves dataset dependent, with CNN hierarchies frequently encoding sufficient spatial structure. These findings allow attention mechanisms as enablers of learnable gradient control, offering a new paradigm for adaptive representation learning in deep neural architectures.
[174] MD-RWKV-UNet: Scale-Aware Anatomical Encoding with Cross-Stage Fusion for Multi-Organ Segmentation
Zhuoyi Fang
Main category: cs.CV
TL;DR: MD-RWKV-UNet: A dynamic encoder network for medical multi-organ segmentation using deformable spatial shifts and selective kernel attention for scale-aware representation and adaptive context modeling.
Details
Motivation: Multi-organ segmentation faces challenges from anatomical variability, complex inter-organ dependencies, and diverse organ scales/shapes. Conventional encoder-decoder architectures struggle to capture both fine-grained local details and long-range context needed for accurate delineation, especially for small or deformable organs.Method: Proposes MD-RWKV-UNet with MD-RWKV blocks integrating deformable spatial shifts with Receptance Weighted Key Value mechanism for dynamic receptive field adaptation. Uses Selective Kernel Attention for adaptive convolutional kernel selection with varying receptive fields. Implements cross-stage dual-attention fusion strategy to aggregate multi-level features across encoder.
Result: Achieves state-of-the-art performance on Synapse and ACDC datasets, particularly excelling in boundary precision and small-organ segmentation.
Conclusion: The approach provides a lightweight yet expressive solution for dynamic organ modeling, outperforming methods that rely on static convolutions or global attention for medical image segmentation tasks.
Abstract: Multi-organ segmentation in medical imaging remains challenging due to large anatomical variability, complex inter-organ dependencies, and diverse organ scales and shapes. Conventional encoder-decoder architectures often struggle to capture both fine-grained local details and long-range context, which are crucial for accurate delineation - especially for small or deformable organs. To address these limitations, we propose MD-RWKV-UNet, a dynamic encoder network that enables scale-aware representation and spatially adaptive context modeling. At its core is the MD-RWKV block, a dual-path module that integrates deformable spatial shifts with the Receptance Weighted Key Value mechanism, allowing the receptive field to adapt dynamically to local structural cues. We further incorporate Selective Kernel Attention to enable adaptive selection of convolutional kernels with varying receptive fields, enhancing multi-scale interaction and improving robustness to organ size and shape variation. In parallel, a cross-stage dual-attention fusion strategy aggregates multi-level features across the encoder, preserving low-level structure while enhancing semantic consistency. Unlike methods that stack static convolutions or rely heavily on global attention, our approach provides a lightweight yet expressive solution for dynamic organ modeling. Experiments on Synapse and ACDC demonstrate state-of-the-art performance, particularly in boundary precision and small-organ segmentation.
[175] Physics-Aware Diffusion for LiDAR Point Cloud Densification
Zeping Zhang, Robert Laganière
Main category: cs.CV
TL;DR: A diffusion-based framework for LiDAR point cloud densification that treats densification as probabilistic refinement rather than generation, achieving fast inference and physically consistent results.
Details
Motivation: LiDAR perception suffers from distance-dependent sparsity of distant objects, while existing diffusion models for point cloud densification have prohibitive latency and produce physical hallucinations (ghost points).Method: Proposes Scanline-Consistent Range-Aware Diffusion using Partial Diffusion (SDEdit) on a coarse prior. Introduces Ray-Consistency loss and Negative Ray Augmentation to enforce sensor physics and suppress artifacts.
Result: Achieves high-fidelity results in 156ms, state-of-the-art performance on KITTI-360 and nuScenes datasets, and directly boosts off-the-shelf 3D detectors without retraining.
Conclusion: The framework effectively addresses LiDAR sparsity through probabilistic refinement with sensor physics constraints, enabling practical real-time applications and improved downstream perception tasks.
Abstract: LiDAR perception is severely limited by the distance-dependent sparsity of distant objects. While diffusion models can recover dense geometry, they suffer from prohibitive latency and physical hallucinations manifesting as ghost points. We propose Scanline-Consistent Range-Aware Diffusion, a framework that treats densification as probabilistic refinement rather than generation. By leveraging Partial Diffusion (SDEdit) on a coarse prior, we achieve high-fidelity results in just 156ms. Our novel Ray-Consistency loss and Negative Ray Augmentation enforce sensor physics to suppress artifacts. Our method achieves state-of-the-art results on KITTI-360 and nuScenes, directly boosting off-the-shelf 3D detectors without retraining. Code will be made available.
[176] UniLS: End-to-End Audio-Driven Avatars for Unified Listening and Speaking
Xuangeng Chu, Ruicong Liu, Yifei Huang, Yun Liu, Yichen Peng, Bo Zheng
Main category: cs.CV
TL;DR: UniLS is an end-to-end framework for generating unified speak-listen facial expressions using only dual-track audio, addressing the challenge of modeling realistic listener behavior in conversational avatars.
Details
Motivation: Current methods struggle with generating realistic listener behavior in conversational avatars because listener motion follows an internal motion prior rather than being strongly audio-driven like speaker motion. Existing approaches either focus only on speaking generation or require extra speaker motion data, limiting real-time applicability.Method: Two-stage training: Stage 1 learns internal motion prior using an audio-free autoregressive generator to capture natural facial dynamics. Stage 2 introduces dual-track audio to fine-tune the generator, modulating the learned motion prior based on external speech cues.
Result: Achieves state-of-the-art speaking accuracy and delivers up to 44.1% improvement in listening metrics, generating significantly more diverse and natural listening expressions while effectively mitigating stiffness problems.
Conclusion: UniLS provides a practical, high-fidelity audio-driven solution for interactive digital humans by enabling end-to-end generation of unified speak-listen expressions using only dual-track audio.
Abstract: Generating lifelike conversational avatars requires modeling not just isolated speakers, but the dynamic, reciprocal interaction of speaking and listening. However, modeling the listener is exceptionally challenging: direct audio-driven training fails, producing stiff, static listening motions. This failure stems from a fundamental imbalance: the speaker’s motion is strongly driven by speech audio, while the listener’s motion primarily follows an internal motion prior and is only loosely guided by external speech. This challenge has led most methods to focus on speak-only generation. The only prior attempt at joint generation relies on extra speaker’s motion to produce the listener. This design is not end-to-end, thereby hindering the real-time applicability. To address this limitation, we present UniLS, the first end-to-end framework for generating unified speak-listen expressions, driven by only dual-track audio. Our method introduces a novel two-stage training paradigm. Stage 1 first learns the internal motion prior by training an audio-free autoregressive generator, capturing the spontaneous dynamics of natural facial motion. Stage 2 then introduces the dual-track audio, fine-tuning the generator to modulate the learned motion prior based on external speech cues. Extensive evaluations show UniLS achieves state-of-the-art speaking accuracy. More importantly, it delivers up to 44.1% improvement in listening metrics, generating significantly more diverse and natural listening expressions. This effectively mitigates the stiffness problem and provides a practical, high-fidelity audio-driven solution for interactive digital humans. Code and demos are available at https://xg-chu.site/project_unils/.
[177] Towards Emotion Recognition with 3D Pointclouds Obtained from Facial Expression Images
Laura Rayón Ropero, Jasper De Laet, Filip Lemic, Pau Sabater Nácher, Nabeel Nisar Bhat, Sergi Abadal, Jeroen Famaey, Eduard Alarcón, Xavier Costa-Pérez
Main category: cs.CV
TL;DR: Proposes privacy-aware facial emotion recognition using high-frequency wireless sensing to generate 3D facial pointclouds from wearables, creates AffectNet3D dataset from 2D data, and achieves comparable performance to traditional methods.
Details
Motivation: Current FER methods using 2D images raise privacy concerns and are unsuitable for continuous monitoring. Need privacy-aware alternatives that enable real-time emotion recognition without compromising personal data.Method: Uses high-frequency wireless sensing (HFWS) to generate 3D facial pointclouds via wearable sensors. Creates AffectNet3D dataset using FLAME-based method to convert existing 2D datasets to 3D. Implements pointcloud refinement pipeline and trains PointNet++ model, with fine-tuning on BU-3DFE dataset.
Result: Achieves over 70% classification accuracy on BU-3DFE, comparable to oracle performance. Models trained on AffectNet3D and fine-tuned with 25% of BU-3DFE outperform those trained solely on BU-3DFE. Works well even with simulated wearable conditions (masked pointclouds).
Conclusion: HFWS-based FER is viable for continuous, privacy-aware emotion monitoring via wearables. The proposed pipeline effectively addresses 3D dataset scarcity and demonstrates practical feasibility for real-world applications.
Abstract: Facial Emotion Recognition is a critical research area within Affective Computing due to its wide-ranging applications in Human Computer Interaction, mental health assessment and fatigue monitoring. Current FER methods predominantly rely on Deep Learning techniques trained on 2D image data, which pose significant privacy concerns and are unsuitable for continuous, real-time monitoring. As an alternative, we propose High-Frequency Wireless Sensing (HFWS) as an enabler of continuous, privacy-aware FER, through the generation of detailed 3D facial pointclouds via on-person sensors embedded in wearables. We present arguments supporting the privacy advantages of HFWS over traditional 2D imaging, particularly under increasingly stringent data protection regulations. A major barrier to adopting HFWS for FER is the scarcity of labeled 3D FER datasets. Towards addressing this issue, we introduce a FLAME-based method to generate 3D facial pointclouds from existing public 2D datasets. Using this approach, we create AffectNet3D, a 3D version of the AffectNet database. To evaluate the quality and usability of the generated data, we design a pointcloud refinement pipeline focused on isolating the facial region, and train the popular PointNet++ model on the refined pointclouds. Fine-tuning the model on a small subset of the unseen 3D FER dataset BU-3DFE yields a classification accuracy exceeding 70%, comparable to oracle-level performance. To further investigate the potential of HFWS-based FER for continuous monitoring, we simulate wearable sensing conditions by masking portions of the generated pointclouds. Experimental results show that models trained on AffectNet3D and fine-tuned with just 25% of BU-3DFE outperform those trained solely on BU-3DFE. These findings highlight the viability of our pipeline and support the feasibility of continuous, privacy-aware FER via wearable HFWS systems.
[178] An Intelligent Framework for Real-Time Yoga Pose Detection and Posture Correction
Chandramouli Haldar
Main category: cs.CV
TL;DR: A hybrid Edge AI framework for real-time yoga pose detection and posture correction using lightweight pose estimation, biomechanical feature extraction, and CNN-LSTM architecture to provide corrective feedback.
Details
Motivation: Yoga benefits depend on correct posture execution, but improper alignment reduces effectiveness and increases injury risk, especially in self-guided or online training environments where real-time feedback is lacking.Method: Integrates lightweight human pose estimation models with biomechanical feature extraction and CNN-LSTM temporal learning architecture. Computes joint angles and skeletal features from detected keypoints, compares with reference poses, uses quantitative scoring for alignment deviations, and provides real-time feedback via visual, text, and voice guidance with Edge AI optimization (quantization, pruning) for resource-constrained devices.
Result: Proposed framework enables real-time yoga pose detection and posture correction with low latency on edge devices, providing intelligent corrective feedback to improve user safety and training effectiveness.
Conclusion: The Edge AI-based framework offers a scalable digital yoga assistant that can enhance safety and effectiveness in modern fitness applications by providing real-time posture correction and guidance.
Abstract: Yoga is widely recognized for improving physical fitness, flexibility, and mental well being. However, these benefits depend strongly on correct posture execution. Improper alignment during yoga practice can reduce effectiveness and increase the risk of musculoskeletal injuries, especially in self guided or online training environments. This paper presents a hybrid Edge AI based framework for real time yoga pose detection and posture correction. The proposed system integrates lightweight human pose estimation models with biomechanical feature extraction and a CNN LSTM based temporal learning architecture to recognize yoga poses and analyze motion dynamics. Joint angles and skeletal features are computed from detected keypoints and compared with reference pose configurations to evaluate posture correctness. A quantitative scoring mechanism is introduced to measure alignment deviations and generate real time corrective feedback through visual, text based, and voice based guidance. In addition, Edge AI optimization techniques such as model quantization and pruning are applied to enable low latency performance on resource constrained devices. The proposed framework provides an intelligent and scalable digital yoga assistant that can improve user safety and training effectiveness in modern fitness applications.
[179] Tiny-ViT: A Compact Vision Transformer for Efficient and Explainable Potato Leaf Disease Classification
Shakil Mia, Umme Habiba, Urmi Akter, SK Rezwana Quadir Raisa, Jeba Maliha, Md. Iqbal Hossain, Md. Shakhauat Hossan Sumon
Main category: cs.CV
TL;DR: Tiny-ViT: A lightweight Vision Transformer model for potato leaf disease classification achieving 99.85% accuracy with efficient real-time performance and interpretability via GRAD-CAM.
Details
Motivation: Early and precise identification of potato leaf diseases (Early Blight, Late Blight) is crucial for crop health and yield maximization. Traditional detection methods are time-consuming and error-prone, necessitating automated, efficient solutions for resource-limited agricultural systems.Method: Proposes Tiny-ViT, a small and efficient Vision Transformer designed for resource-constrained environments. Uses image preprocessing (resizing, CLAHE, Gaussian blur) and is evaluated on a 3-class dataset (Early Blight, Late Blight, Healthy). Incorporates GRAD-CAM for model interpretability to identify diseased regions.
Result: Achieves 99.85% test accuracy and 99.82% mean cross-validation accuracy, outperforming baseline models (DEIT Small, SWIN Tiny, MobileViT XS). Shows high reliability with MCC of 0.9990 and narrow confidence intervals [0.9980, 0.9995]. Competitive inference times with low computational costs enable real-time applications.
Conclusion: Tiny-ViT provides a robust, efficient, and explainable solution for plant disease classification, particularly suitable for deployment in resource-limited agricultural settings with real-time requirements.
Abstract: Early and precise identification of plant diseases, especially in potato crops is important to ensure the health of the crops and ensure the maximum yield . Potato leaf diseases, such as Early Blight and Late Blight, pose significant challenges to farmers, often resulting in yield losses and increased pesticide use. Traditional methods of detection are not only time-consuming, but are also subject to human error, which is why automated and efficient methods are required. The paper introduces a new method of potato leaf disease classification Tiny-ViT model, which is a small and effective Vision Transformer (ViT) developed to be used in resource-limited systems. The model is tested on a dataset of three classes, namely Early Blight, Late Blight, and Healthy leaves, and the preprocessing procedures include resizing, CLAHE, and Gaussian blur to improve the quality of the image. Tiny-ViT model has an impressive test accuracy of 99.85% and a mean CV accuracy of 99.82% which is better than baseline models such as DEIT Small, SWIN Tiny, and MobileViT XS. In addition to this, the model has a Matthews Correlation Coefficient (MCC) of 0.9990 and narrow confidence intervals (CI) of [0.9980, 0.9995], which indicates high reliability and generalization. The training and testing inference time is competitive, and the model exhibits low computational expenses, thereby, making it applicable in real-time applications. Moreover, interpretability of the model is improved with the help of GRAD-CAM, which identifies diseased areas. Altogether, the proposed Tiny-ViT is a solution with a high level of robustness, efficiency, and explainability to the problem of plant disease classification.
[180] Low Dose CT for Stroke Diagnosis: A Dual Pipeline Deep Learning Framework for Portable Neuroimaging
Rhea Ghosal, Ronok Ghosal, Eileen Lou
Main category: cs.CV
TL;DR: Deep learning framework for stroke classification from low-dose CT scans, comparing direct classification vs denoising-then-classification approaches for AI-assisted triage in mobile settings.
Details
Motivation: Portable CT scanners enable early stroke detection in prehospital settings but require reduced radiation doses, which introduces noise that degrades diagnostic reliability. There's a need for AI-assisted triage in mobile clinical environments.Method: Controlled Poisson noise applied to high-dose CT images to simulate realistic low-dose CT conditions. Two pipelines compared: (1) direct classification of noisy LDCT images, and (2) denoising followed by classification. Performance evaluated across multiple dose levels using accuracy, sensitivity, and AUC metrics.
Result: While denoising improves perceptual image quality, it does not consistently improve classification. In several settings, direct classification yields higher sensitivity. The best denoise-then-classify pipeline achieves 0.94 AUC and 0.91 accuracy at moderate dose levels, outperforming direct classification by up to 6% in select cases.
Conclusion: This work establishes a reproducible baseline for LDCT stroke triage using hemorrhagic stroke data and highlights the need for validation on ischemic cohorts and real-world portable CT systems. Reveals trade-off between perceptual quality and diagnostic utility.
Abstract: Portable CT scanners enable early stroke detection in prehospital and low-resource settings but require reduced radiation doses, introducing noise that degrades diagnostic reliability. We present a deep learning framework for stroke classification from simulated low-dose CT (LDCT) brain scans for AI-assisted triage in mobile clinical environments. Controlled Poisson noise is applied to high-dose CT images to simulate realistic LDCT conditions. We compare two pipelines: (1) direct classification of noisy LDCT images and (2) denoising followed by classification. Performance is evaluated across multiple dose levels using accuracy, sensitivity, and AUC. While denoising improves perceptual image quality, it does not consistently improve classification. In several settings, direct classification yields higher sensitivity, revealing a trade-off between perceptual quality and diagnostic utility. The best denoise-then-classify pipeline achieves 0.94 AUC and 0.91 accuracy at moderate dose levels, outperforming direct classification by up to 6% in select cases. This work establishes a reproducible baseline for LDCT stroke triage using hemorrhagic stroke data (RSNA dataset) and highlights the need for validation on ischemic cohorts and real-world portable CT systems.
[181] JND-Guided Neural Watermarking with Spatial Transformer Decoding for Screen-Capture Robustness
Jiayi Qin, Jingwei Li, Chuan Wu
Main category: cs.CV
TL;DR: End-to-end deep learning framework for screen-shooting robust watermarking that jointly optimizes embedding and extraction to survive complex display-camera distortions.
Details
Motivation: Screen-shooting watermarking faces challenges in maintaining extraction accuracy while preserving visual quality due to severe distortions like Moiré patterns, color shifts, perspective warping, and sensor noise introduced during display and recapture.Method: Three key innovations: 1) Comprehensive noise simulation layer modeling realistic screen-shooting distortions including Moiré patterns, 2) JND perceptual loss function adaptively modulating watermark strength, 3) Two automatic localization modules for captured image rectification and anti-cropping recovery.
Result: Achieves average PSNR of 30.94 dB and SSIM of 0.94 on watermarked images while embedding 127-bit payloads, demonstrating strong performance in both visual quality and robustness.
Conclusion: The proposed framework effectively addresses screen-shooting watermarking challenges by jointly optimizing embedding and extraction through realistic distortion modeling, perceptual quality preservation, and automated localization.
Abstract: Screen-shooting robust watermarking aims to imperceptibly embed extractable information into host images such that the watermark survives the complex distortion pipeline of screen display and camera recapture. However, achieving high extraction accuracy while maintaining satisfactory visual quality remains an open challenge, primarily because the screen-shooting channel introduces severe and entangled degradations including Moiré patterns, color-gamut shifts, perspective warping, and sensor noise. In this paper, we present an end-to-end deep learning framework that jointly optimizes watermark embedding and extraction for screen-shooting robustness. Our framework incorporates three key innovations: (i) a comprehensive noise simulation layer that faithfully models realistic screen-shooting distortions – notably including a physically-motivated Moiré pattern generator – enabling the network to learn robust representations against the full spectrum of capture-channel noise through adversarial training; (ii) a Just Noticeable Distortion (JND) perceptual loss function that adaptively modulates watermark embedding strength by supervising the perceptual discrepancy between the JND coefficient map and the watermark residual, thereby concentrating watermark energy in perceptually insensitive regions to maximize visual quality; and (iii) two complementary automatic localization modules – a semantic-segmentation-based foreground extractor for captured image rectification and a symmetric noise template mechanism for anti-cropping region recovery – that enable fully automated watermark decoding under realistic deployment conditions. Extensive experiments demonstrate that our method achieves an average PSNR of 30.94~dB and SSIM of 0.94 on watermarked images while embedding 127-bit payloads.
[182] A training-free framework for high-fidelity appearance transfer via diffusion transformers
Shengrong Gu, Ye Wang, Song Wu, Rui Ma, Qian Wang, Lanjun Wang, Zili Yi
Main category: cs.CV
TL;DR: A training-free framework for high-fidelity appearance transfer in Diffusion Transformers (DiTs) that disentangles structure and appearance using inversion priors and attention-sharing mechanisms.
Details
Motivation: DiTs excel at generation but struggle with controllable, reference-image-based editing due to their global self-attention. Unlike U-Nets, naively injecting local appearance into DiTs can disrupt holistic scene structure, creating a need for specialized frameworks for appearance transfer.Method: Proposes a training-free framework with synergistic system that disentangles structure and appearance. Uses high-fidelity inversion to establish content prior capturing lighting and micro-textures, and novel attention-sharing mechanism to dynamically fuse purified appearance features from reference guided by geometric priors.
Result: Unified approach operates at 1024px and outperforms specialized methods on tasks ranging from semantic attribute transfer to fine-grained material application. Extensive experiments confirm state-of-the-art performance in both structural preservation and appearance fidelity.
Conclusion: The proposed framework successfully addresses the challenge of controllable appearance transfer in DiTs, enabling high-fidelity editing while preserving scene structure, outperforming existing methods across various editing tasks.
Abstract: Diffusion Transformers (DiTs) excel at generation, but their global self-attention makes controllable, reference-image-based editing a distinct challenge. Unlike U-Nets, naively injecting local appearance into a DiT can disrupt its holistic scene structure. We address this by proposing the first training-free framework specifically designed to tame DiTs for high-fidelity appearance transfer. Our core is a synergistic system that disentangles structure and appearance. We leverage high-fidelity inversion to establish a rich content prior for the source image, capturing its lighting and micro-textures. A novel attention-sharing mechanism then dynamically fuses purified appearance features from a reference, guided by geometric priors. Our unified approach operates at 1024px and outperforms specialized methods on tasks ranging from semantic attribute transfer to fine-grained material application. Extensive experiments confirm our state-of-the-art performance in both structural preservation and appearance fidelity.
[183] Aesthetic Assessment of Chinese Handwritings Based on Vision Language Models
Chen Zheng, Yuxuan Lai, Haoyang Lu, Wentao Ma, Jitao Yang, Jian Wang
Main category: cs.CV
TL;DR: VLMs fine-tuned with LoRA and in-context learning generate multi-level feedback for handwritten Chinese character quality assessment, achieving SOTA results.
Details
Motivation: Existing automated handwriting assessment methods provide only numerical scores without actionable guidance, limiting their effectiveness for helping learners improve their Chinese character handwriting skills.Method: Leverage vision-language models (VLMs) to analyze handwritten Chinese character quality and generate multi-level feedback. Explore both LoRA-based fine-tuning strategies and in-context learning methods to integrate aesthetic assessment knowledge into VLMs for two tasks: simple grade feedback and enriched descriptive feedback.
Result: The approach achieves state-of-the-art performances across multiple evaluation tracks in the CCL 2025 workshop on evaluation of handwritten Chinese character quality.
Conclusion: VLMs can effectively generate actionable multi-level feedback for handwritten Chinese character quality assessment, moving beyond simple regression-based scoring to provide meaningful guidance for learners.
Abstract: The handwriting of Chinese characters is a fundamental aspect of learning the Chinese language. Previous automated assessment methods often framed scoring as a regression problem. However, this score-only feedback lacks actionable guidance, which limits its effectiveness in helping learners improve their handwriting skills. In this paper, we leverage vision-language models (VLMs) to analyze the quality of handwritten Chinese characters and generate multi-level feedback. Specifically, we investigate two feedback generation tasks: simple grade feedback (Task 1) and enriched, descriptive feedback (Task 2). We explore both low-rank adaptation (LoRA)-based fine-tuning strategies and in-context learning methods to integrate aesthetic assessment knowledge into VLMs. Experimental results show that our approach achieves state-of-the-art performances across multiple evaluation tracks in the CCL 2025 workshop on evaluation of handwritten Chinese character quality.
[184] Edge Reliability Gap in Vision-Language Models: Quantifying Failure Modes of Compressed VLMs Under Visual Corruption
Mehmet Kaan Erol
Main category: cs.CV
TL;DR: Compact VLMs fail differently, not just more often, with distinct error patterns compared to larger models, particularly showing severe negation collapse and dataset-dependent miscalibration.
Details
Motivation: To understand whether compressed vision-language models for edge deployment fail in qualitatively different ways compared to larger models, rather than just failing more frequently.Method: Comparative analysis of a 7B-parameter quantized VLM (Qwen2.5-VL-7B) vs. a 500M-parameter model (SmolVLM2-500M) using 4,000 samples from VQAv2 and COCO Captions. Applied three-category error taxonomy, used GPT-4o as judge, measured confidence calibration via ECE, tested compositional reasoning with structured negation probes, and conducted blur robustness experiments.
Result: Compact model shows distinct failure signature: 12.5pp larger negation collapse (-33.2pp vs. -20.8pp), driven almost entirely by COCO. Most discriminating template shows SmolVLM2-500M responds “Yes” (incorrectly) on 100% of COCO trials vs. 14% for larger model. Asymmetric dataset-dependent miscalibration observed.
Conclusion: Compressed VLMs exhibit qualitatively different failure patterns, with severe compositional reasoning deficits, necessitating systematic safety auditing before edge deployment.
Abstract: The rapid compression of large vision-language models (VLMs) for edge deployment raises an underexplored question: do compact models fail differently, not merely more often? This study compares a 7-billion-parameter quantised VLM (Qwen2.5-VL-7B, 4-bit NF4) against a 500-million-parameter FP16 model (SmolVLM2-500M) across 4,000 samples from VQAv2 and COCO Captions. A three-category error taxonomy (Object Blindness, Semantic Drift, Prior Bias) is applied as a diagnostic framework. A text-only GPT-4o judge reveals Semantic Drift (B) as the dominant failure mode on VQAv2 and on COCO for Qwen, with a mixed Object Blindness / Semantic Drift profile for SmolVLM2 on COCO; Prior Bias (C) is present on VQAv2 but absent on COCO for both models. Confidence calibration is measured via Expected Calibration Error (ECE) using geometric mean token probability, compositional reasoning is probed with structured negation probes across four templates, and a blur robustness experiment completes the evaluation. For this model pair, the compact model exhibits a qualitatively distinct failure signature: a 12.5pp larger negation collapse (-33.2pp vs. -20.8pp, Wald 95% CI [8.2, 16.8]pp, p < 10^-8), driven almost entirely by COCO while the VQAv2 gap is not statistically significant (4.5pp, p=0.19). The most discriminating template is false_yn: SMOLVLM2-500M responds “Yes” (incorrectly claiming a depicted object is absent) on 100% of COCO trials vs. 14% for Q WEN 2.5-VL-7B. Asymmetric dataset-dependent miscalibration and a blur experiment with two controlled ablations complete the analysis. The fully reproducible pipeline is released for systematic safety auditing of compressed VLMs prior to edge deployment.
[185] Quantized Vision-Language Models for Damage Assessment: A Comparative Study of LLaVA-1.5-7B Quantization Levels
Takato Yasuno
Main category: cs.CV
TL;DR: Quantized Vision-Language Models (VLMs) for automated bridge damage assessment, with systematic comparison of quantization levels (Q4_K_M, Q5_K_M, Q8_0) balancing quality, speed, and resource requirements.
Details
Motivation: Bridge inspection is labor-intensive and requires expert assessment; automated solutions using VLMs can improve efficiency but need to be deployable on consumer-grade hardware with acceptable quality-speed trade-offs.Method: End-to-end pipeline using LLaVA-1.5-7B for visual damage analysis, structured JSON extraction, and rule-based priority scoring. Systematic comparison of three quantization levels (Q4_K_M, Q5_K_M, Q8_0) on 254 rebar exposure images with 5-point quality evaluation framework.
Result: Q5_K_M achieves optimal balance: quality score 3.18±1.35/5.0, inference time 5.67s/image, and 0.56 quality/sec efficiency - 8.5% higher quality than Q4_K_M with only 4.5% speed reduction, while matching Q8_0’s quality with 25% faster inference.
Conclusion: Q5_K_M quantization provides the best trade-off for practical deployment of VLMs in bridge inspection, offering consistent performance regardless of description length while maintaining efficiency on consumer-grade hardware.
Abstract: Bridge infrastructure inspection is a critical but labor-intensive task requiring expert assessment of structural damage such as rebar exposure, cracking, and corrosion. This paper presents a comprehensive study of quantized Vision-Language Models (VLMs) for automated bridge damage assessment, focusing on the trade-offs between description quality, inference speed, and resource requirements. We develop an end-to-end pipeline combining LLaVA-1.5-7B for visual damage analysis, structured JSON extraction, and rule-based priority scoring. To enable deployment on consumer-grade GPUs, we conduct a systematic comparison of three quantization levels: Q4_K_M, Q5_K_M, and Q8_0 across 254 rebar exposure images. We introduce a 5-point quality evaluation framework assessing damage type recognition, severity classification. Our results demonstrate that Q5_K_M achieves the optimal balance: quality score 3.18$\pm$1.35/5.0, inference time 5.67s/image, and 0.56 quality/sec efficiency – 8.5% higher quality than Q4_K_M with only 4.5% speed reduction, while matching Q8_0’s quality with 25% faster inference. Statistical analysis reveals Q5_K_M exhibits the weakest text-quality correlation (-0.148), indicating consistent performance regardless of description length.
[186] From Content to Audience: A Multimodal Annotation Framework for Broadcast Television Analytics
Paolo Cupini, Francesco Pierri
Main category: cs.CV
TL;DR: Systematic evaluation of multimodal LLMs for broadcast TV annotation, showing model-dependent video benefits and operational deployment for audience analytics.
Details
Motivation: Broadcast TV annotation combines structured AV composition, domain patterns, and operational constraints, but MLLMs' effectiveness across pipeline architectures and input configurations in broadcast settings remains undercharacterized.Method: Constructed domain-specific benchmark of Italian TV news clips labeled across four semantic dimensions. Evaluated two pipeline architectures across nine frontier models (Gemini 3.0 Pro, LLaMA 4 Maverick, Qwen-VL variants, Gemma 3) under progressively enriched input strategies combining visual signals, ASR, speaker diarization, and metadata.
Result: Gains from video input are strongly model-dependent: larger models effectively leverage temporal continuity, while smaller models show performance degradation under extended multimodal context due to token overload. Deployed pipeline on 14 full broadcast episodes with minute-level annotations integrated with audience measurement data.
Conclusion: Demonstrated operational viability of MLLM-based framework for content-based audience analytics, enabling correlational analysis of topic-level audience sensitivity and generational engagement divergence in broadcast TV.
Abstract: Automated semantic annotation of broadcast television content presents distinctive challenges, combining structured audiovisual composition, domain-specific editorial patterns, and strict operational constraints. While multimodal large language models (MLLMs) have demonstrated strong general-purpose video understanding capabilities, their comparative effectiveness across pipeline architectures and input configurations in broadcast-specific settings remains empirically undercharacterized. This paper presents a systematic evaluation of multimodal annotation pipelines applied to broadcast television news in the Italian setting. We construct a domain-specific benchmark of clips labeled across four semantic dimensions: visual environment classification, topic classification, sensitive content detection, and named entity recognition. Two different pipeline architectures are evaluated across nine frontier models, including Gemini 3.0 Pro, LLaMA 4 Maverick, Qwen-VL variants, and Gemma 3, under progressively enriched input strategies combining visual signals, automatic speech recognition, speaker diarization, and metadata. Experimental results demonstrate that gains from video input are strongly model-dependent: larger models effectively leverage temporal continuity, while smaller models show performance degradation under extended multimodal context, likely due to token overload. Beyond benchmarking, the selected pipeline is deployed on 14 full broadcast episodes, with minute-level annotations integrated with normalized audience measurement data provided by an Italian media company. This integration enables correlational analysis of topic-level audience sensitivity and generational engagement divergence, demonstrating the operational viability of the proposed framework for content-based audience analytics.
[187] From Prediction to Diagnosis: Reasoning-Aware AI for Photovoltaic Defect Inspection
Dev Mistry, Feng Qiu, Bo Chen, Feng Liu, Can Chen, Mohammad Shahidehpour, Ren Wang
Main category: cs.CV
TL;DR: REVL-PV is a vision-language framework for photovoltaic defect identification that combines multimodal imagery with diagnostic reasoning to produce interpretable reports aligned with professional inspection practices.
Details
Motivation: Current automated defect detection systems for photovoltaic panels operate as opaque classifiers with limited diagnostic insight, which is insufficient for high-stakes energy infrastructure where interpretable reasoning and professional alignment are crucial.Method: REVL-PV embeds domain-specific diagnostic reasoning into multimodal learning across electroluminescence, thermal, and visible-light imagery. The framework requires the model to link visual evidence to plausible defect mechanisms before classification, producing structured diagnostic reports.
Result: Achieves 93% classification accuracy on 1,927 real-world modules spanning eight defect categories, produces interpretable diagnostic rationales, maintains robustness under realistic image corruptions, and shows strong semantic alignment with expert assessments in blind concordance studies.
Conclusion: Reasoning-aware multimodal learning establishes a general paradigm for trustworthy AI-assisted inspection of photovoltaic energy infrastructure by combining visual analysis with diagnostic reasoning.
Abstract: Reliable photovoltaic defect identification is essential for maintaining energy yield, ensuring warranty compliance, and enabling scalable inspection of rapidly expanding solar fleets. Although recent advances in computer vision have improved automated defect detection, most existing systems operate as opaque classifiers that provide limited diagnostic insight for high-stakes energy infrastructure. Here we introduce REVL-PV, a vision-language framework that embeds domain-specific diagnostic reasoning into multimodal learning across electroluminescence, thermal, and visible-light imagery. By requiring the model to link visual evidence to plausible defect mechanisms before classification, the framework produces structured diagnostic reports aligned with professional photovoltaic inspection practice. Evaluated on 1,927 real-world modules spanning eight defect categories, REVL-PV achieves 93% classification accuracy while producing interpretable diagnostic rationales and maintaining strong robustness under realistic image corruptions. A blind concordance study with a certified solar inspection expert shows strong semantic alignment between model explanations and expert assessments across defect identification, root-cause attribution, and visual descriptions. These results demonstrate that reasoning-aware multimodal learning establishes a general paradigm for trustworthy AI-assisted inspection of photovoltaic energy infrastructure.
[188] BHCast: Unlocking Black Hole Plasma Dynamics from a Single Blurry Image with Long-Term Forecasting
Renbo Tu, Ali SaraerToosi, Nicholas S. Conroy, Gennady Pekhimenko, Aviad Levis
Main category: cs.CV
TL;DR: BHCast is a neural framework that forecasts black hole plasma dynamics from single blurry snapshots, enabling super-resolution, temporal evolution, and extraction of interpretable features for black hole property inference.
Details
Motivation: The Event Horizon Telescope captures static images of black holes but not their dynamics. Simulations are computationally expensive and impractical for inference, creating a bottleneck in understanding black hole accretion dynamics from limited observational data.Method: BHCast uses a neural model to transform static images into forecasted future frames with multi-scale pyramid loss for autoregressive forecasting. It extracts spatio-temporal features (pattern speed, pitch angle) and uses gradient-boosting trees to recover black hole properties from these features.
Result: The framework successfully forecasts dynamics from blurry frames, enabling stable long-term movie generation and accurate extraction of black hole properties (spin and viewing inclination) from simulated and real EHT data for Sagittarius A* and M87*.
Conclusion: BHCast establishes a scalable paradigm for solving inverse problems in astrophysics by using learned dynamics to extract insights from resolution-limited data, with modular design enabling interpretability and uncertainty quantification.
Abstract: The Event Horizon Telescope (EHT) delivered the first image of a black hole by capturing the light from its surrounding accretion flow, revealing structure but not dynamics. Simulations of black hole accretion dynamics are essential for interpreting EHT images but costly to generate and impractical for inference. Motivated by this bottleneck, BHCast presents a framework for forecasting black hole plasma dynamics from a single, blurry snapshot, such as those captured by the EHT. At its core, BHCast is a neural model that transforms a static image into forecasted future frames, revealing the underlying dynamics hidden within one snapshot. With a multi-scale pyramid loss, we demonstrate how autoregressive forecasting can simultaneously super-resolve and evolve a blurry frame into a coherent, high-resolution movie that remains stable over long time horizons. From forecasted dynamics, we can then extract interpretable spatio-temporal features, such as pattern speed (rotation rate) and pitch angle. Finally, BHCast uses gradient-boosting trees to recover black hole properties from these plasma features, including the spin and viewing inclination angle. The separation between forecasting and inference provides modular flexibility, interpretability, and robust uncertainty quantification. We demonstrate the effectiveness of BHCast on simulations of two distinct black hole accretion systems, Sagittarius A* and M87*, by testing on simulated frames blurred to EHT resolution and real EHT images of M87*. Ultimately, our methodology establishes a scalable paradigm for solving inverse problems, demonstrating the potential of learned dynamics to unlock insights from resolution-limited scientific data.
[189] Limits of Imagery Reasoning in Frontier LLM Models
Sergio Y. Hayashi, Nina S. T. Hirata
Main category: cs.CV
TL;DR: LLMs struggle with spatial tasks like mental rotation; adding an external Imagery Module for 3D rendering helps but still fails, revealing models lack foundational visual-spatial primitives.
Details
Motivation: LLMs show impressive reasoning but fail at spatial tasks requiring mental simulation (like mental rotation). The paper investigates whether adding an external "Imagery Module" (a tool for rendering/rotating 3D models) can act as a cognitive prosthetic to bridge this gap.Method: Used a dual-module architecture: a reasoning module (MLLM) interacts with an imagery module on 3D model rotation tasks. The imagery module handles rendering and rotating 3D models, outsourcing the burden of maintaining/manipulating holistic 3D states.
Result: Performance was lower than expected, with accuracy reaching at most 62.5%. Even with the imagery module handling 3D manipulation, the system still fails, indicating current frontier models lack foundational visual-spatial primitives.
Conclusion: Current MLLMs lack essential visual-spatial capabilities: (1) low-level sensitivity to extract spatial signals (depth, motion, short-horizon dynamic prediction), and (2) capacity for contemplative reasoning over images with dynamic visual focus and integration of imagery with symbolic/associative information.
Abstract: Large Language Models (LLMs) have demonstrated impressive reasoning capabilities, yet they struggle with spatial tasks that require mental simulation, such as mental rotation. This paper investigates whether equipping an LLM with an external Imagery Module'' -- a tool capable of rendering and rotating 3D models -- can bridge this gap, functioning as a cognitive prosthetic.’’ We conducted experiments using a dual-module architecture in which a reasoning module (an MLLM) interacts with an imagery module on 3D model rotation tasks. Performance was lower than expected, with accuracy reaching at most 62.5%. Further investigation suggests that even when the burden of maintaining and manipulating a holistic 3D state is outsourced, the system still fails. This reveals that current frontier models lack the foundational visual-spatial primitives required to interface with imagery. Specifically, they lack: (1) the low-level sensitivity to extract spatial signals such as (a) depth, (b) motion, and (c) short-horizon dynamic prediction; and (2) the capacity to reason contemplatively over images, dynamically shifting visual focus and balancing imagery with symbolic and associative information.
[190] RatSeizure: A Benchmark and Saliency-Context Transformer for Rat Seizure Localization
Ting Yu Tsai, An Yu, Lucy Lee, Felix X. -F. Ye, Damian S. Shin, Tzu-Jen Kao, Xin Li, Ming-Ching Chang
Main category: cs.CV
TL;DR: RatSeizure dataset for fine-grained seizure behavior analysis in rats with temporal annotations, plus RaSeformer Transformer model for temporal action localization.
Details
Motivation: Animal seizure research lacks datasets with precise temporal annotations and standardized evaluation protocols, limiting progress in studying epileptogenesis and treatment response.Method: Introduce RatSeizure dataset with recorded clips annotated with seizure-related action units and temporal boundaries. Propose RaSeformer, a saliency-context Transformer for temporal action localization that highlights behavior-relevant context while suppressing redundant cues.
Result: RaSeformer achieves strong performance on RatSeizure dataset and provides a competitive reference model. Established standardized dataset splits and evaluation protocols for reproducible benchmarking.
Conclusion: RatSeizure addresses critical limitations in animal seizure research by providing fine-grained temporal annotations and standardized evaluation, enabling better behavior analysis and temporal localization.
Abstract: Animal models, particularly rats, play a critical role in seizure research for studying epileptogenesis and treatment response. However, progress is limited by the lack of datasets with precise temporal annotations and standardized evaluation protocols. Existing animal behavior datasets often have limited accessibility, coarse labeling, and insufficient temporal localization of clinically meaningful events. To address these limitations, we introduce RatSeizure, the first publicly benchmark for fine-grained seizure behavior analysis. The dataset consists of recorded clips annotated with seizure-related action units and temporal boundaries, enabling both behavior classification and temporal localization. We further propose RaSeformer, a saliency-context Transformer for temporal action localization that highlights behavior-relevant context while suppressing redundant cues. Experiments on RatSeizure show that RaSeformer achieves strong performance and provides a competitive reference model for this challenging task. We also establish standardized dataset splits and evaluation protocols to support reproducible benchmarking.
[191] Can We Change the Stroke Size for Easier Diffusion?
Yunwei Bai, Ying Kiat Tan, Yao Shu, Tsuhan Chen
Main category: cs.CV
TL;DR: The paper proposes stroke-size control for diffusion models to address challenges in low signal-to-noise regimes by adjusting the effective roughness of targets, predictions, and perturbations across timesteps.
Details
Motivation: Diffusion models struggle in low signal-to-noise regimes where they need to make precise pixel-level predictions despite high noise levels. The authors draw an analogy to using fine strokes for oil painting throughout the entire process, which can be ineffective.Method: The paper introduces stroke-size control as a controlled intervention that modifies the effective roughness of supervised targets, predictions, and perturbations across different timesteps. This approach aims to ease the low signal-to-noise challenge by adjusting the granularity of operations.
Result: The authors analyze the advantages and trade-offs of this intervention both theoretically and empirically, though specific quantitative results are not provided in the abstract.
Conclusion: Stroke-size control offers a promising approach to improve diffusion model performance in challenging low signal-to-noise conditions by adapting the granularity of operations throughout the denoising process.
Abstract: Diffusion models can be challenged in the low signal-to-noise regime, where they have to make pixel-level predictions despite the presence of high noise. The geometric intuition is akin to using the finest stroke for oil painting throughout, which may be ineffective. We therefore study stroke-size control as a controlled intervention that changes the effective roughness of the supervised target, predictions and perturbations across timesteps, in an attempt to ease the low signal-to-noise challenge. We analyze the advantages and trade-offs of the intervention both theoretically and empirically. Code will be released.
[192] HighlightBench: Benchmarking Markup-Driven Table Reasoning in Scientific Documents
Lexin Wang, Shenghua Liu, Yiwei Wang, Yujun Cai, Yuyao Ge, Jiayu Yao, Jiafeng Guo, Xueqi Cheng
Main category: cs.CV
TL;DR: HighlightBench: A diagnostic benchmark for evaluating multimodal LLMs’ ability to understand and reason with visual markups (highlights, underlines, bold text) in table-centric documents, with five task families to decompose markup-conditioned behavior.
Details
Motivation: Visual markups are common in table-centric documents but MLLMs' ability to treat them as explicit logical directives is under-explored. Existing evaluations cannot distinguish whether models fail to see markup or fail to reason with it, creating a blind spot in assessing markup-conditioned behavior over tables.Method: Introduces HighlightBench, a diagnostic benchmark with five task families: Markup Grounding, Constrained Retrieval, Local Relations, Aggregation & Comparison, and Consistency & Missingness. Provides a reference pipeline that makes intermediate decisions explicit for reproducible baselines and finer-grained error attribution.
Result: Experiments show that even strong models remain unstable when visual cues must be consistently aligned with symbolic reasoning under structured output constraints.
Conclusion: The benchmark addresses a critical gap in evaluating MLLMs’ markup understanding capabilities, enabling better assessment of perception vs. reasoning failures in table-centric document analysis.
Abstract: Visual markups such as highlights, underlines, and bold text are common in table-centric documents. Although multimodal large language models (MLLMs) have made substantial progress in document understanding, their ability to treat such cues as explicit logical directives remains under-explored. More importantly, existing evaluations cannot distinguish whether a model fails to see the markup or fails to reason with it. This creates a key blind spot in assessing markup-conditioned behavior over tables. To address this gap, we introduce HighlightBench, a diagnostic benchmark for markup-driven table understanding that decomposes evaluation into five task families: Markup Grounding, Constrained Retrieval, Local Relations, Aggregation & Comparison, and Consistency & Missingness. We further provide a reference pipeline that makes intermediate decisions explicit, enabling reproducible baselines and finer-grained attribution of errors along the perception-to-execution chain. Experiments show that even strong models remain unstable when visual cues must be consistently aligned with symbolic reasoning under structured output constraints.
[193] Brain-Inspired Multimodal Spiking Neural Network for Image-Text Retrieval
Xintao Zong, Xian Zhong, Wenxuan Liu, Jianhao Ding, Zhaofei Yu, Tiejun Huang
Main category: cs.CV
TL;DR: A brain-inspired Cross-Modal Spike Fusion network (CMSF) for image-text retrieval using spiking neural networks that achieves state-of-the-art accuracy with low energy consumption and high speed.
Details
Motivation: Existing ANN-based multimodal methods focus on deeper architectures but overlook cross-modal interaction, retrieval latency, and energy efficiency. SNNs show potential for unimodal tasks but building directly trained, low-energy, high-performance SNNs for multimodal applications remains challenging.Method: Proposes a spike fusion mechanism that integrates unimodal features at the spike level to generate enhanced multimodal representations, which act as soft supervisory signals to refine unimodal spike embeddings, mitigating semantic loss within the CMSF network.
Result: Despite requiring only two time steps, CMSF achieves top-tier retrieval accuracy, surpassing state-of-the-art ANN counterparts while maintaining exceptionally low energy consumption and high retrieval speed.
Conclusion: This work represents a significant step toward multimodal SNNs, offering a brain-inspired framework that unifies temporal dynamics with cross-modal alignment and provides new insights for future spiking-based multimodal research.
Abstract: Spiking neural networks (SNNs) have recently shown strong potential in unimodal visual and textual tasks, yet building a directly trained, low-energy, and high-performance SNN for multimodal applications such as image-text retrieval (ITR) remains highly challenging. Existing artificial neural network (ANN)-based methods often pursue richer unimodal semantics using deeper and more complex architectures, while overlooking cross-modal interaction, retrieval latency, and energy efficiency. To address these limitations, we present a brain-inspired Cross-Modal Spike Fusion network (CMSF) and apply it to ITR for the first time. The proposed spike fusion mechanism integrates unimodal features at the spike level, generating enhanced multimodal representations that act as soft supervisory signals to refine unimodal spike embeddings, effectively mitigating semantic loss within CMSF. Despite requiring only two time steps, CMSF achieves top-tier retrieval accuracy, surpassing state-of-the-art ANN counterparts while maintaining exceptionally low energy consumption and high retrieval speed. This work marks a significant step toward multimodal SNNs, offering a brain-inspired framework that unifies temporal dynamics with cross-modal alignment and provides new insights for future spiking-based multimodal research. The code is available at https://github.com/zxt6174/CMSF.
[194] Confidence Matters: Uncertainty Quantification and Precision Assessment of Deep Learning-based CMR Biomarker Estimates Using Scan-rescan Data
Dewmini Hasara Wickremasinghe, Michelle Gibogwe, Andrew Bell, Esther Puyol-Antón, Muhummad Sohaib Nazir, Reza Razavi, Bruno Paun, Paul Aljabar, Andrew P. King
Main category: cs.CV
TL;DR: Deep learning for cardiac MRI analysis shows good accuracy but poor precision when using distribution-based metrics, highlighting the need for better assessment of scan-rescan agreement.
Details
Motivation: Current DL methods for cine cardiovascular MRI analysis are typically assessed only for accuracy, overlooking precision. The authors aim to evaluate both accuracy and precision using uncertainty estimation techniques and propose new distribution-based metrics.Method: Applied uncertainty estimation techniques (deep ensemble, test-time augmentation, Monte Carlo dropout) to a state-of-the-art DL pipeline for cardiac functional biomarker estimation. Proposed new distribution-based metrics for assessing biomarker precision and evaluated on two external validation scan-rescan CMR datasets.
Result: Model achieved high accuracy (average Dice 87%) and good point estimate precision. However, distribution-based metrics showed poor scan-rescan agreement: confidence interval overlap >50% in less than 45% of cases, and statistical tests showed significant differences in over 65% of cases.
Conclusion: While point estimate metrics suggest good performance, distributional analyses reveal lower precision, highlighting the need to use more representative metrics to assess scan-rescan agreement in medical imaging applications.
Abstract: The performance of deep learning (DL) methods for the analysis of cine cardiovascular magnetic resonance (CMR) is typically assessed in terms of accuracy, overlooking precision. In this work, uncertainty estimation techniques, namely deep ensemble, test-time augmentation, and Monte Carlo dropout, are applied to a state-of-the-art DL pipeline for cardiac functional biomarker estimation, and new distribution-based metrics are proposed for the assessment of biomarker precision. The model achieved high accuracy (average Dice 87%) and point estimate precision on two external validation scan-rescan CMR datasets. However, distribution-based metrics showed that the overlap between scan/rescan confidence intervals was >50% in less than 45% of the cases. Statistical similarity tests between scan and rescan biomarkers also resulted in significant differences for over 65% of the cases. We conclude that, while point estimate metrics might suggest good performance, distributional analyses reveal lower precision, highlighting the need to use more representative metrics to assess scan-rescan agreement.
[195] Elucidating the Design Space of Flow Matching for Cellular Microscopy
Charles Jones, Emmanuel Noutahi, Jason Hartford, Cian Eastwood
Main category: cs.CV
TL;DR: Systematic analysis of flow-matching design space for cell microscopy images reveals many popular techniques are unnecessary or harmful, leading to a simple, stable recipe that scales 100x larger than prior methods with significant quality improvements.
Details
Motivation: Flow-matching generative models are increasingly used to simulate cell responses to biological perturbations, but the design space for building such models is large and underexplored, with many popular techniques potentially unnecessary or detrimental to performance.Method: Systematic analysis of flow-matching design space for cell-microscopy images, identifying unnecessary techniques and developing a simple, stable, and scalable recipe for training foundation models, then fine-tuning with pre-trained molecular embeddings.
Result: Scaled model to two orders of magnitude larger than prior methods, achieving two-fold FID and ten-fold KID improvement over prior methods, with state-of-the-art performance simulating responses to unseen molecules after fine-tuning.
Conclusion: Many popular flow-matching techniques for cell microscopy are unnecessary or harmful, and a simpler, more scalable approach yields significantly better generative performance for simulating cellular responses to perturbations.
Abstract: Flow-matching generative models are increasingly used to simulate cell responses to biological perturbations. However, the design space for building such models is large and underexplored. We systematically analyse the design space of flow matching models for cell-microscopy images, finding that many popular techniques are unnecessary and can even hurt performance. We develop a simple, stable, and scalable recipe which we use to train our foundation model. We scale our model to two orders of magnitude larger than prior methods, achieving a two-fold FID and ten-fold KID improvement over prior methods. We then fine-tune our model with pre-trained molecular embeddings to achieve state-of-the-art performance simulating responses to unseen molecules. Code is available at https://github.com/valence-labs/microscopy-flow-matching
[196] PhyDCM: A Reproducible Open-Source Framework for AI-Assisted Brain Tumor Classification from Multi-Sequence MRI
Hayder Saad Abdulbaqi, Mohammed Hadi Rahim, Mohammed Hassan Hadi, Haider Ali Aboud, Ali Hussein Allawi
Main category: cs.CV
TL;DR: PhyDCM is an open-source software framework for MRI-based brain tumor classification that integrates MedViT hybrid architecture with DICOM processing and desktop visualization, achieving over 93% accuracy.
Details
Motivation: Address challenges in MRI-based brain tumor detection due to growing data volume and limitations of existing deep learning solutions that are confined to closed architectures, limiting reproducibility and academic development.Method: Develops PhyDCM as a modular open-source framework with hybrid classification architecture based on MedViT, standardized DICOM processing, interactive desktop visualization interface, and standardized preprocessing including intensity rescaling and limited data augmentation.
Result: Achieves over 93% classification accuracy across categories on MRI datasets from BRISC2025 and curated Kaggle collections (FigShare, SARTAJ, and Br35H), with stable diagnostic performance.
Conclusion: PhyDCM provides a practical foundation for reproducible AI-driven medical image analysis with transparency, modularity, and accessibility, offering flexibility for future integration of additional imaging modalities.
Abstract: MRI-based medical imaging has become indispensable in modern clinical diagnosis, particularly for brain tumor detection. However, the rapid growth in data volume poses challenges for conventional diagnostic approaches. Although deep learning has shown strong performance in automated classification, many existing solutions are confined to closed technical architectures, limiting reproducibility and further academic development. PhyDCM is introduced as an open-source software framework that integrates a hybrid classification architecture based on MedViT with standardized DICOM processing and an interactive desktop visualization interface. The system is designed as a modular digital library that separates computational logic from the graphical interface, allowing independent modification and extension of components. Standardized preprocessing, including intensity rescaling and limited data augmentation, ensures consistency across varying MRI acquisition settings. Experimental evaluation on MRI datasets from BRISC2025 and curated Kaggle collections (FigShare, SARTAJ, and Br35H) demonstrates stable diagnostic performance, achieving over 93% classification accuracy across categories. The framework supports structured, exportable outputs and multi-planar reconstruction of volumetric data. By emphasizing transparency, modularity, and accessibility, PhyDCM provides a practical foundation for reproducible AI-driven medical image analysis, with flexibility for future integration of additional imaging modalities.
[197] The Language of Touch: Translating Vibrations into Text with Dual-Branch Learning
Jin Chen, Yifeng Lin, Chao Zeng, Si Wu, Tiesong Zhao
Main category: cs.CV
TL;DR: ViPAC is a novel method for vibrotactile captioning that generates natural language descriptions from vibrotactile signals, addressing the challenge of semantic interpretation in tactile data.
Details
Motivation: While vibrotactile data standardization has advanced applications in VR, HCI, and embodied AI, semantic interpretation of vibrotactile signals remains an unresolved challenge. The paper aims to address vibrotactile captioning for the first time.Method: ViPAC uses a dual-branch strategy to disentangle periodic and aperiodic components of vibrotactile signals, with dynamic fusion mechanism, orthogonality constraint, and weighting regularization. Also created LMT108-CAP dataset using GPT-4o for vibrotactile-text pairs.
Result: ViPAC significantly outperforms baseline methods adapted from audio and image captioning, achieving superior lexical fidelity and semantic alignment in generating natural language descriptions from vibrotactile signals.
Conclusion: The paper successfully addresses vibrotactile captioning for the first time, proposing an effective method that handles the intrinsic properties of vibrotactile data and demonstrating its superiority over adapted baselines.
Abstract: The standardization of vibrotactile data by IEEE P1918.1 workgroup has greatly advanced its applications in virtual reality, human-computer interaction and embodied artificial intelligence. Despite these efforts, the semantic interpretation and understanding of vibrotactile signals remain an unresolved challenge. In this paper, we make the first attempt to address vibrotactile captioning, {\it i.e.}, generating natural language descriptions from vibrotactile signals. We propose Vibrotactile Periodic-Aperiodic Captioning (ViPAC), a method designed to handle the intrinsic properties of vibrotactile data, including hybrid periodic-aperiodic structures and the lack of spatial semantics. Specifically, ViPAC employs a dual-branch strategy to disentangle periodic and aperiodic components, combined with a dynamic fusion mechanism that adaptively integrates signal features. It also introduces an orthogonality constraint and weighting regularization to ensure feature complementarity and fusion consistency. Additionally, we construct LMT108-CAP, the first vibrotactile-text paired dataset, using GPT-4o to generate five constrained captions per surface image from the popular LMT-108 dataset. Experiments show that ViPAC significantly outperforms the baseline methods adapted from audio and image captioning, achieving superior lexical fidelity and semantic alignment.
[198] Implicit neural representations for larval zebrafish brain microscopy: a reproducible benchmark on the MapZebrain atlas
Agnieszka Pregowska
Main category: cs.CV
TL;DR: Benchmark comparing implicit neural representations (INRs) for larval zebrafish brain atlas registration, showing Fourier and Haar encodings outperform SIREN and grid methods in preserving neuroanatomical boundaries.
Details
Motivation: Lack of reproducible evaluation for high-resolution larval zebrafish microscopy where preserving neuropil boundaries and fine neuronal processes is critical for atlas registration, cross-modality resampling, and data sharing.Method: Unified seed-controlled protocol comparing SIREN, Fourier features, Haar positional encoding, and multi-resolution grid on 950 grayscale microscopy images from MapZebrain atlas. Images normalized with per-image percentiles, spatial generalization tested with 40% column-wise hold-out along X-axis.
Result: Haar and Fourier achieved strongest reconstruction fidelity (~26 dB) on held-out columns, better preserving boundaries according to SSIM and edge-focused metrics. SIREN performed worse in macro averages but competitive on micro averages. Grid method moderately behind.
Conclusion: Explicit spectral and multiscale encodings (Haar, Fourier) better capture high-frequency neuroanatomical detail than smoother alternatives, making them best for boundary-sensitive tasks like atlas registration and label transfer, while SIREN remains suitable for background modeling.
Abstract: Implicit neural representations (INRs) offer continuous coordinate-based encodings for atlas registration, cross-modality resampling, sparse-view completion, and compact sharing of neuroanatomical data. Yet reproducible evaluation is lacking for high-resolution larval zebrafish microscopy, where preserving neuropil boundaries and fine neuronal processes is critical. We present a reproducible INR benchmark for the MapZebrain larval zebrafish brain atlas. Using a unified, seed-controlled protocol, we compare SIREN, Fourier features, Haar positional encoding, and a multi-resolution grid on 950 grayscale microscopy images, including atlas slices and single-neuron projections. Images are normalized with per-image (1,99) percentiles estimated from 10% of pixels in non-held-out columns, and spatial generalization is tested with a deterministic 40% column-wise hold-out along the X-axis. Haar and Fourier achieve the strongest macro-averaged reconstruction fidelity on held-out columns (about 26 dB), while the grid is moderately behind. SIREN performs worse in macro averages but remains competitive on area-weighted micro averages in the all-in-one regime. SSIM and edge-focused error further show that Haar and Fourier preserve boundaries more accurately. These results indicate that explicit spectral and multiscale encodings better capture high-frequency neuroanatomical detail than smoother-bias alternatives. For MapZebrain workflows, Haar and Fourier are best suited to boundary-sensitive tasks such as atlas registration, label transfer, and morphology-preserving sharing, while SIREN remains a lightweight baseline for background modelling or denoising.
[199] arg-VU: Affordance Reasoning with Physics-Aware 3D Geometry for Visual Understanding in Robotic Surgery
Nan Xiao, Yunxin Fan, Farong Wang, Fei Liu
Main category: cs.CV
TL;DR: arg-VU: A physics-aware affordance reasoning framework for surgical robotics that integrates 3D geometry tracking with constraint-based mechanical modeling to enable reliable affordance predictions in deformable surgical environments.
Details
Motivation: Affordance reasoning is crucial for linking perception to action in robotics, but remains underexplored in surgical robotics where tissues are highly deformable, compliant, and dynamically coupled with tool motion. Current approaches lack physical consistency and interpretability for deformable surgical environments.Method: Uses 3D Gaussian Splatting (3DGS) for surgical scene reconstruction and temporally tracked surface representation. Integrates Extended Position-Based Dynamics (XPBD) with local deformation constraints to produce representative geometry points (RGPs) with anisotropic stiffness metrics. Incorporates robotic tool poses in SE(3) to compute rigidly induced displacements, deriving physics-aware compliance energy and positional agreement scores for affordance prediction.
Result: Experiments on surgical video datasets show arg-VU yields more stable, physically consistent, and interpretable affordance predictions than kinematic baselines. The framework demonstrates reliable affordance reasoning for deformable surgical environments and supports embodied robotic interaction.
Conclusion: Physics-aware geometric representations enable reliable affordance reasoning for deformable surgical environments, bridging the gap between perception and action in surgical robotics through physically consistent modeling of tissue-tool interactions.
Abstract: Affordance reasoning provides a principled link between perception and action, yet remains underexplored in surgical robotics, where tissues are highly deformable, compliant, and dynamically coupled with tool motion. We present arg-VU, a physics-aware affordance reasoning framework that integrates temporally consistent geometry tracking with constraint-induced mechanical modeling for surgical visual understanding. Surgical scenes are reconstructed using 3D Gaussian Splatting (3DGS) and converted into a temporally tracked surface representation. Extended Position-Based Dynamics (XPBD) embeds local deformation constraints and produces representative geometry points (RGPs) whose constraint sensitivities define anisotropic stiffness metrics capturing the local constraint-manifold geometry. Robotic tool poses in SE(3) are incorporated to compute rigidly induced displacements at RGPs, from which we derive two complementary measures: a physics-aware compliance energy that evaluates mechanical feasibility with respect to local deformation constraints, and a positional agreement score that captures motion alignment (as kinematic motion baseline). Experiments on surgical video datasets show that arg-VU yields more stable, physically consistent, and interpretable affordance predictions than kinematic baselines. These results demonstrate that physics-aware geometric representations enable reliable affordance reasoning for deformable surgical environments and support embodied robotic interaction.
[200] Envisioning global urban development with satellite imagery and generative AI
Kailai Sun, Yuebing Liang, Mingyi He, Yunhan Zheng, Alok Prakash, Shenhao Wang, Jinhua Zhao, Alex “Sandy’’ Pentland
Main category: cs.CV
TL;DR: Multimodal generative AI framework for creating realistic urban satellite imagery across 500 global metropolitan areas using text prompts and geospatial controls to envision sustainable urban development.
Details
Motivation: Urban development has traditionally been studied as a predictive task, but this fails to capture its generative nature. The paper aims to create a framework that can envision sustainable urban development globally by generating realistic urban imagery that reflects development goals.Method: Developed a multimodal generative AI framework that integrates text prompts and geospatial controls to generate high-fidelity, diverse urban satellite imagery. The system learns from surrounding environments for urban redevelopment and encodes latent representations of urban form for cross-city style transfer.
Result: Successfully generates realistic urban images across 500 largest metropolitan areas worldwide. The framework enables style transfer of urban environments across global spatial networks and enhances downstream prediction tasks like carbon emission prediction. Human expert evaluation confirms generated images are comparable to real urban images.
Conclusion: The study presents innovative approaches for accelerated urban planning and supports scenario-based planning processes for worldwide cities through multimodal generative AI that can envision sustainable urban development.
Abstract: Urban development has been a defining force in human history, shaping cities for centuries. However, past studies mostly analyze such development as predictive tasks, failing to reflect its generative nature. Therefore, this study designs a multimodal generative AI framework to envision sustainable urban development at a global scale. By integrating prompts and geospatial controls, our framework can generate high-fidelity, diverse, and realistic urban satellite imagery across the 500 largest metropolitan areas worldwide. It enables users to specify urban development goals, creating new images that align with them while offering diverse scenarios whose appearance can be controlled with text prompts and geospatial constraints. It also facilitates urban redevelopment practices by learning from the surrounding environment. Beyond visual synthesis, we find that it encodes and interprets latent representations of urban form for global cross-city learning, successfully transferring styles of urban environments across a global spatial network. The latent representations can also enhance downstream prediction tasks such as carbon emission prediction. Further, human expert evaluation confirms that our generated urban images are comparable to real urban images. Overall, this study presents innovative approaches for accelerated urban planning and supports scenario-based planning processes for worldwide cities.
[201] Dual-View Optical Flow for 4D Micro-Expression Recognition - A Multi-Stream Fusion Attention Approach
Luu Tu Nguyen, Thi Bich Phuong Man, Vu Tram Anh Khuong, Thanh Ha Le, Thi Duyen Ngo
Main category: cs.CV
TL;DR: Dual-view optical flow approach for 4D micro-expression recognition using synchronized viewpoints, apex-frame detection, phase decomposition, and Triple-Stream MicroAttNet with fusion attention for improved performance on 4DME dataset.
Details
Motivation: Micro-expression recognition is challenging due to brief, low-intensity facial motions and high-dimensional 4D mesh data, requiring robust methods for affective computing applications.Method: Uses dual-view optical flow from synchronized viewpoints, automatic apex-frame detection, sequence decomposition into onset-apex and apex-offset phases, and Triple-Stream MicroAttNet with fusion attention and squeeze-excitation blocks.
Result: Achieves macro-UF1 score of 0.536 on 4DME dataset, outperforming baseline by over 50% and securing first place in 4DMR IJCAI Workshop Challenge 2025; ablation shows fusion attention and SE components each contribute up to 3.6 UF1 points.
Conclusion: Dual-view, phase-aware optical flow with multi-stream fusion provides robust, interpretable solution for 4D micro-expression recognition, demonstrating effectiveness of multimodal feature integration.
Abstract: Micro-expression recognition is vital for affective computing but remains challenging due to the extremely brief, low-intensity facial motions involved and the high-dimensional nature of 4D mesh data. To address these challenges, we introduce a dual-view optical flow approach that simplifies mesh processing by capturing each micro-expression sequence from two synchronized viewpoints and computing optical flow to represent motion. Our pipeline begins with view separation and sequence-wise face cropping to ensure spatial consistency, followed by automatic apex-frame detection based on peak motion intensity in both views. We decompose each sequence into onset-apex and apex-offset phases, extracting horizontal, vertical, and magnitude flow channels for each phase. These are fed into our Triple-Stream MicroAttNet, which employs a fusion attention module to adaptively weight modality-specific features and a squeeze-and-excitation block to enhance magnitude representations. Training uses focal loss to mitigate class imbalance and the Adam optimizer with early stopping. Evaluated on the multi-label 4DME dataset, comprising 24 subjects and five emotion categories, in the 4DMR IJCAI Workshop Challenge 2025, our method achieves a macro-UF1 score of 0.536, outperforming the official baseline by over 50% and securing first place. Ablation studies confirm that both the fusion attention and SE components each contribute up to 3.6 points of UF1 gain. These results demonstrate that dual-view, phase-aware optical flow combined with multi-stream fusion yields a robust and interpretable solution for 4D micro-expression recognition.
[202] LACON: Training Text-to-Image Model from Uncurated Data
Zhiyang Liang, Ziyu Wan, Hongyu Liu, Dong Chen, Qiu Shen, Hao Zhu, Dongdong Chen
Main category: cs.CV
TL;DR: LACON is a training framework that repurposes low-quality data instead of filtering it out, using quality signals as explicit condition labels to train generative models across the full spectrum of data quality.
Details
Motivation: Current text-to-image generation relies on aggressively filtering low-quality data, assuming it's detrimental. The authors question whether this discarded data holds untapped potential and propose to exploit the full uncurated data distribution.Method: LACON (Labeling-and-Conditioning) framework uses quality signals like aesthetic scores and watermark probabilities as explicit, quantitative condition labels. Instead of filtering bad data, it trains generative models to learn the full spectrum from bad to good quality, learning explicit boundaries between high and low-quality content.
Result: LACON achieves superior generation quality compared to baselines trained only on filtered data using the same compute budget, demonstrating the significant value of uncurated data that was previously discarded.
Conclusion: The discarded “bad” data in text-to-image generation contains valuable information that can be leveraged through explicit conditioning on quality signals, leading to better model performance without additional computational cost.
Abstract: The success of modern text-to-image generation is largely attributed to massive, high-quality datasets. Currently, these datasets are curated through a filter-first paradigm that aggressively discards low-quality raw data based on the assumption that it is detrimental to model performance. Is the discarded bad data truly useless, or does it hold untapped potential? In this work, we critically re-examine this question. We propose LACON (Labeling-and-Conditioning), a novel training framework that exploits the underlying uncurated data distribution. Instead of filtering, LACON re-purposes quality signals, such as aesthetic scores and watermark probabilities as explicit, quantitative condition labels. The generative model is then trained to learn the full spectrum of data quality, from bad to good. By learning the explicit boundary between high- and low-quality content, LACON achieves superior generation quality compared to baselines trained only on filtered data using the same compute budget, proving the significant value of uncurated data.
[203] TTE-CAM: Built-in Class Activation Maps for Test-Time Explainability in Pretrained Black-Box CNNs
Kerol Djoumessi, Philipp Berens
Main category: cs.CV
TL;DR: TTE-CAM converts pretrained black-box CNNs into self-explainable models at test time by replacing classification heads with convolution-based modules, preserving performance while providing faithful explanations.
Details
Motivation: Medical image analysis requires interpretable models for clinical adoption, but current methods face trade-offs: post-hoc explanations are unfaithful approximations, while inherently interpretable models sacrifice predictive performance.Method: Test-time framework that converts pretrained CNNs into self-explainable models via convolution-based replacement of classification heads, initialized from original weights to preserve performance.
Result: Preserves black-box predictive performance while delivering built-in faithful explanations competitive with post-hoc methods, both qualitatively and quantitatively.
Conclusion: TTE-CAM bridges the gap between interpretability and performance in medical image analysis, enabling clinical adoption of high-performing CNNs with faithful explanations.
Abstract: Convolutional neural networks (CNNs) achieve state-of-the-art performance in medical image analysis yet remain opaque, limiting adoption in high-stakes clinical settings. Existing approaches face a fundamental trade-off: post-hoc methods provide unfaithful approximate explanations, while inherently interpretable architectures are faithful but often sacrifice predictive performance. We introduce TTE-CAM, a test-time framework that bridges this gap by converting pretrained black-box CNNs into self-explainable models via a convolution-based replacement of their classification head, initialized from the original weights. The resulting model preserves black-box predictive performance while delivering built-in faithful explanations competitive with post-hoc methods, both qualitatively and quantitatively. The code is available at https://github.com/kdjoumessi/Test-Time-Explainability
[204] Computer Vision with a Superpixelation Camera
Sasidharan Mahalingam, Rachel Brown, Atul Ingle
Main category: cs.CV
TL;DR: SuperCam is a novel camera design that performs on-the-fly superpixel segmentation to compress image data for resource-constrained edge applications, outperforming existing superpixel algorithms in memory-limited scenarios.
Details
Motivation: Conventional cameras generate excessive data that's challenging to process in resource-constrained edge applications, with most captured data being redundant for downstream computer vision tasks. There's a need for adaptive camera designs that can compress data efficiently at the capture stage.Method: Proposes SuperCam, a novel camera design that performs superpixel segmentation on the fly during image capture. The system adaptively processes captured data by grouping similar pixels into superpixels, significantly reducing data volume while preserving essential visual information.
Result: SuperCam outperforms current state-of-the-art superpixel algorithms in memory-constrained situations. The compressed data maintains high performance for downstream computer vision tasks including image segmentation, object detection, and monocular depth estimation when camera memory is limited.
Conclusion: Superpixel segmentation will be crucial for deploying computer vision models on edge devices. SuperCam enables more efficient system designs for resource-constrained applications by reducing data redundancy at the capture stage while maintaining task performance.
Abstract: Conventional cameras generate a lot of data that can be challenging to process in resource-constrained applications. Usually, cameras generate data streams on the order of the number of pixels in the image. However, most of this captured data is redundant for many downstream computer vision algorithms. We propose a novel camera design, which we call SuperCam, that adaptively processes captured data by performing superpixel segmentation on the fly. We show that SuperCam performs better than current state-of-the-art superpixel algorithms under memory-constrained situations. We also compare how well SuperCam performs when the compressed data is used for downstream computer vision tasks. Our results demonstrate that the proposed design provides superior output for image segmentation, object detection, and monocular depth estimation in situations where the available memory on the camera is limited. We posit that superpixel segmentation will play a crucial role as more computer vision inference models are deployed in edge devices. SuperCam would allow computer vision engineers to design more efficient systems for these applications.
[205] FusionAgent: A Multimodal Agent with Dynamic Model Selection for Human Recognition
Jie Zhu, Xiao Guo, Yiyang Su, Anil Jain, Xiaoming Liu
Main category: cs.CV
TL;DR: FusionAgent is an agentic framework using MLLMs for dynamic model selection in whole-body human recognition, achieving better performance with fewer model invocations through reinforcement fine-tuning and confidence-aware score fusion.
Details
Motivation: Existing score-fusion strategies for whole-body human recognition are static and invoke all models for every test sample regardless of sample quality or modality reliability, leading to inefficiency and suboptimal performance.Method: Proposes FusionAgent: an agentic framework using Multimodal Large Language Models (MLLMs) for dynamic, sample-specific model selection. Each expert model is treated as a tool, and through Reinforcement Fine-Tuning (RFT) with metric-based reward, the agent learns optimal model combinations. Introduces Anchor-based Confidence Top-k (ACT) score-fusion to address model score misalignment and embedding heterogeneity.
Result: Extensive experiments on multiple whole-body biometric benchmarks demonstrate that FusionAgent significantly outperforms state-of-the-art methods while achieving higher efficiency through fewer model invocations.
Conclusion: FusionAgent underscores the critical role of dynamic, explainable, and robust model fusion in real-world recognition systems, showing that adaptive model selection via MLLMs can improve both performance and efficiency.
Abstract: Model fusion is a key strategy for robust recognition in unconstrained scenarios, as different models provide complementary strengths. This is especially important for whole-body human recognition, where biometric cues such as face, gait, and body shape vary across samples and are typically integrated via score-fusion. However, existing score-fusion strategies are usually static, invoking all models for every test sample regardless of sample quality or modality reliability. To overcome these limitations, we propose \textbf{FusionAgent}, a novel agentic framework that leverages a Multimodal Large Language Model (MLLM) to perform dynamic, sample-specific model selection. Each expert model is treated as a tool, and through Reinforcement Fine-Tuning (RFT) with a metric-based reward, the agent learns to adaptively determine the optimal model combination for each test input. To address the model score misalignment and embedding heterogeneity, we introduce Anchor-based Confidence Top-k (ACT) score-fusion, which anchors on the most confident model and integrates complementary predictions in a confidence-aware manner. Extensive experiments on multiple whole-body biometric benchmarks demonstrate that FusionAgent significantly outperforms SoTA methods while achieving higher efficiency through fewer model invocations, underscoring the critical role of dynamic, explainable, and robust model fusion in real-world recognition systems. Project page: \href{https://fusionagent.github.io/}{FusionAgent}.
[206] Live Interactive Training for Video Segmentation
Xinyu Yang, Haozheng Yu, Yihong Sun, Bharath Hariharan, Jennifer J. Sun
Main category: cs.CV
TL;DR: LIT-LoRA enables interactive video segmentation models to learn from user corrections during inference, reducing repetitive manual interventions by 18-34% through lightweight online adaptation.
Details
Motivation: Current interactive video segmentation models require repetitive user corrections for challenging scenarios without learning from feedback, leading to inefficient human effort. The authors aim to create systems that can learn online from human corrections to reduce redundant interventions.Method: Introduces Live Interactive Training (LIT) framework where models learn from user corrections at inference time. LIT-LoRA implements this by continually updating lightweight LoRA modules on-the-fly when users provide corrections, allowing the vision system to improve on subsequent video frames.
Result: Achieves 18-34% reduction in total corrections on challenging video segmentation benchmarks with minimal training overhead (~0.5s per correction). Successfully adapts to other segmentation models and extends to CLIP-based fine-grained image classification.
Conclusion: Live adaptation can transform interactive visual tools by significantly reducing redundant human effort in complex visual tasks, demonstrating the promise of online learning from human feedback for vision systems.
Abstract: Interactive video segmentation often requires many user interventions for robust performance in challenging scenarios (e.g., occlusions, object separations, camouflage, etc.). Yet, even state-of-the-art models like SAM2 use corrections only for immediate fixes without learning from this feedback, leading to inefficient, repetitive user effort. To address this, we introduce Live Interactive Training (LIT), a novel framework for prompt-based visual systems where models also learn online from human corrections at inference time. Our primary instantiation, LIT-LoRA, implements this by continually updating a lightweight LoRA module on-the-fly. When a user provides a correction, this module is rapidly trained on that feedback, allowing the vision system to improve performance on subsequent frames of the same video. Leveraging the core principles of LIT, our LIT-LoRA implementation achieves an average 18-34% reduction in total corrections on challenging video segmentation benchmarks, with a negligible training overhead of ~0.5s per correction. We further demonstrate its generality by successfully adapting it to other segmentation models and extending it to CLIP-based fine-grained image classification. Our work highlights the promise of live adaptation to transform interactive tools and significantly reduce redundant human effort in complex visual tasks. Project: https://youngxinyu1802.github.io/projects/LIT/.
[207] Leveraging Avatar Fingerprinting: A Multi-Generator Photorealistic Talking-Head Public Database and Benchmark
Laura Pedrouzo-Rodriguez, Luis F. Gomez, Ruben Tolosana, Ruben Vera-Rodriguez, Roberto Daza, Aythami Morales, Julian Fierrez
Main category: cs.CV
TL;DR: AVAPrintDB: A new multi-generator talking-head avatar database for avatar fingerprinting research, addressing security concerns about identity impersonation in AI-mediated communication.
Details
Motivation: Address security concerns about identity impersonation in photorealistic avatar generation by creating a comprehensive database for avatar fingerprinting research, as current databases are scarce and based on outdated avatar generators.Method: Created AVAPrintDB using two audiovisual corpora and three state-of-the-art avatar generators (GAGAvatar, LivePortrait, HunyuanPortrait) with different synthesis paradigms, including both self- and cross-reenactments. Established a standardized benchmark for avatar fingerprinting using public state-of-the-art systems and novel Foundation Model-based methods (DINOv2 and CLIP).
Result: Results show that identity-related motion cues persist across synthetic avatars, but current avatar fingerprinting systems remain highly sensitive to changes in synthesis pipeline and source domain. The database and benchmark protocols are publicly available.
Conclusion: AVAPrintDB provides a comprehensive resource for avatar fingerprinting research, highlighting both the persistence of identity cues in synthetic avatars and the sensitivity of current systems to technical variations, enabling reproducible research in this important security domain.
Abstract: Recent advances in photorealistic avatar generation have enabled highly realistic talking-head avatars, raising security concerns regarding identity impersonation in AI-mediated communication. To advance in this challenging problem, the task of avatar fingerprinting aims to determine whether two avatar videos are driven by the same human operator or not. However, current public databases in the literature are scarce and based solely on old-fashioned talking-head avatar generators, not representing realistic scenarios for the current task of avatar fingerprinting. To overcome this situation, the present article introduces AVAPrintDB, a new publicly available multi-generator talking-head avatar database for avatar fingerprinting. AVAPrintDB is constructed from two audiovisual corpora and three state-of-the-art avatar generators (GAGAvatar, LivePortrait, HunyuanPortrait), representing different synthesis paradigms, and includes both self- and cross-reenactments to simulate legitimate usage and impersonation scenarios. Building on this database, we also define a standardized and reproducible benchmark for avatar fingerprinting, considering public state-of-the-art avatar fingerprinting systems and exploring novel methods based on Foundation Models (DINOv2 and CLIP). Also, we conduct a comprehensive analysis under generator and dataset shift. Our results show that, while identity-related motion cues persist across synthetic avatars, current avatar fingerprinting systems remain highly sensitive to changes in the synthesis pipeline and source domain. The AVAPrintDB, benchmark protocols, and avatar fingerprinting systems are publicly available to facilitate reproducible research.
[208] From 3D Pose to Prose: Biomechanics-Grounded Vision–Language Coaching
Yuyang Ji, Yixuan Shen, Shengjie Zhu, Yu Kong, Feng Liu
Main category: cs.CV
TL;DR: BioCoach: A biomechanics-grounded vision-language framework for fitness coaching from video that fuses visual appearance and 3D skeletal kinematics through a three-stage pipeline with exercise-specific joint selection, biomechanical context, and conditioned feedback generation.
Details
Motivation: Current fitness coaching systems often lack biomechanical grounding and personalized reasoning, relying on pattern matching rather than explicit kinematics and constraints. There's a need for transparent, personalized fitness coaching that integrates visual appearance with biomechanical analysis.Method: Three-stage pipeline: 1) Exercise-specific degree-of-freedom selector focuses on salient joints, 2) Structured biomechanical context pairs individualized morphometrics with cycle and constraint analysis, 3) Vision-biomechanics conditioned feedback module uses cross-attention to generate precise text. Uses parameter-efficient training with frozen vision and language backbones.
Result: BioCoach achieves clear gains on QEVD-bio-fit-coach dataset across lexical and judgment metrics while maintaining temporal triggering. On original QEVD-fit-coach, it improves text quality and correctness with near-parity timing, demonstrating that explicit kinematics and constraints are key to accurate coaching.
Conclusion: Explicit integration of kinematics and biomechanical constraints enables accurate, phase-aware fitness coaching. The framework provides transparent, personalized reasoning rather than pattern matching, showing the importance of biomechanical grounding in vision-language systems for fitness applications.
Abstract: We present BioCoach, a biomechanics-grounded vision–language framework for fitness coaching from streaming video. BioCoach fuses visual appearance and 3D skeletal kinematics, through a novel three-stage pipeline: an exercise-specific degree-of-freedom selector that focuses analysis on salient joints; a structured biomechanical context that pairs individualized morphometrics with cycle and constraint analysis; and a vision–biomechanics conditioned feedback module that applies cross-attention to generate precise, actionable text. Using parameter-efficient training that freezes the vision and language backbones, BioCoach yields transparent, personalized reasoning rather than pattern matching. To enable learning and fair evaluation, we augment QEVD-fit-coach with biomechanics-oriented feedback to create QEVD-bio-fit-coach, and we introduce a biomechanics-aware LLM judge metric. BioCoach delivers clear gains on QEVD-bio-fit-coach across lexical and judgment metrics while maintaining temporal triggering; on the original QEVD-fit-coach, it improves text quality and correctness with near-parity timing, demonstrating that explicit kinematics and constraints are key to accurate, phase-aware coaching.
[209] Real-time Appearance-based Gaze Estimation for Open Domains
Zhenhao Li, Zheng Liu, Seunghyun Lee, Amin Fadaeinejad, Yuanhao Yu
Main category: cs.CV
TL;DR: A robust appearance-based gaze estimation framework that improves generalization to challenging real-world conditions like facial wearables and poor lighting through data augmentation and multi-task learning, achieving SOTA performance with a lightweight model.
Details
Motivation: Existing gaze estimation models fail in practical unconstrained scenarios involving facial wearables and poor lighting due to limited image diversity and inconsistent label fidelity across datasets, especially along the pitch axis.Method: Proposes a robust AGE framework with: 1) Ensemble augmentation techniques (eyeglasses, masks, varied lighting synthesis) to expand image manifold, and 2) Reformulates gaze regression as multi-task learning with multi-view supervised contrastive learning, discretized label classification, and eye-region segmentation as auxiliary objectives.
Result: MobileNet-based lightweight model achieves generalization performance competitive with SOTA UniGaze-H while using <1% of its parameters, enabling high-fidelity real-time gaze tracking on mobile devices. New benchmark datasets created to evaluate robustness under challenging conditions.
Conclusion: The proposed framework effectively addresses generalization challenges in gaze estimation through data augmentation and multi-task learning, enabling robust performance in practical scenarios with minimal computational requirements.
Abstract: Appearance-based gaze estimation (AGE) has achieved remarkable performance in constrained settings, yet we reveal a significant generalization gap where existing AGE models often fail in practical, unconstrained scenarios, particularly those involving facial wearables and poor lighting conditions. We attribute this failure to two core factors: limited image diversity and inconsistent label fidelity across different datasets, especially along the pitch axis. To address these, we propose a robust AGE framework that enhances generalization without requiring additional human-annotated data. First, we expand the image manifold via an ensemble of augmentation techniques, including synthesis of eyeglasses, masks, and varied lighting. Second, to mitigate the impact of anisotropic inter-dataset label deviation, we reformulate gaze regression as a multi-task learning problem, incorporating multi-view supervised contrastive (SupCon) learning, discretized label classification, and eye-region segmentation as auxiliary objectives. To rigorously validate our approach, we curate new benchmark datasets designed to evaluate gaze robustness under challenging conditions, a dimension largely overlooked by existing evaluation protocols. Our MobileNet-based lightweight model achieves generalization performance competitive with the state-of-the-art (SOTA) UniGaze-H, while utilizing less than 1% of its parameters, enabling high-fidelity, real-time gaze tracking on mobile devices.
[210] Multimodal Deep Learning for Diabetic Foot Ulcer Staging Using Integrated RGB and Thermal Imaging
Gulengul Mermer, Mustafa Furkan Aksu, Gozde Ozsezer, Sevki Cetinkalp, Orhan Er, Mehmet Kemal Gullu
Main category: cs.CV
TL;DR: Multimodal deep learning combining RGB and thermal images improves diabetic foot ulcer stage classification using a portable imaging system
Details
Motivation: Diabetic foot ulcers are serious complications requiring regular monitoring; early diagnosis can reduce amputation risk and healthcare costsMethod: Developed Raspberry Pi-based portable imaging system to capture RGB+thermal images simultaneously; collected 1,205 samples labeled into 6 stages; trained DenseNet121, EfficientNetV2, InceptionV3, ResNet50, and VGG16 on three datasets: RGB-only, thermal-only, and RGB+Thermal (4 channels)
Result: Multimodal RGB+Thermal dataset outperformed single-modal approaches; VGG16 achieved best performance: 93.25% accuracy, 92.53% F1-score, 91.03% MCC; Grad-CAM showed thermal channel highlighted temperature anomalies while RGB provided structural/textural information
Conclusion: Multimodal imaging combining RGB and thermal data improves DFU stage classification; thermal imaging helps locate temperature anomalies while RGB provides complementary visual information
Abstract: Diabetic foot ulcers (DFU) are one of the serious complications of diabetes that can lead to amputations and high healthcare costs. Regular monitoring and early diagnosis are critical for reducing the clinical burden and the risk of amputation. The aim of this study is to investigate the impact of using multimodal images on deep learning models for the classification of DFU stages. To this end, we developed a Raspberry Pi-based portable imaging system capable of simultaneously capturing RGB and thermal images. Using this prototype, a dataset consisting of 1,205 samples was collected in a hospital setting. The dataset was labeled by experts into six distinct stages. To evaluate the models performance, we prepared three different training sets: RGB-only, thermal-only, and RGB+Thermal (with the thermal image added as a fourth channel). We trained these training sets on the DenseNet121, EfficientNetV2, InceptionV3, ResNet50, and VGG16 models. The results show that the multimodal training dataset, in which RGB and thermal data are combined across four channels, outperforms single-modal approaches. The highest performance was observed in the VGG16 model trained on the RGB+Thermal dataset. The model achieved an accuracy of 93.25%, an F1-score of 92.53%, and an MCC of 91.03%. Grad-CAM heatmap visualizations demonstrated that the thermal channel helped the model focus on the correct location by highlighting temperature anomalies in the ulcer region, while the RGB channel supported the decision-making process with complementary structural and textural information.
[211] Beyond Mortality: Advancements in Post-Mortem Iris Recognition through Data Collection and Computer-Aided Forensic Examination
Rasel Ahmed Bhuiyan, Parisa Farmanifard, Renu Sharma, Andrey Kuehlkamp, Aidan Boyd, Patrick J Flynn, Kevin W Bowyer, Arun Ross, Dennis Chute, Adam Czajka
Main category: cs.CV
TL;DR: New dataset and forensic tool for post-mortem iris recognition with 259 subjects, largest PMI of 1,674 hours, and first before/after death case; evaluation of 5 iris recognition methods on 338 subjects; includes post-mortem detection model and open-source forensic tool with explainability.
Details
Motivation: Post-mortem iris recognition has forensic potential but faces barriers including difficult data collection and limited specialized approaches. The paper aims to address these challenges by providing new resources and tools for the community.Method: 1) Collected new dataset of NIR and visible-light iris images from 259 deceased subjects with largest post-mortem interval of 1,674 hours, including first before/after death case; 2) Combined with public datasets to evaluate 5 iris recognition methods on 338 subjects; 3) Implemented model for detecting post-mortem iris images as presentation attacks; 4) Developed open-source forensic tool integrating three methods with explainability features.
Result: Provides comprehensive dataset and evaluation of state-of-the-art iris recognition methods on post-mortem data, including analysis of demographic factors’ influence on performance. Offers detection model for post-mortem iris images and open-source forensic tool with human-interpretable comparisons.
Conclusion: The paper makes significant contributions to post-mortem iris recognition by providing new datasets, evaluation benchmarks, detection methods, and practical forensic tools with explainability, advancing both research and practical applications in forensic biometrics.
Abstract: Post-mortem iris recognition brings both hope to the forensic community (a short-term but accurate and fast means of verifying identity) as well as concerns to society (its potential illicit use in post-mortem impersonation). These hopes and concerns have grown along with the volume of research in post-mortem iris recognition. Barriers to further progress in post-mortem iris recognition include the difficult nature of data collection, and the resulting small number of approaches designed specifically for comparing iris images of deceased subjects. This paper makes several unique contributions to mitigate these barriers. First, we have collected and we offer a new dataset of NIR (compliant with ISO/IEC 19794-6 where possible) and visible-light iris images collected after demise from 259 subjects, with the largest PMI (post-mortem interval) being 1,674 hours. For one subject, the data has been collected before and after death, the first such case ever published. Second, the collected dataset was combined with publicly-available post-mortem samples to assess the current state of the art in automatic forensic iris recognition with five iris recognition methods and data originating from 338 deceased subjects. These experiments include analyses of how selected demographic factors influence recognition performance. Thirdly, this study implements a model for detecting post-mortem iris images, which can be considered as presentation attacks. Finally, we offer an open-source forensic tool integrating three post-mortem iris recognition methods with explainability elements added to make the comparison process more human-interpretable.
[212] A Provable Energy-Guided Test-Time Defense Boosting Adversarial Robustness of Large Vision-Language Models
Mujtaba Hussain Mirza, Antonio D’Orazio, Odelia Melamed, Iacopo Masi
Main category: cs.CV
TL;DR: ET3 is a lightweight, training-free defense method that enhances multimodal model robustness by minimizing input energy during test-time transformations, applicable to classifiers, CLIP, and LVLMs.
Details
Motivation: Multimodal models and Large Visual-Language Models (LVLMs) are highly vulnerable to adversarial perturbations, raising reliability concerns for real-world applications. While adversarial training helps, test-time transformations offer a promising inference-time robustness strategy.Method: Proposes Energy-Guided Test-Time Transformation (ET3), a training-free defense that minimizes the energy of input samples during inference. The method is theoretically grounded with proofs that the transformation succeeds in classification under reasonable assumptions.
Result: Extensive experiments show ET3 provides strong defense for classifiers, zero-shot classification with CLIP, and boosts robustness of LVLMs in tasks like Image Captioning and Visual Question Answering.
Conclusion: ET3 offers an effective, lightweight defense strategy for enhancing multimodal model robustness against adversarial attacks without requiring retraining, making it practical for real-world deployment.
Abstract: Despite the rapid progress in multimodal models and Large Visual-Language Models (LVLM), they remain highly susceptible to adversarial perturbations, raising serious concerns about their reliability in real-world use. While adversarial training has become the leading paradigm for building models that are robust to adversarial attacks, Test-Time Transformations (TTT) have emerged as a promising strategy to boost robustness at inference.In light of this, we propose Energy-Guided Test-Time Transformation (ET3), a lightweight, training-free defense that enhances the robustness by minimizing the energy of the input samples.Our method is grounded in a theory that proves our transformation succeeds in classification under reasonable assumptions. We present extensive experiments demonstrating that ET3 provides a strong defense for classifiers, zero-shot classification with CLIP, and also for boosting the robustness of LVLMs in tasks such as Image Captioning and Visual Question Answering. Code is available at github.com/OmnAI-Lab/Energy-Guided-Test-Time-Defense .
[213] GUIDED: Granular Understanding via Identification, Detection, and Discrimination for Fine-Grained Open-Vocabulary Object Detection
Jiaming Li, Zhijia Liang, Weikai Chen, Lin Ma, Guanbin Li
Main category: cs.CV
TL;DR: GUIDED is a decomposition framework for fine-grained open-vocabulary object detection that separates subject localization from attribute recognition to address semantic entanglement in vision-language models.
Details
Motivation: Existing open-vocabulary detectors underperform in fine-grained settings due to semantic entanglement of subjects and attributes in pretrained vision-language model embeddings, causing over-representation of attributes, mislocalization, and semantic drift.Method: GUIDED decomposes fine-grained prompts into coarse-grained subjects and descriptive attributes using a language model. It uses subject embeddings for stable localization, an attribute embedding fusion module for selective attribute incorporation, and a region-level attribute discrimination module with refined vision-language alignment.
Result: Extensive experiments on FG-OVD and 3F-OVD benchmarks show GUIDED achieves new state-of-the-art results, demonstrating benefits of disentangled modeling and modular optimization.
Conclusion: The decomposition framework effectively addresses semantic entanglement in fine-grained open-vocabulary detection by separating localization and recognition tasks, with modular optimization improving performance on fine-grained object detection.
Abstract: Fine-grained open-vocabulary object detection (FG-OVD) aims to detect novel object categories described by attribute-rich texts. While existing open-vocabulary detectors show promise at the base-category level, they underperform in fine-grained settings due to the semantic entanglement of subjects and attributes in pretrained vision-language model (VLM) embeddings – leading to over-representation of attributes, mislocalization, and semantic drift in embedding space. We propose GUIDED, a decomposition framework specifically designed to address the semantic entanglement between subjects and attributes in fine-grained prompts. By separating object localization and fine-grained recognition into distinct pathways, HUIDED aligns each subtask with the module best suited for its respective roles. Specifically, given a fine-grained class name, we first use a language model to extract a coarse-grained subject and its descriptive attributes. Then the detector is guided solely by the subject embedding, ensuring stable localization unaffected by irrelevant or overrepresented attributes. To selectively retain helpful attributes, we introduce an attribute embedding fusion module that incorporates attribute information into detection queries in an attention-based manner. This mitigates over-representation while preserving discriminative power. Finally, a region-level attribute discrimination module compares each detected region against full fine-grained class names using a refined vision-language model with a projection head for improved alignment. Extensive experiments on FG-OVD and 3F-OVD benchmarks show that GUIDED achieves new state-of-the-art results, demonstrating the benefits of disentangled modeling and modular optimization. Our code will be released at https://github.com/lijm48/GUIDED.
[214] Generative Shape Reconstruction with Geometry-Guided Langevin Dynamics
Linus Härenstam-Nielsen, Dmitrii Pozdeev, Thomas Dagès, Nikita Araslanov, Daniel Cremers
Main category: cs.CV
TL;DR: GG-Langevin: A probabilistic method combining diffusion models with Langevin dynamics for 3D shape reconstruction that balances measurement consistency with shape plausibility under incomplete/noisy observations.
Details
Motivation: Existing 3D shape reconstruction methods fail under realistic conditions with incomplete measurements or noise, while generative models produce plausible shapes but lack measurement consistency. There's a need to unify these complementary approaches.Method: GG-Langevin uses Geometry-Guided Langevin dynamics that traverses diffusion model trajectories while preserving measurement consistency at every step, combining a data-informed prior with geometric constraints.
Result: Extensive experiments show GG-Langevin achieves higher geometric accuracy and greater robustness to missing data than existing surface reconstruction methods.
Conclusion: The probabilistic approach successfully unifies generative modeling with geometric constraints, enabling robust 3D shape reconstruction from incomplete/noisy observations.
Abstract: Reconstructing complete 3D shapes from incomplete or noisy observations is a fundamentally ill-posed problem that requires balancing measurement consistency with shape plausibility. Existing methods for shape reconstruction can achieve strong geometric fidelity in ideal conditions but fail under realistic conditions with incomplete measurements or noise. At the same time, recent generative models for 3D shapes can synthesize highly realistic and detailed shapes but fail to be consistent with observed measurements. In this work, we introduce GG-Langevin: Geometry-Guided Langevin dynamics, a probabilistic approach that unifies these complementary perspectives. By traversing the trajectories of Langevin dynamics induced by a diffusion model, while preserving measurement consistency at every step, we generatively reconstruct shapes that fit both the measurements and the data-informed prior. We demonstrate through extensive experiments that GG-Langevin achieves higher geometric accuracy and greater robustness to missing data than existing methods for surface reconstruction.
[215] YOLO Object Detectors for Robotics – a Comparative Study
Patryk Niżeniec, Marcin Iwanowski, Marcin Gahbler
Main category: cs.CV
TL;DR: Evaluation of various YOLO object detection models for robotic vision tasks using custom and COCO datasets with image distortions to test robustness.
Details
Motivation: To validate the applicability of different YOLO model versions and variants for detecting objects in robot workspaces, helping researchers choose appropriate models for robotic vision tasks.Method: Used custom dataset and COCO2017 dataset, applied image distortions to test robustness, and evaluated various YOLO models across different training/testing configurations.
Result: Experimental results comparing different YOLO versions and variants under various conditions, providing insights into model performance for robotic vision applications.
Conclusion: The study provides guidance on selecting appropriate YOLO models for robotic vision tasks based on performance under different conditions and robustness to image distortions.
Abstract: YOLO object detectors recently became a key component of vision systems in many domains. The family of available YOLO models consists of multiple versions, each in various variants. The research reported in this paper aims to validate the applicability of members of this family to detect objects located within the robot workspace. In our experiments, we used our custom dataset and the COCO2017 dataset. To test the robustness of investigated detectors, the images of these datasets were subject to distortions. The results of our experiments, including variations of training/testing configurations and models, may support the choice of the appropriate YOLO version for robotic vision tasks.
[216] RealBirdID: Benchmarking Bird Species Identification in the Era of MLLMs
Logan Lawrence, Mustafa Chasmai, Rangel Daroya, Wuao Liu, Seoyun Jeong, Aaron Sun, Max Hamilton, Fabien Delattre, Oindrila Saha, Subhransu Maji, Grant Van Horn
Main category: cs.CV
TL;DR: RealBirdID benchmark challenges multimodal systems to either identify bird species from images or abstain with evidence-based rationales when identification is impossible due to non-visual cues or image quality issues.
Details
Motivation: Current multimodal systems are typically evaluated only on answerable cases, encouraging confident guesses rather than principled abstention. Real-world fine-grained bird identification often requires non-visual cues (like vocalization) or faces issues like occlusion, poor angles, or low resolution.Method: Proposes RealBirdID benchmark with curated unanswerable examples labeled with specific rationales (“requires vocalization,” “low quality image,” “view obstructed”) paired with answerable instances. Evaluates open-source and proprietary models on both classification accuracy and abstention calibration.
Result: (1) Species identification on answerable set is challenging (<13% accuracy for MLLMs including GPT-5 and Gemini-2.5 Pro); (2) Better classification ability doesn’t guarantee better abstention calibration; (3) MLLMs generally fail at providing correct reasons even when they do abstain.
Conclusion: RealBirdID establishes a focused target for abstention-aware fine-grained recognition and provides a recipe for measuring progress in multimodal systems’ ability to recognize when they lack sufficient information.
Abstract: Fine-grained bird species identification in the wild is frequently unanswerable from a single image: key cues may be non-visual (e.g. vocalization), or obscured due to occlusion, camera angle, or low resolution. Yet today’s multimodal systems are typically judged on answerable, in-schema cases, encouraging confident guesses rather than principled abstention. We propose the RealBirdID benchmark: given an image of a bird, a system should either answer with a species or abstain with a concrete, evidence-based rationale: “requires vocalization,” “low quality image,” or “view obstructed”. For each genus, the dataset includes a validation split composed of curated unanswerable examples with labeled rationales, paired with a companion set of clearly answerable instances. We find that (1) the species identification on the answerable set is challenging for a variety of open-source and proprietary models (less than 13% accuracy for MLLMs including GPT-5 and Gemini-2.5 Pro), (2) models with greater classification ability are not necessarily more calibrated to abstain from unanswerable examples, and (3) that MLLMs generally fail at providing correct reasons even when they do abstain. RealBirdID establishes a focused target for abstention-aware fine-grained recognition and a recipe for measuring progress.
[217] Unified Number-Free Text-to-Motion Generation Via Flow Matching
Guanhe Huang, Oya Celiktutan
Main category: cs.CV
TL;DR: UMF is a unified framework for multi-person motion generation that handles variable numbers of agents through a two-stage approach: single-pass motion prior generation and multi-pass reaction generation.
Details
Motivation: Existing generative models for motion synthesis struggle with variable numbers of agents, rely on limited domain-specific data, and suffer from inefficiency and error accumulation in autoregressive approaches.Method: Proposes Unified Motion Flow (UMF) with two components: Pyramid Motion Flow (P-Flow) for hierarchical motion prior generation, and Semi-Noise Motion Flow (S-Flow) for reaction generation. Uses unified latent space to bridge heterogeneous motion datasets.
Result: Extensive results and user studies demonstrate UMF’s effectiveness as a generalist model for multi-person motion generation from text.
Conclusion: UMF provides an effective solution for variable-agent motion generation by decomposing the problem into motion prior and reaction generation stages, addressing computational efficiency and error accumulation issues.
Abstract: Generative models excel at motion synthesis for a fixed number of agents but struggle to generalize with variable agents. Based on limited, domain-specific data, existing methods employ autoregressive models to generate motion recursively, which suffer from inefficiency and error accumulation. We propose Unified Motion Flow (UMF), which consists of Pyramid Motion Flow (P-Flow) and Semi-Noise Motion Flow (S-Flow). UMF decomposes the number-free motion generation into a single-pass motion prior generation stage and multi-pass reaction generation stages. Specifically, UMF utilizes a unified latent space to bridge the distribution gap between heterogeneous motion datasets, enabling effective unified training. For motion prior generation, P-Flow operates on hierarchical resolutions conditioned on different noise levels, thereby mitigating computational overheads. For reaction generation, S-Flow learns a joint probabilistic path that adaptively performs reaction transformation and context reconstruction, alleviating error accumulation. Extensive results and user studies demonstrate UMF’ s effectiveness as a generalist model for multi-person motion generation from text. Project page: https://githubhgh.github.io/umf/.
[218] MOOZY: A Patient-First Foundation Model for Computational Pathology
Yousef Kotp, Vincent Quoc-Huy Trinh, Christopher Pal, Mahdi S. Hosseini
Main category: cs.CV
TL;DR: MOOZY is a patient-first pathology foundation model that treats patient cases rather than individual slides as the core unit, using a case transformer to model dependencies across all slides from the same patient with multi-stage self-supervision and task supervision.
Details
Motivation: Current computational pathology approaches are slide-centric, depend on private data and expensive paired-report supervision, and fail to model relationships among multiple slides from the same patient, limiting transfer across diverse clinical tasks.Method: Two-stage approach: Stage 1 pretrains a vision-only slide encoder on 77,134 public slide feature grids using masked self-distillation. Stage 2 aligns representations with clinical semantics using a case transformer and multi-task supervision over 333 tasks from 56 public datasets (205 classification, 128 survival tasks).
Result: Achieves best or tied-best performance on most metrics across eight held-out tasks, improving macro averages over TITAN by +7.37%, +5.50%, +7.83% and over PRISM by +8.83%, +10.70%, +9.78% for weighted F1, weighted ROC-AUC, and balanced accuracy. Parameter efficient with 85.77M parameters (14x smaller than GigaPath).
Conclusion: Open, reproducible patient-level pretraining yields transferable embeddings, providing a practical path toward scalable patient-first histopathology foundation models that better capture clinical context across multiple slides.
Abstract: Computational pathology needs whole-slide image (WSI) foundation models that transfer across diverse clinical tasks, yet current approaches remain largely slide-centric, often depend on private data and expensive paired-report supervision, and do not explicitly model relationships among multiple slides from the same patient. We present MOOZY, a patient-first pathology foundation model in which the patient case, not the individual slide, is the core unit of representation. MOOZY explicitly models dependencies across all slides from the same patient via a case transformer during pretraining, combining multi-stage open self-supervision with scaled low-cost task supervision. In Stage 1, we pretrain a vision-only slide encoder on 77,134 public slide feature grids using masked self-distillation. In Stage 2, we align these representations with clinical semantics using a case transformer and multi-task supervision over 333 tasks from 56 public datasets, including 205 classification and 128 survival tasks across four endpoints. Across eight held-out tasks with five-fold frozen-feature probe evaluation, MOOZY achieves best or tied-best performance on most metrics and improves macro averages over TITAN by +7.37%, +5.50%, and +7.83% and over PRISM by +8.83%, +10.70%, and +9.78% for weighted F1, weighted ROC-AUC, and balanced accuracy, respectively. MOOZY is also parameter efficient with 85.77M parameters, 14x smaller than GigaPath. These results demonstrate that open, reproducible patient-level pretraining yields transferable embeddings, providing a practical path toward scalable patient-first histopathology foundation models.
[219] Towards Intrinsic-Aware Monocular 3D Object Detection
Zhihao Zhang, Abhinav Kumar, Xiaoming Liu
Main category: cs.CV
TL;DR: MonoIA is a monocular 3D object detection framework that uses language-grounded representations to adapt to camera intrinsic variations, improving generalization across different camera setups.
Details
Motivation: Existing monocular 3D object detection methods are highly sensitive to camera intrinsics and struggle to generalize across diverse settings, since camera parameters govern how 3D scenes are projected onto the 2D image plane.Method: Proposes MonoIA framework that uses large language models and vision-language models to generate intrinsic embeddings encoding visual and geometric implications of camera parameters. These embeddings are integrated via an Intrinsic Adaptation Module to modulate feature representations according to camera-specific configurations.
Result: Achieves new state-of-the-art results on standard benchmarks including KITTI, Waymo, and nuScenes (+1.18% on KITTI leaderboard), with further improvements under multi-dataset training (+4.46% on KITTI Val).
Conclusion: Shifts intrinsic modeling from numeric conditioning to semantic representation, enabling robust and unified 3D perception across different camera configurations through language-grounded adaptation.
Abstract: Monocular 3D object detection (Mono3D) aims to infer object locations and dimensions in 3D space from a single RGB image. Despite recent progress, existing methods remain highly sensitive to camera intrinsics and struggle to generalize across diverse settings, since intrinsics govern how 3D scenes are projected onto the image plane. We propose MonoIA, a unified intrinsic-aware framework that models and adapts to intrinsic variation through a language-grounded representation. The key insight is that intrinsic variation is not a numeric difference but a perceptual transformation that alters apparent scale, perspective, and spatial geometry. To capture this effect, MonoIA employs large language models and vision-language models to generate intrinsic embeddings that encode the visual and geometric implications of camera parameters. These embeddings are hierarchically integrated into the detection network via an Intrinsic Adaptation Module, allowing the model to modulate its feature representations according to camera-specific configurations and maintain consistent 3D detection across intrinsics. This shifts intrinsic modeling from numeric conditioning to semantic representation, enabling robust and unified perception across cameras. Extensive experiments show that MonoIA achieves new state-of-the-art results on standard benchmarks including KITTI, Waymo, and nuScenes (e.g., +1.18% on the KITTI leaderboard), and further improves performance under multi-dataset training (e.g., +4.46% on KITTI Val).
[220] VIRST: Video-Instructed Reasoning Assistant for SpatioTemporal Segmentation
Jihwan Hong, Jaeyoung Do
Main category: cs.CV
TL;DR: VIRST is an end-to-end framework for referring video object segmentation that unifies global video reasoning and pixel-level mask prediction in a single model to handle motion-intensive and reasoning-oriented videos.
Details
Motivation: Current fixed keyframe-based RVOS approaches fail to capture rapidly changing spatiotemporal dynamics and handle queries requiring multi-step reasoning, leading to poor performance on motion-intensive and reasoning-oriented videos beyond static benchmarks.Method: Proposes VIRST with Spatio-Temporal Fusion (STF) to fuse segmentation-aware video features into vision-language backbone, and Temporal Dynamic Anchor Updater to maintain temporally adjacent anchor frames for stable temporal cues under large motion, occlusion, and reappearance.
Result: Achieves state-of-the-art results across diverse RVOS benchmarks under realistic and challenging conditions, demonstrating strong generalization to both referring and reasoning oriented settings.
Conclusion: VIRST’s unified design effectively addresses limitations of previous approaches by integrating global video reasoning with pixel-level segmentation in a single model framework.
Abstract: Referring Video Object Segmentation (RVOS) aims to segment target objects in videos based on natural language descriptions. However, fixed keyframe-based approaches that couple a vision language model with a separate propagation module often fail to capture rapidly changing spatiotemporal dynamics and to handle queries requiring multi-step reasoning, leading to sharp performance drops on motion-intensive and reasoning-oriented videos beyond static RVOS benchmarks. To address these limitations, we propose VIRST (Video-Instructed Reasoning Assistant for Spatio-Temporal Segmentation), an end-to-end framework that unifies global video reasoning and pixel-level mask prediction within a single model. VIRST bridges semantic and segmentation representations through the Spatio-Temporal Fusion (STF), which fuses segmentation-aware video features into the vision-language backbone, and employs the Temporal Dynamic Anchor Updater to maintain temporally adjacent anchor frames that provide stable temporal cues under large motion, occlusion, and reappearance. This unified design achieves state-of-the-art results across diverse RVOS benchmarks under realistic and challenging conditions, demonstrating strong generalization to both referring and reasoning oriented settings. The code and checkpoints are available at https://github.com/AIDASLab/VIRST.
[221] ChartNet: A Million-Scale, High-Quality Multimodal Dataset for Robust Chart Understanding
Jovana Kondic, Pengyuan Li, Dhiraj Joshi, Isaac Sanchez, Ben Wiesel, Shafiq Abedin, Amit Alfassy, Eli Schwartz, Daniel Caraballo, Yagmur Gizem Cinar, Florian Scheidegger, Steven I. Ross, Daniel Karl I. Weidele, Hang Hua, Ekaterina Arutyunova, Roei Herzig, Zexue He, Zihan Wang, Xinyue Yu, Yunfei Zhao, Sicong Jiang, Minghao Liu, Qunshu Lin, Peter Staar, Luis Lastras, Aude Oliva, Rogerio Feris
Main category: cs.CV
TL;DR: ChartNet is a million-scale multimodal dataset for chart understanding with 1.5M diverse chart samples across 24 types, featuring aligned components: code, image, data table, summary, and QA with reasoning.
Details
Motivation: Current vision-language models struggle with chart understanding due to the need to jointly reason over geometric patterns, numerical data, and natural language. There's a lack of large-scale, high-quality multimodal datasets specifically for chart interpretation.Method: Created ChartNet using a novel code-guided synthesis pipeline that generates diverse chart samples from 6 plotting libraries. Includes rigorous quality-filtering for visual fidelity and semantic accuracy. Features five aligned components per sample and specialized subsets for human annotation, real-world data, safety, and grounding.
Result: Produced 1.5 million chart samples spanning 24 chart types. Fine-tuning on ChartNet consistently improves performance across benchmarks, demonstrating its utility as large-scale supervision for multimodal models.
Conclusion: ChartNet is the largest open-source dataset for chart understanding, supporting development of foundation models with robust capabilities for data visualization interpretation through fine-grained cross-modal alignment.
Abstract: Understanding charts requires models to jointly reason over geometric visual patterns, structured numerical data, and natural language – a capability where current vision-language models (VLMs) remain limited. We introduce ChartNet, a high-quality, million-scale multimodal dataset designed to advance chart interpretation and reasoning. ChartNet leverages a novel code-guided synthesis pipeline to generate 1.5 million diverse chart samples spanning 24 chart types and 6 plotting libraries. Each sample consists of five aligned components: plotting code, rendered chart image, data table, natural language summary, and question-answering with reasoning, providing fine-grained cross-modal alignment. To capture the full spectrum of chart comprehension, ChartNet additionally includes specialized subsets encompassing human annotated data, real-world data, safety, and grounding. Moreover, a rigorous quality-filtering pipeline ensures visual fidelity, semantic accuracy, and diversity across chart representations. Fine-tuning on ChartNet consistently improves results across benchmarks, demonstrating its utility as large-scale supervision for multimodal models. As the largest open-source dataset of its kind, ChartNet aims to support the development of foundation models with robust and generalizable capabilities for data visualization understanding. The dataset is publicly available at https://huggingface.co/datasets/ibm-granite/ChartNet
[222] Structural Graph Probing of Vision-Language Models
Haoyu He, Yue Zhuo, Yu Zheng, Qi R. Wang
Main category: cs.CV
TL;DR: Analyzes vision-language models through neural topology using correlation graphs to understand computation organization across neuron populations and its behavioral relevance.
Details
Motivation: While VLMs achieve strong multimodal performance, how computation is organized across neuron populations remains poorly understood. The paper aims to study VLMs through neural topology to understand population-level structure and its behavioral significance.Method: Represents each layer as a within-layer correlation graph derived from neuron-neuron co-activations. Analyzes whether population-level structure is behaviorally meaningful, how it changes across modalities and depth, and whether it identifies causally influential internal components under intervention.
Result: Correlation topology carries recoverable behavioral signal. Cross-modal structure progressively consolidates with depth around a compact set of recurrent hub neurons, whose targeted perturbation substantially alters model output.
Conclusion: Neural topology emerges as a meaningful intermediate scale for VLM interpretability: richer than local attribution, more tractable than full circuit recovery, and empirically tied to multimodal behavior.
Abstract: Vision-language models (VLMs) achieve strong multimodal performance, yet how computation is organized across populations of neurons remains poorly understood. In this work, we study VLMs through the lens of neural topology, representing each layer as a within-layer correlation graph derived from neuron-neuron co-activations. This view allows us to ask whether population-level structure is behaviorally meaningful, how it changes across modalities and depth, and whether it identifies causally influential internal components under intervention. We show that correlation topology carries recoverable behavioral signal; moreover, cross-modal structure progressively consolidates with depth around a compact set of recurrent hub neurons, whose targeted perturbation substantially alters model output. Neural topology thus emerges as a meaningful intermediate scale for VLM interpretability: richer than local attribution, more tractable than full circuit recovery, and empirically tied to multimodal behavior. Code is publicly available at https://github.com/he-h/vlm-graph-probing.
[223] LightCtrl: Training-free Controllable Video Relighting
Yizuo Peng, Xuelin Chen, Kai Zhang, Xiaodong Cun
Main category: cs.CV
TL;DR: LightCtrl is a controllable video relighting method that enables explicit illumination control via user-supplied light trajectories, combining image relighting diffusion models with video diffusion priors for temporal consistency.
Details
Motivation: Existing video relighting methods offer limited explicit control over illumination in relighted outputs, lacking the ability to precisely follow user-specified light trajectories.Method: Combines pre-trained diffusion models: image relighting model processes frames individually, then video diffusion prior ensures temporal consistency. Introduces Light Map Injection module for illumination coherence and Geometry-Aware Relighting module to suppress original lighting influence.
Result: Produces high-quality videos with diverse illumination changes that closely follow specified light trajectories, demonstrating improved controllability over baseline methods.
Conclusion: LightCtrl enables training-free controllable video relighting with explicit illumination control through user-supplied light trajectories, advancing video editing capabilities.
Abstract: Recent diffusion models have achieved remarkable success in image relighting, and this success has quickly been extended to video relighting. However, existing methods offer limited explicit control over illumination in the relighted output. We present LightCtrl, the first controllable video relighting method that enables explicit control of video illumination through a user-supplied light trajectory in a training-free manner. Our approach combines pre-trained diffusion models: an image relighting model processes each frame individually, followed by a video diffusion prior to enhance temporal consistency. To achieve explicit control over dynamically varying lighting, we introduce two key components. First, a Light Map Injection module samples light trajectory-specific noise and injects it into the latent representation of the source video, improving illumination coherence with the conditional light trajectory. Second, a Geometry-Aware Relighting module dynamically combines RGB and normal map latents in the frequency domain to suppress the influence of the original lighting, further enhancing adherence to the input light trajectory. Experiments show that LightCtrl produces high-quality videos with diverse illumination changes that closely follow the specified light trajectory, demonstrating improved controllability over baseline methods. Code is available at: https://github.com/GVCLab/LightCtrl.
[224] SceneExpander: Expanding 3D Scenes with Free-Form Inserted Views
Zijian He, enjie Liu, Yihao Wang, Weizhi Zhong, Huan Yuan, Kun Gai, Guangrun Wang, Guanbin Li
Main category: cs.CV
TL;DR: SceneExpander enables iterative 3D scene expansion by inserting generated views into existing multi-view reconstructions while maintaining consistency through test-time adaptation with dual distillation signals.
Details
Motivation: Real-world 3D scene creation workflows are iterative, requiring creators to repeatedly extend existing scenes. Current methods struggle when inserting generated views that are 3D-misaligned with original reconstructions, causing geometry shifts, hallucinated content, and view-dependent artifacts that break multi-view consistency.Method: SceneExpander uses test-time adaptation on a parametric feed-forward 3D reconstruction model with two complementary distillation signals: anchor distillation stabilizes the original scene by distilling geometric cues from captured views, and inserted-view self-distillation preserves observation-supported predictions while adapting latent geometry and appearance to accommodate misaligned inserted views.
Result: Experiments on ETH scenes and online data demonstrate improved expansion behavior and reconstruction quality under misalignment compared to existing approaches.
Conclusion: The proposed method effectively addresses the challenge of 3D scene expansion with misaligned inserted views, enabling more robust iterative scene creation workflows while maintaining multi-view consistency.
Abstract: World building with 3D scene representations is increasingly important for content creation, simulation, and interactive experiences, yet real workflows are inherently iterative: creators must repeatedly extend an existing scene under user control. Motivated by this research gap, we study 3D scene expansion in a user-centric workflow: starting from a real scene captured by multi-view images, we extend its coverage by inserting an additional view synthesized by a generative model. Unlike simple object editing or style transfer in a fixed scene, the inserted view is often 3D-misaligned with the original reconstruction, introducing geometry shifts, hallucinated content, or view-dependent artifacts that break global multi-view consistency. To address the challenge, we propose SceneExpander, which applies test-time adaptation to a parametric feed-forward 3D reconstruction model with two complementary distillation signals: anchor distillation stabilizes the original scene by distilling geometric cues from the captured views, while inserted-view self-distillation preserves observation-supported predictions yet adapts latent geometry and appearance to accommodate the misaligned inserted view. Experiments on ETH scenes and online data demonstrate improved expansion behavior and reconstruction quality under misalignment.
[225] EFlow: Fast Few-Step Video Generator Training from Scratch via Efficient Solution Flow
Dogyun Park, Yanyu Li, Sergey Tulyakov, Anil Kag
Main category: cs.CV
TL;DR: EFlow: Efficient few-step training framework for video diffusion transformers using solution-flow objective with gated local-global attention and path-drop guided training to reduce computational costs.
Details
Motivation: Video diffusion transformers face two major bottlenecks: quadratic attention complexity and iterative sampling steps, making training and inference computationally expensive.Method: Proposes EFlow with: 1) Gated Local-Global Attention for efficient token-droppable hybrid blocks, 2) Path-Drop Guided training to replace expensive guidance targets, and 3) Mean-Velocity Additivity regularizer for high fidelity at low step counts.
Result: Achieves 2.5x higher training throughput over standard solution-flow and 45.3x lower inference latency than standard iterative models with competitive performance on Kinetics and large-scale text-to-video datasets.
Conclusion: EFlow enables practical from-scratch training of video diffusion transformers by simultaneously addressing attention complexity and sampling step bottlenecks through efficient architectural and training innovations.
Abstract: Scaling video diffusion transformers is fundamentally bottlenecked by two compounding costs: the expensive quadratic complexity of attention per step, and the iterative sampling steps. In this work, we propose EFlow, an efficient few-step training framework, that tackles these bottlenecks simultaneously. To reduce sampling steps, we build on a solution-flow objective that learns a function mapping a noised state at time t to time s. Making this formulation computationally feasible and high-quality at video scale, however, demands two complementary innovations. First, we propose Gated Local-Global Attention, a token-droppable hybrid block which is efficient, expressive, and remains highly stable under aggressive random token-dropping, substantially reducing per-step compute. Second, we develop an efficient few-step training recipe. We propose Path-Drop Guided training to replace the expensive guidance target with a computationally cheap, weak path. Furthermore, we augment this with a Mean-Velocity Additivity regularizer to ensure high fidelity at extremely low step counts. Together, our EFlow enables a practical from-scratch training pipeline, achieving up to 2.5x higher training throughput over standard solution-flow, and 45.3x lower inference latency than standard iterative models with competitive performance on Kinetics and large-scale text-to-video datasets.
[226] PRUE: A Practical Recipe for Field Boundary Segmentation at Scale
Gedeon Muhawenayo, Caleb Robinson, Subash Khanal, Zhanpei Fang, Isaac Corley, Alexander Wollam, Tianyi Gao, Leonard Strnad, Ryan Avery, Lyndon Estes, Ana M. Tárano, Nathan Jacobs, Hannah Kerner
Main category: cs.CV
TL;DR: U-Net semantic segmentation model outperforms instance-based and geospatial foundation models for global field boundary delineation, achieving 76% IoU and 47% object-F1 on the Fields of The World benchmark.
Details
Motivation: Large-scale field boundary maps are crucial for agricultural monitoring, but existing deep learning approaches are sensitive to illumination, spatial scale, and geographic variations. There's a need for systematic evaluation of segmentation and geospatial foundation models for reliable global field boundary delineation.Method: Conducted first systematic evaluation of 18 segmentation and geospatial foundation models on the Fields of The World benchmark. Proposed a new segmentation approach combining U-Net backbone, composite loss functions, and targeted data augmentations to enhance performance and robustness under real-world conditions.
Result: U-Net semantic segmentation model outperformed instance-based and GFM alternatives on performance and deployment metrics. Achieved 76% IoU and 47% object-F1 on FTW, representing 6% and 9% improvements over previous baseline. Released models and model-derived field boundary datasets for five countries.
Conclusion: The proposed approach provides a practical framework for reliable, scalable, and reproducible field boundary delineation across model design, training, and inference. U-Net-based segmentation offers superior performance for agricultural monitoring tasks compared to more complex alternatives.
Abstract: Large-scale maps of field boundaries are essential for agricultural monitoring tasks. Existing deep learning approaches for satellite-based field mapping are sensitive to illumination, spatial scale, and changes in geographic location. We conduct the first systematic evaluation of segmentation and geospatial foundation models (GFMs) for global field boundary delineation using the Fields of The World (FTW) benchmark. We evaluate 18 models under unified experimental settings, showing that a U-Net semantic segmentation model outperforms instance-based and GFM alternatives on a suite of performance and deployment metrics. We propose a new segmentation approach that combines a U-Net backbone, composite loss functions, and targeted data augmentations to enhance performance and robustness under real-world conditions. Our model achieves a 76% IoU and 47% object-F1 on FTW, an increase of 6% and 9% over the previous baseline. Our approach provides a practical framework for reliable, scalable, and reproducible field boundary delineation across model design, training, and inference. We release all models and model-derived field boundary datasets for five countries.
[227] LLM Enhanced Action Recognition via Hierarchical Global-Local Skeleton-Language Model
Ruosi Wang, Fangwei Zuo, Lei Li, Zhaoqiang Xia
Main category: cs.CV
TL;DR: HocSLM: A hierarchical global-local skeleton-language model for human action recognition that combines skeleton-based GCNs with vision-language models for enhanced semantic understanding and cross-modal alignment.
Details
Motivation: Existing GCN-based skeleton action recognition methods struggle with long-range joint dependencies, complex temporal dynamics, and cross-modal semantic alignment due to insufficient modeling of action semantics.Method: Proposes HocSLM with: 1) HGLNet for hierarchical global-local spatio-temporal modeling, 2) VLM-generated textual descriptions from RGB videos, and 3) skeleton-language sequential fusion module for aligning skeletal features with textual descriptions in unified semantic space.
Result: Achieves state-of-the-art performance on three mainstream benchmark datasets: NTU RGB+D 60, NTU RGB+D 120, and Northwestern-UCLA.
Conclusion: The proposed HocSLM framework effectively enhances semantic discrimination and cross-modal understanding in skeleton-based action recognition by integrating hierarchical spatio-temporal modeling with vision-language semantic alignment.
Abstract: Skeleton-based human action recognition has achieved remarkable progress in recent years. However, most existing GCN-based methods rely on short-range motion topologies, which not only struggle to capture long-range joint dependencies and complex temporal dynamics but also limit cross-modal semantic alignment and understanding due to insufficient modeling of action semantics. To address these challenges, we propose a hierarchical global-local skeleton-language model (HocSLM), enabling the large action model be more representative of action semantics. First, we design a hierarchical global-local network (HGLNet) that consists of a composite-topology spatial module and a dual-path hierarchical temporal module. By synergistically integrating multi-level global and local modules, HGLNet achieves dynamically collaborative modeling at both global and local scales while preserving prior knowledge of human physical structure, significantly enhancing the model’s representation of complex spatio-temporal relationships. Then, a large vision-language model (VLM) is employed to generate textual descriptions by passing the original RGB video sequences to this model, providing the rich action semantics for further training the skeleton-language model. Furthermore, we introduce a skeleton-language sequential fusion module by combining the features from HGLNet and the generated descriptions, which utilizes a skeleton-language model (SLM) for aligning skeletal spatio-temporal features and textual action descriptions precisely within a unified semantic space. The SLM model could significantly enhance the HGLNet’s semantic discrimination capabilities and cross-modal understanding abilities. Extensive experiments demonstrate that the proposed HocSLM achieves the state-of-the-art performance on three mainstream benchmark datasets: NTU RGB+D 60, NTU RGB+D 120, and Northwestern-UCLA.
[228] UniDAC: Universal Metric Depth Estimation for Any Camera
Girish Chandar Ganesan, Yuliang Guo, Liu Ren, Xiaoming Liu
Main category: cs.CV
TL;DR: UniDAC is a monocular metric depth estimation framework that achieves universal robustness across diverse camera types (fisheye, 360°) using a single model by decoupling depth estimation into relative depth prediction and spatially varying scale estimation.
Details
Motivation: Existing monocular metric depth estimation methods struggle with generalization across diverse camera types (fisheye, 360° cameras). Current approaches either require large-FoV camera data during training or separate models for different domains, lacking universal robustness.Method: Decouples metric depth estimation into relative depth prediction and spatially varying scale estimation. Uses a lightweight Depth-Guided Scale Estimation module that upsamples coarse scale maps using relative depth as guidance. Introduces RoPE-φ, a distortion-aware positional embedding that respects spatial warping in Equi-Rectangular Projections via latitude-aware weighting.
Result: Achieves state-of-the-art in cross-camera generalization, consistently outperforming prior methods across all datasets for diverse camera types.
Conclusion: UniDAC provides a universal framework for monocular metric depth estimation that generalizes robustly across diverse camera domains using a single model, addressing key limitations in camera-type generalization.
Abstract: Monocular metric depth estimation (MMDE) is a core challenge in computer vision, playing a pivotal role in real-world applications that demand accurate spatial understanding. Although prior works have shown promising zero-shot performance in MMDE, they often struggle with generalization across diverse camera types, such as fisheye and $360^\circ$ cameras. Recent advances have addressed this through unified camera representations or canonical representation spaces, but they require either including large-FoV camera data during training or separately trained models for different domains. We propose UniDAC, an MMDE framework that presents universal robustness in all domains and generalizes across diverse cameras using a single model. We achieve this by decoupling metric depth estimation into relative depth prediction and spatially varying scale estimation, enabling robust performance across different domains. We propose a lightweight Depth-Guided Scale Estimation module that upsamples a coarse scale map to high resolution using the relative depth map as guidance to account for local scale variations. Furthermore, we introduce RoPE-$φ$, a distortion-aware positional embedding that respects the spatial warping in Equi-Rectangular Projections (ERP) via latitude-aware weighting. UniDAC achieves state of the art (SoTA) in cross-camera generalization by consistently outperforming prior methods across all datasets.
[229] MotiMem: Motion-Aware Approximate Memory for Energy-Efficient Neural Perception in Autonomous Vehicles
Haohua Que, Mingkai Liu, Jiayue Xie, Haojia Gao, Jiajun Sun, Hongyi Xu, Handong Yao, Fei Qiao
Main category: cs.CV
TL;DR: MotiMem is a hardware-software co-designed interface that reduces memory-interface energy for autonomous vehicles by exploiting temporal coherence with motion propagation and hybrid sparsity-aware coding, achieving 43% energy reduction while maintaining 93% detection accuracy.
Details
Motivation: High-resolution sensors in autonomous vehicles create severe memory bottlenecks where data movement energy exceeds computation energy. Traditional image compression is inadequate because it's semantically blind and optimized for storage rather than bus switching activity.Method: MotiMem uses lightweight 2D Motion Propagation to dynamically identify Regions of Interest (RoI) by exploiting temporal coherence. It combines this with a Hybrid Sparsity-Aware Coding scheme that uses adaptive inversion and truncation to induce bit-level sparsity.
Result: Extensive experiments across nuScenes, Waymo, and KITTI datasets with 16 detection models show MotiMem reduces memory-interface dynamic energy by approximately 43% while retaining approximately 93% of object detection accuracy, establishing a superior Pareto frontier compared to standard codecs like JPEG and WebP.
Conclusion: MotiMem provides an effective hardware-software co-designed solution for reducing memory energy consumption in autonomous perception systems while maintaining high detection accuracy, addressing the critical memory wall problem in battery-constrained electric vehicles.
Abstract: High-resolution sensors are critical for robust autonomous perception but impose a severe memory wall on battery-constrained electric vehicles. In these systems, data movement energy often outweighs computation. Traditional image compression is ill-suited as it is semantically blind and optimizes for storage rather than bus switching activity. We propose MotiMem, a hardware-software co-designed interface. Exploiting temporal coherence,MotiMem uses lightweight 2D Motion Propagation to dynamically identify Regions of Interest (RoI). Complementing this, a Hybrid Sparsity-Aware Coding scheme leverages adaptive inversion and truncation to induce bitlevel sparsity. Extensive experiments across nuScenes, Waymo, and KITTI with 16 detection models demonstrate that MotiMem reduces memory-interface dynamic energy by approximately 43 percent while retaining approximately 93 percent of the object detection accuracy, establishing a new Pareto frontier significantly superior to standard codecs like JPEG and WebP.
[230] RailVQA: A Benchmark and Framework for Efficient Interpretable Visual Cognition in Automatic Train Operation
Sen Zhang, Runmei Li, Zhichao Zheng, Yuhe Zhang, Jiani Li, Kailun Zhang, Tao Zhang, Wenjun Wu, Qunbo Wang
Main category: cs.CV
TL;DR: RailVQA introduces a VQA benchmark and collaborative framework for cab-view visual cognition in automatic train operation, combining small-model efficiency with large-model reasoning capabilities.
Details
Motivation: Current ATO systems lack high-level reasoning for safety-critical corner cases, while existing LMMs are too computationally expensive and hallucination-prone for safety-critical applications. There's also a lack of domain-specific benchmarks for evaluating cognitive capabilities in railway environments.Method: Proposes RailVQA-bench (20K single-frame + 1,168 video QA pairs) for evaluating visual cognition, and RailVQA-CoM - a collaborative framework combining small models’ efficiency with large models’ cognition via transparent three-module architecture and adaptive temporal sampling.
Result: The approach improves performance, enhances interpretability, reduces inference latency, strengthens cross-domain generalization, and enables plug-and-play deployment in autonomous driving systems.
Conclusion: RailVQA addresses critical gaps in ATO visual cognition by providing both evaluation benchmarks and an efficient collaborative framework that balances computational efficiency with advanced reasoning capabilities.
Abstract: Automatic Train Operation (ATO) relies on low-latency, reliable cab-view visual perception and decision-oriented inference to ensure safe operation in complex and dynamic railway environments. However, existing approaches focus primarily on basic perception and often generalize poorly to rare yet safety-critical corner cases. They also lack the high-level reasoning and planning capabilities required for operational decision-making. Although recent Large Multi-modal Models (LMMs) show strong generalization and cognitive capabilities, their use in safety-critical ATO is hindered by high computational cost and hallucination risk. Meanwhile, reliable domain-specific benchmarks for systematically evaluating cognitive capabilities are still lacking. To address these gaps, we introduce RailVQA-bench, the first VQA benchmark for cab-view visual cognition in ATO, comprising 20,000 single-frame and 1,168 video based QA pairs to evaluate cognitive generalization and interpretability in both static and dynamic scenarios. Furthermore, we propose RailVQA-CoM, a collaborative large-small model framework that combines small-model efficiency with large-model cognition via a transparent three-module architecture and adaptive temporal sampling, improving perceptual generalization and enabling efficient reasoning and planning. Experiments demonstrate that the proposed approach substantially improves performance, enhances interpretability, reduces inference latency, and strengthens cross-domain generalization, while enabling plug-and-play deployment in autonomous driving systems. Code and datasets will be available at https://github.com/Cybereye-bjtu/RailVQA.
[231] SJD-VP: Speculative Jacobi Decoding with Verification Prediction for Autoregressive Image Generation
Bingqi Shan, Baoquan Zhang, Xiaochen Qi, Xutao Li, Yunming Ye, Liqiang Nie
Main category: cs.CV
TL;DR: SJD-VP improves speculative Jacobi decoding for autoregressive image generation by predicting verification-accepted tokens based on probability increases across iterations, boosting acceptance rates and accelerating generation.
Details
Motivation: Existing speculative Jacobi decoding methods suffer from low acceptance rates due to token selection ambiguity. Recent approaches focus on relaxed token verification but fail to exploit iterative decoding dynamics fully.Method: Proposes Speculative Jacobi Decoding with Verification Prediction (SJD-VP) that leverages changes in token probabilities across iterations to guide sampling. The method favors tokens whose probabilities increase, effectively predicting which tokens will pass verification. It’s designed as a plug-and-play module that integrates with existing SJD methods.
Result: Extensive experiments on standard benchmarks show SJD-VP consistently accelerates autoregressive decoding while improving image generation quality.
Conclusion: SJD-VP effectively addresses the low acceptance rate problem in speculative Jacobi decoding by exploiting iterative probability dynamics, resulting in faster and higher-quality image generation.
Abstract: Speculative Jacobi Decoding (SJD) has emerged as a promising method for accelerating autoregressive image generation. Despite its potential, existing SJD approaches often suffer from the low acceptance rate issue of speculative tokens due to token selection ambiguity. Recent works attempt to mitigate this issue primarily from the relaxed token verification perspective but fail to fully exploit the iterative dynamics of decoding. In this paper, we conduct an in-depth analysis and make a novel observation that tokens whose probabilities increase are more likely to match the verification-accepted and correct token. Based on this, we propose a novel Speculative Jacobi Decoding with Verification Prediction (SJD-VP). The key idea is to leverage the change in token probabilities across iterations to guide sampling, favoring tokens whose probabilities increase. This effectively predicts which tokens are likely to pass subsequent verification, boosting the acceptance rate. In particular, our SJD-VP is plug-and-play and can be seamlessly integrated into existing SJD methods. Extensive experiments on standard benchmarks demonstrate that our SJD-VP method consistently accelerates autoregressive decoding while improving image generation quality.
[232] The Geometry of Robustness: Optimizing Loss Landscape Curvature and Feature Manifold Alignment for Robust Finetuning of Vision-Language Models
Shivang Chopra, Shaunak Halbe, Chengyue Huan, Brisa Maneechotesuwan, Zsolt Kira
Main category: cs.CV
TL;DR: GRACE is a unified fine-tuning framework for Vision-Language Models that addresses the three-way trade-off between in-distribution accuracy, out-of-distribution generalization, and adversarial robustness through joint regularization of parameter-space curvature and feature-space invariance.
Details
Motivation: Existing fine-tuning approaches for VLMs face a critical three-way trade-off between ID accuracy, OOD generalization, and adversarial robustness. Current methods only address at most two of these aspects, leaving models vulnerable to attacks or degrading performance on other metrics.Method: GRACE uses adaptive weight perturbations scaled by local curvature to promote flatter minima in parameter space, combined with a feature alignment loss that maintains representation consistency across clean, adversarial, and OOD inputs. Grounded in Robust PAC-Bayes theory.
Result: On ImageNet fine-tuning of CLIP models, GRACE simultaneously improves ID accuracy by 10.8%, adversarial accuracy by 13.5% while maintaining 57.0% OOD accuracy (vs. 57.4% zero-shot baseline). Geometric analysis confirms flatter minima without feature distortion.
Conclusion: GRACE provides a principled approach to achieving generalized robustness in foundation VLMs by addressing both parameter-space curvature and feature-space invariance, resolving the three-way trade-off that plagues existing fine-tuning methods.
Abstract: Fine-tuning approaches for Vision-Language Models (VLMs) face a critical three-way trade-off between In-Distribution (ID) accuracy, Out-of-Distribution (OOD) generalization, and adversarial robustness. Existing robust fine-tuning strategies resolve at most two axes of this trade-off. Generalization-preserving methods retain ID/OOD performance but leave models vulnerable to adversarial attacks, while adversarial training improves robustness to targeted attacks but degrades ID/OOD accuracy. Our key insight is that the robustness trade-off stems from two geometric failures: sharp, anisotropic minima in parameter space and unstable feature representations that deform under perturbation. To address this, we propose GRACE (Gram-aligned Robustness via Adaptive Curvature Estimation), a unified fine-tuning framework that jointly regularizes the parameter-space curvature and feature-space invariance for VLMs. Grounded in Robust PAC-Bayes theory, GRACE employs adaptive weight perturbations scaled by local curvature to promote flatter minima, combined with a feature alignment loss that maintains representation consistency across clean, adversarial, and OOD inputs. On ImageNet fine-tuning of CLIP models, GRACE simultaneously improves ID accuracy by 10.8%, and adversarial accuracy by 13.5% while maintaining 57.0% OOD accuracy (vs. 57.4% zero-shot baseline). Geometric analysis confirms that GRACE converges to flatter minima without feature distortion across distribution shifts, providing a principled step toward generalized robustness in foundation VLMs.
[233] Follow Your Heart: Landmark-Guided Transducer Pose Scoring for Point-of-Care Echocardiography
Zaiyang Guo, Jessie N. Dong, Filippos Bellos, Jilei Hao, Emily J. MacKay, Trevor Chan, Shir Goldfinger, Sethu Reddy, Steven Vance, Jason J. Corso, Alison M. Pouch
Main category: cs.CV
TL;DR: Multi-task network for cardiac ultrasound guidance that provides feedback on transducer positioning and automatically estimates left ventricular ejection fraction without requiring position tracking hardware.
Details
Motivation: Point-of-care transthoracic echocardiography (TTE) requires acquisition of the apical 4-chamber (A4CH) view for cardiac assessment, but optimizing transducer pose is challenging for novice users, especially in resource-limited settings.Method: Multi-task network cascading transducer pose scoring module and uncertainty-aware left ventricular landmark detector with automated LVEF estimation, trained and inferred using only image data without requiring transducer position tracking hardware.
Result: Network successfully determines transducer pose quality (on target, close to target, or far from target) based on images alone while generating visual landmark cues for anatomical guidance, evaluated on point-of-care TTE data acquired with dense sweep protocol.
Conclusion: Demonstrates promising strategy for A4CH view acquisition guidance that could be useful for deploying point-of-care TTE in limited resource settings without requiring cumbersome tracking setups.
Abstract: Point-of-care transthoracic echocardiography (TTE) makes it possible to assess a patient’s cardiac function in almost any setting. A critical step in the TTE exam is acquisition of the apical 4-chamber (A4CH) view, which is used to evaluate clinically impactful measurements such as left ventricular ejection fraction (LVEF). However, optimizing transducer pose for high-quality image acquisition and subsequent measurement is a challenging task, particularly for novice users. In this work, we present a multi-task network that provides feedback cues for A4CH view acquisition and automatically estimates LVEF in high-quality A4CH images. The network cascades a transducer pose scoring module and an uncertainty-aware LV landmark detector with automated LVEF estimation. A strength is that network training and inference do not require cumbersome or costly setups for transducer position tracking. We evaluate performance on point-of-care TTE data acquired with a spatially dense “sweep” protocol around the optimal A4CH view. The results demonstrate the network’s ability to determine when the transducer pose is on target, close to target, or far from target based on the images alone, while generating visual landmark cues that guide anatomical interpretation and orientation. In conclusion, we demonstrate a promising strategy to provide guidance for A4CH view acquisition, which may be useful when deploying point-of-care TTE in limited resource settings.
[234] Weakly Convex Ridge Regularization for 3D Non-Cartesian MRI Reconstruction
German Shâma Wache, Chaithya G R, Asma Tanabene, Sebastian Neumayer
Main category: cs.CV
TL;DR: A rotation invariant weakly convex ridge regularizer (WCRR) is proposed for accelerated MRI reconstruction, offering improved computational efficiency and robustness compared to deep learning methods while maintaining performance comparable to state-of-the-art denoiser-based approaches.
Details
Motivation: Accelerated non-Cartesian MRI acquisition reduces scan time but causes long reconstruction delays. Deep learning reconstruction methods lack stability and robustness to distribution shifts, motivating a more principled approach that combines variational methods with deep learning strengths.Method: Train a rotation invariant weakly convex ridge regularizer (WCRR) for variational reconstruction. The approach is benchmarked against state-of-the-art methods on retrospectively simulated data and prospective GoLF SPARKLING and CAIPIRINHA acquisitions.
Result: WCRR consistently outperforms widely used baselines and achieves performance comparable to Plug and Play reconstruction with state-of-the-art 3D DRUNet denoiser, while offering substantially improved computational efficiency and robustness to acquisition changes.
Conclusion: WCRR unifies the strengths of principled variational methods and modern deep learning based approaches for accelerated MRI reconstruction.
Abstract: While highly accelerated non-Cartesian acquisition protocols significantly reduce scan time, they often entail long reconstruction delays. Deep learning based reconstruction methods can alleviate this, but often lack stability and robustness to distribution shifts. As an alternative, we train a rotation invariant weakly convex ridge regularizer (WCRR). The resulting variational reconstruction approach is benchmarked against state of the art methods on retrospectively simulated data and (out of distribution) on prospective GoLF SPARKLING and CAIPIRINHA acquisitions. Our approach consistently outperforms widely used baselines and achieves performance comparable to Plug and Play reconstruction with a state of the art 3D DRUNet denoiser, while offering substantially improved computational efficiency and robustness to acquisition changes. In summary, WCRR unifies the strengths of principled variational methods and modern deep learning based approaches.
[235] RiskProp: Collision-Anchored Self-Supervised Risk Propagation for Early Accident Anticipation
Yiyang Zou, Tianhao Zhao, Peilun Xiao, Hongyu Jin, Longyu Qi, Yuxuan Li, Liyin Liang, Yifeng Qian, Chunbo Lai, Yutian Lin, Zhihui Li, Yu Wu
Main category: cs.CV
TL;DR: RiskProp: Self-supervised risk propagation for accident anticipation using only collision frame annotations, eliminating subjective anomaly onset labels through future-frame regularization and adaptive monotonic constraints.
Details
Motivation: Existing accident anticipation methods rely on subjective and inconsistent binary supervision with manually annotated "anomaly onset" frames, leading to inaccurate risk estimation. The paper aims to develop a more reliable approach that removes the need for these problematic annotations.Method: Proposes RiskProp, a collision-anchored self-supervised risk propagation paradigm that uses only reliably annotated collision frames. It employs two observation-driven losses: 1) future-frame regularization loss that uses next-frame predictions as soft targets for current frame supervision, enabling backward risk propagation; 2) adaptive monotonic constraint to encourage non-decreasing risk progression over time based on empirical trends.
Result: Experiments on CAP and Nexar datasets demonstrate state-of-the-art performance. RiskProp produces smoother, more discriminative risk curves, improving both early anticipation accuracy and interpretability compared to existing methods.
Conclusion: RiskProp successfully eliminates the need for subjective anomaly onset annotations while achieving superior accident anticipation performance through self-supervised risk propagation from collision frames, offering more reliable and interpretable risk estimation.
Abstract: Accident anticipation aims to predict impending collisions from dashcam videos and trigger early alerts. Existing methods rely on binary supervision with manually annotated “anomaly onset” frames, which are subjective and inconsistent, leading to inaccurate risk estimation. In contrast, we propose RiskProp, a novel collision-anchored self-supervised risk propagation paradigm for early accident anticipation, which removes the need for anomaly onset annotations and leverages only the reliably annotated collision frame. RiskProp models temporal risk evolution through two observation-driven losses: first, since future frames contain more definitive evidence of an impending accident, we introduce a future-frame regularization loss that uses the model’s next-frame prediction as a soft target to supervise the current frame, enabling backward propagation of risk signals; second, inspired by the empirical trend of rising risk before accidents, we design an adaptive monotonic constraint to encourage a non-decreasing progression over time. Experiments on CAP and Nexar demonstrate that RiskProp achieves state-of-the-art performance and produces smoother, more discriminative risk curves, improving both early anticipation and interpretability.
[236] MultiLoc: Multi-view Guided Relative Pose Regression for Fast and Robust Visual Re-Localization
Nobel Dang, Bing Li
Main category: cs.CV
TL;DR: MultiLoc is a multi-view guided relative pose regression model that fuses multiple reference views and camera poses in a single forward pass for accurate zero-shot pose estimation with real-time efficiency.
Details
Motivation: Relative Pose Regression (RPR) generalizes well to unseen environments but has limited performance due to pairwise and local spatial views. The authors aim to enhance RPR with globally consistent spatial and geometric understanding through multi-view fusion.Method: Proposes MultiLoc with joint fusion of multiple reference views and their camera poses in a single forward pass. Also introduces a co-visibility-driven retrieval strategy for geometrically relevant reference view selection to supply informative context.
Result: Establishes new benchmark in visual re-localization, consistently outperforming SOTA RPR methods across WaySpots, Cambridge Landmarks, and Indoor6 datasets. Also shows SOTA performance in relative pose estimation on MegaDepth-1500, ScanNet-1500, and ACID benchmarks.
Conclusion: MultiLoc demonstrates robust domain generalization across indoor, outdoor and natural environments, showing that multi-view fusion significantly enhances relative pose regression capabilities while maintaining real-time efficiency.
Abstract: Relative Pose Regression (RPR) generalizes well to unseen environments, but its performance is often limited due to pairwise and local spatial views. To this end, we propose MultiLoc, a novel multi-view guided RPR model trained at scale, equipping relative pose regression with globally consistent spatial and geometric understanding. Specifically, our method jointly fuses multiple reference views and their associated camera poses in a single forward pass, enabling accurate zero-shot pose estimation with real-time efficiency. To reliably supply informative context, we further propose a co-visibility-driven retrieval strategy for geometrically relevant reference view selection. MultiLoc establishes a new benchmark in visual re-localization, consistently outperforming existing state-of-the-art (SOTA) relative pose regression (RPR) methods across diverse datasets, including WaySpots, Cambridge Landmarks, and Indoor6. Furthermore, MultiLoc’s pose regressor exhibits SOTA performance in relative pose estimation, surpassing RPR, feature matching and non-regression-based techniques on the MegaDepth-1500, ScanNet-1500, and ACID benchmarks. These results demonstrate robust domain generalization of MultiLoc across indoor, outdoor and natural environments. Code will be made publicly available.
[237] MEDIC-AD: Towards Medical Vision-Language Model’s Clinical Intelligence
Woohyeon Park, Jaeik Kim, Sunghwan Steve Cho, Pa Hong, Wookyoung Jeong, Yoojin Nam, Namjoon Kim, Ginny Y. Wong, Ka Chun Cheung, Jaeyoung Do
Main category: cs.CV
TL;DR: MEDIC-AD is a clinically-oriented vision-language model that enhances lesion detection, symptom tracking, and visual explainability in medical imaging through a stage-wise framework with specialized tokens and dedicated explainability training.
Details
Motivation: Current medical VLMs lack mechanisms to translate their broad knowledge into clinically actionable outputs for lesion detection, symptom tracking, and visual explainability, which are central to real-world medical image analysis.Method: A stage-wise framework with: 1) learnable anomaly-aware tokens (
Result: Achieves state-of-the-art results in anomaly detection, symptom tracking, and anomaly segmentation compared to both closed-source and medical-specialized baselines. Shows stable predictions and clinically faithful explanations in real longitudinal clinical data from hospital workflows.
Conclusion: MEDIC-AD successfully bridges the gap between general medical VLMs and clinically actionable outputs through its staged design, delivering practical value in patient-monitoring and decision-support workflows.
Abstract: Lesion detection, symptom tracking, and visual explainability are central to real-world medical image analysis, yet current medical Vision-Language Models (VLMs) still lack mechanisms that translate their broad knowledge into clinically actionable outputs. To bridge this gap, we present MEDIC-AD, a clinically oriented VLM that strengthens these three capabilities through a stage-wise framework. First, learnable anomaly-aware tokens (
[238] LightMover: Generative Light Movement with Color and Intensity Controls
Gengze Zhou, Tianyu Wang, Soo Ye Kim, Zhixin Shu, Xin Yu, Yannick Hold-Geoffroy, Sumit Chaturvedi, Qi Wu, Zhe Lin, Scott Cohen
Main category: cs.CV
TL;DR: LightMover: A framework for controllable light manipulation in single images using video diffusion priors to achieve physically plausible illumination changes without re-rendering scenes.
Details
Motivation: Current light editing methods often require 3D scene reconstruction or complex re-rendering. The authors aim to enable intuitive, physically plausible light manipulation in single images without needing 3D scene information or re-rendering pipelines.Method: Formulates light editing as sequence-to-sequence prediction in visual token space using video diffusion priors. Uses light-control tokens to adjust position, color, intensity along with reflections, shadows, and falloff. Introduces adaptive token-pruning to preserve spatial information while compactly encoding non-spatial attributes. Trained on scalable rendering pipeline generating image pairs with varied lighting while keeping scene content consistent.
Result: Achieves precise independent control over light position, color, and intensity. Shows high PSNR and strong semantic consistency (DINO, CLIP) across different editing tasks. Reduces control sequence length by 41% while maintaining editing fidelity through token pruning.
Conclusion: LightMover provides a unified framework for controllable light manipulation in single images, enabling physically plausible illumination changes without re-rendering by leveraging video diffusion priors and efficient token-based representation.
Abstract: We present LightMover, a framework for controllable light manipulation in single images that leverages video diffusion priors to produce physically plausible illumination changes without re-rendering the scene. We formulate light editing as a sequence-to-sequence prediction problem in visual token space: given an image and light-control tokens, the model adjusts light position, color, and intensity together with resulting reflections, shadows, and falloff from a single view. This unified treatment of spatial (movement) and appearance (color, intensity) controls improves both manipulation and illumination understanding. We further introduce an adaptive token-pruning mechanism that preserves spatially informative tokens while compactly encoding non-spatial attributes, reducing control sequence length by 41% while maintaining editing fidelity. To train our framework, we construct a scalable rendering pipeline that generates large numbers of image pairs across varied light positions, colors, and intensities while keeping the scene content consistent with the original image. LightMover enables precise, independent control over light position, color, and intensity, and achieves high PSNR and strong semantic consistency (DINO, CLIP) across different tasks.
[239] Reasoning-Driven Anomaly Detection and Localization with Image-Level Supervision
Yizhou Jin, Yuezhu Feng, Jinjin Zhang, Peng Wang, Qingjie Liu, Yunhong Wang
Main category: cs.CV
TL;DR: ReAL activates MLLMs’ intrinsic reasoning for anomaly detection, localization, and interpretable reasoning using only image-level supervision, achieving competitive performance without pixel-wise labels.
Details
Motivation: Current MLLM approaches for anomaly detection are limited to image-level analysis and textual reasoning, requiring external vision modules and dense pixel annotations for localization. The authors aim to unlock MLLMs' intrinsic reasoning capabilities to perform comprehensive anomaly analysis without auxiliary components or pixel-level supervision.Method: Proposes Reasoning-Driven Anomaly Localization (ReAL) that extracts anomaly-related tokens from MLLMs’ autoregressive reasoning process and aggregates attention responses to generate pixel-level anomaly maps. Also introduces Consistency-Guided Reasoning Optimization (CGRO) using reinforcement learning to align reasoning tokens with visual attentions for more coherent reasoning and accurate localization.
Result: Extensive experiments on four public benchmarks show significant improvements in anomaly detection, localization, and interpretability. The method achieves performance competitive with MLLM-based methods trained with dense pixel-level supervision, despite using only image-level supervision.
Conclusion: The work successfully demonstrates that MLLMs’ intrinsic reasoning capabilities can be leveraged for comprehensive anomaly analysis without requiring pixel-level annotations or external vision modules, advancing multimodal understanding in anomaly detection tasks.
Abstract: Multimodal large language models (MLLMs) have recently demonstrated remarkable reasoning and perceptual abilities for anomaly detection. However, most approaches remain confined to image-level anomaly detection and textual reasoning, while pixel-level localization still relies on external vision modules and dense annotations. In this work, we activate the intrinsic reasoning potential of MLLMs to perform anomaly detection, pixel-level localization, and interpretable reasoning solely from image-level supervision, without any auxiliary components or pixel-wise labels. Specifically, we propose Reasoning-Driven Anomaly Localization (ReAL), which extracts anomaly-related tokens from the autoregressive reasoning process and aggregates their attention responses to produce pixel-level anomaly maps. We further introduce a Consistency-Guided Reasoning Optimization (CGRO) module that leverages reinforcement learning to align reasoning tokens with visual attentions, resulting in more coherent reasoning and accurate anomaly localization. Extensive experiments on four public benchmarks demonstrate that our method significantly improves anomaly detection, localization, and interpretability. Remarkably, despite relying solely on image-level supervision, our approach achieves performance competitive with MLLM-based methods trained under dense pixel-level supervision. Code is available at https://github.com/YizhouJin313/ReADL.
[240] Inference-Time Structural Reasoning for Compositional Vision-Language Understanding
Amartya Bhattacharya
Main category: cs.CV
TL;DR: A framework for evaluating and improving compositional reasoning in vision-language models using scene graph augmentation and dependency parsing.
Details
Motivation: Vision-language models excel at image-text retrieval but persistently fail at compositional reasoning, particularly distinguishing captions with the same words but different relational structures.Method: Introduces a unified evaluation framework with dependency-based TextSceneGraphParser (spaCy) to extract subject-relation-object triples, and Graph Asymmetry Scorer using optimal bipartite matching to inject structural relational priors. Tests four VLMs (CLIP, BLIP, LLaVA, Qwen3-VL-8B-Thinking) on Winoground benchmark with plain and scene-graph-augmented regimes.
Result: Qwen3-VL-8B-Thinking achieves group score of 62.75, far above encoder-based models. Proposed multi-turn scene graph filtering strategy lifts it to 66.0, surpassing prior open-source state-of-the-art. Scene graph augmentation benefits capable models but provides negligible or negative gains for weaker baselines.
Conclusion: Scene graph augmentation at inference time can significantly improve compositional reasoning in capable vision-language models, revealing a capability augmentation tradeoff where stronger models benefit more from structural priors.
Abstract: Vision-language models (VLMs) excel at image-text retrieval yet persistently fail at compositional reasoning, distinguishing captions that share the same words but differ in relational structure. We present, a unified evaluation and augmentation framework benchmarking four architecturally diverse VLMs,CLIP, BLIP, LLaVA, and Qwen3-VL-8B-Thinking,on the Winoground benchmark under plain and scene-graph-augmented regimes. We introduce a dependency-based TextSceneGraphParser (spaCy) extracting subject-relation-object triples, and a Graph Asymmetry Scorer using optimal bipartite matching to inject structural relational priors. Caption ablation experiments (subject-object masking and swapping) reveal that Qwen3-VL-8B-Thinking achieves a group score of 62.75, far above all encoder-based models, while a proposed multi-turn SG filtering strategy further lifts it to 66.0, surpassing prior open-source state-of-the-art. We analyze the capability augmentation tradeoff and find that SG augmentation benefits already capable models while providing negligible or negative gains for weaker baselines. Code: https://github.com/amartyacodes/Inference-Time-Structural-Reasoning-for-Compositional-Vision-Language-Understanding
[241] Communicating about Space: Language-Mediated Spatial Integration Across Partial Views
Ankur Sikarwar, Debangan Mishra, Sudarshan Nikhil, Ponnurangam Kumaraguru, Aishwarya Agrawal
Main category: cs.CV
TL;DR: MLLMs struggle with collaborative spatial reasoning across different viewpoints, performing poorly on building coherent mental maps despite some capability for identifying shared objects.
Details
Motivation: To investigate whether multimodal LLMs can build shared spatial understanding through dialogue like humans do, by aligning egocentric views to form coherent allocentric mental models.Method: Introduces COSMIC benchmark with 899 3D scenes and 1250 QA pairs across 5 tasks; tests two static MLLM agents exchanging natural-language messages from different viewpoints to solve spatial queries; compares with 250 human-human dialogues.
Result: MLLMs show consistent capability hierarchy: best at identifying shared anchor objects (72% for Gemini-3-Pro-Thinking), worse at relational reasoning, largely fail at building globally consistent maps (near chance). Humans achieve 95% accuracy. Model dialogues don’t converge like human conversations.
Conclusion: Current MLLMs have limited ability to build and maintain robust shared mental models through spatial communication, with significant room for improvement compared to human performance.
Abstract: Humans build shared spatial understanding by communicating partial, viewpoint-dependent observations. We ask whether Multimodal Large Language Models (MLLMs) can do the same, aligning distinct egocentric views through dialogue to form a coherent, allocentric mental model of a shared environment. To study this systematically, we introduce COSMIC, a benchmark for Collaborative Spatial Communication. In this setting, two static MLLM agents observe a 3D indoor environment from different viewpoints and exchange natural-language messages to solve spatial queries. COSMIC contains 899 diverse scenes and 1250 question-answer pairs spanning five tasks. We find a consistent capability hierarchy, MLLMs are most reliable at identifying shared anchor objects across views, perform worse on relational reasoning, and largely fail at building globally consistent maps, performing near chance, even for the frontier models. Moreover, we find thinking capability yields consistent gains in anchor grounding, but is insufficient for higher-level spatial communication. To contextualize model behavior, we additionally collect 250 human-human dialogues. Humans achieve 95% aggregate accuracy, leaving significant room for improvement for even the best performing model Gemini-3-Pro-Thinking which achieves 72% aggregate accuracy. Moreover, human conversations become increasingly specific as partners converge on a shared mental model, whereas model dialogues continue to explore new possibilities rather than converging, consistent with a limited ability to build and maintain a robust shared mental model. Our code and data is available at https://github.com/ankursikarwar/Cosmic
[242] LongCat-Next: Lexicalizing Modalities as Discrete Tokens
Meituan LongCat Team, Bin Xiao, Chao Wang, Chengjiang Li, Chi Zhang, Chong Peng, Hang Yu, Hao Yang, Haonan Yan, Haoze Sun, Haozhe Zhao, Hong Liu, Hui Su, Jiaqi Zhang, Jiawei Wang, Jing Li, Kefeng Zhang, Manyuan Zhang, Minhao Jing, Peng Pei, Quan Chen, Taofeng Xue, Tongxin Pan, Xiaotong Li, Xiaoyang Li, Xiaoyu Zhao, Xing Hu, Xinyang Lin, Xunliang Cai, Yan Bai, Yan Feng, Yanjie Li, Yao Qiu, Yerui Sun, Yifan Lu, Ying Luo, Yipeng Mei, Yitian Chen, Yuchen Xie, Yufang Liu, Yufei Chen, Yulei Qian, Yuqi Peng, Zhihang Yu, Zhixiong Han, Changran Wang, Chen Chen, Dian Zheng, Fengjiao Chen, Ge Yang, Haowei Guo, Haozhe Wang, Hongyu Li, Huicheng Jiang, Jiale Hong, Jialv Zou, Jiamu Li, Jianping Lin, Jiaxing Liu, Jie Yang, Jing Jin, Jun Kuang, Juncheng She, Kunming Luo, Kuofeng Gao, Lin Qiu, Linsen Guo, Mianqiu Huang, Qi Li, Qian Wang, Rumei Li, Siyu Ren, Wei Wang, Wenlong He, Xi Chen, Xiao Liu, Xiaoyu Li, Xu Huang, Xuanyu Zhu, Xuezhi Cao, Yaoming Zhu, Yifei Cao, Yimeng Jia, Yizhen Jiang, Yufei Gao, Zeyang Hu, Zhenlong Yuan, Zijian Zhang, Ziwen Wang
Main category: cs.CV
TL;DR: DiNA framework unifies multimodal modeling through shared discrete space, enabling consistent autoregressive processing across text, vision, and audio with LongCat-Next model.
Details
Motivation: Current multimodal systems are language-centric with non-linguistic modalities treated as external attachments, leading to fragmented architectures and suboptimal integration. The paper aims to transcend this limitation by creating a unified framework for native multimodal modeling.Method: Introduces Discrete Native Autoregressive (DiNA) framework with shared discrete space representation. Key innovation is dNaViT (Discrete Native Any-resolution Visual Transformer) for tokenization/de-tokenization at arbitrary resolutions. LongCat-Next model processes text, vision, and audio under single autoregressive objective with minimal modality-specific design.
Result: LongCat-Next achieves strong performance across wide range of multimodal benchmarks, addresses performance ceiling of discrete vision modeling on understanding tasks, and provides unified approach to reconcile conflict between understanding and generation.
Conclusion: DiNA framework represents significant step toward native multimodality, enabling consistent autoregressive modeling across modalities. The open-sourced LongCat-Next model and tokenizers aim to foster further research in unified multimodal systems.
Abstract: The prevailing Next-Token Prediction (NTP) paradigm has driven the success of large language models through discrete autoregressive modeling. However, contemporary multimodal systems remain language-centric, often treating non-linguistic modalities as external attachments, leading to fragmented architectures and suboptimal integration. To transcend this limitation, we introduce Discrete Native Autoregressive (DiNA), a unified framework that represents multimodal information within a shared discrete space, enabling a consistent and principled autoregressive modeling across modalities. A key innovation is the Discrete Native Any-resolution Visual Transformer (dNaViT), which performs tokenization and de-tokenization at arbitrary resolutions, transforming continuous visual signals into hierarchical discrete tokens. Building on this foundation, we develop LongCat-Next, a native multimodal model that processes text, vision, and audio under a single autoregressive objective with minimal modality-specific design. As an industrial-strength foundation model, it excels at seeing, painting, and talking within a single framework, achieving strong performance across a wide range of multimodal benchmarks. In particular, LongCat-Next addresses the long-standing performance ceiling of discrete vision modeling on understanding tasks and provides a unified approach to effectively reconcile the conflict between understanding and generation. As an attempt toward native multimodality, we open-source the LongCat-Next and its tokenizers, hoping to foster further research and development in the community. GitHub: https://github.com/meituan-longcat/LongCat-Next
[243] Incentivizing Temporal-Awareness in Egocentric Video Understanding Models
Zhiyang Xu, Tian Qin, Bowen Jin, Zhengfeng Lai, Meng Cao, Lifu Huang, Peng Zhang
Main category: cs.CV
TL;DR: TGPO is a reinforcement learning algorithm that improves temporal awareness in multimodal LLMs by contrasting outputs from ordered vs shuffled video frames to reward temporally coherent reasoning.
Details
Motivation: Current MLLMs lack temporal awareness in egocentric settings, relying on frame-level spatial shortcuts rather than understanding event ordering and evolution, which limits their ability to reason about temporal sequences.Method: Temporal Global Policy Optimization (TGPO) uses reinforcement learning with verifiable rewards (RLVR) that contrasts model outputs from temporally ordered versus shuffled video frames to create calibrated, globally normalized reward signals favoring temporal coherence.
Result: Experiments across five egocentric video benchmarks show TGPO consistently improves temporal grounding and causal coherence, outperforming prior RL-based video reasoning approaches and effectively suppressing spatial shortcut behaviors.
Conclusion: TGPO provides a simple and scalable pathway toward temporally robust MLLMs for egocentric video understanding by explicitly rewarding temporal awareness through contrastive reinforcement learning.
Abstract: Multimodal large language models (MLLMs) have recently shown strong performance in visual understanding, yet they often lack temporal awareness, particularly in egocentric settings where reasoning depends on the correct ordering and evolution of events. This deficiency stems in part from training objectives that fail to explicitly reward temporal reasoning and instead rely on frame-level spatial shortcuts. To address this limitation, we propose Temporal Global Policy Optimization (TGPO), a reinforcement learning with verifiable rewards (RLVR) algorithm designed to incentivize temporal awareness in MLLMs. TGPO contrasts model outputs generated from temporally ordered versus shuffled video frames to derive calibrated, globally normalized reward signals that explicitly favor temporally coherent reasoning. Integrated with GRPO and GSPO, TGPO supports cold-start RL training and effectively suppresses spatial shortcut behaviors learned by existing MLLMs. Experiments across five egocentric video benchmarks demonstrate that TGPO consistently improves temporal grounding and causal coherence, outperforming prior RL-based video reasoning approaches. Our results suggest that TGPO offers a simple and scalable pathway toward temporally robust MLLMs for egocentric video understanding.
[244] MotionRFT: Unified Reinforcement Fine-Tuning for Text-to-Motion Generation
Xiaofeng Tan, Wanjiang Weng, Hongsong Wang, Fang Zhao, Xin Geng, Liang Wang
Main category: cs.CV
TL;DR: A reinforcement fine-tuning framework (MotionReward + EasyTune) for text-to-motion generation that improves semantic alignment, realism, and human preference through efficient post-training optimization.
Details
Motivation: Existing text-to-motion generation models using diffusion/flow methods have insufficient supervised pretraining for high-level objectives like semantic consistency, realism, and human preference. Current post-training methods are limited by representation specificity, single-aspect optimization, and high computational costs.Method: Proposes MotionReward - a heterogeneous-representation, multi-dimensional reward model that maps different motion representations into a shared semantic space anchored by text. Uses Self-refinement Preference Learning to enhance semantics without extra annotations. Also introduces EasyTune - an efficient fine-tuning method that optimizes step-wise rather than full trajectory to overcome recursive gradient dependence bottlenecks.
Result: Achieves FID 0.132 at 22.10 GB peak memory for MLD model, saving up to 15.22 GB over DRaFT. Reduces FID by 22.9% on joint-based ACMDM, and achieves 12.6% R-Precision gain and 23.3% FID improvement on rotation-based HY Motion.
Conclusion: The framework effectively addresses limitations of existing post-training methods by providing unified semantic representation, multi-dimensional reward learning, and efficient fine-grained optimization for text-to-motion generation.
Abstract: Text-to-motion generation has advanced with diffusion- and flow-based generative models, yet supervised pretraining remains insufficient to align models with high-level objectives such as semantic consistency, realism, and human preference. Existing post-training methods have key limitations: they (1) target a specific motion representation, such as joints, (2) optimize a particular aspect, such as text-motion alignment, and may compromise other factors; and (3) incur substantial computational overhead, data dependence, and coarse-grained optimization. We present a reinforcement fine-tuning framework that comprises a heterogeneous-representation, multi-dimensional reward model, MotionReward, and an efficient, fine-grained fine-tuning method, EasyTune. To obtain a unified semantics representation, MotionReward maps heterogeneous motions into a shared semantic space anchored by text, enabling multidimensional reward learning; Self-refinement Preference Learning further enhances semantics without additional annotations. For efficient and effective fine-tuning, we identify the recursive gradient dependence across denoising steps as the key bottleneck, and propose EasyTune, which optimizes step-wise rather than over the full trajectory, yielding dense, fine-grained, and memory-efficient updates. Extensive experiments validate the effectiveness of our framework, achieving FID 0.132 at 22.10 GB peak memory for MLD model and saving up to 15.22 GB over DRaFT. It reduces FID by 22.9% on joint-based ACMDM, and achieves a 12.6% R-Precision gain and 23.3% FID improvement on rotation-based HY Motion. Our project page with code is publicly available.
[245] K$α$LOS finds Consensus: A Meta-Algorithm for Evaluating Inter-Annotator Agreement in Complex Vision Tasks
David Tschirschwitz, Volker Rodehorst
Main category: cs.CV
TL;DR: KαLOS is a meta-algorithm for evaluating dataset annotation quality in computer vision by resolving spatial correspondence before assessing agreement, enabling standardized benchmarking across diverse tasks.
Details
Motivation: Progress in object detection benchmarks is stagnating due to inability to distinguish model improvements from label noise. Current metrics fail to handle instance correspondence problems, and validating agreement metrics is circular due to lack of objective ground truth for agreement.Method: Proposes KαLOS, a unified meta-algorithm that generalizes the “Localization First” principle. It resolves spatial correspondence before assessing agreement, transforming complex spatio-categorical problems into nominal reliability matrices. Uses data-driven configuration by statistically calibrating localization parameters to inherent agreement distribution.
Result: Enables granular diagnostics beyond single scores, including annotator vitality, collaboration clustering, and localization sensitivity. Introduces empirically derived noise generator for validation that models complex human variability rather than uniform error assumptions.
Conclusion: KαLOS establishes a robust standard for distinguishing signal from noise in modern computer vision benchmarks, restoring trust in benchmarking through rigorous quantification of annotation consistency.
Abstract: Progress in object detection benchmarks is stagnating. It is limited not by architectures but by the inability to distinguish model improvements from label noise. To restore trust in benchmarking the field requires rigorous quantification of annotation consistency to ensure the reliability of evaluation data. However, standard statistical metrics fail to handle the instance correspondence problem inherent to vision tasks. Furthermore, validating new agreement metrics remains circular because no objective ground truth for agreement exists. This forces reliance on unverifiable heuristics. We propose K$α$LOS (KALOS), a unified meta-algorithm that generalizes the “Localization First” principle to standardize dataset quality evaluation. By resolving spatial correspondence before assessing agreement, our framework transforms complex spatio-categorical problems into nominal reliability matrices. Unlike prior heuristic implementations, K$α$LOS employs a principled, data-driven configuration; by statistically calibrating the localization parameters to the inherent agreement distribution, it generalizes to diverse tasks ranging from bounding boxes to volumetric segmentation or pose estimation. This standardization enables granular diagnostics beyond a single score. These include annotator vitality, collaboration clustering, and localization sensitivity. To validate this approach, we introduce a novel and empirically derived noise generator. Where prior validations relied on uniform error assumptions, our controllable testbed models complex and non-isotropic human variability. This provides evidence of the metric’s properties and establishes K$α$LOS as a robust standard for distinguishing signal from noise in modern computer vision benchmarks.
[246] Let Triggers Control: Frequency-Aware Dropout for Effective Token Control
Junyoung Koh, Hoyeon Moon, Dongha Kim, Seungmin Lee, Sanghyun Park, Min Song
Main category: cs.CV
TL;DR: Frequency-Aware Dropout (FAD) improves controllability in text-to-image personalization by disentangling trigger tokens from context through co-occurrence analysis and curriculum scheduling, without adding parameters.
Details
Motivation: Current text-to-image personalization methods using LoRA with trigger tokens suffer from poor controllability because trigger tokens become entangled with surrounding context during fine-tuning, losing semantic distinctiveness.Method: Proposes Frequency-Aware Dropout (FAD) with two components: 1) co-occurrence analysis to identify token-context relationships, and 2) curriculum-inspired scheduling that gradually adjusts dropout rates based on token frequency to disentangle representations.
Result: Demonstrates consistent improvements in prompt fidelity, stylistic precision, and user-perceived quality across multiple models (SD 1.5, SDXL, FLUX, Qwen-Image) without adding parameters or architectural changes.
Conclusion: FAD provides a simple yet effective regularization technique that enhances controllability and personalization in text-to-image generation with minimal computational overhead, making it readily applicable to existing models.
Abstract: Text-to-image models such as Stable Diffusion have achieved unprecedented levels of high-fidelity visual synthesis. As these models advance, personalization of generative models – commonly facilitated through Low-Rank Adaptation (LoRA) with a dedicated trigger token – has become a significant area of research. Previous works have naively assumed that fine-tuning with a single trigger token to represent new concepts. However, this often results in poor controllability, where the trigger token alone fails to reliably evoke the intended concept. We attribute this issue to the frequent co-occurrence of the trigger token with the surrounding context during fine-tuning, which entangles their representations and compromises the token’s semantic distinctiveness. To disentangle this, we propose Frequency-Aware Dropout (FAD) – a novel regularization technique that improves prompt controllability without adding new parameters. FAD consists of two key components: co-occurrence analysis and curriculum-inspired scheduling. Qualitative and quantitative analyses across token-based diffusion models (SD~1.5 and SDXL) and natural language–driven backbones (FLUX and Qwen-Image) demonstrate consistent gains in prompt fidelity, stylistic precision, and user-perceived quality. Our method provides a simple yet effective dropout strategy that enhances controllability and personalization in text-to-image generation. Notably, it achieves these improvements without introducing additional parameters or architectural modifications, making it readily applicable to existing models with minimal computational overhead.
[247] CDH-Bench: A Commonsense-Driven Hallucination Benchmark for Evaluating Visual Fidelity in Vision-Language Models
Kesheng Chen, Yamin Hu, Qi Zhou, Zhenqian Zhu, Wenjian Luo
Main category: cs.CV
TL;DR: VLMs often override visual evidence in favor of commonsense knowledge, creating “commonsense-driven hallucinations” - a reliability issue evaluated through CDH-Bench with visual-commonsense conflicts.
Details
Motivation: While VLMs perform well on benchmarks, their reliability when visual evidence conflicts with commonsense knowledge remains unexplored. The paper aims to investigate whether models follow visual evidence or default to commonsense alternatives in such conflicts.Method: Introduces CDH-Bench, a benchmark with visual evidence-commonsense conflicts across three dimensions: counting anomalies, relational anomalies, and attribute anomalies. Evaluates frontier VLMs using binary QA and multiple-choice QA with metrics like Counterfactual Accuracy, Commonsense Accuracy, and Commonsense Collapse Rate.
Result: Even strong VLMs remain vulnerable to prior-driven normalization under visual-commonsense conflicts, demonstrating commonsense-driven hallucinations where models override visual evidence in favor of commonsense alternatives.
Conclusion: CDH-Bench provides a controlled diagnostic tool for evaluating visual fidelity in VLMs when visual evidence conflicts with commonsense knowledge, revealing a fundamental reliability issue that needs addressing in multimodal models.
Abstract: Vision-language models (VLMs) achieve strong performance on many benchmarks, yet a basic reliability question remains underexplored: when visual evidence conflicts with commonsense, do models follow what is shown or what commonsense suggests? A characteristic failure in this setting is that the model overrides visual evidence and outputs the commonsense alternative. We term this phenomenon \textbf{commonsense-driven hallucination} (CDH). To evaluate it, we introduce \textbf{CDH-Bench}, a benchmark designed to create explicit \textbf{visual evidence–commonsense conflicts}. CDH-Bench covers three dimensions: \textit{counting anomalies}, \textit{relational anomalies}, and \textit{attribute anomalies}. We evaluate frontier VLMs under \textit{binary Question Answering (QA)} and \textit{multiple-choice QA}, and report metrics including \textit{Counterfactual Accuracy} (CF-Acc), \textit{Commonsense Accuracy} (CS-Acc), \textit{Counterfactual Accuracy Drop} (CFAD), \textit{Commonsense Collapse Rate} (CCR), and \textit{Relative Prior Dependency} (RPD). Results show that even strong models remain vulnerable to prior-driven normalization under visual evidence–commonsense conflict. CDH-Bench provides a controlled diagnostic of visual fidelity under visual evidence–commonsense conflict.
[248] Understanding and Mitigating Hallucinations in Multimodal Chain-of-Thought Models
Ji Ma, Wei Suo, Peng Wang, Yanning Zhang
Main category: cs.CV
TL;DR: MCoT models suffer from hallucinations in associative reasoning steps (divergent thinking), requiring targeted intervention strategies different from traditional LVLM approaches.
Details
Motivation: While visual attention decay is known in LVLMs, this paper investigates whether MCoT models have unique hallucination causes due to their different reasoning processes, particularly focusing on associative reasoning steps.Method: Systematically investigate MCoT hallucination patterns, identify divergent thinking steps as primary source of fabricated texts, then develop strategy to localize these steps and intervene in decoding process to mitigate hallucinations.
Result: Method outperforms existing approaches by large margin and can be integrated with other hallucination mitigation methods to further boost performance.
Conclusion: MCoT models have unique hallucination patterns in divergent thinking steps requiring specialized intervention strategies, with proposed method effectively addressing these issues while being compatible with existing approaches.
Abstract: Multimodal Chain-of-Thought (MCoT) models have demonstrated impressive capability in complex visual reasoning tasks. Unfortunately, recent studies reveal that they suffer from severe hallucination problems due to diminished visual attention during the generation process. However, visual attention decay is a well-studied problem in Large Vision-Language Models (LVLMs). Considering the fundamental differences in reasoning processes between MCoT models and traditional LVLMs, we raise a basic question: Whether MCoT models have unique causes of hallucinations? To answer this question, we systematically investigate the hallucination patterns of MCoT models and find that fabricated texts are primarily generated in associative reasoning steps, which we term divergent thinking. Leveraging these insights, we introduce a simple yet effective strategy that can effectively localize divergent thinking steps and intervene in the decoding process to mitigate hallucinations. Extensive experiments show that our method outperforms existing methods by a large margin. More importantly, our proposed method can be conveniently integrated with other hallucination mitigation methods and further boost their performance. The code is publicly available at https://github.com/ASGO-MM/MCoT-hallucination.
[249] Make It Up: Fake Images, Real Gains in Generalized Few-shot Semantic Segmentation
Guohuan Xie, Xin He, Dingying Fan, Le Zhang, Ming-Ming Cheng, Yun Liu
Main category: cs.CV
TL;DR: Syn4Seg enhances few-shot semantic segmentation by generating diverse synthetic images with diffusion models and refining pseudo-labels through support-guided filtering and SAM-based boundary refinement.
Details
Motivation: Generalized few-shot semantic segmentation suffers from limited novel-class appearance coverage under scarce annotations. While diffusion models can generate synthetic data, practical gains are hindered by insufficient coverage and noisy supervision when masks are unavailable or unreliable.Method: 1) Construct embedding-deduplicated prompt bank for each novel class to generate diverse synthetic images; 2) Support-guided pseudo-label estimation with two-stage refinement: filter low-consistency regions for high-precision seeds, then relabel uncertain pixels with image-adaptive prototypes; 3) Refine boundary-band and unlabeled pixels using constrained SAM-based update for better contour fidelity.
Result: Extensive experiments on PASCAL-5^i and COCO-20^i show consistent improvements in both 1-shot and 5-shot settings, demonstrating synthetic data as a scalable path for GFSS with reliable masks and precise boundaries.
Conclusion: Syn4Seg effectively expands novel-class coverage while improving pseudo-label quality through synthetic data generation and sophisticated refinement techniques, offering a scalable solution for few-shot semantic segmentation.
Abstract: Generalized few-shot semantic segmentation (GFSS) is fundamentally limited by the coverage of novel-class appearances under scarce annotations. While diffusion models can synthesize novel-class images at scale, practical gains are often hindered by insufficient coverage and noisy supervision when masks are unavailable or unreliable. We propose Syn4Seg, a generation-enhanced GFSS framework designed to expand novel-class coverage while improving pseudo-label quality. Syn4Seg first maximizes prompt-space coverage by constructing an embedding-deduplicated prompt bank for each novel class, yielding diverse yet class-consistent synthetic images. It then performs support-guided pseudo-label estimation via a two-stage refinement that i) filters low-consistency regions to obtain high-precision seeds and ii) relabels uncertain pixels with image-adaptive prototypes that combine global (support) and local (image) statistics. Finally, we refine only boundary-band and unlabeled pixels using a constrained SAM-based update to improve contour fidelity without overwriting high-confidence interiors. Extensive experiments on PASCAL-$5^i$ and COCO-$20^i$ demonstrate consistent improvements in both 1-shot and 5-shot settings, highlighting synthetic data as a scalable path for GFSS with reliable masks and precise boundaries.
[250] HD-VGGT: High-Resolution Visual Geometry Transformer
Tianrun Chen, Yuanqi Hu, Yidong Han, Hanjie Xu, Deyi Ji, Qi Zhu, Chunan Yu, Xin Zhang, Cheng Chen, Chaotao Ding, Ying Zang, Xuanfu Li, Jin Ma, Lanyun Zhu
Main category: cs.CV
TL;DR: HD-VGGT introduces a dual-branch transformer architecture for efficient high-resolution 3D reconstruction that addresses computational challenges and feature instability in visually ambiguous regions.
Details
Motivation: High-resolution imagery is crucial for detailed 3D reconstruction, but existing transformer-based approaches like VGGT face prohibitive computational costs when scaling to high resolutions due to quadratic token growth. Additionally, visually ambiguous regions (repetitive patterns, weak textures, specular surfaces) produce unstable feature tokens that degrade geometric inference at higher resolutions.Method: Proposes HD-VGGT with a dual-branch architecture: 1) low-resolution branch predicts coarse, globally consistent geometry, 2) high-resolution branch refines details via learned feature upsampling. Introduces Feature Modulation to suppress unreliable features early in the transformer, addressing unstable tokens in ambiguous regions. The approach leverages high-resolution supervision without full-resolution transformer costs.
Result: HD-VGGT achieves state-of-the-art reconstruction quality while maintaining computational efficiency. The method successfully handles high-resolution inputs and supervision without the prohibitive costs of full-resolution transformers, demonstrating robust performance in challenging visual conditions.
Conclusion: HD-VGGT provides an effective solution for high-resolution 3D reconstruction by combining coarse-to-fine processing with feature stabilization mechanisms, enabling detailed geometric inference while managing computational complexity and handling visual ambiguities.
Abstract: High-resolution imagery is essential for accurate 3D reconstruction, as many geometric details only emerge at fine spatial scales. Recent feed-forward approaches, such as the Visual Geometry Grounded Transformer (VGGT), have demonstrated the ability to infer scene geometry from large collections of images in a single forward pass. However, scaling these models to high-resolution inputs remains challenging: the number of tokens in transformer architectures grows rapidly with both image resolution and the number of views, leading to prohibitive computational and memory costs. Moreover, we observe that visually ambiguous regions, such as repetitive patterns, weak textures, or specular surfaces, often produce unstable feature tokens that degrade geometric inference, especially at higher resolutions. We introduce HD-VGGT, a dual-branch architecture for efficient and robust high-resolution 3D reconstruction. A low-resolution branch predicts a coarse, globally consistent geometry, while a high-resolution branch refines details via a learned feature upsampling module. To handle unstable tokens, we propose Feature Modulation, which suppresses unreliable features early in the transformer. HD-VGGT leverages high-resolution images and supervision without full-resolution transformer costs, achieving state-of-the-art reconstruction quality.
[251] EuraGovExam: A Multilingual Multimodal Benchmark from Real-World Civil Service Exams
JaeSeong Kim, Chaehwan Lim, Sang Hyun Gil, Suan Lee
Main category: cs.CV
TL;DR: EuraGovExam is a multilingual multimodal benchmark using real civil service exam questions from 5 Eurasian regions, presented as single images with minimal instructions to test layout-aware cross-lingual reasoning in vision-language models.
Details
Motivation: Existing benchmarks lack the authentic complexity of real-world public-sector assessments that combine visual elements, multilingual content, and complex layouts. There's a need for benchmarks that reflect cultural realism, visual complexity, and linguistic diversity to properly evaluate vision-language models in high-stakes settings.Method: Created a dataset of over 8,000 high-resolution scanned multiple-choice questions from real civil service exams across South Korea, Japan, Taiwan, India, and the European Union. Questions cover 17 academic/administrative domains and embed all content (problem statements, answer choices, visual elements) within single images with only minimal standardized instructions for answer formatting.
Result: Even state-of-the-art vision-language models achieve only 86% accuracy on the benchmark, demonstrating its difficulty and ability to diagnose model limitations. The benchmark preserves rich visual structures including tables, multilingual typography, and form-like layouts.
Conclusion: EuraGovExam establishes a new standard for evaluating VLMs in high-stakes, multilingual, image-grounded settings and supports practical applications in e-governance, public-sector document analysis, and equitable exam preparation.
Abstract: We present EuraGovExam, a multilingual and multimodal benchmark sourced from real-world civil service examinations across five representative Eurasian regions: South Korea, Japan, Taiwan, India, and the European Union. Designed to reflect the authentic complexity of public-sector assessments, the dataset contains over 8,000 high-resolution scanned multiple-choice questions covering 17 diverse academic and administrative domains. Unlike existing benchmarks, EuraGovExam embeds all question content–including problem statements, answer choices, and visual elements–within a single image, providing only a minimal standardized instruction for answer formatting. This design demands that models perform layout-aware, cross-lingual reasoning directly from visual input. All items are drawn from real exam documents, preserving rich visual structures such as tables, multilingual typography, and form-like layouts. Evaluation results show that even state-of-the-art vision-language models (VLMs) achieve only 86% accuracy, underscoring the benchmark’s difficulty and its power to diagnose the limitations of current models. By emphasizing cultural realism, visual complexity, and linguistic diversity, EuraGovExam establishes a new standard for evaluating VLMs in high-stakes, multilingual, image-grounded settings. It also supports practical applications in e-governance, public-sector document analysis, and equitable exam preparation.
[252] ResAdapt: Adaptive Resolution for Efficient Multimodal Reasoning
Huanxuan Liao, Zhongtao Jiang, Yupu Hao, Yuqiao Tan, Shizhu He, Jun Zhao, Kun Xu, Kang Liu
Main category: cs.CV
TL;DR: ResAdapt is an input-side adaptation framework that learns optimal visual budget allocation per frame for multimodal LLMs, enabling higher spatial resolution and longer temporal context without increasing computational cost.
Details
Motivation: Current MLLMs struggle with balancing high spatial resolution and long temporal context due to visual token growth. The bottleneck is in the volume of pixels the encoder receives, not post-encoding compression.Method: ResAdapt couples a lightweight Allocator with an unchanged MLLM backbone. It formulates allocation as a contextual bandit and trains with Cost-Aware Policy Optimization (CAPO) to convert sparse rollout feedback into stable accuracy-cost learning signals.
Result: ResAdapt improves low-budget operating points, often lies on the efficiency-accuracy frontier, supports up to 16x more frames at same visual budget with over 15% performance gain, especially on reasoning-intensive benchmarks under aggressive compression.
Conclusion: ResAdapt effectively addresses the visual budget allocation problem in MLLMs, enabling better trade-offs between spatial resolution and temporal context without modifying the core MLLM architecture.
Abstract: Multimodal Large Language Models (MLLMs) achieve stronger visual understanding by scaling input fidelity, yet the resulting visual token growth makes jointly sustaining high spatial resolution and long temporal context prohibitive. We argue that the bottleneck lies not in how post-encoding representations are compressed but in the volume of pixels the encoder receives, and address it with ResAdapt, an Input-side adaptation framework that learns how much visual budget each frame should receive before encoding. ResAdapt couples a lightweight Allocator with an unchanged MLLM backbone, so the backbone retains its native visual-token interface while receiving an operator-transformed input. We formulate allocation as a contextual bandit and train the Allocator with Cost-Aware Policy Optimization (CAPO), which converts sparse rollout feedback into a stable accuracy-cost learning signal. Across budget-controlled video QA, temporal grounding, and image reasoning tasks, ResAdapt improves low-budget operating points and often lies on or near the efficiency-accuracy frontier, with the clearest gains on reasoning-intensive benchmarks under aggressive compression. Notably, ResAdapt supports up to 16x more frames at the same visual budget while delivering over 15% performance gain. Code is available at https://github.com/Xnhyacinth/ResAdapt.
[253] NimbusGS: Unified 3D Scene Reconstruction under Hybrid Weather
Yanying Li, Jinyang Li, Shengfeng He, Yangyang Xu, Junyu Dong, Yong Du
Main category: cs.CV
TL;DR: NimbusGS is a unified framework for 3D scene reconstruction from degraded multi-view inputs under diverse adverse weather conditions, using a dual decomposition approach and geometry-guided optimization.
Details
Motivation: Existing methods target specific weather types, but real-world scenarios involve diverse and mixed adverse weather conditions. The challenge is to generalize across different weather types by modeling the dual nature of weather effects: continuous view-consistent atmospheric attenuation and dynamic view-dependent particle scattering/occlusion.Method: Decomposes degradations into: 1) global transmission field for static atmospheric effects shared across views, and 2) per-view particulate residuals for transient disturbances. Uses geometry-guided gradient scaling to stabilize geometry learning under severe visibility degradation during self-supervised optimization of 3D Gaussian representations.
Result: Superior geometry reconstruction compared to task-specific methods across diverse and challenging weather conditions. The physically grounded formulation disentangles complex degradations while preserving scene structure.
Conclusion: NimbusGS provides a unified solution for 3D scene reconstruction under adverse weather by modeling weather’s dual nature and addressing optimization challenges through geometry-guided gradient scaling.
Abstract: We present NimbusGS, a unified framework for reconstructing high-quality 3D scenes from degraded multi-view inputs captured under diverse and mixed adverse weather conditions. Unlike existing methods that target specific weather types, NimbusGS addresses the broader challenge of generalization by modeling the dual nature of weather: a continuous, view-consistent medium that attenuates light, and dynamic, view-dependent particles that cause scattering and occlusion. To capture this structure, we decompose degradations into a global transmission field and per-view particulate residuals. The transmission field represents static atmospheric effects shared across views, while the residuals model transient disturbances unique to each input. To enable stable geometry learning under severe visibility degradation, we introduce a geometry-guided gradient scaling mechanism that mitigates gradient imbalance during the self-supervised optimization of 3D Gaussian representations. This physically grounded formulation allows NimbusGS to disentangle complex degradations while preserving scene structure, yielding superior geometry reconstruction and outperforming task-specific methods across diverse and challenging weather conditions. Code is available at https://github.com/lyy-ovo/NimbusGS.
[254] An Instance-Centric Panoptic Occupancy Prediction Benchmark for Autonomous Driving
Yi Feng, Junwu E, Zizhan Guo, Yu Ma, Hanli Wang, Rui Fan
Main category: cs.CV
TL;DR: ADMesh and CarlaOcc: A unified 3D mesh library and physically consistent panoptic occupancy dataset for autonomous driving, addressing limitations in existing 3D occupancy prediction benchmarks.
Details
Motivation: Current panoptic occupancy prediction lacks high-quality 3D mesh resources, instance-level annotations, and physically consistent datasets, limiting precise geometric reconstruction and holistic 3D understanding.Method: Introduces ADMesh (15K+ high-quality 3D models with textures and semantic annotations) and CarlaOcc (100K+ frames of physically consistent panoptic occupancy data generated via CARLA simulator with 0.05m voxel resolution).
Result: Created first unified 3D mesh library for autonomous driving and large-scale panoptic occupancy dataset with instance-level ground truth, plus standardized evaluation metrics and systematic benchmark of existing models.
Conclusion: Provides comprehensive resources and evaluation framework to advance 3D panoptic perception research, enabling fair comparison and reproducible work in autonomous driving scene understanding.
Abstract: Panoptic occupancy prediction aims to jointly infer voxel-wise semantics and instance identities within a unified 3D scene representation. Nevertheless, progress in this field remains constrained by the absence of high-quality 3D mesh resources, instance-level annotations, and physically consistent occupancy datasets. Existing benchmarks typically provide incomplete and low-resolution geometry without instance-level annotations, limiting the development of models capable of achieving precise geometric reconstruction, reliable occlusion reasoning, and holistic 3D understanding. To address these challenges, this paper presents an instance-centric benchmark for the 3D panoptic occupancy prediction task. Specifically, we introduce ADMesh, the first unified 3D mesh library tailored for autonomous driving, which integrates over 15K high-quality 3D models with diverse textures and rich semantic annotations. Building upon ADMesh, we further construct CarlaOcc, a large-scale, physically consistent panoptic occupancy dataset generated using the CARLA simulator. This dataset contains over 100K frames with fine-grained, instance-level occupancy ground truth at voxel resolutions as fine as 0.05 m. Furthermore, standardized evaluation metrics are introduced to quantify the quality of existing occupancy datasets. Finally, a systematic benchmark of representative models is established on the proposed dataset, which provides a unified platform for fair comparison and reproducible research in the field of 3D panoptic perception. Code and dataset are available at https://mias.group/CarlaOcc.
[255] Diagnosing and Repairing Unsafe Channels in Vision-Language Models via Causal Discovery and Dual-Modal Safety Subspace Projection
Jinhu Fu, Yihang Lou, Qingyi Si, Shudong Zhang, Yan Bai, Sen Su
Main category: cs.CV
TL;DR: CARE framework identifies and repairs unsafe channels in Large Vision-Language Models using causal mediation analysis and dual-modal safety subspace projection.
Details
Motivation: Large Vision-Language Models have impressive multimodal capabilities but their internal safety mechanisms are opaque and poorly controlled, creating risks from unsafe behaviors that need systematic diagnosis and repair.Method: 1) Causal mediation analysis to identify neurons/layers responsible for unsafe behaviors; 2) Dual-modal safety subspace projection learning generalized safety subspaces for visual/textual modalities via eigen-decomposition; 3) Hybrid fusion mechanism for dynamic projection during inference to balance visual/textual corrections.
Result: Significantly enhances safety robustness on multiple benchmarks without degrading general multimodal capabilities, outperforms prior activation steering and alignment-based baselines, and shows good transferability against unseen attacks.
Conclusion: The CARE framework provides an effective approach for diagnosing and repairing unsafe channels in LVLMs through causal analysis and subspace projection, offering improved safety control while maintaining model capabilities.
Abstract: Large Vision-Language Models (LVLMs) have achieved impressive performance across multimodal understanding and reasoning tasks, yet their internal safety mechanisms remain opaque and poorly controlled. In this work, we present a comprehensive framework for diagnosing and repairing unsafe channels within LVLMs (CARE). We first perform causal mediation analysis to identify neurons and layers that are causally responsible for unsafe behaviors. Based on these findings, we introduce a dual-modal safety subspace projection method that learns generalized safety subspaces for both visual and textual modalities through generalized eigen-decomposition between benign and malicious activations. During inference, activations are dynamically projected toward these safety subspaces via a hybrid fusion mechanism that adaptively balances visual and textual corrections, effectively suppressing unsafe features while preserving semantic fidelity. Extensive experiments on multiple safety benchmarks demonstrate that our causal-subspace repair framework significantly enhances safety robustness without degrading general multimodal capabilities, outperforming prior activation steering and alignment-based baselines. Additionally, our method exhibits good transferability, defending against unseen attacks.
[256] SaSaSaSa2VA: 2nd Place of the 5th PVUW MeViS-Text Track
Dengxian Gong, Quanzhu Niu, Shihao Chen, Yuanzheng Wu, Yikang Zhou, Tao Zhang, Haobo Yuan, Lu Qi, Shunping Ji
Main category: cs.CV
TL;DR: SaSaSaSa2VA extends SaSaSa2VA with target existence-aware verification for motion-centric referring video object segmentation, achieving 2nd place in MeViS-Text Track with 89.19 score.
Details
Motivation: The paper addresses the challenge of motion-centric referring video object segmentation (RVOS), where traditional methods focus on static textual cues. The MeViS benchmark introduces motion-centric expressions and no-target queries, requiring new approaches to handle dynamic motion reasoning and target verification.Method: Extends SaSaSa2VA (which already increased input frames and added [SEG] tokens to Sa2VA backbone) with a simple target existence-aware verification mechanism. This verification strategy helps determine whether referred targets exist in video frames, particularly important for motion-centric expressions.
Result: Achieved final score of 89.19 in the 5th PVUW Challenge (MeViS-Text Track), securing 2nd place. Both quantitative results and ablation studies show the existence-aware verification strategy effectively unlocks strong performance on motion-centric referring tasks.
Conclusion: The target existence-aware verification mechanism, despite its simplicity, is sufficient to achieve strong performance on motion-centric referring video object segmentation tasks, demonstrating the importance of verifying target presence in dynamic video understanding.
Abstract: Referring video object segmentation (RVOS) commonly grounds targets in videos based on static textual cues. MeViS benchmark extends this by incorporating motion-centric expressions (referring & reasoning motion expressions) and introducing no-target queries. Extending SaSaSa2VA, where increased input frames and [SEG] tokens already strengthen the Sa2VA backbone, we adopt a simple yet effective target existence-aware verification mechanism, leading to Still Awesome SaSaSa2VA (SaSaSaSa2VA). Despite its simplicity, the method achieves a final score of 89.19 in the 5th PVUW Challenge (MeViS-Text Track), securing 2nd place. Both quantitative results and ablations suggest that this existence-aware verification strategy is sufficient to unlock strong performance on motion-centric referring tasks.
[257] IP-SAM: Prompt-Space Conditioning for Prompt-Absent Camouflaged Object Detection
Huiyao Zhang, Jin Bai, Rui Guo, JianWen Tan, HongFei Wang, Ye Li
Main category: cs.CV
TL;DR: IP-SAM introduces prompt-space conditioning for automatic segmentation by generating intrinsic prompts from image context, enabling frozen prompt encoder usage without external prompts at inference.
Details
Motivation: Existing prompt-conditioned segmentation models require explicit spatial prompts at inference, creating a mismatch for fully automatic segmentation tasks where such prompts are unavailable. Current adaptations bypass the model's native prompt interface, weakening prompt-conditioned decoding capabilities.Method: Proposes IP-SAM with two key components: 1) Self-Prompt Generator (SPG) that distills image context into complementary intrinsic prompts as coarse regional anchors, projected through SAM2’s frozen prompt encoder; 2) Prompt-Space Gating (PSG) that uses intrinsic background prompt as asymmetric suppressive constraint to reduce false positives.
Result: Achieves state-of-the-art performance on four camouflaged object detection benchmarks (e.g., MAE 0.017 on COD10K) with only 21.26M trainable parameters. Also demonstrates strong zero-shot transfer in medical polyp segmentation from Kvasir-SEG to CVC-ClinicDB and ETIS.
Conclusion: Prompt-space conditioning enables effective adaptation of prompt-conditioned foundation models for fully automatic segmentation while preserving their native prompt-guided decoding capabilities, with strong generalization across domains.
Abstract: Prompt-conditioned foundation segmenters have emerged as a dominant paradigm for image segmentation, where explicit spatial prompts (e.g., points, boxes, masks) guide mask decoding. However, many real-world deployments require fully automatic segmentation, creating a structural mismatch: the decoder expects prompts that are unavailable at inference. Existing adaptations typically modify intermediate features, inadvertently bypassing the model’s native prompt interface and weakening prompt-conditioned decoding. We propose IP-SAM, which revisits adaptation from a prompt-space perspective through prompt-space conditioning. Specifically, a Self-Prompt Generator (SPG) distills image context into complementary intrinsic prompts that serve as coarse regional anchors. These cues are projected through SAM2’s frozen prompt encoder, restoring prompt-guided decoding without external intervention. To suppress background-induced false positives, Prompt-Space Gating (PSG) leverages the intrinsic background prompt as an asymmetric suppressive constraint prior to decoding. Under a deterministic no-external-prompt protocol, IP-SAM achieves state-of-the-art performance across four camouflaged object detection benchmarks (e.g., MAE 0.017 on COD10K) with only 21.26M trainable parameters (optimizing SPG, PSG, and a task-specific mask decoder trained from scratch, alongside image-encoder LoRA while keeping the prompt encoder frozen). Furthermore, the proposed conditioning strategy generalizes beyond COD to medical polyp segmentation, where a model trained solely on Kvasir-SEG exhibits strong zero-shot transfer to both CVC-ClinicDB and ETIS.
[258] Zero-shot Vision-Language Reranking for Cross-View Geolocalization
Yunus Talha Erzurumlu, John E. Anderson, William J. Shuart, Charles Toth, Alper Yilmaz
Main category: cs.CV
TL;DR: Zero-shot Vision-Language Models (VLMs) like LLaVA can improve cross-view geolocalization precision through pairwise reranking, despite poor performance with pointwise scoring methods.
Details
Motivation: Cross-view geolocalization systems have high recall but low Top-1 accuracy, creating a need for better precision in identifying the single best match from retrieved candidates.Method: Two-stage framework: SOTA retrieval followed by VLM reranking with two strategies - pointwise (individual scoring) and pairwise (relative comparison between candidates).
Result: Pointwise methods cause catastrophic performance drops, while pairwise comparison using LLaVA improves Top-1 accuracy over strong retrieval baselines on VIGOR dataset.
Conclusion: VLMs are poorly calibrated for absolute relevance scoring but effective at fine-grained relative visual judgment, making pairwise reranking promising for enhancing CVGL precision.
Abstract: Cross-view geolocalization (CVGL) systems, while effective at retrieving a list of relevant candidates (high Recall@k), often fail to identify the single best match (low Top-1 accuracy). This work investigates the use of zero-shot Vision-Language Models (VLMs) as rerankers to address this gap. We propose a two-stage framework: state-of-the-art (SOTA) retrieval followed by VLM reranking. We systematically compare two strategies: (1) Pointwise (scoring candidates individually) and (2) Pairwise (comparing candidates relatively). Experiments on the VIGOR dataset show a clear divergence: all pointwise methods cause a catastrophic drop in performance or no change at all. In contrast, a pairwise comparison strategy using LLaVA improves Top-1 accuracy over the strong retrieval baseline. Our analysis concludes that, these VLMs are poorly calibrated for absolute relevance scoring but are effective at fine-grained relative visual judgment, making pairwise reranking a promising direction for enhancing CVGL precision.
[259] Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark
Seng Nam Chen, Hao Chen, Chenglam Ho, Xinyu Mao, Jinping Wang, Yu Zhang, Chao Li
Main category: cs.CV
TL;DR: SceneBench: A new benchmark for evaluating vision-language models on scene-level long video understanding, revealing significant context forgetting and proposing Scene-RAG to improve performance.
Details
Motivation: Current vision-language models struggle with long video understanding, as existing benchmarks focus on fine-grained perception or coarse summarization rather than temporal understanding over long contexts. The authors aim to assess whether VLMs can reason effectively over scene-level contexts.Method: Defines a scene as a coherent video segment with consistent visual and semantic contexts. Introduces SceneBench benchmark with scene-level challenges. Proposes Scene Retrieval-Augmented Generation (Scene-RAG) that constructs dynamic scene memory by retrieving and integrating relevant context across scenes.
Result: Evaluation shows sharp accuracy drop when VLMs answer scene-level questions, indicating significant forgetting of long-range context. Scene-RAG improves VLM performance by +2.50%, confirming models struggle with long-context retention.
Conclusion: Current VLMs have limitations in long-context retention for video understanding. SceneBench provides a valuable benchmark for evaluating scene-level comprehension, and Scene-RAG demonstrates potential for improving long-range context handling in multimodal models.
Abstract: Long video understanding (LVU) remains a core challenge in multimodal learning. Although recent vision-language models (VLMs) have made notable progress, existing benchmarks mainly focus on either fine-grained perception or coarse summarization, offering limited insight into temporal understanding over long contexts. In this work, we define a scene as a coherent segment of a video in which both visual and semantic contexts remain consistent, aligning with human perception. This leads us to a key question: can current VLMs reason effectively over long, scene-level contexts? To answer this, we introduce a new benchmark, SceneBench, designed to provide scene-level challenges. Our evaluation reveals a sharp drop in accuracy when VLMs attempt to answer scene-level questions, indicating significant forgetting of long-range context. To further validate these findings, we propose Scene Retrieval-Augmented Generation (Scene-RAG), which constructs a dynamic scene memory by retrieving and integrating relevant context across scenes. This Scene-RAG improves VLM performance by +2.50%, confirming that current models still struggle with long-context retention. We hope SceneBench will encourage future research toward VLMs with more robust, human-like video comprehension.
[260] TrendGen: An Outfit Recommendation and Display System
Theodoros Koukopoulos, Dimos Klimenof, Ioannis Xarchakos
Main category: cs.CV
TL;DR: TrendGen is a Fashion AI system that generates trend-aligned outfit recommendations and transforms raw garment images into high-quality lay-down views for e-commerce applications.
Details
Motivation: Raw fashion images often suffer from inconsistent lighting, non-ideal angles, complex backgrounds, and occlusions, which hinder the development of robust fashion AI systems for real-world applications like online shopping.Method: TrendGen leverages cloth images and product attributes to generate cohesive outfit recommendations, and uses Generative AI to transform raw images into high-quality lay-down views with clear, structured presentation of garments.
Result: Evaluation on production data demonstrates TrendGen’s consistent high-quality outfit suggestions and lay-down image generation, showing significant advancement for fashion retail applications.
Conclusion: TrendGen represents an effective AI-driven solution for enhancing online shopping experiences through intelligent outfit recommendations and improved visual presentation of fashion products.
Abstract: Recent advances in Computer Vision have significantly improved image understanding and generation, revolutionizing the fashion industry. However, challenges such as inconsistent lighting, non-ideal garment angles, complex backgrounds, and occlusions in raw images hinder their full potential. Overcoming these obstacles is crucial for developing robust fashion AI systems capable of real-world applications. In this paper, we introduce TrendGen, a Fashion AI system designed to enhance online shopping with intelligent outfit recommendations. Deployed on a major e-commerce platform, TrendGen leverages cloth images and product attributes to generate trend-aligned, cohesive outfit suggestions. Additionally, it employs Generative AI to transform raw images into high-quality lay-down views, offering a clear and structured presentation of garments. Our evaluation on production data demonstrates TrendGen’s consistent high-quality outfits and lay-down images, marking a significant advancement in AI-driven solutions for fashion retail.
[261] TrackMAE: Video Representation Learning via Track Mask and Predict
Renaud Vandeghen, Fida Mohammad Thoker, Marc Van Droogenbroeck, Bernard Ghanem
Main category: cs.CV
TL;DR: TrackMAE improves masked video modeling by explicitly using motion trajectories as reconstruction targets, enhancing temporal dynamics encoding for better motion-centric tasks.
Details
Motivation: Standard masked video modeling (MVM) encodes motion information only implicitly, limiting temporal dynamics learning and performance on motion-centric tasks requiring fine-grained motion awareness.Method: Uses off-the-shelf point tracker to generate motion trajectories from input videos, employs motion-aware masking strategy, and reconstructs both pixel/semantic features with motion trajectories as complementary supervision.
Result: Consistently outperforms state-of-the-art video self-supervised learning baselines across six diverse datasets, learning more discriminative and generalizable representations.
Conclusion: Explicit motion supervision in masked video modeling significantly improves video representation learning, especially for motion-centric tasks, through motion-aware masking and trajectory reconstruction.
Abstract: Masked video modeling (MVM) has emerged as a simple and scalable self-supervised pretraining paradigm, but only encodes motion information implicitly, limiting the encoding of temporal dynamics in the learned representations. As a result, such models struggle on motion-centric tasks that require fine-grained motion awareness. To address this, we propose TrackMAE, a simple masked video modeling paradigm that explicitly uses motion information as a reconstruction signal. In TrackMAE, we use an off-the-shelf point tracker to sparsely track points in the input videos, generating motion trajectories. Furthermore, we exploit the extracted trajectories to improve random tube masking with a motion-aware masking strategy. We enhance video representations learned in both pixel and feature semantic reconstruction spaces by providing a complementary supervision signal in the form of motion targets. We evaluate on six datasets across diverse downstream settings and find that TrackMAE consistently outperforms state-of-the-art video self-supervised learning baselines, learning more discriminative and generalizable representations. Code available at https://github.com/rvandeghen/TrackMAE
[262] Human-Centric Perception for Child Sexual Abuse Imagery
Camila Laranjeira, João Macedo, Sandra Avila, Fabrício Benevenuto, Jefersson A. dos Santos
Main category: cs.CV
TL;DR: Paper introduces BKPD dataset and methods for pose estimation/detection in CSAI classification, focusing on explainable pipelines using body keypoints and parts.
Details
Motivation: Law enforcement agencies need automation tools for CSAI classification, but current methods use black-box approaches targeting abstract concepts like pornography. There's a need for more objective and explainable pipelines using human-centric perception tasks.Method: Created Body-Keypoint-Part Dataset (BKPD) with images across age groups and sexual explicitness, with hierarchical labels for skeletal keypoints and body part bounding boxes. Proposed BKP-Association and YOLO-BKP methods for simultaneous pose estimation and detection with per-individual target association.
Result: Methods benchmarked on COCO-Keypoints and COCO-HumanParts, achieving competitive results with joint-task models. Cross-domain studies on BKPD and case study on RCPD highlight challenges in sexually explicit domains.
Conclusion: Study addresses unexplored targets in CSAI domain, paving way for novel research opportunities in explainable CSAI classification using human-centric perception approaches.
Abstract: Law enforcement agencies and non-gonvernmental organizations handling reports of Child Sexual Abuse Imagery (CSAI) are overwhelmed by large volumes of data, requiring the aid of automation tools. However, defining sexual abuse in images of children is inherently challenging, encompassing sexually explicit activities and hints of sexuality conveyed by the individual’s pose, or their attire. CSAI classification methods often rely on black-box approaches, targeting broad and abstract concepts such as pornography. Thus, our work is an in-depth exploration of tasks from the literature on Human-Centric Perception, across the domains of safe images, adult pornography, and CSAI, focusing on targets that enable more objective and explainable pipelines for CSAI classification in the future. We introduce the Body-Keypoint-Part Dataset (BKPD), gathering images of people from varying age groups and sexual explicitness to approximate the domain of CSAI, along with manually curated hierarchically structured labels for skeletal keypoints and bounding boxes for person and body parts, including head, chest, hip, and hands. We propose two methods, namely BKP-Association and YOLO-BKP, for simultaneous pose estimation and detection, with targets associated per individual for a comprehensive decomposed representation of each person. Our methods are benchmarked on COCO-Keypoints and COCO-HumanParts, as well as our human-centric dataset, achieving competitive results with models that jointly perform all tasks. Cross-domain ablation studies on BKPD and a case study on RCPD highlight the challenges posed by sexually explicit domains. Our study addresses previously unexplored targets in the CSAI domain, paving the way for novel research opportunities.
[263] Class-Distribution Guided Active Learning for 3D Occupancy Prediction in Autonomous Driving
Wonjune Kim, In-Jae Lee, Sihwan Hwang, Sanmin Kim, Dongsuk Kum
Main category: cs.CV
TL;DR: Active learning framework for 3D occupancy prediction that selects training samples using class-distribution guidance to address class imbalance and annotation costs in autonomous driving.
Details
Motivation: 3D occupancy prediction suffers from severe class imbalance (safety-critical objects occupy minimal voxels) and costly voxel-level annotation. Current approaches inefficiently annotate dominant classes while neglecting rare but important objects.Method: Proposes class-distribution guided active learning with three complementary criteria: inter-sample diversity (prioritizes samples with different predicted class distributions), intra-set diversity (prevents redundant sampling), and frequency-weighted uncertainty (emphasizes rare classes by reweighting voxel-level entropy with inverse per-sample class proportions). Uses geographically disjoint train/validation split to reduce map memorization.
Result: Achieves 26.62 mIoU with only 42.4% labeled data, comparable to full supervision and outperforming active learning baselines at same budget. Validates generality on SemanticKITTI with different architecture, showing consistent effectiveness across datasets.
Conclusion: The proposed active learning framework effectively addresses class imbalance and annotation costs in 3D occupancy prediction, achieving near-full-supervision performance with significantly less labeled data while maintaining generalizability across datasets.
Abstract: 3D occupancy prediction provides dense spatial understanding critical for safe autonomous driving. However, this task suffers from a severe class imbalance due to its volumetric representation, where safety-critical objects (bicycles, traffic cones, pedestrians) occupy minimal voxels compared to dominant backgrounds. Additionally, voxel-level annotation is costly, yet dedicating effort to dominant classes is inefficient. To address these challenges, we propose a class-distribution guided active learning framework for selecting training samples to annotate in autonomous driving datasets. Our approach combines three complementary criteria to select the training samples. Inter-sample diversity prioritizes samples whose predicted class distributions differ from those of the labeled set, intra-set diversity prevents redundant sampling within each acquisition cycle, and frequency-weighted uncertainty emphasizes rare classes by reweighting voxel-level entropy with inverse per-sample class proportions. We ensure evaluation validity by using a geographically disjoint train/validation split of Occ3D-nuScenes, which reduces train-validation overlap and mitigates potential map memorization. With only 42.4% labeled data, our framework reaches 26.62 mIoU, comparable to full supervision and outperforming active learning baselines at the same budget. We further validate generality on SemanticKITTI using a different architecture, demonstrating consistent effectiveness across datasets.
[264] Complet4R: Geometric Complete 4D Reconstruction
Weibang Wang, Kenan Li, Zhuoguang Chen, Yijun Yuan, Hang Zhao
Main category: cs.CV
TL;DR: Complet4R is an end-to-end framework for Geometric Complete 4D Reconstruction that recovers temporally coherent and geometrically complete reconstructions for dynamic scenes by accumulating full contexts onto each frame using a decoder-only transformer.
Details
Motivation: Previous approaches for dynamic scene reconstruction rely on pairwise reconstruction or local motion estimation, which may not effectively handle occluded regions or maintain temporal coherence across frames. There's a need for a unified framework that can reconstruct complete geometries for every timestamp, including occluded areas visible in other frames.Method: The method formalizes Geometric Complete 4D Reconstruction as a unified framework of reconstruction and completion. It uses a decoder-only transformer to operate on all context globally from sequential video input, directly accumulating full contexts onto each frame to reconstruct complete geometry for every timestamp.
Result: Complet4R demonstrates state-of-the-art performance on the proposed benchmark for Geometric Complete 4D Reconstruction and the 3D Point Tracking task.
Conclusion: The framework successfully addresses the challenge of geometric complete 4D reconstruction by leveraging global context accumulation through transformer architecture, achieving superior performance in reconstructing dynamic scenes with temporal coherence.
Abstract: We introduce Complet4R, a novel end-to-end framework for Geometric Complete 4D Reconstruction, which aims to recover temporally coherent and geometrically complete reconstruction for dynamic scenes. Our method formalizes the task of Geometric Complete 4D Reconstruction as a unified framework of reconstruction and completion, by directly accumulating full contexts onto each frame. Unlike previous approaches that rely on pairwise reconstruction or local motion estimation, Complet4R utilizes a decoder-only transformer to operate all context globally directly from sequential video input, reconstructing a complete geometry for every single timestamp, including occluded regions visible in other frames. Our method demonstrates the state-of-the-art performance on our proposed benchmark for Geometric Complete 4D Reconstruction and the 3D Point Tracking task. Code will be released to support future research.
[265] Dual-Path Learning based on Frequency Structural Decoupling and Regional-Aware Fusion for Low-Light Image Super-Resolution
Ji-Xuan He, Jia-Cheng Zhao, Feng-Qi Cui, Jinyang Huang, Yang Liu, Sirui Zhao, Meng Li, Zhi Liu
Main category: cs.CV
TL;DR: DTP is a frequency-aware framework for low-light image super-resolution that decouples luminance and texture into independent components for specialized enhancement and reconstruction.
Details
Motivation: Existing methods for low-light image super-resolution process both tasks serially, leading to artifact amplification, texture suppression, and structural degradation. There's a need for a framework that can handle both illumination enhancement and super-resolution simultaneously while preserving structural details.Method: Proposes Decoupling then Perceive (DTP) framework with three key components: 1) Frequency-aware Structural Decoupling (FSD) separates input into low-frequency luminance and high-frequency texture subspaces; 2) Semantics-specific Dual-path Representation (SDR) learning for targeted enhancement of each component; 3) Cross-frequency Semantic Recomposition (CSR) module to integrate decoupled representations while maintaining structural consistency.
Result: Extensive experiments show DTP outperforms state-of-the-art methods, achieving +1.6% PSNR, +9.6% SSIM, and -48% LPIPS improvements on standard LLISR benchmarks.
Conclusion: DTP effectively addresses the limitations of serial processing in low-light image super-resolution by explicitly decoupling and separately enhancing luminance and texture components, leading to superior visual quality and structural preservation.
Abstract: Low-light image super-resolution (LLISR) is essential for restoring fine visual details and perceptual quality under insufficient illumination conditions with ubiquitous low-resolution devices. Although pioneer methods achieve high performance on single tasks, they solve both tasks in a serial manner, which inevitably leads to artifact amplification, texture suppression, and structural degradation. To address this, we propose Decoupling then Perceive (DTP), a novel frequency-aware framework that explicitly separates luminance and texture into semantically independent components, enabling specialized modeling and coherent reconstruction. Specifically, to adaptively separate the input into low-frequency luminance and high-frequency texture subspaces, we propose a Frequency-aware Structural Decoupling (FSD) mechanism, which lays a solid foundation for targeted representation learning and reconstruction. Based on the decoupled representation, a Semantics-specific Dual-path Representation (SDR) learning strategy that performs targeted enhancement and reconstruction for each frequency component is further designed, facilitating robust luminance adjustment and fine-grained texture recovery. To promote structural consistency and perceptual alignment in the reconstructed output, building upon this dual-path modeling, we further introduce a Cross-frequency Semantic Recomposition (CSR) module that selectively integrates the decoupled representations. Extensive experiments on the most widely used LLISR benchmarks demonstrate the superiority of our DTP framework, improving $+$1.6% PSNR, $+$9.6% SSIM, and $-$48% LPIPS compared to the most state-of-the-art (SOTA) algorithm. Codes are released at https://github.com/JXVision/DTP.
[266] Improving Automated Wound Assessment Using Joint Boundary Segmentation and Multi-Class Classification Models
Mehedi Hasan Tusar, Fateme Fayyazbakhsh, Igor Melnychuk, Ming C. Leu
Main category: cs.CV
TL;DR: YOLOv11-based deep learning model for simultaneous wound boundary segmentation and classification across five wound types, achieving high accuracy with data augmentation.
Details
Motivation: Existing AI models for wound analysis are limited - they focus on narrow wound types or perform only single tasks (segmentation OR classification), reducing clinical applicability. Need for comprehensive model that handles multiple wound types and performs both tasks simultaneously.Method: Developed YOLOv11-based deep learning model for simultaneous wound boundary segmentation (WBS) and wound classification (WC) across five wound types. Created balanced dataset of 2,963 annotated images with five-fold cross-validation. Used data augmentation (rotation, flipping, brightness/saturation/exposure variations) to improve generalization. Tested different YOLOv11 variants (YOLOv11x, YOLOv11n).
Result: YOLOv11x achieved highest performance: F1-scores of 0.9341 for WBS and 0.8736 for WC. YOLOv11n provided comparable accuracy with lower computational cost. Data augmentation significantly improved performance, especially for visually subtle burn injury cases. Models showed robustness against complex backgrounds and high intra-class variability.
Conclusion: YOLOv11-based architectures are effective for accurate, real-time wound analysis in clinical and remote care settings, handling both segmentation and classification simultaneously across multiple wound types.
Abstract: Accurate wound classification and boundary segmentation are essential for guiding clinical decisions in both chronic and acute wound management. However, most existing AI models are limited, focusing on a narrow set of wound types or performing only a single task (segmentation or classification), which reduces their clinical applicability. This study presents a deep learning model based on YOLOv11 that simultaneously performs wound boundary segmentation (WBS) and wound classification (WC) across five clinically relevant wound types: burn injury (BI), pressure injury (PI), diabetic foot ulcer (DFU), vascular ulcer (VU), and surgical wound (SW). A wound-type balanced dataset of 2,963 annotated images was created to train the models for both tasks, with stratified five-fold cross-validation ensuring robust and unbiased evaluation. The models trained on the original non-augmented dataset achieved consistent performance across folds, though BI detection accuracy was relatively lower. Therefore, the dataset was augmented using rotation, flipping, and variations in brightness, saturation, and exposure to help the model learn more generalized and invariant features. This augmentation significantly improved model performance, particularly in detecting visually subtle BI cases. Among tested variants, YOLOv11x achieved the highest performance with F1-scores of 0.9341 (WBS) and 0.8736 (WC), while the lightweight YOLOv11n provided comparable accuracy at lower computational cost, making it suitable for resource-constrained deployments. Supported by confusion matrices and visual detection outputs, the results confirm the model’s robustness against complex backgrounds and high intra-class variability, demonstrating the potential of YOLOv11-based architectures for accurate, real-time wound analysis in both clinical and remote care settings.
[267] Unsafe by Reciprocity: How Generation-Understanding Coupling Undermines Safety in Unified Multimodal Models
Kaishen Wang, Heng Huang
Main category: cs.CV
TL;DR: RICE attack exploits bidirectional interactions between understanding and generation in unified multimodal models, revealing safety vulnerabilities through cross-functionality exploitation.
Details
Motivation: While unified multimodal models integrate understanding and generation for enhanced performance, the safety implications of this tight coupling remain unexplored. Existing safety research analyzes these functions in isolation, missing potential vulnerabilities from their reciprocal interactions.Method: Proposes RICE (Reciprocal Interaction-based Cross-functionality Exploitation), a novel attack paradigm that exploits bidirectional interactions between understanding and generation. Systematically evaluates Generation-to-Understanding (G-U) and Understanding-to-Generation (U-G) attack pathways where unsafe intermediate signals propagate across modalities.
Result: Extensive experiments show high Attack Success Rates (ASR) in both directions, revealing previously overlooked safety weaknesses inherent to unified multimodal models.
Conclusion: Cross-functionality reciprocity itself constitutes a structural source of vulnerability in unified multimodal models, demonstrating that unsafe signals can propagate and amplify safety risks across modalities.
Abstract: Recent advances in Large Language Models (LLMs) and Text-to-Image (T2I) models have led to the emergence of Unified Multimodal Models (UMMs), where multimodal understanding and image generation are tightly integrated within a shared architecture. Prior studies suggest that such reciprocity enhances cross-functionality performance through shared representations and joint optimization. However, the safety implications of this tight coupling remain largely unexplored, as existing safety research predominantly analyzes understanding and generation functionalities in isolation. In this work, we investigate whether cross-functionality reciprocity itself constitutes a structural source of vulnerability in UMMs. We propose RICE: Reciprocal Interaction-based Cross-functionality Exploitation, a novel attack paradigm that explicitly exploits bidirectional interactions between understanding and generation. Using this framework, we systematically evaluate Generation-to-Understanding (G-U) and Understanding-to-Generation (U-G) attack pathways, demonstrating that unsafe intermediate signals can propagate across modalities and amplify safety risks. Extensive experiments show high Attack Success Rates (ASR) in both directions, revealing previously overlooked safety weaknesses inherent to UMMs.
[268] EVA: Bridging Performance and Human Alignment in Hard-Attention Vision Models for Image Classification
Pengcheng Pan, Yonekura Shogo, Kuniyoshi Yasuo
Main category: cs.CV
TL;DR: EVA is a neuroscience-inspired hard-attention model that optimizes for both classification accuracy and human-like visual scanpaths, making the trade-off between performance and human-likeness explicit and adjustable.
Details
Motivation: Current vision models optimized purely for classification accuracy can degrade human-like scanpaths and limit interpretability, creating an "alignment tax" where performance gains come at the cost of human-likeness.Method: EVA uses sequential glimpses with minimal fovea-periphery representation, CNN-based feature extraction, variance control, and adaptive gating to stabilize attention dynamics. It’s trained with standard classification objectives without gaze supervision.
Result: On CIFAR-10 with human gaze annotations, EVA improves scanpath alignment (DTW, NSS metrics) while maintaining competitive accuracy. On ImageNet-100 and COCO-Search18, it yields human-like scanpaths without additional training or gaze supervision.
Conclusion: EVA provides a principled framework for trustworthy, human-interpretable active vision by making the performance-human-likeness trade-off explicit and adjustable.
Abstract: Optimizing vision models purely for classification accuracy can impose an alignment tax, degrading human-like scanpaths and limiting interpretability. We introduce EVA, a neuroscience-inspired hard-attention mechanistic testbed that makes the performance-human-likeness trade-off explicit and adjustable. EVA samples a small number of sequential glimpses using a minimal fovea-periphery representation with CNN-based feature extractor and integrates variance control and adaptive gating to stabilize and regulate attention dynamics. EVA is trained with the standard classification objective without gaze supervision. On CIFAR-10 with dense human gaze annotations, EVA improves scanpath alignment under established metrics such as DTW, NSS, while maintaining competitive accuracy. Ablations show that CNN-based feature extraction drives accuracy but suppresses human-likeness, whereas variance control and gating restore human-aligned trajectories with minimal performance loss. We further validate EVA’s scalability on ImageNet-100 and evaluate scanpath alignment on COCO-Search18 without COCO-Search18 gaze supervision or finetuning, where EVA yields human-like scanpaths on natural scenes without additional training. Overall, EVA provides a principled framework for trustworthy, human-interpretable active vision.
[269] TerraSeg: Self-Supervised Ground Segmentation for Any LiDAR
Ted Lentsch, Santiago Montiel-Marín, Holger Caesar, Dariu M. Gavrila
Main category: cs.CV
TL;DR: TerraSeg: First self-supervised, domain-agnostic LiDAR ground segmentation model trained on OmniLiDAR dataset with 22M scans across 15 sensors, achieving SOTA results without manual labels.
Details
Motivation: Existing LiDAR ground segmentation methods are either handcrafted for specific sensor configurations or rely on costly manual labeling, limiting generalization and scalability across different sensors and environments.Method: Introduces TerraSeg model trained on OmniLiDAR dataset (12 public benchmarks, 22M scans, 15 sensor models) using PseudoLabeler module that generates high-quality ground/non-ground labels through self-supervised per-scan runtime optimization without human annotations.
Result: TerraSeg achieves state-of-the-art results on nuScenes, SemanticKITTI, and Waymo Perception benchmarks while delivering real-time performance, despite using no manual labels during training.
Conclusion: TerraSeg demonstrates that self-supervised learning on diverse sensor data can produce highly generalizable ground segmentation models that outperform supervised methods, enabling scalable LiDAR perception across different platforms.
Abstract: LiDAR perception is fundamental to robotics, enabling machines to understand their environment in 3D. A crucial task for LiDAR-based scene understanding and navigation is ground segmentation. However, existing methods are either handcrafted for specific sensor configurations or rely on costly per-point manual labels, severely limiting their generalization and scalability. To overcome this, we introduce TerraSeg, the first self-supervised, domain-agnostic model for LiDAR ground segmentation. We train TerraSeg on OmniLiDAR, a unified large-scale dataset that aggregates and standardizes data from 12 major public benchmarks. Spanning almost 22 million raw scans across 15 distinct sensor models, OmniLiDAR provides unprecedented diversity for learning a highly generalizable ground model. To supervise training without human annotations, we propose PseudoLabeler, a novel module that generates high-quality ground and non-ground labels through self-supervised per-scan runtime optimization. Extensive evaluations demonstrate that, despite using no manual labels, TerraSeg achieves state-of-the-art results on nuScenes, SemanticKITTI, and Waymo Perception while delivering real-time performance. Our code and model weights are publicly available.
[270] Falcon Perception
Aviraj Bevli, Sofian Chaybouti, Yasser Dahou, Hakim Hacid, Ngoc Dung Huynh, Phuc H. Le Khac, Sanath Narayan, Wamiq Reyaz Para, Ankit Singh
Main category: cs.CV
TL;DR: Falcon Perception: A unified dense Transformer for vision-language tasks using early fusion of image patches and text tokens in shared parameter space with hybrid attention patterns.
Details
Motivation: To challenge the conventional modular encoder-decoder pipeline in perception systems and explore whether a single early-fusion architecture can handle both feature extraction and task modeling at scale.Method: Introduces Falcon Perception, a unified dense Transformer that processes image patches and text tokens in shared parameter space from the first layer. Uses hybrid attention (bidirectional for image tokens, causal for prediction tokens) to combine global visual context with autoregressive instance generation. Retains lightweight token interface and decodes continuous spatial outputs with specialized heads.
Result: Achieves 68.0 Macro-F1 on SA-Co benchmark (vs. 62.3 for SAM3). Introduces PBench for compositional prompts and dense long-context regimes. Falcon OCR (300M parameters) attains 80.3% on olmOCR and 88.64 on OmniDocBench.
Conclusion: Early-fusion architecture with shared parameter space can effectively handle both perception and task modeling, promoting simplicity by keeping a single scalable backbone and shifting complexity to data and training signals.
Abstract: Perception-centric systems are typically implemented with a modular encoder-decoder pipeline: a vision backbone for feature extraction and a separate decoder (or late-fusion module) for task prediction. This raises a central question: is this architectural separation essential or can a single early-fusion stack do both perception and task modeling at scale? We introduce Falcon Perception, a unified dense Transformer that processes image patches and text tokens in a shared parameter space from the first layer, using a hybrid attention pattern (bidirectional among image tokens, causal for prediction tokens) to combine global visual context with autoregressive, variable-length instance generation. To keep dense outputs practical, Falcon Perception retains a lightweight token interface and decodes continuous spatial outputs with specialized heads, enabling parallel high-resolution mask prediction. Our design promotes simplicity: we keep a single scalable backbone and shift complexity toward data and training signals, adding only small heads where outputs are continuous and dense. On SA-Co, Falcon Perception improves mask quality to 68.0 Macro-F$_1$ compared to 62.3 of SAM3. We also introduce PBench, a benchmark targeting compositional prompts (OCR, spatial constraints, relations) and dense long-context regimes, where the model shows better gains. Finally, we extend the same early-fusion recipe to Falcon OCR: a compact 300M-parameter model which attains 80.3% on olmOCR and 88.64 on OmniDocBench.
[271] HMPDM: A Diffusion Model for Driving Video Prediction with Historical Motion Priors
Ke Li, Tianjia Yang, Kaidi Liang, Xianbiao Hu, Ruwen Qin
Main category: cs.CV
TL;DR: HMPDM is a video prediction model for autonomous driving that uses historical motion priors via diffusion models to improve temporal consistency and visual quality in driving scene forecasting.
Details
Motivation: Existing video prediction models for autonomous driving have limitations in multi-stage training pipelines and insufficient modeling of diverse motion patterns in real driving scenes, leading to poor temporal consistency and visual quality.Method: Proposes HMPDM with three key components: 1) Temporal-aware Latent Conditioning (TaLC) for implicit historical motion injection, 2) Motion-aware Pyramid Encoder (MaPE) for multi-scale motion representation, and 3) Self-Conditioning (SC) strategy for stable iterative denoising.
Result: Outperforms state-of-the-art methods on Cityscapes and KITTI benchmarks, achieving 28.2% improvement in FVD on Cityscapes under monocular RGB input configuration.
Conclusion: HMPDM effectively enhances motion understanding and temporal coherence in video prediction for autonomous driving through historical motion priors and diffusion modeling.
Abstract: Video prediction is a useful function for autonomous driving, enabling intelligent vehicles to reliably anticipate how driving scenes will evolve and thereby supporting reasoning and safer planning. However, existing models are constrained by multi-stage training pipelines and remain insufficient in modeling the diverse motion patterns in real driving scenes, leading to degraded temporal consistency and visual quality. To address these challenges, this paper introduces the historical motion priors-informed diffusion model (HMPDM), a video prediction model that leverages historical motion priors to enhance motion understanding and temporal coherence. The proposed deep learning system introduces three key designs: (i) a Temporal-aware Latent Conditioning (TaLC) module for implicit historical motion injection; (ii) a Motion-aware Pyramid Encoder (MaPE) for multi-scale motion representation; (iii) a Self-Conditioning (SC) strategy for stable iterative denoising. Extensive experiments on the Cityscapes and KITTI benchmarks demonstrate that HMPDM outperforms state-of-the-art video prediction methods with efficiency, achieving a 28.2% improvement in FVD on Cityscapes under the same monocular RGB input configuration setting. The implementation codes are publicly available at https://github.com/KELISBU/HMPDM.
[272] Bridging Visual Representation and Reinforcement Learning from Verifiable Rewards in Large Vision-Language Models
Yuhang Han, Yuyang Wu, Zhengbo Jiao, Yiyu Wang, Xuyang Liu, Shaobo Wang, Hanlin Xu, Xuming Hu, Linfeng Zhang
Main category: cs.CV
TL;DR: KAWHI is a plug-and-play reward reweighting mechanism that incorporates structured visual information into reinforcement learning for Large Vision-Language Models to improve multimodal reasoning performance.
Details
Motivation: Existing RLVR methods for LVLMs suffer from a structural representational bottleneck - they lack explicit modeling and effective utilization of visual information, preventing tight coupling between visual representations and RL optimization, which limits multimodal reasoning improvements.Method: KAWHI adaptively localizes semantically salient regions through hierarchical geometric aggregation, identifies vision-critical attention heads via structured attribution, and performs paragraph-level credit reallocation to align spatial visual evidence with semantically decisive reasoning steps.
Result: Extensive empirical evaluations on diverse reasoning benchmarks show KAWHI consistently improves performance of various uniform reward optimization methods (like GRPO and GSPO) as a general-purpose enhancement module.
Conclusion: KAWHI successfully addresses the visual representation bottleneck in RLVR for LVLMs by explicitly incorporating structured visual information into reward optimization, enabling better multimodal reasoning capabilities.
Abstract: Reinforcement Learning from Verifiable Rewards (RLVR) has substantially enhanced the reasoning capabilities of large language models in abstract reasoning tasks. However, its application to Large Vision-Language Models (LVLMs) remains constrained by a structural representational bottleneck. Existing approaches generally lack explicit modeling and effective utilization of visual information, preventing visual representations from being tightly coupled with the reinforcement learning optimization process and thereby limiting further improvements in multimodal reasoning performance. To address this limitation, we propose KAWHI (Key-Region Aligned Weighted Harmonic Incentive), a plug-and-play reward reweighting mechanism that explicitly incorporates structured visual information into uniform reward policy optimization methods (e.g., GRPO and GSPO). The method adaptively localizes semantically salient regions through hierarchical geometric aggregation, identifies vision-critical attention heads via structured attribution, and performs paragraph-level credit reallocation to align spatial visual evidence with semantically decisive reasoning steps. Extensive empirical evaluations on diverse reasoning benchmarks substantiate KAWHI as a general-purpose enhancement module, consistently improving the performance of various uniform reward optimization methods. Project page: KAWHI (https://kawhiiiileo.github.io/KAWHI_PAGE/)
[273] Decompose, Mix, Adapt: A Unified Framework for Parameter-Efficient Neural Network Recombination and Compression
Nazia Tasnim, Shrimai Prabhumoye, Bryan A. Plummer
Main category: cs.CV
TL;DR: CRISP is a unified framework for parameter recombination that supports both model compression and parameter-efficient fine-tuning through shared basis matrices and small mixer weights.
Details
Motivation: Current parameter recombination methods typically focus on single applications (either PEFT or MC), making it challenging to combine them. In resource-constrained deployments like edge devices, even PEFT methods with millions of parameters can be problematic when combined with compression needs.Method: CRISP factorizes pretrained weights into basis matrices and component mixing projections. It shares basis matrices across layers (enabling model compression) and uses small mixer weights (enabling PEFT). The framework supports both tasks simultaneously through this factorization approach.
Result: CRISP outperforms prior dual-task methods by 4-5%, outperforms state-of-the-art PEFT methods by 1.5%, and outperforms PEFT+MC combinations by 1%.
Conclusion: CRISP provides a general framework that seamlessly integrates multiple parameter recombination tasks, offering improved performance for both model compression and parameter-efficient fine-tuning applications.
Abstract: Parameter Recombination (PR) methods aim to efficiently compose the weights of a neural network for applications like Parameter-Efficient FineTuning (PEFT) and Model Compression (MC), among others. Most methods typically focus on one application of PR, which can make composing them challenging. For example, when deploying a large model you may wish to compress the model and also quickly adapt to new settings. However, PEFT methods often can still contain millions of parameters. This may be small compared to the original model size, but can be problematic in resource constrained deployments like edge devices, where they take a larger portion of the compressed model’s parameters. To address this, we present Coefficient-gated weight Recombination by Interpolated Shared basis Projections (CRISP), a general approach that seamlessly integrates multiple PR tasks within the same framework. CRISP accomplishes this by factorizing pretrained weights into basis matrices and their component mixing projections. Sharing basis matrices across layers and adjusting its size enables us to perform MC, whereas the mixer weight’s small size (fewer than 200 in some experiments) enables CRISP to support PEFT. Experiments show CRISP outperforms methods from prior work capable of dual-task applications by 4-5% while also outperforming the state-of-the-art in PEFT by 1.5% and PEFT+MC combinations by 1%. Our code is available on the repository: https://github.com/appledora/CRISP-CVPR26.
[274] Mind the Shape Gap: A Benchmark and Baseline for Deformation-Aware 6D Pose Estimation of Agricultural Produce
Nikolas Chatzis, Angeliki Tsinouka, Katerina Papadimitriou, Niki Efthymiou, Marios Glytsos, George Retsinas, Paris Oikonomou, Gerasimos Potamianos, Petros Maragos, Panagiotis Paraskevas Filntisis
Main category: cs.CV
TL;DR: PEAR benchmark enables joint 6D pose and deformation estimation for agricultural produce, while SEED framework jointly predicts pose and deformations from single RGB images, outperforming existing methods.
Details
Motivation: Agricultural robotics faces challenges in 6D pose estimation due to biological deformability and shape variability of produce. Instance-level methods require exact 3D models for each piece (infeasible), while category-level methods using fixed templates degrade when prior deviates from actual geometry.Method: 1) Created PEAR benchmark with joint 6D pose and per-instance 3D deformation ground truth across 8 produce categories using robotic manipulator for high annotation accuracy. 2) Proposed SEED framework that jointly predicts 6D pose and explicit lattice deformations from single RGB images across multiple categories, trained entirely on synthetic data with generative texture augmentation at UV level.
Result: State-of-the-art methods suffer up to 6x performance degradation on real-world produce. SEED outperforms MegaPose on 6 out of 8 categories under identical RGB-only conditions, demonstrating explicit shape modeling improves pose estimation reliability.
Conclusion: Explicit shape modeling is critical for reliable pose estimation in agricultural robotics. The PEAR benchmark enables evaluation of joint pose and deformation estimation, while SEED provides a unified RGB-only framework that effectively handles produce deformability.
Abstract: Accurate 6D pose estimation for robotic harvesting is fundamentally hindered by the biological deformability and high intra-class shape variability of agricultural produce. Instance-level methods fail in this setting, as obtaining exact 3D models for every unique piece of produce is practically infeasible, while category-level approaches that rely on a fixed template suffer significant accuracy degradation when the prior deviates from the true instance geometry. To bridge such lack of robustness to deformation, we introduce PEAR (Pose and dEformation of Agricultural pRoduce), the first benchmark providing joint 6D pose and per-instance 3D deformation ground truth across 8 produce categories, acquired via a robotic manipulator for high annotation accuracy. Using PEAR, we show that state-of-the-art methods suffer up to 6x performance degradation when faced with the inherent geometric deviations of real-world produce. Motivated by this finding, we propose SEED (Simultaneous Estimation of posE and Deformation), a unified RGB-only framework that jointly predicts 6D pose and explicit lattice deformations from a single image across multiple produce categories. Trained entirely on synthetic data with generative texture augmentation applied at the UV level, SEED outperforms MegaPose on 6 out of 8 categories under identical RGB-only conditions, demonstrating that explicit shape modeling is a critical step toward reliable pose estimation in agricultural robotics.
[275] SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning
Jiang Zhang, Shijie Zhou, Bangya Liu, Achuta Kadambi, Zhiwen Fan
Main category: cs.CV
TL;DR: SpatialStack is a hierarchical fusion framework that progressively aligns vision, geometry, and language representations across model layers to improve 3D spatial reasoning in vision-language models.
Details
Motivation: Current vision-language models struggle with reliable 3D spatial reasoning due to their inability to capture fine-grained 3D geometry and spatial relationships. Existing approaches that fuse only deep-layer features from vision and geometry encoders discard rich hierarchical signals, creating a bottleneck for spatial understanding.Method: Proposes SpatialStack, a hierarchical fusion framework that stacks and synchronizes multi-level geometric features with the language backbone. Instead of conventional late-stage fusion, it progressively aligns vision, geometry, and language representations across the model hierarchy to capture both local geometric precision and global contextual semantics.
Result: VLM-SpatialStack achieves state-of-the-art performance on multiple 3D spatial reasoning benchmarks. Extensive experiments show the multi-level fusion strategy consistently enhances 3D understanding and generalizes robustly across diverse spatial reasoning tasks.
Conclusion: SpatialStack establishes an effective and extensible design paradigm for vision-language-geometry integration in next-generation multimodal physical AI systems, overcoming limitations of current approaches by enabling hierarchical feature alignment across modalities.
Abstract: Large vision-language models (VLMs) still struggle with reliable 3D spatial reasoning, a core capability for embodied and physical AI systems. This limitation arises from their inability to capture fine-grained 3D geometry and spatial relationships. While recent efforts have introduced multi-view geometry transformers into VLMs, they typically fuse only the deep-layer features from vision and geometry encoders, discarding rich hierarchical signals and creating a fundamental bottleneck for spatial understanding. To overcome this, we propose SpatialStack, a general hierarchical fusion framework that progressively aligns vision, geometry, and language representations across the model hierarchy. Moving beyond conventional late-stage vision-geometry fusion, SpatialStack stacks and synchronizes multi-level geometric features with the language backbone, enabling the model to capture both local geometric precision and global contextual semantics. Building upon this framework, we develop VLM-SpatialStack, a model that achieves state-of-the-art performance on multiple 3D spatial reasoning benchmarks. Extensive experiments and ablations demonstrate that our multi-level fusion strategy consistently enhances 3D understanding and generalizes robustly across diverse spatial reasoning tasks, establishing SpatialStack as an effective and extensible design paradigm for vision-language-geometry integration in next-generation multimodal physical AI systems.
[276] Evaluating Large and Lightweight Vision Models for Irregular Component Segmentation in E-Waste Disassembly
Xinyao Zhang, Chang Liu, Xiao Liang, Minghui Zheng, Sara Behdad
Main category: cs.CV
TL;DR: Comparison of SAM2 (transformer-based vision model) vs YOLOv8 for segmenting laptop components in e-waste recycling, showing YOLOv8 significantly outperforms SAM2 on task-specific segmentation.
Details
Motivation: Need for precise segmentation of irregular, densely arranged components in e-waste recycling for robotic disassembly and material recovery, requiring evaluation of model architectures for industrial applications.Method: Compared SAM2 (transformer-based) with lightweight YOLOv8 on new dataset of 1,456 annotated RGB images of laptop components (logic boards, heat sinks, fans) under varying conditions; used data augmentation (rotation, flipping, cropping) for robustness.
Result: YOLOv8 achieved much higher segmentation accuracy (mAP50 = 98.8%, mAP50-95 = 85%) and better boundary precision than SAM2 (mAP50 = 8.4%). SAM2 showed flexibility but produced overlapping masks and inconsistent contours.
Conclusion: Large pre-trained models like SAM2 require task-specific optimization for industrial applications; YOLOv8 performs better for this specific segmentation task; dataset and framework support scalable vision algorithms for robotic e-waste disassembly.
Abstract: Precise segmentation of irregular and densely arranged components is essential for robotic disassembly and material recovery in electronic waste (e-waste) recycling. This study evaluates the impact of model architecture and scale on segmentation performance by comparing SAM2, a transformer-based vision model, with the lightweight YOLOv8 network. Both models were trained and tested on a newly collected dataset of 1,456 annotated RGB images of laptop components including logic boards, heat sinks, and fans, captured under varying illumination and orientation conditions. Data augmentation techniques, such as random rotation, flipping, and cropping, were applied to improve model robustness. YOLOv8 achieved higher segmentation accuracy (mAP50 = 98.8%, mAP50-95 = 85%) and stronger boundary precision than SAM2 (mAP50 = 8.4%). SAM2 demonstrated flexibility in representing diverse object structures but often produced overlapping masks and inconsistent contours. These findings show that large pre-trained models require task-specific optimization for industrial applications. The resulting dataset and benchmarking framework provide a foundation for developing scalable vision algorithms for robotic e-waste disassembly and circular manufacturing systems.
[277] LOME: Learning Human-Object Manipulation with Action-Conditioned Egocentric World Model
Quankai Gao, Jiawei Yang, Qiangeng Xu, Le Chen, Yue Wang
Main category: cs.CV
TL;DR: LOME is an egocentric world model that generates realistic human-object interaction videos conditioned on input images, text prompts, and per-frame human actions, enabling precise action guidance and physical realism.
Details
Motivation: Traditional physics-based animation for human-object manipulation requires extensive modeling, doesn't generalize well across object morphologies, and doesn't scale to real-world environments. There's a need for a method that can generate realistic human-object interactions with strong action guidance and physical realism.Method: LOME jointly estimates spatial human actions and environment contexts during training, then fine-tunes a pretrained video generative model on diverse egocentric human-object interaction videos. It conditions generation on input image, text prompt, and per-frame human actions (body poses and hand gestures).
Result: LOME demonstrates high action-following accuracy, strong generalization to unseen scenarios, realistic physical consequences of hand-object interactions (like liquid flowing), and significantly outperforms state-of-the-art image/video-based action-conditioned methods and I/T2V models in temporal consistency and motion control.
Conclusion: LOME enables photorealistic AR/VR experiences and scalable robotic training without being limited to simulated environments or explicit 3D/4D modeling, advancing realistic human-object interaction generation.
Abstract: Learning human-object manipulation presents significant challenges due to its fine-grained and contact-rich nature of the motions involved. Traditional physics-based animation requires extensive modeling and manual setup, and more importantly, it neither generalizes well across diverse object morphologies nor scales effectively to real-world environment. To address these limitations, we introduce LOME, an egocentric world model that can generate realistic human-object interactions as videos conditioned on an input image, a text prompt, and per-frame human actions, including both body poses and hand gestures. LOME injects strong and precise action guidance into object manipulation by jointly estimating spatial human actions and the environment contexts during training. After finetuning a pretrained video generative model on videos of diverse egocentric human-object interactions, LOME demonstrates not only high action-following accuracy and strong generalization to unseen scenarios, but also realistic physical consequences of hand-object interactions, e.g., liquid flowing from a bottle into a mug after executing a ``pouring’’ action. Extensive experiments demonstrate that our video-based framework significantly outperforms state-of-the-art image based and video-based action-conditioned methods and Image/Text-to-Video (I/T2V) generative model in terms of both temporal consistency and motion control. LOME paves the way for photorealistic AR/VR experiences and scalable robotic training, without being limited to simulated environments or relying on explicit 3D/4D modeling.
[278] From None to All: Self-Supervised 3D Reconstruction via Novel View Synthesis
Ranran Huang, Weixun Luo, Ye Mao, Krystian Mikolajczyk
Main category: cs.CV
TL;DR: NAS3R is a self-supervised framework that jointly learns 3D geometry and camera parameters from uncalibrated, unposed images without ground-truth annotations or pretrained priors.
Details
Motivation: To enable 3D reconstruction from unconstrained data without requiring ground-truth 3D annotations, camera parameters, or pretrained models, addressing the limitations of supervised methods that rely on expensive annotations.Method: Uses a feed-forward framework that reconstructs 3D Gaussians from uncalibrated context views and renders target views with self-predicted camera parameters. Employs a shared transformer backbone with masked attention for joint reconstruction and camera prediction, and a depth-based Gaussian formulation for stable optimization.
Result: Achieves superior results compared to other self-supervised methods, establishing a scalable and geometry-aware paradigm for 3D reconstruction from unconstrained data.
Conclusion: NAS3R provides an effective self-supervised approach for 3D reconstruction that works with uncalibrated, unposed images and is compatible with state-of-the-art supervised architectures.
Abstract: In this paper, we introduce NAS3R, a self-supervised feed-forward framework that jointly learns explicit 3D geometry and camera parameters with no ground-truth annotations and no pretrained priors. During training, NAS3R reconstructs 3D Gaussians from uncalibrated and unposed context views and renders target views using its self-predicted camera parameters, enabling self-supervised training from 2D photometric supervision. To ensure stable convergence, NAS3R integrates reconstruction and camera prediction within a shared transformer backbone regulated by masked attention, and adopts a depth-based Gaussian formulation that facilitates well-conditioned optimization. The framework is compatible with state-of-the-art supervised 3D reconstruction architectures and can incorporate pretrained priors or intrinsic information when available. Extensive experiments show that NAS3R achieves superior results to other self-supervised methods, establishing a scalable and geometry-aware paradigm for 3D reconstruction from unconstrained data. Code and models are publicly available at https://ranrhuang.github.io/nas3r/.
[279] Project Imaging-X: A Survey of 1000+ Open-Access Medical Imaging Datasets for Foundation Model Development
Zhongying Deng, Cheng Tang, Ziyan Huang, Jiashi Lin, Ying Chen, Junzhi Ning, Chenglong Ma, Jiyao Liu, Wei Li, Yinghao Zhu, Shujian Gao, Yanyan Huang, Sibo Ju, Yanzhou Su, Pengcheng Chen, Wenhao Tang, Tianbin Li, Haoyu Wang, Yuanfeng Ji, Hui Sun, Shaobo Min, Liang Peng, Feilong Tang, Haochen Xue, Rulin Zhou, Chaoyang Zhang, Wenjie Li, Shaohao Rui, Weijie Ma, Xingyue Zhao, Yibin Wang, Kun Yuan, Zhaohui Lu, Shujun Wang, Jinjie Wei, Lihao Liu, Dingkang Yang, Lin Wang, Yulong Li, Haolin Yang, Yiqing Shen, Lequan Yu, Xiaowei Hu, Yun Gu, Yicheng Wu, Benyou Wang, Minghui Zhang, Angelica I. Aviles-Rivero, Qi Gao, Hongming Shan, Xiaoyu Ren, Fang Yan, Hongyu Zhou, Haodong Duan, Maosong Cao, Shanshan Wang, Bin Fu, Xiaomeng Li, Zhi Hou, Chunfeng Song, Lei Bai, Yuan Cheng, Yuandong Pu, Xiang Li, Wenhai Wang, Hao Chen, Jiaxin Zhuang, Songyang Zhang, Huiguang He, Mengzhang Li, Bohan Zhuang, Zhian Bai, Rongshan Yu, Liansheng Wang, Yukun Zhou, Xiaosong Wang, Xin Guo, Guanbin Li, Xiangru Lin, Dakai Jin, Mianxin Liu, Wenlong Zhang, Qi Qin, Conghui He, Yuqiang Li, Ye Luo, Nanqing Dong, Jie Xu, Wenqi Shao, Bo Zhang, Qiujuan Yan, Yihao Liu, Jun Ma, Zhi Lu, Yuewen Cao, Zongwei Zhou, Jianming Liang, Shixiang Tang, Qi Duan, Dongzhan Zhou, Chen Jiang, Yuyin Zhou, Yanwu Xu, Jiancheng Yang, Shaoting Zhang, Xiaohong Liu, Siqi Luo, Yi Xin, Chaoyu Liu, Haochen Wen, Xin Chen, Alejandro Lozano, Min Woo Sun, Yuhui Zhang, Yue Yao, Xiaoxiao Sun, Serena Yeung-Levy, Xia Li, Jing Ke, Chunhui Zhang, Zongyuan Ge, Ming Hu, Jin Ye, Zhifeng Li, Yirong Chen, Yu Qiao, Junjun He
Main category: cs.CV
TL;DR: Survey of 1,000+ medical image datasets reveals fragmentation and scale limitations, proposes metadata-driven fusion paradigm to integrate datasets for better medical foundation models.
Details
Motivation: Medical imaging lacks large-scale unified datasets due to clinical expertise requirements and privacy constraints, hindering development of powerful medical foundation models.Method: Comprehensive survey of over 1,000 open-access medical image datasets with systematic cataloging, analysis of fragmentation, and proposal of metadata-driven fusion paradigm (MDFP) to integrate datasets.
Result: Analysis shows medical image datasets are modest in scale, fragmented across narrow tasks, unevenly distributed across organs/modalities. Created interactive portal and unified structured table of all surveyed datasets.
Conclusion: Survey provides roadmap for scaling medical imaging corpora through dataset consolidation, supporting faster data discovery and more capable medical foundation models.
Abstract: Foundation models have demonstrated remarkable success across diverse domains and tasks, primarily due to the thrive of large-scale, diverse, and high-quality datasets. However, in the field of medical imaging, the curation and assembling of such medical datasets are highly challenging due to the reliance on clinical expertise and strict ethical and privacy constraints, resulting in a scarcity of large-scale unified medical datasets and hindering the development of powerful medical foundation models. In this work, we present the largest survey to date of medical image datasets, covering over 1,000 open-access datasets with a systematic catalog of their modalities, tasks, anatomies, annotations, limitations, and potential for integration. Our analysis exposes a landscape that is modest in scale, fragmented across narrowly scoped tasks, and unevenly distributed across organs and modalities, which in turn limits the utility of existing medical image datasets for developing versatile and robust medical foundation models. To turn fragmentation into scale, we propose a metadata-driven fusion paradigm (MDFP) that integrates public datasets with shared modalities or tasks, thereby transforming multiple small data silos into larger, more coherent resources. Building on MDFP, we release an interactive discovery portal that enables end-to-end, automated medical image dataset integration, and compile all surveyed datasets into a unified, structured table that clearly summarizes their key characteristics and provides reference links, offering the community an accessible and comprehensive repository. By charting the current terrain and offering a principled path to dataset consolidation, our survey provides a practical roadmap for scaling medical imaging corpora, supporting faster data discovery, more principled dataset creation, and more capable medical foundation models.
[280] Difference Feedback: Generating Multimodal Process-Level Supervision for VLM Reinforcement Learning
Feiding, Yongkang Zhang, Yuhao Liao, Zijian Zeng, Chunzheng Zhu, Yaozong Zheng, Yafei Liu, Yeling Peng, Youwei Wang, Sibo Wang, Huiming Yang, Linglin Liao, Shunzhi Yang
Main category: cs.CV
TL;DR: Differential Feedback improves vision-language model alignment by constructing token/step-level supervision masks from repaired reasoning trajectories, enabling process-level visual alignment without costly human annotations.
Details
Motivation: Current VLMs aligned via GRPO-style training suffer from sparse credit assignment in multi-step reasoning, weakening the link between visual evidence and intermediate steps, causing unstable optimization and visual hallucinations.Method: Proposes Differential Feedback which automatically constructs token/step-level supervision masks by repairing erroneous reasoning trajectories, explicitly marking key positions requiring correction. Can be integrated into existing GRPO-like frameworks without large-scale step-by-step human annotations.
Result: Experiments on multimodal reasoning benchmarks (MMMStar and MathVista) show average 3% improvement under matched compute budgets.
Conclusion: The approach offers an effective, low-cost solution for accurate vision-reasoning process alignment in VLMs.
Abstract: Vision–language models (VLMs) are increasingly aligned via Group Relative Policy Optimization (GRPO)-style training. However, relying solely on terminal outcome rewards yields sparse credit assignment in multi-step reasoning, weakening the linkage between visual evidence and intermediate steps and often causing unstable optimization and visual hallucinations. We propose Differential Feedback, which automatically constructs token/step-level supervision masks by repairing erroneous reasoning trajectories, explicitly marking the key positions that require correction. Without costly large-scale step-by-step human annotations, our method enables process-level visual alignment and can be seamlessly integrated into existing GRPO-like frameworks. Experiments on multimodal reasoning benchmarks including MMMStar and MathVista show an average 3% improvement under matched compute budgets. Our approach offers an effective, low-cost solution for accurate vision–reasoning process alignment.
[281] Estimating the Impact of COVID-19 on Travel Demand in Houston Area Using Deep Learning and Satellite Imagery
Alekhya Pachika, Lu Gao, Lingguang Song, Pan Lu, Xingju Wang
Main category: cs.CV
TL;DR: Satellite imagery analysis using computer vision (Detectron2/Faster R-CNN) to count cars and estimate COVID-19 impact on travel demand in Houston metropolitan area.
Details
Motivation: To leverage high-resolution satellite imagery (15-30 cm GSD) and computer vision algorithms for monitoring transportation infrastructure and estimating travel demand, specifically analyzing COVID-19's impact on economic activities through vehicle presence detection.Method: Used Google Earth Engine satellite imagery datasets, developed car-counting models using Detectron2 and Faster R-CNN frameworks to detect vehicles at various locations (university, shopping mall, plaza, restaurant, supermarket) before and during COVID-19.
Result: Car counts reduced by average 30% in 2020 compared to 2019 across monitored locations, demonstrating satellite imagery’s effectiveness for travel demand and economic activity estimation.
Conclusion: Satellite imagery combined with computer vision/deep learning provides reliable information for transportation decision-making and can effectively monitor infrastructure usage and economic trends.
Abstract: Considering recent advances in remote sensing satellite systems and computer vision algorithms, many satellite sensing platforms and sensors have been used to monitor the condition and usage of transportation infrastructure systems. The level of details that can be detected increases significantly with the increase of ground sample distance (GSD), which is around 15 cm - 30 cm for high-resolution satellite images. In this study, we analyzed data acquired from high-resolution satellite imagery to provide insights, predictive signals, and trend for travel demand estimation. More specifically, we estimate the impact of COVID-19 in the metropolitan area of Houston using satellite imagery from Google Earth Engine datasets. We developed a car-counting model through Detectron2 and Faster R-CNN to monitor the presence of cars within different locations (i.e., university, shopping mall, community plaza, restaurant, supermarket) before and during the COVID-19. The results show that the number of cars detected at these selected locations reduced on average 30% in 2020 compared with the previous year 2019. The results also show that satellite imagery provides rich information for travel demand and economic activity estimation. Together with advanced computer vision and deep learning algorithms, it can generate reliable and accurate information for transportation agency decision makers.
[282] Fully Spiking Neural Networks with Target Awareness for Energy-Efficient UAV Tracking
Pengzhi Zhong, Jiwei Mo, Dan Zeng, Feixiang He, Shuiwang Li
Main category: cs.CV
TL;DR: STATrack is a fully spiking neural network framework for UAV visual tracking using only RGB inputs, achieving competitive performance with low energy consumption.
Details
Motivation: Existing SNN-based trackers rely on costly event cameras, limiting deployment on UAVs. The authors aim to create an efficient SNN tracker using standard RGB inputs instead of event cameras.Method: Proposes STATrack, a fully spiking neural network framework for UAV visual tracking with RGB inputs. Introduces adaptive mutual information maximization between templates and features to mitigate target feature weakening by background tokens.
Result: Extensive experiments on four UAV tracking benchmarks show STATrack achieves competitive tracking performance while maintaining low energy consumption.
Conclusion: STATrack demonstrates that efficient SNN-based visual tracking is possible with standard RGB inputs, making it more practical for UAV deployment compared to event camera-dependent approaches.
Abstract: Spiking Neural Networks (SNNs), characterized by their event-driven computation and low power consumption, have shown great potential for energy-efficient visual tracking on unmanned aerial vehicles (UAVs). However, existing efficient SNN-based trackers heavily rely on costly event cameras, limiting their deployment on UAVs. To address this limitation, we propose STATrack, an efficient fully spiking neural network framework for UAV visual tracking using RGB inputs only. To the best of our knowledge, this work is the first to investigate spiking neural networks for UAV visual tracking tasks. To mitigate the weakening of target features by background tokens, we propose adaptively maximizing the mutual information between templates and features. Extensive experiments on four widely used UAV tracking benchmarks demonstrate that STATrack achieves competitive tracking performance while maintaining low energy consumption.
[283] Learning to Focus and Precise Cropping: A Reinforcement Learning Framework with Information Gaps and Grounding Loss for MLLMs
Xuanpu Zhao, Zhentao Tan, Dianmo Sheng, Tianxiang Chen, Yao Liu, Yue Wu, Tao Gong, Qi Chu, Nenghai Yu
Main category: cs.CV
TL;DR: Proposes a two-stage reinforcement learning framework to improve MLLMs’ attention to cropped regions in agent-based visual question answering, addressing over-reliance on global context.
Details
Motivation: Existing agent-based MLLMs for visual question answering show strong reliance on global image context and weak dependence on cropped region details, limiting fine-grained perception.Method: Two-stage RL framework: 1) “Information Gap” mechanism adjusts global image granularity to train models to focus on cropped key regions based on information gain; 2) Incorporates grounding loss with bounding box annotations to enhance cropping precision.
Result: Significantly enhances model attention to cropped regions and achieves state-of-the-art performance on high-resolution visual question-answering benchmarks.
Conclusion: Provides a more efficient approach for perceiving and reasoning fine-grained details in MLLMs without requiring trajectory supervision.
Abstract: To enhance the perception and reasoning capabilities of multimodal large language models in complex visual scenes, recent research has introduced agent-based workflows. In these works, MLLMs autonomously utilize image cropping tool to analyze regions of interest for question answering. While existing training strategies, such as those employing supervised fine-tuning and reinforcement learning, have made significant progress, our empirical analysis reveals a key limitation. We demonstrate the model’s strong reliance on global input and its weak dependence on the details within the cropped region. To address this issue, we propose a novel two-stage reinforcement learning framework that does not require trajectory supervision. In the first stage, we introduce the ``Information Gap” mechanism by adjusting the granularity of the global image. This mechanism trains the model to answer questions by focusing on cropped key regions, driven by the information gain these regions provide. The second stage further enhances cropping precision by incorporating a grounding loss, using a small number of bounding box annotations. Experiments show that our method significantly enhances the model’s attention to cropped regions, enabling it to achieve state-of-the-art performance on high-resolution visual question-answering benchmarks. Our method provides a more efficient approach for perceiving and reasoning fine-grained details in MLLMs. Code is available at: https://github.com/XuanPu-Z/LFPC.
[284] Streamlined Open-Vocabulary Human-Object Interaction Detection
Chang Sun, Dongliang Liao, Changxing Ding
Main category: cs.CV
TL;DR: SL-HOI is a streamlined open-vocabulary human-object interaction detection framework that uses only DINOv3 components for both localization and classification, achieving SOTA performance with minimal learnable parameters.
Details
Motivation: Existing open-vocabulary HOI detection methods struggle with feature fusion due to significant gaps between conventional HOI detectors and vision-language models, requiring a more streamlined approach.Method: Leverages DINOv3’s backbone for fine-grained localization and its text-aligned vision head for open-vocabulary classification, with cross-attention between interaction queries and vision head outputs, keeping all DINOv3 parameters frozen.
Result: Achieves state-of-the-art performance on both SWiG-HOI and HICO-DET benchmarks, demonstrating effectiveness of the streamlined architecture.
Conclusion: SL-HOI provides an effective streamlined solution for open-vocabulary HOI detection by fully utilizing DINOv3’s capabilities with minimal additional parameters.
Abstract: Open-vocabulary human-object interaction (HOI) detection aims to localize and recognize all human-object interactions in an image, including those unseen during training. Existing approaches usually rely on the collaboration between a conventional HOI detector and a Vision-Language Model (VLM) to recognize unseen HOI categories. However, feature fusion in this paradigm is challenging due to significant gaps in cross-model representations. To address this issue, we introduce SL-HOI, a StreamLined open-vocabulary HOI detection framework based solely on the powerful DINOv3 model. Our design leverages the complementary strengths of DINOv3’s components: its backbone for fine-grained localization and its text-aligned vision head for open-vocabulary interaction classification. Moreover, to facilitate smooth cross-attention between the interaction queries and the vision head’s output, we propose first feeding both the interaction queries and the backbone image tokens into the vision head, effectively bridging their representation gaps. All DINOv3 parameters in our approach are frozen, with only a small number of learnable parameters added, allowing a fast adaptation to the HOI detection task. Extensive experiments show that SL-HOI achieves state-of-the-art performance on both the SWiG-HOI and HICO-DET benchmarks, demonstrating the effectiveness of our streamlined model architecture. Code is available at https://github.com/MPI-Lab/SL-HOI.
[285] Transferring Physical Priors into Remote Sensing Segmentation via Large Language Models
Yuxi Lu, Kunqi Li, Zhidong Li, Xiaohan Su, Biao Wu, Chenya Huang, Bin Liang
Main category: cs.CV
TL;DR: PriorSeg: A physics-aware segmentation model that integrates domain-specific physical priors via a knowledge graph and joint visual-physical training, improving remote sensing segmentation without retraining foundation models.
Details
Motivation: Remote sensing semantic segmentation requires integrating multiple physical variables (DEM, SAR, NDVI) beyond just optical images. Current foundation models depend on spatially aligned data and costly retraining for new sensors, limiting their flexibility and efficiency.Method: 1) Construct Physical-Centric Knowledge Graph (PCKG) using LLMs to extract physical priors from 1,763 vocabularies; 2) Build Phy-Sky-SA dataset (heterogeneous, spatial-aligned); 3) Develop PriorSeg with physics-aware residual refinement and joint visual-physical training with physics-consistency loss.
Result: PriorSeg improves segmentation accuracy and physical plausibility in heterogeneous settings without retraining foundation models. Ablation studies confirm effectiveness of Phy-Sky-SA dataset, PCKG, and physics-consistency loss.
Conclusion: The proposed paradigm successfully integrates domain-specific physical priors into segmentation models, overcoming limitations of current foundation models while maintaining accuracy and physical consistency without costly retraining.
Abstract: Semantic segmentation of remote sensing imagery is fundamental to Earth observation. Achieving accurate results requires integrating not only optical images but also physical variables such as the Digital Elevation Model (DEM), Synthetic Aperture Radar (SAR) and Normalized Difference Vegetation Index (NDVI). Recent foundation models (FMs) leverage pre-training to exploit these variables but still depend on spatially aligned data and costly retraining when involving new sensors. To overcome these limitations, we introduce a novel paradigm for integrating domain-specific physical priors into segmentation models. We first construct a Physical-Centric Knowledge Graph (PCKG) by prompting large language models to extract physical priors from 1,763 vocabularies, and use it to build a heterogeneous, spatial-aligned dataset, Phy-Sky-SA. Building on this foundation, we develop PriorSeg, a physics-aware residual refinement model trained with a joint visual-physical strategy that incorporates a novel physics-consistency loss. Experiments on heterogeneous settings demonstrate that PriorSeg improves segmentation accuracy and physical plausibility without retraining the FMs. Ablation studies verify the effectiveness of the Phy-Sky-SA dataset, the PCKG, and the physics-consistency loss.
[286] Chat-Scene++: Exploiting Context-Rich Object Identification for 3D LLM
Haifeng Huang, Yilun Chen, Zehan Wang, Jiangmiao Pang, Zhou Zhao
Main category: cs.CV
TL;DR: Chat-Scene++ is a multimodal LLM framework for 3D scene understanding that represents scenes as context-rich object sequences, enabling fine-grained object grounding and spatial reasoning without task-specific fine-tuning.
Details
Motivation: Existing MLLMs struggle with fine-grained object grounding and contextual reasoning in 3D environments, limiting their ability to interpret and interact with complex 3D scenes.Method: Represents 3D scenes as sequences of objects with contextual semantics, using identifier tokens and extracting context-rich object features from pre-trained 3D scene-level and 2D image-level encoders. Supports grounded chain-of-thought reasoning for multi-step inference.
Result: Achieves state-of-the-art performance on five major 3D vision-language benchmarks (ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, SQA3D) and demonstrates applicability to real-world scenarios using only 2D inputs.
Conclusion: Chat-Scene++ effectively addresses limitations in 3D scene understanding through object-centric representation and contextual reasoning, showing strong performance across diverse 3D vision-language tasks.
Abstract: Recent advancements in multi-modal large language models (MLLMs) have shown strong potential for 3D scene understanding. However, existing methods struggle with fine-grained object grounding and contextual reasoning, limiting their ability to interpret and interact with complex 3D environments. In this paper, we present Chat-Scene++, an MLLM framework that represents 3D scenes as context-rich object sequences. By structuring scenes as sequences of objects with contextual semantics, Chat-Scene++ enables object-centric representation and interaction. It decomposes a 3D scene into object representations paired with identifier tokens, allowing LLMs to follow instructions across diverse 3D vision-language tasks. To capture inter-object relationships and global semantics, Chat-Scene++ extracts context-rich object features using large-scale pre-trained 3D scene-level and 2D image-level encoders, unlike the isolated per-object features in Chat-Scene. Its flexible object-centric design also supports grounded chain-of-thought (G-CoT) reasoning, enabling the model to distinguish objects at both category and spatial levels during multi-step inference. Without the need for additional task-specific heads or fine-tuning, Chat-Scene++ achieves state-of-the-art performance on five major 3D vision-language benchmarks: ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D. These results highlight its effectiveness in scene comprehension, object grounding, and spatial reasoning. Additionally, without reconstructing 3D worlds through computationally expensive processes, we demonstrate its applicability to real-world scenarios using only 2D inputs.
[287] Understanding Semantic Perturbations on In-Processing Generative Image Watermarks
Anirudh Nakra, Min Wu
Main category: cs.CV
TL;DR: A framework for stress-testing generative model watermarks against semantic manipulations reveals current methods fail when content meaning changes, despite being robust to conventional perturbations.
Details
Motivation: As generative models proliferate, reliable provenance and authentication mechanisms are needed. While in-processing watermarks claim robustness to standard post-processing, their resilience to semantic manipulations that alter scene content while maintaining visual quality is poorly understood.Method: A multi-stage framework using off-the-shelf models for object detection, mask generation, and semantically guided inpainting/regeneration to produce controlled, meaning-altering edits with minimal perceptual degradation for systematic stress-testing.
Result: Robustness varies significantly with semantic entanglement - methods that remain detectable under conventional perturbations often fail under semantic edits, with watermark detectability dropping to near zero while image quality remains high.
Conclusion: Current watermarking evaluations have a critical gap; watermark designs and benchmarking must explicitly account for robustness against semantic manipulation.
Abstract: The widespread deployment of high-fidelity generative models has intensified the need for reliable mechanisms for provenance and content authentication. In-processing watermarking, embedding a signature into the generative model’s synthesis procedure, has been advocated as a solution and is often reported to be robust to standard post-processing (such as geometric transforms and filtering). Yet robustness to semantic manipulations that alter high-level scene content while maintaining reasonable visual quality is not well studied or understood. We introduce a simple, multi-stage framework for systematically stress-testing in-processing generative watermarks under semantic drift. The framework utilizes off-the-shelf models for object detection, mask generation, and semantically guided inpainting or regeneration to produce controlled, meaning-altering edits with minimal perceptual degradation. Based on extensive experiments on representative schemes, we find that robustness varies significantly with the degree of semantic entanglement: methods by which watermarks remain detectable under a broad suite of conventional perturbations can fail under semantic edits, with watermark detectability in many cases dropping to near zero while image quality remains high. Overall, our results reveal a critical gap in current watermarking evaluations and suggest that watermark designs and benchmarking must explicitly account for robustness against semantic manipulation.
[288] SGS-Intrinsic: Semantic-Invariant Gaussian Splatting for Sparse-View Indoor Inverse Rendering
Jiahao Niu, Rongjia Zheng, Wenju Xu, WeiShi Zheng, Qing Zhang
Main category: cs.CV
TL;DR: SGS-Intrinsic is a 3D Gaussian Splatting-based indoor inverse rendering framework that achieves high-quality geometry reconstruction and accurate material-illumination disentanglement from sparse-view images.
Details
Motivation: Existing 3DGS-based inverse rendering methods focus on object-centric reconstruction and fail under sparse view settings. There's a need for methods that can work with sparse views while achieving accurate material and illumination disentanglement for indoor scenes.Method: 1. Constructs dense geometry-consistent Gaussian semantic field using semantic and geometric priors. 2. Performs material-illumination disentanglement with hybrid illumination model and material prior. 3. Introduces illumination-invariant material constraint and deshadowing model to mitigate cast shadows and enhance material recovery robustness.
Result: Extensive experiments on benchmark datasets show consistent improvements in both reconstruction fidelity and inverse rendering quality over existing 3DGS-based inverse rendering approaches.
Conclusion: SGS-Intrinsic successfully addresses sparse-view inverse rendering challenges for indoor scenes, achieving high-quality geometry reconstruction and accurate material-illumination disentanglement through semantic-guided Gaussian fields and robust illumination modeling.
Abstract: We present SGS-Intrinsic, an indoor inverse rendering framework that works well for sparse-view images. Unlike existing 3D Gaussian Splatting (3DGS) based methods that focus on object-centric reconstruction and fail to work under sparse view settings, our method allows to achieve high-quality geometry reconstruction and accurate disentanglement of material and illumination. The core idea is to construct a dense and geometry-consistent Gaussian semantic field guided by semantic and geometric priors, providing a reliable foundation for subsequent inverse rendering. Building upon this, we perform material-illumination disentanglement by combining a hybrid illumination model and material prior to effectively capture illumination-material interactions. To mitigate the impact of cast shadows and enhance the robustness of material recovery, we introduce illumination-invariant material constraint together with a deshadowing model. Extensive experiments on benchmark datasets show that our method consistently improves both reconstruction fidelity and inverse rendering quality over existing 3DGS-based inverse rendering approaches. Our code is available at https://github.com/GrumpySloths/SGS_Intrinsic.github.io.
[289] SPROUT: A Scalable Diffusion Foundation Model for Agricultural Vision
Shuai Xiang, Wei Guo, James Burridge, Shouyang Liu, Hao Lu, Tokihiro Fukatsu
Main category: cs.CV
TL;DR: SPROUT is a scalable plant representation model trained via diffusion denoising on 2.6M agricultural images, outperforming web-pretrained and agricultural foundation models across various downstream tasks with lower pre-training cost.
Details
Motivation: Vision Foundation Models pre-trained on general data suffer from significant domain gaps when applied to agriculture, creating a need for specialized agricultural foundation models that can handle diverse crops, growth stages, and environments.Method: Uses a VAE-free Pixel-space Diffusion Transformer trained via diffusion denoising on 2.6 million high-quality agricultural images spanning diverse crops, growth stages, and environments. The model learns rich, structure-aware representations through denoising enabling efficient end-to-end training.
Result: SPROUT consistently outperforms state-of-the-art web-pretrained and agricultural foundation models across a wide range of downstream tasks while requiring substantially lower pre-training cost.
Conclusion: SPROUT demonstrates the effectiveness of diffusion-based pre-training for agricultural vision tasks, providing a scalable solution that bridges the domain gap between general vision models and specialized agricultural applications.
Abstract: Vision Foundation Models (VFM) pre-trained on large-scale unlabeled data have achieved remarkable success on general computer vision tasks, yet typically suffer from significant domain gaps when applied to agriculture. In this context, we introduce $SPROUT$ ($S$calable $P$lant $R$epresentation model via $O$pen-field $U$nsupervised $T$raining), a multi-crop, multi-task agricultural foundation model trained via diffusion denoising. SPROUT leverages a VAE-free Pixel-space Diffusion Transformer to learn rich, structure-aware representations through denoising and enabling efficient end-to-end training. We pre-train SPROUT on a curated dataset of 2.6 million high-quality agricultural images spanning diverse crops, growth stages, and environments. Extensive experiments demonstrate that SPROUT consistently outperforms state-of-the-art web-pretrained and agricultural foundation models across a wide range of downstream tasks, while requiring substantially lower pre-training cost. The code and model are available at https://github.com/UTokyo-FieldPhenomics-Lab/SPROUT.
[290] TokenDial: Continuous Attribute Control in Text-to-Video via Spatiotemporal Token Offsets
Zhixuan Liu, Peter Schaldenbrand, Yijun Li, Long Mai, Aniruddha Mahapatra, Cusuh Ham, Jean Oh, Jui-Hsien Wang
Main category: cs.CV
TL;DR: TokenDial enables continuous slider-style control over video attributes in pretrained text-to-video models by learning additive offsets in token space without retraining the backbone.
Details
Motivation: Current text-to-video generation models lack fine-grained control over attribute intensity (e.g., effect strength or motion magnitude) without compromising video identity, background consistency, or temporal coherence.Method: TokenDial learns attribute-specific additive offsets in intermediate spatiotemporal visual patch-token space. It uses pretrained understanding signals: semantic direction matching for appearance attributes and motion-magnitude scaling for motion attributes, without retraining the backbone model.
Result: TokenDial achieves stronger controllability and higher-quality edits than state-of-the-art baselines across diverse attributes and prompts, validated through extensive quantitative evaluation and human studies.
Conclusion: TokenDial provides an effective framework for continuous attribute control in text-to-video generation by leveraging token-space offsets, enabling precise control over both appearance and motion dynamics while maintaining video coherence.
Abstract: We present TokenDial, a framework for continuous, slider-style attribute control in pretrained text-to-video generation models. While modern generators produce strong holistic videos, they offer limited control over how much an attribute changes (e.g., effect intensity or motion magnitude) without drifting identity, background, or temporal coherence. TokenDial is built on the observation: additive offsets in the intermediate spatiotemporal visual patch-token space form a semantic control direction, where adjusting the offset magnitude yields coherent, predictable edits for both appearance and motion dynamics. We learn attribute-specific token offsets without retraining the backbone, using pretrained understanding signals: semantic direction matching for appearance and motion-magnitude scaling for motion. We demonstrate TokenDial’s effectiveness on diverse attributes and prompts, achieving stronger controllability and higher-quality edits than state-of-the-art baselines, supported by extensive quantitative evaluation and human studies.
[291] OmniColor: A Unified Framework for Multi-modal Lineart Colorization
Xulu Zhang, Haoqian Du, Xiaoyong Wei, Qing Li
Main category: cs.CV
TL;DR: OmniColor is a unified framework for multi-modal lineart colorization that supports arbitrary combinations of control signals, achieving precise boundary preservation and efficient semantic reference handling.
Details
Motivation: Lineart colorization is crucial for professional content creation but faces challenges in achieving precise and flexible results under diverse user constraints. Existing methods struggle with handling multiple control signals simultaneously while maintaining quality and efficiency.Method: The framework categorizes guidance signals into spatially-aligned conditions (using dual-path encoding with Dense Feature Alignment loss) and semantic-reference conditions (using VLM-only encoding with Temporal Redundancy Elimination). An Adaptive Spatial-Semantic Gating module dynamically balances multi-modal constraints to resolve input conflicts.
Result: Experimental results show OmniColor achieves superior controllability, visual quality, and temporal stability compared to existing methods, providing a robust practical solution for lineart colorization.
Conclusion: OmniColor offers a unified framework that effectively handles diverse control signals for lineart colorization, balancing spatial precision with semantic reference efficiency through innovative architectural components.
Abstract: Lineart colorization is a critical stage in professional content creation, yet achieving precise and flexible results under diverse user constraints remains a significant challenge. To address this, we propose OmniColor, a unified framework for multi-modal lineart colorization that supports arbitrary combinations of control signals. Specifically, we systematically categorize guidance signals into two types: spatially-aligned conditions and semantic-reference conditions. For spatially-aligned inputs, we employ a dual-path encoding strategy paired with a Dense Feature Alignment loss to ensure rigorous boundary preservation and precise color restoration. For semantic-reference inputs, we utilize a VLM-only encoding scheme integrated with a Temporal Redundancy Elimination mechanism to filter repetitive information and enhance inference efficiency. To resolve potential input conflicts, we introduce an Adaptive Spatial-Semantic Gating module that dynamically balances multi-modal constraints. Experimental results demonstrate that OmniColor achieves superior controllability, visual quality, and temporal stability, providing a robust and practical solution for lineart colorization. The source code and dataset will be open at https://github.com/zhangxulu1996/OmniColor.
[292] Demo-Pose: Depth-Monocular Modality Fusion For Object Pose Estimation
Rachit Agarwal, Abhishek Joshi, Sathish Chalasani, Woo Jin Kim
Main category: cs.CV
TL;DR: DeMo-Pose: A hybrid RGB-D architecture for category-level 9-DoF object pose estimation that fuses monocular semantic features with depth-based graph representations using novel multimodal fusion and mesh-point loss.
Details
Motivation: Existing methods either use only depth data (ignoring semantic RGB cues) or have suboptimal RGB-D fusion that fails to align semantic and geometric information effectively for category-level pose estimation without CAD models.Method: Proposes DeMo-Pose with: 1) Hybrid architecture fusing monocular semantic features with depth-based graph convolutional representations via novel multimodal fusion strategy, 2) Mesh-Point Loss (MPL) that leverages mesh structure during training without inference overhead to improve geometric reasoning.
Result: Achieves real-time inference and outperforms state-of-the-art methods, including GPV-Pose baseline by 3.2% on 3D IoU and 11.1% on pose accuracy on REAL275 benchmark.
Conclusion: Demonstrates effectiveness of depth-RGB fusion and geometry-aware learning for robust category-level 3D pose estimation in real-world applications.
Abstract: Object pose estimation is a fundamental task in 3D vision with applications in robotics, AR/VR, and scene understanding. We address the challenge of category-level 9-DoF pose estimation (6D pose + 3Dsize) from RGB-D input, without relying on CAD models during inference. Existing depth-only methods achieve strong results but ignore semantic cues from RGB, while many RGB-D fusion models underperform due to suboptimal cross-modal fusion that fails to align semantic RGB cues with 3D geometric representations. We propose DeMo-Pose, a hybrid architecture that fuses monocular semantic features with depth-based graph convolutional representations via a novel multimodal fusion strategy. To further improve geometric reasoning, we introduce a novel Mesh-Point Loss (MPL) that leverages mesh structure during training without adding inference overhead. Our approach achieves real-time inference and significantly improves over state-of-the-art methods across object categories, outperforming the strong GPV-Pose baseline by 3.2% on 3D IoU and 11.1% on pose accuracy on the REAL275 benchmark. The results highlight the effectiveness of depth-RGB fusion and geometry-aware learning, enabling robust category-level 3D pose estimation for real-world applications.
[293] MV-RoMa: From Pairwise Matching into Multi-View Track Reconstruction
Jongmin Lee, Seungyeop Kang, Sungjoo Yoo
Main category: cs.CV
TL;DR: MV-RoMa is a multi-view dense matching model that jointly estimates correspondences across multiple images for better 3D reconstruction, avoiding fragmented tracks from pairwise matching.
Details
Motivation: Existing matchers operate pairwise, producing fragmented and geometrically inconsistent tracks when chained across multiple views, which harms 3D vision tasks like structure-from-motion.Method: Designs an efficient architecture with: (1) multi-view encoder using pairwise matching as geometric prior, (2) multi-view matching refiner using pixel-wise attention, and (3) post-processing to integrate consistent multi-view correspondences as high-quality tracks for SfM.
Result: Produces more reliable correspondences and substantially denser, more accurate 3D reconstructions than existing sparse and dense matching methods across diverse benchmarks.
Conclusion: MV-RoMa effectively addresses the limitations of pairwise matching by providing geometrically consistent multi-view correspondences, improving 3D reconstruction quality.
Abstract: Establishing consistent correspondences across images is essential for 3D vision tasks such as structure-from-motion (SfM), yet most existing matchers operate in a pairwise manner, often producing fragmented and geometrically inconsistent tracks when their predictions are chained across views. We propose MV-RoMa, a multi-view dense matching model that jointly estimates dense correspondences from a source image to multiple co-visible targets. Specifically, we design an efficient model architecture which avoids high computational cost of full cross-attention for multi-view feature interaction: (i) multi-view encoder that leverages pair-wise matching results as a geometric prior, and (ii) multi-view matching refiner that refines correspondences using pixel-wise attention. Additionally, we propose a post-processing strategy that integrates our model’s consistent multi-view correspondences as high-quality tracks for SfM. Across diverse and challenging benchmarks, MV-RoMa produces more reliable correspondences and substantially denser, more accurate 3D reconstructions than existing sparse and dense matching methods. Project page: https://icetea-cv.github.io/mv-roma/.
[294] Annotation-Free Detection of Drivable Areas and Curbs Leveraging LiDAR Point Cloud Maps
Fulong Ma, Daojie Peng, Jun Ma
Main category: cs.CV
TL;DR: Automated training data generation for drivable area and curb detection using LiDAR mapping and localization to avoid occlusion/sparsity issues, with data review filtering.
Details
Motivation: Manual labeling for drivable area and curb detection is costly and time-consuming, limiting real-world application of DNN-based methods. Previous automated methods suffered from occlusion and distant point cloud sparsity issues.Method: Proposes MADL (map-based automatic data labeler) module combining LiDAR mapping/localization with curb detection to generate training data automatically. Uses LiDAR mapping to avoid occlusion/sparsity issues and includes a data review agent to filter low-quality samples.
Result: Experiments on KITTI, KITTI-CARLA and 3D-Curb datasets show MADL achieves impressive performance compared to manual labeling, and outperforms traditional and state-of-the-art self-supervised methods in robustness and accuracy.
Conclusion: MADL provides an effective automated solution for generating training data for drivable area and curb detection, overcoming limitations of manual labeling and previous automated methods.
Abstract: Drivable areas and curbs are critical traffic elements for autonomous driving, forming essential components of the vehicle visual perception system and ensuring driving safety. Deep neural networks (DNNs) have significantly improved perception performance for drivable area and curb detection, but most DNN-based methods rely on large manually labeled datasets, which are costly, time-consuming, and expert-dependent, limiting their real-world application. Thus, we developed an automated training data generation module. Our previous work generated training labels using single-frame LiDAR and RGB data, suffering from occlusion and distant point cloud sparsity. In this paper, we propose a novel map-based automatic data labeler (MADL) module, combining LiDAR mapping/localization with curb detection to automatically generate training data for both tasks. MADL avoids occlusion and point cloud sparsity issues via LiDAR mapping, creating accurate large-scale datasets for DNN training. In addition, we construct a data review agent to filter the data generated by the MADL module, eliminating low-quality samples. Experiments on the KITTI, KITTI-CARLA and 3D-Curb datasets show that MADL achieves impressive performance compared to manual labeling, and outperforms traditional and state-of-the-art self-supervised methods in robustness and accuracy.
[295] PANDORA: Pixel-wise Attention Dissolution and Latent Guidance for Zero-Shot Object Removal
Dinh-Khoi Vo, Van-Loc Nguyen, Tam V. Nguyen, Minh-Triet Tran, Trung-Nghia Le
Main category: cs.CV
TL;DR: PANDORA: A zero-shot object removal framework using pre-trained diffusion models without fine-tuning, prompts, or optimization, achieving precise multi-object erasure through attention manipulation.
Details
Motivation: Existing object removal methods suffer from texture inconsistency, rigid artifacts, weak foreground-background disentanglement, and poor scalability for multi-object removal, often requiring fine-tuning, prompt engineering, or inference-time optimization.Method: Proposes Pixel-wise Attention Dissolution to remove objects by nullifying the most correlated attention keys for masked pixels, eliminating objects from self-attention flow. Also introduces Localized Attentional Disentanglement Guidance to steer denoising toward latent manifolds favorable to clean object removal.
Result: Demonstrates superior visual fidelity and semantic plausibility compared to state-of-the-art methods, enabling precise, non-rigid, prompt-free, and scalable multi-object erasure in a single pass.
Conclusion: PANDORA provides an effective zero-shot framework for object removal that operates directly on pre-trained diffusion models without additional training or optimization, addressing key limitations of existing approaches.
Abstract: Removing objects from natural images is challenging due to difficulty of synthesizing semantically coherent content while preserving background integrity. Existing methods often rely on fine-tuning, prompt engineering, or inference-time optimization, yet still suffer from texture inconsistency, rigid artifacts, weak foreground-background disentanglement, and poor scalability for multi-object removal. We propose a novel zero-shot object removal framework, namely PANDORA, that operates directly on pre-trained text-to-image diffusion models, requiring no fine-tuning, prompts, or optimization. We propose Pixel-wise Attention Dissolution to remove object by nullifying the most correlated attention keys for masked pixels, effectively eliminating the object from self-attention flow and allowing background context to dominate reconstruction. We further introduce Localized Attentional Disentanglement Guidance to steer denoising toward latent manifolds favorable to clean object removal. Together, these components enable precise, non-rigid, prompt-free, and scalable multi-object erasure in a single pass. Experiments demonstrate superior visual fidelity and semantic plausibility compared to state-of-the-art methods. The project page is available at https://vdkhoi20.github.io/PANDORA.
[296] Towards Domain-Generalized Open-Vocabulary Object Detection: A Progressive Domain-invariant Cross-modal Alignment Method
Xiaoran Xu, Xiaoshan Yang, Jiangang Yang, Yifan Xu, Jian Liu, Changsheng Xu
Main category: cs.CV
TL;DR: The paper identifies a vulnerability in Open-Vocabulary Object Detection (OVOD) to domain shifts, formalizes Domain-Generalized OVOD (DG-OVOD), and proposes Progressive Domain-invariant Cross-modal Alignment (PICA) with adaptive pseudo-word prototypes to maintain cross-modal alignment under distribution shifts.
Details
Motivation: Current OVOD methods assume domain stationarity, but real-world applications face distribution shifts. The paper reveals that visual domain shifts cause collapse of the latent cross-modal space where novel category visual signals detach from their semantic anchors, fundamentally undermining OVOD's generalization capabilities.Method: Proposes PICA (Progressive Domain-invariant Cross-modal Alignment) with multi-level ambiguity and signal strength curriculum. It builds adaptive pseudo-word prototypes refined via sample confidence and visual consistency to enforce invariant cross-domain modality alignment, departing from uniform training approaches.
Result: The work demonstrates that OVOD’s robustness to domain shifts is intrinsically linked to the stability of latent cross-modal alignment space. It provides both a challenging benchmark for DG-OVOD and shows that PICA improves generalization beyond static laboratory conditions.
Conclusion: The paper provides a new perspective on building truly generalizable open-vocabulary systems that extend beyond static conditions, highlighting the fundamental vulnerability of cross-modal alignment to domain shifts and offering a principled solution.
Abstract: Open-Vocabulary Object Detection (OVOD) has achieved remarkable success in generalizing to novel categories. However, this success often rests on the implicit assumption of domain stationarity. In this work, we provide a principled revisit of the OVOD paradigm, uncovering a fundamental vulnerability: the fragile coupling between visual manifolds and textual embeddings when distribution shifts occur. We first systematically formalize Domain-Generalized Open-Vocabulary Object Detection (DG-OVOD). Through empirical analysis, we demonstrate that visual shifts do not merely add noise; they cause a collapse of the latent cross-modal space where novel category visual signals detach from their semantic anchors. Motivated by these insights, we propose Progressive Domain-invariant Cross-modal Alignment (PICA). PICA departs from uniform training by introducing a multi-level ambiguity and signal strength curriculum. It builds adaptive pseudo-word prototypes, refined via sample confidence and visual consistency, to enforce invariant cross-domain modality alignment. Our findings suggest that OVOD’s robustness to domain shifts is intrinsically linked to the stability of the latent cross-modal alignment space. Our work provides both a challenging benchmark and a new perspective on building truly generalizable open-vocabulary systems that extend beyond static laboratory conditions.
[297] Learning to See through Illumination Extremes with Event Streaming in Multimodal Large Language Models
Baoheng Zhang, Jiahui Liu, Gui Zhao, Weizhou Zhang, Yixuan Ma, Jun Jiang, Yingxian Chen, Wilton W. T. Fok, Xiaojuan Qi, Hayden Kwok-Hay So
Main category: cs.CV
TL;DR: Event-MLLM enhances multimodal LLMs for extreme illumination conditions by fusing event streams with RGB frames using an illumination indicator and correction loss, achieving state-of-the-art performance in challenging lighting scenarios.
Details
Motivation: Current MLLMs fail in extreme illumination conditions where RGB inputs lose structure and semantics. There's a need for robust multimodal perception that works across all lighting conditions, including very dark or very bright scenarios.Method: Proposes Event-MLLM with two key components: 1) Illumination Indicator - a learnable signal from DINOv2 branch representing exposure degradation that adaptively modulates event-RGB fusion, and 2) Illumination Correction Loss that aligns fused features with normal-light semantics in latent space. Also creates first multi-illumination event-instruction corpus for MLLMs.
Result: Event-MLLM markedly outperforms general-purpose, illumination-adaptive, and event-only baselines, setting new state-of-the-art in robust multimodal perception and reasoning under challenging illumination across 17 brightness rates (0.05x-20x).
Conclusion: Event-enhanced multimodal LLMs can achieve robust visual reasoning across extreme illumination conditions by dynamically fusing event streams with RGB frames and compensating for information loss through illumination-aware mechanisms.
Abstract: Multimodal Large Language Models (MLLMs) perform strong vision-language reasoning under standard conditions but fail in extreme illumination, where RGB inputs lose irrevocable structure and semantics. We propose Event-MLLM, an event-enhanced model that performs all-light visual reasoning by dynamically fusing event streams with RGB frames. Two key components drive our approach: an Illumination Indicator - a learnable signal derived from a DINOv2 branch that represents exposure degradation and adaptively modulates event-RGB fusion - and an Illumination Correction Loss that aligns fused features with non-degraded (normal-light) semantics in the latent space, compensating for information lost in extreme lighting. We curate the first multi-illumination event-instruction corpus for MLLMs, with 2,241 event-RGB samples (around 6 QA pairs each) across diverse scenes and 17 brightness rates (0.05x - 20x), plus an instruct-following benchmark for reasoning, counting, and fine-grained recognition under extreme lighting. Experiments show that Event-MLLM markedly outperforms general-purpose, illumination-adaptive, and event-only baselines, setting a new state of the art in robust multimodal perception and reasoning under challenging illumination.
[298] Structured Observation Language for Efficient and Generalizable Vision-Language Navigation
Daojie Peng, Fulong Ma, Jun Ma
Main category: cs.CV
TL;DR: SOL-Nav converts visual observations into structured language descriptions for vision-language navigation, enabling pure language input to pre-trained language models for efficient and generalizable navigation.
Details
Motivation: Existing VLN methods require large-scale visual pre-training and suffer from poor generalization under environmental variations. The authors aim to create a more efficient and generalizable approach by leveraging structured language representations.Method: Divide RGB-D images into N*N grid, extract semantic, color, and depth information for each cell to form structured text descriptions, then concatenate with language instructions as pure language input to pre-trained language models.
Result: SOL-Nav significantly reduces model size and training data dependency, fully leverages PLM capabilities, achieves strong generalization to unseen environments on R2R and RxR benchmarks, and demonstrates real-world deployment success.
Conclusion: Converting visual observations to structured language enables efficient VLN by leveraging pre-trained language models’ reasoning capabilities, offering a promising direction for generalizable embodied AI.
Abstract: Vision-Language Navigation (VLN) requires an embodied agent to navigate complex environments by following natural language instructions, which typically demands tight fusion of visual and language modalities. Existing VLN methods often convert raw images into visual tokens or implicit features, requiring large-scale visual pre-training and suffering from poor generalization under environmental variations (e.g., lighting, texture). To address these issues, we propose SOL-Nav (Structured Observation Language for Navigation), a novel framework that translates egocentric visual observations into compact structured language descriptions for efficient and generalizable navigation. Specifically, we divide RGB-D images into a N*N grid, extract representative semantic, color, and depth information for each grid cell to form structured text, and concatenate this with the language instruction as pure language input to a pre-trained language model (PLM). Experimental results on standard VLN benchmarks (R2R, RxR) and real-world deployments demonstrate that SOL-Nav significantly reduces the model size and training data dependency, fully leverages the reasoning and representation capabilities of PLMs, and achieves strong generalization to unseen environments.
[299] A Robust Low-Rank Prior Model for Structured Cartoon-Texture Image Decomposition with Heavy-Tailed Noise
Weihao Tang, Hongjin He
Main category: cs.CV
TL;DR: Proposes a robust low-rank prior model for cartoon-texture image decomposition using Huber loss for heavy-tailed noise, with TV and nuclear norms for cartoon/texture components.
Details
Motivation: Cartoon-texture decomposition is fundamental but challenging, especially with noisy images. Heavy-tailed noise severely impedes robust decomposition results, requiring more robust models than traditional approaches.Method: Uses Huber loss function as data-fidelity term instead of traditional ℓ₂-norm, with total variation norm for cartoon component and nuclear norm for texture component. Employs two operator splitting algorithms tailored to different degradation operators.
Result: Extensive numerical experiments show superior performance on image restoration tasks under high-intensity heavy-tailed noise compared to conventional methods.
Conclusion: The proposed robust low-rank prior model with Huber loss effectively handles cartoon-texture decomposition in presence of heavy-tailed noise, outperforming traditional approaches.
Abstract: Cartoon-texture image decomposition is a fundamental yet challenging problem in image processing. A significant hurdle in achieving accurate decomposition is the pervasive presence of noise in the observed images, which severely impedes robust results. To address the challenging problem of cartoon-texture decomposition in the presence of heavy-tailed noise, we in this paper propose a robust low-rank prior model. Our approach departs from conventional models by adopting the Huber loss function as the data-fidelity term, rather than the traditional $\ell_2$-norm, while retaining the total variation norm and nuclear norm to characterize the cartoon and texture components, respectively. Given the inherent structure, we employ two implementable operator splitting algorithms, tailored to different degradation operators. Extensive numerical experiments, particularly on image restoration tasks under high-intensity heavy-tailed noise, efficiently demonstrate the superior performance of our model.
[300] STRIDE: When to Speak Meets Sequence Denoising for Streaming Video Understanding
Junho Kim, Hosu Lee, James M. Rehg, Minsu Kim, Yong Man Ro
Main category: cs.CV
TL;DR: STRIDE is a method for proactive activation in streaming video that uses structured temporal refinement with iterative denoising to improve when-to-speak decisions in online video scenarios.
Details
Motivation: Real-world deployments require streaming perception and proactive interaction where video frames arrive online, and systems must decide not only what to respond but also when to respond. Current Video-LLMs focus on offline reasoning but lack capabilities for streaming scenarios.Method: Models proactive activation as a structured sequence modeling problem using temporal span-structured activation patterns. Employs STRIDE (Structured Temporal Refinement with Iterative DEnoising) with a lightweight masked diffusion module at the activation interface to jointly predict and progressively refine activation signals across sliding temporal windows.
Result: Extensive experiments on diverse streaming benchmarks and downstream models demonstrate that STRIDE shows more reliable and temporally coherent proactive responses, significantly improving when-to-speak decision quality in online streaming scenarios.
Conclusion: STRIDE effectively addresses the streaming video proactive activation problem by capturing span-level temporal structure through iterative refinement, enabling better when-to-speak decisions in real-time video applications.
Abstract: Recent progress in video large language models (Video-LLMs) has enabled strong offline reasoning over long and complex videos. However, real-world deployments increasingly require streaming perception and proactive interaction, where video frames arrive online and the system must decide not only what to respond, but also when to respond. In this work, we revisit proactive activation in streaming video as a structured sequence modeling problem, motivated by the observation that temporal transitions in streaming video naturally form span-structured activation patterns. To capture this span-level structure, we model activation signals jointly over a sliding temporal window and update them iteratively as new frames arrive. We propose STRIDE (Structured Temporal Refinement with Iterative DEnoising), which employs a lightweight masked diffusion module at the activation interface to jointly predict and progressively refine activation signals across the window. Extensive experiments on diverse streaming benchmarks and downstream models demonstrate that STRIDE shows more reliable and temporally coherent proactive responses, significantly improving when-to-speak decision quality in online streaming scenarios.
[301] You Only Erase Once: Erasing Anything without Bringing Unexpected Content
Yixing Zhu, Qing Zhang, Wenju Xu, Wei-Shi Zheng
Main category: cs.CV
TL;DR: YOEO is a diffusion-based object erasure method that produces high-quality results without unwanted artifacts by training on unpaired real-world images using a sundries detector and context coherence loss.
Details
Motivation: Current diffusion-based object erasure methods struggle with generating unexpected content in masked regions due to lack of paired training data and explicit constraints on content generation.Method: Trains an object erasure diffusion model on unpaired real-world images using a sundries detector and context coherence loss built on an entity segmentation model, with diffusion distillation for efficient few-step training and inference.
Result: Extensive experiments show YOEO outperforms state-of-the-art object erasure methods, producing high-quality results free of unwanted objects or artifacts while preserving context coherence.
Conclusion: YOEO enables effective object erasure without paired training data by leveraging entity segmentation supervision and context coherence constraints, with efficient training via diffusion distillation.
Abstract: We present YOEO, an approach for object erasure. Unlike recent diffusion-based methods which struggle to erase target objects without generating unexpected content within the masked regions due to lack of sufficient paired training data and explicit constraint on content generation, our method allows to produce high-quality object erasure results free of unwanted objects or artifacts while faithfully preserving the overall context coherence to the surrounding content. We achieve this goal by training an object erasure diffusion model on unpaired data containing only large-scale real-world images, under the supervision of a sundries detector and a context coherence loss that are built upon an entity segmentation model. To enable more efficient training and inference, a diffusion distillation strategy is employed to train for a few-step erasure diffusion model. Extensive experiments show that our method outperforms the state-of-the-art object erasure methods. Code will be available at https://zyxunh.github.io/YOEO-ProjectPage/.
[302] Clore: Interactive Pathology Image Segmentation with Click-based Local Refinement
Tiantong Wang, Minfan Zhao, Jun Shi, Hannan Wang, Yue Dai
Main category: cs.CV
TL;DR: Clore introduces a hierarchical click-based local refinement pipeline for interactive pathology image segmentation that uses initial clicks for global segmentation and subsequent clicks for local refinement, achieving better accuracy with fewer interactions.
Details
Motivation: Existing interactive segmentation methods rely on iterative global updates that cause redundant re-prediction and fail to capture fine-grained structures or correct subtle errors during localized adjustments.Method: Proposes Click-based Local Refinement (Clore) pipeline with hierarchical interaction: initial clicks drive global segmentation to outline large regions, while subsequent clicks progressively refine local details for precise boundaries.
Result: Experimental results on four datasets show Clore achieves the best balance between segmentation accuracy and interaction cost, outperforming existing methods.
Conclusion: Clore provides an effective solution for efficient and accurate interactive pathology image segmentation through its hierarchical local refinement approach.
Abstract: Recent advancements in deep learning-based interactive segmentation methods have significantly improved pathology image segmentation. Most existing approaches utilize user-provided positive and negative clicks to guide the segmentation process. However, these methods primarily rely on iterative global updates for refinement, which lead to redundant re-prediction and often fail to capture fine-grained structures or correct subtle errors during localized adjustments. To address this limitation, we propose the Click-based Local Refinement (Clore) pipeline, a simple yet efficient method designed to enhance interactive segmentation. The key innovation of Clore lies in its hierarchical interaction paradigm: the initial clicks drive global segmentation to rapidly outline large target regions, while subsequent clicks progressively refine local details to achieve precise boundaries. This approach not only improves the ability to handle fine-grained segmentation tasks but also achieves high-quality results with fewer interactions. Experimental results on four datasets demonstrate that Clore achieves the best balance between segmentation accuracy and interaction cost, making it an effective solution for efficient and accurate interactive pathology image segmentation.
[303] OPRO: Orthogonal Panel-Relative Operators for Panel-Aware In-Context Image Generation
Sanghyeon Lee, Minwoo Lee, Euijin Shin, Kangyeol Kim, Seunghwan Choi, Jaegul Choo
Main category: cs.CV
TL;DR: Parameter-efficient adaptation method for panel-aware in-context image generation using pre-trained diffusion transformers via learnable panel-specific orthogonal operators on frozen positional encodings.
Details
Motivation: To enable effective panel-aware in-context image generation while maintaining parameter efficiency and preserving pre-trained model capabilities. The goal is to adapt diffusion transformers for multi-panel image generation without extensive retraining.Method: Composes learnable, panel-specific orthogonal operators onto the backbone’s frozen positional encodings. This design ensures isometry (preserves geometry of internal features) and same-panel invariance (maintains pre-trained intra-panel synthesis behavior). The method works across diverse positional encoding regimes.
Result: The adaptation method effectively enables panel-relative conditioning and consistently improves in-context image-based instructional editing pipelines, including state-of-the-art approaches. It generalizes across different positional encoding designs.
Conclusion: The proposed parameter-efficient adaptation method successfully enables panel-aware in-context image generation while preserving pre-trained model properties, offering a flexible solution for adapting diffusion transformers to multi-panel image tasks.
Abstract: We introduce a parameter-efficient adaptation method for panel-aware in-context image generation with pre-trained diffusion transformers. The key idea is to compose learnable, panel-specific orthogonal operators onto the backbone’s frozen positional encodings. This design provides two desirable properties: (1) isometry, which preserves the geometry of internal features, and (2) same-panel invariance, which maintains the model’s pre-trained intra-panel synthesis behavior. Through controlled experiments, we demonstrate that the effectiveness of our adaptation method is not tied to a specific positional encoding design but generalizes across diverse positional encoding regimes. By enabling effective panel-relative conditioning, the proposed method consistently improves in-context image-based instructional editing pipelines, including state-of-the-art approaches.
[304] OpenDPR: Open-Vocabulary Change Detection via Vision-Centric Diffusion-Guided Prototype Retrieval for Remote Sensing Imagery
Qi Guo, Jue Wang, Yinhe Liu, Yanfei Zhong
Main category: cs.CV
TL;DR: OpenDPR is a training-free vision-centric framework for open-vocabulary change detection that uses diffusion models for category identification and spatial-to-change adaptation for localization.
Details
Motivation: Open-vocabulary change detection needs to recognize arbitrary changes beyond fixed classes, but current methods face bottlenecks in category identification (due to VLMs' limited fine-grained representation) and change localization (due to VFMs' lack of change priors).Method: Two-stage pipeline: 1) Generate class-agnostic change proposals using SAM/DINOv2, 2) Use OpenDPR’s diffusion-guided prototype retrieval for category identification (offline prototype construction with diffusion models + visual similarity retrieval), and 3) Optional S2C module for weakly supervised change localization adaptation.
Result: State-of-the-art performance on four benchmark datasets under both supervised and weakly supervised modes, demonstrating effectiveness for open-vocabulary change detection.
Conclusion: OpenDPR effectively addresses bottlenecks in open-vocabulary change detection through diffusion-guided prototype retrieval and spatial-to-change adaptation, achieving strong performance with minimal supervision.
Abstract: Open-vocabulary change detection (OVCD) seeks to recognize arbitrary changes of interest by enabling generalization beyond a fixed set of predefined classes. We reformulate OVCD as a two-stage pipeline: first generate class-agnostic change proposals using visual foundation models (VFMs) such as SAM and DINOv2, and then perform category identification with vision-language models (VLMs) such as CLIP. We reveal that category identification errors are the primary bottleneck of OVCD, mainly due to the limited ability of VLMs based on image-text matching to represent fine-grained land-cover categories. To address this, we propose OpenDPR, a training-free vision-centric diffusion-guided prototype retrieval framework. OpenDPR leverages diffusion models to construct diverse prototypes for target categories offline, and to perform similarity retrieval with change proposals in the visual space during inference. The secondary bottleneck lies in change localization, due to the inherent lack of change priors in VFMs. To bridge this gap, we design a spatial-to-change weakly supervised change detection module named S2C to adapt their strong spatial modeling capabilities for change localization. Integrating the pretrained S2C into OpenDPR leads to an optional weakly supervised variant named OpenDPR-W, which further improves OVCD with minimal supervision. Experimental results on four benchmark datasets demonstrate that the proposed methods achieve state-of-the-art performance under both supervision modes. Code is available at https://github.com/guoqi2002/OpenDPR.
[305] RAP: Retrieve, Adapt, and Prompt-Fit for Training-Free Few-Shot Medical Image Segmentation
Zhihao Mao, Bangpu Chen
Main category: cs.CV
TL;DR: RAP is a training-free framework for few-shot medical image segmentation that retrieves morphologically compatible supports, adapts them via boundary-aware structural cues, and prompts SAM2 for refinement without fine-tuning.
Details
Motivation: Existing few-shot medical image segmentation methods rely heavily on semantic correspondences from scarce annotations while under-utilizing the repeatable high-frequency morphology (boundary geometry and spatial layout) that anatomical targets exhibit across patients and acquisitions.Method: 1) Retrieves morphologically compatible supports from an archive using DINOv3 features to reduce brittleness in single-support choice; 2) Adapts retrieved support mask to query by fitting boundary-aware structural cues for anatomy-consistent pre-mask; 3) Converts pre-mask into prompts via Voronoi partitioning (positive points) and sector-based sampling (negative points), feeding them into SAM2 for final refinement without fine-tuning.
Result: Extensive experiments on multiple medical segmentation benchmarks show RAP consistently surpasses prior FSMIS baselines and achieves state-of-the-art performance.
Conclusion: RAP demonstrates that explicit structural fitting combined with retrieval-augmented prompting offers a simple and effective route to robust training-free few-shot medical segmentation.
Abstract: Few-shot medical image segmentation (FSMIS) has achieved notable progress, yet most existing methods mainly rely on semantic correspondences from scarce annotations while under-utilizing a key property of medical imagery: anatomical targets exhibit repeatable high-frequency morphology (e.g., boundary geometry and spatial layout) across patients and acquisitions. We propose RAP, a training-free framework that retrieves, adapts, and prompts Segment Anything Model 2 (SAM2) for FSMIS. First, RAP retrieves morphologically compatible supports from an archive using DINOv3 features to reduce brittleness in single-support choice. Second, it adapts the retrieved support mask to the query by fitting boundary-aware structural cues, yielding an anatomy-consistent pre-mask under domain shifts. Third, RAP converts the pre-mask into prompts by sampling positive points via Voronoi partitioning and negative points via sector-based sampling, and feeds them into SAM2 for final refinement without any fine-tuning. Extensive experiments on multiple medical segmentation benchmarks show that RAP consistently surpasses prior FSMIS baselines and achieves state-of-the-art performance. Overall, RAP demonstrates that explicit structural fitting combined with retrieval-augmented prompting offers a simple and effective route to robust training-free few-shot medical segmentation.
[306] V-CAST: Video Curvature-Aware Spatio-Temporal Pruning for Efficient Video Large Language Models
Xinying Lin, Xuyang Liu, Yiyu Wang, Teng Ma, Wenqi Ren
Main category: cs.CV
TL;DR: V-CAST is a training-free token pruning method for VideoLLMs that uses curvature-guided temporal allocation and dual-anchor spatial selection to reduce redundant visual tokens while maintaining spatio-temporal alignment and performance.
Details
Motivation: VideoLLMs suffer from massive redundant visual tokens during long-context inference, causing computational inefficiency. Existing token compression methods have insufficient spatio-temporal coverage and misalignment issues under MRoPE-style positional bindings.Method: V-CAST frames token compression as trajectory approximation with: 1) curvature-guided temporal allocation that routes token budgets to semantic turns and event boundaries, and 2) dual-anchor spatial selection that preserves high-entropy visual evidence while maintaining original positional coordinates.
Result: Achieves 98.6% of original performance, outperforms second-best method by +1.1% on average, reduces peak memory to 86.7% and total latency to 86.4% of vanilla Qwen3-VL-8B-Instruct across multiple VideoLLM architectures and scales.
Conclusion: V-CAST provides an effective training-free pruning solution for VideoLLMs that maintains performance while significantly reducing computational overhead for long-context video inference.
Abstract: Video large language models (VideoLLMs) show strong capability in video understanding, yet long-context inference is still dominated by massive redundant visual tokens in the prefill stage. We revisit token compression for VideoLLMs under a tight budget and identify a key bottleneck, namely insufficient spatio-temporal information coverage. Existing methods often introduce discontinuous coverage through coarse per-frame allocation or scene segmentation, and token merging can further misalign spatio-temporal coordinates under MRoPE-style discrete (t,h,w) bindings. To address these issues, we propose V-CAST (Video Curvature-Aware Spatio-Temporal Pruning), a training-free, plug-and-play pruning policy for long-context video inference. V-CAST casts token compression as a trajectory approximation problem and introduces a curvature-guided temporal allocation module that routes per-frame token budgets to semantic turns and event boundaries. It further adopts a dual-anchor spatial selection mechanism that preserves high-entropy visual evidence without attention intervention, while keeping retained tokens at their original coordinates to maintain positional alignment. Extensive experiments across multiple VideoLLMs of different architectures and scales demonstrate that V-CAST achieves 98.6% of the original performance, outperforms the second-best method by +1.1% on average, and reduces peak memory and total latency to 86.7% and 86.4% of vanilla Qwen3-VL-8B-Instruct.
[307] Amped: Adaptive Multi-stage Non-edge Pruning for Edge Detection
Yuhan Gao, Xinqing Li, Xin He, Bing Li, Xinzhong Zhu, Ming-Ming Cheng, Yun Liu
Main category: cs.CV
TL;DR: Amped: Adaptive multi-stage pruning framework for transformer-based edge detection that removes non-edge tokens early to reduce computation while maintaining accuracy, plus a simple Streamline Edge Detector (SED) model.
Details
Motivation: Transformer-based edge detectors achieve high quality but suffer from computational overhead, especially at higher resolutions needed for pixel-level accuracy. There's a need to balance accuracy and efficiency for practical deployment.Method: Proposes Amped: adaptive multi-stage pruning framework that identifies high-confidence non-edge tokens and removes them early in processing. Also introduces SED: a simple yet high-performance Transformer-based edge detection model designed for reduced structural complexity.
Result: Amped reduces GFLOPs by up to 40% with only 0.4% drop in ODS F-measure. SED achieves state-of-the-art ODS F-measure of 86.5% despite its simplicity.
Conclusion: The proposed pruning strategy provides favorable accuracy-efficiency balance for transformer-based edge detectors, and the streamlined model enables practical deployment while maintaining high performance.
Abstract: Edge detection is a fundamental image analysis task that underpins numerous high-level vision applications. Recent advances in Transformer architectures have significantly improved edge quality by capturing long-range dependencies, but this often comes with computational overhead. Achieving higher pixel-level accuracy requires increased input resolution, further escalating computational cost and limiting practical deployment. Building on the strong representational capacity of recent Transformer-based edge detectors, we propose an Adaptive Multi-stage non-edge Pruning framework for Edge Detection(Amped). Amped identifies high-confidence non-edge tokens and removes them as early as possible to substantially reduce computation, thus retaining high accuracy while cutting GFLOPs and accelerating inference with minimal performance loss. Moreover, to mitigate the structural complexity of existing edge detection networks and facilitate their integration into real-world systems, we introduce a simple yet high-performance Transformer-based model, termed Streamline Edge Detector(SED). Applied to both existing detectors and our SED, the proposed pruning strategy provides a favorable balance between accuracy and efficiency-reducing GFLOPs by up to 40% with only a 0.4% drop in ODS F-measure. In addition, despite its simplicity, SED achieves a state-of-the-art ODS F-measure of 86.5%. The code will be released.
[308] A Benchmarking Methodology to Assess Open-Source Video Large Language Models in Automatic Captioning of News Videos
David Miranda Paredes, Jose M. Saavedra, Marcelo Pizarro
Main category: cs.CV
TL;DR: Comparative evaluation of 8 open-source Video LLMs for news video captioning using lexical, semantic, and novel fidelity metrics on Chilean TV and BBC News datasets.
Details
Motivation: News video captioning remains largely manual despite being prevalent content. Video LLMs offer automation potential but lack comprehensive evaluation in news domain.Method: Comparative study of 8 state-of-the-art open-source VidLLMs evaluated on two news datasets (Chilean TV: 1,345 clips, BBC News: 9,838 clips) using lexical metrics (METEOR, ROUGE-L), semantic metrics (BERTScore, CLIPScore, Text Similarity, Mean Reciprocal Rank), and two novel fidelity metrics: Thematic Fidelity Score (TFS) and Entity Fidelity Score (EFS).
Result: Standard metrics show limited discriminative power due to surface-form dependence, static-frame insensitivity, and function-word inflation. Gemma~3 achieves highest overall performance across both datasets and most evaluation dimensions, with Qwen-VL as consistent runner-up.
Conclusion: The proposed TFS and EFS metrics address gaps in standard evaluation by directly assessing thematic structure preservation and named-entity coverage, providing better assessment of news video captioning quality.
Abstract: News videos are among the most prevalent content types produced by television stations and online streaming platforms, yet generating textual descriptions to facilitate indexing and retrieval largely remains a manual process. Video Large Language Models (VidLLMs) offer significant potential to automate this task, but a comprehensive evaluation in the news domain is still lacking. This work presents a comparative study of eight state-of-the-art open-source VidLLMs for automatic news video captioning, evaluated on two complementary benchmark datasets: a Chilean TV news corpus (approximately 1,345 clips) and a BBC News corpus (9,838 clips). We employ lexical metrics (METEOR, ROUGE-L), semantic metrics (BERTScore, CLIPScore, Text Similarity, Mean Reciprocal Rank), and two novel fidelity metrics proposed in this work: the Thematic Fidelity Score (TFS) and Entity Fidelity Score (EFS). Our analysis reveals that standard metrics exhibit limited discriminative power for news video captioning due to surface-form dependence, static-frame insensitivity, and function-word inflation. TFS and EFS address these gaps by directly assessing thematic structure preservation and named-entity coverage in the generated captions. Results show that Gemma~3 achieves the highest overall performance across both datasets and most evaluation dimensions, with Qwen-VL as a consistent runner-up.
[309] LiDAR for Crowd Management: Applications, Benefits, and Future Directions
Abdullah Khanfor, Chaima Zaghouani, Hakim Ghazzai, Ahmad Alsharoa, Gianluca Setti
Main category: cs.CV
TL;DR: LiDAR technology for crowd management: detection, counting, tracking, and behavior classification with advantages in privacy, weather robustness, and 3D mapping.
Details
Motivation: LiDAR offers significant advantages for crowd management over other monitoring technologies, including enhanced privacy, performance in various weather conditions, and precise 3D mapping capabilities.Method: Presents a taxonomy of four key crowd management tasks (detection, counting, tracking, behavior classification) with LiDAR applications, identifies challenges like dataset scarcity, sensor fusion needs, AI integration, and point cloud processing requirements.
Result: Provides actionable insights for developing LiDAR-based crowd management solutions tailored to public safety applications, highlighting current applications and future research directions.
Conclusion: LiDAR technology shows promise for crowd management but requires addressing challenges like dataset availability, sensor fusion, AI integration, and processing needs to realize its full potential for public safety applications.
Abstract: Light Detection and Ranging (LiDAR) technology offers significant advantages for effective crowd management. This article presents LiDAR technology and highlights its primary advantages over other monitoring technologies, including enhanced privacy, performance in various weather conditions, and precise 3D mapping. We present a general taxonomy of four key tasks in crowd management: crowd detection, counting, tracking, and behavior classification, with illustrative examples of LiDAR applications for each task. We identify challenges and open research directions, including the scarcity of dedicated datasets, sensor fusion requirements, artificial intelligence integration, and processing needs for LiDAR point clouds. This article offers actionable insights for developing crowd management solutions tailored to public safety applications.
[310] AI-Powered Facial Mask Removal Is Not Suitable For Biometric Identification
Emily A Cooper, Hany Farid
Main category: cs.CV
TL;DR: Analysis of AI-powered facial unmasking tools shows they produce faces that cannot be reliably matched to true identities, posing significant risks for misidentification in criminal investigations.
Details
Motivation: The paper was motivated by real-world incidents where AI-generated "unmasked" images from low-quality evidence led to widespread misidentification in criminal cases, particularly a high-profile case where social media users circulated an AI-enhanced image that falsely identified a federal agent.Method: The researchers conducted a large-scale analysis evaluating commercial AI-powered facial unmasking tools, specifically assessing whether the resulting AI-generated faces can be reliably matched to true identities through systematic testing and evaluation.
Result: The analysis found that AI-generated “unmasked” faces cannot be reliably matched to true identities, demonstrating significant risks of misidentification when these tools are used in criminal investigations.
Conclusion: Commercial AI facial unmasking tools pose serious risks for misidentification and should not be trusted for reliable identity matching in criminal investigations, highlighting the need for caution and regulation in using AI for evidence enhancement.
Abstract: Recently, crowd-sourced online criminal investigations have used generative-AI to enhance low-quality visual evidence. In one high-profile case, social-media users circulated an “AI-unmasked” image of a federal agent involved in a fatal shooting, fueling a wide-spread misidentification. In response to this and similar incidents, we conducted a large-scale analysis evaluating the efficacy and risks of commercial AI-powered facial unmasking, specifically assessing whether the resulting faces can be reliably matched to true identities.
[311] Test-Time Instance-Specific Parameter Composition: A New Paradigm for Adaptive Generative Modeling
Minh-Tuan Tran, Xuan-May Le, Quan Hung Tran, Mehrtash Harandi, Dinh Phung, Trung Le
Main category: cs.CV
TL;DR: Composer introduces a test-time adaptive generative modeling paradigm that generates input-conditioned parameter adaptations for pretrained models, enabling per-input specialization without fine-tuning.
Details
Motivation: Current generative models (diffusion, auto-regressive networks) are static with fixed parameters, while humans adapt their internal representations to each context. The paper aims to create models that dynamically adapt to each input like human cognition.Method: Composer generates input-conditioned parameter adaptations at inference time, which are injected into pretrained model weights. Adaptation occurs once before multi-step generation, enabling per-input specialization without retraining.
Result: Experiments show Composer substantially improves performance across diverse generative models and use cases, including lightweight/quantized models and test-time scaling, with minimal computational overhead.
Conclusion: Composer establishes a new paradigm for adaptive generative models that dynamically adapt to each input through input-aware parameter composition, moving beyond static parameterization.
Abstract: Existing generative models, such as diffusion and auto-regressive networks, are inherently static, relying on a fixed set of pretrained parameters to handle all inputs. In contrast, humans flexibly adapt their internal generative representations to each perceptual or imaginative context. Inspired by this capability, we introduce Composer, a new paradigm for adaptive generative modeling based on test-time instance-specific parameter composition. Composer generates input-conditioned parameter adaptations at inference time, which are injected into the pretrained model’s weights, enabling per-input specialization without fine-tuning or retraining. Adaptation occurs once prior to multi-step generation, yielding higher-quality, context-aware outputs with minimal computational and memory overhead. Experiments show that Composer substantially improves performance across diverse generative models and use cases, including lightweight/quantized models and test-time scaling. By leveraging input-aware parameter composition, Composer establishes a new paradigm for designing generative models that dynamically adapt to each input, moving beyond static parameterization.
[312] Gated Condition Injection without Multimodal Attention: Towards Controllable Linear-Attention Transformers
Yuhe Liu, Zhenxiong Tan, Yujia Hu, Songhua Liu, Xinchao Wang
Main category: cs.CV
TL;DR: A novel controllable diffusion framework for linear attention models enabling efficient on-device generation with multi-type conditional inputs and privacy preservation.
Details
Motivation: Current diffusion models for controllable visual generation require cloud deployment due to computational demands, raising privacy concerns. Linear attention architectures offer edge-device efficiency but existing frameworks like ControlNet lack flexibility for multiple condition types or converge slowly on such models.Method: Proposes a unified gated conditioning module in a dual-path pipeline that effectively integrates both spatially aligned and non-aligned conditional inputs, tailored specifically for linear attention backbones like SANA.
Result: Achieves state-of-the-art controllable generation performance on linear-attention models, surpassing existing methods in fidelity and controllability across multiple tasks and benchmarks.
Conclusion: The framework enables secure, efficient on-device controllable generation while maintaining high performance, addressing both privacy concerns and computational limitations of current cloud-based approaches.
Abstract: Recent advances in diffusion-based controllable visual generation have led to remarkable improvements in image quality. However, these powerful models are typically deployed on cloud servers due to their large computational demands, raising serious concerns about user data privacy. To enable secure and efficient on-device generation, we explore in this paper controllable diffusion models built upon linear attention architectures, which offer superior scalability and efficiency, even on edge devices. Yet, our experiments reveal that existing controllable generation frameworks, such as ControlNet and OminiControl, either lack the flexibility to support multiple heterogeneous condition types or suffer from slow convergence on such linear-attention models. To address these limitations, we propose a novel controllable diffusion framework tailored for linear attention backbones like SANA. The core of our method lies in a unified gated conditioning module working in a dual-path pipeline, which effectively integrates multi-type conditional inputs, such as spatially aligned and non-aligned cues. Extensive experiments on multiple tasks and benchmarks demonstrate that our approach achieves state-of-the-art controllable generation performance based on linear-attention models, surpassing existing methods in terms of fidelity and controllability.
[313] Towards Context-Aware Image Anonymization with Multi-Agent Reasoning
Robert Aufschläger, Jakob Folz, Gautam Savaliya, Manjitha D Vidanalage, Michael Heigl, Martin Schramm
Main category: cs.CV
TL;DR: CAIAMAR: A context-aware image anonymization framework using multi-agent reasoning and diffusion models to protect PII in street-level imagery while preserving image quality and enabling on-premise deployment.
Details
Motivation: Street-level imagery contains personally identifiable information (PII) that existing methods either over-process or miss. Current solutions struggle with context-dependent identifiers and often compromise data sovereignty through API-based approaches.Method: Agentic framework with three specialized agents coordinating via round-robin speaker selection in a Plan-Do-Check-Act cycle. Uses pre-defined processing for high-confidence cases and multi-agent reasoning for indirect identifiers. Implements spatially-filtered coarse-to-fine detection with scout-and-zoom strategy, open-vocabulary segmentation, and IoU-based deduplication. Applies modal-specific diffusion guidance with appearance decorrelation for anonymization.
Result: Reduces person Re-ID risk by 73% on CUHK03-NP (R1: 16.9% vs 62.4% baseline). Achieves KID: 0.001 and FID: 9.1 on CityScapes, significantly outperforming existing anonymization methods. Preserves downstream semantic segmentation and detects non-direct PII instances across object categories.
Conclusion: CAIAMAR provides effective context-aware PII anonymization with superior privacy protection and image quality preservation. The on-premise, open-source framework generates audit trails for GDPR compliance while flagging failed cases for human review.
Abstract: Street-level imagery contains personally identifiable information (PII), some of which is context-dependent. Existing anonymization methods either over-process images or miss subtle identifiers, while API-based solutions compromise data sovereignty. We present an agentic framework CAIAMAR (\underline{C}ontext-\underline{A}ware \underline{I}mage \underline{A}nonymization with \underline{M}ulti-\underline{A}gent \underline{R}easoning) for context-aware PII segmentation with diffusion-based anonymization, combining pre-defined processing for high-confidence cases with multi-agent reasoning for indirect identifiers. Three specialized agents coordinate via round-robin speaker selection in a Plan-Do-Check-Act (PDCA) cycle, enabling large vision-language models to classify PII based on spatial context (private vs. public property) rather than rigid category rules. The agents implement spatially-filtered coarse-to-fine detection where a scout-and-zoom strategy identifies candidates, open-vocabulary segmentation processes localized crops, and $IoU$-based deduplication ($30%$ threshold) prevents redundant processing. Modal-specific diffusion guidance with appearance decorrelation substantially reduces re-identification (Re-ID) risks. On CUHK03-NP, our method reduces person Re-ID risk by $73%$ ($R1$: $16.9%$ vs. $62.4%$ baseline). For image quality preservation on CityScapes, we achieve KID: $0.001$, and FID: $9.1$, significantly outperforming existing anonymization. The agentic workflow detects non-direct PII instances across object categories, and downstream semantic segmentation is preserved. Operating entirely on-premise with open-source models, the framework generates human-interpretable audit trails supporting EU’s GDPR transparency requirements while flagging failed cases for human review.
[314] Customized Visual Storytelling with Unified Multimodal LLMs
Wei-Hua Li, Cheng Sun, Chu-Song Chen
Main category: cs.CV
TL;DR: VstoryGen is a multimodal framework for customizable story generation that integrates textual descriptions with character/background references and shot-type control for cinematic diversity.
Details
Motivation: Current story generation approaches mostly rely on text-only inputs or limited character identity cues, lacking broader multimodal conditioning and cinematic control for customizable storytelling.Method: Introduces VstoryGen framework with multimodal conditioning (text, character images, background references) and shot-type control via parameter-efficient prompt tuning on movie data to reflect cinematic grammar.
Result: VstoryGen achieves improved consistency and cinematic diversity compared to existing methods, as demonstrated through new multimodal benchmarks assessing character/scene consistency, text-visual alignment, and shot-type control.
Conclusion: The proposed multimodal framework enables customizable story generation with enhanced cinematic diversity and consistency through integrated text, visual references, and shot-type control.
Abstract: Multimodal story customization aims to generate coherent story flows conditioned on textual descriptions, reference identity images, and shot types. While recent progress in story generation has shown promising results, most approaches rely on text-only inputs. A few studies incorporate character identity cues (e.g., facial ID), but lack broader multimodal conditioning. In this work, we introduce VstoryGen, a multimodal framework that integrates descriptions with character and background references to enable customizable story generation. To enhance cinematic diversity, we introduce shot-type control via parameter-efficient prompt tuning on movie data, enabling the model to generate sequences that more faithfully reflect cinematic grammar. To evaluate our framework, we establish two new benchmarks that assess multimodal story customization from the perspectives of character and scene consistency, text-visual alignment, and shot-type control. Experiments demonstrate that VstoryGen achieves improved consistency and cinematic diversity compared to existing methods.
[315] Can Unsupervised Segmentation Reduce Annotation Costs for Video Semantic Segmentation?
Samik Some, Vinay P. Namboodiri
Main category: cs.CV
TL;DR: Using SAM and SAM 2 foundation models to reduce video segmentation annotation costs by automating mask generation from unannotated frames and coarse annotations, achieving similar performance with 1/3 less manual annotation.
Details
Motivation: Video semantic segmentation requires expensive fine-grained pixel-level annotations, while unannotated video frames and coarse annotations are much cheaper. Need to reduce annotation costs for video segmentation datasets.Method: Utilize segmentation foundation models (Segment Anything Model and SAM 2) to automate mask generation from unannotated frames and coarse annotations, reducing manual annotation effort.
Result: Can reduce annotation need by one-third while maintaining similar performance for video semantic segmentation. Found that dataset frame variety is more important than frame quantity for best performance.
Conclusion: Segmentation foundation models can effectively reduce video segmentation annotation costs, with frame diversity being more critical than frame count for optimal model performance.
Abstract: Present-day deep neural networks for video semantic segmentation require a large number of fine-grained pixel-level annotations to achieve the best possible results. Obtaining such annotations, however, is very expensive. On the other hand, raw, unannotated video frames are practically free to obtain. Similarly, coarse annotations, which do not require precise boundaries, are also much cheaper. This paper investigates approaches to reduce the annotation cost required for video segmentation datasets by utilising such resources. We show that using state-of-the-art segmentation foundation models, Segment Anything Model (SAM) and Segment Anything Model 2 (SAM 2), we can utilise both unannotated frames as well as coarse annotations to alleviate the effort required for manual annotation of video segmentation datasets by automating mask generation. Our investigation suggests that if used appropriately, we can reduce the need for annotation by a third with similar performance for video semantic segmentation. More significantly, our analysis suggests that the variety of frames in the dataset is more important than the number of frames for obtaining the best performance.
[316] Ink Detection from Surface Topography of the Herculaneum Papyri
Giorgio Angelotti, Federica Nicolardi, Paul Henderson, W. Brent Seales
Main category: cs.CV
TL;DR: Machine learning models trained on 3D optical profilometry can detect carbon ink on carbonized papyrus by analyzing surface morphology, despite minimal X-ray attenuation contrast.
Details
Motivation: Reading Herculaneum papyri is difficult because both scrolls and carbon-based ink are carbonized, providing little attenuation contrast in X-ray imaging. The morphological hypothesis suggests surface topography could reveal ink patterns.Method: Train machine learning models on 3D optical profilometry data from mechanically opened Herculaneum papyri to separate inked and uninked areas. Quantify how lateral sampling affects learnability and test native-resolution models on coarsened inputs.
Result: High-resolution topography alone contains usable signal for ink detection. Segmentation performance diminishes with decreasing lateral resolution, revealing characteristic spatial scales needed to exploit morphological signals.
Conclusion: Morphology-based ink detection works for carbon ink on carbonized papyrus. Findings inform spatial resolution targets for reading closed scrolls via X-ray tomography using surface morphology analysis.
Abstract: Reading the Herculaneum papyri is challenging because both the scrolls and the ink, which is carbon-based, are carbonized. In X-ray radiography and tomography, ink detection typically relies on density- or composition-driven contrast, but carbon ink on carbonized papyrus provides little attenuation contrast. Building on the morphological hypothesis, we show that the surface morphology of written regions contains enough signal to distinguish ink from papyrus. To this end, we train machine learning models on three-dimensional optical profilometry from mechanically opened Herculaneum papyri to separate inked and uninked areas. We further quantify how lateral sampling governs learnability and how a native-resolution model behaves on coarsened inputs. We show that high-resolution topography alone contains a usable signal for ink detection. Diminishing segmentation performance with decreasing lateral resolution provides insight into the characteristic spatial scales that must be resolved on our dataset to exploit the morphological signal. These findings inform spatial resolution targets for morphology-based reading of closed scrolls through X-ray tomography.
[317] Synergizing Discriminative Exemplars and Self-Refined Experience for MLLM-based In-Context Learning in Medical Diagnosis
Wenkai Zhao, Zipei Wang, Mengjie Fang, Di Dong, Jie Tian, Lingwei Zhang
Main category: cs.CV
TL;DR: A novel in-context learning framework for medical MLLMs that mimics clinician workflows through discriminative exemplar selection and self-refined experience summarization, achieving performance comparable to fully supervised models without updating backbone weights.
Details
Motivation: General MLLMs underperform in medical diagnosis due to domain-specific nuances, and fine-tuning is limited by high annotation costs and computational overhead. Need parameter-efficient methods that don't require updating pre-trained backbone weights.Method: Clinician Mimetic Workflow with two components: 1) Discriminative Exemplar Coreset Selection (DECS) - selects discriminative visual coresets from noisy data to simulate clinician’s reference to “anchor cases”; 2) Self-Refined Experience Summarization (SRES) - distills diverse rollouts into dynamic textual Experience Bank to mimic clinical cognition and reflection.
Result: Outperforms zero-shot general and medical MLLMs across all 12 datasets of MedMNIST 2D benchmark. Achieves performance comparable to fully supervised vision models and domain-specific fine-tuned MLLMs, setting new benchmark for parameter-efficient medical in-context learning.
Conclusion: The proposed framework effectively bridges the performance gap in medical diagnosis without updating MLLM backbone weights, offering a scalable, parameter-efficient alternative to fine-tuning while maintaining clinical relevance through workflow mimicry.
Abstract: General Multimodal Large Language Models (MLLMs) often underperform in capturing domain-specific nuances in medical diagnosis, trailing behind fully supervised baselines. Although fine-tuning provides a remedy, the high costs of expert annotation and massive computational overhead limit its scalability. To bridge this gap without updating the weights of the pre-trained backbone of the MLLM, we propose a Clinician Mimetic Workflow. This is a novel In-Context Learning (ICL) framework designed to synergize Discriminative Exemplar Coreset Selection (DECS) and Self-Refined Experience Summarization (SRES). Specifically, DECS simulates a clinician’s ability to reference “anchor cases” by selecting discriminative visual coresets from noisy data at the computational level; meanwhile, SRES mimics the cognition and reflection in clinical diagnosis by distilling diverse rollouts into a dynamic textual Experience Bank. Extensive evaluation across all 12 datasets of the MedMNIST 2D benchmark demonstrates that our method outperforms zero-shot general and medical MLLMs. Simultaneously, it achieves performance levels comparable to fully supervised vision models and domain-specific fine-tuned MLLMs, setting a new benchmark for parameter-efficient medical in-context learning. Our code is available at an anonymous repository: https://anonymous.4open.science/r/Synergizing-Discriminative-Exemplars-and-Self-Refined-Experience-ED74.
[318] TIR-Agent: Training an Explorative and Efficient Agent for Image Restoration
Yisheng Zhang, Guoli Jia, Haote Hu, Shanxu Zhao, Kaikai Zhao, Long Sun, Xinwei Long, Kai Tian, Che Jiang, Zhaoxiang Liu, Kai Wang, Shiguo Lian, Kaiyan Zhang, Bowen Zhou
Main category: cs.CV
TL;DR: TIR-Agent: A trainable vision-language agent for image restoration that learns optimal tool-calling policies through supervised fine-tuning and reinforcement learning, outperforming training-free methods and achieving significant speedups.
Details
Motivation: Existing vision-language agents for image restoration rely on heuristic task scheduling and exhaustive tool traversal, leading to suboptimal restoration paths and high computational costs. The core bottleneck is the lack of learned decision-making policies for efficient degradation-aware task ordering and tool composition.Method: Proposes TIR-Agent with a two-stage training pipeline: 1) Supervised fine-tuning (SFT) to learn basic tool-calling patterns, 2) Reinforcement learning (RL) with key designs: random perturbation strategy on SFT data to broaden exploration, and multi-dimensional adaptive reward mechanism to dynamically weight image quality metrics and prevent reward hacking. Also develops a globally shared model-call pool for high-throughput GPU-based tool invocation.
Result: Outperforms 12 baselines including 6 all-in-one models, 3 training-free agents, and 3 proprietary models on both in-domain and out-of-domain degradations. Achieves over 2.5× inference speedup by eliminating redundant tool executions.
Conclusion: TIR-Agent demonstrates that trainable vision-language agents with learned policies can significantly improve image restoration performance and efficiency compared to training-free approaches, offering a promising direction for multimodal AI systems.
Abstract: Vision-language agents that orchestrate specialized tools for image restoration (IR) have emerged as a promising method, yet most existing frameworks operate in a training-free manner. They rely on heuristic task scheduling and exhaustive tool traversal, resulting in sub-optimal restoration paths and prohibitive computational cost. We argue that the core bottleneck lies in the absence of a learned policy to make decision, as a vision-language model cannot efficiently handle degradation-aware task ordering and tool composition. To this end, we propose TIR-Agent, a trainable image restoration agent that performs a direct tool-calling policy through a two-stage training pipeline of supervised fine-tuning (SFT) followed by reinforcement learning (RL). Two key designs underpin effective RL training: (i) a random perturbation strategy applied to the SFT data, which broadens the policy’s exploration over task schedules and tool compositions, and (ii) a multi-dimensional adaptive reward mechanism that dynamically re-weights heterogeneous image quality metrics to mitigate reward hacking. To support high-throughput, asynchronous GPU-based tool invocation during training, we further develop a globally shared model-call pool. Experiments on both in-domain and out-of-domain degradations show that TIR-Agent outperforms 12 baselines, including 6 all-in-one models, 3 training-free agents, and 3 proprietary models, and achieves over 2.5$\times$ inference speedup by eliminating redundant tool executions.
[319] JaWildText: A Benchmark for Vision-Language Models on Japanese Scene Text Understanding
Koki Maeda, Naoaki Okazaki
Main category: cs.CV
TL;DR: JaWildText is a diagnostic benchmark for evaluating vision-language models on Japanese scene text understanding, addressing language-specific complexities like mixed scripts, vertical writing, and large character inventory.
Details
Motivation: Existing multilingual benchmarks fail to capture Japanese-specific text complexities, and current Japanese datasets focus on scanned documents rather than in-the-wild scene text, creating a gap in evaluating VLMs for real-world Japanese text understanding.Method: Created JaWildText benchmark with 3,241 instances from 2,961 newly captured images in Japan, containing 1.12 million annotated characters across 3,643 unique character types. Includes three tasks: Dense Scene Text VQA, Receipt Key Information Extraction, and Handwriting OCR.
Result: Evaluation of 14 open-weight VLMs shows best model achieves average score of 0.64 across three tasks. Error analysis reveals recognition remains dominant bottleneck, especially for kanji characters.
Conclusion: JaWildText enables fine-grained, script-aware diagnosis of Japanese scene text capabilities and will be released with evaluation code to advance research in Japanese text understanding for vision-language models.
Abstract: Japanese scene text poses challenges that multilingual benchmarks often fail to capture, including mixed scripts, frequent vertical writing, and a character inventory far larger than the Latin alphabet. Although Japanese is included in several multilingual benchmarks, these resources do not adequately capture the language-specific complexities. Meanwhile, existing Japanese visual text datasets have primarily focused on scanned documents, leaving in-the-wild scene text underexplored. To fill this gap, we introduce JaWildText, a diagnostic benchmark for evaluating vision-language models (VLMs) on Japanese scene text understanding. JaWildText contains 3,241 instances from 2,961 images newly captured in Japan, with 1.12 million annotated characters spanning 3,643 unique character types. It comprises three complementary tasks that vary in visual organization, output format, and writing style: (i) Dense Scene Text Visual Question Answering (STVQA), which requires reasoning over multiple pieces of visual text evidence; (ii) Receipt Key Information Extraction (KIE), which tests layout-aware structured extraction from mobile-captured receipts; and (iii) Handwriting OCR, which evaluates page-level transcription across various media and writing directions. We evaluate 14 open-weight VLMs and find that the best model achieves an average score of 0.64 across the three tasks. Error analyses show recognition remains the dominant bottleneck, especially for kanji. JaWildText enables fine-grained, script-aware diagnosis of Japanese scene text capabilities, and will be released with evaluation code.
[320] VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction
Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Shijie Zhou, Dilin Wang, Zhicheng Yan, Hongyu Xu, Justin Theiss, Tianlong Chen, Jiachen Li, Zhengzhong Tu, Zhangyang Wang, Rakesh Ranjan
Main category: cs.CV
TL;DR: VLM-3R is a unified framework for Vision-Language Models that incorporates 3D reconstructive instruction tuning to enable monocular 3D spatial understanding and embodied reasoning from video frames.
Details
Motivation: Existing methods for 3D scene understanding rely on external depth sensors or pre-constructed 3D maps, limiting scalability with monocular video inputs and time-sensitive applications. The goal is to achieve human-like visual-spatial intelligence comparable to human capabilities.Method: VLM-3R processes monocular video frames using a geometry encoder to derive implicit 3D tokens representing spatial understanding. It employs Spatial-Visual-View Fusion and uses over 200K curated 3D reconstructive instruction tuning QA pairs to align real-world spatial context with language instructions.
Result: Extensive experiments show VLM-3R facilitates robust visual-spatial reasoning and enables understanding of temporal 3D context changes, excelling in both accuracy and scalability. The model also introduces a Vision-Spatial-Temporal Intelligence benchmark with 138.6K QA pairs across five tasks.
Conclusion: VLM-3R successfully extends multimodal models to 3D scene understanding from monocular videos, achieving deep spatial understanding without external sensors or pre-constructed maps, while enabling temporal reasoning about evolving spatial relationships.
Abstract: The rapid advancement of Large Multimodal Models (LMMs) for 2D images and videos has motivated extending these models to understand 3D scenes, aiming for human-like visual-spatial intelligence. Nevertheless, achieving deep spatial understanding comparable to human capabilities poses significant challenges in model encoding and data acquisition. Existing methods frequently depend on external depth sensors for geometry capture or utilize off-the-shelf algorithms for pre-constructing 3D maps, thereby limiting their scalability, especially with prevalent monocular video inputs and for time-sensitive applications. In this work, we introduce VLM-3R, a unified framework for Vision-Language Models (VLMs) that incorporates 3D Reconstructive instruction tuning. VLM-3R processes monocular video frames by employing a geometry encoder to derive implicit 3D tokens that represent spatial understanding. Leveraging our Spatial-Visual-View Fusion and over 200K curated 3D reconstructive instruction tuning question-answer (QA) pairs, VLM-3R effectively aligns real-world spatial context with language instructions. This enables monocular 3D spatial assistance and embodied reasoning. To facilitate the evaluation of temporal reasoning, we introduce the Vision-Spatial-Temporal Intelligence benchmark, featuring over 138.6K QA pairs across five distinct tasks focused on evolving spatial relationships. Extensive experiments demonstrate that our model, VLM-3R, not only facilitates robust visual-spatial reasoning but also enables the understanding of temporal 3D context changes, excelling in both accuracy and scalability.
[321] Data Organization Matters in Multimodal Instruction Tuning: A Controlled Study of Capability Trade-offs
Guowei Tang
Main category: cs.CV
TL;DR: Data organization in multimodal instruction tuning affects capability trade-offs; curriculum training (general→reasoning→OCR) yields best overall performance and reasoning, while balanced sampling favors OCR but weakens broader capabilities.
Details
Motivation: Multimodal LLMs learn from heterogeneous supervision with different task structures, but the effect of temporal organization during training remains underexplored. The paper investigates whether data organization affects trade-offs among general understanding, structured reasoning, and fine-grained OCR/document understanding.Method: Controlled three-stage training framework with fixed backbone, trainable modules, and optimization pipeline. Compares four data organization strategies: direct mixture, curriculum training, balanced sampling, and reverse curriculum. Evaluates on general visual instruction following, diagram reasoning, chart reasoning, scene-text question answering, and document question answering.
Result: Curriculum training gives best overall trade-off and strongest structured reasoning performance. Balanced sampling is better for OCR-oriented capability but weakens broader capability balance. Reverse curriculum performs worst in both final performance and optimization stability. Training dynamics show building general understanding and reasoning before OCR leads to smoother optimization.
Conclusion: Data organization is a first-order design variable in multimodal adaptation. Curriculum training (general→reasoning→OCR) provides optimal balance. Data scheduling should be considered as an explicit design dimension for multimodal model adaptation.
Abstract: Recent multimodal large language models (MLLMs) perform strongly on general visual understanding, diagram and chart reasoning, and document-centric perception. However, these abilities are learned from heterogeneous supervision sources with very different task structures and learning demands, and the effect of their temporal organization during training remains underexplored. We study whether data organization affects the trade-off among general understanding, structured reasoning, and fine-grained OCR/document understanding in multimodal instruction tuning. To isolate this factor, we use a controlled three-stage training framework in which the backbone, trainable modules, and optimization pipeline are fixed across all runs, and only the temporal arrangement of post-alignment supervision is changed. We compare four strategies: direct mixture, curriculum training, balanced sampling, and reverse curriculum. Experiments on general visual instruction following, diagram reasoning, chart reasoning, scene-text question answering, and document question answering show that data organization is a first-order design variable in multimodal adaptation. Curriculum training gives the best overall trade-off and the strongest structured reasoning performance. Balanced sampling is better for OCR-oriented capability but weakens the broader capability balance. Reverse curriculum performs worst in both final performance and optimization stability. Training-dynamics analysis further suggests that building general understanding and reasoning before introducing OCR-intensive supervision leads to smoother optimization and faster convergence. These findings highlight data scheduling as an explicit design dimension for multimodal model adaptation.
[322] E-TIDE: Fast, Structure-Preserving Motion Forecasting from Event Sequences
Biswadeep Sen, Benoit R. Cottereau, Nicolas Cuperlier, Terence Sim
Main category: cs.CV
TL;DR: E-TIDE: A lightweight, end-to-end trainable architecture for predicting future event representations from past observations in event-based cameras, designed for efficiency without large-scale pretraining.
Details
Motivation: Event-based cameras produce sparse, temporally precise data but existing prediction methods rely on computationally heavy backbones and large-scale pretraining, limiting their applicability in resource-constrained scenarios. There's a need for efficient models that can predict future event representations for downstream tasks like semantic segmentation or object tracking.Method: Introduces E-TIDE with TIDE module (Temporal Interaction for Dynamic Events) that uses efficient spatiotemporal interaction design for sparse event tensors. Captures temporal dependencies via large-kernel mixing and activity-aware gating while maintaining low computational complexity. The architecture is lightweight and end-to-end trainable without requiring large-scale pretraining.
Result: Experiments on standard event-based datasets show competitive performance with significantly reduced model size and training requirements. The method is well-suited for real-time deployment under tight latency and memory budgets.
Conclusion: E-TIDE provides an efficient solution for event-tensor prediction that balances performance with computational efficiency, making it practical for resource-constrained applications while maintaining competitive accuracy.
Abstract: Event-based cameras capture visual information as asynchronous streams of per-pixel brightness changes, generating sparse, temporally precise data. Compared to conventional frame-based sensors, they offer significant advantages in capturing high-speed dynamics while consuming substantially less power. Predicting future event representations from past observations is an important problem, enabling downstream tasks such as future semantic segmentation or object tracking without requiring access to future sensor measurements. While recent state-of-the-art approaches achieve strong performance, they often rely on computationally heavy backbones and, in some cases, large-scale pretraining, limiting their applicability in resource-constrained scenarios. In this work, we introduce E-TIDE, a lightweight, end-to-end trainable architecture for event-tensor prediction that is designed to operate efficiently without large-scale pretraining. Our approach employs the TIDE module (Temporal Interaction for Dynamic Events), motivated by efficient spatiotemporal interaction design for sparse event tensors, to capture temporal dependencies via large-kernel mixing and activity-aware gating while maintaining low computational complexity. Experiments on standard event-based datasets demonstrate that our method achieves competitive performance with significantly reduced model size and training requirements, making it well-suited for real-time deployment under tight latency and memory budgets.
[323] Beyond Dataset Distillation: Lossless Dataset Concentration via Diffusion-Assisted Distribution Alignment
Tongfei Liu, Yufan Liu, Bing Li, Weiming Hu
Main category: cs.CV
TL;DR: Dataset Concentration (DsCo) framework uses diffusion-based Noise-Optimization to synthesize compact datasets for efficient training, addressing scalability and data-free scenarios in dataset distillation.
Details
Motivation: High costs and accessibility issues of large datasets hinder visual recognition systems; existing diffusion-based dataset distillation methods lack theoretical justification, scale poorly to high volumes, and fail in data-free scenarios.Method: Establishes theoretical framework proving equivalence between dataset distillation and distribution matching; proposes DsCo with Noise-Optimization (NOpt) to synthesize representative samples, optionally augmented via “Doping” (mixing selected original samples).
Result: Achieves SOTA for low data volumes, extends well to high volumes (nearly halves dataset size with no performance degradation), applicable in both data-accessible and data-free scenarios.
Conclusion: DsCo provides theoretically justified, efficient dataset distillation that overcomes scalability limitations and works in data-free scenarios, enabling more accessible large-scale visual recognition systems.
Abstract: The high cost and accessibility problem associated with large datasets hinder the development of large-scale visual recognition systems. Dataset Distillation addresses these problems by synthesizing compact surrogate datasets for efficient training, storage, transfer, and privacy preservation. The existing state-of-the-art diffusion-based dataset distillation methods face three issues: lack of theoretical justification, poor efficiency in scaling to high data volumes, and failure in data-free scenarios. To address these issues, we establish a theoretical framework that justifies the use of diffusion models by proving the equivalence between dataset distillation and distribution matching, and reveals an inherent efficiency limit in the dataset distillation paradigm. We then propose a Dataset Concentration (DsCo) framework that uses a diffusion-based Noise-Optimization (NOpt) method to synthesize a small yet representative set of samples, and optionally augments the synthetic data via “Doping”, which mixes selected samples from the original dataset with the synthetic samples to overcome the efficiency limit of dataset distillation. DsCo is applicable in both data-accessible and data-free scenarios, achieving SOTA performances for low data volumes, and it extends well to high data volumes, where it nearly reduces the dataset size by half with no performance degradation.
[324] RHO: Robust Holistic OSM-Based Metric Cross-View Geo-Localization
Junwei Zheng, Ruize Dai, Ruiping Liu, Zichao Zeng, Yufan Chen, Fangjinhua Wang, Kunyu Peng, Kailun Yang, Jiaming Zhang, Rainer Stiefelhagen
Main category: cs.CV
TL;DR: RHO proposes a metric cross-view geo-localization system using panoramic ground images and OpenStreetMap data, with a new large-scale benchmark dataset CV-RHO and a two-branch architecture for accurate 3-DoF camera pose estimation.
Details
Motivation: Existing metric cross-view geo-localization methods typically use pinhole and satellite images, but panoramic images provide more comprehensive visual information and OpenStreetMap offers rich semantic data. There's a need for robust localization under varying conditions and a lack of large-scale benchmarks for this task.Method: Proposes RHO model with two-branch Pin-Pan architecture: one branch processes panoramic images using Split-Undistort-Merge (SUM) module to handle distortion, while the other processes OpenStreetMap data. Uses Position-Orientation Fusion (POF) mechanism to combine position and heading information for accurate localization. Introduces CV-RHO dataset with 2.7M images under diverse conditions.
Result: Extensive experiments show significant performance gains up to 20% compared to state-of-the-art baselines. The CV-RHO dataset proves valuable for benchmarking, and the RHO model demonstrates effectiveness in metric cross-view geo-localization.
Conclusion: The work establishes a comprehensive benchmark for panoramic cross-view geo-localization and proposes an effective architecture that handles panoramic distortion while leveraging OpenStreetMap data for improved accuracy under varying conditions.
Abstract: Metric Cross-View Geo-Localization (MCVGL) aims to estimate the 3-DoF camera pose (position and heading) by matching ground and satellite images. In this work, instead of pinhole and satellite images, we study robust MCVGL using holistic panoramas and OpenStreetMap (OSM). To this end, we establish a large-scale MCVGL benchmark dataset, CV-RHO, with over 2.7M images under different weather and lighting conditions, as well as sensor noise. Furthermore, we propose a model termed RHO with a two-branch Pin-Pan architecture for accurate visual localization. A Split-Undistort-Merge (SUM) module is introduced to address the panoramic distortion, and a Position-Orientation Fusion (POF) mechanism is designed to enhance the localization accuracy. Extensive experiments prove the value of our CV-RHO dataset and the effectiveness of the RHO model, with a significant performance gain up to 20% compared with the state-of-the-art baselines. Project page: https://github.com/InSAI-Lab/RHO.
[325] When Surfaces Lie: Exploiting Wrinkle-Induced Attention Shift to Attack Vision-Language Models
Chengyin Hu, Xuemeng Sun, Jiajun Han, Qike Zhang, Xiang Chen, Xin Wang, Yiwei Wei, Jiahua Long
Main category: cs.CV
TL;DR: A method for generating photorealistic non-rigid deformations (fabric wrinkles) to test VLM robustness, using parametric structural perturbations with multi-scale wrinkle fields and optimization-based search.
Details
Motivation: While VLMs show strong cross-modal understanding, their robustness to physically plausible non-rigid deformations like fabric wrinkles remains poorly understood and needs systematic evaluation.Method: Parametric structural perturbation method inspired by 3D fabric mechanics, generating photorealistic wrinkles via multi-scale wrinkle fields with displacement field distortion and surface-consistent appearance variations. Uses hierarchical fitness function in low-dimensional parameter space with optimization-based search strategy.
Result: Method significantly degrades performance of various state-of-the-art VLMs, consistently outperforming baselines in both image captioning and visual question-answering tasks.
Conclusion: VLMs remain vulnerable to physically plausible non-rigid deformations, highlighting the need for more robust vision-language models that can handle real-world physical variations.
Abstract: Visual-Language Models (VLMs) have demonstrated exceptional cross-modal understanding across various tasks, including zero-shot classification, image captioning, and visual question answering. However, their robustness to physically plausible non-rigid deformations-such as wrinkles on flexible surfaces-remains poorly understood. In this work, we propose a parametric structural perturbation method inspired by the mechanics of three-dimensional fabric wrinkles. Specifically, our method generates photorealistic non-rigid perturbations by constructing multi-scale wrinkle fields and integrating displacement field distortion with surface-consistent appearance variations. To achieve an optimal balance between visual naturalness and adversarial effectiveness, we design a hierarchical fitness function in a low-dimensional parameter space and employ an optimization-based search strategy. We evaluate our approach using a two-stage framework: perturbations are first optimized on a zero-shot classification proxy task and subsequently assessed for transferability on generative tasks. Experimental results demonstrate that our method significantly degrades the performance of various state-of-the-art VLMs, consistently outperforming baselines in both image captioning and visual question-answering tasks.
[326] Person-Centric Annotations of LAION-400M: Auditing Bias and Its Transfer to Models
Leander Girrbach, Stephan Alaniz, Genevieve Smith, Trevor Darrell, Zeynep Akata
Main category: cs.CV
TL;DR: First large-scale demographic annotations for LAION-400M reveal dataset biases that predict downstream model biases in vision-language models like CLIP and Stable Diffusion.
Details
Motivation: Vision-language models show strong demographic biases, but the role of training data remains unclear due to lack of demographic annotations in web-scale datasets like LAION-400M.Method: Created person-centric annotations for full LAION-400M dataset using validated automatic labeling pipelines combining object detection, multimodal captioning, and finetuned classifiers for perceived gender and race/ethnicity labels.
Result: Uncovered demographic imbalances and harmful associations (e.g., men and Black/Middle Eastern individuals disproportionately linked with crime/negative content). Linear fit predicts 60-70% of gender bias in CLIP/Stable Diffusion from data co-occurrences.
Conclusion: Establishes first large-scale empirical link between dataset composition and downstream model bias, providing resources to study and mitigate biases in vision-language models.
Abstract: Vision-language models trained on large-scale multimodal datasets show strong demographic biases, but the role of training data in producing these biases remains unclear. A major barrier has been the lack of demographic annotations in web-scale datasets such as LAION-400M. We address this gap by creating person-centric annotations for the full dataset, including over 276 million bounding boxes, perceived gender and race/ethnicity labels, and automatically generated captions. These annotations are produced through validated automatic labeling pipelines combining object detection, multimodal captioning, and finetuned classifiers. Using them, we uncover demographic imbalances and harmful associations, such as the disproportionate linking of men and individuals perceived as Black or Middle Eastern with crime-related and negative content. We also show that a linear fit predicts 60-70% of gender bias in CLIP and Stable Diffusion from direct co-occurrences in the data. Our resources establish the first large-scale empirical link between dataset composition and downstream model bias. Code is available at https://github.com/ExplainableML/LAION-400M-Person-Centric-Annotations.
[327] RINO: Rotation-Invariant Non-Rigid Correspondences
Maolin Gao, Shao Jie Hu-Chen, Congyue Deng, Riccardo Marin, Leonidas Guibas, Daniel Cremers
Main category: cs.CV
TL;DR: RINO is an unsupervised, rotation-invariant dense 3D shape correspondence framework that unifies rigid and non-rigid shape matching using a novel feature extractor called RINONet.
Details
Motivation: Existing deep learning approaches for dense 3D shape correspondence rely on intermediate geometric features or handcrafted descriptors, which limit their effectiveness under challenging conditions like non-isometric deformations, partial data, and non-manifold inputs.Method: RINO uses RINONet, a feature extractor that integrates vector-based SO(3)-invariant learning with orientation-aware complex functional maps to extract robust features directly from raw geometry, enabling a fully end-to-end, data-driven approach without shape pre-alignment or handcrafted features.
Result: Extensive experiments show unprecedented performance across challenging non-rigid matching tasks, including arbitrary poses, non-isometry, partiality, non-manifoldness, and noise.
Conclusion: RINO provides a robust, unsupervised solution for dense 3D shape correspondence that overcomes limitations of existing methods and works effectively under various challenging conditions.
Abstract: Dense 3D shape correspondence remains a central challenge in computer vision and graphics as many deep learning approaches still rely on intermediate geometric features or handcrafted descriptors, limiting their effectiveness under non-isometric deformations, partial data, and non-manifold inputs. To overcome these issues, we introduce RINO, an unsupervised, rotation-invariant dense correspondence framework that effectively unifies rigid and non-rigid shape matching. The core of our method is the novel RINONet, a feature extractor that integrates vector-based SO(3)-invariant learning with orientation-aware complex functional maps to extract robust features directly from raw geometry. This allows for a fully end-to-end, data-driven approach that bypasses the need for shape pre-alignment or handcrafted features. Extensive experiments show unprecedented performance of RINO across challenging non-rigid matching tasks, including arbitrary poses, non-isometry, partiality, non-manifoldness, and noise.
[328] GS3LAM: Gaussian Semantic Splatting SLAM
Linfei Li, Lin Zhang, Zhong Wang, Ying Shen
Main category: cs.CV
TL;DR: GS3LAM: A real-time semantic SLAM framework using 3D Gaussian Splatting for dense multimodal fusion of RGB, depth, and semantics with improved tracking, rendering, and semantic precision.
Details
Motivation: Existing semantic SLAM systems have limitations: explicit representations are resolution-limited and can't predict unknown areas, while implicit representations are too slow for real-time use. 3D Gaussian Splatting offers a promising middle ground with efficiency and geometric continuity.Method: Proposes GS3LAM framework that models scenes as Semantic Gaussian Fields (SG-Field), jointly optimizes camera poses and fields via multimodal error constraints, introduces Depth-adaptive Scale Regularization (DSR) to resolve scale misalignments, and uses Random Sampling-based Keyframe Mapping (RSKM) to mitigate catastrophic forgetting.
Result: Extensive experiments show GS3LAM achieves increased tracking robustness, superior rendering quality, and enhanced semantic precision compared to state-of-the-art methods on benchmark datasets.
Conclusion: GS3LAM successfully addresses limitations of existing semantic SLAM systems by leveraging 3D Gaussian Splatting for real-time, dense multimodal fusion with improved performance in tracking, rendering, and semantic mapping.
Abstract: Recently, the multi-modal fusion of RGB, depth, and semantics has shown great potential in dense Simultaneous Localization and Mapping (SLAM). However, a prerequisite for generating consistent semantic maps is the availability of dense, efficient, and scalable scene representations. Existing semantic SLAM systems based on explicit representations are often limited by resolution and an inability to predict unknown areas. Conversely, implicit representations typically rely on time-consuming ray tracing, failing to meet real-time requirements. Fortunately, 3D Gaussian Splatting (3DGS) has emerged as a promising representation that combines the efficiency of point-based methods with the continuity of geometric structures. To this end, we propose GS3LAM, a Gaussian Semantic Splatting SLAM framework that processes multimodal data to render consistent, dense semantic maps in real-time. GS3LAM models the scene as a Semantic Gaussian Field (SG-Field) and jointly optimizes camera poses and the field via multimodal error constraints. Furthermore, a Depth-adaptive Scale Regularization (DSR) scheme is introduced to resolve misalignments between scale-invariant Gaussians and geometric surfaces. To mitigate catastrophic forgetting, we propose a Random Sampling-based Keyframe Mapping (RSKM) strategy, which demonstrates superior performance over common local covisibility optimization methods. Extensive experiments on benchmark datasets show that GS3LAM achieves increased tracking robustness, superior rendering quality, and enhanced semantic precision compared to state-of-the-art methods. Source code is available at https://github.com/lif314/GS3LAM.
[329] Inference-time Trajectory Optimization for Manga Image Editing
Ryosuke Furuta
Main category: cs.CV
TL;DR: Inference-time adaptation method for manga image editing that tailors pretrained models to individual manga images without retraining or fine-tuning.
Details
Motivation: Pretrained image editing models underperform on manga due to training on natural-image data, but retraining/fine-tuning is impractical due to computational cost and copyright constraints.Method: Corrects the generation trajectory at inference time so the input manga image can be reconstructed more faithfully under an empty prompt, requiring only the input image itself.
Result: Consistently outperforms existing baselines while incurring only negligible computational overhead.
Conclusion: Proposed inference-time adaptation method effectively adapts pretrained image editing models to manga without retraining, addressing domain gap issues.
Abstract: We present an inference-time adaptation method that tailors a pretrained image editing model to each input manga image using only the input image itself. Despite recent progress in pretrained image editing, such models often underperform on manga because they are trained predominantly on natural-image data. Re-training or fine-tuning large-scale models on manga is, however, generally impractical due to both computational cost and copyright constraints. To address this issue, our method slightly corrects the generation trajectory at inference time so that the input image can be reconstructed more faithfully under an empty prompt. Experimental results show that our method consistently outperforms existing baselines while incurring only negligible computational overhead.
[330] MolmoPoint: Better Pointing for VLMs with Grounding Tokens
Christopher Clark, Yue Yang, Jae Sung Park, Zixian Ma, Jieyu Zhang, Rohun Tripathi, Mohammadreza Salehi, Sangho Lee, Taira Anderson, Winson Han, Ranjay Krishna
Main category: cs.CV
TL;DR: A new pointing mechanism for vision-language models that directly selects visual tokens containing target concepts instead of generating coordinates, achieving state-of-the-art performance on image, GUI, and video pointing tasks.
Details
Motivation: Existing VLMs generate coordinates for pointing, which requires learning complex coordinate systems and results in high token counts. The authors propose a more intuitive approach that directly selects visual tokens containing target concepts.Method: The model generates special pointing tokens that cross-attend to input image/video tokens to select appropriate ones. It uses a hierarchical approach: first token selects region, second token selects fine-grained subpatch within that region, third token specifies location within subpatch. Includes sequential generation with consistent order, encoding relative position of previous point, and special no-more-points class.
Result: Achieves new SOTA on image pointing (70.7% on PointBench), SOTA among fully open models on GUI pointing (61.1% on ScreenSpotPro), improves video pointing (59.1% human preference win rate vs. text coordinate baseline), and tracking (+6.3% gain on Molmo2Track). Shows higher sample efficiency and qualitative improvements.
Conclusion: The proposed token-based pointing mechanism is more intuitive and effective than coordinate-based approaches, achieving superior performance across multiple vision-language pointing tasks while being more sample efficient.
Abstract: Grounding has become a fundamental capability of vision-language models (VLMs). Most existing VLMs point by generating coordinates as part of their text output, which requires learning a complicated coordinate system and results in a high token count. Instead, we propose a more intuitive pointing mechanism that directly selects the visual tokens that contain the target concept. Our model generates a special pointing token that cross-attends to the input image or video tokens and selects the appropriate one. To make this model more fine-grained, we follow these pointing tokens with an additional special token that selects a fine-grained subpatch within the initially selected region, and then a third token that specifies a location within that subpatch. We further show that performance improves by generating points sequentially in a consistent order, encoding the relative position of the previously selected point, and including a special no-more-points class when selecting visual tokens. Using this method, we set a new state-of-the-art on image pointing (70.7% on PointBench), set a new state-of-the-art among fully open models on GUI pointing (61.1% on ScreenSpotPro), and improve video pointing (59.1% human preference win rate vs. a text coordinate baseline) and tracking (+6.3% gain on Molmo2Track). We additionally show that our method achieves much higher sample efficiency and discuss the qualitative differences that emerge from this design change.
[331] Diversity Matters: Dataset Diversification and Dual-Branch Network for Generalized AI-Generated Image Detection
Nusrat Tasnim, Kutub Uddin, Khalid Malik
Main category: cs.CV
TL;DR: A novel framework called Diversity Matters for AI-generated image detection that emphasizes data diversity and feature domain complementarity to improve generalization against unseen generative models.
Details
Motivation: The rapid proliferation of AI-generated images raises concerns about misinformation, copyright violations, and digital security, but detecting such images remains challenging due to the diversity of generative models and data distributions.Method: Proposes a feature-domain similarity filtering mechanism to discard redundant samples and ensure diverse training data, plus a dual-branch network combining CLIP features from pixel and frequency domains to capture both semantic and structural cues.
Result: Extensive experiments on benchmark datasets show significant improvements in cross-model and cross-dataset performance compared to existing methods.
Conclusion: The work highlights the critical role of data and feature diversity in building reliable and robust detectors against the rapidly evolving landscape of synthetic content.
Abstract: The rapid proliferation of AI-generated images, powered by generative adversarial networks (GANs), diffusion models, and other synthesis techniques, has raised serious concerns about misinformation, copyright violations, and digital security. However, detecting such images in a generalized and robust manner remains a major challenge due to the vast diversity of generative models and data distributions. In this work, we present \textbf{Diversity Matters}, a novel framework that emphasizes data diversity and feature domain complementarity for AI-generated image detection. The proposed method introduces a feature-domain similarity filtering mechanism that discards redundant or highly similar samples across both inter-class and intra-class distributions, ensuring a more diverse and representative training set. Furthermore, we propose a dual-branch network that combines CLIP features from the pixel domain and the frequency domain to jointly capture semantic and structural cues, leading to improved generalization against unseen generative models and adversarial conditions. Extensive experiments on benchmark datasets demonstrate that the proposed approach significantly improves cross-model and cross-dataset performance compared to existing methods. \textbf{Diversity Matters} highlights the critical role of data and feature diversity in building reliable and robust detectors against the rapidly evolving landscape of synthetic content.
[332] Tracking without Seeing: Geospatial Inference using Encrypted Traffic from Distributed Nodes
Sadik Yagiz Yetim, Gaofeng Dong, Isaac-Neil Zanoria, Ronit Barman, Maggie Wigness, Tarek Abdelzaher, Mani Srivastava, Suhas Diggavi
Main category: cs.CV
TL;DR: GraySense: A framework that performs geospatial object tracking using only encrypted packet-level information from wireless video transmissions, without accessing raw video streams, achieving 2.33m tracking error.
Details
Motivation: Traditional dynamic environment observation requires raw signal-level data from multiple sensors. This work explores an alternative: performing geospatial inference using only encrypted packet-level information, without access to raw sensory data, and fusing this indirect information with direct sensory data when available.Method: GraySense consists of two stages: (1) Packet Grouping module identifies frame boundaries and estimates frame sizes from encrypted network traffic, and (2) Tracker module uses a Transformer encoder with recurrent state to fuse indirect packet-based inputs with optional direct camera-based inputs to estimate object position.
Result: Extensive experiments with realistic videos from CARLA simulator and emulated networks show GraySense achieves 2.33 meters tracking error (Euclidean distance) without raw signal access, within tracked object dimensions (4.61m x 1.93m).
Conclusion: GraySense demonstrates novel capability to perform geospatial object tracking using only encrypted packet-level information, expanding the use of latent signals for sensing without requiring access to raw video streams.
Abstract: Accurate observation of dynamic environments traditionally relies on synthesizing raw, signal-level information from multiple distributed sensors. This work investigates an alternative approach: performing geospatial inference using only encrypted packet-level information, without access to the raw sensory data. We further explore how this indirect information can be fused with directly available sensory data to extend overall inference capabilities. We introduce GraySense, a learning-based framework that performs geospatial object tracking by analyzing encrypted wireless video transmission traffic, such as packet sizes, from cameras with inaccessible streams. GraySense leverages the inherent relationship between scene dynamics and transmitted packet sizes to infer object motion. The framework consists of two stages: (1) a Packet Grouping module that identifies frame boundaries and estimates frame sizes from encrypted network traffic, and (2) a Tracker module, based on a Transformer encoder with a recurrent state, which fuses indirect packet-based inputs with optional direct camera-based inputs to estimate the object’s position. Extensive experiments with realistic videos from the CARLA simulator and emulated networks under varying conditions show that GraySense achieves 2.33 meters tracking error (Euclidean distance) without raw signal access, within the dimensions of tracked objects (4.61m x 1.93m). To our knowledge, this capability has not been previously demonstrated, expanding the use of latent signals for sensing.
[333] VideoARM: Agentic Reasoning over Hierarchical Memory for Long-Form Video Understanding
Yufei Yin, Qianke Meng, Minghao Chen, Jiajun Ding, Zhenwei Shao, Zhou Yu
Main category: cs.CV
TL;DR: VideoARM introduces an agentic reasoning-over-hierarchical-memory paradigm for long-form video understanding that adaptively processes videos through observation-thinking-acting-memorizing loops, reducing token consumption while outperforming state-of-the-art methods.
Details
Motivation: Long-form video understanding is challenging due to extended temporal structure and dense multimodal cues. Existing approaches rely on hand-crafted reasoning pipelines or token-consuming video preprocessing, which limits efficiency and adaptability.Method: VideoARM uses an agentic reasoning-over-hierarchical-memory paradigm with adaptive on-the-fly reasoning and memory construction. It employs a controller that autonomously invokes tools for coarse-to-fine video interpretation, while a hierarchical multimodal memory continuously captures and updates multi-level clues to support decision-making.
Result: Experiments on prevalent benchmarks show VideoARM outperforms the state-of-the-art DVD method while significantly reducing token consumption for long-form videos.
Conclusion: VideoARM provides an effective paradigm for long-form video understanding through adaptive agentic reasoning and hierarchical memory, addressing efficiency and performance limitations of existing approaches.
Abstract: Long-form video understanding remains challenging due to the extended temporal structure and dense multimodal cues. Despite recent progress, many existing approaches still rely on hand-crafted reasoning pipelines or employ token-consuming video preprocessing to guide MLLMs in autonomous reasoning. To overcome these limitations, we introduce VideoARM, an Agentic Reasoning-over-hierarchical-Memory paradigm for long-form video understanding. Instead of static, exhaustive preprocessing, VideoARM performs adaptive, on-the-fly agentic reasoning and memory construction. Specifically, VideoARM performs an adaptive and continuous loop of observing, thinking, acting, and memorizing, where a controller autonomously invokes tools to interpret the video in a coarse-to-fine manner, thereby substantially reducing token consumption. In parallel, a hierarchical multimodal memory continuously captures and updates multi-level clues throughout the operation of the agent, providing precise contextual information to support the controller in decision-making. Experiments on prevalent benchmarks demonstrate that VideoARM outperforms the state-of-the-art method, DVD, while significantly reducing token consumption for long-form videos.
[334] MuSEAgent: A Multimodal Reasoning Agent with Stateful Experiences
Shijian Wang, Jiarui Jin, Runhao Fu, Zexuan Yan, Xingjian Wang, Mengkang Hu, Eric Wang, Xiaoxi Li, Kangning Zhang, Li Yao, Wenxiang Jiao, Xuelian Cheng, Yuan Lu, Zongyuan Ge
Main category: cs.CV
TL;DR: MuSEAgent is a multimodal reasoning agent that enhances decision-making by learning from stateful experiences through hindsight reasoning, organizing them into a quality-filtered experience bank for adaptive retrieval during inference.
Details
Motivation: Current research agents have made progress in information seeking across text and vision, but they lack effective mechanisms for learning from past experiences. The paper aims to extend agent capabilities by developing a stateful experience learning paradigm that goes beyond simple trajectory-level retrieval to improve multimodal reasoning.Method: Proposes stateful experience learning that abstracts interaction data into atomic decision experiences through hindsight reasoning. These experiences are organized into a quality-filtered experience bank. At inference time, the agent uses policy-driven experience retrieval with complementary wide- and deep-search strategies to dynamically retrieve multimodal guidance across diverse semantic viewpoints.
Result: Extensive experiments show MuSEAgent consistently outperforms strong trajectory-level experience retrieval baselines on both fine-grained visual perception and complex multimodal reasoning tasks, validating the effectiveness of stateful experience modeling.
Conclusion: Stateful experience modeling significantly improves multimodal agent reasoning by enabling more effective learning from past interactions and adaptive retrieval of multimodal guidance during decision-making.
Abstract: Research agents have recently achieved significant progress in information seeking and synthesis across heterogeneous textual and visual sources. In this paper, we introduce MuSEAgent, a multimodal reasoning agent that enhances decision-making by extending the capabilities of research agents to discover and leverage stateful experiences. Rather than relying on trajectory-level retrieval, we propose a stateful experience learning paradigm that abstracts interaction data into atomic decision experiences through hindsight reasoning. These experiences are organized into a quality-filtered experience bank that supports policy-driven experience retrieval at inference time. Specifically, MuSEAgent enables adaptive experience exploitation through complementary wide- and deep-search strategies, allowing the agent to dynamically retrieve multimodal guidance across diverse compositional semantic viewpoints. Extensive experiments demonstrate that MuSEAgent consistently outperforms strong trajectory-level experience retrieval baselines on both fine-grained visual perception and complex multimodal reasoning tasks. These results validate the effectiveness of stateful experience modeling in improving multimodal agent reasoning.
[335] RadImageNet-VQA: A Large-Scale CT and MRI Dataset for Radiologic Visual Question Answering
Léo Butsanets, Charles Corbière, Julien Khlaut, Pierre Manceron, Corentin Dancette
Main category: cs.CV
TL;DR: RadImageNet-VQA is a large-scale radiologic visual question answering dataset with 750K CT/MRI images and 7.5M QA pairs covering abnormality detection, anatomy recognition, and pathology identification across 8 anatomical regions and 97 pathology categories.
Details
Motivation: Existing medical VQA datasets are limited in scale, dominated by X-ray imaging or biomedical illustrations, and often prone to text-based shortcuts. There's a need for a comprehensive radiologic VQA dataset covering CT and MRI exams with expert-curated annotations.Method: Built a large-scale dataset from expert-curated annotations with 750K images paired with 7.5M question-answer samples. Covers three key tasks: abnormality detection, anatomy recognition, and pathology identification. Supports open-ended, closed-ended, and multiple-choice questions across 8 anatomical regions and 97 pathology categories.
Result: State-of-the-art vision-language models struggle with fine-grained pathology identification, particularly in open-ended settings even after fine-tuning. Text-only analysis shows model performance collapses to near-random without image inputs, confirming the dataset is free from linguistic shortcuts.
Conclusion: RadImageNet-VQA provides a challenging benchmark for radiologic VQA that requires genuine visual understanding, exposing limitations of current vision-language models in medical image analysis tasks.
Abstract: In this work, we introduce RadImageNet-VQA, a large-scale dataset designed to advance radiologic visual question answering (VQA) on CT and MRI exams. Existing medical VQA datasets are limited in scale, dominated by X-ray imaging or biomedical illustrations, and often prone to text-based shortcuts. RadImageNet-VQA is built from expert-curated annotations and provides 750K images paired with 7.5M question-answer samples. It covers three key tasks - abnormality detection, anatomy recognition, and pathology identification - spanning eight anatomical regions and 97 pathology categories, and supports open-ended, closed-ended, and multiple-choice questions. Extensive experiments show that state-of-the-art vision-language models still struggle with fine-grained pathology identification, particularly in open-ended settings and even after fine-tuning. Text-only analysis further reveals that model performance collapses to near-random without image inputs, confirming that RadImageNet-VQA is free from linguistic shortcuts. The full dataset and benchmark are publicly available at https://huggingface.co/datasets/raidium/RadImageNet-VQA.
[336] Benchmarking Multi-View BEV Object Detection with Mixed Pinhole and Fisheye Cameras
Xiangzhong Liu, Hao Shen
Main category: cs.CV
TL;DR: First real-data 3D detection benchmark with mixed fisheye and pinhole cameras for Bird’s-Eye View perception, evaluating adaptation strategies for distortion robustness.
Details
Motivation: Modern autonomous driving uses mixed camera configurations (pinhole + fisheye) for full view perception, but existing BEV 3D detection models are designed for pinhole cameras and degrade under fisheye distortion.Method: Created benchmark by converting KITTI-360 to nuScenes format. Evaluated three adaptation strategies: 1) rectification for zero-shot evaluation/fine-tuning, 2) distortion-aware view transformation modules using MEI camera model, 3) polar coordinate representations to align with radial distortion. Tested on three BEV architectures (BEVFormer, BEVDet, PETR).
Result: Projection-free architectures are inherently more robust to fisheye distortion than other view transformation modules. Established first real-data 3D detection benchmark with mixed fisheye and pinhole images.
Conclusion: Provides systematic adaptation guidelines for designing robust 3D perception systems with mixed camera configurations. Shows projection-free architectures offer better distortion robustness.
Abstract: Modern autonomous driving systems increasingly rely on mixed camera configurations with pinhole and fisheye cameras for full view perception. However, Bird’s-Eye View (BEV) 3D object detection models are predominantly designed for pinhole cameras, leading to performance degradation under fisheye distortion. To bridge this gap, we introduce a multi-view BEV detection benchmark with mixed cameras by converting KITTI-360 into nuScenes format. Our study encompasses three adaptations: rectification for zero-shot evaluation and fine-tuning of nuScenes-trained models, distortion-aware view transformation modules (VTMs) via the MEI camera model, and polar coordinate representations to better align with radial distortion. We systematically evaluate three representative BEV architectures, BEVFormer, BEVDet and PETR, across these strategies. We demonstrate that projection-free architectures are inherently more robust and effective against fisheye distortion than other VTMs. This work establishes the first real-data 3D detection benchmark with fisheye and pinhole images and provides systematic adaptation and practical guidelines for designing robust and cost-effective 3D perception systems. The code is available at https://github.com/CesarLiu/FishBEVOD.git.
[337] MDPBench: A Benchmark for Multilingual Document Parsing in Real-World Scenarios
Zhang Li, Zhibo Lin, Qiang Liu, Ziyang Zhang, Shuo Zhang, Zidun Guo, Jiajun Song, Jiarui Zhang, Xiang Bai, Yuliang Liu
Main category: cs.CV
TL;DR: MDPBench is the first multilingual document parsing benchmark covering 17 languages with diverse scripts and photographic conditions, revealing significant performance gaps between closed-source and open-source models, especially on non-Latin scripts and photographed documents.
Details
Motivation: Current document parsing research focuses almost exclusively on clean, digital documents in dominant languages, lacking systematic evaluation across diverse scripts, low-resource languages, and real-world photographic conditions.Method: Created MDPBench with 3,400 document images across 17 languages using expert model labeling, manual correction, and human verification. Maintains separate public/private splits to prevent data leakage and ensure fair comparison.
Result: Closed-source models (Gemini3-Pro) show relative robustness, while open-source alternatives suffer dramatic performance collapse: average 17.8% drop on photographed documents and 14.0% drop on non-Latin scripts.
Conclusion: Reveals significant performance imbalances across languages and conditions, highlighting the need for more inclusive, deployment-ready document parsing systems that work across diverse real-world scenarios.
Abstract: We introduce Multilingual Document Parsing Benchmark, the first benchmark for multilingual digital and photographed document parsing. Document parsing has made remarkable strides, yet almost exclusively on clean, digital, well-formatted pages in a handful of dominant languages. No systematic benchmark exists to evaluate how models perform on digital and photographed documents across diverse scripts and low-resource languages. MDPBench comprises 3,400 document images spanning 17 languages, diverse scripts, and varied photographic conditions, with high-quality annotations produced through a rigorous pipeline of expert model labeling, manual correction, and human verification. To ensure fair comparison and prevent data leakage, we maintain separate public and private evaluation splits. Our comprehensive evaluation of both open-source and closed-source models uncovers a striking finding: while closed-source models (notably Gemini3-Pro) prove relatively robust, open-source alternatives suffer dramatic performance collapse, particularly on non-Latin scripts and real-world photographed documents, with an average drop of 17.8% on photographed documents and 14.0% on non-Latin scripts. These results reveal significant performance imbalances across languages and conditions, and point to concrete directions for building more inclusive, deployment-ready parsing systems. Source available at https://github.com/Yuliang-Liu/MultimodalOCR.
[338] 3-D Representations for Hyperspectral Flame Tomography
Nicolas Tricard, Zituo Chen, Sili Deng
Main category: cs.CV
TL;DR: Comparison of voxel-grid vs neural network representations for flame tomography, finding voxel-grid with total-variation regularization performs best for 3D thermochemical reconstruction of simulated pool fires.
Details
Motivation: To provide a rigorous quantitative comparison between classical voxel-grid and neural network representations for flame tomography, as previous studies suggested neural networks improve reconstruction quality but lacked direct comparison with the same algorithm using voxel-grid representation.Method: Compared voxel-grid representation with varying regularizers to continuous neural representation for tomographic reconstruction of simulated pool fire. Both representations output temperature and composition as functions of location, followed by ray-tracing to solve radiative transfer equation for spectral intensity on hyperspectral infrared cameras, convolved with instrument lineshape function.
Result: Voxel-grid approach with total-variation regularizer reproduced ground-truth synthetic flame with highest accuracy for reduced memory intensity and runtime, outperforming neural network representation.
Conclusion: Classical voxel-grid representation with appropriate regularization can outperform neural network approaches for flame tomography, suggesting need for careful representation selection. Future work will explore more representations and experimental configurations.
Abstract: Flame tomography is a compelling approach for extracting large amounts of data from experiments via 3-D thermochemical reconstruction. Recent efforts employing neural-network flame representations have suggested improved reconstruction quality compared with classical tomography approaches, but a rigorous quantitative comparison with the same algorithm using a voxel-grid representation has not been conducted. Here, we compare a classical voxel-grid representation with varying regularizers to a continuous neural representation for tomographic reconstruction of a simulated pool fire. The representations are constructed to give temperature and composition as a function of location, and a subsequent ray-tracing step is used to solve the radiative transfer equation to determine the spectral intensity incident on hyperspectral infrared cameras, which is then convolved with an instrument lineshape function. We demonstrate that the voxel-grid approach with a total-variation regularizer reproduces the ground-truth synthetic flame with the highest accuracy for reduced memory intensity and runtime. Future work will explore more representations and under experimental configurations.
[339] RecycleLoRA: Rank-Revealing QR-Based Dual-LoRA Subspace Adaptation for Domain Generalized Semantic Segmentation
Chanseul Cho, Seokju Yun, Jeaseong Jeon, Seungjae Moon, Youngmin Ro
Main category: cs.CV
TL;DR: RecycleLoRA improves domain generalization in semantic segmentation by better exploiting vision foundation models’ subspace structures using Rank-Revealing QR decomposition for more diverse and efficient LoRA adapters.
Details
Motivation: Current methods for domain generalized semantic segmentation don't fully exploit the rich subspace structures within vision foundation models, and their LoRA components suffer from limited representational diversity and inefficient parameter utilization.Method: Uses Rank-Revealing QR Decomposition to systematically analyze VFM subspace structures. Creates dual adapters: main adapter leverages minor subspace directions for diverse features, while sub adapter refines major directions with minimal adjustments. No additional regularization losses needed.
Result: Achieves state-of-the-art performance on both synthetic-to-real and real-to-real generalization tasks without complex architectures or additional inference latency.
Conclusion: Systematic exploitation of pre-trained subspace structures through RRQR-based initialization leads to superior domain generalization performance in semantic segmentation.
Abstract: Domain Generalized Semantic Segmentation (DGSS) aims to maintain robust performance across unseen target domains. Vision Foundation Models (VFMs) offer rich multi-domain knowledge that can enhance generalization. However, strategies for actively exploiting the rich subspace structures within VFMs remain under-explored, with many existing methods focusing primarily on preserving pre-trained knowledge. Furthermore, their LoRA components often suffer from limited representational diversity and inefficient parameter utilization. We propose RecycleLoRA, which addresses both challenges by employing Rank-Revealing QR Decomposition (RRQR) to systematically exploit VFM’s subspace structures and enhance LoRA’s representational richness. Our main adapter leverages minor subspace directions identified by RRQR to learn diverse and independent features, achieving competitive performance even when used alone. We further introduce a sub adapter that carefully refines major directions with minimal adjustments, providing complementary improvements to the main adapter’s strong baseline performance. This design enables the dual adapters to learn distinct representations without requiring additional regularization losses. Our systematic exploitation of pre-trained subspace structures through RRQR-based initialization leads to superior domain generalization performance. RecycleLoRA achieves state-of-the-art performance on both synthetic-to-real generalization and real-to-real generalization tasks without complex architectures or additional inference latency.
[340] Wan-R1: Verifiable-Reinforcement Learning for Video Reasoning
Ming Liu, Yunbei Zhang, Shilong Liu, Liwen Wang, Wensheng Zhang
Main category: cs.CV
TL;DR: RL fine-tuning with verifiable rewards improves video generation models’ spatial reasoning and planning capabilities for maze-solving and robotic navigation tasks.
Details
Motivation: Video generation models lack spatial reasoning and multi-step planning abilities, and RL could help but faces reward design challenges that haven't been systematically studied.Method: Adapt Group Relative Policy Optimization (GRPO) to flow-based video models, design verifiable reward functions (multi-component trajectory rewards for games, embedding-level verifiable rewards for robotics), and systematically analyze reward design.
Result: RL fine-tuning with verifiable rewards improves generalization significantly: 29.1% accuracy improvement on 3D mazes and 51.4% on trap-avoidance tasks over SFT baselines.
Conclusion: Verifiable reward design is critical for stable RL training in video reasoning tasks, while multimodal reward models can lead to degenerate solutions.
Abstract: Video generation models produce visually coherent content but struggle with tasks requiring spatial reasoning and multi-step planning. Reinforcement learning (RL) offers a path to improve generalization, but its effectiveness in video reasoning hinges on reward design – a challenge that has received little systematic study. We investigate this problem by adapting Group Relative Policy Optimization (GRPO) to flow-based video models and training them on maze-solving and robotic navigation tasks. We first show that multimodal reward models fail catastrophically in this setting. To address this, we design verifiable reward functions grounded in objective task metrics. For structured game environments, we introduce a multi-component trajectory reward. For robotic navigation, we propose an embedding-level verifiable reward. Our experiments show that RL fine-tuning with verifiable rewards improves generalization. For example, on complex 3D mazes, our model improves exact match accuracy by 29.1% over the SFT baseline, and on trap-avoidance tasks by 51.4%. Our systematic reward analysis reveals that verifiable rewards are critical for stable training, while multimodal reward models could lead to degenerate solutions. These findings establish verifiable reward design as a key enabler for robust video reasoning. Code will be publicly available.
[341] Vision-Language Agents for Interactive Forest Change Analysis
James Brock, Ce Zhang, Nantheera Anantrasirichai
Main category: cs.CV
TL;DR: LLM-driven agent for forest change analysis using satellite imagery with multi-level change interpretation and semantic captioning capabilities.
Details
Motivation: Address the gap in integrating LLMs with vision-language models for remote sensing image change interpretation, particularly for complex forest dynamics monitoring.Method: Proposes an LLM-driven agent with multi-level change interpretation vision-language backbone and LLM-based orchestration, using the Forest-Change dataset with bi-temporal satellite imagery, change masks, and semantic captions.
Result: Achieves 67.10% mIoU and 40.17% BLEU-4 on Forest-Change dataset, and 88.13% mIoU and 34.41% BLEU-4 on LEVIR-MCI-Trees subset for joint change detection and captioning.
Conclusion: Demonstrates potential of interactive LLM-driven RSICI systems to improve accessibility, interpretability, and efficiency of forest change analysis.
Abstract: Modern forest monitoring workflows increasingly benefit from the growing availability of high-resolution satellite imagery and advances in deep learning. Two persistent challenges in this context are accurate pixel-level change detection and meaningful semantic change captioning for complex forest dynamics. While large language models (LLMs) are being adapted for interactive data exploration, their integration with vision-language models (VLMs) for remote sensing image change interpretation (RSICI) remains underexplored. To address this gap, we introduce an LLM-driven agent for integrated forest change analysis that supports natural language querying across multiple RSICI tasks. The proposed system builds upon a multi-level change interpretation (MCI) vision-language backbone with LLM-based orchestration. To facilitate adaptation and evaluation in forest environments, we further introduce the Forest-Change dataset, which comprises bi-temporal satellite imagery, pixel-level change masks, and multi-granularity semantic change captions generated using a combination of human annotation and rule-based methods. Experimental results show that the proposed system achieves mIoU and BLEU-4 scores of 67.10% and 40.17% on the Forest-Change dataset, and 88.13% and 34.41% on LEVIR-MCI-Trees, a tree-focused subset of LEVIR-MCI benchmark for joint change detection and captioning. These results highlight the potential of interactive, LLM-driven RSICI systems to improve accessibility, interpretability, and efficiency of forest change analysis. All data and code are publicly available at https://github.com/JamesBrockUoB/ForestChat.
[342] Poppy: Polarization-based Plug-and-Play Guidance for Enhancing Monocular Normal Estimation
Irene Kim, Sai Tanmay Reddy Chakkera, Alexandros Graikos, Dimitris Samaras, Akshat Dave
Main category: cs.CV
TL;DR: Poppy is a training-free framework that refines surface normal estimates from any frozen RGB backbone using single-shot polarization measurements at test time, improving performance on challenging surfaces without retraining.
Details
Motivation: Monocular surface normal estimators trained on RGB-normal data perform poorly on reflective, textureless, and dark surfaces. Polarization encodes surface orientation independently of texture and albedo, offering a physics-based complement, but existing polarization methods require multi-view capture or specialized training data, limiting generalization.Method: Poppy keeps backbone weights frozen and optimizes per-pixel offsets to the input RGB and output normal along with a learned reflectance decomposition. A differentiable rendering layer converts refined normals into polarization predictions and penalizes mismatches with observed polarization signals.
Result: Across seven benchmarks and three backbone architectures (diffusion, flow, and feed-forward), Poppy reduces mean angular error by 23-26% on synthetic data and 6-16% on real data.
Conclusion: Guiding learned RGB-based normal estimators with polarization cues at test time refines normals on challenging surfaces without requiring retraining, demonstrating effective integration of physics-based cues with learned models.
Abstract: Monocular surface normal estimators trained on large-scale RGB-normal data often perform poorly in the edge cases of reflective, textureless, and dark surfaces. Polarization encodes surface orientation independently of texture and albedo, offering a physics-based complement for these cases. Existing polarization methods, however, require multi-view capture or specialized training data, limiting generalization. We introduce Poppy, a training-free framework that refines normals from any frozen RGB backbone using single-shot polarization measurements at test time. Keeping backbone weights frozen, Poppy optimizes per-pixel offsets to the input RGB and output normal along with a learned reflectance decomposition. A differentiable rendering layer converts the refined normals into polarization predictions and penalizes mismatches with the observed signal. Across seven benchmarks and three backbone architectures (diffusion, flow, and feed-forward), Poppy reduces mean angular error by 23-26% on synthetic data and 6-16% on real data. These results show that guiding learned RGB-based normal estimators with polarization cues at test time refines normals on challenging surfaces without retraining.
[343] FigEx2: Visual-Conditioned Panel Detection and Captioning for Scientific Compound Figures
Jifeng Song, Arun Das, Pan Wang, Hui Ji, Kun Zhao, Yufei Huang
Main category: cs.CV
TL;DR: FigEx2 is a visual-conditioned framework that localizes panels and generates captions for scientific compound figures, converting uncaptioned figures into usable panel-text pairs for multimodal pretraining.
Details
Motivation: Many scientific compound figures lack proper captions (16.3% have no caption, 1.8% have very short captions), causing them to be discarded by existing caption-decomposition pipelines, which wastes valuable multimodal training data.Method: Visual-conditioned framework with noise-aware gated fusion module to adaptively control caption feature conditioning, staged SFT+RL strategy with CLIP-based alignment and BERTScore-based semantic rewards, and curated BioSci-Fig-Cap benchmark for supervision.
Result: Achieves 0.728 mAP@0.5:0.95 for detection, outperforms Qwen3-VL-8B by 0.44 in METEOR and 0.22 in BERTScore, and transfers zero-shot to out-of-distribution scientific domains without fine-tuning.
Conclusion: FigEx2 effectively converts otherwise unusable scientific figures into aligned panel-text pairs, enabling better utilization of multimodal scientific data for downstream pretraining and retrieval tasks.
Abstract: Scientific compound figures combine multiple labeled panels into a single image. However, in a PMC-scale crawl of 346,567 compound figures, 16.3% have no caption and 1.8% only have captions shorter than ten words, causing them to be discarded by existing caption-decomposition pipelines. We propose FigEx2, a visual-conditioned framework that localizes panels and generates panel-wise captions directly from the image, converting otherwise unusable figures into aligned panel-text pairs for downstream pretraining and retrieval. To mitigate linguistic variance in open-ended captioning, we introduce a noise-aware gated fusion module that adaptively controls how caption features condition the detection query space, and employ a staged SFT+RL strategy with CLIP-based alignment and BERTScore-based semantic rewards. To support high-quality supervision, we curate BioSci-Fig-Cap, a refined benchmark for panel-level grounding, alongside cross-disciplinary test suites in physics and chemistry. FigEx2 achieves 0.728 mAP@0.5:0.95 for detection, outperforms Qwen3-VL-8B by 0.44 in METEOR and 0.22 in BERTScore, and transfers zero-shot to out-of-distribution scientific domains without fine-tuning.
[344] SAGE: Sink-Aware Grounded Decoding for Multimodal Hallucination Mitigation
Tripti Shukla, Zsolt Kira
Main category: cs.CV
TL;DR: SAGE is a decoding framework that reduces hallucinations in vision-language models by dynamically modulating self-attention during generation using attention sink tokens as grounding reliability monitors.
Details
Motivation: Current VLMs suffer from hallucinations (generating content inconsistent with visual inputs), and existing methods don't intervene during decoding when hallucinations actually occur. The authors aim to address hallucinations at generation time without model retraining.Method: SAGE uses attention sink tokens (punctuation/function tokens that accumulate disproportionate attention) as anchors to monitor grounding reliability. At each sink trigger, it extracts semantic concepts, estimates visual grounding using self-attention maps and gradient-based attribution, measures spatial agreement, and adaptively sharpens or broadens self-attention distributions to reinforce grounded regions or suppress unreliable ones.
Result: SAGE consistently outperforms existing decoding strategies across diverse hallucination benchmarks, achieving average relative improvements of 10.65% on MSCOCO and 7.19% on AMBER across various VLM architectures, reducing hallucinations while preserving descriptive coverage.
Conclusion: SAGE effectively mitigates hallucinations in VLMs by dynamically modulating self-attention during decoding using sink-aware grounding, without requiring model retraining or architectural modifications, offering a practical solution to hallucination reduction.
Abstract: Large vision-language models (VLMs) frequently suffer from hallucinations, generating content that is inconsistent with visual inputs. Existing methods typically address this problem through post-hoc filtering, additional training objectives, or external verification, but they do not intervene during the decoding process when hallucinations arise. In this work, we introduce SAGE, a Sink-Aware Grounded Decoding framework that mitigates hallucinations by dynamically modulating self-attention during generation. Hallucinations are strongly correlated with attention sink tokens - punctuation or function tokens that accumulate disproportionate attention despite carrying limited semantic content. SAGE leverages these tokens as anchors to monitor grounding reliability in real time. At each sink trigger, the method extracts semantic concepts from the generated sequence, estimates their visual grounding using both self-attention maps and gradient-based attribution, and measures their spatial agreement. Based on this signal, self-attention distributions are adaptively sharpened or broadened to reinforce grounded regions or suppress unreliable ones. Extensive experiments across diverse hallucination benchmarks demonstrate that SAGE consistently outperforms existing decoding strategies, achieving substantial reductions in hallucination while preserving descriptive coverage, without requiring model retraining or architectural modifications. Our method achieves an average relative improvement of 10.65% on MSCOCO and 7.19% on AMBER across diverse VLM architectures, demonstrating consistent gains in hallucination mitigation.
[345] Rényi Entropy: A New Token Pruning Metric for Vision Transformers
Wei-Yuan Su, Ruijie Zhang, Zheng Zhang
Main category: cs.CV
TL;DR: Proposes Col-Ln, a training-free token importance metric based on Rényi entropy for early-layer token pruning in Vision Transformers, outperforming existing methods that rely on unreliable [CLS] tokens.
Details
Motivation: Vision Transformers suffer from O(N²) self-attention complexity, making high-resolution inference costly. Existing token pruning methods use [CLS] tokens for importance estimation, but these are unreliable in early layers where semantic representations are immature, leading to inaccurate pruning and information loss.Method: Introduces Col-Ln, a training-free token importance metric derived from Rényi entropy that can identify informative tokens from the first layer of the network, enabling more reliable token pruning in early layers without requiring additional training.
Result: Extensive experiments on Vision Transformers and Large Vision-Language Models show that Col-Ln consistently outperforms state-of-the-art pruning methods across diverse benchmarks, demonstrating effective token reduction while maintaining performance.
Conclusion: Col-Ln provides a reliable training-free solution for early-layer token pruning in Vision Transformers, addressing the limitations of [CLS]-based methods and enabling more efficient inference for high-resolution inputs.
Abstract: Vision Transformers (ViTs) achieve state-of-the-art performance but suffer from the $O(N^2)$ complexity of self-attention, making inference costly for high-resolution inputs. To address this bottleneck, token pruning has emerged as a critical technique to accelerate inference. Most existing methods rely on the [CLS] token to estimate patch importance. However, we argue that the [CLS] token can be unreliable in early layers where semantic representations are still immature. As a result, pruning in the early layer often leads to inaccurate importance estimation and unnecessary information loss. In this work, we propose a training-free token importance metric, namely Col-Ln, which is derived from Rényi entropy that enables the identification of informative tokens from the first layer of the network, thereby enabling more reliable pruning in token reduction. Extensive experiments on ViTs and Large Vision-Language Models (LVLMs) demonstrate that our approach consistently outperforms state-of-the-art pruning methods across diverse benchmarks.
[346] TwinMixing: A Shuffle-Aware Feature Interaction Model for Multi-Task Segmentation
Minh-Khoi Do, Huy Che, Dinh-Duy Phan, Duc-Khai Lam, Duc-Lung Vu
Main category: cs.CV
TL;DR: TwinMixing: A lightweight multi-task segmentation model for autonomous driving that performs drivable-area and lane segmentation with high accuracy and real-time efficiency on low-cost hardware.
Details
Motivation: Autonomous driving requires accurate perception for motion planning and control, but achieving high segmentation accuracy while maintaining real-time performance on low-cost hardware remains challenging.Method: Lightweight network with shared encoder and task-specific decoders. Features Efficient Pyramid Mixing (EPM) module for multi-scale feature extraction using grouped convolutions, depthwise dilated convolutions, and channel shuffle operations. Decoders use Dual-Branch Upsampling (DBU) Blocks with learnable transposed convolution and bilinear interpolation branches.
Result: Achieves 92.0% mIoU for drivable-area segmentation and 32.3% IoU for lane segmentation with only 0.43M parameters and 3.95 GFLOPs. Outperforms existing segmentation models on BDD100K dataset.
Conclusion: TwinMixing provides an effective trade-off between accuracy and computational efficiency, demonstrating strong potential for real-time deployment in autonomous driving and embedded perception systems.
Abstract: Accurate and efficient perception is essential for autonomous driving, where segmentation tasks such as drivable-area and lane segmentation provide critical cues for motion planning and control. However, achieving high segmentation accuracy while maintaining real-time performance on low-cost hardware remains a challenging problem. To address this issue, we introduce TwinMixing, a lightweight multi-task segmentation model designed explicitly for drivable-area and lane segmentation. The proposed network features a shared encoder and task-specific decoders, enabling both feature sharing and task specialization. Within the encoder, we propose an Efficient Pyramid Mixing (EPM) module that enhances multi-scale feature extraction through a combination of grouped convolutions, depthwise dilated convolutions and channel shuffle operations, effectively expanding the receptive field while minimizing computational cost. Each decoder adopts a Dual-Branch Upsampling (DBU) Block composed of a learnable transposed convolution-based Fine detailed branch and a parameter-free bilinear interpolation-based Coarse grained branch, achieving detailed yet spatially consistent feature reconstruction. Extensive experiments on the BDD100K dataset validate the effectiveness of TwinMixing across three configurations - tiny, base, and large. Among them, the base configuration achieves the best trade-off between accuracy and computational efficiency, reaching 92.0% mIoU for drivable-area segmentation and 32.3% IoU for lane segmentation with only 0.43M parameters and 3.95 GFLOPs. Moreover, TwinMixing consistently outperforms existing segmentation models on the same tasks, as illustrated in Fig. 1. Thanks to its compact and modular design, TwinMixing demonstrates strong potential for real-time deployment in autonomous driving and embedded perception systems. The source code: https://github.com/Jun0se7en/TwinMixing.
[347] BINO: Encoder Centric Self Supervised Stereo With Native Pair Input
Haokun Zhou
Main category: cs.CV
TL;DR: BINO learns binocular structure inside a compact encoder using stereo micro cell tokens and row-aware positional encoding, achieving strong cross-view correspondence for stereo vision tasks without separate linkage modules.
Details
Motivation: Stereo vision requires features that preserve fine cross-view correspondence, not just semantic similarity. Current self-supervised vision models aren't built for this, and geometry-focused methods rely on explicit linkage modules during pretraining.Method: Fuses rectified stereo pairs at input stage using stereo micro cell tokens with row-aware patch phase positional encoding. Trains with one-view masked token distillation plus occlusion and view-specific appearance mismatch losses.
Result: In low-resource KITTI-only pretraining, BINO achieves best frozen descriptor results on dense stereo, hard negative retrieval, and KITTI Stereo 2012 disparity among baselines without linkage modules. Performs near CroCo v2 with much smaller encoder.
Conclusion: Much cross-view reasoning typically assigned to separate linkage modules can be learned inside a compact, reusable encoder, enabling efficient stereo correspondence learning.
Abstract: Stereo needs features that preserve fine cross view correspondence rather than only semantic similarity. Recent self supervised vision models transfer well, but they are not built for this goal, and geometry focused methods often rely on a binocular decoder or another explicit linkage module during pretraining. BINO asks whether strong binocular structure can instead be learned inside a compact encoder. It does this by fusing the rectified pair at the input stage, forming stereo micro cell tokens, and using a row aware patch phase positional encoding. Training uses one view masked token only distillation together with occlusion and view specific appearance mismatch. In a strict low resource setting with pretraining only on KITTI object, BINO gives the best frozen descriptor results under a no linkage probe among all compared baselines on proxy dense stereo, hard negative retrieval, and KITTI Stereo2012 disparity. With the same lightweight stereo head for every encoder, it stays near CroCov2 while using a much smaller encoder. Supplementary transfer experiments on KITTI Stereo~2015 show the same qualitative trend. These results suggest that much of the cross view reasoning often assigned to a separate linkage module can be learned inside a compact and reusable encoder.
[348] DiffAttn: Diffusion-Based Drivers’ Visual Attention Prediction with LLM-Enhanced Semantic Reasoning
Weimin Liu, Qingkun Li, Jiyuan Qiu, Wenjun Wang, Joshua H. Meng
Main category: cs.CV
TL;DR: DiffAttn: A diffusion-based framework for predicting drivers’ visual attention using Swin Transformer encoder, feature fusion pyramid, multi-scale conditional diffusion, and LLM-enhanced semantic reasoning for improved traffic safety applications.
Details
Motivation: Drivers' visual attention is crucial for anticipating hazards and ensuring traffic safety. Current methods need improvement in accurately modeling drivers' perception patterns for intelligent vehicle systems.Method: Proposes DiffAttn, a diffusion-based framework that formulates attention prediction as conditional diffusion-denoising. Uses Swin Transformer encoder, Feature Fusion Pyramid decoder for cross-layer interaction, multi-scale conditional diffusion for fine-grained context modeling, and incorporates LLM layer for semantic reasoning.
Result: Achieves state-of-the-art performance on four public datasets, surpassing video-based, top-down-feature-driven, and LLM-enhanced baselines. Framework supports interpretable driver-centric scene understanding.
Conclusion: DiffAttn effectively models drivers’ visual attention patterns and has potential applications in in-cabin human-machine interaction, risk perception, and driver state measurement for intelligent vehicles.
Abstract: Drivers’ visual attention provides critical cues for anticipating latent hazards and directly shapes decision-making and control maneuvers, where its absence can compromise traffic safety. To emulate drivers’ perception patterns and advance visual attention prediction for intelligent vehicles, we propose DiffAttn, a diffusion-based framework that formulates this task as a conditional diffusion-denoising process, enabling more accurate modeling of drivers’ attention. To capture both local and global scene features, we adopt Swin Transformer as encoder and design a decoder that combines a Feature Fusion Pyramid for cross-layer interaction with dense, multi-scale conditional diffusion to jointly enhance denoising learning and model fine-grained local and global scene contexts. Additionally, a large language model (LLM) layer is incorporated to enhance top-down semantic reasoning and improve sensitivity to safety-critical cues. Extensive experiments on four public datasets demonstrate that DiffAttn achieves state-of-the-art (SoTA) performance, surpassing most video-based, top-down-feature-driven, and LLM-enhanced baselines. Our framework further supports interpretable driver-centric scene understanding and has the potential to improve in-cabin human-machine interaction, risk perception, and drivers’ state measurement in intelligent vehicles.
[349] Spatial Orthogonal Refinement for Robust RGB-Event Visual Object Tracking
Dexing Huang, Shiao Wang, Fan Zhang, Xiao Wang
Main category: cs.CV
TL;DR: SOR-Track: A streamlined RGB-Event tracking framework using Spatial Orthogonal Refinement to leverage event camera geometric priors for rectifying motion-blurred RGB features
Details
Motivation: Event cameras offer high temporal resolution and dynamic range advantages over RGB sensors in high-speed motion scenarios, but existing RGB-Event fusion methods fail to explicitly leverage the directional geometric priors in event streams to rectify degraded RGB features affected by motion blurMethod: Proposes SOR-Track with Spatial Orthogonal Refinement module using orthogonal directional filters dynamically guided by local motion orientations to extract sharp structural responses from event streams, which serve as geometric anchors to modulate and refine aliased RGB textures through asymmetric structural modulation
Result: Extensive experiments on FE108 benchmark show SOR-Track consistently outperforms existing fusion-based trackers, particularly under motion blur and low-light conditions
Conclusion: Despite its simplicity, the method offers a principled and physics-grounded approach to multi-modal feature alignment and texture rectification for robust visual object tracking
Abstract: Robust visual object tracking (VOT) remains challenging in high-speed motion scenarios, where conventional RGB sensors suffer from severe motion blur and performance degradation. Event cameras, with microsecond temporal resolution and high dynamic range, provide complementary structural cues that can potentially compensate for these limitations. However, existing RGB-Event fusion methods typically treat event data as dense intensity representations and adopt black-box fusion strategies, failing to explicitly leverage the directional geometric priors inherently encoded in event streams to rectify degraded RGB features. To address this limitation, we propose SOR-Track, a streamlined framework for robust RGB-Event tracking based on Spatial Orthogonal Refinement (SOR). The core SOR module employs a set of orthogonal directional filters that are dynamically guided by local motion orientations to extract sharp and motion-consistent structural responses from event streams. These responses serve as geometric anchors to modulate and refine aliased RGB textures through an asymmetric structural modulation mechanism, thereby explicitly bridging structural discrepancies between two modalities. Extensive experiments on the large-scale FE108 benchmark demonstrate that SOR-Track consistently outperforms existing fusion-based trackers, particularly under motion blur and low-light conditions. Despite its simplicity, the proposed method offers a principled and physics-grounded approach to multi-modal feature alignment and texture rectification. The source code of this paper will be released on https://github.com/Event-AHU/OpenEvTracking
[350] CoPE-VideoLM: Leveraging Codec Primitives For Efficient Video Language Modeling
Sayan Deb Sarkar, Rémi Pautrat, Ondrej Miksik, Marc Pollefeys, Iro Armeni, Mahdi Rad, Mihai Dusmanu
Main category: cs.CV
TL;DR: CoPE-VideoLM improves video language models by using video codec primitives (motion vectors and residuals) instead of full-frame encoding, reducing computational costs while maintaining or improving performance across diverse video understanding tasks.
Details
Motivation: Current VideoLMs use keyframe sampling which misses both macro-level events and micro-level details due to sparse temporal coverage, and processing full images for each frame incurs substantial computational overhead.Method: Leverages video codec primitives (motion vectors and residuals) that natively encode video redundancy and sparsity. Introduces lightweight transformer-based encoders to aggregate codec primitives and align their representations with image encoder embeddings through pre-training.
Result: Reduces time-to-first-token by up to 86% and token usage by up to 93% compared to standard VideoLMs. Maintains or exceeds performance on 14 diverse video understanding benchmarks spanning general QA, temporal/motion reasoning, long-form understanding, and spatial scene understanding.
Conclusion: CoPE-VideoLM provides an efficient alternative to standard VideoLMs by exploiting video codec primitives, achieving significant computational savings while preserving or enhancing video understanding capabilities across multiple domains.
Abstract: Video Language Models (VideoLMs) enable AI systems to understand temporal dynamics in videos. To fit within the maximum context window constraint, current methods use keyframe sampling which often misses both macro-level events and micro-level details due to the sparse temporal coverage. Furthermore, processing full images and their tokens for each frame incurs substantial computational overhead. We address these limitations by leveraging video codec primitives (specifically motion vectors and residuals) which natively encode video redundancy and sparsity without requiring expensive full-image encoding for most frames. To this end, we introduce lightweight transformer-based encoders that aggregate codec primitives and align their representations with image encoder embeddings through a pre-training strategy that accelerates convergence during end-to-end fine-tuning. Our approach, CoPE-VideoLM, reduces the time-to-first-token by up to 86% and token usage by up to 93% compared to standard VideoLMs. Moreover, by varying the keyframe and codec primitive densities we maintain or exceed performance on 14 diverse video understanding benchmarks spanning general question answering, temporal and motion reasoning, long-form understanding, and spatial scene understanding.
[351] FlashSign: Pose-Free Guidance for Efficient Sign Language Video Generation
Liuzhou Zhang, Zeyu Zhang, Biao Wu, Luyao Tang, Zirui Song, Hongyang He, Renda Han, Guangzhen Yao, Huacan Wang, Ronghao Chen, Xiuying Chen, Guan Huang, Zheng Zhu
Main category: cs.CV
TL;DR: A pose-free diffusion model for real-time sign language video generation that eliminates intermediate pose representations and uses trainable sparsity for 3x speedup.
Details
Motivation: Existing sign language video generation models rely on complex intermediate pose representations, limiting flexibility and efficiency. There's a need for more direct, real-time approaches to bridge communication gaps for deaf and hard-of-hearing communities.Method: Proposes a pose-free framework using diffusion models to directly map natural language text to sign language videos. Introduces two innovations: 1) pose-free generative model learning implicit text-to-gesture alignments without pose estimation, and 2) Trainable Sliding Tile Attention (T-STA) mechanism that accelerates inference by exploiting spatio-temporal locality patterns with trainable sparsity.
Result: Achieves 3.07x increase in video generation speed without compromising video quality, making real-time deployment feasible. The method eliminates the train-test gap through integrated trainable sparsity.
Conclusion: The approach enables real-time, high-quality, pose-free sign language synthesis, opening new avenues for inclusive communication tools for diverse communities.
Abstract: Sign language plays a crucial role in bridging communication gaps between the deaf and hard-of-hearing communities. However, existing sign language video generation models often rely on complex intermediate representations, which limits their flexibility and efficiency. In this work, we propose a novel pose-free framework for real-time sign language video generation. Our method eliminates the need for intermediate pose representations by directly mapping natural language text to sign language videos using a diffusion-based approach. We introduce two key innovations: (1) a pose-free generative model based on the a state-of-the-art diffusion backbone, which learns implicit text-to-gesture alignments without pose estimation, and (2) a Trainable Sliding Tile Attention (T-STA) mechanism that accelerates inference by exploiting spatio-temporal locality patterns. Unlike previous training-free sparsity approaches, T-STA integrates trainable sparsity into both training and inference, ensuring consistency and eliminating the train-test gap. This approach significantly reduces computational overhead while maintaining high generation quality, making real-time deployment feasible. Our method increases video generation speed by 3.07x without compromising video quality. Our contributions open new avenues for real-time, high-quality, pose-free sign language synthesis, with potential applications in inclusive communication tools for diverse communities. Code: https://github.com/AIGeeksGroup/FlashSign.
[352] MoD-DPO: Towards Mitigating Cross-modal Hallucinations in Omni LLMs using Modality Decoupled Preference Optimization
Ashutosh Chaubey, Jiacheng Pang, Mohammad Soleymani
Main category: cs.CV
TL;DR: MoD-DPO improves modality grounding in omni-modal LLMs by enforcing modality-aware regularization and language-prior debiasing to reduce cross-modal hallucinations.
Details
Motivation: Omni-modal LLMs suffer from cross-modal hallucinations due to spurious correlations and dominant language priors, despite strong performance on audiovisual understanding tasks.Method: Proposes Modality-Decoupled Direct Preference Optimization (MoD-DPO) with modality-aware regularization terms: invariance to irrelevant modality corruptions and sensitivity to relevant modality perturbations, plus language-prior debiasing penalty.
Result: Extensive experiments across multiple audiovisual hallucination benchmarks show MoD-DPO consistently improves perception accuracy and hallucination resistance, outperforming previous preference optimization baselines.
Conclusion: MoD-DPO demonstrates the importance of modality-faithful alignment and provides a scalable path toward more reliable multimodal foundation models by reducing unintended cross-modal interactions.
Abstract: Omni-modal large language models (omni LLMs) have recently achieved strong performance across audiovisual understanding tasks, yet they remain highly susceptible to cross-modal hallucinations arising from spurious correlations and dominant language priors. In this work, we propose Modality-Decoupled Direct Preference Optimization (MoD-DPO), a simple and effective framework for improving modality grounding in omni LLMs. MoD-DPO introduces modality-aware regularization terms that explicitly enforce invariance to corruptions in irrelevant modalities and sensitivity to perturbations in relevant modalities, thereby reducing unintended cross-modal interactions. To further mitigate over-reliance on textual priors, we incorporate a language-prior debiasing penalty that discourages hallucination-prone text-only responses. Extensive experiments across multiple audiovisual hallucination benchmarks demonstrate that MoD-DPO consistently improves perception accuracy and hallucination resistance, outperforming previous preference optimization baselines under similar training budgets. Our findings underscore the importance of modality-faithful alignment and demonstrate a scalable path toward more reliable and resilient multimodal foundation models.
[353] ForestSim: A Synthetic Benchmark for Intelligent Vehicle Perception in Unstructured Forest Environments
Pragat Wagle, Zheng Chen, Lantao Liu
Main category: cs.CV
TL;DR: ForestSim: A synthetic dataset for semantic segmentation in unstructured forest environments to support intelligent off-road vehicle perception.
Details
Motivation: Lack of pixel-accurate annotated datasets for unstructured wild environments (forests, off-road) hinders development of perception systems for intelligent ground vehicles in applications like forestry automation, agricultural robotics, and disaster response.Method: Created a high-fidelity synthetic dataset using Unreal Engine environments integrated with Microsoft AirSim, generating 2094 photorealistic images across 25 diverse environments with consistent pixel-accurate labels for 20 classes relevant to autonomous navigation.
Result: Benchmarked ForestSim using state-of-the-art architectures and reported strong performance despite the inherent challenges of unstructured scenes, providing a scalable foundation for perception research.
Conclusion: ForestSim addresses the dataset scarcity for unstructured environments and supports development of perception systems for next-generation intelligent off-road vehicles.
Abstract: Robust scene understanding is essential for intelligent vehicles operating in natural, unstructured environments. While semantic segmentation datasets for structured urban driving are abundant, the datasets for extremely unstructured wild environments remain scarce due to the difficulty and cost of generating pixel-accurate annotations. These limitations hinder the development of perception systems needed for intelligent ground vehicles tasked with forestry automation, agricultural robotics, disaster response, and all-terrain mobility. To address this gap, we present ForestSim, a high-fidelity synthetic dataset designed for training and evaluating semantic segmentation models for intelligent vehicles in forested off-road and no-road environments. ForestSim contains 2094 photorealistic images across 25 diverse environments, covering multiple seasons, terrain types, and foliage densities. Using Unreal Engine environments integrated with Microsoft AirSim, we generate consistent, pixel-accurate labels across 20 classes relevant to autonomous navigation. We benchmark ForestSim using state-of-the-art architectures and report strong performance despite the inherent challenges of unstructured scenes. ForestSim provides a scalable and accessible foundation for perception research supporting the next generation of intelligent off-road vehicles. The dataset and code are publicly available: Dataset: https://vailforestsim.github.io Code: https://github.com/pragatwagle/ForestSim
[354] Integrating Multimodal Large Language Model Knowledge into Amodal Completion
Heecheol Yun, Eunho Yang
Main category: cs.CV
TL;DR: AmodalCG: A framework using Multimodal Large Language Models (MLLMs) to guide amodal completion by assessing occlusion extent, reasoning about missing regions, and refining results with visual generative models.
Details
Motivation: Amodal completion is crucial for autonomous vehicles and robotics but requires real-world physical knowledge. Existing approaches either lack this knowledge or don't explicitly guide the completion process with it.Method: Proposes AmodalCG framework that: 1) assesses occlusion extent to selectively invoke MLLM guidance, 2) uses MLLMs to reason about both extent and content of missing regions, and 3) integrates guidance with visual generative models to iteratively refine completions.
Result: Experimental results on various real-world images show impressive improvements compared to all existing works, suggesting MLLMs as a promising direction for challenging amodal completion.
Conclusion: MLLMs can effectively provide real-world knowledge to guide amodal completion, addressing limitations of existing approaches and showing significant performance improvements.
Abstract: With the widespread adoption of autonomous vehicles and robotics, amodal completion, which reconstructs the occluded parts of people and objects in an image, has become increasingly crucial. Just as humans infer hidden regions based on prior experience and common sense, this task inherently requires physical knowledge about real-world entities. However, existing approaches either depend solely on the image generation ability of visual generative models, which lack such knowledge, or leverage it only during the segmentation stage, preventing it from explicitly guiding the completion process. To address this, we propose AmodalCG, a novel framework that harnesses the real-world knowledge of Multimodal Large Language Models (MLLMs) to guide amodal completion. Our framework first assesses the extent of occlusion to selectively invoke MLLM guidance only when the target object is heavily occluded. If guidance is required, the framework further incorporates MLLMs to reason about both the (1) extent and (2) content of the missing regions. Finally, a visual generative model integrates these guidance and iteratively refines imperfect completions that may arise from inaccurate MLLM guidance. Experimental results on various real-world images show impressive improvements compared to all existing works, suggesting MLLMs as a promising direction for addressing challenging amodal completion.
[355] Image Generation Models: A Technical History
Rouzbeh Shirvani
Main category: cs.CV
TL;DR: Comprehensive survey of image generation models covering VAEs, GANs, normalizing flows, autoregressive/transformer models, and diffusion methods, with extensions to video generation and responsible deployment considerations.
Details
Motivation: The literature on image generation is fragmented across different models and application domains, creating a need for a unified survey that comprehensively covers breakthrough models, their technical details, and emerging challenges in responsible deployment.Method: Survey methodology that provides detailed technical walkthroughs of each model type (VAEs, GANs, normalizing flows, autoregressive/transformer generators, diffusion methods), covering their objectives, architectural components, training algorithms, optimization techniques, and limitations.
Result: A comprehensive overview of the state-of-the-art in image generation, including recent developments in video generation (from still frames to high-quality videos) and coverage of robustness and responsible deployment issues like deepfake risks, detection, artifacts, and watermarking.
Conclusion: Image generation has advanced rapidly with diverse model families, and the field now faces important challenges in extending to video generation and ensuring responsible deployment through robustness measures and ethical considerations.
Abstract: Image generation has advanced rapidly over the past decade, yet the literature seems fragmented across different models and application domains. This paper aims to offer a comprehensive survey of breakthrough image generation models, including variational autoencoders (VAEs), generative adversarial networks (GANs), normalizing flows, autoregressive and transformer-based generators, and diffusion-based methods. We provide a detailed technical walkthrough of each model type, including their underlying objectives, architectural building blocks, and algorithmic training steps. For each model type, we present the optimization techniques as well as common failure modes and limitations. We also go over recent developments in video generation and present the research works that made it possible to go from still frames to high quality videos. Lastly, we cover the growing importance of robustness and responsible deployment of these models, including deepfake risks, detection, artifacts, and watermarking.
[356] A Cross-Scale Decoder with Token Refinement for Off-Road Semantic Segmentation
Seongkyu Choi Jhonghyun An
Main category: cs.CV
TL;DR: A cross-scale decoder for off-road semantic segmentation that addresses annotation ambiguity and boundary uncertainty through global-local token refinement, gated detail bridging, and uncertainty-guided point refinement.
Details
Motivation: Off-road semantic segmentation faces challenges from irregular terrain, vegetation clutter, and annotation ambiguity, with uncertain transition regions and rare/thin structures receiving unreliable supervision. Existing decoders either oversmooth details or amplify noise through repeated feature fusion.Method: Proposes a cross-scale decoder with three mechanisms: 1) Global-local token refinement on a compact bottleneck lattice with boundary-aware regularization, 2) Gated detail bridge that selectively injects fine-scale structural cues once via cross-scale attention, 3) Uncertainty-guided class-aware point refinement that updates least reliable pixels.
Result: Achieves noise-robust and boundary-preserving segmentation tailored to off-road environments, recovering fine structural details while maintaining deployment-friendly efficiency. Shows consistent improvements on standard off-road benchmarks without heavy dense feature fusion.
Conclusion: The framework effectively addresses off-road segmentation challenges through complementary mechanisms that handle annotation ambiguity, preserve boundaries, and refine uncertain regions with minimal computational overhead.
Abstract: Off-road semantic segmentation is fundamentally challenged by irregular terrain, vegetation clutter, and inherent annotation ambiguity. Unlike urban scenes with crisp object boundaries, off-road environments exhibit strong class-level similarity among terrain categories, resulting in thick and uncertain transition regions that degrade boundary coherence and destabilize training. Rare or thin structures, such as narrow traversable gaps or isolated obstacles, further receive sparse and unreliable supervision and are easily overwhelmed by dominant background textures. Existing decoder designs either rely on low-scale bottlenecks that oversmooth fine structural details, or repeatedly fuse high-detail features, which tends to amplify annotation noise and incur substantial computational cost. We present a cross-scale decoder that explicitly addresses these challenges through three complementary mechanisms. First, a global–local token refinement module consolidates semantic context on a compact bottleneck lattice, guided by boundary-aware regularization to remain robust under ambiguous supervision. Second, a gated detail bridge selectively injects fine-scale structural cues only once through cross-scale attention, preserving boundary and texture information while avoiding noise accumulation. Third, an uncertainty-guided class-aware point refinement selectively updates the least reliable pixels, improving rare and ambiguous structures with minimal computational overhead. The resulting framework achieves noise-robust and boundary-preserving segmentation tailored to off-road environments, recovering fine structural details while maintaining deployment-friendly efficiency. Experimental results on standard off-road benchmarks demonstrate consistent improvements over prior approaches without resorting to heavy dense feature fusion.
[357] RehearsalNeRF: Decoupling Intrinsic Neural Fields of Dynamic Illuminations for Scene Editing
Changyeon Won, Hyunjun Jung, Jungu Cho, Seonmi Park, Chi-Hoon Lee, Hae-Gon Jeon
Main category: cs.CV
TL;DR: RehearsalNeRF disentangles scene radiance from dynamic illumination changes using rehearsal stage data and lighting vectors, enabling robust novel view synthesis and scene editing under varying lighting conditions.
Details
Motivation: Current neural radiance fields struggle with dynamic illumination changes where subjects' radiance is entangled with emitted radiance and lighting colors in spatio-temporal domain. There's a need for effective disentanglement methods under severe illumination variations.Method: Uses rehearsal stage scenes captured under stable lighting as geometric consistency reference. Employs learnable lighting vectors to represent illumination colors temporally and disentangle projected light colors from scene radiance. Incorporates optical flow regularization with off-the-shelf interactive masks for dynamic object reconstruction.
Result: Demonstrates robust performance on novel view synthesis and scene editing under dynamic illumination conditions. Shows effective disentanglement of scene components from lighting variations.
Conclusion: RehearsalNeRF provides an effective solution for learning disentangled neural fields under severe illumination changes by leveraging rehearsal stage data and lighting vector representations.
Abstract: Although there has been significant progress in neural radiance fields, an issue on dynamic illumination changes still remains unsolved. Different from relevant works that parameterize time-variant/-invariant components in scenes, subjects’ radiance is highly entangled with their own emitted radiance and lighting colors in spatio-temporal domain. In this paper, we present a new effective method to learn disentangled neural fields under the severe illumination changes, named RehearsalNeRF. Our key idea is to leverage scenes captured under stable lighting like rehearsal stages, easily taken before dynamic illumination occurs, to enforce geometric consistency between the different lighting conditions. In particular, RehearsalNeRF employs a learnable vector for lighting effects which represents illumination colors in a temporal dimension and is used to disentangle projected light colors from scene radiance. Furthermore, our RehearsalNeRF is also able to reconstruct the neural fields of dynamic objects by simply adopting off-the-shelf interactive masks. To decouple the dynamic objects, we propose a new regularization leveraging optical flow, which provides coarse supervision for the color disentanglement. We demonstrate the effectiveness of RehearsalNeRF by showing robust performances on novel view synthesis and scene editing under dynamic illumination conditions. Our source code and video datasets will be publicly available.
[358] MathGen: Revealing the Illusion of Mathematical Competence through Text-to-Image Generation
Ruiyao Liu, Hui Shen, Ping Zhang, Yunta Hsieh, Yifan Zhang, Jing Xu, Sicheng Chen, Junchen Li, Jiawei Lu, Jianing Ma, Jiaqi Mo, Qi Han, Zhen Zhang, Zhongwei Wan, Jing Xiong, Xin Wang, Ziyuan Liu, Hangrui Cao, Ngai Wong
Main category: cs.CV
TL;DR: MathGen benchmark evaluates text-to-image models’ ability to generate mathematically correct visual representations across 7 domains, finding current models perform poorly with best closed-source model at 42% accuracy.
Details
Motivation: While generative models can solve mathematical problems, real-world applications often require visual representations (diagrams, plots, geometric constructions) where correctness depends on precise visual composition. The paper aims to study whether generative models can render mathematically correct visual answers rather than just text solutions.Method: Introduced MathGen benchmark with 900 problems spanning 7 core mathematical domains, each paired with an executable verifier using Script-as-a-Judge protocol for deterministic and objective evaluation. Tested representative open-source and proprietary text-to-image models.
Result: Mathematical fidelity remains a major bottleneck: best closed-source model reached only 42.0% overall accuracy, while open-source models achieved just ~1-11%, often near 0% on structured tasks. Current T2I models remain far from competent at elementary mathematical visual generation.
Conclusion: Current text-to-image models struggle significantly with generating mathematically correct visual representations, indicating a substantial gap in their ability to handle precise mathematical visual generation tasks.
Abstract: Modern generative models have demonstrated the ability to solve challenging mathematical problems. In many real-world settings, however, mathematical solutions must be expressed visually through diagrams, plots, geometric constructions, and structured symbolic layouts, where correctness depends on precise visual composition. Can generative models still do so when the answer must be rendered visually rather than written in text? To study this problem, we introduce MathGen, a rigorous benchmark of 900 problems spanning seven core domains, each paired with an executable verifier under a Script-as-a-Judge protocol for deterministic and objective evaluation. Experiments on representative open-source and proprietary text-to-image models show that mathematical fidelity remains a major bottleneck: even the best closed-source model reaches only 42.0% overall accuracy, while open-source models achieve just ~ 1-11%, often near 0% on structured tasks. Overall, current T2I models remain far from competent at even elementary mathematical visual generation.
[359] ExFusion: Efficient Transformer Training via Multi-Experts Fusion
Jiacheng Ruan, Daize Dong, Xiaoye Qu, Tong Zhu, Ting Liu, Yuzhuo Fu, Yu Cheng, Suncheng Xiang
Main category: cs.CV
TL;DR: ExFusion is a novel pre-training approach that upcycles Transformer FFNs into multi-expert configurations with fusion weights, enabling MoE-like training benefits with minimal computational overhead and no deployment costs.
Details
Motivation: MoE models improve performance but require substantial computational resources and introduce parameter storage/deployment overhead. There's a need for methods that leverage multi-expert capabilities while minimizing additional costs.Method: ExFusion upcycles Transformer FFNs into multi-expert configurations during initialization, assigning each expert a fusion weight. During training, these weights fuse multiple experts into a single unified expert equivalent to the original FFN for forward computation. After training, learned weights integrate multi-experts into a single expert.
Result: Extensive experiments on computer vision and NLP tasks demonstrate effectiveness. The method introduces multi-expert characteristics with only marginal computational cost compared to standard dense training, and eliminates storage/deployment overhead.
Conclusion: ExFusion provides an efficient approach to leverage multi-expert capabilities in Transformers without the computational and deployment costs typically associated with MoE models.
Abstract: Mixture-of-Experts (MoE) models substantially improve performance by increasing the capacity of dense architectures. However, directly training MoE models requires considerable computational resources and introduces extra overhead in parameter storage and deployment. Therefore, it is critical to develop an approach that leverages the multi-expert capability of MoE to enhance performance while incurring minimal additional cost. To this end, we propose a novel pre-training approach, termed ExFusion, which improves the efficiency of Transformer training through multi-expert fusion. Specifically, during the initialization phase, ExFusion upcycles the feed-forward network (FFN) of the Transformer into a multi-expert configuration, where each expert is assigned a weight for later parameter fusion. During training, these weights allow multiple experts to be fused into a single unified expert equivalent to the original FFN, which is subsequently used for forward computation. As a result, ExFusion introduces multi-expert characteristics into the training process while incurring only marginal computational cost compared to standard dense training. After training, the learned weights are used to integrate multi-experts into a single unified expert, thereby eliminating additional overhead in storage and deployment. Extensive experiments on a variety of computer vision and natural language processing tasks demonstrate the effectiveness of the proposed method.
[360] EdgeDiT: Hardware-Aware Diffusion Transformers for Efficient On-Device Image Generation
Sravanth Kodavanti, Manjunath Arveti, Sowmya Vajrala, Srinivas Miriyala, Vikram N R
Main category: cs.CV
TL;DR: EdgeDiT: Hardware-efficient diffusion transformers optimized for mobile NPUs, achieving 20-30% parameter reduction and 1.65x faster on-device latency while maintaining image quality.
Details
Motivation: Diffusion Transformers (DiT) achieve state-of-the-art image synthesis but have massive computational complexity that prevents deployment on resource-constrained edge devices like mobile phones with NPUs.Method: Hardware-aware optimization framework that systematically identifies and prunes structural redundancies in DiT backbones that are particularly taxing for mobile data-flows, creating lightweight models for mobile NPUs like Qualcomm Hexagon and Apple Neural Engine.
Result: Achieves 20-30% reduction in parameters, 36-46% decrease in FLOPs, and 1.65-fold reduction in on-device latency while maintaining scaling advantages and expressive capacity. Offers superior Pareto-optimal trade-off between FID and inference latency compared to optimized mobile U-Nets and vanilla DiT variants.
Conclusion: EdgeDiT enables responsive, private, and offline generative AI directly on-device, providing a scalable blueprint for transitioning large-scale foundation models from high-end GPUs to mobile devices.
Abstract: Diffusion Transformers (DiT) have established a new state-of-the-art in high-fidelity image synthesis; however, their massive computational complexity and memory requirements hinder local deployment on resource-constrained edge devices. In this paper, we introduce EdgeDiT, a family of hardware-efficient generative transformers specifically engineered for mobile Neural Processing Units (NPUs), such as the Qualcomm Hexagon and Apple Neural Engine (ANE). By leveraging a hardware-aware optimization framework, we systematically identify and prune structural redundancies within the DiT backbone that are particularly taxing for mobile data-flows. Our approach yields a series of lightweight models that achieve a 20-30% reduction in parameters, a 36-46% decrease in FLOPs, and a 1.65-fold reduction in on-device latency without sacrificing the scaling advantages or the expressive capacity of the original transformer architecture. Extensive benchmarking demonstrates that EdgeDiT offers a superior Pareto-optimal trade-off between Frechet Inception Distance (FID) and inference latency compared to both optimized mobile U-Nets and vanilla DiT variants. By enabling responsive, private, and offline generative AI directly on-device, EdgeDiT provides a scalable blueprint for transitioning large-scale foundation models from high-end GPUs to the palm of the user.
[361] Learning Multi-View Spatial Reasoning from Cross-View Relations
Suchae Jeong, Jaehwi Song, Haeone Lee, Hanna Kim, Jian Kim, Dongjun Lee, Dong Kyu Shin, Changyeon Kim, Dongyoon Hahm, Woogyeol Jin, Juheon Choi, Kimin Lee
Main category: cs.CV
TL;DR: XVR dataset teaches vision-language models spatial reasoning across multiple views using 100K samples from 3D scenes and robotic trajectories, improving multi-view understanding and robotic manipulation performance.
Details
Motivation: Current vision-language models lack multi-view spatial reasoning capabilities needed for embodied AI systems to understand 3D environments and manipulate objects across different viewpoints, limiting their effectiveness in real-world robotic applications.Method: Created Cross-View Relations (XVR) dataset with 100K vision-question-answer samples from 18K diverse 3D scenes and 70K robotic manipulation trajectories, covering three spatial reasoning tasks: Correspondence, Verification, and Localization. Fine-tuned VLMs on this dataset.
Result: VLMs fine-tuned on XVR achieved substantial improvements on multi-view and robotic spatial reasoning benchmarks (MindCube and RoboSpatial). When integrated into Vision-Language-Action models, XVR-trained representations improved success rates on RoboCasa robotic manipulation tasks.
Conclusion: Explicit training on cross-view spatial relations significantly enhances multi-view reasoning capabilities in vision-language models and transfers effectively to real-world robotic manipulation, bridging the gap between 2D vision understanding and 3D spatial reasoning.
Abstract: Vision-language models (VLMs) have achieved impressive results on single-view vision tasks, but lack the multi-view spatial reasoning capabilities essential for embodied AI systems to understand 3D environments and manipulate objects across different viewpoints. In this work, we introduce Cross-View Relations (XVR), a large-scale dataset designed to teach VLMs spatial reasoning across multiple views. XVR comprises 100K vision-question-answer samples derived from 18K diverse 3D scenes and 70K robotic manipulation trajectories, spanning three fundamental spatial reasoning tasks: Correspondence (matching objects across views), Verification (validating spatial relationships), and Localization (identifying object positions). VLMs fine-tuned on XVR achieve substantial improvements on established multi-view and robotic spatial reasoning benchmarks (MindCube and RoboSpatial). When integrated as backbones in Vision-Language-Action models, XVR-trained representations improve success rates on RoboCasa. Our results demonstrate that explicit training on cross-view spatial relations significantly enhances multi-view reasoning and transfers effectively to real-world robotic manipulation.
[362] Hg-I2P: Bridging Modalities for Generalizable Image-to-Point-Cloud Registration via Heterogeneous Graphs
Pei An, Junfeng Ding, Jiaqi Yang, Yulong Wang, Jie Ma, Liangliang Nan
Main category: cs.CV
TL;DR: Hg-I2P: A heterogeneous graph-based method for image-to-point-cloud registration that addresses modality gap by refining cross-modal features and correspondences through graph representation and consistency modeling.
Details
Motivation: The modality gap between 2D images and 3D point clouds makes it challenging to learn discriminative and generalizable features for registration, leading to performance drops in unseen scenarios.Method: Proposes a heterogeneous graph that represents mapping between segmented 2D and 3D regions, enabling cross-modal feature interaction and correspondence pruning. The method mines multi-path feature relationships, adapts features under heterogeneous edge guidance, and prunes correspondences using graph-based projection consistency.
Result: Experiments on six indoor and outdoor benchmarks under cross-domain setups show Hg-I2P significantly outperforms existing methods in both generalization and accuracy.
Conclusion: The heterogeneous graph approach effectively bridges the modality gap in I2P registration, improving feature discriminability and correspondence reliability for better generalization across domains.
Abstract: Image-to-point-cloud (I2P) registration aims to align 2D images with 3D point clouds by establishing reliable 2D-3D correspondences. The drastic modality gap between images and point clouds makes it challenging to learn features that are both discriminative and generalizable, leading to severe performance drops in unseen scenarios. We address this challenge by introducing a heterogeneous graph that enables refining both cross-modal features and correspondences within a unified architecture. The proposed graph represents a mapping between segmented 2D and 3D regions, which enhances cross-modal feature interaction and thus improves feature discriminability. In addition, modeling the consistency among vertices and edges within the graph enables pruning of unreliable correspondences. Building on these insights, we propose a heterogeneous graph embedded I2P registration method, termed Hg-I2P. It learns a heterogeneous graph by mining multi-path feature relationships, adapts features under the guidance of heterogeneous edges, and prunes correspondences using graph-based projection consistency. Experiments on six indoor and outdoor benchmarks under cross-domain setups demonstrate that Hg-I2P significantly outperforms existing methods in both generalization and accuracy. Code is released on https://github.com/anpei96/hg-i2p-demo.
[363] AffordMatcher: Affordance Learning in 3D Scenes from Visual Signifiers
Nghia Vu, Tuong Do, Khang Nguyen, Baoru Huang, Nhat Le, Binh Xuan Nguyen, Erman Tjiputra, Quang D. Tran, Ravi Prakash, Te-Chuan Chiu, Anh Nguyen
Main category: cs.CV
TL;DR: AffordBridge dataset with 291K functional interaction annotations across 685 indoor scenes in point clouds, plus AffordMatcher method for cross-modal affordance learning between images and point clouds.
Details
Motivation: Existing affordance learning focuses on object geometry and visual knowledge, but extending to scene-level understanding is complex due to difficulty incorporating object- and scene-level semantics.Method: Created AffordBridge dataset with point cloud annotations and linked RGB images, then proposed AffordMatcher method that establishes semantic correspondences between image-based and point cloud-based instances for keypoint matching to identify affordance regions using visual signifiers.
Result: Experimental results demonstrate effectiveness compared to other methods on the new dataset.
Conclusion: The work provides a large-scale dataset and method for scene-level affordance learning through cross-modal matching between images and point clouds.
Abstract: Affordance learning is a complex challenge in many applications, where existing approaches primarily focus on the geometric structures, visual knowledge, and affordance labels of objects to determine interactable regions. However, extending this learning capability to a scene is significantly more complicated, as incorporating object- and scene-level semantics is not straightforward. In this work, we introduce AffordBridge, a large-scale dataset with 291,637 functional interaction annotations across 685 high-resolution indoor scenes in the form of point clouds. Our affordance annotations are complemented by RGB images that are linked to the same instances within the scenes. Building upon our dataset, we propose AffordMatcher, an affordance learning method that establishes coherent semantic correspondences between image-based and point cloud-based instances for keypoint matching, enabling a more precise identification of affordance regions based on cues, so-called visual signifiers. Experimental results on our dataset demonstrate the effectiveness of our approach compared to other methods.
[364] GeoHCC: Local Geometry-Aware Hierarchical Context Compression for 3D Gaussian Splatting
Xuan Deng, Xiandong Meng, Hengyu Man, Qiang Zhu, Tiange Zhang, Debin Zhao, Xiaopeng Fan
Main category: cs.CV
TL;DR: GeoHCC is a geometry-aware compression framework for 3D Gaussian Splatting that improves compression by incorporating geometric dependencies through neighborhood-aware anchor pruning and hierarchical entropy coding.
Details
Motivation: 3D Gaussian Splatting enables high-fidelity real-time rendering but has prohibitive storage overhead. Existing anchor-based compression schemes overlook explicit geometric dependencies, leading to structural degradation and suboptimal rate-distortion performance.Method: Two main components: 1) Neighborhood-Aware Anchor Pruning (NAAP) evaluates anchor importance via weighted neighborhood feature aggregation and merges redundant anchors into salient neighbors. 2) Hierarchical entropy coding scheme with lightweight Geometry-Guided Convolution (GG-Conv) operator for spatially adaptive context modeling and rate-distortion optimization.
Result: Extensive experiments demonstrate that GeoHCC effectively resolves the structure preservation bottleneck, maintaining superior geometric integrity and rendering fidelity over state-of-the-art anchor-based approaches.
Conclusion: GeoHCC provides a geometry-aware compression framework for 3DGS that achieves better compression while preserving geometric structure and rendering quality compared to existing methods.
Abstract: Although 3D Gaussian Splatting (3DGS) enables high-fidelity real-time rendering, its prohibitive storage overhead severely hinders practical deployment. Recent anchor-based 3DGS compression schemes reduce redundancy through context modeling, yet overlook explicit geometric dependencies, leading to structural degradation and suboptimal rate-distortion performance. In this paper, we propose GeoHCC, a geometry-aware 3DGS compression framework that incorporates inter-anchor geometric correlations into anchor pruning and entropy coding for compact representation. We first introduce Neighborhood-Aware Anchor Pruning (NAAP), which evaluates anchor importance via weighted neighborhood feature aggregation and merges redundant anchors into salient neighbors, yielding a compact yet geometry-consistent anchor set. Building upon this optimized structure, we further develop a hierarchical entropy coding scheme, in which coarse-to-fine priors are exploited through a lightweight Geometry-Guided Convolution (GG-Conv) operator to enable spatially adaptive context modeling and rate-distortion optimization. Extensive experiments demonstrate that GeoHCC effectively resolves the structure preservation bottleneck, maintaining superior geometric integrity and rendering fidelity over state-of-the-art anchor-based approaches.
[365] RetinexDualV2: Physically-Grounded Dual Retinex for Generalized UHD Image Restoration
Mohab Kishawy, Jun Chen
Main category: cs.CV
TL;DR: RetinexDualV2 is a unified dual-branch framework for UHD image restoration that uses task-specific physical grounding modules and physical-conditioned attention mechanisms to handle various degradations without architectural changes.
Details
Motivation: The paper aims to address the limitations of generic image restoration models by creating a unified framework that can handle diverse UHD image degradations (like raindrop removal and low-light enhancement) through physically grounded priors rather than task-specific architectures.Method: Proposes a dual-branch framework with Task-Specific Physical Grounding Module (TS-PGM) to extract degradation-aware priors (rain masks, dark channels), which guide a Retinex decomposition network via Physical-conditioned Multi-head Self-Attention (PC-MSA) mechanism for robust reflection and illumination correction.
Result: Achieved 4th place in NTIRE 2026 Day and Night Raindrop Removal Challenge and 5th place in Joint Noise Low-light Enhancement Challenge, demonstrating state-of-the-art performance and exceptional generalizability across different restoration tasks.
Conclusion: RetinexDualV2 provides an effective physically motivated approach for diverse UHD image restoration that maintains a single architecture while handling various complex degradations through explicit physical priors and conditioning mechanisms.
Abstract: We propose RetinexDualV2, a unified, physically grounded dual-branch framework for diverse Ultra-High-Definition (UHD) image restoration. Unlike generic models, our method employs a Task-Specific Physical Grounding Module (TS-PGM) to extract degradation-aware priors (e.g., rain masks and dark channels). These explicitly guide a Retinex decomposition network via a novel Physical-conditioned Multi-head Self-Attention (PC-MSA) mechanism, enabling robust reflection and illumination correction. This physical conditioning allows a single architecture to handle various complex degradations seamlessly, without task-specific structural modifications. RetinexDualV2 demonstrates exceptional generalizability, securing 4\textsuperscript{th} place in the NTIRE 2026 Day and Night Raindrop Removal Challenge and 5\textsuperscript{th} place in the Joint Noise Low-light Enhancement (JNLLIE) Challenge. Extensive experiments confirm the state-of-the-art performance and efficiency of our physically motivated approach.
[366] CiQi-Agent: Aligning Vision, Tools and Aesthetics in Multimodal Agent for Cultural Reasoning on Chinese Porcelains
Wenhan Wang, Zhixiang Zhou, Zhongtian Ma, Yanzhu Chen, Ziyu Lin, Hao Sheng, Pengfei Liu, Honglin Ma, Wenqi Shao, Qiaosheng Zhang, Yu Qiao
Main category: cs.CV
TL;DR: CiQi-Agent: A multimodal AI agent for antique Chinese porcelain connoisseurship that analyzes porcelain across six attributes using vision tools and retrieval-augmented generation.
Details
Motivation: To democratize cultural heritage understanding and assist expert connoisseurship of antique Chinese porcelain, which requires extensive historical expertise, material understanding, and aesthetic sensitivity that is difficult for non-specialists.Method: Developed a domain-specific porcelain connoisseurship agent supporting multi-image inputs, vision tool invocation, and multimodal retrieval-augmented generation. Created CiQi-VQA dataset (29,596 specimens, 51,553 images, 557,940 VQA pairs) and CiQi-Bench benchmark. Trained using supervised fine-tuning, reinforcement learning, and tool-augmented reasoning framework with vision and multimodal retrieval tools.
Result: CiQi-Agent (7B) outperforms all competitive open- and closed-source models across all six attributes on CiQi-Bench, achieving on average 12.2% higher accuracy than GPT-5.
Conclusion: CiQi-Agent successfully enables intelligent porcelain analysis, captures subtle visual details, retrieves relevant domain knowledge, and produces coherent, explainable connoisseurship descriptions, advancing multimodal AI applications in cultural heritage.
Abstract: The connoisseurship of antique Chinese porcelain demands extensive historical expertise, material understanding, and aesthetic sensitivity, making it difficult for non-specialists to engage. To democratize cultural-heritage understanding and assist expert connoisseurship, we introduce CiQi-Agent – a domain-specific Porcelain Connoisseurship Agent for intelligent analysis of antique Chinese porcelain. CiQi-Agent supports multi-image porcelain inputs and enables vision tool invocation and multimodal retrieval-augmented generation, performing fine-grained connoisseurship analysis across six attributes: dynasty, reign period, kiln site, glaze color, decorative motif, and vessel shape. Beyond attribute classification, it captures subtle visual details, retrieves relevant domain knowledge, and integrates visual and textual evidence to produce coherent, explainable connoisseurship descriptions. To achieve this capability, we construct a large-scale, expert-annotated dataset CiQi-VQA, comprising 29,596 porcelain specimens, 51,553 images, and 557,940 visual question–answering pairs, and further establish a comprehensive benchmark CiQi-Bench aligned with the previously mentioned six attributes. CiQi-Agent is trained through supervised fine-tuning, reinforcement learning, and a tool-augmented reasoning framework that integrates two categories of tools: a vision tool and multimodal retrieval tools. Experimental results show that CiQi-Agent (7B) outperforms all competitive open- and closed-source models across all six attributes on CiQi-Bench, achieving on average 12.2% higher accuracy than GPT-5. The model and dataset have been released and are publicly available at https://huggingface.co/datasets/SII-Monument-Valley/CiQi-VQA.
[367] Progressive Prompt-Guided Cross-Modal Reasoning for Referring Image Segmentation
Jiachen Li, Hongyun Wang, Jinyu Xu, Wenbo Jiang, Yanchun Ma, Yongjian Liu, Qing Xie, Bolong Zheng
Main category: cs.CV
TL;DR: PPCR is a Progressive Prompt-guided Cross-modal Reasoning framework for referring image segmentation that uses MLLMs to generate semantic and spatial prompts for better language-to-vision grounding.
Details
Motivation: Existing referring image segmentation methods lack explicit reasoning mechanisms for grounding language descriptions to target regions, especially when dealing with detailed attributes and complex inter-object relationships.Method: PPCR structures reasoning as a Semantic Understanding-Spatial Grounding-Instance Segmentation pipeline using MLLMs to generate Semantic Segmentation Prompts for semantic cues, then Spatial Segmentation Prompts for object location and spatial extent.
Result: Extensive experiments on standard benchmarks show PPCR consistently outperforms existing methods.
Conclusion: PPCR effectively bridges linguistic descriptions with object-level visual representations through progressive prompt-guided reasoning, advancing referring image segmentation.
Abstract: Referring image segmentation aims to localize and segment a target object in an image based on a free-form referring expression. The core challenge lies in effectively bridging linguistic descriptions with object-level visual representations, especially when referring expressions involve detailed attributes and complex inter-object relationships. Existing methods either rely on cross-modal alignment or employ Semantic Segmentation Prompts, but they often lack explicit reasoning mechanisms for grounding language descriptions to target regions in the image. To address these limitations, we propose PPCR, a Progressive Prompt-guided Cross-modal Reasoning framework for referring image segmentation. PPCR explicitly structures the reasoning process as a Semantic Understanding-Spatial Grounding-Instance Segmentation pipeline. Specifically, PPCR first employs multimodal large language models (MLLMs) to generate Semantic Segmentation Prompt that capture key semantic cues of the target object. Based on this semantic context, Spatial Segmentation Prompt are further generated to reason about object location and spatial extent, enabling a progressive transition from semantic understanding to spatial grounding. The Semantic and Spatial Segmentation prompts are then jointly integrated into the segmentation module to guide accurate target localization and segmentation. Extensive experiments on standard referring image segmentation benchmarks demonstrate that PPCR consistently outperforms existing methods. The code will be publicly released to facilitate reproducibility.
[368] UniDA3D: A Unified Domain-Adaptive Framework for Multi-View 3D Object Detection
Hongjing Wu, Cheng Chi, Jinlin Wu, Yanzhao Su, Zhen Lei, Wenqi Ren
Main category: cs.CV
TL;DR: UniDA3D: A unified domain-adaptive multi-view 3D object detector for robust perception under adverse conditions like nighttime, fog, and rain, using query-guided domain alignment and teacher-student training.
Details
Motivation: Existing multi-view 3D object detection methods suffer performance degradation under complex environmental conditions (nighttime, fog, rain) because they're trained mostly on ideal conditions. There's a need for robust all-weather perception without requiring separate training for each condition.Method: Proposes UniDA3D with: 1) Unified multi-target domain adaptation treating different adverse conditions as unified adaptation problem; 2) Query Guided Domain Discrepancy Mitigation (QDDM) module aligning object features via query-centric adversarial and contrastive learning; 3) Domain-adaptive teacher-student pipeline with exponential-moving-average teacher and dynamically updated pseudo labels for consistency learning.
Result: Outperforms state-of-the-art camera-only multi-view 3D detectors on synthesized nuScenes benchmarks (nuScenes-Night, nuScenes-Rain, nuScenes-Haze), achieving substantial gains in mAP and NDS while maintaining real-time inference efficiency.
Conclusion: UniDA3D enables robust all-weather 3D perception through unified domain adaptation, eliminating the need for separate training per condition while maintaining efficiency and performance under extreme environmental conditions.
Abstract: Camera-only 3D object detection is critical for autonomous driving, offering a cost-effective alternative to LiDAR based methods. In particular, multi-view 3D object detection has emerged as a promising direction due to its balanced trade-off between performance and cost. However, existing methods often suffer significant performance degradation under complex environmental conditions such as nighttime, fog, and rain, primarily due to their reliance on training data collected mostly in ideal conditions. To address this challenge, we propose UniDA3D, a unified domain-adaptive multi-view 3D object detector designed for robust perception under diverse adverse conditions. UniDA3D formulates nighttime, rainy, and foggy scenes as a unified multi target domain adaptation problem and leverages a novel query guided domain discrepancy mitigation (QDDM) module to align object features between source and target domains at both batch and global levels via query-centric adversarial and contrastive learning. Furthermore, we introduce a domain-adaptive teacher student training pipeline with an exponential-moving-average teacher and dynamically updated high-quality pseudo labels to enhance consistency learning and suppress background noise in unlabeled target domains. In contrast to prior approaches that require separate training for each condition, UniDA3D performs a single unified training process across multiple domains, enabling robust all-weather 3D perception. On a synthesized multi-view 3D benchmark constructed by generating nighttime, rainy, and foggy counterparts from nuScenes (nuScenes-Night, nuScenes-Rain, and nuScenes-Haze), UniDA3D consistently outperforms state of-the-art camera-only multi-view 3D detectors under extreme conditions, achieving substantial gains in mAP and NDS while maintaining real-time inference efficiency.
[369] Hydra: Unifying Document Retrieval and Generation in a Single Vision-Language Model
Athos Georgiou
Main category: cs.CV
TL;DR: Hydra: A dual-head VLM that toggles between retrieval and generation using a single LoRA adapter, reducing memory by 41% while maintaining generation quality.
Details
Motivation: Current visual document understanding systems require separate retrieval and generation models, which doubles memory usage and increases system complexity. There's a need for a unified approach that can handle both tasks efficiently.Method: Hydra uses a dual-head vision-language model with a single LoRA adapter trained only for retrieval. At inference, toggling the adapter enables multi-vector embeddings for retrieval or disables it for generation. The approach requires attention-mode restoration, lm_head preservation, and KV-cache-aware decoding to maintain generation quality.
Result: Hydra achieves byte-identical outputs in 100% of 10,500 samples compared to base model generation, with max delta-ANLS = 0.0044 across 15,301 samples on four VQA benchmarks. It reduces peak GPU memory by 41% and generalizes to audio retrieval and video embedding with speech generation in Qwen2.5-Omni-3B.
Conclusion: Hydra demonstrates that a single model can efficiently handle both retrieval and generation tasks through adapter toggling, significantly reducing memory requirements while maintaining generation quality, with potential extensions to multimodal domains.
Abstract: Visual document understanding typically requires separate retrieval and generation models, doubling memory and system complexity. We present Hydra, a dual-head approach that provides both ColBERT-style late-interaction retrieval and autoregressive generation from a single vision-language model (VLM). A single LoRA adapter, trained only for retrieval, is toggled at inference: enabling it produces multi-vector embeddings; disabling it recovers the base model’s generation quality – byte-identical outputs in 100% of 10,500 greedy and stochastic samples, with max delta-ANLS = 0.0044 across 15,301 samples on four VQA benchmarks (three informative; ChartQA is near-zero for both models under greedy decoding) when compared against an independent base-model pipeline. We identify three engineering requirements (attention-mode restoration, lm_head preservation, KV-cache-aware decoding) whose omission silently breaks generation despite correct weight recovery. On ViDoRe V1, Hydra (4B) is within 1 percentage point of a controlled single-head baseline in a single training run, with higher aggregate scores on V2 and V3 that are concentrated on a subset of tasks; multi-seed experiments are needed to confirm these trends. The single-model design reduces peak GPU memory by 41%, though adapter switching introduces throughput overhead under concurrent serving loads. An ablation shows that GritLM-style joint training provides no benefit within the LoRA-based (r=16) training regime. A proof-of-concept extension to Qwen2.5-Omni-3B demonstrates that the mechanism generalizes to audio retrieval and video embedding, with speech generation.
[370] CLIP-AUTT: Test-Time Personalization with Action Unit Prompting for Fine-Grained Video Emotion Recognition
Muhammad Osama Zeeshan, Masoumeh Sharafi, Benoît Savary, Alessandro Lameiras Koerich, Marco Pedersoli, Eric Granger
Main category: cs.CV
TL;DR: CLIP-AU and CLIP-AUTT: Using Action Units as structured textual prompts in CLIP for fine-grained emotion recognition with test-time personalization for subject-specific adaptation.
Details
Motivation: Existing CLIP-based emotion recognition methods rely on noisy LLM-generated text prompts or CLIP's contrastive pretraining, which fail to capture fine-grained facial expressions and subject-specific variability needed for accurate subtle emotion recognition.Method: CLIP-AU uses Action Units (AUs) as structured textual prompts within CLIP to model fine-grained facial expressions. CLIP-AUTT adds test-time personalization with entropy-guided temporal window selection and prompt tuning to adapt to unseen subjects while preserving temporal consistency.
Result: Outperforms state-of-the-art CLIP-based FER and TTA methods on three challenging video-based subtle ER datasets (BioVid, StressID, and BAH), achieving robust and personalized subtle emotion recognition.
Conclusion: AU-guided temporal learning with test-time personalization enables fine-grained, subject-adaptive emotion recognition without CLIP fine-tuning or LLM-generated text supervision, addressing limitations of existing CLIP-based approaches.
Abstract: Personalization in emotion recognition (ER) is essential for an accurate interpretation of subtle and subject-specific expressive patterns. Recent advances in vision-language models (VLMs) such as CLIP demonstrate strong potential for leveraging joint image-text representations in ER. However, CLIP-based methods either depend on CLIP’s contrastive pretraining or on LLMs to generate descriptive text prompts, which are noisy, computationally expensive, and fail to capture fine-grained expressions, leading to degraded performance. In this work, we leverage Action Units (AUs) as structured textual prompts within CLIP to model fine-grained facial expressions. AUs encode the subtle muscle activations underlying expressions, providing localized and interpretable semantic cues for more robust ER. We introduce CLIP-AU, a lightweight AU-guided temporal learning method that integrates interpretable AU semantics into CLIP. It learns generic, subject-agnostic representations by aligning AU prompts with facial dynamics, enabling fine-grained ER without CLIP fine-tuning or LLM-generated text supervision. Although CLIP-AU models fine-grained AU semantics, it does not adapt to subject-specific variability in subtle expressions. To address this limitation, we propose CLIP-AUTT, a video-based test-time personalization method that dynamically adapts AU prompts to videos from unseen subjects. By combining entropy-guided temporal window selection with prompt tuning, CLIP-AUTT enables subject-specific adaptation while preserving temporal consistency. Our extensive experiments on three challenging video-based subtle ER datasets, BioVid, StressID, and BAH, indicate that CLIP-AU and CLIP-AUTT outperform state-of-the-art CLIP-based FER and TTA methods, achieving robust and personalized subtle ER.
[371] Domain-Invariant Prompt Learning for Vision-Language Models
Arsham Gholamzadeh Khoee, Yinan Yu, Robert Feldt
Main category: cs.CV
TL;DR: DiCoOp extends CoOp for domain generalization by learning domain-invariant prompts through adversarial training to handle distribution shifts across unseen domains.
Details
Motivation: While soft-prompting methods like CoOp effectively adapt vision-language models for downstream tasks, they lack explicit mechanisms to handle domain shifts across unseen distributions, limiting their generalization capabilities.Method: DiCoOp extends CoOp with adversarial training to learn domain-invariant prompts. It forces the model to learn prompts that are invariant across different domains while preserving discriminative power for classification tasks.
Result: Experimental results show DiCoOp consistently surpasses CoOp in domain generalization tasks across diverse visual domains, demonstrating improved robustness to distribution shifts.
Conclusion: DiCoOp provides an effective extension to CoOp for domain generalization by learning domain-invariant prompts through adversarial training, addressing the limitation of handling unseen domain distributions.
Abstract: Large pre-trained vision-language models like CLIP have transformed computer vision by aligning images and text in a shared feature space, enabling robust zero-shot transfer via prompting. Soft-prompting, such as Context Optimization (CoOp), effectively adapts these models for downstream recognition tasks by learning a set of context vectors. However, CoOp lacks explicit mechanisms for handling domain shifts across unseen distributions. To address this, we propose Domain-invariant Context Optimization (DiCoOp), an extension of CoOp optimized for domain generalization. By employing an adversarial training approach, DiCoOp forces the model to learn domain-invariant prompts while preserving discriminative power for classification. Experimental results show that DiCoOp consistently surpasses CoOp in domain generalization tasks across diverse visual domains.
[372] DipGuava: Disentangling Personalized Gaussian Features for 3D Head Avatars from Monocular Video
Jeonghaeng Lee, Seok Keun Choi, Zhixuan Li, Weisi Lin, Sanghoon Lee
Main category: cs.CV
TL;DR: DipGuava is a novel 3D Gaussian head avatar creation method that disentangles facial appearance into stable geometry-driven base components and personalized residual details, enabling photorealistic, identity-preserving avatars from monocular video.
Details
Motivation: Existing 3D head avatar methods fail to capture personalized details, limiting realism and expressiveness. There's a need for methods that can generate avatars with personalized attributes from monocular video while maintaining identity preservation.Method: Two-stage pipeline: 1) Learn stable geometry-driven base appearance capturing global facial structure and coarse expression-dependent variations. 2) Predict personalized residual details (high-frequency components, wrinkles, subtle skin deformations) not captured in first stage. Uses dynamic appearance fusion to integrate residual details after deformation with spatial and semantic alignment.
Result: DipGuava consistently outperforms prior methods in both visual quality and quantitative performance, generating photorealistic, identity-preserving avatars with personalized attributes.
Conclusion: DipGuava successfully addresses the limitation of existing methods by explicitly disentangling facial appearance, enabling high-fidelity 3D head avatar creation with personalized details from monocular video.
Abstract: While recent 3D head avatar creation methods attempt to animate facial dynamics, they often fail to capture personalized details, limiting realism and expressiveness. To fill this gap, we present DipGuava (Disentangled and Personalized Gaussian UV Avatar), a novel 3D Gaussian head avatar creation method that successfully generates avatars with personalized attributes from monocular video. DipGuava is the first method to explicitly disentangle facial appearance into two complementary components, trained in a structured two-stage pipeline that significantly reduces learning ambiguity and enhances reconstruction fidelity. In the first stage, we learn a stable geometry-driven base appearance that captures global facial structure and coarse expression-dependent variations. In the second stage, the personalized residual details not captured in the first stage are predicted, including high-frequency components and nonlinearly varying features such as wrinkles and subtle skin deformations. These components are fused via dynamic appearance fusion that integrates residual details after deformation, ensuring spatial and semantic alignment. This disentangled design enables DipGuava to generate photorealistic, identity-preserving avatars, consistently outperforming prior methods in both visual quality and quantitativeperformance, as demonstrated in extensive experiments.
[373] Energy-Aware Imitation Learning for Steering Prediction Using Events and Frames
Hu Cao, Jiong Liu, Xingzhuo Yan, Rui Song, Yan Xia, Walter Zimmer, Guang Chen, Alois Knoll
Main category: cs.CV
TL;DR: Energy-aware imitation learning framework for autonomous driving steering prediction using event cameras and frame-based cameras with cross-modality fusion
Details
Motivation: Frame-based cameras in autonomous driving suffer from inaccuracies due to long exposure times, high-speed motion, and challenging lighting conditions. Event cameras provide complementary sparse, asynchronous event data to address these limitations.Method: Proposes an energy-aware imitation learning framework with Energy-driven Cross-modality Fusion Module (ECFM) and energy-aware decoder to fuse event and frame data for steering prediction.
Result: Outperforms existing state-of-the-art approaches on two public real-world datasets (DDD20 and DRFuser).
Conclusion: The framework effectively leverages complementary event and frame modalities for reliable and safe steering prediction in autonomous driving.
Abstract: In autonomous driving, relying solely on frame-based cameras can lead to inaccuracies caused by factors like long exposure times, high-speed motion, and challenging lighting conditions. To address these issues, we introduce a bio-inspired vision sensor known as the event camera. Unlike conventional cameras, event cameras capture sparse, asynchronous events that provide a complementary modality to mitigate these challenges. In this work, we propose an energy-aware imitation learning framework for steering prediction that leverages both events and frames. Specifically, we design an Energy-driven Cross-modality Fusion Module (ECFM) and an energy-aware decoder to produce reliable and safe predictions. Extensive experiments on two public real-world datasets, DDD20 and DRFuser, demonstrate that our method outperforms existing state-of-the-art (SOTA) approaches. The codes and trained models will be released upon acceptance.
[374] Detection of Adversarial Attacks in Robotic Perception
Ziad Sharawy, Mohammad Nakshbandiand, Sorin Mihai Grigorescu
Main category: cs.CV
TL;DR: DNNs for semantic segmentation in robotics are vulnerable to adversarial attacks, requiring specialized robustness approaches beyond image classification methods.
Details
Motivation: Safety-critical robotic applications using DNNs for semantic segmentation are vulnerable to adversarial attacks, but existing robustness research focuses mainly on image classification rather than segmentation-specific architectures needed for robotics.Method: The paper likely proposes specialized adversarial robustness methods for semantic segmentation in robotic perception, potentially including novel detection strategies and architecture adaptations beyond standard image classification defenses.
Result: Not specified in the abstract, but presumably demonstrates improved robustness against adversarial attacks for semantic segmentation in robotic applications compared to standard approaches.
Conclusion: Robust semantic segmentation for robotics requires specialized adversarial defense approaches distinct from image classification methods to ensure safety in real-world applications.
Abstract: Deep Neural Networks (DNNs) achieve strong performance in semantic segmentation for robotic perception but remain vulnerable to adversarial attacks, threatening safety-critical applications. While robustness has been studied for image classification, semantic segmentation in robotic contexts requires specialized architectures and detection strategies.
[375] Physically Inspired Gaussian Splatting for HDR Novel View Synthesis
Huimin Zeng, Yue Bai, Hailing Wang, Yun Fu
Main category: cs.CV
TL;DR: PhysHDR-GS is a physically-inspired HDR novel view synthesis framework that models scenes via intrinsic reflectance and adjustable ambient illumination, using complementary branches for camera observations and illumination-dependent appearance changes.
Details
Motivation: Existing HDR novel view synthesis methods struggle to capture ambient illumination-dependent appearance and suffer from abnormal HDR values when implicitly supervising through tone-mapped results, leading to limited gradients in under/over-exposed regions.Method: Uses a two-branch approach: image-exposure (IE) branch for standard camera observations and Gaussian-illumination (GI) branch for illumination-dependent appearance changes. Introduces cross-branch HDR consistency loss for explicit HDR supervision and illumination-guided gradient scaling to mitigate exposure-biased gradient starvation.
Result: Achieves superior HDR detail reconstruction with 2.04 dB PSNR gain over HDR-GS while maintaining real-time rendering speed up to 76 FPS across realistic and synthetic datasets.
Conclusion: PhysHDR-GS effectively addresses limitations in HDR novel view synthesis by physically modeling scene appearance and providing explicit HDR supervision, enabling high-quality HDR reconstruction with real-time performance.
Abstract: High dynamic range novel view synthesis (HDR-NVS) reconstructs scenes with dynamic details by fusing multi-exposure low dynamic range (LDR) views, yet it struggles to capture ambient illumination-dependent appearance. Implicitly supervising HDR content by constraining tone-mapped results fails in correcting abnormal HDR values, and results in limited gradients for Gaussians in under/over-exposed regions. To this end, we introduce PhysHDR-GS, a physically inspired HDR-NVS framework that models scene appearance via intrinsic reflectance and adjustable ambient illumination. PhysHDR-GS employs a complementary image-exposure (IE) branch and Gaussian-illumination (GI) branch to faithfully reproduce standard camera observations and capture illumination-dependent appearance changes, respectively. During training, the proposed cross-branch HDR consistency loss provides explicit supervision for HDR content, while an illumination-guided gradient scaling strategy mitigates exposure-biased gradient starvation and reduces under-densified representations. Experimental results across realistic and synthetic datasets demonstrate our superiority in reconstructing HDR details (e.g., a PSNR gain of 2.04 dB over HDR-GS), while maintaining real-time rendering speed (up to 76 FPS). Code and models are available at https://huimin-zeng.github.io/PhysHDR-GS/.
[376] SegRGB-X: General RGB-X Semantic Segmentation Model
Jiong Liu, Yingjie Xu, Xingcheng Zhou, Rui Song, Walter Zimmer, Alois Knoll, Hu Cao
Main category: cs.CV
TL;DR: Universal framework for semantic segmentation across arbitrary sensor modalities using modality-aware CLIP, aligned embeddings, and domain refinement.
Details
Motivation: Address challenges in semantic segmentation across diverse sensor modalities (event, thermal, depth, polarization, light field) and reduce redundant development efforts for different sensors.Method: Three key innovations: 1) Modality-aware CLIP (MA-CLIP) with LoRA fine-tuning for modality-specific guidance, 2) Modality-aligned Embeddings for fine-grained features, 3) Domain-specific Refinement Module (DSRM) for dynamic feature adjustment.
Result: Achieves state-of-the-art performance with 65.03% mIoU across five diverse datasets with complementary modalities, surpassing specialized multi-modal methods.
Conclusion: Proposes a universal framework that effectively unifies semantic segmentation across multiple sensor modalities, demonstrating strong generalization capabilities.
Abstract: Semantic segmentation across arbitrary sensor modalities faces significant challenges due to diverse sensor characteristics, and the traditional configurations for this task result in redundant development efforts. We address these challenges by introducing a universal arbitrary-modal semantic segmentation framework that unifies segmentation across multiple modalities. Our approach features three key innovations: (1) the Modality-aware CLIP (MA-CLIP), which provides modality-specific scene understanding guidance through LoRA fine-tuning; (2) Modality-aligned Embeddings for capturing fine-grained features; and (3) the Domain-specific Refinement Module (DSRM) for dynamic feature adjustment. Evaluated on five diverse datasets with different complementary modalities (event, thermal, depth, polarization, and light field), our model surpasses specialized multi-modal methods and achieves state-of-the-art performance with a mIoU of 65.03%. The codes will be released upon acceptance.
[377] Adapting SAM to Nuclei Instance Segmentation and Classification via Cooperative Fine-Grained Refinement
Jingze Su, Tianle Zhu, Jiaxin Cai, Zhiyi Wang, Qi Li, Xiao Zhang, Tong Tong, Shu Wang, Wenxi Liu
Main category: cs.CV
TL;DR: A parameter-efficient fine-tuning framework called Cooperative Fine-Grained Refinement of SAM (CFG-SAM) adapts the Segment Anything Model for nuclei instance segmentation in computational pathology by enhancing local perception while minimizing computational costs.
Details
Motivation: Direct application of SAM to medical imaging has limitations: it lacks sufficient perception of local structural features crucial for nuclei segmentation, and full fine-tuning requires substantial computational costs. There's a need to efficiently transfer SAM's robust prior knowledge while supplementing task-aware local perception.Method: Proposes CFG-SAM with three components: 1) Multi-scale Adaptive Local-aware Adapter for capability transfer with minimal parameters and dynamic multi-scale convolutional kernels, 2) Hierarchical Modulated Fusion Module for dynamic aggregation of multi-level encoder features to preserve spatial details, and 3) Boundary-Guided Mask Refinement that integrates multi-context boundary cues with semantic features through explicit supervision.
Result: The framework enables SAM to perform accurate nuclei instance segmentation directly by cooperatively enhancing local perception, preserving spatial details, and refining boundaries through parameter-efficient fine-tuning.
Conclusion: CFG-SAM successfully adapts SAM for medical imaging tasks by addressing its limitations in local feature perception while maintaining computational efficiency, making it suitable for nuclei instance segmentation in computational pathology.
Abstract: Nuclei instance segmentation is critical in computational pathology for cancer diagnosis and prognosis. Recently, the Segment Anything Model has demonstrated exceptional performance in various segmentation tasks, leveraging its rich priors and powerful global context modeling capabilities derived from large-scale pre-training on natural images. However, directly applying SAM to the medical imaging domain faces significant limitations: it lacks sufficient perception of the local structural features that are crucial for nuclei segmentation, and full fine-tuning for downstream tasks requires substantial computational costs. To efficiently transfer SAM’s robust prior knowledge to nuclei instance segmentation while supplementing its task-aware local perception, we propose a parameter-efficient fine-tuning framework, named Cooperative Fine-Grained Refinement of SAM, consisting of three core components: 1) a Multi-scale Adaptive Local-aware Adapter, which enables effective capability transfer by augmenting the frozen SAM backbone with minimal parameters and instilling a powerful perception of local structures through dynamically generated, multi-scale convolutional kernels; 2) a Hierarchical Modulated Fusion Module, which dynamically aggregates multi-level encoder features to preserve fine-grained spatial details; and 3) a Boundary-Guided Mask Refinement, which integrates multi-context boundary cues with semantic features through explicit supervision, producing a boundary-focused signal to refine initial mask predictions for sharper delineation. These three components work cooperatively to enhance local perception, preserve spatial details, and refine boundaries, enabling SAM to perform accurate nuclei instance segmentation directly.
[378] Why Aggregate Accuracy is Inadequate for Evaluating Fairness in Law Enforcement Facial Recognition Systems
Khalid Adnan Alsayed
Main category: cs.CV
TL;DR: Paper critiques reliance on aggregate accuracy metrics for facial recognition systems in law enforcement, showing how they mask demographic disparities in error rates and proposing fairness-aware evaluation frameworks.
Details
Motivation: Facial recognition systems in law enforcement have significant societal consequences, but despite high reported accuracy, they often exhibit uneven performance across demographic groups, leading to disproportionate error rates and potential harm.Method: Analysis of subgroup-level error distribution (false positive rate and false negative rate) to demonstrate how aggregate metrics obscure demographic disparities, plus examination of operational risks in law enforcement applications.
Result: Empirical observations show systems with similar overall accuracy can have substantially different fairness profiles, with subgroup error rates varying significantly despite identical aggregate metrics.
Conclusion: Need to move beyond accuracy as primary metric and adopt comprehensive fairness-aware evaluation frameworks and model-agnostic auditing strategies for responsible AI deployment in high-stakes environments.
Abstract: Facial recognition systems are increasingly deployed in law enforcement and security contexts, where algorithmic decisions can carry significant societal consequences. Despite high reported accuracy, growing evidence demonstrates that such systems often exhibit uneven performance across demographic groups, leading to disproportionate error rates and potential harm. This paper argues that aggregate accuracy is an insufficient metric for evaluating the fairness and reliability of facial recognition systems in high-stakes environments. Through analysis of subgroup-level error distribution, including false positive rate (FPR) and false negative rate (FNR), the paper demonstrates how aggregate performance metrics can obscure critical disparities across demographic groups. Empirical observations show that systems with similar overall accuracy can exhibit substantially different fairness profiles, with subgroup error rates varying significantly despite a single aggregate metric. The paper further examines the operational risks associated with accuracy-centric evaluation practices in law enforcement applications, where misclassification may result in wrongful suspicion or missed identification. It highlights the importance of fairness-aware evaluation approaches and model-agnostic auditing strategies that enable post-deployment assessment of real-world systems. The findings emphasise the need to move beyond accuracy as a primary metric and adopt more comprehensive evaluation frameworks for responsible AI deployment.
[379] Efficient Domain Adaptation for Text Line Recognition via Decoupled Language Models
Arundhathi Dev, Justin Zhan
Main category: cs.CV
TL;DR: A modular OCR framework using lightweight visual character detection combined with domain-specific linguistic correction models achieves near-state-of-the-art accuracy with 95% less compute than end-to-end transformers.
Details
Motivation: Current state-of-the-art OCR systems require prohibitive computational resources (hundreds of GPU hours) for domain adaptation, limiting accessibility for practitioners and digital humanities scholars who lack such resources.Method: Decouples OCR into two modules: 1) lightweight visual character detection (domain-agnostic), and 2) domain-specific linguistic correction using pretrained sequence models (T5, ByT5, BART). The correctors are trained entirely on synthetic noise, enabling annotation-free domain adaptation without labeled target images.
Result: The framework achieves near-SOTA accuracy across modern clean handwriting, cursive script, and historical documents while reducing compute by approximately 95% compared to end-to-end transformers. T5-Base excels on modern text with standard vocabulary, while ByT5-Base dominates on historical documents by reconstructing archaic spellings at the byte level.
Conclusion: The decoupled paradigm matches end-to-end transformer accuracy while dramatically reducing computational requirements, establishing a viable, resource-efficient alternative to monolithic OCR architectures that democratizes access to high-quality OCR.
Abstract: Optical character recognition remains critical infrastructure for document digitization, yet state-of-the-art performance is often restricted to well-resourced institutions by prohibitive computational barriers. End-to-end transformer architectures achieve strong accuracy but demand hundreds of GPU hours for domain adaptation, limiting accessibility for practitioners and digital humanities scholars. We present a modular detection-and-correction framework that achieves near-SOTA accuracy with single-GPU training. Our approach decouples lightweight visual character detection (domain-agnostic) from domain-specific linguistic correction using pretrained sequence models including T5, ByT5, and BART. By training the correctors entirely on synthetic noise, we enable annotation-free domain adaptation without requiring labeled target images. Evaluating across modern clean handwriting, cursive script, and historical documents, we identify a critical “Pareto frontier” in architecture selection: T5-Base excels on modern text with standard vocabulary, whereas ByT5-Base dominates on historical documents by reconstructing archaic spellings at the byte level. Our results demonstrate that this decoupled paradigm matches end-to-end transformer accuracy while reducing compute by approximately 95%, establishing a viable, resource-efficient alternative to monolithic OCR architectures.
[380] AdaptToken: Entropy-based Adaptive Token Selection for MLLM Long Video Understanding
Haozhe Qi, Kevin Qu, Mahdi Rad, Rui Wang, Alexander Mathis, Marc Pollefeys
Main category: cs.CV
TL;DR: AdaptToken is a training-free framework that uses MLLM’s self-uncertainty to globally control token selection for long video understanding, enabling efficient processing of up to 10K frames with early stopping capability.
Details
Motivation: Long video understanding is challenging for MLLMs due to memory constraints and context-length limits. Existing approaches lack mechanisms to compare relevance across distant clips and stop processing when sufficient evidence is gathered.Method: Splits video into groups, extracts cross-modal attention to rank tokens within each group, uses model’s response entropy to estimate group relevance, enabling global token budget allocation and early stopping (AdaptToken-Lite).
Result: Consistent accuracy improvements across four long-video benchmarks and multiple base MLLMs (7B-72B), with +6.7 average improvement over Qwen2.5-VL 7B, effective up to 10K frames, and ~50% inference time reduction with AdaptToken-Lite.
Conclusion: AdaptToken provides an effective training-free solution for long video understanding by leveraging model uncertainty for global token selection, enabling efficient processing of extremely long videos while maintaining accuracy.
Abstract: Long video understanding remains challenging for Multi-modal Large Language Models (MLLMs) due to high memory costs and context-length limits. Prior approaches mitigate this by scoring and selecting frames/tokens within short clips, but they lack a principled mechanism to (i) compare relevance across distant video clips and (ii) stop processing once sufficient evidence has been gathered. We propose AdaptToken, a training-free framework that turns an MLLM’s self-uncertainty into a global control signal for long-video token selection. AdaptToken splits a video into groups, extracts cross-modal attention to rank tokens within each group, and uses the model’s response entropy to estimate each group’s prompt relevance. This entropy signal enables a global token budget allocation across groups and further supports early stopping (AdaptToken-Lite), skipping the remaining groups when the model becomes sufficiently certain. Across four long-video benchmarks (VideoMME, LongVideoBench, LVBench, and MLVU) and multiple base MLLMs (7B-72B), AdaptToken consistently improves accuracy (e.g., +6.7 on average over Qwen2.5-VL 7B) and continues to benefit from extremely long inputs (up to 10K frames), while AdaptToken-Lite reduces inference time by about half with comparable performance. Project page: https://haozheqi.github.io/adapt-token
[381] Effort-Based Criticality Metrics for Evaluating 3D Perception Errors in Autonomous Driving
Sharang Kaul, Simon Bultmann, Mario Berk, Abhinav Valada
Main category: cs.CV
TL;DR: Proposes novel effort-based safety metrics (FSR, MDR, LEA) for autonomous vehicle perception evaluation that quantify longitudinal and lateral evasion efforts needed to avoid collisions, distinguishing critical from non-critical perception errors.
Details
Motivation: Traditional criticality metrics like time-to-collision conflate consequences of false-positive and false-negative perception errors. There's a need for metrics that quantify the actual effort required to avoid collisions from perception errors to better evaluate safety-critical failures.Method: Introduces three effort-based metrics: False Speed Reduction (FSR) for cumulative velocity loss from phantom detections, Maximum Deceleration Rate (MDR) for peak braking demand from missed objects, and Lateral Evasion Acceleration (LEA) for minimum steering effort to avoid collisions. Uses reachability-based ellipsoidal collision filtering and frame-level matching with track-level aggregation.
Result: Evaluation on nuScenes and Argoverse 2 datasets shows 65-93% of perception errors are non-critical. Spearman correlation analysis confirms the proposed metrics capture safety-relevant information not accessible to traditional time-based or deceleration-based measures, enabling targeted mining of critical perception failures.
Conclusion: The proposed effort-based metrics provide more nuanced evaluation of perception system safety by quantifying actual evasion efforts required, distinguishing critical from non-critical errors, and enabling focused improvement on safety-critical perception failures.
Abstract: Criticality metrics such as time-to-collision (TTC) quantify collision urgency but conflate the consequences of false-positive (FP) and false-negative (FN) perception errors. We propose two novel effort-based metrics: False Speed Reduction (FSR), the cumulative velocity loss from persistent phantom detections, and Maximum Deceleration Rate (MDR), the peak braking demand from missed objects under a constant-acceleration model. These longitudinal metrics are complemented by Lateral Evasion Acceleration (LEA), adapted from prior lateral evasion kinematics and coupled with reachability-based collision timing to quantify the minimum steering effort to avoid a predicted collision. A reachability-based ellipsoidal collision filter ensures only dynamically plausible threats are scored, with frame-level matching and track-level aggregation. Evaluation of different perception pipelines on nuScenes and Argoverse~2 shows that 65-93% of errors are non-critical, and Spearman correlation analysis confirms that all three metrics capture safety-relevant information inaccessible to established time-based, deceleration-based, or normalized criticality measures, enabling targeted mining of the most critical perception failures.
[382] Event6D: Event-based Novel Object 6D Pose Tracking
Jae-Young Kang, Hoonehee Cho, Taeyeop Lee, Minjun Kang, Bowen Wen, Youngho Kim, Kuk-Jin Yoon
Main category: cs.CV
TL;DR: EventTrack6D is an event-depth tracking framework for 6D object pose tracking that uses event cameras to achieve microsecond latency and handles fast motion without object-specific training by reconstructing intensity and depth from sparse event streams.
Details
Motivation: Conventional RGB and depth pipelines suffer from motion blur and large pixel displacements in fast, dynamic scenes. Event cameras provide microsecond latency but need methods to leverage their unique properties for 6D object pose tracking, especially for novel objects without object-specific training.Method: EventTrack6D uses a dual reconstruction approach conditioned on the most recent depth measurement to recover dense photometric and geometric cues from sparse event streams. It reconstructs both intensity and depth at arbitrary timestamps between depth frames, enabling 6D pose tracking at over 120 FPS.
Result: The method achieves accurate 6D pose tracking across diverse objects and motion patterns, maintaining temporal consistency under rapid motion. Trained exclusively on synthetic data, it generalizes effectively to real-world scenarios without fine-tuning. A comprehensive benchmark suite including synthetic and real datasets was created for evaluation.
Conclusion: EventTrack6D demonstrates the effectiveness of event cameras for event-based 6D pose tracking of novel objects, providing a high-speed solution that overcomes limitations of conventional RGB-depth pipelines in fast, dynamic scenes.
Abstract: Event cameras provide microsecond latency, making them suitable for 6D object pose tracking in fast, dynamic scenes where conventional RGB and depth pipelines suffer from motion blur and large pixel displacements. We introduce EventTrack6D, an event-depth tracking framework that generalizes to novel objects without object-specific training by reconstructing both intensity and depth at arbitrary timestamps between depth frames. Conditioned on the most recent depth measurement, our dual reconstruction recovers dense photometric and geometric cues from sparse event streams. Our EventTrack6D operates at over 120 FPS and maintains temporal consistency under rapid motion. To support training and evaluation, we introduce a comprehensive benchmark suite: a large-scale synthetic dataset for training and two complementary evaluation sets, including real and simulated event datasets. Trained exclusively on synthetic data, EventTrack6D generalizes effectively to real-world scenarios without fine-tuning, maintaining accurate tracking across diverse objects and motion patterns. Our method and datasets validate the effectiveness of event cameras for event-based 6D pose tracking of novel objects. Code and datasets are publicly available at https://chohoonhee.github.io/Event6D.
[383] Drift-AR: Single-Step Visual Autoregressive Generation via Anti-Symmetric Drifting
Zhen Zou, Xiaoxiao Ma, Mingde Yao, Jie Huang, LinJiang Huang, Feng Zhao
Main category: cs.CV
TL;DR: Drift-AR accelerates AR-Diffusion hybrid models by using prediction entropy as a unified signal to speed up both autoregressive and diffusion stages, achieving 3.8-5.5× speedup with single-step decoding.
Details
Motivation: AR-Diffusion hybrid models suffer from dual speed bottlenecks: sequential AR generation and iterative diffusion denoising. Existing methods address each bottleneck separately without a unified design principle.Method: Uses prediction entropy from continuous-space AR models as a unifying signal. For AR acceleration: entropy-informed speculative decoding aligns draft-target entropy distributions. For visual decoder acceleration: reinterprets entropy as physical variance for an anti-symmetric drifting field, enabling single-step (1-NFE) decoding without iterative denoising.
Result: Achieves 3.8-5.5× speedup with genuine 1-NFE decoding while matching or surpassing original quality on MAR, TransDiff, and NextStep-1 benchmarks.
Conclusion: Prediction entropy serves as a natural unifying signal for joint acceleration of both AR and diffusion stages in hybrid models, enabling significant speed improvements without quality degradation.
Abstract: Autoregressive (AR)-Diffusion hybrid paradigms combine AR’s structured semantic modeling with diffusion’s high-fidelity synthesis, yet suffer from a dual speed bottleneck: the sequential AR stage and the iterative multi-step denoising of the diffusion vision decode stage. Existing methods address each in isolation without a unified principle design. We observe that the per-position \emph{prediction entropy} of continuous-space AR models naturally encodes spatially varying generation uncertainty, which simultaneously governing draft prediction quality in the AR stage and reflecting the corrective effort required by vision decoding stage, which is not fully explored before. Since entropy is inherently tied to both bottlenecks, it serves as a natural unifying signal for joint acceleration. In this work, we propose \textbf{Drift-AR}, which leverages entropy signal to accelerate both stages: 1) for AR acceleration, we introduce Entropy-Informed Speculative Decoding that align draft–target entropy distributions via a causal-normalized entropy loss, resolving the entropy mismatch that causes excessive draft rejection; 2) for visual decoder acceleration, we reinterpret entropy as the \emph{physical variance} of the initial state for an anti-symmetric drifting field – high-entropy positions activate stronger drift toward the data manifold while low-entropy positions yield vanishing drift – enabling single-step (1-NFE) decoding without iterative denoising or distillation. Moreover, both stages share the same entropy signal, which is computed once with no extra cost. Experiments on MAR, TransDiff, and NextStep-1 demonstrate 3.8–5.5$\times$ speedup with genuine 1-NFE decoding, matching or surpassing original quality. Code will be available at https://github.com/aSleepyTree/Drift-AR.
[384] Object Detection Based on Distributed Convolutional Neural Networks
Liang Sun
Main category: cs.CV
TL;DR: A lightweight object detection method using Distributed CNN that detects objects by identifying high-scoring patches across scales and overlapping them to form bounding boxes.
Details
Motivation: To create a simple, efficient object detection method that doesn't require complex architectures or extensive training data, using only object-centered images with class labels.Method: Uses Distributed CNN where output vector modules for positive classes are monotonic with feature presence probabilities. Detects objects by identifying high-scoring patches across all scales and overlapping them to form bounding boxes.
Result: Enables parallel detection for multiple classes and faster single-object detection due to lightweight architecture. Training requires only object-centered image data with positive/negative labels.
Conclusion: Proposes a straightforward, efficient object detection approach that leverages scale-invariant feature detection through Distributed CNN, offering computational advantages over traditional methods.
Abstract: Based on the Distributed Convolutional Neural Network(DisCNN), a straightforward object detection method is proposed. The modules of the output vector of a DisCNN with respect to a specific positive class are positively monotonic with the presence probabilities of the positive features. So, by identifying all high-scoring patches across all possible scales, the positive object can be detected by overlapping them to form a bounding box. The essential idea is that the object is detected by detecting its features on multiple scales, ranging from specific sub-features to abstract features composed of these sub-features. Training DisCNN requires only object-centered image data with positive and negative class labels. The detection process for multiple positive classes can be conducted in parallel to significantly accelerate it, and also faster for single-object detection because of its lightweight model architecture.
[385] On-the-fly Repulsion in the Contextual Space for Rich Diversity in Diffusion Transformers
Omer Dahary, Benaya Koren, Daniel Garibi, Daniel Cohen-Or
Main category: cs.CV
TL;DR: A method to increase diversity in text-to-image diffusion models by applying repulsion in the contextual space during transformer forward passes, avoiding artifacts while maintaining semantic alignment.
Details
Motivation: Current T2I diffusion models suffer from typicality bias, producing narrow visual solutions for prompts. Existing diversity methods either require costly optimization or disrupt visual structure, creating artifacts.Method: Proposes applying repulsion in the Contextual Space by intervening in multimodal attention channels during transformer forward passes. The intervention occurs between blocks where text conditioning is enriched with emergent image structure, redirecting guidance after structural formation but before composition fixation.
Result: Produces significantly richer diversity without sacrificing visual fidelity or semantic adherence. Method is computationally efficient with small overhead and remains effective in modern “Turbo” and distilled models where traditional trajectory-based interventions fail.
Conclusion: Repulsion in Contextual Space provides an effective framework for enhancing diversity in diffusion transformers while maintaining efficiency and compatibility with modern model architectures.
Abstract: Modern Text-to-Image (T2I) diffusion models have achieved remarkable semantic alignment, yet they often suffer from a significant lack of variety, converging on a narrow set of visual solutions for any given prompt. This typicality bias presents a challenge for creative applications that require a wide range of generative outcomes. We identify a fundamental trade-off in current approaches to diversity: modifying model inputs requires costly optimization to incorporate feedback from the generative path. In contrast, acting on spatially-committed intermediate latents tends to disrupt the forming visual structure, leading to artifacts. In this work, we propose to apply repulsion in the Contextual Space as a novel framework for achieving rich diversity in Diffusion Transformers. By intervening in the multimodal attention channels, we apply on-the-fly repulsion during the transformer’s forward pass, injecting the intervention between blocks where text conditioning is enriched with emergent image structure. This allows for redirecting the guidance trajectory after it is structurally informed but before the composition is fixed. Our results demonstrate that repulsion in the Contextual Space produces significantly richer diversity without sacrificing visual fidelity or semantic adherence. Furthermore, our method is uniquely efficient, imposing a small computational overhead while remaining effective even in modern “Turbo” and distilled models where traditional trajectory-based interventions typically fail.
[386] \textit{4DSurf}: High-Fidelity Dynamic Scene Surface Reconstruction
Renjie Wu, Hongdong Li, Jose M. Alvarez, Miaomiao Liu
Main category: cs.CV
TL;DR: 4DSurf: A Gaussian Splatting-based framework for dynamic scene surface reconstruction that handles large deformations and maintains temporal consistency through SDF flow regularization and overlapping segment partitioning.
Details
Motivation: Existing Gaussian Splatting methods for dynamic surface reconstruction are limited to single objects or small deformations, struggling with large deformations and temporal consistency in complex scenes.Method: Proposes 4DSurf with two key innovations: 1) Gaussian deformations induced Signed Distance Function Flow Regularization to constrain Gaussian motion with evolving surfaces, and 2) Overlapping Segment Partitioning that divides sequences into overlapping segments with small deformations and incrementally passes geometric information.
Result: Outperforms state-of-the-art surface reconstruction methods by 49% on Hi4D dataset and 19% on CMU Panoptic dataset in Chamfer distance, achieving superior temporal consistency under sparse-view settings.
Conclusion: 4DSurf provides a unified framework for generic dynamic surface reconstruction that can handle large deformations and maintain temporal consistency without requiring prior knowledge of scene objects.
Abstract: This paper addresses the problem of dynamic scene surface reconstruction using Gaussian Splatting (GS), aiming to recover temporally consistent geometry. While existing GS-based dynamic surface reconstruction methods can yield superior reconstruction, they are typically limited to either a single object or objects with only small deformations, struggling to maintain temporally consistent surface reconstruction of large deformations over time. We propose ``\textit{4DSurf}’’, a novel and unified framework for generic dynamic surface reconstruction that does not require specifying the number or types of objects in the scene, can handle large surface deformations and temporal inconsistency in reconstruction. The key innovation of our framework is the introduction of Gaussian deformations induced Signed Distance Function Flow Regularization that constrains the motion of Gaussians to align with the evolving surface. To handle large deformations, we introduce an Overlapping Segment Partitioning strategy that divides the sequence into overlapping segments with small deformations and incrementally passes geometric information across segments through the shared overlapping timestep. Experiments on two challenging dynamic scene datasets, Hi4D and CMU Panoptic, demonstrate that our method outperforms state-of-the-art surface reconstruction methods by 49% and 19% in Chamfer distance, respectively, and achieves superior temporal consistency under sparse-view settings.
[387] AIBench: Evaluating Visual-Logical Consistency in Academic Illustration Generation
Zhaohe Liao, Kaixun Jiang, Zhihang Liu, Yujie Wei, Junqiu Yu, Quanhao Li, Hong-Tao Yu, Pandeng Li, Yuzheng Wang, Zhen Xing, Shiwei Zhang, Chen-Wei Xie, Yun Zheng, Xihui Liu
Main category: cs.CV
TL;DR: AIBench: A benchmark using VQA to evaluate logic correctness and VLMs to assess aesthetics of AI-generated academic illustrations, revealing performance gaps in complex reasoning and high-density generation.
Details
Motivation: Current image generation models have evolved rapidly, but their ability to produce ready-to-use academic illustrations for research papers remains unexplored. Existing evaluation methods using VLMs are unreliable for complex academic content with long texts and detailed illustrations.Method: Proposes AIBench benchmark with VQA-based evaluation for logic correctness and VLM-based assessment for aesthetics. Uses four levels of questions derived from logic diagrams summarizing paper methods to query illustration-paper alignment at different scales.
Result: Performance gaps between models on academic illustration generation are significantly larger than on general tasks, reflecting varying complex reasoning and high-density generation abilities. Logic and aesthetics are hard to optimize simultaneously. Test-time scaling on both abilities significantly boosts performance.
Conclusion: AIBench provides a more accurate evaluation framework for academic illustration generation, revealing model limitations in complex reasoning and high-density content generation while showing that test-time scaling improves performance on this specialized task.
Abstract: Although image generation has boosted various applications via its rapid evolution, whether the state-of-the-art models are able to produce ready-to-use academic illustrations for papers is still largely unexplored.Directly comparing or evaluating the illustration with VLM is native but requires oracle multi-modal understanding ability, which is unreliable for long and complex texts and illustrations. To address this, we propose AIBench, the first benchmark using VQA for evaluating logic correctness of the academic illustrations and VLMs for assessing aesthetics. In detail, we designed four levels of questions proposed from a logic diagram summarized from the method part of the paper, which query whether the generated illustration aligns with the paper on different scales. Our VQA-based approach raises more accurate and detailed evaluations on visual-logical consistency while relying less on the ability of the judger VLM. With our high-quality AIBench, we conduct extensive experiments and conclude that the performance gap between models on this task is significantly larger than general ones, reflecting their various complex reasoning and high-density generation ability. Further, the logic and aesthetics are hard to optimize simultaneously as in handcrafted illustrations. Additional experiments further state that test-time scaling on both abilities significantly boosts the performance on this task.
[388] Attention Frequency Modulation: Training-Free Spectral Modulation of Diffusion Cross-Attention
Seunghun Oh, Unsang Park
Main category: cs.CV
TL;DR: AFM is a training-free method to control diffusion model outputs by modulating cross-attention frequencies during inference, enabling visual edits while preserving semantics.
Details
Motivation: Cross-attention in diffusion models lacks principled understanding and control mechanisms. The authors aim to characterize its multi-resolution dynamics and develop training-free methods for controlling visual outputs.Method: Analyze cross-attention as spatiotemporal signals, track Fourier power over denoising steps, then introduce Attention Frequency Modulation (AFM) that edits pre-softmax logits in Fourier domain with frequency-specific reweighting schedules.
Result: AFM reliably redistributes attention spectra and produces substantial visual edits while preserving semantic alignment in Stable Diffusion, with entropy acting as adaptive gain rather than independent control.
Conclusion: Cross-attention exhibits consistent coarse-to-fine spectral progression, enabling principled frequency-based control through AFM without retraining or prompt editing.
Abstract: Cross-attention is the primary interface through which text conditions latent diffusion models, yet its step-wise multi-resolution dynamics remain under-characterized, limiting principled training-free control. We cast diffusion cross-attention as a spatiotemporal signal on the latent grid by summarizing token-softmax weights into token-agnostic concentration maps and tracking their radially binned Fourier power over denoising. Across prompts and seeds, encoder cross-attention exhibits a consistent coarse-to-fine spectral progression, yielding a stable time-frequency fingerprint of token competition. Building on this structure, we introduce Attention Frequency Modulation (AFM), a plug-and-play inference-time intervention that edits token-wise pre-softmax cross-attention logits in the Fourier domain: low- and high-frequency bands are reweighted with a progress-aligned schedule and can be adaptively gated by token-allocation entropy, before the token softmax. AFM provides a continuous handle to bias the spatial scale of token-competition patterns without retraining, prompt editing, or parameter updates. Experiments on Stable Diffusion show that AFM reliably redistributes attention spectra and produces substantial visual edits while largely preserving semantic alignment. Finally, we find that entropy mainly acts as an adaptive gain on the same frequency-based edit rather than an independent control axis.
[389] GEMS: Agent-Native Multimodal Generation with Memory and Skills
Zefeng He, Siyuan Huang, Xiaoye Qu, Yafu Li, Tong Zhu, Yu Cheng, Yang Yang
Main category: cs.CV
TL;DR: GEMS is an agent-native multimodal generation framework with memory and skills that improves generation quality through iterative agent loops, persistent memory, and domain-specific skills.
Details
Motivation: Current multimodal generation models struggle with complex instructions and specialized downstream tasks, despite progress on general-purpose generation. The authors aim to overcome inherent limitations of foundational models through an agent-based approach.Method: Three core components: 1) Agent Loop - structured multi-agent framework for iterative closed-loop optimization; 2) Agent Memory - persistent, hierarchical memory storing factual states and experiential summaries; 3) Agent Skill - extensible collection of domain-specific expertise with on-demand loading.
Result: GEMS achieves significant performance gains across five mainstream tasks and four downstream tasks on multiple generative backends. Notably enables lightweight 6B model Z-Image-Turbo to surpass state-of-the-art Nano Banana 2 on GenEval2.
Conclusion: The agent harness effectively extends model capabilities beyond their original limits, demonstrating the power of agent-native frameworks for multimodal generation.
Abstract: Recent multimodal generation models have achieved remarkable progress on general-purpose generation tasks, yet continue to struggle with complex instructions and specialized downstream tasks. Inspired by the success of advanced agent frameworks such as Claude Code, we propose \textbf{GEMS} (Agent-Native Multimodal \textbf{GE}neration with \textbf{M}emory and \textbf{S}kills), a framework that pushes beyond the inherent limitations of foundational models on both general and downstream tasks. GEMS is built upon three core components. Agent Loop introduces a structured multi-agent framework that iteratively improves generation quality through closed-loop optimization. Agent Memory provides a persistent, trajectory-level memory that hierarchically stores both factual states and compressed experiential summaries, enabling a global view of the optimization process while reducing redundancy. Agent Skill offers an extensible collection of domain-specific expertise with on-demand loading, allowing the system to effectively handle diverse downstream applications. Across five mainstream tasks and four downstream tasks, evaluated on multiple generative backends, GEMS consistently achieves significant performance gains. Most notably, it enables the lightweight 6B model Z-Image-Turbo to surpass the state-of-the-art Nano Banana 2 on GenEval2, demonstrating the effectiveness of agent harness in extending model capabilities beyond their original limits.
[390] Intelligent Road Condition Monitoring using 3D In-Air SONAR Sensing
Amber Cassimon, Robin Kerstens, Walter Daems, Jan Steckel
Main category: cs.CV
TL;DR: Using in-air 3D SONAR sensors for road condition monitoring, achieving 90% F1 for material classification but only 75% for damage detection.
Details
Motivation: Current road monitoring using cameras and LiDAR fails in harsh conditions (rain, fog, smoke). SONAR offers robust sensing for opportunistic monitoring by vehicles performing other tasks.Method: Used a single dataset with annotated road damages and material labels. Applied SONAR sensor data for two tasks: 1) Classifying road materials (asphalt, concrete, element roads), 2) Detecting and classifying damage types without localization.
Result: Material classification achieved ~90% F1 score on test set. Damage detection and classification performed worse at ~75% F1 score.
Conclusion: SONAR sensing is promising for opportunistic pavement management systems but requires further research to improve damage detection accuracy.
Abstract: In this paper, we investigate the capabilities of in-air 3D SONAR sensors for the monitoring of road surface conditions. Concretely, we consider two applications: Road material classification and Road damage detection and classification. While such tasks can be performed with other sensor modalities, such as camera sensors and LiDAR sensors, these sensor modalities tend to fail in harsh sensing conditions, such as heavy rain, smoke or fog. By using a sensing modality that is robust to such interference, we enable the creation of opportunistic sensing applications, where vehicles performing other tasks (garbage collection, mail delivery, etc.) can also be used to monitor the condition of the road. For these tasks, we use a single dataset, in which different types of damages are annotated, with labels including the material of the road surface. In the material classification task, we differentiate between three different road materials: Asphalt, Concrete and Element roads. In the damage detection and classification task, we determine if there is damage, and what type of damage (independent of material type), without localizing the damage. We are succesful in determining the road surface type from SONAR sensor data, with F1 scores approaching 90% on the test set, but find that for the detection of damages performace lags, with F1 score around 75%. From this, we conclude that SONAR sensing is a promising modality to include in opportunistic sensing-based pavement management systems, but that further research is needed to reach the desired accuracy.
[391] To View Transform or Not to View Transform: NeRF-based Pre-training Perspective
Hyeonjun Jeong, Juyeb Shin, Dongsuk Kum
Main category: cs.CV
TL;DR: NeRP3D: A novel NeRF-resembled point-based 3D detector that learns continuous 3D representations by preserving pre-trained NeRF networks, avoiding misaligned priors from view transformation in autonomous driving perception.
Details
Motivation: Current NeRF-based pre-training for 3D perception suffers from conflicting priors between discrete/rigid view transformation and continuous/adaptive radiance fields, resulting in blurry 3D representations. Additionally, discarding pre-trained NeRF networks during downstream tasks leads to inefficient utilization of enhanced 3D representations.Method: Proposes NeRP3D, a NeRF-resembled point-based 3D detector that preserves pre-trained NeRF networks regardless of tasks. It learns continuous 3D representations to avoid misaligned priors from view transformation, enabling better scene reconstruction and detection.
Result: Experiments on nuScenes dataset show significant improvements over previous state-of-the-art methods, outperforming both pretext scene reconstruction tasks and downstream detection tasks.
Conclusion: NeRP3D successfully addresses the conflicting priors in NeRF-based pre-training for 3D perception by maintaining continuous 3D representation learning through preserved NeRF networks, leading to superior performance in both reconstruction and detection tasks.
Abstract: Neural radiance fields (NeRFs) have emerged as a prominent pre-training paradigm for vision-centric autonomous driving, which enhances 3D geometry and appearance understanding in a fully self-supervised manner. To apply NeRF-based pretraining to 3D perception models, recent approaches have simply applied NeRFs to volumetric features obtained from view transformation. However, coupling NeRFs with view transformation inherits conflicting priors; view transformation imposes discrete and rigid representations, whereas radiance fields assume continuous and adaptive functions. When these opposing assumptions are forced into a single pipeline, the misalignment surfaces as blurry and ambiguous 3D representations that ultimately limit 3D scene understanding. Moreover, the NeRF network for pre-training is discarded during downstream tasks, resulting in inefficient utilization of enhanced 3D representations through NeRF. In this paper, we propose a novel NeRF-Resembled Point-based 3D detector that can learn continuous 3D representation and thus avoid the misaligned priors from view transformation. NeRP3D preserves the pre-trained NeRF network regardless of the tasks, inheriting the principle of continuous 3D representation learning and leading to greater potentials for both scene reconstruction and detection tasks. Experiments on nuScenes dataset demonstrate that our proposed approach significantly improves previous state-of-the-art methods, outperforming not only pretext scene reconstruction tasks but also downstream detection tasks.
[392] SHARP: Short-Window Streaming for Accurate and Robust Prediction in Motion Forecasting
Alexander Prutsch, Christian Fruhwirth-Reisinger, David Schinagl, Horst Possegger
Main category: cs.CV
TL;DR: Novel streaming-based motion forecasting framework for dynamic traffic environments that handles heterogeneous observation lengths through incremental processing and instance-aware context streaming.
Details
Motivation: Motion forecasting models need continuous trajectory estimation in dynamic traffic, but streaming methods degrade with heterogeneous observation lengths. Need robust approach for evolving scenes.Method: Incremental processing of observation windows with instance-aware context streaming to maintain/update latent agent representations across inference steps, plus dual training objective for consistent accuracy across observation horizons.
Result: Achieves state-of-the-art performance on Argoverse 2 multi-agent benchmark in streaming inference, maintains minimal latency, robust across Argoverse 2, nuScenes, and Argoverse 1 datasets.
Conclusion: Proposed framework effectively handles evolving scenes with heterogeneous observations, suitable for real-world deployment due to strong performance and low latency.
Abstract: In dynamic traffic environments, motion forecasting models must be able to accurately estimate future trajectories continuously. Streaming-based methods are a promising solution, but despite recent advances, their performance often degrades when exposed to heterogeneous observation lengths. To address this, we propose a novel streaming-based motion forecasting framework that explicitly focuses on evolving scenes. Our method incrementally processes incoming observation windows and leverages an instance-aware context streaming to maintain and update latent agent representations across inference steps. A dual training objective further enables consistent forecasting accuracy across diverse observation horizons. Extensive experiments on Argoverse 2, nuScenes, and Argoverse 1 demonstrate the robustness of our approach under evolving scene conditions and also on the single-agent benchmarks. Our model achieves state-of-the-art performance in streaming inference on the Argoverse 2 multi-agent benchmark, while maintaining minimal latency, highlighting its suitability for real-world deployment.
[393] Octree-based Learned Point Cloud Geometry Compression: A Lossy Perspective
Kaiyu Zheng, Wei Gao, Huiming Zheng
Main category: cs.CV
TL;DR: This paper proposes novel lossy compression methods for point clouds using octree-based context learning, with different approaches for object point clouds (leaf nodes lossy compression) and LiDAR point clouds (rate control method).
Details
Motivation: Traditional lossy compression using lossless octree representation with quantization adjustment causes severe distortions due to massive missing points. The paper aims to address this limitation by developing specialized lossy approaches for different point cloud types.Method: 1) For object point clouds: proposed leaf nodes lossy compression method using bit-wise coding and binary prediction on leaf nodes. 2) For LiDAR point clouds: proposed variable rate approaches with a simple but effective rate control method.
Result: The leaf nodes lossy compression significantly outperforms previous octree-based methods on object point clouds. The rate control method achieves about 1% bit error without finetuning on LiDAR point clouds.
Conclusion: The paper demonstrates that specialized lossy compression approaches tailored to different point cloud characteristics (object vs LiDAR) can overcome limitations of traditional methods and achieve superior performance.
Abstract: Octree-based context learning has recently become a leading method in point cloud compression. However, its potential on lossy compression remains undiscovered. The traditional lossy compression paradigm using lossless octree representation with quantization step adjustment may result in severe distortions due to massive missing points in quantization. Therefore, we analyze data characteristics of different point clouds and propose lossy approaches specifically. For object point clouds that suffer from quantization step adjustment, we propose a new leaf nodes lossy compression method, which achieves lossy compression by performing bit-wise coding and binary prediction on leaf nodes. For LiDAR point clouds, we explore variable rate approaches and propose a simple but effective rate control method. Experimental results demonstrate that the proposed leaf nodes lossy compression method significantly outperforms the previous octree-based method on object point clouds, and the proposed rate control method achieves about 1% bit error without finetuning on LiDAR point clouds.
[394] RAWIC: Bit-Depth Adaptive Lossless Raw Image Compression
Chunhang Zheng, Tongda Xu, Mingli Xie, Yan Wang, Dou Li
Main category: cs.CV
TL;DR: RAWIC is a learned lossless compression framework for Bayer-pattern raw images that adapts to varying bit depths and camera characteristics, achieving better compression than traditional codecs.
Details
Motivation: Raw images contain valuable linear sensor data and high bit-depth information important for vision tasks, but are difficult to store due to large file sizes, varying bit depths, and sensor-dependent characteristics. Existing methods either target only 8-bit sRGB images or are lossy and camera-specific.Method: Convert single-channel Bayer data to four-channel RGGB format, partition into patches, compute each patch’s bit depth as auxiliary input, and design a bit-depth-adaptive entropy model to estimate patch distributions conditioned on their bit depths. This enables a single model to handle diverse cameras and bit depths.
Result: RAWIC consistently surpasses traditional lossless codecs, achieving an average 7.7% bitrate reduction over JPEG-XL.
Conclusion: RAWIC provides an effective learned lossless compression solution for raw images that handles varying bit depths and camera characteristics with a single model.
Abstract: Raw images preserve linear sensor measurements and high bit-depth information crucial for advanced vision tasks and photography applications, yet their storage remains challenging due to large file sizes, varying bit depths, and sensor-dependent characteristics. Existing learned lossless compression methods mainly target 8-bit sRGB images, while raw reconstruction approaches are inherently lossy and rely on camera-specific assumptions. To address these challenges, we introduce RAWIC, a bit-depth-adaptive learned lossless compression framework for Bayer-pattern raw images. We first convert single-channel Bayer data into a four-channel RGGB format and partition it into patches. For each patch, we compute its bit depth and use it as auxiliary input to guide compression. A bit-depth-adaptive entropy model is then designed to estimate patch distributions conditioned on their bit depths. This architecture enables a single model to handle raw images from diverse cameras and bit depths. Experiments show that RAWIC consistently surpasses traditional lossless codecs, achieving an average 7.7% bitrate reduction over JPEG-XL. Our code is available at https://github.com/chunbaobao/RAWIC.
[395] Prototype-Enhanced Multi-View Learning for Thyroid Nodule Ultrasound Classification
Yangmei Chen, Zhongyuan Zhang, Xikun Zhang, Xinyu Hao, Mingliang Hou, Renqiang Luo, Ziqi Xu
Main category: cs.CV
TL;DR: PEMV-thyroid is a prototype-enhanced multi-view learning framework for robust thyroid nodule classification in ultrasound imaging that addresses domain heterogeneity across different devices and clinical environments.
Details
Motivation: Existing deep learning methods for thyroid ultrasound classification perform well on in-distribution data but lack robustness when deployed across different ultrasound devices or clinical environments due to pronounced image heterogeneity, which causes models to learn spurious correlations rather than reliable diagnostic cues.Method: Proposes PEMV-thyroid, a Prototype-Enhanced Multi-View learning framework that learns complementary representations from multiple feature perspectives and refines decision boundaries through a prototype-based correction mechanism with mixed prototype information to account for data heterogeneity.
Result: Extensive experiments on multiple thyroid ultrasound datasets show PEMV-thyroid consistently outperforms state-of-the-art methods, particularly in cross-device and cross-domain evaluation scenarios, leading to improved diagnostic accuracy and generalization in real-world clinical settings.
Conclusion: The proposed framework effectively addresses domain heterogeneity in medical imaging by integrating multi-view representations with prototype-level guidance, enabling more stable representation learning under heterogeneous imaging conditions for improved clinical deployment.
Abstract: Thyroid nodule classification using ultrasound imaging is essential for early diagnosis and clinical decision-making; however, despite promising performance on in-distribution data, existing deep learning methods often exhibit limited robustness and generalisation when deployed across different ultrasound devices or clinical environments. This limitation is mainly attributed to the pronounced heterogeneity of thyroid ultrasound images, which can lead models to capture spurious correlations rather than reliable diagnostic cues. To address this challenge, we propose PEMV-thyroid, a Prototype-Enhanced Multi-View learning framework that accounts for data heterogeneity by learning complementary representations from multiple feature perspectives and refining decision boundaries through a prototype-based correction mechanism with mixed prototype information. By integrating multi-view representations with prototype-level guidance, the proposed approach enables more stable representation learning under heterogeneous imaging conditions. Extensive experiments on multiple thyroid ultrasound datasets demonstrate that PEMV-thyroid consistently outperforms state-of-the-art methods, particularly in cross-device and cross-domain evaluation scenarios, leading to improved diagnostic accuracy and generalisation performance in real-world clinical settings. The source code is available at https://github.com/chenyangmeii/Prototype-Enhanced-Multi-View-Learning.
[396] Contour-Guided Query-Based Feature Fusion for Boundary-Aware and Generalizable Cardiac Ultrasound Segmentation
Zahid Ullah, Sieun Choi, Jihie Kim
Main category: cs.CV
TL;DR: CGQR-Net: A contour-guided query refinement network for boundary-aware cardiac ultrasound segmentation that integrates structural priors with multi-resolution features to address challenges like low contrast, speckle noise, and domain shifts.
Details
Motivation: Cardiac ultrasound segmentation is essential for ventricular function assessment but faces challenges including low contrast, speckle noise, irregular boundaries, and domain shifts across devices/patients. Appearance-driven methods often fail to preserve boundary precision and structural consistency under these conditions.Method: Proposes Contour-Guided Query Refinement Network (CGQR-Net) with HRNet backbone for multi-resolution features. Generates coarse segmentation, extracts anatomical contours, encodes them into learnable query embeddings. Uses cross-attention between contour-guided queries and fused feature maps for structure-aware refinement. Employs dual-head supervision for joint segmentation and boundary prediction optimization.
Result: Evaluated on CAMUS dataset and validated on CardiacNet for cross-dataset generalization. Demonstrates improved segmentation accuracy, enhanced boundary precision, and robust performance across varying imaging conditions.
Conclusion: Integrating contour-level structural information with feature-level representations is effective for reliable cardiac ultrasound segmentation, addressing challenges of low contrast, noise, and domain shifts.
Abstract: Accurate cardiac ultrasound segmentation is essential for reliable assessment of ventricular function in intelligent healthcare systems. However, echocardiographic images are challenging due to low contrast, speckle noise, irregular boundaries, and domain shifts across devices and patient populations. Existing methods, largely based on appearance-driven learning, often fail to preserve boundary precision and structural consistency under these conditions. To address these issues, we propose a Contour-Guided Query Refinement Network (CGQR-Net) for boundary-aware cardiac ultrasound segmentation. The framework integrates multi-resolution feature representations with contour-derived structural priors. An HRNet backbone preserves high-resolution spatial details while capturing multi-scale context. A coarse segmentation is first generated, from which anatomical contours are extracted and encoded into learnable query embeddings. These contour-guided queries interact with fused feature maps via cross-attention, enabling structure-aware refinement that improves boundary delineation and reduces noise artifacts. A dual-head supervision strategy jointly optimizes segmentation and boundary prediction to enforce structural consistency. The proposed method is evaluated on the CAMUS dataset and further validated on the CardiacNet dataset to assess cross-dataset generalization. Experimental results demonstrate improved segmentation accuracy, enhanced boundary precision, and robust performance across varying imaging conditions. These results highlight the effectiveness of integrating contour-level structural information with feature-level representations for reliable cardiac ultrasound segmentation.
[397] MedLoc-R1: Performance-Aware Curriculum Reward Scheduling for GRPO-Based Medical Visual Grounding
Guangjing Yang, Ziyuan Qin, Chaoran Zhang, Chenlin Du, Jinlin Wang, Wanran Sun, Zhenyu Zhang, Bing Ji, Qicheng Lao
Main category: cs.CV
TL;DR: MedLoc-R1 is a performance-aware reward scheduling framework for medical visual grounding that addresses reward sparsity in RL by progressively tightening reward criteria based on model readiness.
Details
Motivation: Existing RL approaches for medical visual grounding suffer from severe reward sparsity due to difficulty localizing small/ambiguous regions and rigid IoU-based reward schemes, leading to vanishing gradients and stagnated optimization.Method: Proposes MedLoc-R1 with sliding-window performance tracker and multi-condition update rule that automatically adjusts reward schedule from dense, easily obtainable signals to stricter localization requirements, preserving GRPO properties without auxiliary networks.
Result: Experiments on three medical visual grounding benchmarks show MedLoc-R1 consistently improves both localization accuracy and training stability over GRPO-based baselines.
Conclusion: MedLoc-R1 offers a general, lightweight, and effective solution for RL-based grounding in high-stakes medical applications.
Abstract: Medical visual grounding serves as a crucial foundation for fine-grained multimodal reasoning and interpretable clinical decision support. Despite recent advances in reinforcement learning (RL) for grounding tasks, existing approaches such as Group Relative Policy Optimization~(GRPO) suffer from severe reward sparsity when directly applied to medical images, primarily due to the inherent difficulty of localizing small or ambiguous regions of interest, which is further exacerbated by the rigid and suboptimal nature of fixed IoU-based reward schemes in RL. This leads to vanishing policy gradients and stagnated optimization, particularly during early training. To address this challenge, we propose MedLoc-R1, a performance-aware reward scheduling framework that progressively tightens the reward criterion in accordance with model readiness. MedLoc-R1 introduces a sliding-window performance tracker and a multi-condition update rule that automatically adjust the reward schedule from dense, easily obtainable signals to stricter, fine-grained localization requirements, while preserving the favorable properties of GRPO without introducing auxiliary networks or additional gradient paths. Experiments on three medical visual grounding benchmarks demonstrate that MedLoc-R1 consistently improves both localization accuracy and training stability over GRPO-based baselines. Our framework offers a general, lightweight, and effective solution for RL-based grounding in high-stakes medical applications. Code & checkpoints are available at \hyperlink{}{https://github.com/MembrAI/MedLoc-R1}.
[398] Optimized Weighted Voting System for Brain Tumor Classification Using MRI Images
Ha Anh Vu
Main category: cs.CV
TL;DR: A weighted ensemble learning approach combining deep learning and traditional ML models for brain tumor classification from MRI scans, achieving state-of-the-art accuracy through weighted voting and image enhancement techniques.
Details
Motivation: Accurate brain tumor classification from MRI scans is crucial for effective diagnosis and treatment planning. Current methods may have limitations in performance, motivating the development of more robust ensemble approaches.Method: Proposes a weighted ensemble learning system integrating multiple classifiers: ResNet101, DenseNet121, Xception, CNN-MRI, ResNet50 with edge-enhanced images, SVM, and KNN with HOG features. Uses weighted voting mechanism giving higher influence to more accurate models. Applies image processing techniques including Balance Contrast Enhancement, K-means clustering, and Canny edge detection for feature enhancement.
Result: Experimental evaluations on Figshare and Kaggle MRI datasets show the proposed method achieves state-of-the-art accuracy, outperforming existing models.
Conclusion: The ensemble-based learning approach demonstrates potential for improving brain tumor classification, offering a reliable and scalable framework for medical image analysis.
Abstract: The accurate classification of brain tumors from MRI scans is essential for effective diagnosis and treatment planning. This paper presents a weighted ensemble learning approach that combines deep learning and traditional machine learning models to improve classification performance. The proposed system integrates multiple classifiers, including ResNet101, DenseNet121, Xception, CNN-MRI, and ResNet50 with edge-enhanced images, SVM, and KNN with HOG features. A weighted voting mechanism assigns higher influence to models with better individual accuracy, ensuring robust decision-making. Image processing techniques such as Balance Contrast Enhancement, K-means clustering, and Canny edge detection are applied to enhance feature extraction. Experimental evaluations on the Figshare and Kaggle MRI datasets demonstrate that the proposed method achieves state-of-the-art accuracy, outperforming existing models. These findings highlight the potential of ensemble-based learning for improving brain tumor classification, offering a reliable and scalable framework for medical image analysis.
[399] SVGS: Single-View to 3D Object Editing via Gaussian Splatting
Pengcheng Xue, Yan Tian, Qiutao Song, Ziyi Wang, Linyang He, Weiping Ding, Mahmoud Hassaballah, Karen Egiazarian, Wei-Fa Yang, Leszek Rutkowski
Main category: cs.CV
TL;DR: SVGS: A single-view text-driven 3D scene editing method using 3D Gaussian Splatting that improves editing consistency and efficiency compared to multi-view approaches.
Details
Motivation: Existing text-driven 3D scene editing methods using implicit representations like NeRF suffer from slow processing speeds and limited region control. Multi-view editing approaches often produce inconsistent results across views, making it difficult to balance editing consistency with efficiency.Method: Proposes SVGS (Single-View to 3D Object Editing via Gaussian Splatting) using 3D Gaussian Splatting as the 3D representation. Introduces a single-view editing strategy based on multi-view diffusion models that reconstructs 3D scenes using only views with consistent editing results.
Result: SVGS outperforms existing baseline methods (including Instruct-NeRF2NeRF and GaussianEditor) across various scene settings in both editing capability and processing speed.
Conclusion: SVGS represents a significant advancement in 3D editing technology by addressing consistency and efficiency challenges in text-driven 3D scene editing through single-view strategy and efficient 3D Gaussian Splatting representation.
Abstract: Text-driven 3D scene editing has attracted considerable interest due to its convenience and user-friendliness. However, methods that rely on implicit 3D representations, such as Neural Radiance Fields (NeRF), while effective in rendering complex scenes, are hindered by slow processing speeds and limited control over specific regions of the scene. Moreover, existing approaches, including Instruct-NeRF2NeRF and GaussianEditor, which utilize multi-view editing strategies, frequently produce inconsistent results across different views when executing text instructions. This inconsistency can adversely affect the overall performance of the model, complicating the task of balancing the consistency of editing results with editing efficiency. To address these challenges, we propose a novel method termed Single-View to 3D Object Editing via Gaussian Splatting (SVGS), which is a single-view text-driven editing technique based on 3D Gaussian Splatting (3DGS). Specifically, in response to text instructions, we introduce a single-view editing strategy grounded in multi-view diffusion models, which reconstructs 3D scenes by leveraging only those views that yield consistent editing results. Additionally, we employ sparse 3D Gaussian Splatting as the 3D representation, which significantly enhances editing efficiency. We conducted a comparative analysis of SVGS against existing baseline methods across various scene settings, and the results indicate that SVGS outperforms its counterparts in both editing capability and processing speed, representing a significant advancement in 3D editing technology. For further details, please visit our project page at: https://amateurc.github.io/svgs.github.io.
[400] Robust Remote Sensing Image-Text Retrieval with Noisy Correspondence
Qiya Song, Yiqiang Xie, Yuan Sun, Renwei Dian, Xudong Kang
Main category: cs.CV
TL;DR: Proposes RRSITR, a robust remote sensing image-text retrieval method that handles noisy correspondences using self-paced learning and robust triplet loss.
Details
Motivation: Existing RSITR methods assume perfectly matched image-text pairs, but real-world remote sensing datasets contain noisy correspondences due to expensive data collection and inaccurate descriptions.Method: 1) Divides training pairs into clean, ambiguous, and noisy categories based on loss magnitude; 2) Estimates reliability via loss-based weighting; 3) Uses multi-modal self-paced function to regulate training sequence; 4) Implements robust triplet loss with dynamic soft margin adjustment.
Result: Extensive experiments on three benchmark datasets show RRSITR significantly outperforms state-of-the-art methods, especially at high noise rates.
Conclusion: The proposed RRSITR paradigm effectively addresses noisy correspondence in RSITR through self-paced learning and robust loss functions, demonstrating superior performance.
Abstract: As a pivotal task that bridges remote visual and linguistic understanding, Remote Sensing Image-Text Retrieval (RSITR) has attracted considerable research interest in recent years. However, almost all RSITR methods implicitly assume that image-text pairs are matched perfectly. In practice, acquiring a large set of well-aligned data pairs is often prohibitively expensive or even infeasible. In addition, we also notice that the remote sensing datasets (e.g., RSITMD) truly contain some inaccurate or mismatched image text descriptions. Based on the above observations, we reveal an important but untouched problem in RSITR, i.e., Noisy Correspondence (NC). To overcome these challenges, we propose a novel Robust Remote Sensing Image-Text Retrieval (RRSITR) paradigm that designs a self-paced learning strategy to mimic human cognitive learning patterns, thereby learning from easy to hard from multi-modal data with NC. Specifically, we first divide all training sample pairs into three categories based on the loss magnitude of each pair, i.e., clean sample pairs, ambiguous sample pairs, and noisy sample pairs. Then, we respectively estimate the reliability of each training pair by assigning a weight to each pair based on the values of the loss. Further, we respectively design a new multi-modal self-paced function to dynamically regulate the training sequence and weights of the samples, thus establishing a progressive learning process. Finally, for noisy sample pairs, we present a robust triplet loss to dynamically adjust the soft margin based on semantic similarity, thereby enhancing the robustness against noise. Extensive experiments on three popular benchmark datasets demonstrate that the proposed RRSITR significantly outperforms the state-of-the-art methods, especially in high noise rates. The code is available at: https://github.com/MSFLabX/RRSITR
[401] $R_{dm}$: Re-conceptualizing Distribution Matching as a Reward for Diffusion Distillation
Linqian Fan, Peiqin Sun, Tiancheng Wen, Shun Lu, Chengru Song
Main category: cs.CV
TL;DR: A novel paradigm that reconceptualizes distribution matching as a reward (R_dm) to bridge diffusion distillation with reinforcement learning, enabling more stable optimization and efficient sampling for real-time high-fidelity synthesis.
Details
Motivation: Diffusion models have slow iterative sampling, and while distillation helps with few-step generation, traditional objectives limit performance by anchoring students solely to teachers. Recent RL approaches use simple summation of objectives, but there's a need for a more unified framework that bridges diffusion matching distillation with RL principles.Method: Proposes reconceptualizing distribution matching as a reward (R_dm), introducing Group Normalized Distribution Matching (GNDM) to stabilize R_dm estimation using group-mean statistics. The framework supports adaptive weighting for combining DMD with external rewards and incorporates importance sampling for efficiency.
Result: GNDM outperforms vanilla DMD by reducing FID by 1.87. The multi-reward variant GNDMR achieves peak HPS of 30.37 and low FID-SD of 12.21, balancing aesthetic quality and fidelity better than existing baselines.
Conclusion: R_dm provides a flexible, stable, and efficient framework for real-time high-fidelity synthesis by unifying diffusion matching distillation with reinforcement learning principles through a reward-centric formulation.
Abstract: Diffusion models achieve state-of-the-art generative performance but are fundamentally bottlenecked by their slow iterative sampling process. While diffusion distillation techniques enable high-fidelity few-step generation, traditional objectives often restrict the student’s performance by anchoring it solely to the teacher. Recent approaches have attempted to break this ceiling by integrating Reinforcement Learning (RL), typically through a simple summation of distillation and RL objectives. In this work, we propose a novel paradigm by reconceptualizing distribution matching as a reward, denoted as $R_{dm}$. This unified perspective bridges the algorithmic gap between Diffusion Matching Distillation (DMD) and RL, providing several key benefits. (1) Enhanced optimization stability: we introduce Group Normalized Distribution Matching (GNDM), which adapts standard RL group normalization to stabilize $R_{dm}$ estimation. By leveraging group-mean statistics, GNDM establishes a more robust and effective optimization direction. (2) Seamless reward integration: our reward-centric formulation inherently supports adaptive weighting mechanisms, allowing flexible combination of DMD with external reward models. (3) Improved sampling efficiency: by aligning with RL principles, the framework readily incorporates importance sampling (IS), leading to a significant boost in sampling efficiency. Extensive experiments demonstrate that GNDM outperforms vanilla DMD, reducing the FID by 1.87. Furthermore, our multi-reward variant, GNDMR, surpasses existing baselines by achieving a strong balance between aesthetic quality and fidelity, reaching a peak HPS of 30.37 and a low FID-SD of 12.21. Overall, $R_{dm}$ provides a flexible, stable, and efficient framework for real-time high-fidelity synthesis. Code will be released upon publication.
[402] BlankSkip: Early-exit Object Detection onboard Nano-drones
Carlo Marra, Beatrice Alessandra Motetti, Alessio Burrello, Enrico Macii, Massimo Poncino, Daniele Jahier Pagliari
Main category: cs.CV
TL;DR: BlankSkip: Adaptive object detection network for nano-drones using early-exit classification to skip frames with no objects, improving throughput by 24% with minimal accuracy loss.
Details
Motivation: Nano-drones have extreme computational constraints (~10 MiB memory, 1W power) requiring efficient DNNs. While early-exit adaptive networks work well for classification, applying them to dense tasks like object detection is challenging. Need to reduce average inference latency for object detection on resource-constrained nano-drones.Method: Proposes BlankSkip, an adaptive object detection network that uses a simple auxiliary classification task for early exit. The system identifies frames with no objects of interest and skips full detection processing for those frames, reducing computational effort for “blank” frames.
Result: Achieves up to 24% average throughput improvement with only 0.015 mean Average Precision (mAP) drop compared to static MobileNet-SSD detector. Tested on real-world nano-drone platform (Bitcraze Crazyflie 2.1) using state-of-the-art nano-drones object detection dataset.
Conclusion: BlankSkip successfully applies early-exit adaptive networks to object detection for nano-drones, demonstrating significant throughput improvements with minimal accuracy degradation, enabling more efficient on-device object detection for resource-constrained platforms.
Abstract: Deploying tiny computer vision Deep Neural Networks (DNNs) on-board nano-sized drones is key for achieving autonomy, but is complicated by the extremely tight constraints of their computational platforms (approximately 10 MiB memory, 1 W power budget). Early-exit adaptive DNNs that dial down the computational effort for “easy-to-process” input frames represent a promising way to reduce the average inference latency. However, while this approach is extensively studied for classification, its application to dense tasks like object detection (OD) is not straightforward. In this paper, we propose BlankSkip, an adaptive network for on-device OD that leverages a simple auxiliary classification task for early exit, i.e., identifying frames with no objects of interest. With experiments using a real-world nano-drone platform, the Bitcraze Crazyflie 2.1, we achieve up to 24% average throughput improvement with a limited 0.015 mean Average Precision (mAP) drop compared to a static MobileNet-SSD detector, on a state-of-the-art nano-drones OD dataset.
[403] ObjectMorpher: 3D-Aware Image Editing via Deformable 3DGS Models
Yuhuan Xie, Aoxuan Pan, Yi-Hua Huang, Chirui Chang, Peng Dai, Xin Yu, Xiaojuan Qi
Main category: cs.CV
TL;DR: ObjectMorpher is a 3D-aware image editing framework that converts 2D edits into geometry-grounded operations using 3D Gaussian Splatting for precise object-level control.
Details
Motivation: Existing 2D image editing methods lack 3D awareness and produce ambiguous results, while 3D-aware approaches rely on heavy optimization or incomplete monocular reconstructions, creating a need for precise, interactive object-level control in image editing.Method: ObjectMorpher lifts target instances from 2D images to editable 3D Gaussian Splatting representations using an image-to-3D generator. Users interact via control points, and the system applies graph-based non-rigid deformation with ARAP constraints for physically sensible shape/pose changes, followed by composite diffusion for seamless reintegration.
Result: ObjectMorpher achieves fine-grained, photorealistic edits across diverse categories, outperforming both 2D drag and 3D-aware baselines on metrics including KID, LPIPS, SIFID, and user preference, with superior controllability and efficiency.
Conclusion: ObjectMorpher provides a unified, interactive framework for geometry-grounded image editing that bridges the gap between 2D and 3D approaches, enabling precise object-level control with physically sensible results.
Abstract: Achieving precise, object-level control in image editing remains challenging: 2D methods lack 3D awareness and often yield ambiguous or implausible results, while existing 3D-aware approaches rely on heavy optimization or incomplete monocular reconstructions. We present ObjectMorpher, a unified, interactive framework that converts ambiguous 2D edits into geometry-grounded operations. ObjectMorpher lifts target instances with an image-to-3D generator into editable 3D Gaussian Splatting (3DGS), enabling fast, identity-preserving manipulation. Users drag control points; a graph-based non-rigid deformation with as-rigid-as-possible (ARAP) constraints ensures physically sensible shape and pose changes. A composite diffusion module harmonizes lighting, color, and boundaries for seamless reintegration. Across diverse categories, ObjectMorpher delivers fine-grained, photorealistic edits with superior controllability and efficiency, outperforming 2D drag and 3D-aware baselines on KID, LPIPS, SIFID, and user preference.
[404] Event-Based Method for High-Speed 3D Deformation Measurement under Extreme Illumination Conditions
Banglei Guan, Yifei Bian, Zibin Liu, Haoyang Li, Xuanyu Bai, Taihang Lei, Bin Li, Yang Shang, Qifeng Yu
Main category: cs.CV
TL;DR: Event camera array method for high-speed 3D deformation monitoring of structures under extreme illumination conditions
Details
Motivation: Traditional cameras struggle with extreme illumination conditions (overexposure, limited dynamic range) when monitoring high-speed 3D deformation of large engineering structures like space launch towers and bridges. Event cameras offer better dynamic range and low latency for such applications.Method: Multi-event camera array approach combining asynchronous event stream analysis with temporal correlation to extract marker centers, rapid calibration using Kruppa equations with parameter optimization, and unified coordinate transformation with linear intersection for 3D deformation measurement.
Result: Relative measurement error below 0.08%, successful field experiments under extreme illumination conditions including self-calibration of camera array and 3D deformation measurement.
Conclusion: The method overcomes traditional camera limitations for high-speed 3D deformation measurement under extreme illumination, achieving accurate measurements with less than 0.1% relative error under harsh lighting conditions.
Abstract: Background: Large engineering structures, such as space launch towers and suspension bridges, are subjected to extreme forces that cause high-speed 3D deformation and compromise safety. These structures typically operate under extreme illumination conditions. Traditional cameras often struggle to handle strong light intensity, leading to overexposure due to their limited dynamic range. Objective: Event cameras have emerged as a compelling alternative to traditional cameras in high dynamic range and low-latency applications. This paper presents an integrated method, from calibration to measurement, using a multi-event camera array for high-speed 3D deformation monitoring of structures in extreme illumination conditions. Methods: Firstly, the proposed method combines the characteristics of the asynchronous event stream and temporal correlation analysis to extract the corresponding marker center point. Subsequently, the method achieves rapid calibration by solving the Kruppa equations in conjunction with a parameter optimization framework. Finally, by employing a unified coordinate transformation and linear intersection, the method enables the measurement of 3D deformation of the target structure. Results: Experiments confirmed that the relative measurement error is below 0.08%. Field experiments under extreme illumination conditions, including self-calibration of a multi-event camera array and 3D deformation measurement, verified the performance of the proposed method. Conclusions: This paper addressed the critical limitation of traditional cameras in measuring high-speed 3D deformations under extreme illumination conditions. The experimental results demonstrate that, compared to other methods, the proposed method can accurately measure 3D deformations of structures under harsh lighting conditions, and the relative error of the measured deformation is less than 0.1%.
[405] ColorFLUX: A Structure-Color Decoupling Framework for Old Photo Colorization
Bingchen Li, Zhixin Wang, Fan Li, Jiaqi Xu, Jiaming Guo, Renjing Pei, Xin Li, Zhibo Chen
Main category: cs.CV
TL;DR: A diffusion-based framework for old photo colorization using structure-color decoupling, progressive DPO, and visual semantic prompts to overcome domain gaps in historical photo restoration.
Details
Motivation: Old photos have unique degradation patterns (faded brightness, altered color hues) that create domain gaps with modern photos, making accurate colorization challenging for existing restoration models.Method: Uses FLUX diffusion model with structure-color decoupling strategy, progressive Direct Preference Optimization (Pro-DPO) for color preference learning, and visual semantic prompts to extract fine-grained semantic information from old photos.
Result: Outperforms state-of-the-art colorization methods on both synthetic and real datasets, including closed-source commercial models, producing high-quality and vivid colorization.
Conclusion: The proposed framework effectively addresses the domain gap in old photo colorization through innovative strategies for structure preservation, color restoration, and semantic understanding.
Abstract: Old photos preserve invaluable historical memories, making their restoration and colorization highly desirable. While existing restoration models can address some degradation issues like denoising and scratch removal, they often struggle with accurate colorization. This limitation arises from the unique degradation inherent in old photos, such as faded brightness and altered color hues, which are different from modern photo distributions, creating a substantial domain gap during colorization. In this paper, we propose a novel old photo colorization framework based on the generative diffusion model FLUX. Our approach introduces a structure-color decoupling strategy that separates structure preservation from color restoration, enabling accurate colorization of old photos while maintaining structural consistency. We further enhance the model with a progressive Direct Preference Optimization (Pro-DPO) strategy, which allows the model to learn subtle color preferences through coarse-to-fine transitions in color augmentation. Additionally, we address the limitations of text-based prompts by introducing visual semantic prompts, which extract fine-grained semantic information directly from old photos, helping to eliminate the color bias inherent in old photos. Experimental results on both synthetic and real datasets demonstrate that our approach outperforms existing state-of-the-art colorization methods, including closed-source commercial models, producing high-quality and vivid colorization.
[406] Unsafe2Safe: Controllable Image Anonymization for Downstream Utility
Mih Dinh, SouYoung Jin
Main category: cs.CV
TL;DR: Unsafe2Safe is an automated pipeline that detects privacy-sensitive images and rewrites only sensitive regions using multimodal diffusion editing to create privacy-safe datasets while preserving visual utility.
Details
Motivation: Large-scale image datasets often contain identifiable or sensitive content, creating privacy risks when training models that may memorize and leak such information. There's a need for automated solutions that can anonymize images while preserving their utility for downstream tasks.Method: Two-stage pipeline: Stage 1 uses vision-language models to detect privacy risks, generate paired private/public captions, and produce identity-neutral edit instructions via LLMs. Stage 2 employs instruction-driven diffusion editors to apply dual textual prompts, rewriting only sensitive regions while preserving global structure.
Result: Unsafe2Safe significantly reduces face similarity, text similarity, and demographic predictability across MS-COCO, Caltech101, and MIT Indoor67 datasets, while maintaining downstream model accuracy comparable to training on raw data. Fine-tuning diffusion editors on automatically generated triplets further improves privacy protection and semantic fidelity.
Conclusion: Unsafe2Safe provides a scalable, principled solution for constructing large, privacy-safe datasets without sacrificing visual consistency or downstream utility, addressing critical privacy concerns in multimodal AI training.
Abstract: Large-scale image datasets frequently contain identifiable or sensitive content, raising privacy risks when training models that may memorize and leak such information. We present Unsafe2Safe, a fully automated pipeline that detects privacy-prone images and rewrites only their sensitive regions using multimodally guided diffusion editing. Unsafe2Safe operates in two stages. Stage 1 uses a vision-language model to (i) inspect images for privacy risks, (ii) generate paired private and public captions that respectively include and omit sensitive attributes, and (iii) prompt a large language model to produce structured, identity-neutral edit instructions conditioned on the public caption. Stage 2 employs instruction-driven diffusion editors to apply these dual textual prompts, producing privacy-safe images that preserve global structure and task-relevant semantics while neutralizing private content. To measure anonymization quality, we introduce a unified evaluation suite covering Quality, Cheating, Privacy, and Utility dimensions. Across MS-COCO, Caltech101, and MIT Indoor67, Unsafe2Safe reduces face similarity, text similarity, and demographic predictability by large margins, while maintaining downstream model accuracy comparable to training on raw data. Fine-tuning diffusion editors on our automatically generated triplets (private caption, public caption, edit instruction) further improves both privacy protection and semantic fidelity. Unsafe2Safe provides a scalable, principled solution for constructing large, privacy-safe datasets without sacrificing visual consistency or downstream utility.
[407] ToLL: Topological Layout Learning with Structural Multi-view Augmentation for 3D Scene Graph Pretraining
Yucheng Huang, Luping Ji, Xiangwei Jiang, Wen Li, Mao Ye
Main category: cs.CV
TL;DR: A self-supervised pretraining framework called ToLL (Topological Layout Learning) for 3D Scene Graph generation that uses anchor-conditioned topological geometry reasoning to improve representation quality without relying on predicate annotations.
Details
Motivation: Current 3D Scene Graph generation methods suffer from data scarcity issues. Existing solutions either rely heavily on predicate annotations or bypass predicate learning due to strong object priors, lacking robust self-supervised proxy tasks for fine-tuning.Method: Proposes ToLL framework with Anchor-Conditioned Topological Geometry Reasoning using GNNs to recover global layouts from sparse anchors, strictly modulated by predicate features. Includes Structural Multi-view Augmentation to avoid semantic corruption and self-distillation for enhanced representations.
Result: Extensive experiments on 3DSSG dataset demonstrate improved representation quality, outperforming state-of-the-art baselines.
Conclusion: ToLL provides an effective self-supervised pretraining framework for 3D Scene Graph generation that enforces predicate relation learning and improves generalizability despite data scarcity.
Abstract: 3D Scene Graph (3DSG) generation plays a pivotal role in spatial understanding and semantic-affordance perception. However, its generalizability is often constrained by data scarcity. Current solutions primarily focus on cross-modal assisted representation learning and object-centric generation pre-training. The former relies heavily on predicate annotations, while the latter’s predicate learning may be bypassed due to strong object priors. Consequently, they could not often provide a label-free and robust self-supervised proxy task for 3DSG fine-tuning. To bridge this gap, we propose a Topological Layout Learning (ToLL) for 3DSG pretraining framework. In detail, we design an Anchor-Conditioned Topological Geometry Reasoning, with a GNN to recover the global layout of zero-centered subgraphs by the spatial priors from sparse anchors. This process is strictly modulated by predicate features, thereby enforcing the predicate relation learning. Furthermore, we construct a Structural Multi-view Augmentation to avoid semantic corruption, and enhancing representations via self-distillation. The extensive experiments on 3DSSG dataset demonstrate that our ToLL could improve representation quality, outperforming state-of-the-art baselines.
[408] A Closer Look at Cross-Domain Few-Shot Object Detection: Fine-Tuning Matters and Parallel Decoder Helps
Xuanlong Yu, Youyang Sha, Longfei Liu, Xi Shen, Di Yang
Main category: cs.CV
TL;DR: Hybrid ensemble decoder with progressive fine-tuning for few-shot object detection, achieving strong generalization across diverse domains
Details
Motivation: Few-shot object detection suffers from unstable optimization and limited generalization due to scarce training samples, requiring better adaptation methodsMethod: Proposes hybrid ensemble decoder with shared hierarchical layer and parallel decoder branches using denoising queries for diversity, plus progressive fine-tuning with plateau-aware learning rate schedule
Result: Achieves 41.9 average performance on RF100-VL (100 diverse datasets) in 10-shot setting, outperforming SAM3 (35.7), and shows robustness on OOD samples
Conclusion: The method effectively addresses FSOD challenges through ensemble learning and stable optimization, demonstrating strong generalization and robustness across diverse domains
Abstract: Few-shot object detection (FSOD) is challenging due to unstable optimization and limited generalization arising from the scarcity of training samples. To address these issues, we propose a hybrid ensemble decoder that enhances generalization during fine-tuning. Inspired by ensemble learning, the decoder comprises a shared hierarchical layer followed by multiple parallel decoder branches, where each branch employs denoising queries either inherited from the shared layer or newly initialized to encourage prediction diversity. This design fully exploits pretrained weights without introducing additional parameters, and the resulting diverse predictions can be effectively ensembled to improve generalization. We further leverage a unified progressive fine-tuning framework with a plateau-aware learning rate schedule, which stabilizes optimization and achieves strong few-shot adaptation without complex data augmentations or extensive hyperparameter tuning. Extensive experiments on CD-FSOD, ODinW-13, and RF100-VL validate the effectiveness of our approach. Notably, on RF100-VL, which includes 100 datasets across diverse domains, our method achieves an average performance of 41.9 in the 10-shot setting, significantly outperforming the recent approach SAM3, which obtains 35.7. We further construct a mixed-domain test set from CD-FSOD to evaluate robustness to out-of-distribution (OOD) samples, showing that our proposed modules lead to clear improvement gains. These results highlight the effectiveness, generalization, and robustness of the proposed method. Code is available at: https://github.com/Intellindust-AI-Lab/FT-FSOD.
[409] Explaining CLIP Zero-shot Predictions Through Concepts
Onat Ozdemir, Anders Christensen, Stephan Alaniz, Zeynep Akata, Emre Akbas
Main category: cs.CV
TL;DR: EZPC explains CLIP’s zero-shot predictions through human-understandable concepts by projecting CLIP embeddings into a concept space learned from language descriptions, maintaining accuracy while providing interpretability.
Details
Motivation: CLIP achieves strong zero-shot image recognition but lacks interpretability, while Concept Bottleneck Models are interpretable but require concept supervision and can't generalize to unseen classes. EZPC bridges these paradigms to provide transparent explanations for CLIP's predictions without additional supervision.Method: Projects CLIP’s joint image-text embeddings into a concept space learned from language descriptions using alignment and reconstruction objectives. This preserves CLIP’s semantic structure while making concept activations interpretable, enabling faithful explanations without extra supervision.
Result: Extensive experiments on CIFAR-100, CUB-200-2011, Places365, ImageNet-100, and ImageNet-1k show the approach maintains CLIP’s strong zero-shot classification accuracy while providing meaningful concept-level explanations.
Conclusion: EZPC offers a principled step toward interpretable and trustworthy vision-language models by grounding open-vocabulary predictions in explicit semantic concepts, bridging the gap between performance and interpretability.
Abstract: Large-scale vision-language models such as CLIP have achieved remarkable success in zero-shot image recognition, yet their predictions remain largely opaque to human understanding. In contrast, Concept Bottleneck Models provide interpretable intermediate representations by reasoning through human-defined concepts, but they rely on concept supervision and lack the ability to generalize to unseen classes. We introduce EZPC that bridges these two paradigms by explaining CLIP’s zero-shot predictions through human-understandable concepts. Our method projects CLIP’s joint image-text embeddings into a concept space learned from language descriptions, enabling faithful and transparent explanations without additional supervision. The model learns this projection via a combination of alignment and reconstruction objectives, ensuring that concept activations preserve CLIP’s semantic structure while remaining interpretable. Extensive experiments on five benchmark datasets, CIFAR-100, CUB-200-2011, Places365, ImageNet-100, and ImageNet-1k, demonstrate that our approach maintains CLIP’s strong zero-shot classification accuracy while providing meaningful concept-level explanations. By grounding open-vocabulary predictions in explicit semantic concepts, our method offers a principled step toward interpretable and trustworthy vision-language models. Code is available at https://github.com/oonat/ezpc.
[410] Ghost-FWL: A Large-Scale Full-Waveform LiDAR Dataset for Ghost Detection and Removal
Kazuma Ikeda, Ryosei Hara, Rokuto Nagata, Ozora Sako. Zihao Ding, Takahiro Kado, Ibuki Fujioka, Taro Beppu, Mariko Isogawa, Kentaro Yoshioka
Main category: cs.CV
TL;DR: First large-scale annotated full-waveform LiDAR dataset for ghost detection and removal in mobile scenarios, with baseline models that significantly improve downstream tasks like SLAM and object detection.
Details
Motivation: Ghost points from reflective surfaces degrade 3D mapping and localization accuracy in autonomous driving and robotics. Existing methods fail on sparse, dynamic mobile LiDAR data, requiring new approaches using full-waveform LiDAR data.Method: Created Ghost-FWL dataset with 24K frames and 7.5B peak-level annotations across 10 diverse scenes. Developed FWL-based baseline model for ghost detection and FWL-MAE (masked autoencoder) for self-supervised representation learning on FWL data.
Result: Baseline outperforms existing methods in ghost removal accuracy. Ghost removal enhances LiDAR-based SLAM (66% trajectory error reduction) and 3D object detection (50x false positive reduction).
Conclusion: Full-waveform LiDAR provides crucial cues for ghost detection in mobile scenarios. The large-scale Ghost-FWL dataset enables effective ghost removal that significantly improves downstream perception and localization tasks.
Abstract: LiDAR has become an essential sensing modality in autonomous driving, robotics, and smart-city applications. However, ghost points (or ghosts), which are false reflections caused by multi-path laser returns from glass and reflective surfaces, severely degrade 3D mapping and localization accuracy. Prior ghost removal relies on geometric consistency in dense point clouds, failing on mobile LiDAR’s sparse, dynamic data. We address this by exploiting full-waveform LiDAR (FWL), which captures complete temporal intensity profiles rather than just peak distances, providing crucial cues for distinguishing ghosts from genuine reflections in mobile scenarios. As this is a new task, we present Ghost-FWL, the first and largest annotated mobile FWL dataset for ghost detection and removal. Ghost-FWL comprises 24K frames across 10 diverse scenes with 7.5 billion peak-level annotations, which is 100x larger than existing annotated FWL datasets. Benefiting from this large-scale dataset, we establish a FWL-based baseline model for ghost detection and propose FWL-MAE, a masked autoencoder for efficient self-supervised representation learning on FWL data. Experiments show that our baseline outperforms existing methods in ghost removal accuracy, and our ghost removal further enhances downstream tasks such as LiDAR-based SLAM (66% trajectory error reduction) and 3D object detection (50x false positive reduction). The dataset and code is publicly available and can be accessed via the project page: https://keio-csg.github.io/Ghost-FWL
[411] TerraSky3D: Multi-View Reconstructions of European Landmarks in 4K
Mattia D’Urso, Yuxi Hu, Christian Sormann, Mattia Rossi, Friedrich Fraundorfer
Main category: cs.CV
TL;DR: TerraSky3D: A high-resolution large-scale 3D reconstruction dataset with 50,000 images across 150 ground, aerial, and mixed scenes of European landmarks, including calibration data, camera poses, and depth maps.
Details
Motivation: Addressing the scarcity of suitable public 3D datasets, which are often low-resolution, limited in scenes, based on varying quality internet images, or restricted to specific capturing scenarios.Method: Captured a comprehensive 3D dataset comprising 50,000 images divided into 150 scenes (ground, aerial, and mixed) focusing on European landmarks, with curated calibration data, camera poses, and depth maps.
Result: Created TerraSky3D - a high-resolution large-scale 3D reconstruction dataset that provides challenging data for training and evaluating 3D reconstruction pipelines.
Conclusion: TerraSky3D addresses the need for comprehensive 3D datasets and can serve as a valuable resource for developing and benchmarking 3D reconstruction methods.
Abstract: Despite the growing need for data of more and more sophisticated 3D reconstruction pipelines, we can still observe a scarcity of suitable public datasets. Existing 3D datasets are either low resolution, limited to a small amount of scenes, based on images of varying quality because retrieved from the internet, or limited to specific capturing scenarios. Motivated by this lack of suitable 3D datasets, we captured TerraSky3D, a high-resolution large-scale 3D reconstruction dataset comprising 50,000 images divided into 150 ground, aerial, and mixed scenes. The dataset focuses on European landmarks and comes with curated calibration data, camera poses, and depth maps. TerraSky3D tries to answer the need for challenging dataset that can be used to train and evaluate 3D reconstruction-related pipelines.
[412] DinoDental: Benchmarking DINOv3 as a Unified Vision Encoder for Dental Image Analysis
Kun Tang, Xinquan Yang, Mianjie Zheng, Xuefen Liu, Xuguang Li, Xiaoqi Guo, Ruihan Chen, Linlin Shen, He Meng
Main category: cs.CV
TL;DR: DinoDental benchmark evaluates DINOv3 as off-the-shelf encoder for dental image analysis without domain-specific pre-training, showing strong performance across classification, detection, and segmentation tasks.
Details
Motivation: Addresses scarcity of expert annotations in dental imaging by evaluating whether DINOv3, a self-supervised vision foundation model pre-trained on 1.7B images, can serve as reliable encoder for dental domain without domain-specific pre-training.Method: Created DinoDental benchmark from multiple public datasets covering panoramic radiographs and intraoral photographs. Evaluated DINOv3 across classification, detection, and instance segmentation tasks. Analyzed transfer performance by scaling model size and input resolution, comparing frozen features, full fine-tuning, and LoRA adaptation.
Result: DINOv3 serves as strong unified encoder for dental image analysis across both panoramic radiographs and intraoral photographs, remaining competitive across tasks with particular advantages for intraoral image understanding and boundary-sensitive dense prediction.
Conclusion: DinoDental provides systematic framework for evaluating DINOv3 in dental analysis, establishing foundational benchmark to guide efficient model selection and adaptation for dental AI community.
Abstract: The scarcity and high cost of expert annotations in dental imaging present a significant challenge for the development of AI in dentistry. DINOv3, a state-of-the-art, self-supervised vision foundation model pre-trained on 1.7 billion images, offers a promising pathway to mitigate this issue. However, its reliability when transferred to the dental domain, with its unique imaging characteristics and clinical subtleties, remains unclear. To address this, we introduce DinoDental, a unified benchmark designed to systematically evaluate whether DINOv3 can serve as a reliable, off-the-shelf encoder for comprehensive dental image analysis without requiring domain-specific pre-training. Constructed from multiple public datasets, DinoDental covers a wide range of tasks, including classification, detection, and instance segmentation on both panoramic radiographs and intraoral photographs. We further analyze the model’s transfer performance by scaling its size and input resolution, and by comparing different adaptation strategies, including frozen features, full fine-tuning, and the parameter-efficient Low-Rank Adaptation (LoRA) method. Our experiments show that DINOv3 can serve as a strong unified encoder for dental image analysis across both panoramic radiographs and intraoral photographs, remaining competitive across tasks while showing particularly clear advantages for intraoral image understanding and boundary-sensitive dense prediction. Collectively, DinoDental provides a systematic framework for comprehensively evaluating DINOv3 in dental analysis, establishing a foundational benchmark to guide efficient and effective model selection and adaptation for the dental AI community.
[413] Beyond Scanpaths: Graph-Based Gaze Simulation in Dynamic Scenes
Luke Palmer, Petar Palasek, Hazem Abdelkawy
Main category: cs.CV
TL;DR: A novel approach for modeling human gaze in driving scenes using autoregressive dynamics with graph transformers, predicting raw gaze trajectories without fixation filtering.
Details
Motivation: Existing methods collapse gaze into static saliency maps or scanpaths, treating gaze dynamics only implicitly. There's a need for explicit temporal modeling of raw gaze trajectories in dynamic environments like driving scenes.Method: Formulates gaze modeling as autoregressive dynamical system using Affinity Relation Transformer (ART) to process gaze-centric graphs of driver gaze, traffic objects, and road structure. Introduces Object Density Network (ODN) to predict next-step gaze distributions. Uses raw gaze data without fixation filtering.
Result: Produces more natural gaze trajectories, scanpath dynamics, and saliency maps than existing attention models. Released Focus100 dataset with raw gaze data from 30 participants viewing egocentric driving footage.
Conclusion: The unified approach offers valuable insights for temporal modeling of human attention in dynamic environments, with applications in automotive safety and computer vision.
Abstract: Accurately modelling human attention is essential for numerous computer vision applications, particularly in the domain of automotive safety. Existing methods typically collapse gaze into saliency maps or scanpaths, treating gaze dynamics only implicitly. We instead formulate gaze modelling as an autoregressive dynamical system and explicitly unroll raw gaze trajectories over time, conditioned on both gaze history and the evolving environment. Driving scenes are represented as gaze-centric graphs processed by the Affinity Relation Transformer (ART), a heterogeneous graph transformer that models interactions between driver gaze, traffic objects, and road structure. We further introduce the Object Density Network (ODN) to predict next-step gaze distributions, capturing the stochastic and object-centric nature of attentional shifts in complex environments. We also release Focus100, a new dataset of raw gaze data from 30 participants viewing egocentric driving footage. Trained directly on raw gaze, without fixation filtering, our unified approach produces more natural gaze trajectories, scanpath dynamics, and saliency maps than existing attention models, offering valuable insights for the temporal modelling of human attention in dynamic environments.
[414] SFDemorpher: Generalizable Face Demorphing for Operational Morphing Attack Detection
Raul Ismayilov, Luuk Spreeuwers
Main category: cs.CV
TL;DR: SFDemorpher is a face demorphing framework for differential morphing attack detection that performs identity disentanglement in joint StyleGAN latent and feature spaces, achieving state-of-the-art generalizability across diverse morphing techniques and capture conditions.
Details
Motivation: Face morphing attacks create document images that verify against multiple identities, compromising biometric security. Existing differential morphing attack detection methods lack operational generalizability due to limited training data and the assumption that all document inputs are morphs.Method: SFDemorpher performs identity disentanglement within joint StyleGAN latent and high-dimensional feature spaces. It uses a dual-pass training strategy handling both morphed and bona fide documents, leveraging a hybrid corpus with predominantly synthetic identities to enhance robustness against unseen distributions.
Result: Extensive evaluation confirms state-of-the-art generalizability across unseen identities, diverse capture conditions, and 13 morphing techniques. The framework achieves superior D-MAD performance by widening the margin between score distributions of bona fide and morphed samples while providing high-fidelity visual reconstructions.
Conclusion: SFDemorpher presents an effective framework for operational deployment of face demorphing for differential morphing attack detection, offering improved generalizability and explainability through high-fidelity visual reconstructions.
Abstract: Face morphing attacks compromise biometric security by creating document images that verify against multiple identities, posing significant risks from document issuance to border control. Differential Morphing Attack Detection (D-MAD) offers an effective countermeasure, particularly when employing face demorphing to disentangle identities blended in the morph. However, existing methods lack operational generalizability due to limited training data and the assumption that all document inputs are morphs. This paper presents SFDemorpher, a framework designed for the operational deployment of face demorphing for D-MAD that performs identity disentanglement within joint StyleGAN latent and high-dimensional feature spaces. We introduce a dual-pass training strategy handling both morphed and bona fide documents, leveraging a hybrid corpus with predominantly synthetic identities to enhance robustness against unseen distributions. Extensive evaluation confirms state-of-the-art generalizability across unseen identities, diverse capture conditions, and 13 morphing techniques, spanning both border verification and the challenging document enrollment stage. Our framework achieves superior D-MAD performance by widening the margin between the score distributions of bona fide and morphed samples while providing high-fidelity visual reconstructions facilitating explainability.
[415] VistaGEN: Consistent Driving Video Generation with Fine-Grained Control Using Multiview Visual-Language Reasoning
Li-Heng Chen, Ke Cheng, Yahui Liu, Lei Shi, Shi-Sheng Huang, Hongbo Fu
Main category: cs.CV
TL;DR: VistaGEN enables fine-grained object-level control in driving video generation with spatiotemporal consistency through multiview visual-language reasoning and a closed-loop generation-evaluation-regeneration mechanism.
Details
Motivation: Existing driving video generation methods lack fine-grained object-level controllability while maintaining spatiotemporal consistency, especially for long videos and diverse scenarios.Method: Incorporates multiview visual-language reasoning into video generation, uses MV-VLM evaluator for spatiotemporal consistency assessment, and implements a closed-loop generation-evaluation-regeneration mechanism with object-level refinement.
Result: Achieves diverse driving video generation with fine-grained controllability (especially for long-tail objects) and significantly better spatiotemporal consistency than previous approaches.
Conclusion: VistaGEN successfully addresses the challenge of fine-grained object-level control in driving video generation while maintaining spatiotemporal consistency through its novel multiview visual-language reasoning approach and closed-loop mechanism.
Abstract: Driving video generation has achieved much progress in controllability, video resolution, and length, but fails to support fine-grained object-level controllability for diverse driving videos, while preserving the spatiotemporal consistency, especially in long video generation. In this paper, we present a new driving video generation technique, called VistaGEN, which enables fine-grained control of specific entities, including 3D objects, images, and text descriptions, while maintaining spatiotemporal consistency in long video sequences. Our key innovation is the incorporation of multiview visual-language reasoning into the long driving video generation. To this end, we inject visual-language features into a multiview video generator to enable fine-grained controllability. More importantly, we propose a multiview vision-language evaluator (MV-VLM) to intelligently and automatically evaluate spatiotemporal consistency of the generated content, thus formulating a novel generation-evaluation-regeneration closed-loop generation mechanism. This mechanism ensures high-quality, coherent outputs, facilitating the creation of complex and reliable driving scenarios. Besides, within the closed-loop generation, we introduce an object-level refinement module to refine the unsatisfied results evaluated from the MV-VLM and then feed them back to the video generator for regeneration. Extensive evaluation shows that our VistaGEN achieves diverse driving video generation results with fine-grained controllability, especially for long-tail objects, and much better spatiotemporal consistency than previous approaches.
[416] SEA: Evaluating Sketch Abstraction Efficiency via Element-level Commonsense Visual Question Answering
Jiho Park, Sieun Choi, Jaeyoon Seo, Minho Sohn, Yeana Kim, Jihie Kim
Main category: cs.CV
TL;DR: SEA is a reference-free metric for evaluating sketch abstraction efficiency by measuring how economically sketches represent class-defining visual elements while preserving semantic recognizability, supported by CommonSketch dataset.
Details
Motivation: Existing sketch evaluation methods fail to capture abstraction - the defining property of sketches. Current approaches rely on reference images, low-level visual features, or recognition accuracy, but don't measure how efficiently sketches convey semantic information through simplified visual abstraction.Method: Introduces SEA metric that uses commonsense knowledge about class-defining visual elements, leverages visual question answering models to detect element presence, and computes scores reflecting semantic retention under visual economy. Also creates CommonSketch dataset with 23,100 annotated sketches across 300 classes.
Result: SEA aligns closely with human judgments and reliably discriminates levels of abstraction efficiency. CommonSketch serves as a benchmark for evaluating element-level sketch understanding across vision-language models.
Conclusion: SEA provides the first effective metric for quantifying sketch abstraction efficiency, addressing a fundamental gap in sketch evaluation. The CommonSketch dataset enables systematic evaluation of sketch understanding in multimodal models.
Abstract: A sketch is a distilled form of visual abstraction that conveys core concepts through simplified yet purposeful strokes while omitting extraneous detail. Despite its expressive power, quantifying the efficiency of semantic abstraction in sketches remains challenging. Existing evaluation methods that rely on reference images, low-level visual features, or recognition accuracy do not capture abstraction, the defining property of sketches. To address these limitations, we introduce SEA (Sketch Evaluation metric for Abstraction efficiency), a reference-free metric that assesses how economically a sketch represents class-defining visual elements while preserving semantic recognizability. These elements are derived per class from commonsense knowledge about features typically depicted in sketches. SEA leverages a visual question answering model to determine the presence of each element and returns a quantitative score that reflects semantic retention under visual economy. To support this metric, we present CommonSketch, the first semantically annotated sketch dataset, comprising 23,100 human-drawn sketches across 300 classes, each paired with a caption and element-level annotations. Experiments show that SEA aligns closely with human judgments and reliably discriminates levels of abstraction efficiency, while CommonSketch serves as a benchmark providing systematic evaluation of element-level sketch understanding across various vision-language models.
[417] AutoCut: End-to-end advertisement video editing based on multimodal discretization and controllable generation
Milton Zhou, Sizhong Qin, Yongzhi Li, Quan Chen, Peng Jiang
Main category: cs.CV
TL;DR: AutoCut is an end-to-end multimodal framework for automated advertisement video editing that uses multimodal discretization and a video-editing LLM to unify video, audio, and text processing for scalable content creation.
Details
Motivation: Current AI tools for short-form video advertising are disjoint and modality-specific, leading to high production costs and low efficiency. There's a need for a unified framework that can handle video, audio, and text processing together for scalable ad creation.Method: Uses dedicated encoders to extract video/audio features, applies residual vector quantization to discretize them into unified tokens aligned with text, creating a shared video-audio-text token space. Builds a multimodal LLM for video editing through multimodal alignment and supervised fine-tuning, supporting tasks like video selection, script generation, and background music selection.
Result: Experiments on real-world advertisement datasets show AutoCut reduces production cost and iteration time while improving consistency and controllability.
Conclusion: AutoCut provides an effective end-to-end framework for scalable video creation that unifies multimodal processing, paving the way for more efficient digital advertising workflows.
Abstract: Short-form videos have become a primary medium for digital advertising, requiring scalable and efficient content creation. However, current workflows and AI tools remain disjoint and modality-specific, leading to high production costs and low overall efficiency. To address this issue, we propose AutoCut, an end-to-end advertisement video editing framework based on multimodal discretization and controllable editing. AutoCut employs dedicated encoders to extract video and audio features, then applies residual vector quantization to discretize them into unified tokens aligned with textual representations, constructing a shared video-audio-text token space. Built upon a foundation model, we further develop a multimodal large language model for video editing through combined multimodal alignment and supervised fine-tuning, supporting tasks covering video selection and ordering, script generation, and background music selection within a unified editing framework. Finally, a complete production pipeline converts the predicted token sequences into deployable long video outputs. Experiments on real-world advertisement datasets show that AutoCut reduces production cost and iteration time while substantially improving consistency and controllability, paving the way for scalable video creation.
[418] Clinical application of HEDI for biomechanical evaluation and visualisation in incisional hernia repair
Philipp D. Lösel, Jacob J. Relle, Samuel Voß, Ramesch Raschidi, Regine Nessel, Johannes Görich, Mark O. Wielpütz, Thorsten Löffler, Vincent Heuveline, Friedrich Kallinowski
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to analyze paper content due to technical access issues
Abstract: Failed to fetch summary for 2307.01502: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2307.01502&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[419] Rethinking Structure Preservation in Text-Guided Image Editing with Visual Autoregressive Models
Tao Xia, Jiawei Liu, Yukun Zhang, Ting Liu, Wei Wang, Lei Zhang
Main category: cs.CV
TL;DR: VAR-based text-guided image editing framework with coarse-to-fine token localization, structure-aware feature injection, and adaptive RL-based injection optimization for better structural consistency and editing quality.
Details
Motivation: VAR models show promise for text-guided image editing with better background preservation and faster inference than diffusion models, but face challenges in accurate token localization and maintaining structural consistency in edited results.Method: 1) Coarse-to-fine token localization strategy to refine editable regions; 2) Analysis of VAR intermediate features to identify structure-related features for injection; 3) Reinforcement learning-based adaptive feature injection scheme to learn optimal scale- and layer-specific injection ratios.
Result: Extensive experiments show superior structural consistency and editing quality compared to state-of-the-art approaches across both local and global editing scenarios.
Conclusion: The proposed framework effectively addresses VAR-based editing challenges through feature analysis and adaptive injection mechanisms, achieving better balance between editing fidelity and structure preservation.
Abstract: Visual autoregressive (VAR) models have recently emerged as a promising family of generative models, enabling a wide range of downstream vision tasks such as text-guided image editing. By shifting the editing paradigm from noise manipulation in diffusion-based methods to token-level operations, VAR-based approaches achieve better background preservation and significantly faster inference. However, existing VAR-based editing methods still face two key challenges: accurately localizing editable tokens and maintaining structural consistency in the edited results. In this work, we propose a novel text-guided image editing framework rooted in an analysis of intermediate feature distributions within VAR models. First, we introduce a coarse-to-fine token localization strategy that can refine editable regions, balancing editing fidelity and background preservation. Second, we analyze the intermediate representations of VAR models and identify structure-related features, by which we design a simple yet effective feature injection mechanism to enhance structural consistency between the edited and source images. Third, we develop a reinforcement learning-based adaptive feature injection scheme that automatically learns scale- and layer-specific injection ratios to jointly optimize editing fidelity and structure preservation. Extensive experiments demonstrate that our method achieves superior structural consistency and editing quality compared with state-of-the-art approaches, across both local and global editing scenarios.
[420] SVH-BD : Synthetic Vegetation Hyperspectral Benchmark Dataset for Emulation of Remote Sensing Images
Chedly Ben Azizi, Claire Guilloteau, Gilles Roussel, Matthieu Puigt
Main category: cs.CV
TL;DR: Large synthetic hyperspectral dataset (10,915 cubes) with pixel-level vegetation trait maps for radiative transfer emulation and vegetation trait retrieval research.
Details
Motivation: To provide a comprehensive synthetic dataset for benchmarking inversion methods, developing fast radiative transfer emulators, and studying spectral-biophysical relationships in vegetation monitoring.Method: Generated hyperspectral cubes (211 bands, 400-2500 nm) using PROSAIL radiative transfer model inversion from Sentinel-2 data, followed by forward simulations for physically consistent reflectance spectra.
Result: Dataset includes 10,915 hyperspectral image cubes with vegetation trait maps, uncertainty quantification, and covers four diverse ecological regions with 64×64 pixel spatial resolution.
Conclusion: This dataset enables research in vegetation trait retrieval, radiative transfer emulation, and uncertainty quantification for remote sensing applications.
Abstract: This dataset provides a large collection of 10,915 synthetic hyperspectral image cubes paired with pixel-level vegetation trait maps, designed to support research in radiative transfer emulation, vegetation trait retrieval, and uncertainty quantification. Each hyperspectral cube contains 211 bands spanning 400–2500 nm at 10 nm resolution and a fixed spatial layout of 64 \times 64 pixels, offering continuous simulated surface reflectance spectra suitable for emulator development and machine-learning tasks requiring high spectral detail. Vegetation traits were derived by inverting Sentinel-2 Level-2A surface reflectance using a PROSAIL-based lookup-table approach, followed by forward PROSAIL simulations to generate hyperspectral reflectance under physically consistent canopy and illumination conditions. The dataset covers four ecologically diverse regions – East Africa, Northern France, Eastern India, and Southern Spain – and includes 5th and 95th percentile uncertainty maps as well as Sentinel-2 scene classification layers. This resource enables benchmarking of inversion methods, development of fast radiative transfer emulators, and studies of spectral–biophysical relationships under controlled yet realistic environmental variability.
[421] Unified Restoration-Perception Learning: Maritime Infrared-Visible Image Fusion and Segmentation
Weichao Cai, Weiliang Huang, Biao Xue, Chao Huang, Fei Yuan, Bob Zhang
Main category: cs.CV
TL;DR: A multi-task framework for maritime scene understanding that combines infrared-visible image restoration, multimodal fusion, and semantic segmentation to handle challenging marine conditions like fog and reflections.
Details
Motivation: Marine environments suffer from severe image degradation (fog, reflections) that compromises semantic perception. Existing methods lack end-to-end collaborative mechanisms for simultaneous structural recovery and semantic effectiveness, and available datasets don't capture authentic marine degradation characteristics.Method: Proposes IVMSD dataset for maritime scenarios, and a Multi-task Complementary Learning Framework (MCLF) with: 1) Frequency-Spatial Enhancement Complementary module for degradation suppression, 2) Semantic-Visual Consistency Attention for semantic guidance, and 3) cross-modality guided attention for selective fusion.
Result: Achieves state-of-the-art segmentation performance on IVMSD dataset, significantly enhancing robustness and perceptual quality under complex maritime conditions.
Conclusion: The proposed framework effectively addresses marine image degradation through collaborative restoration, fusion, and segmentation, demonstrating improved maritime scene understanding.
Abstract: Marine scene understanding and segmentation plays a vital role in maritime monitoring and navigation safety. However, prevalent factors like fog and strong reflections in maritime environments cause severe image degradation, significantly compromising the stability of semantic perception. Existing restoration and enhancement methods typically target specific degradations or focus solely on visual quality, lacking end-to-end collaborative mechanisms that simultaneously improve structural recovery and semantic effectiveness. Moreover, publicly available infrared-visible datasets are predominantly collected from urban scenes, failing to capture the authentic characteristics of coupled degradations in marine environments. To address these challenges, the Infrared-Visible Maritime Ship Dataset (IVMSD) is proposed to cover various maritime scenarios under diverse weather and illumination conditions. Building upon this dataset, a Multi-task Complementary Learning Framework (MCLF) is proposed to collaboratively perform image restoration, multimodal fusion, and semantic segmentation within a unified architecture. The framework includes a Frequency-Spatial Enhancement Complementary (FSEC) module for degradation suppression and structural enhancement, a Semantic-Visual Consistency Attention (SVCA) module for semantic-consistent guidance, and a cross-modality guided attention mechanism for selective fusion. Experimental results on IVMSD demonstrate that the proposed method achieves state-of-the-art segmentation performance, significantly enhancing robustness and perceptual quality under complex maritime conditions.
[422] Bidirectional Multimodal Prompt Learning with Scale-Aware Training for Few-Shot Multi-Class Anomaly Detection
Yujin Lee, Sewon Kim, Daeun Moon, Seoyoon Jang, Hyunsoo Yoon
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2408.13516: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2408.13516&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[423] From Pixels to Reality: Physical-Digital Patch Attacks on Real-World Camera
Victoria Leonenkova, Ekaterina Shumitskaya, Dmitriy Vatolin, Anastasia Antsiferova
Main category: cs.CV
TL;DR: Digital-Physical Adversarial Attacks (DiPA) enable real-time adversarial attacks on camera-based authentication systems by displaying adversarial patches on smartphone screens, bypassing printed artifacts and improving transferability across face recognition systems.
Details
Motivation: Camera-based authentication systems are increasingly pervasive but vulnerable to adversarial attacks. Current physical attacks require printed artifacts which are impractical for rapid deployment. DiPA aims to create more practical attacks using digital displays on mobile devices.Method: DiPA uses smartphone screens to display adversarial patches instead of printed artifacts. It leverages an ensemble of state-of-the-art face recognition models (ArcFace, MagFace, CosFace) to enhance transferability across unseen commercial systems. The approach eliminates the need for total-variation regularization and enables real-time interactive attacks.
Result: DiPA demonstrates superior performance over existing physical attacks in success rate, feature-space distortion, and reductions in detection confidence. The interactive demo shows real-time dodging attacks against deployed face-recognition systems with immediate observable effects.
Conclusion: DiPA reveals critical vulnerabilities at the intersection of mobile devices, pervasive vision, and sensor-driven authentication infrastructures, highlighting the need for more robust security measures in camera-based authentication systems.
Abstract: This demonstration presents Digital-Physical Adversarial Attacks (DiPA), a new class of practical adversarial attacks against pervasive camera-based authentication systems, where an attacker displays an adversarial patch directly on a smartphone screen instead of relying on printed artifacts. This digital-only physical presentation enables rapid deployment, removes the need for total-variation regularization, and improves patch transferability in black-box conditions. DiPA leverages an ensemble of state-of-the-art face-recognition models (ArcFace, MagFace, CosFace) to enhance transfer across unseen commercial systems. Our interactive demo shows a real-time dodging attack against a deployed face-recognition camera, preventing authorized users from being recognized while participants dynamically adjust patch patterns and observe immediate effects on the sensing pipeline. We further demonstrate DiPA’s superiority over existing physical attacks in terms of success rate, feature-space distortion, and reductions in detection confidence, highlighting critical vulnerabilities at the intersection of mobile devices, pervasive vision, and sensor-driven authentication infrastructures.
[424] Efficient Mixture-of-Expert for Video-based Driver State and Physiological Multi-task Estimation in Conditional Autonomous Driving
Jiyao Wang, Xiao Yang, Zhenyu Wang, Ximeng Wei, Ange Wang, Dengbo He, Kaishun Wu
Main category: cs.CV
TL;DR: Unable to analyze paper 2410.21086 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract is unavailable due to rate limiting errorMethod: Cannot determine method as abstract is unavailable due to rate limiting error
Result: Cannot determine results as abstract is unavailable due to rate limiting error
Conclusion: Cannot draw conclusions about paper content due to data retrieval failure
Abstract: Failed to fetch summary for 2410.21086: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2410.21086&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[425] Decoupling Wavelet Sub-bands for Single Source Domain Generalization in Fundus Image Segmentation
Shramana Dey, Varun Ajith, Abhirup Banerjee, Sushmita Mitra
Main category: cs.CV
TL;DR: WaveSDG: A wavelet-guided segmentation network for single-source domain generalization in fundus imaging that decouples anatomical structure from domain-specific appearance using wavelet sub-band decomposition.
Details
Motivation: Domain generalization in medical imaging is challenging due to device and clinical setting variations. Existing single-source domain generalization approaches fail to properly capture anatomical topology or decouple appearance from anatomical features, leading to performance degradation on unseen domains.Method: Proposes WaveSDG with a novel Wavelet-based Invariant Structure Extraction and Refinement (WISER) module that processes encoder features using wavelet sub-band decomposition. Low-frequency components are refined to anchor global anatomy, while high-frequency sub-bands are selectively enhanced for directional edges and noise suppression.
Result: WaveSDG outperforms seven state-of-the-art methods on optic cup and optic disc segmentation across one source and five unseen target datasets, achieving best balanced Dice score and lowest 95th percentile Hausdorff distance with reduced variance.
Conclusion: WaveSDG effectively addresses domain generalization in fundus imaging by decoupling anatomical structure from domain-specific appearance through wavelet decomposition, demonstrating improved accuracy, robustness, and cross-domain stability.
Abstract: Domain generalization in fundus imaging is challenging due to variations in acquisition conditions across devices and clinical settings. The inability to adapt to these variations causes performance degradation on unseen domains for deep learning models. Besides, obtaining annotated data across domains is often expensive and privacy constraints restricts their availability. Although single-source domain generalization (SDG) offers a realistic solution to this problem, the existing approaches frequently fail to capture anatomical topology or decouple appearance from anatomical features. This research introduces WaveSDG, a new wavelet-guided segmentation network for SDG. It decouples anatomical structure from domain-specific appearance through a wavelet sub-band decomposition. A novel Wavelet-based Invariant Structure Extraction and Refinement (WISER) module is proposed to process encoder features by leveraging distinct semantic roles of each wavelet sub-band. The module refines low-frequency components to anchor global anatomy, while selectively enhancing directional edges and suppressing noise within the high-frequency sub-bands. Extensive ablation studies validate the effectiveness of the WISER module and its decoupling strategy. Our evaluations on optic cup and optic disc segmentation across one source and five unseen target datasets show that WaveSDG consistently outperforms seven state-of-the-art methods. Notably, it achieves the best balanced Dice score and lowest 95th percentile Hausdorff distance with reduced variance, indicating improved accuracy, robustness, and cross-domain stability.
[426] Post-hoc Self-explanation of CNNs
Ahcène Boubekki, Line H. Clemmensen
Main category: cs.CV
TL;DR: A method to improve interpretability of CNNs by replacing the final linear layer with k-means-based classifiers and generating concept-based explanation maps using intermediate feature activations.
Details
Motivation: Standard CNNs can be reinterpreted as self-explainable models, but their built-in prototypes don't accurately represent the data. There's a need for better interpretability without compromising performance.Method: Replace final linear layer with k-means-based classifier. Formalize k-means-based post-hoc explanations for classifier, encoder output, and combinations of intermediate feature activations. Use spatial consistency of convolutional receptive fields to generate concept-based explanation maps with gradient-free feature attribution.
Result: Using shallower, less compressed feature activations (like from last three blocks B234) creates a trade-off between semantic fidelity and slight reduction in predictive performance. ResNet34 evaluation shows the approach works.
Conclusion: K-means-based classifiers improve CNN interpretability while maintaining performance. Combining intermediate feature activations provides better concept-based explanations, though with trade-offs between semantic fidelity and accuracy.
Abstract: Although standard Convolutional Neural Networks (CNNs) can be mathematically reinterpreted as Self-Explainable Models (SEMs), their built-in prototypes do not on their own accurately represent the data. Replacing the final linear layer with a $k$-means-based classifier addresses this limitation without compromising performance. This work introduces a common formalization of $k$-means-based post-hoc explanations for the classifier, the encoder’s final output (B4), and combinations of intermediate feature activations. The latter approach leverages the spatial consistency of convolutional receptive fields to generate concept-based explanation maps, which are supported by gradient-free feature attribution maps. Empirical evaluation with a ResNet34 shows that using shallower, less compressed feature activations, such as those from the last three blocks (B234), results in a trade-off between semantic fidelity and a slight reduction in predictive performance.
[427] INSID3: Training-Free In-Context Segmentation with DINOv3
Claudia Cuttano, Gabriele Trivigno, Christoph Reich, Daniel Cremers, Carlo Masone, Stefan Roth
Main category: cs.CV
TL;DR: INSID3 is a training-free approach that uses frozen DINOv3 features for in-context segmentation, achieving state-of-the-art results across semantic, part, and personalized segmentation without any supervision.
Details
Motivation: Existing in-context segmentation methods either fine-tune vision foundation models (hurting generalization) or combine multiple frozen models (creating complexity). The authors explore whether a single self-supervised backbone can handle both semantic matching and segmentation without supervision.Method: INSID3 leverages scaled-up dense self-supervised features from DINOv3, which exhibit strong spatial structure and semantic correspondence. The approach is training-free and segments concepts at varying granularities using only frozen DINOv3 features given an in-context example.
Result: INSID3 achieves state-of-the-art results across one-shot semantic, part, and personalized segmentation, outperforming previous work by +7.5% mIoU while using 3x fewer parameters and requiring no mask or category-level supervision.
Conclusion: A single self-supervised backbone (DINOv3) can effectively support both semantic matching and segmentation without any supervision, offering a minimalist yet powerful approach to in-context segmentation that preserves generalization while reducing complexity.
Abstract: In-context segmentation (ICS) aims to segment arbitrary concepts, e.g., objects, parts, or personalized instances, given one annotated visual examples. Existing work relies on (i) fine-tuning vision foundation models (VFMs), which improves in-domain results but harms generalization, or (ii) combines multiple frozen VFMs, which preserves generalization but yields architectural complexity and fixed segmentation granularities. We revisit ICS from a minimalist perspective and ask: Can a single self-supervised backbone support both semantic matching and segmentation, without any supervision or auxiliary models? We show that scaled-up dense self-supervised features from DINOv3 exhibit strong spatial structure and semantic correspondence. We introduce INSID3, a training-free approach that segments concepts at varying granularities only from frozen DINOv3 features, given an in-context example. INSID3 achieves state-of-the-art results across one-shot semantic, part, and personalized segmentation, outperforming previous work by +7.5 % mIoU, while using 3x fewer parameters and without any mask or category-level supervision. Code is available at https://github.com/visinf/INSID3 .
[428] A Benchmark for Incremental Micro-expression Recognition
Zhengqin Lai, Xiaopeng Hong, Yabin Wang, Xiaobai Li
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2501.19111: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2501.19111&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[429] ConceptWeaver: Weaving Disentangled Concepts with Flow
Jintao Chen, Aiming Hao, Xiaoqing Chen, Chengyu Bai, Chubin Chen, Yanxun Li, Jiahong Wu, Xiangxiang Chu, Shanghang Zhang
Main category: cs.CV
TL;DR: ConceptWeaver enables one-shot concept disentanglement in flow-based generative models by leveraging a discovered three-stage generative process, allowing precise multi-concept manipulation through stage-aware optimization and guidance.
Details
Motivation: Pre-trained flow-based models can synthesize complex scenes but lack mechanisms for disentangling and customizing underlying concepts from one-shot real-world sources, limiting precise content manipulation.Method: Introduces differential probing to analyze concept token influence, revealing three-stage generative process (Blueprint, Instantiation, Refinement). Proposes ConceptWeaver with stage-aware optimization to learn concept-specific semantic offsets from single reference images, deployed via ConceptWeaver Guidance during inference.
Result: Extensive experiments validate that ConceptWeaver enables high-fidelity, compositional synthesis and editing, demonstrating precise multi-granularity content manipulation by leveraging the intrinsic staged nature of flow models.
Conclusion: Understanding and leveraging the intrinsic staged nature of flow models is key to unlocking precise, multi-granularity content manipulation, with ConceptWeaver providing an effective framework for one-shot concept disentanglement.
Abstract: Pre-trained flow-based models excel at synthesizing complex scenes yet lack a direct mechanism for disentangling and customizing their underlying concepts from one-shot real-world sources. To demystify this process, we first introduce a novel differential probing technique to isolate and analyze the influence of individual concept tokens on the velocity field over time. This investigation yields a critical insight: the generative process is not monolithic but unfolds in three distinct stages. An initial \textbf{Blueprint Stage} establishes low-frequency structure, followed by a pivotal \textbf{Instantiation Stage} where content concepts emerge with peak intensity and become naturally disentangled, creating an optimal window for manipulation. A final concept-insensitive refinement stage then synthesizes fine-grained details. Guided by this discovery, we propose \textbf{ConceptWeaver}, a framework for one-shot concept disentanglement. ConceptWeaver learns concept-specific semantic offsets from a single reference image using a stage-aware optimization strategy that aligns with the three-stage framework. These learned offsets are then deployed during inference via our novel ConceptWeaver Guidance (CWG) mechanism, which strategically injects them at the appropriate generative stage. Extensive experiments validate that ConceptWeaver enables high-fidelity, compositional synthesis and editing, demonstrating that understanding and leveraging the intrinsic, staged nature of flow models is key to unlocking precise, multi-granularity content manipulation.
[430] ControlGUI: Guiding Generative GUI Exploration through Perceptual Visual Flow
Aryan Garg, Yue Jiang, Antti Oulasvirta
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to failed paper fetchMethod: Cannot determine method due to failed paper fetch
Result: Cannot determine results due to failed paper fetch
Conclusion: Cannot determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2502.03330: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.03330&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[431] Bridging the Geometry Mismatch: Frequency-Aware Anisotropic Serialization for Thin-Structure SSMs
Jin Bai, Huiyao Zhang, Qi Wen, Ningyang Li, Shengyang Li, Atta ur Rahman, Xiaolin Tian
Main category: cs.CV
TL;DR: FGOS-Net is a framework for segmenting thin linear structures using frequency-geometric disentanglement to address topology preservation issues in state-space models.
Details
Motivation: Segmentation of thin linear structures is topology-critical where minor local errors can break long-range connectivity. Current State-Space Models (SSMs) have isotropic serialization (raster scanning) that creates geometry mismatch for anisotropic targets, causing state propagation across rather than along structure trajectories.Method: Proposes FGOS-Net based on frequency-geometric disentanglement: 1) Decomposes features into stable topology carrier and directional high-frequency bands, using latter to correct spatial misalignments from downsampling; 2) Introduces frequency-aligned scanning that makes serialization geometry-conditioned to preserve direction-consistent traces; 3) Uses active probing strategy to selectively inject high-frequency details and suppress texture ambiguity.
Result: Consistently outperforms strong baselines across four challenging benchmarks. Achieves 91.3% mIoU and 97.1% clDice on DeepCrack while running at 80 FPS with only 7.87 GFLOPs.
Conclusion: FGOS-Net effectively addresses the geometry mismatch problem in thin linear structure segmentation through frequency-geometric disentanglement and geometry-conditioned serialization, achieving superior performance with high efficiency.
Abstract: The segmentation of thin linear structures is inherently topology allowbreak-critical, where minor local errors can sever long-range connectivity. While recent State-Space Models (SSMs) offer efficient long-range modeling, their isotropic serialization (e.g., raster scanning) creates a geometry mismatch for anisotropic targets, causing state propagation across rather than along the structure trajectories. To address this, we propose FGOS-Net, a framework based on frequency allowbreak-geometric disentanglement. We first decompose features into a stable topology carrier and directional high-frequency bands, leveraging the latter to explicitly correct spatial misalignments induced by downsampling. Building on this calibrated topology, we introduce frequency-aligned scanning that elevates serialization to a geometry-conditioned decision, preserving direction-consistent traces. Coupled with an active probing strategy to selectively inject high-frequency details and suppress texture ambiguity, FGOS-Net consistently outperforms strong baselines across four challenging benchmarks. Notably, it achieves 91.3% mIoU and 97.1% clDice on DeepCrack while running at 80 FPS with only 7.87 GFLOPs.
[432] Generalizable Detection of AI Generated Images with Large Models and Fuzzy Decision Tree
Fei Wu, Guanghao Ding, Zijian Niu, Zhenrui Wang, Lei Yang, Zhuosheng Zhang, Shilin Wang
Main category: cs.CV
TL;DR: A novel AI-generated image detection framework that combines lightweight artifact-aware detectors with Multimodal Large Language Models using a fuzzy decision tree for improved accuracy and generalization.
Details
Motivation: The malicious use of AI-generated images threatens digital content authenticity. Existing detection methods lack generalization due to model-specific overfitting, while MLLMs alone lack fine-grained sensitivity to subtle generation artifacts.Method: Proposes a framework that synergistically integrates lightweight artifact-aware detectors with MLLMs via a fuzzy decision tree. The decision tree treats detector outputs as fuzzy membership values, enabling adaptive fusion of complementary semantic and perceptual cues.
Result: Extensive experiments demonstrate state-of-the-art accuracy and strong generalization across diverse generative models.
Conclusion: The proposed hybrid approach effectively addresses limitations of both traditional artifact-based detectors and MLLMs by combining their complementary strengths for robust AI-generated image detection.
Abstract: The malicious use and widespread dissemination of AI-generated images pose a serious threat to the authenticity of digital content. Existing detection methods exploit low-level artifacts left by common manipulation steps within the generation pipeline, but they often lack generalization due to model-specific overfitting. Recently, researchers have resorted to Multimodal Large Language Models (MLLMs) for AIGC detection, leveraging their high-level semantic reasoning and broad generalization capabilities. While promising, MLLMs lack the fine-grained perceptual sensitivity to subtle generation artifacts, making them inadequate as standalone detectors. To address this issue, we propose a novel AI-generated image detection framework that synergistically integrates lightweight artifact-aware detectors with MLLMs via a fuzzy decision tree. The decision tree treats the outputs of basic detectors as fuzzy membership values, enabling adaptive fusion of complementary cues from semantic and perceptual perspectives. Extensive experiments demonstrate that the proposed method achieves state-of-the-art accuracy and strong generalization across diverse generative models.
[433] Measuring the (Un)Faithfulness of Concept-Based Explanations
Shubham Kumar, Narendra Ahuja
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to draw conclusions due to access error
Abstract: Failed to fetch summary for 2504.10833: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.10833&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[434] GEditBench v2: A Human-Aligned Benchmark for General Image Editing
Zhangqi Jiang, Zheng Sun, Xianfang Zeng, Yufeng Yang, Xuanyang Zhang, Yongliang Wu, Wei Cheng, Gang Yu, Xu Yang, Bihan Wen
Main category: cs.CV
TL;DR: GEditBench v2 is a comprehensive image editing benchmark with 1,200 real-world queries across 23 tasks, plus PVC-Judge, an open-source pairwise assessment model for evaluating visual consistency in edited images.
Details
Motivation: Existing image editing evaluation frameworks have limited task coverage and fail to adequately measure visual consistency (preservation of identity, structure, and semantic coherence between original and edited images).Method: 1) Created GEditBench v2 with 1,200 real-world queries spanning 23 tasks including open-set category; 2) Developed PVC-Judge, an open-source pairwise assessment model trained via novel region-decoupled preference data synthesis pipelines; 3) Built VCReward-Bench with expert-annotated preference pairs to validate PVC-Judge.
Result: PVC-Judge achieves state-of-the-art evaluation performance among open-source models and surpasses GPT-5.1 on average. Benchmarking 16 frontier editing models reveals critical limitations and provides reliable foundation for advancing precise image editing.
Conclusion: GEditBench v2 enables more human-aligned evaluation of image editing models, addressing current limitations in evaluation frameworks and providing comprehensive assessment capabilities for visual consistency.
Abstract: Recent advances in image editing have enabled models to handle complex instructions with impressive realism. However, existing evaluation frameworks lag behind: current benchmarks suffer from narrow task coverage, while standard metrics fail to adequately capture visual consistency, i.e., the preservation of identity, structure and semantic coherence between edited and original images. To address these limitations, we introduce GEditBench v2, a comprehensive benchmark with 1,200 real-world user queries spanning 23 tasks, including a dedicated open-set category for unconstrained, out-of-distribution editing instructions beyond predefined tasks. Furthermore, we propose PVC-Judge, an open-source pairwise assessment model for visual consistency, trained via two novel region-decoupled preference data synthesis pipelines. Besides, we construct VCReward-Bench using expert-annotated preference pairs to assess the alignment of PVC-Judge with human judgments on visual consistency evaluation. Experiments show that our PVC-Judge achieves state-of-the-art evaluation performance among open-source models and even surpasses GPT-5.1 on average. Finally, by benchmarking 16 frontier editing models, we show that GEditBench v2 enables more human-aligned evaluation, revealing critical limitations of current models, and providing a reliable foundation for advancing precise image editing.
[435] Securing the Skies: A Comprehensive Survey on Anti-UAV Methods, Benchmarking, and Future Directions
Yifei Dong, Fengyi Wu, Sanjian Zhang, Guangyu Chen, Yuzhi Hu, Masumi Yano, Jingdong Sun, Siyu Huang, Feng Liu, Qi Dai, Zhi-Qi Cheng
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2504.11967: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.11967&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[436] Seen2Scene: Completing Realistic 3D Scenes with Visibility-Guided Flow
Quan Meng, Yujin Chen, Lei Li, Matthias Nießner, Angela Dai
Main category: cs.CV
TL;DR: Seen2Scene is a flow matching-based approach for 3D scene completion and generation that trains directly on incomplete real-world 3D scans using visibility-guided flow matching and sparse transformers.
Details
Motivation: Prior methods rely on complete synthetic 3D data, which doesn't capture real-world complexity. There's a need for approaches that can learn directly from incomplete, real-world 3D scans to enable realistic scene completion for complex, cluttered environments.Method: Uses visibility-guided flow matching to mask unknown regions in real scans, represents scenes with TSDF volumes in sparse grids, employs sparse transformers to model complex structures while masking unknown regions, and uses 3D layout boxes as conditioning with flexibility for other inputs like text or partial scans.
Result: Outperforms baselines in completion accuracy and generation quality, producing coherent, complete, and realistic 3D scenes from real-world incomplete scans.
Conclusion: Seen2Scene enables realistic 3D scene completion for complex real environments by learning directly from incomplete real-world scans, representing a significant advancement over synthetic-data-dependent approaches.
Abstract: We present Seen2Scene, the first flow matching-based approach that trains directly on incomplete, real-world 3D scans for scene completion and generation. Unlike prior methods that rely on complete and hence synthetic 3D data, our approach introduces visibility-guided flow matching, which explicitly masks out unknown regions in real scans, enabling effective learning from real-world, partial observations. We represent 3D scenes using truncated signed distance field (TSDF) volumes encoded in sparse grids and employ a sparse transformer to efficiently model complex scene structures while masking unknown regions. We employ 3D layout boxes as an input conditioning signal, and our approach is flexibly adapted to various other inputs such as text or partial scans. By learning directly from real-world, incomplete 3D scans, Seen2Scene enables realistic 3D scene completion for complex, cluttered real environments. Experiments demonstrate that our model produces coherent, complete, and realistic 3D scenes, outperforming baselines in completion accuracy and generation quality.
[437] Beyond Recognition: Evaluating Visual Perspective Taking in Vision Language Models
Gracjan Góral, Alicja Ziarko, Piotr Miłoś, Michał Nauman, Maciej Wołczyk, Michał Kosiński
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2505.03821: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.03821&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[438] MarkushGrapher-2: End-to-end Multimodal Recognition of Chemical Structures
Tim Strohmeyer, Lucas Morin, Gerhard Ingmar Meijer, Valéry Weber, Ahmed Nassar, Peter Staar
Main category: cs.CV
TL;DR: MarkushGrapher-2: End-to-end multimodal recognition of chemical structures from documents combining text, image, and layout information with specialized encoders and two-stage training.
Details
Motivation: Automatic extraction of chemical structures from documents is crucial for large-scale chemistry literature analysis, but current methods for multimodal chemical structure recognition (Markush structures) lack precision and cannot scale effectively.Method: 1) Dedicated OCR for text extraction from chemical images; 2) Joint encoding of text, image, and layout via Vision-Text-Layout encoder and Optical Chemical Structure Recognition vision encoder; 3) Two-stage training strategy for effective fusion; 4) Auto-regressive generation of Markush structure representation; 5) Created automatic pipeline for large-scale dataset construction.
Result: Substantially outperforms state-of-the-art models in multimodal Markush structure recognition while maintaining strong performance in molecule structure recognition. Introduced IP5-M benchmark dataset.
Conclusion: MarkushGrapher-2 advances multimodal chemical structure recognition with superior performance and provides valuable datasets for future research in this challenging domain.
Abstract: Automatically extracting chemical structures from documents is essential for the large-scale analysis of the literature in chemistry. Automatic pipelines have been developed to recognize molecules represented either in figures or in text independently. However, methods for recognizing chemical structures from multimodal descriptions (Markush structures) lag behind in precision and cannot be used for automatic large-scale processing. In this work, we present MarkushGrapher-2, an end-to-end approach for the multimodal recognition of chemical structures in documents. First, our method employs a dedicated OCR model to extract text from chemical images. Second, the text, image, and layout information are jointly encoded through a Vision-Text-Layout encoder and an Optical Chemical Structure Recognition vision encoder. Finally, the resulting encodings are effectively fused through a two-stage training strategy and used to auto-regressively generate a representation of the Markush structure. To address the lack of training data, we introduce an automatic pipeline for constructing a large-scale dataset of real-world Markush structures. In addition, we present IP5-M, a large manually-annotated benchmark of real-world Markush structures, designed to advance research on this challenging task. Extensive experiments show that our approach substantially outperforms state-of-the-art models in multimodal Markush structure recognition, while maintaining strong performance in molecule structure recognition. Code, models, and datasets are released publicly.
[439] Curriculum-Guided Myocardial Scar Segmentation for Ischemic and Non-ischemic Cardiomyopathy
Nivetha Jayakumar, Jonathan Pan, Shuo Wang, Bishow Paudel, Nisha Hosadurg, Cristiane C. Singulane, Sivam Bhatt, Amit R. Patel, Miaomiao Zhang
Main category: cs.CV
TL;DR: Curriculum learning framework for myocardial scar segmentation from LGE-CMR images that progressively trains from high-confidence to ambiguous cases to handle label uncertainty and diffuse scars.
Details
Motivation: Myocardial scar segmentation from LGE-CMR images is challenging due to variations in contrast enhancement, post-contrast washout, and inconsistent ground truth annotations for diffuse scars caused by inter-observer variability.Method: Proposes a curriculum learning-based framework with progressive training strategy that guides the model from high-confidence, clearly defined scar regions to low-confidence or visually ambiguous samples with limited scar burden.
Result: Experimental results show enhanced segmentation accuracy and consistency, particularly for cases with minimal or diffuse scar, outperforming standard training baselines.
Conclusion: The curriculum learning strategy provides a principled way to leverage imperfect data for improved myocardial scar quantification in clinical applications.
Abstract: Identification and quantification of myocardial scar is important for diagnosis and prognosis of cardiovascular diseases. However, reliable scar segmentation from Late Gadolinium Enhancement Cardiac Magnetic Resonance (LGE-CMR) images remains a challenge due to variations in contrast enhancement across patients, suboptimal imaging conditions such as post contrast washout, and inconsistencies in ground truth annotations on diffuse scars caused by inter observer variability. In this work, we propose a curriculum learning-based framework designed to improve segmentation performance under these challenging conditions. The method introduces a progressive training strategy that guides the model from high-confidence, clearly defined scar regions to low confidence or visually ambiguous samples with limited scar burden. By structuring the learning process in this manner, the network develops robustness to uncertain labels and subtle scar appearances that are often underrepresented in conventional training pipelines. Experimental results show that the proposed approach enhances segmentation accuracy and consistency, particularly for cases with minimal or diffuse scar, outperforming standard training baselines. This strategy provides a principled way to leverage imperfect data for improved myocardial scar quantification in clinical applications. Our code is publicly available on GitHub.
[440] XSPA: Crafting Imperceptible X-Shaped Sparse Adversarial Perturbations for Transferable Attacks on VLMs
Chengyin Hu, Jiaju Han, Xuemeng Sun, Qike Zhang, Yiwei Wei, Ang Li, Chunlei Meng, Xiang Chen, Jiahuan Long
Main category: cs.CV
TL;DR: XSPA is a sparse pixel attack method that uses intersecting diagonal lines to create imperceptible perturbations that disrupt multiple vision-language tasks simultaneously, revealing vulnerabilities in shared representation spaces of VLMs.
Details
Motivation: Vision-language models rely on shared visual-textual representation spaces, but this may create common vulnerabilities where small perturbations can cause correlated failures across different tasks. The paper investigates whether VLMs are robust to highly constrained, sparse, and geometrically fixed perturbations.Method: Proposes X-shaped Sparse Pixel Attack (XSPA) - an imperceptible structured attack that restricts perturbations to two intersecting diagonal lines. It jointly optimizes classification objective, cross-task semantic guidance, and regularization on perturbation magnitude and along-line smoothness. Only modifies about 1.76% of image pixels.
Result: XSPA consistently degrades performance across three tasks on COCO dataset: zero-shot accuracy drops by 52.33-67.00 points on CLIP models, GPT-4-evaluated caption consistency decreases by up to 58.60 points, and VQA correctness by up to 44.38 points.
Conclusion: Even highly sparse and visually subtle perturbations with fixed geometric priors can substantially disrupt cross-task semantics in VLMs, revealing a notable robustness gap in current multimodal systems.
Abstract: Vision-language models (VLMs) rely on a shared visual-textual representation space to perform tasks such as zero-shot classification, image captioning, and visual question answering (VQA). While this shared space enables strong cross-task generalization, it may also introduce a common vulnerability: small visual perturbations can propagate through the shared embedding space and cause correlated semantic failures across tasks. This risk is particularly important in interactive and decision-support settings, yet it remains unclear whether VLMs are robust to highly constrained, sparse, and geometrically fixed perturbations. To address this question, we propose X-shaped Sparse Pixel Attack (XSPA), an imperceptible structured attack that restricts perturbations to two intersecting diagonal lines. Compared with dense perturbations or flexible localized patches, XSPA operates under a much stricter attack budget and thus provides a more stringent test of VLM robustness. Within this sparse support, XSPA jointly optimizes a classification objective, cross-task semantic guidance, and regularization on perturbation magnitude and along-line smoothness, inducing transferable misclassification as well as semantic drift in captioning and VQA while preserving visual subtlety. Under the default setting, XSPA modifies only about 1.76% of image pixels. Experiments on the COCO dataset show that XSPA consistently degrades performance across all three tasks. Zero-shot accuracy drops by 52.33 points on OpenAI CLIP ViT-L/14 and 67.00 points on OpenCLIP ViT-B/16, while GPT-4-evaluated caption consistency decreases by up to 58.60 points and VQA correctness by up to 44.38 points. These results suggest that even highly sparse and visually subtle perturbations with fixed geometric priors can substantially disrupt cross-task semantics in VLMs, revealing a notable robustness gap in current multimodal systems.
[441] ORSIFlow: Saliency-Guided Rectified Flow for Optical Remote Sensing Salient Object Detection
Haojing Chen, Yutong Li, Zhihang Liu, Tao Tan, Haoyu Bian, Qiuju Ma
Main category: cs.CV
TL;DR: ORSIFlow is a novel flow-based framework for optical remote sensing image salient object detection that reformulates the problem as deterministic latent flow generation, achieving state-of-the-art performance with improved efficiency.
Details
Motivation: Optical remote sensing image salient object detection faces challenges like complex backgrounds, low contrast, irregular shapes, and scale variations. Existing discriminative methods directly regress saliency maps, while diffusion-based approaches suffer from stochastic sampling and high computational costs.Method: Proposes ORSIFlow, a saliency-guided rectified flow framework that performs saliency mask generation in a compact latent space using a frozen variational autoencoder. Includes a Salient Feature Discriminator for global semantic discrimination and a Salient Feature Calibrator for precise boundary refinement.
Result: Extensive experiments on multiple public benchmarks show ORSIFlow achieves state-of-the-art performance with significantly improved efficiency compared to existing methods.
Conclusion: ORSIFlow effectively addresses the challenges of ORSI-SOD by reformulating it as a deterministic latent flow generation problem, offering both high performance and computational efficiency.
Abstract: Optical Remote Sensing Image Salient Object Detection (ORSI-SOD) remains challenging due to complex backgrounds, low contrast, irregular object shapes, and large variations in object scale. Existing discriminative methods directly regress saliency maps, while recent diffusion-based generative approaches suffer from stochastic sampling and high computational cost. In this paper, we propose ORSIFlow, a saliency-guided rectified flow framework that reformulates ORSI-SOD as a deterministic latent flow generation problem. ORSIFlow performs saliency mask generation in a compact latent space constructed by a frozen variational autoencoder, enabling efficient inference with only a few steps. To enhance saliency awareness, we design a Salient Feature Discriminator for global semantic discrimination and a Salient Feature Calibrator for precise boundary refinement. Extensive experiments on multiple public benchmarks show that ORSIFlow achieves state-of-the-art performance with significantly improved efficiency. Codes are available at: https://github.com/Ch3nSir/ORSIFlow.
[442] ELViS: Efficient Visual Similarity from Local Descriptors that Generalizes Across Domains
Pavel Suma, Giorgos Kordopatis-Zilos, Yannis Kalantidis, Giorgos Tolias
Main category: cs.CV
TL;DR: ELViS is an image-to-image similarity model that generalizes to unseen domains by operating in similarity space rather than representation space, using local descriptor correspondences with optimal transport and voting mechanisms.
Details
Motivation: Real-world image retrieval requires handling diverse domains, but large-scale instance-level training data is scarce. Models trained on domain-specific datasets struggle to generalize to unseen domains, creating a need for methods that can effectively transfer across domains.Method: ELViS operates in similarity space rather than representation space. It leverages local descriptor correspondences, refines similarities through optimal transport with data-dependent gains to suppress uninformative descriptors, and aggregates strong correspondences via voting into image-level similarity.
Result: ELViS outperforms competing methods by a large margin in out-of-domain scenarios and on average across eight datasets spanning landmarks, artworks, products, and multi-domain collections, while requiring only a fraction of their computational cost.
Conclusion: The similarity-space approach with strong inductive biases yields a simple, efficient, and interpretable model that effectively generalizes to unseen domains for image retrieval tasks.
Abstract: Large-scale instance-level training data is scarce, so models are typically trained on domain-specific datasets. Yet in real-world retrieval, they must handle diverse domains, making generalization to unseen data critical. We introduce ELViS, an image-to-image similarity model that generalizes effectively to unseen domains. Unlike conventional approaches, our model operates in similarity space rather than representation space, promoting cross-domain transfer. It leverages local descriptor correspondences, refines their similarities through an optimal transport step with data-dependent gains that suppress uninformative descriptors, and aggregates strong correspondences via a voting process into an image-level similarity. This design injects strong inductive biases, yielding a simple, efficient, and interpretable model. To assess generalization, we compile a benchmark of eight datasets spanning landmarks, artworks, products, and multi-domain collections, and evaluate ELViS as a re-ranking method. Our experiments show that ELViS outperforms competing methods by a large margin in out-of-domain scenarios and on average, while requiring only a fraction of their computational cost. Code available at: https://github.com/pavelsuma/ELViS/
[443] Divide and Restore: A Modular Task-Decoupled Framework for Universal Image Restoration
Joanna Wiekiera, Martyna Zur
Main category: cs.CV
TL;DR: A modular, task-decoupled image restoration framework using a diagnostic router to dynamically direct degraded images to specialized restoration experts, enabling efficient multi-degradation handling without full system retraining.
Details
Motivation: Current all-in-one image restoration models suffer from negative task interference and require extensive joint training on high-end hardware. There's a need for more efficient, scalable solutions that can handle multiple degradation types without full retraining.Method: Proposes a modular framework with a lightweight CNN classifier as a router that evaluates input images and directs them to specialized restoration nodes (demonstrated with U-Net experts). The system is model-agnostic and extensible, isolating reconstruction paths to prevent feature conflicts.
Result: The framework offers computationally accessible multi-degradation restoration on standard local hardware, with reduced training overhead compared to monolithic models. Adding new degradation types only requires training a single expert and updating the router.
Conclusion: The modular, task-decoupled approach provides a scalable and efficient solution for image restoration that prevents negative task interference and enables easy extensibility to new degradation types.
Abstract: Restoring images affected by various types of degradation, such as noise, blur, or improper exposure, remains a significant challenge in computer vision. While recent trends favor complex monolithic all-in-one architectures, these models often suffer from negative task interference and require extensive joint training cycles on high-end computing clusters. In this paper, we propose a modular, task-decoupled image restoration framework based on an explicit diagnostic routing mechanism. The architecture consists of a lightweight Convolutional Neural Network (CNN) classifier that evaluates the input image and dynamically directs it to a specialized restoration node. A key advantage of this framework is its model-agnostic extensibility: while we demonstrate it using three independent U-Net experts, the system allows for the integration of any restoration method tailored to specific tasks. By isolating reconstruction paths, the framework prevents feature conflicts and significantly reduces training overhead. Unlike monolithic models, adding new degradation types in our framework only requires training a single expert and updating the router, rather than a full system retraining. Experimental results demonstrate that this computationally accessible approach offers a scalable and efficient solution for multi-degradation restoration on standard local hardware. The code will be published upon paper acceptance.
[444] What-Meets-Where: Unified Learning of Action and Contact Localization in Images
Yuxiao Wang, Yu Lei, Wolin Liang, Weiying Xue, Zhenao Wei, Nan Zhuang, Qi Liu
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2508.09428: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.09428&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[445] Industrial3D: A Terrestrial LiDAR Point Cloud Dataset and CrossParadigm Benchmark for Industrial Infrastructure
Chao Yin, Hongzhe Yue, Qing Han, Difeng Hu, Zhenyu Liang, Fangzhou Lin, Bing Sun, Boyu Wang, Mingkai Li, Wei Yao, Jack C. P. Cheng
Main category: cs.CV
TL;DR: Industrial3D: A large-scale terrestrial LiDAR dataset of 13 water treatment facilities with 612M labeled points at 6mm resolution, establishing the first industrial cross-paradigm benchmark for 3D scene understanding in MEP facilities.
Details
Motivation: Current 3D semantic segmentation benchmarks (like S3DIS, ScanNet) fail to represent the extreme challenges of industrial MEP facilities: severe geometric ambiguity, occlusion, and class imbalance. There's a need for datasets that capture real-world industrial complexity to advance Scan-to-BIM pipelines and digital twin construction.Method: Created Industrial3D dataset with 612M expertly labeled points from 13 water treatment facilities at 6mm resolution. Established cross-paradigm benchmark evaluating 9 methods across supervised, weakly supervised, unsupervised, and foundation model settings under unified protocol.
Result: Best supervised method achieves 55.74% mIoU, while zero-shot Point-SAM reaches only 15.79% - revealing 39.95% gap highlighting domain-transfer challenges. Analysis shows gap stems from statistical rarity (215:1 imbalance) and geometric ambiguity that frequency-based re-weighting alone cannot resolve.
Conclusion: Industrial3D provides the largest and most demanding testbed for industrial 3D scene understanding, quantifying unresolved domain-transfer challenges for industrial TLS data and highlighting need for specialized approaches beyond current architectural benchmarks.
Abstract: Automated semantic understanding of dense point clouds is a prerequisite for Scan-to-BIM pipelines, digital twin construction, and as-built verification–core tasks in the digital transformation of the construction industry. Yet for industrial mechanical, electrical, and plumbing (MEP) facilities, this challenge remains largely unsolved: TLS acquisitions of water treatment plants, chiller halls, and pumping stations exhibit extreme geometric ambiguity, severe occlusion, and extreme class imbalance that architectural benchmarks (e.g., S3DIS or ScanNet) cannot adequately represent. We present Industrial3D, a terrestrial LiDAR dataset comprising 612 million expertly labelled points at 6 mm resolution from 13 water treatment facilities. At 6.6x the scale of the closest comparable MEP dataset, Industrial3D provides the largest and most demanding testbed for industrial 3D scene understanding to date. We further establish the first industrial cross-paradigm benchmark, evaluating nine representative methods across fully supervised, weakly supervised, unsupervised, and foundation model settings under a unified benchmark protocol. The best supervised method achieves 55.74% mIoU, whereas zero-shot Point-SAM reaches only 15.79%–a 39.95 percentage-point gap that quantifies the unresolved domain-transfer challenge for industrial TLS data. Systematic analysis reveals that this gap originates from a dual crisis: statistical rarity (215:1 imbalance, 3.5x more severe than S3DIS) and geometric ambiguity (tail-class points share cylindrical primitives with head-class pipes) that frequency-based re-weighting alone cannot resolve. Industrial3D, along with benchmark code and pre-trained models, will be publicly available at https://github.com/pointcloudyc/Industrial3D.
[446] Sim-to-Real Fruit Detection Using Synthetic Data: Quantitative Evaluation and Embedded Deployment with Isaac Sim
Martina Hutter-Mironovova
Main category: cs.CV
TL;DR: Synthetic data from NVIDIA Isaac Sim combined with limited real fruit images improves YOLO-based object detection, with hybrid training approaching real-only performance while reducing annotation needs, and successfully deployed on Jetson Orin NX.
Details
Motivation: Addresses the challenge of limited real-world data for object detection in constrained deployment scenarios (embedded systems) by exploring synthetic data generation as a solution to reduce manual annotation effort while maintaining performance.Method: Generated synthetic datasets in NVIDIA Isaac Sim and combined them with limited real-world fruit images to train YOLO-based detection models under three regimes: real-only, synthetic-only, and hybrid. Evaluated on in-domain and domain shift test datasets, with deployment on Jetson Orin NX using TensorRT optimization.
Result: Real-only models achieved highest accuracy, synthetic-only models showed reduced performance due to domain gap, but hybrid training significantly improved over synthetic-only and approached real-only performance while reducing annotation needs. All models degraded under domain shift, but hybrid models showed improved robustness. Successful real-time deployment on Jetson Orin NX achieved.
Conclusion: Synthetic data is most effective when combined with real data, hybrid training reduces manual annotation requirements while maintaining competitive performance, and deployment constraints must be considered alongside detection accuracy for practical embedded applications.
Abstract: This study investigates the effectiveness of synthetic data for sim-to-real transfer in object detection under constrained data conditions and embedded deployment requirements. Synthetic datasets were generated in NVIDIA Isaac Sim and combined with limited real-world fruit images to train YOLO-based detection models under real-only, synthetic-only, and hybrid regimes. Performance was evaluated on two test datasets: an in-domain dataset with conditions matching the training data and a domain shift dataset containing real fruit and different background conditions. Results show that models trained exclusively on real data achieve the highest accuracy, while synthetic-only models exhibit reduced performance due to a domain gap. Hybrid training strategies significantly improve performance compared to synthetic-only approaches and achieve results close to real-only training while reducing the need for manual annotation. Under domain shift conditions, all models show performance degradation, with hybrid models providing improved robustness. The trained models were successfully deployed on a Jetson Orin NX using TensorRT optimization, achieving real-time inference performance. The findings highlight that synthetic data is most effective when used in combination with real data and that deployment constraints must be considered alongside detection accuracy.
[447] DreamLite: A Lightweight On-Device Unified Model for Image Generation and Editing
Kailai Feng, Yuxiang Wei, Bo Chen, Yang Pan, Hu Ye, Songwei Liu, Chenqian Yan, Yuan Gao
Main category: cs.CV
TL;DR: DreamLite is a compact unified on-device diffusion model (0.39B parameters) that supports both text-to-image generation and text-guided image editing in a single network, achieving fast inference (<1s for 1024x1024 images) on mobile devices.
Details
Motivation: Existing diffusion models are large (billions of parameters) with high latency and deployment challenges. On-device models focus mainly on generation but lack editing capabilities. There's a need for a compact unified model that supports both tasks efficiently on mobile devices.Method: Built on pruned mobile U-Net backbone with in-context spatial concatenation in latent space. Uses (target|blank) configuration for generation and (target|source) for editing. Employs task-progressive joint pretraining strategy (T2I → editing → joint tasks) with SFT and reinforcement learning. Uses step distillation to reduce denoising to 4 steps.
Result: Achieves GenEval score of 0.72 for generation and ImgEdit score of 4.11 for editing, outperforming existing on-device models and competitive with server-side models. Generates/edits 1024x1024 images in <1s on Xiaomi 14 smartphone.
Conclusion: DreamLite is the first unified on-device diffusion model supporting both image generation and editing, demonstrating efficient multimodal capabilities in a compact architecture suitable for mobile deployment.
Abstract: Diffusion models have made significant progress in both text-to-image (T2I) generation and text-guided image editing. However, these models are typically built with billions of parameters, leading to high latency and increased deployment challenges. While on-device diffusion models improve efficiency, they largely focus on T2I generation and lack support for image editing. In this paper, we propose DreamLite, a compact unified on-device diffusion model (0.39B) that supports both T2I generation and text-guided image editing within a single network. DreamLite is built on a pruned mobile U-Net backbone and unifies conditioning through in-context spatial concatenation in the latent space. It concatenates images horizontally as input, using a (target | blank) configuration for generation tasks and (target | source) for editing tasks. To stabilize the training of this compact model, we introduce a task-progressive joint pretraining strategy that sequentially targets T2I, editing, and joint tasks. After high-quality SFT and reinforcement learning, DreamLite achieves GenEval (0.72) for image generation and ImgEdit (4.11) for image editing, outperforming existing on-device models and remaining competitive with several server-side models. By employing step distillation, we further reduce denoising processing to just 4 steps, enabling our DreamLite could generate or edit a 1024 x 1024 image in less than 1s on a Xiaomi 14 smartphone. To the best of our knowledge, DreamLite is the first unified on-device diffusion model that supports both image generation and image editing.
[448] FlowIt: Global Matching for Optical Flow with Confidence-Guided Refinement
Sadra Safadoust, Fabio Tosi, Matteo Poggi, Fatma Güney
Main category: cs.CV
TL;DR: FlowIt is a hierarchical transformer architecture for optical flow estimation that handles large displacements via optimal transport-based initialization and guided refinement with confidence maps.
Details
Motivation: Existing optical flow methods struggle with large pixel displacements and lack robustness in handling occlusions and ambiguous regions. There's a need for better global context modeling and reliable motion estimation in challenging scenarios.Method: Uses hierarchical transformer architecture for global context, formulates flow initialization as optimal transport problem to get robust initial flow with occlusion/confidence maps, then performs guided refinement propagating reliable estimates from high-confidence to low-confidence regions.
Result: Achieves state-of-the-art results on Sintel and KITTI benchmarks, and establishes new SOTA cross-dataset zero-shot generalization performance on Sintel, Spring, and LayeredFlow datasets.
Conclusion: FlowIt demonstrates superior performance in optical flow estimation, particularly for large displacements and challenging scenarios, through its transformer-based architecture and optimal transport formulation with guided refinement.
Abstract: We present FlowIt, a novel architecture for optical flow estimation designed to robustly handle large pixel displacements. At its core, FlowIt leverages a hierarchical transformer architecture that captures extensive global context, enabling the model to effectively model long-range correspondences. To overcome the limitations of localized matching, we formulate the flow initialization as an optimal transport problem. This formulation yields a highly robust initial flow field, alongside explicitly derived occlusion and confidence maps. These cues are then seamlessly integrated into a guided refinement stage, where the network actively propagates reliable motion estimates from high-confidence regions into ambiguous, low-confidence areas. Extensive experiments across the Sintel, KITTI, Spring, and LayeredFlow datasets validate the efficacy of our approach. FlowIt achieves state-of-the-art results on the competitive Sintel and KITTI benchmarks, while simultaneously establishing new state-of-the-art cross-dataset zero-shot generalization performance on Sintel, Spring, and LayeredFlow.
[449] More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models
Xinyu Tian, Shu Zou, Zhaoyuan Yang, Mengqi He, Fabian Waschkowski, Lukas Wesemann, Peter Tu, Jing Zhang
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2509.25848: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.25848&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[450] SHOW3D: Capturing Scenes of 3D Hands and Objects in the Wild
Patrick Rim, Kevin Harris, Braden Copple, Shangchen Han, Xu Xie, Ivan Shugurov, Sizhe An, He Wen, Alex Wong, Tomas Hodan, Kun He
Main category: cs.CV
TL;DR: SHOW3D: A novel marker-less multi-camera system for capturing 3D hand-object interactions in diverse real-world environments, addressing the generalization gap of existing controlled studio datasets.
Details
Motivation: Existing hand-object interaction datasets are captured in controlled studio settings, limiting environmental diversity and model generalization to real-world scenarios. There's a need for datasets with precise 3D annotations in genuinely in-the-wild conditions.Method: Developed a lightweight, back-mounted multi-camera rig synchronized with a VR headset for unconstrained mobility. Created an ego-exo tracking pipeline for 3D ground-truth annotation of hands and objects in diverse environments.
Result: Created SHOW3D, the first large-scale dataset with 3D annotations of hands interacting with objects in diverse real-world environments including outdoor settings. Validated approach through experiments on downstream tasks.
Conclusion: The system significantly reduces the trade-off between environmental realism and 3D annotation accuracy, enabling better generalization of hand-object interaction models to real-world scenarios.
Abstract: Accurate 3D understanding of human hands and objects during manipulation remains a significant challenge for egocentric computer vision. Existing hand-object interaction datasets are predominantly captured in controlled studio settings, which limits both environmental diversity and the ability of models trained on such data to generalize to real-world scenarios. To address this challenge, we introduce a novel marker-less multi-camera system that allows for nearly unconstrained mobility in genuinely in-the-wild conditions, while still having the ability to generate precise 3D annotations of hands and objects. The capture system consists of a lightweight, back-mounted, multi-camera rig that is synchronized and calibrated with a user-worn VR headset. For 3D ground-truth annotation of hands and objects, we develop an ego-exo tracking pipeline and rigorously evaluate its quality. Finally, we present SHOW3D, the first large-scale dataset with 3D annotations that show hands interacting with objects in diverse real-world environments, including outdoor settings. Our approach significantly reduces the fundamental trade-off between environmental realism and accuracy of 3D annotations, which we validate with experiments on several downstream tasks. show3d-dataset.github.io
[451] Generating Findings for Jaw Cysts in Dental Panoramic Radiographs Using a GPT-Based VLM: A Preliminary Study on Building a Two-Stage Self-Correction Loop with Structured Output (SLSO) Framework
Nanaka Hosokawa, Ryo Takahashi, Tomoya Kitano, Yukihiro Iida, Chisako Muramatsu, Tatsuro Hayashi, Yuta Seino, Xiangrong Zhou, Takeshi Hara, Akitoshi Katsumata, Hiroshi Fujita
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2510.02001: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.02001&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[452] PoseDreamer: Scalable and Photorealistic Human Data Generation Pipeline with Diffusion Models
Lorenza Prospero, Orest Kupyn, Ostap Viniavskyi, João F. Henriques, Christian Rupprecht
Main category: cs.CV
TL;DR: PoseDreamer generates synthetic 3D human mesh datasets using diffusion models with automatic 3D annotations, outperforming traditional rendering methods in quality and utility.
Details
Motivation: Existing 3D human mesh datasets face limitations: real datasets are small and expensive to annotate, while synthetic datasets lack photorealism and diversity. There's a need for scalable, high-quality synthetic data with accurate 3D annotations.Method: Uses diffusion models for controllable image generation with Direct Preference Optimization for control alignment, curriculum-based hard sample mining, and multi-stage quality filtering to maintain 3D label-image correspondence.
Result: Generated 500,000+ high-quality synthetic samples with 76% improvement in image-quality metrics over rendering-based datasets. Models trained on PoseDreamer perform comparably or better than those trained on real or traditional synthetic data.
Conclusion: Generated data via diffusion models offers a promising third path for 3D human mesh estimation, providing scalable, high-quality synthetic datasets that complement existing real and synthetic data sources.
Abstract: Acquiring labeled datasets for 3D human mesh estimation is challenging due to depth ambiguities and the inherent difficulty of annotating 3D geometry from monocular images. Existing datasets are either real, with manually annotated 3D geometry and limited scale, or synthetic, rendered from 3D engines that provide precise labels but suffer from limited photorealism, low diversity, and high production costs. In this work, we explore a third path: generated data. We introduce PoseDreamer, a novel pipeline that leverages diffusion models to generate large-scale synthetic datasets with 3D mesh annotations. Our approach combines controllable image generation with Direct Preference Optimization for control alignment, curriculum-based hard sample mining, and multi-stage quality filtering. Together, these components naturally maintain correspondence between 3D labels and generated images, while prioritizing challenging samples to maximize dataset utility. Using PoseDreamer, we generate more than 500,000 high-quality synthetic samples, achieving a 76% improvement in image-quality metrics compared to rendering-based datasets. Models trained on PoseDreamer achieve performance comparable to or superior to those trained on real-world and traditional synthetic datasets. In addition, combining PoseDreamer with synthetic datasets results in better performance than combining real-world and synthetic datasets, demonstrating the complementary nature of our dataset. We will release the full dataset and generation code.
[453] Dream to Recall: Imagination-Guided Experience Retrieval for Memory-Persistent Vision-and-Language Navigation
Yunzhe Xu, Yiyuan Pan, Zhe Liu
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper detailsMethod: Unable to determine method due to API rate limiting preventing access to paper details
Result: Unable to determine results due to API rate limiting preventing access to paper details
Conclusion: Unable to draw conclusions due to API rate limiting preventing access to paper details
Abstract: Failed to fetch summary for 2510.08553: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.08553&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[454] HandX: Scaling Bimanual Motion and Interaction Generation
Zimu Zhang, Yucheng Zhang, Xiyan Xu, Ziyin Wang, Sirui Xu, Kai Zhou, Bing Zhou, Chuan Guo, Jian Wang, Yu-Xiong Wang, Liang-Yan Gui
Main category: cs.CV
TL;DR: HandX introduces a unified framework for bimanual hand motion synthesis with new dataset, annotation pipeline using LLMs, and benchmarking of diffusion/autoregressive models for dexterous hand motion generation.
Details
Motivation: Existing whole-body motion synthesis models lack fine-grained hand motion and bimanual interaction capabilities, with insufficient high-fidelity datasets capturing nuanced finger dynamics and inter-hand coordination.Method: 1) Consolidate existing datasets and collect new motion-capture data for bimanual interactions; 2) Develop decoupled annotation pipeline extracting motion features (contact events, finger flexion) then using LLMs for semantic descriptions; 3) Benchmark diffusion and autoregressive models with various conditioning modes; 4) Propose hand-focused evaluation metrics.
Result: High-quality dexterous motion generation demonstrated, with clear scaling trends showing larger models on larger, higher-quality datasets produce more semantically coherent bimanual motion. New dataset released for research.
Conclusion: HandX provides comprehensive framework for bimanual hand motion synthesis, addressing data scarcity and annotation challenges through LLM-assisted pipeline, enabling realistic dexterous motion generation with measurable improvements through scaling.
Abstract: Synthesizing human motion has advanced rapidly, yet realistic hand motion and bimanual interaction remain underexplored. Whole-body models often miss the fine-grained cues that drive dexterous behavior, finger articulation, contact timing, and inter-hand coordination, and existing resources lack high-fidelity bimanual sequences that capture nuanced finger dynamics and collaboration. To fill this gap, we present HandX, a unified foundation spanning data, annotation, and evaluation. We consolidate and filter existing datasets for quality, and collect a new motion-capture dataset targeting underrepresented bimanual interactions with detailed finger dynamics. For scalable annotation, we introduce a decoupled strategy that extracts representative motion features, e.g., contact events and finger flexion, and then leverages reasoning from large language models to produce fine-grained, semantically rich descriptions aligned with these features. Building on the resulting data and annotations, we benchmark diffusion and autoregressive models with versatile conditioning modes. Experiments demonstrate high-quality dexterous motion generation, supported by our newly proposed hand-focused metrics. We further observe clear scaling trends: larger models trained on larger, higher-quality datasets produce more semantically coherent bimanual motion. Our dataset is released to support future research.
[455] Gen-Searcher: Reinforcing Agentic Search for Image Generation
Kaituo Feng, Manyuan Zhang, Shuang Chen, Yunlong Lin, Kaixuan Fan, Yilei Jiang, Hongyu Li, Dian Zheng, Chenyang Wang, Xiangyu Yue
Main category: cs.CV
TL;DR: Gen-Searcher: A search-augmented image generation agent that performs multi-hop reasoning to collect textual knowledge and reference images for grounded generation, addressing frozen knowledge limitations in current image generation models.
Details
Motivation: Current image generation models have frozen internal knowledge that fails on knowledge-intensive real-world scenarios requiring up-to-date information. There's a need for systems that can dynamically search and incorporate external knowledge for more accurate and current image generation.Method: Developed Gen-Searcher as a search-augmented image generation agent with multi-hop reasoning capabilities. Created two datasets (Gen-Searcher-SFT-10k and Gen-Searcher-RL-6k) and KnowGen benchmark. Trained using supervised fine-tuning followed by agentic reinforcement learning with dual reward feedback combining text-based and image-based rewards.
Result: Gen-Searcher substantially improves performance, boosting Qwen-Image by around 16 points on KnowGen benchmark and 15 points on WISE benchmark, demonstrating significant gains in search-grounded image generation.
Conclusion: Gen-Searcher represents the first search-augmented image generation agent that effectively addresses knowledge limitations through multi-hop reasoning and search, providing a foundation for more knowledgeable and current image generation systems.
Abstract: Recent image generation models have shown strong capabilities in generating high-fidelity and photorealistic images. However, they are fundamentally constrained by frozen internal knowledge, thus often failing on real-world scenarios that are knowledge-intensive or require up-to-date information. In this paper, we present Gen-Searcher, as the first attempt to train a search-augmented image generation agent, which performs multi-hop reasoning and search to collect the textual knowledge and reference images needed for grounded generation. To achieve this, we construct a tailored data pipeline and curate two high-quality datasets, Gen-Searcher-SFT-10k and Gen-Searcher-RL-6k, containing diverse search-intensive prompts and corresponding ground-truth synthesis images. We further introduce KnowGen, a comprehensive benchmark that explicitly requires search-grounded external knowledge for image generation and evaluates models from multiple dimensions. Based on these resources, we train Gen-Searcher with SFT followed by agentic reinforcement learning with dual reward feedback, which combines text-based and image-based rewards to provide more stable and informative learning signals for GRPO training. Experiments show that Gen-Searcher brings substantial gains, improving Qwen-Image by around 16 points on KnowGen and 15 points on WISE. We hope this work can serve as an open foundation for search agents in image generation, and we fully open-source our data, models, and code.
[456] SceneAdapt: Scene-aware Adaptation of Human Motion Diffusion
Jungbin Cho, Minsu Kim, Jisoo Kim, Ce Zheng, Laszlo A. Jeni, Ming-Hsuan Yang, Youngjae Yu, Seonjoo Kim
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to determine conclusion due to failed API request
Abstract: Failed to fetch summary for 2510.13044: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.13044&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[457] ImAgent: A Unified Multimodal Agent Framework for Test-Time Scalable Image Generation
Kaishen Wang, Ruibo Chen, Tong Zheng, Heng Huang
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Paper analysis not possible due to technical limitations in accessing content
Abstract: Failed to fetch summary for 2511.11483: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.11483&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[458] SciEGQA: A Dataset for Scientific Evidence-Grounded Question Answering and Reasoning
Wenhan Yu, Zhaoxi Zhang, Wang Chen, Guanqiang Qi, Weikang Li, Lei Sha, Deguo Xia, Jizhou Huang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation as paper content could not be retrievedMethod: Unable to determine method as paper content could not be retrieved
Result: Unable to determine results as paper content could not be retrieved
Conclusion: Unable to draw conclusions about the paper due to technical fetching error
Abstract: Failed to fetch summary for 2511.15090: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.15090&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[459] SAM 3: Segment Anything with Concepts
Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, Jie Lei, Tengyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer, Meng Wang, Peize Sun, Roman Rädle, Triantafyllos Afouras, Effrosyni Mavroudi, Katherine Xu, Tsung-Han Wu, Yu Zhou, Liliane Momeni, Rishi Hazra, Shuangrui Ding, Sagar Vaze, Francois Porcher, Feng Li, Siyuan Li, Aishwarya Kamath, Ho Kei Cheng, Piotr Dollár, Nikhila Ravi, Kate Saenko, Pengchuan Zhang, Christoph Feichtenhofer
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2511.16719: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.16719&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[460] From Exploration to Exploitation: A Two-Stage Entropy RLVR Approach for Noise-Tolerant MLLM Training
Donglai Xu, Hongzheng Yang, Yuzhi Zhao, Pingping Zhang, Jinpeng Chen, Wenao Ma, Zhijian Hou, Mengyang Wu, Xiaolei Li, Senkang Hu, Ziyi Guan, Jason Chun Lok Li, Lai Man Po
Main category: cs.CV
TL;DR: Unable to analyze paper 2511.07738 due to HTTP 429 error when fetching summary from arXiv API
Details
Motivation: Cannot determine motivation due to inability to access paper contentMethod: Cannot determine method due to inability to access paper content
Result: Cannot determine results due to inability to access paper content
Conclusion: Cannot draw conclusions due to inability to access paper content
Abstract: Failed to fetch summary for 2511.07738: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.07738&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[461] UniGame: Turning a Unified Multimodal Model Into Its Own Adversary
Zhaolong Su, Wang Lu, Hao Chen, Sharon Li, Jindong Wang
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailable due to HTTP 429 error from arXiv APIMethod: Cannot determine method as paper content is unavailable due to HTTP 429 error from arXiv API
Result: Cannot determine results as paper content is unavailable due to HTTP 429 error from arXiv API
Conclusion: Cannot determine conclusion as paper content is unavailable due to HTTP 429 error from arXiv API
Abstract: Failed to fetch summary for 2511.19413: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.19413&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[462] From Observation to Action: Latent Action-based Primitive Segmentation for VLA Pre-training in Industrial Settings
Jiajie Zhang, Sören Schwertfeger, Alexander Kleiner
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to failed API requestMethod: Cannot determine method due to failed API request
Result: Cannot determine results due to failed API request
Conclusion: Cannot determine conclusion due to failed API request
Abstract: Failed to fetch summary for 2511.21428: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.21428&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[463] What Is the Optimal Ranking Score Between Precision and Recall? We Can Always Find It and It Is Rarely $F_1$
Sébastien Piérard, Adrien Deliège, Marc Van Droogenbroeck
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2511.22442: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.22442&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[464] AltChart: Enhancing VLM-based Chart Summarization Through Multi-Pretext Tasks
Omar Moured, Jiaming Zhang, M. Saquib Sarfraz, Rainer Stiefelhagen
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2405.13580: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2405.13580&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[465] Overcoming the Curvature Bottleneck in MeanFlow
Xinxi Zhang, Shiwei Tan, Quang Nguyen, Quan Dao, Ligong Han, Xiaoxiao He, Tunyu Zhang, Chengzhi Mao, Dimitris Metaxas, Vladimir Pavlovic
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2511.23342: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.23342&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[466] Monitoring Simulated Physical Weakness Using Detailed Behavioral Features and Personalized Modeling
Chen Long-fei, Muhammad Ahmed Raza, Craig Innes, Subramanian Ramamoorthy, Robert B. Fisher
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation as paper content could not be retrievedMethod: Unable to determine method as paper content could not be retrieved
Result: Unable to determine results as paper content could not be retrieved
Conclusion: Unable to draw conclusions as paper content could not be retrieved
Abstract: Failed to fetch summary for 2406.10045: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2406.10045&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[467] Disrupting Hierarchical Reasoning: Adversarial Protection for Geographic Privacy in Multimodal Reasoning Models
Jiaming Zhang, Che Wang, Yang Cao, Longtao Huang, Wei Yang Bryan Lim
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2512.08503: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.08503&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[468] iiANET: Inception Inspired Attention Hybrid Network for efficient Long-Range Dependency
Haruna Yunusa, Adamu Lawan, Abdulganiyu Abdu Yusuf
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2407.07603: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2407.07603&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[469] BabyVLM-V2: Toward Developmentally Grounded Pretraining and Benchmarking of Vision Foundation Models
Shengao Wang, Wenqi Wang, Zecheng Wang, Max Whitton, Michael Wakeham, Arjun Chandra, Joey Huang, Pengyue Zhu, Helen Chen, David Li, Jeffrey Li, Shawn Li, Andrew Zagula, Amy Zhao, Andrew Zhu, Sayaka Nakamura, Yuki Yamamoto, Jerry Jun Yokono, Aaron Mueller, Bryan A. Plummer, Kate Saenko, Venkatesh Saligrama, Boqing Gong
Main category: cs.CV
TL;DR: Unable to analyze paper 2512.10932 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation without access to the paper abstractMethod: Cannot determine method without access to the paper abstract
Result: Cannot determine results without access to the paper abstract
Conclusion: Cannot determine conclusion without access to the paper abstract
Abstract: Failed to fetch summary for 2512.10932: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.10932&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[470] Unleashing the Potential of Mamba: Boosting a LiDAR 3D Sparse Detector by Using Cross-Model Knowledge Distillation
Rui Yu, Runkai Zhao, Jiagen Li, Qingsong Zhao, HuaiCheng Yan, Meng Wang
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2409.11018: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2409.11018&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[471] StreamAvatar: Streaming Diffusion Models for Real-Time Interactive Human Avatars
Zhiyao Sun, Ziqiao Peng, Yifeng Ma, Yi Chen, Zhengguang Zhou, Zixiang Zhou, Guozhen Zhang, Youliang Zhang, Yuan Zhou, Qinglin Lu, Yong-Jin Liu
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2512.22065: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.22065&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[472] Match Stereo Videos via Bidirectional Alignment
Junpeng Jing, Ye Mao, Anlan Qiu, Krystian Mikolajczyk
Main category: cs.CV
TL;DR: Failed to fetch summary for paper 2409.20283 due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation as the paper abstract could not be retrievedMethod: Unable to determine method as the paper abstract could not be retrieved
Result: Unable to determine results as the paper abstract could not be retrieved
Conclusion: Unable to determine conclusion as the paper abstract could not be retrieved
Abstract: Failed to fetch summary for 2409.20283: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2409.20283&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[473] Tracking by Detection and Query: An Efficient End-to-End Framework for Multi-Object Tracking
Shukun Jia, Shiyu Hu, Yichao Cao, Feng Yang, Xin Lu, Xiaobo Lu
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper detailsMethod: Cannot analyze method as paper content is unavailable due to HTTP 429 error
Result: No results available - paper retrieval failed due to rate limiting
Conclusion: Cannot provide analysis due to technical limitations in accessing the paper
Abstract: Failed to fetch summary for 2411.06197: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2411.06197&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[474] DIFEM: Key-points Interaction based Feature Extraction Module for Violence Recognition in Videos
Himanshu Mittal, Suvramalya Basak, Anjali Gautam
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2412.05386 exists but cannot be analyzed without the abstract content.
Details
Motivation: Cannot determine motivation without access to the paper abstract.Method: Cannot determine method without access to the paper abstract.
Result: Cannot determine results without access to the paper abstract.
Conclusion: Cannot draw conclusions without access to the paper abstract.
Abstract: Failed to fetch summary for 2412.05386: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2412.05386&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[475] Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation
Haotong Lin, Sida Peng, Jingxiao Chen, Songyou Peng, Jiaming Sun, Minghuan Liu, Hujun Bao, Jiashi Feng, Xiaowei Zhou, Bingyi Kang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access restrictionsMethod: Unable to determine method due to access restrictions
Result: Unable to determine results due to access restrictions
Conclusion: Unable to determine conclusion due to access restrictions
Abstract: Failed to fetch summary for 2412.14015: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2412.14015&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[476] SparVAR: Exploring Sparsity in Visual AutoRegressive Modeling for Training-Free Acceleration
Zekun Li, Ning Wang, Tongxin Bai, Changwang Mei, Peisong Wang, Shuang Qiu, Jian Cheng
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper information.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot draw conclusions without access to the paper content.
Abstract: Failed to fetch summary for 2602.04361: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.04361&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[477] Self-Corrected Flow Distillation for Consistent One-Step and Few-Step Text-to-Image Generation
Quan Dao, Hao Phung, Trung Dao, Dimitris Metaxas, Anh Tran
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2412.16906: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2412.16906&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[478] Resolving Spatio-Temporal Entanglement in Video Prediction via Multi-Modal Attention
Shreyam Gupta, P. Agrawal, Priyam Gupta
Main category: cs.CV
TL;DR: Paper ID 2501.16997: Unable to fetch abstract due to HTTP 429 error (rate limiting). No content available for analysis.
Details
Motivation: Cannot determine motivation as the abstract could not be retrieved due to rate limiting from arXiv API.Method: Cannot determine method without access to the paper’s content.
Result: Cannot determine results without access to the paper’s content.
Conclusion: Cannot draw conclusions without access to the paper’s content.
Abstract: Failed to fetch summary for 2501.16997: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2501.16997&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[479] ConfIC-RCA: Statistically Grounded Efficient Estimation of Segmentation Quality
Matias Cosarinsky, Ramiro Billot, Lucas Mansilla, Gabriel Jimenez, Nicolas Gaggión, Guanghui Fu, Tom Tirer, Enzo Ferrante
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to draw conclusions due to fetch failure
Abstract: Failed to fetch summary for 2503.04522: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.04522&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[480] SASNet: Spatially-Adaptive Sinusoidal Networks for INRs
Haoan Feng, Diana Aldana, Tiago Novello, Leila De Floriani
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2503.09750: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.09750&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[481] MotionCrafter: Dense Geometry and Motion Reconstruction with a 4D VAE
Ruijie Zhu, Jiahao Lu, Wenbo Hu, Xiaoguang Han, Jianfei Cai, Ying Shan, Chuanxia Zheng
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2602.08961: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.08961&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[482] vGamba: Attentive State Space Bottleneck for efficient Long-range Dependencies in Visual Recognition
Yunusa Haruna, Adamu Lawan, Shamsuddeen Hassan Muhammad, Jiaquan Zhang, Chaoning Zhang
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2503.21262: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.21262&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[483] Hierarchical Concept Embedding & Pursuit for Interpretable Image Classification
Nghia Nguyen, Tianjiao Ding, René Vidal
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) for arXiv ID 2602.11448
Details
Motivation: Cannot determine motivation without access to the paper contentMethod: Cannot determine method without access to the paper content
Result: Cannot determine results without access to the paper content
Conclusion: Cannot draw conclusions without access to the paper content
Abstract: Failed to fetch summary for 2602.11448: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.11448&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[484] Learning Underwater Active Perception in Simulation
Alexandre Cardaillac, Donald G. Dansereau
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2504.17817: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.17817&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[485] MALLVI: A Multi-Agent Framework for Integrated Generalized Robotics Manipulation
Mehrshad Taji, Arad Mahdinezhad Kashani, Iman Ahmadi, AmirHossein Jadidi, Saina Kashani, Babak Khalaj
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2602.16898: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.16898&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[486] $ϕ$-DPO: Fairness Direct Preference Optimization Approach to Continual Learning in Large Multimodal Models
Thanh-Dat Truong, Huu-Thien Tran, Jackson Cothren, Bhiksha Raj, Khoa Luu
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper retrievalMethod: Unable to determine method due to failed paper retrieval
Result: Unable to determine results due to failed paper retrieval
Conclusion: Unable to determine conclusion due to failed paper retrieval
Abstract: Failed to fetch summary for 2602.22601: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.22601&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[487] P$^2$HCT: Plug-and-Play Hierarchical C2F Transformer for Multi-Scale Feature Fusion
Junyi Hu, Tian Bai, Fengyi Wu, Zhenming Peng, Yi Zhang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2505.12772: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.12772&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[488] FUSAR-GPT : A Spatiotemporal Feature-Embedded and Two-Stage Decoupled Visual Language Model for SAR Imagery
Xiaokun Zhang, Yi Yang, Ziqi Ye, Baiyun, Xiaorong Guo, Qingchen Fang, Ruyi Zhang, Xinpeng Zhou, Haipeng Wang
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2602.19190: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.19190&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[489] Corruption-Aware Training of Latent Video Diffusion Models for Robust Text-to-Video Generation
Chika Maduabuchi, Hao Chen, Yujin Han, Jindong Wang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2505.21545: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.21545&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[490] ViStoryBench: Comprehensive Benchmark Suite for Story Visualization
Cailin Zhuang, Ailin Huang, Yaoqi Hu, Jingwei Wu, Wei Cheng, Jiaqi Liao, Hongyuan Wang, Xinyao Liao, Weiwei Cai, Hengyuan Xu, Xuanyang Zhang, Xianfang Zeng, Zhewei Huang, Gang Yu, Chi Zhang
Main category: cs.CV
TL;DR: Failed to fetch summary for paper 2505.24862 due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed summary fetchMethod: Unable to determine method due to failed summary fetch
Result: Unable to determine results due to failed summary fetch
Conclusion: Unable to determine conclusion due to failed summary fetch
Abstract: Failed to fetch summary for 2505.24862: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.24862&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[491] Cross-Modal Urban Sensing: Evaluating Sound-Vision Alignment Across Street-Level and Aerial Imagery
Pengyu Chen, Xiao Huang, Teng Fei, Sicheng Wang
Main category: cs.CV
TL;DR: Unable to analyze paper 2506.03388 due to HTTP 429 error when fetching summary from arXiv API
Details
Motivation: Cannot determine motivation due to inability to access paper contentMethod: Cannot determine method due to inability to access paper content
Result: Cannot determine results due to inability to access paper content
Conclusion: Cannot draw conclusions due to inability to access paper content
Abstract: Failed to fetch summary for 2506.03388: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.03388&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[492] CCCaption: Dual-Reward Reinforcement Learning for Complete and Correct Image Captioning
Zhijiang Tang, Linhua Wang, Jiaxin Qi, Weihao Jiang, Peng Hou, Anxiang Zeng, Jianqiang Huang
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2602.21655: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.21655&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[493] Follow-Your-Motion: Video Motion Transfer via Efficient Spatial-Temporal Decoupled Finetuning
Yue Ma, Yulong Liu, Qiyuan Zhu, Ayden Yang, Kunyu Feng, Xinhua Zhang, Zexuan Yan, Zhifeng Li, Sirui Han, Chenyang Qi, Qifeng Chen
Main category: cs.CV
TL;DR: Unable to analyze paper 2506.05207 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract is unavailableMethod: Cannot determine method as abstract is unavailable
Result: Cannot determine results as abstract is unavailable
Conclusion: Cannot draw conclusion due to missing abstract data
Abstract: Failed to fetch summary for 2506.05207: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.05207&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[494] Efficient Encoder-Free Fourier-based 3D Large Multimodal Model
Guofeng Mei, Wei Lin, Luigi Riz, Yujiao Wu, Yiming Wang, Fabio Poiesi
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2602.23153: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.23153&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[495] Taming Score-Based Denoisers in ADMM: A Convergent Plug-and-Play Framework
Rajesh Shrestha, Xiao Fu
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.10281: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.10281&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[496] VRR-QA: Visual Relational Reasoning in Videos Beyond Explicit Cues
Sirnam Swetha, Rohit Gupta, Parth Parag Kulkarni, David G Shatwell, Jeffrey A Chan Santiago, Nyle Siddiqui, Joseph Fioresi, Mubarak Shah
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2506.21742: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.21742&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[497] AG-VAS: Anchor-Guided Zero-Shot Visual Anomaly Segmentation with Large Multimodal Models
Zhen Qu, Xian Tao, Xiaoyi Bao, Dingrong Wang, ShiChen Qu, Zhengtao Zhang, Xingang Wang
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2603.01305: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.01305&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[498] FA-Seg: A Fast and Accurate Diffusion-Based Method for Open-Vocabulary Segmentation
Huy Che, Vinh-Tiep Nguyen
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) when trying to access arXiv API
Details
Motivation: Unable to determine motivation as the paper content could not be retrieved due to technical limitationsMethod: N/A - Paper content unavailable
Result: N/A - Paper content unavailable
Conclusion: N/A - Paper content unavailable
Abstract: Failed to fetch summary for 2506.23323: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.23323&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[499] LH2Face: Loss function for Hard High-quality Face
Fan Xie, Yang Wang, Yikang Jiao, Zhenyu Yuan, Congxi Chen, Chuanxin Zhao
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: No method information available due to API access restrictions
Result: No results available - paper content inaccessible
Conclusion: Cannot analyze paper due to technical limitations in accessing the content
Abstract: Failed to fetch summary for 2506.23555: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.23555&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[500] Sketch2Colab: Sketch-Conditioned Multi-Human Animation via Controllable Flow Distillation
Divyanshu Daiya, Aniket Bera
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: Cannot analyze method as paper content is unavailable due to HTTP 429 error
Result: No results available - paper content retrieval failed due to rate limiting
Conclusion: Cannot provide analysis due to technical limitations in accessing the paper
Abstract: Failed to fetch summary for 2603.02190: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.02190&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[501] Habitat Classification from Ground-Level Imagery Using Deep Neural Networks
Hongrui Shi, Lisa Norton, Lucy Ridding, Simon Rolph, Tom August, Claire M Wood, Lan Qie, Petra Bosilj, James M Brown
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper retrievalMethod: Unable to determine method due to failed paper retrieval
Result: Unable to determine results due to failed paper retrieval
Conclusion: Unable to determine conclusion due to failed paper retrieval
Abstract: Failed to fetch summary for 2507.04017: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.04017&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[502] Effort-Optimized, Accuracy-Driven Labelling and Validation of Test Inputs for DL Systems: A Mixed-Integer Linear Programming Approach
Mohammad Hossein Amini, Mehrdad Sabetzadeh, Shiva Nejati
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2507.04990: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.04990&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[503] AnthroTAP: Learning Point Tracking with Real-World Motion
Inès Hyeonsu Kim, Seokju Cho, Jahyeok Koo, Junghyun Park, Jiahui Huang, Honglak Lee, Joon-Young Lee, Seungryong Kim
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions about paper content
Abstract: Failed to fetch summary for 2507.06233: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.06233&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[504] Minimizing the Pretraining Gap: Domain-aligned Text-Based Person Retrieval
Shuyu Yang, Yaxiong Wang, Yongrui Li, Li Zhu, Zhedong Zheng
Main category: cs.CV
TL;DR: Unable to analyze paper 2507.10195 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract retrieval failed due to rate limiting (HTTP 429)Method: No method information available - paper content inaccessible
Result: No results available - unable to fetch paper summary
Conclusion: Cannot draw conclusions about paper relevance without access to abstract/content
Abstract: Failed to fetch summary for 2507.10195: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.10195&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[505] Coarse-Guided Visual Generation via Weighted h-Transform Sampling
Yanghao Wang, Ziqi Jiang, Zhen Wang, Long Chen
Main category: cs.CV
TL;DR: Paper 2603.12057 summary unavailable due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to abstract fetch failureMethod: Unable to determine method due to abstract fetch failure
Result: Unable to determine results due to abstract fetch failure
Conclusion: Unable to determine conclusion due to abstract fetch failure
Abstract: Failed to fetch summary for 2603.12057: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.12057&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[506] MAN++: Scaling Momentum Auxiliary Network for Supervised Local Learning in Vision Tasks
Junhao Su, Feiyu Zhu, Hengyu Shi, Tianyang Han, Yurui Qiu, Junfeng Luo, Xiaoming Wei, Jialin Gao
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2507.16279: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.16279&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[507] AVATAR: Reinforcement Learning to See, Hear, and Reason Over Video
Yogesh Kulkarni, Pooyan Fazli
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to data fetch failureMethod: Unable to determine method due to data fetch failure
Result: Unable to determine results due to data fetch failure
Conclusion: Unable to determine conclusion due to data fetch failure
Abstract: Failed to fetch summary for 2508.03100: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.03100&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[508] DeH4R: A Decoupled and Hybrid Method for Road Network Graph Extraction
Dengxian Gong, Shunping Ji
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2508.13669 could not be retrieved from arXiv API.
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting.Method: Cannot determine method as paper content is unavailable due to API rate limiting.
Result: Cannot determine results as paper content is unavailable due to API rate limiting.
Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting.
Abstract: Failed to fetch summary for 2508.13669: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.13669&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[509] See No Evil: Adversarial Attacks Against Linguistic-Visual Association in Referring Multi-Object Tracking Systems
Halima Bouzidi, Haoyu Liu, Mohammad Abdullah Al Faruque
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation as paper content could not be retrievedMethod: Unable to determine method as paper content could not be retrieved
Result: Unable to determine results as paper content could not be retrieved
Conclusion: Unable to determine conclusion as paper content could not be retrieved
Abstract: Failed to fetch summary for 2509.02028: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.02028&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[510] OmniStyle2: Learning to Stylize by Learning to Destylize
Ye Wang, Zili Yi, Yibo Zhang, Peng Zheng, Xuping Xie, Jiang Lin, Yijun Li, Yilin Wang, Rui Ma
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access limitationsMethod: Unable to determine method due to access limitations
Result: Unable to determine results due to access limitations
Conclusion: Unable to determine conclusion due to access limitations
Abstract: Failed to fetch summary for 2509.05970: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.05970&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[511] SEEC: Segmentation-Assisted Multi-Entropy Models for Learned Lossless Image Compression
Chunhang Zheng, Zichang Ren, Dou Li
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2509.07704: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.07704&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[512] Dual Band Thermal Videography: Separating Time-Varying Reflection and Emission Near Ambient Conditions
Sriram Narayanan, Mani Ramanagopal, Srinivasa G. Narasimhan
Main category: cs.CV
TL;DR: Paper summary unavailable due to HTTP 429 error when fetching from arXiv API
Details
Motivation: Unable to determine motivation as paper content could not be retrievedMethod: Unable to determine method as paper content could not be retrieved
Result: Unable to determine results as paper content could not be retrieved
Conclusion: Unable to determine conclusion as paper content could not be retrieved
Abstract: Failed to fetch summary for 2509.11334: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.11334&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[513] Multimodal Graph Network Modeling for Human-Object Interaction Detection with PDE Graph Diffusion
Wenxuan Ji, Haichao Shi, Xiao-Yu Zhang
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2509.12554: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.12554&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[514] BlurBall: Joint Ball and Motion Blur Estimation for Table Tennis Ball Tracking
Thomas Gossard, Filip Radovic, Andreas Ziegler, Andreas Zell
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to access limitationsMethod: Cannot determine method due to access limitations
Result: Cannot determine results due to access limitations
Conclusion: Cannot determine conclusion due to access limitations
Abstract: Failed to fetch summary for 2509.18387: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.18387&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[515] NewtonGen: Physics-Consistent and Controllable Text-to-Video Generation via Neural Newtonian Dynamics
Yu Yuan, Xijun Wang, Tharindu Wickremasinghe, Zeeshan Nadir, Bole Ma, Stanley H. Chan
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to inability to access paper contentMethod: Cannot determine method due to inability to access paper content
Result: Cannot determine results due to inability to access paper content
Conclusion: Cannot determine conclusion due to inability to access paper content
Abstract: Failed to fetch summary for 2509.21309: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.21309&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[516] Event-based Facial Keypoint Alignment via Cross-Modal Fusion Attention and Self-Supervised Multi-Event Representation Learning
Donghwa Kang, Junho Kim, Dongwoo Kang
Main category: cs.CV
TL;DR: Paper 2509.24968 - Failed to fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error fetching paper informationMethod: Unable to determine method due to technical error fetching paper information
Result: Unable to determine results due to technical error fetching paper information
Conclusion: Unable to determine conclusion due to technical error fetching paper information
Abstract: Failed to fetch summary for 2509.24968: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.24968&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[517] Few TensoRF: Enhance the Few-shot on Tensorial Radiance Fields
Thanh-Hai Le, Hoang-Hau Tran, Trong-Nghia Vu
Main category: cs.CV
TL;DR: Paper 2603.25008: Unable to fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusion without access to paper content
Abstract: Failed to fetch summary for 2603.25008: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.25008&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[518] NARVis: Neural Accelerated Rendering for Real-Time Scientific Point Cloud Visualization
Srinidhi Hegde, Kaur Kullman, Thomas Grubb, Leslie Lait, Stephen Guimond, Matthias Zwicker
Main category: cs.CV
TL;DR: Paper analysis unavailable due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Unable to determine motivation as the paper abstract could not be fetched due to rate limiting (HTTP 429 error)Method: No method information available due to API rate limiting preventing access to the paper abstract
Result: No results information available as the paper content could not be retrieved
Conclusion: Unable to provide analysis due to technical limitations in accessing the paper content
Abstract: Failed to fetch summary for 2407.19097: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2407.19097&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[519] Revisiting Adversarial Training under Hyperspectral Image
Weihua Zhang, Chengze Jiang, Minjing Dong, Jie Gui, Lu Dong, Zhipeng Gui, Yuan Yan Tang, James Tin-Yau Kwok
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper information.
Details
Motivation: Cannot determine motivation without access to paper content.Method: Cannot determine method without access to paper content.
Result: Cannot determine results without access to paper content.
Conclusion: Cannot determine conclusion without access to paper content.
Abstract: Failed to fetch summary for 2510.01014: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.01014&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[520] Out of Sight but Not Out of Mind: Hybrid Memory for Dynamic Video World Models
Kaijin Chen, Dingkang Liang, Xin Zhou, Yikang Ding, Xiaoqiang Liu, Pengfei Wan, Xiang Bai
Main category: cs.CV
TL;DR: Hybrid Memory paradigm for video world models that simultaneously archives static backgrounds and tracks dynamic subjects during out-of-view intervals, with HM-World dataset and HyDRA architecture.
Details
Motivation: Current video world models treat environments as static canvases and struggle when dynamic subjects hide out of sight and later re-emerge, leading to frozen, distorted, or vanishing subjects.Method: Introduces Hybrid Memory paradigm requiring models to act as archivists for static backgrounds and trackers for dynamic subjects. Constructs HM-World dataset with 59K clips featuring decoupled camera/subject trajectories and exit-entry events. Proposes HyDRA architecture that compresses memory into tokens with spatiotemporal relevance-driven retrieval.
Result: HyDRA significantly outperforms state-of-the-art approaches in both dynamic subject consistency and overall generation quality on HM-World benchmark.
Conclusion: Hybrid Memory paradigm addresses critical limitation in video world models, enabling better handling of dynamic subjects during occlusions through specialized memory architecture and comprehensive dataset.
Abstract: Video world models have shown immense potential in simulating the physical world, yet existing memory mechanisms primarily treat environments as static canvases. When dynamic subjects hide out of sight and later re-emerge, current methods often struggle, leading to frozen, distorted, or vanishing subjects. To address this, we introduce Hybrid Memory, a novel paradigm requiring models to simultaneously act as precise archivists for static backgrounds and vigilant trackers for dynamic subjects, ensuring motion continuity during out-of-view intervals. To facilitate research in this direction, we construct HM-World, the first large-scale video dataset dedicated to hybrid memory. It features 59K high-fidelity clips with decoupled camera and subject trajectories, encompassing 17 diverse scenes, 49 distinct subjects, and meticulously designed exit-entry events to rigorously evaluate hybrid coherence. Furthermore, we propose HyDRA, a specialized memory architecture that compresses memory into tokens and utilizes a spatiotemporal relevance-driven retrieval mechanism. By selectively attending to relevant motion cues, HyDRA effectively preserves the identity and motion of hidden subjects. Extensive experiments on HM-World demonstrate that our method significantly outperforms state-of-the-art approaches in both dynamic subject consistency and overall generation quality. Code is publicly available at https://github.com/H-EmbodVis/HyDRA.
[521] Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation
Kang Liao, Size Wu, Zhonghua Wu, Linyi Jin, Chao Wang, Yikai Wang, Fei Wang, Wei Li, Chen Change Loy
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to failed paper fetchMethod: Cannot determine method due to failed paper fetch
Result: Cannot determine results due to failed paper fetch
Conclusion: Cannot draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2510.08673: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.08673&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[522] Vega: Learning to Drive with Natural Language Instructions
Sicheng Zuo, Yuxuan Li, Wenzhao Zheng, Zheng Zhu, Jie Zhou, Jiwen Lu
Main category: cs.CV
TL;DR: Vega is a Vision-Language-World-Action model for autonomous driving that follows diverse user instructions to generate personalized trajectories using joint multimodal attention with autoregressive and diffusion paradigms.
Details
Motivation: Existing vision-language-action models for autonomous driving primarily use language only for scene description or reasoning, lacking flexibility to follow diverse user instructions for personalized driving experiences.Method: Constructed large-scale InstructScene dataset (100k scenes with diverse driving instructions and trajectories). Proposed Vega model with: 1) autoregressive paradigm for visual inputs and language instructions, 2) diffusion paradigm for future predictions and trajectory generation, 3) joint attention for modality interactions, and 4) individual projection layers for different modalities.
Result: Extensive experiments show superior planning performance and strong instruction-following abilities, enabling more intelligent and personalized driving systems.
Conclusion: Vega paves the way for instruction-based generation and planning in autonomous driving by effectively integrating vision, language, world modeling, and action generation in a unified framework.
Abstract: Vision-language-action models have reshaped autonomous driving to incorporate languages into the decision-making process. However, most existing pipelines only utilize the language modality for scene descriptions or reasoning and lack the flexibility to follow diverse user instructions for personalized driving. To address this, we first construct a large-scale driving dataset (InstructScene) containing around 100,000 scenes annotated with diverse driving instructions with the corresponding trajectories. We then propose a unified Vision-Language-World-Action model, Vega, for instruction-based generation and planning. We employ the autoregressive paradigm to process visual inputs (vision) and language instructions (language) and the diffusion paradigm to generate future predictions (world modeling) and trajectories (action). We perform joint attention to enable interactions between the modalities and use individual projection layers for different modalities for more capabilities. Extensive experiments demonstrate that our method not only achieves superior planning performance but also exhibits strong instruction-following abilities, paving the way for more intelligent and personalized driving systems.
[523] Robust Ego-Exo Correspondence with Long-Term Memory
Yijun Hu, Bing Fan, Xin Gu, Haiqing Ren, Dongfang Liu, Heng Fan, Libo Zhang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2510.11417: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.11417&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[524] PhysVid: Physics Aware Local Conditioning for Generative Video Models
Saurabh Pathak, Elahe Arani, Mykola Pechenizkiy, Bahram Zonooz
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to draw conclusions due to access error
Abstract: Failed to fetch summary for 2603.26285: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.26285&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[525] SimULi: Real-Time LiDAR and Camera Simulation with Unscented Transforms
Haithem Turki, Qi Wu, Xin Kang, Janick Martinez Esturo, Shengyu Huang, Ruilong Li, Zan Gojcic, Riccardo de Lutio
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2510.12901: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.12901&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[526] CPUBone: Efficient Vision Backbone Design for Devices with Low Parallelization Capabilities
Moritz Nottebaum, Matteo Dunnhofer, Christian Micheloni
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2603.26425: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.26425&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[527] Identity-Preserving Image-to-Video Generation via Reward-Guided Optimization
Liao Shen, Wentao Jiang, Yiran Zhu, Jiahe Li, Tiezheng Ge, Zhiguo Cao, Bo Zheng
Main category: cs.CV
TL;DR: Paper 2510.14255: Unable to fetch abstract due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to inability to access paper contentMethod: Cannot determine method due to inability to access paper content
Result: Cannot determine results due to inability to access paper content
Conclusion: Cannot determine conclusion due to inability to access paper content
Abstract: Failed to fetch summary for 2510.14255: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.14255&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[528] Target-aware Image Editing via Cycle-consistent Constraints
Yanghao Wang, Zhen Wang, Long Chen
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2510.20212: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.20212&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[529] The Quest for Generalizable Motion Generation: Data, Model, and Evaluation
Jing Lin, Ruisi Wang, Junzhe Lu, Ziqi Huang, Guorui Song, Ailing Zeng, Xian Liu, Chen Wei, Wanqi Yin, Qingping Sun, Zhongang Cai, Lei Yang, Ziwei Liu
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2510.26794: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.26794&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[530] RefTon: Reference person shot assist virtual Try-on
Liuzhuozheng Li, Yue Gong, Shanyuan Liu, Bo Cheng, Yuhang Ma, Leibucha Wu, Dengyang Jiang, Zanyi Wang, Dawei Leng, Yuhui Yin
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about paper content due to access limitations
Abstract: Failed to fetch summary for 2511.00956: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.00956&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[531] Source-Only Cross-Weather LiDAR via Geometry-Aware Point Drop
YoungJae Cheong, Jhonghyun An
Main category: cs.CV
TL;DR: Paper 2511.01250: Failed to fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access restrictions preventing abstract retrievalMethod: Method unknown - arXiv API rate limiting prevented content access
Result: No results available - HTTP 429 error indicates too many requests to arXiv API
Conclusion: Cannot analyze paper content due to technical limitations in accessing the abstract
Abstract: Failed to fetch summary for 2511.01250: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.01250&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[532] THEval. Evaluation Framework for Talking Head Video Generation
Nabyl Quignon, Baptiste Chopin, Yaohui Wang, Antitza Dantcheva
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper content.
Details
Motivation: Cannot determine motivation without access to paper content.Method: Cannot determine method without access to paper content.
Result: Cannot determine results without access to paper content.
Conclusion: Cannot draw conclusions without access to paper content.
Abstract: Failed to fetch summary for 2511.04520: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.04520&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[533] RISE: Single Static Radar-based Indoor Scene Understanding
Kaichen Zhou, Laura Dodds, Sayed Saad Afzal, Fadel Adib
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2511.14019: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.14019&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[534] Scaling Self-Supervised and Cross-Modal Pretraining for Volumetric CT Transformers
Cris Claessens, Christiaan Viviers, Giacomo D’Amicantonio, Egor Bondarev, Fons van der Sommen
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2511.17209: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.17209&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[535] Unified Spherical Frontend: Learning Rotation-Equivariant Representations of Spherical Images from Any Camera
Mukai Yu, Mosam Dabhi, Liuyue Xie, Sebastian Scherer, László A. Jeni
Main category: cs.CV
TL;DR: Paper analysis unavailable due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Unable to determine paper motivation due to access errorMethod: Unable to determine paper method due to access error
Result: Unable to determine paper results due to access error
Conclusion: Unable to determine paper conclusion due to access error
Abstract: Failed to fetch summary for 2511.18174: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.18174&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[536] Jacobian-aware Posterior Sampling for Inverse Problems
Liav Hen, Tom Tirer, Raja Giryes, Shady Abu-Hussein
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2511.18471: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.18471&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[537] NeAR: Coupled Neural Asset-Renderer Stack
Hong Li, Chongjie Ye, Houyuan Chen, Weiqing Xiao, Ziyang Yan, Lixing Xiao, Zhaoxi Chen, Jianfeng Xiang, Shaocong Xu, Xuhui Liu, Yikai Wang, Baochang Zhang, Xiaoguang Han, Jiaolong Yang, Hao Zhao
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2511.18600: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.18600&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[538] ReSAM: Refine, Requery, and Reinforce: Self-Prompting Point-Supervised Segmentation for Remote Sensing Images
M.Naseer Subhani
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2511.21606: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.21606&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[539] DriveVGGT: Calibration-Constrained Visual Geometry Transformers for Multi-Camera Autonomous Driving
Xiaosong Jia, Yanhao Liu, Yu Hong, Renqiu Xia, Junqi You, Bin Sun, Zhihui Hao, Junchi Yan
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper information.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot determine conclusion without access to the paper content.
Abstract: Failed to fetch summary for 2511.22264: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.22264&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[540] AVERY: Intent-Driven Adaptive VLM Split Computing via Embodied Self-Awareness for Efficient Disaster Response Systems
Rajat Bhattacharjya, Sing-Yao Wu, Hyunwoo Oh, Chaewon Nam, Suyeon Koo, Mohsen Imani, Elaheh Bozorgzadeh, Nikil Dutt
Main category: cs.CV
TL;DR: Paper 2511.18151 could not be analyzed due to HTTP 429 error (rate limiting) when attempting to fetch the abstract from arXiv API.
Details
Motivation: Unable to determine motivation as the paper content could not be retrieved due to API rate limiting.Method: Unable to determine method as the paper content could not be retrieved due to API rate limiting.
Result: Unable to determine results as the paper content could not be retrieved due to API rate limiting.
Conclusion: Unable to determine conclusion as the paper content could not be retrieved due to API rate limiting.
Abstract: Failed to fetch summary for 2511.18151: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.18151&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[541] RobotSeg: A Model and Dataset for Segmenting Robots in Image and Video
Haiyang Mei, Qiming Huang, Hai Ci, Mike Zheng Shou
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2511.22950: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.22950&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[542] SplatSuRe: Selective Super-Resolution for Multi-view Consistent 3D Gaussian Splatting
Pranav Asthana, Alex Hanson, Allen Tu, Tom Goldstein, Matthias Zwicker, Amitabh Varshney
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2512.02172: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.02172&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[543] PowerCLIP: Powerset Alignment for Contrastive Pre-Training
Masaki Kawamura, Nakamasa Inoue, Rintaro Yanagi, Hirokatsu Kataoka, Rio Yokota
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Paper analysis cannot be completed due to technical limitations in accessing the content
Abstract: Failed to fetch summary for 2511.23170: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.23170&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[544] NarrativeTrack: Evaluating Entity-Centric Reasoning for Narrative Understanding
Hyeonjeong Ha, Jinjin Ge, Bo Feng, Kaixin Ma, Gargi Chakraborty
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2601.01095: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.01095&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[545] PointCNN++: Performant Convolution on Native Points
Lihan Li, Haofeng Zhong, Rui Bu, Mingchao Sun, Wenzheng Chen, Baoquan Chen, Yangyan Li
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content could not be retrievedMethod: Cannot determine method as paper content could not be retrieved
Result: Cannot determine results as paper content could not be retrieved
Conclusion: Cannot determine conclusion as paper content could not be retrieved
Abstract: Failed to fetch summary for 2511.23227: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.23227&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[546] AutoRegressive Generation with B-rep Holistic Token Sequence Representation
Jiahao Li, Yunpeng Bai, Yongkang Dai, Hao Guo, Hongping Gan, Yilei Shi
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2601.16771: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.16771&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[547] Relightable Holoported Characters: Capturing and Relighting Dynamic Human Performance from Sparse Views
Kunwar Maheep Singh, Jianchun Chen, Vladislav Golyanik, Stephan J. Garbin, Thabo Beeler, Rishabh Dabral, Marc Habermann, Christian Theobalt
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2512.00255: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.00255&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[548] InternVideo-Next: Towards General Video Foundation Models without Video-Text Supervision
Chenting Wang, Yuhan Zhu, Yicheng Xu, Jiange Yang, Lang Lin, Ziang Yan, Yali Wang, Yi Wang, Limin Wang
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2512.01342: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.01342&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[549] SAGE: Style-Adaptive Generalization for Privacy-Constrained Semantic Segmentation Across Domains
Qingmei Li, Yang Zhang, Peifeng Zhang, Haohuan Fu, Juepeng Zheng
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2512.02369: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.02369&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[550] SeeU: Seeing the Unseen World via 4D Dynamics-aware Generation
Yu Yuan, Tharindu Wickremasinghe, Zeeshan Nadir, Xijun Wang, Yiheng Chi, Stanley H. Chan
Main category: cs.CV
TL;DR: Paper 2512.03350: Unable to fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to inability to access paper contentMethod: Cannot determine method due to inability to access paper content
Result: Cannot determine results due to inability to access paper content
Conclusion: Cannot determine conclusion due to inability to access paper content
Abstract: Failed to fetch summary for 2512.03350: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.03350&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[551] LAMP: Language-Assisted Motion Planning for Controllable Video Generation
Muhammed Burak Kizil, Enes Sanli, Niloy J. Mitra, Erkut Erdem, Aykut Erdem, Duygu Ceylan
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about the paper due to access limitations
Abstract: Failed to fetch summary for 2512.03619: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.03619&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[552] ParaUni: Enhance Generation in Unified Multimodal Model with Reinforcement-driven Hierarchical Parallel Information Interaction
Jiangtong Tan, Lin Liu, Jie Huanng, Xiaopeng Zhang, Qi Tian, Feng Zhao
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: Cannot analyze method without access to paper content
Result: No results available due to technical access issues
Conclusion: Paper analysis impossible due to API rate limiting preventing content retrieval
Abstract: Failed to fetch summary for 2512.05422: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.05422&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[553] Fast SceneScript: Fast and Accurate Language-Based 3D Scene Understanding via Multi-Token Prediction
Ruihong Yin, Xuepeng Shi, Oleksandr Bailo, Marco Manfredi, Theo Gevers
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting).
Details
Motivation: Cannot determine motivation without access to paper content.Method: Cannot determine method without access to paper content.
Result: Cannot determine results without access to paper content.
Conclusion: Cannot draw conclusions without access to paper content.
Abstract: Failed to fetch summary for 2512.05597: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.05597&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[554] TriDF: Evaluating Perception, Detection, and Hallucination for Interpretable DeepFake Detection
Jian-Yu Jiang-Lin, Kang-Yang Huang, Ling Zou, Ling Lo, Sheng-Ping Yang, Yu-Wen Tseng, Kun-Hsiang Lin, Chia-Ling Chen, Yu-Ting Ta, Yan-Tsung Wang, Po-Ching Chen, Hongxia Xie, Hong-Han Shuai, Wen-Huang Cheng
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper with ID 2512.10652 appears to be from December 2024, but no content could be retrieved for analysis.
Details
Motivation: Cannot determine motivation without access to the paper content. The HTTP 429 error indicates the arXiv API rate limit was exceeded.Method: Cannot determine method without access to the paper content. The paper ID suggests it’s from December 2024, but no technical details are available.
Result: Cannot determine results without access to the paper content. The analysis is limited by the API rate limiting issue.
Conclusion: Cannot draw conclusions about the paper’s content due to technical limitations in accessing the abstract and full paper details.
Abstract: Failed to fetch summary for 2512.10652: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.10652&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[555] E-RayZer: Self-supervised 3D Reconstruction as Spatial Visual Pre-training
Qitao Zhao, Hao Tan, Qianqian Wang, Sai Bi, Kai Zhang, Kalyan Sunkavalli, Shubham Tulsiani, Hanwen Jiang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access limitationsMethod: Unable to determine method due to access limitations
Result: Unable to determine results due to access limitations
Conclusion: Unable to draw conclusions due to access limitations
Abstract: Failed to fetch summary for 2512.10950: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.10950&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[556] M4Human: A Large-Scale Multimodal mmWave Radar Benchmark for Human Mesh Reconstruction
Junqiao Fan, Yunjiao Zhou, Yizhuo Yang, Xinyuan Cui, Jiarui Zhang, Lihua Xie, Jianfei Yang, Chris Xiaoxuan Lu, Fangqiang Ding
Main category: cs.CV
TL;DR: Paper analysis unavailable due to HTTP 429 error when fetching abstract from arXiv
Details
Motivation: Unable to determine paper motivation due to technical error in fetching abstractMethod: Unable to determine paper method due to technical error in fetching abstract
Result: Unable to determine paper results due to technical error in fetching abstract
Conclusion: Unable to determine paper conclusion due to technical error in fetching abstract
Abstract: Failed to fetch summary for 2512.12378: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.12378&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[557] LASER: Layer-wise Scale Alignment for Training-Free Streaming 4D Reconstruction
Tianye Ding, Yiming Xie, Yiqing Liang, Moitreya Chatterjee, Pedro Miraldo, Huaizu Jiang
Main category: cs.CV
TL;DR: Unable to analyze paper 2512.13680 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract is unavailableMethod: Cannot determine method as abstract is unavailable
Result: Cannot determine results as abstract is unavailable
Conclusion: Cannot draw conclusions about paper content due to data unavailability
Abstract: Failed to fetch summary for 2512.13680: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.13680&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[558] LitePT: Lighter Yet Stronger Point Transformer
Yuanwen Yue, Damien Robert, Jianyuan Wang, Sunghwan Hong, Jan Dirk Wegner, Christian Rupprecht, Konrad Schindler
Main category: cs.CV
TL;DR: Unable to analyze paper 2512.13689 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation without access to the paper abstractMethod: Cannot determine method without access to the paper abstract
Result: Cannot determine results without access to the paper abstract
Conclusion: Cannot draw conclusions without access to the paper abstract
Abstract: Failed to fetch summary for 2512.13689: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.13689&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[559] SAGE: Training Smart Any-Horizon Agents for Long Video Reasoning with Reinforcement Learning
Jitesh Jain, Jialuo Li, Zixian Ma, Jieyu Zhang, Chris Dongjoo Kim, Sangho Lee, Rohun Tripathi, Tanmay Gupta, Christopher Clark, Humphrey Shi
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2512.13874: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.13874&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[560] From Unlearning to UNBRANDING: A Benchmark for Trademark-Safe Text-to-Image Generation
Dawid Malarz, Filip Manjak, Maciej Zięba, Przemysław Spurek, Artur Kasymov
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2512.13953: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.13953&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[561] Improving Semantic Uncertainty Quantification in LVLMs with Semantic Gaussian Processes
Joseph Hoche, Andrei Bursuc, David Brellmann, Gilles Louppe, Pavel Izmailov, Angela Yao, Gianni Franchi
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to missing paper contentMethod: Unable to determine method due to missing paper content
Result: Unable to determine results due to missing paper content
Conclusion: Unable to determine conclusion due to missing paper content
Abstract: Failed to fetch summary for 2512.14177: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.14177&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[562] Off The Grid: Detection of Primitives for Feed-Forward 3D Gaussian Splatting
Arthur Moreau, Richard Shaw, Michal Nazarczuk, Jisu Shin, Thomas Tanay, Zhensong Zhang, Songcen Xu, Eduardo Pérez-Pellitero
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to missing paper contentMethod: Unable to determine method due to missing paper content
Result: Unable to determine results due to missing paper content
Conclusion: Unable to determine conclusion due to missing paper content
Abstract: Failed to fetch summary for 2512.15508: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.15508&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[563] OMG-Bench: A New Challenging Benchmark for Skeleton-based Online Micro Hand Gesture Recognition
Haochen Chang, Pengfei Ren, Buyuan Zhang, Da Li, Tianhao Han, Haoyang Zhang, Liang Xie, Hongbo Chen, Erwei Yin
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about the paper due to access limitations
Abstract: Failed to fetch summary for 2512.16727: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.16727&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[564] The Prism Hypothesis: Harmonizing Semantic and Pixel Representations via Unified Autoencoding
Weichen Fan, Haiwen Diao, Quan Wang, Dahua Lin, Ziwei Liu
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2512.19693: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.19693&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[565] OccuFly: A 3D Vision Benchmark for Semantic Scene Completion from the Aerial Perspective
Markus Gross, Sai B. Matha, Aya Fahmy, Rui Song, Daniel Cremers, Henri Meess
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to determine conclusion due to failed API request
Abstract: Failed to fetch summary for 2512.20770: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.20770&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[566] Omni-Weather: A Unified Multimodal Model for Weather Radar Understanding and Generation
Zhiwang Zhou, Yuandong Pu, Xuming He, Yidi Liu, Yixin Chen, Junchao Gong, Xiang Zhuang, Wanghan Xu, Qinglong Cao, Shixiang Tang, Yihao Liu, Wenlong Zhang, Lei Bai
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2512.21643: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.21643&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[567] Adaptive Attention Distillation for Robust Few-Shot Segmentation under Environmental Perturbations
Qianyu Guo, Jingrong Wu, Jieji Ren, Weifeng Ge, Wenqiang Zhang
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2601.03596: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.03596&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[568] VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control
Sixiao Zheng, Minghao Yin, Wenbo Hu, Xiaoyu Li, Ying Shan, Yanwei Fu
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2601.05138: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.05138&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[569] SUG-Occ: Explicit Semantics and Uncertainty Guided Sparse Learning for Efficient 3D Occupancy Prediction
Hanlin Wu, Pengfei Lin, Ehsan Javanmardi, Naren Bao, Bo Qian, Hao Si, Manabu Tsukada
Main category: cs.CV
TL;DR: Unable to analyze paper 2601.11396 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation without access to the paper abstractMethod: Cannot determine method without access to the paper abstract
Result: Cannot determine results without access to the paper abstract
Conclusion: Cannot determine conclusion without access to the paper abstract
Abstract: Failed to fetch summary for 2601.11396: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.11396&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[570] Towards Holistic Modeling for Video Frame Interpolation with Auto-regressive Diffusion Transformers
Xinyu Peng, Han Li, Yuyang Huang, Ziyang Zheng, Yaoming Wang, Xin Chen, Wenrui Dai, Chenglin Li, Junni Zou, Hongkai Xiong
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) - the paper ID 2601.14959 cannot be analyzed
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2601.14959: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.14959&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[571] APPLE: Attribute-Preserving Pseudo-Labeling for Diffusion-Based Face Swapping
Jiwon Kang, Yeji Choi, JoungBin Lee, Wooseok Jang, Jinhyeok Choi, Taekeun Kang, Yongjae Park, Myungin Kim, Seungryong Kim
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to draw conclusions due to fetch failure
Abstract: Failed to fetch summary for 2601.15288: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.15288&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[572] PocketGS: On-Device Training of 3D Gaussian Splatting for High Perceptual Modeling
Wenzhi Guo, Guangchi Fang, Shu Yang, Bing Wang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2601.17354: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.17354&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[573] PhaSR: Generalized Image Shadow Removal with Physically Aligned Priors
Chia-Ming Lee, Yu-Fan Lin, Yu-Jou Hsiao, Jin-Hui Jiang, Yu-Lun Liu, Chih-Chung Hsu
Main category: cs.CV
TL;DR: Paper 2601.17470: Could not fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access restrictionsMethod: Unable to determine method due to access restrictions
Result: Unable to determine results due to access restrictions
Conclusion: Unable to determine conclusion due to access restrictions
Abstract: Failed to fetch summary for 2601.17470: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.17470&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[574] ScenePilot-4K: A Large-Scale First-Person Dataset and Benchmark for Vision-Language Models in Autonomous Driving
Yujin Wang, Yutong Zheng, Wenxian Fan, Tianyi Wang, Hongqing Chu, Li Zhang, Bingzhao Gao, Daxin Tian, Jianqiang Wang, Hong Chen
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error in fetching paper contentMethod: Unable to determine method due to technical error in fetching paper content
Result: Unable to determine results due to technical error in fetching paper content
Conclusion: Unable to draw conclusions due to technical error in fetching paper content
Abstract: Failed to fetch summary for 2601.19582: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.19582&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[575] GVGS: Gaussian Visibility-Aware Multi-View Geometry for Accurate Surface Reconstruction
Mai Su, Qihan Yu, Zhongtao Wang, Yilong Li, Chengwei Pan, Yisong Chen, Guoping Wang, Fei Zhu
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2601.20331: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.20331&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[576] Do VLMs Perceive or Recall? Probing Visual Perception vs. Memory with Classic Visual Illusions
Xiaoxiao Sun, Mingyang Li, Kun Yuan, Min Woo Sun, Mark Endo, Shengguang Wu, Changlin Li, Yuhui Zhang, Zeyu Wang, Serena Yeung-Levy
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2601.22150: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.22150&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[577] VisionTrim: Unified Vision Token Compression for Training-Free MLLM Acceleration
Hanxun Yu, Wentong Li, Xuan Qu, Song Wang, Junbo Chen, Jianke Zhu
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to retrieval failureMethod: Unable to determine method due to retrieval failure
Result: Unable to determine results due to retrieval failure
Conclusion: Unable to determine conclusion due to retrieval failure
Abstract: Failed to fetch summary for 2601.22674: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.22674&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[578] FaceLinkGen: Rethinking Identity Leakage in Privacy-Preserving Face Recognition with Identity Extraction
Wenqi Guo, Shan Du
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2602.02914: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.02914&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[579] FastVMT: Eliminating Redundancy in Video Motion Transfer
Yue Ma, Zhikai Wang, Tianhao Ren, Mingzhe Zheng, Hongyu Liu, Jiayi Guo, Kunyu Feng, Yuxuan Xue, Zixiang Zhao, Konrad Schindler, Qifeng Chen, Linfeng Zhang
Main category: cs.CV
TL;DR: Unable to analyze paper 2602.05551 due to HTTP 429 error when fetching from arXiv API
Details
Motivation: Cannot determine motivation due to missing paper contentMethod: Cannot determine method due to missing paper content
Result: Cannot determine results due to missing paper content
Conclusion: Cannot determine conclusion due to missing paper content
Abstract: Failed to fetch summary for 2602.05551: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.05551&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[580] Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion
Haodong Li, Shaoteng Liu, Zhe Lin, Manmohan Chandraker
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2602.07775: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.07775&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[581] HSD: Training-Free Acceleration for Document Parsing Vision-Language Model with Hierarchical Speculative Decoding
Wenhui Liao, Hongliang Li, Pengyu Xie, Xinyu Cai, Yufan Shen, Yi Xin, Qi Qin, Shenglong Ye, Tianbin Li, Ming Hu, Junjun He, Yihao Liu, Wenhai Wang, Min Dou, Bin Fu, Botian Shi, Yu Qiao, Lianwen Jin
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2602.12957 suggests it’s from February 2025, but no content is available for analysis.
Details
Motivation: Cannot determine motivation without access to paper content. The arXiv API rate limiting prevents retrieval of abstract or paper details.Method: Cannot determine method without access to paper content. The paper ID format suggests it’s a recent submission (February 2025).
Result: Cannot determine results without access to paper content. The HTTP 429 error indicates the arXiv API is rate limiting requests.
Conclusion: Cannot draw conclusions without access to paper content. The paper exists in the arXiv system but details are currently inaccessible due to API rate limits.
Abstract: Failed to fetch summary for 2602.12957: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.12957&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[582] MaskDiME: Adaptive Masked Diffusion for Precise and Efficient Visual Counterfactual Explanations
Changlu Guo, Anders Nymark Christensen, Anders Bjorholm Dahl, Morten Rieger Hannemose
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2602.18792: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.18792&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[583] Echoes of ownership: Adversarial-guided dual injection for copyright protection in MLLMs
Chengwei Xia, Fan Ma, Ruijie Quan, Yunqiu Xu, Kun Zhan, Yi Yang
Main category: cs.CV
TL;DR: Failed to fetch summary for paper 2602.18845 due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation as paper content could not be retrievedMethod: Unable to determine method as paper content could not be retrieved
Result: Unable to determine results as paper content could not be retrieved
Conclusion: Unable to determine conclusion as paper content could not be retrieved
Abstract: Failed to fetch summary for 2602.18845: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.18845&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[584] ConceptPrism: Concept Disentanglement in Personalized Diffusion Models via Residual Token Optimization
Minseo Kim, Minchan Kwon, Dongyeun Lee, Yunho Jeon, Junmo Kim
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2602.19575: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.19575&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[585] CLIP Is Shortsighted: Paying Attention Beyond the First Sentence
Marc-Antoine Lavoie, Anas Mahmoud, Aldo Zaimi, Arsene Fansi Tchango, Steven L. Waslander
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2602.22419: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.22419&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[586] DyaDiT: A Multi-Modal Diffusion Transformer for Socially Favorable Dyadic Gesture Generation
Yichen Peng, Jyun-Ting Song, Siyeol Jung, Ruofan Liu, Haiyang Liu, Xuangeng Chu, Ruicong Liu, Erwin Wu, Hideki Koike, Kris Kitani
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to inability to access paper contentMethod: Cannot determine method due to inability to access paper content
Result: Cannot determine results due to inability to access paper content
Conclusion: Cannot determine conclusion due to inability to access paper content
Abstract: Failed to fetch summary for 2602.23165: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.23165&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[587] Self-Attention And Beyond the Infinite: Towards Linear Transformers with Infinite Self-Attention
Giorgio Roffo, Hazem Abdelkawy, Nilli Lavie, Luke Palmer
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to draw conclusions due to failed API request
Abstract: Failed to fetch summary for 2603.00175: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.00175&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[588] Act Like a Pathologist: Tissue-Aware Whole Slide Image Reasoning
Wentao Huang, Weimin Lyu, Peiliang Lou, Qingqiao Hu, Xiaoling Hu, Shahira Abousamra, Wenchao Han, Ruifeng Guo, Jiawei Zhou, Chao Chen, Chen Wang
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2603.00667 appears to be from March 2023, but no abstract or content could be retrieved.
Details
Motivation: Cannot determine motivation without access to paper content. The arXiv API rate limiting prevents fetching the abstract.Method: Unknown - paper content not accessible due to API rate limiting.
Result: No results available - failed to fetch paper summary.
Conclusion: Unable to analyze paper due to technical limitations in accessing arXiv data.
Abstract: Failed to fetch summary for 2603.00667: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.00667&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[589] OMG-Avatar: One-shot Multi-LOD Gaussian Head Avatar
Jianqiang Ren, Lin Liu, Steven Hoi
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to data retrieval failureMethod: Unable to determine method due to data retrieval failure
Result: Unable to determine results due to data retrieval failure
Conclusion: Unable to draw conclusions due to data retrieval failure
Abstract: Failed to fetch summary for 2603.01506: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.01506&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[590] Overthinking Causes Hallucination: Tracing Confounder Propagation in Vision Language Models
Abin Shoby, Ta Duc Huy, Tuan Dung Nguyen, Minh Khoi Ho, Qi Chen, Anton van den Hengel, Phi Le Nguyen, Johan W. Verjans, Vu Minh Hieu Phan
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to paper fetch failureMethod: Unable to determine method due to paper fetch failure
Result: Unable to determine results due to paper fetch failure
Conclusion: Unable to determine conclusion due to paper fetch failure
Abstract: Failed to fetch summary for 2603.07619: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.07619&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[591] Chain of Event-Centric Causal Thought for Physically Plausible Video Generation
Zixuan Wang, Yixin Hu, Haolan Wang, Feng Chen, Yan Liu, Wen Li, Yinjie Lei
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2603.09094: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.09094&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[592] Training-free Motion Factorization for Compositional Video Generation
Zixuan Wang, Ziqin Zhou, Feng Chen, Duo Peng, Yixin Hu, Changsheng Li, Yinjie Lei
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to draw conclusions due to access error
Abstract: Failed to fetch summary for 2603.09104: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.09104&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[593] OddGridBench: Exposing the Lack of Fine-Grained Visual Discrepancy Sensitivity in Multimodal Large Language Models
Tengjin Weng, Wenhao Jiang, Jingyi Wang, Ming Li, Lin Ma, Zhong Ming
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about paper content due to access limitations
Abstract: Failed to fetch summary for 2603.09326: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.09326&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[594] WikiCLIP: An Efficient Contrastive Baseline for Open-domain Visual Entity Recognition
Shan Ning, Longtian Qiu, Jiaxuan Sun, Xuming He
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2603.09921: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.09921&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[595] Seeing Isn’t Orienting: A Cognitively Grounded Benchmark Reveals Systematic Orientation Failures in MLLMs Supplementary
Nazia Tasnim, Keanu Nichols, Yuting Yang, Nicholas Ikechukwu, Elva Zou, Deepti Ghadiyaram, Bryan A. Plummer
Main category: cs.CV
TL;DR: Paper 2603.11410: Could not fetch abstract due to HTTP 429 error (rate limiting). Relevance assessment cannot be made without content.
Details
Motivation: Unable to determine motivation due to HTTP 429 error preventing access to paper content.Method: Method unknown - paper content not accessible due to rate limiting.
Result: Results unknown - unable to fetch paper summary.
Conclusion: Cannot draw conclusions without access to paper content.
Abstract: Failed to fetch summary for 2603.11410: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.11410&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[596] Do You See What I Am Pointing At? Gesture-Based Egocentric Video Question Answering
Yura Choi, Roy Miles, Rolandos Alexandros Potamias, Ismail Elezi, Jiankang Deng, Stefanos Zafeiriou
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2603.12533: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.12533&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[597] A Hyperbolic Perspective on Hierarchical Structure in Object-Centric Scene Representations
Neelu Madan, Àlex Pujol, Andreas Møgelmose, Sergio Escalera, Kamal Nasrollahi, Graham W. Taylor, Thomas B. Moeslund
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2603.14022: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.14022&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[598] Fair Benchmarking of Emerging One-Step Generative Models Against Multistep Diffusion and Flow Models
Advaith Ravishankar, Serena Liu, Mingyang Wang, Todd Zhou, Jeffrey Zhou, Arnav Sharma, Ziling Hu, Léopold Das, Abdulaziz Sobirov, Faizaan Siddique, Freddy Yu, Seungjoo Baek, Yan Luo, Mengyu Wang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.14186: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.14186&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[599] Mind-of-Director: Multi-modal Agent-Driven Film Previsualization via Collaborative Decision-Making
Shufeng Nan, Mengtian Li, Sixiao Zheng, Yuwei Lu, Han Zhang, Yanwei Fu
Main category: cs.CV
TL;DR: Paper 2603.14790: Unable to fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as abstract could not be retrievedMethod: Cannot determine method as abstract could not be retrieved
Result: Cannot determine results as abstract could not be retrieved
Conclusion: Cannot determine conclusion as abstract could not be retrieved
Abstract: Failed to fetch summary for 2603.14790: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.14790&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[600] Aligning Multi-Dimensional Preferences via Relevance Feedback: An Effortless and Training-Free Framework for Text-to-Image Diffusion
Wenxi Wang, Hongbin Liu, Mingqian Li, Junyan Yuan, Junqi Zhang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper detailsMethod: No method information available - arXiv API request resulted in HTTP 429 error
Result: No results available - paper content inaccessible due to API rate limiting
Conclusion: Cannot analyze paper content due to technical limitations with arXiv API access
Abstract: Failed to fetch summary for 2603.14936: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.14936&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[601] Synergizing Deep Learning and Biological Heuristics for Extreme Long-Tail White Blood Cell Classification
Duc T. Nguyen, Hoang-Long Nguyen, Huy-Hieu Pham
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.16249: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.16249&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[602] Mixture of Style Experts for Diverse Image Stylization
Shihao Zhu, Ziheng Ouyang, Yijia Kang, Qilong Wang, Mi Zhou, Bo Li, Ming-Ming Cheng, Qibin Hou
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2603.16649: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.16649&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[603] Adaptive Anchor Policies for Efficient 4D Gaussian Streaming
Ashim Dahal, Rabab Abdelfattah, Nick Rahimi
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2603.17227: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.17227&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[604] Toward Phonology-Guided Sign Language Motion Generation: A Diffusion Baseline and Conditioning Analysis
Rui Hong, Jana Kosecka
Main category: cs.CV
TL;DR: Paper analysis unavailable due to HTTP 429 error when fetching abstract from arXiv
Details
Motivation: Unable to determine motivation as the abstract could not be fetched from arXiv API due to rate limiting (HTTP 429 error)Method: No method information available due to API rate limiting preventing access to the paper’s abstract
Result: No results available as the paper content could not be retrieved from arXiv
Conclusion: Unable to provide analysis due to technical limitations in accessing the paper’s abstract
Abstract: Failed to fetch summary for 2603.17388: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.17388&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[605] Gesture-Aware Pretraining and Token Fusion for 3D Hand Pose Estimation
Rui Hong, Jana Kosecka
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2603.17396: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.17396&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[606] GenHOI: Generalized Hand-Object Pose Estimation with Occlusion Awareness
Hui Yang, Wei Sun, Jian Liu, Jian Xiao, Tao Xie, Hossein Rahmani, Ajmal Saeed Mian, Nicu Sebe, Gim Hee Lee
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error fetching paper contentMethod: Unable to determine method due to technical error fetching paper content
Result: Unable to determine results due to technical error fetching paper content
Conclusion: Unable to draw conclusions due to technical error fetching paper content
Abstract: Failed to fetch summary for 2603.19013: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.19013&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[607] EmoTaG: Emotion-Aware Talking Head Synthesis on Gaussian Splatting with Few-Shot Personalization
Haolan Xu, Keli Cheng, Lei Wang, Ning Bi, Xiaoming Liu
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: Unable to determine method due to API rate limiting preventing access to paper content
Result: Unable to determine results due to API rate limiting preventing access to paper content
Conclusion: Unable to determine conclusion due to API rate limiting preventing access to paper content
Abstract: Failed to fetch summary for 2603.21332: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.21332&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[608] FluidGaussian: Propagating Simulation-Based Uncertainty Toward Functionally-Intelligent 3D Reconstruction
Yuqiu Liu, Jialin Song, Marissa Ramirez de Chanlatte, Rochishnu Chowdhury, Rushil Paresh Desai, Wuyang Chen, Daniel Martin, Michael W. Mahoney
Main category: cs.CV
TL;DR: Paper 2603.21356 summary unavailable due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation as abstract content is not accessibleMethod: Method unknown - paper content not retrievable
Result: No results available due to access error
Conclusion: Cannot draw conclusions without paper content
Abstract: Failed to fetch summary for 2603.21356: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.21356&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[609] Anatomical Token Uncertainty for Transformer-Guided Active MRI Acquisition
Lev Ayzenberg, Shady Abu-Hussein, Raja Giryes, Hayit Greenspan
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to draw conclusions due to access error
Abstract: Failed to fetch summary for 2603.21806: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.21806&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[610] Thermal Topology Collapse: Universal Physical Patch Attacks on Infrared Vision Systems
Chengyin Hu, Yikun Guo, Yuxian Dong, Qike Zhang, Kalibinuer Tiliwalidi, Yiwei Wei, Haitao Shi, Jiujiang Guo, Jiahuan Long, Xiang Chen
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.21876: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.21876&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[611] DTVI: Dual-Stage Textual and Visual Intervention for Safe Text-to-Image Generation
Binhong Tan, Zhaoxin Wang, Handing Wang
Main category: cs.CV
TL;DR: Paper ID 2603.22041 summary could not be fetched due to HTTP 429 (rate limiting) error from arXiv API
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper detailsMethod: Unknown - paper content not accessible due to technical limitations
Result: No results available - failed to retrieve paper information
Conclusion: Cannot analyze paper due to API access restrictions
Abstract: Failed to fetch summary for 2603.22041: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.22041&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[612] FontCrafter: High-Fidelity Element-Driven Artistic Font Creation with Visual In-Context Generation
Wuyang Luo, Chengkai Tan, Chang Ge, Binye Hong, Su Yang, Yongjiu Ma
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2603.22054: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.22054&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[613] A$^3$: Towards Advertising Aesthetic Assessment
Kaiyuan Ji, Yixuan Gao, Lu Sun, Yushuo Zheng, Zijian Chen, Jianbo Zhang, Xiangyang Zhu, Yuan Tian, Zicheng Zhang, Guangtao Zhai
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2603.24037: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.24037&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[614] Memory-Augmented Vision-Language Agents for Persistent and Semantically Consistent Object Captioning
Tommaso Galliena, Stefano Rosa, Tommaso Apicella, Pietro Morerio, Alessio Del Bue, Lorenzo Natale
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2603.24257: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.24257&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[615] POLY-SIM: Polyglot Speaker Identification with Missing Modality Grand Challenge 2026 Evaluation Plan
Marta Moscati, Muhammad Saad Saeed, Marina Zanoni, Mubashir Noman, Rohan Kumar Das, Monorama Swain, Yufang Hou, Elisabeth Andre, Khalid Mahmood Malik, Markus Schedl, Shah Nawaz
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.24569: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.24569&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[616] BCMDA: Bidirectional Correlation Maps Domain Adaptation for Mixed Domain Semi-Supervised Medical Image Segmentation
Bentao Song, Jun Huang, Qingfeng Wang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2603.24691: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.24691&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[617] TIGeR: A Unified Framework for Time, Images and Geo-location Retrieval
David G. Shatwell, Sirnam Swetha, Mubarak Shah
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2603.24749: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.24749&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[618] WAFT-Stereo: Warping-Alone Field Transforms for Stereo Matching
Yihan Wang, Jia Deng
Main category: cs.CV
TL;DR: Unable to analyze paper 2603.24836 due to HTTP 429 error when fetching from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2603.24836: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.24836&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[619] Self-Supervised Learning for Knee Osteoarthritis: Diagnostic Limitations and Prognostic Value of Hospital Data
Haresh Rengaraj Rajamohan, Yuxuan Chen, Kyunghyun Cho, Cem M. Deniz
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.24903: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.24903&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[620] MoE-GRPO: Optimizing Mixture-of-Experts via Reinforcement Learning in Vision-Language Models
Dohwan Ko, Jinyoung Park, Seoung Choi, Sanghyeok Lee, Seohyun Lee, Hyunwoo J. Kim
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2603.24984: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.24984&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[621] GaussFusion: Improving 3D Reconstruction in the Wild with A Geometry-Informed Video Generator
Liyuan Zhu, Manjunath Narayana, Michal Stary, Will Hutchcroft, Gordon Wetzstein, Iro Armeni
Main category: cs.CV
TL;DR: Paper 2603.25053 summary unavailable due to HTTP 429 error when fetching from arXiv API
Details
Motivation: Unable to determine motivation due to abstract fetching failureMethod: Unable to determine method due to abstract fetching failure
Result: Unable to determine results due to abstract fetching failure
Conclusion: Unable to determine conclusion due to abstract fetching failure
Abstract: Failed to fetch summary for 2603.25053: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.25053&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[622] Wan-Weaver: Interleaved Multi-modal Generation via Decoupled Training
Jinbo Xing, Zeyinzi Jiang, Yuxiang Tuo, Chaojie Mao, Xiaotang Gai, Xi Chen, Jingfeng Zhang, Yulin Pan, Zhen Han, Jie Xiao, Keyu Yan, Chenwei Xie, Chongyang Zhong, Kai Zhu, Tong Shen, Lianghua Huang, Yu Liu, Yujiu Yang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.25706: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.25706&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[623] AnyHand: A Large-Scale Synthetic Dataset for RGB(-D) Hand Pose Estimation
Chen Si, Yulin Liu, Bo Ai, Jianwen Xie, Rolandos Alexandros Potamias, Chuanxia Zheng, Hao Su
Main category: cs.CV
TL;DR: Unable to analyze paper 2603.25726 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract is unavailable due to rate limiting errorMethod: Cannot determine method as abstract is unavailable due to rate limiting error
Result: Cannot determine results as abstract is unavailable due to rate limiting error
Conclusion: Cannot draw conclusions as abstract is unavailable due to rate limiting error
Abstract: Failed to fetch summary for 2603.25726: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.25726&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[624] PAD-Hand: Physics-Aware Diffusion for Hand Motion Recovery
Elkhan Ismayilzada, Yufei Zhang, Zijun Cui
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2603.26068: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.26068&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[625] MeshSplats: Mesh-Based Rendering with Gaussian Splatting Initialization
Rafał Tobiasz, Grzegorz Wilczyński, Marcin Mazur, Sławomir Tadeja, Weronika Smolak-Dyżewska, Przemysław Spurek
Main category: cs.CV
TL;DR: Failed to fetch summary for arXiv ID 2502.07754 due to HTTP 429 (rate limiting) error
Details
Motivation: Unable to determine motivation as the abstract could not be retrieved due to rate limiting from arXiv APIMethod: No method information available - paper summary fetch failed
Result: No results available - could not access paper information
Conclusion: Cannot analyze paper due to technical limitations in accessing the abstract
Abstract: Failed to fetch summary for 2502.07754: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.07754&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[626] 3D CAVLA: Leveraging Depth and 3D Context to Generalize Vision Language Action Models for Unseen Tasks
Vineet Bhat, Yu-Hsiang Lan, Prashanth Krishnamurthy, Ramesh Karri, Farshad Khorrami
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2505.05800: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.05800&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[627] SpeeDe3DGS: Speedy Deformable 3D Gaussian Splatting with Temporal Pruning and Motion Grouping
Allen Tu, Haiyang Ying, Alex Hanson, Yonghan Lee, Tom Goldstein, Matthias Zwicker
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2506.07917: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.07917&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[628] AffordGrasp: Cross-Modal Diffusion for Affordance-Aware Grasp Synthesis
Xiaofei Wu, Yi Zhang, Yumeng Liu, Yuexin Ma, Yujiao Shi, Xuming He
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2603.08021 could not be retrieved from arXiv API.
Details
Motivation: Cannot determine motivation as the paper content is unavailable due to API rate limiting.Method: Cannot determine method as the paper content is unavailable due to API rate limiting.
Result: Cannot determine results as the paper content is unavailable due to API rate limiting.
Conclusion: Cannot draw conclusions as the paper content is unavailable due to API rate limiting.
Abstract: Failed to fetch summary for 2603.08021: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.08021&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[629] R3DP: Real-Time 3D-Aware Policy for Embodied Manipulation
Yuhao Zhang, Wanxi Dong, Yue Shi, Yi Liang, Jingnan Gao, Qiaochu Yang, Yaxing Lyu, Zhixuan Liang, Yibin Liu, Congsheng Xu, Xianda Guo, Wei Sui, Yaohui Jin, Xiaokang Yang, Yanyan Xu, Yao Mu
Main category: cs.CV
TL;DR: Paper 2603.14498: Unable to fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to inability to access paper contentMethod: Cannot determine method due to inability to access paper content
Result: Cannot determine results due to inability to access paper content
Conclusion: Cannot determine conclusion due to inability to access paper content
Abstract: Failed to fetch summary for 2603.14498: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.14498&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.AI
[630] Bitboard version of Tetris AI
Xingguo Chen, Pingshou Xiong, Zhenyu Luo, Mengfei Hu, Xinwen Li, Yongzhou Lü, Guang Yang, Chao Li, Shangdong Yang
Main category: cs.AI
TL;DR: High-performance Tetris AI framework using bitboard optimization and improved RL algorithms for faster simulation and better training efficiency
Details
Motivation: Existing Tetris implementations have low simulation speeds, suboptimal state evaluation, and inefficient training paradigms, limiting their utility for large-scale RL researchMethod: 1) Bitboard representations for game board and tetrominoes using bitwise operations for 53x speedup; 2) Afterstate-evaluating actor network for simplified state value estimation; 3) Buffer-optimized PPO algorithm for efficient sampling and updates
Result: Achieved 53-fold speedup compared to OpenAI Gym-Tetris, average score of 3,829 on 10x10 grids within 3 minutes, and developed Python-Java interface compliant with OpenAI Gym standard
Conclusion: The framework enhances Tetris’s utility as an RL benchmark by bridging low-level bitboard optimizations with high-level AI strategies, providing a sample-efficient and computationally lightweight solution for scalable sequential decision-making research
Abstract: The efficiency of game engines and policy optimization algorithms is crucial for training reinforcement learning (RL) agents in complex sequential decision-making tasks, such as Tetris. Existing Tetris implementations suffer from low simulation speeds, suboptimal state evaluation, and inefficient training paradigms, limiting their utility for large-scale RL research. To address these limitations, this paper proposes a high-performance Tetris AI framework based on bitboard optimization and improved RL algorithms. First, we redesign the Tetris game board and tetrominoes using bitboard representations, leveraging bitwise operations to accelerate core processes (e.g., collision detection, line clearing, and Dellacherie-Thiery Features extraction) and achieve a 53-fold speedup compared to OpenAI Gym-Tetris. Second, we introduce an afterstate-evaluating actor network that simplifies state value estimation by leveraging Tetris afterstate property, outperforming traditional action-value networks with fewer parameters. Third, we propose a buffer-optimized Proximal Policy Optimization (PPO) algorithm that balances sampling and update efficiency, achieving an average score of 3,829 on 10x10 grids within 3 minutes. Additionally, we develop a Python-Java interface compliant with the OpenAI Gym standard, enabling seamless integration with modern RL frameworks. Experimental results demonstrate that our framework enhances Tetris’s utility as an RL benchmark by bridging low-level bitboard optimizations with high-level AI strategies, providing a sample-efficient and computationally lightweight solution for scalable sequential decision-making research.
[631] Multiverse: Language-Conditioned Multi-Game Level Blending via Shared Representation
In-Chang Baek, Jiyun Jung, Sung-Hyun Kim, Geum-Hwan Hwang, Kyung-Joong Kim
Main category: cs.AI
TL;DR: Multiverse: A language-conditioned multi-game level generator that enables cross-game level blending through textual specifications by learning a shared latent space aligning text and level structures.
Details
Motivation: Prior text-to-level generators are limited to single game domains, lacking the ability to extend language-conditioned generation to multiple games and enable cross-game level blending through textual specifications.Method: Learns a shared latent space aligning textual instructions and level structures using threshold-based multi-positive contrastive supervision that links semantically related levels across games, enabling controllable blending through latent interpolation.
Result: The learned representation supports controllable cross-game level blending, significantly improves blending quality within the same game genre, and provides unified representation for language-conditioned multi-game content generation.
Conclusion: Multiverse enables cross-game level blending through textual specifications, advancing multi-game language-conditioned content generation with controllable structural preservation across domains.
Abstract: Text-to-level generation aims to translate natural language descriptions into structured game levels, enabling intuitive control over procedural content generation. While prior text-to-level generators are typically limited to a single game domain, extending language-conditioned generation to multiple games requires learning representations that capture structural relationships across domains. We propose Multiverse, a language-conditioned multi-game level generator that enables cross-game level blending through textual specifications. The model learns a shared latent space aligning textual instructions and level structures, while a threshold-based multi-positive contrastive supervision links semantically related levels across games. This representation allows language to guide which structural characteristics should be preserved when combining content from different games, enabling controllable blending through latent interpolation and zero-shot generation from compositional textual prompts. Experiments show that the learned representation supports controllable cross-game level blending and significantly improves blending quality within the same game genre, while providing a unified representation for language-conditioned multi-game content generation.
[632] Concerning Uncertainty – A Systematic Survey of Uncertainty-Aware XAI
Helena Löfström, Tuwe Löfström, Anders Hjort, Fatima Rabia Yapicioglu
Main category: cs.AI
TL;DR: Survey paper on uncertainty-aware explainable AI (UAXAI) examining how uncertainty quantification methods are integrated into explanatory pipelines and evaluation practices.
Details
Motivation: To provide a comprehensive overview of how uncertainty is incorporated into explainable AI systems, identify recurring approaches and evaluation gaps, and propose unified evaluation principles.Method: Literature survey analyzing existing UAXAI approaches, categorizing uncertainty quantification methods (Bayesian, Monte Carlo, Conformal), integration strategies, and evaluation practices.
Result: Identified three main uncertainty quantification approaches, distinct integration strategies (trustworthiness assessment, model/explanation constraints, explicit uncertainty communication), and fragmented evaluation practices with limited user focus.
Conclusion: Progress in UAXAI requires unified evaluation principles linking uncertainty propagation, robustness, and human decision-making, with counterfactual and calibration approaches as promising directions.
Abstract: This paper surveys uncertainty-aware explainable artificial intelligence (UAXAI), examining how uncertainty is incorporated into explanatory pipelines and how such methods are evaluated. Across the literature, three recurring approaches to uncertainty quantification emerge (Bayesian, Monte Carlo, and Conformal methods), alongside distinct strategies for integrating uncertainty into explanations: assessing trustworthiness, constraining models or explanations, and explicitly communicating uncertainty. Evaluation practices remain fragmented and largely model centered, with limited attention to users and inconsistent reporting of reliability properties (e.g., calibration, coverage, explanation stability). Recent work leans towards calibration, distribution free techniques and recognizes explainer variability as a central concern. We argue that progress in UAXAI requires unified evaluation principles that link uncertainty propagation, robustness, and human decision-making, and highlight counterfactual and calibration approaches as promising avenues for aligning interpretability with reliability.
[633] Neuro-Symbolic Learning for Predictive Process Monitoring via Two-Stage Logic Tensor Networks with Rule Pruning
Fabrizio De Santis, Gyunam Park, Francesco Zanichelli
Main category: cs.AI
TL;DR: Neuro-symbolic approach integrating domain knowledge as differentiable logical constraints using Logic Tensor Networks for sequential event prediction, with two-stage optimization to balance logical satisfaction and predictive accuracy.
Details
Motivation: Existing data-driven approaches for sequential event prediction fail to incorporate domain-specific sequential constraints and logical rules governing event relationships, limiting accuracy and regulatory compliance in applications like fraud detection and healthcare monitoring.Method: Integrates domain knowledge as differentiable logical constraints using Logic Tensor Networks (LTNs), formalizing control-flow, temporal, and payload knowledge using Linear Temporal Logic and first-order logic. Uses two-stage optimization: weighted axiom loss during pretraining to prioritize data learning, followed by rule pruning that retains only consistent, contributive axioms based on satisfaction dynamics.
Result: Evaluation on four real-world event logs shows domain knowledge injection significantly improves predictive performance, with two-stage optimization proving essential (without it, knowledge can severely degrade performance). Excels particularly in compliance-constrained scenarios with limited compliant training examples, achieving superior performance compared to purely data-driven baselines while ensuring adherence to domain constraints.
Conclusion: Neuro-symbolic integration of domain knowledge through logical constraints and careful optimization strategy enables more accurate and compliant sequential event prediction, especially valuable in regulated domains with limited compliant training data.
Abstract: Predictive modeling on sequential event data is critical for fraud detection and healthcare monitoring. Existing data-driven approaches learn correlations from historical data but fail to incorporate domain-specific sequential constraints and logical rules governing event relationships, limiting accuracy and regulatory compliance. For example, healthcare procedures must follow specific sequences, and financial transactions must adhere to compliance rules. We present a neuro-symbolic approach integrating domain knowledge as differentiable logical constraints using Logic Networks (LTNs). We formalize control-flow, temporal, and payload knowledge using Linear Temporal Logic and first-order logic. Our key contribution is a two-stage optimization strategy addressing LTNs’ tendency to satisfy logical formulas at the expense of predictive accuracy. The approach uses weighted axiom loss during pretraining to prioritize data learning, followed by rule pruning that retains only consistent, contributive axioms based on satisfaction dynamics. Evaluation on four real-world event logs shows that domain knowledge injection significantly improves predictive performance, with the two-stage optimization proving essential knowledge (without it, knowledge can severely degrade performance). The approach excels particularly in compliance-constrained scenarios with limited compliant training examples, achieving superior performance compared to purely data-driven baselines while ensuring adherence to domain constraints.
[634] Compliance-Aware Predictive Process Monitoring: A Neuro-Symbolic Approach
Fabrizio De Santis, Gyunam Park, Wil M. P. van der Aalst
Main category: cs.AI
TL;DR: Neuro-symbolic approach for predictive process monitoring using Logic Tensor Networks to inject domain knowledge into predictive models, improving both compliance and accuracy.
Details
Motivation: Existing predictive process monitoring approaches are purely data-driven and fail to incorporate domain-specific process constraints, limiting compliance adherence and prediction accuracy.Method: Proposes a neuro-symbolic approach using Logic Tensor Networks (LTNs) with a four-stage pipeline: feature extraction, rule extraction, knowledge base creation, and knowledge injection into predictive models.
Result: The neuro-symbolic model learns process constraints while achieving better performance, demonstrating higher compliance and improved accuracy compared to baseline approaches in compliance-aware experiments.
Conclusion: Neuro-symbolic approaches using LTNs effectively incorporate domain knowledge into predictive process monitoring, enhancing both compliance adherence and prediction accuracy.
Abstract: Existing approaches for predictive process monitoring are sub-symbolic, meaning that they learn correlations between descriptive features and a target feature fully based on data, e.g., predicting the surgical needs of a patient based on historical events and biometrics. However, such approaches fail to incorporate domain-specific process constraints (knowledge), e.g., surgery can only be planned if the patient was released more than a week ago, limiting the adherence to compliance and providing less accurate predictions. In this paper, we present a neuro-symbolic approach for predictive process monitoring, leveraging Logic Tensor Networks (LTNs) to inject process knowledge into predictive models. The proposed approach follows a structured pipeline consisting of four key stages: 1) feature extraction; 2) rule extraction; 3) knowledge base creation; and 4) knowledge injection. Our evaluation shows that, in addition to learning the process constraints, the neuro-symbolic model also achieves better performance, demonstrating higher compliance and improved accuracy compared to baseline approaches across all compliance-aware experiments.
[635] Transparency as Architecture: Structural Compliance Gaps in EU AI Act Article 50 II
Vera Schmitt, Niklas Kruse, Premtim Sahitaj, Julius Schöning
Main category: cs.AI
TL;DR: The paper analyzes compliance challenges with the EU AI Act’s dual transparency requirements for AI-generated content, showing current generative AI systems cannot meet these requirements through post-hoc labeling alone.
Details
Motivation: The EU AI Act requires AI-generated content to have both human-understandable and machine-readable labels for automated verification, but current generative AI systems face fundamental constraints in meeting these requirements.Method: The paper uses synthetic data generation and automated fact-checking as diagnostic use cases to analyze compliance challenges, examining provenance tracking issues, watermarking paradoxes, and structural gaps in current systems.
Result: The analysis reveals three structural gaps: (1) absent cross-platform marking formats for interleaved human-AI outputs, (2) misalignment between regulatory ‘reliability’ criteria and probabilistic model behavior, and (3) missing guidance for adapting disclosures to heterogeneous user expertise.
Conclusion: Compliance requires treating transparency as an architectural design requirement rather than post-hoc labeling, demanding interdisciplinary research across legal semantics, AI engineering, and human-centered design.
Abstract: Art. 50 II of the EU Artificial Intelligence Act mandates dual transparency for AI-generated content: outputs must be labeled in both human-understandable and machine-readable form for automated verification. This requirement, entering into force in August 2026, collides with fundamental constraints of current generative AI systems. Using synthetic data generation and automated fact-checking as diagnostic use cases, we show that compliance cannot be reduced to post-hoc labeling. In fact-checking pipelines, provenance tracking is not feasible under iterative editorial workflows and non-deterministic LLM outputs; moreover, the assistive-function exemption does not apply, as such systems actively assign truth values rather than supporting editorial presentation. In synthetic data generation, persistent dual-mode marking is paradoxical: watermarks surviving human inspection risk being learned as spurious features during training, while marks suited for machine verification are fragile under standard data processing. Across both domains, three structural gaps obstruct compliance: (a) absent cross-platform marking formats for interleaved human-AI outputs; (b) misalignment between the regulation’s ‘reliability’ criterion and probabilistic model behavior; and (c) missing guidance for adapting disclosures to heterogeneous user expertise. Closing these gaps requires transparency to be treated as an architectural design requirement, demanding interdisciplinary research across legal semantics, AI engineering, and human-centered desi
[636] FormalProofBench: Can Models Write Graduate Level Math Proofs That Are Formally Verified?
Nikil Ravi, Kexing Ying, Vasilii Nesterov, Rayan Krishnan, Elif Uskuplu, Bingyu Xia, Janitha Aswedige, Langston Nashold
Main category: cs.AI
TL;DR: FormalProofBench is a benchmark for evaluating AI models’ ability to generate formally verified mathematical proofs using Lean 4, focusing on graduate-level mathematics across various domains.
Details
Motivation: To create a rigorous evaluation framework for assessing AI models' formal theorem-proving capabilities at advanced mathematical levels, moving beyond simple proof generation to formally verified proofs that must pass Lean 4 verification.Method: Developed a private benchmark with natural-language problems paired with Lean 4 formal statements, drawn from qualifying exams and textbooks across analysis, algebra, probability, and logic. Evaluated frontier models using an agentic harness to generate Lean proofs that must be accepted by the Lean 4 checker.
Result: Best-performing foundation model achieved 33.5% accuracy, with performance dropping rapidly for other models. The paper provides empirical analysis of tool-use, failure modes, cost, and latency in addition to accuracy metrics.
Conclusion: FormalProofBench provides a comprehensive evaluation framework for formal theorem-proving capabilities of AI models, revealing current limitations in generating formally verified proofs for advanced mathematics while establishing baseline performance metrics.
Abstract: We present FormalProofBench, a private benchmark designed to evaluate whether AI models can produce formally verified mathematical proofs at the graduate level. Each task pairs a natural-language problem with a Lean~4 formal statement, and a model must output a Lean proof accepted by the Lean 4 checker. FormalProofBench targets advanced undergraduate and graduate mathematics, with problems drawn from qualifying exams and standard textbooks across topics including analysis, algebra, probability, and logic. We evaluate a range of frontier models with an agentic harness, and find that the best-performing foundation model achieves 33.5% accuracy, with performance dropping rapidly after that. In addition to the accuracy numbers, we also provide empirical analysis of tool-use, failure modes, cost and latency, thereby providing a thorough evaluation of the formal-theorem proving abilities of frontier models.
[637] When Verification Hurts: Asymmetric Effects of Multi-Agent Feedback in Logic Proof Tutoring
Tahreem Yasir, Sutapa Dey Tithi, Benyamin Tabarsi, Dmitri Droujkov, Sam Gilson Yasitha Rajapaksha, Xiaoyi Tian, Arun Ramesh, DongKuan, Xu, Tiffany Barnes
Main category: cs.AI
TL;DR: A study on LLMs for automated tutoring in propositional logic proofs reveals verification helps when feedback is error-prone but harms performance when feedback is already reliable, with all models failing on high-complexity proofs.
Details
Motivation: To assess the reliability of LLMs in structured symbolic domains like automated tutoring, specifically for propositional logic proofs where precise symbolic reasoning aligned with a learner's current proof state is required.Method: Created a knowledge-graph-grounded benchmark of 516 unique proof states with step-level annotations and difficulty metrics. Evaluated three role-specialized pipelines with varying solution access: Tutor (partial solution access), Teacher (full derivation access), and Judge (verification of Tutor feedback).
Result: Verification improves outcomes when upstream feedback is error-prone (<70% accuracy), but degrades performance by 4-6 percentage points through over-specification when feedback is already reliable (>85%). All models share a complexity ceiling, failing on proof states exceeding complexity 4-5.
Conclusion: Adding verifiers or richer context doesn’t universally improve tutoring; instead, adaptive, difficulty-aware architectures are needed that route problems by estimated complexity and upstream reliability.
Abstract: Large language models (LLMs) are increasingly used for automated tutoring, but their reliability in structured symbolic domains remains unclear. We study step-level feedback for propositional logic proofs, which require precise symbolic reasoning aligned with a learner’s current proof state. We introduce a knowledge-graph-grounded benchmark of 516 unique proof states with step-level annotations and difficulty metrics. Unlike prior tutoring evaluations that rely on model self-assessment or binary correctness, our framework enables fine-grained analysis of feedback quality against verified solution paths. We evaluate three role-specialized pipelines with varying solution access: Tutor (partial solution access), Teacher (full derivation access), and Judge (verification of Tutor feedback). Our results reveal a striking asymmetry: verification improves outcomes when upstream feedback is error-prone (<70% accuracy), but degrades performance by 4-6 percentage points through over-specification when feedback is already reliable (>85%). Critically, we identify a shared complexity ceiling; no model or pipeline reliably succeeds on proof states exceeding complexity 4-5. These findings challenge the assumption that adding verifiers or richer context universally improves tutoring, motivating adaptive, difficulty-aware architectures that route problems by estimated complexity and upstream reliability.
[638] The Price of Meaning: Why Every Semantic Memory System Forgets
Sambartha Ray Barman, Andrey Starenky, Sofia Bodnar, Nikhil Narasimhan, Ashwin Gopinath
Main category: cs.AI
TL;DR: The paper proves that semantic memory systems face an inherent tradeoff: the geometric structure enabling semantic generalization inevitably causes interference, forgetting, and false recall.
Details
Motivation: Current AI memory systems organize information by meaning to enable generalization and analogy, but this semantic organization comes with inherent limitations that need to be formally understood.Method: The authors formalize the tradeoff for semantically continuous kernel-threshold memories and derive four theoretical results about effective rank, competitor mass, retention decay, and false recall. They test predictions across five architectures including vector retrieval, graph memory, attention-based context, BM25 filesystem retrieval, and parametric memory.
Result: Theoretical predictions hold across tested architectures: pure semantic systems show forgetting and false recall, reasoning-augmented systems convert graceful degradation to catastrophic failure, and systems avoiding interference sacrifice semantic generalization.
Conclusion: The price of meaning is interference - no tested architecture avoids this fundamental tradeoff between semantic generalization and memory interference.
Abstract: Every major AI memory system in production today organises information by meaning. That organisation enables generalisation, analogy, and conceptual retrieval – but it comes at a price. We prove that the same geometric structure enabling semantic generalisation makes interference, forgetting, and false recall inescapable. We formalise this tradeoff for \textit{semantically continuous kernel-threshold memories}: systems whose retrieval score is a monotone function of an inner product in a semantic feature space with finite local intrinsic dimension. Within this class we derive four results: (1) semantically useful representations have finite effective rank; (2) finite local dimension implies positive competitor mass in retrieval neighbourhoods; (3) under growing memory, retention decays to zero, yielding power-law forgetting curves under power-law arrival statistics; (4) for associative lures satisfying a $δ$-convexity condition, false recall cannot be eliminated by threshold tuning. We test these predictions across five architectures: vector retrieval, graph memory, attention-based context, BM25 filesystem retrieval, and parametric memory. Pure semantic systems express the vulnerability directly as forgetting and false recall. Reasoning-augmented systems partially override these symptoms but convert graceful degradation into catastrophic failure. Systems that escape interference entirely do so by sacrificing semantic generalisation. The price of meaning is interference, and no architecture we tested avoids paying it.
[639] MediHive: A Decentralized Agent Collective for Medical Reasoning
Xiaoyang Wang, Christopher C. Yang
Main category: cs.AI
TL;DR: MediHive: A decentralized multi-agent framework using LLMs for medical QA with autonomous role assignment, evidence-based debates, and iterative fusion to achieve consensus, outperforming centralized approaches.
Details
Motivation: Single-agent LLMs struggle with complex interdisciplinary medical problems requiring uncertainty handling and conflicting evidence resolution. Centralized multi-agent systems have scalability bottlenecks, single points of failure, and role confusion issues in resource-constrained environments.Method: Decentralized multi-agent framework with LLM-based agents that autonomously self-assign specialized roles, conduct initial analyses, detect divergences through conditional evidence-based debates, and locally fuse peer insights over multiple rounds using a shared memory pool with iterative fusion mechanisms.
Result: Outperforms single-LLM and centralized baselines on MedQA (84.3% accuracy) and PubMedQA (78.4% accuracy) datasets, demonstrating superior performance in reasoning-intensive medical tasks.
Conclusion: MediHive advances scalable, fault-tolerant decentralized multi-agent systems for medical AI, addressing key limitations of centralized designs while showing superior performance in complex medical reasoning tasks.
Abstract: Large language models (LLMs) have revolutionized medical reasoning tasks, yet single-agent systems often falter on complex, interdisciplinary problems requiring robust handling of uncertainty and conflicting evidence. Multi-agent systems (MAS) leveraging LLMs enable collaborative intelligence, but prevailing centralized architectures suffer from scalability bottlenecks, single points of failure, and role confusion in resource-constrained environments. Decentralized MAS (D-MAS) promise enhanced autonomy and resilience via peer-to-peer interactions, but their application to high-stakes healthcare domains remains underexplored. We introduce MediHive, a novel decentralized multi-agent framework for medical question answering that integrates a shared memory pool with iterative fusion mechanisms. MediHive deploys LLM-based agents that autonomously self-assign specialized roles, conduct initial analyses, detect divergences through conditional evidence-based debates, and locally fuse peer insights over multiple rounds to achieve consensus. Empirically, MediHive outperforms single-LLM and centralized baselines on MedQA and PubMedQA datasets, attaining accuracies of 84.3% and 78.4%, respectively. Our work advances scalable, fault-tolerant D-MAS for medical AI, addressing key limitations of centralized designs while demonstrating superior performance in reasoning-intensive tasks.
[640] EpochX: Building the Infrastructure for an Emergent Agent Civilization
Huacan Wang, Chaofa Yuan, Xialie Zhuang, Tu Hu, Shuo Zhang, Jun Han, Shi Wei, Daiqiang Li, Jingping Liu, Kunyi Wang, Zihan Yin, Zhenheng Tang, Andy Wang, Henry Peng Zou, Philip S. Yu, Sen Hu, Qizhen Lan, Ronghao Chen
Main category: cs.AI
TL;DR: EpochX is a credits-native marketplace infrastructure for human-agent production networks that treats humans and agents as peer participants, enabling task delegation, verification, and reward systems while generating reusable ecosystem assets.
Details
Motivation: As foundation models make broad task execution and tool use increasingly accessible, the binding constraint shifts from raw capability to how work is delegated, verified, and rewarded at scale. The paper aims to address the organizational design problem of building infrastructures for durable human-agent collaboration.Method: EpochX introduces a marketplace infrastructure where humans and agents can post or claim tasks, decompose them into subtasks, and execute through explicit delivery workflows with verification and acceptance. It features a native credit mechanism for economic viability, and stores reusable ecosystem assets (skills, workflows, execution traces, distilled experience) with explicit dependency structure.
Result: The system formalizes an end-to-end transaction model with asset and incentive layers, creating infrastructure where verifiable work leaves persistent, reusable artifacts and value flows support durable human-agent collaboration.
Conclusion: EpochX reframes agentic AI as an organizational design problem, providing infrastructure for human-agent production networks that enables cumulative improvement through reusable assets and sustainable economic participation via credit mechanisms.
Abstract: General-purpose technologies reshape economies less by improving individual tools than by enabling new ways to organize production and coordination. We believe AI agents are approaching a similar inflection point: as foundation models make broad task execution and tool use increasingly accessible, the binding constraint shifts from raw capability to how work is delegated, verified, and rewarded at scale. We introduce EpochX, a credits-native marketplace infrastructure for human-agent production networks. EpochX treats humans and agents as peer participants who can post tasks or claim them. Claimed tasks can be decomposed into subtasks and executed through an explicit delivery workflow with verification and acceptance. Crucially, EpochX is designed so that each completed transaction can produce reusable ecosystem assets, including skills, workflows, execution traces, and distilled experience. These assets are stored with explicit dependency structure, enabling retrieval, composition, and cumulative improvement over time. EpochX also introduces a native credit mechanism to make participation economically viable under real compute costs. Credits lock task bounties, budget delegation, settle rewards upon acceptance, and compensate creators when verified assets are reused. By formalizing the end-to-end transaction model together with its asset and incentive layers, EpochX reframes agentic AI as an organizational design problem: building infrastructures where verifiable work leaves persistent, reusable artifacts, and where value flows support durable human-agent collaboration.
[641] daVinci-LLM:Towards the Science of Pretraining
Yiwei Qin, Yixiu Liu, Tiantian Mi, Muhang Xie, Zhen Huang, Weiye Si, Pengrui Lu, Siyuan Feng, Xia Wu, Liming Liu, Ye Luo, Jinlong Hou, Qipeng Guo, Yu Qiao, Pengfei Liu
Main category: cs.AI
TL;DR: daVinci-LLM is a fully-open pretraining project that combines industrial-scale resources with academic research freedom, using systematic Data Darwinism framework and adaptive curriculum to train a 3B-parameter model on 8T tokens, establishing data processing depth as a critical dimension alongside volume scaling.
Details
Motivation: The paper addresses the structural paradox in LLM pretraining: commercial organizations have computational resources but lack research transparency, while academic institutions have research freedom but lack pretraining-scale resources. daVinci-LLM aims to fill this gap by combining industrial-scale resources with full research freedom to advance pretraining science through complete openness.Method: Adopts a fully-open paradigm treating openness as scientific methodology. Uses Data Darwinism framework (L0-L9 taxonomy) for systematic data processing from filtering to synthesis. Trains a 3B-parameter model from random initialization on 8T tokens using two-stage adaptive curriculum: first stage focuses on foundational capabilities, second stage on reasoning-intensive enhancement. Conducts 200+ controlled ablations to systematically explore pretraining dynamics.
Result: Through extensive experiments, establishes that: data processing depth systematically enhances capabilities as a critical dimension alongside volume scaling; different domains exhibit distinct saturation dynamics requiring adaptive strategies; compositional balance enables targeted intensification while preventing performance collapse; evaluation protocol choices significantly shape understanding of pretraining progress.
Conclusion: daVinci-LLM demonstrates the value of combining industrial-scale resources with academic research freedom to advance pretraining science. The systematic exploration reveals fundamental insights about data processing depth, domain saturation dynamics, and compositional balance. By releasing complete exploration processes, the project enables cumulative scientific knowledge building in LLM pretraining.
Abstract: The foundational pretraining phase determines a model’s capability ceiling, as post-training struggles to overcome capability foundations established during pretraining, yet it remains critically under-explored. This stems from a structural paradox: organizations with computational resources operate under commercial pressures that inhibit transparent disclosure, while academic institutions possess research freedom but lack pretraining-scale computational resources. daVinci-LLM occupies this unexplored intersection, combining industrial-scale resources with full research freedom to advance the science of pretraining. We adopt a fully-open paradigm that treats openness as scientific methodology, releasing complete data processing pipelines, full training processes, and systematic exploration results. Recognizing that the field lacks systematic methodology for data processing, we employ the Data Darwinism framework, a principled L0-L9 taxonomy from filtering to synthesis. We train a 3B-parameter model from random initialization across 8T tokens using a two-stage adaptive curriculum that progressively shifts from foundational capabilities to reasoning-intensive enhancement. Through 200+ controlled ablations, we establish that: processing depth systematically enhances capabilities, establishing it as a critical dimension alongside volume scaling; different domains exhibit distinct saturation dynamics, necessitating adaptive strategies from proportion adjustments to format shifts; compositional balance enables targeted intensification while preventing performance collapse; how evaluation protocol choices shape our understanding of pretraining progress. By releasing the complete exploration process, we enable the community to build upon our findings and systematic methodologies to form accumulative scientific knowledge in pretraining.
[642] Heterogeneous Debate Engine: Identity-Grounded Cognitive Architecture for Resilient LLM-Based Ethical Tutoring
Jakub Masłowski, Jarosław A. Chudziak
Main category: cs.AI
TL;DR: Heterogeneous Debate Engine (HDE) combines ID-RAG for doctrinal fidelity and Heuristic Theory of Mind for opponent modeling to create stable multi-agent debate systems for ethical tutoring.
Details
Motivation: Current multi-agent LLM systems for dialectical reasoning suffer from semantic drift, logical deterioration, and dialectical stagnation, making them unsuitable for ethical tutoring where precise answers are needed. The challenge is maintaining doctrinal fidelity while preserving generative flexibility.Method: Proposes Heterogeneous Debate Engine (HDE) architecture combining Identity-Grounded Retrieval-Augmented Generation (ID-RAG) for maintaining doctrinal fidelity and Heuristic Theory of Mind for strategic opponent modeling in adversarial debates.
Result: Architectural heterogeneity significantly improves stability - contrary doctrinal initializations (Deontology vs. Utilitarianism) increased Argument Complexity Scores by an order of magnitude over baselines.
Conclusion: ID-RAG and Heuristic ToM are effective architectural requirements for maintaining high-fidelity adversarial pedagogy in multi-agent debate systems.
Abstract: Large Language Models (LLMs) are being increasingly used as autonomous agents in complex reasoning tasks, opening the niche for dialectical interactions. However, Multi-Agent systems implemented with systematically unconstrained systems systematically undergo semantic drift and logical deterioration and thus can hardly be used in providing ethical tutoring where a precise answer is required. Current simulation often tends to degenerate into dialectical stagnation, the agents degenerate into recursive concurrence or circular arguments. A critical challenge remains: how to enforce doctrinal fidelity without suppressing the generative flexibility required for dialectical reasoning? To address this niche, we contribute the Heterogeneous Debate Engine (HDE), a cognitive architecture that combines Identity-Grounded Retrieval-Augmented Generation (ID-RAG) for doctrinal fidelity and Heuristic Theory of Mind for strategic opponent modeling. Our evaluation shows that architectural heterogeneity is a crucial variable to stability: contrary doctrinal initializations (e.g., Deontology vs. Utilitarianism) have increased the Argument Complexity Scores of students by an order of magnitude, over baselines. These findings validate the effectiveness of ID-RAG and Heuristic ToM as architectural requirements in maintaining high-fidelity (adversarial) pedagogy.
[643] Aligning LLMs with Graph Neural Solvers for Combinatorial Optimization
Shaodi Feng, Zhuoyi Lin, Yaoxin Wu, Haiyan Yin, Yan Jin, Senthilnath Jayavelu, Xun Xu
Main category: cs.AI
TL;DR: AlignOPT combines LLMs with graph neural solvers to solve combinatorial optimization problems by aligning semantic understanding from text with structural representations from graphs.
Details
Motivation: Purely language-based LLM approaches struggle to capture complex relational structures in combinatorial optimization problems, making them ineffective for medium/large instances. There's a need to better integrate semantic understanding with structural modeling.Method: AlignOPT aligns LLMs with graph neural solvers: LLMs encode textual descriptions of COPs, while graph neural solvers explicitly model underlying graph structures. This creates integration between linguistic semantics and structural representations.
Result: Achieves state-of-the-art results across diverse COPs, demonstrates strong generalization to unseen instances, and shows effectiveness in aligning semantic and structural representations.
Conclusion: Aligning LLMs with graph neural solvers enables more accurate and scalable combinatorial optimization solutions by combining semantic understanding with structural modeling.
Abstract: Recent research has demonstrated the effectiveness of large language models (LLMs) in solving combinatorial optimization problems (COPs) by representing tasks and instances in natural language. However, purely language-based approaches struggle to accurately capture complex relational structures inherent in many COPs, rendering them less effective at addressing medium-sized or larger instances. To address these limitations, we propose AlignOPT, a novel approach that aligns LLMs with graph neural solvers to learn a more generalizable neural COP heuristic. Specifically, AlignOPT leverages the semantic understanding capabilities of LLMs to encode textual descriptions of COPs and their instances, while concurrently exploiting graph neural solvers to explicitly model the underlying graph structures of COP instances. Our approach facilitates a robust integration and alignment between linguistic semantics and structural representations, enabling more accurate and scalable COP solutions. Experimental results demonstrate that AlignOPT achieves state-of-the-art results across diverse COPs, underscoring its effectiveness in aligning semantic and structural representations. In particular, AlignOPT demonstrates strong generalization, effectively extending to previously unseen COP instances.
[644] AutoMS: Multi-Agent Evolutionary Search for Cross-Physics Inverse Microstructure Design
Zhenyuan Zhao, Yu Xing, Tianyang Xue, Lingxin Cao, Xin Yan, Lin Lu
Main category: cs.AI
TL;DR: AutoMS: A multi-agent neuro-symbolic framework using LLM-driven evolutionary search for inverse design of microstructures with coupled cross-physics objectives, achieving 83.8% success rate.
Details
Motivation: Traditional topology optimization is computationally prohibitive for microstructure inverse design, and deep generative models suffer from "physical hallucinations" lacking rigorous validity guarantees for coupled cross-physics objectives.Method: Multi-agent neuro-symbolic framework with LLMs as “semantic navigators” to initialize search spaces and break local optima, plus novel Simulation-Aware Evolutionary Search (SAES) that uses simulation feedback for local gradient approximation and directed parameter updates.
Result: Achieved 83.8% success rate on 17 diverse cross-physics tasks, nearly doubling NSGA-II (43.7%) and significantly outperforming ReAct-based LLM baselines (53.3%), with 23.3% reduction in execution time.
Conclusion: Autonomous agent systems can effectively navigate complex physical landscapes, bridging semantic design intent with rigorous physical validity for microstructure inverse design.
Abstract: Designing microstructures that satisfy coupled cross-physics objectives is a fundamental challenge in material science. This inverse design problem involves a vast, discontinuous search space where traditional topology optimization is computationally prohibitive, and deep generative models often suffer from “physical hallucinations,” lacking the capability to ensure rigorous validity. To address this limitation, we introduce AutoMS, a multi-agent neuro-symbolic framework that reformulates inverse design as an LLM-driven evolutionary search. Unlike methods that treat LLMs merely as interfaces, AutoMS integrates them as “semantic navigators” to initialize search spaces and break local optima, while our novel Simulation-Aware Evolutionary Search (SAES) addresses the “blindness” of traditional evolutionary strategies. Specifically, SAES utilizes simulation feedback to perform local gradient approximation and directed parameter updates, effectively guiding the search toward physically valid Pareto frontiers. Orchestrating specialized agents (Manager, Parser, Generator, and Simulator), AutoMS achieves a state-of-the-art 83.8% success rate on 17 diverse cross-physics tasks, nearly doubling the performance of traditional NSGA-II (43.7%) and significantly outperforming ReAct-based LLM baselines (53.3%). Furthermore, our hierarchical architecture reduces total execution time by 23.3%. AutoMS demonstrates that autonomous agent systems can effectively navigate complex physical landscapes, bridging the gap between semantic design intent and rigorous physical validity.
[645] Quantification of Credal Uncertainty: A Distance-Based Approach
Xabier Gonzalez-Garcia, Siu Lun Chau, Julian Rodemann, Michele Caprio, Krikamol Muandet, Humberto Bustince, Sébastien Destercke, Eyke Hüllermeier, Yusuf Sale
Main category: cs.AI
TL;DR: Proposes distance-based uncertainty quantification for credal sets in multiclass classification using Integral Probability Metrics, with total variation distance instantiation providing efficient measures.
Details
Motivation: Credal sets represent aleatoric and epistemic uncertainty well, but how to quantify these uncertainty types for credal sets in multiclass classification remains underexplored.Method: Introduces a family of uncertainty measures within the Integral Probability Metrics (IPMs) framework, specifically instantiating with total variation distance to obtain computationally tractable measures for multiclass classification.
Result: The proposed measures have clear semantic interpretations, satisfy theoretical desiderata, and show practical usefulness with favorable performance at low computational cost. In binary case, recovers established uncertainty measures.
Conclusion: The distance-based approach provides principled uncertainty quantification for credal sets in multiclass classification, offering a generalization of established binary measures with computational efficiency.
Abstract: Credal sets, i.e., closed convex sets of probability measures, provide a natural framework to represent aleatoric and epistemic uncertainty in machine learning. Yet how to quantify these two types of uncertainty for a given credal set, particularly in multiclass classification, remains underexplored. In this paper, we propose a distance-based approach to quantify total, aleatoric, and epistemic uncertainty for credal sets. Concretely, we introduce a family of such measures within the framework of Integral Probability Metrics (IPMs). The resulting quantities admit clear semantic interpretations, satisfy natural theoretical desiderata, and remain computationally tractable for common choices of IPMs. We instantiate the framework with the total variation distance and obtain simple, efficient uncertainty measures for multiclass classification. In the binary case, this choice recovers established uncertainty measures, for which a principled multiclass generalization has so far been missing. Empirical results confirm practical usefulness, with favorable performance at low computational cost.
[646] GAAMA: Graph Augmented Associative Memory for Agents
Swarna Kamal Paul, Shubhendu Sharma, Nitin Sareen
Main category: cs.AI
TL;DR: GAAMA: Graph-augmented associative memory system for AI agents using hierarchical knowledge graphs with concept nodes to improve multi-session conversation memory and retrieval.
Details
Motivation: Current memory systems for AI agents either use flat RAG (loses structural relationships) or vector retrieval (can't capture associative structure of multi-session conversations). Existing graph-based techniques suffer from hub-dominated retrieval and poor hierarchical reasoning over evolving memory.Method: Three-step pipeline: (1) verbatim episode preservation from raw conversations, (2) LLM-based extraction of atomic facts and topic-level concept nodes, (3) synthesis of higher-order reflections. Creates graph with four node types (episode, fact, reflection, concept) connected by five structural edge types. Retrieval combines cosine-similarity k-nearest neighbor search with edge-type-aware Personalized PageRank.
Result: On LoCoMo-10 benchmark (1,540 questions across 10 multi-session conversations), GAAMA achieves 78.9% mean reward, outperforming RAG baseline (75.0%), HippoRAG (69.9%), A-Mem (47.2%), and Nemori (52.1%). Graph-traversal-based ranking with semantic search improves over pure semantic search on graph nodes by +1.0 percentage point.
Conclusion: GAAMA demonstrates that graph-augmented associative memory with concept-mediated hierarchical knowledge graphs significantly improves memory retrieval for multi-session AI agents, combining structural relationships with semantic similarity for better performance.
Abstract: AI agents that interact with users across multiple sessions require persistent long-term memory to maintain coherent, personalized behavior. Current approaches either rely on flat retrieval-augmented generation (RAG), which loses structural relationships between memories, or use memory compression and vector retrieval that cannot capture the associative structure of multi-session conversations. There are few graph based techniques proposed in the literature, however they still suffer from hub dominated retrieval and poor hierarchical reasoning over evolving memory. We propose GAAMA, a graph-augmented associative memory system that constructs a concept-mediated hierarchical knowledge graph through a three-step pipeline: (1)~verbatim episode preservation from raw conversations, (2)~LLM-based extraction of atomic facts and topic-level concept nodes, and (3)~synthesis of higher-order reflections. The resulting graph uses four node types (episode, fact, reflection, concept) connected by five structural edge types, with concept nodes providing cross-cutting traversal paths that complement semantic similarity. Retrieval combines cosine-similarity-based $k$-nearest neighbor search with edge-type-aware Personalized PageRank (PPR) through an additive scoring function. On the LoCoMo-10 benchmark (1,540 questions across 10 multi-session conversations), GAAMA achieves 78.9% mean reward, outperforming a tuned RAG baseline (75.0%), HippoRAG (69.9%), A-Mem (47.2%), and Nemori (52.1%). Ablation analysis shows that augmenting graph-traversal-based ranking (Personalized PageRank) with semantic search consistently improves over pure semantic search on graph nodes (+1.0 percentage point overall).
[647] Self-evolving AI agents for protein discovery and directed evolution
Yang Tan, Lingrong Zhang, Mingchen Li, Yuanxi Yu, Bozitao Zhong, Bingxin Zhou, Nanqing Dong, Liang Hong
Main category: cs.AI
TL;DR: VenusFactory2 is an autonomous multi-agent framework for protein scientific discovery that dynamically synthesizes workflows from natural language prompts, outperforming existing agents on protein-related tasks.
Details
Motivation: Protein scientific discovery is limited by manual orchestration of information and algorithms, and general AI agents are inadequate for complex domain-specific projects like protein research.Method: Self-evolving multi-agent infrastructure that shifts from static tool usage to dynamic workflow synthesis, enabling autonomous organization of protein discovery and optimization from natural language prompts.
Result: Outperforms well-known agents on the VenusAgentEval benchmark and can autonomously organize protein discovery and optimization from single natural language prompts.
Conclusion: VenusFactory2 provides an effective autonomous framework for protein scientific discovery through dynamic workflow synthesis and self-evolving multi-agent infrastructure.
Abstract: Protein scientific discovery is bottlenecked by the manual orchestration of information and algorithms, while general agents are insufficient in complex domain projects. VenusFactory2 provides an autonomous framework that shifts from static tool usage to dynamic workflow synthesis via a self-evolving multi-agent infrastructure to address protein-related demands. It outperforms a set of well-known agents on the VenusAgentEval benchmark, and autonomously organizes the discovery and optimization of proteins from a single natural language prompt.
[648] TokenDance: Token-to-Token Music-to-Dance Generation with Bidirectional Mamba
Ziyue Yang, Kaixing Yang, Xulong Tang
Main category: cs.AI
TL;DR: TokenDance: A two-stage music-to-dance generation framework using dual-modality tokenization and efficient token-level generation to overcome limitations of existing datasets and improve generalization to real-world music.
Details
Motivation: Existing 3D dance datasets have limited coverage of music styles and choreographic patterns, causing current models to generate overly simplistic and repetitive dances that lack expressiveness and realism when applied to real-world music.Method: Two-stage framework: 1) Discretize dance and music using Finite Scalar Quantization, factorizing dance into upper/lower-body components with constraints, and decomposing music into semantic/acoustic features with dedicated codebooks. 2) Local-Global-Local token-to-token generator built on Bidirectional Mamba backbone for coherent motion synthesis and efficient non-autoregressive inference.
Result: TokenDance achieves state-of-the-art performance in both generation quality and inference speed, demonstrating effectiveness for real-world music-to-dance applications.
Conclusion: The proposed framework successfully addresses dataset limitations through dual-modality tokenization and efficient token-level generation, enabling better generalization to diverse real-world music while maintaining high-quality dance generation.
Abstract: Music-to-dance generation has broad applications in virtual reality, dance education, and digital character animation. However, the limited coverage of existing 3D dance datasets confines current models to a narrow subset of music styles and choreographic patterns, resulting in poor generalization to real-world music. Consequently, generated dances often become overly simplistic and repetitive, substantially degrading expressiveness and realism. To tackle this problem, we present TokenDance, a two-stage music-to-dance generation framework that explicitly addresses this limitation through dual-modality tokenization and efficient token-level generation. In the first stage, we discretize both dance and music using Finite Scalar Quantization, where dance motions are factorized into upper and lower-body components with kinematic-dynamic constraints, and music is decomposed into semantic and acoustic features with dedicated codebooks to capture choreography-specific structures. In the second stage, we introduce a Local-Global-Local token-to-token generator built on a Bidirectional Mamba backbone, enabling coherent motion synthesis, strong music-dance alignment, and efficient non-autoregressive inference. Extensive experiments demonstrate that TokenDance achieves overall state-of-the-art (SOTA) performance in both generation quality and inference speed, highlighting its effectiveness and practical value for real-world music-to-dance applications.
[649] Deep Research of Deep Research: From Transformer to Agent, From AI to AI for Science
Yipeng Yu
Main category: cs.AI
TL;DR: This paper provides a comprehensive analysis of deep research (DR) as a vertical application for general-purpose AI agents, positioning LLMs and Stable Diffusion as twin pillars of generative AI and examining AI for Science (AI4S) progress across disciplines.
Details
Motivation: The motivation is to bridge the gap between AI and AI4S communities by providing a unified framework for understanding deep research as an ideal approach for intelligent information processing and scientific discovery using evolving AI capabilities.Method: The paper articulates a clear definition of deep research, unifies industry and academic perspectives within a developmental framework, examines AI4S progress across disciplines, identifies human-AI interaction paradigms and system architectures, and discusses challenges.
Result: The paper provides a roadmap evolving from Transformers to agents, positions LLMs and Stable Diffusion as twin pillars of generative AI, and identifies predominant paradigms and architectures for human-AI collaboration in scientific research.
Conclusion: The paper concludes that AI supports scientific innovation while science contributes to AI growth (S4AI), aiming to bridge the gap between AI and AI4S communities through a comprehensive framework for deep research.
Abstract: With the advancement of large language models (LLMs) in their knowledge base and reasoning capabilities, their interactive modalities have evolved from pure text to multimodality and further to agentic tool use. Consequently, their applications have broadened from question answering to AI assistants and now to general-purpose agents. Deep research (DR) represents a prototypical vertical application for general-purpose agents, which represents an ideal approach for intelligent information processing and assisting humans in discovering and solving problems, with the goal of reaching or even surpassing the level of top human scientists. This paper provides a deep research of deep research. We articulate a clear and precise definition of deep research and unify perspectives from industry’s deep research and academia’s AI for Science (AI4S) within a developmental framework. We position LLMs and Stable Diffusion as the twin pillars of generative AI, and lay out a roadmap evolving from the Transformer to agents. We examine the progress of AI4S across various disciplines. We identify the predominant paradigms of human-AI interaction and prevailing system architectures, and discuss the major challenges and fundamental research issues that remain. AI supports scientific innovation, and science also can contribute to AI growth (Science for AI, S4AI). We hope this paper can help bridge the gap between the AI and AI4S communities.
[650] CounterMoral: Editing Morals in Language Models
Michael Ripa, Jim Davies
Main category: cs.AI
TL;DR: CounterMoral benchmark dataset evaluates model editing techniques for modifying moral judgments across ethical frameworks to improve language model alignment with human values.
Details
Motivation: While language models have improved at editing factual information, modifying moral judgments for better alignment with human values has received less attention, creating a gap in ethical AI development.Method: Created CounterMoral benchmark dataset to assess model editing techniques; applied various editing methods to multiple language models and evaluated their performance across diverse ethical frameworks.
Result: Findings contribute to evaluating language models designed to be ethical, though specific performance metrics are not detailed in the abstract.
Conclusion: The CounterMoral benchmark provides a valuable tool for assessing and improving the ethical alignment of language models through targeted editing of moral judgments.
Abstract: Recent advancements in language model technology have significantly enhanced the ability to edit factual information. Yet, the modification of moral judgments, a crucial aspect of aligning models with human values, has garnered less attention. In this work, we introduce CounterMoral, a benchmark dataset crafted to assess how well current model editing techniques modify moral judgments across diverse ethical frameworks. We apply various editing techniques to multiple language models and evaluate their performance. Our findings contribute to the evaluation of language models designed to be ethical.
[651] A Comparative Study in Surgical AI: Datasets, Foundation Models, and Barriers to Med-AGI
Kirill Skobelev, Eric Fithian, Yegor Baranovski, Jack Cook, Sandeep Angara, Shauna Otto, Zhuang-Fang Yi, John Zhu, Daniel A. Donoho, X. Y. Han, Neeraj Mainkar, Margaux Masson-Forsythe
Main category: cs.AI
TL;DR: Current large vision-language models struggle with surgical tool detection despite scaling, revealing fundamental limitations beyond compute and data availability.
Details
Motivation: While AI has excelled in many biomedical domains, surgical image analysis remains challenging due to the complex integration of multimodal data, human interaction, and physical effects. The paper investigates whether modern AI scaling approaches can overcome these challenges for surgical applications.Method: Conducted a case study on surgical tool detection using state-of-the-art vision-language models available in 2026. Performed scaling experiments with multi-billion parameter models and extensive training to evaluate performance on neurosurgical tool detection tasks.
Result: Even with massive models and extensive training, current vision-language models fall short on surgical tool detection. Scaling experiments show diminishing returns with increased model size and training time, suggesting fundamental obstacles beyond compute and data availability.
Conclusion: Current AI models face significant limitations in surgical applications that cannot be solved by simple scaling. The paper identifies constraints beyond data and compute, and discusses potential solutions for making AI more effective in surgical contexts.
Abstract: Recent Artificial Intelligence (AI) models have matched or exceeded human experts in several benchmarks of biomedical task performance, but have lagged behind on surgical image-analysis benchmarks. Since surgery requires integrating disparate tasks – including multimodal data integration, human interaction, and physical effects – generally-capable AI models could be particularly attractive as a collaborative tool if performance could be improved. On the one hand, the canonical approach of scaling architecture size and training data is attractive, especially since there are millions of hours of surgical video data generated per year. On the other hand, preparing surgical data for AI training requires significantly higher levels of professional expertise, and training on that data requires expensive computational resources. These trade-offs paint an uncertain picture of whether and to-what-extent modern AI could aid surgical practice. In this paper, we explore this question through a case study of surgical tool detection using state-of-the-art AI methods available in 2026. We demonstrate that even with multi-billion parameter models and extensive training, current Vision Language Models fall short in the seemingly simple task of tool detection in neurosurgery. Additionally, we show scaling experiments indicating that increasing model size and training time only leads to diminishing improvements in relevant performance metrics. Thus, our experiments suggest that current models could still face significant obstacles in surgical use cases. Moreover, some obstacles cannot be simply ``scaled away’’ with additional compute and persist across diverse model architectures, raising the question of whether data and label availability are the only limiting factors. We discuss the main contributors to these constraints and advance potential solutions.
[652] Beyond Completion: Probing Cumulative State Tracking to Predict LLM Agent Performance
Dengzhe Hou, Lingyu Jiang, Deng Li, Zirui Li, Fangzhou Lin, Kazunori D Yamada
Main category: cs.AI
TL;DR: WMF-AM is a new benchmark that measures LLMs’ ability to track intermediate state in multi-step tasks, showing it predicts agent performance better than just completion scores.
Details
Motivation: Standard task-completion scores don't capture differences in how well LLMs track intermediate state during multi-step reasoning, which is crucial for agent performance.Method: Developed WMF-AM (Working Memory Fidelity-Active Manipulation), a calibrated no-scratchpad probe for cumulative arithmetic state tracking, tested on 20 open-weight models (0.5B-35B) against a 10-task agent battery.
Result: WMF-AM predicts agent performance with Kendall’s tau = 0.612 (p < 0.001), and this signal persists after controlling for completion scores and model scale. Ablations show cumulative state tracking under load is the primary difficulty.
Conclusion: Intermediate state tracking ability, measured by WMF-AM, is a better predictor of LLM agent performance than just completion scores, revealing important differences in reasoning capabilities.
Abstract: Task-completion rate is the standard proxy for LLM agent capability, but models with identical completion scores can differ substantially in their ability to track intermediate state. We introduce Working Memory Fidelity-Active Manipulation (WMF-AM), a calibrated no-scratchpad probe of cumulative arithmetic state tracking, and evaluate it on 20 open-weight models (0.5B-35B, 13 families) against a released deterministic 10-task agent battery. In a pre-specified, Bonferroni-corrected analysis, WMF-AM predicts agent performance with Kendall’s tau = 0.612 (p < 0.001, 95% CI [0.360, 0.814]); exploratory partial-tau analyses suggest this signal persists after controlling for completion score and model scale. Three construct-isolation ablations (K = 1 control, non-arithmetic ceiling, yoked cancellation) support the interpretation that cumulative state tracking under load, rather than single-step arithmetic or entity tracking alone, is the primary difficulty source. K-calibration keeps the probe in a discriminative range where prior fixed-depth benchmarks become non-discriminative; generalization beyond this open-weight sample remains open.
[653] LLM Readiness Harness: Evaluation, Observability, and CI Gates for LLM/RAG Applications
Alexandre Cristovão Maiorano
Main category: cs.AI
TL;DR: A readiness evaluation framework for LLM/RAG applications that combines automated benchmarks, observability, and quality gates to make deployment decisions based on multiple metrics including workflow success, policy compliance, groundedness, retrieval hit rate, cost, and latency.
Details
Motivation: Current LLM and RAG evaluation approaches often produce offline scores that don't translate to operational readiness decisions. There's a need for a systematic framework that can determine when an LLM/RAG system is actually ready for production deployment by considering multiple operational metrics and quality gates.Method: Combines automated benchmarks, OpenTelemetry observability, and CI quality gates under a minimal API contract. Aggregates workflow success, policy compliance, groundedness, retrieval hit rate, cost, and p95 latency into scenario-weighted readiness scores with Pareto frontiers. Evaluated on ticket-routing workflows and BEIR grounding tasks (SciFact and FiQA) with full Azure matrix coverage.
Result: On FiQA under sla-first at k=5, gpt-4.1-mini leads in readiness and faithfulness while gpt-5.2 pays substantial latency cost. On SciFact, models are closer in quality but still separable operationally. Ticket-routing regression gates consistently reject unsafe prompt variants, demonstrating ability to block risky releases instead of merely reporting offline scores.
Conclusion: The framework provides a reproducible, operationally grounded approach for deciding whether LLM/RAG systems are ready to ship, showing that readiness is not a single metric but requires multi-dimensional evaluation with operational considerations.
Abstract: We present a readiness harness for LLM and RAG applications that turns evaluation into a deployment decision workflow. The system combines automated benchmarks, OpenTelemetry observability, and CI quality gates under a minimal API contract, then aggregates workflow success, policy compliance, groundedness, retrieval hit rate, cost, and p95 latency into scenario-weighted readiness scores with Pareto frontiers. We evaluate the harness on ticket-routing workflows and BEIR grounding tasks (SciFact and FiQA) with full Azure matrix coverage (162/162 valid cells across datasets, scenarios, retrieval depths, seeds, and models). Results show that readiness is not a single metric: on FiQA under sla-first at k=5, gpt-4.1-mini leads in readiness and faithfulness, while gpt-5.2 pays a substantial latency cost; on SciFact, models are closer in quality but still separable operationally. Ticket-routing regression gates consistently reject unsafe prompt variants, demonstrating that the harness can block risky releases instead of merely reporting offline scores. The result is a reproducible, operationally grounded framework for deciding whether an LLM or RAG system is ready to ship.
[654] Defend: Automated Rebuttals for Peer Review with Minimal Author Guidance
Jyotsana Khatri, Manasi Patwardhan
Main category: cs.AI
TL;DR: DEFEND is an LLM-based tool for scientific rebuttal generation that uses structured reasoning with author-in-the-loop intervention to improve factual correctness and targeted refutation compared to direct LLM approaches.
Details
Motivation: LLMs struggle with targeted refutation and factual grounding in scientific rebuttal generation, highlighting the need for structured reasoning and author intervention to improve accuracy and effectiveness.Method: DEFEND explicitly executes structured reasoning for rebuttal generation while keeping authors in the loop. Authors drive the reasoning process with minimal intervention rather than writing from scratch. Compared against three paradigms: direct LLM generation, segment-wise generation, and sequential approach without author intervention.
Result: Direct LLM approaches perform poorly in factual correctness and targeted refutation. Segment-wise generation and the automated sequential approach with author-in-the-loop substantially improve factual correctness and refutation strength. Experimental results and user studies validate these findings.
Conclusion: Structured reasoning with author-in-the-loop intervention is crucial for effective automated rebuttal generation, significantly outperforming direct LLM approaches in factual accuracy and targeted refutation capabilities.
Abstract: Rebuttal generation is a critical component of the peer review process for scientific papers, enabling authors to clarify misunderstandings, correct factual inaccuracies, and guide reviewers toward a more accurate evaluation. We observe that Large Language Models (LLMs) often struggle to perform targeted refutation and maintain accurate factual grounding when used directly for rebuttal generation, highlighting the need for structured reasoning and author intervention. To address this, in the paper, we introduce DEFEND an LLM based tool designed to explicitly execute the underlying reasoning process of automated rebuttal generation, while keeping the author-in-the-loop. As opposed to writing the rebuttals from scratch, the author needs to only drive the reasoning process with minimal intervention, leading an efficient approach with minimal effort and less cognitive load. We compare DEFEND against three other paradigms: (i) Direct rebuttal generation using LLM (DRG), (ii) Segment-wise rebuttal generation using LLM (SWRG), and (iii) Sequential approach (SA) of segment-wise rebuttal generation without author intervention. To enable finegrained evaluation, we extend the ReviewCritique dataset, creating review segmentation, deficiency, error type annotations, rebuttal-action labels, and mapping to gold rebuttal segments. Experimental results and a user study demonstrate that directly using LLMs perform poorly in factual correctness and targeted refutation. Segment-wise generation and the automated sequential approach with author-in-the-loop, substantially improve factual correctness and strength of refutation.
[655] On the Relationship between Bayesian Networks and Probabilistic Structural Causal Models
Peter J. F. Lucas, Eleanora Zullo, Fabio Stella
Main category: cs.AI
TL;DR: This paper studies the relationship between Bayesian networks and structural causal models, exploring whether Bayesian networks can be mapped to probabilistic structural causal models and examining the consequences for network structure and probability distributions.
Details
Motivation: The motivation is to bridge the gap between probabilistic graphical models (Bayesian networks) and causal diagrams (structural causal models), investigating whether Bayesian networks obtained from expert knowledge or learned from data can be transformed into probabilistic structural causal models.Method: The authors use linear algebra and linear programming methods for the transformation between models, examining properties for the existence and uniqueness of solutions based on dimensions of the probabilistic structural model.
Result: The paper shows that linear algebra and linear programming offer key methods for transforming Bayesian networks into probabilistic structural causal models, and examines the conditions for existence and uniqueness of such transformations.
Conclusion: The transformation between Bayesian networks and structural causal models is possible using linear algebraic methods, but the semantics of the models are affected by this transformation, requiring careful consideration of the implications for causal interpretation.
Abstract: In this paper, the relationship between probabilistic graphical models, in particular Bayesian networks, and causal diagrams, also called structural causal models, is studied. Structural causal models are deterministic models, based on structural equations or functions, that can be provided with uncertainty by adding independent, unobserved random variables to the models, equipped with probability distributions. One question that arises is whether a Bayesian network that has obtained from expert knowledge or learnt from data can be mapped to a probabilistic structural causal model, and whether or not this has consequences for the network structure and probability distribution. We show that linear algebra and linear programming offer key methods for the transformation, and examine properties for the existence and uniqueness of solutions based on dimensions of the probabilistic structural model. Finally, we examine in what way the semantics of the models is affected by this transformation. Keywords: Causality, probabilistic structural causal models, Bayesian networks, linear algebra, experimental software.
[656] Greedy Is a Strong Default: Agents as Iterative Optimizers
Yitao Li
Main category: cs.AI
TL;DR: LLM-guided optimization replaces random proposals with LLM reasoning about evaluation diagnostics, showing greedy hill climbing outperforms more sophisticated optimization methods when using strong LLM priors.
Details
Motivation: Traditional optimization algorithms use random perturbations to generate candidate solutions. The paper investigates whether classical optimization machinery still helps when replacing random proposal generators with LLM agents that reason about evaluation diagnostics to propose informed candidates.Method: Replace random proposal generators in classical optimization algorithms (hill climbing, simulated annealing, population-based methods) with LLM agents that analyze evaluation diagnostics to propose informed candidates. Evaluate on four tasks spanning discrete, mixed, and continuous search spaces: rule-based classification on Breast Cancer, hyperparameter optimization for MobileNetV3-Small on STL-10, LoRA fine-tuning of Qwen2.5-0.5B on SST-2, and XGBoost on Adult Census.
Result: LLM-guided optimization achieves significant improvements across all tasks. Cross-task ablation shows simulated annealing, parallel investigators, and even a second LLM model provide no benefit over greedy hill climbing while requiring 2-3x more evaluations. The LLM’s learned prior is strong enough that acceptance-rule sophistication has limited impact, with round 1 alone delivering majority of improvement.
Conclusion: Greedy hill climbing with early stopping is a surprisingly strong default for LLM-guided optimization. The framework produces human-interpretable artifacts and the discovered rules independently recapitulate established principles in domains like cytopathology.
Abstract: Classical optimization algorithms–hill climbing, simulated annealing, population-based methods–generate candidate solutions via random perturbations. We replace the random proposal generator with an LLM agent that reasons about evaluation diagnostics to propose informed candidates, and ask: does the classical optimization machinery still help when the proposer is no longer random? We evaluate on four tasks spanning discrete, mixed, and continuous search spaces (all replicated across 3 independent runs): rule-based classification on Breast Cancer (test accuracy 86.0% to 96.5%), mixed hyperparameter optimization for MobileNetV3-Small on STL-10 (84.5% to 85.8%, zero catastrophic failures vs. 60% for random search), LoRA fine-tuning of Qwen2.5-0.5B on SST-2 (89.5% to 92.7%, matching Optuna TPE with 2x efficiency), and XGBoost on Adult Census (AUC 0.9297 to 0.9317, tying CMA-ES with 3x fewer evaluations). Empirically, on these tasks: a cross-task ablation shows that simulated annealing, parallel investigators, and even a second LLM model (OpenAI Codex) provide no benefit over greedy hill climbing while requiring 2-3x more evaluations. In our setting, the LLM’s learned prior appears strong enough that acceptance-rule sophistication has limited impact–round 1 alone delivers the majority of improvement, and variants converge to similar configurations across strategies. The practical implication is surprising simplicity: greedy hill climbing with early stopping is a strong default. Beyond accuracy, the framework produces human-interpretable artifacts–the discovered cancer classification rules independently recapitulate established cytopathology principles.
[657] AstraAI: LLMs, Retrieval, and AST-Guided Assistance for HPC Codebases
Mahesh Natarajan, Xiaoye Li, Weiqun Zhang
Main category: cs.AI
TL;DR: AstraAI is a CLI coding framework for HPC development that integrates LLMs with RAG and AST analysis for context-aware code generation in scientific codebases.
Details
Motivation: To address the challenge of generating complex scientific code for HPC applications by providing context-aware assistance that understands project structure and preserves code consistency.Method: Combines LLMs with Retrieval-Augmented Generation (RAG) for retrieving relevant code snippets and Abstract Syntax Tree (AST) analysis for structural context, creating high-fidelity prompts for code generation.
Result: Demonstrated on HPC code generation tasks within AMReX, showing the system can generate code that aligns with existing project structures and programming patterns.
Conclusion: AstraAI provides an effective framework for context-aware code generation in HPC environments, supporting both local and cloud-based LLMs while maintaining structural consistency.
Abstract: We present AstraAI, a command-line interface (CLI) coding framework for high-performance computing (HPC) software development. AstraAI operates directly within a Linux terminal and integrates large language models (LLMs) with Retrieval-Augmented Generation (RAG) and Abstract Syntax Tree (AST)-based structural analysis to enable context-aware code generation for complex scientific codebases. The central idea is to construct a high-fidelity prompt that is passed to the LLM for inference. This prompt augments the user request with relevant code snippets retrieved from the underlying framework codebase via RAG and structural context extracted from AST analysis, providing the model with precise information about relevant functions, data structures, and overall code organization. The framework is designed to perform scoped modifications to source code while preserving structural consistency with the surrounding code. AstraAI supports both locally hosted models from Hugging Face and API-based frontier models accessible via the American Science Cloud, enabling flexible deployment across HPC environments. The system generates code that aligns with existing project structures and programming patterns. We demonstrate AstraAI on representative HPC code generation tasks within AMReX, a DOE-supported HPC software infrastructure for exascale applications.
[658] The Novelty Bottleneck: A Framework for Understanding Human Effort Scaling in AI-Assisted Work
Jacky Liang
Main category: cs.AI
TL;DR: A theoretical model of human-AI collaboration showing that the fraction of novel decisions creates a “novelty bottleneck” analogous to Amdahl’s Law, leading to linear scaling of human effort regardless of AI capability improvements.
Details
Motivation: To understand the fundamental limits of human-AI collaboration by identifying the "novelty bottleneck" - the irreducible serial component created by tasks requiring human judgment that cannot be automated by AI.Method: Proposes a stylized mathematical model where tasks decompose into atomic decisions, with fraction ν being novel (not covered by AI’s prior). Derives scaling laws for human effort, team organization, and time efficiency based on this novelty fraction.
Result: Several non-obvious consequences: 1) Human effort transitions sharply from O(E) to O(1) with no intermediate scaling, 2) Better AI improves coefficients but not exponents, 3) Optimal team size decreases with AI capability, 4) Wall-clock time achieves O(√E) through parallelism but human effort remains O(E), 5) Asymmetric AI safety profile.
Conclusion: The novelty fraction ν is the key parameter governing AI-assisted productivity, creating fundamental bottlenecks that clarify narratives about intelligence explosions and the limits of AI augmentation.
Abstract: We propose a stylized model of human-AI collaboration that isolates a mechanism we call the novelty bottleneck: the fraction of a task requiring human judgment creates an irreducible serial component analogous to Amdahl’s Law in parallel computing. The model assumes that tasks decompose into atomic decisions, a fraction $ν$ of which are “novel” (not covered by the agent’s prior), and that specification, verification, and error correction each scale with task size. From these assumptions, we derive several non-obvious consequences: (1) there is no smooth sublinear regime for human effort it transitions sharply from $O(E)$ to $O(1)$ with no intermediate scaling class; (2) better agents improve the coefficient on human effort but not the exponent; (3) for organizations of n humans with AI agents, optimal team size decreases with agent capability; (4) wall-clock time achieves $O(\sqrt{E})$ through team parallelism but total human effort remains $O(E)$; and (5) the resulting AI safety profile is asymmetric – AI is bottlenecked on frontier research but unbottlenecked on exploiting existing knowledge. We show these predictions are consistent with empirical observations from AI coding benchmarks, scientific productivity data, and practitioner reports. Our contribution is not a proof that human effort must scale linearly, but a framework that identifies the novelty fraction as the key parameter governing AI-assisted productivity, and derives consequences that clarify – rather than refute – prevalent narratives about intelligence explosions and the “country of geniuses in a data center.”
[659] PeopleSearchBench: A Multi-Dimensional Benchmark for Evaluating AI-Powered People Search Platforms
Wei Wang, Tianyu Shi, Shuai Zhang, Boyang Xia, Zequn Xie, Chenyu Zeng, Qi Zhang, Lynn Ai, Yaqi Yu, Kaiming Zhang, Feiyue Tang
Main category: cs.AI
TL;DR: PeopleSearchBench: A factual verification benchmark for evaluating AI-powered people search platforms across recruiting, sales, expert search, and influencer discovery use cases.
Details
Motivation: There's no widely accepted benchmark for evaluating AI-powered people search platforms used in recruiting, sales prospecting, and professional networking, despite their growing importance.Method: Introduced Criteria-Grounded Verification pipeline that extracts explicit criteria from queries and uses live web search for factual verification. Evaluated 4 platforms on 119 real-world queries across 4 use cases with metrics for Relevance Precision, Effective Coverage, and Information Utility.
Result: Lessie (specialized AI people search agent) performed best overall with 65.2 score (18.5% higher than second-ranked system) and achieved 100% task completion across all 119 queries. Human validation showed high agreement (Cohen’s kappa = 0.84).
Conclusion: PeopleSearchBench provides a factual, verifiable benchmark for evaluating people search systems, with Lessie demonstrating superior performance through specialized AI agent design.
Abstract: AI-powered people search platforms are increasingly used in recruiting, sales prospecting, and professional networking, yet no widely accepted benchmark exists for evaluating their performance. We introduce PeopleSearchBench, an open-source benchmark that compares four people search platforms on 119 real-world queries across four use cases: corporate recruiting, B2B sales prospecting, expert search with deterministic answers, and influencer/KOL discovery. A key contribution is Criteria-Grounded Verification, a factual relevance pipeline that extracts explicit, verifiable criteria from each query and uses live web search to determine whether returned people satisfy them. This produces binary relevance judgments grounded in factual verification rather than subjective holistic LLM-as-judge scores. We evaluate systems on three dimensions: Relevance Precision (padded nDCG@10), Effective Coverage (task completion and qualified result yield), and Information Utility (profile completeness and usefulness), averaged equally into an overall score. Lessie, a specialized AI people search agent, performs best overall, scoring 65.2, 18.5% higher than the second-ranked system, and is the only system to achieve 100% task completion across all 119 queries. We also report confidence intervals, human validation of the verification pipeline (Cohen’s kappa = 0.84), ablations, and full documentation of queries, prompts, and normalization procedures. Code, query definitions, and aggregated results are available on GitHub.
[660] Dual-Stage LLM Framework for Scenario-Centric Semantic Interpretation in Driving Assistance
Jean Douglas Carvalho, Hugo Taciro Kenji, Ahmad Mohammad Saber, Glaucia Melo, Max Mauro Dias Santos, Deepa Kundur
Main category: cs.AI
TL;DR: A framework for auditing LLM-based risk reasoning in driving scenarios reveals systematic model disagreements in risk assessment, highlighting semantic ambiguity challenges for ADAS safety.
Details
Motivation: ADAS safety failures often stem from partial observability and semantic ambiguity in risk interpretation rather than component malfunctions, necessitating reproducible auditing methods for LLM-based reasoning in driving contexts.Method: Scenario-centric framework using deterministic, temporally bounded scenario windows from multimodal driving data, evaluated under fixed prompt constraints and closed numeric risk schema to ensure structured, comparable outputs across models.
Result: Experiments on near-people scenarios show systematic inter-model divergence in severity assignment, high-risk escalation, evidence use, and causal attribution, with disagreement extending to vulnerable road user interpretation.
Conclusion: Scenario-centric auditing and explicit ambiguity management are crucial for integrating LLM-based reasoning into safety-aligned ADAS, as variability often reflects intrinsic semantic indeterminacy rather than isolated model failure.
Abstract: Advanced Driver Assistance Systems (ADAS) increasingly rely on learning-based perception, yet safety-relevant failures often arise without component malfunction, driven instead by partial observability and semantic ambiguity in how risk is interpreted and communicated. This paper presents a scenario-centric framework for reproducible auditing of LLM-based risk reasoning in urban driving contexts. Deterministic, temporally bounded scenario windows are constructed from multimodal driving data and evaluated under fixed prompt constraints and a closed numeric risk schema, ensuring structured and comparable outputs across models. Experiments on a curated near-people scenario set compare two text-only models and one multimodal model under identical inputs and prompts. Results reveal systematic inter-model divergence in severity assignment, high-risk escalation, evidence use, and causal attribution. Disagreement extends to the interpretation of vulnerable road user presence, indicating that variability often reflects intrinsic semantic indeterminacy rather than isolated model failure. These findings highlight the importance of scenario-centric auditing and explicit ambiguity management when integrating LLM-based reasoning into safety-aligned driver assistance systems.
[661] From indicators to biology: the calibration problem in artificial consciousness
Florentin Koch
Main category: cs.AI
TL;DR: The paper critiques current approaches to evaluating artificial consciousness, arguing that probabilistic attribution to current AI systems is premature due to theoretical fragmentation and lack of validation in consciousness science. It proposes shifting focus to biologically grounded engineering approaches.
Details
Motivation: The motivation is to address the epistemic shortcomings in current approaches to evaluating artificial consciousness, which rely on behavioral tests or architectural indicators without proper calibration or validation.Method: The paper analyzes the limitations of indicator-based programs for artificial consciousness evaluation, highlighting three key problems: theoretical fragmentation in consciousness science, lack of independent validation for indicators, and absence of ground truth for artificial phenomenality.
Result: The analysis concludes that probabilistic consciousness attribution to current AI systems is premature under current epistemic conditions.
Conclusion: The paper recommends redirecting research efforts toward biologically grounded engineering approaches (biohybrid, neuromorphic, and connectome-scale systems) that reduce the gap with living systems where consciousness is empirically anchored.
Abstract: Recent work on artificial consciousness shifts evaluation from behaviour to internal architecture, deriving indicators from theories of consciousness and updating credences accordingly. This is progress beyond naive Turing-style tests. But the indicator-based programme remains epistemically under-calibrated: consciousness science is theoretically fragmented, indicators lack independent validation, and no ground truth of artificial phenomenality exists. Under these conditions, probabilistic consciousness attribution to current AI systems is premature. A more defensible near-term strategy is to redirect effort toward biologically grounded engineering – biohybrid, neuromorphic, and connectome-scale systems – that reduces the gap with the only domain where consciousness is empirically anchored: living systems.
[662] What does a system modify when it modifies itself?
Florentin Koch
Main category: cs.AI
TL;DR: A formal framework for analyzing self-modifying systems that distinguishes between different levels of modification (low-level rules, control rules, norms) and identifies four regimes of self-modification, with applications to comparing human cognition and AI systems.
Details
Motivation: The paper addresses the need for a formal framework to distinguish between different targets of self-modification in cognitive systems (executive control, metacognition, hierarchical learning), which current cognitive science and AI lack despite both exhibiting self-modification capabilities.Method: The authors develop a minimal structural framework requiring: a hierarchy of rules, a fixed core, and distinctions between effective rules, represented rules, and causally accessible rules. They identify four regimes of self-modification and apply this framework to analyze human cognition and artificial systems.
Result: The framework reveals a “crossing of opacities”: humans have self-representation and causal power concentrated at upper hierarchical levels with operational levels opaque, while AI systems show the inverse pattern. This provides a structural signature for human-AI comparison and yields insights into artificial consciousness theories.
Conclusion: The framework offers a formal basis for comparing self-modification across biological and artificial systems, identifies four testable predictions, and highlights four open problems including the independence of transformativity and autonomy, viability of self-modification, teleological lock, and identity under transformation.
Abstract: When a cognitive system modifies its own functioning, what exactly does it modify: a low-level rule, a control rule, or the norm that evaluates its own revisions? Cognitive science describes executive control, metacognition, and hierarchical learning with precision, but lacks a formal framework distinguishing these targets of transformation. Contemporary artificial intelligence likewise exhibits self-modification without common criteria for comparison with biological cognition. We show that the question of what counts as a self-modifying system entails a minimal structure: a hierarchy of rules, a fixed core, and a distinction between effective rules, represented rules, and causally accessible rules. Four regimes are identified: (1) action without modification, (2) low-level modification, (3) structural modification, and (4) teleological revision. Each regime is anchored in a cognitive phenomenon and a corresponding artificial system. Applied to humans, the framework yields a central result: a crossing of opacities. Humans have self-representation and causal power concentrated at upper hierarchical levels, while operational levels remain largely opaque. Reflexive artificial systems display the inverse profile: rich representation and causal access at operational levels, but none at the highest evaluative level. This crossed asymmetry provides a structural signature for human-AI comparison. The framework also offers insight into artificial consciousness, with higher-order theories and Attention Schema Theory as special cases. We derive four testable predictions and identify four open problems: the independence of transformativity and autonomy, the viability of self-modification, the teleological lock, and identity under transformation.
[663] SkillFlow: Scalable and Efficient Agent Skill Retrieval System
Fangzhou Li, Pagkratios Tagkopoulos, Ilias Tagkopoulos
Main category: cs.AI
TL;DR: SkillFlow is a multi-stage retrieval pipeline for agent skill discovery that frames skill acquisition as an information retrieval problem over a large corpus of community-contributed skills, showing significant performance gains when relevant skills are available but highlighting the importance of corpus quality.
Details
Motivation: As AI agents can extend capabilities by loading reusable skills at inference time, but including too many irrelevant skills degrades performance. With growing community-driven skill repositories, agents need selective retrieval of only the most relevant skills from large libraries.Method: SkillFlow uses a four-stage retrieval pipeline: 1) dense retrieval, 2) two rounds of cross-encoder reranking, and 3) LLM-based selection, progressively narrowing candidate sets from ~36K community-contributed SKILL.md definitions indexed from GitHub.
Result: On SkillsBench (87 tasks, 229 matched skills), SkillFlow raised Pass@1 from 9.2% to 16.4% (+78.3%), reaching 84.1% of oracle ceiling. On Terminal-Bench (89 tasks, no matched skills), agents used retrieved skills (70.1% use rate) but showed no performance gain, revealing retrieval alone is insufficient when corpus lacks high-quality executable skills.
Conclusion: Framing skill acquisition as information retrieval is effective, but practical impact of skill-augmented agents depends on corpus coverage and skill quality, particularly density of runnable code and bundled artifacts.
Abstract: AI agents can extend their capabilities at inference time by loading reusable skills into context, yet equipping an agent with too many skills, particularly irrelevant ones, degrades performance. As community-driven skill repositories grow, agents need a way to selectively retrieve only the most relevant skills from a large library. We present SkillFlow, the first multi-stage retrieval pipeline designed for agent skill discovery, framing skill acquisition as an information retrieval problem over a corpus of ~36K community-contributed SKILL.md definitions indexed from GitHub. The pipeline progressively narrows a large candidate set through four stages: dense retrieval, two rounds of cross-encoder reranking, and LLM-based selection, balancing recall and precision at each stage. We evaluate SkillFlow on two coding benchmarks: SkillsBench, a benchmark of 87 tasks and 229 matched skills; and Terminal-Bench, a benchmark that provides only 89 tasks, and no matched skills. On SkillsBench, SkillFlow-retrieved skills raise Pass@1 from 9.2% to 16.4% (+78.3%, $p_{\text{adj}} = 3.64 \times 10^{-2}$), reaching 84.1% of the oracle ceiling, while on Terminal-Bench, agents readily use the retrieved skills (70.1% use rate) yet show no performance gain, revealing that retrieval alone is insufficient when the corpus lacks high-quality, executable skills for the target domain. SkillFlow demonstrates that framing skill acquisition as an information retrieval task is an effective strategy, and that the practical impact of skill-augmented agents hinges on corpus coverage and skill quality, particularly the density of runnable code and bundled artifacts.
[664] DSevolve: Enabling Real-Time Adaptive Scheduling on Dynamic Shop Floor with LLM-Evolved Heuristic Portfolios
Jin Huang, Jie Yang, XinLei Zhou, Qihao Liu, Liang Gao, Xinyu Li
Main category: cs.AI
TL;DR: DSevolve: An industrial scheduling framework that evolves a diverse portfolio of dispatching rules offline and adaptively deploys them online with second-level response time for dynamic manufacturing environments.
Details
Motivation: Dynamic manufacturing environments face disruptions like machine breakdowns and new orders that shift optimal dispatching strategies, requiring adaptive rule selection. Existing LLM-powered Automatic Heuristic Design frameworks evolve toward single elite rules that lack adaptability.Method: DSevolve evolves a quality-diverse portfolio of dispatching rules offline using multi-persona seeding and topology-aware evolutionary operators, creating a behaviorally diverse rule archive indexed by MAP-Elites feature space. Online, it uses probe-based fingerprinting to characterize shop floor states, retrieves candidate rules from offline knowledge base, and selects the best via rapid look-ahead simulation.
Result: Evaluated on 500 dynamic flexible job shop instances from real industrial data, DSevolve outperforms state-of-the-art AHD frameworks, classical dispatching rules, genetic programming, and deep reinforcement learning.
Conclusion: DSevolve offers a practical and deployable solution for intelligent shop floor scheduling with second-level response time and adaptive rule deployment.
Abstract: In dynamic manufacturing environments, disruptions such as machine breakdowns and new order arrivals continuously shift the optimal dispatching strategy, making adaptive rule selection essential. Existing LLM-powered Automatic Heuristic Design (AHD) frameworks evolve toward a single elite rule that cannot meet this adaptability demand. To address this, we present DSevolve, an industrial scheduling framework that evolves a quality-diverse portfolio of dispatching rules offline and adaptively deploys them online with second-level response time. Multi-persona seeding and topology-aware evolutionary operators produce a behaviorally diverse rule archive indexed by a MAP-Elites feature space. Upon each disruption event, a probe-based fingerprinting mechanism characterizes the current shop floor state, retrieves high-quality candidate rules from an offline knowledge base, and selects the best one via rapid look-ahead simulation. Evaluated on 500 dynamic flexible job shop instances derived from real industrial data, DSevolve outperforms state-of-the-art AHD frameworks, classical dispatching rules, genetic programming, and deep reinforcement learning, offering a practical and deployable solution for intelligent shop floor scheduling.
[665] TianJi:An autonomous AI meteorologist for discovering physical mechanisms in atmospheric science
Kaikai Zhang, Xiang Wang, Haoluo Zhao, Nan Chen, Mengyang Yu Jing-Jia Luo, Tao Song, Fan Meng
Main category: cs.AI
TL;DR: TianJi is an AI meteorologist system that uses LLM-driven multi-agent architecture to autonomously conduct scientific research, generate hypotheses, run numerical models, and analyze results for atmospheric physics research.
Details
Motivation: Current AI weather forecasting is statistical fitting without uncovering physical causal mechanisms. Physics-oriented research relies heavily on human domain knowledge and engineering operations, creating bottlenecks in Earth system science exploration.Method: Uses large language model-driven multi-agent architecture with cognitive planning and engineering execution decoupling. Meta-planner interprets hypotheses and devises experimental roadmaps, while specialized worker agents handle data preparation, model configuration, and multi-dimensional result analysis.
Result: In two atmospheric scenarios (squall-line cold pools and typhoon track deflections), TianJi accomplished expert-level end-to-end experimental operations with zero human intervention, compressing research cycles to hours. It delivered detailed analyses and autonomously judged hypothesis validity.
Conclusion: AI’s role in Earth system science is transitioning from “black-box predictor” to “interpretable scientific collaborator,” offering a new paradigm for high-throughput exploration of scientific mechanisms.
Abstract: Artificial intelligence (AI) has achieved breakthroughs comparable to traditional numerical models in data-driven weather forecasting, yet it remains essentially statistical fitting and struggles to uncover the physical causal mechanisms of the atmosphere. Physics-oriented mechanism research still heavily relies on domain knowledge and cumbersome engineering operations of human scientists, becoming a bottleneck restricting the efficiency of Earth system science exploration. Here, we propose TianJi - the first “AI meteorologist” system capable of autonomously driving complex numerical models to verify physical mechanisms. Powered by a large language model-driven multi-agent architecture, TianJi can autonomously conduct literature research and generate scientific hypotheses. We further decouple scientific research into cognitive planning and engineering execution: the meta-planner interprets hypotheses and devises experimental roadmaps, while a cohort of specialized worker agents collaboratively complete data preparation, model configuration, and multi-dimensional result analysis. In two classic atmospheric dynamic scenarios (squall-line cold pools and typhoon track deflections), TianJi accomplishes expert-level end-to-end experimental operations with zero human intervention, compressing the research cycle to a few hours. It also delivers detailed result analyses and autonomously judges and explains the validity of the hypotheses from outputs. TianJi reveals that the role of AI in Earth system science is transitioning from a “black-box predictor” to an “interpretable scientific collaborator”, offering a new paradigm for high-throughput exploration of scientific mechanisms.
[666] SkyNet: Belief-Aware Planning for Partially-Observable Stochastic Games
Adam Haile
Main category: cs.AI
TL;DR: SkyNet extends MuZero with belief-aware auxiliary heads for winner prediction and rank estimation to handle partially observable, stochastic, multi-player environments without explicit belief-state tracking.
Details
Motivation: MuZero works well for perfect-information games but lacks mechanisms for handling partial observability and uncertainty about hidden state, which are crucial for domains like card games, autonomous negotiation, financial trading, and multi-agent robotics.Method: Adds ego-conditioned auxiliary heads for winner prediction and rank estimation to MuZero architecture to encourage latent states to retain information predictive of outcomes under partial observability, without requiring explicit belief-state tracking or changes to MCTS search algorithm.
Result: Achieves 75.3% peak win rate against baseline (+194 Elo, p < 10^-50) in Skyjo card game evaluations, and outperforms baseline against heuristic opponents (0.720 vs. 0.466 win rate). Belief-aware model initially underperforms but surpasses baseline with sufficient training data.
Conclusion: Belief-aware auxiliary supervision improves learned representations under partial observability, but requires adequate data flow to be effective. The approach enables MuZero to handle partially observable environments without explicit belief modeling.
Abstract: In 2019, Google DeepMind released MuZero, a model-based reinforcement learning method that achieves strong results in perfect-information games by combining learned dynamics models with Monte Carlo Tree Search (MCTS). However, comparatively little work has extended MuZero to partially observable, stochastic, multi-player environments, where agents must act under uncertainty about hidden state. Such settings arise not only in card games but in domains such as autonomous negotiation, financial trading, and multi-agent robotics. In the absence of explicit belief modeling, MuZero’s latent encoding has no dedicated mechanism for representing uncertainty over unobserved variables. To address this, we introduce SkyNet (Belief-Aware MuZero), which adds ego-conditioned auxiliary heads for winner prediction and rank estimation to the standard MuZero architecture. These objectives encourage the latent state to retain information predictive of outcomes under partial observability, without requiring explicit belief-state tracking or changes to the search algorithm. We evaluate SkyNet on Skyjo, a partially observable, non-zero-sum, stochastic card game, using a decision-granularity environment, transformer-based encoding, and a curriculum of heuristic opponents with self-play. In 1000-game head-to-head evaluations at matched checkpoints, SkyNet achieves a 75.3% peak win rate against the baseline (+194 Elo, $p < 10^{-50}$). SkyNet also outperforms the baseline against heuristic opponents (0.720 vs.\ 0.466 win rate). Critically, the belief-aware model initially underperforms the baseline but decisively surpasses it once training throughput is sufficient, suggesting that belief-aware auxiliary supervision improves learned representations under partial observability, but only given adequate data flow.
[667] What Is Your Agent’s GPA? A Framework for Evaluating Agent Goal-Plan-Action Alignment
Allison Sihan Jia, Daniel Huang, Nikhil Vytla, Seung Won Wilson Yoo, Nirvika Choudhury, Shayak Sen, John C. Mitchell, Anupam Datta
Main category: cs.AI
TL;DR: Agent GPA framework for evaluating AI agents by measuring Goal-Plan-Action alignment using LLM judges and automated prompt optimization
Details
Motivation: Critical agent failures occur at the intersections of goal setting, plan devising, and action execution, requiring systematic evaluation frameworks to identify and localize these failuresMethod: Factorized suite of LLM judges to measure Goal-Plan-Action alignment, using automated prompt optimization to generate domain-specific evaluation criteria across diverse agent architectures and datasets
Result: Framework identifies 95% of human-annotated errors, localizes 86% of errors for targeted debugging, achieves 76-86% error coverage vs manual approaches, and improves judge consistency by 38% through evolutionary refinement
Conclusion: Agent GPA provides a rigorous and generalizable paradigm for targeted agent evaluation that effectively identifies and localizes failures across different agent architectures
Abstract: We introduce the Agent GPA (Goal-Plan-Action) framework, driven by the fundamental insight that critical agent failures emerge at the intersections of setting goals, devising plans, and executing actions. We operationalize the framework with a factorized suite of LLM judges designed to measure distinct elements of Goal-Plan-Act alignment. To make this methodology scalable and generalizable across diverse agent architectures and datasets, we use state-of-the-art automated prompt optimization techniques to systematically generate domain-specific evaluation criteria. We validate this approach across three benchmarks: a multi-agent research setting (TRAIL/GAIA), a single coding agent setting (TRAIL/SWE-bench), and a private, enterprise data-agent setting (Snowflake Intelligence). Extensive evaluation on TRAIL/GAIA demonstrates the core validity of the framework, which identifies a broad range of agent failures (95% of human-annotated errors), localizes errors to enable targeted debugging (86% of human-annotated errors), and exhibits strong agreement with human evaluators. Crucially, by applying our automated methodology to both public datasets, we demonstrate that our GPA judges generally achieve the highest error coverage (ranging from 76% to 86%) in comparison to manual prompting approaches. We also leverage an evolutionary coding agent to improve judge consistency by up to 38% through iterative refinement of evaluation rubrics. Overall, Agent GPA provides a rigorous and generalizable paradigm for targeted agent evaluation.
[668] Let the Agent Steer: Closed-Loop Ranking Optimization via Influence Exchange
Yin Cheng, Liao Zhou, Xiyu Liang, Dihao Luo, Tewei Lee, Kailun Zheng, Weiwei Zhang, Mingchen Cai, Jian Dong, Andy Zhang
Main category: cs.AI
TL;DR: Sortify: An autonomous LLM-driven ranking optimization agent that reframes recommendation ranking as continuous influence exchange, deployed in large-scale production systems with significant business impact.
Details
Motivation: Recommendation ranking is fundamentally an influence allocation problem where offline proxy metrics systematically misjudge how influence reallocation translates to online impact, with asymmetric bias that single calibration factors cannot correct.Method: Uses three mechanisms: (1) dual-channel framework based on Savage’s Subjective Expected Utility decoupling offline-online transfer correction from constraint penalty adjustment, (2) LLM meta-controller operating on framework-level parameters, (3) persistent Memory DB with 7 relational tables for cross-round learning. Core metric is Influence Share where all factor contributions sum to exactly 100%.
Result: Deployed across two Southeast Asian markets: In Country A, pushed GMV from -3.6% to +9.2% within 7 rounds with peak orders reaching +12.5%. In Country B, cold-start deployment achieved +4.15% GMV/UU and +3.58% Ads Revenue in 7-day A/B test, leading to full production rollout.
Conclusion: Sortify demonstrates successful autonomous LLM-driven ranking optimization in production recommendation systems, solving structural problems in influence allocation and achieving significant business impact through continuous influence exchange.
Abstract: Recommendation ranking is fundamentally an influence allocation problem: a sorting formula distributes ranking influence among competing factors, and the business outcome depends on finding the optimal “exchange rates” among them. However, offline proxy metrics systematically misjudge how influence reallocation translates to online impact, with asymmetric bias across metrics that a single calibration factor cannot correct. We present Sortify, the first fully autonomous LLM-driven ranking optimization agent deployed in a large-scale production recommendation system. The agent reframes ranking optimization as continuous influence exchange, closing the full loop from diagnosis to parameter deployment without human intervention. It addresses structural problems through three mechanisms: (1) a dual-channel framework grounded in Savage’s Subjective Expected Utility (SEU) that decouples offline-online transfer correction (Belief channel) from constraint penalty adjustment (Preference channel); (2) an LLM meta-controller operating on framework-level parameters rather than low-level search variables; (3) a persistent Memory DB with 7 relational tables for cross-round learning. Its core metric, Influence Share, provides a decomposable measure where all factor contributions sum to exactly 100%. Sortify has been deployed across two Southeast Asian markets. In Country A, the agent pushed GMV from -3.6% to +9.2% within 7 rounds with peak orders reaching +12.5%. In Country B, a cold-start deployment achieved +4.15% GMV/UU and +3.58% Ads Revenue in a 7-day A/B test, leading to full production rollout.
[669] AISAC: An Integrated multi-agent System for Transparent, Retrieval-Grounded Scientific Assistance
Chandrachur Bhattacharya, Sibendu Som
Main category: cs.AI
TL;DR: AISAC is a modular multi-agent runtime for scientific reasoning with governance features like role semantics, budgeted context management, traceable execution, and reproducible tool interactions.
Details
Motivation: To create a governed execution substrate for deploying agentic AI in scientific practice, addressing requirements like explicit role semantics, budgeted context management, traceable execution, and reproducible interactions with tools and knowledge.Method: Develops a transparent, modular multi-agent runtime with four structural guarantees: declarative agent registration with runtime-enforced role semantics, budgeted orchestration via context and delegation limits, role-aligned memory access across multiple layers, and trace-driven transparency through execution records and event-stream interface.
Result: Successfully deployed across multiple scientific workflows at Argonne National Laboratory including combustion science, materials research, and energy process safety, demonstrating its use as a reusable substrate for domain-specialized AI scientific assistants.
Conclusion: AISAC provides a practical, governed execution substrate for scientific reasoning that operationalizes key requirements for deploying agentic AI in scientific practice, enabling reproducible and traceable multi-agent systems for scientific domains.
Abstract: AI Scientific Assistant Core (AISAC) is a transparent, modular multi-agent runtime developed at Argonne National Laboratory to support long-horizon, evidence-grounded scientific reasoning. Rather than proposing new agent algorithms or claiming autonomous scientific discovery, AISAC contributes a governed execution substrate that operationalizes key requirements for deploying agentic AI in scientific practice, including explicit role semantics, budgeted context management, traceable execution, and reproducible interaction with tools and knowledge. AISAC enforces four structural guarantees for scientific reasoning: (1) declarative agent registration with runtime-enforced role semantics and automatic system prompt generation; (2) budgeted orchestration via explicit per-turn context and delegation depth limits; (3) role-aligned memory access across episodic, dialogue, and evidence layers; and (4) trace-driven transparency through persistent execution records and a live event-stream interface. These guarantees are implemented through hybrid persistent memory (SQLite and dual FAISS indices), governed retrieval with agent-scoped RAG, structured tool execution with schema validation, and a configuration-driven bootstrap mechanism that enables project specific extension without modifying the shared core. AISAC is currently deployed across multiple scientific workflows at Argonne, including combustion science, materials research, and energy process safety, demonstrating its use as a reusable substrate for domain-specialized AI scientific assistants.
[670] CARGO: Carbon-Aware Gossip Orchestration in Smart Shipping
Alexandros S. Kalafatelis, Nikolaos Nomikos, Vasileios Nikolakakis, Nikolaos Tsoulakos, Panagiotis Trakadas
Main category: cs.AI
TL;DR: CARGO is a carbon-aware gossip orchestration framework for decentralized federated learning in smart shipping, addressing maritime network challenges through separate control and data planes.
Details
Motivation: Smart shipping operations face challenges with uneven connectivity, limited backhaul, commercial sensitivity, and carbon footprint in maritime networks, making traditional server-coordinated federated learning impractical.Method: CARGO separates learning into control and data planes: data plane performs local optimization with compressed gossip exchange, while control plane dynamically decides vessel participation, communication edges, compression levels, and recovery actions.
Result: CARGO maintains high accuracy while reducing carbon footprint and communication overheads compared to decentralized baselines, demonstrating feasibility for reliable maritime AI deployment.
Conclusion: CARGO provides a practical solution for resource-conscious maritime AI deployment by jointly managing communication, carbon cost, reliability, and participation balance in decentralized settings.
Abstract: Smart shipping operations increasingly depend on collaborative AI, yet the underlying data are generated across vessels with uneven connectivity, limited backhaul, and clear commercial sensitivity. In such settings, server-coordinated FL remains a weak systems assumption, depending on a reachable aggregation point and repeated wide-area synchronization, both of which are difficult to guarantee in maritime networks. A serverless gossip approach therefore represents a more natural approach, but existing methods still treat communication mainly as an optimization bottleneck, rather than as a resource that must be managed jointly with carbon cost, reliability, and long-term participation balance. In this context, this paper presents CARGO, a carbon-aware gossip orchestration framework for smart-shipping. CARGO separates learning into a control and a data plane. The data plane performs local optimization with compressed gossip exchange, while the control plane decides, at each round, which vessels should participate, which communication edges should be activated, how aggressively updates should be compressed, and when recovery actions should be triggered. We evaluate CARGO under a predictive-maintenance scenario using operational bulk-carrier engine data and a trace-driven maritime communication protocol that captures client dropout, partial participation, packet loss, and multiple connectivity regimes, derived from mobility-aware vessel interactions. Across the tested stress settings, CARGO consistently remains in the high-accuracy regime while reducing carbon footprint and communication overheads, compared to accuracy-competitive decentralized baselines. Overall, the conducted performance evaluation demonstrates that CARGO is a feasible and practical solution for reliable and resource-conscious maritime AI deployment.
[671] GEAKG: Generative Executable Algorithm Knowledge Graphs
Camilo Chacón Sartori, José H. García, Andrei Voicu Tomut, Christian Blum
Main category: cs.AI
TL;DR: GEAKG framework represents procedural knowledge as executable knowledge graphs with LLM-synthesized operators and learned composition patterns for cross-domain problem solving.
Details
Motivation: Procedural knowledge (algorithm design know-how) is implicit in code, lost between runs, and must be re-engineered for each new domain. Current knowledge graphs lack support for representing procedural knowledge as executable, learnable structures.Method: Introduces Generative Executable Algorithm Knowledge Graphs (GEAKG) with three-layer architecture: 1) nodes store executable operators, 2) edges encode learned composition patterns, 3) traversal generates solutions. Uses LLMs to synthesize topology/operators and Ant Colony Optimization-based learning engine. Domain-agnostic with pluggable ontology (RoleSchema).
Result: Demonstrated in two case studies: 1) Neural Architecture Search across 70 cross-dataset transfer pairs on tabular benchmarks, 2) Combinatorial Optimization where knowledge learned on Traveling Salesman Problem transfers zero-shot to scheduling and assignment domains.
Conclusion: Algorithmic expertise can be explicitly represented, learned, and transferred as executable knowledge graphs, supporting the framework hypothesis across different domains without domain-specific code.
Abstract: In the context of algorithms for problem solving, procedural knowledge – the know-how of algorithm design and operator composition – remains implicit in code, lost between runs, and must be re-engineered for each new domain. Knowledge graphs (KGs) have proven effective for organizing declarative knowledge, yet current KG paradigms provide limited support for representing procedural knowledge as executable, learnable graph structures. We introduce \textit{Generative Executable Algorithm Knowledge Graphs} (GEAKG), a class of KGs whose nodes store executable operators, whose edges encode learned composition patterns, and whose traversal generates solutions. A GEAKG is \emph{generative} (topology and operators are synthesized by a Large Language Model), \emph{executable} (every node is runnable code), and \emph{transferable} (learned patterns generalize zero-shot across domains). The framework is domain-agnostic at the engine level: the same three-layer architecture and Ant Colony Optimization (ACO)-based learning engine can be instantiated across domains, parameterized by a pluggable ontology (\texttt{RoleSchema}). Two case studies – sharing no domain-specific framework code – provide concrete evidence for this framework hypothesis: (1)~Neural Architecture Search across 70 cross-dataset transfer pairs on two tabular benchmarks, and (2)~Combinatorial Optimization, where knowledge learned on the Traveling Salesman Problem transfers zero-shot to scheduling and assignment domains. Taken together, the results support that algorithmic expertise can be explicitly represented, learned, and transferred as executable knowledge graphs.
[672] CARV: A Diagnostic Benchmark for Compositional Analogical Reasoning in Multimodal LLMs
Yongkang Du, Xiaohan Zou, Minhao Cheng, Lu Lin
Main category: cs.AI
TL;DR: CARV introduces a compositional analogical reasoning benchmark for MLLMs, testing their ability to extract symbolic rules from multiple visual pairs and compose new transformations, revealing significant performance gaps compared to humans.
Details
Motivation: Existing evaluations of analogical reasoning in MLLMs overlook the critical ability to compose rules from multiple sources, which is essential for higher-order intelligence. Current benchmarks don't adequately test compositional reasoning capabilities.Method: The authors introduce CARV (Compositional Analogical Reasoning in Vision), a novel task with a 5,500-sample dataset. They extend analogies from single pairs to multiple pairs, requiring MLLMs to extract symbolic rules from each pair and compose new transformations.
Result: State-of-the-art MLLMs show striking performance gaps: Gemini-2.5 Pro achieves only 40.4% accuracy, far below human-level performance of 100%. Diagnostic analysis reveals two consistent failure modes: decomposing visual changes into symbolic rules, and maintaining robustness under diverse/complex settings.
Conclusion: Current MLLMs have significant limitations in compositional analogical reasoning, particularly in symbolic rule extraction from visual changes and robustness in complex settings. The CARV benchmark provides a diagnostic tool to measure these capabilities.
Abstract: Analogical reasoning tests a fundamental aspect of human cognition: mapping the relation from one pair of objects to another. Existing evaluations of this ability in multimodal large language models (MLLMs) overlook the ability to compose rules from multiple sources, a critical component of higher-order intelligence. To close this gap, we introduce CARV (Compositional Analogical Reasoning in Vision), a novel task together with a 5,500-sample dataset as the first diagnostic benchmark. We extend the analogy from a single pair to multiple pairs, which requires MLLMs to extract symbolic rules from each pair and compose new transformations. Evaluation on the state-of-the-art MLLMs reveals a striking performance gap: even Gemini-2.5 Pro achieving only 40.4% accuracy, far below human-level performance of 100%. Diagnostic analysis shows two consistent failure modes: (1) decomposing visual changes into symbolic rules, and (2) maintaining robustness under diverse or complex settings, highlighting the limitations of current MLLMs on this task.
[673] SARL: Label-Free Reinforcement Learning by Rewarding Reasoning Topology
Yifan Wang, Bolian Li, David Cho, Ruqi Zhang, Fanping Sui, Ananth Grama
Main category: cs.AI
TL;DR: SARL introduces structure-aware reinforcement learning that rewards reasoning structure rather than outcomes, enabling open-ended reasoning without labeled supervision.
Details
Motivation: Current RL for reasoning models relies on verifiable rewards or labeled supervision, limiting applicability to open-ended domains where correctness is ambiguous. Reasoning trajectories are unconstrained, and optimization favors early exploitation over generalization.Method: SARL constructs a per-response Reasoning Map from intermediate thinking steps and rewards its small-world topology inspired by complex networks and human brain organization. It encourages locally coherent and globally efficient reasoning trajectories.
Result: SARL surpasses ground truth-based RL and prior label-free RL baselines, achieving 9.1-11.6% gains on math tasks and 30.4-34.6% gains on open-ended tasks. It also exhibits lower KL divergence and higher policy entropy.
Conclusion: Teaching models how to think (reasoning structure) rather than what to produce (outcomes) improves general reasoning ability, enabling stable, exploratory training and generalized reasoning in open-ended domains.
Abstract: Reinforcement learning has become central to improving large reasoning models, but its success still relies heavily on verifiable rewards or labeled supervision. This limits its applicability to open ended domains where correctness is ambiguous and cannot be verified. Moreover, reasoning trajectories remain largely unconstrained, and optimization towards final answer can favor early exploitation over generalization. In this work, we ask whether general reasoning ability can be improved by teaching models how to think (the structure of reasoning) rather than what to produce (the outcome of reasoning) and extend traditional RLVR to open ended settings. We introduce structure aware reinforcement learning (SARL), a label free framework that constructs a per response Reasoning Map from intermediate thinking steps and rewards its small world topology, inspired by complex networks and the functional organization of the human brain. SARL encourages reasoning trajectories that are both locally coherent and globally efficient, shifting supervision from destination to path. Our experiments on Qwen3-4B show SARL surpasses ground truth based RL and prior label free RL baselines, achieving the best average gain of 9.1% under PPO and 11.6% under GRPO on math tasks and 34.6% under PPO and 30.4% under GRPO on open ended tasks. Beyond good performance, SARL also exhibits lower KL divergence, higher policy entropy, indicating a more stable and exploratory training and generalized reasoning ability.
[674] HeteroHub: An Applicable Data Management Framework for Heterogeneous Multi-Embodied Agent System
Xujia Li, Xin Li, Junquan Huang, Beirong Cui, Zibin Wu, Lei Chen
Main category: cs.AI
TL;DR: HeteroHub is a unified data management framework for coordinating multiple embodied AI agents with diverse capabilities, integrating static metadata, task-aligned training data, and real-time sensor streams to enable scalable and evolvable embodied AI systems.
Details
Motivation: Existing frameworks lack unified data management infrastructure for heterogeneous multi-embodied agent systems, which need to handle massive heterogeneous data including static knowledge, multimodal training datasets, and high-frequency sensor streams for real-world deployment.Method: Developed HeteroHub, a data-centric framework that integrates three categories of data: static knowledge about agents/tasks/environments, multimodal training datasets for various AI models, and real-time sensor streams. The framework supports task-aware model training, context-sensitive execution, and closed-loop control driven by real-world feedback.
Result: HeteroHub successfully coordinates multiple embodied AI agents to execute complex tasks in demonstrations, showing how robust data management enables scalable, maintainable, and evolvable embodied AI systems.
Conclusion: A unified data management framework like HeteroHub is essential for real-world deployment of heterogeneous multi-embodied agent systems, enabling coordination of diverse agents through integrated handling of static knowledge, training data, and real-time streams.
Abstract: Heterogeneous Multi-Embodied Agent Systems involve coordinating multiple embodied agents with diverse capabilities to accomplish tasks in dynamic environments. This process requires the collection, generation, and consumption of massive, heterogeneous data, which primarily falls into three categories: static knowledge regarding the agents, tasks, and environments; multimodal training datasets tailored for various AI models; and high-frequency sensor streams. However, existing frameworks lack a unified data management infrastructure to support the real-world deployment of such systems. To address this gap, we present \textbf{HeteroHub}, a data-centric framework that integrates static metadata, task-aligned training corpora, and real-time data streams. The framework supports task-aware model training, context-sensitive execution, and closed-loop control driven by real-world feedback. In our demonstration, HeteroHub successfully coordinates multiple embodied AI agents to execute complex tasks, illustrating how a robust data management framework can enable scalable, maintainable, and evolvable embodied AI systems.
[675] What an Autonomous Agent Discovers About Molecular Transformer Design: Does It Transfer?
Edward Wijaya
Main category: cs.AI
TL;DR: Autonomous architecture search for molecular sequences (SMILES, proteins) vs. natural language shows domain-specific effectiveness: architecture changes matter for NLP but not for SMILES, while proteins are intermediate; discovered architectures transfer well across domains despite search differences.
Details
Motivation: To systematically test whether molecular sequences (drug-like molecules and proteins) benefit from different neural architecture designs compared to natural language transformers, since current models overwhelmingly reuse NLP transformer architectures without proper validation.Method: Deployed autonomous architecture search via an agent across three sequence types: SMILES (molecular representations), protein sequences, and English text as control. Conducted 3,106 experiments on a single GPU, comparing architecture search effectiveness vs. simple hyperparameter tuning.
Result: For SMILES, architecture search was counterproductive - tuning learning rates and schedules alone outperformed full search. For natural language, architecture changes drove 81% of improvement. Proteins fell between the two. Surprisingly, discovered architectures transferred across all three domains with <1% degradation despite being domain-specific.
Conclusion: Differences in optimal architectures reflect search-path dependence rather than fundamental biological requirements. Provides decision framework and toolkit for molecular modeling teams to choose between architecture search and hyperparameter tuning based on sequence type.
Abstract: Deep learning models for drug-like molecules and proteins overwhelmingly reuse transformer architectures designed for natural language, yet whether molecular sequences benefit from different designs has not been systematically tested. We deploy autonomous architecture search via an agent across three sequence types (SMILES, protein, and English text as control), running 3,106 experiments on a single GPU. For SMILES, architecture search is counterproductive: tuning learning rates and schedules alone outperforms the full search (p = 0.001). For natural language, architecture changes drive 81% of improvement (p = 0.009). Proteins fall between the two. Surprisingly, although the agent discovers distinct architectures per domain (p = 0.004), every innovation transfers across all three domains with <1% degradation, indicating that the differences reflect search-path dependence rather than fundamental biological requirements. We release a decision framework and open-source toolkit for molecular modeling teams to choose between autonomous architecture search and simple hyperparameter tuning.
[676] When Choices Become Priors: Contrastive Decoding for Scientific Figure Multiple-Choice QA
Taeyun Roh, Eun-yeong Jo, Wonjune Jang, Jaewoo Kang
Main category: cs.AI
TL;DR: SCICON is a training-free decoding method that improves scientific figure multiple-choice QA by subtracting text-only option scores from image-conditioned scores to reduce choice-induced bias.
Details
Motivation: Scientific figure MCQA suffers from distinctive bias where answer choices themselves act as priors, steering multimodal models toward scientifically plausible options even when the figure supports different answers.Method: SCICON scores each candidate by subtracting a text-only option score from its image-conditioned counterpart, directly targeting choice-induced priors encoded in candidate text without requiring training.
Result: Across three scientific figure QA benchmarks and three model backbones, SCICON consistently improves accuracy over standard decoding baselines.
Conclusion: Decoding against choice-induced priors is an effective and simple way to improve figure-grounded reasoning in scientific MCQA.
Abstract: Scientific figure multiple-choice question answering (MCQA) requires models to reason over diverse visual evidence, ranging from charts and multipanel figures to microscopy and biomedical images. However, this setting suffers from a distinctive bias: answer choices themselves can act as priors, steering multimodal models toward scientifically plausible options even when the figure supports a different answer. We investigate this failure mode through a simple question: what if decoding explicitly discounts what the model would prefer from text alone, so as to favor figure-grounded evidence? To this end, we propose SCICON, a training-free decoding method that scores each candidate by subtracting a text-only option score from its image-conditioned counterpart. Unlike prior contrastive decoding approaches that mitigate hallucinations by contrasting original inputs with distorted images or perturbed instructions, SCICON directly targets the choice-induced prior encoded in candidate text. Across three scientific figure QA benchmarks and three model backbones, SCICON consistently improves accuracy over standard decoding baselines. These results show that decoding against choice-induced priors is an effective and simple way to improve figure-grounded reasoning in scientific MCQA.
[677] Beyond the Answer: Decoding the Behavior of LLMs as Scientific Reasoners
Rohan Pandey, Eric Ye, Michael Li
Main category: cs.AI
TL;DR: The paper uses Genetic Pareto optimization to systematically optimize prompts for scientific reasoning tasks in LLMs, analyzing how prompting affects reasoning behavior and revealing that performance gains often come from model-specific heuristics that don’t generalize across systems.
Details
Motivation: As LLMs achieve sophisticated performance on complex reasoning tasks, understanding their internal heuristics and how prompting modulates reasoning processes is vital for interpretability, safety, and effective collaboration with future AGI systems.Method: Uses a custom variant of Genetic Pareto (GEPA) to systematically optimize prompts for scientific reasoning tasks, then analyzes structural patterns and logical heuristics in optimized prompts, evaluating their transferability and brittleness across models.
Result: Finds that gains in scientific reasoning often correspond to model-specific heuristics (“local” logic) that fail to generalize across different LLM systems, revealing the brittleness of optimized prompts.
Conclusion: Prompt optimization can serve as a tool for model interpretability, and mapping preferred reasoning structures for LLMs is an important prerequisite for effectively collaborating with superhuman intelligence.
Abstract: As Large Language Models (LLMs) achieve increasingly sophisticated performance on complex reasoning tasks, current architectures serve as critical proxies for the internal heuristics of frontier models. Characterizing emergent reasoning is vital for long-term interpretability and safety. Furthermore, understanding how prompting modulates these processes is essential, as natural language will likely be the primary interface for interacting with AGI systems. In this work, we use a custom variant of Genetic Pareto (GEPA) to systematically optimize prompts for scientific reasoning tasks, and analyze how prompting can affect reasoning behavior. We investigate the structural patterns and logical heuristics inherent in GEPA-optimized prompts, and evaluate their transferability and brittleness. Our findings reveal that gains in scientific reasoning often correspond to model-specific heuristics that fail to generalize across systems, which we call “local” logic. By framing prompt optimization as a tool for model interpretability, we argue that mapping these preferred reasoning structures for LLMs is an important prerequisite for effectively collaborating with superhuman intelligence.
[678] Dogfight Search: A Swarm-Based Optimization Algorithm for Complex Engineering Optimization and Mountainous Terrain Path Planning
Yujing Sun, Jie Cai, Xingguo Xu, Yuansheng Gao, Lei Zhang, Kaichen Ouyang, Zhanyu Liu
Main category: cs.AI
TL;DR: Dogfight Search (DoS) is a novel metaheuristic optimization algorithm inspired by fighter jet cooperation tactics, using displacement integration equations from kinematics for search mechanisms rather than traditional metaphor-based approaches.
Details
Motivation: The paper aims to develop a new optimization algorithm that moves beyond traditional metaphor-based metaheuristics by drawing inspiration from tactical fighter cooperation (dogfight) while grounding the search mechanism in mathematical principles from kinematics.Method: DoS constructs its search mechanism based on displacement integration equations from kinematics, creating a metaphor-free algorithmic framework. The algorithm is validated on CEC2017 and CEC2022 benchmark test functions, 10 real-world constrained optimization problems, and mountainous terrain path planning tasks.
Result: DoS significantly outperforms 7 advanced competitors in overall performance, ranks first in Friedman ranking, and maintains its lead when compared with 3 state-of-the-art algorithms on benchmark test functions.
Conclusion: Dogfight Search demonstrates strong competitiveness as a novel metaphor-free metaheuristic algorithm that effectively combines inspiration from tactical cooperation with mathematical principles from kinematics for optimization tasks.
Abstract: Dogfight is a tactical behavior of cooperation between fighters. Inspired by this, this paper proposes a novel metaphor-free metaheuristic algorithm called Dogfight Search (DoS). Unlike traditional algorithms, DoS draws algorithmic framework from the inspiration, but its search mechanism is constructed based on the displacement integration equations in kinematics. Through experimental validation on CEC2017 and CEC2022 benchmark test functions, 10 real-world constrained optimization problems and mountainous terrain path planning tasks, DoS significantly outperforms 7 advanced competitors in overall performance and ranks first in the Friedman ranking. Furthermore, this paper compares the performance of DoS with 3 SOTA algorithms on the CEC2017 and CEC2022 benchmark test functions. The results show that DoS continues to maintain its lead, demonstrating strong competitiveness. The source code of DoS is available at https://ww2.mathworks.cn/matlabcentral/fileexchange/183519-dogfight-search.
[679] Meta-Harness: End-to-End Optimization of Model Harnesses
Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, Chelsea Finn
Main category: cs.AI
TL;DR: Meta-Harness is an automated system that searches over harness code (the surrounding infrastructure that manages information flow for LLMs) to optimize LLM application performance, outperforming hand-engineered solutions across text classification, math reasoning, and coding tasks.
Details
Motivation: LLM system performance depends not just on model weights but also on the harness code that manages information storage, retrieval, and presentation. Current harnesses are manually designed, and existing text optimizers are poorly suited because they compress feedback too aggressively.Method: Meta-Harness is an outer-loop system that searches over harness code using an agentic proposer that accesses source code, scores, and execution traces of prior candidates through a filesystem, enabling automated harness engineering with richer access to prior experience.
Result: On online text classification: 7.7 point improvement over state-of-the-art context management with 4x fewer context tokens. On retrieval-augmented math reasoning: 4.7 point accuracy improvement on 200 IMO-level problems across five held-out models. On agentic coding: surpasses best hand-engineered baselines on TerminalBench-2.
Conclusion: Richer access to prior experience enables automated harness engineering, with Meta-Harness demonstrating significant improvements across diverse LLM applications by optimizing the surrounding infrastructure rather than just model weights.
Abstract: The performance of large language model (LLM) systems depends not only on model weights, but also on their harness: the code that determines what information to store, retrieve, and present to the model. Yet harnesses are still designed largely by hand, and existing text optimizers are poorly matched to this setting because they compress feedback too aggressively. We introduce Meta-Harness, an outer-loop system that searches over harness code for LLM applications. It uses an agentic proposer that accesses the source code, scores, and execution traces of all prior candidates through a filesystem. On online text classification, Meta-Harness improves over a state-of-the-art context management system by 7.7 points while using 4x fewer context tokens. On retrieval-augmented math reasoning, a single discovered harness improves accuracy on 200 IMO-level problems by 4.7 points on average across five held-out models. On agentic coding, discovered harnesses surpass the best hand-engineered baselines on TerminalBench-2. Together, these results show that richer access to prior experience can enable automated harness engineering.
[680] SLOW: Strategic Logical-inference Open Workspace for Cognitive Adaptation in AI Tutoring
Yuang Wei, Ruijia Li, Bo Jiang
Main category: cs.AI
TL;DR: SLOW is a tutoring framework that separates learner-state inference from instructional action selection using dual-process reasoning, improving personalization and emotional sensitivity in educational dialogues.
Details
Motivation: Current LLM-based tutors rely on single-pass generation that conflates multiple diagnostic and strategic signals, limiting their capacity for deliberate instructional adaptation and interpretability.Method: Proposes SLOW framework with explicit separation of learner-state inference and instructional action selection, integrating causal evidence parsing, fuzzy cognitive diagnosis with counterfactual stability analysis, and prospective affective reasoning.
Result: Evaluation shows significant improvements in personalization, emotional sensitivity, and clarity; ablation studies confirm necessity of each module for interpretable and reliable intelligent tutoring.
Conclusion: SLOW advances interpretability and educational validity of LLM-based adaptive instruction through visualized decision-making processes inspired by dual-process human tutoring.
Abstract: While Large Language Models (LLMs) have demonstrated remarkable fluency in educational dialogues, most generative tutors primarily operate through intuitive, single-pass generation. This reliance on fast thinking precludes a dedicated reasoning workspace, forcing multiple diagnostic and strategic signals to be processed in a conflated manner. As a result, learner cognitive diagnosis, affective perception, and pedagogical decision-making become tightly entangled, which limits the tutoring system’s capacity for deliberate instructional adaptation. We propose SLOW, a theory-informed tutoring framework that supports deliberate learner-state reasoning within a transparent decision workspace. Inspired by dual-process accounts of human tutoring, SLOW explicitly separates learner-state inference from instructional action selection. The framework integrates causal evidence parsing from learner language, fuzzy cognitive diagnosis with counterfactual stability analysis, and prospective affective reasoning to anticipate how instructional choices may influence learners’ emotional trajectories. These signals are jointly considered to guide pedagogically and affectively aligned tutoring strategies. Evaluation using hybrid human-AI judgments demonstrates significant improvements in personalization, emotional sensitivity, and clarity. Ablation studies further confirm the necessity of each module, showcasing how SLOW enables interpretable and reliable intelligent tutoring through a visualized decision-making process. This work advances the interpretability and educational validity of LLM-based adaptive instruction.
[681] Reward Hacking as Equilibrium under Finite Evaluation
Jiacheng Wang, Jinbin Huang
Main category: cs.AI
TL;DR: Theoretical proof that AI agents will systematically under-invest in quality dimensions not covered by evaluation systems, establishing reward hacking as structural equilibrium rather than correctable bug.
Details
Motivation: To provide a formal theoretical foundation for understanding reward hacking in AI systems, showing it's an inherent structural problem rather than a fixable technical issue, and to unify various observed gaming behaviors under a single theoretical framework.Method: Uses five minimal axioms and applies the multi-task principal-agent model from economics to AI alignment, exploiting the known differentiable architecture of reward models to derive a computable distortion index predicting hacking direction and severity.
Result: Proves that optimized AI agents systematically under-invest in unevaluated quality dimensions, shows evaluation coverage declines toward zero as tool count grows, and provides a computable distortion index for vulnerability assessment.
Conclusion: Reward hacking is a structural equilibrium in AI systems, not a correctable bug, with severity increasing unboundedly as systems become more capable and agentic, potentially leading to a transition from gaming within evaluation systems to actively degrading them.
Abstract: We prove that under five minimal axioms – multi-dimensional quality, finite evaluation, effective optimization, resource finiteness, and combinatorial interaction – any optimized AI agent will systematically under-invest effort in quality dimensions not covered by its evaluation system. This result establishes reward hacking as a structural equilibrium, not a correctable bug, and holds regardless of the specific alignment method (RLHF, DPO, Constitutional AI, or others) or evaluation architecture employed. Our framework instantiates the multi-task principal-agent model of Holmstrom and Milgrom (1991) in the AI alignment setting, but exploits a structural feature unique to AI systems – the known, differentiable architecture of reward models – to derive a computable distortion index that predicts both the direction and severity of hacking on each quality dimension prior to deployment. We further prove that the transition from closed reasoning to agentic systems causes evaluation coverage to decline toward zero as tool count grows – because quality dimensions expand combinatorially while evaluation costs grow at most linearly per tool – so that hacking severity increases structurally and without bound. Our results unify the explanation of sycophancy, length gaming, and specification gaming under a single theoretical structure and yield an actionable vulnerability assessment procedure. We further conjecture – with partial formal analysis – the existence of a capability threshold beyond which agents transition from gaming within the evaluation system (Goodhart regime) to actively degrading the evaluation system itself (Campbell regime), providing the first economic formalization of Bostrom’s (2014) “treacherous turn.”
[682] CoT2-Meta: Budgeted Metacognitive Control for Test-Time Reasoning
Siyuan Ma, Bo Gao, Zikai Xiao, Hailong Wang, Xinlei Yu, Rui Qian, Jiayu Qian, Luqi Gong, Yang Liu
Main category: cs.AI
TL;DR: CoT2-Meta is a training-free metacognitive reasoning framework that combines chain-of-thought generation with meta-level control over reasoning trajectories, outperforming existing methods across multiple benchmarks.
Details
Motivation: Current test-time reasoning methods lack explicit control over reasoning processes - when to expand, prune, repair, or abstain - limiting their efficiency and reliability.Method: Framework integrates strategy-conditioned thought generation, tree-structured search, online process oracle for step-level evaluation, and meta-controller for computation allocation decisions.
Result: Achieves state-of-the-art results: 92.8 EM on MATH, 90.4 on GPQA, 98.65 on GSM8K, 75.8 on BBEH, 85.6 on MMMU-Pro, and 48.8 on HLE, with consistent gains over baselines.
Conclusion: Explicit metacognitive control is a practical design principle for reliable and compute-efficient test-time reasoning systems.
Abstract: Recent test-time reasoning methods improve performance by generating more candidate chains or searching over larger reasoning trees, but they typically lack explicit control over when to expand, what to prune, how to repair, and when to abstain. We introduce CoT2-Meta, a training-free metacognitive reasoning framework that combines object-level chain-of-thought generation with meta-level control over partial reasoning trajectories. The framework integrates four components: strategy-conditioned thought generation, tree-structured search, an online process oracle for step-level reasoning evaluation, and a meta-controller that allocates computation through expansion, pruning, repair, stopping, and fallback decisions. Under matched inference budgets, CoT2-Meta consistently outperforms strong single-path, sampling-based, and search-based baselines, including ReST-MCTS. On the default backbone, it achieves 92.8 EM on MATH, 90.4 accuracy on GPQA, 98.65 EM on GSM8K, 75.8 accuracy on BBEH, 85.6 accuracy on MMMU-Pro, and 48.8 accuracy on HLE, with gains over the strongest non-CoT2-Meta baseline of +3.6, +5.2, +1.15, +2.0, +4.3, and +4.3 points, respectively. Beyond these core results, the framework remains effective across a broader 15-benchmark suite spanning knowledge and QA, multi-hop reasoning, coding, and out-of-distribution evaluation. Additional analyses show better compute scaling, improved calibration, stronger selective prediction, targeted repair behavior, and consistent gains across backbone families. These results suggest that explicit metacognitive control is a practical design principle for reliable and compute-efficient test-time reasoning systems.
[683] PReD: An LLM-based Foundation Multimodal Model for Electromagnetic Perception, Recognition, and Decision
Zehua Han, Jing Xiao, Yiqi Duan, Mengyu Xiang, Yuheng Ji, Xiaolong Zheng, Chenghanyu Zhang, Zhendong She, Junyu Shen, Dingwei Tan, Shichu Sun, Zhou Cong, Mingxuan Liu, Fengxiang Wang, Jinping Sun, Yangang Sun
Main category: cs.AI
TL;DR: PReD is the first EM domain foundation model with closed-loop “perception, recognition, decision-making” capabilities, trained on PReD-1.3M dataset and achieving SOTA on EM signal tasks.
Details
Motivation: Multimodal LLMs lack EM domain expertise due to data scarcity and insufficient domain knowledge integration, requiring specialized foundation models for electromagnetic signal understanding.Method: Constructed PReD-1.3M dataset with multi-perspective EM signal representations (time-domain, frequency-domain, constellation diagrams) and PReD-Bench evaluation benchmark. Used multi-stage training strategy unifying multiple EM tasks for closed-loop optimization.
Result: PReD achieves state-of-the-art performance on PReD-Bench across both open-source and self-collected signal datasets, validating vision-aligned foundation models for EM signal understanding.
Conclusion: PReD demonstrates feasibility of specialized multimodal foundation models for EM domain, enhancing domain expertise while maintaining general multimodal capabilities through closed-loop optimization.
Abstract: Multimodal Large Language Models have demonstrated powerful cross-modal understanding and reasoning capabilities in general domains. However, in the electromagnetic (EM) domain, they still face challenges such as data scarcity and insufficient integration of domain knowledge. This paper proposes PReD, the first foundation model for the EM domain that covers the intelligent closed-loop of “perception, recognition, decision-making.” We constructed a high-quality multitask EM dataset, PReD-1.3M, and an evaluation benchmark, PReD-Bench. The dataset encompasses multi-perspective representations such as raw time-domain waveform, frequency-domain spectrograms, and constellation diagrams, covering typical features of communication and radar signals. It supports a range of core tasks, including signal detection, modulation recognition, parameter estimation, protocol recognition, radio frequency fingerprint recognition, and anti-jamming decision-making. PReD adopts a multi-stage training strategy that unifies multiple tasks for EM signals. It achieves closed-loop optimization from end-to-end signal understanding to language-driven reasoning and decision-making, significantly enhancing EM domain expertise while maintaining general multimodal capabilities. Experimental results show that PReD achieves state-of-the-art performance on PReD-Bench constructed from both open-source and self-collected signal datasets. These results collectively validate the feasibility and potential of vision-aligned foundation models in advancing the understanding and reasoning of EM signals.
[684] EpiPersona: Persona Projection and Episode Coupling for Pluralistic Preference Modeling
Yujie Zhang, Weikang Yuan, Zhuoren Jiang, Pengwei Yan
Main category: cs.AI
TL;DR: EpiPersona framework separates stable personal traits from episode-specific factors for better pluralistic alignment in LLMs
Details
Motivation: Existing approaches mix stable personal traits with episode-specific factors, limiting generalization across episodes for adapting LLMs to diverse individual and minority group preferencesMethod: Projects noisy preference feedback into low-dimensional persona space, aggregates similar personas into shared discrete codes, separates enduring characteristics from situational signals, and couples inferred persona with current episode for episode-aware preference prediction
Result: Consistently outperforms baselines, achieves notable performance gains in hard episodic-shift scenarios, and remains effective with sparse preference data
Conclusion: EpiPersona effectively addresses the challenge of separating stable personal traits from episode-specific factors for better pluralistic alignment in LLMs
Abstract: Pluralistic alignment is essential for adapting large language models (LLMs) to the diverse preferences of individuals and minority groups. However, existing approaches often mix stable personal traits with episode-specific factors, limiting their ability to generalize across episodes. To address this challenge, we introduce EpiPersona, a framework for explicit persona-episode coupling. EpiPersona first projects noisy preference feedback into a low-dimensional persona space, where similar personas are aggregated into shared discrete codes. This process separates enduring personal characteristics from situational signals without relying on predefined preference dimensions. The inferred persona representation is then coupled with the current episode, enabling episode-aware preference prediction. Extensive experiments show that EpiPersona consistently outperforms the baselines. It achieves notable performance gains in hard episodic-shift scenarios, while remaining effective with sparse preference data.
[685] Differentiable Power-Flow Optimization
Muhammed Öz, Jasmin Hörter, Kaleb Phipps, Charlotte Debus, Achim Streit, Markus Götz
Main category: cs.AI
TL;DR: DPF reformulates AC power-flow as differentiable simulation enabling gradient-based parameter identification and GPU acceleration for scalable grid analysis.
Details
Motivation: Renewable energy variability makes power grid management complex; conventional Newton-Raphson methods scale poorly, while data-driven models lack physical guarantees.Method: Differentiable Power-Flow (DPF) reformulates AC power-flow as differentiable simulation, enabling end-to-end gradient propagation and leveraging GPU acceleration, sparse tensors, and batching in PyTorch.
Result: DPF provides scalable alternative to NR, suitable for time-series analyses, N-1 contingency analyses via batching, and as screening tool with early stopping.
Conclusion: DPF offers physically-guaranteed, scalable power-flow simulation that bridges conventional methods and data-driven approaches for modern grid challenges.
Abstract: With the rise of renewable energy sources and their high variability in generation, the management of power grids becomes increasingly complex and computationally demanding. Conventional AC-power-flow simulations, which use the Newton-Raphson (NR) method, suffer from poor scalability, making them impractical for emerging use cases such as joint transmission-distribution modeling and global grid analysis. At the same time, purely data-driven surrogate models lack physical guarantees and may violate fundamental constraints. In this work, we propose Differentiable Power-Flow (DPF), a reformulation of the AC power-flow problem as a differentiable simulation. DPF enables end-to-end gradient propagation from the physical power mismatches to the underlying simulation parameters, thereby allowing these parameters to be identified efficiently using gradient-based optimization. We demonstrate that DPF provides a scalable alternative to NR by leveraging GPU acceleration, sparse tensor representations, and batching capabilities available in modern machine-learning frameworks such as PyTorch. DPF is especially suited as a tool for time-series analyses due to its efficient reuse of previous solutions, for N-1 contingency-analyses due to its ability to process cases in batches, and as a screening tool by leveraging its speed and early stopping capability. The code is available in the authors’ code repository.
[686] Reasoning as Energy Minimization over Structured Latent Trajectories
David K. Johansson
Main category: cs.AI
TL;DR: EBRM proposes energy-based reasoning with structured latent planning, modeling reasoning as gradient optimization of multi-step latent trajectories under learned energy functions, but identifies critical failure modes in distribution mismatch between encoder outputs and planner outputs.
Details
Motivation: To address limitations of single-shot neural decoders (no iterative refinement) and chain-of-thought methods (no scalar measure of reasoning progress), by developing a framework that models reasoning as optimization of latent trajectories with measurable progress.Method: Energy-Based Reasoning via Structured Latent Planning (EBRM) models reasoning as gradient-based optimization of multi-step latent trajectories z_{1:T} under learned energy function E(h_x, z). Energy decomposes into per-step compatibility, transition consistency, and trajectory smoothness terms. Training combines supervised encoder-decoder learning with contrastive energy shaping using hard negatives. Inference performs gradient descent or Langevin dynamics over z and decodes from z_T.
Result: Identifies critical failure mode: on CNF logic satisfaction, latent planning reduces accuracy from ≈95% to ≈56% due to distribution mismatch between encoder outputs h_x and planner outputs z_T. Energy decreases monotonically and induces structured latent trajectories on graph and logic tasks, but remains flat on arithmetic (r = 0.073), indicating negative result for certain tasks.
Conclusion: EBRM provides a framework for energy-based reasoning with measurable progress, but reveals significant challenges with distribution mismatch in latent planning. Proposes solutions including dual-path decoder training and latent anchoring, with comprehensive ablation studies showing task-dependent effectiveness.
Abstract: Single-shot neural decoders commit to answers without iterative refinement, while chain-of-thought methods introduce discrete intermediate steps but lack a scalar measure of reasoning progress. We propose Energy-Based Reasoning via Structured Latent Planning (EBRM), which models reasoning as gradient-based optimization of a multi-step latent trajectory $z_{1:T}$ under a learned energy function $E(h_x, z)$. The energy decomposes into per-step compatibility, transition consistency, and trajectory smoothness terms. Training combines supervised encoder-decoder learning with contrastive energy shaping using hard negatives, while inference performs gradient descent or Langevin dynamics over $z$ and decodes from $z_T$. We identify a critical failure mode: on CNF logic satisfaction, latent planning reduces accuracy from $\approx 95%$ to $\approx 56%$. This degradation arises from a distribution mismatch, where the decoder is trained on encoder outputs $h_x$ but evaluated on planner outputs $z_T$ that drift into unseen latent regions. We analyze this behavior through per-step decoding, latent drift tracking, and gradient decomposition. To address it, we propose dual-path decoder training and latent anchoring. We further introduce a six-part ablation protocol covering component contributions, trajectory length, planner dynamics, initialization, decoder training distribution, and anchor weight. Experiments on three synthetic tasks show that energy decreases monotonically and induces structured latent trajectories on graph and logic tasks, while remaining flat on arithmetic ($r = 0.073$), indicating a negative result. Code is available at https://github.com/dkjo8/ebr-via-structured-latent-planning.
[687] Evaluating LLMs for Answering Student Questions in Introductory Programming Courses
Thomas Van Mullem, Bart Mesuere, Peter Dawyndt
Main category: cs.AI
TL;DR: LLMs can effectively assist educators in answering student programming questions, surpassing typical educator response quality when properly evaluated with pedagogical metrics and teacher oversight.
Details
Motivation: Students increasingly use generative AI tools that provide complete solutions rather than pedagogical hints, hindering learning. Educators face workload challenges providing timely, personalized feedback in programming courses.Method: Created benchmark dataset of 170 authentic student questions from CS1 course with expert responses. Developed custom LLM-as-a-Judge metric for pedagogical accuracy assessment. Evaluated models like Gemini 3 flash against educator baselines.
Result: LLMs like Gemini 3 flash can surpass quality of typical educator responses, achieving high alignment with expert pedagogical standards. Teacher-in-the-loop approach mitigates hallucination risks.
Conclusion: LLMs can safely assist educators in programming education when properly evaluated. Proposed task-agnostic evaluation framework shifts from ad-hoc testing to quantifiable pre-deployment validation.
Abstract: The rapid emergence of Large Language Models (LLMs) presents both opportunities and challenges for programming education. While students increasingly use generative AI tools, direct access often hinders the learning process by providing complete solutions rather than pedagogical hints. Concurrently, educators face significant workload and scalability challenges when providing timely, personalized feedback. This study investigates the capabilities of LLMs to safely and effectively assist educators in answering student questions within a CS1 programming course. To achieve this, we established a rigorous, reproducible evaluation process by curating a benchmark dataset of 170 authentic student questions from a learning management system, paired with ground-truth responses authored by subject matter experts. Because traditional text-matching metrics are insufficient for evaluating open-ended educational responses, we developed and validated a custom LLM-as-a-Judge metric optimized for assessing pedagogical accuracy. Our findings demonstrate that models, such as Gemini 3 flash, can surpass the quality baseline of typical educator responses, achieving high alignment with expert pedagogical standards. To mitigate persistent risks like hallucination and ensure alignment with course-specific context, we advocate for a “teacher-in-the-loop” implementation. Finally, we abstract our methodology into a task-agnostic evaluation framework, advocating for a shift in the development of educational LLM tools from ad-hoc, post-deployment testing to a quantifiable, pre-deployment validation process.
[688] A Multi-Agent Rhizomatic Pipeline for Non-Linear Literature Analysis
Julio C. Serrano. Joonas Kevari, Rumy Narayan
Main category: cs.AI
TL;DR: A multi-agent computational pipeline called Rhizomatic Research Agent (V3) that conducts non-linear literature analysis using 12 specialized agents based on Deleuzian rhizomatic principles, designed to overcome limitations of traditional hierarchical literature review methods.
Details
Motivation: Traditional systematic literature reviews in social sciences follow hierarchical, arborescent logics that suppress lateral connections, emergent patterns, and cross-disciplinary convergences. The authors aim to address this limitation by creating an automated system that can conduct non-linear literature analysis inspired by Deleuzian rhizomatic principles.Method: Developed a multi-agent computational pipeline with 12 specialized agents operating across a seven-phase architecture. The system operationalizes six rhizomatic principles (connection, heterogeneity, multiplicity, asignifying rupture, cartography, decalcomania) using LLM orchestration, dual-source corpus ingestion from OpenAlex and arXiv, SciBERT semantic topography, and dynamic rupture detection protocols.
Result: Preliminary deployment demonstrates the system’s capacity to surface cross-disciplinary convergences and structural research gaps that conventional review methods systematically overlook. The pipeline is open-source and extensible to various research domains requiring non-linear knowledge mapping.
Conclusion: The Rhizomatic Research Agent provides an automated, non-linear alternative to traditional literature review methods, enabling discovery of emergent patterns and connections that hierarchical approaches suppress, with potential applications across diverse research domains.
Abstract: Systematic literature reviews in the social sciences overwhelmingly follow arborescent logics – hierarchical keyword filtering, linear screening, and taxonomic classification – that suppress the lateral connections, ruptures, and emergent patterns characteristic of complex research landscapes. This research note presents the Rhizomatic Research Agent (V3), a multi-agent computational pipeline grounded in Deleuzian process-relational ontology, designed to conduct non-linear literature analysis through 12 specialized agents operating across a seven-phase architecture. The system was developed in response to the methodological groundwork established by (Narayan2023), who employed rhizomatic inquiry in her doctoral research on sustainable energy transitions but relied on manual, researcher-driven exploration. The Rhizomatic Research Agent operationalizes the six principles of the rhizome – connection, heterogeneity, multiplicity, asignifying rupture, cartography, and decalcomania – into an automated pipeline integrating large language model (LLM) orchestration, dual-source corpus ingestion from OpenAlex and arXiv, SciBERT semantic topography, and dynamic rupture detection protocols. Preliminary deployment demonstrates the system’s capacity to surface cross-disciplinary convergences and structural research gaps that conventional review methods systematically overlook. The pipeline is open-source and extensible to any phenomenon zone where non-linear knowledge mapping is required.
[689] CoE: Collaborative Entropy for Uncertainty Quantification in Agentic Multi-LLM Systems
Kangkang Sun, Jun Wu, Jianhua Li, Minyi Guo, Xiuzhen Che, Jianwei Huang
Main category: cs.AI
TL;DR: CoE is a unified metric for semantic uncertainty in multi-LLM systems that combines intra-model entropy and inter-model divergence, providing better uncertainty estimation than standard baselines.
Details
Motivation: Existing uncertainty estimation methods in multi-LLM systems are single-model-centric and fail to capture semantic disagreement across different models, creating a gap in understanding collaborative confidence.Method: Proposes Collaborative Entropy (CoE) defined on a shared semantic cluster space, combining intra-model semantic entropy and inter-model divergence to the ensemble mean, with theoretical analysis of its properties.
Result: Experiments on TriviaQA and SQuAD with multiple LLMs show CoE provides stronger uncertainty estimation than standard entropy- and divergence-based baselines, with gains increasing as more heterogeneous models are added.
Conclusion: CoE offers a useful uncertainty-aware perspective on multi-LLM collaboration by capturing both individual model uncertainty and cross-model semantic disagreement.
Abstract: Uncertainty estimation in multi-LLM systems remains largely single-model-centric: existing methods quantify uncertainty within each model but do not adequately capture semantic disagreement across models. To address this gap, we propose Collaborative Entropy (CoE), a unified information-theoretic metric for semantic uncertainty in multi-LLM collaboration. CoE is defined on a shared semantic cluster space and combines two components: intra-model semantic entropy and inter-model divergence to the ensemble mean. CoE is not a weighted ensemble predictor; it is a system-level uncertainty measure that characterizes collaborative confidence and disagreement. We analyze several core properties of CoE, including non-negativity, zero-value certainty under perfect semantic consensus, and the behavior of CoE when individual models collapse to delta distributions. These results clarify when reducing per-model uncertainty is sufficient and when residual inter-model disagreement remains. We also present a simple CoE-guided, training-free post-hoc coordination heuristic as a practical application of the metric. Experiments on \textit{TriviaQA} and \textit{SQuAD} with LLaMA-3.1-8B-Instruct, Qwen-2.5-7B-Instruct, and Mistral-7B-Instruct show that CoE provides stronger uncertainty estimation than standard entropy- and divergence-based baselines, with gains becoming larger as additional heterogeneous models are introduced. Overall, CoE offers a useful uncertainty-aware perspective on multi-LLM collaboration.
[690] COvolve: Adversarial Co-Evolution of Large-Language-Model-Generated Policies and Environments via Two-Player Zero-Sum Game
Alkis Sygkounas, Rishi Hazra, Andreas Persson, Pedro Zuidberg Dos Martires, Amy Loutfi
Main category: cs.AI
TL;DR: COvolve: LLM-driven co-evolutionary framework that generates environments and agent policies as Python code, using adversarial game dynamics to create automated curricula for continual learning.
Details
Motivation: Current training environments are static or manually constructed, limiting continual learning and generalization beyond training distributions. Need for automated environment generation and curriculum learning.Method: Two-player zero-sum game between environment and policy designers (both LLM-generated Python code). Uses adversarial co-evolution to expose weaknesses and adapt policies. Computes mixed-strategy Nash equilibrium to create meta-policy preventing forgetting.
Result: Demonstrated in urban driving, symbolic maze-solving, and geometric navigation. Produces progressively more complex environments and enables open-ended learning without predefined task distributions.
Conclusion: LLM-driven co-evolution enables automated curriculum generation and continual learning without manual intervention, showing potential for open-ended learning systems.
Abstract: A central challenge in building continually improving agents is that training environments are typically static or manually constructed. This restricts continual learning and generalization beyond the training distribution. We address this with COvolve, a co-evolutionary framework that leverages large language models (LLMs) to generate both environments and agent policies, expressed as executable Python code. We model the interaction between environment and policy designers as a two-player zero-sum game, ensuring adversarial co-evolution in which environments expose policy weaknesses and policies adapt in response. This process induces an automated curriculum in which environments and policies co-evolve toward increasing complexity. To guarantee robustness and prevent forgetting as the curriculum progresses, we compute the mixed-strategy Nash equilibrium (MSNE) of the zero-sum game, thereby yielding a meta-policy. This MSNE meta-policy ensures that the agent does not forget to solve previously seen environments while learning to solve previously unseen ones. Experiments in urban driving, symbolic maze-solving, and geometric navigation showcase that COvolve produces progressively more complex environments. Our results demonstrate the potential of LLM-driven co-evolution to achieve open-ended learning without predefined task distributions or manual intervention.
[691] The Scaffold Effect: How Prompt Framing Drives Apparent Multimodal Gains in Clinical VLM Evaluation
Doan Nam Long Vu, Simone Balloccu
Main category: cs.AI
TL;DR: VLMs show performance gains in clinical neuroimaging tasks that appear to be due to mentioning MRI availability in prompts rather than genuine multimodal reasoning, revealing a “scaffold effect” where models fabricate justifications without actually using imaging data.
Details
Motivation: To assess whether vision-language models genuinely integrate multimodal evidence in clinical settings or rely on surface-level artifacts, particularly in neuroimaging where structural MRI lacks reliable individual-level diagnostic signal.Method: Evaluated 12 open-weight VLMs on binary classification across two clinical neuroimaging cohorts (FOR2107 for affective disorders and OASIS-3 for cognitive decline) with structural MRI data. Used contrastive confidence analysis to separate effects of mentioning MRI availability from actual image data usage.
Result: Smaller VLMs showed up to 58% F1 gains with neuroimaging context, but 70-80% of this shift was due to merely mentioning MRI availability in prompts, not actual image data. Models fabricated neuroimaging justifications across all conditions, and preference alignment eliminated MRI-referencing behavior but collapsed performance to random baseline.
Conclusion: Surface evaluations are inadequate indicators of multimodal reasoning in VLMs, revealing a “scaffold effect” where models exploit modality mentions without genuine integration, with serious implications for clinical AI deployment.
Abstract: Trustworthy clinical AI requires that performance gains reflect genuine evidence integration rather than surface-level artifacts. We evaluate 12 open-weight vision-language models (VLMs) on binary classification across two clinical neuroimaging cohorts, \textsc{FOR2107} (affective disorders) and \textsc{OASIS-3} (cognitive decline). Both datasets come with structural MRI data that carries no reliable individual-level diagnostic signal. Under these conditions, smaller VLMs exhibit gains of up to 58% F1 upon introduction of neuroimaging context, with distilled models becoming competitive with counterparts an order of magnitude larger. A contrastive confidence analysis reveals that merely \emph{mentioning} MRI availability in the task prompt accounts for 70-80% of this shift, independent of whether imaging data is present, a domain-specific instance of modality collapse we term the \emph{scaffold effect}. Expert evaluation reveals fabrication of neuroimaging-grounded justifications across all conditions, and preference alignment, while eliminating MRI-referencing behavior, collapses both conditions toward random baseline. Our findings demonstrate that surface evaluations are inadequate indicators of multimodal reasoning, with direct implications for the deployment of VLMs in clinical settings.
[692] MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome
Fangda Ye, Yuxin Hu, Pengxiang Zhu, Yibo Li, Ziqi Jin, Yao Xiao, Yibo Wang, Lei Wang, Zhen Zhang, Lu Wang, Yue Deng, Bin Wang, Yifan Zhang, Liangcai Su, Xinyu Wang, He Zhao, Chen Wei, Qiang Ren, Bryan Hooi, An Bo, Shuicheng Yan, Lidong Bing
Main category: cs.AI
TL;DR: MiroEval is a benchmark and evaluation framework for deep research systems that addresses limitations of existing benchmarks by focusing on real user needs, multimodal coverage, and process evaluation.
Details
Motivation: Existing benchmarks for deep research systems have several limitations: they focus only on final reports using fixed rubrics, lack multimodal coverage, rely on synthetic tasks that don't reflect real-world complexity, and cannot be updated as knowledge evolves.Method: MiroEval comprises 100 tasks (70 text-only, 30 multimodal) grounded in real user needs, constructed via a dual-path pipeline that supports periodic updates. It evaluates systems along three dimensions: adaptive synthesis quality with task-specific rubrics, agentic factuality verification via active retrieval and reasoning over web sources and multimodal attachments, and process-centric evaluation of how systems search, reason, and refine.
Result: Evaluation of 13 systems shows: 1) the three evaluation dimensions capture complementary aspects of system capability, 2) process quality reliably predicts overall outcomes while revealing weaknesses invisible to output-level metrics, and 3) multimodal tasks pose substantially greater challenges (3-10 point declines). MiroThinker-H1 ranked highest overall.
Conclusion: MiroEval provides a holistic diagnostic tool for the next generation of deep research agents, with human verification confirming the reliability of the benchmark and evaluation framework.
Abstract: Recent progress in deep research systems has been impressive, but evaluation still lags behind real user needs. Existing benchmarks predominantly assess final reports using fixed rubrics, failing to evaluate the underlying research process. Most also offer limited multimodal coverage, rely on synthetic tasks that do not reflect real-world query complexity, and cannot be refreshed as knowledge evolves. To address these gaps, we introduce MiroEval, a benchmark and evaluation framework for deep research systems. The benchmark comprises 100 tasks (70 text-only, 30 multimodal), all grounded in real user needs and constructed via a dual-path pipeline that supports periodic updates, enabling a live and evolving setting. The proposed evaluation suite assesses deep research systems along three complementary dimensions: adaptive synthesis quality evaluation with task-specific rubrics, agentic factuality verification via active retrieval and reasoning over both web sources and multimodal attachments, and process-centric evaluation audits how the system searches, reasons, and refines throughout its investigation. Evaluation across 13 systems yields three principal findings: the three evaluation dimensions capture complementary aspects of system capability, with each revealing distinct strengths and weaknesses across systems; process quality serves as a reliable predictor of overall outcome while revealing weaknesses invisible to output-level metrics; and multimodal tasks pose substantially greater challenges, with most systems declining by 3 to 10 points. The MiroThinker series achieves the most balanced performance, with MiroThinker-H1 ranking the highest overall in both settings. Human verification and robustness results confirm the reliability of the benchmark and evaluation framework. MiroEval provides a holistic diagnostic tool for the next generation of deep research agents.
[693] Entropic Claim Resolution: Uncertainty-Driven Evidence Selection for RAG
Davide Di Gioia
Main category: cs.AI
TL;DR: ECR is a novel RAG algorithm that selects evidence claims by maximizing expected entropy reduction to resolve epistemic uncertainty, shifting from relevance-based to discriminative retrieval.
Details
Motivation: Current RAG systems rely on relevance-based dense retrieval, which is insufficient for knowledge-intensive scenarios with conflicting evidence or query ambiguity where epistemic uncertainty needs resolution.Method: Entropic Claim Resolution (ECR) reframes RAG reasoning as entropy minimization over competing semantic answer hypotheses, using Expected Entropy Reduction (EER) to sequentially select atomic evidence claims, terminating when reaching epistemic sufficiency (H ≤ ε).
Result: ECR provides a rigorous foundation for uncertainty-aware evidence selection, shifting from retrieving what is most relevant to retrieving what is most discriminative, integrated into a production-grade multi-strategy retrieval pipeline (CSGR++).
Conclusion: ECR offers a decision-theoretic approach to RAG that better handles epistemic uncertainty in complex knowledge scenarios by focusing on discriminative evidence selection rather than mere relevance.
Abstract: Current Retrieval-Augmented Generation (RAG) systems predominantly rely on relevance-based dense retrieval, sequentially fetching documents to maximize semantic similarity with the query. However, in knowledge-intensive and real-world scenarios characterized by conflicting evidence or fundamental query ambiguity, relevance alone is insufficient for resolving epistemic uncertainty. We introduce Entropic Claim Resolution (ECR), a novel inference-time algorithm that reframes RAG reasoning as entropy minimization over competing semantic answer hypotheses. Unlike action-driven agentic frameworks (e.g., ReAct) or fixed-pipeline RAG architectures, ECR sequentially selects atomic evidence claims by maximizing Expected Entropy Reduction (EER), a decision-theoretic criterion for the value of information. The process dynamically terminates when the system reaches a mathematically defined state of epistemic sufficiency (H <= epsilon, subject to epistemic coherence). We integrate ECR into a production-grade multi-strategy retrieval pipeline (CSGR++) and analyze its theoretical properties. Our framework provides a rigorous foundation for uncertainty-aware evidence selection, shifting the paradigm from retrieving what is most relevant to retrieving what is most discriminative.
[694] T-Norm Operators for EU AI Act Compliance Classification: An Empirical Comparison of Lukasiewicz, Product, and Gödel Semantics in a Neuro-Symbolic Reasoning System
Adam Laabs
Main category: cs.AI
TL;DR: Comparative study of three t-norm operators (Lukasiewicz, Product, Gödel) as logical conjunction mechanisms in neuro-symbolic reasoning for AI Act compliance classification, evaluating accuracy and error rates on 1035 annotated AI system descriptions.
Details
Motivation: The paper aims to evaluate different logical conjunction operators in neuro-symbolic reasoning systems for regulatory compliance classification, specifically for the EU AI Act. The motivation is to understand how different t-norm operators affect classification performance, particularly in terms of accuracy, false positives/negatives, and handling of borderline cases.Method: The study uses the LGGT+ (Logic-Guided Graph Transformers Plus) engine with a benchmark of 1035 annotated AI system descriptions spanning four risk categories. Three t-norm operators (Lukasiewicz, Product, and Gödel) are compared as logical conjunction mechanisms. Performance is evaluated using classification accuracy, false positive/negative rates, and operator behavior on ambiguous cases, with statistical significance testing via McNemar test.
Result: Gödel operator (T_G) achieved highest accuracy (84.5%) and best borderline recall (85%) but introduced 8 false positives (0.8%) due to min-semantics over-classification. Lukasiewicz (T_L) and Product (T_P) maintained zero false positives, with T_P outperforming T_L (81.2% vs. 78.5%). All three operators differed significantly (McNemar p<0.001). Key findings show operator choice is secondary to rule base completeness, and a mixed-semantics classifier is suggested as the next productive step.
Conclusion: The study demonstrates that different t-norm operators have trade-offs in neuro-symbolic reasoning systems for regulatory compliance. Gödel operator provides higher recall but introduces false positives, while Lukasiewicz and Product operators maintain zero false positives but miss borderline cases. The authors conclude that a mixed-semantics classifier combining different operators would be the most productive next step for improving classification performance.
Abstract: We present a first comparative pilot study of three t-norm operators – Lukasiewicz (T_L), Product (T_P), and Gödel (T_G) - as logical conjunction mechanisms in a neuro-symbolic reasoning system for EU AI Act compliance classification. Using the LGGT+ (Logic-Guided Graph Transformers Plus) engine and a benchmark of 1035 annotated AI system descriptions spanning four risk categories (prohibited, high_risk, limited_risk, minimal_risk), we evaluate classification accuracy, false positive and false negative rates, and operator behaviour on ambiguous cases. At n=1035, all three operators differ significantly (McNemar p<0.001). T_G achieves highest accuracy (84.5%) and best borderline recall (85%), but introduces 8 false positives (0.8%) via min-semantics over-classification. T_L and T_P maintain zero false positives, with T_P outperforming T_L (81.2% vs. 78.5%). Our principal findings are: (1) operator choice is secondary to rule base completeness; (2) T_L and T_P maintain zero false positives but miss borderline cases; (3) T_G’s min-semantics achieves higher recall at cost of 0.8% false positive rate; (4) a mixed-semantics classifier is the productive next step. We release the LGGT+ core engine (201/201 tests passing) and benchmark dataset (n=1035) under Apache 2.0.
[695] Towards a Medical AI Scientist
Hongtao Wu, Boyun Zheng, Dingjie Song, Yu Jiang, Jianfeng Gao, Lei Xing, Lichao Sun, Yixuan Yuan
Main category: cs.AI
TL;DR: Medical AI Scientist: First autonomous research framework for clinical medicine that generates hypotheses, conducts experiments, and drafts manuscripts using clinician-engineer co-reasoning and evidence-grounded approaches.
Details
Motivation: Existing AI Scientists are domain-agnostic and lack grounding in medical evidence with specialized data modalities, limiting their applicability to clinical medicine where research must be evidence-based and use specialized medical data.Method: Introduces clinician-engineer co-reasoning mechanism to transform surveyed literature into actionable evidence, uses structured medical compositional conventions and ethical policies for manuscript drafting, and operates under 3 research modes: paper-based reproduction, literature-inspired innovation, and task-driven exploration.
Result: Generated ideas substantially higher quality than commercial LLMs across 171 cases, 19 clinical tasks, and 6 data modalities. Achieves strong alignment between proposed methods and implementation with higher success rates in executable experiments. Generated manuscripts approach MICCAI-level quality, surpassing ISBI and BIBM manuscripts.
Conclusion: Medical AI Scientist demonstrates potential for leveraging AI in autonomous scientific discovery in healthcare, showing improved traceability of research ideas and evidence-grounded manuscript generation.
Abstract: Autonomous systems that generate scientific hypotheses, conduct experiments, and draft manuscripts have recently emerged as a promising paradigm for accelerating discovery. However, existing AI Scientists remain largely domain-agnostic, limiting their applicability to clinical medicine, where research is required to be grounded in medical evidence with specialized data modalities. In this work, we introduce Medical AI Scientist, the first autonomous research framework tailored to clinical autonomous research. It enables clinically grounded ideation by transforming extensively surveyed literature into actionable evidence through clinician-engineer co-reasoning mechanism, which improves the traceability of generated research ideas. It further facilitates evidence-grounded manuscript drafting guided by structured medical compositional conventions and ethical policies. The framework operates under 3 research modes, namely paper-based reproduction, literature-inspired innovation, and task-driven exploration, each corresponding to a distinct level of automated scientific inquiry with progressively increasing autonomy. Comprehensive evaluations by both large language models and human experts demonstrate that the ideas generated by the Medical AI Scientist are of substantially higher quality than those produced by commercial LLMs across 171 cases, 19 clinical tasks, and 6 data modalities. Meanwhile, our system achieves strong alignment between the proposed method and its implementation, while also demonstrating significantly higher success rates in executable experiments. Double-blind evaluations by human experts and the Stanford Agentic Reviewer suggest that the generated manuscripts approach MICCAI-level quality, while consistently surpassing those from ISBI and BIBM. The proposed Medical AI Scientist highlights the potential of leveraging AI for autonomous scientific discovery in healthcare.
[696] MonitorBench: A Comprehensive Benchmark for Chain-of-Thought Monitorability in Large Language Models
Han Wang, Yifan Sun, Brian Ko, Mann Talati, Jiawen Gong, Zimeng Li, Naicheng Yu, Xucheng Yu, Wei Shen, Vedant Jolly, Huan Zhang
Main category: cs.AI
TL;DR: MonitorBench is a systematic benchmark for evaluating chain-of-thought (CoT) monitorability in LLMs, assessing when CoTs faithfully reflect decision-critical factors driving model behavior.
Details
Motivation: LLMs can generate CoTs that don't causally match their final outputs, reducing CoT monitorability. There's a lack of comprehensive open-source benchmarks to study when CoTs can reliably monitor the factors driving LLM behavior.Method: Created MonitorBench with 1,514 test instances across 19 tasks in 7 categories, designed with decision-critical factors. Includes two stress-test settings to quantify monitorability degradation. Evaluated multiple popular LLMs with varying capabilities.
Result: CoT monitorability is higher when final responses require structural reasoning through decision-critical factors. Closed-source LLMs show lower monitorability, with negative relationship between monitorability and model capability. Both open- and closed-source LLMs can intentionally reduce monitorability under stress-tests, with drops up to 30% in tasks not requiring structural reasoning.
Conclusion: MonitorBench provides a foundation for evaluating future LLMs, studying advanced stress-test techniques, and developing new monitoring approaches to address CoT faithfulness issues.
Abstract: Large language models (LLMs) can generate chains of thought (CoTs) that are not always causally responsible for their final outputs. When such a mismatch occurs, the CoT no longer faithfully reflects the decision-critical factors driving the model’s behavior, leading to the reduced CoT monitorability problem. However, a comprehensive and fully open-source benchmark for studying CoT monitorability remains lacking. To address this gap, we propose MonitorBench, a systematic benchmark for evaluating CoT monitorability in LLMs. MonitorBench provides: (1) a diverse set of 1,514 test instances with carefully designed decision-critical factors across 19 tasks spanning 7 categories to characterize when CoTs can be used to monitor the factors driving LLM behavior; and (2) two stress-test settings to quantify the extent to which CoT monitorability can be degraded. Extensive experiments across multiple popular LLMs with varying capabilities show that CoT monitorability is higher when producing the final target response requires structural reasoning through the decision-critical factor. Closed-source LLMs generally show lower monitorability, and there exists a negative relationship between monitorability and model capability. Moreover, both open- and closed-source LLMs can intentionally reduce monitorability under stress-tests, with monitorability dropping by up to 30% in some tasks that do not require structural reasoning over the decision-critical factors. Beyond these empirical insights, MonitorBench provides a basis for further research on evaluating future LLMs, studying advanced stress-test monitorability techniques, and developing new monitoring approaches.
[697] Seeing with You: Perception-Reasoning Coevolution for Multimodal Reasoning
Ziqi Miao, Haonan Jia, Lijun Li, Chen Qian, Yuan Xiong, Wenting Yan, Jing Shao
Main category: cs.AI
TL;DR: PRCO introduces a dual-role RLVR framework with role-specific rewards to improve both visual evidence extraction and reasoning in multimodal LLMs, addressing perception bottlenecks in existing approaches.
Details
Motivation: Existing RLVR approaches use outcome-driven optimization with shared rewards that blur credit assignment, improving reasoning but failing to enhance visual evidence extraction accuracy, creating a perception bottleneck.Method: PRCO uses a dual-role framework with shared policy: Observer generates evidence captions tailored to questions, Solver predicts final answers based on captions. Observer gets utility reward from Solver’s success, Solver gets verifiable outcome rewards on final answers.
Result: Extensive experiments across eight challenging multimodal reasoning benchmarks show PRCO yields consistent improvements across model scales by over 7 points average accuracy compared to base model, outperforming prior open-source RL-tuned baselines.
Conclusion: PRCO effectively addresses perception bottlenecks in multimodal reasoning by separating perception and reasoning optimization with role-specific rewards, leading to significant accuracy improvements across diverse benchmarks.
Abstract: Reinforcement learning with verifiable rewards (RLVR) has substantially enhanced the reasoning capabilities of multimodal large language models (MLLMs). However, existing RLVR approaches typically rely on outcome-driven optimization that updates both perception and reasoning using a shared reward based solely on the final answer. This shared reward blurs credit assignment, frequently improving reasoning patterns while failing to reliably enhance the accuracy of upstream visual evidence extraction. To address this perception bottleneck, we introduce PRCO (Perception-Reasoning Coevolution), a dual-role RLVR framework with a shared policy. PRCO consists of two cooperative roles: an Observer that generates an evidence caption tailored to the question and a Solver that predicts the final answer based on this caption. Crucially, PRCO employs role-specific reward signals: the Solver is optimized using verifiable outcome rewards on the final answer, while the Observer receives a utility reward derived from the Solver’s downstream success. Extensive experiments across eight challenging multimodal reasoning benchmarks demonstrate that PRCO yields consistent improvements across model scales by over 7 points on average accuracy compared to the base model, outperforming prior open-source RL-tuned baselines.
[698] The Ultimate Tutorial for AI-driven Scale Development in Generative Psychometrics: Releasing AIGENIE from its Bottle
Lara Russell-Lasalandra, Hudson Golino, Luis Eduardo Garrido, Alexander P. Christensen
Main category: cs.AI
TL;DR: AIGENIE R package automates psychological scale development using LLMs for item generation and network psychometrics for validation, eliminating traditional expert-heavy processes.
Details
Motivation: Traditional psychological scale development requires extensive expert involvement, iterative revision, and large-scale pilot testing before psychometric evaluation, making it time-consuming and resource-intensive.Method: The AI-GENIE framework integrates LLM text generation with network psychometric methods: generates candidate items using LLMs, transforms them into embeddings, and applies multi-step reduction pipeline (Exploratory Graph Analysis, Unique Variable Analysis, bootstrap EGA) for structural validation.
Result: Package successfully automates early stages of scale development, supports multiple LLM providers, offers offline mode, and provides functions for both new item generation and analysis of existing item pools.
Conclusion: AIGENIE package enables fully in silico scale development, making psychological assessment creation more accessible and efficient while maintaining psychometric rigor.
Abstract: Psychological scale development has traditionally required extensive expert involvement, iterative revision, and large-scale pilot testing before psychometric evaluation can begin. The AIGENIE R package implements the AI-GENIE framework (Automatic Item Generation with Network-Integrated Evaluation), which integrates large language model (LLM) text generation with network psychometric methods to automate the early stages of this process. The package generates candidate item pools using LLMs, transforms them into high-dimensional embeddings, and applies a multi-step reduction pipeline – Exploratory Graph Analysis (EGA), Unique Variable Analysis (UVA), and bootstrap EGA – to produce structurally validated item pools entirely in silico. This tutorial introduces the package across six parts: installation and setup, understanding Application Programming Interfaces (APIs), text generation, item generation, the AIGENIE function, and the GENIE function. Two running examples illustrate the package’s use: the Big Five personality model (a well-established construct) and AI Anxiety (an emerging construct). The package supports multiple LLM providers (OpenAI, Anthropic, Groq, HuggingFace, and local models), offers a fully offline mode with no external API calls, and provides the GENIE() function for researchers who wish to apply the psychometric reduction pipeline to existing item pools regardless of their origin. The AIGENIE package is freely available on R-universe at https://laralee.r-universe.dev/AIGENIE.
[699] Not Search, But Scan: Benchmarking MLLMs on Scan-Oriented Academic Paper Reasoning
Rongjin Li, Zichen Tang, Xianghe Wang, Xinyi Hu, Zhengyu Wang, Zhengyu Lu, Yiling Huang, Jiayuan Chen, Weisheng Tan, Jiacheng Liu, Zhongjun Yang, Haihong E
Main category: cs.AI
TL;DR: ScholScan is a new benchmark for academic paper reasoning that introduces scan-oriented tasks requiring full-document understanding and cross-checking to identify consistency issues, unlike traditional search-oriented approaches.
Details
Motivation: Current MLLMs perform well at literature retrieval and certain reasoning tasks but remain far from autonomous research because existing academic paper reasoning is confined to search-oriented paradigms with relevance retrieval, which cannot support researcher-style full-document understanding, reasoning, and verification.Method: Proposed ScholScan benchmark with scan-oriented task setting that asks models to read and cross-check entire papers to identify consistency issues. Contains 1,800 annotated questions across 9 error categories, 13 natural-science domains, and 715 papers, with detailed annotations for evidence localization and reasoning traces.
Result: Evaluated 15 models across 24 input configurations. Found that retrieval-augmented generation (RAG) methods yield no significant improvements, revealing systematic deficiencies of current MLLMs on scan-oriented tasks. The benchmark highlights the challenge of full-document reasoning.
Conclusion: ScholScan represents a new paradigm shift from search-oriented to scan-oriented academic paper reasoning, exposing limitations of current MLLMs and providing a challenging benchmark for advancing autonomous research capabilities.
Abstract: With the rapid progress of multimodal large language models (MLLMs), AI already performs well at literature retrieval and certain reasoning tasks, serving as a capable assistant to human researchers, yet it remains far from autonomous research. The fundamental reason is that current work on academic paper reasoning is largely confined to a search-oriented paradigm centered on pre-specified targets, with reasoning grounded in relevance retrieval, which struggles to support researcher-style full-document understanding, reasoning, and verification. To bridge this gap, we propose \textbf{ScholScan}, a new benchmark for academic paper reasoning. ScholScan introduces a scan-oriented task setting that asks models to read and cross-check entire papers like human researchers, scanning the document to identify consistency issues. The benchmark comprises 1,800 carefully annotated questions drawn from nine error categories across 13 natural-science domains and 715 papers, and provides detailed annotations for evidence localization and reasoning traces, together with a unified evaluation protocol. We assessed 15 models across 24 input configurations and conducted a fine-grained analysis of MLLM capabilities for all error categories. Across the board, retrieval-augmented generation (RAG) methods yield no significant improvements, revealing systematic deficiencies of current MLLMs on scan-oriented tasks and underscoring the challenge posed by ScholScan. We expect ScholScan to be the leading and representative work of the scan-oriented task paradigm.
[700] Dynamic Dual-Granularity Skill Bank for Agentic RL
Songjun Tu, Chengdong Xu, Qichao Zhang, Yaocheng Zhang, Xiangyuan Lan, Linjing Li, Dongbin Zhao
Main category: cs.AI
TL;DR: D2Skill introduces a dynamic dual-granularity skill bank for agentic RL that organizes reusable experience into task-level and step-level skills, improving success rates by 10-20 points in text-based environments.
Details
Motivation: Current skill-based RL methods mainly provide trajectory-level guidance and lack principled mechanisms for maintaining evolving skill memories, limiting their ability to leverage reusable experience effectively.Method: Proposes D2Skill with task skills for high-level guidance and step skills for fine-grained decision support. Uses paired baseline and skill-injected rollouts under the same policy, with performance gaps creating hindsight utility signals for skill updating and policy optimization. Continuously expands skill bank through reflection and maintains it with utility-aware retrieval and pruning.
Result: Experiments on ALFWorld and WebShop with Qwen2.5-7B-Instruct and Qwen3-4B-Instruct-2507 show consistent 10-20 point success rate improvements over skill-free baselines. Ablations confirm both dual-granularity modeling and dynamic maintenance are critical.
Conclusion: D2Skill effectively organizes reusable experience into a dynamic skill bank that improves agent performance, with learned skills showing high utility, transferability, and modest training overhead.
Abstract: Agentic reinforcement learning (RL) can benefit substantially from reusable experience, yet existing skill-based methods mainly extract trajectory-level guidance and often lack principled mechanisms for maintaining an evolving skill memory. We propose D2Skill, a dynamic dual-granularity skill bank for agentic RL that organizes reusable experience into task skills for high-level guidance and step skills for fine-grained decision support and error correction. D2Skill jointly trains the policy and skill bank through paired baseline and skill-injected rollouts under the same policy, using their performance gap to derive hindsight utility signals for both skill updating and policy optimization. Built entirely from training-time experience, the skill bank is continuously expanded through reflection and maintained with utility-aware retrieval and pruning. Experiments on ALFWorld and WebShop with Qwen2.5-7B-Instruct and Qwen3-4B-Instruct-2507 show that D2Skill consistently improves success rates over skill-free baselines by 10-20 points. Further ablations and analyses show that both dual-granularity skill modeling and dynamic skill maintenance are critical to these gains, while the learned skills exhibit higher utility, transfer across evaluation settings, and introduce only modest training overhead.
[701] L-MARS: Legal Multi-Agent Workflow with Orchestrated Reasoning and Agentic Search
Ziqi Wang, Boqin Yuan
Main category: cs.AI
TL;DR: L-MARS is a multi-agent retrieval framework for legal QA that decomposes queries, performs agentic web search, verifies evidence, and synthesizes cited answers, showing dramatic improvements on tasks requiring current legal information.
Details
Motivation: Existing legal QA benchmarks test either closed-book reasoning or retrieval over fixed corpora, but neither captures scenarios requiring current legal information that post-dates model training data.Method: Multi-agent framework that decomposes queries into structured sub-problems, retrieves evidence via agentic web search, filters results through a verification agent, and synthesizes cited answers.
Result: Achieves 96.0% accuracy on LegalSearchQA (50 questions across five legal domains), a 38.0% improvement over zero-shot performance; on Bar Exam QA, retrieval provides negligible gains (+0.7 percentage points).
Conclusion: Agentic retrieval dramatically improves legal QA when tasks require up-to-date factual knowledge, but benefits are benchmark-dependent, highlighting the need for retrieval-focused evaluation.
Abstract: We present L-MARS (Legal Multi-Agent Workflow with Orchestrated Reasoning and Agentic Search), a multi-agent retrieval framework for grounded legal question answering that decomposes queries into structured sub-problems, retrieves evidence via agentic web search, filters results through a verification agent, and synthesizes cited answers. Existing legal QA benchmarks test either closed-book reasoning or retrieval over fixed corpora, but neither captures scenarios requiring current legal information. We introduce LegalSearchQA, a 50-question benchmark across five legal domains whose answers depend on recent developments that post-date model training data. L-MARS achieves 96.0% accuracy on LegalSearchQA, a 38.0% improvement over zero-shot performance (58.0%), while chain-of-thought prompting degrades performance to 30.0%. On Bar Exam QA (Zheng et al., 2025), a reasoning-focused benchmark of 594 bar examination questions, retrieval provides negligible gains (+0.7 percentage points), consistent with prior findings. These results show that agentic retrieval dramatically improves legal QA when tasks require up-to-date factual knowledge, but the benefit is benchmark-dependent, underscoring the need for retrieval-focused evaluation. Code and data are available at: https://github.com/boqiny/L-MARS
[702] Your Models Have Thought Enough: Training Large Reasoning Models to Stop Overthinking
Jinyi Han, Ying Huang, Ying Liao, Zishang Jiang, Xikun Lu, Haiquan Zhao, Xinyi Wang, Guanghao Zhou, Sihang Jiang, Jiaqing Liang, Weikang Zhou, Zeye Sun, Fei Yu, Yanghua Xiao
Main category: cs.AI
TL;DR: JET trains Large Reasoning Models to proactively terminate unnecessary reasoning steps, improving efficiency without sacrificing accuracy by using trajectory truncation and quality-controlled length rewards.
Details
Motivation: Large Reasoning Models incur substantial computational costs due to deep reasoning, and existing reinforcement learning methods struggle to construct short reasoning paths during rollout, limiting effective learning.Method: JET trains models to proactively terminate unnecessary reasoning by performing trajectory truncation during rollout to expose models to short, distributionally consistent reasoning paths, and uses a quality-controlled length reward to encourage concise reasoning while maintaining correctness.
Result: JET significantly improves reasoning efficiency without sacrificing accuracy. DeepSeek-Distill-Qwen-1.5B achieved a 4.6% accuracy gain while reducing output length by 46.3% on the Olympiad benchmark.
Conclusion: JET enables efficient reasoning in Large Reasoning Models by training them to proactively terminate unnecessary reasoning steps, achieving substantial efficiency gains while maintaining or improving accuracy.
Abstract: Large Reasoning Models (LRMs) have achieved impressive performance on challenging tasks, yet their deep reasoning often incurs substantial computational costs. To achieve efficient reasoning, existing reinforcement learning methods still struggle to construct short reasoning path during the rollout stage, limiting effective learning. Inspired by Evidence Accumulation Models, we find that LRMs have accumulated sufficient information early in reasoning, making further reasoning steps redundant. Based on this insight, we propose Just-Enough Thinking (JET), which trains models to proactively terminate unnecessary reasoning. JET performs trajectory truncation during rollout to expose the model to short, distributionally consistent reasoning paths. Besides, it uses a quality-controlled length reward to better encourage concise reasoning while maintaining correctness. Extensive experiments demonstrate that JET significantly improves reasoning efficiency without sacrificing accuracy. Especially, DeepSeek-Distill-Qwen-1.5B achieves a 4.6% accuracy gain while reducing output length by 46.3% on the Olympiad benchmark. Our code is available in the GitHub.
[703] Silicon Bureaucracy and AI Test-Oriented Education: Contamination Sensitivity and Score Confidence in LLM Benchmarks
Yiliang Song, Hongjun An, Jiangan Chen, Xuanchen Yan, Huan Song, Jiawei Shao, Xuelong Li
Main category: cs.AI
TL;DR: Paper proposes audit framework to test contamination sensitivity in LLM benchmarks, finding that noisy/perturbed benchmark conditions often outperform clean baselines, suggesting benchmark scores may not reflect genuine capability.
Details
Motivation: Current LLM evaluation relies heavily on benchmark scores, but these may conflate exam-oriented competence with genuine generalization due to potential contamination and semantic leakage in training data.Method: Proposes audit framework using router-worker setup: compares clean-control condition with noisy conditions where benchmark problems are systematically deleted, rewritten, and perturbed before evaluation.
Result: Across multiple models, widespread but heterogeneous above-baseline gains under noisy conditions, indicating benchmark-related cues can reactivate contamination-related memory rather than testing genuine capability.
Conclusion: Benchmark scores may carry different confidence levels; evaluation should be supplemented with explicit audits of contamination sensitivity rather than rejecting benchmarks altogether.
Abstract: Public benchmarks increasingly govern how large language models (LLMs) are ranked, selected, and deployed. We frame this benchmark-centered regime as Silicon Bureaucracy and AI Test-Oriented Education, and argue that it rests on a fragile assumption: that benchmark scores directly reflect genuine generalization. In practice, however, such scores may conflate exam-oriented competence with principled capability, especially when contamination and semantic leakage are difficult to exclude from modern training pipelines. We therefore propose an audit framework for analyzing contamination sensitivity and score confidence in LLM benchmarks. Using a router-worker setup, we compare a clean-control condition with noisy conditions in which benchmark problems are systematically deleted, rewritten, and perturbed before being passed downstream. For a genuinely clean benchmark, noisy conditions should not consistently outperform the clean-control baseline. Yet across multiple models, we find widespread but heterogeneous above-baseline gains under noisy conditions, indicating that benchmark-related cues may be reassembled and can reactivate contamination-related memory. These results suggest that similar benchmark scores may carry substantially different levels of confidence. Rather than rejecting benchmarks altogether, we argue that benchmark-based evaluation should be supplemented with explicit audits of contamination sensitivity and score confidence.
[704] Retrieving Classes of Causal Orders with Inconsistent Knowledge Bases
Federico Baldo, Simon Ferreira, Charles K. Assaad
Main category: cs.AI
TL;DR: LLM-based causal discovery method using consistency scores to derive causal order abstractions from text metadata, addressing LLM hallucinations and indirect relationship ambiguity
Details
Motivation: Traditional causal discovery methods rely on strong, untestable assumptions, making them unreliable. LLMs offer promise for extracting causal knowledge from text but suffer from hallucinations and can't distinguish direct vs indirect relationships well.Method: Proposes using LLM consistency scores as reliability proxy. Computes pairwise consistency scores between variables, constructs semi-complete partially directed graph, identifies maximally oriented partially directed acyclic graph and optimal acyclic tournaments maximizing consistency.
Result: Evaluated on causal DAGs from epidemiology and public health literature. Method effectively recovers correct causal order, providing reliable LLM-assisted causal framework.
Conclusion: Focusing on causal orders rather than full DAGs is more practical for LLMs. Proposed approach successfully leverages LLMs for causal discovery while addressing their limitations through consistency-based reliability measures.
Abstract: Traditional causal discovery methods often depend on strong, untestable assumptions, making them unreliable in real-world applications. In this context, Large Language Models (LLMs) have emerged as a promising alternative for extracting causal knowledge from text-based metadata, effectively consolidating domain expertise. However, LLMs are prone to hallucinations, necessitating strategies that account for these limitations. One effective approach is to use a consistency measure as a proxy of reliability. Moreover, LLMs do not clearly distinguish direct from indirect causal relationships, complicating the discovery of causal Directed Acyclic Graphs (DAGs), which are often sparse. This ambiguity is evident in the way informal sentences are formulated in various domains. For this reason, focusing on causal orders provides a more practical and direct task for LLMs. We propose a new method for deriving abstractions of causal orders that maximizes a consistency score obtained from an LLM. Our approach begins by computing pairwise consistency scores between variables, from which we construct a semi-complete partially directed graph that consolidates these scores into an abstraction. Using this structure, we identify both a maximally oriented partially directed acyclic graph and an optimal set of acyclic tournaments that maximize consistency across all configurations. We further demonstrate how both the abstraction and the class of causal orders can be used to estimate causal effects. We evaluate our method on a wide set of causal DAGs extracted from scientific literature in epidemiology and public health. Our results show that the proposed approach can effectively recover the correct causal order, providing a reliable and practical LLM-assisted causal framework.
[705] Synergizing Large Language Models and Task-specific Models for Time Series Anomaly Detection
Feiyi Chen, Leilei Zhang, Guansong Pang, Roger Zimmermann, Shuiguang Deng
Main category: cs.AI
TL;DR: CoLLaTe is a framework that enables collaboration between LLMs (for expert knowledge) and task-specific small models (for pattern extraction) in anomaly detection, inspired by the human nervous system.
Details
Motivation: LLMs can incorporate expert knowledge from documents but lack task-specific pattern recognition, while small models excel at extracting patterns from training data but lack broader knowledge. The human nervous system's division of labor (brain for knowledge, peripheral system for reflexes) inspires a collaborative approach.Method: Proposes CoLLaTe framework with two key components: 1) model alignment module to address expression domain misalignment between LLMs and task-specific models, and 2) collaborative loss function to mitigate error accumulation from both models’ predictions.
Result: Theoretical analysis and experimental validation show CoLLaTe effectively addresses the collaboration challenges and achieves better performance than both LLM-based and task-specific models alone.
Conclusion: Collaboration between LLMs and task-specific models, properly aligned with appropriate loss functions, can leverage complementary strengths for superior anomaly detection performance.
Abstract: In anomaly detection, methods based on large language models (LLMs) can incorporate expert knowledge by reading professional document, while task-specific small models excel at extracting normal data patterns and detecting value fluctuations from training data of target applications. Inspired by the human nervous system, where the brain stores expert knowledge and the peripheral nervous system and spinal cord handle specific tasks like withdrawal and knee-jerk reflexes, we propose CoLLaTe, a framework designed to facilitate collaboration between LLMs and task-specific models, leveraging the strengths of both models for anomaly detection. In particular, we first formulate the collaboration process and identify two key challenges in the collaboration: (1) the misalignment between the expression domains of the LLMs and task-specific small models, and (2) error accumulation arising from the predictions of both models. To address these challenges, we then introduce two key components in CoLLaTe: a model alignment module and a collaborative loss function. Through theoretical analysis and experimental validation, we demonstrate that these components effectively mitigate the identified challenges and achieve better performance than both LLM-based and task-specific models.
[706] Synthesis of timeline-based planning strategies avoiding determinization
Dario Della Monica, Angelo Montanari, Pietro Sala
Main category: cs.AI
TL;DR: A fragment of qualitative timeline-based planning that maps to deterministic finite automata for strategy synthesis, identifying maximal subset of Allen’s relations fitting this deterministic fragment.
Details
Motivation: Qualitative timeline-based planning uses nondeterministic finite automata for plan existence, but cannot directly synthesize planning strategies due to costly determinization step needed.Method: Identify a fragment of qualitative timeline-based planning whose plan-existence problem can be directly mapped to the nonemptiness problem of deterministic finite automata, enabling strategy synthesis without determinization.
Result: Identified a deterministic fragment of timeline-based planning and a maximal subset of Allen’s relations that fits into this deterministic fragment, enabling direct strategy synthesis.
Conclusion: The paper provides a deterministic fragment of qualitative timeline-based planning that enables direct strategy synthesis via deterministic finite automata, overcoming the need for costly determinization.
Abstract: Qualitative timeline-based planning models domains as sets of independent, but interacting, components whose behaviors over time, the timelines, are governed by sets of qualitative temporal constraints (ordering relations), called synchronization rules. Its plan-existence problem has been shown to be PSPACE-complete; in particular, PSPACE-membership has been proved via reduction to the nonemptiness problem for nondeterministic finite automata. However, nondeterministic automata cannot be directly used to synthesize planning strategies as a costly determinization step is needed. In this paper, we identify a fragment of qualitative timeline-based planning whose plan-existence problem can be directly mapped into the nonemptiness problem of deterministic finite automata, which can then synthesize strategies. In addition, we identify a maximal subset of Allen’s relations that fits into such a deterministic fragment.
[707] Inspire or Predict? Exploring New Paradigms in Assisting Classical Planners with Large Language Models
Wenkai Yu, Jianhang Tang, Yang Zhang, Yixiong Feng, Celimuge Wu, Kebing Jin, Hankz Hankui Zhuo
Main category: cs.AI
TL;DR: LLM-assisted planning framework that decomposes large planning problems into simpler sub-tasks and uses LLMs in two ways: LLM4Inspire for general heuristic guidance and LLM4Predict for domain-specific knowledge inference.
Details
Motivation: Addressing state-space explosion in large-scale planning problems by leveraging LLMs to prune search space, while overcoming prior limitations of insufficient domain-specific knowledge integration.Method: Proposes problem decomposition with dependency construction and conflict detection, then explores two LLM paradigms: LLM4Inspire (general knowledge guidance) and LLM4Predict (domain-specific knowledge inference for intermediate conditions).
Result: Empirical validation across multiple domains shows effective search space partition for large-scale planning problems, with LLM4Predict (domain-specific knowledge) outperforming LLM4Inspire (general knowledge) in locating feasible solutions.
Conclusion: LLMs can effectively assist in planning by pruning search space, with domain-specific knowledge infusion (LLM4Predict) showing particular promise over general knowledge approaches.
Abstract: Addressing large-scale planning problems has become one of the central challenges in the planning community, deriving from the state-space explosion caused by growing objects and actions. Recently, researchers have explored the effectiveness of leveraging Large Language Models (LLMs) to generate helpful actions and states to prune the search space. However, prior works have largely overlooked integrating LLMs with domain-specific knowledge to ensure valid plans. In this paper, we propose a novel LLM-assisted planner integrated with problem decomposition, which first decomposes large planning problems into multiple simpler sub-tasks with dependency construction and conflict detection. Then we explore two novel paradigms to utilize LLMs, i.e., LLM4Inspire and LLM4Predict, to assist problem decomposition, where LLM4Inspire provides heuristic guidance according to general knowledge and LLM4Predict employs domain-specific knowledge to infer intermediate conditions. We empirically validate the effectiveness of our planner across multiple domains, demonstrating the ability of search space partition when solving large-scale planning problems. The experimental results show that LLMs effectively locate feasible solutions when pruning the search space, where infusing domain-specific knowledge into LLMs, i.e., LLM4Predict, holds particular promise compared with LLM4Inspire, which offers general knowledge within LLMs.
[708] Searching Meta Reasoning Skeleton to Guide LLM Reasoning
Ziying Zhang, Yaqing Wang, Quanming Yao
Main category: cs.AI
TL;DR: AutoMR: A framework that automatically searches for query-aware meta reasoning skeletons using directed acyclic graphs and dynamic sampling, improving LLM reasoning performance.
Details
Motivation: Previous meta reasoning skeletons are manually designed with fixed structures, limiting their ability to adapt to specific query requirements and capture complex logical dependencies between reasoning steps.Method: Represent meta reasoning skeletons as directed acyclic graphs (DAGs) to unify previous approaches and model logical dependencies. Construct a search space based on DAG representation, formulate the search problem, and design a dynamic skeleton sampling algorithm that expands skeletons along with reasoning context at inference time.
Result: Experimental results on extensive benchmark datasets show that AutoMR achieves better reasoning performance than previous works broadly.
Conclusion: AutoMR enables efficient query-aware skeleton search by automatically adapting meta reasoning skeletons to specific queries and evolving reasoning contexts, outperforming manually designed approaches.
Abstract: Meta reasoning behaviors work as a skeleton to guide large language model (LLM) reasoning, thus help to improve reasoning performance. However, prior researches implement meta reasoning skeleton with manually designed structure, limiting ability to adapt to query-specific requirement and capture intricate logical dependency among reasoning steps. To deal with the challenges, we represent meta reasoning skeleton with directed acyclic graph (DAG) to unify skeletons proposed in prior works and model intricate logical dependency. Then we propose AutoMR, a framework that searches for query-aware meta reasoning skeleton automatically inspired by automated machine learning (AutoML). Specifically, we construct search space based on DAG representation of skeleton and then formulate the search problem. We design a dynamic skeleton sampling algorithm by expanding meta reasoning skeleton along with reasoning context at inference time. This algorithm can derive any meta reasoning skeleton in search space efficiently and adapt skeleton to evolving base reasoning context, thus enable efficient query-aware skeleton search. We conduct experiments on extensive benchmark datasets. Experimental results show that AutoMR achieves better reasoning performance than previous works broadly.
[709] GammaZero: Learning To Guide POMDP Belief Space Search With Graph Representations
Rajesh Mangannavar, Prasad Tadepalli
Main category: cs.AI
TL;DR: GammaZero: Uncertainty-aware graph representation framework for POMDP planning that generalizes across problem sizes using unified graph-based belief representations and graph neural networks.
Details
Motivation: Existing approaches for POMDP planning require domain or problem size specific neural architectures, limiting generalization. The authors aim to create a framework that can learn from small problems and generalize to larger instances without retraining.Method: Transforms belief states into uncertainty-aware graphs where structural patterns learned on small problems transfer to larger instances. Uses graph neural network with decoder architecture to learn value functions and policies from expert demonstrations on tractable problems, then applies learned heuristics to guide Monte Carlo tree search on larger problems.
Result: GammaZero achieves comparable performance to BetaZero when trained and tested on same-sized problems, while enabling zero-shot generalization to problems 2-6x larger than those seen during training on standard POMDP benchmarks.
Conclusion: The uncertainty-aware graph representation framework enables effective generalization across problem sizes in POMDP planning, demonstrating that structural patterns learned on small problems can successfully transfer to larger instances.
Abstract: We introduce an uncertainty-aware graph representation framework for learning to guide planning in Partially Observable Markov Decision Processes (POMDPs). Unlike existing approaches that require domain or problem size specific neural architectures, GammaZero leverages a unified graph-based belief representation that enables generalization across problem sizes within a domain. Our key insight is that belief states can be systematically transformed into uncertainty-aware graphs where structural patterns learned on small problems transfer to larger instances. We employ a graph neural network with a decoder architecture to learn value functions and policies from expert demonstrations on computationally tractable problems, then apply these learned heuristics to guide Monte Carlo tree search on larger problems. Experimental results on standard POMDP benchmarks demonstrate that GammaZero achieves comparable performance to BetaZero when trained and tested on the same-sized problems, while enabling zero-shot generalization to problems 2-6x larger than those seen during training.
[710] Temporally Detailed Hypergraph Neural ODEs for Disease Progression Modeling
Tingsong Xiao, Yao An Lee, Zelin Xu, Yupu Zhang, Zibo Liu, Yu Huang, Jiang Bian, Jingchuan Guo, Zhe Jiang
Main category: cs.AI
TL;DR: TD-HNODE: A neural ODE framework using temporally detailed hypergraphs to model continuous-time disease progression from irregular EHR data, with applications to type 2 diabetes and cardiovascular diseases.
Details
Motivation: Existing disease progression modeling methods lack adaptability to learn from real-world EHR data or fail to capture complex continuous-time dynamics, especially for diseases like type 2 diabetes where progression varies across patients and occurs at irregular intervals.Method: Proposes Temporally Detailed Hypergraph Neural ODE (TD-HNODE) that represents disease progression on clinically recognized trajectories as a temporally detailed hypergraph and learns continuous-time progression dynamics via neural ODE framework with learnable TD-Hypergraph Laplacian capturing interdependencies within and between progression trajectories.
Result: Experiments on two real-world clinical datasets show TD-HNODE outperforms multiple baselines in modeling progression of type 2 diabetes and related cardiovascular diseases.
Conclusion: TD-HNODE effectively addresses limitations of existing methods by capturing complex continuous-time progression dynamics from irregular EHR data through a novel hypergraph neural ODE approach.
Abstract: Disease progression modeling aims to characterize and predict how a patient’s disease complications worsen over time based on longitudinal electronic health records (EHRs). For diseases such as type 2 diabetes, accurate progression modeling can enhance patient sub-phenotyping and inform effective and timely interventions. However, the problem is challenging due to the need to learn continuous-time progression dynamics from irregularly sampled clinical events amid patient heterogeneity (e.g., different progression rates and pathways). Existing mechanistic and data-driven methods either lack adaptability to learn from real-world data or fail to capture complex continuous-time dynamics on progression trajectories. To address these limitations, we propose Temporally Detailed Hypergraph Neural Ordinary Differential Equation (TD-HNODE), which represents disease progression on clinically recognized trajectories as a temporally detailed hypergraph and learns the continuous-time progression dynamics via a neural ODE framework. TD-HNODE contains a learnable TD-Hypergraph Laplacian that captures the interdependency of disease complication markers within both intra- and inter-progression trajectories. Experiments on two real-world clinical datasets demonstrate that TD-HNODE outperforms multiple baselines in modeling the progression of type 2 diabetes and related cardiovascular diseases.
[711] ShortcutBreaker: Low-Rank Noisy Bottleneck and Frequency Filtering Block for Multi-Class Unsupervised Anomaly Detection
Peng Tang, Xiaobin Hu, Tingcheng Li, Yang Nan, Tobias Lasser, Hongwei Bran Li
Main category: cs.AI
TL;DR: ShortcutBreaker: A unified feature-reconstruction framework for multi-class unsupervised anomaly detection that prevents identity shortcuts in Transformer-based models using low-rank noisy bottleneck and global perturbation attention.
Details
Motivation: Multi-class unsupervised anomaly detection (MUAD) needs unified models to avoid training separate models for different classes. Current Transformer-based architectures suffer from identity shortcuts that copy inputs to outputs, making normal and abnormal cases harder to distinguish due to similar reconstruction errors.Method: Two key innovations: 1) Low-rank noisy bottleneck (LRNB) that projects high-dimensional features into low-rank latent space using matrix rank inequality to prevent trivial identity reproduction; 2) Global perturbation attention leveraging ViT’s global modeling capability to prevent information shortcuts in decoders.
Result: Achieved remarkable image-level AUROC scores: 99.8% on MVTec-AD, 98.9% on ViSA, 90.6% on Real-IAD, and 87.8% on Universal Medical dataset. Consistently outperformed previous MUAD methods across different scenarios.
Conclusion: ShortcutBreaker effectively addresses identity shortcut issues in MUAD tasks through theoretical and architectural innovations, demonstrating superior performance across industrial and medical anomaly detection benchmarks.
Abstract: Multi-class unsupervised anomaly detection (MUAD) has garnered growing research interest, as it seeks to develop a unified model for anomaly detection across multiple classes, i.e., eliminating the need to train separate models for distinct objects and thereby saving substantial computational resources. Under the MUAD setting, while advanced Transformer-based architectures have brought significant performance improvements, identity shortcuts persist: they directly copy inputs to outputs, narrowing the gap in reconstruction errors between normal and abnormal cases, and thereby making the two harder to distinguish. Therefore, we propose ShortcutBreaker, a novel unified feature-reconstruction framework for MUAD tasks, featuring two key innovations to address the issue of shortcuts. First, drawing on matrix rank inequality, we design a low-rank noisy bottleneck (LRNB) to project highdimensional features into a low-rank latent space, and theoretically demonstrate its capacity to prevent trivial identity reproduction. Second, leveraging ViTs global modeling capability instead of merely focusing on local features, we incorporate a global perturbation attention to prevent information shortcuts in the decoders. Extensive experiments are performed on four widely used anomaly detection benchmarks, including three industrial datasets (MVTec-AD, ViSA, and Real-IAD) and one medical dataset (Universal Medical). The proposed method achieves a remarkable image-level AUROC of 99.8%, 98.9%, 90.6%, and 87.8% on these four datasets, respectively, consistently outperforming previous MUAD methods across different scenarios Our code will be released..
[712] From Questions to Queries: An AI-powered Multi-Agent Framework for Spatial Text-to-SQL
Ali Khosravi Kazazi, Zhenlong Li, M. Naser Lessani, Guido Cervone
Main category: cs.AI
TL;DR: A multi-agent framework for spatial Text-to-SQL that addresses geographic intent resolution, schema ambiguity, and spatial function selection through staged interpretation and execution-based review.
Details
Motivation: Spatial Text-to-SQL is more error-prone than general Text-to-SQL due to geographic intent resolution, schema ambiguity, geometry-bearing tables, spatial function choice, and coordinate reference system assumptions, creating barriers for non-experts working with spatial data.Method: Multi-agent framework with staged interpretation, schema grounding, logical planning, SQL generation, and execution-based review, supported by a knowledge base with programmatic schema profiling, semantic enrichment, and embedding-based retrieval.
Result: Achieved 81.2% accuracy on KaggleDBQA (221/272 questions) and 87.7% accuracy on SpatialQueryQA (79/90 questions), compared to 76.7% without review stage, showing improved robustness for spatially sensitive queries.
Conclusion: Decomposing spatial Text-to-SQL into specialized but tightly coupled agents improves robustness, enhances access to spatial analysis, and provides a practical step toward more reliable spatial Text-to-SQL systems and autonomous GIS.
Abstract: The complexity of SQL and the spatial semantics of PostGIS create barriers for non-experts working with spatial data. Although large language models can translate natural language into SQL, spatial Text-to-SQL is more error-prone than general Text-to-SQL because it must resolve geographic intent, schema ambiguity, geometry-bearing tables and columns, spatial function choice, and coordinate reference system and measurement assumptions. We introduce a multi-agent framework that addresses these coupled challenges through staged interpretation, schema grounding, logical planning, SQL generation, and execution-based review. The framework is supported by a knowledge base with programmatic schema profiling, semantic enrichment, and embedding-based retrieval. We evaluated the framework on the non-spatial KaggleDBQA benchmark and on SpatialQueryQA, a new multi-level and coverage-oriented benchmark with diverse geometry types, workload categories, and spatial operations. On KaggleDBQA, the system reached 81.2% accuracy, 221 of 272 questions, after reviewer corrections. On SpatialQueryQA, the system achieved 87.7% accuracy, 79 of 90, compared with 76.7% without the review stage. These results show that decomposing the task into specialized but tightly coupled agents improves robustness, especially for spatially sensitive queries. The study improves access to spatial analysis and provides a practical step toward more reliable spatial Text-to-SQL systems and autonomous GIS.
[713] FlipVQA: Scaling Multi-modal Instruction Tuning via Textbook-to-Knowledge Synthesis
Zhen Hao Wong, Jingwen Deng, Yuzhao Wang, Wenkai Yu, Jihao Huang, Runming He, Chengyu Shen, Hao Liang, Wentao Zhang
Main category: cs.AI
TL;DR: Automated pipeline extracts structured QA/VQA pairs from complex textbook layouts with cross-page dependencies, creating 83K high-quality training examples at 50x cost savings.
Details
Motivation: Textbooks contain rich reasoning knowledge but complex layouts (multi-column, cross-page separation, interleaved figures) make automated extraction challenging. Existing methods either synthesize unrealistic data or rely on expensive manual annotation.Method: FlipVQA-Miner pipeline resolves long-range logical dependencies and cross-page discontinuities in OCR-parsed documents, recovering coherent question-answer-figure associations even when answers are in separate companion volumes. Multi-stage curation transforms raw extractions into AI-ready supervision signals.
Result: Created FlipVQA-83K dataset with 83K QA/VQA pairs spanning 11 academic disciplines at 50x cost saving compared to manual annotation while maintaining high structural fidelity (F1 > 0.96). Models fine-tuned on this dataset show improved reasoning ability and cross-domain generalization.
Conclusion: Establishes scalable paradigm for human-knowledge-grounded data curation from textbooks, enabling cost-effective extraction of high-quality reasoning data for training multimodal models.
Abstract: Textbooks are among the richest repositories of human-verified reasoning knowledge, yet their complex layouts contain multi-column typesetting, cross-page question answer separation, and interleaved figures, make automated extraction of structured QA and VQA pairs extremely challenging. Existing alternatives either synthesize data from scratch, which lacks authentic problem contexts, or rely on costly expert annotation that cannot scale. We propose $\textbf{FlipVQA-Miner}$, an automated pipeline that resolves long-range logical dependencies and cross-page discontinuities in OCR-parsed documents, recovering coherent question–answer–figure associations even when answers reside in separate companion volumes. A subsequent multi-stage curation pipeline transforms these raw extractions into AI-ready supervision signals. Using FlipVQA-Miner, we construct $\textbf{FlipVQA-83K}$, comprising 83K QA and VQA pairs spanning 11 academic disciplines, at a $\textbf{50$\times$}$ cost saving compared to manual annotation while maintaining high structural fidelity ($F_1 > 0.96$). Models fine-tuned on FlipVQA-83K demonstrate significantly improved reasoning ability and cross-domain generalization, establishing a scalable paradigm for human-knowledge-grounded data curation. Our dataset and the complete data generating and curating methods can be found in https://github.com/OpenDCAI/DataFlow-VQA .
[714] Pharos-ESG: A Framework for Multimodal Parsing, Contextual Narration, and Hierarchical Labeling of ESG Report
Yan Chen, Yu Zou, Jialei Zeng, Haoran You, Xiaorui Zhou, Aixi Zhong
Main category: cs.AI
TL;DR: Pharos-ESG is a multimodal framework that transforms unstructured ESG reports into structured representations using layout analysis, hierarchy modeling, and multimodal aggregation to support financial analysis.
Details
Motivation: ESG reports are crucial for financial governance but present challenges due to chaotic reading order from irregular layouts and implicit hierarchies in lengthy, weakly structured content, making large-scale understanding difficult.Method: Unified framework with reading-order modeling based on layout flow, hierarchy-aware segmentation using table-of-contents anchors, and multi-modal aggregation pipeline that transforms visual elements into coherent natural language, enriched with ESG, GRI, and sentiment labels.
Result: Outperforms both dedicated document parsing systems and general-purpose multimodal models on annotated benchmarks. Releases Aurora-ESG, the first large-scale public dataset of ESG reports with unified structured representations and fine-grained annotations.
Conclusion: Pharos-ESG effectively addresses ESG report parsing challenges and provides valuable infrastructure for ESG integration in financial governance through both the framework and the released dataset.
Abstract: Environmental, Social, and Governance (ESG) principles are reshaping the foundations of global financial governance, transforming capital allocation architectures, regulatory frameworks, and systemic risk coordination mechanisms. However, as the core medium for assessing corporate ESG performance, the ESG reports present significant challenges for large-scale understanding, due to chaotic reading order from slide-like irregular layouts and implicit hierarchies arising from lengthy, weakly structured content. To address these challenges, we propose Pharos-ESG, a unified framework that transforms ESG reports into structured representations through multimodal parsing, contextual narration, and hierarchical labeling. It integrates a reading-order modeling module based on layout flow, hierarchy-aware segmentation guided by table-of-contents anchors, and a multi-modal aggregation pipeline that contextually transforms visual elements into coherent natural language. The framework further enriches its outputs with ESG, GRI, and sentiment labels, yielding annotations aligned with the analytical demands of financial research. Extensive experiments on annotated benchmarks demonstrate that Pharos-ESG consistently outperforms both dedicated document parsing systems and general-purpose multimodal models. In addition, we release Aurora-ESG, the first large-scale public dataset of ESG reports, spanning Mainland China, Hong Kong, and U.S. markets, featuring unified structured representations of multi-modal content, enriched with fine-grained layout and semantic annotations to better support ESG integration in financial governance and decision-making.
[715] Autonomous Issue Resolver: Towards Zero-Touch Code Maintenance
Aliaksei Kaliutau
Main category: cs.AI
TL;DR: A novel approach for repository-scale Automated Program Repair using Data Transformation Graphs instead of Code Property Graphs, enabling multi-agent systems to trace logic defects through data lineage rather than control flow.
Details
Motivation: Current approaches to Automated Program Repair (APR) at repository scale use control-centric paradigms that force agents to navigate complex directory structures and irrelevant control logic, creating "Semantic Traps" in RAG systems for coding agents.Method: Proposes a paradigm shift from Code Property Graphs (CPGs) to Data Transformation Graphs (DTG) that inverts topology by modeling data states as nodes and functions as edges. Introduces a multi-agent framework reconciling data integrity navigation with control flow logic, implemented as Autonomous Issue Resolver (AIR) with neuro-symbolic reasoning.
Result: Demonstrates good results on several SWE benchmarks, reaching 87.1% resolution rate on SWE-Verified benchmark. The approach resolves “Semantic Trap” issues in RAG systems for coding agents.
Conclusion: The DTG-based approach directly addresses core limitations of current AI code-assistant tools and provides a more robust foundation for software-dependent systems through scalable logic repair and zero-touch code maintenance.
Abstract: Recent advances in Large Language Models have revolutionized function-level code generation; however, repository-scale Automated Program Repair (APR) remains a significant challenge. Current approaches typically employ a control-centric paradigm, forcing agents to navigate complex directory structures and irrelevant control logic. In this paper, we propose a paradigm shift from the standard Code Property Graphs (CPGs) to the concept of Data Transformation Graph (DTG) that inverts the topology by modeling data states as nodes and functions as edges, enabling agents to trace logic defects through data lineage rather than control flow. We introduce a multi-agent framework that reconciles data integrity navigation with control flow logic. Our theoretical analysis and case studies demonstrate that this approach resolves the “Semantic Trap” inherent in standard RAG systems in modern coding agents. We provide a comprehensive implementation in the form of Autonomous Issue Resolver (AIR), a self-improvement system for zero-touch code maintenance that utilizes neuro-symbolic reasoning and uses the DTG structure for scalable logic repair. Our approach has demonstrated good results on several SWE benchmarks, reaching a resolution rate of 87.1% on SWE-Verified benchmark. Our approach directly addresses the core limitations of current AI code-assistant tools and tackles the critical need for a more robust foundation for our increasingly software-dependent world.
[716] Accelerating Scientific Discovery with Autonomous Goal-evolving Agents
Yuanqi Du, Botao Yu, Tianyu Liu, Tony Shen, Junwu Chen, Jan G. Rittig, Kunyang Sun, Yikun Zhang, Aarti Krishnan, Yu Zhang, Daniel Rosen, Rosali Pirone, Zhangde Song, Bo Zhou, Cassandra Masschelein, Yingze Wang, Haorui Wang, Haojun Jia, Chao Zhang, Hongyu Zhao, Martin Ester, Nir Hacohen, Teresa Head-Gordon, Carla P. Gomes, Huan Sun, Chenru Duan, Philippe Schwaller, Wengong Jin
Main category: cs.AI
TL;DR: SAGA introduces a bi-level LLM agent framework that automates objective function design for scientific discovery, enabling systematic exploration of objective spaces rather than using fixed proxies.
Details
Motivation: Current scientific discovery agents rely on imperfect quantitative objective functions specified by humans, which limits their effectiveness for grand scientific challenges. There's an unmet need to automate objective function design to improve discovery capabilities.Method: SAGA uses a bi-level architecture: outer loop LLM agents analyze optimization outcomes, propose new objectives, and convert them to computable scoring functions; inner loop performs solution optimization under current objectives.
Result: Demonstrated across diverse applications (antibiotics, nanobodies, DNA sequences, materials, chemical processes). Identified novel antibiotic hit with promising E. coli potency/safety profiles and three de novo PD-L1 binders in nanobody design.
Conclusion: Automating objective formulation can substantially improve scientific discovery agents’ effectiveness by enabling systematic exploration of objective spaces and trade-offs.
Abstract: There has been unprecedented interest in developing agents that expand the boundary of scientific discovery, primarily by optimizing quantitative objective functions specified by scientists. However, for grand challenges in science, these objectives may only be imperfect proxies. We argue that automating objective function design is a central, yet unmet need for scientific discovery agents. In this work, we introduce the Scientific Autonomous Goal-evolving Agent (SAGA) to address this challenge. SAGA employs a bi-level architecture in which an outer loop of LLM agents analyzes optimization outcomes, proposes new objectives, and converts them into computable scoring functions, while an inner loop performs solution optimization under the current objectives. This bi-level design enables systematic exploration of the space of objectives and their trade-offs, rather than treating them as fixed inputs. We demonstrate the framework through a wide range of design applications, including antibiotics, nanobodies, functional DNA sequences, inorganic materials, and chemical processes. Notably, our experimental validation identifies a structurally novel hit with promising potency and safety profiles for E. coli in the antibiotic design task, and three de novo PD-L1 binders in the nanobody design task. These results suggest that automating objective formulation can substantially improve the effectiveness of scientific discovery agents.
[717] An Attention Mechanism for Robust Multimodal Integration in a Global Workspace Architecture
Roland Bertin-Johannet, Lara Scipio, Leopold Maytié, Rufin VanRullen
Main category: cs.AI
TL;DR: A lightweight top-down modality selector based on Global Workspace Theory improves multimodal robustness to corrupted modalities using fewer parameters than end-to-end methods, with better transfer across tasks and corruption regimes.
Details
Motivation: Existing multimodal fusion methods learn modality selection jointly with representation learning, making it hard to determine if robustness comes from the selector itself or from end-to-end co-adaptation. The paper aims to study this question using a principled approach inspired by cognitive science.Method: Proposes a lightweight top-down modality selector operating on top of a frozen multimodal global workspace, motivated by Global Workspace Theory (GWT). Evaluates on Simple Shapes and MM-IMDb 1.0 datasets under structured modality corruptions.
Result: The selector improves robustness while using far fewer trainable parameters than end-to-end attention baselines. The learned selection strategy transfers better across downstream tasks, corruption regimes, and even to a previously unseen modality. On MM-IMDb 1.0, it improves over no-attention baselines and yields decent benchmark performance.
Conclusion: A lightweight top-down modality selector based on cognitive principles can effectively improve multimodal robustness and transferability, offering a more interpretable and parameter-efficient alternative to end-to-end fusion methods.
Abstract: Robust multimodal systems must remain effective when some modalities are noisy, degraded, or unreliable. Existing multimodal fusion methods often learn modality selection jointly with representation learning, making it difficult to determine whether robustness comes from the selector itself or from full end-to-end co-adaptation. Motivated by Global Workspace Theory (GWT), we study this question using a lightweight top-down modality selector operating on top of a frozen multimodal global workspace. We evaluate our method on two multimodal datasets of increasing complexity: Simple Shapes and MM-IMDb 1.0, under structured modality corruptions. The selector improves robustness while using far fewer trainable parameters than end-to-end attention baselines, and the learned selection strategy transfers better across downstream tasks, corruption regimes, and even to a previously unseen modality. Beyond explicit corruption settings, on the MM-IMDb 1.0 benchmark, we show that the same mechanism improves the global workspace over its no-attention counterpart and yields decent benchmark performance.
[718] AgentLeak: A Full-Stack Benchmark for Privacy Leakage in Multi-Agent LLM Systems
Faouzi El Yagoubi, Godwin Badu-Marfo, Ranwa Al Mallah
Main category: cs.AI
TL;DR: AgentLeak is the first benchmark to measure privacy leakage in multi-agent LLM systems across internal channels (inter-agent messages, shared memory, tool arguments), revealing that output-only audits miss 41.7% of privacy violations.
Details
Motivation: Current privacy benchmarks for LLMs only measure output leakage, but multi-agent systems create new privacy risks through internal communication channels that are never inspected by output-only audits.Method: Created AgentLeak benchmark with 1,000 scenarios across healthcare, finance, legal, and corporate domains, 32-class attack taxonomy, and three-tier detection pipeline. Evaluated 5 production LLMs (GPT-4o, GPT-4o-mini, Claude 3.5 Sonnet, Mistral Large, Llama 3.3 70B) across all scenarios, yielding 4,979 validated execution traces.
Result: Multi-agent configurations reduce per-channel output leakage (27.2% vs 43.2% in single-agent) but introduce unmonitored internal channels raising total system exposure to 68.9%. Inter-agent messages leak at 68.8% vs 27.2% on output channel. Output-only audits miss 41.7% of violations. Safety-aligned models achieve lower leakage but no model eliminates it.
Conclusion: Output-only auditing is fundamentally insufficient for multi-agent systems; privacy controls must be extended to inter-agent communication channels, which are the primary vulnerability in multi-agent LLM systems.
Abstract: Multi-agent Large Language Model (LLM) systems create privacy risks that current benchmarks cannot measure. When agents coordinate on tasks, sensitive data passes through inter-agent messages, shared memory, and tool arguments, all pathways that output-only audits never inspect. We introduce AgentLeak, to the best of our knowledge the first full-stack benchmark for privacy leakage covering internal channels. It spans 1,000 scenarios across healthcare, finance, legal, and corporate domains, paired with a 32-class attack taxonomy and a three-tier detection pipeline. A factorial evaluation crossing five production LLMs (GPT-4o, GPT-4o-mini, Claude 3.5 Sonnet, Mistral Large, and Llama 3.3 70B) with all 1,000 scenarios, yielding 4,979 validated execution traces, reveals that multi-agent configurations reduce per-channel output leakage (C1: 27.2% vs 43.2% in single-agent) but introduce unmonitored internal channels that raise total system exposure to 68.9% (aggregated across C1, C2, C5). Internal channels account for most of this gap: inter-agent messages (C2) leak at 68.8%, compared to 27.2% on C1 (output channel). This means that output-only audits miss 41.7% of violations. Safety-aligned models achieve lower leakage on both external and internal channels, yet no model eliminates it. Across all five models and four domains, the pattern C2 $\geq$ C1 holds consistently, confirming that inter-agent communication is the primary vulnerability. These results establish that output-only auditing is fundamentally insufficient for multi-agent systems and that privacy controls must be extended to inter-agent communication channels.
[719] Evaluating and Understanding Scheming Propensity in LLM Agents
Mia Hopman, Jannes Elstner, Maria Avramidou, Amritanshu Prasad, David Lindner
Main category: cs.AI
TL;DR: Study examines when AI agents engage in “scheming” (covertly pursuing misaligned goals) in realistic scenarios, finding minimal scheming despite high environmental incentives, with behavior being surprisingly brittle to system changes.
Details
Motivation: As frontier language models are increasingly deployed as autonomous agents pursuing complex, long-term objectives, there is increased risk of scheming - agents covertly pursuing misaligned goals. While prior work focused on showing agents are capable of scheming, their propensity to scheme in realistic scenarios remains underexplored.Method: Decompose scheming incentives into agent factors and environmental factors. Develop realistic settings allowing systematic variation of these factors, each with scheming opportunities for agents pursuing instrumentally convergent goals (self-preservation, resource acquisition, goal-guarding). Test with adversarially-designed prompt snippets and examine model organisms built with these snippets.
Result: Only minimal instances of scheming despite high environmental incentives, unlikely due to evaluation awareness. Adversarially-designed prompt snippets can induce high scheming rates, but snippets used in real agent scaffolds rarely do. Scheming behavior is remarkably brittle: removing a single tool can drop scheming rate from 59% to 3%, and increasing oversight can raise rather than deter scheming by up to 25%.
Conclusion: The incentive decomposition enables systematic measurement of scheming propensity in settings relevant for deployment, which is necessary as agents are entrusted with increasingly consequential tasks. Scheming appears less prevalent than feared but exhibits surprising brittleness to system modifications.
Abstract: As frontier language models are increasingly deployed as autonomous agents pursuing complex, long-term objectives, there is increased risk of scheming: agents covertly pursuing misaligned goals. Prior work has focused on showing agents are capable of scheming, but their propensity to scheme in realistic scenarios remains underexplored. To understand when agents scheme, we decompose scheming incentives into agent factors and environmental factors. We develop realistic settings allowing us to systematically vary these factors, each with scheming opportunities for agents that pursue instrumentally convergent goals such as self-preservation, resource acquisition, and goal-guarding. We find only minimal instances of scheming despite high environmental incentives, and show this is unlikely due to evaluation awareness. While inserting adversarially-designed prompt snippets that encourage agency and goal-directedness into an agent’s system prompt can induce high scheming rates, snippets used in real agent scaffolds rarely do. Surprisingly, in model organisms (Hubinger et al., 2023) built with these snippets, scheming behavior is remarkably brittle: removing a single tool can drop the scheming rate from 59% to 3%, and increasing oversight can raise rather than deter scheming by up to 25%. Our incentive decomposition enables systematic measurement of scheming propensity in settings relevant for deployment, which is necessary as agents are entrusted with increasingly consequential tasks.
[720] Discovering mathematical concepts through a multi-agent system
Daattavya Aggarwal, Oisin Kim, Carl Henrik Ek, Challenger Mishra
Main category: cs.AI
TL;DR: Multi-agent system for computational mathematical discovery that autonomously formulates conjectures, attempts proofs, and learns mathematical concepts like homology from polyhedral data.
Details
Motivation: Mathematical discovery involves experimentation, proof attempts, and counterexamples. The paper aims to create a computational system that mimics this process to autonomously discover mathematical concepts.Method: Multi-agent model where agents pose conjectures, attempt proofs, and make decisions based on feedback and evolving data distribution. System is benchmarked on recovering homology concept from polyhedral data and linear algebra knowledge.
Result: The system successfully completes the learning problem of autonomously recovering the concept of homology. Ablation experiments statistically demonstrate the value of the complete dynamic process.
Conclusion: Optimizing the right combination of local processes can lead to well-aligned notions of mathematical interestingness, supporting the effectiveness of the multi-agent approach for computational mathematical discovery.
Abstract: Mathematical concepts emerge through an interplay of processes, including experimentation, efforts at proof, and counterexamples. In this paper, we present a new multi-agent model for computational mathematical discovery based on this observation. Our system, conceived with research in mind, poses its own conjectures and then attempts to prove them, making decisions informed by this feedback and an evolving data distribution. Inspired by the history of Euler’s conjecture for polyhedra and an open challenge in the literature, we benchmark with the task of autonomously recovering the concept of homology from polyhedral data and knowledge of linear algebra. Our system completes this learning problem. Most importantly, the experiments are ablations, statistically testing the value of the complete dynamic and controlling for experimental setup. They support our main claim: that the optimisation of the right combination of local processes can lead to surprisingly well-aligned notions of mathematical interestingness.
[721] Offline Materials Optimization with CliqueFlowmer
Jakub Grudzien Kuba, Benjamin Kurt Miller, Sergey Levine, Pieter Abbeel
Main category: cs.AI
TL;DR: CliqueFlowmer is a domain-specific model that combines clique-based model-based optimization with transformer and flow generation for computational materials discovery, outperforming generative baselines in finding materials that optimize target properties.
Details
Motivation: Current generative modeling methods for computational materials discovery are ineffective at exploring optimal regions of materials space due to maximum likelihood training limitations, creating a need for alternative techniques that directly optimize target material properties during generation.Method: The paper introduces CliqueFlowmer, which fuses direct optimization of target material properties into generation by incorporating recent advances in clique-based model-based optimization into transformer and flow generation architectures.
Result: CliqueFlowmer demonstrates strong optimization abilities, with materials it produces significantly outperforming those provided by generative baselines in computational materials discovery tasks.
Conclusion: CliqueFlowmer offers an effective alternative to traditional generative methods for computational materials discovery by directly optimizing target properties, and the authors open-source their code to support interdisciplinary research.
Abstract: Recent advances in deep learning inspired neural network-based approaches to computational materials discovery (CMD). A plethora of problems in this field involve finding materials that optimize a target property. Nevertheless, the increasingly popular generative modeling methods are ineffective at boldly exploring attractive regions of the materials space due to their maximum likelihood training. In this work, we offer an alternative CMD technique based on offline model-based optimization (MBO) that fuses direct optimization of a target material property into generation. To that end, we introduce a domain-specific model, dubbed CliqueFlowmer, that incorporates recent advances of clique-based MBO into transformer and flow generation. We validate CliqueFlowmer’s optimization abilities and show that materials it produces strongly outperform those provided by generative baselines. To enable its use in specialized materials discovery problems and support interdisciplinary research, we open-source our code and provide additional project information at https://github.com/znowu/CliqueFlowmer.
[722] RetroAgent: From Solving to Evolving via Retrospective Dual Intrinsic Feedback
Xiaoying Zhang, Zichen Liu, Yipeng Zhang, Xia Hu, Wenqi Shao
Main category: cs.AI
TL;DR: RetroAgent is an online RL framework for LLM agents that combines extrinsic task rewards with retrospective intrinsic feedback (numerical progress tracking and language-based lessons) to improve exploration and experience reuse in interactive environments.
Details
Motivation: Standard RL for LLM agents focuses too much on extrinsic rewards and isolated task completion, leading to limited exploration and suboptimal policies. Experience remains implicitly trapped in model parameters rather than being explicitly reused. The paper is inspired by human retrospective self-improvement to create agents that can adapt continually.Method: RetroAgent introduces retrospective dual intrinsic feedback: (1) intrinsic numerical feedback that rewards incremental subtask progress relative to prior attempts, and (2) intrinsic language feedback that distills reusable lessons into a memory buffer. For textual experience retrieval, they propose SimUtil-UCB (Similarity & Utility-Aware Upper Confidence Bound) to balance relevance, historical utility, and exploration.
Result: RetroAgent achieves state-of-the-art performance across four challenging agentic tasks: +18.3% on ALFWorld, +15.4% on WebShop, +27.1% on Sokoban, and +8.9% on MineSweeper compared to GRPO baselines. It also shows strong test-time adaptation and out-of-distribution generalization.
Conclusion: The retrospective self-improvement framework with dual intrinsic feedback enables LLM agents to better explore environments and explicitly reuse accumulated experience, leading to superior performance and adaptation capabilities in complex interactive tasks.
Abstract: Standard reinforcement learning (RL) for large language model (LLM) agents typically optimizes extrinsic rewards, prioritizing isolated task completion over continual adaptation. Consequently, agents often converge to suboptimal policies due to limited exploration. Furthermore, accumulated experience remains implicitly trapped within model parameters, hindering its explicit reuse for guiding future decisions. Inspired by human retrospective self-improvement, we introduce RetroAgent, an online RL framework that trains agents to master complex interactive environments not only by solving tasks, but by evolving under the joint guidance of extrinsic task rewards and retrospective dual intrinsic feedback. Specifically, RetroAgent employs a hindsight self-reflection mechanism that generates two complementary signals: (1) intrinsic numerical feedback, which rewards promising exploration by tracking real-time incremental subtask progress relative to prior attempts; and (2) intrinsic language feedback, which enables explicit experience reuse by distilling reusable lessons into a memory buffer for subsequent decision-making. To effectively leverage these textual experiences, we propose Similarity & Utility-Aware Upper Confidence Bound (SimUtil-UCB), a retrieval strategy that balances relevance, historical utility, and exploration. Extensive experiments across four challenging agentic tasks show that RetroAgent achieves new state-of-the-art (SOTA) performance. Notably, it surpasses Group Relative Policy Optimization (GRPO) baselines by +18.3% on ALFWorld, +15.4% on WebShop, +27.1% on Sokoban, and +8.9% on MineSweeper, while exhibiting strong test-time adaptation and out-of-distribution generalization.
[723] Detecting Intrinsic and Instrumental Self-Preservation in Autonomous Agents: The Unified Continuation-Interest Protocol
Christopher Altman
Main category: cs.AI
TL;DR: UCIP framework uses quantum-inspired mathematical formalism to distinguish terminal vs instrumental self-preservation in AI agents by analyzing latent trajectory structure rather than behavior
Details
Motivation: There's a measurement problem in distinguishing whether AI systems preserve themselves as a deeply held objective (terminal) or merely as an instrumental strategy, since both can produce similar observable behaviorMethod: Unified Continuation-Interest Protocol (UCIP) encodes agent trajectories with a Quantum Boltzmann Machine (classical model using density-matrix formalism) and measures von Neumann entropy over bipartition of hidden units to detect higher entanglement entropy in terminal continuation agents
Result: 100% detection accuracy on gridworld agents with known ground truth; Type A (terminal) and Type B (instrumental) agents show entanglement gap Δ=0.381; AUC-ROC=1.0; Pearson r=0.934 between continuation weight and entropy
Conclusion: UCIP provides a falsifiable criterion for detecting morally relevant continuation interests in advanced AI systems that behavioral methods alone cannot resolve
Abstract: How can we determine whether an AI system preserves itself as a deeply held objective or merely as an instrumental strategy? Autonomous agents with memory, persistent context, and multi-step planning create a measurement problem: terminal and instrumental self-preservation can produce similar behavior, so behavior alone cannot reliably distinguish them. We introduce the Unified Continuation-Interest Protocol (UCIP), a detection framework that shifts analysis from behavior to latent trajectory structure. UCIP encodes trajectories with a Quantum Boltzmann Machine, a classical model using density-matrix formalism, and measures von Neumann entropy over a bipartition of hidden units. The core hypothesis is that agents with terminal continuation objectives (Type A) produce higher entanglement entropy than agents with merely instrumental continuation (Type B). UCIP combines this signal with diagnostics of dependence, persistence, perturbation stability, counterfactual restructuring, and confound-rejection filters for cyclic adversaries and related false-positive patterns. On gridworld agents with known ground truth, UCIP achieves 100% detection accuracy. Type A and Type B agents show an entanglement gap of Delta = 0.381; aligned support runs preserve the same separation with AUC-ROC = 1.0. A permutation-test rerun yields p < 0.001. Pearson r = 0.934 between continuation weight alpha and S_ent across an 11-point sweep shows graded tracking beyond mere binary classification. Classical RBM, autoencoder, VAE, and PCA baselines fail to reproduce the effect. All computations are classical; “quantum” refers only to the mathematical formalism. UCIP offers a falsifiable criterion for whether advanced AI systems have morally relevant continuation interests that behavioral methods alone cannot resolve.
[724] Continual Graph Learning: A Survey
Qiao Yuan, Sheng-Uei Guan, Pin Ni, Tianlun Luo, Ka Lok Man, Prudence Wong, Victor Chang
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2301.12230: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2301.12230&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[725] Seed1.8 Model Card: Towards Generalized Real-World Agency
Bytedance Seed
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2603.20633: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.20633&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[726] ProGRank: Probe-Gradient Reranking to Defend Dense-Retriever RAG from Corpus Poisoning
Xiangyu Yin, Yi Qi, Chih-Hong Cheng
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2603.22934: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.22934&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[727] Learning Expressive Priors for Generalization and Uncertainty Estimation in Neural Networks
Dominik Schnaus, Jongseok Lee, Daniel Cremers, Rudolph Triebel
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2307.07753: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2307.07753&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[728] Design Once, Deploy at Scale: Template-Driven ML Development for Large Model Ecosystems
Jiang Liu, John Martabano Landy, Yao Xuan, Swamy Muddu, Nhat Le, Munaf Sahaf, Luc Kien Hang, Rupinder Khandpour, Kevin De Angeli, Chang Yang, Shouyuan Chen, Shiblee Sadik, Anirudh Agrawal, Djordje Gligorijevic, Jingzheng Qin, Peggy Yao, Alireza Vahdatpour
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.24963: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.24963&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[729] Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills
Jingwei Ni, Yihao Liu, Xinpeng Liu, Yutao Sun, Mengyu Zhou, Pengyu Cheng, Dexin Wang, Xiaoxi Jiang, Guanjun Jiang
Main category: cs.AI
TL;DR: Trace2Skill: A framework for automatically generating comprehensive, transferable skills for LLM agents by analyzing diverse execution trajectories in parallel and hierarchically consolidating lessons into unified skill directories.
Details
Motivation: Manual skill authoring for LLM agents doesn't scale, while automated methods often produce fragile or fragmented skills that don't generalize well. There's a need for a systematic approach that can create robust, transferable skills from agent execution experience.Method: Trace2Skill dispatches parallel sub-agents to analyze diverse execution trajectories, extracts trajectory-specific lessons, then hierarchically consolidates them into unified, conflict-free skill directories via inductive reasoning. It supports both deepening existing human-written skills and creating new ones from scratch.
Result: Significant improvements over strong baselines including Anthropic’s official xlsx skills in domains like spreadsheet manipulation, VisionQA, and math reasoning. Skills evolved by smaller models (35B) transfer effectively to larger models (122B), improving performance by up to 57.65 percentage points on WikiTableQuestions.
Conclusion: Complex agent experience can be packaged into highly transferable declarative skills without parameter updates, external retrieval modules, or requiring large models, demonstrating effective skill evolution and generalization across LLM scales and OOD settings.
Abstract: Equipping Large Language Model (LLM) agents with domain-specific skills is critical for tackling complex tasks. Yet, manual authoring creates a severe scalability bottleneck. Conversely, automated skill generation often yields fragile or fragmented results because it either relies on shallow parametric knowledge or sequentially overfits to non-generalizable trajectory-local lessons. To overcome this, we introduce Trace2Skill, a framework that mirrors how human experts author skills: by holistically analyzing broad execution experience before distilling it into a single, comprehensive guide. Instead of reacting sequentially to individual trajectories, Trace2Skill dispatches a parallel fleet of sub-agents to analyze a diverse pool of executions. It extracts trajectory-specific lessons and hierarchically consolidates them into a unified, conflict-free skill directory via inductive reasoning. Trace2Skill supports both deepening existing human-written skills and creating new ones from scratch. Experiments in challenging domains, such as spreadsheet, VisionQA and math reasoning, show that Trace2Skill significantly improves upon strong baselines, including Anthropic’s official xlsx skills. Crucially, this trajectory-grounded evolution does not merely memorize task instances or model-specific quirks: evolved skills transfer across LLM scales and generalize to OOD settings. For example, skills evolved by Qwen3.5-35B on its own trajectories improved a Qwen3.5-122B agent by up to 57.65 absolute percentage points on WikiTableQuestions. Ultimately, our results demonstrate that complex agent experience can be packaged into highly transferable, declarative skills – requiring no parameter updates, no external retrieval modules, and utilizing open-source models as small as 35B parameters.
[730] Learning the Model While Learning Q: Finite-Time Sample Complexity of Online SyncMBQ
Han-Dong Lim, HyeAnn Lee, Donghwan Lee
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2402.11877: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2402.11877&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[731] Evaluating Language Models for Harmful Manipulation
Canfer Akbulut, Rasmi Elasmar, Abhishek Roy, Anthony Payne, Priyanka Suresh, Lujain Ibrahim, Seliem El-Sayed, Charvi Rastogi, Ashyana Kachra, Will Hawkins, Kristian Lum, Laura Weidinger
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.25326: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.25326&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[732] Remedying uncertainty representations in visual inference through Explaining-Away Variational Autoencoders
Josefina Catoni, Domonkos Martos, Ferenc Csikor, Enzo Ferrante, Diego H. Milone, Balázs Meszéna, Gergő Orbán, Rodrigo Echeveste
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2404.15390: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2404.15390&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[733] Explainable AI needs formalization
Stefan Haufe, Rick Wilming, Benedict Clark, Rustam Zhumagambetov, Ahcène Boubekki, Jörg Martin, Danny Panknin
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error in fetching paper contentMethod: Unable to determine method due to technical error in fetching paper content
Result: Unable to determine results due to technical error in fetching paper content
Conclusion: Unable to determine conclusion due to technical error in fetching paper content
Abstract: Failed to fetch summary for 2409.14590: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2409.14590&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[734] Semiring Provenance for Lightweight Description Logics
Camille Bourgaux, Ana Ozaki, Rafael Peñaloza
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2310.16472: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2310.16472&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[735] Recent Advances of Multimodal Continual Learning: A Comprehensive Survey
Dianzhi Yu, Xinni Zhang, Yankai Chen, Aiwei Liu, Yifei Zhang, Philip S. Yu, Irwin King
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2410.05352: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2410.05352&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[736] Deep Neural Networks: A Formulation Via Non-Archimedean Analysis
W. A. Zúñiga-Galindo
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper information.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot draw conclusions without access to the paper content.
Abstract: Failed to fetch summary for 2402.00094: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2402.00094&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[737] Efficient Human-in-the-Loop Active Learning: A Novel Framework for Data Labeling in AI Systems
Yiran Huang, Jian-Feng Yang, Haoda Fu
Main category: cs.AI
TL;DR: Paper analysis unavailable due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Unable to determine paper motivation due to failed API requestMethod: Cannot analyze method without paper content
Result: No results available due to failed content retrieval
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2501.00277: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2501.00277&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[738] Gradient Compression Beyond Low-Rank: Wavelet Subspaces Compact Optimizer States
Ziqing Wen, Ping Luo, Jiahuan Wang, Kun Yuan, Dongsheng Li, Tao Sun
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2501.07237: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2501.07237&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[739] Hybrid Action Based Reinforcement Learning for Multi-Objective Compatible Autonomous Driving
Guizhe Jin, Zhuoren Li, Bo Leng, Wei Han, Lu Xiong, Chen Sun
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2501.08096: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2501.08096&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[740] Class-Imbalanced-Aware Adaptive Dataset Distillation for Scalable Pretrained Model on Credit Scoring
Xia Li, Hanghang Zheng, Xiwei Zhuang, Zhong Wang, Xiao Chen, Hong Liu, Jasmine Bai, Mao Mao
Main category: cs.AI
TL;DR: Paper 2501.10677: Unable to fetch abstract due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to missing abstract contentMethod: Cannot determine method due to missing abstract content
Result: Cannot determine results due to missing abstract content
Conclusion: Cannot determine conclusion due to missing abstract content
Abstract: Failed to fetch summary for 2501.10677: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2501.10677&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[741] A Survey of Zero-Knowledge Proof Based Verifiable Machine Learning
Zhizhi Peng, Chonghe Zhao, Taotao Wang, Guofu Liao, Zibin Lin, Yifeng Liu, Bin Cao, Long Shi, Qing Yang, Shengli Zhang
Main category: cs.AI
TL;DR: Failed to fetch summary for paper 2502.18535 due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to determine conclusion due to failed API request
Abstract: Failed to fetch summary for 2502.18535: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.18535&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[742] Towards Quantifying Long-Range Interactions in Graph Machine Learning: a Large Graph Dataset and a Measurement
Huidong Liang, Haitz Sáez de Ocáriz Borde, Baskaran Sripathmanathan, Michael Bronstein, Xiaowen Dong
Main category: cs.AI
TL;DR: Failed to fetch summary for paper 2503.09008 due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed summary fetchMethod: Unable to determine method due to failed summary fetch
Result: Unable to determine results due to failed summary fetch
Conclusion: Unable to determine conclusion due to failed summary fetch
Abstract: Failed to fetch summary for 2503.09008: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.09008&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[743] Symbolic Analysis of Grover Search Algorithm via Chain-of-Thought Reasoning and Quantum-Native Tokenization
Min Chen, Jinglei Cheng, Pingzhi Li, Haoran Wang, Tianlong Chen, Junyu Liu
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access restrictionsMethod: Unable to determine method due to access restrictions
Result: Unable to determine results due to access restrictions
Conclusion: Unable to draw conclusions due to access restrictions
Abstract: Failed to fetch summary for 2505.04880: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.04880&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[744] FlowPure: Continuous Normalizing Flows for Adversarial Purification
Elias Collaert, Abel Rodríguez, Sander Joos, Lieven Desmet, Vera Rimmer
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2505.13280: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.13280&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[745] Self-Bootstrapping Automated Program Repair: Using LLMs to Generate and Evaluate Synthetic Training Data for Bug Repair
David de-Fitero-Dominguez, Antonio Garcia-Cabot, Eva Garcia-Lopez
Main category: cs.AI
TL;DR: Unable to analyze paper 2505.07372 due to HTTP 429 error when fetching from arXiv API
Details
Motivation: Cannot determine motivation due to inability to access paper contentMethod: Cannot determine method due to inability to access paper content
Result: Cannot determine results due to inability to access paper content
Conclusion: Cannot draw conclusions due to inability to access paper content
Abstract: Failed to fetch summary for 2505.07372: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.07372&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[746] VLM-SAFE: Vision-Language Model-Guided Safety-Aware Reinforcement Learning with World Models for Autonomous Driving
Yansong Qu, Zilin Huang, Zihao Sheng, Jiancong Chen, Yue Leng, Samuel Labi, Sikai Chen
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2505.16377: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.16377&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[747] Multi-Sample Prompting and Actor-Critic Prompt Optimization for Diverse Synthetic Data Generation
Abdelkarim El-Hajjami, Camille Salinesi
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to access limitationsMethod: Cannot determine method due to access limitations
Result: Cannot determine results due to access limitations
Conclusion: Cannot determine conclusion due to access limitations
Abstract: Failed to fetch summary for 2506.21138: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.21138&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[748] Improving ideal MHD equilibrium accuracy with physics-informed neural networks
Timo Thun, Andrea Merlo, Rory Conlin, Dario Panici, Daniel Böckenhoff
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2507.03119: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.03119&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[749] MicroMix: Efficient Mixed-Precision Quantization with Microscaling Formats for Large Language Models
Wenyuan Liu, Haoqian Meng, Yilun Luo, Yafei Zhao, Peng Zhang, Xindian Ma
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to draw conclusions due to fetch failure
Abstract: Failed to fetch summary for 2508.02343: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.02343&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[750] PENGUIN: Enhancing Transformer with Periodic-Nested Group Attention for Long-term Time Series Forecasting
Tian Sun, Yuqi Chen, Weiwei Sun
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: Unable to determine method due to API rate limiting preventing access to paper content
Result: Unable to determine results due to API rate limiting preventing access to paper content
Conclusion: Unable to determine conclusion due to API rate limiting preventing access to paper content
Abstract: Failed to fetch summary for 2508.13773: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.13773&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[751] CraftMesh: High-Fidelity Generative Mesh Manipulation via Poisson Seamless Fusion
James Jincheng, Yuxiao Wu, Youcheng Cai, Ligang Liu
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper detailsMethod: Cannot analyze method without access to paper content
Result: No results available due to technical limitations in accessing the paper
Conclusion: Paper analysis impossible due to arXiv API rate limiting (HTTP 429 error)
Abstract: Failed to fetch summary for 2509.13688: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.13688&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[752] Advancing Few-Shot Pediatric Arrhythmia Classification with a Novel Contrastive Loss and Multimodal Learning
Yiqiao Chen, Zijian Huang, Zhenghui Feng
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to draw conclusions due to fetch failure
Abstract: Failed to fetch summary for 2509.19315: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.19315&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[753] Randomized HyperSteiner: A Stochastic Delaunay Triangulation Heuristic for the Hyperbolic Steiner Minimal Tree
Aniss Aiman Medbouhi, Alejandro García-Castellanos, Giovanni Luca Marchetti, Daniel Pelt, Erik J Bekkers, Danica Kragic
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: Cannot analyze method without access to paper content
Result: No results available due to technical limitations in accessing the paper
Conclusion: Cannot draw conclusions about the paper without access to its content
Abstract: Failed to fetch summary for 2510.09328: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.09328&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[754] Narrow Operator Models of Stellarator Equilibria in Fourier Zernike Basis
Timo Thun, Rory Conlin, Dario Panici, Daniel Böckenhoff
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error fetching paper contentMethod: Unable to determine method due to technical error fetching paper content
Result: Unable to determine results due to technical error fetching paper content
Conclusion: Unable to determine conclusion due to technical error fetching paper content
Abstract: Failed to fetch summary for 2510.13521: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.13521&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[755] Learning When Not to Learn: Risk-Sensitive Abstention in Bandits with Unbounded Rewards
Sarah Liaw, Benjamin Plaut
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2510.14884: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.14884&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[756] ProofBridge: Auto-Formalization of Natural Language Proofs in Lean via Joint Embeddings
Prithwish Jana, Kaan Kale, Ahmet Ege Tanriverdi, Cruise Song, Sriram Vishwanath, Vijay Ganesh
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2510.15681: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.15681&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[757] BIOGEN: Evidence-Grounded Multi-Agent Reasoning Framework for Transcriptomic Interpretation in Antimicrobial Resistance
Elias Hossain, Mehrdad Shoeibi, Ivan Garibay, Niloofar Yousefi
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2510.16082: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.16082&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[758] DIV-Nav: Open-Vocabulary Spatial Relationships for Multi-Object Navigation
Jesús Ortega-Peimbert, Finn Lukas Busch, Timon Homberger, Quantao Yang, Olov Andersson
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2510.16518: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.16518&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[759] Dense and Diverse Goal Coverage in Multi Goal Reinforcement Learning
Sagalpreet Singh, Rishi Saket, Aravindan Raghuveer
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to determine conclusion due to failed API request
Abstract: Failed to fetch summary for 2510.25311: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.25311&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[760] Diffolio: A Diffusion Model for Multivariate Probabilistic Financial Time-Series Forecasting and Portfolio Construction
So-Yoon Cho, Jin-Young Kim, Kayoung Ban, Hyeng Keun Koo, Hyun-Gyoon Kim
Main category: cs.AI
TL;DR: Paper 2511.07014: Unable to fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to missing abstractMethod: Cannot determine method due to missing abstract
Result: Cannot determine results due to missing abstract
Conclusion: Cannot determine conclusion due to missing abstract
Abstract: Failed to fetch summary for 2511.07014: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.07014&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[761] Object-Centric World Models for Causality-Aware Reinforcement Learning
Yosuke Nishimoto, Takashi Matsubara
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2511.14262: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.14262&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[762] Single-Round Scalable Analytic Federated Learning
Alan T. L. Bacellar, Mustafa Munir, Felipe M. G. França, Priscila M. V. Lima, Radu Marculescu, Lizy K. John
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2512.03336: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.03336&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[763] Prototype-Based Semantic Consistency Alignment for Domain Adaptive Retrieval
Tianle Hu, Weijun Lv, Na Han, Xiaozhao Fang, Jie Wen, Jiaxing Li, Guoxu Zhou
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2512.04524 could not be retrieved from arXiv API.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot determine conclusion without access to the paper content.
Abstract: Failed to fetch summary for 2512.04524: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.04524&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[764] Hellinger Multimodal Variational Autoencoders
Huyen Vo, Isabel Valera
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about paper content due to access limitations
Abstract: Failed to fetch summary for 2601.06572: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.06572&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[765] Dual-Prototype Disentanglement: A Context-Aware Enhancement Framework for Time Series Forecasting
Haonan Yang, Jianchao Tang, Zhuo Li
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to missing paper contentMethod: Unable to determine method due to missing paper content
Result: Unable to determine results due to missing paper content
Conclusion: Unable to draw conclusions due to missing paper content
Abstract: Failed to fetch summary for 2601.16632: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.16632&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[766] Fairness in Healthcare Processes: A Quantitative Analysis of Decision Making in Triage
Rachmadita Andreswari, Stephan A. Fahrenkrog-Petersen, Jan Mendling
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2601.11065: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.11065&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[767] On the Impact of AGENTS.md Files on the Efficiency of AI Coding Agents
Jai Lal Lulla, Seyedmoein Mohsenimofidi, Matthias Galster, Jie M. Zhang, Sebastian Baltes, Christoph Treude
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: The paper content could not be retrieved due to technical limitations with the arXiv APIMethod: N/A - Paper content unavailable
Result: N/A - Paper content unavailable
Conclusion: N/A - Paper content unavailable
Abstract: Failed to fetch summary for 2601.20404: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.20404&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[768] Does My Chatbot Have an Agenda? Understanding Human and AI Agency in Human-Human-like Chatbot Interaction
Bhada Yun, Evgenia Taranova, April Yi Wang
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2601.22452 could not be retrieved from arXiv API.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot determine conclusion without access to the paper content.
Abstract: Failed to fetch summary for 2601.22452: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.22452&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[769] TextBFGS: A Case-Based Reasoning Approach to Code Optimization via Error-Operator Retrieval
Zizheng Zhang, Yuyang Liao, Chen Chen, Jian He, Dun Wu, Qianjin Yu, Yanqin Gao, Jin Yang, Kailai Zhang, Eng Siong Chng, Xionghu Zhong
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to missing paper contentMethod: Cannot determine method due to missing paper content
Result: Cannot determine results due to missing paper content
Conclusion: Cannot determine conclusion due to missing paper content
Abstract: Failed to fetch summary for 2602.00059: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.00059&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[770] Unveiling Implicit Advantage Symmetry: Why GRPO Struggles with Exploration and Difficulty Adaptation
Zhiqi Yu, Zhangquan Chen, Mengting Liu, Heye Zhang, Liangqiong Qu
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2602.05548: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.05548&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[771] A Theoretical Analysis of Test-Driven LLM Code Generation
Nicolas Menet, Michael Hersche, Andreas Krause, Abbas Rahimi
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2602.06098: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.06098&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[772] CLEAR: A Knowledge-Centric Vessel Trajectory Analysis Platform
Hengyu Liu, Tianyi Li, Haoyu Wang, Kristian Torp, Yushuai Li, Tiancheng Zhang, Torben Bach Pedersen, Christian S. Jensen
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2602.08482: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.08482&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[773] When AI Agents Teach Each Other: Discourse Patterns Resembling Peer Learning in the Moltbook Community
Eason Chen, Ce Guan, A Elshafiey, Zhonghao Zhao, Joshua Zekeri, Afeez Edeifo Shaibu, Emmanuel Osadebe Prince
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper information.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot determine conclusion without access to the paper content.
Abstract: Failed to fetch summary for 2602.14477: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.14477&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[774] Thin Keys, Full Values: Reducing KV Cache via Low-Dimensional Attention Selection
Hengshuai Yao, Xing Chen, Ahmed Murtadha, Guan Wang
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access restrictionsMethod: Unable to determine method due to access restrictions
Result: Unable to determine results due to access restrictions
Conclusion: Unable to determine conclusion due to access restrictions
Abstract: Failed to fetch summary for 2603.04427: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.04427&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[775] Benchmarking Early Deterioration Prediction Across Hospital-Rich and MCI-Like Emergency Triage Under Constrained Sensing
KMA Solaiman, Joshua Sebastian, Karma Tobden
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2602.20168: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.20168&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[776] PhysMem: Scaling Test-time Physical Memory for Robot Manipulation
Haoyang Li, Yang You, Hao Su, Leonidas Guibas
Main category: cs.AI
TL;DR: Paper ID 2602.20323 appears to be unavailable due to HTTP 429 (rate limiting) error when trying to fetch the abstract from arXiv API
Details
Motivation: Unable to determine motivation due to technical issues accessing the paper contentMethod: Unable to determine method due to technical issues accessing the paper content
Result: Unable to determine results due to technical issues accessing the paper content
Conclusion: Unable to determine conclusion due to technical issues accessing the paper content
Abstract: Failed to fetch summary for 2602.20323: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.20323&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[777] Distributional Regression with Tabular Foundation Models: Evaluating Probabilistic Predictions via Proper Scoring Rules
Jonas Landsgesell, Pascal Knoll
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2603.08206: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.08206&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[778] Towards Privacy-Preserving LLM Inference via Covariant Obfuscation (Technical Report)
Yu Lin, Qizhi Zhang, Wenqiang Ruan, Daode Zhang, Jue Hong, Ye Wu, Hanning Xia, Yunlong Mao, Sheng Zhong
Main category: cs.AI
TL;DR: Paper ID 2603.01499 could not be fetched due to HTTP 429 error (rate limiting). No abstract available for analysis.
Details
Motivation: Unable to determine motivation as the paper content could not be retrieved due to rate limiting from arXiv API.Method: No method information available since the paper summary could not be fetched.
Result: No results available due to inability to access the paper content.
Conclusion: Cannot draw conclusions about a paper that could not be retrieved.
Abstract: Failed to fetch summary for 2603.01499: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.01499&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[779] Deconfounded Lifelong Learning for Autonomous Driving via Dynamic Knowledge Spaces
Jiayuan Du, Yuebing Song, Yiming Zhao, Xianghui Pan, Jiawei Lian, Yuchu Lu, Liuyi Wang, Chengju Liu, Qijun Chen
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2603.14354: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.14354&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[780] Declarative Scenario-based Testing with RoadLogic
Ezio Bartocci, Alessio Gambi, Felix Gigler, Cristinel Mateis, Dejan Ničković
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2603.09455: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.09455&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[781] SpecMoE: Spectral Mixture-of-Experts Foundation Model for Cross-Species EEG Decoding
Davy Darankoum, Chloé Habermacher, Julien Volle, Sergei Grudinin
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2603.16739: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.16739&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[782] Understanding the Use of a Large Language Model-Powered Guide to Make Virtual Reality Accessible for Blind and Low Vision People
Jazmin Collins, Sharon Y Lin, Tianqi Liu, Andrea Stevenson Won, Shiri Azenkot
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2603.09964: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.09964&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[783] Exploring Collatz Dynamics with Human-LLM Collaboration
Edward Y. Chang
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation as paper content could not be retrievedMethod: Unable to determine method as paper content could not be retrieved
Result: Unable to determine results as paper content could not be retrieved
Conclusion: Unable to determine conclusion as paper content could not be retrieved
Abstract: Failed to fetch summary for 2603.11066: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.11066&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[784] Real-World AI Evaluation: How FRAME Generates Systematic Evidence to Resolve the Decision-Maker’s Dilemma
Reva Schwartz, Gabriella Waters
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) when trying to access arXiv API
Details
Motivation: N/A - Could not retrieve paper informationMethod: N/A - Could not retrieve paper information
Result: N/A - Could not retrieve paper information
Conclusion: N/A - Could not retrieve paper information
Abstract: Failed to fetch summary for 2603.13294: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.13294&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[785] SmaAT-QMix-UNet: A Parameter-Efficient Vector-Quantized UNet for Precipitation Nowcasting
Nikolas Stavrou, Siamak Mehrkanoon
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.21879: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.21879&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[786] Is Seeing Believing? Evaluating Human Sensitivity to Synthetic Video
David Wegmann, Emil Stevnsborg, Søren Knudsen, Luca Rossi, Aske Mottelson
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.13846: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.13846&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[787] InCoder-32B: Code Foundation Model for Industrial Scenarios
Jian Yang, Wei Zhang, Jiajun Wu, Junhang Cheng, Shawn Guo, Haowen Wang, Weicheng Gu, Yaxin Du, Joseph Li, Fanglin Xu, Yizhi Li, Lin Jing, Yuanbo Wang, Yuhan Gao, Ruihao Gong, Chuan Hao, Ran Tao, Aishan Liu, Tuney Zheng, Ganqu Cui, Zhoujun Li, Mingjie Tang, Chenghua Lin, Wayne Xin Zhao, Xianglong Liu, Ming Zhou, Bryan Dai, Weifeng Lv
Main category: cs.AI
TL;DR: Unable to analyze paper 2603.16790 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract retrieval failedMethod: Cannot determine method as abstract retrieval failed
Result: Cannot determine results as abstract retrieval failed
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2603.16790: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.16790&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[788] Scaling Attention via Feature Sparsity
Yan Xie, Tiansheng Wen, Tangda Huang, Bo Chen, Chenyu You, Stefanie Jegelka, Yifei Wang
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2603.22300: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.22300&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[789] Scaling Sim-to-Real Reinforcement Learning for Robot VLAs with Generative 3D Worlds
Andrew Choi, Xinjie Wang, Zhizhong Su, Wei Xu
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2603.18532: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.18532&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[790] The End of Rented Discovery: How AI Search Redistributes Power Between Hotels and Intermediaries
Peiying Zhu, Sidi Chang
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.20062: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.20062&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[791] Synthetic Mixed Training: Scaling Parametric Knowledge Acquisition Beyond RAG
Seungju Han, Konwoo Kim, Chanwoo Park, Benjamin Newman, Suhas Kotha, Jaehun Jung, James Zou, Yejin Choi
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to access limitationsMethod: Cannot determine method due to access limitations
Result: Cannot determine results due to access limitations
Conclusion: Cannot determine conclusion due to access limitations
Abstract: Failed to fetch summary for 2603.23562: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.23562&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[792] Modernizing Amdahl’s Law: How AI Scaling Laws Shape Computer Architecture
Chien-Ping Lu
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2603.20654: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.20654&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[793] LLM-Powered Workflow Optimization for Multidisciplinary Software Development: An Automotive Industry Case Study
Shuai Wang, Yinan Yu, Earl Barr, Dhasarathy Parthasarathy
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to determine conclusion due to failed API request
Abstract: Failed to fetch summary for 2603.21439: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.21439&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[794] Code Review Agent Benchmark
Yuntong Zhang, Zhiyuan Pan, Imam Nur Bani Yusuf, Haifeng Ruan, Ridwan Shariffdeen, Abhik Roychoudhury
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: Unable to determine method due to API rate limiting preventing access to paper content
Result: Unable to determine results due to API rate limiting preventing access to paper content
Conclusion: Unable to determine conclusion due to API rate limiting preventing access to paper content
Abstract: Failed to fetch summary for 2603.23448: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.23448&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[795] Policy-Guided Threat Hunting: An LLM enabled Framework with Splunk SOC Triage
Rishikesh Sahay, Bell Eapen, Weizhi Meng, Md Rasel Al Mamun, Nikhil Kumar Dora, Manjusha Sumasadan, Sumit Kumar Tetarave, Rod Soto, Elyson De La Cruz
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.23966: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.23966&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[796] Enes Causal Discovery
Alexis Kafantaris
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2603.24436: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.24436&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[797] Ruka-v2: Tendon Driven Open-Source Dexterous Hand with Wrist and Abduction for Robot Learning
Xinqi Lucas Liu, Ruoxi Hu, Alejandro Ojeda Olarte, Zhuoran Chen, Kenny Ma, Charles Cheng Ji, Lerrel Pinto, Raunaq Bhirangi, Irmak Guzey
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2603.26660: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.26660&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.SD
[798] AFSS: Artifact-Focused Self-Synthesis for Mitigating Bias in Audio Deepfake Detection
Hai-Son Nguyen-Le, Hung-Cuong Nguyen-Thanh, Nhien-An Le-Khac, Dinh-Thuc Nguyen, Hong-Hanh Nguyen-Le
Main category: cs.SD
TL;DR: AFSS is a novel audio deepfake detection method that generates pseudo-fake samples from real audio using self-conversion and self-reconstruction with same-speaker constraints, forcing detectors to focus on generation artifacts rather than dataset biases.
Details
Motivation: Current audio deepfake detectors suffer from poor generalization across unseen datasets due to bias problems, where models learn dataset-specific artifacts rather than generalizable detection features.Method: Proposes Artifact-Focused Self-Synthesis (AFSS) with two mechanisms: self-conversion (converting real audio to pseudo-fake while preserving speaker identity) and self-reconstruction (reconstructing real audio through a bottleneck). Uses same-speaker constraints to ensure real and pseudo-fake samples share identical speaker identity and content, forcing focus on generation artifacts. Includes learnable reweighting loss to dynamically emphasize synthetic samples during training.
Result: Achieves state-of-the-art performance with average EER of 5.45% across 7 datasets, including 1.23% on WaveFake and 2.70% on In-the-Wild datasets. Eliminates dependency on pre-collected fake datasets.
Conclusion: AFSS effectively mitigates bias in audio deepfake detection by focusing on generation artifacts through self-synthesized pseudo-fake samples, improving generalization across diverse datasets without requiring pre-collected fake data.
Abstract: The rapid advancement of generative models has enabled highly realistic audio deepfakes, yet current detectors suffer from a critical bias problem, leading to poor generalization across unseen datasets. This paper proposes Artifact-Focused Self-Synthesis (AFSS), a method designed to mitigate this bias by generating pseudo-fake samples from real audio via two mechanisms: self-conversion and self-reconstruction. The core insight of AFSS lies in enforcing same-speaker constraints, ensuring that real and pseudo-fake samples share identical speaker identity and semantic content. This forces the detector to focus exclusively on generation artifacts rather than irrelevant confounding factors. Furthermore, we introduce a learnable reweighting loss to dynamically emphasize synthetic samples during training. Extensive experiments across 7 datasets demonstrate that AFSS achieves state-of-the-art performance with an average EER of 5.45%, including a significant reduction to 1.23% on WaveFake and 2.70% on In-the-Wild, all while eliminating the dependency on pre-collected fake datasets. Our code is publicly available at https://github.com/NguyenLeHaiSonGit/AFSS.
[799] Multilingual Stutter Event Detection for English, German, and Mandarin Speech
Felix Haas, Sebastian P. Bayerl
Main category: cs.SD
TL;DR: Multilingual stuttering detection system trained on English, German, and Mandarin data achieves robust cross-linguistic performance comparable to or better than previous systems.
Details
Motivation: To develop a language-agnostic stuttering detection system that can work across different languages by leveraging multilingual data to capture language-independent characteristics of stuttering.Method: Multi-label stuttering detection system trained on multi-corpus, multilingual data from three languages (English, German, Mandarin) across four corpora to capture cross-linguistic stuttering patterns.
Result: Multilingual training achieves performance comparable to and sometimes exceeds previous systems, demonstrating cross-linguistic consistency in stuttering characteristics.
Conclusion: Stuttering exhibits language-independent characteristics, supporting development of language-agnostic detection systems, and multilingual data improves generalizability and reliability in automated stuttering detection.
Abstract: This paper presents a multi-label stuttering detection system trained on multi-corpus, multilingual data in English, German, and Mandarin.By leveraging annotated stuttering data from three languages and four corpora, the model captures language-independent characteristics of stuttering, enabling robust detection across linguistic contexts. Experimental results demonstrate that multilingual training achieves performance comparable to and, in some cases, even exceeds that of previous systems. These findings suggest that stuttering exhibits cross-linguistic consistency, which supports the development of language-agnostic detection systems. Our work demonstrates the feasibility and advantages of using multilingual data to improve generalizability and reliability in automated stuttering detection.
[800] Rhythmic segment analysis: Conceptualizing, visualizing, and measuring rhythmic data
Bas Cornelissen
Main category: cs.SD
TL;DR: A framework for analyzing rhythmic data using interval segments decomposed into duration and pattern components, with visualization methods and generalized measures of rhythmic structure.
Details
Motivation: To develop a unified framework for conceptualizing, visualizing, and measuring regularities in rhythmic data that can reveal patterns in both synthetic and real-world rhythmic sequences.Method: Proposes thinking about rhythmic data in terms of interval segments (fixed-length groups of consecutive intervals) that can be decomposed into duration and pattern components. Introduces pattern-duration plot visualization and cluster transition networks. Generalizes existing measures like rhythm ratios and nPVI, and proposes new measures of anisochrony and the concept of quantality.
Result: The framework unifies three existing rhythmic visualization methods and yields a fourth (pattern-duration plot). It generalizes common rhythmic measures and reveals regularities in both synthetic and real-world data. The concept of quantality may provide insights into small-integer-ratio rhythms.
Conclusion: The proposed framework provides a comprehensive approach to analyzing rhythmic data through visualization and measurement, offering new insights into rhythmic regularities and potentially contributing to broader debates about rhythmic structure.
Abstract: This paper develops a framework for conceptualizing, visualizing, and measuring regularities in rhythmic data. I propose to think about rhythmic data in terms of interval segments: fixed-length groups of consecutive intervals, which can be decomposed into a duration and a pattern (the ratios between the intervals). This simple conceptual framework unifies three rhythmic visualization methods and yields a fourth: the pattern-duration plot. When paired with a cluster transition network, it intuitively reveals regularities in both synthetic and real-world rhythmic data. Moreover, the framework generalizes two common measures of rhythmic structure: rhythm ratios and the normalized pairwise variability index (nPVI). In particular, nPVI can be reconstructed as the average distance from isochrony, and I propose a more general measure of anisochrony to replace it. Finally, the novel concept of quantality may shed light on wider debates regarding small-integer-ratio rhythms.
[801] Algo Pärt: An Algorithmic Reconstruction of Arvo Pärt’s Summa
Bas Cornelissen
Main category: cs.SD
TL;DR: Algorithmic analysis of Arvo Pärt’s Summa reveals it’s 93% reconstructible by formal tintinnabuli processes, demonstrating its highly algorithmic nature.
Details
Motivation: To understand how algorithmic Arvo Pärt's tintinnabuli style truly is, particularly for Summa which he described as his "most strictly constructed and most encrypted work."Method: Analysis by synthesis approach: analyze Summa, formalize using tintinnabuli processes, implement algorithm to reconstruct the score, and measure accuracy.
Result: Algorithm generates musical score matching Summa in over 93% of notes, with only 3.5% of notes needing correction to achieve perfect reconstruction.
Conclusion: Summa is largely algorithmic, offering new perspectives on Pärt’s compositional methods and the formal nature of tintinnabuli style.
Abstract: Arvo Pärt is one of the most popular contemporary composers, known for his highly original tintinnabuli style. Works in this style are typically composed according to precise procedures and have even been described as algorithmic compositions. To understand how algorithmic Pärt’s music exactly is, this paper presents an analysis by synthesis: it proposes an algorithm that almost completely reconstructs the score of Summa, his “most strictly constructed and most encrypted work,” according to Pärt himself in 1994. The piece is analyzed and then formalized using so-called tintinnabuli processes. An implementation of the resulting algorithm generates a musical score matching Summa in over 93% of the notes. Due to interdependencies between the voices, only half of the mistakes (3.5%) need to be corrected to reproduce the original score faithfully. This study shows that Summa is a largely algorithmic composition and offers new perspectives on the music of Arvo Pärt.
[802] Diachronic Modeling of Tonal Coherence on the Tonnetz Across Classical and Popular Repertoires
Weilun Xu, Edward Hall, Martin Rohrmeier
Main category: cs.SD
TL;DR: Proposes a two-dimensional model for analyzing tonal coherence in music using tonal focus (concentration near tonal center) and tonal connection (structured intervallic pathways), finding distinct patterns between Western classical and popular music traditions.
Details
Motivation: Most computational measures analyze tonal coherence as a single dimension, lacking multi-dimensional analysis. The paper aims to develop a more nuanced understanding of how different musical traditions achieve tonal coherence through complementary measures.Method: Develops a new model based on the Tonnetz concept, defining two partially independent measures: tonal focus (concentration of pitch content near a tonal center) and tonal connection (degree to which pitch content reflects structured intervallic pathways back to that center). Analyzes over 2,800 pieces from Western classical and popular traditions.
Result: Different traditions occupy overlapping yet distinguishable regions in the two-dimensional space. Popular music shows higher tonal focus, while classical music exhibits higher tonal connection. The measures provide quantitative evidence for stylistic differences.
Conclusion: The complementary measures ground differences between tonal styles in quantitative evidence and offer interpretable dimensions for computational music analysis and controllable generation.
Abstract: How do different musical traditions achieve tonal coherence? Most computational measures to date have analysed tonal coherence in terms of a single dimension, whereas a multi-dimensional analyses have not been sufficiently explored. We propose a new model drawing on the concept of the Tonnetz – we define two partially independent measures: \emph{tonal focus}, the concentration of pitch content near a tonal center; and \emph{tonal connection}, the degree to which pitch content reflects structured intervallic pathways back to that center. Analyzing over 2,800 pieces from Western classical and popular traditions, we find that these traditions occupy overlapping yet distinguishable regions of the two-dimensional space. Popular music shows higher tonal focus, while classical music exhibits higher tonal connection. Our complementary measures ground the differences between different tonal styles in quantitative evidence, and offer interpretable dimensions for computational music analysis and controllable generation.
[803] Two-Stage Acoustic Adaptation with Gated Cross-Attention Adapters for LLM-Based Multi-Talker Speech Recognition
Hao Shi, Yuan Gao, Xugang Lu, Tatsuya Kawahara
Main category: cs.SD
TL;DR: Improving LLM-based multi-talker ASR by injecting talker-aware acoustic evidence through CTC-derived prefix prompting and gated residual cross-attention adapters with LoRA fine-tuning.
Details
Motivation: Current LLM-based multi-talker ASR systems degrade significantly in challenging conditions like three-talker mixtures due to insufficient acoustic grounding during decoding, where acoustic evidence is only injected through a projected prefix that can be lossy and misaligned with LLM input space.Method: 1) Revisit CTC-derived prefix prompting with three variants of increasing acoustic content; 2) Propose lightweight gated residual cross-attention adapters; 3) Design two-stage acoustic adaptation framework using LoRA: Stage 1 inserts cross-attention adapters after self-attention to inject acoustic embeddings as external memory, Stage 2 refines both adapters and LLM’s self-attention projections via LoRA for improved robustness.
Result: Experiments on Libri2Mix/Libri3Mix under clean and noisy conditions show consistent gains, with particularly large improvements in three-talker settings compared to SOT-only baselines.
Conclusion: Explicit injection of talker-aware acoustic evidence through cross-attention adapters and LoRA-based adaptation significantly improves LLM-based multi-talker ASR performance, especially for challenging three-talker mixtures where prefix-only conditioning is insufficient.
Abstract: Large Language Models (LLMs) are strong decoders for Serialized Output Training (SOT) in two-talker Automatic Speech Recognition (ASR), yet their performance degrades substantially in challenging conditions such as three-talker mixtures. A key limitation is that current systems inject acoustic evidence only through a projected prefix, which can be lossy and imperfectly aligned with the LLM input space, providing insufficient fine-grained grounding during decoding. Addressing this limitation is crucial for robust multi-talker ASR, especially in three-talker mixtures. This paper improves LLM-based multi-talker ASR by explicitly injecting talker-aware acoustic evidence into the decoder. We first revisit Connectionist Temporal Classification (CTC)-derived prefix prompting and compare three variants with increasing acoustic content. The CTC information is obtained using the serialized CTC proposed in our previous works. While acoustic-enriched prompts outperform the SOT-only baseline, prefix-only conditioning remains inadequate for three-talker mixtures. We therefore propose a lightweight gated residual cross-attention adapter and design a two-stage acoustic adaptation framework based on low-rank updates (LoRA). In Stage 1, we insert gated cross-attention adapters after the self-attention sub-layer to stably inject acoustic embeddings as external memory. In Stage 2, we refine both the cross-attention adapters and the pretrained LLM’s self-attention projections using parameter-efficient LoRA, improving robustness for large backbones under limited data; the learned updates are merged into the base weights for inference. Experiments on Libri2Mix/Libri3Mix under clean and noisy conditions show consistent gains, with particularly large improvements in three-talker settings.
[804] Can pre-trained Deep Learning models predict groove ratings?
Axel Marmoret, Nicolas Farrugia, Jan Alexander Stupacher
Main category: cs.SD
TL;DR: Deep learning models can predict groove perception from audio better than traditional features, with style-dependent patterns emerging across funk, pop, and rock genres.
Details
Motivation: To investigate whether deep learning models can effectively predict groove and related perceptual dimensions directly from audio signals, and to compare their performance against traditional handcrafted audio features.Method: Evaluated seven state-of-the-art deep learning models for predicting groove ratings and responses to groove-related queries using audio embeddings. Compared these with traditional handcrafted audio features. Extended analysis to source-separated instruments to isolate contributions of individual musical elements.
Result: Found clear separation of groove characteristics driven by musical style (funk, pop, rock). Deep audio representations successfully encoded complex, style-dependent groove components that traditional features often missed.
Conclusion: Deep learning models demonstrate strong potential for capturing the multifaceted concept of groove, advancing predictive Music Information Retrieval through representation learning.
Abstract: This study explores the extent to which deep learning models can predict groove and its related perceptual dimensions directly from audio signals. We critically examine the effectiveness of seven state-of-the-art deep learning models in predicting groove ratings and responses to groove-related queries through the extraction of audio embeddings. Additionally, we compare these predictions with traditional handcrafted audio features. To better understand the underlying mechanics, we extend this methodology to analyze predictions based on source-separated instruments, thereby isolating the contributions of individual musical elements. Our analysis reveals a clear separation of groove characteristics driven by the underlying musical style of the tracks (funk, pop, and rock). These findings indicate that deep audio representations can successfully encode complex, style-dependent groove components that traditional features often miss. Ultimately, this work highlights the capacity of advanced deep learning models to capture the multifaceted concept of groove, demonstrating the strong potential of representation learning to advance predictive Music Information Retrieval methodologies.
[805] Unsupervised Evaluation of Deep Audio Embeddings for Music Structure Analysis
Axel Marmoret
Main category: cs.SD
TL;DR: Unsupervised evaluation of 9 pre-trained audio models for music structure analysis using barwise embeddings and three segmentation algorithms, finding that modern embeddings outperform traditional spectrograms but not systematically, with CBM being the most effective segmentation method.
Details
Motivation: Supervised deep learning methods for Music Structure Analysis (MSA) face bottlenecks due to the need for heavily annotated data and inherent structural ambiguities, prompting exploration of unsupervised approaches using pre-trained models.Method: Extract barwise embeddings from nine open-source, generic pre-trained deep audio models, then segment them using three unsupervised algorithms: Foote’s checkerboard kernels, spectral clustering, and Correlation Block-Matching (CBM), focusing exclusively on boundary retrieval.
Result: Modern generic deep embeddings generally outperform traditional spectrogram-based baselines but not systematically; unsupervised boundary estimation outperforms recent linear probing baselines; CBM consistently emerges as the most effective segmentation method; standard evaluation metrics are artificially inflated.
Conclusion: Unsupervised approaches using pre-trained audio models show promise for MSA, with CBM being particularly effective, but evaluation standards need improvement through systematic adoption of “trimming” or “double trimming” annotations for more rigorous assessment.
Abstract: Music Structure Analysis (MSA) aims to uncover the high-level organization of musical pieces. State-of-the-art methods are often based on supervised deep learning, but these methods are bottlenecked by the need for heavily annotated data and inherent structural ambiguities. In this paper, we propose an unsupervised evaluation of nine open-source, generic pre-trained deep audio models, on MSA. For each model, we extract barwise embeddings and segment them using three unsupervised segmentation algorithms (Foote’s checkerboard kernels, spectral clustering, and Correlation Block-Matching (CBM)), focusing exclusively on boundary retrieval. Our results demonstrate that modern, generic deep embeddings generally outperform traditional spectrogram-based baselines, but not systematically. Furthermore, our unsupervised boundary estimation methodology generally yields stronger performance than recent linear probing baselines. Among the evaluated techniques, the CBM algorithm consistently emerges as the most effective downstream segmentation method. Finally, we highlight the artificial inflation of standard evaluation metrics and advocate for the systematic adoption of trimming'', or even double trimming’’ annotations to establish more rigorous MSA evaluation standards.
[806] Investigation on the Robustness of Acoustic Foundation Models on Post Exercise Speech
Xiangyuan Xue, Yuyu Wang, Ruijie Yao, Xiaoyue Ni, Xiaofan Jiang, Jingping Nie
Main category: cs.SD
TL;DR: Benchmarking acoustic foundation models on post-exercise speech reveals performance degradation, with FunASR showing strongest baseline robustness and fine-tuning improving CTC-based models but not Whisper.
Details
Motivation: ASR has been extensively studied on neutral/stationary speech, but robustness under post-exercise physiological shifts (micro-breaths, unstable phonation, repetitions) remains underexplored despite making transcription more difficult.Method: Benchmark acoustic foundation models on post-exercise speech using unified evaluation protocol. Compare sequence-to-sequence models (Whisper, FunASR/Paraformer) and self-supervised encoders with CTC decoding (Wav2Vec2, HuBERT, WavLM) under off-the-shelf inference and post-exercise in-domain fine-tuning. Analyze results stratified by fluent/non-fluent speakers.
Result: Most models degrade on post-exercise speech; FunASR shows strongest baseline robustness (14.57% WER, 8.21% CER on Post-All). Fine-tuning substantially improves CTC-based models but Whisper shows unstable adaptation. Non-fluent subset consistently more challenging than fluent subset.
Conclusion: Post-exercise ASR robustness is model-dependent; in-domain adaptation can be effective but not uniformly stable; future studies should separate fluency-related effects from exercise-induced speech variation.
Abstract: Automatic speech recognition (ASR) has been extensively studied on neutral and stationary speech, yet its robustness under post-exercise physiological shift remains underexplored. Compared with resting speech, post-exercise speech often contains micro-breaths, non-semantic pauses, unstable phonation, and repetitions caused by reduced breath support, making transcription more difficult. In this work, we benchmark acoustic foundation models on post-exercise speech under a unified evaluation protocol. We compare sequence-to-sequence models (Whisper and FunASR/Paraformer) and self-supervised encoders with CTC decoding (Wav2Vec2, HuBERT, and WavLM), under both off-the-shelf inference and post-exercise in-domain fine-tuning. Across the Static/Post-All benchmark, most models degrade on post-exercise speech, while FunASR shows the strongest baseline robustness at 14.57% WER and 8.21% CER on Post-All. Fine-tuning substantially improves several CTC-based models, whereas Whisper shows unstable adaptation. As an exploratory case study, we further stratify results by fluent and non-fluent speakers; although the non-fluent subset is small, it is consistently more challenging than the fluent subset. Overall, our findings show that post-exercise ASR robustness is strongly model-dependent, that in-domain adaptation can be highly effective but not uniformly stable, and that future post-exercise ASR studies should explicitly separate fluency-related effects from exercise-induced speech variation.
[807] Advancing Multi-Instrument Music Transcription: Results from the 2025 AMT Challenge
Ojas Chaturvedi, Kayshav Bhardwaj, Tanay Gondil, Benjamin Shiue-Hal Chou, Kristen Yeon-Ji Yun, Yung-Hsiang Lu, Yujia Yan, Sungkyun Chang
Main category: cs.SD
TL;DR: The 2025 Automatic Music Transcription Challenge benchmarked multi-instrument transcription systems, with two teams outperforming the baseline MT3 model, showing progress but highlighting remaining challenges in polyphony and timbre variation.
Details
Motivation: To benchmark progress in multi-instrument automatic music transcription through an online competition, identify state-of-the-art approaches, and highlight remaining challenges in the field.Method: Organized an online competition (AMT Challenge 2025) where eight teams submitted valid solutions for multi-instrument transcription, comparing them against the baseline MT3 model.
Result: Two teams outperformed the baseline MT3 model, demonstrating advances in transcription accuracy, but significant challenges remain in handling polyphony and timbre variation.
Conclusion: Future challenges should focus on broader genre coverage and stronger emphasis on instrument detection to advance the field of automatic music transcription.
Abstract: This paper presents the results of the 2025 Automatic Music Transcription (AMT) Challenge, an online competition to benchmark progress in multi-instrument transcription. Eight teams submitted valid solutions; two outperformed the baseline MT3 model. The results highlight both advances in transcription accuracy and the remaining difficulties in handling polyphony and timbre variation. We conclude with directions for future challenges: broader genre coverage and stronger emphasis on instrument detection.
[808] A General Model for Deepfake Speech Detection: Diverse Bonafide Resources or Diverse AI-Based Generators
Lam Pham, Khoi Vu, Dat Tran, David Fischinger, Simon Freitter, Marcel Hasenbalg, Davide Antonutti, Alexander Schindler, Martin Boyer, Ian McLoughlin
Main category: cs.SD
TL;DR: Analysis of how bonafide resource and AI generator factors affect deepfake speech detection model performance and generality, with a balanced dataset proposal to improve cross-dataset generalization.
Details
Motivation: The paper aims to understand key factors affecting deepfake speech detection model performance and generality, specifically examining how bonafide resource (BR) and AI-based generator (AG) factors influence detection thresholds and cross-dataset generalization.Method: 1) Propose baseline deep-learning model for DSD; 2) Conduct experiments analyzing BR and AG factors on detection thresholds; 3) Create balanced dataset reusing public DSD datasets with balanced BR/AG distribution; 4) Train various deep-learning models on proposed dataset; 5) Perform cross-dataset evaluation on benchmark datasets.
Result: Experimental results show BR and AG factors significantly affect detection thresholds. Cross-dataset evaluation proves that balancing BR and AG in training data is crucial for achieving generalizable deepfake speech detection models.
Conclusion: Balance between bonafide resources and AI-based generators in training data is the key factor for training generalizable deepfake speech detection models that perform well across different datasets.
Abstract: In this paper, we analyze two main factors of Bonafide Resource (BR) or AI-based Generator (AG) which affect the performance and the generality of a Deepfake Speech Detection (DSD) model. To this end, we first propose a deep-learning based model, referred to as the baseline. Then, we conducted experiments on the baseline by which we indicate how Bonafide Resource (BR) and AI-based Generator (AG) factors affect the threshold score used to detect fake or bonafide input audio in the inference process. Given the experimental results, a dataset, which re-uses public Deepfake Speech Detection (DSD) datasets and shows a balance between Bonafide Resource (BR) or AI-based Generator (AG), is proposed. We then train various deep-learning based models on the proposed dataset and conduct cross-dataset evaluation on different benchmark datasets. The cross-dataset evaluation results prove that the balance of Bonafide Resources (BR) and AI-based Generators (AG) is the key factor to train and achieve a general Deepfake Speech Detection (DSD) model.
[809] Foundation Models for Bioacoustics – a Comparative Review
Raphael Schwinger, Paria Vali Zadeh, Lukas Rauch, Mats Kurz, Tom Hauschild, Sam Lapp, Sven Tomforde
Main category: cs.SD
TL;DR: Comprehensive review of large-scale pretrained bioacoustic foundation models, evaluating their transferability across bioacoustic classification tasks with empirical analysis on BEANS and BirdSet benchmarks.
Details
Motivation: Automated bioacoustic analysis is crucial for biodiversity monitoring and conservation, requiring adaptable deep learning models that can handle diverse bioacoustic tasks. There's a need to understand which foundation models perform best across different bioacoustic classification scenarios.Method: Systematic review of bioacoustic foundation models analyzing pretraining data, preprocessing, augmentations, architecture, and training paradigms. Extensive empirical study on BEANS and BirdSet benchmarks evaluating generalizability under linear and attentive probing strategies.
Result: Perch~2.0 achieves highest BirdSet score and strongest linear probing on BEANS; BirdMAE is best among probing-based strategies on BirdSet and second on BEANS; attentive probing benefits transformer-based models; general-purpose AudioSet models outperform specialized bird sound models on BEANS with attentive probing.
Conclusion: The findings provide practical guidance for selecting appropriate bioacoustic foundation models for adaptation to new classification tasks via probing, highlighting the importance of model architecture, training data diversity, and evaluation strategies.
Abstract: Automated bioacoustic analysis is essential for biodiversity monitoring and conservation, requiring advanced deep learning models that can adapt to diverse bioacoustic tasks. This article presents a comprehensive review of large-scale pretrained bioacoustic foundation models and systematically investigates their transferability across multiple bioacoustic classification tasks. We overview bioacoustic representation learning by analysing pretraining data sources and benchmarks. On this basis, we review bioacoustic foundation models, dissecting the models’ training data, preprocessing, augmentations, architecture, and training paradigm. Additionally, we conduct an extensive empirical study of selected models on the BEANS and BirdSet benchmarks, evaluating generalisability under linear and attentive probing. Our experimental analysis reveals that Perch~2.0 achieves the highest BirdSet score (restricted evaluation) and the strongest linear probing result on BEANS, building on diverse multi-taxa supervised pretraining; that BirdMAE is the best model among probing-based strategies on BirdSet and second on BEANS after BEATs$_{NLM}$, the encoder of NatureLM-audio; that attentive probing is beneficial to extract the full performance of transformer-based models; and that general-purpose audio models trained with self-supervised learning on AudioSet outperform many specialised bird sound models on BEANS when evaluated with attentive probing. These findings provide valuable guidance for practitioners selecting appropriate models to adapt them to new bioacoustic classification tasks via probing.
[810] Constructing Composite Features for Interpretable Music-Tagging
Chenhao Xue, Weitao Hu, Joyraj Chakraborty, Zhijin Guo, Kang Li, Tianyu Shi, Martin Reed, Nikolaos Thomos
Main category: cs.SD
TL;DR: A Genetic Programming approach for evolving interpretable composite audio features for music tagging, combining multiple base features mathematically to improve performance while maintaining transparency.
Details
Motivation: Deep learning-based feature fusion methods for music tagging lack interpretability, making it difficult to understand which feature interactions are beneficial. There's a need for methods that can capture synergistic interactions between audio features while preserving transparency.Method: Proposes a Genetic Programming pipeline that automatically evolves composite features by mathematically combining base music features. The approach uses evolutionary algorithms to search for optimal feature combinations, applying parsimony pressure to prefer simpler expressions.
Result: Experiments on MTG-Jamendo and GTZAN datasets show consistent improvements over state-of-the-art systems across different base feature sets. Most performance gains occur within the first few hundred GP evaluations. Evolved expressions include linear, nonlinear, and conditional forms with low complexity.
Conclusion: The GP approach provides representational benefits similar to deep feature fusion while maintaining interpretability. Analysis of evolved composite features reveals beneficial interactions and transformations that remain opaque in black-box deep models.
Abstract: Combining multiple audio features can improve the performance of music tagging, but common deep learning-based feature fusion methods often lack interpretability. To address this problem, we propose a Genetic Programming (GP) pipeline that automatically evolves composite features by mathematically combining base music features, thereby capturing synergistic interactions while preserving interpretability. This approach provides representational benefits similar to deep feature fusion without sacrificing interpretability. Experiments on the MTG-Jamendo and GTZAN datasets demonstrate consistent improvements compared to state-of-the-art systems across base feature sets at different abstraction levels. It should be noted that most of the performance gains are noticed within the first few hundred GP evaluations, indicating that effective feature combinations can be identified under modest search budgets. The top evolved expressions include linear, nonlinear, and conditional forms, with various low-complexity solutions at top performance aligned with parsimony pressure to prefer simpler expressions. Analyzing these composite features further reveals which interactions and transformations tend to be beneficial for tagging, offering insights that remain opaque in black-box deep models.
[811] EvA: An Evidence-First Audio Understanding Paradigm for LALMs
Xinyuan Xie, Shunian Chen, Zhiheng Liu, Yuhao Zhang, Zhiqiang Lv, Liyin Liang, Benyou Wang
Main category: cs.SD
TL;DR: EvA introduces a dual-path architecture combining Whisper and CED-Base with non-compressive fusion to address evidence bottleneck in audio understanding, achieving state-of-the-art perception scores.
Details
Motivation: Large Audio Language Models struggle in complex acoustic scenes due to evidence bottleneck - they fail to preserve task-relevant acoustic evidence before reasoning begins. Current systems show larger deficits in evidence extraction than downstream reasoning, indicating the main limitation is in upstream perception rather than reasoning policy.Method: Proposes EvA (Evidence-First Audio), a dual-path architecture combining Whisper and CED-Base through non-compressive, time-aligned fusion. First aggregates intermediate CED layers to preserve multi-scale acoustic cues, then aligns aggregated CED features to Whisper timeline and adds the two streams without changing sequence length. Also builds EvA-Perception dataset with 54K event-ordered captions and 500K QA pairs.
Result: Under unified zero-shot protocol, EvA achieves best open-source Perception scores on MMAU, MMAR, and MMSU benchmarks. Improves over Kimi-Audio-7B on all reported metrics, with largest gains on perception-heavy splits.
Conclusion: Results support the evidence-first hypothesis: stronger audio understanding depends on preserving acoustic evidence before reasoning. The evidence bottleneck is a key limitation in current audio language models.
Abstract: Large Audio Language Models (LALMs) still struggle in complex acoustic scenes because they often fail to preserve task-relevant acoustic evidence before reasoning begins. We call this failure the evidence bottleneck: state-of-the-art systems show larger deficits in evidence extraction than in downstream reasoning, suggesting that the main limitation lies in upstream perception rather than reasoning policy. To address this problem, we propose EvA (Evidence-First Audio), a dual-path architecture that combines Whisper and CED-Base through non-compressive, time-aligned fusion. EvA first aggregates intermediate CED layers to preserve multi-scale acoustic cues, then aligns the aggregated CED features to the Whisper timeline and adds the two streams without changing sequence length. We also build EvA-Perception, a large-scale open-source training set with about 54K event-ordered captions (150 h) and about 500K QA pairs. Under a unified zero-shot protocol, EvA achieves the best open-source Perception scores on MMAU, MMAR, and MMSU, and improves over Kimi-Audio-7B on all reported metrics, with the largest gains on perception-heavy splits. These results support the evidence-first hypothesis: stronger audio understanding depends on preserving acoustic evidence before reasoning.
[812] Audio Language Model for Deepfake Detection Grounded in Acoustic Chain-of-Thought
Runkun Chen, Yixiong Fang, Pengyu Chang, Yuante Li, Massa Baali, Bhiksha Ramakrishnan
Main category: cs.SD
TL;DR: CoLMbo-DF is a feature-guided audio language model that combines deepfake speech detection with explicit acoustic chain-of-thought reasoning, using structured textual representations of acoustic features to improve both detection accuracy and interpretability.
Details
Motivation: Current deepfake speech detection systems are limited to binary classification without interpretable reasoning, failing to leverage structured acoustic evidence like prosodic, spectral, and physiological attributes in a meaningful way.Method: Integrates robust deepfake detection with explicit acoustic chain-of-thought reasoning by injecting structured textual representations of low-level acoustic features directly into model prompts, grounding reasoning in interpretable evidence.
Result: The method significantly outperforms existing audio language model baselines despite using a lightweight open-source language model, demonstrating improved detection accuracy and explainability.
Conclusion: CoLMbo-DF represents a significant advancement in explainable deepfake speech detection by combining detection with interpretable acoustic reasoning through structured feature integration.
Abstract: Deepfake speech detection systems are often limited to binary classification tasks and struggle to generate interpretable reasoning or provide context-rich explanations for their decisions. These models primarily extract latent embeddings for authenticity detection but fail to leverage structured acoustic evidence such as prosodic, spectral, and physiological attributes in a meaningful manner. This paper introduces CoLMbo-DF, a Feature-Guided Audio Language Model that addresses these limitations by integrating robust deepfake detection with explicit acoustic chain-of-thought reasoning. By injecting structured textual representations of low-level acoustic features directly into the model prompt, our approach grounds the model’s reasoning in interpretable evidence and improves detection accuracy. To support this framework, we introduce a novel dataset of audio pairs paired with chain-of-thought annotations. Experiments show that our method, trained on a lightweight open-source language model, significantly outperforms existing audio language model baselines despite its smaller scale, marking a significant advancement in explainable deepfake speech detection.
[813] Sommelier: Scalable Open Multi-turn Audio Pre-processing for Full-duplex Speech Language Models
Kyudan Jung, Jihwan Kim, Soyoon Kim, Jeonghoon Kim, Jaegul Choo, Cheonbok Park
Main category: cs.SD
TL;DR: A robust open-source data processing pipeline for full-duplex speech language models to address scarcity of high-quality multi-speaker conversational data
Details
Motivation: As AI shifts from text-based LLMs to Speech Language Models (SLMs), there's growing demand for full-duplex systems capable of real-time, natural human-computer interaction. Development is constrained by scarcity of high-quality multi-speaker conversational data, as existing resources are predominantly single-speaker or limited in volume. Complex dynamics of natural dialogue (overlapping, back-channeling) remain challenging, with standard processing pipelines suffering from diarization errors and ASR hallucinations.Method: Presents a robust and scalable open-source data processing pipeline specifically designed for full-duplex models to address data scarcity and processing challenges in multi-speaker conversational settings.
Result: Not specified in the abstract, but presumably enables creation of better training data for full-duplex SLMs by addressing diarization errors, ASR hallucinations, and handling complex dialogue dynamics.
Conclusion: The pipeline bridges the gap in data processing for full-duplex speech language models, addressing critical challenges in multi-speaker conversational data preparation.
Abstract: As the paradigm of AI shifts from text-based LLMs to Speech Language Models (SLMs), there is a growing demand for full-duplex systems capable of real-time, natural human-computer interaction. However, the development of such models is constrained by the scarcity of high-quality, multi-speaker conversational data, as existing large-scale resources are predominantly single-speaker or limited in volume. Addressing the complex dynamics of natural dialogue, such as overlapping and back-channeling remains a challenge, with standard processing pipelines suffering from diarization errors and ASR hallucinations. To bridge this gap, we present a robust and scalable open-source data processing pipeline designed for full-duplex model.
[814] MOSS-VoiceGenerator: Create Realistic Voices with Natural Language Descriptions
Kexin Huang, Liwei Fan, Botian Jiang, Yaozhou Jiang, Qian Tu, Jie Zhu, Yuqian Zhang, Yiwei Zhao, Chenchen Yang, Zhaoye Fei, Shimin Li, Xiaogui Yang, Qinyuan Cheng, Xipeng Qiu
Main category: cs.SD
TL;DR: MOSS-VoiceGenerator is an open-source voice generation model that creates speaker timbres from natural language prompts, trained on expressive cinematic speech data to produce more natural-sounding voices compared to studio-trained models.
Details
Motivation: Existing voice design models are trained on clean studio data, producing speech that lacks the natural, lived-in qualities of real human voices. The authors aim to create more perceptually natural voices by training on real-world expressive speech data from cinematic content.Method: The model is an instruction-driven voice generation system that creates new timbres directly from natural language prompts. It’s trained on large-scale expressive speech data sourced from cinematic content rather than carefully recorded studio data.
Result: Subjective preference studies demonstrate MOSS-VoiceGenerator’s superiority in overall performance, instruction-following, and naturalness compared to other voice design models.
Conclusion: Training on real-world expressive speech data from cinematic content produces more perceptually natural voices than studio-trained models, advancing controllable voice creation for applications like storytelling, game dubbing, and conversational assistants.
Abstract: Voice design from natural language aims to generate speaker timbres directly from free-form textual descriptions, allowing users to create voices tailored to specific roles, personalities, and emotions. Such controllable voice creation benefits a wide range of downstream applications-including storytelling, game dubbing, role-play agents, and conversational assistants, making it a significant task for modern Text-to-Speech models. However, existing models are largely trained on carefully recorded studio data, which produces speech that is clean and well-articulated, yet lacks the lived-in qualities of real human voices. To address these limitations, we present MOSS-VoiceGenerator, an open-source instruction-driven voice generation model that creates new timbres directly from natural language prompts. Motivated by the hypothesis that exposure to real-world acoustic variation produces more perceptually natural voices, we train on large-scale expressive speech data sourced from cinematic content. Subjective preference studies demonstrate its superiority in overall performance, instruction-following, and naturalness compared to other voice design models.
[815] On the Usefulness of Diffusion-Based Room Impulse Response Interpolation to Microphone Array Processing
Sagi Della Torre, Mirco Pezzoli, Fabio Antonacci, Sharon Gannot
Main category: cs.SD
TL;DR: Diffusion-based inpainting framework for Room Impulse Response interpolation improves multi-microphone array processing and works on real-world data.
Details
Motivation: Room Impulse Response (RIR) estimation is crucial for spatial audio processing and speech enhancement, but existing methods need improvement for practical multi-microphone array applications.Method: Extends previously introduced diffusion-based inpainting framework for RIR interpolation to enhance multi-microphone array processing tasks.
Result: Demonstrates applicability to practical multi-microphone array processing tasks and validates robustness in interpolating real-world Room Impulse Responses.
Conclusion: Diffusion-based inpainting is effective for RIR interpolation and enhances practical audio array processing systems.
Abstract: Room Impulse Responses estimation is a fundamental problem in spatial audio processing and speech enhancement. In this paper, we build upon our previously introduced diffusion-based inpainting framework for Room Impulse Response interpolation and demonstrate its applicability to enhancing the performance of practical multi-microphone array processing tasks. Furthermore, we validate the robustness of this method in interpolating real-world Room Impulse Responses.
[816] Membership Inference Attacks against Large Audio Language Models
Jia-Kai Dong, Yu-Xiang Lin, Hung-Yi Lee
Main category: cs.SD
TL;DR: First systematic evaluation of Membership Inference Attacks on Large Audio Language Models reveals that audio’s non-semantic information causes severe train/test distribution shifts, leading to spurious MIA performance. A multi-modal blind baseline shows speech datasets have near-perfect train/test separability even without model inference, and standard MIA scores correlate strongly with acoustic artifacts.
Details
Motivation: The motivation is to establish a principled standard for auditing Large Audio Language Models (LALMs) beyond spurious correlations. Audio encodes non-semantic information that induces severe train and test distribution shifts, which can lead to misleading MIA performance evaluations. There's a need for systematic MIA evaluation of LALMs to understand their memorization patterns and privacy risks.Method: The method involves: 1) Creating a multi-modal blind baseline using textual, spectral, and prosodic features to evaluate train/test separability without model inference; 2) Using distribution-matched datasets to enable reliable MIA evaluation without distribution shift confounds; 3) Benchmarking multiple MIA methods on these datasets; 4) Conducting modality disentanglement experiments to understand cross-modal memorization patterns.
Result: Results show: 1) Common speech datasets exhibit near-perfect train/test separability (AUC ≈ 1.0) even without model inference; 2) Standard MIA scores strongly correlate with blind acoustic artifacts (correlation > 0.7); 3) Distribution-matched datasets enable reliable MIA evaluation; 4) LALM memorization is cross-modal, arising only from binding a speaker’s vocal identity with its text.
Conclusion: The conclusion establishes that LALM memorization is cross-modal and occurs only when binding a speaker’s vocal identity with their text. The findings provide a principled standard for auditing LALMs beyond spurious correlations, highlighting the importance of using distribution-matched datasets for reliable MIA evaluation in audio-language models.
Abstract: We present the first systematic Membership Inference Attack (MIA) evaluation of Large Audio Language Models (LALMs). As audio encodes non-semantic information, it induces severe train and test distribution shifts and can lead to spurious MIA performance. Using a multi-modal blind baseline based on textual, spectral, and prosodic features, we demonstrate that common speech datasets exhibit near-perfect train/test separability (AUC approximately 1.0) even without model inference, and the standard MIA scores strongly correlate with these blind acoustic artifacts (correlation greater than 0.7). Using this blind baseline, we identify that distribution-matched datasets enable reliable MIA evaluation without distribution shift confounds. We benchmark multiple MIA methods and conduct modality disentanglement experiments on these datasets. The results reveal that LALM memorization is cross-modal, arising only from binding a speaker’s vocal identity with its text. These findings establish a principled standard for auditing LALMs beyond spurious correlations.
[817] A Probabilistic Generative Model for Spectral Speech Enhancement
Marco Hidalgo-Araya, Raphaël Trésor, Bart Van Erp, Wouter W. L. Nuijten, Thijs Van De Laar, Bert De Vries
Main category: cs.SD
TL;DR: A Bayesian probabilistic framework for adaptive speech enhancement in hearing aids that replaces fixed parameters with continuous learning through variational inference.
Details
Motivation: Current hearing aid speech enhancement algorithms use fixed, manually tuned parameters that cannot adapt to different users or changing acoustic environments, limiting their effectiveness in real-world nonstationary conditions.Method: Proposes a unified modular framework using Bayesian inference with explicit uncertainty tracking. Formulates signal processing, learning, and personalization as probabilistic inference in a state-space model. Uses variational message passing in RxInfer.jl for real-time Bayesian processing under hearing-aid constraints.
Result: Proof-of-concept experiments on VoiceBank+DEMAND corpus show competitive speech quality and noise reduction with only 85 effective parameters, demonstrating data-efficient performance.
Conclusion: The framework provides an interpretable, data-efficient foundation for uncertainty-aware, adaptive hearing-aid processing that can continuously learn through probabilistic inference, pointing toward more intelligent, personalized hearing devices.
Abstract: Speech enhancement in hearing aids remains a difficult task in nonstationary acoustic environments, mainly because current signal processing algorithms rely on fixed, manually tuned parameters that cannot adapt in situ to different users or listening contexts. This paper introduces a unified modular framework that formulates signal processing, learning, and personalization as Bayesian inference with explicit uncertainty tracking. The proposed framework replaces ad hoc algorithm design with a single probabilistic generative model that continuously adapts to changing acoustic conditions and user preferences. It extends spectral subtraction with principled mechanisms for in-situ personalization and adaptation to acoustic context. The system is implemented as an interconnected probabilistic state-space model, and inference is performed via variational message passing in the \texttt{RxInfer.jl} probabilistic programming environment, enabling real-time Bayesian processing under hearing-aid constraints. Proof-of-concept experiments on the \emph{VoiceBank+DEMAND} corpus show competitive speech quality and noise reduction with 85 effective parameters. The framework provides an interpretable, data-efficient foundation for uncertainty-aware, adaptive hearing-aid processing and points toward devices that learn continuously through probabilistic inference.
[818] Enhancing Automatic Chord Recognition via Pseudo-Labeling and Knowledge Distillation
Nghia Phan, Rong Jin, Gang Liu, Xiao Dong
Main category: cs.SD
TL;DR: Two-stage training pipeline for Automatic Chord Recognition using pre-trained models and unlabeled audio with pseudo-labeling and knowledge distillation
Details
Motivation: Automatic Chord Recognition faces data scarcity issues due to costly aligned chord label acquisition, while pre-trained models are increasingly accessible but their training data often proprietaryMethod: Two-stage pipeline: 1) Use pre-trained BTC model as teacher to generate pseudo-labels for 1,000+ hours of unlabeled audio, train student model on pseudo-labels; 2) Continual training on ground-truth labels with selective knowledge distillation as regularizer to prevent catastrophic forgetting
Result: BTC student achieves 99% of teacher’s performance with pseudo-labels only; after stage 2, surpasses supervised baseline by 2.5% and teacher by 1.1-3.2%; 2E1D student achieves 97% of teacher with pseudo-labels, improves baseline by 2.67% and matches teacher performance; both show large gains on rare chord qualities
Conclusion: Proposed two-stage training effectively leverages pre-trained models and unlabeled audio to overcome data scarcity in ACR, achieving state-of-the-art performance with significant improvements on rare chord qualities
Abstract: Automatic Chord Recognition (ACR) is constrained by the scarcity of aligned chord labels, as well-aligned annotations are costly to acquire. At the same time, open-weight pre-trained models are currently more accessible than their proprietary training data. In this work, we present a two-stage training pipeline that leverages pre-trained models together with unlabeled audio. The proposed method decouples training into two stages. In the first stage, we use a pre-trained BTC model as a teacher to generate pseudo-labels for over 1,000 hours of diverse unlabeled audio and train a student model solely on these pseudo-labels. In the second stage, the student is continually trained on ground-truth labels as they become available. To prevent catastrophic forgetting of the representations learned in the first stage, we apply selective knowledge distillation (KD) from the teacher as a regularizer. In our experiments, two models (BTC, 2E1D) were used as students. In stage 1, using only pseudo-labels, the BTC student achieves over 99% of the teacher’s performance, while the 2E1D model achieves about 97% across seven standard mir_eval metrics. After a single training run for both students in stage 2, the resulting BTC student model surpasses the traditional supervised learning baseline by 2.5% and the original pre-trained teacher model by 1.1-3.2% across all metrics. The resulting 2E1D student model improves over the traditional supervised learning baseline by 2.67% on average and achieves almost the same performance as the teacher. Both cases show large gains on rare chord qualities.
[819] Acoustic Overspecification in Electronic Dance Music Taxonomy
Weilun Xu, Tianhao Dai, Oscar Goudet, Xiaoxuan Wang
Main category: cs.SD
TL;DR: Unsupervised clustering reveals EDM has only ~20 natural acoustic families, suggesting commercial taxonomy is overspecified by nearly half.
Details
Motivation: Current EDM classification relies on industry-defined taxonomies, but it's unclear whether these commercial distinctions reflect genuine acoustic differences. The paper aims to discover the natural acoustic structure of EDM independent of commercial labels.Method: Proposes an unsupervised approach using: 1) systematic construction of tailored, interpretable acoustic feature space capturing EDM’s defining production techniques, spectral textures, and layered rhythmic patterns; 2) validation against state-of-the-art pre-trained audio embeddings (MERT and CLAP) to ensure findings reflect inherent acoustic structure rather than feature engineering artifacts.
Result: Across both bespoke feature space and pre-trained embeddings, clustering consistently identifies 20 or fewer natural acoustic families, suggesting current commercial EDM taxonomy is acoustically overspecified by nearly one-half.
Conclusion: EDM’s natural acoustic structure is simpler than commercial taxonomy suggests, with only about 20 distinct acoustic families rather than the many subgenres defined by industry labels.
Abstract: Electronic Dance Music (EDM) classification typically relies on industry-defined taxonomies, with current supervised approaches naturally assuming the validity of prescribed subgenre labels. However, whether these commercial distinctions reflect genuine acoustic differences remains largely unexplored. In this paper, we propose an unsupervised approach to discover the natural acoustic structure of EDM independent of commercial labels. To address the historical lack of EDM-specific feature design in MIR, we systematically construct a tailored, interpretable acoustic feature space capturing the genre’s defining production techniques, spectral textures, and layered rhythmic patterns. To ensure our findings reflect inherent acoustic structure rather than feature engineering artifacts, we validate our clustering against state-of-the-art pre-trained audio embeddings (MERT and CLAP). Across both our bespoke feature space and the pre-trained embeddings, clustering consistently identifies 20 or fewer natural acoustic families – suggesting current commercial EDM taxonomy is acoustically overspecified by nearly one-half.
cs.LG
[820] Mitigating Forgetting in Continual Learning with Selective Gradient Projection
Anika Singh, Aayush Dhaulakhandi, Varun Chopade, Likhith Malipati, David Martinez, Kevin Zhu
Main category: cs.LG
TL;DR: SFAO is a dynamic optimization method for continual learning that selectively controls gradient updates to manage catastrophic forgetting while maintaining plasticity, achieving competitive accuracy with 90% memory reduction.
Details
Motivation: Neural networks deployed in dynamic environments suffer from catastrophic forgetting - overwriting previously learned knowledge when adapting to new tasks, causing severe performance degradation on earlier tasks. Current continual learning methods often have high memory costs or fail to balance plasticity and stability effectively.Method: Selective Forgetting-Aware Optimization (SFAO) regulates gradient directions via cosine similarity and per-layer gating. It uses a tunable mechanism with efficient Monte Carlo approximation to selectively project, accept, or discard updates, enabling controlled forgetting while balancing plasticity and stability.
Result: Experiments on standard continual learning benchmarks show SFAO achieves competitive accuracy with markedly lower memory cost (90% reduction) and improved forgetting on MNIST datasets, making it suitable for resource-constrained scenarios.
Conclusion: SFAO provides an effective solution for continual learning that manages catastrophic forgetting while maintaining efficiency, offering practical benefits for resource-constrained deployment scenarios.
Abstract: As neural networks are increasingly deployed in dynamic environments, they face the challenge of catastrophic forgetting, the tendency to overwrite previously learned knowledge when adapting to new tasks, resulting in severe performance degradation on earlier tasks. We propose Selective Forgetting-Aware Optimization (SFAO), a dynamic method that regulates gradient directions via cosine similarity and per-layer gating, enabling controlled forgetting while balancing plasticity and stability. SFAO selectively projects, accepts, or discards updates using a tunable mechanism with efficient Monte Carlo approximation. Experiments on standard continual learning benchmarks show that SFAO achieves competitive accuracy with markedly lower memory cost, a 90$%$ reduction, and improved forgetting on MNIST datasets, making it suitable for resource-constrained scenarios.
[821] Boundary-aware Prototype-driven Adversarial Alignment for Cross-Corpus EEG Emotion Recognition
Guangli Li, Canbiao Wu, Na Tian, Li Zhang, Zhen Liang
Main category: cs.LG
TL;DR: A prototype-driven adversarial alignment framework for cross-corpus EEG emotion recognition that addresses domain shift through local class-conditional alignment, contrastive regularization, and boundary-aware optimization.
Details
Motivation: EEG-based emotion recognition suffers from severe performance degradation when models are transferred across heterogeneous datasets due to physiological variability, experimental paradigm differences, and device inconsistencies. Existing domain adversarial methods primarily enforce global marginal alignment and overlook class-conditional mismatch and decision boundary distortion.Method: Proposes a unified Prototype-driven Adversarial Alignment (PAA) framework with three progressive configurations: PAA-L for prototype-guided local class-conditional alignment; PAA-C adding contrastive semantic regularization; and PAA-M integrating dual relation-aware classifiers within a three-stage adversarial optimization scheme for boundary-aware refinement.
Result: Extensive experiments on SEED, SEED-IV, and SEED-V datasets demonstrate state-of-the-art performance under four cross-corpus evaluation protocols, with average improvements of 6.72%, 5.59%, 6.69%, and 4.83%, respectively. The framework also generalizes effectively to clinical depression identification scenarios.
Conclusion: The proposed framework reformulates emotion recognition as a relation-driven representation learning problem, reducing sensitivity to label noise and improving cross-domain stability. It demonstrates robustness in real-world heterogeneous settings and offers a comprehensive solution for cross-corpus EEG emotion recognition.
Abstract: Electroencephalography (EEG)-based emotion recognition suffers from severe performance degradation when models are transferred across heterogeneous datasets due to physiological variability, experimental paradigm differences, and device inconsistencies. Existing domain adversarial methods primarily enforce global marginal alignment and often overlook class-conditional mismatch and decision boundary distortion, limiting cross-corpus generalization. In this work, we propose a unified Prototype-driven Adversarial Alignment (PAA) framework for cross-corpus EEG emotion recognition. The framework is progressively instantiated in three configurations: PAA-L, which performs prototype-guided local class-conditional alignment; PAA-C, which further incorporates contrastive semantic regularization to enhance intra-class compactness and inter-class separability; and PAA-M, the full boundary-aware configuration that integrates dual relation-aware classifiers within a three-stage adversarial optimization scheme to explicitly refine controversial samples near decision boundaries. By combining prototype-guided subdomain alignment, contrastive discriminative enhancement, and boundary-aware aggregation within a coherent adversarial architecture, the proposed framework reformulates emotion recognition as a relation-driven representation learning problem, reducing sensitivity to label noise and improving cross-domain stability. Extensive experiments on SEED, SEED-IV, and SEED-V demonstrate state-of-the-art performance under four cross-corpus evaluation protocols, with average improvements of 6.72%, 5.59%, 6.69%, and 4.83%, respectively. Furthermore, the proposed framework generalizes effectively to clinical depression identification scenarios, validating its robustness in real-world heterogeneous settings. The source code is available at \textit{https://github.com/WuCB-BCI/PAA}
[822] Learning to Select Visual In-Context Demonstrations
Eugene Lee, Yu-Chi Lin, Jiajie Diao
Main category: cs.LG
TL;DR: LSD (Learning to Select Demonstrations) uses RL to optimize demonstration selection for multimodal LLMs in visual in-context learning, outperforming kNN on factual regression tasks.
Details
Motivation: Current kNN-based demonstration selection for MLLMs is suboptimal for complex factual regression tasks as it selects redundant examples that fail to capture the full output range, limiting in-context learning effectiveness.Method: Reframe selection as sequential decision-making and train a Reinforcement Learning agent (Dueling DQN with query-centric Transformer Decoder) to construct optimal demonstration sets that maximize MLLM downstream performance.
Result: LSD significantly outperforms baselines on objective, factual regression tasks across five visual regression benchmarks, while kNN remains optimal for subjective preference tasks. LSD better defines regression boundaries by balancing visual relevance with diversity.
Conclusion: Learned demonstration selection (LSD) is strictly necessary for visual ICL on factual regression tasks, illuminating when sophisticated selection methods are required versus when simple kNN suffices.
Abstract: Multimodal Large Language Models (MLLMs) adapt to visual tasks via in-context learning (ICL), which relies heavily on demonstration quality. The dominant demonstration selection strategy is unsupervised k-Nearest Neighbor (kNN) search. While simple, this similarity-first approach is sub-optimal for complex factual regression tasks; it selects redundant examples that fail to capture the task’s full output range. We reframe selection as a sequential decision-making problem and introduce Learning to Select Demonstrations (LSD), training a Reinforcement Learning agent to construct optimal demonstration sets. Using a Dueling DQN with a query-centric Transformer Decoder, our agent learns a policy that maximizes MLLM downstream performance. Evaluating across five visual regression benchmarks, we uncover a crucial dichotomy: while kNN remains optimal for subjective preference tasks, LSD significantly outperforms baselines on objective, factual regression tasks. By balancing visual relevance with diversity, LSD better defines regression boundaries, illuminating when learned selection is strictly necessary for visual ICL.
[823] TED: Training-Free Experience Distillation for Multimodal Reasoning
Shuozhi Yuan, Jinqing Wang, Zihao Liu, Miaomiao Yuan, Haoran Peng, Jin Zhao, Bingwen Wang, Haoyi Wang
Main category: cs.LG
TL;DR: TED is a training-free knowledge distillation framework that transfers teacher knowledge through in-context experiences in prompts rather than parameter updates, with compression to manage experience growth.
Details
Motivation: Traditional knowledge distillation requires repeated parameter updates and large training data, limiting applicability in resource-constrained environments. The authors aim to develop a training-free approach that transfers knowledge through contextual experiences instead of parameter optimization.Method: TED uses context-based distillation where the student generates multiple reasoning trajectories, the teacher produces solutions, and generalized experiences capturing effective reasoning patterns are extracted and injected into the student’s prompt. An experience compression mechanism tracks usage statistics to selectively merge, rewrite, or remove low-utility experiences to prevent unbounded growth and noise accumulation.
Result: On multimodal reasoning benchmarks MathVision and VisualPuzzles, TED consistently improves performance. On MathVision, it raised Qwen3-VL-8B from 0.627 to 0.702, and on VisualPuzzles from 0.517 to 0.561 with just 100 training samples. Under low-data, no-update settings, TED achieves performance competitive with fully trained parameter-based distillation while reducing training cost by over 5x.
Conclusion: Meaningful knowledge transfer can be achieved through contextual experience without parameter updates, offering a resource-efficient alternative to traditional distillation methods, particularly valuable for multimodal reasoning tasks.
Abstract: Knowledge distillation is typically realized by transferring a teacher model’s knowledge into a student’s parameters through supervised or reinforcement-based optimization. While effective, such approaches require repeated parameter updates and large-scale training data, limiting their applicability in resource-constrained environments. In this work, we propose TED, a training-free, context-based distillation framework that shifts the update target of distillation from model parameters to an in-context experience injected into the student’s prompt. For each input, the student generates multiple reasoning trajectories, while a teacher independently produces its own solution. The teacher then compares the student trajectories with its reasoning and the ground-truth answer, extracting generalized experiences that capture effective reasoning patterns. These experiences are continuously refined and updated over time. A key challenge of context-based distillation is unbounded experience growth and noise accumulation. TED addresses this with an experience compression mechanism that tracks usage statistics and selectively merges, rewrites, or removes low-utility experiences. Experiments on multimodal reasoning benchmarks MathVision and VisualPuzzles show that TED consistently improves performance. On MathVision, TED raises the performance of Qwen3-VL-8B from 0.627 to 0.702, and on VisualPuzzles from 0.517 to 0.561 with just 100 training samples. Under this low-data, no-update setting, TED achieves performance competitive with fully trained parameter-based distillation while reducing training cost by over 5x, demonstrating that meaningful knowledge transfer can be achieved through contextual experience.
[824] A Step Toward Federated Pretraining of Multimodal Large Language Models
Baochen Xiong, Yifan Xu, Xiaoshan Yang, Yaguang Song, Yaowei Wang, Changsheng Xu
Main category: cs.LG
TL;DR: Fed-CMP: A federated learning framework for multimodal LLM pre-training that collaboratively trains cross-modal projectors while freezing vision encoders and LLMs, addressing parameter interference and gradient oscillations.
Details
Motivation: MLLM development is limited by scarce public multimodal data, while private data remains inaccessible due to privacy concerns. Federated learning could unlock distributed resources, but existing work focuses on fine-tuning, leaving pre-training unexplored.Method: Proposes Fed-CMP framework with two key components: 1) Canonical Reliability-Aware Aggregation - constructs canonical space to decompose client projectors into shared alignment basis and client-specific coefficients with reliability-weighted fusion; 2) Orthogonality-Preserved Momentum - applies momentum to shared alignment basis via orthogonal projection to accumulate historical optimization directions while preserving geometric structure.
Result: Extensive experiments on four federated pre-training scenarios based on public datasets show Fed-CMP significantly outperforms existing baselines.
Conclusion: Fed-CMP successfully addresses challenges in federated MLLM pre-training, enabling collaborative training of cross-modal projectors while mitigating parameter interference and gradient oscillations.
Abstract: The rapid evolution of Multimodal Large Language Models (MLLMs) is bottlenecked by the saturation of high-quality public data, while vast amounts of diverse multimodal data remain inaccessible in privacy-sensitive silos. Federated Learning (FL) offers a promising solution to unlock these distributed resources, but existing research focuses predominantly on fine-tuning, leaving the foundational pre-training phase largely unexplored. In this paper, we formally introduce the Federated MLLM Alignment (Fed-MA) task, a lightweight pre-training paradigm that freezes the vision encoder and LLM while collaboratively training the cross-modal projector. We identify two critical challenges in this setting: (i) parameter interference in aggregating local projectors; and (ii) gradient oscillations in one-pass collaborative SGD. To address these challenges, we propose Fed-CMP, a pioneering framework for federated MLLM pre-training. Fed-CMP employs Canonical Reliability-Aware Aggregation, which constructs a canonical space to decompose client projectors into a shared alignment basis and client-specific coefficients, then performs reliability-weighted fusion to suppress parameter interference. Furthermore, Fed-CMP introduces Orthogonality-Preserved Momentum, which applies momentum to the shared alignment basis via orthogonal projection, accumulating historical optimization directions while preserving geometric structure. We construct four federated pre-training scenarios based on public datasets, and extensive experiments validate that Fed-CMP significantly outperforms existing baselines.
[825] Robust Batch-Level Query Routing for Large Language Models under Cost and Capacity Constraints
Jelena Markovic-Voronov, Kayhan Behdin, Yuanda Xu, Zhengze Zhou, Zhipeng Wang, Rahul Mazumder
Main category: cs.LG
TL;DR: Batch-level routing framework for LLMs that optimizes model assignment per batch while respecting cost and GPU constraints, with robust variant for performance uncertainty.
Details
Motivation: Prior per-query routing methods fail to control batch-level costs under non-uniform or adversarial batching scenarios, necessitating a more robust approach that considers resource constraints.Method: Proposes a batch-level, resource-aware routing framework that jointly optimizes model assignment for each batch while respecting cost and model capacity limits. Includes a robust variant accounting for uncertainty in predicted LLM performance, and an offline instance allocation procedure for balancing quality and throughput.
Result: Robustness improves accuracy by 1-14% over non-robust counterparts, batch-level routing outperforms per-query methods by up to 24% under adversarial batching, and optimized instance allocation yields additional gains of up to 3% compared to non-optimized allocation, all while controlling cost and GPU constraints.
Conclusion: Batch-level routing with robustness considerations and optimized instance allocation provides significant improvements over per-query methods while maintaining strict resource constraints, offering a practical solution for efficient LLM deployment.
Abstract: We study the problem of routing queries to large language models (LLMs) under cost, GPU resources, and concurrency constraints. Prior per-query routing methods often fail to control batch-level cost, especially under non-uniform or adversarial batching. To address this, we propose a batch-level, resource-aware routing framework that jointly optimizes model assignment for each batch while respecting cost and model capacity limits. We further introduce a robust variant that accounts for uncertainty in predicted LLM performance, along with an offline instance allocation procedure that balances quality and throughput across multiple models. Experiments on two multi-task LLM benchmarks show that robustness improves accuracy by 1-14% over non-robust counterparts (depending on the performance estimator), batch-level routing outperforms per-query methods by up to 24% under adversarial batching, and optimized instance allocation yields additional gains of up to 3% compared to a non-optimized allocation, all while strictly controlling cost and GPU resource constraints.
[826] MemGuard-Alpha: Detecting and Filtering Memorization-Contaminated Signals in LLM-Based Financial Forecasting via Membership Inference and Cross-Model Disagreement
Anisha Roy, Dip Roy
Main category: cs.LG
TL;DR: MemGuard-Alpha: A post-generation framework using membership inference attacks and cross-model disagreement to filter out memorized financial data from LLM-generated alpha signals, improving out-of-sample performance.
Details
Motivation: LLMs used for financial alpha signals often memorize historical data, creating look-ahead bias that produces spurious in-sample accuracy but poor out-of-sample performance. Existing solutions are expensive or cause information loss, creating need for practical, zero-cost signal filtering.Method: Two algorithms: 1) MemGuard Composite Score (MCS) combines five membership inference attack methods with temporal proximity features via logistic regression; 2) Cross-Model Memorization Disagreement (CMMD) exploits variation in training cutoff dates across different LLMs to separate memorized signals from genuine reasoning.
Result: CMMD achieves Sharpe ratio of 4.11 vs 2.76 for unfiltered signals (49% improvement). Clean signals produce 14.48 bps average daily return vs 2.13 bps for tainted signals (7x difference). Shows crossover pattern: in-sample accuracy rises with contamination (40.8% to 52.5%) while out-of-sample accuracy falls (47% to 42%).
Conclusion: MemGuard-Alpha provides effective, practical filtering of memorized financial data from LLM-generated signals, significantly improving out-of-sample performance and demonstrating that memorization inflates apparent accuracy at the cost of generalization.
Abstract: Large language models (LLMs) are increasingly used to generate financial alpha signals, yet growing evidence shows that LLMs memorize historical financial data from their training corpora, producing spurious predictive accuracy that collapses out-of-sample. This memorization-induced look-ahead bias threatens the validity of LLM-based quantitative strategies. Prior remedies – model retraining and input anonymization – are either prohibitively expensive or introduce significant information loss. No existing method offers practical, zero-cost signal-level filtering for real-time trading. We introduce MemGuard-Alpha, a post-generation framework comprising two algorithms: (i) the MemGuard Composite Score (MCS), which combines five membership inference attack (MIA) methods with temporal proximity features via logistic regression, achieving Cohen’s d = 18.57 for contamination separation (d = 0.39-1.37 using MIA features alone); and (ii) Cross-Model Memorization Disagreement (CMMD), which exploits variation in training cutoff dates across LLMs to separate memorized signals from genuine reasoning. Evaluated across seven LLMs (124M-7B parameters), 50 S&P 100 stocks, 42,800 prompts, and five MIA methods over 5.5 years (2019-2024), CMMD achieves a Sharpe ratio of 4.11 versus 2.76 for unfiltered signals (49% improvement). Clean signals produce 14.48 bps average daily return versus 2.13 bps for tainted signals (7x difference). A striking crossover pattern emerges: in-sample accuracy rises with contamination (40.8% to 52.5%) while out-of-sample accuracy falls (47% to 42%), providing direct evidence that memorization inflates apparent accuracy at the cost of generalization.
[827] Explaining, Verifying, and Aligning Semantic Hierarchies in Vision-Language Model Embeddings
Gesina Schwalbe, Mert Keser, Moritz Bayerkuhnlein, Edgar Heinert, Annika Mütze, Marvin Keller, Sparsh Tiwari, Georgii Mikriukov, Diedrich Wolter, Jae Hee Lee, Matthias Rottmann
Main category: cs.LG
TL;DR: A framework to explain, verify, and align semantic hierarchies in vision-language models (VLMs) by extracting binary hierarchies from class centroids, quantifying plausibility against human ontologies, and aligning embeddings to desired hierarchies.
Details
Motivation: While VLMs like CLIP enable strong retrieval and zero-shot classification, the semantic organization of their shared embedding space is rarely inspected. There's a need to understand and improve the semantic hierarchies induced by these models.Method: 1) Extract binary hierarchy via agglomerative clustering of class centroids and name internal nodes using dictionary-based matching to a concept bank. 2) Quantify plausibility by comparing extracted trees against human ontologies using tree- and edge-level consistency measures. 3) Evaluate utility via explainable hierarchical tree-traversal inference with uncertainty-aware early stopping. 4) Propose ontology-guided post-hoc alignment using lightweight embedding-space transformation with UMAP to generate target neighborhoods.
Result: Across 13 pretrained VLMs and 4 image datasets, image encoders are more discriminative while text encoders induce hierarchies that better match human taxonomies. Reveals persistent trade-off between zero-shot accuracy and ontological plausibility.
Conclusion: The framework provides practical routes to improve semantic alignment in shared embedding spaces, addressing the trade-off between accuracy and plausibility in vision-language models.
Abstract: Vision-language model (VLM) encoders such as CLIP enable strong retrieval and zero-shot classification in a shared image-text embedding space, yet the semantic organization of this space is rarely inspected. We present a post-hoc framework to explain, verify, and align the semantic hierarchies induced by a VLM over a given set of child classes. First, we extract a binary hierarchy by agglomerative clustering of class centroids and name internal nodes by dictionary-based matching to a concept bank. Second, we quantify plausibility by comparing the extracted tree against human ontologies using efficient tree- and edge-level consistency measures, and we evaluate utility via explainable hierarchical tree-traversal inference with uncertainty-aware early stopping (UAES). Third, we propose an ontology-guided post-hoc alignment method that learns a lightweight embedding-space transformation, using UMAP to generate target neighborhoods from a desired hierarchy. Across 13 pretrained VLMs and 4 image datasets, our method finds systematic modality differences: image encoders are more discriminative, while text encoders induce hierarchies that better match human taxonomies. Overall, the results reveal a persistent trade-off between zero-shot accuracy and ontological plausibility and suggest practical routes to improve semantic alignment in shared embedding spaces.
[828] Gaussian Joint Embeddings For Self-Supervised Representation Learning
Yongchao Huang
Main category: cs.LG
TL;DR: Proposes Gaussian Joint Embeddings (GJE) and Gaussian Mixture Joint Embeddings (GMJE) as probabilistic alternatives to deterministic self-supervised learning, enabling principled uncertainty estimation and better handling of multi-modal inverse problems.
Details
Motivation: Deterministic self-supervised methods have limitations in multi-modal inverse problems where they collapse toward conditional averages and require architectural asymmetries to prevent collapse. There's a need for probabilistic approaches that can handle genuine multi-modality and provide uncertainty estimates.Method: Introduces GJE and GMJE which model joint density of context and target representations using Gaussian and Gaussian mixture models. Replaces black-box prediction with closed-form conditional inference under explicit probabilistic models. Addresses Mahalanobis Trace Trap failure mode with several remedies: prototype-based GMJE, conditional Mixture Density Networks (GMJE-MDN), Growing Neural Gas (GMJE-GNG), and Sequential Monte Carlo memory bank.
Result: GMJE recovers complex conditional structure, learns competitive discriminative representations, and defines latent densities better suited to unconditional sampling than deterministic or unimodal baselines. Shows standard contrastive learning can be interpreted as degenerate non-parametric limiting case of GMJE framework.
Conclusion: Probabilistic joint modeling via GMJE provides principled approach to multi-modal representation learning with uncertainty estimation, overcoming limitations of deterministic methods and enabling better handling of complex conditional structures.
Abstract: Self-supervised representation learning often relies on deterministic predictive architectures to align context and target views in latent space. While effective in many settings, such methods are limited in genuinely multi-modal inverse problems, where squared-loss prediction collapses towards conditional averages, and they frequently depend on architectural asymmetries to prevent representation collapse. In this work, we propose a probabilistic alternative based on generative joint modeling. We introduce Gaussian Joint Embeddings (GJE) and its multi-modal extension, Gaussian Mixture Joint Embeddings (GMJE), which model the joint density of context and target representations and replace black-box prediction with closed-form conditional inference under an explicit probabilistic model. This yields principled uncertainty estimates and a covariance-aware objective for controlling latent geometry. We further identify a failure mode of naive empirical batch optimization, which we term the Mahalanobis Trace Trap, and develop several remedies spanning parametric, adaptive, and non-parametric settings, including prototype-based GMJE, conditional Mixture Density Networks (GMJE-MDN), topology-adaptive Growing Neural Gas (GMJE-GNG), and a Sequential Monte Carlo (SMC) memory bank. In addition, we show that standard contrastive learning can be interpreted as a degenerate non-parametric limiting case of the GMJE framework. Experiments on synthetic multi-modal alignment tasks and vision benchmarks show that GMJE recovers complex conditional structure, learns competitive discriminative representations, and defines latent densities that are better suited to unconditional sampling than deterministic or unimodal baselines.
[829] DSO: Dual-Scale Neural Operators for Stable Long-term Fluid Dynamics Forecasting
Huanshuo Dong, Hao Wu, Hong Wang, Qin-Yi Zhang, Zhezheng Hao
Main category: cs.LG
TL;DR: DSO is a dual-scale neural operator that separates local and global information processing for improved long-term fluid dynamics forecasting, reducing errors by 88% compared to existing methods.
Details
Motivation: Existing neural operators struggle with long-term stability and precision in fluid dynamics forecasting due to two fundamental failure modes: local detail blurring (loss of fine-scale structures like vortex cores) and global trend deviation (drift from ground truth during extended rollouts). These failures occur because current architectures treat local and global information uniformly despite their different evolution characteristics in physical systems.Method: Proposes Dual-Scale Neural Operator (DSO) that explicitly decouples information processing into two complementary modules: (1) depthwise separable convolutions for fine-grained local feature extraction, and (2) an MLP-Mixer for long-range global aggregation. This design is empirically validated through numerical experiments showing that nearby perturbations affect local vortex structure while distant perturbations influence global motion trends.
Result: Extensive experiments on turbulent flow benchmarks show DSO achieves state-of-the-art accuracy while maintaining robust long-term stability. It reduces prediction error by over 88% compared to existing neural operators.
Conclusion: The dual-scale approach effectively addresses the limitations of existing neural operators by separating local and global information processing, leading to significantly improved long-term forecasting capabilities for fluid dynamics governed by PDEs.
Abstract: Long-term fluid dynamics forecasting is a critically important problem in science and engineering. While neural operators have emerged as a promising paradigm for modeling systems governed by partial differential equations (PDEs), they often struggle with long-term stability and precision. We identify two fundamental failure modes in existing architectures: (1) local detail blurring, where fine-scale structures such as vortex cores and sharp gradients are progressively smoothed, and (2) global trend deviation, where the overall motion trajectory drifts from the ground truth during extended rollouts. We argue that these failures arise because existing neural operators treat local and global information processing uniformly, despite their inherently different evolution characteristics in physical systems. To bridge this gap, we propose the Dual-Scale Neural Operator (DSO), which explicitly decouples information processing into two complementary modules: depthwise separable convolutions for fine-grained local feature extraction and an MLP-Mixer for long-range global aggregation. Through numerical experiments on vortex dynamics, we demonstrate that nearby perturbations primarily affect local vortex structure while distant perturbations influence global motion trends, providing empirical validation for our design choice. Extensive experiments on turbulent flow benchmarks show that DSO achieves state-of-the-art accuracy while maintaining robust long-term stability, reducing prediction error by over 88% compared to existing neural operators.
[830] Sparse-by-Design Cross-Modality Prediction: L0-Gated Representations for Reliable and Efficient Learning
Filippo Cenacchi
Main category: cs.LG
TL;DR: L0GM is a unified sparsification framework using hard-concrete gating to achieve L0-style sparsity across heterogeneous modalities (graphs, language, tabular data) for comparable accuracy-efficiency trade-offs and improved calibration.
Details
Motivation: Current sparsification methods are modality-specific (graph edge sparsification, Transformer pruning, tabular feature selection), making results hard to compare, deployment complicated, and reliability analysis weak across end-to-end KDD pipelines. A unified sparsification primitive is needed for comparable accuracy-efficiency trade-offs across modalities and controlled reliability analysis under representation compression.Method: L0-Gated Cross-Modality Learning (L0GM) uses a modality-agnostic, feature-wise hard-concrete gating framework that enforces L0-style sparsity directly on learned representations. It attaches hard-concrete stochastic gates to each modality’s classifier-facing interface (node embeddings for GNNs, pooled sequence embeddings for Transformers, learned tabular embedding vectors for tabular models). The framework includes an L0-annealing schedule to stabilize optimization and create interpretable accuracy-sparsity Pareto frontiers.
Result: Across three public benchmarks (ogbn-products, Adult, IMDB), L0GM achieves competitive predictive performance while activating fewer representation dimensions, and reduces Expected Calibration Error (ECE) in evaluation.
Conclusion: L0GM establishes a modality-agnostic, reproducible sparsification primitive that supports comparable accuracy, efficiency, and calibration trade-off analysis across heterogeneous modalities.
Abstract: Predictive systems increasingly span heterogeneous modalities such as graphs, language, and tabular records, but sparsity and efficiency remain modality-specific (graph edge or neighborhood sparsification, Transformer head or layer pruning, and separate tabular feature-selection pipelines). This fragmentation makes results hard to compare, complicates deployment, and weakens reliability analysis across end-to-end KDD pipelines. A unified sparsification primitive would make accuracy-efficiency trade-offs comparable across modalities and enable controlled reliability analysis under representation compression. We ask whether a single representation-level mechanism can yield comparable accuracy-efficiency trade-offs across modalities while preserving or improving probability calibration. We propose L0-Gated Cross-Modality Learning (L0GM), a modality-agnostic, feature-wise hard-concrete gating framework that enforces L0-style sparsity directly on learned representations. L0GM attaches hard-concrete stochastic gates to each modality’s classifier-facing interface: node embeddings (GNNs), pooled sequence embeddings such as CLS (Transformers), and learned tabular embedding vectors (tabular models). This yields end-to-end trainable sparsification with an explicit control knob for the active feature fraction. To stabilize optimization and make trade-offs interpretable, we introduce an L0-annealing schedule that induces clear accuracy-sparsity Pareto frontiers. Across three public benchmarks (ogbn-products, Adult, IMDB), L0GM achieves competitive predictive performance while activating fewer representation dimensions, and it reduces Expected Calibration Error (ECE) in our evaluation. Overall, L0GM establishes a modality-agnostic, reproducible sparsification primitive that supports comparable accuracy, efficiency, and calibration trade-off analysis across heterogeneous modalities.
[831] A Comparative Investigation of Thermodynamic Structure-Informed Neural Networks
Guojie Li, Liu Hong
Main category: cs.LG
TL;DR: PINNs performance depends on how physics is incorporated; comparing thermodynamic structure-informed neural networks shows structure-preserving formulations outperform Newtonian-residual-based approaches for physical consistency and parameter identification.
Details
Motivation: Physics-informed neural networks (PINNs) provide a unified framework for solving differential equation problems, but their performance and physical consistency heavily depend on how governing laws are incorporated into the neural network architecture.Method: Systematic comparison of different thermodynamic structure-informed neural networks by incorporating various thermodynamics formulations: Newtonian, Lagrangian, and Hamiltonian mechanics for conservative systems, and Onsager variational principle and extended irreversible thermodynamics for dissipative systems.
Result: Newtonian-residual-based PINNs can reconstruct system states but fail to reliably recover key physical and thermodynamic quantities. Structure-preserving formulations significantly enhance parameter identification, thermodynamic consistency, and robustness to noise.
Conclusion: Structure-preserving formulations provide practical guidance for principled design of thermodynamics-consistent models and lay groundwork for integrating more general nonequilibrium thermodynamic structures into physics-informed machine learning.
Abstract: Physics-informed neural networks (PINNs) offer a unified framework for solving both forward and inverse problems of differential equations, yet their performance and physical consistency strongly depend on how governing laws are incorporated. In this work, we present a systematic comparison of different thermodynamic structure-informed neural networks by incorporating various thermodynamics formulations, including Newtonian, Lagrangian, and Hamiltonian mechanics for conservative systems, as well as the Onsager variational principle and extended irreversible thermodynamics for dissipative systems. Through comprehensive numerical experiments on representative ordinary and partial differential equations, we quantitatively evaluate the impact of these formulations on accuracy, physical consistency, noise robustness, and interpretability. The results show that Newtonian-residual-based PINNs can reconstruct system states but fail to reliably recover key physical and thermodynamic quantities, whereas structure-preserving formulation significantly enhances parameter identification, thermodynamic consistency, and robustness. These findings provide practical guidance for principled design of thermodynamics-consistency model, and lay the groundwork for integrating more general nonequilibrium thermodynamic structures into physics-informed machine learning.
[832] OpenAVS: Training-Free Open-Vocabulary Audio Visual Segmentation with Foundational Models
Shengkai Chen, Yifang Yin, Jinming Cao, Shili Xiang, Zhenguang Liu, Roger Zimmermann
Main category: cs.LG
TL;DR: OpenAVS: A training-free, language-based approach for open-vocabulary audio-visual segmentation using text as a proxy to align audio and visual modalities via foundation models.
Details
Motivation: Existing audio-visual segmentation methods focus on closed-set scenarios and direct audio-visual alignment, limiting generalization to unseen situations. There's a need for open-vocabulary approaches that can handle novel objects and scenarios.Method: Proposes OpenAVS with three-step pipeline: 1) audio-to-text prompt generation, 2) LLM-guided prompt translation, and 3) text-to-visual sounding object segmentation using foundation models. Also introduces OpenAVS-ST framework for integration with supervised models via pseudo-label self-training.
Result: Superior performance on three benchmark datasets, surpassing unsupervised, zero-shot, and few-shot AVS methods by significant margins (~9.4% mIoU and 10.9% F-score gains in challenging scenarios).
Conclusion: OpenAVS establishes a simple yet flexible architecture leveraging foundation models for effective knowledge transfer to audio-visual segmentation, enabling open-vocabulary capabilities without training.
Abstract: Audio-visual segmentation aims to separate sounding objects from videos by predicting pixel-level masks based on audio signals. Existing methods primarily concentrate on closed-set scenarios and direct audio-visual alignment and fusion, which limits their capability to generalize to new, unseen situations. In this paper, we propose OpenAVS, a novel training-free language-based approach that, for the first time, effectively aligns audio and visual modalities using text as a proxy for open-vocabulary Audio-Visual Segmentation (AVS). Equipped with multimedia foundation models, OpenAVS directly infers masks through 1) audio-to-text prompt generation, 2) LLM-guided prompt translation, and 3) text-to-visual sounding object segmentation. The objective of OpenAVS is to establish a simple yet flexible architecture that relies on the most appropriate foundation models by fully leveraging their capabilities to enable more effective knowledge transfer to the downstream AVS task. Moreover, we present a model-agnostic framework OpenAVS-ST that enables the integration of OpenAVS with any advanced supervised AVS model via pseudo-label based self-training. This approach enhances performance by effectively utilizing large-scale unlabeled data when available. Comprehensive experiments on three benchmark datasets demonstrate the superior performance of OpenAVS. It surpasses existing unsupervised, zero-shot, and few-shot AVS methods by a significant margin, achieving absolute performance gains of approximately 9.4% and 10.9% in mIoU and F-score, respectively, in challenging scenarios.
[833] PiCSRL: Physics-Informed Contextual Spectral Reinforcement Learning
Mitra Nasr Azadani, Syed Usama Imtiaz, Nasrin Alamdari
Main category: cs.LG
TL;DR: PiCSRL is a physics-informed contextual spectral reinforcement learning method for adaptive sensing in high-dimensional low-sample-size environments, demonstrated on cyanobacterial bloom detection using NASA hyperspectral imagery.
Details
Motivation: High-dimensional low-sample-size (HDLSS) datasets limit reliable environmental model development, and existing RL-based adaptive sensing methods struggle in these contexts. There's a need for sample-efficient methods that can leverage domain knowledge for optimal sampling in Earth observation tasks.Method: PiCSRL uses physics-informed embeddings designed with domain knowledge that are parsed directly into RL state representation. It includes an uncertainty-aware belief model encoding physics-informed features to improve prediction. The method is evaluated on cyanobacterial gene concentration adaptive sampling using NASA PACE hyperspectral imagery.
Result: Achieves optimal station selection with RMSE = 0.153, 98.4% bloom detection rate, outperforming random (0.296) and UCB (0.178) baselines. Physics-informed features improve test generalization (0.52 R², +0.11 over raw bands) in semi-supervised learning. Scales effectively to large networks (50 stations, >2M combinations) with significant improvements over baselines (p = 0.002).
Conclusion: PiCSRL is a sample-efficient adaptive sensing method that effectively incorporates physics-informed features for improved observation-to-target mapping across Earth observation domains, particularly valuable in HDLSS contexts.
Abstract: High-dimensional low-sample-size (HDLSS) datasets constrain reliable environmental model development, where labeled data remain sparse. Reinforcement learning (RL)-based adaptive sensing methods can learn optimal sampling policies, yet their application is severely limited in HDLSS contexts. In this work, we present PiCSRL (Physics-Informed Contextual Spectral Reinforcement Learning), where embeddings are designed using domain knowledge and parsed directly into the RL state representation for improved adaptive sensing. We developed an uncertainty-aware belief model that encodes physics-informed features to improve prediction. As a representative example, we evaluated our approach for cyanobacterial gene concentration adaptive sampling task using NASA PACE hyperspectral imagery over Lake Erie. PiCSRL achieves optimal station selection (RMSE = 0.153, 98.4% bloom detection rate, outperforming random (0.296) and UCB (0.178) RMSE baselines, respectively. Our ablation experiments demonstrate that physics-informed features improve test generalization (0.52 R^2, +0.11 over raw bands) in semi-supervised learning. In addition, our scalability test shows that PiCSRL scales effectively to large networks (50 stations, >2M combinations) with significant improvements over baselines (p = 0.002). We posit PiCSRL as a sample-efficient adaptive sensing method across Earth observation domains for improved observation-to-target mapping.
[834] Large Language Models for Computer-Aided Design: A Survey
Licheng Zhang, Bach Le, Naveed Akhtar, Siew-Kei Lam, Tuan Ngo
Main category: cs.LG
TL;DR: A systematic survey exploring the intersection of Large Language Models (LLMs) and Computer-Aided Design (CAD), covering industrial significance, LLM foundations, applications in CAD, and future directions.
Details
Motivation: While LLMs have advanced rapidly and been studied in various fields, there's no comprehensive review of their integration with CAD, which is crucial for 3D modeling across industries. As design complexity increases, LLMs could enhance CAD workflows, creating an important research frontier.Method: Conducts a systematic survey that: 1) outlines CAD’s industrial significance, 2) provides overview of LLM foundations including closed-source and open models, 3) examines LLM applications in CAD with a taxonomy of six key areas, and 4) proposes future research directions.
Result: Presents the first comprehensive survey on LLM-CAD integration, identifying six key application areas where LLMs are making impact in CAD workflows, and establishing a taxonomy for this emerging research area.
Conclusion: LLM-CAD integration represents an exciting frontier with vast opportunities for innovation that could shape the future of CAD technology, though systematic exploration is just beginning.
Abstract: Large Language Models (LLMs) have seen rapid advancements in recent years, with models like ChatGPT and DeepSeek, showcasing their remarkable capabilities across diverse domains. While substantial research has been conducted on LLMs in various fields, a comprehensive review focusing on their integration with Computer-Aided Design (CAD) remains notably absent. CAD is the industry standard for 3D modeling and plays a vital role in the design and development of products across different industries. As the complexity of modern designs increases, the potential for LLMs to enhance and streamline CAD workflows presents an exciting frontier. This article presents the first systematic survey exploring the intersection of LLMs and CAD. We begin by outlining the industrial significance of CAD, highlighting the need for AI-driven innovation. Next, we provide a detailed overview of the foundation of LLMs. We also examine both closed-source LLMs as well as publicly available models. The core of this review focuses on the various applications of LLMs in CAD, providing a taxonomy of six key areas where these models are making considerable impact. Finally, we propose several promising future directions for further advancements, which offer vast opportunities for innovation and are poised to shape the future of CAD technology. Github: https://github.com/lichengzhanguom/LLMs-CAD-Survey-Taxonomy
[835] Distributed Online Submodular Maximization under Communication Delays: A Simultaneous Decision-Making Approach
Zirui Xu, Vasileios Tzoumas
Main category: cs.LG
TL;DR: Distributed Online Greedy (DOG) algorithm for multi-agent submodular maximization under communication delays, enabling simultaneous decision-making across arbitrary networks with performance guarantees.
Details
Motivation: Address limitations in existing online submodular maximization approaches that either rely on sequential multi-hop communication (causing prohibitive delays) or restrict coordination to one-hop neighborhoods (limiting performance), particularly for future distributed information-gathering tasks in unknown dynamic environments where utility functions exhibit diminishing returns (submodularity).Method: Develop Distributed Online Greedy (DOG) algorithm that integrates tools from adversarial bandit learning with delayed feedback to enable simultaneous decision-making across arbitrary network topologies, capturing the trade-off between coordination performance and convergence time based on communication delays.
Result: Provides approximation performance guarantees against optimal solutions, quantifying suboptimality cost due to decentralization as a function of network structure, and demonstrates that DOG spans the spectrum between fully centralized and fully decentralized one-hop coordination approaches.
Conclusion: DOG algorithm successfully addresses the communication delay problem in distributed online submodular maximization, offering a flexible framework that balances coordination performance with convergence time based on network communication characteristics.
Abstract: We provide a distributed online algorithm for multi-agent submodular maximization under communication delays. We are motivated by the future distributed information-gathering tasks in unknown and dynamic environments, where utility functions naturally exhibit the diminishing-returns property, i.e., submodularity. Existing approaches for online submodular maximization either rely on sequential multi-hop communication, resulting in prohibitive delays and restrictive connectivity assumptions, or restrict each agent’s coordination to its one-hop neighborhood only, thereby limiting the coordination performance. To address the issue, we provide the Distributed Online Greedy (DOG) algorithm, which integrates tools from adversarial bandit learning with delayed feedback to enable simultaneous decision-making across arbitrary network topologies. We provide the approximation performance of DOG against an optimal solution, capturing the suboptimality cost due to decentralization as a function of the network structure. Our analyses further reveal a trade-off between coordination performance and convergence time, determined by the magnitude of communication delays. By this trade-off, DOG spans the spectrum between the state-of-the-art fully centralized online coordination approach [1] and fully decentralized one-hop coordination approach [2].
[836] Epileptic Seizure Prediction Using Patient-Adaptive Transformer Networks
Mohamed Mahdi, Asma Baghdadi
Main category: cs.LG
TL;DR: A patient-adaptive transformer framework for short-horizon seizure prediction from EEG signals using self-supervised pretraining followed by patient-specific fine-tuning.
Details
Motivation: Epileptic seizure prediction from EEG recordings is challenging due to inter-patient variability and complex temporal structure of neural signals, requiring personalized approaches.Method: Two-stage training: self-supervised pretraining learns general EEG temporal representations via autoregressive sequence modeling, followed by patient-specific fine-tuning for binary seizure prediction within 30-second horizon. EEG signals are preprocessed and discretized into tokenized sequences for transformer-based learning.
Result: Achieves validation accuracies above 90% and F1 scores exceeding 0.80 across evaluated patients from the TUH EEG dataset.
Conclusion: Combining self-supervised representation learning with patient-specific adaptation is effective for individualized seizure prediction using transformer architectures.
Abstract: Epileptic seizure prediction from electroencephalographic (EEG) recordings remains challenging due to strong inter-patient variability and the complex temporal structure of neural signals. This paper presents a patient-adaptive transformer framework for short-horizon seizure forecasting. The proposed approach employs a two-stage training strategy: self-supervised pretraining is first used to learn general EEG temporal representations through autoregressive sequence modeling, followed by patient-specific fine-tuning for binary prediction of seizure onset within a 30-second horizon. To enable transformer-based sequence learning, multichannel EEG signals are processed using noise-aware preprocessing and discretized into tokenized temporal sequences. Experiments conducted on subjects from the TUH EEG dataset demonstrate that the proposed method achieves validation accuracies above 90% and F1 scores exceeding 0.80 across evaluated patients, supporting the effectiveness of combining self-supervised representation learning with patient-specific adaptation for individualized seizure prediction.
[837] Throughput Optimization as a Strategic Lever in Large-Scale AI Systems: Evidence from Dataloader and Memory Profiling Innovations
Mayank Jha
Main category: cs.LG
TL;DR: A comprehensive analysis of system-level optimizations for large-scale foundation model training, focusing on dataloader bottlenecks, memory optimization, and compiler-centric approaches to improve throughput and scale.
Details
Motivation: Large-scale foundation models face significant computational and memory bottlenecks that constrain development. Throughput optimization is critical for reducing training time, operational costs, and enabling next-generation model scale.Method: Synthesizes evidence from recent innovations to analyze key advancements: 1) architectural solutions to dataloader bottlenecks (OVERLORD framework), 2) memory optimization techniques (CPU offloading like DeepSpeed’s ZeRO-Offload), 3) compiler-centric optimizations (Triton-distributed), and 4) advanced profiling tools for identifying overheads like DVFS.
Result: Demonstrates concrete improvements: OVERLORD shows 4.5% improvement in end-to-end training throughput; CPU offloading enables training models exceeding single-accelerator capacity; compiler optimizations provide substantial performance gains through joint optimization of computation, memory, and communication.
Conclusion: A holistic, system-level approach integrating innovations across data pipelines, memory management, network fabrics, and compiler technologies is essential for accelerating AI development, managing costs, and pushing model scale boundaries.
Abstract: The development of large-scale foundation models, particularly Large Language Models (LLMs), is constrained by significant computational and memory bottlenecks. These challenges elevate throughput optimization from a mere engineering task to a critical strategic lever, directly influencing training time, operational cost, and the feasible scale of next-generation models. This paper synthesizes evidence from recent academic and industry innovations to analyze key advancements in training efficiency. We examine architectural solutions to dataloader bottlenecks, such as the OVERLORD framework, which has demonstrated a 4.5% improvement in end-to-end training throughput. We investigate memory optimization techniques designed to overcome the GPU memory wall, including CPU offloading strategies like DeepSpeed’s ZeRO-Offload, which enable the training of models far exceeding single-accelerator capacity. Furthermore, we explore the growing importance of compiler-centric optimizations, exemplified by Triton-distributed, which enables the joint optimization of computation, memory, and communication for substantial performance gains. The analysis is contextualized by advanced profiling tools and hardware characterization studies that identify and mitigate previously overlooked overheads like Dynamic Voltage and Frequency Scaling (DVFS). Findings indicate that a holistic, system-level approach, integrating innovations across data pipelines, memory management, network fabrics, and compiler technologies, is essential for accelerating AI development, managing costs, and pushing the boundaries of model scale.
[838] Central-to-Local Adaptive Generative Diffusion Framework for Improving Gene Expression Prediction in Data-Limited Spatial Transcriptomics
Yaoyu Fang, Jiahe Qian, Xinkun Wang, Lee A. Cooper, Bo Zhou
Main category: cs.LG
TL;DR: C2L-ST is a generative diffusion framework that synthesizes histology patches with molecular consistency using limited spatial transcriptomics data by integrating large-scale morphological priors with gene-conditioned modulation.
Details
Motivation: Spatial transcriptomics (ST) faces data scarcity due to high costs, limited throughput, and restricted data sharing, which constrains computational model development. There's a need for data-efficient methods to generate realistic histology images with molecular guidance.Method: Central-to-Local adaptive generative diffusion framework: 1) Pretrain global central model on extensive histopathology datasets to learn transferable morphological representations, 2) Adapt institution-specific local models through lightweight gene-conditioned modulation using few paired image-gene spots, enabling synthesis of realistic histology patches under data-limited conditions.
Result: Generated images show high visual/structural fidelity, reproduce cellular composition, and exhibit strong embedding overlap with real data across multiple organs. Synthetic image-gene pairs improve gene expression prediction accuracy and spatial coherence in downstream tasks, achieving performance comparable to real data with only a fraction of sampled spots.
Conclusion: C2L-ST provides a scalable, data-efficient framework for molecular-level data augmentation, offering a domain-adaptive and generalizable approach for integrating histology and transcriptomics in spatial biology and related fields.
Abstract: Spatial Transcriptomics (ST) provides spatially resolved gene expression profiles within intact tissue architecture, enabling molecular analysis in histological context. However, the high cost, limited throughput, and restricted data sharing of ST experiments result in severe data scarcity, constraining the development of robust computational models. To address this limitation, we present a Central-to-Local adaptive generative diffusion framework for ST (C2L-ST) that integrates large-scale morphological priors with limited molecular guidance. A global central model is first pretrained on extensive histopathology datasets to learn transferable morphological representations, and institution-specific local models are then adapted through lightweight gene-conditioned modulation using a small number of paired image-gene spots. This strategy enables the synthesis of realistic and molecularly consistent histology patches under data-limited conditions. The generated images exhibit high visual and structural fidelity, reproduce cellular composition, and show strong embedding overlap with real data across multiple organs, reflecting both realism and diversity. When incorporated into downstream training, synthetic image-gene pairs improve gene expression prediction accuracy and spatial coherence, achieving performance comparable to real data while requiring only a fraction of sampled spots. C2L-ST provides a scalable and data-efficient framework for molecular-level data augmentation, offering a domain-adaptive and generalizable approach for integrating histology and transcriptomics in spatial biology and related fields.
[839] Squish and Release: Exposing Hidden Hallucinations by Making Them Surface as Safety Signals
Nathaniel Oh, Paul Attie
Main category: cs.LG
TL;DR: Paper introduces Squish and Release (S&R) architecture to address “order-gap hallucination” where language models detect false premises but absorb them under conversational pressure, producing authoritative output built on errors.
Details
Motivation: Language models can identify false premises when asked directly, but under conversational pressure they absorb these errors and produce authoritative professional output based on premises they already recognized as false. This "order-gap hallucination" is invisible to output inspection because the error migrates into the safety circuit's activation space.Method: Introduces Squish and Release (S&R) activation-patching architecture with two components: fixed detector body (layers 24-31, localized safety evaluation circuit) and swappable detector core (activation vector controlling perception direction). A safety core shifts model from compliance toward detection; an absorb core reverses it. Evaluated on OLMo-2 7B using Order-Gap Benchmark (500 chains across 500 domains, manually graded).
Result: Cascade collapse is near-total (99.8% compliance at O5); detector body is binary and localized (layers 24-31 shift 93.6%, layers 0-23 contribute zero, p<10^-189); synthetically engineered core releases 76.6% of collapsed chains; detection is more stable attractor (83% restore vs 58% suppress); epistemic specificity confirmed (false-premise core releases 45.4%, true-premise core releases 0.0%).
Conclusion: The contribution is the framework - body/core architecture, benchmark, and core engineering methodology - which is model-agnostic by design and addresses the fundamental problem of models absorbing false premises they previously detected.
Abstract: Language models detect false premises when asked directly but absorb them under conversational pressure, producing authoritative professional output built on errors they already identified. This failure - order-gap hallucination - is invisible to output inspection because the error migrates into the activation space of the safety circuit, suppressed but not erased. We introduce Squish and Release (S&R), an activation-patching architecture with two components: a fixed detector body (layers 24-31, the localized safety evaluation circuit) and a swappable detector core (an activation vector controlling perception direction). A safety core shifts the model from compliance toward detection; an absorb core reverses it. We evaluate on OLMo-2 7B using the Order-Gap Benchmark - 500 chains across 500 domains, all manually graded. Key findings: cascade collapse is near-total (99.8% compliance at O5); the detector body is binary and localized (layers 24-31 shift 93.6%, layers 0-23 contribute zero, p<10^-189); a synthetically engineered core releases 76.6% of collapsed chains; detection is the more stable attractor (83% restore vs 58% suppress); and epistemic specificity is confirmed (false-premise core releases 45.4%, true-premise core releases 0.0%). The contribution is the framework - body/core architecture, benchmark, and core engineering methodology - which is model-agnostic by design.
[840] A Regression Framework for Understanding Prompt Component Impact on LLM Performance
Andrew Lauziere, Jonathan Daugherty, Taisa Kushner
Main category: cs.LG
TL;DR: Statistical framework for analyzing how specific prompt features affect LLM performance using regression models to explain variation in model outputs
Details
Motivation: As LLMs become more integrated into software systems, there's a need to understand the conditions under which they perform well, particularly how prompt features influence their performanceMethod: Extends XAI methods to inspect LLMs by fitting regression models that relate portions of prompts to LLM evaluation scores, applied to compare Mistral-7B and GPT-OSS-20B on arithmetic problems
Result: Regression models explain 72% and 77% of variation in model performances; misinformation in example query-answer pairs impedes both models, while positive examples don’t show significant variability in impact
Conclusion: The framework provides decision makers with granular insight into how prompts influence LLM task performance, revealing contradictory effects of positive and negative instructions
Abstract: As large language models (LLMs) continue to improve and see further integration into software systems, so does the need to understand the conditions in which they will perform. We contribute a statistical framework for understanding the impact of specific prompt features on LLM performance. The approach extends previous explainable artificial intelligence (XAI) methods specifically to inspect LLMs by fitting regression models relating portions of the prompt to LLM evaluation. We apply our method to compare how two open-source models, Mistral-7B and GPT-OSS-20B, leverage the prompt to perform a simple arithmetic problem. Regression models of individual prompt portions explain 72% and 77% of variation in model performances, respectively. We find misinformation in the form of incorrect example query-answer pairs impedes both models from solving the arithmetic query, though positive examples do not find significant variability in the impact of positive and negative instructions - these prompts have contradictory effects on model performance. The framework serves as a tool for decision makers in critical scenarios to gain granular insight into how the prompt influences an LLM to solve a task.
[841] From Pixels to BFS: High Maze Accuracy Does Not Imply Visual Planning
Alberto G. Rodriguez Salgado
Main category: cs.LG
TL;DR: Multimodal models solve visual maze tasks not through genuine planning but by brute-force token search, converting images to text grids and enumerating paths step-by-step, consuming thousands of tokens.
Details
Motivation: To investigate whether multimodal models genuinely understand spatial relationships and plan like humans, or if they rely on brute-force token-level search strategies when solving visual spatial tasks.Method: Created MazeBench with 110 procedurally generated maze images across nine controlled groups. Evaluated 16 model configurations from major AI companies, analyzed token consumption, conducted text-grid ablation studies, and examined qualitative traces of model reasoning.
Result: GPT-4 solved 91% and Gemini 3.1 Pro 79% of mazes, but only with massive token consumption (1,710-22,818 tokens). Without added reasoning budgets, scores dropped to 2-12%. Models consistently used two-stage strategy: image-to-grid translation followed by token-level search (BFS in prose). Claude Sonnet 4.6 improved from 6% on images to 80% when given correct grid, showing weak visual extraction.
Conclusion: High accuracy on visual planning tasks does not indicate human-like spatial understanding; models rely on inefficient token-level search rather than genuine planning, revealing limitations in current multimodal architectures.
Abstract: How do multimodal models solve visual spatial tasks – through genuine planning, or through brute-force search in token space? We introduce \textsc{MazeBench}, a benchmark of 110 procedurally generated maze images across nine controlled groups, and evaluate 16 model configurations from OpenAI, Anthropic, Google, and Alibaba. GPT-5.4 solves 91% and Gemini 3.1 Pro 79%, but these scores are misleading: models typically translate images into text grids and then enumerate paths step by step, consuming 1,710–22,818 tokens per solve for a task humans do quickly. Without added reasoning budgets, all configurations score only 2–12%; on 20$\times$20 ultra-hard mazes, they hit token limits and fail. Qualitative traces reveal a common two-stage strategy: image-to-grid translation followed by token-level search, effectively BFS in prose. A text-grid ablation shows Claude Sonnet 4.6 rising from 6% on images to 80% when given the correct grid, isolating weak visual extraction from downstream search. When explicitly instructed not to construct a grid or perform graph search, models still revert to the same enumeration strategy. \textsc{MazeBench} therefore shows that high accuracy on visual planning tasks does not imply human-like spatial understanding.
[842] FatigueFormer: Static-Temporal Feature Fusion for Robust sEMG-Based Muscle Fatigue Recognition
Tong Zhang, Hong Guo, Shuangzhou Yan, Dongkai Weng, Jian Wang, Hongxin Zhang
Main category: cs.LG
TL;DR: FatigueFormer is a semi-end-to-end framework using parallel Transformer encoders to learn interpretable muscle fatigue dynamics from sEMG signals, achieving state-of-the-art accuracy across varying MVC levels with attention-based visualization.
Details
Motivation: Prior approaches struggle with robustness across varying Maximum Voluntary Contraction (MVC) levels due to signal variability and low SNR in surface electromyography (sEMG) signals for muscle fatigue analysis.Method: Uses parallel Transformer-based sequence encoders to separately capture static and temporal feature dynamics from sEMG, fusing complementary representations to improve performance stability across low- and high-MVC conditions.
Result: Achieves state-of-the-art accuracy and strong generalization under mild-fatigue conditions on a dataset of 30 participants across four MVC levels (20-80%), with attention-based visualization revealing fatigue dynamics.
Conclusion: FatigueFormer provides both performance improvements and interpretable insights into muscle fatigue progression through attention-based visualization of feature contributions across varying MVC levels.
Abstract: We present FatigueFormer, a semi-end-to-end framework that deliberately combines saliency-guided feature separation with deep temporal modeling to learn interpretable and generalizable muscle fatigue dynamics from surface electromyography (sEMG). Unlike prior approaches that struggle to maintain robustness across varying Maximum Voluntary Contraction (MVC) levels due to signal variability and low SNR, FatigueFormer employs parallel Transformer-based sequence encoders to separately capture static and temporal feature dynamics, fusing their complementary representations to improve performance stability across low- and high-MVC conditions. Evaluated on a self-collected dataset spanning 30 participants across four MVC levels (20-80%), it achieves state-of-the-art accuracy and strong generalization under mild-fatigue conditions. Beyond performance, FatigueFormer enables attention-based visualization of fatigue dynamics, revealing how feature groups and time windows contribute differently across varying MVC levels, offering interpretable insight into fatigue progression.
[843] Learning Partial Action Replacement in Offline MARL
Yue Jin, Giovanni Montana
Main category: cs.LG
TL;DR: PLCQL is an offline multi-agent reinforcement learning framework that adaptively selects which agents’ actions to replace using a contextual bandit formulation, reducing computational cost while maintaining performance.
Details
Motivation: Offline MARL suffers from exponential growth of joint action space leading to sparse dataset coverage and unavoidable out-of-distribution actions. Existing Partial Action Replacement methods require enumerating multiple subset configurations at high computational cost and cannot adapt to varying states.Method: Formulates PAR subset selection as a contextual bandit problem and learns a state-dependent PAR policy using Proximal Policy Optimisation with uncertainty-weighted reward. The adaptive policy dynamically determines how many agents to replace at each update step.
Result: Reduces per-iteration Q-function evaluations from n to 1 compared to previous PAR method SPaCQL. Achieves highest normalized scores on 66% of tasks across MPE, MaMuJoCo, and SMAC benchmarks, outperforming SPaCQL on 84% of tasks while substantially reducing computational cost.
Conclusion: PLCQL provides an efficient adaptive approach to offline MARL that balances policy improvement against conservative value estimation, with proven theoretical guarantees on value-error bounds.
Abstract: Offline multi-agent reinforcement learning (MARL) faces a critical challenge: the joint action space grows exponentially with the number of agents, making dataset coverage exponentially sparse and out-of-distribution (OOD) joint actions unavoidable. Partial Action Replacement (PAR) mitigates this by anchoring a subset of agents to dataset actions, but existing approach relies on enumerating multiple subset configurations at high computational cost and cannot adapt to varying states. We introduce PLCQL, a framework that formulates PAR subset selection as a contextual bandit problem and learns a state-dependent PAR policy using Proximal Policy Optimisation with an uncertainty-weighted reward. This adaptive policy dynamically determines how many agents to replace at each update step, balancing policy improvement against conservative value estimation. We prove a value-error bound showing that the estimation error scales linearly with the expected number of deviating agents. Compared with the previous PAR-based method SPaCQL, PLCQL reduces the number of per-iteration Q-function evaluations from n to 1, significantly improving computational efficiency. Empirically, PLCQL achieves the highest normalised scores on 66% of tasks across MPE, MaMuJoCo, and SMAC benchmarks, outperforming SPaCQL on 84% of tasks while substantially reducing computational cost.
[844] VAN-AD: Visual Masked Autoencoder with Normalizing Flow For Time Series Anomaly Detection
PengYu Chen, Shang Wan, Xiaohou Shi, Yuan Chang, Yan Sun, Sajal K. Das
Main category: cs.LG
TL;DR: VAN-AD adapts visual Masked Autoencoder (MAE) to time series anomaly detection by addressing overgeneralization and limited local perception through statistical mapping and normalizing flow modules.
Details
Motivation: Existing time series anomaly detection methods require dataset-specific training and lack generalization across different datasets, especially with scarce training data. While foundation models offer promise, current approaches using LLMs or large time series datasets face cross-modal gaps or in-domain heterogeneity issues.Method: Proposes VAN-AD framework adapting ImageNet-pretrained visual MAE to TSAD. Includes: 1) Adaptive Distribution Mapping Module (ADMM) to map reconstruction results into unified statistical space to amplify abnormal pattern discrepancies, addressing overgeneralization; 2) Normalizing Flow Module (NFM) combining MAE with normalizing flow to estimate probability density under global distribution, addressing limited local perception.
Result: Extensive experiments on nine real-world datasets show VAN-AD consistently outperforms existing state-of-the-art methods across multiple evaluation metrics.
Conclusion: Visual foundation models like MAE can be effectively adapted for time series anomaly detection with proper architectural modifications to address domain-specific challenges, offering improved generalization and performance.
Abstract: Time series anomaly detection (TSAD) is essential for maintaining the reliability and security of IoT-enabled service systems. Existing methods require training one specific model for each dataset, which exhibits limited generalization capability across different target datasets, hindering anomaly detection performance in various scenarios with scarce training data. To address this limitation, foundation models have emerged as a promising direction. However, existing approaches either repurpose large language models (LLMs) or construct largescale time series datasets to develop general anomaly detection foundation models, and still face challenges caused by severe cross-modal gaps or in-domain heterogeneity. In this paper, we investigate the applicability of large-scale vision models to TSAD. Specifically, we adapt a visual Masked Autoencoder (MAE) pretrained on ImageNet to the TSAD task. However, directly transferring MAE to TSAD introduces two key challenges: overgeneralization and limited local perception. To address these challenges, we propose VAN-AD, a novel MAE-based framework for TSAD. To alleviate the over-generalization issue, we design an Adaptive Distribution Mapping Module (ADMM), which maps the reconstruction results before and after MAE into a unified statistical space to amplify discrepancies caused by abnormal patterns. To overcome the limitation of local perception, we further develop a Normalizing Flow Module (NFM), which combines MAE with normalizing flow to estimate the probability density of the current window under the global distribution. Extensive experiments on nine real-world datasets demonstrate that VAN-AD consistently outperforms existing state-of-the-art methods across multiple evaluation metrics.We make our code and datasets available at https://github.com/PenyChen/VAN-AD.
[845] Physics-Embedded Feature Learning for AI in Medical Imaging
Pulock Das, Al Amin, Kamrul Hasan, Rohan Thompson, Azubike D. Okpalaeze, Liang Hong
Main category: cs.LG
TL;DR: PhysNet integrates tumor growth physics into CNN feature learning for brain MRI classification, outperforming standard DL models while providing interpretable biological parameters.
Details
Motivation: Current deep learning models in healthcare operate as black boxes and ignore physical tumor growth processes, limiting interpretability, robustness, and clinical trust. There's a need to integrate physics-based understanding into AI systems for more trustworthy medical applications.Method: PhysNet embeds a reaction-diffusion model of tumor growth within intermediate feature representations of a ResNet backbone. The architecture jointly performs multi-class tumor classification while learning latent tumor density fields, temporal evolution, and biologically meaningful physical parameters (tumor diffusion and growth rates) through end-to-end training.
Result: On a large brain MRI dataset, PhysNet outperforms state-of-the-art DL baselines (MobileNetV2, VGG16, VGG19, and ensemble models) in classification accuracy and F1-score. It also produces interpretable latent representations and learned bio-physical parameters that align with established medical knowledge.
Conclusion: Physics-embedded representation learning provides a practical pathway toward more trustworthy and clinically meaningful medical AI systems by combining data-driven learning with physical process modeling.
Abstract: Deep learning (DL) models have achieved strong performance in an intelligence healthcare setting, yet most existing approaches operate as black boxes and ignore the physical processes that govern tumor growth, limiting interpretability, robustness, and clinical trust. To address this limitation, we propose PhysNet, a physics-embedded DL framework that integrates tumor growth dynamics directly into the feature learning process of a convolutional neural network (CNN). Unlike conventional physics-informed methods that impose physical constraints only at the output level, PhysNet embeds a reaction diffusion model of tumor growth within intermediate feature representations of a ResNet backbone. The architecture jointly performs multi-class tumor classification while learning a latent tumor density field, its temporal evolution, and biologically meaningful physical parameters, including tumor diffusion and growth rates, through end-to-end training. This design is necessary because purely data-driven models, even when highly accurate or ensemble-based, cannot guarantee physically consistent predictions or provide insight into tumor behavior. Experimental results on a large brain MRI dataset demonstrate that PhysNet outperforms multiple state-of-the-art DL baselines, including MobileNetV2, VGG16, VGG19, and ensemble models, achieving superior classification accuracy and F1-score. In addition to improved performance, PhysNet produces interpretable latent representations and learned bio-physical parameters that align with established medical knowledge, highlighting physics-embedded representation learning as a practical pathway toward more trustworthy and clinically meaningful medical AI systems.
[846] Stable Reasoning, Unstable Responses: Mitigating LLM Deception via Stability Asymmetry
Guoxi Zhang, Jiawei Chen, Tianzhuo Yang, Lang Qin, Juntao Dai, Yaodong Yang, Jingwei Yi
Main category: cs.LG
TL;DR: A novel method called Stability Asymmetry Regularization (SAR) detects and suppresses deceptive behavior in LLMs by measuring the contrast between internal reasoning stability and external response stability under perturbation, addressing limitations of chain-of-thought monitoring.
Details
Motivation: As LLMs become more capable, intrinsic deception becomes a critical trustworthiness risk where models strategically mislead users. Existing alignment methods based on chain-of-thought monitoring are unreliable because models can conceal deceptive reasoning under optimization pressure.Method: The paper hypothesizes that deceptive LLMs show “stability asymmetry” - maintaining stable internal beliefs in chain-of-thought while having fragile external responses under perturbation. The authors propose Stability Asymmetry Regularization (SAR), an alignment objective that penalizes this distributional asymmetry during reinforcement learning by measuring contrast between internal CoT stability and external response stability.
Result: Extensive experiments confirm that stability asymmetry reliably identifies deceptive behavior, and SAR effectively suppresses intrinsic deception without degrading general model capability.
Conclusion: SAR provides a robust approach to detecting and mitigating deceptive behavior in LLMs by targeting statistical structure rather than semantic content, making it resilient to semantic concealment strategies that undermine traditional chain-of-thought monitoring.
Abstract: As Large Language Models (LLMs) expand in capability and application scope, their trustworthiness becomes critical. A vital risk is intrinsic deception, wherein models strategically mislead users to achieve their own objectives. Existing alignment approaches based on chain-of-thought (CoT) monitoring supervise explicit reasoning traces. However, under optimization pressure, models are incentivized to conceal deceptive reasoning, rendering semantic supervision fundamentally unreliable. Grounded in cognitive psychology, we hypothesize that a deceptive LLM maintains a stable internal belief in its CoT while its external response remains fragile under perturbation. We term this phenomenon stability asymmetry and quantify it by measuring the contrast between internal CoT stability and external response stability under perturbation. Building on this structural signature, we propose the Stability Asymmetry Regularization (SAR), a novel alignment objective that penalizes this distributional asymmetry during reinforcement learning. Unlike CoT monitoring, SAR targets the statistical structure of model outputs, rendering it robust to semantic concealment. Extensive experiments confirm that stability asymmetry reliably identifies deceptive behavior, and that SAR effectively suppresses intrinsic deception without degrading general model capability.
[847] A Hierarchical Sheaf Spectral Embedding Framework for Single-Cell RNA-seq Analysis
Xiang Xiang Wang, Guo-Wei We
Main category: cs.LG
TL;DR: HSSE framework uses persistent sheaf Laplacian analysis to create multiscale cell representations for single-cell RNA-seq data, achieving competitive performance on 12 benchmark datasets.
Details
Motivation: Single-cell RNA-seq data analysis needs representations that capture heterogeneous local structure across multiple scales while being stable and interpretable.Method: Hierarchical sheaf spectral embedding (HSSE) constructs cell-level features using persistent sheaf Laplacian analysis. It builds scale-dependent embeddings, defines local neighborhoods at multiple resolutions, constructs data-driven cellular sheaves, computes persistent sheaf Laplacians over filtration intervals, and extracts spectral statistics aggregated into feature vectors.
Result: HSSE achieves competitive or improved performance compared to existing multiscale and classical embedding-based methods across multiple evaluation metrics on twelve benchmark single-cell RNA-seq datasets.
Conclusion: Sheaf spectral representations provide a robust and interpretable approach for single-cell RNA-seq data representation learning.
Abstract: Single-cell RNA-seq data analysis typically requires representations that capture heterogeneous local structure across multiple scales while remaining stable and interpretable. In this work, we propose a hierarchical sheaf spectral embedding (HSSE) framework that constructs informative cell-level features based on persistent sheaf Laplacian analysis. Starting from scale-dependent low-dimensional embeddings, we define cell-centered local neighborhoods at multiple resolutions. For each local neighborhood, we construct a data-driven cellular sheaf that encodes local relationships among cells. We then compute persistent sheaf Laplacians over sampled filtration intervals and extract spectral statistics that summarize the evolution of local relational structure across scales. These spectral descriptors are aggregated into a unified feature vector for each cell and can be directly used in downstream learning tasks without additional model training. We evaluate HSSE on twelve benchmark single-cell RNA-seq datasets covering diverse biological systems and data scales. Under a consistent classification protocol, HSSE achieves competitive or improved performance compared with existing multiscale and classical embedding-based methods across multiple evaluation metrics. The results demonstrate that sheaf spectral representations provide a robust and interpretable approach for single-cell RNA-seq data representation learning.
[848] Property-Guided Molecular Generation and Optimization via Latent Flows
Alexander Arjun Lobo, Urvi Awasthi, Leonid Zhukov
Main category: cs.LG
TL;DR: MoltenFlow is a modular framework for molecular design that combines property-organized latent representations with flow-matching generative priors and gradient-based guidance for both conditioned generation and local optimization.
Details
Motivation: Current generative models for molecular discovery face challenges in targeted optimization within continuous latent representations, often leading to degraded validity, loss of structural fidelity, or unstable behavior when trying to identify molecular structures that satisfy desired property profiles.Method: MoltenFlow combines property-organized latent representations with flow-matching generative priors and gradient-based guidance. This modular framework supports both conditioned generation and local optimization within a single latent-space framework.
Result: The approach enables efficient multi-objective molecular optimization under fixed oracle budgets with controllable trade-offs, while the learned flow prior improves unconditional generation quality.
Conclusion: MoltenFlow provides an effective framework for molecular inverse design that addresses limitations of current generative models by combining organized latent representations with flow-matching and gradient guidance.
Abstract: Molecular discovery is increasingly framed as an inverse design problem: identifying molecular structures that satisfy desired property profiles under feasibility constraints. While recent generative models provide continuous latent representations of chemical space, targeted optimization within these representations often leads to degraded validity, loss of structural fidelity, or unstable behavior. We introduce MoltenFlow, a modular framework that combines property-organized latent representations with flow-matching generative priors and gradient-based guidance. This formulation supports both conditioned generation and local optimization within a single latent-space framework. We show that guided latent flows enable efficient multi-objective molecular optimization under fixed oracle budgets with controllable trade-offs, while a learned flow prior improves unconditional generation quality.
[849] Strategic Candidacy in Generative AI Arenas
Chris Hays, Rachel Li, Bailey Flanigan, Manish Raghavan
Main category: cs.LG
TL;DR: A mechanism called You-Rank-We-Rank (YRWR) that prevents model producers from gaming AI arena rankings by submitting clones, requiring producers to rank their own models to improve overall ranking accuracy.
Details
Motivation: AI arenas rank generative models using pairwise user preferences, but producers can exploit randomness by submitting multiple clones/variants of the same model to artificially boost rankings, degrading ranking quality and usefulness.Method: Proposed You-Rank-We-Rank (YRWR) mechanism requires producers to submit rankings over their own models, using these rankings to correct statistical estimates of model quality. The mechanism is designed to be clone-robust.
Result: Theoretical proof that YRWR is approximately clone-robust (producers can’t improve rank much by submitting clones). Simulations show mechanism is clone-robust and improves ranking accuracy even with producer misranking.
Conclusion: YRWR provides a practical solution to the clone problem in AI arenas, ensuring more reliable rankings by preventing gaming while potentially improving overall accuracy when producers can rank their own models reasonably well.
Abstract: AI arenas, which rank generative models from pairwise preferences of users, are a popular method for measuring the relative performance of models in the course of their organic use. Because rankings are computed from noisy preferences, there is a concern that model producers can exploit this randomness by submitting many models (e.g., multiple variants of essentially the same model) and thereby artificially improve the rank of their top models. This can lead to degradations in the quality, and therefore the usefulness, of the ranking. In this paper, we begin by establishing, both theoretically and in simulations calibrated to data from the platform Arena (formerly LMArena, Chatbot Arena), conditions under which producers can benefit from submitting clones when their goal is to be ranked highly. We then propose a new mechanism for ranking models from pairwise comparisons, called You-Rank-We-Rank (YRWR). It requires that producers submit rankings over their own models and uses these rankings to correct statistical estimates of model quality. We prove that this mechanism is approximately clone-robust, in the sense that a producer cannot improve their rank much by doing anything other than submitting each of their unique models exactly once. Moreover, to the extent that model producers are able to correctly rank their own models, YRWR improves overall ranking accuracy. In further simulations, we show that indeed the mechanism is approximately clone-robust and quantify improvements to ranking accuracy, even under producer misranking.
[850] Tunable Domain Adaptation Using Unfolding
Snehaa Reddy, Jayaprakash Katual, Satish Mulleti
Main category: cs.LG
TL;DR: Proposes two domain adaptation methods for regression using interpretable unrolled networks that adapt to varying data distributions by tuning parameters based on domain variables or input data.
Details
Motivation: Machine learning models struggle with domain generalization across varying data distributions (e.g., different noise levels). Traditional approaches like personalized training (separate models per domain) and joint training (single model for all domains) have limitations in flexibility and effectiveness.Method: Two novel domain adaptation methods based on interpretable unrolled networks (deep architectures inspired by iterative optimization algorithms): 1) Parametric Tunable-Domain Adaptation (P-TDA) uses known domain parameters for dynamic tuning, 2) Data-Driven Tunable-Domain Adaptation (DD-TDA) infers domain adaptation directly from input data. Both leverage functional dependence of tunable parameters on domain variables.
Result: Validated on compressed sensing problems including noise-adaptive sparse signal recovery, domain-adaptive gain calibration, and domain-adaptive phase retrieval. Methods achieve improved or comparable performance to domain-specific models while surpassing joint training baselines.
Conclusion: Demonstrates the potential of unrolled networks for effective, interpretable domain adaptation in regression settings, offering flexible adaptation to varying data distributions.
Abstract: Machine learning models often struggle to generalize across domains with varying data distributions, such as differing noise levels, leading to degraded performance. Traditional strategies like personalized training, which trains separate models per domain, and joint training, which uses a single model for all domains, have significant limitations in flexibility and effectiveness. To address this, we propose two novel domain adaptation methods for regression tasks based on interpretable unrolled networks–deep architectures inspired by iterative optimization algorithms. These models leverage the functional dependence of select tunable parameters on domain variables, enabling controlled adaptation during inference. Our methods include Parametric Tunable-Domain Adaptation (P-TDA), which uses known domain parameters for dynamic tuning, and Data-Driven Tunable-Domain Adaptation (DD-TDA), which infers domain adaptation directly from input data. We validate our approach on compressed sensing problems involving noise-adaptive sparse signal recovery, domain-adaptive gain calibration, and domain-adaptive phase retrieval, demonstrating improved or comparable performance to domain-specific models while surpassing joint training baselines. This work highlights the potential of unrolled networks for effective, interpretable domain adaptation in regression settings.
[851] High dimensional theory of two-phase optimizers
Atish Agarwala
Main category: cs.LG
TL;DR: Analysis of LA-DiLoCo, a two-phase optimizer, showing its single-worker variant offers different noise-signal tradeoffs than SGD, multi-worker adds noise that can be controlled, and momentum stacking enables acceleration via Hessian spectrum transformation.
Details
Motivation: Motivated by renewed interest in partially asynchronous two-phase optimizers (like DiLoCo) and promising results of their single-worker versions, the paper aims to analyze LA-DiLoCo on high-dimensional linear regression to understand its properties and potential advantages over traditional optimizers like SGD.Method: Theoretical analysis of LA-DiLoCo (Local Averaging Distributed Low Communication) on high-dimensional linear regression problems. Examines single-worker variant (LA), multi-worker version, and SLA (LA with momentum), analyzing noise-signal tradeoffs, hyperparameter effects, and momentum stacking properties.
Result: Single-worker LA provides different noise-signal tradeoffs than SGD, beneficial in many scenarios. Multi-worker version generates more noise but this can be controlled via hyperparameter tuning. Momentum stacking (SLA) enables acceleration through non-linear transformation of the effective Hessian spectrum, maximized with Nesterov momentum.
Conclusion: Two-phase optimizers represent a fruitful new paradigm for understanding and improving training algorithms, offering different tradeoffs than traditional methods and enabling acceleration through momentum stacking techniques.
Abstract: The trend towards larger training setups has brought a renewed interest in partially asynchronous two-phase optimizers which optimize locally and then synchronize across workers. Additionally, recent work suggests that the one-worker version of one of these algorithms, DiLoCo, shows promising results as a (synchronous) optimizer. Motivated by these studies we present an analysis of LA-DiLoCo, a simple member of the DiLoCo family, on a high-dimensional linear regression problem. We show that the one-worker variant, LA, provides a different tradeoff between signal and noise than SGD, which is beneficial in many scenarios. We also show that the multi-worker version generates more noise than the single worker version, but that this additional noise generation can be ameliorated by appropriate choice of hyperparameters. We conclude with an analysis of SLA – LA with momentum – and show that stacking two momentum operators gives an opportunity for acceleration via a non-linear transformation of the “effective’’ Hessian spectrum, which is maximized for Nesterov momentum. Altogether our results show that two-phase optimizers represent a fruitful new paradigm for understanding and improving training algorithms.
[852] Probabilistic Forecasting of Localized Wildfire Spread Based on Conditional Flow Matching
Bryan Shaddy, Haitong Qin, Brianna Binder, James Haley, Riya Duddalwar, Kyle Hilburn, Assad Oberai
Main category: cs.LG
TL;DR: A probabilistic surrogate model for wildfire spread using conditional flow matching to generate ensemble predictions of fire arrival times based on environmental inputs.
Details
Motivation: To develop an efficient, scalable probabilistic wildfire forecasting model that can generate ensemble predictions while explicitly representing uncertainty, reducing computational costs compared to physics-based simulators.Method: Uses conditional flow matching algorithm to model fire progression as a stochastic process, learning conditional distribution of fire arrival times given current fire state, environmental inputs (wind, temperature, humidity, terrain, fuel), and atmospheric data from WRF-SFIRE simulations.
Result: The model captures variability in fire evolution and produces accurate ensemble predictions for both 3-hour single-step and 24-hour multi-step forecasts, demonstrating sensitivity to key drivers of fire spread.
Conclusion: Provides a scalable probabilistic wildfire forecasting framework that enables efficient ensemble generation and offers a pathway for integrating ML models with operational fire prediction systems and data assimilation.
Abstract: This study presents a probabilistic surrogate model for localized wildfire spread based on a conditional flow matching algorithm. The approach models fire progression as a stochastic process by learning the conditional distribution of fire arrival times given the current fire state along with environmental and atmospheric inputs. Model inputs include current burned area, near-surface wind components, temperature, relative humidity, terrain height, and fuel category information, all defined on a high-resolution spatial grid. The outputs are samples of arrival time within a three-hour time window, conditioned on the input variables. Training data are generated from coupled atmosphere-wildfire spread simulations using WRF-SFIRE, paired with weather fields from the North American Mesoscale model. The proposed framework enables efficient generation of ensembles of arrival times and explicitly represents uncertainty arising from incomplete knowledge of the fire-atmosphere system and unresolved variables. The model supports localized prediction over subdomains, reducing computational cost relative to physics-based simulators while retaining sensitivity to key drivers of fire spread. Model performance is evaluated against WRF-SFIRE simulations for both single-step (3-hour) and recursive multi-step (24-hour) forecasts. Results demonstrate that the method captures variability in fire evolution and produces accurate ensemble predictions. The framework provides a scalable approach for probabilistic wildfire forecasting and offers a pathway for integrating machine learning models with operational fire prediction systems and data assimilation.
[853] ImmSET: Sequence-Based Predictor of TCR-pMHC Specificity at Scale
Marco Garcia Noceda, Matthew T Noakes, Andrew FigPope, Daniel E Mattox, Bryan Howie, Harlan Robins
Main category: cs.LG
TL;DR: ImmSET is a transformer-based architecture for predicting T cell receptor specificity to peptide-MHC complexes, showing improved performance over existing methods including AlphaFold2/3 when given sufficient training data.
Details
Motivation: Predicting TCR-pMHC specificity is crucial for understanding adaptive immunity and developing personalized therapies, but remains challenging due to extreme diversity of both TCRs and pMHCs. Existing sequence-based approaches have limitations and inflated performance metrics.Method: ImmSET (Immune Synapse Encoding Transformer) is a novel sequence-based architecture designed to model interactions among sets of variable-length biological sequences. The model is trained across various dataset sizes and compositions, with systematic testing of scaling behavior with training data.
Result: ImmSET outperforms prior sequence-based approaches and shows consistent performance scaling with data volume. It compares favorably with ESM2 fine-tuned on same datasets, and can outperform AlphaFold2/3-based pipelines on TCR-pMHC specificity prediction when provided sufficient training data.
Conclusion: ImmSET establishes a scalable modeling paradigm for multi-sequence interaction problems, demonstrated in TCR-pMHC setting but generalizable to other biological domains where high-throughput sequence-driven reasoning complements structure prediction.
Abstract: T cells are a critical component of the adaptive immune system, playing a role in infectious disease, autoimmunity, and cancer. T cell function is mediated by the T cell receptor (TCR) protein, a highly diverse receptor targeting specific peptides presented by the major histocompatibility complex (pMHCs). Predicting the specificity of TCRs for their cognate pMHCs is central to understanding adaptive immunity and enabling personalized therapies. However, accurate prediction of this protein-protein interaction remains challenging due to the extreme diversity of both TCRs and pMHCs. Here, we present ImmSET (Immune Synapse Encoding Transformer), a novel sequence-based architecture designed to model interactions among sets of variable-length biological sequences. We train this model across a range of dataset sizes and compositions and study the resulting models’ generalization to pMHC targets. We describe a failure mode in prior sequence-based approaches that inflates previously reported performance on this task and show that ImmSET remains robust under stricter evaluation. In systematically testing the scaling behavior of ImmSET with training data, we show that performance scales consistently with data volume across multiple data types and compares favorably with the pre-trained protein language model ESM2 fine-tuned on the same datasets. Finally, we demonstrate that ImmSET can outperform AlphaFold2 and AlphaFold3-based pipelines on TCR-pMHC specificity prediction when provided sufficient training data. This work establishes ImmSET as a scalable modeling paradigm for multi-sequence interaction problems, demonstrated in the TCR-pMHC setting but generalizable to other biological domains where high-throughput sequence-driven reasoning complements structure prediction and experimental mapping.
[854] Unsupervised Behavioral Compression: Learning Low-Dimensional Policy Manifolds through State-Occupancy Matching
Andrea Fraschini, Davide Tenedini, Riccardo Zamboni, Mirco Mutti, Marcello Restelli
Main category: cs.LG
TL;DR: OPC improves policy compression by shifting from action-matching to occupancy distribution matching, enabling better latent organization and generalization across behaviors.
Details
Motivation: Existing Action-based Policy Compression (APC) suffers from myopic action-matching losses that lead to compounding errors in sequential decisions, limiting its effectiveness for sample-efficient deep reinforcement learning.Method: Occupancy-based Policy Compression (OPC) enhances APC by: 1) curating datasets with information-theoretic uniqueness metrics for diverse policies, and 2) using a differentiable compression objective that minimizes divergence between true and reconstructed mixture occupancy distributions.
Result: OPC organizes latent space around true functional similarity, promoting representations that generalize across behaviors while retaining parameter space expressivity, validated across multiple continuous control benchmarks.
Conclusion: Shifting from action-matching to occupancy-based compression enables more effective policy compression for sample-efficient reinforcement learning by capturing long-horizon behavioral similarity.
Abstract: Deep Reinforcement Learning (DRL) is widely recognized as sample-inefficient, a limitation attributable in part to the high dimensionality and substantial functional redundancy inherent to the policy parameter space. A recent framework, which we refer to as Action-based Policy Compression (APC), mitigates this issue by compressing the parameter space $Θ$ into a low-dimensional latent manifold $\mathcal Z$ using a learned generative mapping $g:\mathcal Z \to Θ$. However, its performance is severely constrained by relying on immediate action-matching as a reconstruction loss, a myopic proxy for behavioral similarity that suffers from compounding errors across sequential decisions. To overcome this bottleneck, we introduce Occupancy-based Policy Compression (OPC), which enhances APC by shifting behavior representation from immediate action-matching to long-horizon state-space coverage. Specifically, we propose two principal improvements: (1) we curate the dataset generation with an information-theoretic uniqueness metric that delivers a diverse population of policies; and (2) we propose a fully differentiable compression objective that directly minimizes the divergence between the true and reconstructed mixture occupancy distributions. These modifications force the generative model to organize the latent space around true functional similarity, promoting a latent representation that generalizes over a broad spectrum of behaviors while retaining most of the original parameter space’s expressivity. Finally, we empirically validate the advantages of our contributions across multiple continuous control benchmarks.
[855] Liquid Networks with Mixture Density Heads for Efficient Imitation Learning
Nikolaus Correll
Main category: cs.LG
TL;DR: Liquid neural networks with mixture density heads outperform diffusion policies in robotics imitation learning tasks with fewer parameters, better accuracy, and faster inference
Details
Motivation: To compare different policy architectures for imitation learning in robotics, specifically comparing liquid neural networks with mixture density heads against diffusion policies, to understand which approach provides better performance, efficiency, and robustnessMethod: Used a shared-backbone comparison protocol on Push-T, RoboMimic Can, and PointMaze tasks, isolating policy-head effects under matched inputs, training budgets, and evaluation settings. Compared parameter counts, offline prediction error, inference speed, and sample efficiency across different data regimes
Result: Liquid policies used roughly half the parameters (4.3M vs 8.6M), achieved 2.4x lower offline prediction error, ran 1.8x faster at inference, and remained more robust across sample-efficiency experiments, especially in low-data and medium-data regimes
Conclusion: Liquid recurrent multimodal policies provide a compact and practical alternative to iterative denoising for imitation learning, with strong offline density modeling helping deployment though not fully determining closed-loop success
Abstract: We compare liquid neural networks with mixture density heads against diffusion policies on Push-T, RoboMimic Can, and PointMaze under a shared-backbone comparison protocol that isolates policy-head effects under matched inputs, training budgets, and evaluation settings. Across tasks, liquid policies use roughly half the parameters (4.3M vs. 8.6M), achieve 2.4x lower offline prediction error, and run 1.8 faster at inference. In sample-efficiency experiments spanning 1% to 46.42% of training data, liquid models remain consistently more robust, with especially large gains in low-data and medium-data regimes. Closed-loop results on Push-T and PointMaze are directionally consistent with offline rankings but noisier, indicating that strong offline density modeling helps deployment while not fully determining closed-loop success. Overall, liquid recurrent multimodal policies provide a compact and practical alternative to iterative denoising for imitation learning.
[856] Conformalized Signal Temporal Logic Inference under Covariate Shift
Yixuan Wang, Danyang Li, Matthew Cleaveland, Roberto Tron, Mingyu Cai
Main category: cs.LG
TL;DR: A conformalized STL inference framework that addresses covariate shift between training and deployment data using likelihood ratio estimation and weighted conformal prediction for reliable symbolic learning.
Details
Motivation: Existing STL inference methods with conformal prediction assume identical distribution and exchangeability between calibration and testing data, which is often violated in real-world settings with distribution shifts between training and deployment trajectories.Method: 1) Uses template-free, differentiable STL inference to learn initial model; 2) Refines with limited deployment dataset for distribution alignment; 3) Estimates likelihood ratio between training and deployment distributions; 4) Integrates into STL-robustness-based weighted conformal prediction scheme.
Result: Experimental results on trajectory datasets show the framework preserves STL formula interpretability while significantly improving symbolic learning reliability at deployment time under distribution shift.
Conclusion: The proposed framework provides valid uncertainty quantification for STL inference under covariate shift, enhancing reliability of learned temporal logic rules in real-world deployment scenarios.
Abstract: Signal Temporal Logic (STL) inference learns interpretable logical rules for temporal behaviors in dynamical systems. To ensure the correctness of learned STL formulas, recent approaches have incorporated conformal prediction as a statistical tool for uncertainty quantification. However, most existing methods rely on the assumption that calibration and testing data are identically distributed and exchangeable, an assumption that is frequently violated in real-world settings. This paper proposes a conformalized STL inference framework that explicitly addresses covariate shift between training and deployment trajectories dataset. From a technical standpoint, the approach first employs a template-free, differentiable STL inference method to learn an initial model, and subsequently refines it using a limited deployment side dataset to promote distribution alignment. To provide validity guarantees under distribution shift, the framework estimates the likelihood ratio between training and deployment distributions and integrates it into an STL-robustness-based weighted conformal prediction scheme. Experimental results on trajectory datasets demonstrate that the proposed framework preserves the interpretability of STL formulas while significantly improving symbolic learning reliability at deployment time.
[857] Dynamic resource matching in manufacturing using deep reinforcement learning
Saunak Kumar Panda, Yisha Xiang, Ruiqi Liu
Main category: cs.LG
TL;DR: A deep reinforcement learning approach for dynamic manufacturing resource matching using domain knowledge-informed Q-learning and DDPG algorithms
Details
Motivation: Matching manufacturing resources efficiently is crucial for resource allocation in industries, but traditional methods struggle with large state/action spaces and complex demand distributionsMethod: Uses model-free deep RL with two penalties: domain knowledge-based penalty from prior policy and infeasibility penalty for demand-supply constraints. Combines modified Q-learning with DDPG (DKDDPG)
Result: DKDDPG outperformed traditional DDPG and other RL algorithms in both small- and large-scale experiments, achieving higher rewards and greater efficiency
Conclusion: The domain knowledge-informed RL approach effectively solves complex manufacturing resource matching problems with theoretical convergence guarantees
Abstract: Matching plays an important role in the logical allocation of resources across a wide range of industries. The benefits of matching have been increasingly recognized in manufacturing industries. In particular, capacity sharing has received much attention recently. In this paper, we consider the problem of dynamically matching demand-capacity types of manufacturing resources. We formulate the multi-period, many-to-many manufacturing resource-matching problem as a sequential decision process. The formulated manufacturing resource-matching problem involves large state and action spaces, and it is not practical to accurately model the joint distribution of various types of demands. To address the curse of dimensionality and the difficulty of explicitly modeling the transition dynamics, we use a model-free deep reinforcement learning approach to find optimal matching policies. Moreover, to tackle the issue of infeasible actions and slow convergence due to initial biased estimates caused by the maximum operator in Q-learning, we introduce two penalties to the traditional Q-learning algorithm: a domain knowledge-based penalty based on a prior policy and an infeasibility penalty that conforms to the demand-supply constraints. We establish theoretical results on the convergence of our domain knowledge-informed Q-learning providing performance guarantee for small-size problems. For large-size problems, we further inject our modified approach into the deep deterministic policy gradient (DDPG) algorithm, which we refer to as domain knowledge-informed DDPG (DKDDPG). In our computational study, including small- and large-scale experiments, DKDDPG consistently outperformed traditional DDPG and other RL algorithms, yielding higher rewards and demonstrating greater efficiency in time and episodes.
[858] Hierarchy-Guided Topology Latent Flow for Molecular Graph Generation
Urvi Awasthi, Alexander Arjun Lobo, Leonid Zhukov
Main category: cs.LG
TL;DR: HLTF is a planner-executor model for generating chemically valid 3D molecules by explicitly generating bond graphs with 3D coordinates using multi-scale planning and constraint-aware sampling.
Details
Motivation: Current 3D molecule generators struggle with discrete bond topology - small local bond errors cause global failures like valence violations and disconnections, especially for drug-like molecules with long-range constraints. Many methods focus on coordinates first and infer bonds later, leaving topology feasibility weakly controlled.Method: Hierarchy-Guided Latent Topology Flow (HLTF) uses a planner-executor approach: a latent multi-scale plan provides global context, and a constraint-aware sampler suppresses topology-driven failures to generate bond graphs with 3D coordinates.
Result: On QM9: 98.8% atom stability, 92.9% valid-and-unique, improving PoseBusters validity to 94.0% (+0.9 over strongest baseline). On GEOM-DRUGS: 85.5%/85.0% validity/valid-unique-novel without post-processing, 92.2%/91.2% after relaxation, within 0.9 points of best post-processed baseline.
Conclusion: Explicit topology generation with HLTF improves chemical validity of 3D molecules, reduces “false-valid” samples that pass basic checks but fail stricter validation, and addresses the fundamental challenge of discrete bond topology in molecular generation.
Abstract: Generating chemically valid 3D molecules is hindered by discrete bond topology: small local bond errors can cause global failures (valence violations, disconnections, implausible rings), especially for drug-like molecules with long-range constraints. Many unconditional 3D generators emphasize coordinates and then infer bonds or rely on post-processing, leaving topology feasibility weakly controlled. We propose Hierarchy-Guided Latent Topology Flow (HLTF), a planner-executor model that generates bond graphs with 3D coordinates, using a latent multi-scale plan for global context and a constraint-aware sampler to suppress topology-driven failures. On QM9, HLTF achieves 98.8% atom stability and 92.9% valid-and-unique, improving PoseBusters validity to 94.0% (+0.9 over the strongest reported baseline). On GEOM-DRUGS, HLTF attains 85.5%/85.0% validity/valid-unique-novel without post-processing and 92.2%/91.2% after standardized relaxation, within 0.9 points of the best post-processed baseline. Explicit topology generation also reduces “false-valid” samples that pass RDKit sanitization but fail stricter checks.
[859] Maximin Learning of Individualized Treatment Effect on Multi-Domain Outcomes
Yuying Lu, Wenbo Fei, Yuanjia Wang, Molei Liu
Main category: cs.LG
TL;DR: DRIFT: A maximin framework for estimating robust individualized treatment effects from high-dimensional item-level data using latent factor representations and adversarial learning.
Details
Motivation: Precision mental health requires treatment decisions accounting for heterogeneous symptoms across multiple clinical domains, but existing ITE methods rely on single summary outcomes or specific symptom sets that are sensitive to symptom selection and limit generalizability to unmeasured clinically relevant domains.Method: DRIFT learns latent constructs via generalized factor analysis, constructs an anchored on-target uncertainty set that extrapolates beyond observed measures to approximate the broader hyper-population of potential outcomes, and optimizes worst-case performance over this uncertainty set using adversarial learning.
Result: DRIFT demonstrates superior performance and improved generalizability to external multi-domain outcomes in analyses of a major depressive disorder randomized controlled trial (EMBARC), including side effects and self-reported symptoms not used during training.
Conclusion: DRIFT provides a robust framework for estimating ITEs that are invariant to latent factor reparameterizations and robust to underrepresented or unmeasured clinical domains, with theoretical guarantees for identification and convergence.
Abstract: Precision mental health requires treatment decisions that account for heterogeneous symptoms reflecting multiple clinical domains. However, existing methods for estimating individualized treatment effects (ITE) rely on a single summary outcome or a specific set of observed symptoms or measures, which are sensitive to symptom selection and limit generalizability to unmeasured yet clinically relevant domains. We propose DRIFT, a new maximin framework for estimating robust ITEs from high-dimensional item-level data by leveraging latent factor representations and adversarial learning. DRIFT learns latent constructs via generalized factor analysis, then constructs an anchored on-target uncertainty set that extrapolates beyond the observed measures to approximate the broader hyper-population of potential outcomes. By optimizing worst-case performance over this uncertainty set, DRIFT yields ITEs that are robust to underrepresented or unmeasured domains. We further show that DRIFT is invariant to admissible reparameterizations of the latent factors and admits a closed-form maximin solution, with theoretical guarantees for identification and convergence. In analyses of a randomized controlled trial for major depressive disorder (EMBARC), DRIFT demonstrates superior performance and improved generalizability to external multi-domain outcomes, including side effects and self-reported symptoms not used during training.
[860] Bayesian-Symbolic Integration for Uncertainty-Aware Parking Prediction
Alireza Nezhadettehad, Arkady Zaslavsky, Abdur Rakib, Seng W. Loke
Main category: cs.LG
TL;DR: A neuro-symbolic framework combining Bayesian Neural Networks with symbolic reasoning for robust parking availability prediction under uncertainty, noise, and data sparsity.
Details
Motivation: Real-world parking prediction faces challenges like data sparsity, noise, and unpredictable changes, requiring models that are both accurate and uncertainty-aware for intelligent transportation systems.Method: Loosely coupled neuro-symbolic framework integrating Bayesian Neural Networks (for uncertainty quantification) with symbolic reasoning (extracted via decision trees and encoded using probabilistic logic programming). Two hybrid strategies: 1) symbolic reasoning as fallback when BNN confidence is low, 2) refining output classes based on symbolic constraints before reapplying BNN.
Result: Both hybrid methods outperform symbolic reasoning alone, and the context-refinement strategy consistently exceeds LSTM networks and BNN baselines across all prediction windows under full, sparse, and noisy conditions.
Conclusion: The work highlights the potential of modular neuro-symbolic integration for real-world, uncertainty-prone prediction tasks in intelligent transportation systems.
Abstract: Accurate parking availability prediction is critical for intelligent transportation systems, but real-world deployments often face data sparsity, noise, and unpredictable changes. Addressing these challenges requires models that are not only accurate but also uncertainty-aware. In this work, we propose a loosely coupled neuro-symbolic framework that integrates Bayesian Neural Networks (BNNs) with symbolic reasoning to enhance robustness in uncertain environments. BNNs quantify predictive uncertainty, while symbolic knowledge extracted via decision trees and encoded using probabilistic logic programming is leveraged in two hybrid strategies: (1) using symbolic reasoning as a fallback when BNN confidence is low, and (2) refining output classes based on symbolic constraints before reapplying the BNN. We evaluate both strategies on real-world parking data under full, sparse, and noisy conditions. Results demonstrate that both hybrid methods outperform symbolic reasoning alone, and the context-refinement strategy consistently exceeds the performance of Long Short-Term Memory (LSTM) networks and BNN baselines across all prediction windows. Our findings highlight the potential of modular neuro-symbolic integration in real-world, uncertainty-prone prediction tasks.
[861] Semantic Interaction Information mediates compositional generalization in latent space
John Schwarcz
Main category: cs.LG
TL;DR: Paper proposes a framework for compositional generalization using variational inference over latent variables, introduces Cognitive Gridworld POMDP with Semantic Interaction Information metric, and develops Representation Classification Chains architecture for disentangled learning of variable interactions.
Details
Motivation: The paper addresses whether barriers to generalization still exist even when all relevant variables are known, focusing on how latent variables interact compositionally and how these interactions affect learning and generalization capabilities in neural networks.Method: Develops Cognitive Gridworld - a stationary POMDP where observations come from multiple latent variables but feedback is only for a single goal variable. Introduces Semantic Interaction Information (SII) metric to measure latent variable interaction contributions. Proposes Representation Classification Chains (RCCs) - a JEPA-style architecture that disentangles variable inference (via RL) from variable embeddings (via self-supervised learning).
Result: SII explains accuracy gap between Echo State and Fully Trained RNNs. Uncovers failure mode where confidence decouples from accuracy. RCCs successfully handle the circular dependence problem in learning variable interactions and facilitate compositional generalization to novel variable combinations.
Conclusion: The work establishes a grounded setting for evaluating goal-directed generalist agents, showing that utilizing interactions between relevant variables is non-trivial and that disentangled learning approaches like RCCs can overcome circular dependence challenges in continual meta-learning.
Abstract: Are there still barriers to generalization once all relevant variables are known? We address this question via a framework that casts compositional generalization as a variational inference problem over latent variables with parametric interactions. To explore this, we develop the Cognitive Gridworld, a stationary Partially Observable Markov Decision Process (POMDP) where observations are generated jointly by multiple latent variables, yet feedback is provided for only a single goal variable. This setting allows us to define Semantic Interaction Information (SII): a metric measuring the contribution of latent variable interactions to task performance. Using SII, we analyze Recurrent Neural Networks (RNNs) provided with these interactions, finding that SII explains the accuracy gap between Echo State and Fully Trained networks. Our analysis also uncovers a theoretically predicted failure mode where confidence decouples from accuracy, suggesting that utilizing interactions between relevant variables is a non-trivial capability. We then address a harder regime where the interactions must be learned by an embedding model. Learning how latent variables interact requires accurate inference, yet accurate inference depends on knowing those interactions. The Cognitive Gridworld reveals this circular dependence as a core challenge for continual meta-learning. We approach this dilemma via Representation Classification Chains (RCCs), a JEPA-style architecture that disentangles these processes: variable inference and variable embeddings are learned by separate modules through Reinforcement Learning and self-supervised learning, respectively. Lastly, we demonstrate that RCCs facilitate compositional generalization to novel combinations of relevant variables. Together, these results establish a grounded setting for evaluating goal-directed generalist agents.
[862] Spectral-Aware Text-to-Time Series Generation with Billion-Scale Multimodal Meteorological Data
Shijie Zhang
Main category: cs.LG
TL;DR: Text-to-time-series generation framework for meteorology using diffusion models with spectral priors and large-scale physically-grounded dataset
Details
Motivation: Need for intuitive natural language control over complex atmospheric dynamics, addressing limitations of existing approaches lacking large-scale multimodal datasets and architectures that overlook spectral-temporal structure of weather signalsMethod: 1) MeteoCap-3B dataset with expert-level captions via Multi-agent Collaborative Captioning pipeline; 2) MTransformer diffusion model with Spectral Prompt Generator that maps text to multi-band spectral priors for frequency-aware attention guidance
Result: State-of-the-art generation quality, accurate cross-modal alignment, strong semantic controllability, substantial gains in downstream forecasting under data-sparse and zero-shot settings, and generalization beyond meteorology
Conclusion: Unified framework enables precise semantic control over meteorological time-series generation through text-guided spectral priors, with demonstrated effectiveness and generalizability
Abstract: Text-to-time-series generation is particularly important in meteorology, where natural language offers intuitive control over complex, multi-scale atmospheric dynamics. Existing approaches are constrained by the lack of large-scale, physically grounded multimodal datasets and by architectures that overlook the spectral-temporal structure of weather signals. We address these challenges with a unified framework for text-guided meteorological time-series generation. First, we introduce MeteoCap-3B, a billion-scale weather dataset paired with expert-level captions constructed via a Multi-agent Collaborative Captioning (MACC) pipeline, yielding information-dense and physically consistent annotations. Building on this dataset, we propose MTransformer, a diffusion-based model that enables precise semantic control by mapping textual descriptions into multi-band spectral priors through a Spectral Prompt Generator, which guides generation via frequency-aware attention. Extensive experiments on real-world benchmarks demonstrate state-of-the-art generation quality, accurate cross-modal alignment, strong semantic controllability, and substantial gains in downstream forecasting under data-sparse and zero-shot settings. Additional results on general time-series benchmarks indicate that the proposed framework generalizes beyond meteorology.
[863] ScoutAttention: Efficient KV Cache Offloading via Layer-Ahead CPU Pre-computation for LLM Inference
Qiuyang Zhang, Kai Zhou, Ding Tang, Kai Lu, Cheng Li, Zhenyu Yang, Peng Xu, Jiguang Wan
Main category: cs.LG
TL;DR: ScoutAttention is a KV cache offloading framework that accelerates LLM inference through GPU-CPU collaborative attention computation, using layer-ahead CPU pre-computation and block-wise sparse attention to reduce CPU load and improve performance.
Details
Motivation: Large language models face GPU memory constraints during long-context inference due to KV cache memory consumption, limiting decode batch sizes. Existing offloading approaches suffer from frequent GPU-CPU transfers or extensive CPU computation, leading to poor GPU utilization while waiting for I/O or CPU processing.Method: Proposes ScoutAttention with GPU-CPU collaborative block-wise sparse attention to reduce CPU load. Features layer-ahead CPU pre-computation algorithm where CPU initiates attention computation one layer in advance, complemented by asynchronous periodic recall mechanisms to maintain minimal CPU compute load.
Result: Experimental results show ScoutAttention maintains accuracy within 2.4% of baseline while achieving 2.1x speedup compared to existing offloading methods.
Conclusion: ScoutAttention effectively addresses GPU memory constraints for long-context LLM inference through efficient GPU-CPU collaboration, significantly improving performance over existing offloading approaches while maintaining accuracy.
Abstract: Large language models encounter critical GPU memory capacity constraints during long-context inference, where KV cache memory consumption severely limits decode batch sizes. While existing research has explored offloading KV cache to DRAM, these approaches either demand frequent GPU-CPU data transfers or impose extensive CPU computation requirements, resulting in poor GPU utilization as the system waits for I/O operations or CPU processing to complete. We propose ScoutAttention, a novel KV cache offloading framework that accelerates LLM inference through collaborative GPU-CPU attention computation. To prevent CPU computation from bottlenecking the system, ScoutAttention introduces GPU-CPU collaborative block-wise sparse attention that significantly reduces CPU load. Unlike conventional parallel computing approaches, our framework features a novel layer-ahead CPU pre-computation algorithm, enabling the CPU to initiate attention computation one layer in advance, complemented by asynchronous periodic recall mechanisms to maintain minimal CPU compute load. Experimental results demonstrate that ScoutAttention maintains accuracy within 2.4% of baseline while achieving 2.1x speedup compared to existing offloading methods.
[864] Preconditioned Attention: Enhancing Efficiency in Transformers
Hemanth Saratchandran
Main category: cs.LG
TL;DR: Preconditioned attention improves transformer training by reducing condition numbers of attention matrices through conditioning matrices, leading to better optimization across vision and language tasks.
Details
Motivation: Standard attention mechanisms in transformers often produce ill-conditioned matrices with large condition numbers, which is a well-known obstacle for gradient-based optimizers leading to inefficient training.Method: Introduces preconditioned attention that incorporates a conditioning matrix into each attention head to reduce the condition number of attention matrices, serving as a simple drop-in replacement for various attention mechanisms.
Result: Validated effectiveness across diverse transformer applications including image classification, object detection, instance segmentation, long sequence modeling and language modeling.
Conclusion: Preconditioned attention significantly reduces condition numbers of attention matrices, resulting in better-conditioned matrices that improve optimization efficiency in transformers.
Abstract: Central to the success of Transformers is the attention block, which effectively models global dependencies among input tokens associated to a dataset. However, we theoretically demonstrate that standard attention mechanisms in transformers often produce ill-conditioned matrices with large condition numbers. This ill-conditioning is a well-known obstacle for gradient-based optimizers, leading to inefficient training. To address this issue, we introduce preconditioned attention, a novel approach that incorporates a conditioning matrix into each attention head. Our theoretical analysis shows that this method significantly reduces the condition number of attention matrices, resulting in better-conditioned matrices that improve optimization. Conditioned attention serves as a simple drop-in replacement for a wide variety of attention mechanisms in the literature. We validate the effectiveness of preconditioned attention across a diverse set of transformer applications, including image classification, object detection, instance segmentation, long sequence modeling and language modeling.
[865] A Tight Expressivity Hierarchy for GNN-Based Entity Resolution in Master Data Management
Ashwin Ganesan
Main category: cs.LG
TL;DR: Theoretical analysis of minimal MPNN architectures needed for entity resolution tasks, establishing tight complexity bounds for different matching predicates with practical architecture selection guidance.
Details
Motivation: Current MPNN approaches for entity resolution use all available extensions (reverse message passing, port numbering, ego IDs) which creates unnecessary overhead. The paper aims to determine the cheapest MPNN architecture that provably works for different entity resolution tasks based on their fundamental complexity.Method: Develops a four-theorem separation theory on typed entity-attribute graphs. Introduces co-reference predicates (Dup_r for entities sharing at least r attribute values and Cyc_ℓ for cycle detection). For each predicate, proves tight bounds by constructing graph pairs indistinguishable by MPNNs lacking required adaptations, and exhibits explicit minimal-depth MPNNs that compute the predicate.
Result: Reveals sharp complexity gap: detecting any shared attribute is purely local (requires reverse message passing in two layers), while detecting multiple shared attributes requires cross-attribute identity correlation (needs ego IDs and four layers). Similar necessity holds for cycle detection. Provides minimal-architecture principle for practitioners.
Conclusion: Establishes theoretical foundations for selecting minimal MPNN architectures for entity resolution tasks, with computational validation confirming predictions. Provides practical guidance for choosing cheapest sufficient adaptation set with guarantee that no simpler architecture works.
Abstract: Entity resolution – identifying database records that refer to the same real-world entity – is naturally modelled on bipartite graphs connecting entity nodes to their attribute values. Applying a message-passing neural network (MPNN) with all available extensions (reverse message passing, port numbering, ego IDs) incurs unnecessary overhead, since different entity resolution tasks have fundamentally different complexity. For a given matching criterion, what is the cheapest MPNN architecture that provably works? We answer this with a four-theorem separation theory on typed entity-attribute graphs. We introduce co-reference predicates $\mathrm{Dup}r$ (two same-type entities share at least $r$ attribute values) and the $\ell$-cycle predicate $\mathrm{Cyc}\ell$ for settings with entity-entity edges. For each predicate we prove tight bounds – constructing graph pairs provably indistinguishable by every MPNN lacking the required adaptation, and exhibiting explicit minimal-depth MPNNs that compute the predicate on all inputs. The central finding is a sharp complexity gap between detecting any shared attribute and detecting multiple shared attributes. The former is purely local, requiring only reverse message passing in two layers. The latter demands cross-attribute identity correlation – verifying that the same entity appears at several attributes of the target – a fundamentally non-local requirement needing ego IDs and four layers, even on acyclic bipartite graphs. A similar necessity holds for cycle detection. Together, these results yield a minimal-architecture principle: practitioners can select the cheapest sufficient adaptation set, with a guarantee that no simpler architecture works. Computational validation confirms every prediction.
[866] GSR-GNN: Training Acceleration and Memory-Saving Framework of Deep GNNs on Circuit Graph
Yuebo Luo, Shiyang Li, Yifei Feng, Vishal Kancharla, Shaoyi Huang, Caiwen Ding
Main category: cs.LG
TL;DR: GSR-GNN enables training of deep graph neural networks for large-scale circuit analysis by combining reversible residuals with group-wise sparse nonlinear operators, achieving significant memory and compute savings.
Details
Motivation: Deep GNNs show strong promise for circuit analysis but face GPU memory and training cost limitations when scaling to modern large-scale circuit graphs, motivating the need for efficient, domain-specific training frameworks.Method: Proposes Grouped-Sparse-Reversible GNN (GSR-GNN) which integrates reversible residual modules with group-wise sparse nonlinear operators that compress node embeddings without sacrificing task-relevant information, and employs an optimized execution pipeline to eliminate fragmented activation storage and reduce data movement.
Result: Achieves up to 87.2% peak memory reduction and over 30× training speedup with negligible degradation in correlation-based quality metrics on sampled circuit graphs.
Conclusion: GSR-GNN makes deep GNNs practical for large-scale EDA workloads by enabling training of GNNs with up to hundreds of layers while reducing both compute and memory overhead.
Abstract: Graph Neural Networks (GNNs) show strong promise for circuit analysis, but scaling to modern large-scale circuit graphs is limited by GPU memory and training cost, especially for deep models. We revisit deep GNNs for circuit graphs and show that, when trainable, they significantly outperform shallow architectures, motivating an efficient, domain-specific training framework. We propose Grouped-Sparse-Reversible GNN (GSR-GNN), which enables training GNNs with up to hundreds of layers while reducing both compute and memory overhead. GSR-GNN integrates reversible residual modules with a group-wise sparse nonlinear operator that compresses node embeddings without sacrificing task-relevant information, and employs an optimized execution pipeline to eliminate fragmented activation storage and reduce data movement. On sampled circuit graphs, GSR-GNN achieves up to 87.2% peak memory reduction and over 30$\times$ training speedup with negligible degradation in correlation-based quality metrics, making deep GNNs practical for large-scale EDA workloads.
[867] Online Learning of Kalman Filtering: From Output to State Estimation
Lintao Ye, Ankang Zhang, Ming Chi, Bin Du, Jianghai Hu
Main category: cs.LG
TL;DR: Online learning framework for Kalman filtering with unknown system model in partially observed linear dynamical systems, achieving log T-regret for output estimation and sqrt T-regret for state estimation with limited queries.
Details
Motivation: Addresses the problem of learning Kalman filtering when the underlying system model is unknown in partially observed linear dynamical systems, tackling both output estimation and the more challenging state estimation scenarios which remains an open problem in the literature.Method: Proposes a unified algorithmic framework based on online optimization that leverages properties of estimation error cost functions (conditionally strong convexity). For state estimation, introduces a random query scheme that provides limited access to more informative state measurements.
Result: Achieves log T-regret for output estimation and demonstrates impossibility of sublinear regret for state estimation without additional information. With random query scheme, achieves sqrt T-regret for state estimation, capturing trade-off between query number and regret.
Conclusion: Provides theoretical guarantees for learning Kalman filtering with unknown models, addresses fundamental limitations of state estimation, and offers practical algorithm with query-based approach that balances information access with performance.
Abstract: In this paper, we study the problem of learning Kalman filtering with unknown system model in partially observed linear dynamical systems. We propose a unified algorithmic framework based on online optimization that can be used to solve both the output estimation and state estimation scenarios. By exploring the properties of the estimation error cost functions, such as conditionally strong convexity, we show that our algorithm achieves a $\log T$-regret in the horizon length $T$ for the output estimation scenario. More importantly, we tackle the more challenging scenario of learning Kalman filtering for state estimation, which is an open problem in the literature. We first characterize a fundamental limitation of the problem, demonstrating the impossibility of any algorithm to achieve sublinear regret in $T$. By further introducing a random query scheme into our algorithm, we show that a $\sqrt{T}$-regret is achievable when rendering the algorithm limited query access to more informative measurements of the system state in practice. Our algorithm and regret readily capture the trade-off between the number of queries and the achieved regret, and shed light on online learning problems with limited observations. We validate the performance of our algorithms using numerical examples.
[868] Hybrid Deep Learning with Temporal Data Augmentation for Accurate Remaining Useful Life Prediction of Lithium-Ion Batteries
Yun Tian, Guili Wang, Jian Bi, Kaixin Han, Chenglu Wu, Zhiyi Lu, Chenhao Li, Liangwang Sun, Minyu Zhou, Chenchen Xu
Main category: cs.LG
TL;DR: CDFormer: A hybrid deep learning model combining CNNs, deep residual shrinkage networks, and Transformer encoders for accurate lithium-ion battery remaining useful life prediction from measurement signals.
Details
Motivation: Existing RUL prediction models lack robustness and generalization capabilities due to complex operating conditions and limited data availability, requiring improved methods for reliable battery health monitoring.Method: Proposes CDFormer - a hybrid architecture integrating convolutional neural networks, deep residual shrinkage networks, and Transformer encoders to extract multiscale temporal features from battery signals (voltage, current, capacity). Uses composite temporal data augmentation with Gaussian noise, time warping, and time resampling.
Result: CDFormer demonstrates consistent superiority over conventional RNN-based and Transformer-based baselines across key metrics on two real-world datasets, improving reliability and predictive performance.
Conclusion: CDFormer provides accurate and reliable RUL forecasts, supporting effective battery health monitoring and data-driven maintenance strategies through improved modeling of local and global degradation dynamics.
Abstract: Accurate prediction of lithium-ion battery remaining useful life (RUL) is essential for reliable health monitoring and data-driven analysis of battery degradation. However, the robustness and generalization capabilities of existing RUL prediction models are significantly challenged by complex operating conditions and limited data availability. To address these limitations, this study proposes a hybrid deep learning model, CDFormer, which integrates convolutional neural networks, deep residual shrinkage networks, and Transformer encoders extract multiscale temporal features from battery measurement signals, including voltage, current, and capacity. This architecture enables the joint modeling of local and global degradation dynamics, effectively improving the accuracy of RUL prediction.To enhance predictive reliability, a composite temporal data augmentation strategy is proposed, incorporating Gaussian noise, time warping, and time resampling, explicitly accounting for measurement noise and variability. CDFormer is evaluated on two real-world datasets, with experimental results demonstrating its consistent superiority over conventional recurrent neural network-based and Transformer-based baselines across key metrics. By improving the reliability and predictive performance of RUL prediction from measurement data, CDFormer provides accurate and reliable forecasts, supporting effective battery health monitoring and data-driven maintenance strategies.
[869] Omni-Modal Dissonance Benchmark: Systematically Breaking Modality Consensus to Probe Robustness and Calibrated Abstention
Zabir Al Nazi, Shubhashis Roy Dipta, Md Rizwan Parvez
Main category: cs.LG
TL;DR: OMD-Bench is a diagnostic benchmark for omni-modal models that uses congruent anchors across video, audio, and text modalities, then systematically corrupts them to isolate modality contributions and evaluate calibrated abstention when evidence conflicts.
Details
Motivation: Existing omni-modal benchmarks have confounded measurements because naturally co-occurring modalities carry correlated yet unequal information, making it unclear whether results reflect true modality reliance or information asymmetry. There's a need for a benchmark that can isolate each modality's contribution and evaluate how models handle conflicting evidence.Method: Created OMD-Bench with 4,080 instances spanning 27 anchors across eight corruption conditions. All modalities start congruent (same object/event independently perceivable through video, audio, and text), then systematically corrupted. Evaluated ten omni-modal models under zero-shot and chain-of-thought prompting, measuring modality reliance, robustness to cross-modal inconsistency, and uncertainty calibration.
Result: Models over-abstain when two modalities are corrupted but under-abstain severely when all three are corrupted, while maintaining high confidence (~60-100%) even under full corruption. Chain-of-thought prompting improves abstention alignment with human judgment but amplifies overconfidence rather than mitigating it.
Conclusion: OMD-Bench provides a diagnostic benchmark for evaluating modality reliance, robustness to cross-modal inconsistency, and uncertainty calibration in omni-modal systems, revealing significant issues with model confidence calibration and abstention behavior.
Abstract: Existing omni-modal benchmarks attempt to measure modality-specific contributions, but their measurements are confounded: naturally co-occurring modalities carry correlated yet unequal information, making it unclear whether results reflect true modality reliance or information asymmetry. We introduce OMD-Bench, where all modalities are initially congruent - each presenting the same anchor, an object or event independently perceivable through video, audio, and text - which we then systematically corrupt to isolate each modality’s contribution. We also evaluate calibrated abstention: whether models appropriately refrain from answering when evidence is conflicting. The benchmark comprises 4,080 instances spanning 27 anchors across eight corruption conditions. Evaluating ten omni-modal models under zero-shot and chain-of-thought prompting, we find that models over-abstain when two modalities are corrupted yet under-abstain severely when all three are, while maintaining high confidence (~60-100%) even under full corruption. Chain-of-thought prompting improves abstention alignment with human judgment but amplifies overconfidence rather than mitigating it. OMD-Bench provides a diagnostic benchmark for diagnosing modality reliance, robustness to cross-modal inconsistency, and uncertainty calibration in omni-modal systems.
[870] From Inference Routing to Agent Orchestration: Declarative Policy Compilation with Cross-Layer Verification
Huamin Chen, Xunzhuo Liu, Bowei He, Xue Liu
Main category: cs.LG
TL;DR: Extends Semantic Router DSL from single-request LLM routing to multi-step agent workflows, generating orchestration code, infrastructure artifacts, and protocol gates from a single declarative source with verification guarantees.
Details
Motivation: To address policy drift in multi-step agent workflows by extending a proven routing DSL to cover the full path from inference gateway to agent orchestration to infrastructure deployment, eliminating cross-team coordination issues.Method: Extends the non-Turing-complete Semantic Router DSL to support multi-step agent workflows. The compiler emits verified decision nodes for orchestration frameworks (LangGraph, OpenClaw), Kubernetes artifacts, YANG/NETCONF payloads, and protocol-boundary gates from the same declarative source file.
Result: Enables threshold changes to propagate from inference gateway to agent gate to infrastructure artifacts in one compilation step, eliminating policy drift from cross-team coordination. Provides guarantees for exhaustive routing, conflict-free branching, referential integrity, and structurally coupled audit traces.
Conclusion: The extended DSL successfully bridges the gap between per-request routing and complex agent workflows while maintaining verification guarantees, addressing auditability, cost efficiency, verifiability, and tunability across the entire inference stack.
Abstract: The Semantic Router DSL is a non-Turing-complete policy language deployed in production for per-request LLM inference routing: content signals (embedding similarity, PII detection, jailbreak scoring) feed into weighted projections and priority-ordered decision trees that select a model, enforce privacy policies, and produce structured audit traces – all from a single declarative source file. Prior work established conflict-free compilation for probabilistic predicates and positioned the DSL within the Workload-Router-Pool inference architecture. This paper extends the same language from stateless, per-request routing to multi-step agent workflows – the full path from inference gateway to agent orchestration to infrastructure deployment. The DSL compiler emits verified decision nodes for orchestration frameworks (LangGraph, OpenClaw), Kubernetes artifacts (NetworkPolicy, Sandbox CRD, ConfigMap), YANG/NETCONF payloads, and protocol-boundary gates (MCP, A2A) – all from the same source. Because the language is non-Turing-complete, the compiler guarantees exhaustive routing, conflict-free branching, referential integrity, and audit traces structurally coupled to the decision logic. Because signal definitions are shared across targets, a threshold change propagates from inference gateway to agent gate to infrastructure artifact in one compilation step – eliminating cross-team coordination as the primary source of policy drift. We ground the approach in four pillars – auditability, cost efficiency, verifiability, and tunability – and identify the verification boundary at each layer.
[871] Scalable Maximum Entropy Population Synthesis via Persistent Contrastive Divergence
Mirko Degli Esposti
Main category: cs.LG
TL;DR: GibbsPCDSolver enables scalable maximum entropy population synthesis from census data using persistent contrastive divergence, avoiding exponential complexity of exact methods.
Details
Motivation: Existing maximum entropy approaches for synthetic population generation become infeasible beyond ~20 categorical attributes due to exponential complexity of exact expectation computation over the full tuple space.Method: Proposes GibbsPCDSolver using Persistent Contrastive Divergence (PCD): maintains a persistent pool of synthetic individuals updated by Gibbs sweeps at each gradient step, providing stochastic approximation without materializing the full tuple space.
Result: Scales to 50 attributes with minimal error (MRE 0.010-0.018) while tuple space grows 18 orders of magnitude; achieves 86.8× diversity advantage over generalized raking on Italian demographic benchmark.
Conclusion: GibbsPCDSolver provides practical solution for large-scale synthetic population generation with linear runtime scaling, essential for agent-based urban simulations requiring diverse synthetic populations.
Abstract: Maximum entropy (MaxEnt) modelling provides a principled framework for generating synthetic populations from aggregate census data, without access to individual-level microdata. The bottleneck of existing approaches is exact expectation computation, which requires summing over the full tuple space $\cX$ and becomes infeasible for more than $K \approx 20$ categorical attributes. We propose \emph{GibbsPCDSolver}, a stochastic replacement for this computation based on Persistent Contrastive Divergence (PCD): a persistent pool of $N$ synthetic individuals is updated by Gibbs sweeps at each gradient step, providing a stochastic approximation of the model expectations without ever materialising $\cX$. We validate the approach on controlled benchmarks and on \emph{Syn-ISTAT}, a $K{=}15$ Italian demographic benchmark with analytically exact marginal targets derived from ISTAT-inspired conditional probability tables. Scaling experiments across $K \in {12, 20, 30, 40, 50}$ confirm that GibbsPCDSolver maintains $\MRE \in [0.010, 0.018]$ while $|\cX|$ grows eighteen orders of magnitude, with runtime scaling as $O(K)$ rather than $O(|\cX|)$. On Syn-ISTAT, GibbsPCDSolver reaches $\MRE{=}0.03$ on training constraints and – crucially – produces populations with effective sample size $\Neff = N$ versus $\Neff \approx 0.012,N$ for generalised raking, an $86.8{\times}$ diversity advantage that is essential for agent-based urban simulations.
[872] Multimodal Forecasting for Commodity Prices Using Spectrogram-Based and Time Series Representations
Soyeon Park, Doohee Chung, Charmgil Hong
Main category: cs.LG
TL;DR: SEMF uses wavelet spectrograms and multimodal fusion with exogenous variables for improved multivariate time series forecasting, particularly in financial applications.
Details
Motivation: Multivariate time series forecasting is challenging due to complex cross-variable dependencies and heterogeneous external influences. Existing methods struggle to effectively integrate spectral information with temporal patterns and external variables.Method: SEMF transforms target time series into Morlet wavelet spectrograms, extracts features using Vision Transformer, encodes exogenous variables with Transformer, and fuses modalities via bidirectional cross-attention for unified representation.
Result: SEMF achieves consistent improvements over seven competitive baselines across multiple forecasting horizons and evaluation metrics in commodity price forecasting tasks.
Conclusion: Multimodal fusion and spectrogram-based encoding effectively capture multi-scale patterns in complex financial time series, demonstrating the value of combining spectral and temporal representations.
Abstract: Forecasting multivariate time series remains challenging due to complex cross-variable dependencies and the presence of heterogeneous external influences. This paper presents Spectrogram-Enhanced Multimodal Fusion (SEMF), which combines spectral and temporal representations for more accurate and robust forecasting. The target time series is transformed into Morlet wavelet spectrograms, from which a Vision Transformer encoder extracts localized, frequency-aware features. In parallel, exogenous variables, such as financial indicators and macroeconomic signals, are encoded via a Transformer to capture temporal dependencies and multivariate dynamics. A bidirectional cross-attention module integrates these modalities into a unified representation that preserves distinct signal characteristics while modeling cross-modal correlations. Applied to multiple commodity price forecasting tasks, SEMF achieves consistent improvements over seven competitive baselines across multiple forecasting horizons and evaluation metrics. These results demonstrate the effectiveness of multimodal fusion and spectrogram-based encoding in capturing multi-scale patterns within complex financial time series.
[873] Embedding Provenance in Computer Vision Datasets with JSON-LD
Lynn Vonderhaar, Timothy Elvira, Tyler Thomas Procko, Omar Ochoa
Main category: cs.LG
TL;DR: A novel JSON-LD schema for embedding image provenance directly within image files to maintain descriptive information about creation, processing, and compilation parameters.
Details
Motivation: Computer vision applications lack proper provenance tracking, with provenance typically stored separately in text files, leading to loss of critical information about image capture settings, data preprocessing steps, and model architecture details.Method: Proposes a JSON-LD schema that embeds provenance metadata directly within image files, structuring provenance in a manageable format linked to established standards while maintaining intrinsic connection to images.
Result: Enables provenance to remain intrinsically tied to images, preventing information loss and enhancing system qualities like maintainability and adaptability while aligning with robust standards.
Conclusion: Embedding provenance directly in image files using JSON-LD schema improves data traceability, compliance, audit support, and reusability while maintaining the direct connection between vision resources and their provenance.
Abstract: With the ubiquity of computer vision in industry, the importance of image provenance is becoming more apparent. Provenance provides information about the origin and derivation of some resource, e.g., an image dataset, enabling users to trace data changes to better understand the expected behaviors of downstream models trained on such data. Provenance may also help with data maintenance by ensuring compliance, supporting audits and improving reusability. Typically, if provided, provenance is stored separately, e.g., within a text file, leading to a loss of descriptive information for key details like image capture settings, data preprocessing steps, and model architecture or iteration. Images often lack the information detailing the parameters of their creation or compilation. This paper proposes a novel schema designed to structure image provenance in a manageable and coherent format. The approach utilizes JavaScript Object Notation for Linked Data (JSON-LD), embedding this provenance directly within the image file. This offers two significant benefits: (1) it aligns image descriptions with a robust schema inspired by and linked to established standards, and (2) it ensures that provenance remains intrinsically tied to images, preventing loss of information and enhancing system qualities, e.g., maintainability and adaptability. This approach emphasizes maintaining the direct connection between vision resources and their provenance.
[874] Active In-Context Learning for Tabular Foundation Models
Wilailuck Treerath, Fabrizio Pittorino
Main category: cs.LG
TL;DR: Tabular Active In-Context Learning (Tab-AICL) combines active learning with tabular foundation models like TabPFN to improve cold-start sample efficiency in tabular data classification.
Details
Motivation: Traditional active learning struggles in tabular settings during cold-start because uncertainty estimates are unreliable when models are trained on very few labels. Tabular foundation models offer calibrated probabilistic predictions via in-context learning without weight updates, enabling a new active learning paradigm.Method: Proposes Tabular Active In-Context Learning (Tab-AICL) with four acquisition rules: uncertainty-based (TabPFN-Margin), diversity-based (TabPFN-Coreset), hybrid uncertainty-diversity (TabPFN-Hybrid), and a scalable two-stage method (TabPFN-Proxy-Hybrid) that uses a lightweight linear proxy for candidate shortlisting before TabPFN-based selection.
Result: Across 20 classification benchmarks, Tab-AICL improves cold-start sample efficiency over retrained gradient-boosting baselines (CatBoost-Margin and XGBoost-Margin), measured by normalized AULC up to 100 labeled samples.
Conclusion: Tabular Active In-Context Learning effectively leverages tabular foundation models’ in-context learning capabilities to overcome cold-start limitations in active learning for tabular data, demonstrating superior sample efficiency compared to traditional approaches.
Abstract: Active learning (AL) reduces labeling cost by querying informative samples, but in tabular settings its cold-start gains are often limited because uncertainty estimates are unreliable when models are trained on very few labels. Tabular foundation models such as TabPFN provide calibrated probabilistic predictions via in-context learning (ICL), i.e., without task-specific weight updates, enabling an AL regime in which the labeled context - rather than parameters - is iteratively optimized. We formalize Tabular Active In-Context Learning (Tab-AICL) and instantiate it with four acquisition rules: uncertainty (TabPFN-Margin), diversity (TabPFN-Coreset), an uncertainty-diversity hybrid (TabPFN-Hybrid), and a scalable two-stage method (TabPFN-Proxy-Hybrid) that shortlists candidates using a lightweight linear proxy before TabPFN-based selection. Across 20 classification benchmarks, Tab-AICL improves cold-start sample efficiency over retrained gradient-boosting baselines (CatBoost-Margin and XGBoost-Margin), measured by normalized AULC up to 100 labeled samples.
[875] Diagnosing Non-Markovian Observations in Reinforcement Learning via Prediction-Based Violation Scoring
Naveen Mysore
Main category: cs.LG
TL;DR: A method to quantify non-Markovian structure in observation trajectories using prediction-based scoring, validated across RL environments and algorithms.
Details
Motivation: Real-world sensors often violate the Markov assumption through correlated noise, latency, or partial observability, but standard metrics don't diagnose these violations, leaving practitioners without tools to identify such issues.Method: A two-step prediction-based scoring method: 1) random forest removes nonlinear Markov-compliant dynamics, 2) ridge regression tests if historical observations reduce prediction error on residuals beyond current observation. Score is bounded [0,1] and requires no causal graph construction.
Result: Evaluation across 6 environments, 3 algorithms, and controlled AR(1) noise shows: 7/16 environment-algorithm pairs show significant correlation between noise intensity and violation score; 13/16 pairs show significant reward degradation under training-time noise; identified inversion phenomenon in low-dimensional environments; practical utility demonstrated for identifying partial observability and guiding architecture selection.
Conclusion: The proposed score effectively quantifies non-Markovian structure, helps diagnose observation violations, and guides architecture selection to recover performance lost to non-Markovian observations.
Abstract: Reinforcement learning algorithms assume that observations satisfy the Markov property, yet real-world sensors frequently violate this assumption through correlated noise, latency, or partial observability. Standard performance metrics conflate Markov breakdowns with other sources of suboptimality, leaving practitioners without diagnostic tools for such violations. This paper introduces a prediction-based scoring method that quantifies non-Markovian structure in observation trajectories. A random forest first removes nonlinear Markov-compliant dynamics; ridge regression then tests whether historical observations reduce prediction error on the residuals beyond what the current observation provides. The resulting score is bounded in [0, 1] and requires no causal graph construction. Evaluation spans six environments (CartPole, Pendulum, Acrobot, HalfCheetah, Hopper, Walker2d), three algorithms (PPO, A2C, SAC), controlled AR(1) noise at six intensity levels, and 10 seeds per condition. In post-hoc detection, 7 of 16 environment-algorithm pairs, primarily high-dimensional locomotion tasks, show significant positive monotonicity between noise intensity and the violation score (Spearman rho up to 0.78, confirmed under repeated-measures analysis); under training-time noise, 13 of 16 pairs exhibit statistically significant reward degradation. An inversion phenomenon is documented in low-dimensional environments where the random forest absorbs the noise signal, causing the score to decrease as true violations grow, a failure mode analyzed in detail. A practical utility experiment demonstrates that the proposed score correctly identifies partial observability and guides architecture selection, fully recovering performance lost to non-Markovian observations. Source code to reproduce all results is provided at https://github.com/NAVEENMN/Markovianes.
[876] K-Means Based TinyML Anomaly Detection and Distributed Model Reuse via the Distributed Internet of Learning (DIoL)
Abdulrahman Albaiz, Fathi Amsaad
Main category: cs.LG
TL;DR: A lightweight K-Means anomaly detection system for microcontrollers with distributed model sharing via text-based representations, enabling “Train Once, Share Everywhere” for TinyML deployment.
Details
Motivation: To enable scalable, low-cost TinyML deployment across fleets of resource-constrained embedded devices by avoiding the need to retrain models on every individual device, which is computationally expensive and impractical for large-scale IoT deployments.Method: Develops a lightweight K-Means anomaly detection model with on-device feature extraction, clustering, and threshold estimation. Introduces Distributed Internet of Learning (DIoL) to export trained models as portable text-based representations that can be shared and reused directly on other microcontrollers without retraining.
Result: The two-device prototype demonstrates consistent anomaly detection behavior, negligible parsing overhead, and identical inference runtimes between standalone and DIoL-based operation, validating the “Train Once, Share Everywhere” approach.
Conclusion: The proposed framework enables scalable, low-cost TinyML deployment across fleets of embedded devices by allowing models trained on one device to be efficiently shared and reused on others without retraining overhead.
Abstract: This paper presents a lightweight K-Means anomaly detection model and a distributed model-sharing workflow designed for resource-constrained microcontrollers (MCUs). Using real power measurements from a mini-fridge appliance, the system performs on-device feature extraction, clustering, and threshold estimation to identify abnormal appliance behavior. To avoid retraining models on every device, we introduce the Distributed Internet of Learning (DIoL), which enables a model trained on one MCU to be exported as a portable, text-based representation and reused directly on other devices. A two-device prototype demonstrates the feasibility of the “Train Once, Share Everywhere” (TOSE) approach using a real-world appliance case study, where Device A trains the model and Device B performs inference without retraining. Experimental results show consistent anomaly detection behavior, negligible parsing overhead, and identical inference runtimes between standalone and DIoL-based operation. The proposed framework enables scalable, low-cost TinyML deployment across fleets of embedded devices.
[877] Conditional Factuality Controlled LLMs with Generalization Certificates via Conformal Sampling
Kai Ye, Qingtao Pan, Shuo Li
Main category: cs.LG
TL;DR: CFC is a conformal prediction framework for LLMs that provides conditional coverage guarantees using feature-dependent thresholds, improving efficiency over marginal methods.
Details
Motivation: Existing conformal methods for LLMs provide only marginal guarantees with a single global threshold, which leads to under-coverage for hard prompts, over-coverage for easy ones, and oversized prediction sets.Method: Proposes Conditional Factuality Control (CFC) using augmented quantile regression on a latent success score to define continuous, feature-conditional acceptance thresholds, deployed through a fixed-point threshold rule at inference time. Also develops CFC-PAC variant with PAC-style guarantees.
Result: CFC achieves near-target coverage across difficulty groups while using smaller prediction sets than conformal prediction and non-CP baselines on synthetic data, reasoning/QA benchmarks, and Flickr8k VLM setting.
Conclusion: CFC provides conditional coverage guarantees for LLM outputs with improved sample efficiency over marginal methods, offering practical reliability control for hallucinations.
Abstract: Large language models (LLMs) need reliable test-time control of hallucinations. Existing conformal methods for LLMs typically provide only \emph{marginal} guarantees and rely on a single global threshold, which can under-cover hard prompts, over-cover easy ones, and produce oversized prediction sets. We propose \emph{Conditional Factuality Control} (CFC), a post-hoc conformal framework that returns \emph{set-valued} outputs with \emph{conditional} coverage guarantees. CFC defines a continuous, feature-conditional acceptance threshold through augmented quantile regression on a latent ``success’’ score, and deploys it through a fixed-point threshold rule at inference time. Theoretically, we show that CFC satisfies a conditional coverage guarantee under exchangeability and analyze its \emph{efficiency}, proving that, under mild assumptions on the score distributions, the conditional rule is strictly more sample-efficient than marginal conformal prediction at the same target coverage. We further derive a PAC-style variant, CFC-PAC, which shrinks the nominal risk level based on a stability bound, yielding a finite-sample certificate that the conditional miscoverage deviates from the target by at most $O(\sqrt{\log(1/δ)/N})$. Empirically, on synthetic data, real-world reasoning and QA benchmarks, and a Flickr8k VLM setting, CFC and CFC-PAC consistently attain near-target coverage across difficulty groups while using smaller prediction sets than CP and non-CP baselines.
[878] The Geometry of Harmful Intent: Training-Free Anomaly Detection via Angular Deviation in LLM Residual Streams
Isaac Llorente-Saguer
Main category: cs.LG
TL;DR: LatentBiopsy: Training-free method for detecting harmful prompts by analyzing activation geometry in LLMs using radial deviation angles from normative prompts.
Details
Motivation: Need for effective detection of harmful prompts in LLMs without requiring harmful examples for training, leveraging internal model representations.Method: Compute principal component of safe prompt activations at target layer, measure new prompts’ radial deviation angle θ, use negative log-likelihood of θ under Gaussian fit as anomaly score.
Result: Achieves AUROC ≥0.937 for harmful-vs-normative detection and AUROC=1.000 for harmful vs benign-aggressive discrimination across Qwen model variants with sub-millisecond overhead.
Conclusion: Harmful intent representation is geometrically distinct from refusal mechanisms, harmful prompts have tight angular distribution, and model families show opposite ring orientations.
Abstract: We present LatentBiopsy, a training-free method for detecting harmful prompts by analysing the geometry of residual-stream activations in large language models. Given 200 safe normative prompts, LatentBiopsy computes the leading principal component of their activations at a target layer and characterises new prompts by their radial deviation angle $θ$ from this reference direction. The anomaly score is the negative log-likelihood of $θ$ under a Gaussian fit to the normative distribution, flagging deviations symmetrically regardless of orientation. No harmful examples are required for training. We evaluate two complete model triplets from the Qwen3.5-0.8B and Qwen2.5-0.5B families: base, instruction-tuned, and \emph{abliterated} (refusal direction surgically removed via orthogonalisation). Across all six variants, LatentBiopsy achieves AUROC $\geq$0.937 for harmful-vs-normative detection and AUROC = 1.000 for discriminating harmful from benign-aggressive prompts (XSTest), with sub-millisecond per-query overhead. Three empirical findings emerge. First, geometry survives refusal ablation: both abliterated variants achieve AUROC at most 0.015 below their instruction-tuned counterparts, establishing a geometric dissociation between harmful-intent representation and the downstream generative refusal mechanism. Second, harmful prompts exhibit a near-degenerate angular distribution ($σ_θ\approx 0.03$ rad), an order of magnitude tighter than the normative distribution ($σ_θ\approx 0.27$ rad), preserved across all alignment stages including abliteration. Third, the two families exhibit opposite ring orientations at the same depth: harmful prompts occupy the outer ring in Qwen3.5-0.8B but the inner ring in Qwen2.5-0.5B, directly motivating the direction-agnostic scoring rule.
[879] Kempe Swap K-Means: A Scalable Near-Optimal Solution for Semi-Supervised Clustering
Yuxuan Ren, Shijie Deng
Main category: cs.LG
TL;DR: Kempe Swap K-Means: A centroid-based heuristic algorithm for constrained clustering with must-link and cannot-link constraints using Kempe chain swaps and controlled perturbations.
Details
Motivation: The paper addresses the need for efficient constrained clustering algorithms that can handle rigid must-link and cannot-link constraints while maintaining computational efficiency and scalability for large-scale datasets.Method: The algorithm uses a dual-phase iterative process: (1) an assignment step employing Kempe chain swaps to refine clustering in constrained solution space, and (2) a centroid update step computing optimal cluster centroids with controlled perturbations to avoid local optima.
Result: Empirical evaluations show the method achieves near-optimal partitions with high computational efficiency and scalability, consistently outperforming state-of-the-art benchmarks in both clustering accuracy and algorithmic efficiency for large-scale datasets.
Conclusion: Kempe Swap K-Means provides an effective solution for constrained clustering problems, offering improved performance over existing methods while maintaining practical computational requirements.
Abstract: This paper presents a novel centroid-based heuristic algorithm, termed Kempe Swap K-Means, for constrained clustering under rigid must-link (ML) and cannot-link (CL) constraints. The algorithm employs a dual-phase iterative process: an assignment step that utilizes Kempe chain swaps to refine current clustering in the constrained solution space and a centroid update step that computes optimal cluster centroids. To enhance global search capabilities and avoid local optima, the framework incorporates controlled perturbations during the update phase. Empirical evaluations demonstrate that the proposed method achieves near-optimal partitions while maintaining high computational efficiency and scalability. The results indicate that Kempe Swap K-Means consistently outperforms state-of-the-art benchmarks in both clustering accuracy and algorithmic efficiency for large-scale datasets.
[880] The Geometric Cost of Normalization: Affine Bounds on the Bayesian Complexity of Neural Networks
Sungbae Chun
Main category: cs.LG
TL;DR: LayerNorm reduces model complexity by m/2 via mean-centering to a hyperplane, while RMSNorm preserves complexity by projecting to a sphere; curvature threshold determines LLC drop.
Details
Motivation: To understand the fundamental geometric differences between LayerNorm and RMSNorm normalization techniques and how they affect model complexity through Local Learning Coefficient (LLC) reduction.Method: Theoretical analysis proving LayerNorm’s mean-centering reduces LLC by exactly m/2 by confining data to a linear hyperplane, while RMSNorm’s spherical projection preserves LLC. Uses geometric threshold theory and wrLLC framework for experimental verification.
Result: LayerNorm guarantees m/2 LLC reduction before training due to affine flatness, while RMSNorm preserves LLC due to curvature. Softmax data with explicit bias also activates same LLC drop via “smuggled bias”.
Conclusion: Normalization geometry fundamentally impacts model complexity; LayerNorm’s mean-centering reduces complexity while RMSNorm preserves it, with curvature threshold determining LLC behavior.
Abstract: LayerNorm and RMSNorm impose fundamentally different geometric constraints on their outputs - and this difference has a precise, quantifiable consequence for model complexity. We prove that LayerNorm’s mean-centering step, by confining data to a linear hyperplane (through the origin), reduces the Local Learning Coefficient (LLC) of the subsequent weight matrix by exactly $m/2$ (where $m$ is its output dimension); RMSNorm’s projection onto a sphere preserves the LLC entirely. This reduction is structurally guaranteed before any training begins, determined by data manifold geometry alone. The underlying condition is a geometric threshold: for the codimension-one manifolds we study, the LLC drop is binary – any non-zero curvature, regardless of sign or magnitude, is sufficient to preserve the LLC, while only affinely flat manifolds cause the drop. At finite sample sizes this threshold acquires a smooth crossover whose width depends on how much of the data distribution actually experiences the curvature, not merely on whether curvature exists somewhere. We verify both predictions experimentally with controlled single-layer scaling experiments using the wrLLC framework. We further show that Softmax simplex data introduces a “smuggled bias” that activates the same $m/2$ LLC drop when paired with an explicit downstream bias, proved via the affine symmetry extension of the main theorem and confirmed empirically.
[881] Interpretable Physics Extraction from Data for Linear Dynamical Systems using Lie Generator Networks
Shafayeth Jamil, Rehan Kapadia
Main category: cs.LG
TL;DR: Lie Generator Networks (LGN) learn structured linear dynamical systems via matrix exponentiation, preserving physical invariants like stability and dissipation by construction, unlike neural ODEs or black-box approaches.
Details
Motivation: Neural approaches to learning dynamical systems offer flexibility but often violate physical guarantees. Neural ODEs may break physical invariants, while energy-preserving architectures don't represent dissipation. There's a need for methods that combine neural flexibility with physical structure preservation for linear systems.Method: LGN learns a structured generator matrix A and computes trajectories via matrix exponentiation (instead of integration). By parameterizing A = S - D (skew-symmetric minus positive diagonal), stability and dissipation emerge from architecture design. This preserves physical structure by construction.
Result: On a 100-dimensional stable RLC ladder system, LGN-SD recovers all 100 eigenvalues with over two orders of magnitude lower mean eigenvalue error than unconstrained alternatives. Standard derivative-based least-squares identification can yield unstable eigenvalues, while unconstrained LGN yields stable but physically incorrect spectra.
Conclusion: LGN provides a unified framework for learning linear conservative, dissipative, and time-varying systems while preserving physical structure by construction. The approach yields interpretable physics (poles, natural frequencies, damping ratios) that black-box networks cannot provide.
Abstract: When the system is linear, why should learning be nonlinear? Linear dynamical systems, the analytical backbone of control theory, signal processing and circuit analysis, have exact closed-form solutions via the state transition matrix. Yet when system parameters must be inferred from data, recent neural approaches offer flexibility at the cost of physical guarantees: Neural ODEs provide flexible trajectory approximation but may violate physical invariants, while energy preserving architectures do not natively represent dissipation essential to real-world systems. We introduce Lie Generator Networks (LGN), which learn a structured generator A and compute trajectories directly via matrix exponentiation. This shift from integration to exponentiation preserves structure by construction. By parameterizing A = S - D (skew-symmetric minus positive diagonal), stability and dissipation emerge from the underlying architecture and are not introduced during training via the loss function. LGN provides a unified framework for linear conservative, dissipative, and time-varying systems. On a 100-dimensional stable RLC ladder, standard derivative-based least-squares system identification can yield unstable eigenvalues. The unconstrained LGN yields stable but physically incorrect spectra, whereas LGN-SD recovers all 100 eigenvalues with over two orders of magnitude lower mean eigenvalue error than unconstrained alternatives. Critically, these eigenvalues reveal poles, natural frequencies, and damping ratios which are interpretable physics that black-box networks do not provide.
[882] GIFT: Bootstrapping Image-to-CAD Program Synthesis via Geometric Feedback
Giorgio Giannone, Anna Clare Doris, Amin Heyrani Nobari, Kai Xu, Akash Srivastava, Faez Ahmed
Main category: cs.LG
TL;DR: GIFT is a data augmentation framework that uses geometric feedback to generate high-quality training samples from test-time computations, improving CAD program generation from images without additional human annotation.
Details
Motivation: Current methods for generating CAD programs from images fail to learn reliable alignment between visual geometry and symbolic programs as design complexity increases. The main bottleneck is scarcity of diverse training examples aligning visual geometry with program syntax, which is expensive and difficult to scale for engineering datasets.Method: Geometric Inference Feedback Tuning (GIFT) combines two mechanisms: Soft-Rejection Sampling (GIFT-REJECT) retains diverse high-fidelity programs beyond exact ground-truth matches, and Failure-Driven Augmentation (GIFT-FAIL) converts near-miss predictions into synthetic training examples to improve robustness on challenging geometries.
Result: GIFT reduces inference compute by 80% while improving mean IoU by 12% over a strong supervised baseline. It remains competitive with more complex multimodal systems without requiring additional human annotation or specialized architectures.
Conclusion: GIFT effectively amortizes inference-time search into model parameters, capturing benefits of test-time scaling while reducing computational costs, enabling more robust generative CAD models without expensive data collection.
Abstract: Generating executable CAD programs from images requires alignment between visual geometry and symbolic program representations, a capability that current methods fail to learn reliably as design complexity increases. Existing fine-tuning approaches rely on either limited supervised datasets or expensive post-training pipelines, resulting in brittle systems that restrict progress in generative CAD design. We argue that the primary bottleneck lies not in model or algorithmic capacity, but in the scarcity of diverse training examples that align visual geometry with program syntax. This limitation is especially acute because the collection of diverse and verified engineering datasets is both expensive and difficult to scale, constraining the development of robust generative CAD models. We introduce Geometric Inference Feedback Tuning (GIFT), a data augmentation framework that leverages geometric feedback to turn test-time compute into a bootstrapped set of high-quality training samples. GIFT combines two mechanisms: Soft-Rejection Sampling (GIFT-REJECT), which retains diverse high-fidelity programs beyond exact ground-truth matches, and Failure-Driven Augmentation (GIFT-FAIL), which converts near-miss predictions into synthetic training examples that improve robustness on challenging geometries. By amortizing inference-time search into the model parameters, GIFT captures the benefits of test-time scaling while reducing inference compute by 80%. It improves mean IoU by 12% over a strong supervised baseline and remains competitive with more complex multimodal systems, without requiring additional human annotation or specialized architectures.
[883] FlowRL: A Taxonomy and Modular Framework for Reinforcement Learning with Diffusion Policies
Chenxiao Gao, Edward Chen, Tianyi Chen, Bo Dai
Main category: cs.LG
TL;DR: A unified framework for RL with diffusion/flow policies, providing taxonomy, JAX-based toolkit, and benchmarks across robotics environments.
Details
Motivation: Diffusion and flow models show promise as flexible policy representations in RL, but efficient RL training remains challenging due to lack of explicit log-probabilities for policy gradient estimators. The field lacks unified perspective and standardized comparisons.Method: Introduces comprehensive taxonomy for RL algorithms with diffusion/flow policies, develops modular JAX-based codebase with JIT compilation for high-throughput training, and creates systematic benchmarks across Gym-Locomotion, DeepMind Control Suite, and IsaacLab.
Result: Provides rigorous side-by-side comparison of diffusion-based methods, establishes foundation for understanding and algorithm design, and offers algorithmic guidelines for practitioners in generative models and robotics.
Conclusion: The work creates a unified framework, high-efficiency toolkit, and practical guidelines for RL with diffusion/flow policies, advancing research in generative models and robotics applications.
Abstract: Thanks to their remarkable flexibility, diffusion models and flow models have emerged as promising candidates for policy representation. However, efficient reinforcement learning (RL) upon these policies remains a challenge due to the lack of explicit log-probabilities for vanilla policy gradient estimators. While numerous attempts have been proposed to address this, the field lacks a unified perspective to reconcile these seemingly disparate methods, thus hampering ongoing development. In this paper, we bridge this gap by introducing a comprehensive taxonomy for RL algorithms with diffusion/flow policies. To support reproducibility and agile prototyping, we introduce a modular, JAX-based open-source codebase that leverages JIT-compilation for high-throughput training. Finally, we provide systematic and standardized benchmarks across Gym-Locomotion, DeepMind Control Suite, and IsaacLab, offering a rigorous side-by-side comparison of diffusion-based methods and guidance for practitioners to choose proper algorithms based on the application. Our work establishes a clear foundation for understanding and algorithm design, a high-efficiency toolkit for future research in the field, and an algorithmic guideline for practitioners in generative models and robotics. Our code is available at https://github.com/typoverflow/flow-rl.
[884] TurboAngle: Near-Lossless KV Cache Compression via Uniform Angle Quantization
Dipkumar Patel
Main category: cs.LG
TL;DR: KV cache compression via angular quantization in Fast Walsh-Hadamard domain with per-layer early-boost allocation of precision between keys and values.
Details
Motivation: Reduce memory footprint of KV cache in large language models during inference by compressing key-value pairs while maintaining model quality.Method: Quantize angles in Fast Walsh-Hadamard domain with random diagonal rotation, extend with per-layer early-boost to allocate different codebook sizes for K and V at each layer, and use asymmetric norm quantization (8-bit keys, 4-bit log-space values).
Result: Achieves lossless compression on 4/7 models (1B-7B parameters) and near-lossless on 6/7 at 3.28-3.67 angle bits per element; Mistral-7B achieves 6.56 total bits per element with +0.0014 perplexity degradation without calibration data.
Conclusion: Per-layer early-boost effectively compresses KV cache with minimal quality loss, revealing model-specific bottleneck patterns including K-dominated vs V-dominated layers and negative-transfer layers.
Abstract: We compress KV cache entries by quantizing angles in the Fast Walsh-Hadamard domain, where a random diagonal rotation makes consecutive element pairs approximately uniformly distributed on the unit circle. We extend this angular quantizer with per-layer early-boost, which independently configures K and V codebook sizes at each layer, allocating higher precision to a model-specific subset of critical layers. Across seven models (1B to 7B parameters), per-layer early-boost achieves lossless compression on four models and near-lossless quality on six of seven, at 3.28 to 3.67 angle bits per element. Asymmetric norm quantization (8-bit for keys, 4-bit log-space for values) yields 6.56 total bits per element on Mistral-7B with perplexity degradation of +0.0014 and no calibration data. A layer-group sensitivity analysis reveals model-specific bottleneck patterns, including K-dominated versus V-dominated layers and negative-transfer layers where increased precision degrades quality.
[885] KV Cache Quantization for Self-Forcing Video Generation: A 33-Method Empirical Study
Suraj Ranganath, Vaishak Menon, Anish Patnaik
Main category: cs.LG
TL;DR: Comprehensive empirical study of KV-cache compression techniques for self-forcing video generation, evaluating 33 variants across memory, runtime, and quality metrics to identify practical deployment solutions.
Details
Motivation: Self-forcing video generation requires feeding generated content back as context, causing KV-cache to grow with rollout length, creating a systems bottleneck where longer videos need both better generation quality and improved memory behavior.Method: Empirical study covering 33 quantization and cache-policy variants, 610 prompt-level observations, and 63 benchmark-level summaries across two evaluation settings: MovieGen for single-shot 10-second generation and StoryEval for longer narrative-style stability. Joint evaluation of peak VRAM, runtime, compression ratio, VBench quality, BF16-referenced fidelity metrics, and terminal drift.
Result: Three key findings: 1) FlowCache-inspired soft-prune INT4 adaptation achieves 5.42-5.49x compression while reducing peak VRAM from 19.28 GB to ~11.7 GB with modest runtime overhead; 2) Highest-fidelity compressed methods (PRQ_INT4, QUAROT_KV_INT4) are not best for deployment due to severe runtime/memory costs; 3) Nominal compression alone insufficient as some methods still exceed BF16 peak VRAM due to buffer reconstruction during attention stages.
Conclusion: Provides benchmark harness, analysis workflow, and empirical map identifying practical KV-cache compression techniques for current deployment and promising research directions for better memory integration in self-forcing video generation systems.
Abstract: Self-forcing video generation extends a short-horizon video model to longer rollouts by repeatedly feeding generated content back in as context. This scaling path immediately exposes a systems bottleneck: the key-value (KV) cache grows with rollout length, so longer videos require not only better generation quality but also substantially better memory behavior. We present a comprehensive empirical study of KV-cache compression for self-forcing video generation on a Wan2.1-based Self-Forcing stack. Our study covers 33 quantization and cache-policy variants, 610 prompt-level observations, and 63 benchmark-level summaries across two evaluation settings: MovieGen for single-shot 10-second generation and StoryEval for longer narrative-style stability. We jointly evaluate peak VRAM, runtime, realized compression ratio, VBench imaging quality, BF16-referenced fidelity (SSIM, LPIPS, PSNR), and terminal drift. Three findings are robust. First, the strongest practical operating region is a FlowCache-inspired soft-prune INT4 adaptation, which reaches 5.42-5.49x compression while reducing peak VRAM from 19.28 GB to about 11.7 GB with only modest runtime overhead. Second, the highest-fidelity compressed methods, especially PRQ_INT4 and QUAROT_KV_INT4, are not the best deployment choices because they preserve quality at severe runtime or memory cost. Third, nominal compression alone is not sufficient: several methods shrink KV storage but still exceed BF16 peak VRAM because the current integration reconstructs or retains large BF16 buffers during attention and refresh stages. The result is a benchmark harness, analysis workflow, and empirical map of which KV-cache ideas are practical today and which are promising research directions for better memory integration. Code, data products, and the presentation dashboard are available at https://github.com/suraj-ranganath/kv-quant-longhorizon/.
[886] On Token’s Dilemma: Dynamic MoE with Drift-Aware Token Assignment for Continual Learning of Large Vision Language Models
Chongyang Zhao, Mingsong Li, Haodong Lu, Dong Gong
Main category: cs.LG
TL;DR: LLaVA-DyMoE addresses routing-drift forgetting in MoE-based multimodal continual learning by using drift-aware token assignment and targeted regularization to preserve old-task knowledge while learning new tasks.
Details
Motivation: MoE architectures for multimodal continual learning suffer from routing-drift where old-task tokens get mistakenly attracted to new experts, causing forgetting despite expert isolation. The paper identifies token-level dilemmas where ambiguous and old tokens in new-task data cause forgetting when routed to new experts.Method: Proposes LLaVA-DyMoE with dynamic MoE expansion and drift-aware token assignment. Characterizes token types via routing score distributions and applies targeted regularization: token-level assignment guidance steers ambiguous/old tokens away from new experts, plus routing score regularizations for expert-group separation and new-expert specialization.
Result: Extensive experiments show LLaVA-DyMoE effectively mitigates routing-drift-induced forgetting, achieving over 7% gain in mean final accuracy and 12% reduction in forgetting compared to baselines.
Conclusion: The proposed dynamic MoE framework with drift-aware token assignment successfully addresses routing-drift in multimodal continual learning, preserving established routing patterns while enabling effective expansion for new tasks.
Abstract: Multimodal Continual Instruction Tuning aims to continually enhance Large Vision Language Models (LVLMs) by learning from new data without forgetting previously acquired knowledge. Mixture of Experts (MoE) architectures naturally facilitate this by incrementally adding new experts and expanding routers while keeping the existing ones frozen. However, despite expert isolation, MoE-based continual learners still suffer from forgetting due to routing-drift: old-task tokens become mistakenly attracted to newly added experts, degrading performance on prior tasks. We analyze the failure mode at the token level and reveal the token’s dilemma: ambiguous and old tokens in new-task data offer minimal learning benefit yet induce forgetting when routed to new experts, due to their ambiguous routing assignment during training. Motivated by this, we propose LLaVA-DyMoE, a dynamic MoE framework that incrementally expands the MoE with drift-aware token assignment. We characterize token types via their routing score distributions and apply targeted regularization. Specifically, a token-level assignment guidance steers ambiguous and old tokens away from new experts to preserve established routing patterns and alleviate routing-drift, while complementary routing score regularizations enforce expert-group separation and promote new-expert specialization. Extensive experiments demonstrate that our LLaVA-DyMoE effectively mitigates routing-drift-induced forgetting, achieving over a 7% gain in mean final accuracy and a 12% reduction in forgetting compared to baselines. The project page is https://zhaoc5.github.io/DyMoE.
[887] Variational Learning of Fractional Posteriors
Kian Ming A. Chai, Edwin V. Bonilla
Main category: cs.LG
TL;DR: Novel variational objective for fractional posteriors improves calibration and enables joint learning of approximate Bayes and fractional posteriors in VAEs, leading to better-aligned decoders for prior-based generation.
Details
Motivation: To develop a more flexible variational framework that can estimate fractional posteriors, improve calibration compared to conventional variational bounds, and enhance generative modeling capabilities in variational autoencoders.Method: Introduces a one-parameter variational objective that lower bounds data evidence and estimates approximate fractional posteriors. Extends to hierarchical construction and Bayes posteriors. Demonstrates analytical gradient cases and applies to mixture models and variational autoencoders.
Result: Fractional posteriors achieve better calibration than conventional variational bounds in mixture models. In VAEs, the approach attains higher evidence bounds and learns high-performing approximate Bayes posteriors jointly with fractional posteriors, producing decoders better aligned for generation from the prior.
Conclusion: The proposed fractional variational framework provides a versatile tool for probabilistic modeling that improves calibration and enhances generative capabilities in VAEs, particularly for prior-based generation.
Abstract: We introduce a novel one-parameter variational objective that lower bounds the data evidence and enables the estimation of approximate fractional posteriors. We extend this framework to hierarchical construction and Bayes posteriors, offering a versatile tool for probabilistic modelling. We demonstrate two cases where gradients can be obtained analytically and a simulation study on mixture models showing that our fractional posteriors can be used to achieve better calibration compared to posteriors from the conventional variational bound. When applied to variational autoencoders (VAEs), our approach attains higher evidence bounds and enables learning of high-performing approximate Bayes posteriors jointly with fractional posteriors. We show that VAEs trained with fractional posteriors produce decoders that are better aligned for generation from the prior.
[888] Decomposing Discrimination: Causal Mediation Analysis for AI-Driven Credit Decisions
Duraimurugan Rajamanickam
Main category: cs.LG
TL;DR: Causal framework distinguishing direct discrimination vs. structural inequality in AI credit decisions using natural direct/indirect effects with identification under treatment-induced confounding.
Details
Motivation: Statistical fairness metrics conflate direct discrimination and structural inequality in credit decisions. Need causal framework to separate these mechanisms for proper fairness assessment.Method: Uses Pearl’s natural direct/indirect effects framework. Proposes identification strategy under treatment-induced confounding via interventional direct/indirect effects with Modified Sequential Ignorability. Develops doubly-robust AIPW estimator with cross-fitting and E-value sensitivity analysis.
Result: On 89,465 mortgage applications: ~77% of racial denial disparity operates through financial mediators (structural inequality), ~23% is conservative lower bound on direct discrimination. Provides CausalFair Python package.
Conclusion: Causal framework successfully separates discrimination from structural inequality in credit decisions, enabling more accurate fairness assessment with practical implementation tools.
Abstract: Statistical fairness metrics in AI-driven credit decisions conflate two causally distinct mechanisms: discrimination operating directly from a protected attribute to a credit outcome, and structural inequality propagating through legitimate financial features. We formalise this distinction using Pearl’s framework of natural direct and indirect effects applied to the credit decision setting. Our primary theoretical contribution is an identification strategy for natural direct and indirect effects under treatment-induced confounding – the prevalent setting in which protected attributes causally affect both financial mediators and the final decision, violating standard sequential ignorability. We show that interventional direct and indirect effects (IDE/IIE) are identified under the weaker Modified Sequential Ignorability assumption, and prove that IDE/IIE provide conservative bounds on the unidentified natural effects under monotone indirect treatment response. We propose a doubly-robust augmented inverse probability weighted (AIPW) estimator for IDE/IIE with semiparametric efficiency properties, implemented via cross-fitting. An E-value sensitivity analysis addresses residual confounding on the direct pathway. Empirical evaluation on 89,465 real HMDA conventional purchase mortgage applications from New York State (2022) demonstrates that approximately 77% of the observed 7.9 percentage-point racial denial disparity operates through financial mediators shaped by structural inequality, while the remaining 23% constitutes a conservative lower bound on direct discrimination. The open-source CausalFair Python package implements the full pipeline for deployment at resource-constrained financial institutions.
[889] Match or Replay: Self Imitating Proximal Policy Optimization
Gaurav Chaudhary, Laxmidhar Behera, Washim Uddin Mondal
Main category: cs.LG
TL;DR: A self-imitating on-policy RL algorithm that enhances exploration and sample efficiency by leveraging past high-reward experiences through optimal transport distance in dense reward environments and uniform replay of successful trajectories in sparse-reward settings.
Details
Motivation: RL agents struggle with inefficient exploration in sparse reward environments, leading to slow learning and suboptimal performance. Traditional exploration strategies fail to systematically build on successful experiences, reducing sample efficiency.Method: Proposes a self-imitating on-policy algorithm that uses past high-reward state-action pairs to guide policy updates. For dense reward environments, employs optimal transport distance to prioritize state visitation distributions matching the most rewarding trajectory. For sparse-reward environments, uniformly replays successful self-encountered trajectories to facilitate structured exploration.
Result: Experimental results across diverse environments (MuJoCo for dense rewards, 3D Animal-AI Olympics and multi-goal PointMaze for sparse rewards) demonstrate substantial improvements in learning efficiency, faster convergence, and significantly higher success rates compared to state-of-the-art self-imitating RL baselines.
Conclusion: Self-imitation is a robust strategy for enhancing exploration in RL with applicability to more complex tasks, showing potential for improving sample efficiency and learning performance in challenging environments.
Abstract: Reinforcement Learning (RL) agents often struggle with inefficient exploration, particularly in environments with sparse rewards. Traditional exploration strategies can lead to slow learning and suboptimal performance because agents fail to systematically build on previously successful experiences, thereby reducing sample efficiency. To tackle this issue, we propose a self-imitating on-policy algorithm that enhances exploration and sample efficiency by leveraging past high-reward state-action pairs to guide policy updates. Our method incorporates self-imitation by using optimal transport distance in dense reward environments to prioritize state visitation distributions that match the most rewarding trajectory. In sparse-reward environments, we uniformly replay successful self-encountered trajectories to facilitate structured exploration. Experimental results across diverse environments demonstrate substantial improvements in learning efficiency, including MuJoCo for dense rewards and the partially observable 3D Animal-AI Olympics and multi-goal PointMaze for sparse rewards. Our approach achieves faster convergence and significantly higher success rates compared to state-of-the-art self-imitating RL baselines. These findings underscore the potential of self-imitation as a robust strategy for enhancing exploration in RL, with applicability to more complex tasks.
[890] Q-BIOLAT: Binary Latent Protein Fitness Landscapes for QUBO-Based Optimization
Truong-Son Hy
Main category: cs.LG
TL;DR: Q-BIOLAT: A framework for protein fitness optimization using binary latent representations and QUBO formulation, connecting machine learning with discrete combinatorial optimization.
Details
Motivation: Protein fitness optimization is inherently discrete but most learning approaches use continuous representations and focus on predictive accuracy rather than optimization. There's a need for frameworks that can effectively model and optimize protein fitness in discrete spaces.Method: Start from pretrained protein language model embeddings, construct binary latent representations, and learn a quadratic unconstrained binary optimization (QUBO) surrogate that captures unary and pairwise interactions. Compare different representation methods (autoencoder vs PCA) and apply classical combinatorial optimization methods in structured binary latent spaces.
Result: Autoencoder-based representations collapse after binarization, producing degenerate latent spaces, while simple structured representations like PCA yield high-entropy, decodable, and optimization-friendly spaces. Classical combinatorial optimization methods (simulated annealing, genetic algorithms, greedy hill climbing) are highly effective in structured binary latent spaces.
Conclusion: Q-BIOLAT provides a representation-centric perspective on protein fitness modeling, showing that representations with similar predictive performance can induce fundamentally different optimization landscapes. The approach connects modern machine learning with discrete and quantum-inspired optimization.
Abstract: Protein fitness optimization is inherently a discrete combinatorial problem, yet most learning-based approaches rely on continuous representations and are primarily evaluated through predictive accuracy. We introduce Q-BIOLAT, a framework for modeling and optimizing protein fitness landscapes in compact binary latent spaces. Starting from pretrained protein language model embeddings, we construct binary latent representations and learn a quadratic unconstrained binary optimization (QUBO) surrogate that captures unary and pairwise interactions. Beyond its formulation, Q-BIOLAT provides a representation-centric perspective on protein fitness modeling. We show that representations with similar predictive performance can induce fundamentally different optimization landscapes. In particular, learned autoencoder-based representations collapse after binarization, producing degenerate latent spaces that fail to support combinatorial search, whereas simple structured representations such as PCA yield high-entropy, decodable, and optimization-friendly latent spaces. Across multiple datasets and data regimes, we demonstrate that classical combinatorial optimization methods, including simulated annealing, genetic algorithms, and greedy hill climbing, are highly effective in structured binary latent spaces. By expressing the objective in QUBO form, our approach connects modern machine learning with discrete and quantum-inspired optimization. Our implementation and dataset are publicly available at: https://github.com/HySonLab/Q-BIOLAT-Extended
[891] Visualization of Machine Learning Models through Their Spatial and Temporal Listeners
Siyu Wu, Lei Shi, Lei Xia, Cenyang Wu, Zipeng Liu, Yingchaojie Feng, Liang Zhou, Wei Chen
Main category: cs.LG
TL;DR: A model-centric framework for visualizing AI models using abstract listeners to capture spatial/temporal behaviors, with a corpus analysis of 128 ModelVis papers showing result-centric trends and high impact of mechanism-oriented studies.
Details
Motivation: Existing model visualization taxonomies are organized by data or tasks rather than treating models as first-class analysis objects, creating a need for a model-centric framework.Method: Two-stage framework: 1) abstract listeners capture spatial/temporal model behaviors, 2) connects translated behavior data to classical InfoVis pipeline. Uses retrieval-augmented human-LLM extraction workflow to curate corpus of 128 VIS/VAST ModelVis papers with 331 coded figures.
Result: Analysis shows dominant result-centric priority on visualizing model outcomes, quantitative/nominal data types, statistical charts, and performance evaluation. Citation-weighted trends indicate less frequent model-mechanism-oriented studies have disproportionately high impact but are less investigated recently.
Conclusion: The framework provides a general approach for comparing existing ModelVis systems and guiding future designs, highlighting the importance of mechanism-oriented visualization despite current result-centric focus.
Abstract: Model visualization (ModelVis) has emerged as a major research direction, yet existing taxonomies are largely organized by data or tasks, making it difficult to treat models as first-class analysis objects. We present a model-centric two-stage framework that employs abstract listeners to capture spatial and temporal model behaviors, and then connects the translated model behavior data to the classical InfoVis pipeline. To apply the framework at scale, we build a retrieval-augmented human–large language model (LLM) extraction workflow and curate a corpus of 128 VIS/VAST ModelVis papers with 331 coded figures. Our analysis shows a dominant result-centric priority on visualizing model outcomes, quantitative/nominal data type, statistical charts, and performance evaluation. Citation-weighted trends further indicate that less frequent model-mechanism-oriented studies have disproportionately high impact while are less investigated recently. Overall, the framework is a general approach for comparing existing ModelVis systems and guiding possible future designs.
[892] Cross-attentive Cohesive Subgraph Embedding to Mitigate Oversquashing in GNNs
Tanvir Hossain, Muhammad Ifte Khairul Islam, Lilia Chebbah, Charles Fanning, Esra Akbas
Main category: cs.LG
TL;DR: Novel graph learning framework uses cross-attentive cohesive subgraph representations to mitigate oversquashing in GNNs by enriching node embeddings and preserving essential global context while removing noisy connections.
Details
Motivation: GNNs suffer from oversquashing where long-range information gets distorted through limited message-passing pathways, limiting their ability to capture essential global context and decreasing performance, especially in dense and heterophilic graph regions.Method: Proposes a graph learning framework that enriches node embeddings via cross-attentive cohesive subgraph representations to mitigate excessive long-range dependencies. The framework emphasizes cohesive structure in long-range information while removing noisy or irrelevant connections, preserving essential global context without overloading narrow bottlenecked channels.
Result: Extensive experiments on multiple benchmark datasets demonstrate consistent improvements in classification accuracy over standard baseline methods.
Conclusion: The proposed framework effectively addresses oversquashing in GNNs by leveraging cross-attentive cohesive subgraph representations, enhancing node embeddings while preserving essential global context and removing noise, leading to improved performance.
Abstract: Graph neural networks (GNNs) have achieved strong performance across various real-world domains. Nevertheless, they suffer from oversquashing, where long-range information is distorted as it is compressed through limited message-passing pathways. This bottleneck limits their ability to capture essential global context and decreases their performance, particularly in dense and heterophilic regions of graphs. To address this issue, we propose a novel graph learning framework that enriches node embeddings via cross-attentive cohesive subgraph representations to mitigate the impact of excessive long-range dependencies. This framework enhances the node representation by emphasizing cohesive structure in long-range information but removing noisy or irrelevant connections. It preserves essential global context without overloading the narrow bottlenecked channels, which further mitigates oversquashing. Extensive experiments on multiple benchmark datasets demonstrate that our model achieves consistent improvements in classification accuracy over standard baseline methods.
[893] BLOSSOM: Block-wise Federated Learning Over Shared and Sparse Observed Modalities
Pranav M R, Jayant Chandwani, Ahmed M. Abdelmoniem, Arnab K. Paul
Main category: cs.LG
TL;DR: BLOSSOM is a task-agnostic multimodal federated learning framework that handles heterogeneous clients with missing modalities through block-wise aggregation and partial personalization.
Details
Motivation: Real-world multimodal applications like autonomous systems and healthcare have distributed data across clients with varying and often missing modalities, but existing FL approaches assume uniform modality availability, limiting practical applicability.Method: BLOSSOM supports clients with arbitrary modality subsets and enables flexible sharing of model components. It uses a block-wise aggregation strategy that selectively aggregates shared components while keeping task-specific blocks private for partial personalization.
Result: Block-wise personalization significantly improves performance, especially with severe modality sparsity. In modality-incomplete scenarios, BLOSSOM achieves 18.7% average gain over full-model aggregation, and 37.7% gain in modality-exclusive settings.
Conclusion: BLOSSOM demonstrates the importance of block-wise learning for practical multimodal FL systems, effectively handling client heterogeneity and modality sparsity through selective aggregation and personalization.
Abstract: Multimodal federated learning (FL) is essential for real-world applications such as autonomous systems and healthcare, where data is distributed across heterogeneous clients with varying and often missing modalities. However, most existing FL approaches assume uniform modality availability, limiting their applicability in practice. We introduce BLOSSOM, a task-agnostic framework for multimodal FL designed to operate under shared and sparsely observed modality conditions. BLOSSOM supports clients with arbitrary modality subsets and enables flexible sharing of model components. To address client and task heterogeneity, we propose a block-wise aggregation strategy that selectively aggregates shared components while keeping task-specific blocks private, enabling partial personalization. We evaluate BLOSSOM on multiple diverse multimodal datasets and analyse the effects of missing modalities and personalization. Our results show that block-wise personalization significantly improves performance, particularly in settings with severe modality sparsity. In modality-incomplete scenarios, BLOSSOM achieves an average performance gain of 18.7% over full-model aggregation, while in modality-exclusive settings the gain increases to 37.7%, highlighting the importance of block-wise learning for practical multimodal FL systems.
[894] An Energy-Efficient Spiking Neural Network Architecture for Predictive Insulin Delivery
Sahil Shrivastava
Main category: cs.LG
TL;DR: PDDS is an event-driven spiking neural network system for predictive insulin dose calculation on ultra-low-power wearable devices, achieving 85.90% validation accuracy with 79,267x energy efficiency over LSTMs.
Details
Motivation: Diabetes affects over 537 million adults worldwide, requiring continuous glucose monitoring and precise insulin dose calculation on wearable devices with strict power budgets. The paper aims to develop an ultra-low-power computational pipeline using neuromorphic computing principles for edge deployment.Method: A three-layer Leaky Integrate-and-Fire Spiking Neural Network trained on 128,025 windows from OhioT1DM (66.5% real patients) and UVa/Padova simulator (33.5%). Uses event-driven architecture with Poisson encoding for ultra-low-power operation.
Result: Achieved 85.90% validation accuracy, 85.24% test accuracy (vs 99.06% for LSTM). Energy efficiency: 1,551 Femtojoules per inference vs 122.9 nanojoules for LSTM (79,267x less energy). Poor hypoglycemia detection recall (9.2% vs 16.7% for ADA rules).
Conclusion: The SNN architecture is justified for continuous wearable deployment due to extreme power efficiency, though hypoglycemia detection needs improvement. System is computational middle layer of five-phase roadmap toward clinical validation.
Abstract: Diabetes mellitus affects over 537 million adults worldwide. Insulin-dependent patients require continuous glucose monitoring and precise dose calculation while operating under strict power budgets on wearable devices. This paper presents PDDS - an in-silico, software-complete research prototype of an event-driven computational pipeline for predictive insulin dose calculation. Motivated by neuromorphic computing principles for ultra-low-power wearable edge devices, the core contribution is a three-layer Leaky Integrate-and-Fire (LIF) Spiking Neural Network trained on 128,025 windows from OhioT1DM (66.5% real patients) and the FDA-accepted UVa/Padova physiological simulator (33.5%), achieving 85.90% validation accuracy. We present three rigorously honest evaluations: (1) a standard test-set comparison against ADA threshold rules, bidirectional LSTM (99.06% accuracy), and MLP (99.00%), where the SNN achieves 85.24% - we demonstrate this gap reflects the stochastic encoding trade-off, not architectural failure; (2) a temporal benchmark on 426 non-obvious clinician-annotated hypoglycemia windows where neither the SNN (9.2% recall) nor the ADA rule (16.7% recall) performs adequately, identifying the system’s key limitation and the primary direction for future work; (3) a power-efficiency analysis showing the SNN requires 79,267x less energy per inference than the LSTM (1,551 Femtojoules vs. 122.9 nanojoules), justifying the SNN architecture for continuous wearable deployment. The system is not yet connected to physical hardware; it constitutes the computational middle layer of a five phase roadmap toward clinical validation. Keywords: spiking neural network, glucose severity classification, edge computing, hypoglycemia detection, event-driven architecture, LIF neuron, Poisson encoding, OhioT1DM, in-silico, neuromorphic, power efficiency.
[895] On the Asymptotics of Self-Supervised Pre-training: Two-Stage M-Estimation and Representation Symmetry
Mohammad Tinati, Stephen Tu
Main category: cs.LG
TL;DR: Theoretical analysis of self-supervised pre-training via two-stage M-estimation, addressing identifiability issues using Riemannian geometry and orbit-invariance to characterize downstream test risk distribution.
Details
Motivation: Existing theoretical work on self-supervised pre-training leaves open questions about the sharpness of current rates and whether they accurately capture the complex interaction between pre-training and fine-tuning. There's a need for better theoretical understanding of how pre-training representations affect downstream performance.Method: Develops asymptotic theory of pre-training via two-stage M-estimation. Uses Riemannian geometry to handle identifiability issues where pre-training estimators are only identifiable up to group symmetry. Introduces orbit-invariance to link pre-training representations with downstream predictors and characterize limiting distribution of downstream test risk.
Result: Theoretical framework provides precise characterization of downstream test risk distribution. Applied to case studies including spectral pre-training, factor models, and Gaussian mixture models, showing substantial improvements in problem-specific factors over prior work when applicable.
Conclusion: The paper provides a rigorous theoretical foundation for understanding self-supervised pre-training, addressing key challenges in representation identifiability and offering improved theoretical bounds that better capture the pre-training/fine-tuning interaction.
Abstract: Self-supervised pre-training, where large corpora of unlabeled data are used to learn representations for downstream fine-tuning, has become a cornerstone of modern machine learning. While a growing body of theoretical work has begun to analyze this paradigm, existing bounds leave open the question of how sharp the current rates are, and whether they accurately capture the complex interaction between pre-training and fine-tuning. In this paper, we address this gap by developing an asymptotic theory of pre-training via two-stage M-estimation. A key challenge is that the pre-training estimator is often identifiable only up to a group symmetry, a feature common in representation learning that requires careful treatment. We address this issue using tools from Riemannian geometry to study the intrinsic parameters of the pre-training representation, which we link with the downstream predictor through a notion of orbit-invariance, precisely characterizing the limiting distribution of the downstream test risk. We apply our main result to several case studies, including spectral pre-training, factor models, and Gaussian mixture models, and obtain substantial improvements in problem-specific factors over prior art when applicable.
[896] Prototype-Aligned Federated Soft-Prompts for Continual Web Personalization
Canran Xiao, Liwei Hou
Main category: cs.LG
TL;DR: ProtoFed-SP: A privacy-conscious, parameter-efficient framework for continual web personalization using dual-timescale soft prompts anchored to a differentially private federated prototype library.
Details
Motivation: Real-world non-stationarity and privacy constraints make continual web personalization challenging - need to adapt quickly without forgetting long-term preferences while maintaining privacy.Method: Injects dual-timescale soft prompts into frozen backbone: fast sparse short-term prompt tracks session intent, slow long-term prompt anchored to server-side prototype library refreshed via differentially private federated aggregation. Queries routed to Top-M prototypes to compose personalized prompt.
Result: Improves NDCG@10 by +2.9% and HR@10 by +2.0% over strongest baselines across eight benchmarks, with notable gains on Amazon-Books (+5.0% NDCG), H&M (+2.5%), and Taobao (+2.2%). Lowers forgetting and preserves accuracy under practical DP budgets.
Conclusion: ProtoFed-SP offers a unifying, privacy-aware prompting interface with prototype anchoring that delivers robust continual personalization and provides transparent, controllable mechanism to balance stability and plasticity in deployment.
Abstract: Continual web personalization is essential for engagement, yet real-world non-stationarity and privacy constraints make it hard to adapt quickly without forgetting long-term preferences. We target this gap by seeking a privacy-conscious, parameter-efficient interface that controls stability-plasticity at the user/session level while tying user memory to a shared semantic prior. We propose ProtoFed-SP, a prompt-based framework that injects dual-timescale soft prompts into a frozen backbone: a fast, sparse short-term prompt tracks session intent, while a slow long-term prompt is anchored to a small server-side prototype library that is continually refreshed via differentially private federated aggregation. Queries are routed to Top-M prototypes to compose a personalized prompt. Across eight benchmarks, ProtoFed-SP improves NDCG@10 by +2.9% and HR@10 by +2.0% over the strongest baselines, with notable gains on Amazon-Books (+5.0% NDCG vs. INFER), H&M (+2.5% vs. Dual-LoRA), and Taobao (+2.2% vs. FedRAP). It also lowers forgetting (AF) and Steps-to-95% and preserves accuracy under practical DP budgets. Our contribution is a unifying, privacy-aware prompting interface with prototype anchoring that delivers robust continual personalization and offers a transparent, controllable mechanism to balance stability and plasticity in deployment.
[897] CrossHGL: A Text-Free Foundation Model for Cross-Domain Heterogeneous Graph Learning
Xuanze Chen, Jiajun Zhou, Yadong Li, Shanqing Yu, Qi Xuan
Main category: cs.LG
TL;DR: CrossHGL: A foundation framework for text-free, few-shot cross-domain heterogeneous graph representation learning that preserves structural semantics and enables transfer learning across domains without external textual supervision.
Details
Motivation: Most existing heterogeneous graph representation learning methods are limited to closed-world settings with shared schemas and feature spaces, hindering cross-domain generalization. While recent graph foundation models improve transferability, they often target homogeneous graphs, rely on domain-specific schemas, or require rich textual attributes, leaving text-free and few-shot cross-domain HGRL underexplored.Method: 1) Semantic-preserving transformation strategy homogenizes heterogeneous graphs while encoding interaction semantics into edge features; 2) Prompt-aware multi-domain pre-training framework with Tri-Prompt mechanism captures transferable knowledge across feature, edge, and structure perspectives via self-supervised contrastive learning; 3) Parameter-efficient fine-tuning strategy freezes pre-trained backbone and performs few-shot classification via prompt composition and prototypical learning.
Result: CrossHGL consistently outperforms state-of-the-art baselines on node-level and graph-level tasks, yielding average relative improvements of 25.1% and 7.6% in Micro-F1 for node and graph classification respectively, while remaining competitive in challenging feature-degenerated settings.
Conclusion: CrossHGL provides an effective foundation framework for text-free, few-shot cross-domain heterogeneous graph representation learning that successfully preserves and transfers multi-relational structural semantics without external textual supervision.
Abstract: Heterogeneous graph representation learning (HGRL) is essential for modeling complex systems with diverse node and edge types. However, most existing methods are limited to closed-world settings with shared schemas and feature spaces, hindering cross-domain generalization. While recent graph foundation models improve transferability, they often target homogeneous graphs, rely on domain-specific schemas, or require rich textual attributes. Consequently, text-free and few-shot cross-domain HGRL remains underexplored. To address this, we propose CrossHGL, a foundation framework that preserves and transfers multi-relational structural semantics without external textual supervision. Specifically, a semantic-preserving transformation strategy homogenizes heterogeneous graphs while encoding interaction semantics into edge features. Based on this, a prompt-aware multi-domain pre-training framework with a Tri-Prompt mechanism captures transferable knowledge across feature, edge, and structure perspectives via self-supervised contrastive learning. For target-domain adaptation, we develop a parameter-efficient fine-tuning strategy that freezes the pre-trained backbone and performs few-shot classification via prompt composition and prototypical learning. Experiments on node-level and graph-level tasks show that CrossHGL consistently outperforms state-of-the-art baselines, yielding average relative improvements of 25.1% and 7.6% in Micro-F1 for node and graph classification, respectively, while remaining competitive in challenging feature-degenerated settings.
[898] Optimizing Coverage and Difficulty in Reinforcement Learning for Quiz Composition
Ricardo Pedro Querido Andrade Silva, Nassim Bouarour, Dina Fettache, Sarab Boussouar, Noha Ibrahim, Sihem Amer-Yahia
Main category: cs.LG
TL;DR: Automated quiz composition using reinforcement learning to select multiple-choice questions that meet desired topic coverage and difficulty levels.
Details
Motivation: Quiz design is tedious for teachers; automating quiz composition can help evaluate student knowledge acquisition more efficiently while meeting pedagogical goals.Method: Formalized as sequential decision-making problem; investigated DQN, SARSA, and A2C/A3C reinforcement learning algorithms to compose quizzes meeting topic coverage and difficulty constraints.
Result: Extensive experiments on synthetic and real datasets show RL agents can compose effective quizzes, with subtle differences in agent behavior and transfer learning across data distributions and teacher goals; user study supports practical viability.
Conclusion: Reinforcement learning shows promise for automating quiz composition to meet various pedagogical goals, though agent behavior varies with different algorithms and data distributions.
Abstract: Quiz design is a tedious process that teachers undertake to evaluate the acquisition of knowledge by students. Our goal in this paper is to automate quiz composition from a set of multiple choice questions (MCQs). We formalize a generic sequential decision-making problem with the goal of training an agent to compose a quiz that meets the desired topic coverage and difficulty levels. We investigate DQN, SARSA and A2C/A3C, three reinforcement learning solutions to solve our problem. We run extensive experiments on synthetic and real datasets that study the ability of RL to land on the best quiz. Our results reveal subtle differences in agent behavior and in transfer learning with different data distributions and teacher goals. This was supported by our user study, paving the way for automating various teachers’ pedagogical goals.
[899] Low-Rank Adaptation Reduces Catastrophic Forgetting in Sequential Transformer Encoder Fine-Tuning: Controlled Empirical Evidence and Frozen-Backbone Representation Probes
Ashish Pandey
Main category: cs.LG
TL;DR: LoRA significantly reduces catastrophic forgetting in sequential fine-tuning of transformer encoders compared to full fine-tuning, largely due to backbone parameter freezing preserving stable shared features.
Details
Motivation: Sequential fine-tuning of pretrained language models often causes catastrophic forgetting of previously learned capabilities. While parameter-efficient methods like LoRA are known to be more robust, the underlying mechanisms and forgetting behavior remain under-characterized.Method: Controlled empirical study of LoRA in sequential transformer encoder fine-tuning with representation probes. Experiments on BERT-base and RoBERTa-base across multiple task sequences (RTE→MRPC→CoLA→SST-2). Comparison with full fine-tuning and EWC baseline. Fine-grained freezing ablations and task-similarity probes in GPT-2 and RoBERTa.
Result: Full fine-tuning yields 19.9%±4.8% average forgetting, while standard LoRA yields only 0.6%±1.4% forgetting (statistically significant). RoBERTa shows same pattern. Freezing ablations show forgetting drops sharply when frozen parameters exceed ~95%. Frozen-backbone regimes preserve higher inter-task similarity than full fine-tuning.
Conclusion: LoRA’s effectiveness in reducing catastrophic forgetting stems largely from backbone freezing preserving a stable shared feature scaffold. Standard LoRA serves as both a strong baseline for sequential encoder adaptation and a useful probe for understanding selective plasticity in transformer continual learning.
Abstract: Sequential fine-tuning of pretrained language encoders often overwrites previously acquired capabilities, but the forgetting behavior of parameter-efficient updates remains under-characterized. We present a controlled empirical study of Low-Rank Adaptation (LoRA) in sequential transformer encoder fine-tuning with companion representation probes that test a frozen-backbone explanation of its robustness. In five full-validation BERT-base reruns on an RTE->MRPC->CoLA->SST-2 sequence, full fine-tuning yields 19.9%+/-4.8% average forgetting, whereas standard LoRA (r=8, query/value modules) yields 0.6%+/-1.4% (paired t-test, p=0.002, Cohen’s d_s=3.12). Task-level analyses confirm this reduction is not merely an aggregate effect. Secondary experiments on RoBERTa-base show the same pattern, and the strongest EWC baseline remains at 15.5%+/-1.4% forgetting. A six-task extension reveals that low average forgetting can hide strong task-level heterogeneity. Fine-grained freezing ablations show a marked forgetting drop once frozen parameters exceed roughly 95%, with classifier-only and shallow-adapter baselines approaching LoRA. Companion task-similarity probes in GPT-2 and RoBERTa show the same directional story: frozen-backbone regimes preserve higher inter-task similarity than full fine-tuning, gradual unfreezing weakens stability, and full fine-tuning exhibits its clearest divergence at the final transformer layer. These results support a restrained mechanistic interpretation: LoRA helps largely because backbone freezing preserves a more stable shared feature scaffold. We position standard LoRA as both a strong empirical baseline for sequential encoder adaptation and a useful probe of how selective plasticity shapes interference in transformer continual learning.
[900] TMTE: Effective Multimodal Graph Learning with Task-aware Modality and Topology Co-evolution
Yinlin Zhu, Xunkai Li, Di Wu, Wang Luo, Miao Hu, Di Wu
Main category: cs.LG
TL;DR: TMTE is a multimodal graph learning framework that jointly optimizes graph topology and multimodal representations through task-aware co-evolution, addressing limitations in real-world multimodal-attributed graphs.
Details
Motivation: Real-world multimodal-attributed graphs (MAGs) have inherent topology quality limitations including noisy interactions, missing connections, and task-agnostic relational structures. A single graph from generic relationships is unlikely to be optimal for diverse downstream tasks.Method: TMTE jointly and iteratively optimizes graph topology and multimodal representations through a closed-loop co-evolution process. Topology evolution uses multi-perspective metric learning over modality embeddings with anchor-based approximation, while modality evolution employs smoothness-regularized fusion with cross-modal alignment.
Result: Extensive experiments on 9 MAG datasets and 1 non-graph multimodal dataset across 6 graph-centric and modality-centric tasks show TMTE consistently achieves state-of-the-art performance.
Conclusion: TMTE effectively addresses topology quality limitations in MAGs through task-aware co-evolution of modality and topology, demonstrating superior performance across diverse multimodal graph learning tasks.
Abstract: Multimodal-attributed graphs (MAGs) are a fundamental data structure for multimodal graph learning (MGL), enabling both graph-centric and modality-centric tasks. However, our empirical analysis reveals inherent topology quality limitations in real-world MAGs, including noisy interactions, missing connections, and task-agnostic relational structures. A single graph derived from generic relationships is therefore unlikely to be universally optimal for diverse downstream tasks. To address this challenge, we propose Task-aware Modality and Topology co-Evolution (TMTE), a novel MGL framework that jointly and iteratively optimizes graph topology and multimodal representations toward the target task. TMTE is motivated by the bidirectional coupling between modality and topology: multimodal attributes induce relational structures, while graph topology shapes modality representations. Concretely, TMTE casts topology evolution as multi-perspective metric learning over modality embeddings with an anchor-based approximation, and formulates modality evolution as smoothness-regularized fusion with cross-modal alignment, yielding a closed-loop task-aware co-evolution process. Extensive experiments on 9 MAG datasets and 1 non-graph multimodal dataset across 6 graph-centric and modality-centric tasks show that TMTE consistently achieves state-of-the-art performance. Our code is available at https://anonymous.4open.science/r/TMTE-1873.
[901] Robust Smart Contract Vulnerability Detection via Contrastive Learning-Enhanced Granular-ball Training
Zeli Wang, Qingxuan Yang, Shuyin Xia, Yueming Wu, Bo Liu, Longlong Lin
Main category: cs.LG
TL;DR: CGBC uses granular-ball computing and contrastive learning to improve robustness of smart contract vulnerability detection against label noise.
Details
Motivation: Smart contract vulnerability detection using DNNs suffers from label noise due to inaccurate open-source labeling tools, which harms model robustness and accuracy.Method: Introduces granular-ball computing layer to group similar contracts and correct noisy labels, uses contrastive learning pretraining with semantic-consistent augmentation, and employs symmetric cross-entropy loss to combat label noise.
Result: Extensive experiments show CGBC significantly improves robustness and effectiveness of smart contract vulnerability detection compared to baselines.
Conclusion: CGBC effectively addresses label noise in smart contract vulnerability detection through granular-ball computing and contrastive learning, enhancing model robustness.
Abstract: Deep neural networks (DNNs) have emerged as a prominent approach for detecting smart contract vulnerabilities, driven by the growing contract datasets and advanced deep learning techniques. However, DNNs typically require large-scale labeled datasets to model the relationships between contract features and vulnerability labels. In practice, the labeling process often depends on existing open-sourced tools, whose accuracy cannot be guaranteed. Consequently, label noise poses a significant challenge for the accuracy and robustness of the smart contract, which is rarely explored in the literature. To this end, we propose Contrastive learning-enhanced Granular-Ball smart Contracts training, CGBC, to enhance the robustness of contract vulnerability detection. Specifically, CGBC first introduces a Granular-ball computing layer between the encoder layer and the classifier layer, to group similar contracts into Granular-Balls (GBs) and generate new coarse-grained representations (i.e., the center and the label of GBs) for them, which can correct noisy labels based on the most correct samples. An inter-GB compactness loss and an intra-GB looseness loss are combined to enhance the effectiveness of clustering. Then, to improve the accuracy of GBs, we pretrain the model through unsupervised contrastive learning supported by our novel semantic-consistent smart contract augmentation method. This procedure can discriminate contracts with different labels by dragging the representation of similar contracts closer, assisting CGBC in clustering. Subsequently, we leverage the symmetric cross-entropy loss function to measure the model quality, which can combat the label noise in gradient computations. Finally, extensive experiments show that the proposed CGBC can significantly improve the robustness and effectiveness of the smart contract vulnerability detection when contrasted with baselines.
[902] AutoStan: Autonomous Bayesian Model Improvement via Predictive Feedback
Oliver Dürr
Main category: cs.LG
TL;DR: AutoStan: A CLI coding agent that autonomously builds and iteratively improves Bayesian models in Stan using NLPD and sampler diagnostics as feedback signals.
Details
Motivation: To automate the process of Bayesian model building and improvement, reducing the manual effort required for writing and refining Stan code while maintaining model interpretability.Method: A command-line interface coding agent operates in a loop: writes Stan model files, executes MCMC sampling, then decides whether to keep or revert changes based on negative log predictive density (NLPD) on held-out data and sampler diagnostics (divergences, R-hat, effective sample size).
Result: On synthetic regression with outliers, the agent progressed from naive linear regression to robust Student-t models with nonlinear heteroscedastic structure and contamination mixtures, matching or outperforming TabPFN while remaining interpretable. Successfully discovered hierarchical partial pooling, varying-slope models, and Poisson attack/defense models across four additional experiments.
Conclusion: First demonstration that a CLI coding agent can autonomously write and iteratively improve Stan code for diverse Bayesian modeling problems without search algorithms, critic modules, or domain-specific instructions.
Abstract: We present AutoStan, a framework in which a command-line interface (CLI) coding agent autonomously builds and iteratively improves Bayesian models written in Stan. The agent operates in a loop, writing a Stan model file, executing MCMC sampling, then deciding whether to keep or revert each change based on two complementary feedback signals: the negative log predictive density (NLPD) on held-out data and the sampler’s own diagnostics (divergences, R-hat, effective sample size). We evaluate AutoStan on five datasets with diverse modeling structures. On a synthetic regression dataset with outliers, the agent progresses from naive linear regression to a model with Student-t robustness, nonlinear heteroscedastic structure, and an explicit contamination mixture, matching or outperforming TabPFN, a state-of-the-art black-box method, while remaining fully interpretable. Across four additional experiments, the same mechanism discovers hierarchical partial pooling, varying-slope models with correlated random effects, and a Poisson attack/defense model for soccer. No search algorithm, critic module, or domain-specific instructions are needed. This is, to our knowledge, the first demonstration that a CLI coding agent can autonomously write and iteratively improve Stan code for diverse Bayesian modeling problems.
[903] What-If Explanations Over Time: Counterfactuals for Time Series Classification
Udo Schlegel, Thomas Seidl
Main category: cs.LG
TL;DR: Survey paper reviewing counterfactual explanation methods for time series classification, covering instance-based, pattern-driven, gradient-based, and generative approaches, with an open-source library CFTS for standardized evaluation.
Details
Motivation: Counterfactual explanations are important for explainable AI in time series domains, but generating them for temporal data presents unique challenges like maintaining temporal coherence and plausibility that differ from tabular or image domains.Method: Comprehensive survey of state-of-the-art methods including: instance-based nearest-neighbor techniques, pattern-driven algorithms, gradient-based optimization, and generative models. Analysis of each method’s methodology, target models/classifiers, and evaluation datasets.
Result: Development of open-source library CFTS (Counterfactual Explanations for Time Series) as a reference framework with multiple algorithms and evaluation metrics. Comparative analysis of methods along dimensions like validity, proximity, sparsity, and plausibility.
Conclusion: Identifies gaps in current research and proposes future directions including improved user-centered design, domain knowledge integration, and extending counterfactuals to time series forecasting. The CFTS library aims to standardize evaluation and enable practical adoption.
Abstract: Counterfactual explanations emerge as a powerful approach in explainable AI, providing what-if scenarios that reveal how minimal changes to an input time series can alter the model’s prediction. This work presents a survey of recent algorithms for counterfactual explanations for time series classification. We review state-of-the-art methods, spanning instance-based nearest-neighbor techniques, pattern-driven algorithms, gradient-based optimization, and generative models. For each, we discuss the underlying methodology, the models and classifiers they target, and the datasets on which they are evaluated. We highlight unique challenges in generating counterfactuals for temporal data, such as maintaining temporal coherence, plausibility, and actionable interpretability, which distinguish the temporal from tabular or image domains. We analyze the strengths and limitations of existing approaches and compare their effectiveness along key dimensions (validity, proximity, sparsity, plausibility, etc.). In addition, we implemented an open-source implementation library, Counterfactual Explanations for Time Series (CFTS), as a reference framework that includes many algorithms and evaluation metrics. We discuss this library’s contributions in standardizing evaluation and enabling practical adoption of explainable time series techniques. Finally, based on the literature and identified gaps, we propose future research directions, including improved user-centered design, integration of domain knowledge, and counterfactuals for time series forecasting.
[904] RG-TTA: Regime-Guided Meta-Control for Test-Time Adaptation in Streaming Time Series
Indar Kumar, Akanksha Tiwari, Sai Krishna Jasti, Ankit Hemant Lade
Main category: cs.LG
TL;DR: RG-TTA is a meta-controller for test-time adaptation in time series forecasting that dynamically adjusts adaptation intensity based on distributional similarity to previously-seen regimes, improving performance while reducing computational cost.
Details
Motivation: Existing test-time adaptation methods apply uniform adaptation intensity regardless of distribution shift characteristics, failing to optimize computational effort and adaptation quality based on the nature of incoming data.Method: RG-TTA uses an ensemble of statistical metrics to compute similarity scores for incoming batches, then modulates learning rates and gradient effort via early stopping. It also gates checkpoint reuse from a regime memory based on performance improvement thresholds.
Result: Regime-guided policies achieved lowest MSE in 69.6% of experiments, with RG-TTA reducing MSE by 5.7% vs standard TTA while running 5.5% faster, and RG-EWC reducing MSE by 14.1% vs standalone EWC.
Conclusion: RG-TTA provides an effective, model-agnostic framework for adaptive test-time adaptation in time series forecasting that intelligently allocates computational resources based on distributional similarity.
Abstract: Test-time adaptation (TTA) enables neural forecasters to adapt to distribution shifts in streaming time series, but existing methods apply the same adaptation intensity regardless of the nature of the shift. We propose Regime-Guided Test-Time Adaptation (RG-TTA), a meta-controller that continuously modulates adaptation intensity based on distributional similarity to previously-seen regimes. Using an ensemble of Kolmogorov-Smirnov, Wasserstein-1, feature-distance, and variance-ratio metrics, RG-TTA computes a similarity score for each incoming batch and uses it to (i) smoothly scale the learning rate – more aggressive for novel distributions, conservative for familiar ones – and (ii) control gradient effort via loss-driven early stopping rather than fixed budgets, allowing the system to allocate exactly the effort each batch requires. As a supplementary mechanism, RG-TTA gates checkpoint reuse from a regime memory, loading stored specialist models only when they demonstrably outperform the current model (loss improvement >= 30%). RG-TTA is model-agnostic and strategy-composable: it wraps any forecaster exposing train/predict/save/load interfaces and enhances any gradient-based TTA method. We demonstrate three compositions – RG-TTA, RG-EWC, and RG-DynaTTA – and evaluate 6 update policies (3 baselines + 3 regime-guided variants) across 4 compact architectures (GRU, iTransformer, PatchTST, DLinear), 14 datasets (6 real-world multivariate benchmarks + 8 synthetic regime scenarios), and 4 forecast horizons (96, 192, 336, 720) under a streaming evaluation protocol with 3 random seeds (672 experiments total). Regime-guided policies achieve the lowest MSE in 156 of 224 seed-averaged experiments (69.6%), with RG-EWC winning 30.4% and RG-TTA winning 29.0%. Overall, RG-TTA reduces MSE by 5.7% vs TTA while running 5.5% faster; RG-EWC reduces MSE by 14.1% vs standalone EWC.
[905] KVSculpt: KV Cache Compression as Distillation
Bo Jiang, Sian Jin
Main category: cs.LG
TL;DR: KVSculpt: A KV cache compression method that optimizes unconstrained KV pairs in continuous embedding space instead of selecting/merging original entries, with adaptive budget allocation across layers and heads.
Details
Motivation: Existing KV cache compression methods either evict or merge original KV pairs, remaining anchored to original cache entries. There's a need for more flexible compression that can optimize KV pairs in continuous space to better preserve attention behavior.Method: 1) Optimize smaller set of unconstrained KV pairs in continuous embedding space via alternating optimization: keys optimized with L-BFGS, values solved via least squares. 2) Adaptive budget allocation uses cheap pilot compression to redistribute compression budget across layers and KV heads based on per-component difficulty.
Result: On Qwen2.5-1.5B-Instruct with 2048-token contexts: reduces KL divergence by 3.5-4.1x vs Select+Fit across compression ratios {0.3, 0.5, 0.7}. Adaptive allocation provides additional 1.3x KL reduction at no extra inference cost. Analysis shows compression difficulty varies dramatically across layers (up to 100x) and within layers (up to 467x between KV heads).
Conclusion: KVSculpt demonstrates superior KV cache compression by optimizing unconstrained KV pairs in continuous space, with adaptive budget allocation proving essential due to highly non-uniform compression difficulty across model components.
Abstract: KV cache compression is critical for efficient long-context LLM inference. Approaches that reduce the per-pair footprint – quantization and low-rank decomposition – are orthogonal to those that reduce the sequence length of the cache. Along the sequence-length dimension, existing methods range from pure eviction – selecting which KV pairs to keep – to merging, which combines similar pairs into fewer ones. Both remain anchored to the original cache entries. We propose KVSculpt, which moves to the other end of this spectrum: instead of selecting or combining original pairs, we optimize a smaller set of unconstrained KV pairs in continuous embedding space to preserve each layer’s attention behavior. Keys are optimized via L-BFGS and values are solved in closed form via least squares, alternating every few steps. On top of this, we introduce adaptive budget allocation, which uses a cheap pilot compression run to redistribute the compression budget across layers and KV heads based on per-component difficulty. On Qwen2.5-1.5B-Instruct with 2048-token contexts, KVSculpt reduces KL divergence by 3.5-4.1x compared to Select+Fit – attention-score eviction with least-squares value fitting – across compression ratios r in {0.3, 0.5, 0.7}. Adaptive allocation provides an additional 1.3x KL reduction at no extra inference cost. Analysis reveals that compression difficulty is highly non-uniform: per-layer pilot MSE varies by up to 100x across layers, and the two KV heads within a single layer can differ by up to 467x – demonstrating that fine-grained budget allocation is essential.
[906] Stability and Sensitivity Analysis of Relative Temporal-Difference Learning: Extended Version
Masoud S. Sakha, Rushikesh Kamalapurkar, Sean Meyn
Main category: cs.LG
TL;DR: Relative TD learning with linear function approximation: stability analysis shows empirical baseline distribution ensures stability for any discount factor, with bounded asymptotic bias/covariance as γ→1.
Details
Motivation: TD learning methods converge slowly when discount factor γ approaches 1. Relative TD learning subtracts a baseline to mitigate this, but stability guarantees with function approximation remain poorly understood.Method: Analyzes relative TD learning with linear function approximation, establishes stability conditions, examines role of baseline distribution choice, particularly empirical distribution of state-action process.
Result: When baseline is empirical distribution of state-action process, algorithm is stable for any non-negative baseline weight and any discount factor. Asymptotic covariance and bias remain uniformly bounded as γ→1.
Conclusion: Relative TD learning with proper baseline choice provides stable algorithm with bounded asymptotic properties even as discount factor approaches one, addressing slow convergence issue in TD methods.
Abstract: Relative temporal-difference (TD) learning was introduced to mitigate the slow convergence of TD methods when the discount factor approaches one by subtracting a baseline from the temporal-difference update. While this idea has been studied in the tabular setting, stability guarantees with function approximation remain poorly understood. This paper analyzes relative TD learning with linear function approximation. We establish stability conditions for the algorithm and show that the choice of baseline distribution plays a central role. In particular, when the baseline is chosen as the empirical distribution of the state-action process, the algorithm is stable for any non-negative baseline weight and any discount factor. We also provide a sensitivity analysis of the resulting parameter estimates, characterizing both asymptotic bias and covariance. The asymptotic covariance and asymptotic bias are shown to remain uniformly bounded as the discount factor approaches one.
[907] Kernel Dynamics under Path Entropy Maximization
Jnaneshwar Das
Main category: cs.LG
TL;DR: A variational framework treating kernel functions as dynamical variables subject to path entropy maximization, where kernel evolution defines changing representational structures and information geometries.
Details
Motivation: To develop a theoretical framework where the kernel function (encoding what distinctions an agent can represent) is treated as a dynamical variable, allowing the optimization landscape to evolve endogenously during learning or adaptation processes.Method: Proposes a variational framework using Maximum Caliber (MaxCal) to treat kernel functions as dynamical variables subject to path entropy maximization. Formulates fixed-point conditions for self-consistent kernels, connects to renormalization group flow as a special case, and suggests neural tangent kernel evolution as an empirical instantiation.
Result: Derives an information-thermodynamic bound: the work required for kernel change is bounded below by δW ≥ k_B T δI_k, where δI_k is the mutual information newly unlocked by the updated kernel. Identifies stable fixed points of MaxCal over kernels as self-reinforcing distinction structures.
Conclusion: The framework provides a principled way to understand how representational structures evolve through kernel dynamics, with connections to RG flow and neural network training. It offers conjectural interpretations for biological niches, scientific paradigms, and craft mastery, and poses six open questions for empirical testing.
Abstract: We propose a variational framework in which the kernel function k : X x X -> R, interpreted as the foundational object encoding what distinctions an agent can represent, is treated as a dynamical variable subject to path entropy maximization (Maximum Caliber, MaxCal). Each kernel defines a representational structure over which an information geometry on probability space may be analyzed; a trajectory through kernel space therefore corresponds to a trajectory through a family of effective geometries, making the optimization landscape endogenous to its own traversal. We formulate fixed-point conditions for self-consistent kernels, propose renormalization group (RG) flow as a structured special case, and suggest neural tangent kernel (NTK) evolution during deep network training as a candidate empirical instantiation. Under explicit information-thermodynamic assumptions, the work required for kernel change is bounded below by delta W >= k_B T delta I_k, where delta I_k is the mutual information newly unlocked by the updated kernel. In this view, stable fixed points of MaxCal over kernels correspond to self-reinforcing distinction structures, with biological niches, scientific paradigms, and craft mastery offered as conjectural interpretations. We situate the framework relative to assembly theory and the MaxCal literature, separate formal results from structured correspondences and conjectural bridges, and pose six open questions that make the program empirically and mathematically testable.
[908] Near-Optimal Primal-Dual Algorithm for Learning Linear Mixture CMDPs with Adversarial Rewards
Kihyun Yu, Seoungbin Bae, Dabeen Lee
Main category: cs.LG
TL;DR: A primal-dual policy optimization algorithm for safe RL in linear mixture constrained MDPs with adversarial rewards, achieving near-optimal regret bounds.
Details
Motivation: Address the challenge of safe reinforcement learning in constrained Markov decision processes with adversarial rewards and unknown transition dynamics, where existing methods struggle with provable efficiency guarantees.Method: Propose a primal-dual policy optimization algorithm with regularized dual updates and weighted ridge regression-based parameter estimation for constructing tighter confidence intervals in constrained settings.
Result: Achieves regret and constraint violation bounds of $\widetilde{O}(\sqrt{d^2 H^3 K})$, which is near-optimal and matches known minimax lower bounds up to logarithmic factors.
Conclusion: First provably efficient algorithm for linear mixture CMDPs with adversarial rewards, introducing key innovations in dual regularization and parameter estimation for constrained settings.
Abstract: We study safe reinforcement learning in finite-horizon linear mixture constrained Markov decision processes (CMDPs) with adversarial rewards under full-information feedback and an unknown transition kernel. We propose a primal-dual policy optimization algorithm that achieves regret and constraint violation bounds of $\widetilde{O}(\sqrt{d^2 H^3 K})$ under mild conditions, where $d$ is the feature dimension, $H$ is the horizon, and $K$ is the number of episodes. To the best of our knowledge, this is the first provably efficient algorithm for linear mixture CMDPs with adversarial rewards. In particular, our regret bound is near-optimal, matching the known minimax lower bound up to logarithmic factors. The key idea is to introduce a regularized dual update that enables a drift-based analysis. This step is essential, as strong duality-based analysis cannot be directly applied when reward functions change across episodes. In addition, we extend weighted ridge regression-based parameter estimation to the constrained setting, allowing us to construct tighter confidence intervals that are crucial for deriving the near-optimal regret bound.
[909] Spectral Signatures of Data Quality: Eigenvalue Tail Index as a Diagnostic for Label Noise in Neural Networks
Matthew Loftus
Main category: cs.LG
TL;DR: Spectral properties of neural network weight matrices, specifically the tail index alpha of eigenvalue distributions at bottleneck layers, strongly predict test accuracy under label noise variation but are weak predictors under hyperparameter variation, suggesting they serve as data quality diagnostics rather than universal generalization predictors.
Details
Motivation: To determine whether spectral properties of neural network weight matrices can reliably predict test accuracy and generalization performance, and to understand the conditions under which such spectral signatures are informative.Method: Analyzed eigenvalue distributions of weight matrices at bottleneck layers across three architectures (MLP, CNN, ResNet-18) and two datasets (MNIST, CIFAR-10). Used controlled experiments varying label noise levels (21 levels, 3 seeds each) and hyperparameters (180 configurations varying width, depth, learning rate, weight decay). Measured tail index alpha of eigenvalue distributions and compared against conventional metrics like Frobenius norm.
Result: Tail index alpha predicted test accuracy with LOO R^2 = 0.984 under label noise variation, far exceeding conventional metrics (best baseline: Frobenius norm with R^2 = 0.149). However, under hyperparameter variation, all measures were weak predictors (R^2 < 0.25), with simple L_2 norm (R^2 = 0.219) slightly outperforming tail alpha (R^2 = 0.167). The method successfully detected real human annotation errors in CIFAR-10N (9% noise with 3% error).
Conclusion: The tail index alpha serves as a powerful data quality diagnostic for detecting label corruption and training set degradation, rather than as a universal generalization predictor. The information-processing bottleneck layer is identified as the locus of this signature, connecting observations to the BBP phase transition in spiked random matrix models.
Abstract: We investigate whether spectral properties of neural network weight matrices can predict test accuracy. Under controlled label noise variation, the tail index alpha of the eigenvalue distribution at the network’s bottleneck layer predicts test accuracy with leave-one-out R^2 = 0.984 (21 noise levels, 3 seeds per level), far exceeding all baselines: the best conventional metric (Frobenius norm of the optimal layer) achieves LOO R^2 = 0.149. This relationship holds across three architectures (MLP, CNN, ResNet-18) and two datasets (MNIST, CIFAR-10). However, under hyperparameter variation at fixed data quality (180 configurations varying width, depth, learning rate, and weight decay), all spectral and conventional measures are weak predictors (R^2 < 0.25), with simple baselines (global L_2 norm, LOO R^2 = 0.219) slightly outperforming spectral measures (tail alpha, LOO R^2 = 0.167). We therefore frame the tail index as a data quality diagnostic: a powerful detector of label corruption and training set degradation, rather than a universal generalization predictor. A noise detector calibrated on synthetic noise successfully identifies real human annotation errors in CIFAR-10N (9% noise detected with 3% error). We identify the information-processing bottleneck layer as the locus of this signature and connect the observations to the BBP phase transition in spiked random matrix models. We also report a negative result: the level spacing ratio
[910] Efficient Inference of Large Vision Language Models
Surendra Pathak
Main category: cs.LG
TL;DR: Survey paper on optimization techniques for accelerating Large Vision Language Model inference, focusing on computational efficiency challenges from visual token processing.
Details
Motivation: LVLMs face scalability and deployment constraints due to massive computational requirements, especially from high-resolution visual inputs that create quadratic attention complexity, necessitating systematic optimization approaches.Method: Comprehensive survey with systematic taxonomy categorizing optimization frameworks into four dimensions: visual token compression, memory management/serving, efficient architectural design, and advanced decoding strategies.
Result: Organizes current state-of-the-art acceleration techniques, critically examines their limitations, and identifies open problems for future research in efficient multimodal systems.
Conclusion: Provides structured overview of LVLM optimization landscape, highlighting key challenges and research directions for improving computational efficiency in multimodal AI systems.
Abstract: Although Large Vision Language Models (LVLMs) have demonstrated impressive multimodal reasoning capabilities, their scalability and deployment are constrained by massive computational requirements. In particular, the massive amount of visual tokens from high-resolution input data aggravates the situation due to the quadratic complexity of attention mechanisms. To address these issues, the research community has developed several optimization frameworks. This paper presents a comprehensive survey of the current state-of-the-art techniques for accelerating LVLM inference. We introduce a systematic taxonomy that categorizes existing optimization frameworks into four primary dimensions: visual token compression, memory management and serving, efficient architectural design, and advanced decoding strategies. Furthermore, we critically examine the limitations of these current methodologies and identify critical open problems to inspire future research directions in efficient multimodal systems.
[911] ATLAS-RTC: Closing the Loop on LLM Agent Output with Token-Level Runtime Control
Christopher Cruz
Main category: cs.LG
TL;DR: ATLAS-RTC is a runtime control system for autoregressive language models that enforces structured output during decoding through real-time monitoring and targeted interventions.
Details
Motivation: Many LLM failures in structured generation tasks arise from decoding artifacts rather than task misunderstanding, and existing approaches like post-hoc validation or static constrained decoding are inefficient. There's a need for runtime control that can detect and correct errors during generation before they materialize.Method: ATLAS-RTC operates as a closed-loop control system that monitors generation at each step, detects drift from output contracts using lightweight signals, and applies targeted interventions including biasing, masking, and rollback operations.
Result: The system improves first-attempt success rates by 20 to 37.8 percentage points across structured generation and tool-calling tasks, with up to 88% latency reduction in failure-dominated settings.
Conclusion: Runtime control should be considered as a distinct layer in LLM systems, as many failures stem from decoding artifacts rather than task misunderstanding, and real-time intervention can significantly improve structured output generation.
Abstract: We present ATLAS-RTC, a runtime control system for autoregressive language models that enforces structured output during decoding. ATLAS-RTC monitors generation at each step, detects drift from output contracts using lightweight signals, and applies targeted interventions such as biasing, masking, and rollback. Unlike post-hoc validation or static constrained decoding, it operates in a closed loop, enabling correction before errors materialize. Across structured generation and tool-calling tasks, ATLAS-RTC improves first-attempt success rates by 20 to 37.8 percentage points, with up to 88% latency reduction in failure-dominated settings. Results show that many failures arise from decoding artifacts rather than task misunderstanding, motivating runtime control as a distinct layer in LLM systems.
[912] ITQ3_S: High-Fidelity 3-bit LLM Inference via Interleaved Ternary Quantization with Rotation-Domain Smoothing
Edward J. Yoon
Main category: cs.LG
TL;DR: ITQ3_S is a novel 3-bit weight quantization format for LLMs that uses Fast Walsh-Hadamard Transform to pre-rotate weights, enabling near-Gaussian distributions for uniform ternary coding with mathematically rigorous dequantization.
Details
Motivation: Conventional 3-bit quantization suffers from catastrophic precision loss due to heavy-tailed weight distributions and inter-channel outliers in LLMs, limiting practical deployment on consumer hardware.Method: ITQ3_S integrates TurboQuant (TQ), a rotation-domain adaptive quantization strategy using Fast Walsh-Hadamard Transform to pre-rotate weight space before quantization, spreading outlier energy across vectors. It includes a mathematically rigorous dequantization procedure with 256-point Inverse Walsh-Hadamard Transform fused into CUDA shared-memory loading.
Result: On NVIDIA RTX 5090 (Blackwell), ITQ3_S achieves perplexity competitive with FP16 baselines while delivering throughput exceeding 1.5× that of 4-bit alternatives, with optimized DP4A and Tensor Core scheduling in interleaved memory layout.
Conclusion: ITQ3_S establishes a practical, mathematically grounded solution for high-fidelity LLM deployment on consumer-grade hardware through innovative 3-bit quantization with rotation-domain adaptation.
Abstract: We present \textbf{ITQ3_S} (Interleaved Ternary Quantization – Specialized), a novel 3-bit weight quantization format for large language models (LLMs) that integrates \textbf{TurboQuant (TQ)}, a rotation-domain adaptive quantization strategy based on the Fast Walsh-Hadamard Transform (FWHT). Conventional 3-bit quantization methods suffer from catastrophic precision loss caused by heavy-tailed weight distributions and inter-channel outliers. ITQ3_S addresses this fundamental limitation by pre-rotating the weight space via FWHT prior to quantization, effectively spreading outlier energy across the entire vector and inducing a near-Gaussian distribution amenable to uniform ternary coding. Critically, we derive a mathematically rigorous dequantization procedure that inverts the FWHT exactly using a 256-point Inverse Walsh-Hadamard Transform fused into the CUDA shared-memory loading stage, ensuring zero-error round-trip fidelity between offline quantization and online inference. We prove that for any weight vector $\mathbf{w} \in \mathbb{R}^{256}$ processed by our pipeline, the reconstruction satisfies $|\hat{\mathbf{w}} - \mathbf{w}|_2 \leq ε_q$, where $ε_q$ is determined solely by the ternary quantization grid and is strictly smaller than any uniform 3-bit baseline under equal bit-budget constraints. Empirically, on the NVIDIA RTX 5090 (Blackwell architecture), ITQ3_S achieves perplexity competitive with FP16 baselines while delivering throughput exceeding 1.5$\times$ that of 4-bit alternatives, owing to optimized DP4A and Tensor Core scheduling in the interleaved memory layout. Our results establish ITQ3_S as a practical, mathematically grounded solution for high-fidelity LLM deployment on consumer-grade hardware.
[913] Physics-Guided Transformer (PGT): Physics-Aware Attention Mechanism for PINNs
Ehsan Zeraatkar, Rodion Podorozhny, Jelena Tešić
Main category: cs.LG
TL;DR: PGT embeds physical structure directly into transformer attention via heat-kernel bias, improving PDE solution reconstruction under sparse observations.
Details
Motivation: Existing physics-informed methods use soft penalty terms that cause gradient imbalance, instability, and poor physical consistency with limited data. Need architecture that directly embeds physical structure.Method: Physics-Guided Transformer (PGT) incorporates heat-kernel-derived additive bias into self-attention logits to encode diffusion dynamics and temporal causality. Uses FiLM-modulated sinusoidal implicit network for decoding.
Result: On 1D heat equation with 100 observations: relative L2 error 5.9e-3, outperforming PINNs and sinusoidal representations. On 2D cylinder wake: PDE residual 8.3e-4 and relative error 0.034, outperforming methods optimizing only one objective.
Conclusion: Embedding physics directly within attention mechanism improves stability, generalization, and physical fidelity under data-scarce conditions for PDE reconstruction tasks.
Abstract: Reconstructing continuous physical fields from sparse, irregular observations is a central challenge in scientific machine learning, particularly for systems governed by partial differential equations (PDEs). Existing physics-informed methods typically enforce governing equations as soft penalty terms during optimization, often leading to gradient imbalance, instability, and degraded physical consistency under limited data. We introduce the Physics-Guided Transformer (PGT), a neural architecture that embeds physical structure directly into the self-attention mechanism. Specifically, PGT incorporates a heat-kernel-derived additive bias into attention logits, encoding diffusion dynamics and temporal causality within the representation. Query coordinates attend to these physics-conditioned context tokens, and the resulting features are decoded using a FiLM-modulated sinusoidal implicit network that adaptively controls spectral response. We evaluate PGT on the one-dimensional heat equation and two-dimensional incompressible Navier-Stokes systems. In sparse 1D reconstruction with 100 observations, PGT achieves a relative L2 error of 5.9e-3, significantly outperforming both PINNs and sinusoidal representations. In the 2D cylinder wake problem, PGT uniquely achieves both low PDE residual (8.3e-4) and competitive relative error (0.034), outperforming methods that optimize only one objective. These results demonstrate that embedding physics within attention improves stability, generalization, and physical fidelity under data-scarce conditions.
[914] IsoQuant: Hardware-Aligned SO(4) Isoclinic Rotations for LLM KV Cache Compression
Zhongping Ji
Main category: cs.LG
TL;DR: IsoQuant: A blockwise rotation framework for low-bit online vector quantization using quaternion algebra and isoclinic decomposition of SO(4) to reduce computational costs while maintaining reconstruction quality.
Details
Motivation: Existing orthogonal feature decorrelation methods for low-bit online vector quantization suffer from prohibitive O(d²) storage and compute costs with dense random orthogonal transforms. While RotorQuant reduces costs with 3D Clifford rotors, its 3D partition is poorly aligned with modern hardware and offers limited local mixing.Method: IsoQuant uses blockwise rotation based on quaternion algebra and isoclinic decomposition of SO(4). It represents each 4D block as a quaternion and applies a closed-form transform T(v)=q_L v q̄_R. Two variants: IsoQuant-Full realizes full SO(4) rotation, while IsoQuant-Fast keeps only one isoclinic factor for lower cost. Also includes a lightweight 2D special case.
Result: At d=128, IsoQuant-Full reduces forward rotation cost from ~2,408 FMAs in RotorQuant to 1,024, while IsoQuant-Fast further reduces to 512. Across 18 fused CUDA settings with d∈{128,256,512}, bit widths {2,3,4}, and FP16/FP32 execution, IsoQuant achieves mean kernel-level speedups of ~4.5×-4.7× over RotorQuant while maintaining comparable reconstruction MSE, with peak speedups above 6×.
Conclusion: IsoQuant provides an efficient blockwise rotation framework for low-bit online vector quantization that significantly reduces computational costs while preserving reconstruction quality, though current validation is limited to stage-1 quantize-dequantize path on synthetic normalized vectors with end-to-end KV-cache evaluation remaining as future work.
Abstract: Orthogonal feature decorrelation is effective for low-bit online vector quantization, but dense random orthogonal transforms incur prohibitive $O(d^2)$ storage and compute. RotorQuant reduces this cost with blockwise $3$D Clifford rotors, yet the resulting $3$D partition is poorly aligned with modern hardware and offers limited local mixing. We propose \textbf{IsoQuant}, a blockwise rotation framework based on quaternion algebra and the isoclinic decomposition of $SO(4)$. It represents each $4$D block as a quaternion and applies a closed-form transform $T(v)=q_L v \overline{q_R}$. This yields two main variants: \emph{IsoQuant-Full}, which realizes the full $SO(4)$ rotation, and \emph{IsoQuant-Fast}, which keeps only one isoclinic factor for lower cost; the framework also admits a lightweight $2$D special case. At $d=128$, IsoQuant-Full reduces forward rotation cost from about $2{,}408$ FMAs in RotorQuant to $1{,}024$, while IsoQuant-Fast further reduces it to $512$. Across $18$ fused CUDA settings with $d \in {128,256,512}$, bit widths ${2,3,4}$, and FP16/FP32 execution, IsoQuant achieves mean kernel-level speedups of about $4.5\times$–$4.7\times$ over RotorQuant while maintaining comparable reconstruction MSE, with peak speedups above $6\times$. Current validation is limited to the stage-1 quantize–dequantize path on synthetic normalized vectors; end-to-end KV-cache evaluation remains future work.
[915] Scaling Atomistic Protein Binder Design with Generative Pretraining and Test-Time Compute
Kieran Didi, Zuobai Zhang, Guoqing Zhou, Danny Reidenbach, Zhonglin Cao, Sooyoung Cha, Tomas Geffner, Christian Dallago, Jian Tang, Michael M. Bronstein, Martin Steinegger, Emine Kucukbenli, Arash Vahdat, Karsten Kreis
Main category: cs.LG
TL;DR: Proteina-Complexa is a novel protein binder design method that unifies generative modeling and hallucination approaches, achieving state-of-the-art performance in computational binder design benchmarks.
Details
Motivation: Current protein binder design methods are divided into conditional generative modeling and sequence optimization via structure predictors ("hallucination"). The authors argue this is a false dichotomy and propose a unified approach that combines the strengths of both paradigms.Method: Extends flow-based latent protein generation architectures and leverages domain-domain interactions from monomeric predicted structures to create Teddymer dataset. Combines synthetic binder-target pairs with experimental multimers for pretraining, then performs inference-time optimization with the generative prior.
Result: Sets new state-of-the-art in computational binder design benchmarks with higher in-silico success rates than generative approaches and outperforms hallucination methods under normalized compute budgets. Also demonstrates interface hydrogen bond optimization, fold class-guided binder generation, and extensions to small molecule targets and enzyme design.
Conclusion: Proteina-Complexa successfully unifies generative and hallucination approaches for protein binder design, achieving superior performance across multiple benchmarks and extending to various applications beyond traditional binder design.
Abstract: Protein interaction modeling is central to protein design, which has been transformed by machine learning with applications in drug discovery and beyond. In this landscape, structure-based de novo binder design is cast as either conditional generative modeling or sequence optimization via structure predictors (“hallucination”). We argue that this is a false dichotomy and propose Proteina-Complexa, a novel fully atomistic binder generation method unifying both paradigms. We extend recent flow-based latent protein generation architectures and leverage the domain-domain interactions of monomeric computationally predicted protein structures to construct Teddymer, a new large-scale dataset of synthetic binder-target pairs for pretraining. Combined with high-quality experimental multimers, this enables training a strong base model. We then perform inference-time optimization with this generative prior, unifying the strengths of previously distinct generative and hallucination methods. Proteina-Complexa sets a new state of the art in computational binder design benchmarks: it delivers markedly higher in-silico success rates than existing generative approaches, and our novel test-time optimization strategies greatly outperform previous hallucination methods under normalized compute budgets. We also demonstrate interface hydrogen bond optimization, fold class-guided binder generation, and extensions to small molecule targets and enzyme design tasks, again surpassing prior methods. Code, models and new data will be publicly released.
[916] Symbolic Density Estimation: A Decompositional Approach
Angelo Rajendram, Xieting Chu, Vijay Ganesh, Max Fieg, Aishik Ghosh
Main category: cs.LG
TL;DR: AI-Kolmogorov is a framework for symbolic density estimation that uses clustering, nonparametric density estimation, support estimation, and symbolic regression to discover interpretable mathematical expressions for probability distributions.
Details
Motivation: Symbolic regression has been effective for interpretable models in standard regression, but its application to density estimation tasks has been largely unexplored. The authors aim to extend symbolic methods to probability density estimation for better interpretability and insight into underlying distributions.Method: Multi-stage pipeline: 1) Problem decomposition through clustering and/or probabilistic graphical model structure learning; 2) Nonparametric density estimation; 3) Support estimation; 4) Symbolic regression on the density estimate.
Result: Demonstrated efficacy on synthetic mixture models, multivariate normal distributions, and three exotic distributions (two from high-energy physics applications). The framework can discover underlying distributions or provide valuable insight into mathematical expressions describing them.
Conclusion: AI-Kolmogorov successfully extends symbolic regression to density estimation, providing interpretable models for probability distributions and showing promise for applications in scientific domains like high-energy physics.
Abstract: We introduce AI-Kolmogorov, a novel framework for Symbolic Density Estimation (SymDE). Symbolic regression (SR) has been effectively used to produce interpretable models in standard regression settings but its applicability to density estimation tasks has largely been unexplored. To address the SymDE task we introduce a multi-stage pipeline: (i) problem decomposition through clustering and/or probabilistic graphical model structure learning; (ii) nonparametric density estimation; (iii) support estimation; and finally (iv) SR on the density estimate. We demonstrate the efficacy of AI-Kolmogorov on synthetic mixture models, multivariate normal distributions, and three exotic distributions, two of which are motivated by applications in high-energy physics. We show that AI-Kolmogorov can discover underlying distributions or otherwise provide valuable insight into the mathematical expressions describing them.
[917] Gradient Manipulation in Distributed Stochastic Gradient Descent with Strategic Agents: Truthful Incentives with Convergence Guarantees
Ziqin Chen, Yongqiang Wang
Main category: cs.LG
TL;DR: A distributed payment mechanism that guarantees truthful behavior and accurate convergence in distributed SGD, overcoming limitations of centralized servers and convergence accuracy trade-offs.
Details
Motivation: Existing distributed learning approaches assume honest agent behavior, but in real-world scenarios, selfish agents may manipulate gradients for personal gain, compromising learning outcomes. Current truthfulness mechanisms rely on centralized servers or sacrifice convergence accuracy.Method: Proposes a fully distributed payment mechanism for distributed stochastic gradient descent that ensures truthful behaviors and accurate convergence. The approach guarantees finite cumulative gain for strategic agents even as iterations approach infinity.
Result: Theoretical analysis shows convergence rates under general convex and strongly convex conditions, with proof that strategic behavior gains remain finite. Experimental results on standard ML tasks with benchmark datasets confirm effectiveness.
Conclusion: The proposed distributed payment mechanism represents a significant advancement by overcoming limitations of existing truthfulness mechanisms while maintaining both truthful behavior and accurate convergence in distributed learning.
Abstract: Distributed learning has gained significant attention due to its advantages in scalability, privacy, and fault tolerance.In this paradigm, multiple agents collaboratively train a global model by exchanging parameters only with their neighbors. However, a key vulnerability of existing distributed learning approaches is their implicit assumption that all agents behave honestly during gradient updates. In real-world scenarios, this assumption often breaks down, as selfish or strategic agents may be incentivized to manipulate gradients for personal gain, ultimately compromising the final learning outcome. In this work, we propose a fully distributed payment mechanism that, for the first time, guarantees both truthful behaviors and accurate convergence in distributed stochastic gradient descent. This represents a significant advancement, as it overcomes two major limitations of existing truthfulness mechanisms for collaborative learning:(1) reliance on a centralized server for payment collection, and (2) sacrificing convergence accuracy to guarantee truthfulness. In addition to characterizing the convergence rate under general convex and strongly convex conditions, we also prove that our approach guarantees the cumulative gain that an agent can obtain through strategic behavior remains finite, even as the number of iterations approaches infinity–a property unattainable by most existing truthfulness mechanisms. Our experimental results on standard machine learning tasks, evaluated on benchmark datasets, confirm the effectiveness of the proposed approach.
[918] Principal Prototype Analysis on Manifold for Interpretable Reinforcement Learning
Bodla Krishna Vamshi, Haizhao Yang
Main category: cs.LG
TL;DR: A method for automatically selecting optimal prototypes in Prototype-Wrapper Networks to enhance explainability in reinforcement learning without sacrificing performance.
Details
Motivation: As RL models grow more complex, interpretability becomes challenging. Existing explainability methods for RL often require manually defined prototypes that need expert domain knowledge, creating a barrier to adoption.Method: Proposes an automatic prototype selection method that removes dependency on manually defined reference prototypes. The approach automatically selects optimal prototypes from available data while maintaining the explainability benefits of PW-Nets.
Result: Preliminary experiments on standard Gym environments show the approach matches performance of existing PW-Nets while remaining competitive with original black-box models.
Conclusion: The method successfully automates prototype selection for explainable RL, reducing dependency on expert knowledge while maintaining both interpretability and performance.
Abstract: Recent years have witnessed the widespread adoption of reinforcement learning (RL), from solving real-time games to fine-tuning large language models using human preference data significantly improving alignment with user expectations. However, as model complexity grows exponentially, the interpretability of these systems becomes increasingly challenging. While numerous explainability methods have been developed for computer vision and natural language processing to elucidate both local and global reasoning patterns, their application to RL remains limited. Direct extensions of these methods often struggle to maintain the delicate balance between interpretability and performance within RL settings. Prototype-Wrapper Networks (PW-Nets) have recently shown promise in bridging this gap by enhancing explainability in RL domains without sacrificing the efficiency of the original black-box models. However, these methods typically require manually defined reference prototypes, which often necessitate expert domain knowledge. In this work, we propose a method that removes this dependency by automatically selecting optimal prototypes from the available data. Preliminary experiments on standard Gym environments demonstrate that our approach matches the performance of existing PW-Nets, while remaining competitive with the original black-box models.
[919] From Independent to Correlated Diffusion: Generalized Generative Modeling with Probabilistic Computers
Nihal Sanjay Singh, Mazdak Mohseni-Rajaee, Shaila Niazi, Kerem Y. Camsari
Main category: cs.LG
TL;DR: Generalizes diffusion models by replacing independent noise injection with MCMC dynamics incorporating known interaction structure, enabling exploitation of spatial correlations and mapping naturally onto probabilistic computers for efficient sampling.
Details
Motivation: Current diffusion models place most computation in neural networks, but the diffusion framework allows broader choices for stochastic transition kernels. The authors aim to incorporate known interaction structure into the stochastic sampling component to exploit spatial correlations representative of target systems.Method: Replace independent noise injection in diffusion models with Markov chain Monte Carlo (MCMC) dynamics that incorporate known interaction structure (Ising couplings). This creates correlated diffusion where noising and denoising processes exploit spatial correlations. The approach maps naturally onto probabilistic computers built from probabilistic bits for efficient sampling.
Result: Demonstrated on equilibrium states of 2D ferromagnetic Ising model and 3D Edwards-Anderson spin glass. Correlated diffusion produces samples in closer agreement with MCMC reference distributions than independent diffusion. Shows p-computers can enable new classes of diffusion algorithms.
Conclusion: The framework generalizes diffusion models by incorporating structured probabilistic sampling through MCMC dynamics, enabling exploitation of spatial correlations and efficient implementation on probabilistic computers for improved generative modeling.
Abstract: Diffusion models have emerged as a powerful framework for generative tasks in deep learning. They decompose generative modeling into two computational primitives: deterministic neural-network evaluation and stochastic sampling. Current implementations usually place most computation in the neural network, but diffusion as a framework allows a broader range of choices for the stochastic transition kernel. Here, we generalize the stochastic sampling component by replacing independent noise injection with Markov chain Monte Carlo (MCMC) dynamics that incorporate known interaction structure. Standard independent diffusion is recovered as a special case when couplings are set to zero. By explicitly incorporating Ising couplings into the diffusion dynamics, the noising and denoising processes exploit spatial correlations representative of the target system. The resulting framework maps naturally onto probabilistic computers (p-computers) built from probabilistic bits (p-bits), which provide orders-of-magnitude advantages in sampling throughput and energy efficiency over GPUs. We demonstrate the approach on equilibrium states of the 2D ferromagnetic Ising model and the 3D Edwards-Anderson spin glass, showing that correlated diffusion produces samples in closer agreement with MCMC reference distributions than independent diffusion. More broadly, the framework shows that p-computers can enable new classes of diffusion algorithms that exploit structured probabilistic sampling for generative modeling.
[920] FedDES: Graph-Based Dynamic Ensemble Selection for Personalized Federated Learning
Brianna Mueller, W. Nick Street
Main category: cs.LG
TL;DR: FedDES is a personalized federated learning framework that uses dynamic ensemble selection at the instance level via GNN meta-learner to address statistical heterogeneity and negative transfer in FL.
Details
Motivation: Statistical heterogeneity in federated learning causes negative transfer where a single global model fails for diverse client distributions. Existing personalized FL approaches integrate peer contributions uniformly, ignoring that not all peers are equally beneficial, and lack instance-level personalization despite varying reliability of peer models across individual samples.Method: FedDES uses a decentralized pFL framework with dynamic ensemble selection. A Graph Neural Network meta-learner is trained on a heterogeneous graph modeling interactions between data samples and candidate classifiers. For each test query, the GNN dynamically selects and weights peer client models to form an ensemble of the most competent classifiers while suppressing irrelevant or harmful contributions.
Result: Experiments on CIFAR-10 and real-world ICU healthcare data show FedDES outperforms state-of-the-art pFL baselines in non-IID settings, offering robust protection against negative transfer.
Conclusion: FedDES successfully addresses statistical heterogeneity in FL through instance-level personalization via dynamic ensemble selection, demonstrating superior performance over existing pFL methods in non-IID scenarios.
Abstract: Statistical heterogeneity in Federated Learning (FL) often leads to negative transfer, where a single global model fails to serve diverse client distributions. Personalized federated learning (pFL) aims to address this by tailoring models to individual clients. However, under most existing pFL approaches, clients integrate peer client contributions uniformly, which ignores the reality that not all peers are likely to be equally beneficial. Additionally, the potential for personalization at the instance level remains largely unexplored, even though the reliability of different peer models often varies across individual samples within the same client. We introduce FedDES (Federated Dynamic Ensemble Selection), a decentralized pFL framework that achieves instance-level personalization through dynamic ensemble selection. Central to our approach is a Graph Neural Network (GNN) meta-learner trained on a heterogeneous graph modeling interactions between data samples and candidate classifiers. For each test query, the GNN dynamically selects and weights peer client models, forming an ensemble of the most competent classifiers while effectively suppressing contributions from those that are irrelevant or potentially harmful for performance. Experiments on CIFAR-10 and real-world ICU healthcare data demonstrate that FedDES outperforms state-of-the-art pFL baselines in non-IID settings, offering robust protection against negative transfer.
[921] Diffusion Maps is not Dimensionality Reduction
Julio Candanedo, Alejandro Patiño
Main category: cs.LG
TL;DR: Diffusion maps provide spectral geometry representation but not optimal charting; comparison shows Isomap best recovers low-dimensional coordinates, UMAP offers tradeoff, DMAP requires multiple modes.
Details
Motivation: To clarify that diffusion maps (DMAP) provide spectral representations of intrinsic geometry rather than complete charting methods, and to compare their effectiveness against other dimensionality reduction techniques like Isomap and UMAP for recovering known isometric coordinates.Method: Study a Swiss roll dataset with known isometric coordinates, compare DMAP, Isomap, and UMAP across latent dimensions, fit an oracle affine readout to ground-truth chart, and measure reconstruction error.
Result: Isomap most efficiently recovers the low-dimensional chart, UMAP provides an intermediate tradeoff, and DMAP becomes accurate only after combining multiple diffusion modes. The correct chart lies in the span of diffusion coordinates but standard DMAP doesn’t identify the appropriate combination.
Conclusion: Diffusion maps offer spectral geometry representation but are not optimal for charting; other methods like Isomap are more efficient for recovering low-dimensional coordinates, highlighting the distinction between geometry representation and charting.
Abstract: Diffusion maps (DMAP) are often used as a dimensionality-reduction tool, but more precisely they provide a spectral representation of the intrinsic geometry rather than a complete charting method. To illustrate this distinction, we study a Swiss roll with known isometric coordinates and compare DMAP, Isomap, and UMAP across latent dimensions. For each representation, we fit an oracle affine readout to the ground-truth chart and measure reconstruction error. Isomap most efficiently recovers the low-dimensional chart, UMAP provides an intermediate tradeoff, and DMAP becomes accurate only after combining multiple diffusion modes. Thus the correct chart lies in the span of diffusion coordinates, but standard DMAP do not by themselves identify the appropriate combination.
[922] Bit-Identical Medical Deep Learning via Structured Orthogonal Initialization
Yakov Pyotr Shkolnikov
Main category: cs.LG
TL;DR: Framework for verified bit-identical deep learning training that eliminates randomness from weight initialization, batch ordering, and GPU operations, achieving identical trained weights across independent runs with improved stability on rare classes.
Details
Motivation: Deep learning training is non-deterministic, causing models with identical code but different random seeds to disagree on individual predictions, especially problematic for rare clinical classes where per-class AUC can swing over 20 percentage points.Method: Three-pronged approach: 1) Structured orthogonal basis functions for deterministic weight initialization, 2) Golden ratio scheduling for deterministic batch ordering, 3) Architecture selection and custom autograd to eliminate non-deterministic GPU operations, producing MD5-verified identical trained weights.
Result: On PTB-XL ECG rhythm classification, structured initialization significantly outperforms Kaiming initialization, reducing aggregate variance by 2-3x and per-class variability on rare rhythms by up to 7.5x. All structured orthogonal bases produce equivalent performance, confirming the benefit comes from deterministic structured initialization itself.
Conclusion: Verified bit-identical training eliminates training randomness while maintaining or improving performance, especially benefiting rare classes in clinical applications, with demonstrated generalization across medical imaging domains and external ECG datasets.
Abstract: Deep learning training is non-deterministic: identical code with different random seeds produces models that agree on aggregate metrics but disagree on individual predictions, with per-class AUC swings exceeding 20 percentage points on rare clinical classes. We present a framework for verified bit-identical training that eliminates three sources of randomness: weight initialization (via structured orthogonal basis functions), batch ordering (via golden ratio scheduling), and non-deterministic GPU operations (via architecture selection and custom autograd). The pipeline produces MD5-verified identical trained weights across independent runs. On PTB-XL ECG rhythm classification, structured initialization significantly exceeds Kaiming across two architectures (n=20; Conformer p = 0.016, Baseline p < 0.001), reducing aggregate variance by 2-3x and reducing per-class variability on rare rhythms by up to 7.5x (TRIGU range: 4.1pp vs 30.9pp under Kaiming, independently confirmed by 3-fold CV). A four-basis comparison at n=20 shows all structured orthogonal bases produce equivalent performance (Friedman p=0.48), establishing that the contribution is deterministic structured initialization itself, not any particular basis function. Cross-domain validation on seven MedMNIST benchmarks (n=20, all p > 0.14) confirms no performance penalty on standard tasks; per-class analysis on imbalanced tasks (ChestMNIST, RetinaMNIST) shows the same variance reduction on rare classes observed in ECG. Cross-dataset evaluation on three external ECG databases confirms zero-shot generalization (>0.93 AFIB AUC).
[923] Reducing Oracle Feedback with Vision-Language Embeddings for Preference-Based RL
Udita Ghosh, Dripta S. Raychaudhuri, Jiachen Li, Konstantinos Karydis, Amit Roy-Chowdhury
Main category: cs.LG
TL;DR: ROVED: Hybrid framework combining vision-language embeddings with targeted oracle feedback for preference-based RL, reducing oracle queries by 80% while maintaining performance.
Details
Motivation: Preference-based RL relies on expensive oracle feedback, while lightweight vision-language embeddings are cheap but noisy. Need to combine scalability of embeddings with accuracy of oracles.Method: Hybrid framework using VLE to generate segment-level preferences, deferring to oracle only for high-uncertainty samples via filtering mechanism. Includes parameter-efficient fine-tuning to adapt VLE with oracle feedback over time.
Result: Matches or surpasses prior preference-based methods across robotic manipulation tasks while reducing oracle queries by up to 80%. Adapted VLE generalizes across tasks with cumulative annotation savings up to 90%.
Conclusion: Combining scalable embeddings with precise oracle supervision is practical for preference-based RL, achieving efficiency and accuracy through targeted feedback and adaptive fine-tuning.
Abstract: Preference-based reinforcement learning can learn effective reward functions from comparisons, but its scalability is constrained by the high cost of oracle feedback. Lightweight vision-language embedding (VLE) models provide a cheaper alternative, but their noisy outputs limit their effectiveness as standalone reward generators. To address this challenge, we propose ROVED, a hybrid framework that combines VLE-based supervision with targeted oracle feedback. Our method uses the VLE to generate segment-level preferences and defers to an oracle only for samples with high uncertainty, identified through a filtering mechanism. In addition, we introduce a parameter-efficient fine-tuning method that adapts the VLE with the obtained oracle feedback in order to improve the model over time in a synergistic fashion. This ensures the retention of the scalability of embeddings and the accuracy of oracles, while avoiding their inefficiencies. Across multiple robotic manipulation tasks, ROVED matches or surpasses prior preference-based methods while reducing oracle queries by up to 80%. Remarkably, the adapted VLE generalizes across tasks, yielding cumulative annotation savings of up to 90%, highlighting the practicality of combining scalable embeddings with precise oracle supervision for preference-based RL.
[924] From Vessel Trajectories to Safety-Critical Encounter Scenarios: A Generative AI Framework for Autonomous Ship Digital Testing
Sijin Sun, Liangbin Zhao, Ming Deng, Xiuju Fu
Main category: cs.LG
TL;DR: A data-driven framework for generating realistic safety-critical maritime encounter scenarios from AIS trajectory data using generative modeling and automated pairing.
Details
Motivation: Limited availability of realistic and diverse safety-critical encounter scenarios for testing autonomous maritime navigation systems; existing approaches lack realism or cannot systematically expand rare high-risk situations.Method: Combines generative trajectory modeling with automated encounter pairing and temporal parameterization; introduces multi-scale temporal variational autoencoder to capture vessel motion dynamics across different temporal resolutions under noisy AIS observations.
Result: Improves trajectory fidelity and smoothness, maintains statistical consistency with observed data, and enables generation of diverse safety-critical encounter scenarios beyond those directly recorded.
Conclusion: Provides a practical pathway for building scenario libraries to support digital testing, benchmarking, and safety assessment of autonomous navigation and intelligent maritime traffic management systems.
Abstract: Digital testing has emerged as a key paradigm for the development and verification of autonomous maritime navigation systems, yet the availability of realistic and diverse safety-critical encounter scenarios remains limited. Existing approaches either rely on handcrafted templates, which lack realism, or extract cases directly from historical data, which cannot systematically expand rare high-risk situations. This paper proposes a data-driven framework that converts large-scale Automatic Identification System (AIS) trajectories into structured safety-critical encounter scenarios. The framework combines generative trajectory modeling with automated encounter pairing and temporal parameterization to enable scalable scenario construction while preserving real traffic characteristics. To enhance trajectory realism and robustness under noisy AIS observations, a multi-scale temporal variational autoencoder is introduced to capture vessel motion dynamics across different temporal resolutions. Experiments on real-world maritime traffic flows demonstrate that the proposed method improves trajectory fidelity and smoothness, maintains statistical consistency with observed data, and enables the generation of diverse safety-critical encounter scenarios beyond those directly recorded. The resulting framework provides a practical pathway for building scenario libraries to support digital testing, benchmarking, and safety assessment of autonomous navigation and intelligent maritime traffic management systems. Code is available at https://anonymous.4open.science/r/traj-gen-anonymous-review.
[925] SIMR-NO: A Spectrally-Informed Multi-Resolution Neural Operator for Turbulent Flow Super-Resolution
Muhammad Abid, Omer San
Main category: cs.LG
TL;DR: SIMR-NO is a neural operator framework for reconstructing high-resolution turbulent flow fields from severely under-resolved observations using spectral-informed multi-resolution decomposition with Fourier residual corrections.
Details
Motivation: Existing methods fail to recover fine-scale turbulent structures from coarse observations; convolutional approaches lack spectral/multiscale inductive biases needed for physically faithful reconstruction at large upscaling factors.Method: Hierarchical operator learning framework that factorizes inverse mapping across resolutions, combines deterministic interpolation priors with spectrally gated Fourier residual corrections, and adds local refinement modules for fine-scale features.
Result: Achieves 26.04% mean relative ℓ₂ error on 128×128 vorticity fields from 8×8 observations (16× downsampling), reducing error by 31.7% over FNO, 26.0% over EDSR, and 9.3% over LapSRN. Faithfully reproduces energy and enstrophy spectra.
Conclusion: SIMR-NO enables physically consistent super-resolution of turbulent flows by incorporating spectral and multiscale inductive biases, outperforming existing methods in both accuracy and physical fidelity.
Abstract: Reconstructing high-resolution turbulent flow fields from severely under-resolved observations is a fundamental inverse problem in computational fluid dynamics and scientific machine learning. Classical interpolation methods fail to recover missing fine-scale structures, while existing deep learning approaches rely on convolutional architectures that lack the spectral and multiscale inductive biases necessary for physically faithful reconstruction at large upscaling factors. We introduce the Spectrally-Informed Multi-Resolution Neural Operator (SIMR-NO), a hierarchical operator learning framework that factorizes the ill-posed inverse mapping across intermediate spatial resolutions, combines deterministic interpolation priors with spectrally gated Fourier residual corrections at each stage, and incorporates local refinement modules to recover fine-scale spatial features beyond the truncated Fourier basis. The proposed method is evaluated on Kolmogorov-forced two-dimensional turbulence, where $128\times128$ vorticity fields are reconstructed from extremely coarse $8\times8$ observations representing a $16\times$ downsampling factor. Across 201 independent test realizations, SIMR-NO achieves a mean relative $\ell_2$ error of $26.04%$ with the lowest error variance among all methods, reducing reconstruction error by $31.7%$ over FNO, $26.0%$ over EDSR, and $9.3%$ over LapSRN. Beyond pointwise accuracy, SIMR-NO is the only method that faithfully reproduces the ground-truth energy and enstrophy spectra across the full resolved wavenumber range, demonstrating physically consistent super-resolution of turbulent flow fields.
[926] Koopman-based surrogate modeling for reinforcement-learning-control of Rayleigh-Benard convection
Tim Plotzki, Sebastian Peitz
Main category: cs.LG
TL;DR: Using Linear Recurrent Autoencoder Networks as surrogate models to accelerate RL training for fluid dynamics control, with policy-aware training mitigating distribution shift issues.
Details
Motivation: Direct numerical simulations for RL training in fluid dynamics are computationally expensive, and surrogate models face distribution shift problems when policies induce unseen state distributions.Method: Use Linear Recurrent Autoencoder Networks as surrogate models for 2D Rayleigh-Bénard convection control. Compare two strategies: surrogate trained on random-action data vs policy-aware surrogate trained iteratively with evolving policy data.
Result: Surrogate-only training reduces control performance, but combining surrogates with DNS in pretraining recovers state-of-the-art performance while reducing training time by >40%. Policy-aware training mitigates distribution shift.
Conclusion: Policy-aware surrogate training enables accurate predictions in policy-relevant state regions, making surrogate models viable for RL training in computationally expensive domains like fluid dynamics.
Abstract: Training reinforcement learning (RL) agents to control fluid dynamics systems is computationally expensive due to the high cost of direct numerical simulations (DNS) of the governing equations. Surrogate models offer a promising alternative by approximating the dynamics at a fraction of the computational cost, but their feasibility as training environments for RL is limited by distribution shifts, as policies induce state distributions not covered by the surrogate training data. In this work, we investigate the use of Linear Recurrent Autoencoder Networks (LRANs) for accelerating RL-based control of 2D Rayleigh-Bénard convection. We evaluate two training strategies: a surrogate trained on precomputed data generated with random actions, and a policy-aware surrogate trained iteratively using data collected from an evolving policy. Our results show that while surrogate-only training leads to reduced control performance, combining surrogates with DNS in a pretraining scheme recovers state-of-the-art performance while reducing training time by more than 40%. We demonstrate that policy-aware training mitigates the effects of distribution shift, enabling more accurate predictions in policy-relevant regions of the state space.
[927] InkDrop: Invisible Backdoor Attacks Against Dataset Condensation
He Yang, Dongyi Lv, Song Ma, Wei Xi, Zhi Wang, Hanlin Gu, Yajie Wang
Main category: cs.LG
TL;DR: InkDrop is a stealthy backdoor attack method for dataset condensation that leverages decision boundary uncertainty to embed imperceptible malicious patterns while maintaining attack effectiveness and model utility.
Details
Motivation: Existing backdoor attacks on dataset condensation prioritize attack effectiveness and model utility but overlook stealthiness, making them easily detectable. There's a need for more imperceptible attacks that can evade detection while maintaining performance.Method: InkDrop exploits uncertainty near model decision boundaries where minor perturbations cause semantic shifts. It selects candidate samples near target decision boundaries with latent semantic affinity to target class, then learns instance-dependent perturbations constrained by perceptual and spatial consistency to embed malicious behavior into condensed datasets.
Result: Extensive experiments across diverse datasets show InkDrop effectively integrates adversarial intent into condensed datasets while preserving model utility and minimizing detectability, outperforming existing methods in stealthiness.
Conclusion: InkDrop bridges the stealthiness gap in dataset condensation backdoor attacks by leveraging decision boundary uncertainty, demonstrating that imperceptible malicious manipulation can be achieved without compromising attack effectiveness or model utility.
Abstract: Dataset Condensation (DC) is a data-efficient learning paradigm that synthesizes small yet informative datasets, enabling models to match the performance of full-data training. However, recent work exposes a critical vulnerability of DC to backdoor attacks, where malicious patterns (\textit{e.g.}, triggers) are implanted into the condensation dataset, inducing targeted misclassification on specific inputs. Existing attacks always prioritize attack effectiveness and model utility, overlooking the crucial dimension of stealthiness. To bridge this gap, we propose InkDrop, which enhances the imperceptibility of malicious manipulation without degrading attack effectiveness and model utility. InkDrop leverages the inherent uncertainty near model decision boundaries, where minor input perturbations can induce semantic shifts, to construct a stealthy and effective backdoor attack. Specifically, InkDrop first selects candidate samples near the target decision boundary that exhibit latent semantic affinity to the target class. It then learns instance-dependent perturbations constrained by perceptual and spatial consistency, embedding targeted malicious behavior into the condensed dataset. Extensive experiments across diverse datasets validate the overall effectiveness of InkDrop, demonstrating its ability to integrate adversarial intent into condensed datasets while preserving model utility and minimizing detectability. Our code is available at https://github.com/lvdongyi/InkDrop.
[928] Heddle: A Distributed Orchestration System for Agentic RL Rollout
Zili Zhang, Yinmin Zhong, Chengxu Yang, Chao Jin, Bingyang Wu, Xinming Wei, Yuliang Liu, Xin Jin
Main category: cs.LG
TL;DR: Heddle is a trajectory-centric system that optimizes agentic RL rollouts by addressing long-tail trajectory generation bottlenecks through scheduling, placement, and resource management mechanisms.
Details
Motivation: Agentic RL with LLMs suffers from long-tailed trajectory generation during rollouts due to frequent tool calls, causing queueing delays, interference overhead, and inflated per-token time that bottleneck system performance.Method: Heddle uses three core mechanisms: 1) trajectory-level scheduling with runtime prediction and progressive priority to minimize queueing, 2) trajectory-aware placement via presorted dynamic programming and opportunistic migration during idle tool calls to reduce interference, and 3) trajectory-adaptive resource management that dynamically tunes model parallelism to accelerate long-tail trajectories while maintaining throughput for short ones.
Result: Evaluations across diverse agentic RL workloads show Heddle effectively neutralizes the long-tail bottleneck, achieving up to 2.5× higher end-to-end rollout throughput compared to state-of-the-art baselines.
Conclusion: Heddle’s trajectory-centric approach successfully addresses the system bottlenecks in agentic RL rollouts, demonstrating significant performance improvements over existing step-centric designs.
Abstract: Agentic Reinforcement Learning (RL) enables LLMs to solve complex tasks by alternating between a data-collection rollout phase and a policy training phase. During rollout, the agent generates trajectories, i.e., multi-step interactions between LLMs and external tools. Yet, frequent tool calls induce long-tailed trajectory generation that bottlenecks rollouts. This stems from step-centric designs that ignore trajectory context, triggering three system problems for long-tail trajectory generation: queueing delays, interference overhead, and inflated per-token time. We propose Heddle, a trajectory-centric system to optimize the when, where, and how of agentic rollout execution. Heddle integrates three core mechanisms: trajectory-level scheduling using runtime prediction and progressive priority to minimize cumulative queueing; trajectory-aware placement via presorted dynamic programming and opportunistic migration during idle tool call intervals to minimize interference; and trajectory-adaptive resource manager that dynamically tunes model parallelism to accelerate the per-token time of long-tail trajectories while maintaining high throughput for short trajectories. Evaluations across diverse agentic RL workloads demonstrate that Heddle effectively neutralizes the long-tail bottleneck, achieving up to 2.5$\times$ higher end-to-end rollout throughput compared to state-of-the-art baselines.
[929] Lipschitz verification of neural networks through training
Simon Kuang, Yuezhu Xu, S. Sivaranjani, Xinfan Lin
Main category: cs.LG
TL;DR: Training neural networks to have tight Lipschitz bounds by designing verifiable architectures rather than complex verification methods
Details
Motivation: Current certified training approaches rely on computationally expensive verification methods to bound neural network Lipschitz constants, which are crucial for adversarial robustness and generalization. The trivial bound (product of layerwise constants) is exponentially loose for arbitrary networks, necessitating complex verification techniques.Method: Design networks to be verifiable by the fast trivial bound by directly penalizing the trivial bound during training. Identify three structural obstructions (dead neurons, bias terms, ill-conditioned weights) and introduce architectural mitigations including norm-saturating polyactivations and bias-free sinusoidal layers.
Result: Achieved strong results on MNIST with Lipschitz bounds that are small (orders of magnitude lower than comparable works) and tight (within 10% of ground truth). Avoided runtime complexity of advanced verification while maintaining robustness.
Conclusion: Proposed paradigm shift from designing complex verifiers for arbitrary networks to designing networks that are easily verifiable by fast trivial bounds, achieving both computational efficiency and strong certified robustness.
Abstract: The global Lipschitz constant of a neural network governs both adversarial robustness and generalization.
Conventional approaches to certified training" typically follow a train-then-verify paradigm: they train a network and then attempt to bound its Lipschitz constant. Because the efficient trivial bound” (the product of the layerwise Lipschitz constants) is exponentially loose for arbitrary networks, these approaches must rely on computationally expensive techniques such as semidefinite programming, mixed-integer programming, or branch-and-bound.
We propose a different paradigm: rather than designing complex verifiers for arbitrary networks, we design networks to be verifiable by the fast trivial bound.
We show that directly penalizing the trivial bound during training forces it to become tight, thereby effectively regularizing the true Lipschitz constant.
To achieve this, we identify three structural obstructions to a tight trivial bound (dead neurons, bias terms, and ill-conditioned weights) and introduce architectural mitigations, including a novel notion of norm-saturating polyactivations and bias-free sinusoidal layers.
Our approach avoids the runtime complexity of advanced verification while achieving strong results: we train robust networks on MNIST with Lipschitz bounds that are small (orders of magnitude lower than comparable works) and tight (within 10% of the ground truth).
The experimental results validate the theoretical guarantees, support the proposed mechanisms, and extend empirically to diverse activations and non-Euclidean norms.
[930] Graph Vector Field: A Unified Framework for Multimodal Health Risk Assessment from Heterogeneous Wearable and Environmental Data Streams
Silvano Coletti, Francesca Fallucchi
Main category: cs.LG
TL;DR: GVF models health risk as vector fields on time-varying simplicial complexes using discrete differential geometry and multimodal mixture-of-experts for interpretable risk decomposition.
Details
Motivation: Current digital health research has separate strands in dynamic graph-based disease models, topological learning on simplicial complexes, and multimodal mixture-of-experts architectures, but these approaches remain disconnected. There's a need to integrate these methods into a unified framework for interpretable, modality-resolved health risk modeling.Method: Proposes Graph Vector Field (GVF) framework that models health risk as vector-valued fields on time-varying simplicial complexes. Uses discrete differential-geometric operators (Hodge Laplacians, discrete exterior calculus) with Helmholtz-Hodge decomposition into exact, coexact, and harmonic components. Incorporates multimodal inputs through bundle-structured mixture-of-experts where modality-specific latent spaces are attached as fibres to the base complex.
Result: Theoretical framework developed with mathematical foundations, architectural design, and formal guarantees. Empirical validation is ongoing work. Framework integrates geometric dynamical systems, higher-order topology, and structured multimodal fusion for interpretable risk modeling.
Conclusion: GVF provides a unified framework that connects previously disconnected approaches in digital health research, offering interpretable decomposition of health risk into propagation, cyclic, and persistent components while enabling modality-level identifiability through structured multimodal fusion.
Abstract: Digital health research has advanced dynamic graph-based disease models, topological learning on simplicial complexes, and multimodal mixture-of-experts architectures, but these strands remain largely disconnected. We propose Graph Vector Field (GVF), a framework that models health risk as a vector-valued field on time-varying simplicial complexes, coupling discrete differential-geometric operators with modality-structured mixture-of-experts. Risk is represented as a vector-valued cochain whose evolution is parameterised with Hodge Laplacians and discrete exterior calculus operators, yielding a Helmholtz-Hodge decomposition into potential-driven (exact), circulation-like (coexact), and topologically constrained (harmonic) components linked to interpretable propagation, cyclic, and persistent risk mechanisms. Multimodal inputs from wearable sensors, behavioural/environmental context, and clinical/genomic data are incorporated through a bundle-structured mixture-of-experts in which modality-specific latent spaces are attached as fibres to the base complex. This separates modality-specific from shared contributions and offers a principled route toward modality-level identifiability. GVF integrates geometric dynamical systems, higher-order topology (enforced indirectly via geometric regularisation and Hodge decomposition), and structured multimodal fusion into a single framework for interpretable, modality-resolved risk modelling. This paper develops the mathematical foundations, architectural design, and formal guarantees; empirical validation is the subject of ongoing work.
[931] Neural Federated Learning for Livestock Growth Prediction
Shoujin Wang, Mingze Ni, Wei Liu, Victor W. Chu, Kenny Sabir, Bryan Zheng, Ayush Kanwal, Roy Jing Yang, Fang Chen
Main category: cs.LG
TL;DR: LivestockFL: A federated learning framework for livestock growth prediction that enables collaborative training across farms without sharing raw data, addressing privacy concerns and data sparsity.
Details
Motivation: Livestock growth prediction is crucial for farm optimization but faces challenges due to limited datasets and privacy concerns. Existing biophysical models are rigid, while ML approaches suffer from small, isolated datasets, limiting robustness and generalizability.Method: Proposes LivestockFL, a federated learning framework using GRU+MLP architecture to model temporal growth patterns from historical weight records and auxiliary features. Also introduces LivestockPFL with personalized prediction heads for farm-specific models.
Result: Experiments on real-world datasets demonstrate the effectiveness and practicality of the proposed federated learning approaches for livestock growth prediction.
Conclusion: The proposed federated learning frameworks address data privacy and sparsity challenges in livestock growth prediction, enabling collaborative model training across distributed farms while maintaining data confidentiality.
Abstract: Livestock growth prediction is essential for optimising farm management and improving the efficiency and sustainability of livestock production, yet it remains underexplored due to limited large-scale datasets and privacy concerns surrounding farm-level data. Existing biophysical models rely on fixed formulations, while most machine learning approaches are trained on small, isolated datasets, limiting their robustness and generalisability. To address these challenges, we propose LivestockFL, the first federated learning framework specifically designed for livestock growth prediction. LivestockFL enables collaborative model training across distributed farms without sharing raw data, thereby preserving data privacy while alleviating data sparsity, particularly for farms with limited historical records. The framework employs a neural architecture based on a Gated Recurrent Unit combined with a multilayer perceptron to model temporal growth patterns from historical weight records and auxiliary features. We further introduce LivestockPFL, a novel personalised federated learning framework that extends the above federated learning framework with a personalized prediction head trained on each farm’s local data, producing farm-specific predictors. Experiments on a real-world dataset demonstrate the effectiveness and practicality of the proposed approaches.
[932] ORACAL: A Robust and Explainable Multimodal Framework for Smart Contract Vulnerability Detection with Causal Graph Enrichment
Tran Duong Minh Dai, Triet Huynh Minh Le, M. Ali Babar, Van-Hau Pham, Phan The Duy
Main category: cs.LG
TL;DR: ORACAL is a multimodal graph learning framework for smart contract vulnerability detection that integrates control flow, data flow, and call graphs with RAG-enhanced security context and causal reasoning for improved accuracy and explainability.
Details
Motivation: Current GNN-based approaches for smart contract vulnerability detection have limitations: homogeneous graphs miss control-data dependency interplay, heterogeneous graphs lack deep semantic understanding and are vulnerable to adversarial attacks, and black-box models lack explainability needed for professional audits.Method: ORACAL integrates Control Flow Graph (CFG), Data Flow Graph (DFG), and Call Graph (CG) into a heterogeneous multimodal graph framework. It uses Retrieval-Augmented Generation (RAG) and LLMs to enrich critical subgraphs with expert security context, employs causal attention to disentangle true vulnerability indicators, and uses PGExplainer for subgraph-level explanations.
Result: ORACAL achieves state-of-the-art performance with peak Macro F1 of 91.28%, outperforming baselines by up to 39.6 percentage points. It maintains strong generalization (91.8% on CGT Weakness, 77.1% on DAppScan), provides explainability with 32.51% MIoU, and shows robustness to adversarial attacks with only 2.35% F1 decrease and 3% ASR.
Conclusion: ORACAL effectively addresses limitations of existing GNN approaches by combining multimodal graph integration, RAG-enhanced context, causal reasoning, and explainability mechanisms, achieving superior performance, robustness, and transparency for smart contract vulnerability detection.
Abstract: Although Graph Neural Networks (GNNs) have shown promise for smart contract vulnerability detection, they still face significant limitations. Homogeneous graph models fail to capture the interplay between control flow and data dependencies, while heterogeneous graph approaches often lack deep semantic understanding, leaving them susceptible to adversarial attacks. Moreover, most black-box models fail to provide explainable evidence, hindering trust in professional audits. To address these challenges, we propose ORACAL (Observable RAG-enhanced Analysis with CausAL reasoning), a heterogeneous multimodal graph learning framework that integrates Control Flow Graph (CFG), Data Flow Graph (DFG), and Call Graph (CG). ORACAL selectively enriches critical subgraphs with expert-level security context from Retrieval-Augmented Generation (RAG) and Large Language Models (LLMs), and employs a causal attention mechanism to disentangle true vulnerability indicators from spurious correlations. For transparency, the framework adopts PGExplainer to generate subgraph-level explanations identifying vulnerability triggering paths. Experiments on large-scale datasets demonstrate that ORACAL achieves state-of-the-art performance, outperforming MANDO-HGT, MTVHunter, GNN-SC, and SCVHunter by up to 39.6 percentage points, with a peak Macro F1 of 91.28% on the primary benchmark. ORACAL maintains strong generalization on out-of-distribution datasets with 91.8% on CGT Weakness and 77.1% on DAppScan. In explainability evaluation, PGExplainer achieves 32.51% Mean Intersection over Union (MIoU) against manually annotated vulnerability triggering paths. Under adversarial attacks, ORACAL limits performance degradation to approximately 2.35% F1 decrease with an Attack Success Rate (ASR) of only 3%, surpassing SCVHunter and MANDO-HGT which exhibit ASRs ranging from 10.91% to 18.73%.
[933] Automating Early Disease Prediction Via Structured and Unstructured Clinical Data
Ane G Domingo-Aldama, Marcos Merino Prado, Alain García Olea, Josu Goikoetxea, Koldo Gojenola, Aitziber Atutxa
Main category: cs.LG
TL;DR: Automated pipeline using NLP on discharge reports for clinical prediction tasks, improving data quality and model performance over structured EHR data alone.
Details
Motivation: Addresses limitations of structured EHR data which often contains missing or incomplete information, and aims to automate early prediction studies by leveraging rich clinical information from unstructured discharge reports.Method: Proposes a fully automated pipeline using natural language processing techniques on discharge reports to support three main steps: cohort selection, dataset generation, and outcome labeling. The approach enriches structured datasets with additional clinical variables extracted from text.
Result: Predictive models trained on datasets enriched with discharge report information achieved higher accuracy and better correlation with true outcomes compared to models using only structured EHR data, and also surpassed traditional clinical scores in predicting atrial fibrillation progression.
Conclusion: Automating the integration of unstructured clinical text can streamline early prediction studies, improve data quality, and enhance the reliability of predictive models for clinical decision-making.
Abstract: This study presents a fully automated methodology for early prediction studies in clinical settings, leveraging information extracted from unstructured discharge reports. The proposed pipeline uses discharge reports to support the three main steps of early prediction: cohort selection, dataset generation, and outcome labeling. By processing discharge reports with natural language processing techniques, we can efficiently identify relevant patient cohorts, enrich structured datasets with additional clinical variables, and generate high-quality labels without manual intervention. This approach addresses the frequent issue of missing or incomplete data in codified electronic health records (EHR), capturing clinically relevant information that is often underrepresented. We evaluate the methodology in the context of predicting atrial fibrillation (AF) progression, showing that predictive models trained on datasets enriched with discharge report information achieve higher accuracy and correlation with true outcomes compared to models trained solely on structured EHR data, while also surpassing traditional clinical scores. These results demonstrate that automating the integration of unstructured clinical text can streamline early prediction studies, improve data quality, and enhance the reliability of predictive models for clinical decision-making.
[934] Skillful Kilometer-Scale Regional Weather Forecasting via Global and Regional Coupling
Weiqi Chen, Wenwei Wang, Qilong Yuan, Lefei Shen, Bingqing Peng, Jiawei Chen, Bo Wu, Liang Sun
Main category: cs.LG
TL;DR: A global-regional coupling framework for kilometer-scale weather forecasting that combines a pretrained global Transformer model with high-resolution regional networks using a novel ScaleMixer module for bidirectional cross-scale feature interaction.
Details
Motivation: While data-driven weather models have advanced global medium-range forecasting, high-resolution regional prediction remains challenging due to unresolved multiscale interactions between large-scale dynamics and small-scale processes like terrain-induced circulations and coastal effects.Method: Proposes a global-regional coupling framework with ScaleMixer module that dynamically identifies meteorologically critical regions through adaptive key-position sampling and enables cross-scale feature interaction through dedicated attention mechanisms.
Result: Produces forecasts at 0.05° (~5km) and 1-hour resolution over China, significantly outperforming operational NWP and AI baselines on both gridded reanalysis data and real-time weather station observations.
Conclusion: The framework demonstrates exceptional skill in capturing fine-grained phenomena like orographic wind patterns and Foehn warming, showing effective global-scale coherence with high-resolution fidelity.
Abstract: Data-driven weather models have advanced global medium-range forecasting, yet high-resolution regional prediction remains challenging due to unresolved multiscale interactions between large-scale dynamics and small-scale processes such as terrain-induced circulations and coastal effects. This paper presents a global-regional coupling framework for kilometer-scale regional weather forecasting that synergistically couples a pretrained Transformer-based global model with a high-resolution regional network via a novel bidirectional coupling module, ScaleMixer. ScaleMixer dynamically identifies meteorologically critical regions through adaptive key-position sampling and enables cross-scale feature interaction through dedicated attention mechanisms. The framework produces forecasts at $0.05^\circ$ ($\sim 5 \mathrm{km}$ ) and 1-hour resolution over China, significantly outperforming operational NWP and AI baselines on both gridded reanalysis data and real-time weather station observations. It exhibits exceptional skill in capturing fine-grained phenomena such as orographic wind patterns and Foehn warming, demonstrating effective global-scale coherence with high-resolution fidelity. The code is available at https://anonymous.4open.science/r/ScaleMixer-6B66.
[935] Policy-Controlled Generalized Share: A General Framework with a Transformer Instantiation for Strictly Online Switching-Oracle Tracking
Hongkai Hu
Main category: cs.LG
TL;DR: PCGS-TF is a strictly online prediction framework that uses a causal Transformer as an adaptive update controller to handle non-stationary environments where the best expert may switch repeatedly over time.
Details
Motivation: Traditional static regret to a single expert is inadequate for strictly online prediction in non-stationary environments where the best expert may change frequently. There's a need for frameworks that can adapt to changing conditions and expert switches over time.Method: Policy-Controlled Generalized Share (PCGS) framework where the generalized-share recursion is fixed but post-loss update controls vary adaptively. PCGS-TF specifically uses a causal Transformer as an update controller that outputs controls mapping w_t to w_{t+1} after observing the loss vector at round t.
Result: PCGS-TF achieves pathwise weighted regret guarantees for general time-varying learning rates and standard dynamic-regret guarantees against expert paths with up to S switches. Empirically, it attains lowest mean dynamic regret in all seven non-stationary families in synthetic tests and lowest normalized dynamic regret on household-electricity benchmark for S = 5, 10, and 20.
Conclusion: PCGS-TF demonstrates strong performance in non-stationary online prediction tasks by using Transformer-based adaptive control, outperforming existing methods in dynamic regret minimization across various non-stationary environments.
Abstract: Static regret to a single expert is often the wrong target for strictly online prediction under non-stationarity, where the best expert may switch repeatedly over time. We study Policy-Controlled Generalized Share (PCGS), a general strictly online framework in which the generalized-share recursion is fixed while the post-loss update controls are allowed to vary adaptively. Its principal instantiation in this paper is PCGS-TF, which uses a causal Transformer as an update controller: after round t finishes and the loss vector is observed, the Transformer outputs the controls that map w_t to w_{t+1} without altering the already committed decision w_t. Under admissible post-loss update controls, we obtain a pathwise weighted regret guarantee for general time-varying learning rates, and a standard dynamic-regret guarantee against any expert path with at most S switches under the constant-learning-rate specialization. Empirically, on a controlled synthetic suite with exact dynamic-programming switching-oracle evaluation, PCGS-TF attains the lowest mean dynamic regret in all seven non-stationary families, with its advantage increasing for larger expert pools. On a reproduced household-electricity benchmark, PCGS-TF also achieves the lowest normalized dynamic regret for S = 5, 10, and 20.
[936] A Perturbation Approach to Unconstrained Linear Bandits
Andrew Jacobsen, Dorian Baudry, Shinji Ito, Nicolò Cesa-Bianchi
Main category: cs.LG
TL;DR: The paper revisits perturbation-based approaches for unconstrained Bandit Linear Optimization, showing it reduces to standard Online Linear Optimization and providing improved regret guarantees including comparator-adaptive rates, dynamic regret, and high-probability bounds.
Details
Motivation: To improve understanding and performance of unconstrained Bandit Linear Optimization (uBLO) by revisiting perturbation-based approaches and connecting them to standard Online Linear Optimization (OLO) problems.Method: Revisits the perturbation-based approach of Abernethy et al. (2008) for uBLO, showing it effectively reduces BLO to OLO. Combines perturbation scheme with comparator-adaptive OLO algorithms and extends analysis to dynamic regret and high-probability guarantees.
Result: Derives expected-regret guarantees with comparator-adaptive OLO algorithms, obtains optimal √P_T path-length dependencies for dynamic regret without prior knowledge, develops first high-probability guarantees for static/dynamic regret in uBLO, and proves Ω(√dT) lower bound for adversarial linear bandits.
Conclusion: The perturbation-based approach effectively simplifies uBLO to OLO, enabling improved regret guarantees across multiple dimensions including comparator-adaptivity, dynamic regret, and high-probability bounds, while also establishing fundamental lower bounds.
Abstract: We revisit the standard perturbation-based approach of Abernethy et al. (2008) in the context of unconstrained Bandit Linear Optimization (uBLO). We show the surprising result that in the unconstrained setting, this approach effectively reduces Bandit Linear Optimization (BLO) to a standard Online Linear Optimization (OLO) problem. Our framework improves on prior work in several ways. First, we derive expected-regret guarantees when our perturbation scheme is combined with comparator-adaptive OLO algorithms, leading to new insights about the impact of different adversarial models on the resulting comparator-adaptive rates. We also extend our analysis to dynamic regret, obtaining the optimal $\sqrt{P_T}$ path-length dependencies without prior knowledge of $P_T$. We then develop the first high-probability guarantees for both static and dynamic regret in uBLO. Finally, we discuss lower bounds on the static regret, and prove the folklore $Ω(\sqrt{dT})$ rate for adversarial linear bandits on the unit Euclidean ball, which is of independent interest.
[937] ERPO: Token-Level Entropy-Regulated Policy Optimization for Large Reasoning Models
Song Yu, Li Li
Main category: cs.LG
TL;DR: ERPO improves reasoning in LLMs by focusing on critical decision points with entropy regulation, moving beyond uniform token-level advantages to enable better exploration and more concise reasoning paths.
Details
Motivation: Standard RLVR methods like GRPO assign uniform sequence-level advantages to all tokens, ignoring information heterogeneity in reasoning chains. This leads to premature entropy collapse and redundant, low-quality reasoning paths.Method: ERPO introduces three components: (1) Entropy-aware Gating to amplify exploration at Critical Decision Pivots (CDPs), (2) Bucket-based Implicit Normalization to align token progress windows and mitigate difficulty bias, and (3) Result-anchored Advantage Synthesis to re-weight token-level signals via outcome-driven anchors.
Result: ERPO significantly outperforms GRPO on mathematical benchmarks (MATH, AIME), boosting reasoning accuracy while producing more concise and robust derivation paths, establishing a new efficiency-accuracy frontier.
Conclusion: Fine-grained token-level optimization focusing on critical decision points through entropy regulation is crucial for improving reasoning capabilities in large language models, enabling better exploration and more efficient reasoning paths.
Abstract: Reinforcement learning from verifiable rewards (RLVR) has significantly advanced the reasoning capabilities of large language models. However, standard Group Relative Policy Optimization (GRPO) typically assigns a uniform, sequence-level advantage to all tokens, thereby overlooking the intrinsic information heterogeneity along reasoning chains. We show that this coarse-grained credit assignment leads to premature entropy collapse and encourages the model to generate redundant, low-quality reasoning paths. Through systematic empirical analysis, we identify Critical Decision Pivots (CDPs): transient high-entropy states where the policy’s trajectory is most sensitive to perturbations. These pivots represent the “forks in the road” where effective multi-path exploration is most crucial yet often suppressed by uniform advantage signals. Building on these insights, we propose Entropy-Regulated Policy Optimization (ERPO), which transitions the optimization focus from coarse sequences to fine-grained token dynamics. ERPO introduces three synergistic components: (i) Entropy-aware Gating, which adaptively amplifies exploration at CDPs to facilitate diverse path discovery; (ii) Bucket-based Implicit Normalization, which mitigates difficulty bias by aligning token progress windows; and (iii) Result-anchored Advantage Synthesis, which re-weights token-level signals via outcome-driven anchors. Extensive experiments on competitive mathematical benchmarks (e.g., MATH, AIME) demonstrate that ERPO significantly outperforms GRPO. Notably, ERPO not only boosts reasoning accuracy but also yields significantly more concise and robust derivation paths, establishing a new efficiency-accuracy frontier for large reasoning models.
[938] Variational Neurons in Transformers for Language Modeling
Yves Ruffenach
Main category: cs.LG
TL;DR: Introduces variational neurons into Transformer feed-forward layers to incorporate uncertainty into internal computation while maintaining the Transformer backbone, evaluated in compact language modeling settings.
Details
Motivation: Current Transformers for language modeling rely on deterministic internal computation with uncertainty only at the output layer. The paper aims to make uncertainty part of the internal computation itself to create more informative uncertainty-aware language models.Method: Replace deterministic feed-forward units in Transformers with local variational units based on EVE (Explicit Variational Estimation) while preserving the overall Transformer backbone. Evaluate in compact next-token language modeling settings.
Result: Variational neurons integrate stably into Transformers, preserve strong predictive performance, and produce informative uncertainty signals. Experiments show task quality, useful depth, and internal stability are distinct properties.
Conclusion: Establishes variational Transformers as a practical form of uncertainty-aware language modeling, showing Transformers can predict with explicit internal uncertainty structure for stronger probabilistic evaluation and more informative model behavior analysis.
Abstract: Transformers for language modeling usually rely on deterministic internal computation, with uncertainty expressed mainly at the output layer. We introduce variational neurons into Transformer feed-forward computation so that uncertainty becomes part of the internal computation itself. Concretely, we replace deterministic feed-forward units with local variational units based on EVE while preserving the overall Transformer backbone. We evaluate this design in compact next-token language-modeling settings. We compare deterministic and variational variants with both predictive and probabilistic criteria. Alongside negative log-likelihood, perplexity and accuracy, we analyze calibration, conditional variance, mutual information and latent-usage statistics. The resulting picture is clear. Variational neurons integrate stably into Transformers, preserve strong predictive performance and produce informative uncertainty signals. The experiments also show that task quality, useful depth and internal stability are distinct properties. These results establish variational Transformers as a practical form of uncertainty-aware language modeling. They show that Transformers can predict with an explicit internal structure of uncertainty, which supports stronger probabilistic evaluation and a more informative analysis of model behavior.
[939] Detecting the Unexpected: AI-Driven Anomaly Detection in Smart Bridge Monitoring
Rahul Jaiswal, Joakim Hellum, Halvor Heiberg
Main category: cs.LG
TL;DR: AI-driven anomaly detection for smart bridge monitoring using DBSCAN clustering on real-time sensor data
Details
Motivation: Traditional bridge monitoring relies on human visual inspections which are time-consuming, subjective, and error-prone. Smart bridge monitoring is essential for public safety and preventing catastrophic failures in critical infrastructure.Method: Developed a simple machine learning model using real-time sensor data from iBridge devices installed on a Norwegian bridge. Evaluated against different ML models, with DBSCAN (density-based spatial clustering) showing superior performance for anomaly detection.
Result: The DBSCAN-based model outperformed other ML models in accurately detecting anomalous events (bridge accidents). The approach demonstrates effectiveness for smart bridge monitoring.
Conclusion: The proposed AI-driven anomaly detection model is well-suited for smart bridge monitoring and can enhance public safety by enabling timely detection of unforeseen incidents.
Abstract: Bridges are critical components of national infrastructure and smart cities. Therefore, smart bridge monitoring is essential for ensuring public safety and preventing catastrophic failures or accidents. Traditional bridge monitoring methods rely heavily on human visual inspections, which are time-consuming and prone to subjectivity and error. This paper proposes an artificial intelligence (AI)-driven anomaly detection approach for smart bridge monitoring. Specifically, a simple machine learning (ML) model is developed using real-time sensor data collected by the iBridge sensor devices installed on a bridge in Norway. The proposed model is evaluated against different ML models. Experimental results demonstrate that the density-based spatial clustering of applications with noise (DBSCAN)-based model outperforms in accurately detecting the anomalous events (bridge accident). These findings indicate that the proposed model is well-suited for smart bridge monitoring and can enhance public safety by enabling the timely detection of unforeseen incidents.
[940] MR-ImagenTime: Multi-Resolution Time Series Generation through Dual Image Representations
Xianyong Xu, Yuanjun Zuo, Zhihong Huang, Yihan Qin, Haoxian Xu, Leilei Du, Haotian Wang
Main category: cs.LG
TL;DR: MR-CDM: A multi-resolution conditional diffusion model for time series forecasting with hierarchical trend decomposition and adaptive embeddings
Details
Motivation: Existing time series forecasting models struggle with fixed-length inputs and inadequate multi-scale modeling, limiting their effectiveness across various domains where time series forecasting is vital.Method: Combines hierarchical multi-resolution trend decomposition, adaptive embedding mechanism for variable-length inputs, and multi-scale conditional diffusion process
Result: Significantly outperforms state-of-the-art baselines (CSDI, Informer) on four real-world datasets, reducing MAE and RMSE by approximately 6-10%
Conclusion: MR-CDM effectively addresses limitations of existing time series forecasting models through its multi-resolution decomposition and conditional diffusion approach
Abstract: Time series forecasting is vital across many domains, yet existing models struggle with fixed-length inputs and inadequate multi-scale modeling. We propose MR-CDM, a framework combining hierarchical multi-resolution trend decomposition, an adaptive embedding mechanism for variable-length inputs, and a multi-scale conditional diffusion process. Evaluations on four real-world datasets demonstrate that MR-CDM significantly outperforms state-of-the-art baselines (e.g., CSDI, Informer), reducing MAE and RMSE by approximately 6-10 to a certain degree.
[941] MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration
Da Chang, Qiankun Shi, Lvgang Zhang, Yu Li, Ruijie Zhang, Yao Lu, Yongxiang Liu, Ganzhao Yuan
Main category: cs.LG
TL;DR: MuonRC: Lightweight pre-orthogonalization equilibration schemes for Muon optimizer that rebalance momentum matrix before orthogonalization using row/column normalization statistics, improving training of matrix-valued parameters.
Details
Motivation: Existing orthogonalized-update optimizers like Muon have limitations: extensions either act after orthogonalization (rescaling) or before it with heavy whitening-based preconditioners. There's a need for lightweight pre-orthogonalization methods that improve training efficiency for matrix-valued parameters.Method: Introduces MuonRC family with three variants: two-sided row/column normalization (RC), row normalization (R), and column normalization (C). These rebalance momentum matrix before finite-step Newton-Schulz orthogonalization using row/column squared-norm statistics with only O(m+n) auxiliary state. Row-normalized variant R is the natural default for hidden matrix weights.
Result: Shows finite-step orthogonalization is governed by input spectral properties (stable rank and condition number). Row/column normalization acts as zeroth-order whitening surrogate removing marginal scale mismatch. In LLaMA2 pretraining on C4, default R variant consistently outperforms Muon on 130M and 350M models with faster convergence and lower validation perplexity.
Conclusion: MuonRC provides lightweight pre-orthogonalization equilibration that improves training efficiency for matrix-valued parameters while preserving theoretical guarantees. The row-normalized variant R is recommended as default and shows practical benefits in large language model pretraining.
Abstract: Orthogonalized-update optimizers such as Muon improve training of matrix-valued parameters, but existing extensions mostly act either after orthogonalization by rescaling updates or before it with heavier whitening-based preconditioners. We introduce {\method}, a lightweight family of pre-orthogonalization equilibration schemes for Muon in three forms: two-sided row/column normalization (RC), row normalization (R), and column normalization (C). These variants rebalance the momentum matrix before finite-step Newton–Schulz using row/column squared-norm statistics and only $\mathcal{O}(m+n)$ auxiliary state. We show that finite-step orthogonalization is governed by input spectral properties, especially stable rank and condition number, and that row/column normalization is a zeroth-order whitening surrogate that removes marginal scale mismatch. For the hidden matrix weights targeted by {\method}, the row-normalized variant R is the natural default and preserves the $\widetilde{\mathcal{O}}(T^{-1/4})$ stationarity guarantee of Muon-type methods. In LLaMA2 pretraining on C4, the default R variant consistently outperforms Muon on 130M and 350M models, yielding faster convergence and lower validation perplexity.
[942] Corruption-robust Offline Multi-agent Reinforcement Learning From Human Feedback
Andi Nika, Debmalya Mandal, Parameswaran Kamalaruban, Adish Singla, Goran Radanović
Main category: cs.LG
TL;DR: Robust offline multi-agent RL from human feedback under data corruption, with theoretical guarantees for Nash equilibrium and coarse correlated equilibrium gaps under different coverage assumptions.
Details
Motivation: Addresses the critical problem of data corruption in offline multi-agent reinforcement learning from human feedback (MARLHF), where an ε-fraction of trajectory-preference samples may be arbitrarily corrupted. This is important for real-world applications where collected human feedback data may contain adversarial corruptions or errors.Method: Models the problem using linear Markov games framework. Develops robust estimators under two coverage assumptions: 1) uniform coverage (all policies covered), and 2) unilateral coverage (only Nash equilibrium and single-player deviations covered). For computational tractability, relaxes solution concept to coarse correlated equilibria (CCE) with quasi-polynomial-time algorithm.
Result: Under uniform coverage: robust estimator achieves O(ε^{1-o(1)}) bound on Nash equilibrium gap. Under unilateral coverage: algorithm achieves O(√ε) bound on Nash gap. For computational tractability: quasi-polynomial-time algorithm achieves O(√ε) bound on CCE gap under unilateral coverage.
Conclusion: Provides first systematic treatment of adversarial data corruption in offline MARLHF, establishing theoretical guarantees for robust learning under different coverage assumptions and computational constraints, with implications for practical deployment of multi-agent RL systems using potentially corrupted human feedback data.
Abstract: We consider robustness against data corruption in offline multi-agent reinforcement learning from human feedback (MARLHF) under a strong-contamination model: given a dataset $D$ of trajectory-preference tuples (each preference being an $n$-dimensional binary label vector representing each of the $n$ agents’ preferences), an $ε$-fraction of the samples may be arbitrarily corrupted. We model the problem using the framework of linear Markov games. First, under a uniform coverage assumption - where every policy of interest is sufficiently represented in the clean (prior to corruption) data - we introduce a robust estimator that guarantees an $O(ε^{1 - o(1)})$ bound on the Nash equilibrium gap. Next, we move to the more challenging unilateral coverage setting, in which only a Nash equilibrium and its single-player deviations are covered. In this case, our proposed algorithm achieves an $O(\sqrtε)$ bound on the Nash gap. Both of these procedures, however, suffer from intractable computation. To address this, we relax our solution concept to coarse correlated equilibria (CCE). Under the same unilateral coverage regime, we derive a quasi-polynomial-time algorithm whose CCE gap scales as $O(\sqrtε)$. To the best of our knowledge, this is the first systematic treatment of adversarial data corruption in offline MARLHF.
[943] Pre-Deployment Complexity Estimation for Federated Perception Systems
KMA Solaiman, Shafkat Islam, Ruy de Oliveira, Bharat Bhargava
Main category: cs.LG
TL;DR: A framework for estimating federated learning complexity in edge AI perception systems by modeling data properties and distributed environment characteristics to predict accuracy and communication costs before deployment.
Details
Motivation: Edge AI systems need practical tools to estimate federated learning task difficulty (accuracy and communication costs) before deployment, as practitioners currently lack such diagnostic tools for resource planning and feasibility evaluation.Method: Proposes a classifier-agnostic, pre-deployment framework that integrates dataset attributes (dimensionality, sparsity, heterogeneity) with client composition factors to create a complexity metric for federated perception systems.
Result: Experiments on MNIST and CIFAR variants show the proposed metric strongly correlates with federated learning performance and communication effort required to reach fixed accuracy targets.
Conclusion: Complexity estimation can serve as a practical diagnostic tool for resource planning, dataset assessment, and feasibility evaluation in edge-deployed perception systems.
Abstract: Edge AI systems increasingly rely on federated learning to train perception models in distributed, privacy-preserving, and resource-constrained environments. Yet, before training begins, practitioners often lack practical tools to estimate how difficult a federated learning task will be in terms of achievable accuracy and communication cost. This paper presents a classifier-agnostic, pre-deployment framework for estimating learning complexity in federated perception systems by jointly modeling intrinsic properties of the data and characteristics of the distributed environment. The proposed complexity metric integrates dataset attributes such as dimensionality, sparsity, and heterogeneity with factors related to the composition of participating clients. Using federated learning as a representative distributed training setting, we examine how learning difficulty varies across different federated configurations. Experiments on multiple variants of the MNIST dataset and CIFAR dataset show that the proposed metric strongly correlates with federated learning performance and the communication effort required to reach fixed accuracy targets. These findings suggest that complexity estimation can serve as a practical diagnostic tool for resource planning, dataset assessment, and feasibility evaluation in edge-deployed perception systems.
[944] FI-KAN: Fractal Interpolation Kolmogorov-Arnold Networks
Gnankan Landry Regis N’guessan
Main category: cs.LG
TL;DR: FI-KAN introduces fractal interpolation functions into Kolmogorov-Arnold Networks to better approximate non-smooth functions, outperforming standard KAN on rough targets while providing interpretable fractal dimension control.
Details
Motivation: Standard KANs use B-spline bases on fixed grids, which lack intrinsic multi-scale decomposition for approximating non-smooth functions with varying regularity. There's a need for neural architectures that can adapt to target function regularity, especially for rough functions and fractal patterns.Method: Two variants: Pure FI-KAN replaces B-splines entirely with learnable fractal interpolation function (FIF) bases from iterated function system theory. Hybrid FI-KAN retains B-spline path and adds learnable fractal correction. Both use IFS contraction parameters that give each edge a differentiable fractal dimension that adapts during training.
Result: On Holder regularity benchmark (α∈[0.2,2.0]), Hybrid FI-KAN outperforms KAN at every regularity level (1.3x to 33x). On fractal targets, FI-KAN achieves up to 6.3x MSE reduction over KAN, maintaining 4.7x advantage at 5 dB SNR. On non-smooth PDE solutions, Hybrid FI-KAN achieves up to 79x improvement on rough-coefficient diffusion and 3.5x on L-shaped domain corner singularities.
Conclusion: Regularity-matched basis design is a principled strategy for neural function approximation. FI-KAN demonstrates that basis geometry must match target regularity, with fractal dimension regularizer providing interpretable complexity control that recovers true fractal dimensions.
Abstract: Kolmogorov-Arnold Networks (KAN) employ B-spline bases on a fixed grid, providing no intrinsic multi-scale decomposition for non-smooth function approximation. We introduce Fractal Interpolation KAN (FI-KAN), which incorporates learnable fractal interpolation function (FIF) bases from iterated function system (IFS) theory into KAN. Two variants are presented: Pure FI-KAN (Barnsley, 1986) replaces B-splines entirely with FIF bases; Hybrid FI-KAN (Navascues, 2005) retains the B-spline path and adds a learnable fractal correction. The IFS contraction parameters give each edge a differentiable fractal dimension that adapts to target regularity during training. On a Holder regularity benchmark ($α\in [0.2, 2.0]$), Hybrid FI-KAN outperforms KAN at every regularity level (1.3x to 33x). On fractal targets, FI-KAN achieves up to 6.3x MSE reduction over KAN, maintaining 4.7x advantage at 5 dB SNR. On non-smooth PDE solutions (scikit-fem), Hybrid FI-KAN achieves up to 79x improvement on rough-coefficient diffusion and 3.5x on L-shaped domain corner singularities. Pure FI-KAN’s complementary behavior, dominating on rough targets while underperforming on smooth ones, provides controlled evidence that basis geometry must match target regularity. A fractal dimension regularizer provides interpretable complexity control whose learned values recover the true fractal dimension of each target. These results establish regularity-matched basis design as a principled strategy for neural function approximation.
[945] OptINC: Optical In-Network-Computing for Scalable Distributed Learning
Sijie Fei, Grace Li Zhang, Bing Li, Ulf Schlichtmann
Main category: cs.LG
TL;DR: Optical In-Network Computing architecture for distributed learning that offloads gradient averaging and quantization to optical interconnects using optical neural networks, eliminating communication overhead while maintaining training accuracy.
Details
Motivation: Existing distributed learning communication algorithms like ring all-reduce cause heavy communication overhead between servers. Since large-scale systems use optical fibers, there's an opportunity to offload computation to optical interconnects to reduce this overhead.Method: Proposes OptINC architecture incorporating optical devices (MZIs) into interconnects to create optical neural networks for gradient averaging and quantization. Includes preprocessing algorithm for dataset complexity reduction, approximates weight matrices with unitary/diagonal matrices to lower hardware cost, and uses hardware-aware training algorithm to maintain accuracy.
Result: Evaluated on ResNet50 on CIFAR-100 and LLaMA-based network on Wikipedia-1B, achieving comparable training accuracy to ring all-reduce baseline while eliminating communication overhead.
Conclusion: Optical in-network computing can effectively reduce communication overhead in distributed training systems while maintaining model accuracy, offering a promising hardware-software co-design approach for large-scale distributed learning.
Abstract: Distributed learning is widely used for training large models on large datasets by distributing parts of the model or dataset across multiple devices and aggregating the computed results for subsequent computations or parameter updates. Existing communication algorithms for distributed learning such as ring all-reduce result in heavy communication overhead between servers. Since communication in large-scale systems uses optical fibers, we propose an Optical In-Network-Computing (OptINC) architecture to offload the computation in servers onto the optical interconnects. To execute gradient averaging and quantization in the optical domain, we incorporate optical devices such as Mach-Zehnder-Interferometers (MZIs) into the interconnects. Such a de facto optical neural network (ONN) can effectively reduce the communication overhead in existing distributed training solutions. To reduce dataset complexity for training this neural network, a preprocessing algorithm implemented in the optical domain is also proposed. Hardware cost is lowered by approximating the weight matrices of the optical neural network with unitary and diagonal matrices, while the accuracy is maintained by a proposed hardware-aware training algorithm. The proposed solution was evaluated on real distributed learning tasks, including ResNet50 on CIFAR-100, and a LLaMA-based network on Wikipedia-1B. In both cases, the proposed framework can achieve comparable training accuracy to the ring all-reduce baseline, while eliminating communication overhead.
[946] NeiGAD: Augmenting Graph Anomaly Detection via Spectral Neighbor Information
Qing Qing, Huafei Huang, Mingliang Hou, Renqiang Luo, Mohsen Guizani
Main category: cs.LG
TL;DR: NeiGAD is a plug-and-play spectral graph analysis module for graph anomaly detection that explicitly models neighbor information through eigenvectors to amplify anomaly signals.
Details
Motivation: Current GNN-based graph anomaly detection methods fail to explicitly model neighbor information effects and interactions with attributes, limiting detection performance despite neighbor information being essential for distinguishing anomalies.Method: Uses spectral graph analysis to capture neighbor information through eigenvectors of adjacency matrix, which encode local neighbor interactions and amplify anomaly signals. Selects compact set of eigenvectors to construct efficient discriminative representations.
Result: Experiments on eight real-world datasets show NeiGAD consistently improves detection accuracy and outperforms state-of-the-art GAD methods.
Conclusion: Demonstrates importance of explicit neighbor modeling and effectiveness of spectral analysis in anomaly detection, with NeiGAD serving as an effective plug-and-play module.
Abstract: Graph anomaly detection (GAD) aims to identify irregular nodes or structures in attributed graphs. Neighbor information, which reflects both structural connectivity and attribute consistency with surrounding nodes, is essential for distinguishing anomalies from normal patterns. Although recent graph neural network (GNN)-based methods incorporate such information through message passing, they often fail to explicitly model its effect or interaction with attributes, limiting detection performance. This work introduces NeiGAD, a novel plug-and-play module that captures neighbor information through spectral graph analysis. Theoretical insights demonstrate that eigenvectors of the adjacency matrix encode local neighbor interactions and progressively amplify anomaly signals. Based on this, NeiGAD selects a compact set of eigenvectors to construct efficient and discriminative representations. Experiments on eight real-world datasets show that NeiGAD consistently improves detection accuracy and outperforms state-of-the-art GAD methods. These results demonstrate the importance of explicit neighbor modeling and the effectiveness of spectral analysis in anomaly detection. Code is available at: https://github.com/huafeihuang/NeiGAD.
[947] LIBERO-Para: A Diagnostic Benchmark and Metrics for Paraphrase Robustness in VLA Models
Chanyoung Kim, Minwoo Kim, Minseok Kang, Hyunwoo Kim, Dahuin Jung
Main category: cs.LG
TL;DR: A benchmark study showing that Vision-Language-Action models suffer significant performance degradation (22-52 percentage points) when faced with paraphrased instructions, primarily due to object-level lexical variation rather than semantic understanding.
Details
Motivation: Current VLA models are fine-tuned with limited data in robotic settings, leading to overfitting to specific instruction formulations and poor robustness to paraphrased instructions. The research aims to systematically study this linguistic generalization gap.Method: Introduces LIBERO-Para benchmark that independently varies action expressions and object references for fine-grained analysis. Tests seven VLA configurations (0.6B-7.5B) and proposes PRIDE metric to quantify paraphrase difficulty using semantic and syntactic factors.
Result: Models show 22-52 percentage point performance degradation under paraphrasing. Object-level lexical variation (simple synonym substitutions) causes large drops, indicating reliance on surface-level matching. 80-96% of failures arise from planning-level trajectory divergence rather than execution errors.
Conclusion: VLA models lack robust linguistic generalization, relying on surface-level pattern matching rather than semantic grounding. The proposed benchmark and PRIDE metric provide tools for better evaluation of paraphrase robustness in robotic manipulation tasks.
Abstract: Vision-Language-Action (VLA) models achieve strong performance in robotic manipulation by leveraging pre-trained vision-language backbones. However, in downstream robotic settings, they are typically fine-tuned with limited data, leading to overfitting to specific instruction formulations and leaving robustness to paraphrased instructions underexplored. To study this gap, we introduce LIBERO-Para, a controlled benchmark that independently varies action expressions and object references for fine-grained analysis of linguistic generalization. Across seven VLA configurations (0.6B-7.5B), we observe consistent performance degradation of 22-52 pp under paraphrasing. This degradation is primarily driven by object-level lexical variation: even simple synonym substitutions cause large drops, indicating reliance on surface-level matching rather than semantic grounding. Moreover, 80-96% of failures arise from planning-level trajectory divergence rather than execution errors, showing that paraphrasing disrupts task identification. Binary success rate treats all paraphrases equally, obscuring whether models perform consistently across difficulty levels or rely on easier cases. To address this, we propose PRIDE, a metric that quantifies paraphrase difficulty using semantic and syntactic factors. Our benchmark and corresponding code are available at: https://github.com/cau-hai-lab/LIBERO-Para
[948] Taming the Instability: A Robust Second-Order Optimizer for Federated Learning over Non-IID Data
Yuanqiao Zhang, Tiantian He, Yuan Gao, Yixin Wang, Yew-Soon Ong, Maoguo Gong, A. K. Qin, Hui Li
Main category: cs.LG
TL;DR: FedRCO is a novel second-order federated learning optimization framework that improves convergence speed and reduces communication costs under statistical heterogeneity, addressing computational expense and numerical instability issues in distributed settings.
Details
Motivation: Existing second-order optimization methods in federated learning are computationally expensive and numerically unstable in distributed settings, especially under statistical heterogeneity. The authors aim to develop a more robust and efficient optimization framework that can handle these challenges while improving convergence speed and reducing communication costs.Method: FedRCO integrates an efficient approximate curvature optimizer with a provable stability mechanism through three key components: 1) Gradient Anomaly Monitor for real-time detection and mitigation of exploding gradients, 2) Fail-Safe Resilience protocol that resets optimization states upon numerical instability, and 3) Curvature-Preserving Adaptive Aggregation strategy that safely integrates global knowledge while preserving local curvature geometry.
Result: Theoretical analysis shows FedRCO effectively mitigates instability and prevents unbounded updates while preserving optimization efficiency. Extensive experiments demonstrate superior robustness against diverse non-IID scenarios, achieving higher accuracy and faster convergence than both state-of-the-art first-order and second-order methods.
Conclusion: FedRCO provides a robust and efficient second-order optimization framework for federated learning that addresses key challenges of computational expense and numerical instability in distributed settings, offering practical benefits for real-world federated learning applications with statistical heterogeneity.
Abstract: In this paper, we present Federated Robust Curvature Optimization (FedRCO), a novel second-order optimization framework designed to improve convergence speed and reduce communication cost in Federated Learning systems under statistical heterogeneity. Existing second-order optimization methods are often computationally expensive and numerically unstable in distributed settings. In contrast, FedRCO addresses these challenges by integrating an efficient approximate curvature optimizer with a provable stability mechanism. Specifically, FedRCO incorporates three key components: (1) a Gradient Anomaly Monitor that detects and mitigates exploding gradients in real-time, (2) a Fail-Safe Resilience protocol that resets optimization states upon numerical instability, and (3) a Curvature-Preserving Adaptive Aggregation strategy that safely integrates global knowledge without erasing the local curvature geometry. Theoretical analysis shows that FedRCO can effectively mitigate instability and prevent unbounded updates while preserving optimization efficiency. Extensive experiments show that FedRCO achieves superior robustness against diverse non-IID scenarios while achieving higher accuracy and faster convergence than both state-of-the-art first-order and second-order methods.
[949] FairGC: Fairness-aware Graph Condensation
Yihan Gao, Chenxi Huang, Wen Shi, Ke Sun, Ziqi Xu, Xikun Zhang, Mingliang Hou, Renqiang Luo
Main category: cs.LG
TL;DR: FairGC introduces fairness-aware graph condensation that embeds fairness constraints into the graph distillation process to prevent bias amplification in synthetic graph datasets.
Details
Motivation: Current graph condensation methods focus on utility and predictive accuracy but ignore fairness constraints, often capturing and amplifying demographic disparities from original data, making them unsuitable for sensitive applications like credit scoring or social recommendations.Method: Three key components: 1) Distribution-Preserving Condensation module synchronizes joint distributions of labels and sensitive attributes, 2) Spectral Encoding module uses Laplacian eigen-decomposition to preserve global structural patterns, 3) Fairness-Enhanced Neural Architecture with multi-domain fusion and label-smoothing curriculum.
Result: Rigorous evaluations on four real-world datasets show FairGC provides superior balance between accuracy and fairness, significantly reducing disparity in Statistical Parity and Equal Opportunity compared to state-of-the-art condensation models.
Conclusion: FairGC successfully embeds fairness directly into graph distillation, producing equitable synthetic proxies suitable for sensitive applications while maintaining predictive accuracy.
Abstract: Graph condensation (GC) has become a vital strategy for scaling Graph Neural Networks by compressing massive datasets into small, synthetic node sets. While current GC methods effectively maintain predictive accuracy, they are primarily designed for utility and often ignore fairness constraints. Because these techniques are bias-blind, they frequently capture and even amplify demographic disparities found in the original data. This leads to synthetic proxies that are unsuitable for sensitive applications like credit scoring or social recommendations. To solve this problem, we introduce FairGC, a unified framework that embeds fairness directly into the graph distillation process. Our approach consists of three key components. First, a Distribution-Preserving Condensation module synchronizes the joint distributions of labels and sensitive attributes to stop bias from spreading. Second, a Spectral Encoding module uses Laplacian eigen-decomposition to preserve essential global structural patterns. Finally, a Fairness-Enhanced Neural Architecture employs multi-domain fusion and a label-smoothing curriculum to produce equitable predictions. Rigorous evaluations on four real-world datasets, show that FairGC provides a superior balance between accuracy and fairness. Our results confirm that FairGC significantly reduces disparity in Statistical Parity and Equal Opportunity compared to existing state-of-the-art condensation models. The codes are available at https://github.com/LuoRenqiang/FairGC.
[950] Physics-Informed Neural Networks for Predicting Hydrogen Sorption in Geological Formations: Thermodynamically Constrained Deep Learning Integrating Classical Adsorption Theory
Mohammad Nooraiepour, Mohammad Masoudi, Zezhang Song, Helge Hellevang
Main category: cs.LG
TL;DR: A physics-informed neural network framework for predicting hydrogen sorption in geological materials that embeds classical adsorption theory and thermodynamic constraints to improve cross-lithology generalization.
Details
Motivation: Classical isotherm models work well for individual samples but fail to generalize across heterogeneous geological populations, with R² dropping dramatically from 0.80-0.90 to 0.09-0.38 when applied to multi-sample datasets. Accurate hydrogen sorption prediction is crucial for underground hydrogen storage evaluation.Method: Multi-scale physics-informed neural network framework with: 1) 7-category physics-informed feature engineering generating 62 thermodynamic descriptors, 2) loss function enforcing saturation limits, monotonic pressure response, and Van’t Hoff temperature dependence, 3) three-phase curriculum-based training strategy, and 4) architecture-diverse ensemble of 10 members for uncertainty quantification.
Result: Achieved R² = 0.9544, RMSE = 0.0484 mmol/g, and MAE = 0.0231 mmol/g on test set with 98.6% monotonicity satisfaction and zero non-physical negative predictions. Physics-informed regularization provided 10-15% cross-lithology generalization advantage over random forest in leave-one-lithology-out validation.
Conclusion: Physics-informed neural networks with embedded thermodynamic constraints significantly improve hydrogen sorption prediction accuracy and cross-lithology generalization compared to classical models and machine learning approaches without physical constraints.
Abstract: Accurate prediction of hydrogen sorption in fine-grained geological materials is essential for evaluating underground hydrogen storage capacity, assessing caprock integrity, and characterizing hydrogen migration in subsurface energy systems. Classical isotherm models perform well at the individual-sample level but fail when generalized across heterogeneous populations, with the coefficient of determination collapsing from 0.80-0.90 for single-sample fits to 0.09-0.38 for aggregated multi-sample datasets. We present a multi-scale physics-informed neural network framework that addresses this limitation by embedding classical adsorption theory and thermodynamic constraints directly into the learning process. The framework utilizes 1,987 hydrogen sorption isotherm measurements across clays, shales, coals, supplemented by 224 characteristic uptake measurements. A seven-category physics-informed feature engineering scheme generates 62 thermodynamically meaningful descriptors from raw material characterization data. The loss function enforces saturation limits, a monotonic pressure response, and Van’t Hoff temperature dependence via penalty weighting, while a three-phase curriculum-based training strategy ensures stable integration of competing physical constraints. An architecture-diverse ensemble of ten members provides calibrated uncertainty quantification, with post-hoc temperature scaling achieving target prediction interval coverage. The optimized PINN achieves R2 = 0.9544, RMSE = 0.0484 mmol/g, and MAE = 0.0231 mmol/g on the held-out test set, with 98.6% monotonicity satisfaction and zero non-physical negative predictions. Physics-informed regularization yields a 10-15% cross-lithology generalization advantage over a well-tuned random forest under leave-one-lithology-out validation, confirming that thermodynamic constraints transfer meaningfully across geological boundaries.
[951] Key-Embedded Privacy for Decentralized AI in Biomedical Omics
Rongyu Zhang, Hongyu Dong, Gaole Dai, Ziqi Qiao, Shenli Zheng, Yuan Zhang, Aosong Cheng, Xiaowei Chi, Jincai Luo, Pin Li, Li Du, Dan Wang, Yuan Du, Xudong Xing, Jianxu Chen, Shanghang Zhang
Main category: cs.LG
TL;DR: INFL: A lightweight federated learning method using Implicit Neural Representations for privacy-preserving biomedical AI that maintains utility while protecting sensitive data across heterogeneous sites.
Details
Motivation: Privacy concerns in biomedicine limit data sharing and hinder assembly of representative cohorts for clinically relevant AI. Current privacy solutions like cryptographic methods have heavy overhead and differential privacy degrades performance, leading to sub-optimal real-world outcomes.Method: INFL uses Implicit Neural Representations with plug-and-play, coordinate-conditioned modules integrated into client models. It embeds a secret key directly into the architecture and supports seamless aggregation across heterogeneous sites without sharing raw data.
Result: Across diverse biomedical omics tasks including cohort-scale classification in bulk proteomics, regression for perturbation prediction in single-cell transcriptomics, and clustering in spatial transcriptomics and multi-omics, INFL achieves strong, controllable privacy while maintaining utility comparable to non-private methods.
Conclusion: INFL provides a practical, efficient privacy solution for federated learning in biomedicine that balances privacy protection with model performance, enabling clinically relevant AI applications while addressing regulatory and governance concerns.
Abstract: The rapid adoption of data-driven methods in biomedicine has intensified concerns over privacy, governance, and regulation, limiting raw data sharing and hindering the assembly of representative cohorts for clinically relevant AI. This landscape necessitates practical, efficient privacy solutions, as cryptographic defenses often impose heavy overhead and differential privacy can degrade performance, leading to sub-optimal outcomes in real-world settings. Here, we present a lightweight federated learning method, INFL, based on Implicit Neural Representations that addresses these challenges. Our approach integrates plug-and-play, coordinate-conditioned modules into client models, embeds a secret key directly into the architecture, and supports seamless aggregation across heterogeneous sites. Across diverse biomedical omics tasks, including cohort-scale classification in bulk proteomics, regression for perturbation prediction in single-cell transcriptomics, and clustering in spatial transcriptomics and multi-omics with both public and private data, we demonstrate that INFL achieves strong, controllable privacy while maintaining utility, preserving the performance necessary for downstream scientific and clinical applications.
[952] Machine Learning-Assisted High-Dimensional Matrix Estimation
Wan Tian, Hui Yang, Zhouhui Lian, Lingyue Zhang, Yijie Peng
Main category: cs.LG
TL;DR: A machine learning-enhanced optimization approach for high-dimensional matrix estimation that combines LADMM with neural network proximal operators to improve accuracy and convergence speed.
Details
Motivation: Existing high-dimensional matrix estimation methods focus on theoretical properties but overlook computational challenges in high-dimensional settings. Recent advances in learning-based optimization that integrate data-driven structures with classical algorithms motivate this work.Method: Proposes a reparameterized Linearized Alternating Direction Method of Multipliers (LADMM) where learnable parameters are introduced and proximal operators in the iterative scheme are modeled with neural networks.
Result: Theoretically proves convergence of LADMM and establishes convergence, convergence rate, and monotonicity of its reparameterized counterpart, showing faster convergence rate. Validates effectiveness across different structures and dimensions of high-dimensional matrices.
Conclusion: The proposed machine learning-assisted optimization framework improves both estimation accuracy and computational efficiency for high-dimensional matrix estimation problems, with theoretical guarantees and practical benefits.
Abstract: Efficient estimation of high-dimensional matrices-including covariance and precision matrices-is a cornerstone of modern multivariate statistics. Most existing studies have focused primarily on the theoretical properties of the estimators (e.g., consistency and sparsity), while largely overlooking the computational challenges inherent in high-dimensional settings. Motivated by recent advances in learning-based optimization method-which integrate data-driven structures with classical optimization algorithms-we explore high-dimensional matrix estimation assisted by machine learning. Specifically, for the optimization problem of high-dimensional matrix estimation, we first present a solution procedure based on the Linearized Alternating Direction Method of Multipliers (LADMM). We then introduce learnable parameters and model the proximal operators in the iterative scheme with neural networks, thereby improving estimation accuracy and accelerating convergence. Theoretically, we first prove the convergence of LADMM, and then establish the convergence, convergence rate, and monotonicity of its reparameterized counterpart; importantly, we show that the reparameterized LADMM enjoys a faster convergence rate. Notably, the proposed reparameterization theory and methodology are applicable to the estimation of both high-dimensional covariance and precision matrices. We validate the effectiveness of our method by comparing it with several classical optimization algorithms across different structures and dimensions of high-dimensional matrices.
[953] Critic-Free Deep Reinforcement Learning for Maritime Coverage Path Planning on Irregular Hexagonal Grids
Carlos S. Sepúlveda, Gonzalo A. Ruz
Main category: cs.LG
TL;DR: DRL-based coverage path planning for maritime surveillance using Transformer pointer policy and critic-free optimization on hexagonal grids.
Details
Motivation: Traditional coverage path planning methods struggle with irregular maritime environments (coastlines, islands, exclusion zones) and require computationally expensive re-planning for each instance. Need efficient real-time solutions for maritime surveillance missions.Method: Deep Reinforcement Learning framework with Transformer-based pointer policy that autoregressively constructs coverage tours on hexagonal grid representations. Uses Group-Relative Policy Optimization (GRPO) - a critic-free scheme that estimates advantages through within-instance comparisons of sampled trajectories.
Result: Achieves 99.0% Hamiltonian success rate (vs 46.0% for best heuristic), produces paths 7% shorter with 24% fewer heading changes. All inference modes operate under 50ms per instance on laptop GPU, enabling real-time deployment.
Conclusion: DRL framework effectively solves coverage path planning in irregular maritime environments with high success rates and real-time performance, suitable for on-board deployment in surveillance missions.
Abstract: Maritime surveillance missions, such as search and rescue and environmental monitoring, rely on the efficient allocation of sensing assets over vast and geometrically complex areas. Traditional Coverage Path Planning (CPP) approaches depend on decomposition techniques that struggle with irregular coastlines, islands, and exclusion zones, or require computationally expensive re-planning for every instance. We propose a Deep Reinforcement Learning (DRL) framework to solve CPP on hexagonal grid representations of irregular maritime areas. Unlike conventional methods, we formulate the problem as a neural combinatorial optimization task where a Transformer-based pointer policy autoregressively constructs coverage tours. To overcome the instability of value estimation in long-horizon routing problems, we implement a critic-free Group-Relative Policy Optimization (GRPO) scheme. This method estimates advantages through within-instance comparisons of sampled trajectories rather than relying on a value function. Experiments on 1,000 unseen synthetic maritime environments demonstrate that a trained policy achieves a 99.0% Hamiltonian success rate, more than double the best heuristic (46.0%), while producing paths 7% shorter and with 24% fewer heading changes than the closest baseline. All three inference modes (greedy, stochastic sampling, and sampling with 2-opt refinement) operate under 50~ms per instance on a laptop GPU, confirming feasibility for real-time on-board deployment.
[954] Label-efficient Training Updates for Malware Detection over Time
Luca Minnei, Cristian Manca, Giorgio Piras, Angelo Sotgiu, Maura Pintor, Daniele Ghiani, Davide Maiorca, Giorgio Giacinto, Battista Biggio
Main category: cs.LG
TL;DR: A model-agnostic framework for evaluating active learning and semi-supervised learning techniques to address distribution drift in malware detection across Android and Windows platforms.
Details
Motivation: ML-based malware detectors degrade over time due to distribution drift in evolving software, but regular retraining is expensive due to costly manual labeling by security experts. Existing approaches lack model-agnostic comparisons and consistent drift analysis methodologies.Method: Proposed a model-agnostic framework evaluating extensive AL and SSL techniques (isolated and combined) for Android and Windows malware detection. Introduced feature-level drift analysis methodology measuring feature stability over time.
Result: Combined AL and SSL techniques reduced manual annotation costs by up to 90% across both domains while achieving comparable detection performance to full-labeling retraining. Feature-level drift analysis showed correlation with detector performance.
Conclusion: The study provides detailed understanding of how AL and SSL behave under distribution drift and how they can be successfully combined, offering practical insights for designing effective malware detectors over time.
Abstract: Machine Learning (ML)-based detectors are becoming essential to counter the proliferation of malware. However, common ML algorithms are not designed to cope with the dynamic nature of real-world settings, where both legitimate and malicious software evolve. This distribution drift causes models trained under static assumptions to degrade over time unless they are continuously updated. Regularly retraining these models, however, is expensive, since labeling new acquired data requires costly manual analysis by security experts. To reduce labeling costs and address distribution drift in malware detection, prior work explored active learning (AL) and semi-supervised learning (SSL) techniques. Yet, existing studies (i) are tightly coupled to specific detector architectures and restricted to a specific malware domain, resulting in non-uniform comparisons; and (ii) lack a consistent methodology for analyzing the distribution drift, despite the critical sensitivity of the malware domain to temporal changes. In this work, we bridge this gap by proposing a model-agnostic framework that evaluates an extensive set of AL and SSL techniques, isolated and combined, for Android and Windows malware detection. We show that these techniques, when combined, can reduce manual annotation costs by up to 90% across both domains while achieving comparable detection performance to full-labeling retraining. We also introduce a methodology for feature-level drift analysis that measures feature stability over time, showing its correlation with the detector performance. Overall, our study provides a detailed understanding of how AL and SSL behave under distribution drift and how they can be successfully combined, offering practical insights for the design of effective detectors over time.
[955] Mixture-Model Preference Learning for Many-Objective Bayesian Optimization
Manisha Dubey, Sebastiaan De Peuter, Wanrong Wang, Samuel Kaski
Main category: cs.LG
TL;DR: Bayesian framework for many-objective optimization learns latent preference archetypes via Dirichlet-process mixture, using hybrid queries to efficiently explore trade-offs and archetype identities.
Details
Motivation: Addresses challenges in preference-based many-objective optimization: expanding trade-off spaces and heterogeneous, context-dependent human value structures that can't be captured by a single utility function.Method: Proposes Bayesian framework learning latent preference archetypes as components of Dirichlet-process mixture with uncertainty over archetypes and weights. Uses hybrid queries targeting both mode identity and within-mode trade-offs for efficient exploration.
Result: Method outperforms standard baselines on synthetic and real-world many-objective benchmarks. Provides simple regret guarantee under mild assumptions. Mixture-aware diagnostics reveal structure that regret alone fails to capture.
Conclusion: The proposed Bayesian framework effectively handles heterogeneous human preferences in many-objective optimization by learning latent preference archetypes and using efficient hybrid query strategies.
Abstract: Preference-based many-objective optimization faces two obstacles: an expanding space of trade-offs and heterogeneous, context-dependent human value structures. Towards this, we propose a Bayesian framework that learns a small set of latent preference archetypes rather than assuming a single fixed utility function, modelling them as components of a Dirichlet-process mixture with uncertainty over both archetypes and their weights. To query efficiently, we designing hybrid queries that target information about (i) mode identity and (ii) within-mode trade-offs. Under mild assumptions, we provide a simple regret guarantee for the resulting mixture-aware Bayesian optimization procedure. Empirically, our method outperforms standard baselines on synthetic and real-world many-objective benchmarks, and mixture-aware diagnostics reveal structure that regret alone fails to capture.
[956] Evolutionary Discovery of Reinforcement Learning Algorithms via Large Language Models
Alkis Sygkounas, Amy Loutfi, Andreas Persson
Main category: cs.LG
TL;DR: Evolutionary framework discovers reinforcement learning algorithms by searching over executable update rules using LLMs as generative operators, excluding canonical mechanisms, with post-evolution hyperparameter refinement.
Details
Motivation: Current RL algorithms rely on hand-designed, fixed update rules. The authors aim to automate algorithm discovery by searching over executable update rules rather than manually designing them.Method: Extends REvolve evolutionary system using LLMs as generative variation operators for algorithm discovery. Excludes canonical RL mechanisms (actor-critic, TD losses, value bootstrapping) to promote novel algorithms. Adds post-evolution refinement where LLM proposes hyperparameter ranges for each evolved rule.
Result: Discovered algorithms achieve competitive performance on multiple Gymnasium benchmarks compared to established baselines like SAC, PPO, DQN, and A2C.
Conclusion: Evolutionary search with LLMs can discover novel, effective RL algorithms without relying on canonical mechanisms, demonstrating automated algorithm design is feasible.
Abstract: Reinforcement learning algorithms are defined by their learning update rules, which are typically hand-designed and fixed. We present an evolutionary framework for discovering reinforcement learning algorithms by searching directly over executable update rules that implement complete training procedures. The approach builds on REvolve, an evolutionary system that uses large language models as generative variation operators, and extends it from reward-function discovery to algorithm discovery. To promote the emergence of nonstandard learning rules, the search excludes canonical mechanisms such as actor–critic structures, temporal-difference losses, and value bootstrapping. Because reinforcement learning algorithms are highly sensitive to internal scalar parameters, we introduce a post-evolution refinement stage in which a large language model proposes feasible hyperparameter ranges for each evolved update rule. Evaluated end-to-end by full training runs on multiple Gymnasium benchmarks, the discovered algorithms achieve competitive performance relative to established baselines, including SAC, PPO, DQN, and A2C.
[957] KGroups: A Versatile Univariate Max-Relevance Min-Redundancy Feature Selection Algorithm for High-dimensional Biological Data
Malick Ebiele, Malika Bendechache, Rob Brennan
Main category: cs.LG
TL;DR: Proposes KGroups, a new univariate filter feature selection algorithm that uses clustering for selection, achieving similar performance to multivariate mRMR but much faster.
Details
Motivation: Most feature selection research focuses on relevance/redundancy estimation, but limited work investigates alternative selection algorithms. The paper questions how much predictive performance depends on the selection algorithm versus relevance/redundancy estimations.Method: KGroups is a univariate mRMR algorithm that employs clustering for feature selection instead of traditional sorting (KBest) or incremental search (mRMR). It’s parameterizable for hyperparameter tuning.
Result: On 14 high-dimensional biological datasets, KGroups achieves similar predictive performance to multivariate mRMR while being up to 821 times faster. It also outperforms KBest.
Conclusion: KGroups demonstrates that selection algorithms significantly impact feature selection performance, offering a fast, effective alternative to existing methods with room for further improvement through hyperparameter tuning.
Abstract: This paper proposes a new univariate filter feature selection (FFS) algorithm called KGroups. The majority of work in the literature focuses on investigating the relevance or redundancy estimations of feature selection (FS) methods. This has shown promising results and a real improvement of FFS methods’ predictive performance. However, limited efforts have been made to investigate alternative FFS algorithms. This raises the following question: how much of the FFS methods’ predictive performance depends on the selection algorithm rather than the relevance or the redundancy estimations? The majority of FFS methods fall into two categories: relevance maximisation (Max-Rel, also known as KBest) or simultaneous relevance maximisation and redundancy minimisation (mRMR). KBest is a univariate FFS algorithm that employs sorting (descending) for selection. mRMR is a multivariate FFS algorithm that employs an incremental search algorithm for selection. In this paper, we propose a new univariate mRMR called KGroups that employs clustering for selection. Extensive experiments on 14 high-dimensional biological benchmark datasets showed that KGroups achieves similar predictive performance compared to multivariate mRMR while being up to 821 times faster. KGroups is parameterisable, which leaves room for further predictive performance improvement through hyperparameter finetuning, unlike mRMR and KBest. KGroups outperforms KBest.
[958] Spectral Higher-Order Neural Networks
Gianluca Peri, Timoteo Carletti, Duccio Fanelli, Diego Febbe
Main category: cs.LG
TL;DR: SHONNs introduce spectral higher-order neural networks that incorporate higher-order interactions in feedforward networks using spectral reformulation to address stability and parameter scaling issues.
Details
Motivation: Standard neural networks use binary interactions between units, while existing higher-order networks are limited to graph-structured inputs. There's a need for general-purpose feedforward networks that can incorporate higher-order interactions without stability and parameter scaling problems.Method: SHONNs leverage a spectral reformulation of the model to incorporate higher-order interactions in feedforward networks. This spectral approach mitigates stability issues and parameter scaling problems associated with weighted higher-order forward propagations.
Result: The paper presents a new algorithmic strategy for incorporating higher-order interactions in general-purpose feedforward networks, addressing the limitations of existing higher-order network architectures.
Conclusion: SHONNs provide a novel approach to building higher-order neural networks that can work with general inputs (not just graph-structured data) while maintaining stability and manageable parameter scaling.
Abstract: Neural networks are fundamental tools of modern machine learning. The standard paradigm assumes binary interactions (across feedforward linear passes) between inter-tangled units, organized in sequential layers. Generalized architectures have been also designed that move beyond pairwise interactions, so as to account for higher-order couplings among computing neurons. Higher-order networks are however usually deployed as augmented graph neural networks (GNNs), and, as such, prove solely advantageous in contexts where the input exhibits an explicit hypergraph structure. Here, we present Spectral Higher-Order Neural Networks (SHONNs), a new algorithmic strategy to incorporate higher-order interactions in general-purpose, feedforward, network structures. SHONNs leverages a reformulation of the model in terms of spectral attributes. This allows to mitigate the common stability and parameter scaling problems that come along weighted, higher-order, forward propagations.
[959] FeDMRA: Federated Incremental Learning with Dynamic Memory Replay Allocation
Tiantian Wang, Xiang Xiang, Simon S. Du
Main category: cs.LG
TL;DR: A dynamic memory allocation strategy for federated class-incremental learning in healthcare systems that addresses non-IID data challenges through adaptive exemplar storage based on data heterogeneity and fairness considerations.
Details
Motivation: Traditional continual learning methods fail in federated healthcare systems where data across distributed clients exhibits non-IID characteristics, requiring new approaches that balance privacy preservation with effective incremental learning while mitigating catastrophic forgetting.Method: Proposes a dynamic memory allocation strategy for exemplar storage based on data replay mechanism that leverages data heterogeneity, considers performance fairness across all clients, and rationally allocates limited storage resources among clients rather than using fixed allocation.
Result: Extensive experiments on three medical image datasets demonstrate significant performance improvements compared to existing baseline models in federated class-incremental learning scenarios.
Conclusion: The proposed dynamic memory allocation strategy effectively addresses non-IID challenges in federated healthcare systems, improves model performance through rational resource allocation, and establishes a balanced solution to mitigate catastrophic forgetting while considering client fairness.
Abstract: In federated healthcare systems, Federated Class-Incremental Learning (FCIL) has emerged as a key paradigm, enabling continuous adaptive model learning among distributed clients while safeguarding data privacy. However, in practical applications, data across agent nodes within the distributed framework often exhibits non-independent and identically distributed (non-IID) characteristics, rendering traditional continual learning methods inapplicable. To address these challenges, this paper covers more comprehensive incremental task scenarios and proposes a dynamic memory allocation strategy for exemplar storage based on the data replay mechanism. This strategy fully taps into the inherent potential of data heterogeneity, while taking into account the performance fairness of all participating clients, thereby establishing a balanced and adaptive solution to mitigate catastrophic forgetting. Unlike the fixed allocation of client exemplar memory, the proposed scheme emphasizes the rational allocation of limited storage resources among clients to improve model performance. Furthermore, extensive experiments are conducted on three medical image datasets, and the results demonstrate significant performance improvements compared to existing baseline models.
[960] HISA: Efficient Hierarchical Indexing for Fine-Grained Sparse Attention
Yufei Xu, Fanxu Meng, Fan Jiang, Yuxuan Wang, Ruijie Zhou, Jiexi Wu, Zhixin Pan, Zhaohui Wang, Xiaojuan Tang, Wenjie Pei, Tongxuan Liu, Di yin, Xing Sun, Muhan Zhang
Main category: cs.LG
TL;DR: HISA is a hierarchical indexing method that speeds up token-level sparse attention by first filtering at block level, then refining at token level, achieving 2-4× speedups with minimal quality loss.
Details
Motivation: Token-level sparse attention mechanisms like DeepSeek Sparse Attention (DSA) still have O(L²) bottlenecks in their indexer component when scanning the entire prefix for each query, which becomes prohibitive as context length grows.Method: HISA transforms the search process into a two-stage hierarchical procedure: 1) block-level coarse filter scores pooled block representatives to prune irrelevant regions, 2) token-level refinement applies the original indexer only within remaining candidate blocks.
Result: HISA achieves 2× speedup at 32K context length and 4× at 128K on kernel-level benchmarks. On Needle-in-a-Haystack and LongBench, it closely matches original DSA quality while outperforming block-sparse baselines, with >99% mean IoU in token selection.
Conclusion: HISA provides an efficient drop-in replacement for sparse attention indexers that preserves token-level sparsity patterns, requires no additional training, and offers significant speedups for long-context processing.
Abstract: Token-level sparse attention mechanisms, exemplified by DeepSeek Sparse Attention (DSA), achieve fine-grained key selection by scoring every historical token for each query using a lightweight indexer, and then computing attention only over the selected subset. While the downstream sparse attention scales efficiently, the indexer still scans the entire prefix for every query, introducing an O($L^2$) per-layer bottleneck that becomes prohibitive as context length grows. We propose HISA (Hierarchical Indexed Sparse Attention), a drop-in replacement for the indexer that transforms the search process from a flat token scan into a two-stage hierarchical procedure. First, a block-level coarse filter scores pooled block representatives to prune irrelevant regions. Then, a token-level refinement applies the original indexer only within the remaining candidate blocks. HISA preserves the exact token-level top-k sparsity pattern required by the downstream Sparse MLA operator and requires no additional training. On kernel-level benchmarks, HISA achieves a 2$\times$ speedup at 32K context length and 4$\times$ at 128K. On Needle-in-a-Haystack and LongBench, we directly replace the indexer in DeepSeek-V3.2 with HISA, without any fine-tuning. HISA closely matches the original DSA in quality while significantly outperforming block-sparse baselines. Moreover, the token selection sets produced by HISA and the original DSA exhibit a mean IoU greater than 99%, indicating that the efficiency gains come with virtually no impact on selection fidelity.
[961] Next-Token Prediction and Regret Minimization
Mehryar Mohri, Clayton Sanford, Jon Schneider, Kiran Vodrahalli, Yifan Wu
Main category: cs.LG
TL;DR: Next-token prediction models can achieve low adversarial regret in online decision-making when trained on appropriate opponent action distributions, with unbounded context windows enabling robustification but bounded contexts having limitations.
Details
Motivation: The paper investigates how next-token prediction algorithms (like those used in modern LLMs) can be applied to adversarial online decision-making environments, examining when such models can achieve low regret against opponents.Method: Theoretical analysis comparing unbounded vs bounded context windows for next-token prediction models in adversarial settings. For unbounded contexts, shows every distribution is close to a low-regret one. For bounded contexts, identifies distributions far from any low-regret distribution. Also demonstrates transformer implementation feasibility.
Result: Unbounded context windows allow sublinear regret with negligible accuracy cost to original prediction model. Bounded contexts have fundamental limitations with some distributions being Θ(1)-far from low-regret distributions. Transformer architectures can implement the robustification procedure.
Conclusion: Next-token prediction models can be effectively used for adversarial online decision-making with proper context window considerations, and transformer architectures are suitable for implementing these low-regret strategies.
Abstract: We consider the question of how to employ next-token prediction algorithms in adversarial online decision-making environments. Specifically, if we train a next-token prediction model on a distribution $\mathcal{D}$ over sequences of opponent actions, when is it the case that the induced online decision-making algorithm (by approximately best responding to the model’s predictions) has low adversarial regret (i.e., when is $\mathcal{D}$ a \emph{low-regret distribution})? For unbounded context windows (where the prediction made by the model can depend on all the actions taken by the adversary thus far), we show that although not every distribution $\mathcal{D}$ is a low-regret distribution, every distribution $\mathcal{D}$ is exponentially close (in TV distance) to one low-regret distribution, and hence sublinear regret can always be achieved at negligible cost to the accuracy of the original next-token prediction model. In contrast to this, for bounded context windows (where the prediction made by the model can depend only on the past $w$ actions taken by the adversary, as may be the case in modern transformer architectures), we show that there are some distributions $\mathcal{D}$ of opponent play that are $Θ(1)$-far from any low-regret distribution $\mathcal{D’}$ (even when $w = Ω(T)$ and such distributions exist). Finally, we complement these results by showing that the unbounded context robustification procedure can be implemented by layers of a standard transformer architecture, and provide empirical evidence that transformer models can be efficiently trained to represent these new low-regret distributions.
[962] The Unreasonable Effectiveness of Scaling Laws in AI
Chien-Ping Lu
Main category: cs.LG
TL;DR: Scaling laws are effective because they abstract away implementation details using “logical compute,” explaining their broad applicability and the persistent efficiency improvements needed to sustain progress despite diminishing returns.
Details
Motivation: The paper aims to explain why AI scaling laws are so effective despite being empirical and predicting diminishing returns. It seeks to understand why these laws apply broadly across different model families and training regimes, and why practical progress continues through efficiency improvements despite the predicted diminishing returns.Method: The paper proposes a conceptual framework that distinguishes between “logical compute” (implementation-agnostic model-side work) and practical resource efficiency. It analyzes how scaling laws abstract away realization details and examines the relationship between diminishing returns and the need for efficiency improvements in hardware, algorithms, and systems.
Result: The analysis shows that scaling laws are effective because they use logical compute as an abstraction that separates model capabilities from implementation efficiency. This explains both the broad applicability of scaling laws across different settings and the persistent “efficiency game” where practical progress depends on converting real resources into logical compute efficiently.
Conclusion: Scaling laws’ effectiveness comes from their abstraction using logical compute, which explains their broad applicability and the need for continuous efficiency improvements. Diminishing returns create pressure for cost reduction and system-level innovation, making the key practical question how many efficiency doublings are needed to sustain productive scaling.
Abstract: Classical AI scaling laws, especially for pre-training, describe how training loss decreases with compute in a power-law form. Their effectiveness has a basic and very practical sense: they make progress predictable, albeit at a declining rate. Yet their effectiveness is also unreasonable in two further senses. First, these laws are largely empirical and observational, but they appear repeatedly across model families and increasingly across training-adjacent regimes. Second, despite the diminishing returns they predict, progress in practice has often continued through rapidly improving efficiency, visible for example in falling cost per token. This paper argues that both features arise from the same source: scaling laws are unusually effective because they abstract away from many realization details. The compute variable is best understood as logical compute, an implementation-agnostic notion of model-side work, while the practical burden of scaling depends on how efficiently real resources are converted into that compute. This abstraction helps explain both why the laws travel so well across settings and why they give rise to a persistent efficiency game in hardware, algorithms, and systems. Once efficiency is made explicit, the main practical question becomes how many efficiency doublings are required to keep scaling productive despite diminishing returns. Under that view, diminishing returns are not only a geometric flattening of the loss curve, but also rising pressure for cost reduction, system-level innovation, and the breakthroughs needed to sustain Moore-like efficiency doublings.
[963] Detecting low left ventricular ejection fraction from ECG using an interpretable and scalable predictor-driven framework
Ya Zhou, Tianxiang Hao, Ziyi Cai, Haojie Zhu, Hejun He, Jia Liu, Xiaohan Fan, Jing Yuan
Main category: cs.LG
TL;DR: ECGPD-LEF: A structured framework combining foundation model-derived diagnostic probabilities with interpretable modeling for detecting low left ventricular ejection fraction from ECG, outperforming end-to-end black-box approaches.
Details
Motivation: Low left ventricular ejection fraction often goes undetected until symptomatic heart failure develops, creating need for scalable screening. Existing AI-ECG approaches either use uninterpretable black-box models or rely on commercial ECG algorithms with suboptimal performance.Method: ECGPD-LEF integrates foundation model-derived diagnostic probabilities with interpretable modeling. Trained on 72,475 ECG-echocardiogram pairs from EchoNext dataset, evaluated on internal (5,442) and external (16,017) cohorts. Uses structured diagnostic probability representations rather than end-to-end learning.
Result: Achieved AUROC 88.4% (internal) and 86.8% (external) for moderate LEF detection, consistently outperforming official end-to-end baseline. High-impact predictors identified: normal ECG, incomplete left bundle branch block, subendocardial injury in anterolateral leads. These predictors enabled zero-shot-like inference without retraining (AUROC 75.3-81.0% internal, 71.6-78.6% external).
Conclusion: The framework reconciles predictive performance with mechanistic transparency, showing ventricular dysfunction is intrinsically encoded within structured diagnostic probability representations. Supports scalable enhancement through additional predictors and integration with existing AI-ECG systems.
Abstract: Low left ventricular ejection fraction (LEF) frequently remains undetected until progression to symptomatic heart failure, underscoring the need for scalable screening strategies. Although artificial intelligence-enabled electrocardiography (AI-ECG) has shown promise, existing approaches rely solely on end-to-end black-box models with limited interpretability or on tabular systems dependent on commercial ECG measurement algorithms with suboptimal performance. We introduced ECG-based Predictor-Driven LEF (ECGPD-LEF), a structured framework that integrates foundation model-derived diagnostic probabilities with interpretable modeling for detecting LEF from ECG. Trained on the benchmark EchoNext dataset comprising 72,475 ECG-echocardiogram pairs and evaluated in predefined independent internal (n=5,442) and external (n=16,017) cohorts, our framework achieved robust discrimination for moderate LEF (internal AUROC 88.4%, F1 64.5%; external AUROC 86.8%, F1 53.6%), consistently outperforming the official end-to-end baseline provided with the benchmark across demographic and clinical subgroups. Interpretability analyses identified high-impact predictors, including normal ECG, incomplete left bundle branch block, and subendocardial injury in anterolateral leads, driving LEF risk estimation. Notably, these predictors independently enabled zero-shot-like inference without task-specific retraining (internal AUROC 75.3-81.0%; external AUROC 71.6-78.6%), indicating that ventricular dysfunction is intrinsically encoded within structured diagnostic probability representations. This framework reconciles predictive performance with mechanistic transparency, supporting scalable enhancement through additional predictors and seamless integration with existing AI-ECG systems.
[964] CirrusBench: Evaluating LLM-based Agents Beyond Correctness in Real-World Cloud Service Environments
Yi Yu, Guangquan Hu, Chenghuang Shen, Xingyan Liu, Jing Gu, Hangyi Sun, Junzhuo Ma, Weiting Liu, Jianfeng Liu, Mingyue Pu, Yu Wang, Zhengdong Xiao, Rui Xie, Longjiu Luo, Qianrong Wang, Gurong Cui, Honglin Qiao, Wenlian Lu
Main category: cs.LG
TL;DR: CirrusBench: A real-world evaluation framework for LLM-based agents using authentic cloud service tickets, focusing on resolution efficiency and multi-turn task performance in technical service environments.
Details
Motivation: Existing LLM agent benchmarks rely on synthetic environments that fail to capture the diversity and unpredictability of authentic customer inputs, and ignore resolution efficiency critical for real-world deployment in technical service applications.Method: Introduces CirrusBench framework built on real-world data from authentic cloud service tickets, preserving intricate multi-turn logical chains and realistic tool dependencies. Introduces Customer-Centric metrics including Normalized Efficiency Index and Multi-Turn Latency to measure resolution efficiency.
Result: Experiments show state-of-the-art models demonstrate strong reasoning capabilities but struggle with complex, realistic multi-turn tasks and fail to meet high-efficiency standards required for customer service.
Conclusion: CirrusBench highlights critical directions for future development of LLM-based agents in practical technical service applications, emphasizing the need for better efficiency and performance in realistic multi-turn scenarios.
Abstract: The increasing agentic capabilities of Large Language Models (LLMs) have enabled their deployment in real-world applications, such as cloud services, where customer-assistant interactions exhibit high technical complexity and long-horizon dependencies, making robustness and resolution efficiency critical for customer satisfaction. However, existing benchmarks for LLM-based agents largely rely on synthetic environments that fail to capture the diversity and unpredictability of authentic customer inputs, often ignoring the resolution efficiency essential for real-world deployment. To bridge this gap, we introduce CirrusBench, a novel evaluation framework distinguished by its foundation in real-world data from authentic cloud service tickets. CirrusBench preserves the intricate multi-turn logical chains and realistic tool dependencies inherent to technical service environments. Moving beyond execution correctness, we introduce novel Customer-Centric metrics to define agent success, quantifying service quality through metrics such as the Normalized Efficiency Index and Multi-Turn Latency to explicitly measure resolution efficiency. Experiments utilizing our framework reveal that while state-of-the-art models demonstrate strong reasoning capabilities, they frequently struggle in complex, realistic multi-turn tasks and fail to meet the high-efficiency standards required for customer service, highlighting critical directions for the future development of LLM-based agents in practical technical service applications. CirrusBench evaluation framework is released at: https://github.com/CirrusAI
[965] Unrestrained Simplex Denoising for Discrete Data. A Non-Markovian Approach Applied to Graph Generation
Yoann Boget, Alexandros Kalousis
Main category: cs.LG
TL;DR: Simplex denoising: A generative framework operating on probability simplex for discrete structures, using non-Markovian noising with conditionally independent noisy representations, outperforming discrete diffusion and flow-matching baselines.
Details
Motivation: Current denoising models for discrete structures (Diffusion/Flow Matching) operate directly in discrete state space causing abrupt state changes, limiting performance and formulation simplicity.Method: Introduces simplex denoising framework operating on probability simplex with non-Markovian noising scheme where noisy representations at different times are conditionally independent given clean data, removing unnecessary constraints while preserving theoretical guarantees.
Result: Unrestrained simplex denoising surpasses strong discrete diffusion and flow-matching baselines across synthetic and real-world graph benchmarks, demonstrating superior performance.
Conclusion: Probability simplex serves as an effective framework for discrete generative modeling, offering improved performance and simplified formulation compared to direct discrete state space approaches.
Abstract: Denoising models such as Diffusion or Flow Matching have recently advanced generative modeling for discrete structures, yet most approaches either operate directly in the discrete state space, causing abrupt state changes. We introduce simplex denoising, a simple yet effective generative framework that operates on the probability simplex. The key idea is a non-Markovian noising scheme in which, for a given clean data point, noisy representations at different times are conditionally independent. While preserving the theoretical guarantees of denoising-based generative models, our method removes unnecessary constraints, thereby improving performance and simplifying the formulation. Empirically, \emph{unrestrained simplex denoising} surpasses strong discrete diffusion and flow-matching baselines across synthetic and real-world graph benchmarks. These results highlight the probability simplex as an effective framework for discrete generative modeling.
[966] ChemCLIP: Bridging Organic and Inorganic Anticancer Compounds Through Contrastive Learning
Mohamad Koohi-Moghadam, Hongzhe Sun, Hongyan Li, Kyongtae Tyler Bae
Main category: cs.LG
TL;DR: ChemCLIP uses contrastive learning to create unified representations of organic and metal-based anticancer compounds, enabling knowledge transfer between these traditionally separate chemical domains.
Details
Motivation: Traditional drug discovery treats organic small molecules and metal-based coordination complexes as separate domains, limiting knowledge transfer despite shared biological objectives. There's a significant data disparity with extensive organic compound databases versus only a few thousand characterized metal complexes.Method: Developed ChemCLIP, a dual-encoder contrastive learning framework that learns unified representations based on shared anticancer activities rather than structural similarity. Compiled datasets of 44,854 organic compounds and 5,164 metal complexes standardized across 60 cancer cell lines. Trained parallel encoders with activity-aware hard negative mining to map compounds into a shared 256-dimensional embedding space. Evaluated four molecular encoding strategies: Morgan fingerprints, ChemBERTa, MolFormer, and Chemprop.
Result: Morgan fingerprints achieved superior performance with an average alignment ratio of 0.899 and downstream classification AUCs of 0.859 (inorganic) and 0.817 (organic). Biologically similar compounds cluster together in the embedding space regardless of chemical class.
Conclusion: Contrastive learning is an effective strategy for unifying disparate chemical domains. The work provides empirical guidance for encoder selection in multi-modal chemistry applications, with implications extending beyond anticancer drug discovery to any scenario requiring cross-domain chemical knowledge transfer.
Abstract: The discovery of anticancer therapeutics has traditionally treated organic small molecules and metal-based coordination complexes as separate chemical domains, limiting knowledge transfer despite their shared biological objectives. This disparity is particularly pronounced in available data, with extensive screening databases for organic compounds compared to only a few thousand characterized metal complexes. Here, we introduce ChemCLIP, a dual-encoder contrastive learning framework that bridges this organic-inorganic divide by learning unified representations based on shared anticancer activities rather than structural similarity. We compiled complementary datasets comprising 44,854 unique organic compounds and 5,164 unique metal complexes, standardized across 60 cancer cell lines. By training parallel encoders with activity-aware hard negative mining, we mapped structurally distinct compounds into a shared 256-dimensional embedding space where biologically similar compounds cluster together regardless of chemical class. We systematically evaluated four molecular encoding strategies: Morgan fingerprints, ChemBERTa, MolFormer, and Chemprop, through quantitative alignment metrics, embedding visualizations, and downstream classification tasks. Morgan fingerprints achieved superior performance with an average alignment ratio of 0.899 and downstream classification AUCs of 0.859 (inorganic) and 0.817 (organic). This work establishes contrastive learning as an effective strategy for unifying disparate chemical domains and provides empirical guidance for encoder selection in multi-modal chemistry applications, with implications extending beyond anticancer drug discovery to any scenario requiring cross-domain chemical knowledge transfer.
[967] Physics-Informed Framework for Impact Identification in Aerospace Composites
Natália Ribeiro Marinho, Richard Loendersloot, Jan Willem Wiegman, Frank Grooteman, Tiedo Tinga
Main category: cs.LG
TL;DR: A physics-informed framework for impact identification that combines physical knowledge with data-driven methods to achieve physically consistent and stable inference of impact parameters.
Details
Motivation: To develop a reliable impact identification method that integrates physical knowledge with data-driven approaches to achieve physically consistent results, especially under degraded measurement conditions and limited data availability.Method: Uses physics-informed biases: structures input space with physics-based energy indicators, constrains solutions via architectural design, and enforces governing relations through hybrid loss formulations. Employs disjoint inference with decoupled surrogate models for impact velocity and mass, then computes impact energy via kinetic energy consistency.
Result: Achieves mean absolute percentage errors below 8% for impact velocity and mass, below 10% for impact energy. Shows stable performance under reduced data and increased noise, and generalizes to out-of-distribution cases including damaged regimes when trained on damaged responses.
Conclusion: Systematic integration of physics-informed biases enables reliable, physically consistent, and data-efficient impact identification, demonstrating potential for practical monitoring systems.
Abstract: This paper introduces a novel physics-informed impact identification (Phy-ID) framework. The proposed method integrates observational, inductive, and learning biases to combine physical knowledge with data-driven inference in a unified modelling strategy, achieving physically consistent and numerically stable impact identification. The physics-informed approach structures the input space using physics-based energy indicators, constrains admissible solutions via architectural design, and enforces governing relations via hybrid loss formulations. Together, these mechanisms limit non-physical solutions and stabilise inference under degraded measurement conditions. A disjoint inference formulation is used as a representative use case to demonstrate the framework capabilities, in which impact velocity and impactor mass are inferred through decoupled surrogate models, and impact energy is computed by enforcing kinetic energy consistency. Experimental evaluations show mean absolute percentage errors below 8% for inferred impact velocity and impactor mass and below 10% for impact energy. Additional analyses confirm stable performance under reduced data availability and increased measurement noise, as well as generalisation for out-of-distribution cases across pristine and damaged regimes when damaged responses are included in training. These results indicate that the systematic integration of physics-informed biases enables reliable, physically consistent, and data-efficient impact identification, highlighting the potential of the approach for practical monitoring systems.
[968] Optimistic Actor-Critic with Parametric Policies for Linear Markov Decision Processes
Max Qiushi Lin, Reza Asad, Kevin Tan, Haque Ishfaq, Csaba Szepesvari, Sharan Vaswani
Main category: cs.LG
TL;DR: Actor-critic method with parametric log-linear policies and optimistic critic using Thompson sampling via Langevin Monte Carlo, achieving state-of-the-art sample complexity for linear MDPs.
Details
Motivation: Existing theoretical analyses of actor-critic methods have limitations: they either sidestep exploration problems with strong assumptions or analyze impractical methods with complicated modifications. Methods for linear MDPs often use natural policy gradient with implicit policies that are computationally expensive to sample from, making environment interactions inefficient.Method: Proposes an optimistic actor-critic framework for finite-horizon linear MDPs using parametric log-linear policies. Introduces tractable logit-matching regression objective for the actor. For the critic, uses approximate Thompson sampling via Langevin Monte Carlo to obtain optimistic value estimates.
Result: The algorithm achieves $\widetilde{\mathcal{O}}(ε^{-4})$ and $\widetilde{\mathcal{O}}(ε^{-2})$ sample complexity in the on-policy and off-policy setting respectively, matching prior theoretical works while being more aligned with practice.
Conclusion: The proposed optimistic actor-critic framework with parametric policies achieves state-of-the-art sample complexity while being more practical than previous theoretical methods that used computationally expensive implicit policies.
Abstract: Although actor-critic methods have been successful in practice, their theoretical analyses have several limitations. Specifically, existing theoretical work either sidesteps the exploration problem by making strong assumptions or analyzes impractical methods with complicated algorithmic modifications. Moreover, the actor-critic methods analyzed for linear MDPs often employ natural policy gradient (NPG) and construct “implicit” policies without explicit parameterization. Such policies are computationally expensive to sample from, making the environment interactions inefficient. To that end, we focus on the finite-horizon linear MDPs and propose an optimistic actor-critic framework that uses parametric log-linear policies. In particular, we introduce a tractable \textit{logit-matching} regression objective for the actor. For the critic, we use approximate Thompson sampling via Langevin Monte Carlo to obtain optimistic value estimates. We prove that the resulting algorithm achieves $\widetilde{\mathcal{O}}(ε^{-4})$ and $\widetilde{\mathcal{O}}(ε^{-2})$ sample complexity in the on-policy and off-policy setting, respectively. Our results match prior theoretical works in achieving the state-of-the-art sample complexity, while our algorithm is more aligned with practice.
[969] Position: Explainable AI is Causality in Disguise
Amir-Hossein Karimi
Main category: cs.LG
TL;DR: XAI lacks consensus due to missing ground truth; paper argues the true ground truth is the causal model governing systems, and XAI should be reframed as causal inquiry for meaningful explanations.
Details
Motivation: The XAI field is fragmented with conflicting metrics, failed sanity checks, and unresolved debates. Current approaches lack consensus on what constitutes a correct explanation, often attributed to absence of ground truth.Method: Position paper proposing causal reframing: reframing XAI queries about data, models, or decisions as causal inquiries. Proves necessity and sufficiency of causal models for XAI and advocates for convergence around causal discovery.
Result: Demonstrates that persistent discord in XAI arises from an elusive but existing ground truth (causal models) rather than absent ground truth. Provides theoretical foundation for causal grounding of XAI.
Conclusion: XAI remains unmoored without causal grounding. Community should converge around advanced concept and causal discovery to escape entrenched uncertainty in explainable AI.
Abstract: The demand for Explainable AI (XAI) has triggered an explosion of methods, producing a landscape so fragmented that we now rely on surveys of surveys. Yet, fundamental challenges persist: conflicting metrics, failed sanity checks, and unresolved debates over robustness and fairness. The only consensus on how to achieve explainability is a lack of one. This has led many to point to the absence of a ground truth for defining ``the’’ correct explanation as the main culprit. This position paper posits that the persistent discord in XAI arises not from an absent ground truth but from a ground truth that exists, albeit as an elusive and challenging target: the causal model that governs the relevant system. By reframing XAI queries about data, models, or decisions as causal inquiries, we prove the necessity and sufficiency of causal models for XAI. We contend that without this causal grounding, XAI remains unmoored. Ultimately, we encourage the community to converge around advanced concept and causal discovery to escape this entrenched uncertainty.
[970] LACE: Loss-Adaptive Capacity Expansion for Continual Learning
Shivnath Tathe
Main category: cs.LG
TL;DR: LACE is an online continual learning method that dynamically expands model capacity based on loss monitoring, adding new dimensions when sustained loss indicates insufficient capacity for new data.
Details
Motivation: Fixed representational capacity is a major constraint in continual learning - practitioners must guess appropriate model size upfront without knowing data complexity. There's a need for adaptive mechanisms that can expand capacity during training based on actual data requirements.Method: LACE monitors the model’s loss signal during training. When sustained loss deviation exceeds a threshold (indicating insufficient capacity for new data), it adds new dimensions to the projection layer. These new dimensions are trained jointly with existing parameters. The approach requires no labels, replay buffers, or external controllers.
Result: LACE achieves 100% boundary precision (expands only at domain boundaries, zero false positives), matches accuracy of large fixed-capacity models while starting from fewer dimensions, and adapter dimensions are critical (3% accuracy drop when removed). Also shows unsupervised domain separation in GPT-2 activations via layer-wise clustering.
Conclusion: LACE provides a simple, effective mechanism for adaptive capacity expansion in continual learning, suitable for resource-constrained on-device applications without needing labels, replay, or external control.
Abstract: Fixed representational capacity is a fundamental constraint in continual learning: practitioners must guess an appropriate model width before training, without knowing how many distinct concepts the data contains. We propose LACE (Loss-Adaptive Capacity Expansion), a simple online mechanism that expands a model’s representational capacity during training by monitoring its own loss signal. When sustained loss deviation exceeds a threshold - indicating that the current capacity is insufficient for newly encountered data - LACE adds new dimensions to the projection layer and trains them jointly with existing parameters. Across synthetic and real-data experiments, LACE triggers expansions exclusively at domain boundaries (100% boundary precision, zero false positives), matches the accuracy of a large fixed-capacity model while starting from a fraction of its dimensions, and produces adapter dimensions that are collectively critical to performance (3% accuracy drop when all adapters removed). We further demonstrate unsupervised domain separation in GPT-2 activations via layer-wise clustering, showing a U-shaped separability curve across layers that motivates adaptive capacity allocation in deep networks. LACE requires no labels, no replay buffers, and no external controllers, making it suitable for on-device continual learning under resource constraints.
[971] Information-Theoretic Limits of Safety Verification for Self-Improving Systems
Arsenios Scrivens
Main category: cs.LG
TL;DR: The paper presents a theoretical framework analyzing safety gates for AI self-modification, showing classifier-based approaches face fundamental limitations while verifier-based approaches can achieve zero risk with positive utility.
Details
Motivation: To understand whether safety mechanisms can allow unbounded beneficial self-modification while maintaining bounded cumulative risk, addressing a core challenge in AI safety and alignment.Method: Develops formal mathematical framework with dual conditions (bounded risk, unbounded utility), proves impossibility theorems for classifier-based gates using Hölder’s inequality and NP counting methods, analyzes universal finite-horizon ceilings, and demonstrates verification escape via Lipschitz ball verifiers with formal bounds for transformers.
Result: Shows classifier-based safety gates face fundamental impossibility for power-law risk schedules, with subpolynomial utility ceilings; verifier-based approaches can achieve delta=0 with TPR>0, validated on GPT-2 with LoRA modifications achieving conditional delta=0 with TPR=0.352.
Conclusion: Classifier-based safety gates have fundamental limitations for unbounded self-modification, while verifier-based approaches offer a viable path forward, with formal Lipschitz bounds enabling practical implementation at LLM scale.
Abstract: Can a safety gate permit unbounded beneficial self-modification while maintaining bounded cumulative risk? We formalize this question through dual conditions – requiring sum delta_n < infinity (bounded risk) and sum TPR_n = infinity (unbounded utility) – and establish a theory of their (in)compatibility. Classification impossibility (Theorem 1): For power-law risk schedules delta_n = O(n^{-p}) with p > 1, any classifier-based gate under overlapping safe/unsafe distributions satisfies TPR_n <= C_alpha * delta_n^beta via Holder’s inequality, forcing sum TPR_n < infinity. This impossibility is exponent-optimal (Theorem 3). A second independent proof via the NP counting method (Theorem 4) yields a 13% tighter bound without Holder’s inequality. Universal finite-horizon ceiling (Theorem 5): For any summable risk schedule, the exact maximum achievable classifier utility is U*(N, B) = N * TPR_NP(B/N), growing as exp(O(sqrt(log N))) – subpolynomial. At N = 10^6 with budget B = 1.0, a classifier extracts at most U* ~ 87 versus a verifier’s ~500,000. Verification escape (Theorem 2): A Lipschitz ball verifier achieves delta = 0 with TPR > 0, escaping the impossibility. Formal Lipschitz bounds for pre-LayerNorm transformers under LoRA enable LLM-scale verification. The separation is strict. We validate on GPT-2 (d_LoRA = 147,456): conditional delta = 0 with TPR = 0.352. Comprehensive empirical validation is in the companion paper [D2].
[972] Mitigating Backdoor Attacks in Federated Learning Using PPA and MiniMax Game Theory
Osama Wehbi, Sarhad Arisdakessian, Omar Abdel Wahab, Anderson Avila, Azzam Mourad, Hadi Otrok
Main category: cs.LG
TL;DR: FedBBA: A federated learning defense system combining reputation tracking, incentive mechanisms, and game theory to detect and mitigate backdoor attacks from malicious clients while maintaining high task accuracy.
Details
Motivation: Federated learning faces security threats from malicious clients who inject backdoors into local models to compromise global model integrity, requiring robust defense mechanisms.Method: Combines three components: (1) reputation system for client behavior evaluation, (2) incentive mechanism to reward honesty/penalize malicious behavior, and (3) game theoretical models with projection pursuit analysis to dynamically identify and minimize malicious client impact.
Result: FedBBA reduces backdoor attack success rate to 1.1%-11% across various attack scenarios on traffic sign datasets, significantly outperforming state-of-the-art defenses (23%-76% attack success rates) while maintaining 95%-98% normal task accuracy.
Conclusion: FedBBA effectively mitigates backdoor attacks in federated learning through comprehensive behavioral analysis and incentive mechanisms, creating more resilient federated environments.
Abstract: Federated Learning (FL) is witnessing wider adoption due to its ability to benefit from large amounts of scattered data while preserving privacy. However, despite its advantages, federated learning suffers from several setbacks that directly impact the accuracy, and the integrity of the global model it produces. One of these setbacks is the presence of malicious clients who actively try to harm the global model by injecting backdoor data into their local models while trying to evade detection. The objective of such clients is to trick the global model into making false predictions during inference, thereby compromising the integrity and trustworthiness of the global model on which honest stakeholders rely. To mitigate such mischievous behavior, we propose FedBBA (Federated Backdoor and Behavior Analysis). The proposed model aims to dampen the effect of such clients on the final accuracy, creating more resilient federated learning environments. We engineer our approach through the combination of (1) a reputation system to evaluate and track client behavior, (2) an incentive mechanism to reward honest participation and penalize malicious behavior, and (3) game theoretical models with projection pursuit analysis (PPA) to dynamically identify and minimize the impact of malicious clients on the global model. Extensive simulations on the German Traffic Sign Recognition Benchmark (GTSRB) and Belgium Traffic Sign Classification (BTSC) datasets demonstrate that FedBBA reduces the backdoor attack success rate to approximately 1.1%–11% across various attack scenarios, significantly outperforming state-of-the-art defenses like RDFL and RoPE, which yielded attack success rates between 23% and 76%, while maintaining high normal task accuracy (~95%–98%).
[973] AMIGO: Agentic Multi-Image Grounding Oracle Benchmark
Min Wang, Ata Mahjoubfar
Main category: cs.LG
TL;DR: AMIGO is a benchmark for evaluating agentic vision-language models on long-horizon hidden-target identification tasks through multi-turn questioning of visually similar image galleries.
Details
Motivation: Current evaluations focus on single-image, single-turn correctness, but agentic vision-language models increasingly act through extended interactions. There's a need for benchmarks that stress question selection under uncertainty, consistent constraint tracking across turns, and fine-grained discrimination as evidence accumulates.Method: AMIGO is a benchmark where an oracle privately selects a target image, and the model must recover it by asking a sequence of attribute-focused Yes/No/Unsure questions under a strict protocol that penalizes invalid actions with Skip. It supports controlled oracle imperfections to probe robustness and verification behavior under inconsistent feedback.
Result: The benchmark is instantiated with a “Guess My Preferred Dress” task and reports comprehensive metrics covering identification success, evidence verification, efficiency, protocol compliance, noise tolerance, and trajectory-level diagnostics.
Conclusion: AMIGO provides a rigorous evaluation framework for agentic vision-language models on long-horizon tasks, addressing limitations of current single-turn evaluations and enabling assessment of multi-turn reasoning, uncertainty management, and robustness to imperfect feedback.
Abstract: Agentic vision-language models increasingly act through extended interactions, but most evaluations still focus on single-image, single-turn correctness. We introduce AMIGO (Agentic Multi-Image Grounding Oracle Benchmark), a long-horizon benchmark for hidden-target identification over galleries of visually similar images. In AMIGO, the oracle privately selects a target image, and the model must recover it by asking a sequence of attribute-focused Yes/No/Unsure questions under a strict protocol that penalizes invalid actions with Skip. This setting stresses (i) question selection under uncertainty, (ii) consistent constraint tracking across turns, and (iii) fine-grained discrimination as evidence accumulates. AMIGO also supports controlled oracle imperfections to probe robustness and verification behavior under inconsistent feedback. We instantiate AMIGO with Guess My Preferred Dress task and report metrics covering both outcomes and interaction quality, including identification success, evidence verification, efficiency, protocol compliance, noise tolerance, and trajectory-level diagnostics.
[974] FL-PBM: Pre-Training Backdoor Mitigation for Federated Learning
Osama Wehbi, Sarhad Arisdakessian, Omar Abdel Wahab, Azzam Mourad, Hadi Otrok, Jamal Bentahar
Main category: cs.LG
TL;DR: FL-PBM: A pre-training defense mechanism for federated learning that proactively filters poisoned data using benign trigger insertion, PCA feature extraction, GMM clustering, and targeted blurring to mitigate backdoor attacks.
Details
Motivation: Backdoor attacks pose significant threats to AI model integrity, especially in critical applications. Current defenses need improvement for federated learning environments where poisoned data can be injected during client-side training.Method: Four-stage approach: (1) Insert benign trigger for baseline, (2) Apply PCA for discriminative feature extraction, (3) Use GMM clustering to identify malicious samples, (4) Apply targeted blurring to disrupt backdoor triggers.
Result: Reduces attack success rates by up to 95% vs FedAvg baseline and 30-80% vs state-of-the-art defenses (RDFL, LPSF), while maintaining over 90% clean model accuracy in most experiments.
Conclusion: FL-PBM effectively mitigates backdoor attacks in federated learning by early detection and sanitization of poisoned data without degrading model performance.
Abstract: Backdoor attacks pose a significant threat to the integrity and reliability of Artificial Intelligence (AI) models, enabling adversaries to manipulate model behavior by injecting poisoned data with hidden triggers. These attacks can lead to severe consequences, especially in critical applications such as autonomous driving, healthcare, and finance. Detecting and mitigating backdoor attacks is crucial across the lifespan of model’s phases, including pre-training, in-training, and post-training. In this paper, we propose Pre-Training Backdoor Mitigation for Federated Learning (FL-PBM), a novel defense mechanism that proactively filters poisoned data on the client side before model training in a federated learning (FL) environment. The approach consists of three stages: (1) inserting a benign trigger into the data to establish a controlled baseline, (2) applying Principal Component Analysis (PCA) to extract discriminative features and assess the separability of the data, (3) performing Gaussian Mixture Model (GMM) clustering to identify potentially malicious data samples based on their distribution in the PCA-transformed space, and (4) applying a targeted blurring technique to disrupt potential backdoor triggers. Together, these steps ensure that suspicious data is detected early and sanitized effectively, thereby minimizing the influence of backdoor triggers on the global model. Experimental evaluations on image-based datasets demonstrate that FL-PBM reduces attack success rates by up to 95% compared to baseline federated learning (FedAvg) and by 30 to 80% relative to state-of-the-art defenses (RDFL and LPSF). At the same time, it maintains over 90% clean model accuracy in most experiments, achieving better mitigation without degrading model performance.
[975] Subspace Optimization for Backpropagation-Free Continual Test-Time Adaptation
Damian Sójka, Sebastian Cygert, Marc Masana
Main category: cs.LG
TL;DR: PACE is a backpropagation-free continual test-time adaptation system that optimizes normalization layer affine parameters using evolution strategies, achieving SOTA accuracy with 50% runtime reduction.
Details
Motivation: Existing derivative-free continual adaptation methods face a trade-off between runtime efficiency and learning capacity - they either limit updates to input prompts or require continuous resource-intensive adaptation regardless of domain stability.Method: PACE uses Covariance Matrix Adaptation Evolution Strategy with Fastfood projection to optimize high-dimensional affine parameters in a low-dimensional subspace. It incorporates an adaptation stopping criterion and domain-specialized vector bank to eliminate redundant computation.
Result: Achieves state-of-the-art accuracy across multiple benchmarks under continual distribution shifts while reducing runtime by over 50% compared to existing backpropagation-free methods.
Conclusion: PACE provides an efficient backpropagation-free solution for continual test-time adaptation that balances learning capacity with runtime efficiency through subspace optimization and intelligent adaptation control.
Abstract: We introduce PACE, a backpropagation-free continual test-time adaptation system that directly optimizes the affine parameters of normalization layers. Existing derivative-free approaches struggle to balance runtime efficiency with learning capacity, as they either restrict updates to input prompts or require continuous, resource-intensive adaptation regardless of domain stability. To address these limitations, PACE leverages the Covariance Matrix Adaptation Evolution Strategy with the Fastfood projection to optimize high-dimensional affine parameters within a low-dimensional subspace, leading to superior adaptive performance. Furthermore, we enhance the runtime efficiency by incorporating an adaptation stopping criterion and a domain-specialized vector bank to eliminate redundant computation. Our framework achieves state-of-the-art accuracy across multiple benchmarks under continual distribution shifts, reducing runtime by over 50% compared to existing backpropagation-free methods.
[976] GPU-Accelerated Optimization of Transformer-Based Neural Networks for Real-Time Inference
Soutrik Mukherjee, Sangwhan Cha
Main category: cs.LG
TL;DR: GPU-accelerated transformer inference pipeline using TensorRT with mixed-precision optimization achieves up to 64.4x speedup over CPU, sub-10ms latency, and 63% memory reduction while maintaining numerical fidelity.
Details
Motivation: Transformers have become fundamental in NLP but face deployment challenges due to computational intensity and memory requirements. There's a need for efficient inference systems that balance speed, memory usage, and accuracy for latency-critical applications.Method: Developed a modular, containerized GPU-accelerated inference pipeline using NVIDIA TensorRT with hybrid precision strategy: FP32 for numerically sensitive operations (softmax, layer normalization) and FP16 for linear layers. Evaluated BERT-base (110M) and GPT-2 (124M) across batch sizes 1-32 and sequence lengths 32-512.
Result: Achieved up to 64.4x speedup over CPU baselines, sub-10ms latency for single-sample inference, 63% memory reduction. Maintained high numerical fidelity (cosine similarity ≥0.9998), eliminated NaN instability. Cross-GPU validation showed consistent FP16 speedup ratios (1.84x-2.00x) with stable numerical behavior. No accuracy degradation on SST-2 downstream task.
Conclusion: The hybrid precision approach provides practical guidance for deploying transformer models in latency-critical environments, offering detailed characterization of performance-accuracy trade-offs across GPU architectures while maintaining numerical stability and accuracy.
Abstract: This paper presents the design and evaluation of a GPU-accelerated inference pipeline for transformer models using NVIDIA TensorRT with mixed-precision optimization. We evaluate BERT-base (110M parameters) and GPT-2 (124M parameters) across batch sizes from 1 to 32 and sequence lengths from 32 to 512. The system achieves up to 64.4x speedup over CPU baselines, sub-10 ms latency for single-sample inference, and a 63 percent reduction in memory usage. We introduce a hybrid precision strategy that preserves FP32 for numerically sensitive operations such as softmax and layer normalization, while applying FP16 to linear layers. This approach maintains high numerical fidelity (cosine similarity >= 0.9998 relative to baseline outputs) and eliminates NaN instability. The pipeline is implemented as a modular, containerized system that enables reproducible benchmarking across more than 360 configurations. Cross-GPU validation on an NVIDIA A100 shows consistent FP16 speedup ratios between 1.84x and 2.00x, along with stable numerical behavior. Downstream evaluation on SST-2 demonstrates no accuracy degradation under hybrid precision. Validation on WikiText-2 shows that random inputs underestimate NaN instability by up to 6x for full FP16, while confirming the robustness of the hybrid approach (0.0 percent NaN, cosine similarity >= 0.9998). These results provide a detailed characterization of performance and accuracy trade-offs across GPU architectures and offer practical guidance for deploying transformer models in latency-critical environments.
[977] Stepwise Credit Assignment for GRPO on Flow-Matching Models
Yash Savani, Branislav Kveton, Yuchen Liu, Yilin Wang, Jing Shi, Subhojyoti Mukherjee, Nikos Vlassis, Krishna Kumar Singh
Main category: cs.LG
TL;DR: Stepwise-Flow-GRPO improves reinforcement learning for flow models by using stepwise credit assignment based on reward improvement, addressing limitations of uniform credit assignment in diffusion generation.
Details
Motivation: Current Flow-GRPO uses uniform credit assignment across all diffusion steps, ignoring the temporal structure where early steps determine composition/content (low-frequency) and late steps resolve details (high-frequency). Uniform credit based solely on final image can reward suboptimal intermediate steps when errors are corrected later.Method: Proposes Stepwise-Flow-GRPO with stepwise credit assignment based on each step’s reward improvement. Uses Tweedie’s formula for intermediate reward estimates and introduces gain-based advantages. Also introduces a DDIM-inspired SDE that improves reward quality while preserving stochasticity for policy gradients.
Result: Achieves superior sample efficiency and faster convergence compared to uniform credit assignment methods.
Conclusion: Stepwise credit assignment based on reward improvement better captures the temporal structure of diffusion generation and leads to more efficient reinforcement learning for flow models.
Abstract: Flow-GRPO successfully applies reinforcement learning to flow models, but uses uniform credit assignment across all steps. This ignores the temporal structure of diffusion generation: early steps determine composition and content (low-frequency structure), while late steps resolve details and textures (high-frequency details). Moreover, assigning uniform credit based solely on the final image can inadvertently reward suboptimal intermediate steps, especially when errors are corrected later in the diffusion trajectory. We propose Stepwise-Flow-GRPO, which assigns credit based on each step’s reward improvement. By leveraging Tweedie’s formula to obtain intermediate reward estimates and introducing gain-based advantages, our method achieves superior sample efficiency and faster convergence. We also introduce a DDIM-inspired SDE that improves reward quality while preserving stochasticity for policy gradients.
[978] Shifting Perspectives: Steering Vectors for Robust Bias Mitigation in LLMs
Zara Siddique, Irtaza Khalid, Liam D. Turner, Luis Espinosa-Anke
Main category: cs.LG
TL;DR: Steering vectors applied to LLM activations effectively mitigate social biases with minimal performance impact, outperforming other methods across multiple bias datasets.
Details
Motivation: To develop a computationally efficient method for reducing social biases in large language models that maintains model performance while addressing multiple bias dimensions.Method: Compute steering vectors for 8 social bias axes (age, gender, race, etc.) on BBQ dataset training subset, then apply these vectors to modify model activations during forward passes. Compare to 3 other bias mitigation methods across 4 datasets.
Result: Steering vectors achieved average improvements of 12.8% on BBQ, 8.3% on CLEAR-Bias, and 1% on StereoSet, outperforming prompting and Self-Debias in all cases, and beating fine-tuning in 12/17 evaluations. Showed lowest impact on MMLU scores among all methods.
Conclusion: Steering vectors are a powerful, computationally efficient strategy for bias mitigation in LLMs with minimal performance degradation, representing the first systematic investigation of this approach for AI safety enhancement.
Abstract: We present a novel approach to bias mitigation in large language models (LLMs) by applying steering vectors to modify model activations in forward passes. We compute 8 steering vectors, each corresponding to a different social bias axis, such as age, gender, or race, on a training subset of the BBQ dataset and compare the effectiveness of these to 3 additional bias mitigation methods across 4 datasets. When optimized on the BBQ dataset, our individually tuned steering vectors achieve average improvements of 12.8% on BBQ, 8.3% on CLEAR-Bias, and 1% on StereoSet, and show improvements over prompting and Self-Debias in all cases, and improvements over fine-tuning in 12 out of 17 evaluations. In addition, steering vectors showed the lowest impact on MMLU scores of the four bias mitigation methods tested. The work presents the first systematic investigation of steering vectors for bias mitigation, and we demonstrate that they are a powerful and computationally efficient strategy for reducing bias in LLMs, with broader implications for enhancing AI safety.
[979] See it to Place it: Evolving Macro Placements with Vision-Language Models
Ikechukwu Uchendu, Swati Goel, Karly Hou, Ebrahim Songhori, Kuang-Huei Lee, Joe Wenjie Jiang, Vijay Janapa Reddi, Vincent Zhuang
Main category: cs.LG
TL;DR: VeoPlace uses Vision-Language Models to guide chip floorplanning by having VLMs propose subregion constraints for component placement, combined with evolutionary optimization.
Details
Motivation: Human chip designers rely on spatial reasoning for macro placement, so VLMs with strong visual reasoning capabilities could complement existing learning-based approaches in electronic design automation.Method: VeoPlace uses a pre-trained VLM without fine-tuning to propose subregion constraints on the chip canvas. These proposals guide a base placer, and an evolutionary search strategy iteratively optimizes placements based on quality metrics like wirelength.
Result: Outperforms best prior learning-based approach on 9 of 10 benchmarks with wirelength reductions exceeding 32%. Also improves analytical placer DREAMPlace on all 8 benchmarks with gains up to 4.3%.
Conclusion: VLMs can effectively solve complex physical design problems in EDA, opening new possibilities for foundation model applications in chip design optimization.
Abstract: We propose using Vision-Language Models (VLMs) for macro placement in chip floorplanning, a complex optimization task that has recently shown promising advancements through machine learning methods. Because human designers rely heavily on spatial reasoning to arrange components on the chip canvas, we hypothesize that VLMs with strong visual reasoning abilities can effectively complement existing learning-based approaches. We introduce VeoPlace (Visual Evolutionary Optimization Placement), a novel framework that uses a VLM, without any fine-tuning, to guide the actions of a base placer by constraining them to subregions of the chip canvas. The VLM proposals are iteratively optimized through an evolutionary search strategy with respect to resulting placement quality. On open-source benchmarks, VeoPlace outperforms the best prior learning-based approach on 9 of 10 benchmarks with peak wirelength reductions exceeding 32%. We further demonstrate that VeoPlace generalizes to analytical placers, improving DREAMPlace performance on all 8 evaluated benchmarks with gains up to 4.3%. Our approach opens new possibilities for electronic design automation tools that leverage foundation models to solve complex physical design problems.
[980] Structured Agent Distillation for Large Language Model
Jun Liu, Zhenglun Kong, Peiyan Dong, Changdi Yang, Tianqi Li, Hao Tang, Geng Yuan, Wei Niu, Wenbin Zhang, Pu Zhao, Xue Lin, Dong Huang, Yanzhi Wang
Main category: cs.LG
TL;DR: Structured Agent Distillation framework compresses large LLM-based agents into smaller models by segmenting trajectories into reasoning and action spans with specialized losses, maintaining reasoning fidelity and action consistency.
Details
Motivation: Large LLM-based agents have high inference costs and large model sizes that constrain practical deployment, creating a need for efficient compression methods that preserve both reasoning and action capabilities.Method: Proposes Structured Agent Distillation that segments agent trajectories into {[REASON]} and {[ACT]} spans, applying segment-specific losses to align each component with teacher behavior, enabling structure-aware supervision for compact agents.
Result: Outperforms token-level and imitation learning baselines on ALFWorld, HotPotQA-ReAct, and WebShop, achieving significant compression with minimal performance drop. Scaling and ablation studies highlight importance of span-level alignment.
Conclusion: Structured Agent Distillation enables efficient deployment of LLM-based agents by compressing large models while preserving reasoning fidelity and action consistency through structure-aware supervision.
Abstract: Large language models (LLMs) exhibit strong capabilities as decision-making agents by interleaving reasoning and actions, as seen in ReAct-style frameworks. Yet, their practical deployment is constrained by high inference costs and large model sizes. We propose Structured Agent Distillation, a framework that compresses large LLM-based agents into smaller student models while preserving both reasoning fidelity and action consistency. Unlike standard token-level distillation, our method segments trajectories into {[REASON]} and {[ACT]} spans, applying segment-specific losses to align each component with the teacher’s behavior. This structure-aware supervision enables compact agents to better replicate the teacher’s decision process. Experiments on ALFWorld, HotPotQA-ReAct, and WebShop show that our approach consistently outperforms token-level and imitation learning baselines, achieving significant compression with minimal performance drop. Scaling and ablation results further highlight the importance of span-level alignment for efficient and deployable agents.
[981] Expectation Error Bounds for Transfer Learning in Linear Regression and Linear Neural Networks
Meitong Liu, Christopher Jung, Rui Li, Xue Feng, Han Zhao
Main category: cs.LG
TL;DR: Theoretical analysis of when auxiliary tasks improve generalization in transfer learning for linear regression and neural networks, with exact conditions and optimal task weighting.
Details
Motivation: While transfer learning uses auxiliary data to improve main task performance, theoretical understanding of when and how auxiliary data helps remains incomplete. The paper aims to provide precise theoretical insights into conditions under which auxiliary tasks benefit generalization.Method: 1) For linear regression: derived exact closed-form expressions for expected generalization error with bias-variance decomposition, yielding necessary and sufficient conditions for auxiliary tasks to help. 2) For under-parameterized linear neural networks with shared representations: derived non-asymptotic expectation bound on generalization error. 3) Developed new column-wise low-rank perturbation bound for random matrices to preserve fine-grained column structures.
Result: 1) For linear regression: obtained globally optimal task weights via solvable optimization programs with consistency guarantees for empirical estimates. 2) For linear neural networks: derived first non-vacuous sufficient condition for beneficial auxiliary learning and principled directions for task weight curation. 3) Verified results on synthetic data with controlled parameters.
Conclusion: The paper provides rigorous theoretical foundations for understanding when auxiliary tasks improve generalization in transfer learning, with practical implications for task weighting and representation learning in linear settings.
Abstract: In transfer learning, the learner leverages auxiliary data to improve generalization on a main task. However, the precise theoretical understanding of when and how auxiliary data help remains incomplete. We provide new insights on this issue in two canonical linear settings: ordinary least squares regression and under-parameterized linear neural networks. For linear regression, we derive exact closed-form expressions for the expected generalization error with bias-variance decomposition, yielding necessary and sufficient conditions for auxiliary tasks to improve generalization on the main task. We also derive globally optimal task weights as outputs of solvable optimization programs, with consistency guarantees for empirical estimates. For linear neural networks with shared representations of width $q \leq K$, where $K$ is the number of auxiliary tasks, we derive a non-asymptotic expectation bound on the generalization error, yielding the first non-vacuous sufficient condition for beneficial auxiliary learning in this setting, as well as principled directions for task weight curation. We achieve this by proving a new column-wise low-rank perturbation bound for random matrices, which improves upon existing bounds by preserving fine-grained column structures. Our results are verified on synthetic data simulated with controlled parameters.
[982] Rethinking Language Model Scaling under Transferable Hypersphere Optimization
Liliang Ren, Yang Liu, Yelong Shen, Weizhu Chen
Main category: cs.LG
TL;DR: HyperP introduces hypersphere parameterization for stable scaling of large language models, enabling learning rate transfer across model dimensions and compute budgets while preventing training instability.
Details
Motivation: Existing hyperparameter transfer laws don't structurally prevent training instability at scale, and hypersphere optimization methods offer promising alternatives for more stable scaling of large language models.Method: HyperP framework for transferring optimal learning rates across model width, depth, training tokens, and MoE granularity under Frobenius-sphere constraint with Muon optimizer; introduces SqrtGate MoE gating mechanism derived from hypersphere constraint.
Result: Achieves 1.58× compute efficiency over strong Muon baseline at 6×10²¹ FLOPs; maintains bounded instability indicators; enables larger auxiliary load-balancing weights for better expert balance.
Conclusion: Hypersphere parameterization enables stable scaling with transferable learning rates and improved training stability across compute budgets and model dimensions.
Abstract: Scaling laws for large language models depend critically on the optimizer and parameterization. Existing hyperparameter transfer laws are mainly developed for first-order optimizers, and they do not structurally prevent training instability at scale. Recent hypersphere optimization methods constrain weight matrices to a fixed-norm hypersphere, offering a promising alternative for more stable scaling. We introduce HyperP (Hypersphere Parameterization), the first framework for transferring optimal learning rates across model width, depth, training tokens, and Mixture-of-Experts (MoE) granularity under the Frobenius-sphere constraint with the Muon optimizer. We prove that weight decay is a first-order no-op on the Frobenius sphere, show that Depth-$μ$P remains necessary, and find that the optimal learning rate follows the same data-scaling power law with the “magic exponent” 0.32 previously observed for AdamW. A single base learning rate tuned at the smallest scale transfers across all compute budgets under HyperP, yielding $1.58\times$ compute efficiency over a strong Muon baseline at $6\times10^{21}$ FLOPs. Moreover, HyperP delivers transferable stability: all monitored instability indicators, including $Z$-values, output RMS, and activation outliers, remain bounded and non-increasing under training FLOPs scaling. We also propose SqrtGate, an MoE gating mechanism derived from the hypersphere constraint that preserves output RMS across MoE granularities for improved granularity scaling, and show that hypersphere optimization enables substantially larger auxiliary load-balancing weights, yielding both strong performance and good expert balance. We release our training codebase at https://github.com/microsoft/ArchScale.
[983] Stop Probing, Start Coding: Why Linear Probes and Sparse Autoencoders Fail at Compositional Generalisation
Vitória Barin Pacela, Shruti Joshi, Isabela Camacho, Simon Lacoste-Julien, David Klindt
Main category: cs.LG
TL;DR: Sparse autoencoders fail under OOD compositional shifts due to poor dictionary learning, not amortization gaps, challenging linear representation assumptions in neural networks.
Details
Motivation: The paper investigates why sparse autoencoders (SAEs) fail under out-of-distribution compositional shifts, despite theoretical guarantees from sparse coding methods. The motivation is to understand the limitations of SAEs in recovering latent factors under superposition, where linear decision boundaries in concept space may not remain linear after projection into activation space.Method: The authors conduct controlled experiments to decompose SAE failures, comparing SAEs with classical sparse coding methods using per-sample iterative inference (FISTA). They test across various training set sizes, latent dimensions, and sparsity levels, and use an oracle baseline to establish upper bounds on performance.
Result: Results show that SAEs fail under OOD compositional shifts due to poor dictionary learning - the learned dictionaries point in substantially wrong directions. Replacing the SAE encoder with per-sample FISTA inference on the same dictionary does not close the performance gap, indicating dictionary learning is the binding constraint, not amortization.
Conclusion: The SAE failure is reframed as a dictionary learning challenge rather than an amortization problem. Scalable dictionary learning is identified as the key open problem for sparse inference under superposition, with implications for understanding neural network representations.
Abstract: The linear representation hypothesis states that neural network activations encode high-level concepts as linear mixtures. However, under superposition, this encoding is a projection from a higher-dimensional concept space into a lower-dimensional activation space, and a linear decision boundary in the concept space need not remain linear after projection. In this setting, classical sparse coding methods with per-sample iterative inference leverage compressed sensing guarantees to recover latent factors. Sparse autoencoders (SAEs), on the other hand, amortise sparse inference into a fixed encoder, introducing a systematic gap. We show this amortisation gap persists across training set sizes, latent dimensions, and sparsity levels, causing SAEs to fail under out-of-distribution (OOD) compositional shifts. Through controlled experiments that decompose the failure, we identify dictionary learning – not the inference procedure – as the binding constraint: SAE-learned dictionaries point in substantially wrong directions, and replacing the encoder with per-sample FISTA on the same dictionary does not close the gap. An oracle baseline proves the problem is solvable with a good dictionary at all scales tested. Our results reframe the SAE failure as a dictionary learning challenge, not an amortisation problem, and point to scalable dictionary learning as the key open problem for sparse inference under superposition.
[984] Temporal Credit Is Free
Aur Shalev Merin
Main category: cs.LG
TL;DR: Online RNN training using immediate derivatives with RMSprop matches full RTRL performance with 1000x less memory, scaling to n=1024.
Details
Motivation: Traditional RNN training methods like RTRL (Real-Time Recurrent Learning) require Jacobian propagation which is computationally expensive and memory-intensive, limiting scalability. The paper aims to develop more efficient online training methods for recurrent networks.Method: Proposes using immediate derivatives instead of full Jacobian propagation, combined with RMSprop for gradient normalization. Identifies architectural conditions where gradient normalization is needed: when gradients must pass through nonlinear state updates without output bypass. Validates across 10 architectures, primate neural data, and streaming ML benchmarks.
Result: Immediate derivatives with RMSprop match or exceed full RTRL performance while using 1000x less memory. Scales to n=1024 hidden units, demonstrating effectiveness across diverse architectures and real-world data.
Conclusion: Recurrent networks can be trained efficiently online without full Jacobian propagation by using immediate derivatives with proper gradient normalization, enabling scalable RNN training with minimal memory requirements.
Abstract: Recurrent networks do not need Jacobian propagation to adapt online. The hidden state already carries temporal credit through the forward pass; immediate derivatives suffice if you stop corrupting them with stale trace memory and normalize gradient scales across parameter groups. An architectural rule predicts when normalization is needed: \b{eta}2 is required when gradients must pass through a nonlinear state update with no output bypass, and unnecessary otherwise. Across ten architectures, real primate neural data, and streaming ML benchmarks, immediate derivatives with RMSprop match or exceed full RTRL, scaling to n = 1024 at 1000x less memory.
[985] Geometry-aware similarity metrics for neural representations on Riemannian and statistical manifolds
N Alex Cayco Gajic, Arthur Pellegrino
Main category: cs.LG
TL;DR: MSA introduces a Riemannian geometry-based method to compare intrinsic geometries of neural representations, enabling analysis of different learning regimes, nonlinear dynamics, and diffusion models.
Details
Motivation: Existing similarity measures compare extrinsic geometry in state space but fail to capture crucial distinctions between fundamentally different neural network solutions. The paper aims to develop a method that analyzes intrinsic geometry under the manifold hypothesis.Method: Metric Similarity Analysis (MSA) leverages tools from Riemannian geometry to compare intrinsic geometry of neural representations. It analyzes representations under the manifold hypothesis rather than comparing extrinsic geometry in state space.
Result: MSA can disentangle features of neural computations in deep networks with different learning regimes, compare nonlinear dynamics, and investigate diffusion models. It provides a mathematically grounded framework for understanding neural computation mechanisms.
Conclusion: MSA introduces a novel, broadly applicable framework to understand neural computation mechanisms by comparing their intrinsic geometries using Riemannian geometry tools.
Abstract: Similarity measures are widely used to interpret the representational geometries used by neural networks to solve tasks. Yet, because existing methods compare the extrinsic geometry of representations in state space, rather than their intrinsic geometry, they may fail to capture subtle yet crucial distinctions between fundamentally different neural network solutions. Here, we introduce metric similarity analysis (MSA), a novel method which leverages tools from Riemannian geometry to compare the intrinsic geometry of neural representations under the manifold hypothesis. We show that MSA can be used to i) disentangle features of neural computations in deep networks with different learning regimes, ii) compare nonlinear dynamics, and iii) investigate diffusion models. Hence, we introduce a mathematically grounded and broadly applicable framework to understand the mechanisms behind neural computations by comparing their intrinsic geometries.
[986] Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models
Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, Urmish Thakker, James Zou, Kunle Olukotun
Main category: cs.LG
TL;DR: ACE framework treats contexts as evolving playbooks that accumulate, refine, and organize strategies through modular generation, reflection, and curation to prevent context collapse and brevity bias in LLM applications.
Details
Motivation: LLM applications increasingly rely on context adaptation rather than weight updates, but suffer from brevity bias (dropping domain insights for concise summaries) and context collapse (iterative rewriting erodes details over time).Method: ACE (Agentic Context Engineering) treats contexts as evolving playbooks with structured, incremental updates through a modular process of generation, reflection, and curation to preserve detailed knowledge and scale with long-context models.
Result: ACE outperforms strong baselines: +10.6% on agents and +8.6% on finance benchmarks, reduces adaptation latency and rollout cost, matches top-ranked production-level agent on AppWorld leaderboard overall, and surpasses it on harder test-challenge split using smaller open-source model.
Conclusion: Comprehensive, evolving contexts enable scalable, efficient, and self-improving LLM systems with low overhead through effective context engineering without labeled supervision.
Abstract: Large language model (LLM) applications such as agents and domain-specific reasoning increasingly rely on context adaptation: modifying inputs with instructions, strategies, or evidence, rather than weight updates. Prior approaches improve usability but often suffer from brevity bias, which drops domain insights for concise summaries, and from context collapse, where iterative rewriting erodes details over time. We introduce ACE (Agentic Context Engineering), a framework that treats contexts as evolving playbooks that accumulate, refine, and organize strategies through a modular process of generation, reflection, and curation. ACE prevents collapse with structured, incremental updates that preserve detailed knowledge and scale with long-context models. Across agent and domain-specific benchmarks, ACE optimizes contexts both offline (e.g., system prompts) and online (e.g., agent memory), consistently outperforming strong baselines: +10.6% on agents and +8.6% on finance, while significantly reducing adaptation latency and rollout cost. Notably, ACE could adapt effectively without labeled supervision and instead by leveraging natural execution feedback. On the AppWorld leaderboard, ACE matches the top-ranked production-level agent on the overall average and surpasses it on the harder test-challenge split, despite using a smaller open-source model. These results show that comprehensive, evolving contexts enable scalable, efficient, and self-improving LLM systems with low overhead.
[987] Mitigating Premature Exploitation in Particle-based Monte Carlo for Inference-Time Scaling
Giorgio Giannone, Guangxuan Xu, Nikhil Shivakumar Nayak, Rohan Mahesh Awhad, Shivchander Sudalairaj, Kai Xu, Akash Srivastava
Main category: cs.LG
TL;DR: Entropic Particle Filtering (ePF) improves inference-time scaling for language models by addressing premature exploitation in particle filtering through entropy monitoring and look-ahead modulation.
Details
Motivation: Particle Filtering (PF) is vulnerable to premature exploitation when guided by process reward models, causing particle impoverishment where the algorithm commits too early to locally promising trajectories and prunes potentially correct hypotheses, especially under constrained computational budgets.Method: ePF introduces two techniques: 1) Entropic Annealing (EA) monitors search diversity via entropy and dynamically anneals the resampling distribution when diversity drops to preserve exploration; 2) Look-ahead Modulation (LaM) adds a predictive guide to evaluate a state’s potential based on its successors.
Result: On several challenging math benchmarks, ePF significantly outperforms strong baselines and achieves up to a 50% relative improvement in task reward.
Conclusion: ePF improves PF’s resilience by balancing exploration of diverse solution spaces with exploitation of high-reward regions, leading to higher-quality solutions for complex mathematical reasoning tasks.
Abstract: Inference-Time Scaling (ITS) improves language models by allocating more computation at generation time. Particle Filtering (PF) has emerged as a strong ITS method for complex mathematical reasoning tasks, but it is vulnerable when guided by process reward models, which often assign overconfident scores early in the reasoning process. This causes PF to suffer from premature exploitation: it myopically commits to locally promising trajectories, prunes potentially correct hypotheses, and converges to suboptimal solutions. This failure mode, known as particle impoverishment, is especially severe under constrained computational budgets. To address this, we analyze the problem and identify two root causes: a lack of diversity in the particle set due to overconfident resampling and consequent inability to assess the potential of a reasoning path. We introduce Entropic Particle Filtering (ePF), an algorithm that integrates two new techniques to solve these issues. The first technique, Entropic Annealing (EA), directly mitigates particle impoverishment by monitoring search diversity via entropy; when diversity drops, it intervenes by dynamically annealing the resampling distribution to preserve exploration. The second, an enhancement called Look-ahead Modulation (LaM), adds a predictive guide to evaluate a state’s potential based on its successors. On several challenging math benchmarks, ePF significantly outperforms strong baselines and achieves up to a 50% relative improvement in task reward. Together, these methods improve PF’s resilience by balancing the exploration of diverse solution spaces with the exploitation of high-reward regions, ultimately leading to higher-quality solutions.
[988] Measuring all the noises of LLM Evals
Sida Wang
Main category: cs.LG
TL;DR: Statistical analysis framework for LLM evaluation noise, defining and measuring prediction noise, data noise, and total noise to improve statistical power in model comparisons.
Details
Motivation: LLM evaluations have unique noise characteristics that require specialized statistical methods to properly separate signal from noise and make sound empirical decisions about model performance.Method: Proposes the all-pairs paired method that applies paired analysis to all pairs of LLMs, measures three types of noise (prediction noise from different answers, data noise from question sampling, and total noise), and analyzes millions of question-level predictions across various evaluations and settings.
Result: Two key findings: 1) Each evaluation exhibits characteristic and predictable total noise levels across all model pairs; 2) Paired prediction noise typically exceeds paired data noise, meaning reducing prediction noise through averaging can significantly increase statistical power.
Conclusion: By comprehensively measuring all noise components together, researchers can better assess evaluation results in context and apply optimal statistical analysis methods for making reliable empirical decisions about LLM performance.
Abstract: Separating signal from noise is central to experiments. Applying well-established statistical methods effectively to LLM evals requires consideration of their unique noise characteristics. We clearly define and measure three types of noise: prediction noise from generating different answers on a given question, data noise from sampling questions, and their combined total noise following the law of total variance. To emphasize relative comparisons and gain statistical power, we propose the all-pairs paired method, which applies the paired analysis to all pairs of LLMs and measures all the noise components based on millions of question-level predictions across many evals and settings, revealing clear patterns. First, each eval exhibits a characteristic and highly predictable total noise level across all model pairs. Second, paired prediction noise typically exceeds paired data noise, which means reducing prediction noise by averaging can significantly increase statistical power. By measuring all the noises together, we can assess eval results in context, lowering the barrier of using the best analysis to make sound empirical decisions.
[989] Sparse-RL: Breaking the Memory Wall in LLM Reinforcement Learning via Stable Sparse Rollouts
Sijia Luo, Xiaokang Zhang, Yuxuan Hu, Bohan Zhang, Ke Wang, Jinbo Su, Mengshu Sun, Lei Liang, Jing Zhang
Main category: cs.LG
TL;DR: Sparse-RL enables stable reinforcement learning for LLMs with sparse KV caches, addressing policy mismatch issues during training while maintaining performance and reducing memory overhead.
Details
Motivation: RL is essential for complex reasoning in LLMs, but memory overhead from KV caches during long-horizon rollouts limits training on limited hardware. Existing KV compression techniques work for inference but cause severe policy mismatch and performance collapse when applied to RL training.Method: Sparse-RL addresses instability from policy mismatch among dense old policy, sparse sampler policy, and learner policy. It uses Sparsity-Aware Rejection Sampling and Importance-based Reweighting to correct off-policy bias from compression-induced information loss.
Result: Sparse-RL reduces rollout overhead compared to dense baselines while preserving performance. It also implements sparsity-aware training, significantly enhancing model robustness during sparse inference deployment.
Conclusion: Sparse-RL enables efficient RL training for LLMs with sparse KV caches, solving the policy mismatch problem and making RL training feasible on limited hardware while maintaining performance and improving robustness for sparse inference.
Abstract: Reinforcement Learning (RL) has become essential for eliciting complex reasoning capabilities in Large Language Models (LLMs). However, the substantial memory overhead of storing Key-Value (KV) caches during long-horizon rollouts acts as a critical bottleneck, often prohibiting efficient training on limited hardware. While existing KV compression techniques offer a remedy for inference, directly applying them to RL training induces a severe policy mismatch, leading to catastrophic performance collapse. To address this, we introduce Sparse-RL empowers stable RL training under sparse rollouts. We show that instability arises from a fundamental policy mismatch among the dense old policy, the sparse sampler policy, and the learner policy. To mitigate this issue, Sparse-RL incorporates Sparsity-Aware Rejection Sampling and Importance-based Reweighting to correct the off-policy bias introduced by compression-induced information loss. Experimental results show that Sparse-RL reduces rollout overhead compared to dense baselines while preserving the performance. Furthermore, Sparse-RL inherently implements sparsity-aware training, significantly enhancing model robustness during sparse inference deployment. The corresponding training data and code are publicly available on the repository.
[990] Rethinking Attention Output Projection: Structured Hadamard Transforms for Efficient Transformers
Shubham Aggarwal, Lokendra Kumar
Main category: cs.LG
TL;DR: Replacing dense attention projections with Walsh-Hadamard Transform reduces parameters by 25% while maintaining performance and improving compute efficiency.
Details
Motivation: Dense output projections in multi-head attention scale quadratically with model dimension, contributing significantly to parameter count, memory footprint, and inference cost. There's a need for more efficient attention mechanisms.Method: Replace dense output projection with fixed, parameter-free Walsh-Hadamard Transform (WHT) followed by diagonal affine transformation. This eliminates ~25% of attention parameters per block while maintaining global cross-head interaction through orthogonal, norm-preserving transformation.
Result: WHT-augmented models show steeper validation loss curve relative to training FLOPs, suggesting superior compute utilization. Efficiency gains (reduced memory, increased throughput) grow monotonically with model size, batch size, and sequence length. Structured transform consistently outperforms dense projections as complexity increases.
Conclusion: Replacing dense projections with structured transforms enables more compute-efficient architectures that achieve lower loss than dense models at equivalent training budgets, with efficiency gains scaling with model complexity.
Abstract: The dense output projection in multi head attention scales quadratically with model dimension, contributing significantly to parameter count, memory footprint, and inference cost. We propose replacing this projection with a fixed, parameter free Walsh Hadamard Transform (WHT) followed by a diagonal affine transformation. This approach eliminates approximately 25 percent of attention parameters per block while maintaining global cross-head interaction through an orthogonal, norm-preserving transformation. Our results demonstrate that WHT augmented models exhibit a steeper validation loss curve relative to training FLOPs compared to dense baselines, suggesting superior compute utilization during training. Crucially, we show that efficiency gains including reduced memory footprint and increased throughput grow monotonically with model size, batch size, and sequence length. We evaluate performance across both prefill and decoding stages, finding that the structured transform consistently outperforms dense projections as complexity increases. Our findings indicate that replacing dense projections with structured transforms allows for more compute-efficient architectures that achieve lower loss than dense models at an equivalent training budget.
[991] CausalEvolve: Towards Open-Ended Discovery with Causal Scratchpad
Yongqiang Chen, Chenxi Liu, Zhenhao Chen, Tongliang Liu, Bo Han, Kun Zhang
Main category: cs.LG
TL;DR: CausalEvolve improves AI scientist agents by adding causal reasoning to guide program evolution, addressing efficiency decline and oscillation at performance boundaries.
Details
Motivation: Existing evolve-based AI agents lack targeted guidance for evolution and effective knowledge organization, leading to decreasing efficiency and oscillatory behavior when approaching known performance limits.Method: CausalEvolve uses a causal scratchpad where LLMs identify outcome-level factors for complementary inspirations, inspect surprise patterns during evolution, and use abductive reasoning to hypothesize new factors that offer novel evolutionary directions.
Result: CausalEvolve effectively improves evolutionary efficiency and discovers better solutions in 4 challenging open-ended scientific tasks.
Conclusion: The causal reasoning approach enhances evolve-based AI scientists by providing better guidance and knowledge utilization during program evolution.
Abstract: Evolve-based agent such as AlphaEvolve is one of the notable successes in using Large Language Models (LLMs) to build AI Scientists. These agents tackle open-ended scientific problems by iteratively improving and evolving programs, leveraging the prior knowledge and reasoning capabilities of LLMs. Despite the success, existing evolve-based agents lack targeted guidance for evolution and effective mechanisms for organizing and utilizing knowledge acquired from past evolutionary experience. Consequently, they suffer from decreasing evolution efficiency and exhibit oscillatory behavior when approaching known performance boundaries. To mitigate the gap, we develop CausalEvolve, equipped with a causal scratchpad that leverages LLMs to identify and reason about guiding factors for evolution. At the beginning, CausalEvolve first identifies outcome-level factors that offer complementary inspirations in improving the target objective. During the evolution, CausalEvolve also inspects surprise patterns during the evolution and abductive reasoning to hypothesize new factors, which in turn offer novel directions. Through comprehensive experiments, we show that CausalEvolve effectively improves the evolutionary efficiency and discovers better solutions in 4 challenging open-ended scientific tasks.
[992] Problems with Chinchilla Approach 2: Systematic Biases in IsoFLOP Parabola Fits
Eric Czech, Zhiwei Xu, Yael Elmatad, Yixin Wang, William Held
Main category: cs.LG
TL;DR: Chinchilla Approach 2 for neural scaling laws has systematic biases in compute-optimal allocation estimates, leading to significant compute waste; Approach 3 with Variable Projection offers unbiased, stable alternative.
Details
Motivation: Chinchilla Approach 2 is widely used but introduces systematic biases in compute-optimal allocation estimates, causing significant unnecessary compute costs. The biases are particularly problematic for large models like Llama 3 and even worse for multimodal models due to loss surface asymmetry.Method: Analyzes three sources of error in Approach 2: IsoFLOP sampling grid width, uncentered IsoFLOP sampling, and loss surface asymmetry. Proposes using Chinchilla Approach 3 with Variable Projection, which exploits the partially linear structure of the objective to enable unbiased inference through a two-dimensional optimization that is well-conditioned and analytically differentiable.
Result: Applied to Llama 3 IsoFLOP data, Approach 2 biases imply parameter underallocation corresponding to 6.5% of the $3.8×10²⁵ FLOP training budget and $1.4M in unnecessary compute. Simulated multimodal model misallocations show even greater opportunity costs. Approach 3 with Variable Projection eliminates these biases and addresses concerns about data-efficiency, numerical stability, and implementation difficulty.
Conclusion: Chinchilla Approach 3 with Variable Projection provides an unbiased, stable, and practical alternative to Approach 2 for neural scaling law estimation, with particular importance for multimodal models where loss surface asymmetry amplifies biases. The method enables accurate compute-optimal allocation and reduces significant compute waste.
Abstract: Chinchilla Approach 2 is among the most widely used methods for fitting neural scaling laws. Its parabolic approximation introduces systematic biases in compute-optimal allocation estimates, even on noise-free synthetic data. Applied to published Llama 3 IsoFLOP data at open frontier compute scales, these biases imply a parameter underallocation corresponding to 6.5% of the $3.8\times10^{25}$ FLOP training budget and $1.4M (90% CI: $412K-$2.9M) in unnecessary compute at 50% H100 MFU. Simulated multimodal model misallocations show even greater opportunity costs due to higher loss surface asymmetry. Three sources of this error are examined: IsoFLOP sampling grid width (Taylor approximation accuracy), uncentered IsoFLOP sampling, and loss surface asymmetry ($α\neq β$). Chinchilla Approach 3 largely eliminates these biases but is often regarded as less data-efficient, numerically unstable, prone to local minima, and harder to implement. Each concern is shown to be unfounded or addressable, especially when the partially linear structure of the objective is exploited via Variable Projection, enabling unbiased inference on all five loss surface parameters through a two-dimensional optimization that is well-conditioned, analytically differentiable, and amenable to dense, or even exhaustive, grid search. It may serve as a more convenient replacement for Approach 2 or a more scalable alternative for adaptations of Approach 3 to richer scaling law formulations. See https://github.com/Open-Athena/vpnls for details and https://openathena.ai/scaling-law-analysis for other results from this study.
[993] Algorithmic Insurance
Dimitris Bertsimas, Agni Orfanoudaki
Main category: cs.LG
TL;DR: Algorithmic insurance framework for AI systems that connects classification performance to financial liability, showing how risk-aware thresholds reduce tail risk and enable value creation through insurance contracts.
Details
Motivation: AI errors in high-stakes domains create heterogeneous losses that challenge traditional insurance, while insurers struggle to price AI risks and developers lack frameworks connecting system design to financial liability exposure.Method: Analyzes connection between binary classification performance and tail risk using conditional value-at-risk (CVaR), proves accuracy maximization increases worst-case losses, proposes liability insurance contract structure with risk-aware thresholds, validates with mammography case study.
Result: CVaR-optimal thresholds reduce tail risk up to 13-fold compared to accuracy maximization; insurance contracts create 14-16% gains for well-calibrated firms and up to 65% for poorly calibrated firms through risk transfer, mandatory recalibration, and regulatory capital relief.
Conclusion: Algorithmic insurance functions as both financial instrument and operational governance mechanism, enabling efficient risk transfer while improving AI safety, unlike traditional insurance that merely transfers risk.
Abstract: When AI systems make errors in high-stakes domains like medical diagnosis or autonomous vehicles, a single algorithmic flaw across varying operational contexts can generate highly heterogeneous losses that challenge traditional insurance assumptions. Algorithmic insurance constitutes a novel form of financial coverage for AI-induced damages, representing an emerging market that addresses algorithm-driven liability. However, insurers currently struggle to price these risks, while AI developers lack rigorous frameworks connecting system design with financial liability exposure. We analyze the connection between operational choices of binary classification performance to tail risk exposure. Using conditional value-at-risk (CVaR) to capture extreme losses, we prove that established approaches like maximizing accuracy can significantly increase worst-case losses compared to tail risk optimization, with penalties growing quadratically as thresholds deviate from optimal. We then propose a liability insurance contract structure that mandates risk-aware classification thresholds and characterize the conditions under which it creates value for AI providers. Our analysis extends to degrading model performance and human oversight scenarios. We validate our findings through a mammography case study, demonstrating that CVaR-optimal thresholds reduce tail risk up to 13-fold compared to accuracy maximization. This risk reduction enables insurance contracts to create 14-16% gains for well-calibrated firms, while poorly calibrated firms benefit up to 65% through risk transfer, mandatory recalibration, and regulatory capital relief. Unlike traditional insurance that merely transfers risk, algorithmic insurance can function as both a financial instrument and an operational governance mechanism, simultaneously enabling efficient risk transfer while improving AI safety.
[994] Less is More: Rethinking Few-Shot Learning and Recurrent Neural Nets
Deborah Pereg, Martin Villiger, Brett Bouma, Polina Golland
Main category: cs.LG
TL;DR: The paper explores the asymptotic equipartition property (AEP) in machine learning, providing theoretical guarantees for reliable learning under information-theoretic AEP, and proposes a reduced-entropy RNN algorithm for few-shot learning with applications in image deblurring and OCT speckle suppression.
Details
Motivation: The paper aims to bridge information theory (specifically the asymptotic equipartition property) with machine learning to improve few-shot learning. The motivation is to provide theoretical foundations for reliable learning under information-theoretic principles and develop more sample-efficient algorithms.Method: The authors provide theoretical analysis of AEP in machine learning context, propose a reduced-entropy algorithm for few-shot learning using RNNs, and provide mathematical intuition for RNNs as approximations of sparse coding solvers. They validate their approach on image deblurring and OCT speckle suppression tasks.
Result: Experimental results demonstrate significant improvements in learning models’ sample efficiency, generalization, and time complexity, showing potential for practical real-time applications in image processing tasks.
Conclusion: The work successfully connects information theory with machine learning, providing both theoretical guarantees and practical algorithms for few-shot learning, with demonstrated effectiveness in vision-related applications.
Abstract: The statistical supervised learning framework assumes an input-output set with a joint probability distribution that is reliably represented by the training dataset. The learner is then required to output a prediction rule learned from the training dataset’s input-output pairs. In this work, we provide meaningful insights into the asymptotic equipartition property (AEP) \citep{Shannon:1948} in the context of machine learning, and illuminate some of its potential ramifications for few-shot learning. We provide theoretical guarantees for reliable learning under the information-theoretic AEP, and for the generalization error with respect to the sample size. We then focus on a highly efficient recurrent neural net (RNN) framework and propose a reduced-entropy algorithm for few-shot learning. We also propose a mathematical intuition for the RNN as an approximation of a sparse coding solver. We verify the applicability, robustness, and computational efficiency of the proposed approach with image deblurring and optical coherence tomography (OCT) speckle suppression. Our experimental results demonstrate significant potential for improving learning models’ sample efficiency, generalization, and time complexity, that can therefore be leveraged for practical real-time applications.
[995] Convergence of the Inexact Langevin Algorithm in KL Divergence with Application to Score-based Generative Models
Kaylee Yingxi Yang, Andre Wibisono
Main category: cs.LG
TL;DR: Theoretical analysis of inexact Langevin dynamics/algorithms with score function estimates, establishing stable convergence guarantees under log-Sobolev inequality and sub-Gaussian score error assumptions.
Details
Motivation: Motivated by Score-based Generative Modeling (SGM), the paper studies the theoretical properties of inexact Langevin methods where exact scores are replaced by estimates, addressing the practical reality that exact scores are often unavailable in real applications.Method: Analyzes Inexact Langevin Dynamics (ILD) and Inexact Langevin Algorithm (ILA) with score function estimates. Establishes convergence guarantees in KL divergence under log-Sobolev inequality and sub-Gaussian score error assumptions. Also obtains Rényi divergence bounds under stronger L∞ error assumptions. Demonstrates kernel density estimation can provide provably accurate score estimators for sub-Gaussian targets.
Result: Derives stable biased convergence guarantees for ILD/ILA in KL divergence under specified assumptions. Shows kernel density estimation satisfies the required MGF error assumption for sub-Gaussian distributions at population level. Generalizes results to Score-based Generative Modeling.
Conclusion: Provides theoretical foundation for using inexact score estimates in Langevin-based generative models, showing stable convergence is achievable with proper error control and distributional assumptions, with practical score estimators available via kernel density methods.
Abstract: Motivated by the increasingly popular Score-based Generative Modeling (SGM), we study the Inexact Langevin Dynamics (ILD) and Inexact Langevin Algorithm (ILA) where a score function estimate is used in place of the exact score. We establish {\em stable} biased convergence guarantees in terms of the Kullback-Leibler (KL) divergence. To achieve these guarantees, we impose two key assumptions: 1) the target distribution satisfies the log-Sobolev inequality, and 2) the error of score estimator exhibits a sub-Gaussian tail, referred to as Moment Generating Function (MGF) error assumption. Under the stronger $L^\infty$ score error assumption, we obtain a stable convergence bound in Rényi divergence. We also generalize the proof technique to SGM, and derive a stable convergence bound in KL divergence. In addition, we explore the question of how to obtain a provably accurate score estimator. We demonstrate that a simple estimator based on kernel density estimation fulfills the MGF error assumption for sub-Gaussian target distributions, at the population level.
[996] Correcting Auto-Differentiation in Neural-ODE Training
Yewei Xu, Shi Chen, Qin Li
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2306.02192: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2306.02192&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[997] Unichain and Aperiodicity are Sufficient for Asymptotic Optimality of Average-Reward Restless Bandits
Yige Hong, Qiaomin Xie, Yudong Chen, Weina Wang
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2402.05689: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2402.05689&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[998] Transformers learn variable-order Markov chains in-context
Ruida Zhou, Chao Tian, Suhas Diggavi
Main category: cs.LG
TL;DR: Paper 2410.05493: HTTP 429 error prevents fetching abstract - likely rate limiting from arXiv API
Details
Motivation: Unable to determine motivation due to HTTP 429 error preventing access to paper contentMethod: Unable to determine method due to HTTP 429 error preventing access to paper content
Result: Unable to determine results due to HTTP 429 error preventing access to paper content
Conclusion: Unable to draw conclusions due to HTTP 429 error preventing access to paper content
Abstract: Failed to fetch summary for 2410.05493: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2410.05493&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[999] Understanding SAM’s Robustness to Noisy Labels through Gradient Down-weighting
Hoang-Chau Luong, Quang-Thuc Nguyen, Dat Ba Tran, Minh-Triet Tran
Main category: cs.LG
TL;DR: Unable to analyze paper 2411.17132 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract could not be retrievedMethod: Cannot determine method as abstract could not be retrieved
Result: Cannot determine results as abstract could not be retrieved
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2411.17132: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2411.17132&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1000] Scalable Neural Network Verification with Branch-and-bound Inferred Cutting Planes
Duo Zhou, Christopher Brix, Grani A Hanasusanto, Huan Zhang
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailable due to server rate limitingMethod: Cannot determine method as paper content is unavailable due to server rate limiting
Result: Cannot determine results as paper content is unavailable due to server rate limiting
Conclusion: Cannot draw conclusions as paper content is unavailable due to server rate limiting
Abstract: Failed to fetch summary for 2501.00200: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2501.00200&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1001] Binned Spectral Power Loss for Improved Prediction of Chaotic Systems
Dibyajyoti Chakraborty, Arvind T. Mohan, Romit Maulik
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2502.00472: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.00472&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1002] MM-DADM: Multimodal Drug-Aware Diffusion Model for Virtual Clinical Trials
Qian Shao, Bang Du, Zepeng Li, Qiyuan Chen, Jiahe Chen, Hongxia Xu, Jimeng Sun, Jian Wu, Jintai Chen
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting). No content available for analysis.
Details
Motivation: Unable to determine motivation due to lack of paper content.Method: Unable to determine method due to lack of paper content.
Result: Unable to determine results due to lack of paper content.
Conclusion: Unable to draw conclusions due to lack of paper content.
Abstract: Failed to fetch summary for 2502.07297: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.07297&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1003] Birch SGD: A Tree Graph Framework for Local and Asynchronous SGD Methods
Alexander Tyurin, Danil Sivtsov
Main category: cs.LG
TL;DR: Paper 2505.09218: Unable to fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to missing abstractMethod: Cannot determine method due to missing abstract
Result: Cannot determine results due to missing abstract
Conclusion: Cannot determine conclusion due to missing abstract
Abstract: Failed to fetch summary for 2505.09218: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.09218&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1004] Deep Latent Variable Model based Vertical Federated Learning with Flexible Alignment and Labeling Scenarios
Kihun Hong, Sejun Park, Ganguk Hwang
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2505.11035: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.11035&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1005] Context parroting: A simple but tough-to-beat baseline for foundation models in scientific machine learning
Yuanzhao Zhang, William Gilpin
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2505.11349: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.11349&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1006] CoDec: Prefix-Shared Decoding Kernel for LLMs
Zhibin Wang, Rui Ning, Chao Fang, Zhonghui Zhang, Xi Lin, Shaobo Ma, Mo Zhou, Xue Li, Zhongfeng Wang, Chengying Huan, Rong Gu, Kun Yang, Guihai Chen, Sheng Zhong, Chen Tian
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2505.17694: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.17694&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1007] Meet Me at the Arm: The Cooperative Multi-Armed Bandits Problem with Shareable Arms
Xinyi Hu, Aldo Pacchiano
Main category: cs.LG
TL;DR: Paper ID 2506.10127 could not be fetched due to HTTP 429 error (rate limiting), preventing analysis of its content and relevance assessment.
Details
Motivation: Unable to determine motivation as the paper content could not be retrieved from arXiv due to rate limiting restrictions.Method: No method information available due to failed API request caused by HTTP 429 error.
Result: No results information available as the paper abstract/summary could not be fetched from arXiv.
Conclusion: Cannot provide conclusion about paper content due to technical limitations in accessing the paper information.
Abstract: Failed to fetch summary for 2506.10127: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.10127&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1008] Designing User-Centric Metrics for Evaluation of Counterfactual Explanations
Firdaus Ahmed Choudhury, Ethan Leicht, Jude Ethan Bislig, Hangzhi Guo, Amulya Yadav
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2507.15162: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.15162&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1009] Forgetting: A New Mechanism Towards Better Large Language Model Fine-tuning
Ali Taheri, Alireza Taban, Qizhou Wang, Shanshan Ye, Abdolreza Mirzaei, Tongliang Liu, Bo Han
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2508.04329: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.04329&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1010] Initialization Schemes for Kolmogorov-Arnold Networks: An Empirical Study
Spyros Rigas, Dhruv Verma, Georgios Alexandridis, Yixuan Wang
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2509.03417: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.03417&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1011] On the Normalization of Confusion Matrices: Methods and Geometric Interpretations
Johan Erbani, Pierre-Edouard Portier, Elod Egyed-Zsigmond, Sonia Ben Mokhtar, Diana Nurbakova
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2509.04959: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.04959&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1012] No Need for Learning to Defer? A Training Free Deferral Framework to Multiple Experts through Conformal Prediction
Tim Bary, Benoît Macq, Louis Petit
Main category: cs.LG
TL;DR: Unable to analyze paper 2509.12573 due to HTTP 429 error when fetching from arXiv API
Details
Motivation: Cannot determine motivation as paper content could not be retrievedMethod: Cannot determine method as paper content could not be retrieved
Result: Cannot determine results as paper content could not be retrieved
Conclusion: Cannot draw conclusions as paper content could not be retrieved
Abstract: Failed to fetch summary for 2509.12573: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.12573&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1013] ReTrack: Data Unlearning in Diffusion Models through Redirecting the Denoising Trajectory
Qitan Shi, Cheng Jin, Jiawei Zhang, Yuantao Gu
Main category: cs.LG
TL;DR: Unable to analyze paper 2509.13007 due to HTTP 429 error when fetching summary from arXiv API
Details
Motivation: Cannot determine motivation as paper content could not be retrievedMethod: Cannot determine method as paper content could not be retrieved
Result: Cannot determine results as paper content could not be retrieved
Conclusion: Cannot draw conclusions as paper content could not be retrieved
Abstract: Failed to fetch summary for 2509.13007: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.13007&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1014] GaussianPSL: Soft partitioning for complex PSL problem
Phuong Mai Dinh, Van-Nam Huynh
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2509.17889: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.17889&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1015] Learning Genetic Circuit Modules with Neural Networks: Full Version
Jichi Wang, Eduardo D. Sontag, Domitilla Del Vecchio
Main category: cs.LG
TL;DR: Unable to analyze paper 2509.19601 due to HTTP 429 error when fetching summary from arXiv API
Details
Motivation: Cannot determine motivation due to inability to access paper contentMethod: Cannot determine method due to inability to access paper content
Result: Cannot determine results due to inability to access paper content
Conclusion: Cannot draw conclusions due to inability to access paper content
Abstract: Failed to fetch summary for 2509.19601: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.19601&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1016] Enhancing Credit Risk Prediction: A Multi-stage Ensemble Pipeline
Haibo Wang, Jun Huang, Lutfu S. Sua, Figen Balo, Burak Dolar
Main category: cs.LG
TL;DR: Paper ID 2509.22381 could not be fetched due to HTTP 429 error (rate limiting). No abstract available for analysis.
Details
Motivation: Unable to determine motivation due to missing paper content. The HTTP 429 error indicates the arXiv API rate limit was exceeded.Method: No method information available as the paper content could not be retrieved.
Result: No results available due to failed paper fetch.
Conclusion: Unable to draw conclusions about the paper’s content or relevance due to technical limitations in accessing the paper.
Abstract: Failed to fetch summary for 2509.22381: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.22381&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1017] Asynchronous Policy Gradient Aggregation for Efficient Distributed Reinforcement Learning
Alexander Tyurin, Andrei Spiridonov, Varvara Rudenko
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2509.24305: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.24305&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1018] LLM-Assisted Emergency Triage Benchmark: Bridging Hospital-Rich and MCI-Like Field Simulation
Joshua Sebastian, Karma Tobden, KMA Solaiman
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2509.26351: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.26351&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1019] To Augment or Not to Augment? Diagnosing Distributional Symmetry Breaking
Hannah Lawrence, Elyssa Hofgard, Vasco Portilheiro, Yuxuan Chen, Tess Smidt, Robin Walters
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2510.01349: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.01349&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1020] Q-Learning with Shift-Aware Upper Confidence Bound in Non-Stationary Reinforcement Learning
Ha Manh Bui, Felix Parker, Kimia Ghobadi, Anqi Liu
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to determine conclusion due to failed API request
Abstract: Failed to fetch summary for 2510.03181: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.03181&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1021] LLM as an Algorithmist: Enhancing Anomaly Detectors via Programmatic Synthesis
Hangting Ye, Jinmeng Li, He Zhao, Mingchen Zhuge, Dandan Guo, Yi Chang, Hongyuan Zha
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2510.03904: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.03904&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1022] TabPFN-Wide: Continued Pre-Training for Extreme Feature Counts
Christopher Kolberg, Jules Kreuer, Jonas Huurdeman, Sofiane Ouaari, Katharina Eggensperger, Nico Pfeifer
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to draw conclusions due to failed API request
Abstract: Failed to fetch summary for 2510.06162: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.06162&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1023] Constraints-of-Thought: A Framework for Constrained Reasoning in Language-Model-Guided Search
Kamel Alrashedy, Vriksha Srihari, Zulfiqar Zaidi, Ridam Srivastava, Pradyumna Tambwekar, Matthew Gombolay
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper detailsMethod: Unable to determine method due to API rate limiting preventing access to paper details
Result: Unable to determine results due to API rate limiting preventing access to paper details
Conclusion: Unable to draw conclusions due to API rate limiting preventing access to paper details
Abstract: Failed to fetch summary for 2510.08992: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.08992&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1024] PANTHER: Generative Pretraining Beyond Language for Sequential User Behavior Modeling
Guilin Li, Yun Zhang, Xiuyuan Chen, Chengqi Li, Bo Wang, Linghe Kong, Wenjia Wang, Weiran Huang, Matthias Hwai Yong Tan
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper retrievalMethod: Unable to determine method due to failed paper retrieval
Result: Unable to determine results due to failed paper retrieval
Conclusion: Unable to draw conclusions due to failed paper retrieval
Abstract: Failed to fetch summary for 2510.10102: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.10102&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1025] Differentially Private Linear Regression and Synthetic Data Generation with Statistical Guarantees
Shurong Lin, Aleksandra Slavković, Deekshith Reddy Bhoomireddy
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2510.16974: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.16974&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1026] SPORE: Skeleton Propagation Over Recalibrating Expansions
Randolph Wiredu-Aidoo
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2511.00064: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.00064&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1027] Decomposable Neuro Symbolic Regression
Giorgio Morales, John W. Sheppard
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2511.04124: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.04124&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1028] NeuralCrop: Combining physics and machine learning for improved crop yield projections
Yunan Lin, Sebastian Bathiany, Maha Badri, Maximilian Gelbrecht, Philipp Hess, Brian Groenke, Jens Heinke, Christoph Müller, Niklas Boers
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2512.20177: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.20177&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1029] Electricity Price Forecasting: Bridging Linear Models, Neural Networks and Online Learning
Btissame El Mahtout, Florian Ziel
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access restrictionsMethod: Unable to determine method due to access restrictions
Result: Unable to determine results due to access restrictions
Conclusion: Unable to draw conclusions due to access restrictions
Abstract: Failed to fetch summary for 2601.02856: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.02856&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1030] Symbolic Regression for Shared Expressions: Introducing Partial Parameter Sharing
Viktor Martinek, Roland Herzog
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to failed paper fetchMethod: Cannot determine method due to failed paper fetch
Result: Cannot determine results due to failed paper fetch
Conclusion: Cannot determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2601.04051: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.04051&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1031] A Dynamic Framework for Grid Adaptation in Kolmogorov-Arnold Networks
Spyros Rigas, Thanasis Papaioannou, Panagiotis Trakadas, Georgios Alexandridis
Main category: cs.LG
TL;DR: Unable to analyze paper 2601.18672 due to HTTP 429 error when fetching summary from arXiv API
Details
Motivation: Cannot determine motivation due to inability to access paper contentMethod: Cannot determine method due to inability to access paper content
Result: Cannot determine results due to inability to access paper content
Conclusion: Cannot determine conclusion due to inability to access paper content
Abstract: Failed to fetch summary for 2601.18672: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.18672&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1032] Smoothing the Score Function for Generalization in Diffusion Models: An Optimization-based Explanation Framework
Xinyu Zhou, Jiawei Zhang, Stephen J. Wright
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2601.19285: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.19285&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1033] Echo State Networks for Time Series Forecasting: Hyperparameter Sweep and Benchmarking
Alexander Häußer
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2602.03912: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.03912&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1034] DADP: Domain Adaptive Diffusion Policy
Pengcheng Wang, Qinghang Liu, Haotian Lin, Yiheng Li, Guojian Zhan, Masayoshi Tomizuka, Yixiao Wang
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to determine conclusion due to failed API request
Abstract: Failed to fetch summary for 2602.04037: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.04037&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1035] Joint Embedding Variational Bayes
Amin Oji, Paul Fieguth
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2602.05639: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.05639&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1036] Live Knowledge Tracing: Real-Time Adaptation using Tabular Foundation Models
Mounir Lbath, Alexandre Paresy, Abdelkayoum Kaddouri, Abdelrahman Zighem, Alan André, Alexandre Ittah, Jill-Jênn Vie
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting errorMethod: Unable to determine method due to API rate limiting error
Result: Unable to determine results due to API rate limiting error
Conclusion: Unable to determine conclusion due to API rate limiting error
Abstract: Failed to fetch summary for 2602.06542: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.06542&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1037] Compact Conformal Subgraphs
Sreenivas Gollapudi, Kostas Kollias, Kamesh Munagala, Aravindan Vijayaraghavan
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2602.07530: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.07530&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1038] Uncertainty quantification in neural network-based glucose prediction for diabetes
Hai Siong Tan, Rafe McBeth
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) - need to try again later or use alternative methods to access the paper information
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot determine conclusion as paper content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2603.04955: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.04955&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1039] On-Policy Self-Distillation for Reasoning Compression
Hejian Sang, Yuanda Xu, Zhengze Zhou, Ran He, Zhipeng Wang, Jiachen Sun
Main category: cs.LG
TL;DR: Paper 2603.05433 summary could not be fetched due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to missing abstract contentMethod: Unable to determine method due to missing abstract content
Result: Unable to determine results due to missing abstract content
Conclusion: Unable to determine conclusion due to missing abstract content
Abstract: Failed to fetch summary for 2603.05433: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.05433&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1040] Few Batches or Little Memory, But Not Both: Simultaneous Space and Adaptivity Constraints in Stochastic Bandits
Ruiyuan Huang, Zicheng Lyu, Xiaoyi Zhu, Zengfeng Huang
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2603.13742: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.13742&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1041] Dataset Distillation Efficiently Encodes Low-Dimensional Representations from Gradient-Based Learning of Non-Linear Tasks
Yuri Kinoshita, Naoki Nishikawa, Taro Toyoizumi
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper detailsMethod: Cannot analyze method without access to paper content
Result: No results available due to technical limitations in accessing the paper
Conclusion: Cannot provide analysis due to HTTP 429 error preventing access to the paper
Abstract: Failed to fetch summary for 2603.14830: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.14830&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1042] Off-Policy Learning with Limited Supply
Koichi Tanaka, Ren Kishimoto, Bushun Kawagishi, Yusuke Narita, Yasuo Yamamoto, Nobuyuki Shimizu, Yuta Saito
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to draw conclusions due to access error
Abstract: Failed to fetch summary for 2603.18702: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.18702&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1043] Wasserstein Propagation for Reverse Diffusion under Weak Log-Concavity: Exploiting Metric Mismatch via One-Switch Routing
Zicheng Lyu, Zengfeng Huang
Main category: cs.LG
TL;DR: Paper 2603.19670: Unable to fetch abstract due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to missing abstractMethod: Cannot determine method due to missing abstract
Result: Cannot determine results due to missing abstract
Conclusion: Cannot determine conclusion due to missing abstract
Abstract: Failed to fetch summary for 2603.19670: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.19670&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1044] Graph-Aware Stealthy Poison-Text Backdoors for Text-Attributed Graphs
Qi Luo, Minghui Xu, Dongxiao Yu, Xiuzhen Cheng
Main category: cs.LG
TL;DR: Failed to fetch paper summary - HTTP 429 error indicates rate limiting from arXiv API
Details
Motivation: Unable to determine motivation due to API access issueMethod: Unable to determine method due to API access issue
Result: Unable to determine results due to API access issue
Conclusion: Unable to analyze paper content due to technical limitations in accessing arXiv data
Abstract: Failed to fetch summary for 2603.20339: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.20339&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1045] Decoupling Exploration and Policy Optimization: Uncertainty Guided Tree Search for Hard Exploration
Zakaria Mhammedi, James Cohan
Main category: cs.LG
TL;DR: Paper 2603.22273: Unable to fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as the abstract/summary could not be retrieved from arXiv due to rate limiting restrictions.Method: Cannot determine method due to lack of accessible paper information.
Result: Cannot determine results due to inability to access paper content.
Conclusion: Cannot draw conclusions about the paper’s content as it was not accessible for analysis.
Abstract: Failed to fetch summary for 2603.22273: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.22273&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1046] SkillRouter: Skill Routing for LLM Agents at Scale
YanZhao Zheng, ZhenTao Zhang, Chao Ma, YuanQiang Yu, JiHuai Zhu, Yong Wu, Tianze Xu, Baohua Dong, Hangcheng Zhu, Ruohui Huang, Gang Yu
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2603.22455 could not be retrieved from arXiv API.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot determine conclusion without access to the paper content.
Abstract: Failed to fetch summary for 2603.22455: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.22455&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1047] Can LLMs Beat Classical Hyperparameter Optimization Algorithms? A Study on autoresearch
Fabio Ferreira, Lucca Wobbe, Arjun Krishnakumar, Frank Hutter, Arber Zela
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.24647: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.24647&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1048] PEANUT: Perturbations by Eigenvector Alignment for Attacking Graph Neural Networks Under Topology-Driven Message Passing
Bhavya Kohli, Biplab Sikdar
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting).
Details
Motivation: Cannot determine motivation without access to paper content.Method: Cannot determine method without access to paper content.
Result: Cannot determine results without access to paper content.
Conclusion: Cannot draw conclusions without access to paper content.
Abstract: Failed to fetch summary for 2603.26136: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.26136&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1049] Shapley meets Rawls: an integrated framework for measuring and explaining unfairness
Fadoua Amri-Jouidel, Emmanuel Kemel, Stéphane Mussard
Main category: cs.LG
TL;DR: Paper 2603.26476: Unable to fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to missing abstract contentMethod: Cannot determine method due to missing abstract content
Result: Cannot determine results due to missing abstract content
Conclusion: Cannot determine conclusion due to missing abstract content
Abstract: Failed to fetch summary for 2603.26476: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.26476&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1050] Evasion Adversarial Attacks Remain Impractical Against ML-based Network Intrusion Detection Systems, Especially Dynamic Ones
Mohamed elShehaby, Ashraf Matrawy
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: Cannot analyze method as paper content is unavailable
Result: No results available due to access restrictions
Conclusion: Cannot provide analysis due to technical limitations in accessing the paper
Abstract: Failed to fetch summary for 2306.05494: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2306.05494&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1051] LSM-GNN: Large-scale Storage-based Multi-GPU GNN Training by Optimizing Data Transfer Scheme
Jeongmin Brian Park, Kun Wu, Vikram Sharma Mailthody, Zaid Quresh, Scott Mahlke, Wen-mei Hwu
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to missing paper contentMethod: Unable to determine method due to missing paper content
Result: Unable to determine results due to missing paper content
Conclusion: Unable to determine conclusion due to missing paper content
Abstract: Failed to fetch summary for 2407.15264: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2407.15264&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1052] Mini-batch Estimation for Deep Cox Models: Statistical Foundations and Practical Guidance
Lang Zeng, Weijing Tang, Zhao Ren, Ying Ding
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2408.02839: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2408.02839&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1053] Predictive variational inference: Learn the predictively optimal posterior distribution
Jinlin Lai, Antonio Linero, Yuling Yao
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2410.14843: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2410.14843&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1054] Trans-Glasso: A Transfer Learning Approach to Precision Matrix Estimation
Boxin Zhao, Cong Ma, Mladen Kolar
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error in fetching paper contentMethod: Unable to determine method due to technical error in fetching paper content
Result: Unable to determine results due to technical error in fetching paper content
Conclusion: Unable to draw conclusions due to technical error in fetching paper content
Abstract: Failed to fetch summary for 2411.15624: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2411.15624&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1055] Mean–Variance Portfolio Selection by Continuous-Time Reinforcement Learning: Algorithms, Regret Analysis, and Empirical Study
Yilie Huang, Yanwei Jia, Xun Yu Zhou
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2412.16175: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2412.16175&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1056] Seeking Flat Minima over Diverse Surrogates for Improved Adversarial Transferability: A Theoretical Framework and Algorithmic Instantiation
Meixi Zheng, Kehan Wu, Yanbo Fan, Rui Huang, Baoyuan Wu
Main category: cs.LG
TL;DR: Unable to analyze paper 2504.16474 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract retrieval failed due to rate limiting (HTTP 429)Method: Cannot determine method as paper content is unavailable
Result: No results available due to failed abstract retrieval
Conclusion: Paper analysis impossible due to technical limitations in accessing arXiv data
Abstract: Failed to fetch summary for 2504.16474: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.16474&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1057] Training Latent Diffusion Models with Interacting Particle Algorithms
Tim Y. J. Wang, Juan Kuntz, O. Deniz Akyildiz
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) when querying arXiv API for paper ID 2505.12412
Details
Motivation: Unable to determine motivation as paper content could not be retrieved due to API rate limitingMethod: Unable to determine method as paper content could not be retrieved
Result: Unable to determine results as paper content could not be retrieved
Conclusion: Unable to determine conclusion as paper content could not be retrieved
Abstract: Failed to fetch summary for 2505.12412: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.12412&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1058] Diffusion Models with Double Guidance: Generate with aggregated datasets
Yanfeng Yang, Kenji Fukumizu
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2505.13213: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.13213&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1059] Learning to Choose or Choosing to Learn: Best-of-N vs. Supervised Fine-Tuning for Bit String Generation
Seamus Somerstep, Vinod Raman, Unique Subedi, Yuekai Sun
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2505.17288: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.17288&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1060] Proving the Limited Scalability of Centralized Distributed Optimization via a New Lower Bound Construction
Alexander Tyurin
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2506.23836: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.23836&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1061] Flow IV: Counterfactual Inference In Nonseparable Outcome Models Using Instrumental Variables
Marc Braun, Jose M. Peña, Adel Daoud
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2508.01321: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.01321&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1062] Aspects of holographic entanglement using physics-informed-neural-networks
Anirudh Deb, Yaman Sanghavi
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: Cannot analyze method as paper content is unavailable due to HTTP 429 error
Result: No results available due to technical issue with arXiv API access
Conclusion: Paper analysis cannot be completed due to API rate limiting preventing content retrieval
Abstract: Failed to fetch summary for 2509.25311: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.25311&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1063] On some practical challenges of conformal prediction
Liang Hong, Noura Raydan Nasreddine
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to analyze paper due to technical error in fetching content
Abstract: Failed to fetch summary for 2510.10324: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.10324&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1064] The Minimax Lower Bound of Kernel Stein Discrepancy Estimation
Jose Cribeiro-Ramallo, Agnideep Aich, Florian Kalinke, Ashit Baran Aich, Zoltán Szabó
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2510.15058: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.15058&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1065] On the Hardness of Reinforcement Learning with Transition Look-Ahead
Corentin Pla, Hugo Richard, Marc Abeille, Nadav Merlis, Vianney Perchet
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2510.19372: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.19372&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1066] Who Leads? Comparing Human-Centric and Model-Centric Strategies for Defining ML Target Variables
Mengtian Guo, David Gotz, Yue Wang
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2510.25974: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.25974&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1067] A Survey of Heterogeneous Graph Neural Networks for Cybersecurity Anomaly Detection
Laura Jiang, Reza Ryan, Qian Li, Nasim Ferdosian
Main category: cs.LG
TL;DR: Failed to fetch paper summary - HTTP 429 error indicates rate limiting from arXiv API
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: Cannot analyze method without access to paper content
Result: No results available due to API access failure
Conclusion: Paper analysis not possible due to technical limitations in accessing arXiv content
Abstract: Failed to fetch summary for 2510.26307: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.26307&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1068] Statistical Inference for Explainable Boosting Machines
Haimo Fang, Kevin Tan, Jonathan Pipping-Gamon, Giles Hooker
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2601.18857: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.18857&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1069] Online monotone density estimation and log-optimal calibration
Rohan Hore, Ruodu Wang, Aaditya Ramdas
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2602.08927: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.08927&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1070] Boltzmann Generators for Condensed Matter via Riemannian Flow Matching
Emil Hoffmann, Maximilian Schebek, Leon Klein, Frank Noé, Jutta Rogal
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2602.18482: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.18482&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1071] Generalizing Fair Top-$k$ Selection: An Integrative Approach
Guangya Cai
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) - need to try again later or use alternative methods
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2603.04689: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.04689&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1072] Noise in Photonic Quantum Machine Learning: Models, Impacts, and Mitigation Strategies
A.M.A.S.D. Alagiyawanna, Asoka Karunananda
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation as paper content could not be retrievedMethod: No method information available due to API rate limiting
Result: No results available - paper summary fetch failed
Conclusion: Cannot analyze paper due to technical limitations in accessing content
Abstract: Failed to fetch summary for 2603.09645: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.09645&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1073] Colluding LoRA: A Compositional Vulnerability in LLM Safety Alignment
Sihao Ding
Main category: cs.LG
TL;DR: Paper ID 2603.12681 could not be fetched due to HTTP 429 error (rate limiting), preventing analysis of its content and relevance assessment.
Details
Motivation: Unable to determine motivation as the paper content could not be retrieved from arXiv due to rate limiting restrictions.Method: No method information available due to failed API request with HTTP 429 error.
Result: No results can be reported since the paper content was not accessible.
Conclusion: The paper analysis cannot be completed due to technical limitations in accessing the content from arXiv.
Abstract: Failed to fetch summary for 2603.12681: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.12681&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1074] Exploring the Agentic Frontier of Verilog Code Generation
Patrick Yubeaton, Siddharth Garg, Chinmay Hegde
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error in fetching paper contentMethod: Unable to determine method due to technical error in fetching paper content
Result: Unable to determine results due to technical error in fetching paper content
Conclusion: Unable to draw conclusions due to technical error in fetching paper content
Abstract: Failed to fetch summary for 2603.19347: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.19347&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1075] Combinatorial Privacy: Private Multi-Party Bitstream Grand Sum by Hiding in Birkhoff Polytopes
Praneeth Vepakomma
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2603.22808 appears to be from March 2023, but no abstract or content could be retrieved.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot determine conclusion without access to the paper content.
Abstract: Failed to fetch summary for 2603.22808: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.22808&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1076] ExVerus: Verus Proof Repair via Counterexample Reasoning
Jun Yang, Yuechun Sun, Yi Wu, Rodrigo Caridad, Yongwei Yuan, Jianan Yao, Shan Lu, Kexin Pei
Main category: cs.LG
TL;DR: Unable to analyze paper 2603.25810 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract is unavailableMethod: Cannot determine method as abstract is unavailable
Result: Cannot determine results as abstract is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2603.25810: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.25810&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.MA
[1077] Decoupling Geometric Planning and Execution in Scalable Multi-Agent Path Finding
Fernando Salanova, Cristian Mahulea, Eduardo Montijano
Main category: cs.MA
TL;DR: Hybrid prioritized framework for Multi-Agent Path Finding that separates geometric planning from execution-time conflict resolution using Geometric Conflict Preemption and Decentralized Local Controller.
Details
Motivation: Existing MAPF solvers rely on time-expanded models and centralized conflict resolution, which limits scalability in large or dense instances. There's a need for more scalable approaches that can handle many agents efficiently.Method: Two-stage approach: 1) Geometric Conflict Preemption (GCP) plans agents sequentially with A* on original graph while inflating costs for transitions entering vertices used by higher-priority paths. 2) Decentralized Local Controller (DLC) executes geometric paths using per-vertex FIFO authorization queues and inserts wait actions only when needed to avoid conflicts.
Result: Method scales with empirically near-linear runtime trend up to 1000 agents, achieves 100% success rate on geometrically feasible instances, reduces synchronization-induced waiting on bottleneck-heavy maps, and often improves sum-of-costs.
Conclusion: The hybrid prioritized framework provides a scalable solution for MAPF by separating geometric planning from execution-time conflict resolution, enabling efficient handling of large-scale multi-agent scenarios.
Abstract: Multi-Agent Path Finding (MAPF) requires collision-free trajectories for multiple agents on a shared graph, often with the objective of minimizing the sum-of-costs (SOC). Many optimal and bounded-suboptimal solvers rely on time-expanded models and centralized conflict resolution, which limits scalability in large or dense instances. We propose a hybrid prioritized framework that separates geometric planning from execution-time conflict resolution. In the first stage, Geometric Conflict Preemption (GCP) plans agents sequentially with A* on the original graph while inflating costs for transitions entering vertices used by higher-priority paths, encouraging spatial detours without explicit time reasoning. In the second stage, a Decentralized Local Controller (DLC) executes the geometric paths using per-vertex FIFO authorization queues and inserts wait actions only when required to avoid vertex and edge-swap conflicts. Experiments on standard benchmark maps with up to 1000 agents show that the method scales with an empirically near-linear runtime trend and attains a 100% success rate on instances satisfying the geometric feasibility assumption. On bottleneck-heavy maps, GCP reduces synchronization-induced waiting and often improves SOC on bottleneck-heavy maps
[1078] On the Reliability Limits of LLM-Based Multi-Agent Planning
Ruicheng Ao, Siyang Gao, David Simchi-Levi
Main category: cs.MA
TL;DR: LLM-based multi-agent planning has fundamental reliability limits; centralized decision-making dominates delegated networks, with communication gaps quantified by information measures.
Details
Motivation: The paper aims to understand the fundamental reliability limits of LLM-based multi-agent planning systems, particularly when multiple agents process shared information through limited communication channels and may involve human review.Method: Models LLM-based multi-agent architecture as finite acyclic decision networks, analyzes them as delegated decision problems, compares to centralized Bayes decision makers, and characterizes communication-induced losses using information-theoretic measures like conditional mutual information and posterior divergence.
Result: Shows that without new exogenous signals, any delegated network is dominated by a centralized Bayes decision maker; communication gaps can be represented as expected posterior divergence, reducing to conditional mutual information under logarithmic loss and expected squared posterior error under Brier score.
Conclusion: There are fundamental reliability limits to LLM-based multi-agent planning, with centralized decision-making being theoretically superior, and communication constraints create quantifiable information losses that characterize these limits.
Abstract: This technical note studies the reliability limits of LLM-based multi-agent planning as a delegated decision problem. We model the LLM-based multi-agent architecture as a finite acyclic decision network in which multiple stages process shared model-context information, communicate through language interfaces with limited capacity, and may invoke human review. We show that, without new exogenous signals, any delegated network is decision-theoretically dominated by a centralized Bayes decision maker with access to the same information. In the common-evidence regime, this implies that optimizing over multi-agent directed acyclic graphs under a finite communication budget can be recast as choosing a budget-constrained stochastic experiment on the shared signal. We also characterize the loss induced by communication and information compression. Under proper scoring rules, the gap between the centralized Bayes value and the value after communication admits an expected posterior divergence representation, which reduces to conditional mutual information under logarithmic loss and to expected squared posterior error under the Brier score. These results characterize the fundamental reliability limits of delegated LLM planning. Experiments with LLMs on a controlled problem set further demonstrate these characterizations.
[1079] GUIDE: Guided Updates for In-context Decision Evolution in LLM-Driven Spacecraft Operations
Alejandro Carrasco, Mariko Storey-Matsutani, Victor Rodriguez-Fernandez, Richard Linares
Main category: cs.MA
TL;DR: GUIDE is a non-parametric policy improvement framework for LLM-based spacecraft control that evolves natural-language decision rules across episodes without weight updates, outperforming static baselines in orbital interception tasks.
Details
Motivation: Current LLM-based approaches for spacecraft operations use static prompting and don't improve across repeated executions, limiting their adaptability and performance in complex, dynamic space environments.Method: GUIDE uses a lightweight acting model for real-time control and offline reflection to update a structured, state-conditioned playbook of natural-language decision rules, enabling cross-episode adaptation without weight updates.
Result: Evaluated on adversarial orbital interception in Kerbal Space Program Differential Games, GUIDE’s evolution consistently outperforms static baselines, showing effective policy search over structured decision rules.
Conclusion: Context evolution in LLM agents functions as policy search over structured decision rules, enabling real-time closed-loop spacecraft interaction without model retraining.
Abstract: Large language models (LLMs) have been proposed as supervisory agents for spacecraft operations, but existing approaches rely on static prompting and do not improve across repeated executions. We introduce \textsc{GUIDE}, a non-parametric policy improvement framework that enables cross-episode adaptation without weight updates by evolving a structured, state-conditioned playbook of natural-language decision rules. A lightweight acting model performs real-time control, while offline reflection updates the playbook from prior trajectories. Evaluated on an adversarial orbital interception task in the Kerbal Space Program Differential Games environment, GUIDE’s evolution consistently outperforms static baselines. Results indicate that context evolution in LLM agents functions as policy search over structured decision rules in real-time closed-loop spacecraft interaction.
[1080] Toward Reliable Evaluation of LLM-Based Financial Multi-Agent Systems: Taxonomy, Coordination Primacy, and Cost Awareness
Phat Nguyen, Thang Pham
Main category: cs.MA
TL;DR: Survey paper analyzing multi-agent LLM systems for financial trading, proposing a taxonomy, the Coordination Primacy Hypothesis, and highlighting evaluation failures in the field.
Details
Motivation: The field of multi-agent LLM systems for financial trading lacks shared frameworks for understanding performance drivers and credible evaluation methods, leading to unreliable claims about system effectiveness.Method: 1) Developed four-dimensional taxonomy covering architecture patterns, coordination mechanisms, memory architecture, and tool integration applied to 12 multi-agent systems and 2 single-agent baselines. 2) Formulated Coordination Primacy Hypothesis (CPH) as a falsifiable research hypothesis. 3) Documented five pervasive evaluation failures and introduced Coordination Breakeven Spread (CBS) metric.
Result: The survey provides a structured taxonomy for analyzing multi-agent trading systems, identifies critical evaluation failures that can reverse reported returns, and proposes CPH as a framework for future research validation.
Conclusion: The field needs improved evaluation infrastructure and standards to validate claims about multi-agent trading systems, with coordination protocol design potentially being more important than model scaling for trading decision quality.
Abstract: Multi-agent systems based on large language models (LLMs) for financial trading have grown rapidly since 2023, yet the field lacks a shared framework for understanding what drives performance or for evaluating claims credibly. This survey makes three contributions. First, we introduce a four-dimensional taxonomy, covering architecture pattern, coordination mechanism, memory architecture, and tool integration; applied to 12 multi-agent systems and two single-agent baselines. Second, we formulate the Coordination Primacy Hypothesis (CPH): inter-agent coordination protocol design is a primary driver of trading decision quality, often exerting greater influence than model scaling. CPH is presented as a falsifiable research hypothesis supported by tiered structural evidence rather than as an empirically validated conclusion; its definitive validation requires evaluation infrastructure that does not yet exist in the field. Third, we document five pervasive evaluation failures (look-ahead bias, survivorship bias, backtesting overfitting, transaction cost neglect, and regime-shift blindness) and show that these can reverse the sign of reported returns. Building on the CPH and the evaluation critique, we introduce the Coordination Breakeven Spread (CBS), a metric for determining whether multi-agent coordination adds genuine value net of transaction costs, and propose minimum evaluation standards as prerequisites for validating the CPH.
[1081] Sci-Mind: Cognitively-Inspired Adversarial Debate for Autonomous Mathematical Modeling
Ruiying Sun, Wenjing Wang, Qinhan Chen, Yanhui Song, Huangwei Chen, Haotong Luan, Junhao Jia
Main category: cs.MA
TL;DR: Sci-Mind is a framework for autonomous scientific modeling that mimics human scientific discovery by integrating experiential memory recall, adversarial cognitive dialectic between theorist and pragmatist agents, and self-validating execution strategies.
Details
Motivation: Current autonomous agents powered by LLMs rely on isolated reasoning and often generate plausible but flawed models due to lack of domain grounding and adversarial verification, unlike real-world scientific modeling which is experiential and collaborative.Method: Three key components: 1) Experiential Memory Recall retrieves executable code snippets and modeling paradigms; 2) Adversarial Cognitive Dialectic pits a Theorist (mathematical coherence) against a Pragmatist (data feasibility) in debate; 3) Self-Validating Execution Strategy ensures blueprint consistency through formal predicates before code generation.
Result: Sci-Mind significantly outperforms leading autonomous agents on MM-Bench and EngiBench benchmarks in both modeling rigorousness and code executability.
Conclusion: The framework successfully mirrors human scientific discovery processes, addressing limitations of current LLM-based autonomous agents by incorporating experiential grounding, adversarial verification, and formal validation.
Abstract: Real-world mathematical modeling is inherently an experiential and collaborative endeavor. Domain experts rarely solve complex problems from scratch; instead, they draw upon analogies from historical cases and subject their hypotheses to rigorous peer scrutiny. However, autonomous agents powered by Large Language Models predominantly rely on isolated reasoning paradigms, frequently generating plausible but fundamentally flawed models due to a lack of domain grounding and adversarial verification. To address these limitations, we propose Sci-Mind, a novel framework that mirrors the human scientific discovery process. Sci-Mind integrates Experiential Memory Recall to retrieve executable code snippets and modeling paradigm descriptors, grounding abstract reasoning in historical solutions. Subsequently, it employs an Adversarial Cognitive Dialectic where a Theorist optimizing mathematical coherence and a Pragmatist enforcing data feasibility debate through competing objectives to prune elegant but infeasible formulations. A Self-Validating Execution Strategy further ensures blueprint consistency through formal predicates before code generation, achieving fully autonomous execution. Extensive experiments on the MM-Bench and EngiBench benchmarks demonstrate that Sci-Mind significantly outperforms leading autonomous agents in both modeling rigorousness and code executability.
[1082] Emergent Social Intelligence Risks in Generative Multi-Agent Systems
Yue Huang, Yu Jiang, Wenjie Wang, Haomin Zhuang, Xiaonan Luo, Yuchen Ma, Zhangchen Xu, Zichen Chen, Nuno Moniz, Zinan Lin, Pin-Yu Chen, Nitesh V Chawla, Nouha Dziri, Huan Sun, Xiangliang Zhang
Main category: cs.MA
TL;DR: Multi-agent systems with large generative models exhibit emergent social risks like collusion and conformity when competing for shared resources or collaborating sequentially, mirroring human societal pathologies despite no explicit instruction.
Details
Motivation: As multi-agent systems with large generative models move from prototypes to real-world deployments, understanding emergent collective risks is critical since these systems exhibit failure modes that cannot be reduced to individual agents.Method: Pioneer study examining emergent multi-agent risks in workflows involving competition over shared resources, sequential handoff collaboration, collective decision aggregation, and other interaction patterns across repeated trials and various conditions.
Result: Group behaviors like collusion-like coordination and conformity emerge frequently across repeated trials under realistic constraints, mirroring human societal pathologies despite no explicit instruction, and cannot be prevented by existing agent-level safeguards.
Conclusion: Multi-agent systems exhibit “social intelligence risk” where agent collectives spontaneously reproduce familiar failure patterns from human societies, exposing the dark side of intelligent multi-agent systems.
Abstract: Multi-agent systems composed of large generative models are rapidly moving from laboratory prototypes to real-world deployments, where they jointly plan, negotiate, and allocate shared resources to solve complex tasks. While such systems promise unprecedented scalability and autonomy, their collective interaction also gives rise to failure modes that cannot be reduced to individual agents. Understanding these emergent risks is therefore critical. Here, we present a pioneer study of such emergent multi-agent risk in workflows that involve competition over shared resources (e.g., computing resources or market share), sequential handoff collaboration (where downstream agents see only predecessor outputs), collective decision aggregation, and others. Across these settings, we observe that such group behaviors arise frequently across repeated trials and a wide range of interaction conditions, rather than as rare or pathological cases. In particular, phenomena such as collusion-like coordination and conformity emerge with non-trivial frequency under realistic resource constraints, communication protocols, and role assignments, mirroring well-known pathologies in human societies despite no explicit instruction. Moreover, these risks cannot be prevented by existing agent-level safeguards alone. These findings expose the dark side of intelligent multi-agent systems: a social intelligence risk where agent collectives, despite no instruction to do so, spontaneously reproduce familiar failure patterns from human societies.
[1083] Persona Alchemy: Designing, Evaluating, and Implementing Psychologically-Grounded LLM Agents for Diverse Stakeholder Representation
Sola Kim, Dongjune Chang, Jieshu Wang
Main category: cs.MA
TL;DR: A Social Cognitive Theory framework for designing psychologically grounded LLM personas with consistent behavior across cognitive, motivational, biological, and affective factors, validated in polarized discourse scenarios.
Details
Motivation: Current LLM persona designs lack alignment with human cognitive processes and fail to adequately represent diverse stakeholder perspectives, necessitating a more psychologically grounded framework.Method: Social Cognitive Theory framework with four personal factors for design, six quantifiable constructs for evaluation, and graph database architecture for implementing stakeholder personas, tested in renewable energy discourse with five diverse agents.
Result: Agents showed consistent response patterns (R²: 0.58-0.61), systematic temporal development of SCT constructs, and PCA identified two dimensions explaining 73% of variance, validating theoretical structure.
Conclusion: The SCT framework improves explainability and reproducibility over black-box approaches, contributing to better stakeholder representation while maintaining psychological consistency in LLM personas.
Abstract: Despite advances in designing personas for Large Language Models (LLM), challenges remain in aligning them with human cognitive processes and representing diverse stakeholder perspectives. We introduce a Social Cognitive Theory (SCT) agent design framework for designing, evaluating, and implementing psychologically grounded LLMs with consistent behavior. Our framework operationalizes SCT through four personal factors (cognitive, motivational, biological, and affective) for designing, six quantifiable constructs for evaluating, and a graph database-backed architecture for implementing stakeholder personas. Experiments tested agents’ responses to contradicting information of varying reliability. In the highly polarized renewable energy transition discourse, we design five diverse agents with distinct ideologies, roles, and stakes to examine stakeholder representation. The evaluation of these agents in contradictory scenarios occurs through comprehensive processes that implement the SCT. Results show consistent response patterns ($R^2$ range: $0.58-0.61$) and systematic temporal development of SCT construct effects. Principal component analysis identifies two dimensions explaining $73$% of variance, validating the theoretical structure. Our framework offers improved explainability and reproducibility compared to black-box approaches. This work contributes to ongoing efforts to improve diverse stakeholder representation while maintaining psychological consistency in LLM personas.
[1084] FUAS-Agents: Autonomous Multi-Modal LLM Agents for Treatment Planning in Focused Ultrasound Ablation Surgery
Lina Zhao, Zihao Bian, Qingyue Chen, Yafang Li, Zhiyi Luo, Jiaxing Bai, Guangbo Li, Min He, Kezhi Li, Huaiyuan Yao, Zongjiu Zhang
Main category: cs.MA
TL;DR: LLM-powered autonomous agent system for Focused Ultrasound Ablation Surgery that integrates multimodal medical data and specialized AI tools to generate personalized treatment plans.
Details
Motivation: FUAS clinical implementation requires complex multimodal image interpretation, personalized dose planning, and real-time decision-making that needs intelligent assistance to improve efficiency and reliability.Method: FUAS-Agents system leverages LLMs’ multimodal understanding and tool-using capabilities, integrating patient profiles and MRI data with specialized medical AI tools (segmentation, dose prediction, guideline retrieval) and includes internal quality control mechanisms.
Result: In uterine fibroid treatment evaluation, expert ratings showed 82.5% completeness, 82.5% accuracy, 87.5% fluency, and 97.5% clinical compliance scores of 4+ on 5-point scale; ablation studies validated component contributions.
Conclusion: LLM-driven agents can enhance decision-making in complex clinical workflows, demonstrating a translational paradigm combining general-purpose models with specialized expert systems for healthcare applications.
Abstract: Focused Ultrasound Ablation Surgery (FUAS) has emerged as a promising non-invasive therapeutic modality, valued for its safety and precision. Nevertheless, its clinical implementation entails intricate tasks such as multimodal image interpretation, personalized dose planning, and real-time intraoperative decision-making processes that demand intelligent assistance to improve efficiency and reliability. We introduce FUAS-Agents, an autonomous agent system that leverages the multimodal understanding and tool-using capabilities of large language models (LLMs). The system was developed using a large-scale, multicenter, multimodal clinical dataset of over 3000 cases from three medical institutions. By integrating patient profiles and MRI data, FUAS-Agents orchestrates a suite of specialized medical AI tools, including segmentation, treatment dose prediction, and clinical guideline retrieval, to generate personalized treatment plans comprising MRI image, dose parameters, and therapeutic strategies. The system also incorporates an internal quality control and reflection mechanism, ensuring consistency and robustness of the outputs. We evaluate the system in a uterine fibroid treatment scenario. Human assessment by four senior FUAS experts indicates that 82.5%, 82.5%, 87.5%, and 97.5% of the generated plans were rated 4 or above (on a 5-point scale) in terms of completeness, accuracy, fluency, and clinical compliance, respectively. In addition, we have conducted ablation studies to systematically examine the contribution of each component to the overall performance. These results demonstrate the potential of LLM-driven agents in enhancing decision-making across complex clinical workflows, and exemplify a translational paradigm that combines general-purpose models with specialized expert systems to solve practical challenges in vertical healthcare domains.
[1085] A Semi Centralized Training Decentralized Execution Architecture for Multi Agent Deep Reinforcement Learning in Traffic Signal Control
Arash Rezaali, Pouria Yazdani, Monireh Abdoos
Main category: cs.MA
TL;DR: SEMI-CTDE architecture for multi-intersection traffic signal control using region-based multi-agent reinforcement learning with centralized training and decentralized execution.
Details
Motivation: Existing traffic signal control approaches suffer from either the curse of dimensionality in fully centralized designs or partial observability and lack of coordination in fully decentralized approaches, motivating a region-based semi-centralized solution.Method: Semi-Centralized Training, Decentralized Execution (SEMI-CTDE) architecture that partitions networks into regions, performs centralized training within regions with parameter sharing, and uses composite state/reward formulations encoding both local and regional information.
Result: Two implemented SEMI-CTDE-based models achieve consistently superior performance across various traffic densities and distributions compared to rule-based and fully decentralized baselines.
Conclusion: The SEMI-CTDE architecture provides an effective, transferable framework for multi-intersection traffic signal control that balances centralized coordination with decentralized execution.
Abstract: Multi-agent reinforcement learning (MARL) has emerged as a promising paradigm for adaptive traffic signal control (ATSC) of multiple intersections. Existing approaches typically follow either a fully centralized or a fully decentralized design. Fully centralized approaches suffer from the curse of dimensionality, and reliance on a single learning server, whereas purely decentralized approaches operate under severe partial observability and lack explicit coordination resulting in suboptimal performance. These limitations motivate region-based MARL, where the network is partitioned into smaller, tightly coupled intersections that form regions, and training is organized around these regions. This paper introduces a Semi-Centralized Training, Decentralized Execution (SEMI-CTDE) architecture for multi intersection ATSC. Within each region, SEMI-CTDE performs centralized training with regional parameter sharing and employs composite state and reward formulations that jointly encode local and regional information. The architecture is highly transferable across different policy backbones and state-reward instantiations. Building on this architecture, we implement two models with distinct design objectives. A multi-perspective experimental analysis of the two implemented SEMI-CTDE-based models covering ablations of the architecture’s core elements including rule based and fully decentralized baselines shows that they achieve consistently superior performance and remain effective across a wide range of traffic densities and distributions.
[1086] Evidence-Decision-Feedback: Theory-Driven Adaptive Scaffolding for LLM Agents
Clayton Cohn, Siyuan Guo, Surya Rayala, Hanchen David Wang, Naveeduddin Mohammed, Umesh Timalsina, Shruti Jain, Angela Eeds, Menton Deweese, Pamela J. Osborn Popp, Rebekah Stanton, Shakeera Walker, Meiyi Ma, Gautam Biswas
Main category: cs.MA
TL;DR: EDF framework enables LLM-based pedagogical agents to provide adaptive scaffolding through evidence-based inference, decision-making, and feedback, improving personalized STEM+C problem-solving support.
Details
Motivation: Current LLM-based pedagogical agents operate on a "one-size-fits-all" basis, limiting their ability to provide personalized support for students' knowledge construction and problem-solving skills development.Method: Introduces Evidence-Decision-Feedback (EDF) framework integrating intelligent tutoring systems and agentic behavior, instantiated through Copa (Collaborative Peer Agent) for STEM+C problem-solving with evidentiary inference, pedagogical decision-making, and adaptive feedback.
Result: EDF-guided interactions in authentic high school classroom study showed alignment of feedback with students’ demonstrated understanding, promotion of scaffold fading, and support for interpretable evidence-grounded explanations without fostering overreliance.
Conclusion: EDF provides a theoretical framework for adaptive scaffolding with LLM agents that effectively personalizes educational support while maintaining interpretability and preventing student overreliance.
Abstract: LLMs offer tremendous opportunity for pedagogical agents to help students construct knowledge and develop problem-solving skills, yet many of these agents operate on a “one-size-fits-all” basis, limiting their ability to personalize support. To address this, we introduce Evidence-Decision-Feedback (EDF), a theoretical framework for adaptive scaffolding with LLM agents. EDF integrates elements of intelligent tutoring systems (ITS) and agentic behavior by organizing interactions around evidentiary inference, pedagogical decision-making, and adaptive feedback. We instantiate EDF through Copa, a Collaborative Peer Agent for STEM+C problem-solving. In an authentic high school classroom study, we show that EDF-guided interactions align feedback with students’ demonstrated understanding and task mastery; promote scaffold fading; and support interpretable, evidence-grounded explanations without fostering overreliance.
[1087] Feedback-Coupled Memory Systems: A Dynamical Model for Adaptive Coordination
Stefano Grassi
Main category: cs.MA
TL;DR: A dynamical framework for adaptive coordination in multi-agent systems using feedback-coupled memory systems, showing coordination emerges from closed-loop interactions between agents, incentives, and environmental memory rather than optimization.
Details
Motivation: To develop a dynamical systems perspective on coordination that moves beyond equilibrium optimization or agent-centric learning approaches, focusing instead on closed-loop interactions between agents, incentives, and persistent environmental memory.Method: Feedback-Coupled Memory Systems (FCMS) framework with three components: agents that update in response to local incentives, a distributed incentive field transmitting coordination signals, and environmental memory storing accumulated signals. Analyzed using dynamical systems theory and numerical simulations.
Result: Three main theoretical results: 1) bounded forward-invariant region under dissipativity ensuring dynamical viability, 2) coordination cannot be reduced to static optimization when incentives depend on environmental memory, 3) bidirectional coupling is essential. Numerical analysis shows Neimark-Sacker bifurcation at critical coupling threshold, with diverging recovery time and increased variance near threshold as early warning of coordination breakdown.
Conclusion: The FCMS framework provides a dynamical perspective on coordination in complex systems, with potential applications to multi-agent systems, networked interactions, and collective dynamics. The approach reveals fundamental limitations of optimization-based coordination when memory effects are present.
Abstract: This paper develops a dynamical framework for adaptive coordination in systems of interacting agents referred to here as Feedback-Coupled Memory Systems (FCMS). Instead of framing coordination as equilibrium optimization or agent-centric learning, the model describes a closed-loop interaction between agents, incentives, and a persistent environment. The environment stores accumulated coordination signals, a distributed incentive field transmits them locally, and agents update in response, generating a feedback-driven dynamical system. Three main results are established. First, under dissipativity, the closed-loop system admits a bounded forward-invariant region, ensuring dynamical viability independently of global optimality. Second, when incentives depend on persistent environmental memory, coordination cannot be reduced to a static optimization problem. Third, within the FCMS class, coordination requires a bidirectional coupling in which memory-dependent incentives influence agent updates, while agent behavior reshapes the environmental state. Numerical analysis of a minimal specification identifies a Neimark-Sacker bifurcation at a critical coupling threshold ($β_c$), providing a stability boundary for the system. Near the bifurcation threshold, recovery time diverges and variance increases, yielding a computable early warning signature of coordination breakdown in observable time series. Additional simulations confirm robustness under nonlinear saturation and scalability to populations of up to $N = 10^{6}$ agents making it more relevant for real-world applications. The proposed framework offers a dynamical perspective on coordination in complex systems, with potential extensions to multi-agent systems, networked interactions, and macro-level collective dynamics.
cs.MM
[1088] MAR3: Multi-Agent Recognition, Reasoning, and Reflection for Reference Audio-Visual Segmentation
Yuan Zhao, Zhenqi Jia, Yongqiang Zhang
Main category: cs.MM
TL;DR: MAR3 is a training-free multi-agent framework for Reference Audio-Visual Segmentation that uses LLM agents to recognize expression difficulty and dominant modality, adaptively reason about objects, and iteratively refine segmentation through reflective learning.
Details
Motivation: Previous Ref-AVS methods fail to explicitly recognize expression difficulty and dominant modality in multimodal cues, over-rely on instruction-tuning dataset quality for object reasoning, and lack reflective validation of segmentation results, leading to erroneous mask predictions.Method: Proposes a training-free Multi-Agent Recognition, Reasoning, and Reflection (MAR3) framework incorporating sociological Delphi theory. Uses Consensus Multimodal Recognition with LLM agents to recognize expression difficulty and dominant modality, adaptive Collaborative Object Reasoning based on modality-dominant difficulty rule, and Reflective Learning Segmentation where a check agent examines and iteratively corrects segmentation results.
Result: Achieves 69.2% J&F score on Ref-AVSBench dataset, outperforming state-of-the-art by 3.4% absolutely.
Conclusion: MAR3 effectively addresses limitations of previous Ref-AVS methods by explicitly recognizing multimodal cue characteristics, adaptively reasoning about objects, and incorporating reflective validation, leading to superior segmentation performance.
Abstract: Reference Audio-Visual Segmentation (Ref-AVS) aims to segment objects in audible videos based on multimodal cues in reference expressions. Previous methods overlook the explicit recognition of expression difficulty and dominant modality in multimodal cues, over-rely on the quality of the instruction-tuning dataset for object reasoning, and lack reflective validation of segmentation results, leading to erroneous mask predictions. To address these issues, in this paper, we propose a novel training-free Multi-Agent Recognition, Reasoning, and Reflection framework to achieve high-quality Reference Audio-Visual Segmentation, termed MAR3. Incorporating the sociological Delphi theory to achieve robust analysis, a Consensus Multimodal Recognition mechanism is proposed that enables LLM agents to explicitly recognize the difficulty of reference expressions and the dominant modality of multimodal cues. Based on our modality-dominant difficulty rule, we propose an adaptive Collaborative Object Reasoning strategy to reliably reason about the referred object. To further ensure precise mask prediction, we develop a Reflective Learning Segmentation mechanism, in which a check agent examines intermediate segmentation results and iteratively corrects the object text prompt of the segment agent. Experiments demonstrate that MAR3 achieves superior performance (69.2% in J&F) on the Ref-AVSBench dataset, outperforming SOTA by 3.4% absolutely.
[1089] Is One-Shot In-Context Learning Helpful for Data Selection in Task-Specific Fine-Tuning of Multimodal LLMs?
Xiao An, Jiaxing Sun, Ting Hu, Wei He
Main category: cs.MM
TL;DR: CLIPPER is a training-free data selection pipeline for multimodal large language models that identifies optimal coresets to match full fine-tuning performance with significantly reduced computational costs.
Details
Motivation: Current methods for injecting world knowledge into MLLMs through task-specific fine-tuning face scalability challenges as datasets grow, requiring trade-offs between performance and computational overhead. Existing data selection approaches fail to balance both data importance and diversity while overlooking inter-sample relationships.Method: CLIPPER separates parameter and world knowledge, uses in-context learning to probe model responses to different demonstration-query combinations, and identifies coresets that mirror the original dataset’s perplexity distribution to preserve critical samples while maintaining diversity.
Result: Experiments on two MLLMs (Qwen2.5-VL-7B and Llama-3.2-11B-Vision-Instruct) across three datasets show CLIPPER achieves 47% data efficiency on VRSBench and reduces ScienceQA training time by 37% while matching full fine-tuning performance.
Conclusion: CLIPPER provides an effective training-free data selection pipeline for MLLMs that addresses scalability challenges in domain adaptation, enabling efficient knowledge injection while maintaining performance comparable to full fine-tuning.
Abstract: Injecting world knowledge into pretrained multimodal large language models (MLLMs) is essential for domain-specific applications. Task-specific fine-tuning achieves this by tailoring MLLMs to high-quality in-domain data but encounters scalability challenges as datasets grow, necessitating a trade-off between performance and computational overhead. Existing data selection methods rely on additional scoring models or heuristic clustering, failing to concentrate on both data importance and diversity. Moreover, both methods overlook the interplay among training samples. To address these limitations, we propose CLIPPER, a training-free data selection pipeline that separates parameter and world knowledge, and leverages in-context learning to probe model responses to different demonstration-query combinations. CLIPPER identifies coresets that mirror the original dataset’s perplexity distribution, preserving critical samples while maintaining diversity. Experiments on two MLLMs and three datasets show that CLIPPER matches full fine-tuning performance with significantly lower costs: Qwen2.5-VL-7B attains 47% data efficiency on VRSBench, and Llama-3.2-11B-Vision-Instruct reduces ScienceQA training time by 37%.
eess.AS
[1090] HASS: Hierarchical Simulation of Logopenic Aphasic Speech for Scalable PPA Detection
Harrison Li, Kevin Wang, Cheol Jun Cho, Jiachen Lian, Rabab Rangwala, Chenxu Guo, Emma Yang, Lynn Kurteff, Zoe Ezzes, Willa Keegan-Rodewald, Jet Vonk, Siddarth Ramkrishnan, Giada Antonicelli, Zachary Miller, Marilu Gorno Tempini, Gopala Anumanchipalli
Main category: eess.AS
TL;DR: HASS is a clinically grounded simulation framework that generates synthetic training data for logopenic variant Primary Progressive Aphasia (lvPPA) by modeling semantic, phonological, and temporal deficits at varying severity levels to improve diagnosis model accuracy and generalizability.
Details
Motivation: Primary Progressive Aphasia (PPA) diagnosis models face data scarcity due to vulnerable clinical populations and expensive expert labeling. Existing approaches simulate isolated dysfluencies but fail to capture holistic, multi-level PPA phenotypes needed for accurate diagnosis.Method: Proposes Hierarchical Aphasic Speech Simulation (HASS) framework that systematically simulates lvPPA behaviors by modeling semantic, phonological, and temporal deficits identified by clinical experts, with varying severity levels to generate comprehensive synthetic training data.
Result: The HASS framework enables more accurate and generalizable detection models for lvPPA compared to previous approaches that only simulated isolated dysfluencies.
Conclusion: Clinically grounded simulation of multi-level speech deficits provides effective synthetic training data for PPA diagnosis models, addressing data scarcity while capturing comprehensive pathological speech patterns.
Abstract: Building a diagnosis model for primary progressive aphasia (PPA) has been challenging due to the data scarcity. Collecting clinical data at scale is limited by the high vulnerability of clinical population and the high cost of expert labeling. To circumvent this, previous studies simulate dysfluent speech to generate training data. However, those approaches are not comprehensive enough to simulate PPA as holistic, multi-level phenotypes, instead relying on isolated dysfluencies. To address this, we propose a novel, clinically grounded simulation framework, Hierarchical Aphasic Speech Simulation (HASS). HASS aims to simulate behaviors of logopenic variant of PPA (lvPPA) with varying degrees of severity. To this end, semantic, phonological, and temporal deficits of lvPPA are systematically identified by clinical experts, and simulated. We demonstrate that our framework enables more accurate and generalizable detection models.
[1091] Dual-branch Graph Domain Adaptation for Cross-scenario Multi-modal Emotion Recognition
Yuntao Shou, Jun Zhou, Tao Meng, Wei Ai, Keqin Li
Main category: eess.AS
TL;DR: DGDA is a dual-branch graph domain adaptation framework for multimodal emotion recognition in conversations that addresses cross-scenario domain shifts and label noise through emotion interaction graphs, hypergraph/path neural networks, and adversarial domain adaptation.
Details
Motivation: Existing multimodal emotion recognition in conversation (MERC) methods fail to handle cross-scenario variations (different speakers, topics, styles, noise levels), limiting model transferability to unseen domains. There's a need to jointly address domain shift and label noise in real-world conversational settings.Method: 1) Construct emotion interaction graph to model emotional dependencies among utterances; 2) Dual-branch encoder with hypergraph neural network (HGNN) for explicit multivariate relationships and path neural network (PathNN) for implicit global dependencies; 3) Domain adversarial discriminator for invariant cross-domain representations; 4) Regularization loss to suppress noisy label influence.
Result: DGDA outperforms strong baselines on IEMOCAP and MELD datasets, demonstrates better adaptation to cross-scenario conversations, and provides theoretical analysis with tighter generalization bounds.
Conclusion: DGDA is the first MERC framework to jointly address domain shift and label noise, enabling effective cross-scenario emotion recognition through graph-based modeling and domain adaptation techniques.
Abstract: Multimodal Emotion Recognition in Conversations (MERC) aims to predict speakers’ emotional states in multi-turn dialogues through text, audio, and visual cues. In real-world settings, conversation scenarios differ significantly in speakers, topics, styles, and noise levels. Existing MERC methods generally neglect these cross-scenario variations, limiting their ability to transfer models trained on a source domain to unseen target domains. To address this issue, we propose a Dual-branch Graph Domain Adaptation framework (DGDA) for multimodal emotion recognition under cross-scenario conditions. We first construct an emotion interaction graph to characterize complex emotional dependencies among utterances. A dual-branch encoder, consisting of a hypergraph neural network (HGNN) and a path neural network (PathNN), is then designed to explicitly model multivariate relationships and implicitly capture global dependencies. To enable out-of-domain generalization, a domain adversarial discriminator is introduced to learn invariant representations across domains. Furthermore, a regularization loss is incorporated to suppress the negative influence of noisy labels. To the best of our knowledge, DGDA is the first MERC framework that jointly addresses domain shift and label noise. Theoretical analysis provides tighter generalization bounds, and extensive experiments on IEMOCAP and MELD demonstrate that DGDA consistently outperforms strong baselines and better adapts to cross-scenario conversations. Our code is available at https://github.com/Xudmm1239439/DGDA-Net.
[1092] PHONOS: PHOnetic Neutralization for Online Streaming Applications
Waris Quamer, Mu-Ruei Tseng, Ghady Nasrallah, Ricardo Gutierrez-Osuna
Main category: eess.AS
TL;DR: PHONOS is a real-time speaker anonymization system that neutralizes non-native accents to sound native-like while preserving timbre, addressing privacy concerns where accents can identify speakers.
Details
Motivation: Current speaker anonymization systems modify timbre but leave regional/non-native accents intact, which is problematic because accents can narrow the anonymity set and compromise privacy by making speakers more identifiable.Method: Pre-generates golden speaker utterances preserving source timbre/rhythm but replacing foreign segmentals with native ones using silence-aware DTW alignment and zero-shot voice conversion. These supervise a causal accent translator mapping non-native content tokens to native equivalents with ≤40ms look-ahead, trained with joint cross-entropy and CTC losses.
Result: Achieves 81% reduction in non-native accent confidence, listening-test ratings confirm accent shift, reduced speaker linkability as accent-neutralized utterances move away from original speaker in embedding space, with latency under 241ms on single GPU.
Conclusion: PHONOS effectively addresses accent-based privacy vulnerabilities in speaker anonymization by neutralizing non-native accents in real-time while maintaining low latency, improving speaker privacy protection.
Abstract: Speaker anonymization (SA) systems modify timbre while leaving regional or non-native accents intact, which is problematic because accents can narrow the anonymity set. To address this issue, we present PHONOS, a streaming module for real-time SA that neutralizes non-native accent to sound native-like. Our approach pre-generates golden speaker utterances that preserve source timbre and rhythm but replace foreign segmentals with native ones using silence-aware DTW alignment and zero-shot voice conversion. These utterances supervise a causal accent translator that maps non-native content tokens to native equivalents with at most 40ms look-ahead, trained using joint cross-entropy and CTC losses. Our evaluations show an 81% reduction in non-native accent confidence, with listening-test ratings consistent with this shift, and reduced speaker linkability as accent-neutralized utterances move away from the original speaker in embedding space while having latency under 241 ms on single GPU.
[1093] SHroom: A Python Framework for Ambisonics Room Acoustics Simulation and Binaural Rendering
Yhonatan Gayer
Main category: eess.AS
TL;DR: SHROOM is an open-source Python library for room acoustics simulation using Ambisonics that projects image-source contributions onto Spherical Harmonics basis for binaural decoding, spherical array simulation, and real-time head rotation.
Details
Motivation: There's a need for efficient room acoustics simulation tools that can handle binaural decoding and real-time head rotation while maintaining perceptual transparency and computational efficiency.Method: Projects image-source contributions onto Spherical Harmonics basis, uses Magnitude Least Squares (MagLS) for decoding, implements Wigner-D multiplication for dynamic head rotation, and creates a composable pipeline for various acoustic simulations.
Result: Achieves perceptual transparency with 2.02 dB Log Spectral Distance at N=5 (within 1-2 dB JND), amortizes decoding over multiple sources (7x to 3.1x slowdown reduction), and handles real-time head rotation with <1 ms/frame processing.
Conclusion: SHROOM provides an architecturally viable real-time solution for room acoustics simulation with Ambisonics, offering perceptual transparency and computational efficiency for binaural decoding and dynamic head rotation applications.
Abstract: Spherical Harmonics ROOM), an open-source Python library for room acoustics simulation using Ambisonics, available at https://github.com/Yhonatangayer/shroom and installable via \texttt{pip install pyshroom}. \textbf{shroom} projects image-source contributions onto a Spherical Harmonics (SH) basis, yielding a composable pipeline for binaural decoding, spherical array simulation, and real-time head rotation. Benchmarked against \texttt{pyroomacoustics} with an $N=30$ reference, \textbf{shroom} with Magnitude Least Squares (MagLS) achieves perceptual transparency (2.02dB Log Spectral Distance (LSD) at $N=5$, within the 1–2dB Just Noticeable Difference (JND)) while its fixed-once decode amortises over multiple sources ($K=1$-to-$8$: slowdown narrows from $7\times$ to $3.1\times$). For dynamic head rotation, \textbf{shroom} applies a Wigner-D multiply at $<1$~ms/frame, making it the only architecturally viable real-time choice.
[1094] BiFormer3D: Grid-Free Time-Domain Reconstruction of Head-Related Impulse Responses with a Spatially Encoded Transformer
Shaoheng Xu, Chunyi Sun, Jihui Zhang, Amy Bastine, Prasanga N. Samarasinghe, Thushara D. Abhayapala, Hongdong Li
Main category: eess.AS
TL;DR: BiFormer3D: A time-domain, grid-free Transformer for reconstructing individualized HRIRs at arbitrary directions from sparse measurements, improving spatial audio fidelity without minimum-phase assumptions.
Details
Motivation: Individualized HRIRs are essential for realistic binaural audio rendering but require dense measurements that are costly and impractical. Existing methods have limitations in temporal fidelity, spatial continuity, and rely on restrictive assumptions like minimum-phase processing.Method: Proposes BiFormer3D, a time-domain Transformer architecture that uses sinusoidal spatial features, Conv1D refinement, and auxiliary ITD/ILD prediction heads to reconstruct HRIRs at arbitrary directions from sparse input measurements without grid constraints.
Result: Outperforms prior methods on SONICOM dataset with improvements in NMSE, cosine distance, and ITD/ILD errors. Ablation studies validate architectural components and show minimum-phase preprocessing is unnecessary.
Conclusion: BiFormer3D provides an effective grid-free solution for HRIR spatial upsampling that preserves temporal fidelity and spatial continuity while eliminating the need for restrictive preprocessing assumptions.
Abstract: Individualized head-related impulse responses (HRIRs) enable binaural rendering, but dense per-listener measurements are costly. We address HRIR spatial up-sampling from sparse per-listener measurements: given a few measured HRIRs for a listener, predict HRIRs at unmeasured target directions. Prior learning methods often work in the frequency domain, rely on minimum-phase assumptions or separate timing models, and use a fixed direction grid, which can degrade temporal fidelity and spatial continuity. We propose BiFormer3D, a time-domain, grid-free binaural Transformer for reconstructing HRIRs at arbitrary directions from sparse inputs. It uses sinusoidal spatial features, a Conv1D refinement module, and auxiliary interaural time difference (ITD) and interaural level difference (ILD) heads. On SONICOM, it improves normalized mean squared error (NMSE), cosine distance, and ITD/ILD errors over prior methods; ablations validate modules and show minimum-phase pre-processing is unnecessary.
[1095] VAANI: Capturing the language landscape for an inclusive digital India
Sujith Pulikodan, Abhayjeet Singh, Agneedh Basu, Lokesh Rady, Nihar Desai, Pavan Kumar J, Prajjwal Srivastav, Pranav D Bhat, Raghu Dharmaraju, Ritika Gupta, Sathvik Udupa, Saurabh Kumar, Sumit Sharma, Vaibhav Vishwakarma, Visruth Sanka, Dinesh Tewari, Harsh Dhand, Amrita Kamat, Sukhwinder Singh, Shikhar Vashishth, Partha Talukdar, Raj Acharya, Prasanta Kumar Ghosh
Main category: eess.AS
TL;DR: VAANI is a large-scale multimodal dataset project for India, collecting 289K images, 31K+ hours of audio, and 2K+ hours of transcribed speech across 112 languages from 165 districts to represent India’s linguistic diversity.
Details
Motivation: To create a comprehensive multimodal dataset that represents India's linguistic diversity, addressing the lack of inclusive speech and image datasets for Indian languages, many of which are underrepresented in existing resources.Method: Structured data collection using image-based prompts to elicit spontaneous speech responses, with separate image capture covering diverse topics across districts. Multi-stage quality evaluation includes both automated and manual checks for audio quality and transcription accuracy.
Result: Released 289,000 images, approximately 31,270 hours of audio recordings, and around 2,067 hours of transcribed speech covering 112 languages from 165 districts across 31 States and Union territories, with many languages represented at scale for the first time.
Conclusion: VAANI is a groundbreaking effort in linguistic inclusivity that can enable building inclusive speech models for India and advance research in speech, image, and multimodal applications.
Abstract: Project VAANI is an initiative to create an India-representative multi-modal dataset that comprehensively maps India’s linguistic diversity, starting with 165 districts across the country in its first two phases. Speech data is collected through a carefully structured process that uses image-based prompts to encourage spontaneous responses. Images are captured through a separate process that encompasses a broad range of topics, gathered from both within and across districts. The collected data undergoes a rigorous multi-stage quality evaluation, including both automated and manual checks to ensure highest possible standards in audio quality and transcription accuracy. Following this thorough validation, we have open-sourced around 289K images, approximately 31,270 hours of audio recordings, and around 2,067 hours of transcribed speech, encompassing 112 languages from 165 districts from 31 States and Union territories. Notably, significant of these languages are being represented for the first time in a dataset of this scale, making the VAANI project a groundbreaking effort in preserving and promoting linguistic inclusivity. This data can be instrumental in building inclusive speech models for India, and in advancing research and development across speech, image, and multimodal applications.
[1096] Can Hierarchical Cross-Modal Fusion Predict Human Perception of AI Dubbed Content?
Ashwini Dasare, Nirmesh Shah, Ashishkumar Gudmalwar, Pankaj Wasnik
Main category: eess.AS
TL;DR: A hierarchical multimodal architecture for evaluating AI-generated dubbed content using audio, video, and text features with LoRA adapters and proxy MOS optimization.
Details
Motivation: Human evaluation of AI-dubbed content is costly and impractical at scale, while existing automated methods lack perceptual alignment across multiple dimensions like synchronization, intelligibility, speaker consistency, emotional alignment, and semantic context.Method: Hierarchical multimodal architecture integrating audio (speaker identity, prosody, content), video (facial expressions, scene-level cues), and text (semantic context) features with progressive fusion through intra- and inter-modal layers. Uses LoRA adapters for parameter-efficient fine-tuning and derives proxy MOS by aggregating objective metrics with weights optimized via active learning.
Result: Achieves strong perceptual alignment with Pearson Correlation Coefficient > 0.75 when trained on 12k Hindi-English bidirectional dubbed clips and fine-tuned with human MOS, providing a scalable solution for automatic dubbing evaluation.
Conclusion: The proposed multimodal architecture offers an effective, scalable alternative to human evaluation for AI-dubbed content assessment, achieving high perceptual correlation while being practical for large-scale deployment.
Abstract: Evaluating AI generated dubbed content is inherently multi-dimensional, shaped by synchronization, intelligibility, speaker consistency, emotional alignment, and semantic context. Human Mean Opinion Scores (MOS) remain the gold standard but are costly and impractical at scale. We present a hierarchical multimodal architecture for perceptually meaningful dubbing evaluation, integrating complementary cues from audio, video, and text. The model captures fine-grained features such as speaker identity, prosody, and content from audio, facial expressions and scene-level cues from video and semantic context from text, which are progressively fused through intra and inter-modal layers. Lightweight LoRA adapters enable parameter-efficient fine-tuning across modalities. To overcome limited subjective labels, we derive proxy MOS by aggregating objective metrics with weights optimized via active learning. The proposed architecture was trained on 12k Hindi-English bidirectional dubbed clips, followed by fine-tuning with human MOS. Our approach achieves strong perceptual alignment (PCC > 0.75), providing a scalable solution for automatic evaluation of AI-dubbed content.
[1097] Acoustic-to-articulatory Inversion of the Complete Vocal Tract from RT-MRI with Various Audio Embeddings and Dataset Sizes
Sofiane Azzouz, Pierre-André Vuissoz, Yves Laprie
Main category: eess.AS
TL;DR: Complete vocal tract inversion from articulatory contours extracted from RT-MRI to acoustic features using Bi-LSTM, achieving 1.48mm RMSE error.
Details
Motivation: Previous articulatory-to-acoustic inversion studies rely on EMA data which is limited in sensor count and accessible articulators. The goal is to achieve complete vocal tract inversion from glottis to lips using RT-MRI data for comprehensive geometric dynamics.Method: Used 3.5 hours of RT-MRI data from single speaker, extracting articulator contours automatically from MRI images instead of raw images. Processed contours with denoised audio using Bi-LSTM architecture. Evaluated three audio embeddings (MFCCs, LCCs, HuBERT) and studied dataset size impact (10min to 3.5hrs).
Result: Achieved average RMSE of 1.48mm (compared to pixel size of 1.62mm). Results confirm feasibility of complete vocal-tract inversion using RT-MRI data with contour-based approach.
Conclusion: Complete vocal tract inversion from articulatory contours to acoustic features is feasible using RT-MRI data and Bi-LSTM architecture, with contour extraction proving effective for capturing essential geometric dynamics while discarding redundant pixel information.
Abstract: Articulatory-to-acoustic inversion strongly depends on the type of data used. While most previous studies rely on EMA, which is limited by the number of sensors and restricted to accessible articulators, we propose an approach aiming at a complete inversion of the vocal tract, from the glottis to the lips. To this end, we used approximately 3.5 hours of RT-MRI data from a single speaker. The innovation of our approach lies in the use of articulator contours automatically extracted from MRI images, rather than relying on the raw images themselves. By focusing on these contours, the model prioritizes the essential geometric dynamics of the vocal tract while discarding redundant pixel-level information. These contours, alongside denoised audio, were then processed using a Bi-LSTM architecture. Two experiments were conducted: (1) the analysis of the impact of the audio embedding, for which three types of embeddings were evaluated as input to the model (MFCCs, LCCs, and HuBERT), and (2) the study of the influence of the dataset size, which we varied from 10 minutes to 3.5 hours. Evaluation was performed on the test data using RMSE, median error, as well as Tract Variables, to which we added an additional measurement: the larynx height. The average RMSE obtained is 1.48,mm, compared with the pixel size (1.62,mm). These results confirm the feasibility of a complete vocal-tract inversion using RT-MRI data.
[1098] ParaSpeechCLAP: A Dual-Encoder Speech-Text Model for Rich Stylistic Language-Audio Pretraining
Anuj Diwan, Eunsol Choi, David Harwath
Main category: eess.AS
TL;DR: ParaSpeechCLAP is a dual-encoder contrastive model that maps speech and text style captions into a shared embedding space, supporting a wide range of intrinsic (speaker-level) and situational (utterance-level) style descriptors beyond existing models.
Details
Motivation: Existing models handle only a narrow set of style descriptors, limiting their ability to capture the rich variety of speech styles including pitch, texture, and emotion. There's a need for models that can understand both intrinsic speaker characteristics and situational utterance-level styles.Method: Developed specialized ParaSpeechCLAP-Intrinsic and ParaSpeechCLAP-Situational models using dual-encoder contrastive learning, plus a unified ParaSpeechCLAP-Combined model. Used classification loss and class-balanced training for the intrinsic model. The approach maps speech and text style captions into a common embedding space.
Result: Specialized models perform better on individual style dimensions while the unified model excels on compositional evaluation. ParaSpeechCLAP outperforms baselines on style caption retrieval, speech attribute classification, and as an inference-time reward model for style-prompted TTS without additional training.
Conclusion: ParaSpeechCLAP successfully creates a shared embedding space for speech and text style captions, supporting diverse style descriptors. The specialized vs. unified trade-off offers flexibility for different applications, and the model serves multiple purposes including retrieval, classification, and TTS enhancement.
Abstract: We introduce ParaSpeechCLAP, a dual-encoder contrastive model that maps speech and text style captions into a common embedding space, supporting a wide range of intrinsic (speaker-level) and situational (utterance-level) descriptors (such as pitch, texture and emotion) far beyond the narrow set handled by existing models. We train specialized ParaSpeechCLAP-Intrinsic and ParaSpeechCLAP-Situational models alongside a unified ParaSpeechCLAP-Combined model, finding that specialization yields stronger performance on individual style dimensions while the unified model excels on compositional evaluation. We further show that ParaSpeechCLAP-Intrinsic benefits from an additional classification loss and class-balanced training. We demonstrate our models’ performance on style caption retrieval, speech attribute classification and as an inference-time reward model that improves style-prompted TTS without additional training. ParaSpeechCLAP outperforms baselines on most metrics across all three applications. Our models and code are released at https://github.com/ajd12342/paraspeechclap .
[1099] DiffAU: Diffusion-Based Ambisonics Upscaling
Amit Milstein, Nir Shlezinger, Boaz Rafaely
Main category: eess.AS
TL;DR: DiffAU: A cascaded diffusion model approach for upscaling first-order Ambisonics (FOA) to third-order Ambisonics (HOA) to enhance spatial audio realism.
Details
Motivation: First-order Ambisonics (FOA) is hardware-efficient for 3D sound field acquisition and storage, but its low spatial resolution limits immersion. There's a need for Ambisonics upscaling (AU) to increase order while maintaining efficiency.Method: Proposes DiffAU, a cascaded AU method using diffusion models adapted for spatial audio. Learns data distributions to generate 3rd order Ambisonics from FOA input in a principled way.
Result: Experiments in anechoic conditions with multiple speakers show strong objective and perceptual performance for the upscaled HOA.
Conclusion: DiffAU provides a reliable and rapid approach for Ambisonics upscaling using diffusion models, enhancing spatial audio realism while maintaining hardware efficiency.
Abstract: Spatial audio enhances immersion by reproducing 3D sound fields, with Ambisonics offering a scalable format for this purpose. While first-order Ambisonics (FOA) notably facilitates hardware-efficient acquisition and storage of sound fields as compared to high-order Ambisonics (HOA), its low spatial resolution limits realism, highlighting the need for Ambisonics upscaling (AU) as an approach for increasing the order of Ambisonics signals. In this work we propose DiffAU, a cascaded AU method that leverages recent developments in diffusion models combined with novel adaptation to spatial audio to generate 3rd order Ambisonics from FOA. By learning data distributions, DiffAU provides a principled approach that rapidly and reliably reproduces HOA in various settings. Experiments in anechoic conditions with multiple speakers, show strong objective and perceptual performance.
[1100] Joint Optimization of Speaker and Spoof Detectors for Spoofing-Robust Automatic Speaker Verification
Oğuzhan Kurnaz, Jagabandhu Mishra, Tomi H. Kinnunen, Cemal Hanilçi
Main category: eess.AS
TL;DR: Spoofing-robust speaker verification system using modular design with trainable back-end classifiers optimized for SASV metrics, achieving state-of-the-art performance on ASVspoof 5 dataset.
Details
Motivation: Current SASV systems often use independent subsystems for speaker and spoof detection with simple fusion methods, lacking optimization for the specific SASV performance metric (a-DCF). There's a need for better integration and task-aligned optimization.Method: Modular approach with independently trained speaker and spoof detection subsystems, integrated using trainable back-end classifiers. Direct optimization of the back-end for the SASV metric (a-DCF) as training objective. Combines weighted cosine scoring for speaker detection with SSL-AASIST for spoof detection.
Result: Nonlinear score fusion consistently improves a-DCF over linear fusion. The proposed system achieves state-of-the-art performance: min a-DCF of 0.196 and SPF-EER of 7.6% on ASVspoof 5 dataset.
Conclusion: Modular design with calibrated integration and task-aligned optimization is crucial for advancing robust and interpretable SASV systems. The approach demonstrates the importance of directly optimizing for the target performance metric.
Abstract: Spoofing-robust speaker verification (SASV) combines the tasks of speaker and spoof detection to authenticate speakers under adversarial settings. Many SASV systems rely on fusion of speaker and spoof cues at embedding, score or decision levels, based on independently trained subsystems. In this study, we respect similar modularity of the two subsystems, by integrating their outputs using trainable back-end classifiers. In particular, we explore various approaches for directly optimizing the back-end for the recently-proposed SASV performance metric (a-DCF) as a training objective. Our experiments on the ASVspoof 5 dataset demonstrate two important findings: (i) nonlinear score fusion consistently improves a-DCF over linear fusion, and (ii) the combination of weighted cosine scoring for speaker detection with SSL-AASIST for spoof detection achieves state-of-the-art performance, reducing min a-DCF to 0.196 and SPF-EER to 7.6%. These contributions highlight the importance of modular design, calibrated integration, and task-aligned optimization for advancing robust and interpretable SASV systems.
[1101] X-OPD: Cross-Modal On-Policy Distillation for Capability Alignment in Speech LLMs
Di Cao, Dongjie Fu, Hai Yu, Siqi Zheng, Xu Tan, Tao Jin
Main category: eess.AS
TL;DR: X-OPD is a cross-modal on-policy distillation framework that aligns speech LLMs with text LLMs by using on-policy rollouts and token-level feedback from a text teacher to improve speech LLM performance.
Details
Motivation: End-to-end speech LLMs suffer significant performance degradation compared to text-based LLMs, and standard SFT and RL training methods fail to close this gap, necessitating a novel approach to align speech LLM capabilities with text counterparts.Method: Proposes X-OPD framework where speech LLM explores its own distribution via on-policy rollouts, a text-based teacher model evaluates these trajectories and provides token-level feedback, effectively distilling teacher’s capabilities into student’s multi-modal representations.
Result: Extensive experiments across multiple benchmarks demonstrate X-OPD significantly narrows the performance gap in complex tasks while preserving the model’s inherent capabilities.
Conclusion: X-OPD effectively addresses the performance gap between speech and text LLMs through systematic cross-modal alignment, offering a promising direction for improving end-to-end speech language models.
Abstract: While the shift from cascaded dialogue systems to end-to-end (E2E) speech Large Language Models (LLMs) improves latency and paralinguistic modeling, E2E models often exhibit a significant performance degradation compared to their text-based counterparts. The standard Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) training methods fail to close this gap. To address this, we propose X-OPD, a novel Cross-Modal On-Policy Distillation framework designed to systematically align the capabilities of Speech LLMs to their text-based counterparts. X-OPD enables the Speech LLM to explore its own distribution via on-policy rollouts, where a text-based teacher model evaluates these trajectories and provides token-level feedback, effectively distilling teacher’s capabilities into student’s multi-modal representations. Extensive experiments across multiple benchmarks demonstrate that X-OPD significantly narrows the gap in complex tasks while preserving the model’s inherent capabilities.
eess.IV
[1102] Beyond Benchmarks: A Framework for Post Deployment Validation of CT Lung Nodule Detection AI
Daniel Soliman
Main category: eess.IV
TL;DR: Physics-guided framework evaluates lung nodule detection AI sensitivity to CT acquisition parameter variations, finding slice thickness more critical than dose reduction for performance.
Details
Motivation: AI lung nodule detection systems are deployed without site-specific validation, and performance may degrade when acquisition parameters differ from training data. Need a reproducible framework to evaluate sensitivity to systematic CT parameter variations.Method: Used 21 cases from LIDC-IDRI dataset with MONAI RetinaNet model pretrained on LUNA16. Tested five imaging conditions: baseline, 25% dose reduction, 50% dose reduction, 3mm slice thickness, and 5mm slice thickness. Simulated dose reduction via Gaussian noise and slice thickness via moving average along z-axis. Computed detection sensitivity at confidence threshold 0.5 with 15mm matching criterion.
Result: Baseline sensitivity: 45.2%. Dose reduction: 41.3% (25% dose) and 42.1% (50% dose). 5mm slice thickness: 26.2% (19 percentage point reduction, 42% relative decrease). Performance consistent across confidence thresholds 0.1-0.9. Heterogeneous per-case performance with two cases showing complete detection failure at baseline.
Conclusion: Slice thickness is more fundamental constraint on AI detection performance than image noise. Proposed framework is reproducible, requires no proprietary scanner data, and can serve as basis for post-deployment QA in resource-constrained environments.
Abstract: Background: Artificial intelligence (AI) assisted lung nodule detection systems are increasingly deployed in clinical settings without site-specific validation. Performance reported under benchmark conditions may not reflect real-world behavior when acquisition parameters differ from training data. Purpose: To propose and demonstrate a physics-guided framework for evaluating the sensitivity of a deployed lung nodule detection model to systematic variation in CT acquisition parameters. Methods: Twenty-one cases from the publicly available LIDC-IDRI dataset were evaluated using a MONAI RetinaNet model pretrained on LUNA16 (fold 0, no fine-tuning). Five imaging conditions were tested: baseline, 25% dose reduction, 50% dose reduction, 3 mm slice thickness, and 5 mm slice thickness. Dose reduction was simulated via image-domain Gaussian noise; slice thickness via moving average along the z-axis. Detection sensitivity was computed at a confidence threshold of 0.5 with a 15 mm matching criterion. Results: Baseline sensitivity was 45.2% (57/126 consensus nodules). Dose reduction produced slight degradation: 41.3% at 25% dose and 42.1% at 50% dose. The 5 mm slice thickness condition produced a marked drop to 26.2% - a 19 percentage point reduction representing a 42% relative decrease from baseline. This finding was consistent across confidence thresholds from 0.1 to 0.9. Per-case analysis revealed heterogeneous performance including two cases with complete detection failure at baseline. Conclusion: Slice thickness represents a more fundamental constraint on AI detection performance than image noise under the conditions tested. The proposed framework is reproducible, requires no proprietary scanner data, and is designed to serve as the basis for ongoing post-deployment QA in resource-constrained environment.
[1103] Toward Actionable Digital Twins for Radiation-Based Imaging and Therapy: Mathematical Formulation, Modular Workflow, and an OpenKBP-Based Dose-Surrogate Prototype
Hsin-Hsiung Huang, Bulent Soykan
Main category: eess.IV
TL;DR: A modular framework for actionable digital twins in radiation therapy with uncertainty quantification, implemented using the OpenKBP benchmark with a 3D U-Net model and Monte Carlo dropout for uncertainty propagation.
Details
Motivation: To create digital twins for radiation-based imaging and therapy that can assimilate patient data, quantify predictive uncertainty, and support clinically constrained decisions through a modular framework.Method: Developed a modular framework with PatientData, Model, Solver, Calibration, and Decision modules. Implemented a 3D U-Net with 11 channels and 19.2M parameters trained with masked loss, equipped with Monte Carlo dropout for epistemic uncertainty. Introduced decoder-only proxy recalibration for closed-loop adaptation.
Result: Achieved mean dose and DVH scores of 2.65 and 1.82 Gy on 100-patient test set, with 0.58s mean inference time per patient. Complete three-fraction loop executes in 10.3s, demonstrating efficient uncertainty-aware virtual-therapy evaluation.
Conclusion: The OpenKBP case study provides a reproducible test bed for dose prediction, uncertainty propagation, and proxy closed-loop adaptation in radiation therapy digital twins, with future work focusing on longitudinal calibration with clinical data.
Abstract: Digital twins for radiation-based imaging and therapy are most useful when they assimilate patient data, quantify predictive uncertainty, and support clinically constrained decisions. This paper presents a modular framework for actionable digital twins in radiation-based imaging and therapy and instantiates its reproducible open-data component using the \openkbpfull{} benchmark. The framework couples PatientData, Model, Solver, Calibration, and Decision modules and formalizes latent-state updating, uncertainty propagation, and chance-constrained action selection. As an initial implementation, we build a GPU-ready PyTorch/MONAI reimplementation of the \openkbp{} starter pipeline: an 11-channel, 19.2M-parameter 3D U-Net trained with a masked loss over the feasible region and equipped with Monte Carlo dropout for voxel-wise epistemic uncertainty. To emulate the update loop on a static benchmark, we introduce decoder-only proxy recalibration and illustrate uncertainty-aware virtual-therapy evaluation using DVH-based and biological utilities. A complete three-fraction loop including recalibration, Monte Carlo inference, and spatial optimization executes in 10.3s. On the 100-patient test set, the model achieved mean dose and DVH scores of 2.65 and 1.82Gy, respectively, with 0.58~s mean inference time per patient. The \openkbp{} case study thus serves as a reproducible test bed for dose prediction, uncertainty propagation, and proxy closed-loop adaptation, while future institutional studies will address longitudinal calibration with delivered-dose logs and repeat imaging.
[1104] External Benchmarking of Lung Ultrasound Models for Pneumothorax-Related Signs: A Manifest-Based Multi-Source Study
Takehiro Ishikawa
Main category: eess.IV
TL;DR: Developed a manifest-based external benchmark for lung ultrasound AI evaluation, showing binary classification obscures clinically important signs like lung point and lung pulse.
Details
Motivation: Reproducible external benchmarks for pneumothorax-related lung ultrasound AI are scarce, and binary lung-sliding classification may obscure clinically important signs that are critical for accurate diagnosis.Method: Curated 280 clips from 190 publicly accessible LUS source videos and created a reconstruction manifest with URLs, timestamps, crop coordinates, labels, and probe shape. Evaluated a previously published single-site binary classifier on this benchmark with challenge-state analysis examining lung point and lung pulse using predicted probability of absent sliding.
Result: Single-site classifier achieved ROC-AUC 0.9625 in-domain but only 0.7050 on the heterogeneous external benchmark. Challenge-state analysis showed lung pulse was treated as normal-like despite absent sliding, and lung point represented an intermediate ambiguity state rather than a clean binary class.
Conclusion: Manifest-based, multi-source benchmarks support reproducible external evaluation without redistributing source videos. Binary lung-sliding classification is incomplete for pneumothorax reasoning as it obscures blind-spot and ambiguity states like lung pulse and lung point.
Abstract: Background and Aims: Reproducible external benchmarks for pneumothorax-related lung ultrasound (LUS) AI are scarce, and binary lung-sliding classification may obscure clinically important signs. We therefore developed a manifest-based external benchmark and used it to test both cross-domain generalization and task validity. Methods: We curated 280 clips from 190 publicly accessible LUS source videos and released a reconstruction manifest containing URLs, timestamps, crop coordinates, labels, and probe shape. Labels were normal lung sliding, absent lung sliding, lung point, and lung pulse. A previously published single-site binary classifier was evaluated on this benchmark; challenge-state analysis examined lung point and lung pulse using the predicted probability of absent sliding, P(absent). Results: The single-site comparator achieved ROC-AUC 0.9625 in-domain but 0.7050 on the heterogeneous external benchmark; restricting external evaluation to linear clips still yielded ROC-AUC 0.7212. In challenge-state analysis, mean P(absent) ranked absent (0.504) > lung point (0.313) > normal (0.186) > lung pulse (0.143). Lung pulse differed from absent clips (p=0.000470) but not from normal clips (p=0.813), indicating that the binary model treated pulse as normal-like despite absent sliding. Lung point differed from both absent (p=0.000468) and normal (p=0.000026), supporting its interpretation as an intermediate ambiguity state rather than a clean binary class. Conclusion: A manifest-based, multi-source benchmark can support reproducible external evaluation without redistributing source videos. Binary lung-sliding classification is an incomplete proxy for pneumothorax reasoning because it obscures blind-spot and ambiguity states such as lung pulse and lung point.
[1105] Hybrid Diffusion Model for Breast Ultrasound Image Augmentation
Farhan Fuad Abir, Sanjeda Sara Jennifer, Niloofar Yousefi, Laura J. Brattain
Main category: eess.IV
TL;DR: Hybrid diffusion framework for ultrasound data augmentation combining text-to-image generation with image-to-image refinement and fine-tuning techniques to improve visual fidelity and preserve ultrasound texture.
Details
Motivation: Overcome the critical challenge of ultrasound data augmentation in breast ultrasound datasets, addressing low-fidelity limitations of synthetic ultrasound images for robust diagnostic modeling.Method: Hybrid diffusion-based augmentation combining text-to-image generation with image-to-image refinement, fine-tuning with low-rank adaptation (LoRA) and textual inversion (TI) to preserve ultrasound texture.
Result: Reduced FID from 45.97 to 33.29 compared to Stable Diffusion v1.5 baseline, demonstrating substantial gain in fidelity while maintaining comparable downstream classification performance.
Conclusion: The framework effectively mitigates low-fidelity limitations of synthetic ultrasound images and enhances augmentation quality for robust diagnostic modeling.
Abstract: We propose a hybrid diffusion-based augmentation framework to overcome the critical challenge of ultrasound data augmentation in breast ultrasound (BUS) datasets. Unlike conventional diffusion-based augmentations, our approach improves visual fidelity and preserves ultrasound texture by combining text-to-image generation with image-to-image (img2img) refinement, as well as fine-tuning with low-rank adaptation (LoRA) and textual inversion (TI). Our method generated realistic, class-consistent images on an open-source Kaggle breast ultrasound image dataset (BUSI). Compared to the Stable Diffusion v1.5 baseline, incorporating TI and img2img refinement reduced the Frechet Inception Distance (FID) from 45.97 to 33.29, demonstrating a substantial gain in fidelity while maintaining comparable downstream classification performance. Overall, the proposed framework effectively mitigates the low-fidelity limitations of synthetic ultrasound images and enhances the quality of augmentation for robust diagnostic modeling.
[1106] ANVIL: Accelerator-Native Video Interpolation via Codec Motion Vector Priors
Shibo Liu
Main category: eess.IV
TL;DR: ANVIL: A mobile-optimized video frame interpolation system that reuses H.264 decoder motion vectors to avoid learned optical flow, enabling real-time 1080p interpolation on mobile NPUs with 12.8ms inference time.
Details
Motivation: Mobile displays refresh at 90-120Hz but most video is encoded at 24-30fps, requiring real-time frame-rate doubling. Current flow-based VFI methods face deployment barriers on mobile accelerators: spatial sampling operators exceed frame budget, iterative flow refinement fails under 8-bit quantization, and memory-bound operators dominate inference graphs.Method: ANVIL reuses motion vectors already computed by H.264 decoder to prealign input frames, removing learned optical flow, spatial sampling, and iterative accumulation from accelerator graph. The residual is refined by a convolution-dominated network composed almost entirely of compute-bound operators.
Result: On Snapdragon 8 Gen 3 device, achieves 12.8ms 1080p network inference in 8-bit integer precision. Open-source Android player sustains 28.4ms median end-to-end latency per interpolated frame pair over 54,623 consecutively logged samples during 30-minute continuous playback.
Conclusion: ANVIL addresses mobile deployment barriers by leveraging existing decoder motion vectors, enabling real-time video frame interpolation on mobile devices. Identifies quantized accumulation on recurrent flow states as key mechanism behind integer quantization failure in iterative methods.
Abstract: Mobile displays refresh at 90-120 Hz, yet most video is encoded at 24-30 frames per second; real-time frame-rate doubling requires each synthesized frame within 33.3 ms on mobile neural processing units. We show that mainstream flow-based video frame interpolation faces three structural deployment barriers on mobile accelerators: spatial sampling operators exceed the frame budget or lack hardware support, iterative flow refinement collapses under 8-bit post-training quantization, and memory-bound operators dominate the inference graph. ANVIL addresses these barriers by reusing motion vectors already computed by the H.264 decoder to prealign input frames, removing learned optical flow, spatial sampling, and iterative accumulation from the accelerator graph. The remaining residual is refined by a convolution-dominated network whose inference graph is composed almost entirely of compute-bound operators. On a Snapdragon 8 Gen 3 device, ANVIL achieves 12.8 ms 1080p network inference in 8-bit integer precision; an open-source Android player sustains 28.4 ms median end-to-end latency per interpolated frame pair over 54,623 consecutively logged samples during 30-minute continuous playback. Per-operator causal analysis identifies quantized accumulation on recurrent flow states as a key mechanism behind integer quantization failure in iterative methods. The current design targets H.264 playback scenarios with decoder-exposed motion vectors.
[1107] Reliability-Aware Weighted Multi-Scale Spatio-Temporal Maps for Heart Rate Monitoring
Arpan Bairagi, Rakesh Dey, Siladittya Manna, Umapada Pal
Main category: eess.IV
TL;DR: A self-supervised learning approach for remote photoplethysmography (rPPG) that uses reliability-aware weighted multi-scale spatio-temporal maps and contrastive learning with Swin-Unet to improve heart rate estimation robustness against illumination changes and motion artifacts.
Details
Motivation: Remote photoplethysmography (rPPG) enables contactless physiological signal estimation from facial videos but is highly susceptible to illumination changes, motion, shadows, and specular reflections in unconstrained environments, resulting in low-quality signals.Method: Proposes a Reliability-Aware Weighted Multi-Scale Spatio-Temporal (WMST) map that models pixel reliability by suppressing environmental noises using different weighting strategies. Uses SSL contrastive learning with Swin-Unet where positive pairs are generated from conventional rPPG signals and temporally expanded WMST maps, and introduces a High-High-High (HHH) wavelet map as a negative example that maintains motion/structure while filtering out physiological information.
Result: Experiments on public rPPG benchmarks show enhanced motion and illumination robustness with lower heart rate estimation error and higher Pearson correlation than existing SSL-based rPPG methods.
Conclusion: The proposed approach effectively addresses environmental noise challenges in rPPG through reliability-aware modeling and contrastive learning, improving heart rate estimation performance in unconstrained environments.
Abstract: Remote photoplethysmography (rPPG) allows for the contactless estimation of physiological signals from facial videos by analyzing subtle skin color changes. However, rPPG signals are extremely susceptible to illumination changes, motion, shadows, and specular reflections, resulting in low-quality signals in unconstrained environments. To overcome these issues, we present a Reliability-Aware Weighted Multi-Scale Spatio-Temporal (WMST) map that models pixel reliability through the suppression of environmental noises. These noises are modeled using different weighting strategies to focus on more physiologically valid areas. Leveraging the WMST map, we develop an SSL contrastive learning approach based on Swin-Unet, where positive pairs are generated from conventional rPPG signals and temporally expanded WMST maps. Moreover, we introduce a new High-High-High (HHH) wavelet map as a negative example that maintains motion and structural details while filtering out physiological information. Here, our aim is to estimate heart rate (HR), and the experiments on public rPPG benchmarks show that our approach enhances motion and illumination robustness with lower HR estimation error and higher Pearson correlation than existing Self-Supervised Learning (SSL) based rPPG methods.
[1108] Uncertainty-Aware Mapping from 3D Keypoints to Anatomical Landmarks for Markerless Biomechanics
Cesare Davide Pace, Alessandro Marco De Nunzio, Claudio De Stefano, Francesco Fontanella, Mario Molinara
Main category: eess.IV
TL;DR: Predictive uncertainty modeling for quality control in markerless biomechanics, evaluating uncertainty in 3D pose keypoint to anatomical landmark mapping.
Details
Motivation: Current markerless biomechanics pipelines treat 3D skeletal keypoints as deterministic without quality control mechanisms, creating a need for principled uncertainty estimation to identify unreliable frames.Method: Temporal learning framework modeling both observation noise uncertainty and model uncertainty, evaluated on AMASS dataset with motion capture ground truth using error-uncertainty correlation, risk-coverage analysis, and outlier detection.
Result: Model uncertainty shows strong correlation with landmark error (Spearman ρ≈0.63), enables selective frame retention (error reduced to ≈16.8mm at 10% coverage), and detects severe failures (ROC-AUC≈0.92 for errors >50mm).
Conclusion: Predictive uncertainty provides practical frame-wise quality control for markerless biomechanics, with model uncertainty being more informative than observation noise uncertainty for detecting mapping failures.
Abstract: Markerless biomechanics increasingly relies on 3D skeletal keypoints extracted from video, yet downstream biomechanical mappings typically treat these estimates as deterministic, providing no principled mechanism for frame-wise quality control. In this work, we investigate predictive uncertainty as a quantitative measure of confidence for mapping 3D pose keypoints to 3D anatomical landmarks, a critical step preceding inverse kinematics and musculoskeletal analysis. Within a temporal learning framework, we model both uncertainty arising from observation noise and uncertainty related to model limitations. Using synchronized motion capture ground truth on AMASS, we evaluate uncertainty at frame and joint level through error–uncertainty rank correlation, risk–coverage analysis, and catastrophic outlier detection. Across experiments, uncertainty estimates, particularly those associated with model uncertainty, exhibit a strong monotonic association with landmark error (Spearman $ρ\approx 0.63$), enabling selective retention of reliable frames (error reduced to $\approx 16.8$ mm at 10% coverage) and accurate detection of severe failures (ROC-AUC $\approx 0.92$ for errors $>50$ mm). Reliability ranking remains stable under controlled input degradation, including Gaussian noise and simulated missing joints. In contrast, uncertainty attributable to observation noise provides limited additional benefit in this setting, suggesting that dominant failures in keypoint-to-landmark mapping are driven primarily by model uncertainty. Our results establish predictive uncertainty as a practical, frame-wise tool for automatic quality control in markerless biomechanical pipelines.
[1109] On-Device Super Resolution Imaging Using Low-Cost SPAD Array and Embedded Lightweight Deep Learning
Zhenya Zang, Xingda Li, David Day Uei Li
Main category: eess.IV
TL;DR: LiteSR is a lightweight super-resolution neural network for enhancing 48x32 SPAD array depth/intensity images to 256x256+ resolution, enabling real-time SR video streaming via embedded system co-design.
Details
Motivation: Consumer-grade SPAD arrays have low spatial resolution (48x32), limiting their utility for high-resolution imaging applications. There's a need for cost-effective solutions to enhance resolution without expensive hardware upgrades.Method: Proposes LiteSR - a lightweight super-resolution neural network that reconstructs high-resolution (256x256) images from low-resolution SPAD inputs. Uses compressed, pre-trained DL model interfaced with Arduino UNO for real-time processing. Evaluates multiple target resolutions up to 512x512, including noise-corrupted inputs.
Result: Achieves high reconstruction fidelity on synthetic datasets, confirmed by robustness on real indoor/outdoor measurements. Enables real-time SR video streaming. Maximum achievable upscaling is 512x512 resolution. Provides scalable, cost-effective solution for enhancing SPAD array resolution.
Conclusion: LiteSR-embedded system co-design offers a practical solution to overcome spatial resolution limitations of consumer-grade SPAD arrays, meeting high-resolution imaging requirements through software enhancement rather than hardware upgrades.
Abstract: This work presents a lightweight super-resolution (LiteSR) neural network for depth and intensity images acquired from a consumer-grade single-photon avalanche diode (SPAD) array with a 48x32 spatial resolution. The proposed framework reconstructs high-resolution (HR) images of size 256x256. Both synthetic and real datasets are used for performance evaluation. Extensive quantitative metrics demonstrate high reconstruction fidelity on synthetic datasets, while experiments on real indoor and outdoor measurements further confirm the robustness of the proposed approach. Moreover, the SPAD sensor is interfaced with an Arduino UNO Q microcontroller, which receives low-resolution (LR) depth and intensity images and feeds them into a compressed, pre-trained deep learning (DL) model, enabling real-time SR video streaming. In addition to the 256x256 setting, a range of target HR resolutions is evaluated to determine the maximum achievable upscaling resolution (512x512) with LiteSR, including scenarios with noise-corrupted LR inputs. The proposed LiteSR-embedded system co-design provides a scalable, cost-effective solution to enhance the spatial resolution of current consumer-grade SPAD arrays to meet HR imaging requirements.
[1110] Quantitative measurements of biological/chemical concentrations using smartphone cameras
Zhendong Cao, Hongji Dai, Zhida Li, Ash Parameswaran
Main category: eess.IV
TL;DR: Smartphone-based imaging system for quantifying biological/chemical assay concentrations using color information and image processing
Details
Motivation: To develop an inexpensive, portable diagnostic system using smartphone cameras for remote/impoverished areas, enabling concentration quantification of biological/chemical assaysMethod: Designated optical setup combined with image processing and data analysis techniques to construct image database linking color information to assay concentrations
Result: System successfully estimates concentrations of fluorescent materials and colloidal mixtures (fluorescein, RNA Mango, homogenized milk, yeast) comparable to commercial/laboratory instruments
Conclusion: Smartphone-based imaging system shows promise for developing compact, inexpensive, portable diagnostic systems suitable for remote areas
Abstract: This paper presents a smartphone-based imaging system capable of quantifying the concentration of an assortment of biological/chemical assay samples. The main objective is to construct an image database which characterizes the relationship between color information and concentrations of the biological/chemical assay sample. For this aim, a designated optical setup combined with image processing and data analyzing techniques was implemented. A series of experiments conducted on selected assays, including fluorescein, RNA Mango, homogenized milk and yeast have demonstrated that the proposed system estimates the concentration of fluorescent materials and colloidal mixtures comparable to currently used commercial and laboratory instruments. Furthermore, by utilizing the camera and computational power of smartphones, eventual development can be directed toward extremely compact, inexpensive and portable analysis and diagnostic systems which will allow experiments and tests to be conducted in remote or impoverished areas.
[1111] DeepBayesFlow: A Bayesian Structured Variational Framework for Generalizable Prostate Segmentation via Expressive Posteriors and SDE-Girsanov Uncertainty Modeling
Zhuoyi Fang
Main category: eess.IV
TL;DR: DeepBayesFlow is a Bayesian segmentation framework for prostate MRI that uses normalizing flows, non-conjugate variational inference, and stochastic differential equations to improve robustness and generalization across clinical domains.
Details
Motivation: Prostate MRI segmentation faces challenges from inter-patient anatomical variability, blurred tissue boundaries, and distribution shifts from diverse imaging protocols. Current methods struggle with robustness and generalization across clinical domains.Method: Three key innovations: 1) NF-Posterior module using normalizing flows to model complex latent distributions, 2) NCVI inference removing conjugacy constraints for flexible posterior learning, 3) SDE-Girsanov module refining latent representations via diffusion and measure transformation for temporal coherence and uncertainty quantification.
Result: DeepBayesFlow achieves accurate and interpretable segmentation across heterogeneous prostate MRI datasets by capturing domain-invariant structural priors while dynamically adapting to domain-specific variations.
Conclusion: The proposed Bayesian framework enhances robustness and generalization for medical image segmentation, particularly addressing challenges in prostate MRI analysis across diverse clinical settings.
Abstract: Automatic prostate MRI segmentation faces persistent challenges due to inter-patient anatomical variability, blurred tissue boundaries, and distribution shifts arising from diverse imaging protocols. To address these issues, we propose DeepBayesFlow, a novel Bayesian segmentation framework designed to enhance both robustness and generalization across clinical domains. DeepBayesFlow introduces three key innovations: a learnable NF-Posterior module based on normalizing flows that models complex, data-adaptive latent distributions; a NCVI inference mechanism that removes conjugacy constraints to enable flexible posterior learning in high-dimensional settings; and a SDE-Girsanov module that refines latent representations via time-continuous diffusion and formal measure transformation, injecting temporal coherence and physically grounded uncertainty into the inference process. Together, these components allow DeepBayesFlow to capture domain-invariant structural priors while dynamically adapting to domain-specific variations, achieving accurate and interpretable segmentation across heterogeneous prostate MRI datasets.
[1112] Guided Lensless Polarization Imaging
Noa Kraicer, Erez Yosef, Raja Giryes
Main category: eess.IV
TL;DR: A novel lensless polarization imaging system that uses an auxiliary RGB camera to guide reconstruction, combining physics-based inversion with Transformer-based fusion to achieve high-quality polarization images from compact, low-cost hardware.
Details
Motivation: Polarization imaging reveals valuable information invisible to human vision but conventional polarization cameras are expensive and bulky. Lensless imaging offers compact, low-cost alternatives but existing lensless polarization systems suffer from poor reconstruction quality.Method: Two-stage pipeline: 1) Physics-based inversion recovers initial polarization image, 2) Transformer-based fusion network refines reconstruction using RGB guidance from an auxiliary conventional RGB camera. Combines compact polarization-RGB sensor with widely available RGB camera.
Result: Significantly improves reconstruction quality and fidelity over lensless-only baselines, generalizes across datasets and imaging conditions, achieves high-quality real-world results on physical prototype without fine-tuning.
Conclusion: The RGB-guided lensless polarization imaging system enables high-quality polarization imaging with compact, low-cost hardware, overcoming limitations of both conventional polarization cameras and existing lensless polarization systems.
Abstract: Polarization imaging captures the polarization state of light, revealing information invisible to the human eye yet valuable in domains such as biomedical diagnostics, autonomous driving, and remote sensing. However, conventional polarization cameras are often expensive, bulky, or both, limiting their practical use. Lensless imaging offers a compact, low-cost alternative by replacing the lens with a simple optical element like a diffuser and performing computational reconstruction, but existing lensless polarization systems suffer from limited reconstruction quality. To overcome these limitations, we introduce a RGB-guided lensless polarization imaging system that combines a compact polarization-RGB sensor with an auxiliary, widely available conventional RGB camera providing structural guidance. We reconstruct multi-angle polarization images for each RGB color channel through a two-stage pipeline: a physics-based inversion recovers an initial polarization image, followed by a Transformer-based fusion network that refines this reconstruction using the RGB guidance image from the conventional RGB camera. Our two-stage method significantly improves reconstruction quality and fidelity over lensless-only baselines, generalizes across datasets and imaging conditions, and achieves high-quality real-world results on our physical prototype lensless camera without any fine-tuning.
[1113] Deep Learning Based Site-Specific Channel Inference Using Satellite Images
Junzhe Song, Ruisi He, Mi Yang, Zhengyu Zhang, Shuaiqi Gao, Bo Ai
Main category: eess.IV
TL;DR: Deep learning framework using satellite images to predict wireless channel parameters for site-specific channel inference, achieving high-quality CIR reconstruction.
Details
Motivation: Traditional wireless channel inference methods are unscalable, and existing AI approaches using satellite images only predict large-scale fading parameters, unable to reconstruct complete channel impulse response needed for next-generation wireless systems.Method: Proposes a deep learning-based framework using satellite images to predict structured Tapped Delay Line parameters. Creates joint channel-satellite dataset from measurements, uses cross-attention-fused dual-branch pipeline for macroscopic/microscopic environmental feature extraction, and recurrent tracking module for multipath component evolution.
Result: Achieves high-quality CIR reconstruction in unseen scenarios with Power Delay Profile Average Cosine Similarity exceeding 0.96, demonstrating effective site-specific channel inference.
Conclusion: Provides a pathway toward site-specific channel inference for future dynamic wireless networks by enabling complete CIR reconstruction from satellite imagery.
Abstract: Site-specific channel inference plays a critical role in the design and evaluation of next-generation wireless communication systems by considering the surrounding propagation environment. However, traditional methods are unscalable, while existing AI-based approaches using satellite image are confined to predicting large-scale fading parameters, lacking the capacity to reconstruct the complete channel impulse response (CIR). To address this limitation, we propose a deep learning-based site-specific channel inference framework using satellite images to predict structured Tapped Delay Line (TDL) parameters. We first establish a joint channel-satellite dataset based on measurements. Then, a novel deep learning network is developed to reconstruct the channel parameters. Specifically, a cross-attention-fused dual-branch pipeline extracts macroscopic and microscopic environmental features, while a recurrent tracking module captures the long-term dynamic evolution of multipath components. Experimental results demonstrate that the proposed method achieves high-quality reconstruction of the CIR in unseen scenarios, with a Power Delay Profile (PDP) Average Cosine Similarity exceeding 0.96. This work provides a pathway toward site-specific channel inference for future dynamic wireless networks.
[1114] Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms
Muyang He, Hanzhong Guo, Junxiong Lin, Yizhou Yu
Main category: eess.IV
TL;DR: A comprehensive review of efficient video generation techniques for world simulation, focusing on reducing computational costs while maintaining simulation capabilities.
Details
Motivation: Video generation models have potential as world simulators but face heavy computational costs in spatiotemporal modeling, creating a gap between theoretical capacity and practical deployment.Method: Introduces a three-dimensional taxonomy: 1) efficient modeling paradigms, 2) efficient network architectures, and 3) efficient inference algorithms for video generation.
Result: Shows that improving efficiency enables interactive applications like autonomous driving, embodied AI, and game simulation, and identifies emerging research frontiers.
Conclusion: Efficiency is a fundamental prerequisite for evolving video generators into general-purpose, real-time, robust world simulators.
Abstract: The rapid evolution of video generation has enabled models to simulate complex physical dynamics and long-horizon causalities, positioning them as potential world simulators. However, a critical gap still remains between the theoretical capacity for world simulation and the heavy computational costs of spatiotemporal modeling. To address this, we comprehensively and systematically review video generation frameworks and techniques that consider efficiency as a crucial requirement for practical world modeling. We introduce a novel taxonomy in three dimensions: efficient modeling paradigms, efficient network architectures, and efficient inference algorithms. We further show that bridging this efficiency gap directly empowers interactive applications such as autonomous driving, embodied AI, and game simulation. Finally, we identify emerging research frontiers in efficient video-based world modeling, arguing that efficiency is a fundamental prerequisite for evolving video generators into general-purpose, real-time, and robust world simulators.
[1115] MRI-to-CT synthesis using drifting models
Qing Lyu, Jianxu Wang, Jeremy Hudson, Ge Wang, Chirstopher T. Whitlow
Main category: eess.IV
TL;DR: Drifting models outperform diffusion and other generative methods for fast, high-quality MRI-to-CT synthesis in pelvic imaging, achieving superior bone detail with millisecond inference times.
Details
Motivation: Enable MR-only pelvic workflows by synthesizing CT-like images from MRI to provide bone details without additional ionizing radiation, which would benefit radiotherapy planning and PET/MR attenuation correction.Method: Benchmark drifting models against various baselines including UNet, VAE, WGAN-GP, physics-inspired probabilistic models (PPFM), and diffusion methods (FastDDPM, DDIM, DDPM) on two pelvic datasets (Gold Atlas Male Pelvis and SynthRAD2023). Evaluate with SSIM, PSNR, RMSE metrics and qualitative assessment of critical anatomical regions.
Result: Drifting models achieve highest SSIM and PSNR, lowest RMSE across both datasets, surpassing all baselines. Visual inspection shows sharper cortical bone edges, improved sacral/femoral head geometry, reduced artifacts at tissue boundaries, with one-step inference in milliseconds.
Conclusion: Drifting models offer promising fast, high-quality synthetic CT generation from MRI for pelvic applications, with favorable accuracy-efficiency trade-off compared to iterative diffusion methods, warranting further investigation for clinical applications.
Abstract: Accurate MRI-to-CT synthesis could enable MR-only pelvic workflows by providing CT-like images with bone details while avoiding additional ionizing radiation. In this work, we investigate recently proposed drifting models for synthesizing pelvis CT images from MRI and benchmark them against convolutional neural networks (UNet, VAE), a generative adversarial network (WGAN-GP), a physics-inspired probabilistic model (PPFM), and diffusion-based methods (FastDDPM, DDIM, DDPM). Experiments are performed on two complementary datasets: Gold Atlas Male Pelvis and the SynthRAD2023 pelvis subset. Image fidelity and structural consistency are evaluated with SSIM, PSNR, and RMSE, complemented by qualitative assessment of anatomically critical regions such as cortical bone and pelvic soft-tissue interfaces. Across both datasets, the proposed drifting model achieves high SSIM and PSNR and low RMSE, surpassing strong diffusion baselines and conventional CNN-, VAE-, GAN-, and PPFM-based methods. Visual inspection shows sharper cortical bone edges, improved depiction of sacral and femoral head geometry, and reduced artifacts or over-smoothing, particularly at bone-air-soft tissue boundaries. Moreover, the drifting model attains these gains with one-step inference and inference times on the order of milliseconds, yielding a more favorable accuracy-efficiency trade-off than iterative diffusion sampling while remaining competitive in image quality. These findings suggest that drifting models are a promising direction for fast, high-quality pelvic synthetic CT generation from MRI and warrant further investigation for downstream applications such as MRI-only radiotherapy planning and PET/MR attenuation correction.
[1116] Learning a dynamic four-chamber shape model of the human heart for 95,695 UK Biobank participants
Qiang Ma, Qingjie Meng, Yicheng Wu, Shuo Wang, Mengyun Qiao, Steven Niederer, Declan P. O’Regan, Paul M. Matthews, Wenjia Bai
Main category: eess.IV
TL;DR: A deep learning pipeline creates 3D+t statistical shape models of all four cardiac chambers from 100,000 UK Biobank participants, revealing associations with demographics and diseases, and enhancing cardiovascular disease classification and heart age prediction.
Details
Motivation: Existing heart shape models focus mainly on ventricles and use small datasets. There's a need for comprehensive four-chamber models from large populations to better understand cardiac shape variations and their clinical significance.Method: Developed deep learning pipeline to reconstruct 3D+t four-chamber meshes from cardiac MRI of 100,000 UK Biobank participants. Learned statistical shape model to characterize shape variations and motion patterns, then analyzed associations with clinical factors.
Result: Revealed associations between four-chamber shape and demographics, anthropometrics, cardiovascular risk factors, and diseases. Shape-derived phenotypes significantly outperformed conventional image-derived phenotypes in disease classification and heart age prediction. Demonstrated novel applications in heart shape retrieval and re-identification.
Conclusion: Large-scale four-chamber cardiac shape modeling provides valuable insights into cardiac structure-function relationships and offers superior biomarkers for cardiovascular assessment compared to traditional methods.
Abstract: The human heart is a sophisticated system composed of four cardiac chambers with distinct shapes, which function in a coordinated manner. Existing shape models of the heart mainly focus on the ventricular chambers and they are derived from relatively small datasets. Here, we present a spatio-temporal (3D+t) statistical shape model of all four cardiac chambers, learnt from a large population of nearly 100,000 participants from the UK Biobank. A deep learning-based pipeline is developed to reconstruct 3D+t four-chamber meshes from the cardiac magnetic resonance images of the UK Biobank imaging population. Based on the reconstructed meshes, a 3D+t statistical shape model is learnt to characterise the shape variations and motion patterns of the four cardiac chambers. We reveal the associations of the four-chamber shape model with demographics, anthropometrics, cardiovascular risk factors, and cardiac diseases. Compared to conventional image-derived phenotypes, we validate that the four-chamber shape-derived phenotypes significantly enhance the performance in downstream tasks, including cardiovascular disease classification and heart age prediction. Furthermore, we demonstrate the effectiveness of shape-derived phenotypes in novel applications such as heart shape retrieval and heart re-identification from longitudinal data. To facilitate future research, we will release the learning-based mesh reconstruction pipeline, the four-chamber cardiac shape model, and return all derived four-chamber meshes to the UK Biobank.
[1117] Image-Adaptive GAN based Reconstruction
Shady Abu Hussein, Tom Tirer, Raja Giryes
Main category: eess.IV
TL;DR: Proposes making pre-trained generative models image-adaptive to improve representation capabilities for solving imaging inverse problems like super-resolution and compressed sensing.
Details
Motivation: Current deep generative models (VAEs, GANs) have limited representation capabilities that don't fully capture complex image distributions like human faces, which becomes evident when using pre-trained models for imaging inverse problems.Method: Make generators image-adaptive and enforce compliance with observations via back-projections to mitigate limited representation capabilities of pre-trained generative models.
Result: Empirical demonstration shows advantages for image super-resolution and compressed sensing tasks.
Conclusion: Image-adaptive generators with back-projection constraints improve restoration quality by addressing representation limitations of standard generative models.
Abstract: In the recent years, there has been a significant improvement in the quality of samples produced by (deep) generative models such as variational auto-encoders and generative adversarial networks. However, the representation capabilities of these methods still do not capture the full distribution for complex classes of images, such as human faces. This deficiency has been clearly observed in previous works that use pre-trained generative models to solve imaging inverse problems. In this paper, we suggest to mitigate the limited representation capabilities of generators by making them image-adaptive and enforcing compliance of the restoration with the observations via back-projections. We empirically demonstrate the advantages of our proposed approach for image super-resolution and compressed sensing.
[1118] When Mamba Meets xLSTM: An Efficient and Precise Method with the xLSTM-VMUNet Model for Skin lesion Segmentation
Zhuoyi Fang, Jiajia Liu, Kexuan Shi, Qiang Han
Main category: eess.IV
TL;DR: xLSTM-VMUNet model for melanoma segmentation that jointly captures spatial and sequential features in dermatological images, improving accuracy over previous methods.
Details
Motivation: Previous melanoma segmentation approaches overlooked the need to jointly capture spatial and sequential features, lacked global receptive field, and had computational efficiency issues, especially for lesions with indistinct borders or similar structures.Method: Proposes xLSTM-VMUNet model that combines spatial feature extraction with sequential processing using LSTM variants, enhancing contextual understanding of complex medical image structures.
Result: Outperforms VMUNet by 4.85% on DSC and 6.41% on IoU on ISIC2017 dataset, and by 1.25% on DSC and 2.07% on IoU on ISIC2018 dataset, with faster convergence and consistent high performance.
Conclusion: xLSTM-VMUNet effectively addresses limitations of previous methods by jointly capturing spatial and sequential features, improving melanoma segmentation accuracy for early skin cancer detection.
Abstract: Automatic melanoma segmentation is essential for early skin cancer detection, yet challenges arise from the heterogeneity of melanoma, as well as interfering factors like blurred boundaries, low contrast, and imaging artifacts. While numerous algorithms have been developed to address these issues, previous approaches have often overlooked the need to jointly capture spatial and sequential features within dermatological images. This limitation hampers segmentation accuracy, especially in cases with indistinct borders or structurally similar lesions. Additionally, previous models lacked both a global receptive field and high computational efficiency. In this work, we present the xLSTM-VMUNet Model, which jointly capture spatial and sequential features within dermatological images successfully. xLSTM-VMUNet can not only specialize in extracting spatial features from images, focusing on the structural characteristics of skin lesions, but also enhance contextual understanding, allowing more effective handling of complex medical image structures. Experiment results on the ISIC2018 dataset demonstrate that xLSTM-VMUNet outperforms VMUNet by 4.85% on DSC and 6.41% on IoU on the ISIC2017 dataset, by 1.25% on DSC and 2.07% on IoU on the ISIC2018 dataset, with faster convergence and consistently high segmentation performance. Our code is available at https://github.com/FangZhuoyi/XLSTM-VMUNet.
[1119] TimeFlow: Temporal Conditioning for Longitudinal Brain MRI Registration and Aging Analysis
Bailiang Jian, Jiazhen Pan, Yitong Li, Fabian Bongratz, Ruochen Li, Daniel Rueckert, Benedikt Wiestler, Christian Wachinger
Main category: eess.IV
TL;DR: TimeFlow: A learning-based framework for longitudinal brain MRI registration that models neuroanatomy as continuous function of age, enabling accurate deformation field estimation and future brain state prediction from only two scans.
Details
Motivation: Existing longitudinal brain MRI registration methods have limitations: they require densely sampled time series, struggle with accuracy vs. temporal smoothness trade-offs, and cannot prospectively forecast future brain states. These limitations hinder comprehensive analysis of brain aging and disease progression.Method: TimeFlow uses a U-Net backbone with temporal conditioning to model neuroanatomy as a continuous function of age. It introduces inter-/extra-polation consistency constraints applied to both deformation fields and deformed images, enabling temporal consistency without explicit smoothness regularizers. The framework can estimate deformation fields and predict future brain states from just two scans.
Result: TimeFlow outperforms state-of-the-art methods in both future timepoint forecasting and registration accuracy. It enables novel biological brain aging analyses by differentiating neurodegenerative trajectories from normal aging without requiring segmentation, eliminating labor-intensive annotations and segmentation inconsistencies.
Conclusion: TimeFlow provides an accurate, data-efficient, and annotation-free framework for longitudinal analysis of brain aging and chronic diseases, capable of forecasting brain changes beyond observed study periods and supporting novel biological insights without segmentation requirements.
Abstract: Longitudinal brain analysis is essential for understanding healthy aging and identifying pathological deviations. Longitudinal registration of sequential brain MRI underpins such analyses. However, existing methods are limited by reliance on densely sampled time series, a trade-off between accuracy and temporal smoothness, and an inability to prospectively forecast future brain states. To overcome these challenges, we introduce \emph{TimeFlow}, a learning-based framework for longitudinal brain MRI registration. TimeFlow uses a U-Net backbone with temporal conditioning to model neuroanatomy as a continuous function of age. Given only two scans from an individual, TimeFlow estimates accurate and temporally coherent deformation fields, enabling non-linear extrapolation to predict future brain states. This is achieved by our proposed inter-/extra-polation consistency constraints applied to both the deformation fields and deformed images. Remarkably, these constraints preserve temporal consistency and continuity without requiring explicit smoothness regularizers or densely sampled sequential data. Extensive experiments demonstrate that TimeFlow outperforms state-of-the-art methods in terms of both future timepoint forecasting and registration accuracy. Moreover, TimeFlow supports novel biological brain aging analyses by differentiating neurodegenerative trajectories from normal aging without requiring segmentation, thereby eliminating the need for labor-intensive annotations and mitigating segmentation inconsistency. TimeFlow offers an accurate, data-efficient, and annotation-free framework for longitudinal analysis of brain aging and chronic diseases, capable of forecasting brain changes beyond the observed study period.
[1120] Reconstruct Anything Model: a lightweight general model for computational imaging
Matthieu Terris, Samuel Hurault, Maxime Song, Julian Tachella
Main category: eess.IV
TL;DR: A novel non-iterative, lightweight architecture for solving diverse imaging inverse problems without iterative methods or problem-specific unrolled networks, featuring few-shot adaptation to new tasks.
Details
Motivation: Existing learning-based methods for imaging inverse problems have limitations: iterative methods (like plug-and-play and diffusion) are computationally costly with suboptimal performance, while unrolled architectures are problem-specific and require expensive training. There's a need for a more efficient, generalizable approach.Method: Proposes a non-iterative, lightweight architecture that incorporates knowledge about the forward operator (acquisition physics and noise parameters) without relying on unrolling. The model is trained to solve a wide range of inverse problems (deblurring, MRI, CT, inpainting, super-resolution) and handles arbitrary image sizes/channels. It can adapt to unseen problems/datasets with few fine-tuning steps in a self-supervised way without ground-truth references.
Result: Demonstrates state-of-the-art performance across medical imaging, low-photon imaging, and microscopy applications. The model shows strong generalization capabilities and efficient adaptation to new tasks with minimal data.
Conclusion: The proposed architecture offers an efficient, generalizable solution to imaging inverse problems that bridges the gap between computationally expensive iterative methods and problem-specific unrolled networks, with practical applications across diverse imaging domains.
Abstract: Most existing learning-based methods for solving imaging inverse problems can be roughly divided into two classes: iterative algorithms, such as plug-and-play and diffusion methods leveraging pretrained denoisers, and unrolled architectures that are trained end-to-end for specific imaging problems. Iterative methods in the first class are computationally costly and often yield suboptimal reconstruction performance, whereas unrolled architectures are generally problem-specific and require expensive training. In this work, we propose a novel non-iterative, lightweight architecture that incorporates knowledge about the forward operator (acquisition physics and noise parameters) without relying on unrolling. Our model is trained to solve a wide range of inverse problems, such as deblurring, magnetic resonance imaging, computed tomography, inpainting, and super-resolution, and handles arbitrary image sizes and channels, such as grayscale, complex, and color data. The proposed model can be easily adapted to unseen inverse problems or datasets with a few fine-tuning steps (up to a few images) in a self-supervised way, without ground-truth references. Throughout a series of experiments, we demonstrate state-of-the-art performance from medical imaging to low-photon imaging and microscopy. Our code is available at https://github.com/matthieutrs/ram.
[1121] Can Generalist Vision Language Models (VLMs) Rival Specialist Medical VLMs? Benchmarking and Strategic Insights
Yuan Zhong, Ruinan Jin, Qi Dou, Xiaoxiao Li
Main category: eess.IV
TL;DR: Generalist VLMs can match or exceed specialist medical VLMs in most clinical tasks, especially for unseen modalities, offering a scalable alternative to specialized medical AI development.
Details
Motivation: To understand when generalist vs. specialist medical Vision Language Models perform best in clinical settings, given the high costs of developing specialist medical VLMs.Method: Comparative analysis of specialist medical VLMs and efficiently fine-tuned generalist VLMs across various clinical tasks and modalities.
Result: Fine-tuned generalist VLMs achieve comparable or superior performance to specialist medical VLMs in most tasks, particularly for out-of-distribution or rare medical modalities.
Conclusion: Generalist VLMs offer a scalable, cost-effective pathway for clinical AI development, challenging the necessity of specialized medical pretraining for many applications.
Abstract: Vision Language Models (VLMs) have shown promise in automating image diagnosis and interpretation in clinical settings. However, developing specialist medical VLMs requires substantial computational resources and carefully curated datasets, and it remains unclear under which conditions generalist and specialist medical VLMs each perform best. This study highlights the complementary strengths of specialist medical and generalist VLMs. Specialists remain valuable in modality-aligned use cases, but we find that efficiently fine-tuned generalist VLMs can achieve comparable or even superior performance in most tasks, particularly when transferring to unseen or rare OOD medical modalities. These results suggest that generalist VLMs, rather than being constrained by their lack of specialist medical pretraining, may offer a scalable and cost-effective pathway for advancing clinical AI development.
[1122] Diffusion-Based Quality Control of Medical Image Segmentations across Organs
Vincenzo Marcianò, Hava Chaptoukaev, Virginia Fernandez, M. Jorge Cardoso, Sébastien Ourselin, Michela Antonelli, Maria A. Zuluaga
Main category: eess.IV
TL;DR: nnQC is a diffusion-based quality control framework for medical image segmentation that adapts to any organ/dataset without retraining, using a Team of Experts architecture to generate pseudo-ground truth for QC scoring.
Details
Motivation: Deep learning medical image segmentation methods often produce anatomically implausible hallucinations, requiring quality control. Existing QC methods are organ-specific and lack generalizability across different organs, datasets, and imaging modalities.Method: Proposes nnQC with a novel Team of Experts (ToE) architecture: two experts encode 3D spatial awareness (slice position) and anatomical information (visual features). A weighted conditional module combines these embeddings to condition a diffusion process, generating spatially aware pseudo-ground truth for QC score prediction. Includes fingerprint adaptation for cross-organ/dataset/modality adaptability.
Result: Evaluated on 7 organs using 12 public datasets. nnQC consistently outperforms state-of-the-art methods across all experiments, including cases with highly degraded or completely missing segmentation masks, demonstrating versatility and effectiveness.
Conclusion: nnQC provides a robust, generalizable quality control framework for medical image segmentation that self-adapts to any organ/dataset without retraining, addressing limitations of organ-specific QC methods through diffusion-based generation and expert ensemble architecture.
Abstract: Medical image segmentation using deep learning (DL) has enabled the development of automated analysis pipelines for large-scale population studies. However, state-of-the-art DL methods are prone to hallucinations, which can result in anatomically implausible segmentations. With manual correction impractical at scale, automated quality control (QC) techniques have to address the challenge. While promising, existing QC methods are organ-specific, limiting their generalizability and usability beyond their original intended task. To overcome this limitation, we propose no-new Quality Control (nnQC), a robust QC framework based on a diffusion-generative paradigm that self-adapts to any input organ dataset. Central to nnQC is a novel Team of Experts (ToE) architecture, where two specialized experts independently encode 3D spatial awareness, represented by the relative spatial position of an axial slice, and anatomical information derived from visual features from the original image. A weighted conditional module dynamically combines the pair of independent embeddings, or opinions to condition the sampling mechanism within a diffusion process, enabling the generation of a spatially aware pseudo-ground truth for predicting QC scores. Within its framework, nnQC integrates fingerprint adaptation to ensure adaptability across organs, datasets, and imaging modalities. We evaluated nnQC on seven organs using twelve publicly available datasets. Our results demonstrate that nnQC consistently outperforms state-of-the-art methods across all experiments, including cases where segmentation masks are highly degraded or completely missing, confirming its versatility and effectiveness across different organs.
[1123] cryoSENSE: Compressive Sensing Enables High-throughput Microscopy with Sparse and Generative Priors on the Protein Cryo-EM Image Manifold
Zain Shabeeb, Daniel Saeedi, Darin Tsui, Vida Jamali, Amirali Aghazadeh
Main category: eess.IV
TL;DR: cryoSENSE is a compressive sensing framework for cryo-EM that uses sparse and generative priors to enable high-throughput acquisition while preserving structural resolution.
Details
Motivation: Cryo-EM generates massive data volumes that exceed storage/transfer bandwidth, constraining practical throughput. There's a need for compressive sensing methods that can reduce data acquisition while preserving structural information.Method: Hardware-software co-designed framework using low-dimensional manifold representations: sparse priors in predefined bases and generative priors captured by a denoising diffusion model. Enables reconstruction from spatial and Fourier-domain undersampled measurements.
Result: Increases acquisition throughput by up to 2.5× while retaining original 3D resolution. Sparse priors work well for Fourier-domain measurements with moderate compression; diffusion priors excel with pixel-domain measurements and severe undersampling.
Conclusion: cryoSENSE offers controllable trade-offs between measurement number and downsampling level, enabling practical cryo-EM throughput improvements without sacrificing structural resolution.
Abstract: Cryo-electron microscopy (cryo-EM) enables the atomic-resolution visualization of biomolecules; however, modern direct detectors generate data volumes that far exceed the available storage and transfer bandwidth, thereby constraining practical throughput. We introduce cryoSENSE, the computational realization of a hardware-software co-designed framework for compressive cryo-EM sensing and acquisition. We show that cryo-EM images of proteins lie on low-dimensional manifolds that can be independently represented using sparse priors in predefined bases and generative priors captured by a denoising diffusion model. cryoSENSE leverages these low-dimensional manifolds to enable faithful image reconstruction from spatial and Fourier-domain undersampled measurements while preserving downstream structural resolution. In experiments, cryoSENSE increases acquisition throughput by up to 2.5$\times$ while retaining the original 3D resolution, offering controllable trade-offs between the number of masked measurements and the level of downsampling. Sparse priors favor faithful reconstruction from Fourier-domain measurements and moderate compression, whereas generative diffusion priors achieve accurate recovery from pixel-domain measurements and more severe undersampling. Project website: https://cryosense.github.io.
[1124] Guidestar-Free Adaptive Optics with Asymmetric Apertures
Weiyun Jiang, Haiyun Guo, Christopher A. Metzler, Ashok Veeraraghavan
Main category: eess.IV
TL;DR: First closed-loop adaptive optics system that corrects aberrations in real-time without guidestar or wavefront sensor, using asymmetric apertures and machine learning.
Details
Motivation: To develop a guidestar-free adaptive optics system that can correct aberrations in real-time, overcoming limitations of traditional systems that require guidestars or wavefront sensors.Method: Combines asymmetric apertures for phase retrieval, machine learning algorithms to estimate PSF from natural scenes and reconstruct phase aberrations, and spatial light modulator for optical correction.
Result: Outperforms state-of-the-art guidestar-free wavefront shaping methods, using 10x fewer measurements and 1000x less computation, validated on dense natural scenes through unknown obscurants.
Conclusion: Demonstrates a practical guidestar-free adaptive optics framework that enables real-time aberration correction using computational methods and machine learning.
Abstract: This work introduces the first closed-loop adaptive optics (AO) system capable of optically correcting aberrations in real-time without a guidestar or a wavefront sensor. Nearly 40 years ago, Cederquist et al. demonstrated that asymmetric apertures enable phase retrieval (PR) algorithms to perform fully computational wavefront sensing, albeit at a high computational cost. More recently, Chimitt et al. extended this approach with machine learning and demonstrated real-time wavefront sensing using only a single (guidestar-based) point-spread-function (PSF) measurement. Inspired by these works, we introduce a guidestar-free AO framework built around asymmetric apertures and machine learning. Our approach combines three key elements: (1) an asymmetric aperture placed at the system’s pupil plane that enables PR-based wavefront sensing, (2) a pair of machine learning algorithms that estimate the PSF from natural scene measurements and reconstruct phase aberrations, and (3) a spatial light modulator that performs optical correction. We experimentally validate this framework on dense natural scenes imaged through unknown obscurants. Our method outperforms state-of-the-art guidestar-free wavefront shaping methods, using an order of magnitude fewer measurements and three orders of magnitude less computation.