Daily arXiv Papers - 2026-02-25

AI-enhanced summaries of 0 research papers from arXiv

Editor’s Picks

Top papers matching your research interests in multimodal LLMs, audio and vision understanding/generation.

[1] HoloLLM: Multisensory Foundation Model for Language-Grounded Human Sensing and Reasoning

Chuhao Zhou, Jianfei Yang

Main category: cs.CV

TL;DR: HoloLLM is a multimodal LLM that integrates uncommon sensing modalities (LiDAR, infrared, mmWave radar, WiFi) for robust human perception in smart homes, addressing data scarcity and modality heterogeneity challenges.

DetailsMotivation: Current Vision-Language Models rely heavily on visual data, which limits robustness in real-world scenarios with occlusions, poor lighting, or privacy constraints. There's a need for models that can leverage diverse sensing modalities beyond vision for more reliable human behavior understanding in smart environments.

Method: Introduces HoloLLM with a Universal Modality-Injection Projector (UMIP) that enhances pre-aligned modality embeddings using fine-grained, text-aligned features from tailored encoders via coarse-to-fine cross-attention. Also develops a human-VLM collaborative data curation pipeline to generate paired textual annotations for sensing datasets.

Result: Extensive experiments on two newly constructed benchmarks show HoloLLM significantly outperforms existing MLLMs, improving language-grounded human sensing accuracy by up to 30%.

Conclusion: HoloLLM establishes a new foundation for real-world, language-informed multisensory embodied intelligence by effectively integrating uncommon sensing modalities for robust human perception and reasoning.

Abstract: Embodied agents operating in smart homes must understand human behavior through diverse sensory inputs and communicate via natural language. While Vision-Language Models (VLMs) have enabled impressive language-grounded perception, their reliance on visual data limits robustness in real-world scenarios with occlusions, poor lighting, or privacy constraints. In this paper, we introduce HoloLLM, a Multimodal Large Language Model (MLLM) that integrates uncommon but powerful sensing modalities, such as LiDAR, infrared, mmWave radar, and WiFi, to enable seamless human perception and reasoning across heterogeneous environments. We address two key challenges: (1) the scarcity of aligned modality-text data for rare sensors, and (2) the heterogeneity of their physical signal representations. To overcome these, we design a Universal Modality-Injection Projector (UMIP) that enhances pre-aligned modality embeddings with fine-grained, text-aligned features from tailored encoders via coarse-to-fine cross-attention without introducing significant alignment overhead. We further introduce a human-VLM collaborative data curation pipeline to generate paired textual annotations for sensing datasets. Extensive experiments on two newly constructed benchmarks show that HoloLLM significantly outperforms existing MLLMs, improving language-grounded human sensing accuracy by up to 30%. This work establishes a new foundation for real-world, language-informed multisensory embodied intelligence.

Relevance: 9/10

[2] OTPrune: Distribution-Aligned Visual Token Pruning via Optimal Transport

Xiwen Chen, Wenhui Zhu, Gen Li, Xuanzhao Dong, Yujian Xiong, Hao Wang, Peijie Qiu, Qingquan Song, Zhipeng Wang, Shao Tang, Yalin Wang, Abolfazl Razi

Main category: cs.CV

TL;DR: OTPrune: A training-free framework that uses optimal transport to prune visual tokens in multimodal LLMs by aligning token distributions, achieving better performance-efficiency tradeoffs.

DetailsMotivation: Multimodal LLMs suffer from high inference costs due to redundant visual tokens. Existing pruning methods overlook the distributional structure of visual representations, leading to suboptimal performance.

Method: Formulates pruning as distribution alignment via optimal transport, minimizing 2-Wasserstein distance between full and pruned token distributions. Derives tractable submodular objective for efficient optimization with theoretical guarantees of monotonicity and submodularity.

Result: Achieves superior performance-efficiency tradeoffs compared to state-of-the-art methods on comprehensive benchmarks, preserving both local diversity and global representativeness while reducing inference cost.

Conclusion: OTPrune provides a principled, training-free framework for visual token pruning that maintains semantic faithfulness through distributional alignment, offering stable and efficient acceleration for multimodal LLMs.

Abstract: Multi-modal large language models (MLLMs) achieve strong visual-language reasoning but suffer from high inference cost due to redundant visual tokens. Recent work explores visual token pruning to accelerate inference, while existing pruning methods overlook the underlying distributional structure of visual representations. We propose OTPrune, a training-free framework that formulates pruning as distribution alignment via optimal transport (OT). By minimizing the 2-Wasserstein distance between the full and pruned token distributions, OTPrune preserves both local diversity and global representativeness while reducing inference cost. Moreover, we derive a tractable submodular objective that enables efficient optimization, and theoretically prove its monotonicity and submodularity, providing a principled foundation for stable and efficient pruning. We further provide a comprehensive analysis that explains how distributional alignment contributes to stable and semantically faithful pruning. Comprehensive experiments on wider benchmarks demonstrate that OTPrune achieves superior performance-efficiency tradeoffs compared to state-of-the-art methods. The code is available at https://github.com/xiwenc1/OTPrune.

Relevance: 9/10

[3] Controllable Exploration in Hybrid-Policy RLVR for Multi-Modal Reasoning

Zhuoxu Huang, Mengxi Jia, Hao Sun, Xuelong Li, Jungong Han

Main category: cs.LG

TL;DR: CalibRL is a hybrid-policy RLVR framework for MLLMs that enables controllable exploration with expert guidance to prevent entropy collapse and policy degradation during reinforcement learning training.

DetailsMotivation: RLVR training for MLLMs suffers from entropy collapse, policy degradation, and over-exploitation of suboptimal behaviors due to enormous state spaces and sparse rewards, requiring better exploration strategies.

Method: Uses distribution-aware advantage weighting to scale updates by group rareness, and asymmetric activation (LeakyReLU) leveraging expert knowledge as calibration baseline to moderate overconfident updates while preserving corrective direction.

Result: Extensive experiments across eight benchmarks show consistent improvements in both in-domain and out-of-domain settings, validating the effectiveness of controllable hybrid-policy RLVR training.

Conclusion: CalibRL successfully addresses distributional mismatch between model policy and expert trajectories, achieving stable balance between exploration and exploitation in MLLM RLVR training.

Abstract: Reinforcement Learning with verifiable rewards (RLVR) has emerged as a primary learning paradigm for enhancing the reasoning capabilities of multi-modal large language models (MLLMs). However, during RL training, the enormous state space of MLLM and sparse rewards often leads to entropy collapse, policy degradation, or over-exploitation of suboptimal behaviors. This necessitates an exploration strategy that maintains productive stochasticity while avoiding the drawbacks of uncontrolled random sampling, yielding inefficient exploration. In this paper, we propose CalibRL, a hybrid-policy RLVR framework that supports controllable exploration with expert guidance, enabled by two key mechanisms. First, a distribution-aware advantage weighting scales updates by group rareness to calibrate the distribution, therefore preserving exploration. Meanwhile, the asymmetric activation function (LeakyReLU) leverages the expert knowledge as a calibration baseline to moderate overconfident updates while preserving their corrective direction. CalibRL increases policy entropy in a guided manner and clarifies the target distribution by estimating the on-policy distribution through online sampling. Updates are driven by these informative behaviors, avoiding convergence to erroneous patterns. Importantly, these designs help alleviate the distributional mismatch between the model’s policy and expert trajectories, thereby achieving a more stable balance between exploration and exploitation. Extensive experiments across eight benchmarks, including both in-domain and out-of-domain settings, demonstrate consistent improvements, validating the effectiveness of our controllable hybrid-policy RLVR training. Code is available at https://github.com/zhh6425/CalibRL.

Relevance: 9/10


Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

Table of Contents

cs.CL

[1] Talking to Yourself: Defying Forgetting in Large Language Models

Yutao Sun, Mingshuai Chen, Tiancheng Zhao, Phillip Miao, Zilun Zhang, Haozhan Shen, Ruizhe Zhu, Jianwei Yin

Main category: cs.CL

TL;DR: SA-SFT: Self-augmentation fine-tuning method where LLMs generate self-dialogues before fine-tuning to mitigate catastrophic forgetting without external data or modified training schedules.

DetailsMotivation: Catastrophic forgetting in LLMs when fine-tuning on narrow task-specific data degrades general knowledge and reasoning abilities. Need for lightweight methods that preserve original capabilities while adapting to new tasks.

Method: Self-augmentation routine where LLM generates self-dialogues prior to fine-tuning. Self-authored data mixed with task data without changing optimization or training schedules. No external data or additional tuning required.

Result: Consistently mitigates catastrophic forgetting while improving in-domain performance. Across 50 evaluation scenarios, maintains performance comparable to original model and achieves best results in 40 cases, outperforming baselines like layer freezing and external data mixing.

Conclusion: Self-augmentation offers simple, effective mechanism for robust LLM adaptation without catastrophic forgetting. Theoretical analysis suggests forgetting stems partly from style-induced parameter drift, and self-alignment through self-generated data counteracts this effect.

Abstract: Catastrophic forgetting remains a major challenge when fine-tuning large language models (LLMs) on narrow, task-specific data, often degrading their general knowledge and reasoning abilities. We propose SA-SFT, a lightweight self-augmentation routine in which an LLM generates self-dialogues prior to fine-tuning, and the resulting self-authored data are mixed with task data without modifying optimization or training schedules. Despite requiring no external data or additional tuning, SA-SFT consistently mitigates catastrophic forgetting while improving in-domain performance. Across 50 evaluation scenarios, it maintains performance comparable to the original model and achieves the best results in 40 cases, outperforming common baselines such as layer freezing and external data mixing. Guided by these empirical findings, we further present a theoretical analysis suggesting that forgetting can partly stem from style-induced parameter drift, and that self-alignment through self-generated data provides an effective means to counteract this effect. Overall, our results indicate that self-augmentation offers a simple and effective mechanism for robust LLM adaptation without incurring catastrophic forgetting.

[2] Benchmarking Distilled Language Models: Performance and Efficiency in Resource-Constrained Settings

Sachin Gopal Wani, Eric Page, Ajay Dholakia, David Ellison

Main category: cs.CL

TL;DR: Distillation creates superior performance-compute curve for small language models, with distilled 8B models being 2000x more compute-efficient than vanilla counterparts while matching reasoning of models 10x larger.

DetailsMotivation: To develop powerful yet efficient small language models for resource-constrained environments through knowledge distillation, benchmarking performance and computational cost against vanilla and proprietary models.

Method: Benchmarking performance and computational cost of distilled models against vanilla and proprietary counterparts, providing quantitative analysis of efficiency through knowledge distillation techniques.

Result: Distillation creates superior performance-compute curve; distilled 8B model is over 2000x more compute-efficient than training vanilla counterpart while achieving reasoning capabilities on par with or exceeding standard models ten times its size.

Conclusion: Knowledge distillation is validated not just as a compression technique but as a primary strategy for building state-of-the-art, accessible AI through efficient small language models.

Abstract: Knowledge distillation offers a transformative pathway to developing powerful, yet efficient, small language models (SLMs) suitable for resource-constrained environments. In this paper, we benchmark the performance and computational cost of distilled models against their vanilla and proprietary counterparts, providing a quantitative analysis of their efficiency. Our results demonstrate that distillation creates a superior performance-tocompute curve. We find that creating a distilled 8B model is over 2,000 times more compute-efficient than training its vanilla counterpart, while achieving reasoning capabilities on par with, or even exceeding, standard models ten times its size. These findings validate distillation not just as a compression technique, but as a primary strategy for building state-of-the-art, accessible AI

Zhaolu Kang, Junhao Gong, Qingxi Chen, Hao Zhang, Jiaxin Liu, Rong Fu, Zhiyuan Feng, Yuan Wang, Simon Fong, Kaiyue Zhou

Main category: cs.CL

TL;DR: JurisMMA is a novel framework for Legal Judgment Prediction that decomposes trial tasks into stages and uses multimodal video-text data from a new large Chinese judicial dataset called JurisMM.

DetailsMotivation: Traditional legal judgment prediction methods struggle with complex cases involving multiple allegations, diverse evidence, and lack adaptability. There's a need for more effective frameworks that can handle the complexity of real legal cases and leverage multimodal information.

Method: The JurisMMA framework decomposes trial tasks, standardizes processes, and organizes them into distinct stages. The authors also built JurisMM, a large dataset with over 100,000 recent Chinese judicial records containing both text and multimodal video-text data.

Result: Experiments on both JurisMM and the benchmark LawBench validate the framework’s effectiveness. The results show the framework works well not only for legal judgment prediction but also for broader legal applications.

Conclusion: JurisMMA offers new perspectives for developing future legal methods and datasets, demonstrating effectiveness in handling complex legal cases through task decomposition and multimodal data integration.

Abstract: Legal Judgment Prediction (LJP) aims to predict the outcomes of legal cases based on factual descriptions, serving as a fundamental task to advance the development of legal systems. Traditional methods often rely on statistical analyses or role-based simulations but face challenges with multiple allegations, diverse evidence, and lack adaptability. In this paper, we introduce JurisMMA, a novel framework for LJP that effectively decomposes trial tasks, standardizes processes, and organizes them into distinct stages. Furthermore, we build JurisMM, a large dataset with over 100,000 recent Chinese judicial records, including both text and multimodal video-text data, enabling comprehensive evaluation. Experiments on JurisMM and the benchmark LawBench validate our framework’s effectiveness. These results indicate that our framework is effective not only for LJP but also for a broader range of legal applications, offering new perspectives for the development of future legal methods and datasets.

[4] ConceptRM: The Quest to Mitigate Alert Fatigue through Consensus-Based Purity-Driven Data Cleaning for Reflection Modelling

Yongda Yu, Lei Zhang, Xinxin Guo, Minghui Yu, Zhengqi Zhuang, Guoping Rong, Haifeng Shen, Zhengfeng Li, Boge Wang, Guoan Zhang, Bangyu Xiang, Xiaobin Xu

Main category: cs.CL

TL;DR: ConceptRM: A method to create high-quality training data for alert filtering models using minimal expert annotations and co-teaching to identify reliable negative samples from noisy user feedback data.

DetailsMotivation: Alert fatigue from excessive false alarms in intelligent agent systems leads users to miss critical issues. Existing approaches use reflection models trained on noisy user feedback data, but manual cleaning is expensive.

Method: Uses small expert annotations as anchors, creates perturbed datasets with varying noise ratios, employs co-teaching to train multiple distinct models for collaborative learning, and analyzes consensus decisions to identify reliable negative samples.

Result: Significantly enhances false alert interception with minimal annotation cost, outperforming state-of-the-art LLM baselines by up to 53.31% on in-domain datasets and 41.67% on out-of-domain datasets.

Conclusion: ConceptRM provides an effective solution for training high-quality reflection models to combat alert fatigue by efficiently leveraging noisy user feedback data with minimal expert supervision.

Abstract: In many applications involving intelligent agents, the overwhelming volume of alerts (mostly false) generated by the agents may desensitize users and cause them to overlook critical issues, leading to the so-called ‘‘alert fatigue’’. A common strategy is to train a reflection model as a filter to intercept false alerts with labelled data collected from user verification feedback. However, a key challenge is the noisy nature of such data as it is often collected in production environments. As cleaning noise via manual annotation incurs high costs, this paper proposes a novel method ConceptRM for constructing a high-quality corpus to train a reflection model capable of effectively intercepting false alerts. With only a small amount of expert annotations as anchors, ConceptRM creates perturbed datasets with varying noise ratios and utilizes co-teaching to train multiple distinct models for collaborative learning. By analyzing the consensus decisions of these models, it effectively identifies reliable negative samples from a noisy dataset. Experimental results demonstrate that ConceptRM significantly enhances the interception of false alerts with minimal annotation cost, outperforming several state-of-the-art LLM baselines by up to 53.31% on in-domain datasets and 41.67% on out-of-domain datasets.

[5] InterviewSim: A Scalable Framework for Interview-Grounded Personality Simulation

Yu Li, Pranav Narayanan Venkit, Yada Pruksachatkun, Chien-Sheng Wu

Main category: cs.CL

TL;DR: A framework for evaluating personality simulation in LLMs using real interview data with four metrics: content similarity, factual consistency, personality alignment, and knowledge retention.

DetailsMotivation: Existing personality simulation evaluations rely on proxies like demographic surveys or short AI interviews, lacking direct assessment against what individuals actually said. There's a need for interview-grounded evaluation at scale.

Method: Extracted 671k question-answer pairs from 23k verified interview transcripts across 1,000 public personalities (avg 11.5 hours each). Proposed multi-dimensional evaluation with four metrics. Compared retrieval-augmented vs chronological-based methods.

Result: Methods grounded in real interview data substantially outperform those using only biographical profiles or parametric knowledge. Retrieval-augmented methods excel at personality style and response quality, while chronological methods better preserve factual consistency and knowledge retention.

Conclusion: The framework enables principled method selection based on application needs and provides actionable insights for advancing personality simulation research.

Abstract: Simulating real personalities with large language models requires grounding generation in authentic personal data. Existing evaluation approaches rely on demographic surveys, personality questionnaires, or short AI-led interviews as proxies, but lack direct assessment against what individuals actually said. We address this gap with an interview-grounded evaluation framework for personality simulation at a large scale. We extract over 671,000 question-answer pairs from 23,000 verified interview transcripts across 1,000 public personalities, each with an average of 11.5 hours of interview content. We propose a multi-dimensional evaluation framework with four complementary metrics measuring content similarity, factual consistency, personality alignment, and factual knowledge retention. Through systematic comparison, we demonstrate that methods grounded in real interview data substantially outperform those relying solely on biographical profiles or the model’s parametric knowledge. We further reveal a trade-off in how interview data is best utilized: retrieval-augmented methods excel at capturing personality style and response quality, while chronological-based methods better preserve factual consistency and knowledge retention. Our evaluation framework enables principled method selection based on application requirements, and our empirical findings provide actionable insights for advancing personality simulation research.

[6] What Makes a Good Query? Measuring the Impact of Human-Confusing Linguistic Features on LLM Performance

William Watson, Nicole Cho, Sumitra Ganesh, Manuela Veloso

Main category: cs.CL

TL;DR: The paper analyzes how query features (like clause complexity, lexical rarity, anaphora, negation, answerability, and intention grounding) correlate with LLM hallucination risk, identifying a consistent “risk landscape” across 369,837 real-world queries.

DetailsMotivation: Current approaches treat LLM hallucinations as model or decoding defects, but this paper argues that query form also shapes responses. The research aims to identify specific query features that make hallucination more likely, drawing on classical linguistics insights about how query structure affects comprehension.

Method: Constructed a 22-dimension query feature vector covering clause complexity, lexical rarity, anaphora, negation, answerability, and intention grounding. Analyzed 369,837 real-world queries to identify correlations between these features and hallucination propensity across different models and datasets.

Result: Revealed a consistent “risk landscape”: features like deep clause nesting and underspecification correlate with higher hallucination rates, while clear intention grounding and answerability correlate with lower hallucination rates. Domain specificity showed mixed, dataset- and model-dependent effects.

Conclusion: Established an empirically observable query-feature representation correlated with hallucination risk, which can guide query rewriting and future intervention studies to reduce hallucinations by modifying query structure.

Abstract: Large Language Model (LLM) hallucinations are usually treated as defects of the model or its decoding strategy. Drawing on classical linguistics, we argue that a query’s form can also shape a listener’s (and model’s) response. We operationalize this insight by constructing a 22-dimension query feature vector covering clause complexity, lexical rarity, and anaphora, negation, answerability, and intention grounding, all known to affect human comprehension. Using 369,837 real-world queries, we ask: Are there certain types of queries that make hallucination more likely? A large-scale analysis reveals a consistent “risk landscape”: certain features such as deep clause nesting and underspecification align with higher hallucination propensity. In contrast, clear intention grounding and answerability align with lower hallucination rates. Others, including domain specificity, show mixed, dataset- and model-dependent effects. Thus, these findings establish an empirically observable query-feature representation correlated with hallucination risk, paving the way for guided query rewriting and future intervention studies.

[7] No One Size Fits All: QueryBandits for Hallucination Mitigation

Nicole Cho, William Watson, Alec Koppel, Sumitra Ganesh, Manuela Veloso

Main category: cs.CL

TL;DR: QueryBandits: A contextual bandit framework that adaptively selects optimal query-rewrite strategies to reduce hallucinations in closed-source LLMs without retraining.

DetailsMotivation: Most hallucination mitigation work focuses on open-source models, but closed-source models dominate institutional deployments and lack effective forward-pass-only solutions that don't require retraining or gradient access.

Method: Model-agnostic contextual bandit framework that learns online to select optimal query-rewrite strategies (like paraphrase or expand) using empirically validated reward functions based on semantic features.

Result: Top QueryBandit (Thompson Sampling) achieves 87.5% win rate over No-Rewrite baseline, outperforms static policies by 42.6-60.3% across 16 QA scenarios. Contextual bandits outperform vanilla bandits, showing no single rewrite policy works for all queries.

Conclusion: Online adaptive query-rewriting via contextual bandits can effectively reduce hallucinations in closed-source LLMs through forward-pass mechanisms only, bypassing need for retraining or gradient access.

Abstract: Advanced reasoning capabilities in Large Language Models (LLMs) have led to more frequent hallucinations; yet most mitigation work focuses on open-source models for post-hoc detection and parameter editing. The dearth of studies focusing on hallucinations in closed-source models is especially concerning, as they constitute the vast majority of models in institutional deployments. We introduce QueryBandits, a model-agnostic contextual bandit framework that adaptively learns online to select the optimal query-rewrite strategy by leveraging an empirically validated and calibrated reward function. Across 16 QA scenarios, our top QueryBandit (Thompson Sampling) achieves an 87.5% win rate over a No-Rewrite baseline and outperforms zero-shot static policies (e.g., Paraphrase or Expand) by 42.6% and 60.3%, respectively. Moreover, all contextual bandits outperform vanilla bandits across all datasets, with higher feature variance coinciding with greater variance in arm selection. This substantiates our finding that there is no single rewrite policy optimal for all queries. We also discover that certain static policies incur higher cumulative regret than No-Rewrite, indicating that an inflexible query-rewriting policy can worsen hallucinations. Thus, learning an online policy over semantic features with QueryBandits can shift model behavior purely through forward-pass mechanisms, enabling its use with closed-source models and bypassing the need for retraining or gradient-based adaptation.

[8] Natural Language Processing Models for Robust Document Categorization

Radoslaw Roszczyk, Pawel Tecza, Maciej Stodolski, Krzysztof Siwek

Main category: cs.CL

TL;DR: Evaluation of Naive Bayes, BiLSTM, and BERT for text classification, finding BERT most accurate but computationally heavy, BiLSTM offers best balance, with practical system demonstration.

DetailsMotivation: The study aims to evaluate different machine learning methods for automated text classification, focusing on balancing classification accuracy with computational efficiency for real-world automation pipelines, particularly in handling unbalanced document categorization.

Method: Three models were examined: Naive Bayes classifier (simple probabilistic model), bidirectional LSTM network (recurrent neural network with contextual understanding), and fine-tuned transformer-based BERT model (state-of-the-art transformer architecture). A demonstrative system was implemented for practical validation.

Result: BERT achieved highest accuracy (>99%) but required longest training times and most resources; BiLSTM reached ~98.56% accuracy with moderate computational costs; Naive Bayes was fastest (milliseconds) but lowest accuracy (~94.5%). Class imbalance affected all methods, especially for minority categories.

Conclusion: BiLSTM offers the most balanced solution for the examined scenario, providing strong accuracy with reasonable computational costs. The study demonstrates practical applicability through an automated routing system and suggests future exploration of transformer architectures.

Abstract: This article presents an evaluation of several machine learning methods applied to automated text classification, alongside the design of a demonstrative system for unbalanced document categorization and distribution. The study focuses on balancing classification accuracy with computational efficiency, a key consideration when integrating AI into real world automation pipelines. Three models of varying complexity were examined: a Naive Bayes classifier, a bidirectional LSTM network, and a fine tuned transformer based BERT model. The experiments reveal substantial differences in performance. BERT achieved the highest accuracy, consistently exceeding 99%, but required significantly longer training times and greater computational resources. The BiLSTM model provided a strong compromise, reaching approximately 98.56% accuracy while maintaining moderate training costs and offering robust contextual understanding. Naive Bayes proved to be the fastest to train, on the order of milliseconds, yet delivered the lowest accuracy, averaging around 94.5%. Class imbalance influenced all methods, particularly in the recognition of minority categories. A fully functional demonstrative system was implemented to validate practical applicability, enabling automated routing of technical requests with throughput unattainable through manual processing. The study concludes that BiLSTM offers the most balanced solution for the examined scenario, while also outlining opportunities for future improvements and further exploration of transformer architectures.

[9] How communicatively optimal are exact numeral systems? Once more on lexicon size and morphosyntactic complexity

Chundra Cathcart, Arne Rubehn, Katja Bocklage, Luca Ciucci, Kellen Parker van Dam, Alžběta Kučerová, Jekaterina Mažara, Carlo Y. Meloni, David Snee, Johann-Mattis List

Main category: cs.CL

TL;DR: Analysis shows many languages have less efficient numeral systems than expected, challenging previous theories about optimal communicative efficiency in numeral systems.

DetailsMotivation: Previous research argued that exact recursive numeral systems optimize communicative efficiency by balancing lexicon size and morphosyntactic complexity, but this study questions whether previous analyses properly accounted for the actual complexity languages display.

Method: Used data from 52 genetically diverse languages with an annotation scheme distinguishing between predictable and unpredictable allomorphy (formal variation) to analyze numeral system efficiency.

Result: Found that many of the world’s languages are decisively less efficient than would be expected under previous theories of optimal communicative efficiency in numeral systems.

Conclusion: The findings challenge previous assumptions about numeral system efficiency and have implications for the study of numeral systems and linguistic evolution more broadly.

Abstract: Recent research argues that exact recursive numeral systems optimize communicative efficiency by balancing a tradeoff between the size of the numeral lexicon and the average morphosyntactic complexity (roughly length in morphemes) of numeral terms. We argue that previous studies have not characterized the data in a fashion that accounts for the degree of complexity languages display. Using data from 52 genetically diverse languages and an annotation scheme distinguishing between predictable and unpredictable allomorphy (formal variation), we show that many of the world’s languages are decisively less efficient than one would expect. We discuss the implications of our findings for the study of numeral systems and linguistic evolution more generally.

[10] Case-Aware LLM-as-a-Judge Evaluation for Enterprise-Scale RAG Systems

Mukul Chhabra, Luigi Medrano, Arush Verma

Main category: cs.CL

TL;DR: A case-aware LLM-as-a-Judge evaluation framework for enterprise multi-turn RAG systems that addresses enterprise-specific failure modes like case misidentification and workflow misalignment.

DetailsMotivation: Enterprise RAG assistants operate in multi-turn, case-based workflows (technical support, IT operations) where existing evaluation frameworks fail to capture enterprise-specific failure modes such as case misidentification, workflow misalignment, and partial resolution across turns.

Method: Proposes a case-aware LLM-as-a-Judge evaluation framework with eight operationally grounded metrics covering retrieval quality, grounding fidelity, answer utility, precision integrity, and case/workflow alignment. Uses deterministic prompting with strict JSON outputs for scalable batch evaluation.

Result: Comparative study shows generic proxy metrics provide ambiguous signals, while the proposed framework exposes enterprise-critical tradeoffs that are actionable for system improvement.

Conclusion: The framework enables scalable evaluation, regression testing, and production monitoring for enterprise multi-turn RAG systems by addressing their unique operational constraints and failure modes.

Abstract: Enterprise Retrieval-Augmented Generation (RAG) assistants operate in multi-turn, case-based workflows such as technical support and IT operations, where evaluation must reflect operational constraints, structured identifiers (e.g., error codes, versions), and resolution workflows. Existing RAG evaluation frameworks are primarily designed for benchmark-style or single-turn settings and often fail to capture enterprise-specific failure modes such as case misidentification, workflow misalignment, and partial resolution across turns. We present a case-aware LLM-as-a-Judge evaluation framework for enterprise multi-turn RAG systems. The framework evaluates each turn using eight operationally grounded metrics that separate retrieval quality, grounding fidelity, answer utility, precision integrity, and case/workflow alignment. A severity-aware scoring protocol reduces score inflation and improves diagnostic clarity across heterogeneous enterprise cases. The system uses deterministic prompting with strict JSON outputs, enabling scalable batch evaluation, regression testing, and production monitoring. Through a comparative study of two instruction-tuned models across short and long workflows, we show that generic proxy metrics provide ambiguous signals, while the proposed framework exposes enterprise-critical tradeoffs that are actionable for system improvement.

[11] K-Function: Joint Pronunciation Transcription and Feedback for Evaluating Kids Language Function

Shuhe Li, Chenxu Guo, Jiachen Lian, Cheol Jun Cho, Wenshuo Zhao, Xiner Xu, Ruiyu Jin, Xiaoyu Shi, Xuanru Zhou, Dingkun Zhou, Sam Wang, Grace Wang, Jingze Yang, Jingyi Xu, Ruohan Bao, Xingrui Chen, Elise Brenner, Brandon In, Francesca Pei, Maria Luisa Gorno-Tempini, Gopala Anumanchipalli

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2507.03043: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.03043&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[12] Disentangling Geometry, Performance, and Training in Language Models

Atharva Kulkarni, Jacob Mitchell Springer, Arjun Subramonian, Swabha Swayamdipta

Main category: cs.CL

TL;DR: Analysis of Transformer unembedding matrix geometry (effective rank) shows it reflects training hyperparameters more than predicting downstream performance, with mixed correlation to model quality.

DetailsMotivation: To systematically investigate whether geometric properties of Transformer weights, particularly the unembedding matrix's effective rank, can reliably estimate downstream model performance, given prior interpretability research focused on these properties.

Method: Trained 108 OLMo-style language models under controlled variation of hyperparameters, analyzed relationship between model performance and unembedding matrix geometry (effective rank), extended to other geometric metrics and final-layer representations.

Result: High effective rank often correlates with best-performing models but not universally; low effective rank co-occurs with late-stage degradation but doesn’t cause it; effective rank strongly influenced by pre-training hyperparameters (batch size, weight decay); existing geometric metrics reflect training choices rather than performance.

Conclusion: Transformer weight geometry metrics, including unembedding matrix effective rank, primarily reflect training hyperparameter choices rather than reliably predicting downstream performance, challenging their utility for performance estimation.

Abstract: Geometric properties of Transformer weights, particularly the unembedding matrix, have been widely useful in language model interpretability research. Yet, their utility for estimating downstream performance remains unclear. In this work, we systematically investigate the relationship between model performance and the unembedding matrix geometry, particularly its effective rank. Our experiments, involving a suite of 108 OLMo-style language models trained under controlled variation, reveal several key findings. While the best-performing models often exhibit a high effective rank, this trend is not universal across tasks and training setups. Contrary to prior work, we find that low effective rank does not cause late-stage performance degradation in small models, but instead co-occurs with it; we find adversarial cases where low-rank models do not exhibit saturation. Moreover, we show that effective rank is strongly influenced by pre-training hyperparameters, such as batch size and weight decay, which in-turn affect the model’s performance. Lastly, extending our analysis to other geometric metrics and final-layer representation, we find that these metrics are largely aligned, but none can reliably predict downstream performance. Overall, our findings suggest that the model’s geometry, as captured by existing metrics, primarily reflects training choices rather than performance.

[13] From Performance to Purpose: A Sociotechnical Taxonomy for Evaluating Large Language Model Utility

Gavin Levinson, Keith Feldman

Main category: cs.CL

TL;DR: LUX is a comprehensive taxonomy framework for evaluating LLM utility across four domains: performance, interaction, operations, and governance, with hierarchical organization and metrics for quantitative comparison.

DetailsMotivation: Current LLM evaluation focuses on task-level success but lacks unified frameworks for assessing broader sociotechnical determinants of utility in real-world, high-stakes applications.

Method: Developed Language Model Utility Taxonomy (LUX) with hierarchical organization across four domains, each containing dimensions and components grounded in measurable metrics, plus a web tool for exploration.

Result: Created a comprehensive framework that structures utility evaluation, enabling consistent assessment and comparison of LLMs across diverse use cases through quantitative metrics.

Conclusion: LUX addresses the gap in unified LLM utility evaluation by providing a structured taxonomy that goes beyond conventional performance measures to include sociotechnical considerations.

Abstract: As large language models (LLMs) continue to improve at completing discrete tasks, they are being integrated into increasingly complex and diverse real-world systems. However, task-level success alone does not establish a model’s fit for use in practice. In applied, high-stakes settings, LLM effectiveness is driven by a wider array of sociotechnical determinants that extend beyond conventional performance measures. Although a growing set of metrics capture many of these considerations, they are rarely organized in a way that supports consistent evaluation, leaving no unified taxonomy for assessing and comparing LLM utility across use cases. To address this gap, we introduce the Language Model Utility Taxonomy (LUX), a comprehensive framework that structures utility evaluation across four domains: performance, interaction, operations, and governance. Within each domain, LUX is organized hierarchically into thematically aligned dimensions and components, each grounded in metrics that enable quantitative comparison and alignment of model selection with intended use. In addition, an external dynamic web tool is provided to support exploration of the framework by connecting each component to a repository of relevant metrics (factors) for applied evaluation.

[14] Stop-Think-AutoRegress: Language Modeling with Latent Diffusion Planning

Justin Lovelace, Christian Belardi, Sofian Zalouk, Adhitya Polavaram, Srivatsa Kundurthy, Kilian Q. Weinberger

Main category: cs.CL

TL;DR: STAR-LDM combines latent diffusion planning with autoregressive generation, adding a “thinking” phase for semantic planning before token generation, improving language understanding and coherence.

DetailsMotivation: Conventional autoregressive language models make token-by-token decisions without global planning, limiting their ability to maintain coherence and reason about longer contexts. The authors aim to integrate planning capabilities into language generation.

Method: STAR-LDM integrates latent diffusion planning with autoregressive generation. It pauses generation for a “thinking” phase where it refines a semantic plan through diffusion in continuous space before committing to discrete tokens. The architecture also enables control through lightweight classifiers for attribute steering.

Result: STAR-LDM significantly outperforms similar-sized models on language understanding benchmarks and achieves >70% win rates in LLM-as-judge comparisons for narrative coherence and commonsense reasoning. It maintains better fluency-control trade-offs than specialized approaches.

Conclusion: The integration of diffusion-based planning with autoregressive generation enables better global planning in language models, improving coherence and reasoning while allowing flexible control through lightweight classifiers without retraining.

Abstract: The Stop-Think-AutoRegress Language Diffusion Model (STAR-LDM) integrates latent diffusion planning with autoregressive generation. Unlike conventional autoregressive language models limited to token-by-token decisions, STAR-LDM incorporates a “thinking” phase that pauses generation to refine a semantic plan through diffusion before continuing. This enables global planning in continuous space prior to committing to discrete tokens. Evaluations show STAR-LDM significantly outperforms similar-sized models on language understanding benchmarks and achieves $>70%$ win rates in LLM-as-judge comparisons for narrative coherence and commonsense reasoning. The architecture also allows straightforward control through lightweight classifiers, enabling fine-grained steering of attributes without model retraining while maintaining better fluency-control trade-offs than specialized approaches.

[15] Personal Information Parroting in Language Models

Nishant Subramani, Kshitish Ghate, Mona Diab

Main category: cs.CL

TL;DR: A study on personal information memorization in language models, showing that models memorize email addresses, phone numbers, and IP addresses from training data, with larger models and more training showing higher memorization rates.

DetailsMotivation: Language models are trained on web data containing personal information, which they may memorize, creating privacy risks. The authors aim to quantify this memorization and understand how model size and training affect it.

Method: Developed regexes and rules (R&R) detector suite to detect email addresses, phone numbers, and IP addresses. Used manually curated set of 483 PI instances to measure memorization in Pythia models (160M-6.9B) at different training steps, testing if models parrot PI verbatim when prompted with preceding context.

Result: 13.6% of PI instances were parroted verbatim by Pythia-6.9b. Both model size and pretraining amount positively correlate with memorization. Even smallest model (Pythia-160m) parroted 2.7% of instances exactly.

Conclusion: Pretraining datasets should be aggressively filtered and anonymized to minimize personal information parroting, as language models memorize sensitive data from their training corpora.

Abstract: Modern language models (LM) are trained on large scrapes of the Web, containing millions of personal information (PI) instances, many of which LMs memorize, increasing privacy risks. In this work, we develop the regexes and rules (R&R) detector suite to detect email addresses, phone numbers, and IP addresses, which outperforms the best regex-based PI detectors. On a manually curated set of 483 instances of PI, we measure memorization: finding that 13.6% are parroted verbatim by the Pythia-6.9b model, i.e., when the model is prompted with the tokens that precede the PI in the original document, greedy decoding generates the entire PI span exactly. We expand this analysis to study models of varying sizes (160M-6.9B) and pretraining time steps (70k-143k iterations) in the Pythia model suite and find that both model size and amount of pretraining are positively correlated with memorization. Even the smallest model, Pythia-160m, parrots 2.7% of the instances exactly. Consequently, we strongly recommend that pretraining datasets be aggressively filtered and anonymized to minimize PI parroting.

[16] Enhancing Hate Speech Detection on Social Media: A Comparative Analysis of Machine Learning Models and Text Transformation Approaches

Saurabh Mishra, Shivani Thakur, Radhika Mamidi

Main category: cs.CL

TL;DR: Evaluation of ML models for hate speech detection and text transformation techniques for content neutralization

DetailsMotivation: Address the proliferation of hate speech on social media platforms by developing effective detection and moderation tools

Method: Compare traditional models (CNNs, LSTMs) with advanced neural networks (BERT and derivatives) and hybrid models; explore text transformation techniques to neutralize negative expressions

Result: BERT shows superior accuracy due to deep contextual understanding; hybrid models perform better in certain scenarios; text transformation approaches can convert negative expressions to neutral ones

Conclusion: Advanced models like BERT are effective for hate speech detection, hybrid models have niche advantages, and text transformation offers potential for content mitigation; future directions needed for more robust systems

Abstract: The proliferation of hate speech on social media platforms has necessitated the development of effective detection and moderation tools. This study evaluates the efficacy of various machine learning models in identifying hate speech and offensive language and investigates the potential of text transformation techniques to neutralize such content. We compare traditional models like CNNs and LSTMs with advanced neural network models such as BERT and its derivatives, alongside exploring hybrid models that combine different architectural features. Our results indicate that while advanced models like BERT show superior accuracy due to their deep contextual understanding, hybrid models exhibit improved capabilities in certain scenarios. Furthermore, we introduce innovative text transformation approaches that convert negative expressions into neutral ones, thereby potentially mitigating the impact of harmful content. The implications of these findings are discussed, highlighting the strengths and limitations of current technologies and proposing future directions for more robust hate speech detection systems.

[17] Semantic Novelty at Scale: Narrative Shape Taxonomy and Readership Prediction in 28,606 Books

W. Frederick Zimmerman

Main category: cs.CL

TL;DR: Semantic novelty analysis of 28,606 pre-1920 books reveals 8 narrative shape archetypes, with volume (variance of novelty trajectory) as strongest predictor of readership, and shows genre/historical constraints on narrative structure.

DetailsMotivation: To develop an information-theoretic measure of narrative structure at corpus scale using semantic novelty, and understand how information-density dynamics affect reader engagement across genres and historical periods.

Method: Compute paragraph-level semantic novelty using cosine distance between SBERT sentence embeddings and running centroid of preceding paragraphs, reduce to 16-segment PAA, cluster with Ward-linkage to identify narrative archetypes, and analyze correlations with readership metrics.

Result: Identified 8 canonical narrative shapes; volume (variance) strongest length-independent predictor of readership (partial rho=0.32); genre strongly constrains narrative shape; books became more predictable 1840-1910; 85% signature uniqueness across books.

Conclusion: Information-density dynamics constitute a fundamental dimension of narrative structure with measurable consequences for reader engagement, distinct from sentiment or topic analysis.

Abstract: I introduce semantic novelty–cosine distance between each paragraph’s sentence embedding and the running centroid of all preceding paragraphs–as an information-theoretic measure of narrative structure at corpus scale. Applying it to 28,606 books in PG19 (pre-1920 English literature), I compute paragraph-level novelty curves using 768-dimensional SBERT embeddings, then reduce each to a 16-segment Piecewise Aggregate Approximation (PAA). Ward-linkage clustering on PAA vectors reveals eight canonical narrative shape archetypes, from Steep Descent (rapid convergence) to Steep Ascent (escalating unpredictability). Volume–variance of the novelty trajectory–is the strongest length-independent predictor of readership (partial rho = 0.32), followed by speed (rho = 0.19) and Terminal/Initial ratio (rho = 0.19). Circuitousness shows strong raw correlation (rho = 0.41) but is 93 percent correlated with length; after control, partial rho drops to 0.11–demonstrating that naive correlations in corpus studies can be dominated by length confounds. Genre strongly constrains narrative shape (chi squared = 2121.6, p < 10 to the power negative 242), with fiction maintaining plateau profiles while nonfiction front-loads information. Historical analysis shows books became progressively more predictable between 1840 and 1910 (T/I ratio trend r = negative 0.74, p = 0.037). SAX analysis reveals 85 percent signature uniqueness, suggesting each book traces a nearly unique path through semantic space. These findings demonstrate that information-density dynamics, distinct from sentiment or topic, constitute a fundamental dimension of narrative structure with measurable consequences for reader engagement. Dataset: https://huggingface.co/datasets/wfzimmerman/pg19-semantic-novelty

[18] CARE: An Explainable Computational Framework for Assessing Client-Perceived Therapeutic Alliance Using Large Language Models

Anqi Li, Chenxiao Wang, Yu Lu, Renjun Xu, Lizhi Ma, Zhenzhong Lan

Main category: cs.CL

TL;DR: CARE is an LLM-based framework that automatically predicts multi-dimensional therapeutic alliance scores and generates interpretable rationales from counseling transcripts, outperforming existing methods and reducing the gap between counselor evaluations and client-perceived alliance.

DetailsMotivation: Traditional methods for capturing client perceptions of therapeutic alliance are burdensome (post-session questionnaires) or produce coarse scores without interpretable rationales. Existing computational approaches lack holistic session context modeling and interpretability.

Method: Built on CounselingWAI dataset enriched with 9,516 expert-curated rationales, CARE fine-tunes LLaMA-3.1-8B-Instruct using rationale-augmented supervision to predict alliance scores and generate interpretable rationales from counseling transcripts.

Result: CARE outperforms leading LLMs, achieves over 70% higher Pearson correlation with client ratings, substantially reduces gap between counselor and client evaluations, and produces high-quality rationales validated by both automatic and human evaluations.

Conclusion: CARE demonstrates potential as an AI-assisted tool for mental health care by uncovering alliance-building challenges, illustrating interaction patterns, and providing actionable insights from real-world counseling sessions.

Abstract: Client perceptions of the therapeutic alliance are critical for counseling effectiveness. Accurately capturing these perceptions remains challenging, as traditional post-session questionnaires are burdensome and often delayed, while existing computational approaches produce coarse scores, lack interpretable rationales, and fail to model holistic session context. We present CARE, an LLM-based framework to automatically predict multi-dimensional alliance scores and generate interpretable rationales from counseling transcripts. Built on the CounselingWAI dataset and enriched with 9,516 expert-curated rationales, CARE is fine-tuned using rationale-augmented supervision with the LLaMA-3.1-8B-Instruct backbone. Experiments show that CARE outperforms leading LLMs and substantially reduces the gap between counselor evaluations and client-perceived alliance, achieving over 70% higher Pearson correlation with client ratings. Rationale-augmented supervision further improves predictive accuracy. CARE also produces high-quality, contextually grounded rationales, validated by both automatic and human evaluations. Applied to real-world Chinese online counseling sessions, CARE uncovers common alliance-building challenges, illustrates how interaction patterns shape alliance development, and provides actionable insights, demonstrating its potential as an AI-assisted tool for supporting mental health care.

[19] CAMEL: Confidence-Gated Reflection for Reward Modeling

Zirui Zhu, Hailun Xu, Yang Luo, Yong Liu, Kanchan Sarkar, Kun Xu, Yang You

Main category: cs.CL

TL;DR: CAMEL is a confidence-gated reflection framework for reward modeling that uses log-probability margins as confidence signals to selectively invoke reflection only for low-confidence instances, achieving SOTA performance with better efficiency.

DetailsMotivation: Existing reward modeling approaches have trade-offs: scalar discriminative models are efficient but lack interpretability, while generative judging models offer richer reasoning at high computational cost. There's a need for a method that combines efficiency with interpretable reasoning.

Method: CAMEL uses log-probability margins between verdict tokens as confidence signals to gate reflection. It performs lightweight single-token preference decisions first, then selectively invokes reflection only for low-confidence instances. The model is trained via reinforcement learning with counterfactual prefix augmentation to expose it to diverse initial verdicts and encourage genuine revision.

Result: Achieves state-of-the-art performance on three reward-model benchmarks with 82.9% average accuracy, surpassing prior best by 3.2%. Outperforms 70B-parameter models using only 14B parameters, establishing a strictly better accuracy-efficiency Pareto frontier.

Conclusion: CAMEL demonstrates that confidence-gated reflection can effectively combine the efficiency of discriminative models with the reasoning capabilities of generative models, achieving superior performance with better computational efficiency.

Abstract: Reward models play a fundamental role in aligning large language models with human preferences. Existing methods predominantly follow two paradigms: scalar discriminative preference models, which are efficient but lack interpretability, and generative judging models, which offer richer reasoning at the cost of higher computational overhead. We observe that the log-probability margin between verdict tokens strongly correlates with prediction correctness, providing a reliable proxy for instance difficulty without additional inference cost. Building on this insight, we propose CAMEL, a confidence-gated reflection framework that performs a lightweight single-token preference decision first and selectively invokes reflection only for low-confidence instances. To induce effective self-correction, we train the model via reinforcement learning with counterfactual prefix augmentation, which exposes the model to diverse initial verdicts and encourages genuine revision. Empirically, CAMEL achieves state-of-the-art performance on three widely used reward-model benchmarks with 82.9% average accuracy, surpassing the best prior model by 3.2% and outperforming 70B-parameter models using only 14B parameters, while establishing a strictly better accuracy-efficiency Pareto frontier.

[20] ID-LoRA: Efficient Low-Rank Adaptation Inspired by Matrix Interpolative Decomposition

Xindian Ma, Rundong Kong, Peng Zhang, Ruoxiang Huang, Yongyu Jiang

Main category: cs.CL

TL;DR: ID-LoRA is a novel parameter-efficient fine-tuning method that extracts clustered parameter groups from pretrained weights to create multiple low-rank components sharing a single trainable matrix, reducing parameters while maintaining performance.

DetailsMotivation: Current LoRA variants introduce significant parameter overhead when scaling up models, while aggressively reducing rank degrades performance in complex multi-task settings. There's a need for a method that breaks this trade-off between parameter efficiency and performance.

Method: ID-LoRA extracts clustered parameter groups from the pretrained weight matrix and uses them to form multiple low-rank components. All these components share only a single initialized trainable low-rank matrix, reducing trainable parameters while preserving model capacity.

Result: ID-LoRA outperforms both full fine-tuning and existing PEFT baselines (LoRA, DoRA, HydraLoRA) on five diverse benchmarks while using up to 46% fewer trainable parameters than standard LoRA. In multi-task scenarios, it surpasses LoRA variants on Code and MMLU tasks with only 54% of LoRA’s parameters.

Conclusion: ID-LoRA successfully breaks the trade-off between parameter efficiency and performance in PEFT, offering superior results with significantly fewer trainable parameters across diverse tasks and multi-task settings.

Abstract: LoRA has become a universal Parameter-Efficient Fine-Tuning (PEFT) technique that equips Large Language Models (LLMs) to adapt quickly to new tasks. However, when these models are scaled up, even the latest LoRA variants still introduce considerable overhead in trainable parameters. Conversely, aggressively lowering the rank to curb this overhead markedly degrades performance in complex multi-task settings. We propose ID-LoRA, a novel PEFT framework that breaks the trade-off. Its core innovation lies in extracting and reusing clustered parameter groups from the pretrained weight matrix. These groups are then used to form multiple low-rank components, all of which share only a single initialized trainable low-rank matrix. This approach cuts the number of trainable parameters while keeping the model’s capacity intact. We evaluate ID-LoRA on five diverse benchmarks: Mathematical Reasoning, Code Generation, MMLU, CommonsenseQA, and Safety Alignment. ID-LoRA outperforms both full fine-tuning and existing PEFT baselines (e.g., LoRA, DoRA, HydraLoRA) while using up to 46% fewer trainable parameters than the standard LoRA. In multi-task scenarios, it surpasses LoRA and its recent variants (e.g., DoRA and HydraLoRA) on both Code and MMLU tasks, yet requires only 54% of the trainable parameters demanded by the conventional LoRA.

[21] Adaptive Text Anonymization: Learning Privacy-Utility Trade-offs via Prompt Optimization

Gabriel Loiseau, Damien Sileo, Damien Riquet, Maxime Meyer, Marc Tommasi

Main category: cs.CL

TL;DR: Adaptive text anonymization framework using prompt optimization to automatically adapt anonymization strategies to specific privacy-utility requirements across domains.

DetailsMotivation: Existing text anonymization methods use static, manually designed strategies that lack flexibility to adapt to diverse privacy goals, domains, and downstream applications, failing to generalize across different contexts.

Method: Proposes a framework for task-specific prompt optimization that automatically constructs anonymization instructions for language models, enabling adaptation to different privacy goals, domains, and usage patterns.

Result: Achieves better privacy-utility trade-off than existing baselines across five diverse datasets, remains computationally efficient on open-source models with performance comparable to larger closed-source models, and discovers novel anonymization strategies.

Conclusion: The adaptive text anonymization framework successfully addresses the context-sensitive nature of anonymization by automatically adapting strategies to specific requirements, outperforming static approaches while maintaining efficiency.

Abstract: Anonymizing textual documents is a highly context-sensitive problem: the appropriate balance between privacy protection and utility preservation varies with the data domain, privacy objectives, and downstream application. However, existing anonymization methods rely on static, manually designed strategies that lack the flexibility to adjust to diverse requirements and often fail to generalize across domains. We introduce adaptive text anonymization, a new task formulation in which anonymization strategies are automatically adapted to specific privacy-utility requirements. We propose a framework for task-specific prompt optimization that automatically constructs anonymization instructions for language models, enabling adaptation to different privacy goals, domains, and downstream usage patterns. To evaluate our approach, we present a benchmark spanning five datasets with diverse domains, privacy constraints, and utility objectives. Across all evaluated settings, our framework consistently achieves a better privacy-utility trade-off than existing baselines, while remaining computationally efficient and effective on open-source language models, with performance comparable to larger closed-source models. Additionally, we show that our method can discover novel anonymization strategies that explore different points along the privacy-utility trade-off frontier.

[22] Explicit Grammar Semantic Feature Fusion for Robust Text Classification

Azrin Sultana, Firoz Ahmed

Main category: cs.CL

TL;DR: A lightweight text classification model that combines explicit grammatical structure encoding with frozen contextual embeddings, outperforming baseline models by 2-15% without heavy transformer architectures.

DetailsMotivation: Existing transformer-based NLP models are computationally intensive and unsuitable for resource-constrained environments. The paper aims to create a robust, lightweight classification model that doesn't rely on full parameterized transformers or heavy deep learning architectures.

Method: Encodes sentence-level grammatical structure (syntactic composition, phrase patterns, complexity indicators) into a compact grammar vector, then fuses it with frozen contextual embeddings. Uses explicit grammatical rules as inductive bias rather than learnable modules. Evaluated with DBNs, LSTMs, BiLSTMs, BERT, and XLNET with varied epochs.

Result: The unified feature representation model captures both semantic and structural properties of text, outperforming baseline models by 2%-15%. Enables more effective learning across heterogeneous domains and is suitable for edge devices.

Conclusion: The proposed lightweight framework successfully integrates grammatical structure as explicit inductive bias with semantic embeddings, achieving better performance than baseline models while being computationally efficient for edge deployment.

Abstract: Natural Language Processing enables computers to understand human language by analysing and classifying text efficiently with deep-level grammatical and semantic features. Existing models capture features by learning from large corpora with transformer models, which are computationally intensive and unsuitable for resource-constrained environments. Therefore, our proposed study incorporates comprehensive grammatical rules alongside semantic information to build a robust, lightweight classification model without resorting to full parameterised transformer models or heavy deep learning architectures. The novelty of our approach lies in its explicit encoding of sentence-level grammatical structure, including syntactic composition, phrase patterns, and complexity indicators, into a compact grammar vector, which is then fused with frozen contextual embeddings. These heterogeneous elements unified a single representation that captures both the structural and semantic characteristics of the text. Deep learning models such as Deep Belief Networks (DBNs), Long Short-Term Memory (LSTMs), BiLSTMs, and transformer-based BERT and XLNET were used to train and evaluate the model, with the number of epochs varied. Based on experimental results, the unified feature representation model captures both the semantic and structural properties of text, outperforming baseline models by 2%-15%, enabling more effective learning across heterogeneous domains. Unlike prior syntax-aware transformer models that inject grammatical structure through additional attention layers, tree encoders, or full fine-tuning, the proposed framework treats grammar as an explicit inductive bias rather than a learnable module, resulting in a very lightweight model that delivers better performance on edge devices

[23] SibylSense: Adaptive Rubric Learning via Memory Tuning and Adversarial Probing

Yifei Xu, Guilherme Potje, Shivam Shandilya, Tiancheng Yuan, Leonardo de Oliveira Nunes, Rakshanda Agarwal, Saeid Asgari, Adam Atkinson, Emre Kıcıman, Songwu Lu, Ranveer Chandra, Tusher Chakraborty

Main category: cs.CL

TL;DR: SibylSense: An inference-time learning approach that adapts a frozen rubric generator through a tunable memory bank of validated rubric items to create more discriminative rubrics for RL post-training in open-ended generation tasks.

DetailsMotivation: The paper addresses the challenge of designing aligned and robust rewards for RL post-training in open-ended generation. Current approaches have limitations: expert rubrics are costly, prompted rubrics are often superficial or inconsistent, and fixed-pool discriminative rubrics can saturate and drift, enabling reward hacking.

Method: SibylSense uses an inference-time learning approach that adapts a frozen rubric generator through a tunable memory bank of validated rubric items. Memory is updated via verifier-based item rewards measured by reference-candidate answer discriminative gaps from a handful of examples. The approach alternates memory tuning with a rubric-adversarial policy update that produces rubric-satisfying candidate answers, shrinking discriminative gaps and driving the rubric generator to capture new quality dimensions.

Result: Experiments on two open-ended tasks show that SibylSense yields more discriminative rubrics and improves downstream RL performance over static and non-adaptive baselines.

Conclusion: SibylSense provides an effective approach for creating adaptive, discriminative rubrics for RL post-training in open-ended generation, addressing limitations of existing rubric construction methods.

Abstract: Designing aligned and robust rewards for open-ended generation remains a key barrier to RL post-training. Rubrics provide structured, interpretable supervision, but scaling rubric construction is difficult: expert rubrics are costly, prompted rubrics are often superficial or inconsistent, and fixed-pool discriminative rubrics can saturate and drift, enabling reward hacking. We present SibylSense, an inference-time learning approach that adapts a frozen rubric generator through a tunable memory bank of validated rubric items. Memory is updated via verifier-based item rewards measured by reference-candidate answer discriminative gaps from a handful of examples. SibylSense alternates memory tuning with a rubric-adversarial policy update that produces rubric-satisfying candidate answers, shrinking discriminative gaps and driving the rubric generator to capture new quality dimensions. Experiments on two open-ended tasks show that SibylSense yields more discriminative rubrics and improves downstream RL performance over static and non-adaptive baselines.

[24] Overton Pluralistic Reinforcement Learning for Large Language Models

Yu Fu, Seongho Son, Ilija Bogunovic

Main category: cs.CL

TL;DR: OP-GRPO is a reinforcement learning framework that enables a single LLM to generate diverse, pluralistic responses capturing multiple human perspectives without explicit prompting or modular architectures.

DetailsMotivation: Existing alignment paradigms fail to capture the pluralistic nature of human values, which vary across individuals and contexts. Current approaches either produce single-perspective responses or require complex modular systems.

Method: Two-step approach: 1) Train a Sentence Transformer similarity estimator for accurate coverage evaluation of pluralistic responses. 2) OP-GRPO training with dual-reward system that ensures both broad coverage of human perspectives and uniqueness of each perspective.

Result: Qwen2.5-3B-Instruct model outperforms 20B GPT-OSS baseline by 37.4% relative accuracy on NLI benchmark, and beats modular architecture baseline by 19.1% relative improvement. Shows “small models, big perspective coverage” effect.

Conclusion: OP-GRPO enables efficient pluralistic response generation from single LLMs without complex architectures, demonstrating superior performance over larger models and modular baselines in capturing diverse human perspectives.

Abstract: Existing alignment paradigms remain limited in capturing the pluralistic nature of human values. Overton Pluralism addresses this gap by generating responses with diverse perspectives from a single query. This paper introduces OP-GRPO (Overton Pluralistic Group Relative Policy Optimization), a reinforcement learning framework for implicit Overton Pluralism that enables a single large language model to produce pluralistic responses without explicit prompting or modular orchestration. Our workflow consists of two main steps. First, similarity estimator training fine-tunes a Sentence Transformer for Overton Pluralism tasks to provide more accurate coverage evaluation of generated responses. Second, OP-GRPO training incorporates this similarity estimator into a dual-reward system designed to ensure both broad coverage of genuine human perspectives and the uniqueness of each perspective, thereby promoting diversity. Empirical results demonstrate a “small models, big perspective coverage” effect. The trained Qwen2.5-3B-Instruct model surpasses a 20B GPT-OSS baseline with a 37.4 percent relative accuracy gain on a Natural Language Inference benchmark, and also outperforms a modular architecture baseline with a 19.1 percent relative improvement. Additional evaluations using GPT-4.1 as a large language model judge further confirm the robustness of the approach.

[25] Don’t Ignore the Tail: Decoupling top-K Probabilities for Efficient Language Model Distillation

Sayantan Dasgupta, Trevor Cohn, Timothy Baldwin

Main category: cs.CL

TL;DR: Proposes a new tail-aware divergence for language model distillation that decouples teacher’s top-K predictions from lower-probability ones to better capture informative distribution tails, maintaining KL divergence’s computational efficiency.

DetailsMotivation: Standard KL divergence in language model distillation is dominated by teacher's highest probability tokens (modes), diminishing the influence of less probable but potentially informative components of the output distribution.

Method: Introduces a tail-aware divergence that decouples the contribution of teacher’s top-K predicted probabilities from lower-probability predictions while maintaining the same computational profile as KL divergence.

Result: The modified distillation method yields competitive performance in both pre-training and supervised distillation of decoder models across various datasets, is efficient, and can be performed with modest academic budgets for large datasets.

Conclusion: The proposed tail-aware divergence improves language model distillation by better capturing informative distribution tails while maintaining computational efficiency and accessibility for academic research.

Abstract: The core learning signal used in language model distillation is the standard Kullback-Leibler (KL) divergence between the student and teacher distributions. Traditional KL divergence tends to be dominated by the next tokens with the highest probabilities, i.e., the teacher’s modes, thereby diminishing the influence of less probable yet potentially informative components of the output distribution. We propose a new tail-aware divergence that decouples the contribution of the teacher model’s top-K predicted probabilities from that of lower-probability predictions, while maintaining the same computational profile as the KL Divergence. Our decoupled approach reduces the impact of the teacher modes and, consequently, increases the contribution of the tail of the distribution. Experimental results demonstrate that our modified distillation method yields competitive performance in both pre-training and supervised distillation of decoder models across various datasets. Furthermore, the distillation process is efficient and can be performed with a modest academic budget for large datasets, eliminating the need for industry-scale computing.

[26] FinAnchor: Aligned Multi-Model Representations for Financial Prediction

Zirui He, Huopu Zhang, Yanguang Liu, Sirui Wu, Mengnan Du

Main category: cs.CL

TL;DR: FinAnchor is a lightweight framework that integrates embeddings from multiple LLMs for financial prediction by selecting an anchor embedding space and learning linear mappings to align representations from other models.

DetailsMotivation: Financial prediction from long documents faces challenges with sparse actionable signals obscured by noise, and the optimal LLM for generating embeddings varies across tasks and time periods, necessitating a robust multi-model approach.

Method: Selects an anchor embedding space from available LLMs, learns linear mappings to align representations from other models into this anchor space, then aggregates aligned features to form unified representations for downstream prediction.

Result: Consistently outperforms strong single-model baselines and standard ensemble methods across multiple financial NLP tasks, demonstrating effectiveness of anchoring heterogeneous representations.

Conclusion: FinAnchor provides an effective lightweight framework for robust financial prediction by integrating multiple LLM embeddings without fine-tuning underlying models, addressing feature space incompatibility through anchor-based alignment.

Abstract: Financial prediction from long documents involves significant challenges, as actionable signals are often sparse and obscured by noise, and the optimal LLM for generating embeddings varies across tasks and time periods. In this paper, we propose FinAnchor(Financial Anchored Representations), a lightweight framework that integrates embeddings from multiple LLMs without fine-tuning the underlying models. FinAnchor addresses the incompatibility of feature spaces by selecting an anchor embedding space and learning linear mappings to align representations from other models into this anchor. These aligned features are then aggregated to form a unified representation for downstream prediction. Across multiple financial NLP tasks, FinAnchor consistently outperforms strong single-model baselines and standard ensemble methods, demonstrating the effectiveness of anchoring heterogeneous representations for robust financial prediction.

[27] Exa-PSD: a new Persian sentiment analysis dataset on Twitter

Seyed Himan Ghaderi, Saeed Sarbazi Azad, Mohammad Mehdi Jaziriyan, Ahmad Akbari

Main category: cs.CL

TL;DR: Introduces Exa, a Persian sentiment analysis dataset of 12,000 annotated tweets, achieving 79.87 Macro F-score with pre-trained ParsBERT and RoBERTa models.

DetailsMotivation: Persian NLP faces challenges due to limited datasets in specialized topics and the prevalence of irony/colloquial language in social media, necessitating a Persian Twitter sentiment analysis dataset.

Method: Collected 12,000 Persian tweets annotated by 5 native Persian taggers into positive, neutral, and negative classes. Evaluated using pre-trained ParsBERT and RoBERTa models.

Result: Achieved 79.87 Macro F-score, demonstrating the dataset’s value for Persian sentiment analysis systems.

Conclusion: The Exa dataset provides a valuable resource for Persian sentiment analysis on social media, addressing the gap in Persian NLP resources for Twitter data.

Abstract: Today, Social networks such as Twitter are the most widely used platforms for communication of people. Analyzing this data has useful information to recognize the opinion of people in tweets. Sentiment analysis plays a vital role in NLP, which identifies the opinion of the individuals about a specific topic. Natural language processing in Persian has many challenges despite the adventure of strong language models. The datasets available in Persian are generally in special topics such as products, foods, hotels, etc while users may use ironies, colloquial phrases in social media To overcome these challenges, there is a necessity for having a dataset of Persian sentiment analysis on Twitter. In this paper, we introduce the Exa sentiment analysis Persian dataset, which is collected from Persian tweets. This dataset contains 12,000 tweets, annotated by 5 native Persian taggers. The aforementioned data is labeled in 3 classes: positive, neutral and negative. We present the characteristics and statistics of this dataset and use the pre-trained Pars Bert and Roberta as the base model to evaluate this dataset. Our evaluation reached a 79.87 Macro F-score, which shows the model and data can be adequately valuable for a sentiment analysis system.

[28] The Art of Efficient Reasoning: Data, Reward, and Optimization

Taiqiang Wu, Zenan Zu, Bo Zhou, Ngai Wong

Main category: cs.CL

TL;DR: Systematic study of efficient reasoning in LLMs using RL to optimize Chain-of-Thought length while maintaining accuracy, revealing two-stage training process and key insights for practical implementation.

DetailsMotivation: LLMs benefit from scaled Chain-of-Thought reasoning but suffer from computational overhead. Current efficient reasoning approaches use RL to incentivize short yet accurate thinking trajectories, but the mechanics need systematic investigation.

Method: Comprehensive evaluation with fine-grained metrics including length distribution conditioned on correctness across token budgets (2k-32k). Extensive experiments (0.2M GPU hours) deconstructing training prompts, rollouts, reward shaping, and optimization strategies. Key approach: train on easier prompts to ensure positive reward signal density and avoid length collapse.

Result: Revealed two-stage training paradigm: length adaptation followed by reasoning refinement. Found that learned length bias generalizes across domains. Validated findings across Qwen3 series (0.6B to 30B), demonstrating robustness and generalization.

Conclusion: Provides valuable insights and practical guidelines for efficient reasoning in LLMs, showing that training on easier prompts with proper reward shaping can optimize Chain-of-Thought length while maintaining accuracy across different model scales.

Abstract: Large Language Models (LLMs) consistently benefit from scaled Chain-of-Thought (CoT) reasoning, but also suffer from heavy computational overhead. To address this issue, efficient reasoning aims to incentivize short yet accurate thinking trajectories, typically through reward shaping with Reinforcement Learning (RL). In this paper, we systematically investigate the mechanics of efficient reasoning for LLMs. For comprehensive evaluation, we advocate for more fine-grained metrics, including length distribution conditioned on correctness and performance across a wide spectrum of token budgets ranging from 2k to 32k. First, we reveal that the training process follows a two-stage paradigm: length adaptation and reasoning refinement. After that, we conduct extensive experiments (about 0.2 million GPU hours) in a unified protocol, deconstructing training prompts and rollouts, reward shaping, and optimization strategies. In particular, a key finding is to train on relatively easier prompts, ensuring the density of positive reward signals and thus avoiding the length collapse. Meanwhile, the learned length bias can be generalized across domains. We distill all findings into valuable insights and practical guidelines, and further validate them across the Qwen3 series, ranging from 0.6B to 30B, demonstrating the robustness and generalization.

[29] Blackbird Language Matrices: A Framework to Investigate the Linguistic Competence of Language Models

Paola Merlo, Chunyang Jiang, Giuseppe Samo, Vivi Nastase

Main category: cs.CL

TL;DR: BLM is a novel linguistic task inspired by intelligence tests that uses structured multiple-choice problems to probe LLMs’ ability to detect linguistic objects, systematic patterns, and their interaction with reasoning errors.

DetailsMotivation: To create a curated, structured dataset that can answer core questions about LLM abilities: whether they detect linguistic objects and their properties, whether they use systematic patterns across sentences, and how linguistic vs reasoning errors interact.

Method: Developed Blackbird Language Matrices (BLM) - multiple-choice problems structured at multiple levels (within sentences, across input sequences, within candidate answers). Created datasets in multiple languages with both simple baseline models and more tailored models for benchmarking.

Result: BLMs can be solved at good performance levels across multiple languages. LLM representations contain relevant grammatical objects and attributes for solving linguistic tasks, and solutions are reached by detecting systematic patterns across sentences.

Conclusion: Curated, structured datasets like BLMs support multi-faceted investigations of language and LLM properties, enabling explainability research to understand why LLMs behave as they do.

Abstract: This article describes a novel language task, the Blackbird Language Matrices (BLM) task, inspired by intelligence tests, and illustrates the BLM datasets, their construction and benchmarking, and targeted experiments on chunking and systematicity. BLMs are multiple-choice problems, structured at multiple levels: within each sentence, across the input sequence, within each candidate answer. Because of their rich structure, these curated, but naturalistic datasets are key to answer some core questions about current large language models abilities: do LLMs detect linguistic objects and their properties? Do they detect and use systematic patterns across sentences? Are they more prone to linguistic or reasoning errors, and how do these interact? We show that BLMs, while challenging, can be solved at good levels of performance, in more than one language, with simple baseline models or, at better performance levels, with more tailored models. We show that their representations contain the grammatical objects and attributes relevant to solve a linguistic task. We also show that these solutions are reached by detecting systematic patterns across sentences. The paper supports the point of view that curated, structured datasets support multi-faceted investigations of properties of language and large language models. Because they present a curated, articulated structure, because they comprise both learning contexts and expected answers, and because they are partly built by hand, BLMs fall in the category of datasets that can support explainability investigations, and be useful to ask why large language models behave the way they do.

[30] Linear Reasoning vs. Proof by Cases: Obstacles for Large Language Models in FOL Problem Solving

Yuliang Ji, Fuchen Shen, Jian Wu, Qiujie Xie, Yue Zhang

Main category: cs.CL

TL;DR: PC-FOL: A novel first-order logic dataset focusing on case-based reasoning problems with natural language proofs, revealing performance gaps between linear and case-based reasoning in LLMs.

DetailsMotivation: Existing mathematical reasoning datasets primarily focus on linear reasoning, neglecting important reasoning types like proof by contradiction and proof by cases, which are crucial for comprehensively evaluating LLMs' reasoning abilities.

Method: Introduces PC-FOL dataset annotated by professional mathematicians, focusing on case-based reasoning problems with manually written natural language proofs. Provides theoretical analysis using graphical models to explain performance disparities.

Result: Experimental results show substantial performance gap between linear reasoning and case-based reasoning problems across leading LLMs. Theoretical analysis provides explanation for the observed disparity.

Conclusion: The work reveals core challenges in automated natural language mathematical proof generation and paves the way for future research on comprehensive mathematical reasoning evaluation.

Abstract: To comprehensively evaluate the mathematical reasoning capabilities of Large Language Models (LLMs), researchers have introduced abundant mathematical reasoning datasets. However, most existing datasets primarily focus on linear reasoning, neglecting other parts such as proof by contradiction and proof by cases, which are crucial for investigating LLMs’ reasoning abilities. To address this limitation, we first introduce a novel first-order logic (FOL) dataset named PC-FOL, annotated by professional mathematicians, focusing on case-based reasoning problems. All instances in this dataset are equipped with a manually written natural language proof, clearly distinguishing it from conventional linear reasoning datasets. Our experimental results over leading LLMs demonstrate a substantial performance gap between linear reasoning and case-based reasoning problems. To further investigate this phenomenon, we provide a theoretical analysis grounded in graphical model, which provides an explanation for the observed disparity between the two types of reasoning problems. We hope this work can reveal the core challenges in the field of automated natural language mathematical proof generation, paving the way for future research.

[31] Evaluating Proactive Risk Awareness of Large Language Models

Xuan Luo, Yubin Chen, Zhiyu Hou, Linpu Yu, Geng Tu, Jing Li, Ruifeng Xu

Main category: cs.CL

TL;DR: Proactive risk awareness evaluation framework for LLMs to anticipate ecological harms before they occur, tested with Butterfly dataset showing gaps in current safety alignment

DetailsMotivation: LLMs are increasingly used in decision-making, requiring safety measures that go beyond reacting to explicit harmful intent to anticipate unintended ecological risks and provide warnings before damage occurs

Method: Introduced proactive risk awareness evaluation framework with Butterfly dataset containing 1,094 queries simulating ordinary solution-seeking activities with latent ecological impact. Tested across five LLMs, analyzing effects of response length, languages, and modality

Result: Significant declines in proactive awareness under length-restricted responses, cross-lingual similarities, and persistent blind spots in multimodal species protection. Reveals gap between current safety alignment and real-world ecological responsibility

Conclusion: Highlights critical need for proactive safeguards in LLM deployment beyond reactive safety measures, emphasizing ecological responsibility as an important frontier for AI safety

Abstract: As large language models (LLMs) are increasingly embedded in everyday decision-making, their safety responsibilities extend beyond reacting to explicit harmful intent toward anticipating unintended but consequential risks. In this work, we introduce a proactive risk awareness evaluation framework that measures whether LLMs can anticipate potential harms and provide warnings before damage occurs. We construct the Butterfly dataset to instantiate this framework in the environmental and ecological domain. It contains 1,094 queries that simulate ordinary solution-seeking activities whose responses may induce latent ecological impact. Through experiments across five widely used LLMs, we analyze the effects of response length, languages, and modality. Experimental results reveal consistent, significant declines in proactive awareness under length-restricted responses, cross-lingual similarities, and persistent blind spots in (multimodal) species protection. These findings highlight a critical gap between current safety alignment and the requirements of real-world ecological responsibility, underscoring the need for proactive safeguards in LLM deployment.

[32] Beyond the Star Rating: A Scalable Framework for Aspect-Based Sentiment Analysis Using LLMs and Text Classification

Vishal Patil, Shree Vaishnavi Bacha, Revanth Yamani, Yidan Sun, Mayank Kejriwal

Main category: cs.CL

TL;DR: Hybrid approach combining LLMs for aspect identification with traditional ML for sentiment classification enables scalable analysis of millions of customer reviews

DetailsMotivation: Customer reviews are valuable but challenging to analyze at scale; LLMs show promise but face computational cost and scalability limitations for large-scale review analysis

Method: Uses ChatGPT to identify key aspects from sampled restaurant reviews, then develops sentiment classifiers using human-labeled data, applying this to 4.7 million reviews over 17 years from a major platform

Result: Machine-labeled aspects significantly explain variance in overall restaurant ratings across different dining aspects, cuisines, and geographical regions

Conclusion: Combining LLMs with traditional ML can effectively automate aspect-based sentiment analysis of large-scale customer feedback, providing a practical framework for hospitality and other service sectors

Abstract: Customer-provided reviews have become an important source of information for business owners and other customers alike. However, effectively analyzing millions of unstructured reviews remains challenging. While large language models (LLMs) show promise for natural language understanding, their application to large-scale review analysis has been limited by computational costs and scalability concerns. This study proposes a hybrid approach that uses LLMs for aspect identification while employing classic machine-learning methods for sentiment classification at scale. Using ChatGPT to analyze sampled restaurant reviews, we identified key aspects of dining experiences and developed sentiment classifiers using human-labeled reviews, which we subsequently applied to 4.7 million reviews collected over 17 years from a major online platform. Regression analysis reveals that our machine-labeled aspects significantly explain variance in overall restaurant ratings across different aspects of dining experiences, cuisines, and geographical regions. Our findings demonstrate that combining LLMs with traditional machine learning approaches can effectively automate aspect-based sentiment analysis of large-scale customer feedback, suggesting a practical framework for both researchers and practitioners in the hospitality industry and potentially, other service sectors.

[33] Prompt-Level Distillation: A Non-Parametric Alternative to Model Fine-Tuning for Efficient Reasoning

Sanket Badhe, Deep Shah

Main category: cs.CL

TL;DR: Prompt-Level Distillation (PLD) extracts reasoning patterns from large teacher models to create structured instructions for smaller student models, enabling efficient reasoning with transparency.

DetailsMotivation: Chain-of-Thought prompting is accurate but has high latency and inference costs, while fine-tuning smaller models sacrifices interpretability and has operational overhead. Need efficient, transparent reasoning for regulated industries and edge devices.

Method: PLD extracts explicit reasoning patterns from a Teacher model and organizes them into a structured list of expressive instructions for the Student model’s System Prompt, enabling transparent reasoning without fine-tuning.

Result: On StereoSet and Contract-NLI datasets using Gemma-3 4B, PLD improved Macro F1 scores from 57% to 90.0% and 67% to 83% respectively, matching frontier performance with negligible latency overhead.

Conclusion: PLD enables compact models to achieve state-of-the-art reasoning performance with full transparency, making it suitable for regulated industries, high-volume use cases, and edge devices.

Abstract: Advanced reasoning typically requires Chain-of-Thought prompting, which is accurate but incurs prohibitive latency and substantial test-time inference costs. The standard alternative, fine-tuning smaller models, often sacrifices interpretability while introducing significant resource and operational overhead. To address these limitations, we introduce Prompt-Level Distillation (PLD). We extract explicit reasoning patterns from a Teacher model and organize them into a structured list of expressive instructions for the Student model’s System Prompt. Evaluated on the StereoSet and Contract-NLI datasets using Gemma-3 4B, PLD improved Macro F1 scores from 57% to 90.0% and 67% to 83% respectively, enabling this compact model to match frontier performance with negligible latency overhead. These expressive instructions render the decision-making process transparent, allowing for full human verification of logic, making this approach ideal for regulated industries such as law, finance, and content moderation, as well as high-volume use cases and edge devices.

[34] PVminer: A Domain-Specific Tool to Detect the Patient Voice in Patient Generated Data

Samah Fodeh, Linhai Ma, Yan Wang, Srivani Talakokkul, Ganesh Puthiaraju, Afshan Khan, Ashley Hagaman, Sarah Lowe, Aimee Roundtree

Main category: cs.CL

TL;DR: PVminer: A domain-adapted NLP framework using BERT models and topic modeling to structure patient voice in healthcare communications, achieving strong performance on multi-label classification tasks.

DetailsMotivation: Patient-generated text contains valuable patient voice information but traditional qualitative coding is labor-intensive and doesn't scale. Existing ML/NLP approaches treat patient-centered communication and social determinants of health as separate tasks or use models not well-suited to patient-facing language.

Method: PVminer formulates patient voice detection as multi-label, multi-class prediction using patient-specific BERT encoders (PV-BERT-base/large), unsupervised topic modeling for thematic augmentation (PV-Topic-BERT), and fine-tuned classifiers for hierarchical labels (Code, Subcode, Combo). Topic representations are incorporated during fine-tuning and inference.

Result: Achieves F1 scores of 82.25% (Code), 80.14% (Subcode), and up to 77.87% (Combo), outperforming biomedical and clinical pre-trained baselines. Ablation study shows author identity and topic-based augmentation each contribute meaningful gains.

Conclusion: PVminer provides an effective NLP framework for structuring patient voice at scale, with models, code, and documentation to be publicly released to support healthcare research.

Abstract: Patient-generated text such as secure messages, surveys, and interviews contains rich expressions of the patient voice (PV), reflecting communicative behaviors and social determinants of health (SDoH). Traditional qualitative coding frameworks are labor intensive and do not scale to large volumes of patient-authored messages across health systems. Existing machine learning (ML) and natural language processing (NLP) approaches provide partial solutions but often treat patient-centered communication (PCC) and SDoH as separate tasks or rely on models not well suited to patient-facing language. We introduce PVminer, a domain-adapted NLP framework for structuring patient voice in secure patient-provider communication. PVminer formulates PV detection as a multi-label, multi-class prediction task integrating patient-specific BERT encoders (PV-BERT-base and PV-BERT-large), unsupervised topic modeling for thematic augmentation (PV-Topic-BERT), and fine-tuned classifiers for Code, Subcode, and Combo-level labels. Topic representations are incorporated during fine-tuning and inference to enrich semantic inputs. PVminer achieves strong performance across hierarchical tasks and outperforms biomedical and clinical pre-trained baselines, achieving F1 scores of 82.25% (Code), 80.14% (Subcode), and up to 77.87% (Combo). An ablation study further shows that author identity and topic-based augmentation each contribute meaningful gains. Pre-trained models, source code, and documentation will be publicly released, with annotated datasets available upon request for research use.

[35] On Data Engineering for Scaling LLM Terminal Capabilities

Renjie Pi, Grace Lam, Mohammad Shoeybi, Pooya Jannaty, Bryan Catanzaro, Wei Ping

Main category: cs.CL

TL;DR: A systematic study of data engineering for terminal agents, introducing Terminal-Task-Gen pipeline for synthetic task generation and Terminal-Corpus dataset, with Nemotron-Terminal models showing substantial performance gains on Terminal-Bench 2.0.

DetailsMotivation: Despite rapid progress in terminal capabilities of large language models, training data strategies for state-of-the-art terminal agents remain largely undisclosed, creating a gap in understanding effective data engineering practices for this domain.

Method: Developed Terminal-Task-Gen, a lightweight synthetic task generation pipeline supporting seed-based and skill-based task construction. Conducted comprehensive analysis of data and training strategies including filtering, curriculum learning, long context training, and scaling behavior. Created Terminal-Corpus dataset and trained Nemotron-Terminal models initialized from Qwen3 (8B, 14B, 32B).

Result: Nemotron-Terminal models achieved substantial gains on Terminal-Bench 2.0: 8B improved from 2.5% to 13.0%, 14B from 4.0% to 20.2%, and 32B from 3.4% to 27.4%, matching performance of significantly larger models. All resources are open-sourced.

Conclusion: The study provides systematic insights into data engineering for terminal agents, demonstrating that carefully engineered synthetic data and training strategies can substantially improve terminal task performance while being open-sourced to accelerate research.

Abstract: Despite rapid recent progress in the terminal capabilities of large language models, the training data strategies behind state-of-the-art terminal agents remain largely undisclosed. We address this gap through a systematic study of data engineering practices for terminal agents, making two key contributions: (1) Terminal-Task-Gen, a lightweight synthetic task generation pipeline that supports seed-based and skill-based task construction, and (2) a comprehensive analysis of data and training strategies, including filtering, curriculum learning, long context training, and scaling behavior. Our pipeline yields Terminal-Corpus, a large-scale open-source dataset for terminal tasks. Using this dataset, we train Nemotron-Terminal, a family of models initialized from Qwen3(8B, 14B, 32B) that achieve substantial gains on Terminal-Bench 2.0: Nemotron-Terminal-8B improves from 2.5% to 13.0% Nemotron-Terminal-14B improves from 4.0% to 20.2%, and Nemotron-Terminal-32B improves from 3.4% to 27.4%, matching the performance of significantly larger models. To accelerate research in this domain, we open-source our model checkpoints and most of our synthetic datasets at https://huggingface.co/collections/nvidia/nemotron-terminal.

[36] Augmenting Lateral Thinking in Language Models with Humor and Riddle Data for the BRAINTEASER Task

Mina Ghashami, Soumya Smruti Mishra

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2405.10385: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2405.10385&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[37] SEFL: A Framework for Generating Synthetic Educational Assignment Feedback with LLM Agents

Mike Zhang, Amalie Pernille Dilling, Léon Gondelman, Niels Erik Ruan Lyngdorf, Euan D. Lindsay, Johannes Bjerva

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation due to failed paper fetch

Method: Cannot determine method due to failed paper fetch

Result: Cannot determine results due to failed paper fetch

Conclusion: Cannot determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2502.12927: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.12927&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[38] Bridging Gaps in Natural Language Processing for Yorùbá: A Systematic Review of a Decade of Progress and Prospects

Toheeb Aduramomi Jimoh, Tabea De Wille, Nikola S. Nikolov

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2502.17364: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.17364&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[39] EAMET: Robust Massive Model Editing via Embedding Alignment Optimization

Yanbo Dai, Zhenlan Ji, Zongjie Li, Shuai Wang

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2505.11876: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.11876&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[40] Language Models use Lookbacks to Track Beliefs

Nikhil Prakash, Natalie Shapira, Arnab Sen Sharma, Christoph Riedl, Yonatan Belinkov, Tamar Rott Shaham, David Bau, Atticus Geiger

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2505.14685: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.14685&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[41] HSSBench: Benchmarking Humanities and Social Sciences Ability for Multimodal Large Language Models

Zhaolu Kang, Junhao Gong, Jiaxu Yan, Wanke Xia, Yian Wang, Ziwen Wang, Huaxuan Ding, Zhuo Cheng, Wenhao Cao, Zhiyuan Feng, Siqi He, Shannan Yan, Junzhe Chen, Xiaomin He, Chaoya Jiang, Wei Ye, Kaidong Yu, Xuelong Li

Main category: cs.CL

TL;DR: Paper analysis unavailable due to HTTP 429 error when fetching abstract from arXiv

DetailsMotivation: Unable to determine motivation as the abstract could not be retrieved due to rate limiting (HTTP 429 error)

Method: Cannot analyze method without access to the paper abstract

Result: No results available due to technical issue with arXiv API

Conclusion: Unable to provide analysis due to HTTP 429 error when attempting to fetch paper information

Abstract: Failed to fetch summary for 2506.03922: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.03922&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[42] Watermarking Degrades Alignment in Language Models: Analysis and Mitigation

Apurv Verma, NhatHai Phan, Shubhendu Trivedi

Main category: cs.CL

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about the paper due to access limitations

Abstract: Failed to fetch summary for 2506.04462: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.04462&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[43] Should We Still Pretrain Encoders with Masked Language Modeling?

Hippolyte Gisserot-Boukhlef, Nicolas Boizard, Manuel Faysse, Duarte M. Alves, Emmanuel Malherbe, André F. T. Martins, Céline Hudelot, Pierre Colombo

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2507.00994: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.00994&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[44] RooseBERT: A New Deal For Political Language Modelling

Deborah Dore, Elena Cabrio, Serena Villata

Main category: cs.CL

TL;DR: Failed to fetch summary for paper 2508.03250 due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed summary fetch

Method: Unable to determine method due to failed summary fetch

Result: Unable to determine results due to failed summary fetch

Conclusion: Unable to determine conclusion due to failed summary fetch

Abstract: Failed to fetch summary for 2508.03250: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.03250&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[45] ATTS: Asynchronous Test-Time Scaling via Conformal Prediction

Jing Xiong, Qiujiang Chen, Fanghua Ye, Zhongwei Wan, Chuanyang Zheng, Chenyang Zhao, Hui Shen, Hanbo Li, Chaofan Tao, Haochen Tan, Haoli Bai, Lifeng Shang, Lingpeng Kong, Ngai Wong

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to API rate limiting error

Method: Unable to determine method due to API rate limiting error

Result: Unable to determine results due to API rate limiting error

Conclusion: Unable to determine conclusion due to API rate limiting error

Abstract: Failed to fetch summary for 2509.15148: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.15148&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[46] LD-MoLE: Learnable Dynamic Routing for Mixture of LoRA Experts

Yuan Zhuang, Yi Shen, Yuexin Bian, Qing Su, Shihao Ji, Yuanyuan Shi, Fei Miao

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2509.25684: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.25684&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[47] Latent Thinking Optimization: Your Latent Reasoning Language Model Secretly Encodes Reward Signals in Its Latent Thoughts

Hanwen Du, Yuxin Dong, Xia Ning

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2509.26314: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.26314&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[48] Everything is Plausible: Investigating the Impact of LLM Rationales on Human Notions of Plausibility

Shramay Palta, Peter Rankel, Sarah Wiegreffe, Rachel Rudinger

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to draw conclusions due to access error

Abstract: Failed to fetch summary for 2510.08091: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.08091&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[49] Understanding the Ability of LLMs to Handle Character-Level Perturbation

Anyuan Zhuo, Xuefei Ning, Ningyuan Li, Jingyi Zhu, Yu Wang, Pinyan Lu

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2510.14365: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.14365&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[50] LUMI: Unsupervised Intent Clustering with Multiple Pseudo-Labels

I-Fan Lin, Faegheh Hasibi, Suzan Verberne

Main category: cs.CL

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2510.14640: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.14640&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[51] Repurposing Synthetic Data for Fine-grained Search Agent Supervision

Yida Zhao, Kuan Li, Xixi Wu, Liwen Zhang, Dingchu Zhang, Baixuan Li, Maojia Song, Zhuo Chen, Chenxi Wang, Xinyu Wang, Kewei Tu, Pengjun Xie, Jingren Zhou, Yong Jiang

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about paper content due to access limitations

Abstract: Failed to fetch summary for 2510.24694: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.24694&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[52] OckBench: Measuring the Efficiency of LLM Reasoning

Zheng Du, Hao Kang, Song Han, Tushar Krishna, Ligeng Zhu

Main category: cs.CL

TL;DR: Unable to analyze paper 2511.05722 due to HTTP 429 error when fetching abstract from arXiv API

DetailsMotivation: Cannot determine motivation without access to the paper abstract

Method: Cannot determine method without access to the paper abstract

Result: Cannot determine results without access to the paper abstract

Conclusion: Cannot draw conclusions without access to the paper abstract

Abstract: Failed to fetch summary for 2511.05722: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.05722&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[53] Refusal Steering: Fine-grained Control over LLM Refusal Behaviour for Sensitive Topics

Iker García-Ferrero, David Montero, Roman Orus

Main category: cs.CL

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable due to API rate limiting

Result: Cannot determine results as paper content is unavailable due to API rate limiting

Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting

Abstract: Failed to fetch summary for 2512.16602: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.16602&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[54] Towards Efficient Agents: A Co-Design of Inference Architecture and System

Weizhe Lin, Hui-Ling Zhen, Shuai Yang, Xian Wang, Renxi Liu, Hanting Chen, Wangze Zhang, Chuansai Zhou, Yiming Li, Chen Chen, Xing Li, Zhiyuan Yang, Xiaosong Li, Xianzhi Yu, Zhenhua Dong, Mingxuan Yuan, Yunhe Wang

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to data fetch failure

Method: Unable to determine method due to data fetch failure

Result: Unable to determine results due to data fetch failure

Conclusion: Unable to draw conclusions due to data fetch failure

Abstract: Failed to fetch summary for 2512.18337: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.18337&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[55] Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss

Ang Lv, Jin Ma, Yiyuan Ma, Siyuan Qiao

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation due to failed paper fetch

Method: Cannot determine method due to failed paper fetch

Result: Cannot determine results due to failed paper fetch

Conclusion: Cannot determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2512.23447: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.23447&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[56] What Matters For Safety Alignment?

Xing Li, Hui-Ling Zhen, Lihao Yin, Xianzhi Yu, Zhenhua Dong, Mingxuan Yuan

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable due to rate limiting

Method: Cannot determine method as paper content is unavailable due to rate limiting

Result: Cannot determine results as paper content is unavailable due to rate limiting

Conclusion: Cannot draw conclusions as paper content is unavailable due to rate limiting

Abstract: Failed to fetch summary for 2601.03868: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.03868&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[57] A Parallel Cross-Lingual Benchmark for Multimodal Idiomaticity Understanding

Dilara Torunoğlu-Selamet, Dogukan Arslan, Rodrigo Wilkens, Wei He, Doruk Eryiğit, Thomas Pickard, Adriana S. Pagano, Aline Villavicencio, Gülşen Eryiğit, Ágnes Abuczki, Aida Cardoso, Alesia Lazarenka, Dina Almassova, Amalia Mendes, Anna Kanellopoulou, Antoni Brosa-Rodríguez, Baiba Saulite, Beata Wojtowicz, Bolette Pedersen, Carlos Manuel Hidalgo-Ternero, Chaya Liebeskind, Danka Jokić, Diego Alves, Eleni Triantafyllidi, Erik Velldal, Fred Philippy, Giedre Valunaite Oleskeviciene, Ieva Rizgeliene, Inguna Skadina, Irina Lobzhanidze, Isabell Stinessen Haugen, Jauza Akbar Krito, Jelena M. Marković, Johanna Monti, Josue Alejandro Sauca, Kaja Dobrovoljc, Kingsley O. Ugwuanyi, Laura Rituma, Lilja Øvrelid, Maha Tufail Agro, Manzura Abjalova, Maria Chatzigrigoriou, María del Mar Sánchez Ramos, Marija Pendevska, Masoumeh Seyyedrezaei, Mehrnoush Shamsfard, Momina Ahsan, Muhammad Ahsan Riaz Khan, Nathalie Carmen Hau Norman, Nilay Erdem Ayyıldız, Nina Hosseini-Kivanani, Noémi Ligeti-Nagy, Numaan Naeem, Olha Kanishcheva, Olha Yatsyshyna, Daniil Orel, Petra Giommarelli, Petya Osenova, Radovan Garabik, Regina E. Semou, Rozane Rebechi, Salsabila Zahirah Pranida, Samia Touileb, Sanni Nimb, Sarfraz Ahmad, Sarvinoz Sharipova, Shahar Golan, Shaoxiong Ji, Sopuruchi Christian Aboh, Srdjan Sucur, Stella Markantonatou, Sussi Olsen, Vahide Tajalli, Veronika Lipp, Voula Giouli, Yelda Yeşildal Eraydın, Zahra Saaberi, Zhuohan Xie

Main category: cs.CL

TL;DR: Failed to fetch summary for paper 2601.08645 due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to missing abstract

Method: Unable to determine method due to missing abstract

Result: Unable to determine results due to missing abstract

Conclusion: Unable to determine conclusion due to missing abstract

Abstract: Failed to fetch summary for 2601.08645: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.08645&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[58] FROST: Filtering Reasoning Outliers with Attention for Efficient Reasoning

Haozheng Luo, Zhuolin Jiang, Md Zahid Hasan, Yan Chen, Soumalya Sarkar

Main category: cs.CL

TL;DR: Unable to analyze paper 2601.19001 due to HTTP 429 error when fetching from arXiv API

DetailsMotivation: Cannot determine motivation due to inability to access paper content

Method: Cannot determine method due to inability to access paper content

Result: Cannot determine results due to inability to access paper content

Conclusion: Cannot draw conclusions due to inability to access paper content

Abstract: Failed to fetch summary for 2601.19001: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.19001&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[59] What If We Allocate Test-Time Compute Adaptively?

Ahsan Bilal, Ahmed Mohsin, Muhammad Umer, Ali Subhan, Hassan Rizwan, Ayesha Mohsin, Dean Hougen

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2602.01070: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.01070&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[60] The Single-Multi Evolution Loop for Self-Improving Model Collaboration Systems

Shangbin Feng, Kishan Panaganti, Yulia Tsvetkov, Wenhao Yu

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2602.05182: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.05182&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[61] Language Modeling and Understanding Through Paraphrase Generation and Detection

Jan Philip Wahle

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusion due to missing paper content

Abstract: Failed to fetch summary for 2602.08274: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.08274&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[62] Context Shapes LLMs Retrieval-Augmented Fact-Checking Effectiveness

Pietro Bernardelle, Stefano Civelli, Kevin Roitero, Gianluca Demartini

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to paper fetch failure

Method: Unable to determine method due to paper fetch failure

Result: Unable to determine results due to paper fetch failure

Conclusion: Unable to analyze paper due to technical issues with arXiv API

Abstract: Failed to fetch summary for 2602.14044: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.14044&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[63] INSURE-Dial: A Phase-Aware Conversational Dataset & Benchmark for Compliance Verification and Phase Detection

Shubham Kulkarni, Alexander Lyzhov, Preetam Joshi, Shiva Chaitanya

Main category: cs.CL

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about the paper due to data access issues

Abstract: Failed to fetch summary for 2602.18448: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.18448&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[64] Anatomy of Unlearning: The Dual Impact of Fact Salience and Model Fine-Tuning

Borisiuk Anna, Andrey Savchenko, Alexander Panchenko, Elena Tutubalina

Main category: cs.CL

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2602.19612 exists but content cannot be retrieved.

DetailsMotivation: Cannot determine motivation without access to paper content.

Method: Cannot determine method without access to paper content.

Result: Cannot determine results without access to paper content.

Conclusion: Cannot determine conclusion without access to paper content.

Abstract: Failed to fetch summary for 2602.19612: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.19612&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[65] BabyLM Turns 4 and Goes Multilingual: Call for Papers for the 2026 BabyLM Workshop

Leshem Choshen, Ryan Cotterell, Mustafa Omer Gul, Jaap Jumelet, Tal Linzen, Aaron Mueller, Suchir Salhan, Raj Sanjay Shah, Alex Warstadt, Ethan Gotlieb Wilcox

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2602.20092: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.20092&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[66] Usability Study of Security Features in Programmable Logic Controllers

Karen Li, Kopo M. Ramokapane, Awais Rashid

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2208.02500: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2208.02500&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[67] Watermarking Language Models with Error Correcting Codes

Patrick Chao, Yan Sun, Edgar Dobriban, Hamed Hassani

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2406.10281: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2406.10281&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[68] Causal Claims in Economics

Prashant Garg, Thiemo Fetzer

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2501.06873: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2501.06873&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[69] ICE-ID: A Novel Historical Census Dataset for Longitudinal Identity Resolution

Gonçalo Hora de Carvalho, Lazar S. Popov, Sander Kaatee, Mário S. Correia, Kristinn R. Thórisson, Tangrui Li, Pétur Húni Björnsson, Eiríkur Smári Sigurðarson, Jilles S. Dibangoye

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation without access to the paper content

Method: Cannot determine method without access to the paper content

Result: Cannot determine results without access to the paper content

Conclusion: Cannot draw conclusions without access to the paper content

Abstract: Failed to fetch summary for 2506.13792: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.13792&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[70] Programming by Backprop: An Instruction is Worth 100 Examples When Finetuning LLMs

Jonathan Cook, Silvia Sapora, Arash Ahmadian, Akbir Khan, Tim Rocktaschel, Jakob Foerster, Laura Ruis

Main category: cs.CL

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) - need to retry later or use alternative methods

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2506.18777: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.18777&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[71] Complexity-aware fine-tuning

Andrey Goncharov, Daniil Vyazhev, Petr Sychev, Edvard Khalafyan, Alexey Zaytsev

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to draw conclusions due to failed paper fetch

Abstract: Failed to fetch summary for 2506.21220: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.21220&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[72] TASER: Table Agents for Schema-guided Extraction and Recommendation

Nicole Cho, Kirsty Fielding, William Watson, Sumitra Ganesh, Manuela Veloso

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2508.13404: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.13404&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[73] A Simple and Efficient Jailbreak Method Exploiting LLMs’ Helpfulness

Xuan Luo, Yue Wang, Zefeng He, Geng Tu, Jing Li, Ruifeng Xu

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2509.14297: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.14297&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[74] RHYTHM: Reasoning with Hierarchical Temporal Tokenization for Human Mobility

Haoyu He, Haozheng Luo, Yan Chen, Qi R. Wang

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to technical error in fetching paper content

Method: Unable to determine method due to technical error in fetching paper content

Result: Unable to determine results due to technical error in fetching paper content

Conclusion: Unable to draw conclusions due to technical error in fetching paper content

Abstract: Failed to fetch summary for 2509.23115: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.23115&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[75] TimeOmni-1: Incentivizing Complex Reasoning with Time Series in Large Language Models

Tong Guan, Zijie Meng, Dianqi Li, Shiyu Wang, Chao-Han Huck Yang, Qingsong Wen, Zuozhu Liu, Sabato Marco Siniscalchi, Ming Jin, Shirui Pan

Main category: cs.CL

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2509.24803: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.24803&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[76] Towards Scalable Oversight via Partitioned Human Supervision

Ren Yin, Takashi Ishida, Masashi Sugiyama

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2510.22500: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.22500&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[77] SWE-Master: Unleashing the Potential of Software Engineering Agents via Post-Training

Huatong Song, Lisheng Huang, Shuang Sun, Jinhao Jiang, Ran Le, Daixuan Cheng, Guoxin Chen, Yiwen Hu, Zongchao Chen, Yiming Jia, Wayne Xin Zhao, Yang Song, Tao Zhang, Ji-Rong Wen

Main category: cs.CL

TL;DR: Unable to analyze paper 2602.03411 due to HTTP 429 error when fetching summary from arXiv API

DetailsMotivation: Cannot determine motivation as paper content could not be retrieved

Method: Cannot determine method as paper content could not be retrieved

Result: Cannot determine results as paper content could not be retrieved

Conclusion: Cannot draw conclusion as paper content could not be retrieved

Abstract: Failed to fetch summary for 2602.03411: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.03411&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[78] UI-Venus-1.5 Technical Report

Venus Team, Changlong Gao, Zhangxuan Gu, Yulin Liu, Xinyu Qiu, Shuheng Shen, Yue Wen, Tianyu Xia, Zhenyu Xu, Zhengwen Zeng, Beitong Zhou, Xingran Zhou, Weizhi Chen, Sunhao Dai, Jingya Dou, Yichen Gong, Yuan Guo, Zhenlin Guo, Feng Li, Qian Li, Jinzhen Lin, Yuqi Zhou, Linchao Zhu, Liang Chen, Zhenyu Guo, Changhua Meng, Weiqiang Wang

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to API rate limiting preventing access to paper details

Method: Unable to determine method due to API rate limiting preventing access to paper details

Result: Unable to determine results due to API rate limiting preventing access to paper details

Conclusion: Unable to determine conclusion due to API rate limiting preventing access to paper details

Abstract: Failed to fetch summary for 2602.09082: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.09082&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[79] MCPShield: A Security Cognition Layer for Adaptive Trust Calibration in Model Context Protocol Agents

Zhenhong Zhou, Yuanhe Zhang, Hongwei Cai, Moayad Aloqaily, Ouns Bouachir, Linsey Pang, Prakhar Mehrotra, Kun Wang, Qingsong Wen

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2602.14281: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.14281&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[80] GLM-5: from Vibe Coding to Agentic Engineering

GLM-5-Team, Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, Chenzheng Zhu, Congfeng Yin, Cunxiang Wang, Gengzheng Pan, Hao Zeng, Haoke Zhang, Haoran Wang, Huilong Chen, Jiajie Zhang, Jian Jiao, Jiaqi Guo, Jingsen Wang, Jingzhao Du, Jinzhu Wu, Kedong Wang, Lei Li, Lin Fan, Lucen Zhong, Mingdao Liu, Mingming Zhao, Pengfan Du, Qian Dong, Rui Lu, Shuang-Li, Shulin Cao, Song Liu, Ting Jiang, Xiaodong Chen, Xiaohan Zhang, Xuancheng Huang, Xuezhen Dong, Yabo Xu, Yao Wei, Yifan An, Yilin Niu, Yitong Zhu, Yuanhao Wen, Yukuo Cen, Yushi Bai, Zhongpei Qiao, Zihan Wang, Zikang Wang, Zilin Zhu, Ziqiang Liu, Zixuan Li, Bojie Wang, Bosi Wen, Can Huang, Changpeng Cai, Chao Yu, Chen Li, Chengwei Hu, Chenhui Zhang, Dan Zhang, Daoyan Lin, Dayong Yang, Di Wang, Ding Ai, Erle Zhu, Fangzhou Yi, Feiyu Chen, Guohong Wen, Hailong Sun, Haisha Zhao, Haiyi Hu, Hanchen Zhang, Hanrui Liu, Hanyu Zhang, Hao Peng, Hao Tai, Haobo Zhang, He Liu, Hongwei Wang, Hongxi Yan, Hongyu Ge, Huan Liu, Huanpeng Chu, Jia’ni Zhao, Jiachen Wang, Jiajing Zhao, Jiamin Ren, Jiapeng Wang, Jiaxin Zhang, Jiayi Gui, Jiayue Zhao, Jijie Li, Jing An, Jing Li, Jingwei Yuan

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2602.15763 could not be retrieved from arXiv API.

DetailsMotivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot draw conclusions without access to the paper content.

Abstract: Failed to fetch summary for 2602.15763: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.15763&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[81] Anatomy of Capability Emergence: Scale-Invariant Representation Collapse and Top-Down Reorganization in Neural Networks

Jayadev Billa

Main category: cs.CL

TL;DR: Unable to analyze paper 2602.15997 due to HTTP 429 error when fetching from arXiv API

DetailsMotivation: Cannot determine motivation due to API access error

Method: Cannot determine method due to API access error

Result: Cannot determine results due to API access error

Conclusion: Paper analysis unavailable due to technical limitations

Abstract: Failed to fetch summary for 2602.15997: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.15997&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[82] Intent Laundering: AI Safety Datasets Are Not What They Seem

Shahriar Golchin, Marc Wetter

Main category: cs.CL

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content could not be retrieved

Method: Cannot determine method as paper content could not be retrieved

Result: Cannot determine results as paper content could not be retrieved

Conclusion: Cannot draw conclusions as paper content could not be retrieved

Abstract: Failed to fetch summary for 2602.16729: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.16729&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[83] Games That Teach, Chats That Convince: Comparing Interactive and Static Formats for Persuasive Learning

Seyed Hossein Alavi, Zining Wang, Shruthi Chockkalingam, Raymond T. Ng, Vered Shwartz

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2602.17905: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.17905&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[84] How Well Can LLM Agents Simulate End-User Security and Privacy Attitudes and Behaviors?

Yuxuan Li, Leyang Li, Hao-Ping Lee, Sauvik Das

Main category: cs.CL

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper with ID 2602.18464 cannot be analyzed without access to its abstract or content.

DetailsMotivation: Cannot determine motivation without access to paper content.

Method: Cannot determine method without access to paper content.

Result: Cannot determine results without access to paper content.

Conclusion: Cannot draw conclusions without access to paper content.

Abstract: Failed to fetch summary for 2602.18464: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.18464&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[85] Nacrith: Neural Lossless Compression via Ensemble Context Modeling and High-Precision CDF Coding

Roberto Tacconelli

Main category: cs.CL

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2602.19626: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.19626&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

cs.CV

[86] Not Just What’s There: Enabling CLIP to Comprehend Negated Visual Descriptions Without Fine-tuning

Junhao Xiao, Zhiyu Wu, Hao Lin, Yi Chen, Yahui Liu, Xiaoran Zhao, Zixu Wang, Zejiang He

Main category: cs.CV

TL;DR: CLIPGlasses enhances CLIP’s ability to understand negation in visual descriptions through a plug-and-play framework with disentanglement and context-aware repulsion modules.

DetailsMotivation: Vision-Language Models like CLIP struggle with understanding negation, often embedding affirmatives and negatives similarly, leading to incorrect matches like "no dog" with dog images. Existing fine-tuning methods risk overfitting.

Method: CLIPGlasses uses a dual-stage design: 1) Lens module disentangles negated semantics from text embeddings, 2) Frame module predicts context-aware repulsion strength. This is integrated into modified similarity computation to penalize alignment with negated semantics.

Result: CLIP with CLIPGlasses achieves competitive in-domain performance and outperforms state-of-the-art methods in cross-domain generalization, especially under low-resource conditions, indicating stronger robustness across domains.

Conclusion: CLIPGlasses effectively enhances CLIP’s negation understanding without fine-tuning the text encoder, providing better generalization and robustness across domains while maintaining plug-and-play flexibility.

Abstract: Vision-Language Models (VLMs) like CLIP struggle to understand negation, often embedding affirmatives and negatives similarly (e.g., matching “no dog” with dog images). Existing methods refine negation understanding via fine-tuning CLIP’s text encoder, risking overfitting. In this work, we propose CLIPGlasses, a plug-and-play framework that enhances CLIP’s ability to comprehend negated visual descriptions. CLIPGlasses adopts a dual-stage design: a Lens module disentangles negated semantics from text embeddings, and a Frame module predicts context-aware repulsion strength, which is integrated into a modified similarity computation to penalize alignment with negated semantics, thereby reducing false positive matches. Experiments show that CLIP equipped with CLIPGlasses achieves competitive in-domain performance and outperforms state-of-the-art methods in cross-domain generalization. Its superiority is especially evident under low-resource conditions, indicating stronger robustness across domains.

[87] HoloLLM: Multisensory Foundation Model for Language-Grounded Human Sensing and Reasoning

Chuhao Zhou, Jianfei Yang

Main category: cs.CV

TL;DR: HoloLLM is a multimodal LLM that integrates uncommon sensing modalities (LiDAR, infrared, mmWave radar, WiFi) for robust human perception in smart homes, addressing data scarcity and modality heterogeneity challenges.

DetailsMotivation: Current Vision-Language Models rely heavily on visual data, which limits robustness in real-world scenarios with occlusions, poor lighting, or privacy constraints. There's a need for models that can leverage diverse sensing modalities beyond vision for more reliable human behavior understanding in smart environments.

Method: Introduces HoloLLM with a Universal Modality-Injection Projector (UMIP) that enhances pre-aligned modality embeddings using fine-grained, text-aligned features from tailored encoders via coarse-to-fine cross-attention. Also develops a human-VLM collaborative data curation pipeline to generate paired textual annotations for sensing datasets.

Result: Extensive experiments on two newly constructed benchmarks show HoloLLM significantly outperforms existing MLLMs, improving language-grounded human sensing accuracy by up to 30%.

Conclusion: HoloLLM establishes a new foundation for real-world, language-informed multisensory embodied intelligence by effectively integrating uncommon sensing modalities for robust human perception and reasoning.

Abstract: Embodied agents operating in smart homes must understand human behavior through diverse sensory inputs and communicate via natural language. While Vision-Language Models (VLMs) have enabled impressive language-grounded perception, their reliance on visual data limits robustness in real-world scenarios with occlusions, poor lighting, or privacy constraints. In this paper, we introduce HoloLLM, a Multimodal Large Language Model (MLLM) that integrates uncommon but powerful sensing modalities, such as LiDAR, infrared, mmWave radar, and WiFi, to enable seamless human perception and reasoning across heterogeneous environments. We address two key challenges: (1) the scarcity of aligned modality-text data for rare sensors, and (2) the heterogeneity of their physical signal representations. To overcome these, we design a Universal Modality-Injection Projector (UMIP) that enhances pre-aligned modality embeddings with fine-grained, text-aligned features from tailored encoders via coarse-to-fine cross-attention without introducing significant alignment overhead. We further introduce a human-VLM collaborative data curation pipeline to generate paired textual annotations for sensing datasets. Extensive experiments on two newly constructed benchmarks show that HoloLLM significantly outperforms existing MLLMs, improving language-grounded human sensing accuracy by up to 30%. This work establishes a new foundation for real-world, language-informed multisensory embodied intelligence.

[88] GOT-Edit: Geometry-Aware Generic Object Tracking via Online Model Editing

Shih-Fang Chen, Jun-Cheng Chen, I-Hong Jhuo, Yen-Yu Lin

Main category: cs.CV

TL;DR: GOT-Edit integrates 3D geometric reasoning into 2D object tracking via online model editing with null-space constraints, improving robustness to occlusion and clutter.

DetailsMotivation: Current generic object tracking methods rely only on 2D features, lacking 3D geometric understanding, making them vulnerable to occlusion, distractors, and appearance changes. Human perception uses 3D knowledge, so the paper aims to bridge this gap.

Method: GOT-Edit uses online cross-modality model editing to integrate geometry-aware cues from a pre-trained Visual Geometry Grounded Transformer into a generic object tracker. It leverages null-space constraints during model updates to incorporate geometric information while preserving semantic discrimination.

Result: Extensive experiments on multiple GOT benchmarks show GOT-Edit achieves superior robustness and accuracy, particularly under occlusion and clutter, establishing a new paradigm for combining 2D semantics with 3D geometric reasoning.

Conclusion: The approach successfully integrates 3D geometric reasoning into 2D object tracking through online model editing, demonstrating improved performance and robustness in challenging scenarios.

Abstract: Human perception for effective object tracking in 2D video streams arises from the implicit use of prior 3D knowledge and semantic reasoning. In contrast, most generic object tracking (GOT) methods primarily rely on 2D features of the target and its surroundings, while neglecting 3D geometric cues, making them susceptible to partial occlusion, distractors, and variations in geometry and appearance. To address this limitation, we introduce GOT-Edit, an online cross-modality model editing approach that integrates geometry-aware cues into a generic object tracker from a 2D video stream. Our approach leverages features from a pre-trained Visual Geometry Grounded Transformer to infer geometric cues from only a few 2D images. To address the challenge of seamlessly combining geometry and semantics, GOT-Edit performs online model editing. By leveraging null-space constraints during model updates, it incorporates geometric information while preserving semantic discrimination, yielding consistently better performance across diverse scenarios. Extensive experiments on multiple GOT benchmarks demonstrate that GOT-Edit achieves superior robustness and accuracy, particularly under occlusion and clutter, establishing a new paradigm for combining 2D semantics with 3D geometric reasoning for generic object tracking. The project page is available at https://chenshihfang.github.io/GOT-EDIT.

[89] VISION-ICE: Video-based Interpretation and Spatial Identification of Arrhythmia Origins via Neural Networks in Intracardiac Echocardiography

Dorsa EPMoghaddam, Feng Gao, Drew Bernard, Kavya Sinha, Mehdi Razavi, Behnaam Aazhang

Main category: cs.CV

TL;DR: AI framework using intracardiac echocardiography (ICE) videos and 3D CNN to localize arrhythmia sources, achieving 66.2% accuracy in distinguishing normal sinus rhythm from left/right-sided arrhythmias.

DetailsMotivation: Current arrhythmia localization methods (high-density mapping, CT/MRI) are time and resource intensive. AI has shown promise in echocardiographic analysis, so researchers aim to leverage routine ICE imaging during electrophysiology procedures to guide clinicians and reduce procedural time.

Method: Formulated arrhythmia source localization as three-class classification task (normal sinus rhythm, left-sided, right-sided arrhythmias). Developed 3D Convolutional Neural Network trained on ICE video data to discriminate among the three classes.

Result: Model achieved mean accuracy of 66.2% in ten-fold cross-validation when evaluated on four previously unseen patients, substantially outperforming the 33.3% random baseline.

Conclusion: Demonstrates feasibility and clinical promise of using ICE videos with deep learning for automated arrhythmia localization. Could enable faster, more targeted electrophysiological interventions and reduce procedural burden. Future work will expand dataset for improved robustness and generalizability.

Abstract: Contemporary high-density mapping techniques and preoperative CT/MRI remain time and resource intensive in localizing arrhythmias. AI has been validated as a clinical decision aid in providing accurate, rapid real-time analysis of echocardiographic images. Building on this, we propose an AI-enabled framework that leverages intracardiac echocardiography (ICE), a routine part of electrophysiology procedures, to guide clinicians toward areas of arrhythmogenesis and potentially reduce procedural time. Arrhythmia source localization is formulated as a three-class classification task, distinguishing normal sinus rhythm, left-sided, and right-sided arrhythmias, based on ICE video data. We developed a 3D Convolutional Neural Network trained to discriminate among the three aforementioned classes. In ten-fold cross-validation, the model achieved a mean accuracy of 66.2% when evaluated on four previously unseen patients (substantially outperforming the 33.3% random baseline). These results demonstrate the feasibility and clinical promise of using ICE videos combined with deep learning for automated arrhythmia localization. Leveraging ICE imaging could enable faster, more targeted electrophysiological interventions and reduce the procedural burden of cardiac ablation. Future work will focus on expanding the dataset to improve model robustness and generalizability across diverse patient populations.

[90] A Very Big Video Reasoning Suite

Maijunxian Wang, Ruisi Wang, Juyi Lin, Ran Ji, Thaddäus Wiedemer, Qingying Gao, Dezhi Luo, Yaoyao Qian, Lianyu Huang, Zelong Hong, Jiahui Ge, Qianli Ma, Hang He, Yifan Zhou, Lingzi Guo, Lantao Mei, Jiachen Li, Hanwen Xing, Tianqi Zhao, Fengyuan Yu, Weihang Xiao, Yizheng Jiao, Jianheng Hou, Danyang Zhang, Pengcheng Xu, Boyang Zhong, Zehong Zhao, Gaoyun Fang, John Kitaoka, Yile Xu, Hua Xu, Kenton Blacutt, Tin Nguyen, Siyuan Song, Haoran Sun, Shaoyue Wen, Linyang He, Runming Wang, Yanzhi Wang, Mengyue Yang, Ziqiao Ma, Raphaël Millière, Freda Shi, Nuno Vasconcelos, Daniel Khashabi, Alan Yuille, Yilun Du, Ziming Liu, Bo Li, Dahua Lin, Ziwei Liu, Vikash Kumar, Yijiang Li, Lei Yang, Zhongang Cai, Hokin Deng

Main category: cs.CV

TL;DR: VBVR introduces a massive video reasoning dataset (1M+ clips, 200 tasks) and benchmark to study scaling behavior and emergent generalization in video reasoning models.

DetailsMotivation: Video models have focused on visual quality but lack systematic study of reasoning capabilities. There's a need for large-scale training data and verifiable evaluation to understand video reasoning and its scaling behavior.

Method: Created VBVR Dataset (200 reasoning tasks, 1M+ video clips) following principled taxonomy, and VBVR-Bench evaluation framework with rule-based, human-aligned scorers for reproducible assessment.

Result: Dataset is 3 orders of magnitude larger than existing datasets. Early signs of emergent generalization to unseen reasoning tasks were observed through large-scale scaling studies.

Conclusion: VBVR provides foundation for next-stage research in generalizable video reasoning, enabling systematic study of video reasoning capabilities and scaling behavior.

Abstract: Rapid progress in video models has largely focused on visual quality, leaving their reasoning capabilities underexplored. Video reasoning grounds intelligence in spatiotemporally consistent visual environments that go beyond what text can naturally capture, enabling intuitive reasoning over spatiotemporal structure such as continuity, interaction, and causality. However, systematically studying video reasoning and its scaling behavior is hindered by the lack of large-scale training data. To address this gap, we introduce the Very Big Video Reasoning (VBVR) Dataset, an unprecedentedly large-scale resource spanning 200 curated reasoning tasks following a principled taxonomy and over one million video clips, approximately three orders of magnitude larger than existing datasets. We further present VBVR-Bench, a verifiable evaluation framework that moves beyond model-based judging by incorporating rule-based, human-aligned scorers, enabling reproducible and interpretable diagnosis of video reasoning capabilities. Leveraging the VBVR suite, we conduct one of the first large-scale scaling studies of video reasoning and observe early signs of emergent generalization to unseen reasoning tasks. Together, VBVR lays a foundation for the next stage of research in generalizable video reasoning. The data, benchmark toolkit, and models are publicly available at https://video-reason.com/ .

[91] OTPrune: Distribution-Aligned Visual Token Pruning via Optimal Transport

Xiwen Chen, Wenhui Zhu, Gen Li, Xuanzhao Dong, Yujian Xiong, Hao Wang, Peijie Qiu, Qingquan Song, Zhipeng Wang, Shao Tang, Yalin Wang, Abolfazl Razi

Main category: cs.CV

TL;DR: OTPrune: A training-free framework that uses optimal transport to prune visual tokens in multimodal LLMs by aligning token distributions, achieving better performance-efficiency tradeoffs.

DetailsMotivation: Multimodal LLMs suffer from high inference costs due to redundant visual tokens. Existing pruning methods overlook the distributional structure of visual representations, leading to suboptimal performance.

Method: Formulates pruning as distribution alignment via optimal transport, minimizing 2-Wasserstein distance between full and pruned token distributions. Derives tractable submodular objective for efficient optimization with theoretical guarantees of monotonicity and submodularity.

Result: Achieves superior performance-efficiency tradeoffs compared to state-of-the-art methods on comprehensive benchmarks, preserving both local diversity and global representativeness while reducing inference cost.

Conclusion: OTPrune provides a principled, training-free framework for visual token pruning that maintains semantic faithfulness through distributional alignment, offering stable and efficient acceleration for multimodal LLMs.

Abstract: Multi-modal large language models (MLLMs) achieve strong visual-language reasoning but suffer from high inference cost due to redundant visual tokens. Recent work explores visual token pruning to accelerate inference, while existing pruning methods overlook the underlying distributional structure of visual representations. We propose OTPrune, a training-free framework that formulates pruning as distribution alignment via optimal transport (OT). By minimizing the 2-Wasserstein distance between the full and pruned token distributions, OTPrune preserves both local diversity and global representativeness while reducing inference cost. Moreover, we derive a tractable submodular objective that enables efficient optimization, and theoretically prove its monotonicity and submodularity, providing a principled foundation for stable and efficient pruning. We further provide a comprehensive analysis that explains how distributional alignment contributes to stable and semantically faithful pruning. Comprehensive experiments on wider benchmarks demonstrate that OTPrune achieves superior performance-efficiency tradeoffs compared to state-of-the-art methods. The code is available at https://github.com/xiwenc1/OTPrune.

[92] De-rendering, Reasoning, and Repairing Charts with Vision-Language Models

Valentin Bonas, Martin Sinnona, Viviana Siless, Emmanuel Iarussi

Main category: cs.CV

TL;DR: A framework combining chart de-rendering, automated analysis, and iterative improvement to provide actionable, interpretable feedback on visualization design using vision-language reasoning.

DetailsMotivation: Data visualizations are prone to errors that can distort interpretation, but rule-based linters miss context and general-purpose LLMs produce unreliable feedback on visualization quality.

Method: System reconstructs chart structure from images, identifies design flaws using vision-language reasoning, and proposes concrete modifications based on established visualization principles, creating an iterative feedback loop.

Result: On 1,000 charts from Chart2Code benchmark, generated 10,452 design recommendations clustered into 10 coherent categories (axis formatting, color accessibility, legend consistency, etc.).

Conclusion: LLM-driven recommendation systems show promise for delivering structured, principle-based feedback on visualization design, enabling more intelligent and accessible authoring tools.

Abstract: Data visualizations are central to scientific communication, journalism, and everyday decision-making, yet they are frequently prone to errors that can distort interpretation or mislead audiences. Rule-based visualization linters can flag violations, but they miss context and do not suggest meaningful design changes. Directly querying general-purpose LLMs about visualization quality is unreliable: lacking training to follow visualization design principles, they often produce inconsistent or incorrect feedback. In this work, we introduce a framework that combines chart de-rendering, automated analysis, and iterative improvement to deliver actionable, interpretable feedback on visualization design. Our system reconstructs the structure of a chart from an image, identifies design flaws using vision-language reasoning, and proposes concrete modifications supported by established principles in visualization research. Users can selectively apply these improvements and re-render updated figures, creating a feedback loop that promotes both higher-quality visualizations and the development of visualization literacy. In our evaluation on 1,000 charts from the Chart2Code benchmark, the system generated 10,452 design recommendations, which clustered into 10 coherent categories (e.g., axis formatting, color accessibility, legend consistency). These results highlight the promise of LLM-driven recommendation systems for delivering structured, principle-based feedback on visualization design, opening the door to more intelligent and accessible authoring tools.

[93] N4MC: Neural 4D Mesh Compression

Guodong Chen, Huanshuo Dong, Mallesham Dasari

Main category: cs.CV

TL;DR: N4MC is the first 4D neural compression framework for time-varying mesh sequences that uses motion compensation and transformer-based interpolation to achieve superior compression performance with real-time decoding.

DetailsMotivation: Existing neural mesh compression methods treat each mesh frame independently, ignoring temporal redundancy in time-varying sequences. The authors aim to exploit this redundancy similar to how 2D video codecs use inter-frame compression for better efficiency.

Method: Converts irregular mesh frames into regular 4D tensors, uses auto-decoder to capture spatial-temporal correlations, introduces transformer-based interpolation model conditioned on latent embeddings from tracked volume centers to enhance temporal coherence and eliminate motion ambiguities.

Result: N4MC outperforms state-of-the-art methods in rate-distortion performance and enables real-time decoding of 4D mesh sequences.

Conclusion: The framework successfully exploits temporal redundancy in mesh sequences through motion compensation and transformer-based interpolation, achieving efficient compression with real-time performance.

Abstract: We present N4MC, the first 4D neural compression framework to efficiently compress time-varying mesh sequences by exploiting their temporal redundancy. Unlike prior neural mesh compression methods that treat each mesh frame independently, N4MC takes inspiration from inter-frame compression in 2D video codecs, and learns motion compensation in long mesh sequences. Specifically, N4MC converts consecutive irregular mesh frames into regular 4D tensors to provide a uniform and compact representation. These tensors are then condensed using an auto-decoder, which captures both spatial and temporal correlations for redundancy removal. To enhance temporal coherence, we introduce a transformer-based interpolation model that predicts intermediate mesh frames conditioned on latent embeddings derived from tracked volume centers, eliminating motion ambiguities. Extensive evaluations show that N4MC outperforms state-of-the-art in rate-distortion performance, while enabling real-time decoding of 4D mesh sequences. The implementation of our method is available at: https://github.com/frozzzen3/N4MC.

[94] GSNR: Graph Smooth Null-Space Representation for Inverse Problems

Romario Gualdrón-Hurtado, Roman Jacome, Rafael S. Suarez, Henry Arguello

Main category: cs.CV

TL;DR: GSNR improves inverse imaging problems by incorporating graph-based null-space structure to constrain invisible components, enhancing reconstruction quality across various tasks.

DetailsMotivation: Inverse imaging problems are ill-posed with infinite solutions due to non-trivial null-space. Traditional priors don't constrain null-space components, leading to biased reconstructions. The paper aims to incorporate meaningful null-space information into reconstruction frameworks.

Method: Proposes Graph-Smooth Null-Space Representation (GSNR) using graph Laplacian to construct null-restricted Laplacian that encodes pixel similarity in null-space. Uses low-dimensional projection from p-smoothest spectral graph modes (lowest frequencies). Integrated into PnP, DIP, and diffusion solvers.

Result: Applied to image deblurring, compressed sensing, demosaicing, and super-resolution. Provides consistent improvement up to 4.3 dB over baselines and up to 1 dB over end-to-end learned models in PSNR.

Conclusion: GSNR effectively incorporates null-space structure via graph smoothness, improving reconstruction quality across multiple inverse imaging problems with theoretical and practical benefits.

Abstract: Inverse problems in imaging are ill-posed, leading to infinitely many solutions consistent with the measurements due to the non-trivial null-space of the sensing matrix. Common image priors promote solutions on the general image manifold, such as sparsity, smoothness, or score function. However, as these priors do not constrain the null-space component, they can bias the reconstruction. Thus, we aim to incorporate meaningful null-space information in the reconstruction framework. Inspired by smooth image representation on graphs, we propose Graph-Smooth Null-Space Representation (GSNR), a mechanism that imposes structure only into the invisible component. Particularly, given a graph Laplacian, we construct a null-restricted Laplacian that encodes similarity between neighboring pixels in the null-space signal, and we design a low-dimensional projection matrix from the $p$-smoothest spectral graph modes (lowest graph frequencies). This approach has strong theoretical and practical implications: i) improved convergence via a null-only graph regularizer, ii) better coverage, how much null-space variance is captured by $p$ modes, and iii) high predictability, how well these modes can be inferred from the measurements. GSNR is incorporated into well-known inverse problem solvers, e.g., PnP, DIP, and diffusion solvers, in four scenarios: image deblurring, compressed sensing, demosaicing, and image super-resolution, providing consistent improvement of up to 4.3 dB over baseline formulations and up to 1 dB compared with end-to-end learned models in terms of PSNR.

[95] Circuit Tracing in Vision-Language Models: Understanding the Internal Mechanisms of Multimodal Thinking

Jingcheng Yang, Tianhu Xiong, Shengyi Qian, Klara Nahrstedt, Mingyuan Wu

Main category: cs.CV

TL;DR: First framework for transparent circuit tracing in vision-language models to analyze multimodal reasoning using transcoders, attribution graphs, and attention methods.

DetailsMotivation: Vision-language models are powerful but remain opaque black boxes, lacking transparency in how they perform multimodal reasoning. There's a need for systematic analysis tools to understand how VLMs integrate visual and semantic information.

Method: Developed a framework using transcoders, attribution graphs, and attention-based methods to trace circuits in VLMs. Used feature steering and circuit patching to validate causal relationships.

Result: Uncovered hierarchical integration of visual and semantic concepts, revealed distinct visual feature circuits for mathematical reasoning and cross-modal associations, and proved circuits are causal and controllable.

Conclusion: The framework enables systematic analysis of multimodal reasoning in VLMs, demonstrating that circuits are causal and controllable, laying groundwork for more explainable and reliable vision-language models.

Abstract: Vision-language models (VLMs) are powerful but remain opaque black boxes. We introduce the first framework for transparent circuit tracing in VLMs to systematically analyze multimodal reasoning. By utilizing transcoders, attribution graphs, and attention-based methods, we uncover how VLMs hierarchically integrate visual and semantic concepts. We reveal that distinct visual feature circuits can handle mathematical reasoning and support cross-modal associations. Validated through feature steering and circuit patching, our framework proves these circuits are causal and controllable, laying the groundwork for more explainable and reliable VLMs.

[96] Large-scale Photorealistic Outdoor 3D Scene Reconstruction from UAV Imagery Using Gaussian Splatting Techniques

Christos Maikos, Georgios Angelidis, Georgios Th. Papadopoulos

Main category: cs.CV

TL;DR: Real-time 3D reconstruction pipeline using drone video streams and 3D Gaussian Splatting for low-latency AR/VR applications

DetailsMotivation: UAVs are widely used for aerial perception, and 3DGS shows promise for real-time neural rendering, but their integration into end-to-end UAV-based reconstruction systems remains underexplored.

Method: End-to-end pipeline combining live RTMP video streaming, synchronized sensor fusion, camera pose estimation, and 3DGS optimization for continuous model updates and low-latency deployment in interactive visualization environments.

Result: Achieves competitive visual fidelity with significantly higher rendering performance and reduced end-to-end latency compared to NeRF-based approaches, maintaining reconstruction quality within 4-7% of offline references.

Conclusion: The system is suitable for real-time, scalable augmented perception from aerial platforms, enabling immersive AR/VR applications with minimal latency.

Abstract: In this study, we present an end-to-end pipeline capable of converting drone-captured video streams into high-fidelity 3D reconstructions with minimal latency. Unmanned aerial vehicles (UAVs) are extensively used in aerial real-time perception applications. Moreover, recent advances in 3D Gaussian Splatting (3DGS) have demonstrated significant potential for real-time neural rendering. However, their integration into end-to-end UAV-based reconstruction and visualization systems remains underexplored. Our goal is to propose an efficient architecture that combines live video acquisition via RTMP streaming, synchronized sensor fusion, camera pose estimation, and 3DGS optimization, achieving continuous model updates and low-latency deployment within interactive visualization environments that supports immersive augmented and virtual reality (AR/VR) applications. Experimental results demonstrate that the proposed method achieves competitive visual fidelity, while delivering significantly higher rendering performance and substantially reduced end-to-end latency, compared to NeRF-based approaches. Reconstruction quality remains within 4-7% of high-fidelity offline references, confirming the suitability of the proposed system for real-time, scalable augmented perception from aerial platforms.

[97] BiRQA: Bidirectional Robust Quality Assessment for Images

Aleksandr Gushchin, Dmitriy S. Vatolin, Anastasia Antsiferova

Main category: cs.CV

TL;DR: BiRQA is a fast, robust full-reference image quality assessment model using bidirectional multiscale pyramid processing with adversarial training for resilience against attacks.

DetailsMotivation: Current neural FR IQA metrics are slow and vulnerable to adversarial perturbations, limiting their practical deployment in image compression, restoration, and generative modeling applications.

Method: Uses a compact model processing four fast complementary features in a bidirectional multiscale pyramid with bottom-up attention and top-down cross-gating. Introduces Anchored Adversarial Training with clean anchor samples and ranking loss to bound prediction error under attacks.

Result: Outperforms or matches previous SOTA on five FR IQA benchmarks while running ~3x faster. Under white-box attacks, improves SROCC from 0.30-0.57 to 0.60-0.84 on KADID-10k, showing substantial robustness gains.

Conclusion: BiRQA combines competitive accuracy with real-time throughput and strong adversarial resilience, making it the only FR IQA model with this combination of desirable properties.

Abstract: Full-Reference image quality assessment (FR IQA) is important for image compression, restoration and generative modeling, yet current neural metrics remain slow and vulnerable to adversarial perturbations. We present BiRQA, a compact FR IQA metric model that processes four fast complementary features within a bidirectional multiscale pyramid. A bottom-up attention module injects fine-scale cues into coarse levels through an uncertainty-aware gate, while a top-down cross-gating block routes semantic context back to high resolution. To enhance robustness, we introduce Anchored Adversarial Training, a theoretically grounded strategy that uses clean “anchor” samples and a ranking loss to bound pointwise prediction error under attacks. On five public FR IQA benchmarks BiRQA outperforms or matches the previous state of the art (SOTA) while running ~3x faster than previous SOTA models. Under unseen white-box attacks it lifts SROCC from 0.30-0.57 to 0.60-0.84 on KADID-10k, demonstrating substantial robustness gains. To our knowledge, BiRQA is the only FR IQA model combining competitive accuracy with real-time throughput and strong adversarial resilience.

[98] 3DSPA: A 3D Semantic Point Autoencoder for Evaluating Video Realism

Bhavik Chandna, Kelsey R. Allen

Main category: cs.CV

TL;DR: 3DSPA: An automated 3D spatiotemporal point autoencoder framework for evaluating video realism without reference videos, combining 3D point trajectories, depth cues, and semantic features to assess physical plausibility and temporal consistency.

DetailsMotivation: Current video realism evaluation requires manual human annotation or limited-scope datasets. There's a need for automated evaluation that captures both semantics and coherent 3D structure without reference videos for applications like robotics and film-making.

Method: 3DSPA integrates 3D point trajectories, depth cues, and DINO semantic features into a unified representation using a 3D spatiotemporal point autoencoder. It models object motion and scene understanding to assess realism, temporal consistency, and physical plausibility.

Result: 3DSPA reliably identifies videos violating physical laws, is more sensitive to motion artifacts, and aligns more closely with human judgments of video quality and realism across multiple datasets compared to existing methods.

Conclusion: Enriching trajectory-based representations with 3D semantics provides a stronger foundation for benchmarking generative video models and implicitly captures physical rule violations, enabling automated video realism evaluation.

Abstract: AI video generation is evolving rapidly. For video generators to be useful for applications ranging from robotics to film-making, they must consistently produce realistic videos. However, evaluating the realism of generated videos remains a largely manual process – requiring human annotation or bespoke evaluation datasets which have restricted scope. Here we develop an automated evaluation framework for video realism which captures both semantics and coherent 3D structure and which does not require access to a reference video. Our method, 3DSPA, is a 3D spatiotemporal point autoencoder which integrates 3D point trajectories, depth cues, and DINO semantic features into a unified representation for video evaluation. 3DSPA models how objects move and what is happening in the scene, enabling robust assessments of realism, temporal consistency, and physical plausibility. Experiments show that 3DSPA reliably identifies videos which violate physical laws, is more sensitive to motion artifacts, and aligns more closely with human judgments of video quality and realism across multiple datasets. Our results demonstrate that enriching trajectory-based representations with 3D semantics offers a stronger foundation for benchmarking generative video models, and implicitly captures physical rule violations. The code and pretrained model weights will be available at https://github.com/TheProParadox/3dspa_code.

[99] SAS-Net: Scene-Appearance Separation Network for Robust Spatiotemporal Registration in Bidirectional Photoacoustic Microscopy

Jiahao Qin

Main category: cs.CV

TL;DR: Unable to analyze paper 2602.09050 due to HTTP 429 error when fetching from arXiv API

DetailsMotivation: Cannot determine motivation as paper content could not be retrieved

Method: Cannot determine method as paper content could not be retrieved

Result: Cannot determine results as paper content could not be retrieved

Conclusion: Cannot draw conclusions as paper content could not be retrieved

Abstract: Failed to fetch summary for 2602.09050: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.09050&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[100] Aesthetic Camera Viewpoint Suggestion with 3D Aesthetic Field

Sheyang Tang, Armin Shafiee Sarvestani, Jialu Xu, Xiaoyu Xu, Zhou Wang

Main category: cs.CV

TL;DR: A 3D aesthetic field approach for viewpoint suggestion using sparse captures and feedforward 3D Gaussian Splatting network to distill 2D aesthetic knowledge into 3D space

DetailsMotivation: Existing aesthetic viewpoint suggestion methods are limited: single-view approaches don't understand scene geometry, while 3D exploration methods require dense captures or prebuilt 3D environments with costly reinforcement learning searches. There's a need for geometry-grounded aesthetic reasoning with sparse captures.

Method: Introduces 3D aesthetic field concept for geometry-grounded aesthetic reasoning. Uses feedforward 3D Gaussian Splatting network to distill high-level aesthetic knowledge from pretrained 2D aesthetic model into 3D space. Proposes two-stage search pipeline: coarse viewpoint sampling followed by gradient-based refinement to efficiently identify aesthetically appealing viewpoints.

Result: Extensive experiments show the method consistently suggests viewpoints with superior framing and composition compared to existing approaches, establishing a new direction for 3D-aware aesthetic modeling.

Conclusion: The 3D aesthetic field enables efficient viewpoint suggestion with sparse captures, avoiding costly RL searches while maintaining geometry awareness, representing significant advancement in 3D-aware aesthetic modeling.

Abstract: The aesthetic quality of a scene depends strongly on camera viewpoint. Existing approaches for aesthetic viewpoint suggestion are either single-view adjustments, predicting limited camera adjustments from a single image without understanding scene geometry, or 3D exploration approaches, which rely on dense captures or prebuilt 3D environments coupled with costly reinforcement learning (RL) searches. In this work, we introduce the notion of 3D aesthetic field that enables geometry-grounded aesthetic reasoning in 3D with sparse captures, allowing efficient viewpoint suggestions in contrast to costly RL searches. We opt to learn this 3D aesthetic field using a feedforward 3D Gaussian Splatting network that distills high-level aesthetic knowledge from a pretrained 2D aesthetic model into 3D space, enabling aesthetic prediction for novel viewpoints from only sparse input views. Building on this field, we propose a two-stage search pipeline that combines coarse viewpoint sampling with gradient-based refinement, efficiently identifying aesthetically appealing viewpoints without dense captures or RL exploration. Extensive experiments show that our method consistently suggests viewpoints with superior framing and composition compared to existing approaches, establishing a new direction toward 3D-aware aesthetic modeling.

[101] Learning Hierarchical Sparse Transform Coding for 3DGS Compression

Hao Xu, Xiaolin Wu, Xi Zhang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to draw conclusions due to failed paper fetch

Abstract: Failed to fetch summary for 2505.22908: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.22908&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[102] CLIPoint3D: Language-Grounded Few-Shot Unsupervised 3D Point Cloud Domain Adaptation

Mainak Singha, Sarthak Mehrotra, Paolo Casari, Subhasis Chaudhuri, Elisa Ricci, Biplab Banerjee

Main category: cs.CV

TL;DR: CLIPoint3D: A few-shot unsupervised 3D point cloud domain adaptation framework using frozen CLIP backbone with knowledge-driven prompt tuning and geometric cues from lightweight 3D encoder.

DetailsMotivation: Vision-language models like CLIP show cross-modal reasoning but are fragile under domain shifts, especially for 3D point clouds. Existing 3D domain adaptation methods use heavy trainable encoders which are inefficient. Need efficient adaptation from synthetic to real-world 3D data.

Method: Projects 3D point clouds into multiple depth maps, uses frozen CLIP backbone refined through knowledge-driven prompt tuning that integrates language priors with geometric cues from lightweight 3D encoder. Uses parameter-efficient fine-tuning, entropy-guided view sampling, optimal transport-based alignment loss, and uncertainty-aware prototype alignment loss.

Result: Achieves 3-16% accuracy gains over CLIP-based and conventional encoder-based baselines on PointDA-10 and GraspNetPC-10 benchmarks.

Conclusion: CLIPoint3D effectively adapts CLIP for 3D point cloud domain adaptation with efficiency and strong performance, bridging synthetic-to-real domain gaps while maintaining class separability.

Abstract: Recent vision-language models (VLMs) such as CLIP demonstrate impressive cross-modal reasoning, extending beyond images to 3D perception. Yet, these models remain fragile under domain shifts, especially when adapting from synthetic to real-world point clouds. Conventional 3D domain adaptation approaches rely on heavy trainable encoders, yielding strong accuracy but at the cost of efficiency. We introduce CLIPoint3D, the first framework for few-shot unsupervised 3D point cloud domain adaptation built upon CLIP. Our approach projects 3D samples into multiple depth maps and exploits the frozen CLIP backbone, refined through a knowledge-driven prompt tuning scheme that integrates high-level language priors with geometric cues from a lightweight 3D encoder. To adapt task-specific features effectively, we apply parameter-efficient fine-tuning to CLIP’s encoders and design an entropy-guided view sampling strategy for selecting confident projections. Furthermore, an optimal transport-based alignment loss and an uncertainty-aware prototype alignment loss collaboratively bridge source-target distribution gaps while maintaining class separability. Extensive experiments on PointDA-10 and GraspNetPC-10 benchmarks show that CLIPoint3D achieves consistent 3-16% accuracy gains over both CLIP-based and conventional encoder-based baselines. Codes are available at https://github.com/SarthakM320/CLIPoint3D.

[103] NRSeg: Noise-Resilient Learning for BEV Semantic Segmentation via Driving World Models

Siyu Li, Fei Teng, Yihong Cao, Kailun Yang, Zhiyong Li, Yaonan Wang

Main category: cs.CV

TL;DR: Unable to analyze paper 2507.04002 due to HTTP 429 error when fetching abstract from arXiv API

DetailsMotivation: Cannot determine motivation as abstract is unavailable

Method: Cannot determine method as abstract is unavailable

Result: Cannot determine results as abstract is unavailable

Conclusion: Cannot draw conclusion as paper content is inaccessible

Abstract: Failed to fetch summary for 2507.04002: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.04002&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[104] SimLBR: Learning to Detect Fake Images by Learning to Detect Real Images

Aayush Dhakal, Subash Khanal, Srikumar Sastry, Jacob Arndt, Philipe Ambrozio Dias, Dalton Lunga, Nathan Jacobs

Main category: cs.CV

TL;DR: SimLBR: A simple fake image detection framework using Latent Blending Regularization that learns tight decision boundaries around real images and treats fakes as a sink class, improving cross-generator generalization and efficiency.

DetailsMotivation: Current fake image detection methods overfit to training data and fail catastrophically on hard test sets with distribution shifts. There's a need for more robust detection that generalizes across different AI generators.

Method: Proposes SimLBR framework with Latent Blending Regularization (LBR) that learns a tight decision boundary around real image distribution, treating fake category as a sink class. Emphasizes reliability-oriented evaluation with risk-adjusted metrics.

Result: Achieves up to +24.85% accuracy and +69.62% recall on Chameleon benchmark, with orders of magnitude faster training than existing approaches. Shows strong cross-generator generalization.

Conclusion: SimLBR provides an efficient and robust solution for fake image detection with improved generalization across different AI generators, addressing critical reliability concerns in AI-generated content detection.

Abstract: The rapid advancement of generative models has made the detection of AI-generated images a critical challenge for both research and society. Recent works have shown that most state-of-the-art fake image detection methods overfit to their training data and catastrophically fail when evaluated on curated hard test sets with strong distribution shifts. In this work, we argue that it is more principled to learn a tight decision boundary around the real image distribution and treat the fake category as a sink class. To this end, we propose SimLBR, a simple and efficient framework for fake image detection using Latent Blending Regularization (LBR). Our method significantly improves cross-generator generalization, achieving up to +24.85% accuracy and +69.62% recall on the challenging Chameleon benchmark. SimLBR is also highly efficient, training orders of magnitude faster than existing approaches. Furthermore, we emphasize the need for reliability-oriented evaluation in fake image detection, introducing risk-adjusted metrics and worst-case estimates to better assess model robustness. All code and models will be released on HuggingFace and GitHub.

[105] gQIR: Generative Quanta Image Reconstruction

Aryan Garg, Sizhuo Ma, Mohit Gupta

Main category: cs.CV

TL;DR: Adapting large text-to-image latent diffusion models for photon-limited quanta burst imaging with SPAD sensors, handling Bernoulli photon statistics to produce photometrically faithful and perceptually pleasing reconstructions.

DetailsMotivation: Single-photon avalanche diode (SPAD) sensors enable imaging in extreme low-light conditions, but raw quanta frames contain sparse, noisy, binary photon detections. Standard restoration pipelines and modern generative models fail under these extreme noise statistics, requiring new approaches that can handle alignment, denoising, and demosaicing for color reconstruction.

Method: Adapts large text-to-image latent diffusion models to the photon-limited domain by leveraging structural and semantic priors from internet-scale models while introducing mechanisms to handle Bernoulli photon statistics. Integrates latent-space restoration with burst-level spatio-temporal reasoning for coherent image reconstruction from photon-limited data.

Result: The approach substantially improves perceptual quality over classical and modern learning-based baselines across synthetic benchmarks and real-world datasets, including the first color SPAD burst dataset and a challenging Deforming (XD) video benchmark.

Conclusion: Demonstrates the promise of adapting large generative priors to extreme photon-limited sensing, producing reconstructions that are both photometrically faithful and perceptually pleasing even under high-speed motion conditions.

Abstract: Capturing high-quality images from only a few detected photons is a fundamental challenge in computational imaging. Single-photon avalanche diode (SPAD) sensors promise high-quality imaging in regimes where conventional cameras fail, but raw \emph{quanta frames} contain only sparse, noisy, binary photon detections. Recovering a coherent image from a burst of such frames requires handling alignment, denoising, and demosaicing (for color) under noise statistics far outside those assumed by standard restoration pipelines or modern generative models. We present an approach that adapts large text-to-image latent diffusion models to the photon-limited domain of quanta burst imaging. Our method leverages the structural and semantic priors of internet-scale diffusion models while introducing mechanisms to handle Bernoulli photon statistics. By integrating latent-space restoration with burst-level spatio-temporal reasoning, our approach produces reconstructions that are both photometrically faithful and perceptually pleasing, even under high-speed motion. We evaluate the method on synthetic benchmarks and new real-world datasets, including the first color SPAD burst dataset and a challenging \textit{Deforming (XD)} video benchmark. Across all settings, the approach substantially improves perceptual quality over classical and modern learning-based baselines, demonstrating the promise of adapting large generative priors to extreme photon-limited sensing. Code at \href{https://github.com/Aryan-Garg/gQIR}{https://github.com/Aryan-Garg/gQIR}.

[106] MedCLIPSeg: Probabilistic Vision-Language Adaptation for Data-Efficient and Generalizable Medical Image Segmentation

Taha Koleilat, Hojat Asgariandehkordi, Omid Nejati Manzari, Berardino Barile, Yiming Xiao, Hassan Rivaz

Main category: cs.CV

TL;DR: MedCLIPSeg adapts CLIP for medical image segmentation using probabilistic cross-modal attention and patch-level embeddings, achieving better accuracy, efficiency, and uncertainty estimation across diverse medical imaging datasets.

DetailsMotivation: Medical image segmentation faces challenges including limited annotations, ambiguous anatomical features, and domain shifts. While vision-language models like CLIP offer strong cross-modal representations, their potential for dense, text-guided medical image segmentation remains underexplored.

Method: MedCLIPSeg leverages patch-level CLIP embeddings through probabilistic cross-modal attention, enabling bidirectional interaction between image and text tokens with explicit modeling of predictive uncertainty. It uses a soft patch-level contrastive loss for nuanced semantic learning across diverse textual prompts.

Result: Extensive experiments across 16 datasets spanning five imaging modalities and six organs demonstrate that MedCLIPSeg outperforms prior methods in accuracy, efficiency, and robustness, while providing interpretable uncertainty maps highlighting local reliability of segmentation results.

Conclusion: This work demonstrates the potential of probabilistic vision-language modeling for text-driven medical image segmentation, offering improved data efficiency and domain generalizability.

Abstract: Medical image segmentation remains challenging due to limited annotations for training, ambiguous anatomical features, and domain shifts. While vision-language models such as CLIP offer strong cross-modal representations, their potential for dense, text-guided medical image segmentation remains underexplored. We present MedCLIPSeg, a novel framework that adapts CLIP for robust, data-efficient, and uncertainty-aware medical image segmentation. Our approach leverages patch-level CLIP embeddings through probabilistic cross-modal attention, enabling bidirectional interaction between image and text tokens and explicit modeling of predictive uncertainty. Together with a soft patch-level contrastive loss that encourages more nuanced semantic learning across diverse textual prompts, MedCLIPSeg effectively improves data efficiency and domain generalizability. Extensive experiments across 16 datasets spanning five imaging modalities and six organs demonstrate that MedCLIPSeg outperforms prior methods in accuracy, efficiency, and robustness, while providing interpretable uncertainty maps that highlight local reliability of segmentation results. This work demonstrates the potential of probabilistic vision-language modeling for text-driven medical image segmentation.

[107] SceMoS: Scene-Aware 3D Human Motion Synthesis by Planning with Geometry-Grounded Tokens

Anindita Ghosh, Vladislav Golyanik, Taku Komura, Philipp Slusallek, Christian Theobalt, Rishabh Dabral

Main category: cs.CV

TL;DR: SceMoS: A scene-aware 3D human motion synthesis framework that uses 2D scene representations instead of full 3D data for physically grounded motion generation.

DetailsMotivation: Current methods for text-driven 3D human motion synthesis require computationally expensive 3D scene data and simultaneously learn high-level planning and low-level contact reasoning. The authors aim to show that structured 2D scene representations can serve as a powerful alternative to full 3D supervision.

Method: SceMoS disentangles global planning from local execution using: (1) a text-conditioned autoregressive global motion planner operating on bird’s-eye-view (BEV) images with DINOv2 features for scene representation, and (2) a geometry-grounded motion tokenizer using 2D local scene heightmaps trained via conditional VQ-VAE to embed surface physics into discrete vocabulary.

Result: Achieves state-of-the-art motion realism and contact accuracy on the TRUMANS benchmark while reducing trainable parameters for scene encoding by over 50% compared to 3D-based methods.

Conclusion: 2D scene cues can effectively ground 3D human-scene interaction, reaching an efficiency-fidelity trade-off where BEV semantics capture spatial layout for global reasoning while local heightmaps enforce fine-grained physical adherence without full 3D volumetric reasoning.

Abstract: Synthesizing text-driven 3D human motion within realistic scenes requires learning both semantic intent (“walk to the couch”) and physical feasibility (e.g., avoiding collisions). Current methods use generative frameworks that simultaneously learn high-level planning and low-level contact reasoning, and rely on computationally expensive 3D scene data such as point clouds or voxel occupancy grids. We propose SceMoS, a scene-aware motion synthesis framework that shows that structured 2D scene representations can serve as a powerful alternative to full 3D supervision in physically grounded motion synthesis. SceMoS disentangles global planning from local execution using lightweight 2D cues and relying on (1) a text-conditioned autoregressive global motion planner that operates on a bird’s-eye-view (BEV) image rendered from an elevated corner of the scene, encoded with DINOv2 features, as the scene representation, and (2) a geometry-grounded motion tokenizer trained via a conditional VQ-VAE, that uses 2D local scene heightmap, thus embedding surface physics directly into a discrete vocabulary. This 2D factorization reaches an efficiency-fidelity trade-off: BEV semantics capture spatial layout and affordance for global reasoning, while local heightmaps enforce fine-grained physical adherence without full 3D volumetric reasoning. SceMoS achieves state-of-the-art motion realism and contact accuracy on the TRUMANS benchmark, reducing the number of trainable parameters for scene encoding by over 50%, showing that 2D scene cues can effectively ground 3D human-scene interaction.

[108] Path-Decoupled Hyperbolic Flow Matching for Few-Shot Adaptation

Lin Li, Ziqi Jiang, Gefan Ye, Zhenqi He, Jiahui Li, Jun Xiao, Kwang-Ting Cheng, Long Chen

Main category: cs.CV

TL;DR: Hyperbolic Flow Matching (HFM) improves cross-modal few-shot adaptation by using hyperbolic geometry to decouple feature transport paths, addressing limitations of Euclidean Flow Matching in handling diverse feature distributions.

DetailsMotivation: Euclidean-based Flow Matching for visual-semantic alignment suffers from path entanglement due to polynomial volume growth limitations in flat geometry, which fails to accommodate diverse feature distributions in cross-modal few-shot adaptation.

Method: Proposes path-decoupled Hyperbolic Flow Matching (HFM) using Lorentz manifold’s exponential expansion: 1) Centripetal hyperbolic alignment anchors textual roots to push visual leaves to boundary for orderly flow initialization, 2) Path-decoupled objective confines trajectories within isolated class-specific geodesic corridors via step-wise supervision, 3) Adaptive diameter-based stopping prevents over-transportation.

Result: Extensive ablations on 11 benchmarks show HFM establishes new state-of-the-art, consistently outperforming Euclidean counterparts.

Conclusion: HFM effectively addresses path entanglement in cross-modal few-shot adaptation through hyperbolic geometry, providing superior visual-semantic alignment compared to Euclidean approaches.

Abstract: Recent advances in cross-modal few-shot adaptation treat visual-semantic alignment as a continuous feature transport problem via Flow Matching (FM). However, we argue that Euclidean-based FM overlooks fundamental limitations of flat geometry, where polynomial volume growth fails to accommodate diverse feature distributions, leading to severe path entanglement. To this end, we propose path-decoupled Hyperbolic Flow Matching (HFM), leveraging the Lorentz manifold’s exponential expansion for trajectory decoupling. HFM structures the transport via two key designs: 1) Centripetal hyperbolic alignment: It constructs a centripetal hierarchy by anchoring textual roots, which pushes visual leaves to the boundary to initialize orderly flows. 2) Path-decoupled objective: It acts as a ``semantic guardrail’’ rigidly confining trajectories within isolated class-specific geodesic corridors via step-wise supervision. Furthermore, we devise an adaptive diameter-based stopping to prevent over-transportation into the crowded origin based on the intrinsic semantic scale. Extensive ablations on 11 benchmarks have shown that HFM establishes a new state-of-the-art, consistently outperforming its Euclidean counterparts. Our codes and models will be released.

[109] Pip-Stereo: Progressive Iterations Pruner for Iterative Optimization based Stereo Matching

Jintu Zheng, Qizhe Liu, HuangXin Xu, Zhuojie Chen

Main category: cs.CV

TL;DR: PipStereo: A real-time stereo matching method for edge deployment using progressive iteration pruning, monocular prior transfer, and hardware-aware FlashGRU to achieve efficient inference while maintaining accuracy.

DetailsMotivation: Existing iterative stereo matching methods rely on RNNs which are computationally expensive and hinder deployment on edge devices, creating a need for efficient real-time stereo matching solutions.

Method: Three key innovations: 1) Progressive iteration pruning to collapse recursive computation into near-single-pass inference by exploiting spatial sparsity and temporal redundancy; 2) Collaborative monocular prior transfer framework that embeds depth priors without dedicated monocular encoder; 3) FlashGRU - hardware-aware RNN operator with structured sparsity and I/O-conscious design.

Result: Achieves 7.28× speedup, 76.6% memory peak reduction, and 80.9% global memory requests reduction over native ConvGRUs at 2K resolution. Processes 320×640 frames in 75ms on NVIDIA Jetson Orin NX and 19ms on RTX 4090, matching accuracy of large iterative models while exceeding existing real-time methods.

Conclusion: PipStereo enables real-time, high-fidelity stereo matching on edge hardware by addressing computational bottlenecks of iterative refinement while maintaining competitive accuracy through efficient architectural innovations.

Abstract: While iterative stereo matching achieves high accuracy, its dependence on Recurrent Neural Networks (RNN) hinders edge deployment, a challenge underexplored in existing researches. We analyze iterative refinement and reveal that disparity updates are spatially sparse and temporally redundant. First, we introduce a progressive iteration pruning strategy that suppresses redundant update steps, effectively collapsing the recursive computation into a near-single-pass inference. Second, we propose a collaborative monocular prior transfer framework that implicitly embeds depth priors without requiring a dedicated monocular encoder, thereby eliminating its associated computational burden. Third, we develop FlashGRU, a hardware-aware RNN operator leveraging structured sparsity and I/O-conscious design, achieving a 7.28$\times$ speedup, 76.6% memory peak reduction and 80.9% global memory requests reduction over natvie ConvGRUs under 2K resolution. Our PipStereo enables real-time, high-fidelity stereo matching on edge hardware: it processes 320$\times$640 frames in just 75ms on an NVIDIA Jetson Orin NX (FP16) and 19ms on RTX 4090, matching the accuracy of large iterative based models, and our generalization ability and accuracy far exceeds that of existing real-time methods. Our embedded AI projects will be updated at: https://github.com/XPENG-Aridge-AI.

[110] LESA: Learnable Stage-Aware Predictors for Diffusion Model Acceleration

Peiliang Cai, Jiacheng Liu, Haowen Xu, Xinyu Wang, Chang Zou, Linfeng Zhang

Main category: cs.CV

TL;DR: LESA: A learnable stage-aware predictor framework using KAN networks and multi-stage experts for accelerating diffusion transformers while maintaining generation quality.

DetailsMotivation: Diffusion Transformers (DiTs) have high computational demands that hinder practical deployment. Existing feature caching methods struggle with the complex, stage-dependent dynamics of diffusion processes, leading to quality degradation and inconsistency with standard denoising.

Method: Proposes a LEarnable Stage-Aware (LESA) predictor framework with two-stage training. Uses Kolmogorov-Arnold Network (KAN) to learn temporal feature mappings, and introduces a multi-stage, multi-expert architecture with specialized predictors for different noise-level stages.

Result: Achieves 5.00x acceleration on FLUX.1-dev with minimal quality degradation (1.0% drop), 6.25x speedup on Qwen-Image with 20.2% quality improvement over previous SOTA, and 5.00x acceleration on HunyuanVideo with 24.7% PSNR improvement over TaylorSeer.

Conclusion: The training-based framework effectively accelerates diffusion transformers while maintaining high-fidelity generation, demonstrating strong generalization across different models for both text-to-image and text-to-video synthesis.

Abstract: Diffusion models have achieved remarkable success in image and video generation tasks. However, the high computational demands of Diffusion Transformers (DiTs) pose a significant challenge to their practical deployment. While feature caching is a promising acceleration strategy, existing methods based on simple reusing or training-free forecasting struggle to adapt to the complex, stage-dependent dynamics of the diffusion process, often resulting in quality degradation and failing to maintain consistency with the standard denoising process. To address this, we propose a LEarnable Stage-Aware (LESA) predictor framework based on two-stage training. Our approach leverages a Kolmogorov-Arnold Network (KAN) to accurately learn temporal feature mappings from data. We further introduce a multi-stage, multi-expert architecture that assigns specialized predictors to different noise-level stages, enabling more precise and robust feature forecasting. Extensive experiments show our method achieves significant acceleration while maintaining high-fidelity generation. Experiments demonstrate 5.00x acceleration on FLUX.1-dev with minimal quality degradation (1.0% drop), 6.25x speedup on Qwen-Image with a 20.2% quality improvement over the previous SOTA (TaylorSeer), and 5.00x acceleration on HunyuanVideo with a 24.7% PSNR improvement over TaylorSeer. State-of-the-art performance on both text-to-image and text-to-video synthesis validates the effectiveness and generalization capability of our training-based framework across different models. Our code is included in the supplementary materials and will be released on GitHub.

[111] Probing and Bridging Geometry-Interaction Cues for Affordance Reasoning in Vision Foundation Models

Qing Zhang, Xuesong Li, Jing Zhang

Main category: cs.CV

TL;DR: Visual affordance understanding requires both geometric perception (object parts) and interaction perception (agent actions), which are naturally encoded in different VFMs and can be fused for competitive affordance estimation.

DetailsMotivation: To understand what it means for visual systems to truly comprehend affordance, and to test the hypothesis that this requires two complementary capacities: geometric perception of object parts and interaction perception of how agents engage with those parts.

Method: Systematic probing of Visual Foundation Models (VFMs) to analyze their inherent capabilities, then fusing DINO’s geometric prototypes with Flux’s interaction maps in a training-free, zero-shot manner for affordance estimation.

Result: DINO encodes part-level geometric structures, while Flux contains rich verb-conditioned spatial attention maps as implicit interaction priors. Their fusion achieves affordance estimation competitive with weakly-supervised methods.

Conclusion: Geometric and interaction perception are fundamental building blocks of affordance understanding in VFMs, providing a mechanistic account of how perception grounds action.

Abstract: What does it mean for a visual system to truly understand affordance? We argue that this understanding hinges on two complementary capacities: geometric perception, which identifies the structural parts of objects that enable interaction, and interaction perception, which models how an agent’s actions engage with those parts. To test this hypothesis, we conduct a systematic probing of Visual Foundation Models (VFMs). We find that models like DINO inherently encode part-level geometric structures, while generative models like Flux contain rich, verb-conditioned spatial attention maps that serve as implicit interaction priors. Crucially, we demonstrate that these two dimensions are not merely correlated but are composable elements of affordance. By simply fusing DINO’s geometric prototypes with Flux’s interaction maps in a training-free and zero-shot manner, we achieve affordance estimation competitive with weakly-supervised methods. This final fusion experiment confirms that geometric and interaction perception are the fundamental building blocks of affordance understanding in VFMs, providing a mechanistic account of how perception grounds action.

[112] Leveraging Causal Reasoning Method for Explaining Medical Image Segmentation Models

Limai Jiang, Ruitao Xie, Bokai Yang, Huazhen Huang, Juan He, Yufu Huo, Zikai Wang, Yang Wei, Yunpeng Cai

Main category: cs.CV

TL;DR: A causal inference-based explanation model for medical image segmentation that quantifies influence of input regions and network components on segmentation outputs, providing more faithful explanations than existing methods.

DetailsMotivation: Address the black-box nature of deep learning models in medical image segmentation by developing explanation techniques specifically for segmentation tasks, which are currently underexplored compared to classification tasks.

Method: Uses causal inference framework and backpropagates average treatment effect (ATE) to quantify influence of input regions and network components on target segmentation areas.

Result: Demonstrated more faithful explanations than existing approaches on two medical imaging datasets, and revealed significant heterogeneity in perceptual strategies across different models and inputs.

Conclusion: The method provides notable insights for optimizing segmentation models and addresses trustworthiness concerns in high-stakes medical scenarios through improved explainability.

Abstract: Medical image segmentation plays a vital role in clinical decision-making, enabling precise localization of lesions and guiding interventions. Despite significant advances in segmentation accuracy, the black-box nature of most deep models has raised growing concerns about their trustworthiness in high-stakes medical scenarios. Current explanation techniques have primarily focused on classification tasks, leaving the segmentation domain relatively underexplored. We introduced an explanation model for segmentation task which employs the causal inference framework and backpropagates the average treatment effect (ATE) into a quantification metric to determine the influence of input regions, as well as network components, on target segmentation areas. Through comparison with recent segmentation explainability techniques on two representative medical imaging datasets, we demonstrated that our approach provides more faithful explanations than existing approaches. Furthermore, we carried out a systematic causal analysis of multiple foundational segmentation models using our method, which reveals significant heterogeneity in perceptual strategies across different models, and even between different inputs for the same model. Suggesting the potential of our method to provide notable insights for optimizing segmentation models. Our code can be found at https://github.com/lcmmai/PdCR.

[113] How Do Inpainting Artifacts Propagate to Language?

Pratham Yashwante, Davit Abrahamyan, Shresth Grover, Sukruth Rao

Main category: cs.CV

TL;DR: Diffusion-based inpainting artifacts affect vision-language models’ captioning performance, with reconstruction quality metrics correlating with lexical and semantic caption quality.

DetailsMotivation: To understand how visual artifacts from diffusion-based inpainting affect language generation in vision-language models, providing a diagnostic framework for examining visual reconstruction quality's influence on multimodal systems.

Method: Two-stage diagnostic setup: mask image regions, reconstruct them using diffusion inpainting, then feed both original and reconstructed images to captioning models for controlled comparison of generated captions.

Result: Consistent associations between pixel-level/perceptual reconstruction metrics and lexical/semantic captioning performance; inpainting artifacts cause systematic, layer-dependent changes in model behavior through analysis of intermediate visual representations and attention patterns.

Conclusion: Provides practical diagnostic framework for examining how visual reconstruction quality influences language generation in multimodal systems, revealing systematic effects of inpainting artifacts on vision-language model behavior.

Abstract: We study how visual artifacts introduced by diffusion-based inpainting affect language generation in vision-language models. We use a two-stage diagnostic setup in which masked image regions are reconstructed and then provided to captioning models, enabling controlled comparisons between captions generated from original and reconstructed inputs. Across multiple datasets, we analyze the relationship between reconstruction fidelity and downstream caption quality. We observe consistent associations between pixel-level and perceptual reconstruction metrics and both lexical and semantic captioning performance. Additional analysis of intermediate visual representations and attention patterns shows that inpainting artifacts lead to systematic, layer-dependent changes in model behavior. Together, these results provide a practical diagnostic framework for examining how visual reconstruction quality influences language generation in multimodal systems.

[114] A Lightweight Vision-Language Fusion Framework for Predicting App Ratings from User Interfaces and Metadata

Azrin Sultana, Firoz Ahmed

Main category: cs.CV

TL;DR: A lightweight vision-language framework for app rating prediction that combines MobileNetV3 for UI visual features and DistilBERT for textual features through gated fusion.

DetailsMotivation: Existing app rating prediction models are limited to either textual data or UI features, overlooking the importance of jointly leveraging both UI and semantic information for better prediction accuracy.

Method: Proposes a multimodal framework using MobileNetV3 to extract visual features from UI layouts and DistilBERT to extract textual features, with a gated fusion module using Swish activations and MLP regression head.

Result: Achieves MAE of 0.1060, RMSE of 0.1433, MSE of 0.0205, R2 of 0.8529, and Pearson correlation of 0.9251 after 20 epochs, demonstrating effectiveness of multimodal fusion.

Conclusion: The lightweight vision-language framework provides valuable insights for developers, supports sustainable app development, and enables efficient deployment on edge devices.

Abstract: App ratings are among the most significant indicators of the quality, usability, and overall user satisfaction of mobile applications. However, existing app rating prediction models are largely limited to textual data or user interface (UI) features, overlooking the importance of jointly leveraging UI and semantic information. To address these limitations, this study proposes a lightweight vision–language framework that integrates both mobile UI and semantic information for app rating prediction. The framework combines MobileNetV3 to extract visual features from UI layouts and DistilBERT to extract textual features. These multimodal features are fused through a gated fusion module with Swish activations, followed by a multilayer perceptron (MLP) regression head. The proposed model is evaluated using mean absolute error (MAE), root mean square error (RMSE), mean squared error (MSE), coefficient of determination (R2), and Pearson correlation. After training for 20 epochs, the model achieves an MAE of 0.1060, an RMSE of 0.1433, an MSE of 0.0205, an R2 of 0.8529, and a Pearson correlation of 0.9251. Extensive ablation studies further demonstrate the effectiveness of different combinations of visual and textual encoders. Overall, the proposed lightweight framework provides valuable insights for developers and end users, supports sustainable app development, and enables efficient deployment on edge devices.

[115] PFGNet: A Fully Convolutional Frequency-Guided Peripheral Gating Network for Efficient Spatiotemporal Predictive Learning

Xinyong Cai, Changbin Sun, Yong Wang, Hongyu Yang, Yuankai Wu

Main category: cs.CV

TL;DR: PFGNet: A fully convolutional spatiotemporal prediction framework that uses pixel-wise frequency-guided gating to dynamically modulate receptive fields, achieving state-of-the-art performance with high efficiency.

DetailsMotivation: Pure convolutional models for spatiotemporal prediction offer superior efficiency and parallelism compared to recurrent/hybrid architectures, but their fixed receptive fields limit their ability to adaptively capture spatially varying motion patterns. The paper aims to overcome this limitation while maintaining computational efficiency.

Method: Proposes PFGNet with Peripheral Frequency Gating (PFG) blocks that extract localized spectral cues and adaptively fuse multi-scale large-kernel peripheral responses with learnable center suppression, forming spatially adaptive band-pass filters. Uses separable 1D convolutions (1×k followed by k×1) to reduce computational cost from O(k²) to O(2k).

Result: PFGNet achieves state-of-the-art or near-state-of-the-art forecasting performance on Moving MNIST, TaxiBJ, Human3.6M, and KTH datasets with substantially fewer parameters and FLOPs compared to existing methods.

Conclusion: PFGNet demonstrates that fully convolutional architectures with frequency-guided adaptive receptive fields can achieve excellent spatiotemporal prediction performance while maintaining high computational efficiency, offering a promising alternative to recurrent or attention-based approaches.

Abstract: Spatiotemporal predictive learning (STPL) aims to forecast future frames from past observations and is essential across a wide range of applications. Compared with recurrent or hybrid architectures, pure convolutional models offer superior efficiency and full parallelism, yet their fixed receptive fields limit their ability to adaptively capture spatially varying motion patterns. Inspired by biological center-surround organization and frequency-selective signal processing, we propose PFGNet, a fully convolutional framework that dynamically modulates receptive fields through pixel-wise frequency-guided gating. The core Peripheral Frequency Gating (PFG) block extracts localized spectral cues and adaptively fuses multi-scale large-kernel peripheral responses with learnable center suppression, effectively forming spatially adaptive band-pass filters. To maintain efficiency, all large kernels are decomposed into separable 1D convolutions ($1 \times k$ followed by $k \times 1$), reducing per-channel computational cost from $O(k^2)$ to $O(2k)$. PFGNet enables structure-aware spatiotemporal modeling without recurrence or attention. Experiments on Moving MNIST, TaxiBJ, Human3.6M, and KTH show that PFGNet delivers SOTA or near-SOTA forecasting performance with substantially fewer parameters and FLOPs. Our code is available at https://github.com/fhjdqaq/PFGNet.

[116] Beyond Human Performance: A Vision-Language Multi-Agent Approach for Quality Control in Pharmaceutical Manufacturing

Subhra Jyoti Mandal, Lara Rachidi, Puneet Jain, Matthieu Duvinage, Sander W. Timmer

Main category: cs.CV

TL;DR: A multi-agent framework combining deep learning and vision-language models for automated colony-forming unit detection in pharmaceutical manufacturing, achieving 85% reduction in human verification.

DetailsMotivation: Manual CFU counting is labor-intensive and error-prone, while existing DL approaches are insufficient for pharmaceutical-grade requirements (need >99% accuracy) and vulnerable to sample quality variations and artifacts.

Method: Multi-agent framework: VLM agent first classifies plates as valid/invalid; for valid samples, both DL (Detectron2) and VLM agents independently estimate colony counts; predictions within 5% agreement are automatically recorded, otherwise routed for expert review with continuous retraining.

Result: Initial DL automation reduced human verification by 50%; with VLM integration, increased to 85% reduction; custom Detectron2 achieved 99% detection rate with 2% false positives and 0.6% false negatives.

Conclusion: The proposed system provides a scalable, auditable, and regulation-ready solution for microbiological quality control, advancing automation in biopharmaceutical production through DL-VLM integration.

Abstract: Colony-forming unit (CFU) detection is critical in pharmaceutical manufacturing, serving as a key component of Environmental Monitoring programs and ensuring compliance with stringent quality standards. Manual counting is labor-intensive and error-prone, while deep learning (DL) approaches, though accurate, remain vulnerable to sample quality variations and artifacts. Building on our earlier CNN-based framework (Beznik et al., 2020), we evaluated YOLOv5, YOLOv7, and YOLOv8 for CFU detection; however, these achieved only 97.08 percent accuracy, insufficient for pharmaceutical-grade requirements. A custom Detectron2 model trained on GSK’s dataset of over 50,000 Petri dish images achieved 99 percent detection rate with 2 percent false positives and 0.6 percent false negatives. Despite high validation accuracy, Detectron2 performance degrades on outlier cases including contaminated plates, plastic artifacts, or poor optical clarity. To address this, we developed a multi-agent framework combining DL with vision-language models (VLMs). The VLM agent first classifies plates as valid or invalid. For valid samples, both DL and VLM agents independently estimate colony counts. When predictions align within 5 percent, results are automatically recorded in Postgres and SAP; otherwise, samples are routed for expert review. Expert feedback enables continuous retraining and self-improvement. Initial DL-based automation reduced human verification by 50 percent across vaccine manufacturing sites. With VLM integration, this increased to 85 percent, delivering significant operational savings. The proposed system provides a scalable, auditable, and regulation-ready solution for microbiological quality control, advancing automation in biopharmaceutical production.

[117] Robust Spiking Neural Networks Against Adversarial Attacks

Shuai Wang, Malu Zhang, Yulin Jiang, Dehao Zhang, Ammar Belatreche, Yu Liang, Yimeng Shan, Zijian Zhou, Yang Yang, Haizhou Li

Main category: cs.CV

TL;DR: TGO method enhances SNN robustness by moving neuron membrane potentials away from thresholds and introducing probabilistic firing, reducing adversarial attack vulnerability.

DetailsMotivation: SNNs are promising for energy-efficient neuromorphic computing but have limited robustness in adversarial environments. Threshold-neighboring spiking neurons are identified as key factors limiting robustness, setting upper limits for adversarial attacks and being prone to state-flipping under minor disturbances.

Method: Threshold Guarding Optimization (TGO) method with two key aspects: 1) Adding constraints to loss function to move neurons’ membrane potentials away from thresholds, increasing gradient sparsity and reducing theoretical upper bound of adversarial attacks; 2) Introducing noisy spiking neurons to transition firing mechanism from deterministic to probabilistic, decreasing state-flipping probability.

Result: Extensive experiments in standard adversarial scenarios prove the method significantly enhances robustness of directly trained SNNs.

Conclusion: The findings pave the way for advancing more reliable and secure neuromorphic computing in real-world applications by addressing fundamental robustness limitations in SNNs.

Abstract: Spiking Neural Networks (SNNs) represent a promising paradigm for energy-efficient neuromorphic computing due to their bio-plausible and spike-driven characteristics. However, the robustness of SNNs in complex adversarial environments remains significantly constrained. In this study, we theoretically demonstrate that those threshold-neighboring spiking neurons are the key factors limiting the robustness of directly trained SNNs. We find that these neurons set the upper limits for the maximum potential strength of adversarial attacks and are prone to state-flipping under minor disturbances. To address this challenge, we propose a Threshold Guarding Optimization (TGO) method, which comprises two key aspects. First, we incorporate additional constraints into the loss function to move neurons’ membrane potentials away from their thresholds. It increases SNNs’ gradient sparsity, thereby reducing the theoretical upper bound of adversarial attacks. Second, we introduce noisy spiking neurons to transition the neuronal firing mechanism from deterministic to probabilistic, decreasing their state-flipping probability due to minor disturbances. Extensive experiments conducted in standard adversarial scenarios prove that our method significantly enhances the robustness of directly trained SNNs. These findings pave the way for advancing more reliable and secure neuromorphic computing in real-world applications.

[118] The Finite Primitive Basis Theorem for Computational Imaging: Formal Foundations of the OperatorGraph Representation

Chengshuai Yang

Main category: cs.CV

TL;DR: The paper proves that all computational imaging forward models can be represented as DAGs using exactly 11 canonical primitives, establishing foundations for Physics World Models.

DetailsMotivation: Traditional computational imaging forward models are implemented as monolithic, modality-specific codes, lacking a unified mathematical framework. The authors aim to establish a minimal set of canonical primitives that can represent all imaging modalities.

Method: The authors prove the Finite Primitive Basis Theorem showing all forward models in class Cimg can be epsilon-approximated by typed DAGs using exactly 11 primitives. They provide a constructive algorithm to generate DAGs for any given forward model and prove the library’s minimality.

Result: Empirical validation on 31 linear modalities achieves error below 0.01 with at most 5 nodes and depth 5. The paper provides constructive DAG decompositions for 9 additional nonlinear modalities and proves the 11-primitive library is minimal.

Conclusion: The results establish mathematical foundations for the Physics World Model framework, providing a unified representation for computational imaging forward models across diverse modalities using a minimal set of canonical primitives.

Abstract: Computational imaging forward models, from coded aperture spectral cameras to MRI scanners, are traditionally implemented as monolithic, modality-specific codes. We prove that every forward model in a broad, precisely defined operator class Cimg (encompassing clinical, scientific, and industrial imaging modalities, both linear and nonlinear) admits an epsilon-approximate representation as a typed directed acyclic graph (DAG) whose nodes are drawn from a library of exactly 11 canonical primitives: Propagate, Modulate, Project, Encode, Convolve, Accumulate, Detect, Sample, Disperse, Scatter, and Transform. We call this the Finite Primitive Basis Theorem. The proof is constructive: we provide an algorithm that, given any H in Cimg, produces a DAG G with relative operator error at most epsilon and graph complexity within prescribed bounds. We further prove that the library is minimal: removing any single primitive causes at least one modality to lose its epsilon-approximate representation. A systematic analysis of nonlinearities in imaging physics shows they fall into two structural categories: pointwise scalar functions (handled by Transform) and self-consistent iterations (unrolled into existing linear primitives). Empirical validation on 31 linear modalities confirms eimg below 0.01 with at most 5 nodes and depth 5, and we provide constructive DAG decompositions for 9 additional nonlinear modalities. These results establish mathematical foundations for the Physics World Model (PWM) framework.

[119] CAD-Prompted SAM3: Geometry-Conditioned Instance Segmentation for Industrial Objects

Zhenran Tang, Rohan Nagabhirava, Changliu Liu

Main category: cs.CV

TL;DR: CAD-prompted segmentation framework using multi-view CAD renderings as geometry-based prompts for industrial object segmentation, overcoming limitations of language and appearance-based prompting.

DetailsMotivation: Natural language prompting struggles with uncommon or difficult-to-describe objects in manufacturing, while image exemplars rely on appearance cues (color/texture) that are unreliable when objects have different materials/finishes. CAD models capture canonical geometry that defines industrial parts.

Method: Built on SAM3, uses canonical multi-view renderings of CAD models as prompt input to provide geometry-based conditioning. Trained with synthetic data from mesh renderings in simulation under diverse viewpoints and scene contexts. Enables single-stage, CAD-prompted mask prediction.

Result: Enables segmentation of objects that cannot be robustly described by language or appearance alone, extending promptable segmentation to industrial/manufacturing scenarios where CAD models are available.

Conclusion: CAD-prompted segmentation provides a geometry-based alternative to language and appearance prompting, particularly valuable for industrial applications where precise CAD models exist but appearance varies.

Abstract: Verbal-prompted segmentation is inherently limited by the expressiveness of natural language and struggles with uncommon, instance-specific, or difficult-to-describe objects: scenarios frequently encountered in manufacturing and 3D printing environments. While image exemplars provide an alternative, they primarily encode appearance cues such as color and texture, which are often unrelated to a part’s geometric identity. In industrial settings, a single component may be produced in different materials, finishes, or colors, making appearance-based prompting unreliable. In contrast, such objects are typically defined by precise CAD models that capture their canonical geometry. We propose a CAD-prompted segmentation framework built on SAM3 that uses canonical multi-view renderings of a CAD model as prompt input. The rendered views provide geometry-based conditioning independent of surface appearance. The model is trained using synthetic data generated from mesh renderings in simulation under diverse viewpoints and scene contexts. Our approach enables single-stage, CAD-prompted mask prediction, extending promptable segmentation to objects that cannot be robustly described by language or appearance alone.

[120] WildGHand: Learning Anti-Perturbation Gaussian Hand Avatars from Monocular In-the-Wild Videos

Hanhui Li, Xuan Huang, Wanquan Liu, Yuhao Cheng, Long Chen, Yiqiang Yan, Xiaodan Liang, Chenqiang Gao

Main category: cs.CV

TL;DR: WildGHand is an optimization-based framework for high-fidelity 3D hand avatar reconstruction from monocular videos in challenging real-world conditions with various perturbations.

DetailsMotivation: Existing 3D hand reconstruction methods work well in controlled environments but degrade in real-world settings with perturbations like hand-object interactions, extreme poses, illumination changes, and motion blur.

Method: Uses 3D Gaussian splatting with two key components: (1) dynamic perturbation disentanglement module representing perturbations as time-varying biases on 3D Gaussian attributes, and (2) perturbation-aware optimization strategy with per-frame anisotropic weighted masks to guide optimization.

Result: Achieves state-of-the-art performance on curated dataset and two public datasets, with up to 15.8% relative gain in PSNR and 23.1% relative reduction in LPIPS compared to base model.

Conclusion: WildGHand enables robust 3D hand avatar reconstruction from in-the-wild videos by effectively handling various real-world perturbations through explicit perturbation modeling and adaptive optimization.

Abstract: Despite recent progress in 3D hand reconstruction from monocular videos, most existing methods rely on data captured in well-controlled environments and therefore degrade in real-world settings with severe perturbations, such as hand-object interactions, extreme poses, illumination changes, and motion blur. To tackle these issues, we introduce WildGHand, an optimization-based framework that enables self-adaptive 3D Gaussian splatting on in-the-wild videos and produces high-fidelity hand avatars. WildGHand incorporates two key components: (i) a dynamic perturbation disentanglement module that explicitly represents perturbations as time-varying biases on 3D Gaussian attributes during optimization, and (ii) a perturbation-aware optimization strategy that generates per-frame anisotropic weighted masks to guide optimization. Together, these components allow the framework to identify and suppress perturbations across both spatial and temporal dimensions. We further curate a dataset of monocular hand videos captured under diverse perturbations to benchmark in-the-wild hand avatar reconstruction. Extensive experiments on this dataset and two public datasets demonstrate that WildGHand achieves state-of-the-art performance and substantially improves over its base model across multiple metrics (e.g., up to a $15.8%$ relative gain in PSNR and a $23.1%$ relative reduction in LPIPS). Our implementation and dataset are available at https://github.com/XuanHuang0/WildGHand.

[121] AIForge-Doc: A Benchmark for Detecting AI-Forged Tampering in Financial and Form Documents

Jiaqi Wu, Yuchen Zhou, Muduo Xu, Zisheng Liang, Simiao Ren, Jiayu Xue, Meige Yang, Siying Chen, Jingheng Huan

Main category: cs.CV

TL;DR: AIForge-Doc is the first benchmark for diffusion-model-based document inpainting forgery detection, focusing on financial documents with pixel-level annotations, showing existing detectors fail against AI-generated forgeries.

DetailsMotivation: Existing document forgery datasets use traditional editing tools, creating a gap where state-of-the-art detectors are blind to AI-forged document fraud using diffusion models. There's a need for dedicated benchmarks to address this emerging threat.

Method: Created AIForge-Doc by systematically forging numeric fields in real-world receipt and form images using two AI inpainting APIs (Gemini 2.5 Flash Image and Ideogram v2 Edit), yielding 4,061 forged images from four public document datasets across nine languages with pixel-precise tampered-region masks.

Result: Benchmarked three detectors: TruFor (AUC=0.751 vs. 0.96 on NIST16), DocTamper (AUC=0.563 vs. 0.98 in-distribution, pixel-level IoU=0.020), and GPT-4o (AUC=0.509). All degrade substantially, showing AI-forged values are indistinguishable to automated detectors and VLMs.

Conclusion: AIForge-Doc represents a qualitatively new and unsolved challenge for document forensics, demonstrating that current detection methods are inadequate against diffusion-model-based document forgeries.

Abstract: We present AIForge-Doc, the first dedicated benchmark targeting exclusively diffusion-model-based inpainting in financial and form documents with pixel-level annotation. Existing document forgery datasets rely on traditional digital editing tools (e.g., Adobe Photoshop, GIMP), creating a critical gap: state-of-the-art detectors are blind to the rapidly growing threat of AI-forged document fraud. AIForge-Doc addresses this gap by systematically forging numeric fields in real-world receipt and form images using two AI inpainting APIs – Gemini 2.5 Flash Image and Ideogram v2 Edit – yielding 4,061 forged images from four public document datasets (CORD, WildReceipt, SROIE, XFUND) across nine languages, annotated with pixel-precise tampered-region masks in DocTamper-compatible format. We benchmark three representative detectors – TruFor, DocTamper, and a zero-shot GPT-4o judge – and find that all existing methods degrade substantially: TruFor achieves AUC=0.751 (zero-shot, out-of-distribution) vs. AUC=0.96 on NIST16; DocTamper achieves AUC=0.563 vs. AUC=0.98 in-distribution, with pixel-level IoU=0.020; GPT-4o achieves only 0.509 – essentially at chance – confirming that AI-forged values are indistinguishable to automated detectors and VLMs. These results demonstrate that AIForge-Doc represents a qualitatively new and unsolved challenge for document forensics.

[122] An interactive enhanced driving dataset for autonomous driving

Haojie Feng, Peizhi Zhang, Mengjie Tian, Xinrui Zhang, Zhuoren Li, Junpeng Huang, Xiurong Wang, Junfan Zhu, Jianzhou Wang, Dongxiao Yin, Lu Xiong

Main category: cs.CV

TL;DR: Proposes Interactive Enhanced Driving Dataset (IEDD) with million-level interactive segments from naturalistic driving data and IEDD-VQA dataset with synthetic BEV videos for autonomous driving VLA models.

DetailsMotivation: Autonomous driving needs robust interactive capabilities, but current Vision-Language-Action models are constrained by sparse interactive scenarios and inadequate multimodal alignment in existing datasets.

Method: Develops scalable pipeline to mine million-level interactive segments from naturalistic driving data based on interactive trajectories, designs metrics to quantify interaction processes, and constructs IEDD-VQA dataset with synthetic BEV videos where semantic actions are strictly aligned with structured language.

Result: Benchmark results evaluating ten mainstream Vision Language Models demonstrate the dataset’s reuse value in assessing and fine-tuning reasoning capabilities of autonomous driving models.

Conclusion: IEDD addresses data limitations for autonomous driving VLA models by providing large-scale interactive datasets with proper multimodal alignment, enabling better assessment and development of interactive reasoning capabilities.

Abstract: The evolution of autonomous driving towards full automation demands robust interactive capabilities; however, the development of Vision-Language-Action (VLA) models is constrained by the sparsity of interactive scenarios and inadequate multimodal alignment in existing data. To this end, this paper proposes the Interactive Enhanced Driving Dataset (IEDD). We develop a scalable pipeline to mine million-level interactive segments from naturalistic driving data based on interactive trajectories, and design metrics to quantify the interaction processes. Furthermore, the IEDD-VQA dataset is constructed by generating synthetic Bird’s Eye View (BEV) videos where semantic actions are strictly aligned with structured language. Benchmark results evaluating ten mainstream Vision Language Models (VLMs) are provided to demonstrate the dataset’s reuse value in assessing and fine-tuning the reasoning capabilities of autonomous driving models.

[123] Efficient and Explainable End-to-End Autonomous Driving via Masked Vision-Language-Action Diffusion

Jiaru Zhang, Manav Gagvani, Can Cui, Juntong Peng, Ruqi Zhang, Ziran Wang

Main category: cs.CV

TL;DR: MVLAD-AD: A masked vision-language-action diffusion model for autonomous driving that uses discrete action tokenization and geometry-aware embeddings to improve planning efficiency and explainability.

DetailsMotivation: Current LLMs and VLMs for autonomous driving suffer from slow inference latency, poor action precision, and lack of explainability. Autoregressive approaches are too slow, while diffusion-based planners use verbose language tokens without explicit geometric structure.

Method: Proposes a masked vision-language-action diffusion model with: 1) discrete action tokenization using a compact codebook of kinematically feasible waypoints, 2) geometry-aware embedding learning to approximate physical geometric metrics in latent space, and 3) action-priority decoding to prioritize trajectory generation.

Result: Extensive experiments on nuScenes and derived benchmarks show MVLAD-AD achieves superior efficiency and outperforms state-of-the-art autoregressive and diffusion baselines in planning precision, while providing high-fidelity and explainable reasoning.

Conclusion: MVLAD-AD successfully bridges the gap between efficient planning and semantic explainability in autonomous driving through its novel masked diffusion approach with geometric-aware action representations.

Abstract: Large Language Models (LLMs) and Vision-Language Models (VLMs) have emerged as promising candidates for end-to-end autonomous driving. However, these models typically face challenges in inference latency, action precision, and explainability. Existing autoregressive approaches struggle with slow token-by-token generation, while prior diffusion-based planners often rely on verbose, general-purpose language tokens that lack explicit geometric structure. In this work, we propose Masked Vision-Language-Action Diffusion for Autonomous Driving (MVLAD-AD), a novel framework designed to bridge the gap between efficient planning and semantic explainability via a masked vision-language-action diffusion model. Unlike methods that force actions into the language space, we introduce a discrete action tokenization strategy that constructs a compact codebook of kinematically feasible waypoints from real-world driving distributions. Moreover, we propose geometry-aware embedding learning to ensure that embeddings in the latent space approximate physical geometric metrics. Finally, an action-priority decoding strategy is introduced to prioritize trajectory generation. Extensive experiments on nuScenes and derived benchmarks demonstrate that MVLAD-AD achieves superior efficiency and outperforms state-of-the-art autoregressive and diffusion baselines in planning precision, while providing high-fidelity and explainable reasoning.

[124] PropFly: Learning to Propagate via On-the-Fly Supervision from Pre-trained Video Diffusion Models

Wonyong Seo, Jaeho Moon, Jaehyup Lee, Soo Ye Kim, Munchurl Kim

Main category: cs.CV

TL;DR: PropFly: Training pipeline for propagation-based video editing using on-the-fly supervision from pre-trained video diffusion models instead of requiring paired video datasets.

DetailsMotivation: Training propagation-based video editing models requires large-scale paired video datasets which are costly and complex to acquire. Existing methods need off-the-shelf or precomputed paired video editing datasets.

Method: Uses one-step clean latent estimations from intermediate noised latents with varying CFG scales to synthesize diverse ‘source’ (low-CFG) and ’edited’ (high-CFG) latents on-the-fly. Learns propagation via Guidance-Modulated Flow Matching loss with an adapter attached to pre-trained VDM.

Result: Extensive experiments show PropFly significantly outperforms state-of-the-art methods on various video editing tasks, producing high-quality editing results with temporal consistency.

Conclusion: PropFly provides an effective training pipeline for propagation-based video editing without requiring paired video datasets, leveraging on-the-fly supervision from pre-trained VDMs to learn temporally consistent transformations.

Abstract: Propagation-based video editing enables precise user control by propagating a single edited frame into following frames while maintaining the original context such as motion and structures. However, training such models requires large-scale, paired (source and edited) video datasets, which are costly and complex to acquire. Hence, we propose the PropFly, a training pipeline for Propagation-based video editing, relying on on-the-Fly supervision from pre-trained video diffusion models (VDMs) instead of requiring off-the-shelf or precomputed paired video editing datasets. Specifically, our PropFly leverages one-step clean latent estimations from intermediate noised latents with varying Classifier-Free Guidance (CFG) scales to synthesize diverse pairs of ‘source’ (low-CFG) and ’edited’ (high-CFG) latents on-the-fly. The source latent serves as structural information of the video, while the edited latent provides the target transformation for learning propagation. Our pipeline enables an additional adapter attached to the pre-trained VDM to learn to propagate edits via Guidance-Modulated Flow Matching (GMFM) loss, which guides the model to replicate the target transformation. Our on-the-fly supervision ensures the model to learn temporally consistent and dynamic transformations. Extensive experiments demonstrate that our PropFly significantly outperforms the state-of-the-art methods on various video editing tasks, producing high-quality editing results.

[125] Long-Term Multi-Session 3D Reconstruction Under Substantial Appearance Change

Beverley Gorry, Tobias Fischer, Michael Milford, Alejandro Fontan

Main category: cs.CV

TL;DR: A method for joint 3D reconstruction across long-term temporal gaps using cross-session correspondences with hybrid handcrafted/learned features, applied to coral reef monitoring.

DetailsMotivation: Existing SfM pipelines fail in long-term environmental monitoring scenarios with substantial visual and structural changes (e.g., coral reef surveys), as they assume near-simultaneous capture and limited appearance change.

Method: Enforces cross-session correspondences directly within joint SfM reconstruction using complementary handcrafted and learned visual features. Uses visual place recognition to identify likely cross-session image pairs for efficient learned feature matching.

Result: Demonstrates consistent joint reconstruction across sessions in long-term coral reef datasets with significant real-world change, where standard independent and joint SfM pipelines fail.

Conclusion: The approach enables coherent 3D reconstruction from imagery captured years apart by addressing the fundamental limitation of post-hoc alignment in existing methods through direct cross-session correspondence enforcement.

Abstract: Long-term environmental monitoring requires the ability to reconstruct and align 3D models across repeated site visits separated by months or years. However, existing Structure-from-Motion (SfM) pipelines implicitly assume near-simultaneous image capture and limited appearance change, and therefore fail when applied to long-term monitoring scenarios such as coral reef surveys, where substantial visual and structural change is common. In this paper, we show that the primary limitation of current approaches lies in their reliance on post-hoc alignment of independently reconstructed sessions, which is insufficient under large temporal appearance change. We address this limitation by enforcing cross-session correspondences directly within a joint SfM reconstruction. Our approach combines complementary handcrafted and learned visual features to robustly establish correspondences across large temporal gaps, enabling the reconstruction of a single coherent 3D model from imagery captured years apart, where standard independent and joint SfM pipelines break down. We evaluate our method on long-term coral reef datasets exhibiting significant real-world change, and demonstrate consistent joint reconstruction across sessions in cases where existing methods fail to produce coherent reconstructions. To ensure scalability to large datasets, we further restrict expensive learned feature matching to a small set of likely cross-session image pairs identified via visual place recognition, which reduces computational cost and improves alignment robustness.

[126] Interaction-aware Representation Modeling with Co-occurrence Consistency for Egocentric Hand-Object Parsing

Yuejiao Su, Yi Wang, Lei Yao, Yawen Cui, Lap-Pui Chau

Main category: cs.CV

TL;DR: InterFormer: An interaction-aware transformer for parsing hands and active objects in egocentric videos, addressing query initialization limitations, interaction-irrelevant noise, and physical consistency issues.

DetailsMotivation: Need for fine-grained understanding of egocentric human-environment interactions for embodied agents, with specific challenges in accurately parsing hands and active objects. Existing transformer methods have limitations in query initialization adaptability, interaction-irrelevant content in embeddings, and physical consistency issues.

Method: Proposes InterFormer with three components: Dynamic Query Generator (DQG) for spatial hand-object contact-based query initialization, Dual-context Feature Selector (DFS) to fuse interactive cues with semantic features while suppressing noise, and Conditional Co-occurrence (CoCo) loss for physical consistency constraints.

Result: Achieves state-of-the-art performance on EgoHOS and challenging out-of-distribution mini-HOI4D datasets, demonstrating effectiveness and strong generalization ability.

Conclusion: InterFormer effectively addresses key limitations in hand-object parsing for egocentric vision, providing better query initialization, noise suppression, and physical consistency through its integrated components.

Abstract: A fine-grained understanding of egocentric human-environment interactions is crucial for developing next-generation embodied agents. One fundamental challenge in this area involves accurately parsing hands and active objects. While transformer-based architectures have demonstrated considerable potential for such tasks, several key limitations remain unaddressed: 1) existing query initialization mechanisms rely primarily on semantic cues or learnable parameters, demonstrating limited adaptability to changing active objects across varying input scenes; 2) previous transformer-based methods utilize pixel-level semantic features to iteratively refine queries during mask generation, which may introduce interaction-irrelevant content into the final embeddings; and 3) prevailing models are susceptible to “interaction illusion”, producing physically inconsistent predictions. To address these issues, we propose an end-to-end Interaction-aware Transformer (InterFormer), which integrates three key components, i.e., a Dynamic Query Generator (DQG), a Dual-context Feature Selector (DFS), and the Conditional Co-occurrence (CoCo) loss. The DQG explicitly grounds query initialization in the spatial dynamics of hand-object contact, enabling targeted generation of interaction-aware queries for hands and various active objects. The DFS fuses coarse interactive cues with semantic features, thereby suppressing interaction-irrelevant noise and emphasizing the learning of interactive relationships. The CoCo loss incorporates hand-object relationship constraints to enhance physical consistency in prediction. Our model achieves state-of-the-art performance on both the EgoHOS and the challenging out-of-distribution mini-HOI4D datasets, demonstrating its effectiveness and strong generalization ability. Code and models are publicly available at https://github.com/yuggiehk/InterFormer.

[127] VAGNet: Grounding 3D Affordance from Human-Object Interactions in Videos

Aihua Mao, Kaihang Huang, Yong-Jin Liu, Chee Seng Chan, Ying He

Main category: cs.CV

TL;DR: VAGNet introduces video-guided 3D affordance grounding using dynamic interaction sequences to better identify contact regions for human-object interactions, outperforming static-based methods.

DetailsMotivation: Current 3D affordance grounding methods rely on static visual/textual cues, missing that affordances are defined by dynamic actions. Humans learn object use through observation and imitation, not just shape examination.

Method: Proposes VAGNet framework that aligns video-derived interaction cues with 3D structure to resolve ambiguities. Introduces PVAD dataset - first HOI video-3D pairing affordance dataset providing functional supervision.

Result: Extensive experiments on PVAD show VAGNet achieves state-of-the-art performance, significantly outperforming static-based baselines.

Conclusion: Video-guided approach effectively grounds 3D affordances using dynamic interaction sequences, addressing limitations of static methods. Dataset and code will be publicly released.

Abstract: 3D object affordance grounding aims to identify regions on 3D objects that support human-object interaction (HOI), a capability essential to embodied visual reasoning. However, most existing approaches rely on static visual or textual cues, neglecting that affordances are inherently defined by dynamic actions. As a result, they often struggle to localize the true contact regions involved in real interactions. We take a different perspective. Humans learn how to use objects by observing and imitating actions, not just by examining shapes. Motivated by this intuition, we introduce video-guided 3D affordance grounding, which leverages dynamic interaction sequences to provide functional supervision. To achieve this, we propose VAGNet, a framework that aligns video-derived interaction cues with 3D structure to resolve ambiguities that static cues cannot address. To support this new setting, we introduce PVAD, the first HOI video-3D pairing affordance dataset, providing functional supervision unavailable in prior works. Extensive experiments on PVAD show that VAGNet achieves state-of-the-art performance, significantly outperforming static-based baselines. The code and dataset will be open publicly.

[128] Knowing the Unknown: Interpretable Open-World Object Detection via Concept Decomposition Model

Xueqiang Lv, Shizhou Zhang, Yinghui Xing, Di Xu, Peng Wang, Yanning Zhang

Main category: cs.CV

TL;DR: IPOW is an interpretable open-world object detection framework that decomposes RoI features into discriminative, shared, and background concepts to improve unknown detection while providing interpretability.

DetailsMotivation: Existing open-world object detection methods focus on improving unknown recall but lack interpretability, leading to known-unknown confusion and reduced prediction reliability. The paper aims to make OWOD interpretable so detectors can truly "know the unknown."

Method: Proposes IPOW with Concept Decomposition Model (CDM) that explicitly decomposes Faster R-CNN RoI features into three concepts: discriminative (for known category separation), shared (generalizable features), and background. Uses Concept-Guided Rectification (CGR) to resolve known-unknown confusion by addressing when unknown objects fall into known class discriminative space.

Result: Extensive experiments show IPOW significantly improves unknown recall while mitigating known-unknown confusion, and provides concept-level interpretability for both known and unknown predictions.

Conclusion: The interpretable framework enables better understanding of detection decisions, improves unknown object detection performance, and resolves confusion between known and unknown categories through concept decomposition and guided rectification.

Abstract: Open-world object detection (OWOD) requires incrementally detecting known categories while reliably identifying unknown objects. Existing methods primarily focus on improving unknown recall, yet overlook interpretability, often leading to known-unknown confusion and reduced prediction reliability. This paper aims to make the entire OWOD framework interpretable, enabling the detector to truly “knowing the unknown”. To this end, we propose a concept-driven InterPretable OWOD framework(IPOW) by introducing a Concept Decomposition Model (CDM) for OWOD, which explicitly decomposes the coupled RoI features in Faster R-CNN into discriminative, shared, and background concepts. Discriminative concepts identify the most discriminative features to enlarge the distances between known categories, while shared and background concepts, due to their strong generalization ability, can be readily transferred to detect unknown categories. Leveraging the interpretable framework, we identify that known-unknown confusion arises when unknown objects fall into the discriminative space of known classes. To address this, we propose Concept-Guided Rectification (CGR) to further resolve such confusion. Extensive experiments show that IPOW significantly improves unknown recall while mitigating confusion, and provides concept-level interpretability for both known and unknown predictions.

[129] RecoverMark: Robust Watermarking for Localization and Recovery of Manipulated Faces

Haonan An, Xiaohui Ye, Guang Hua, Yihang Tao, Hangcheng Cao, Xiangyu Yu, Yuguang Fang

Main category: cs.CV

TL;DR: RecoverMark is a watermarking framework that uses face content as watermark embedded in background for robust manipulation localization, content recovery, and ownership verification against AI-generated face manipulation.

DetailsMotivation: Current fragile watermark methods for detecting AI-generated face manipulation are vulnerable to removal attacks and suffer from interference in dual-watermark strategies. There's a need for a robust solution that can simultaneously localize manipulations, recover content, and verify ownership.

Method: Uses protected face content itself as watermark embedded into surrounding background. Employs a robust two-stage training paradigm with distortion layers simulating comprehensive attacks and progressive training strategy for robust watermark embedding.

Result: Extensive experiments show robustness against both seen and unseen attacks, with generalizability to in-distribution and out-of-distribution data. Achieves simultaneous manipulation localization, content recovery, and IP protection.

Conclusion: RecoverMark addresses limitations of fragile watermarks by exploiting real-world constraints (background semantic consistency) and using image content as watermark, providing a robust solution for AI-generated face manipulation detection and recovery.

Abstract: The proliferation of AI-generated content has facilitated sophisticated face manipulation, severely undermining visual integrity and posing unprecedented challenges to intellectual property. In response, a common proactive defense leverages fragile watermarks to detect, localize, or even recover manipulated regions. However, these methods always assume an adversary unaware of the embedded watermark, overlooking their inherent vulnerability to watermark removal attacks. Furthermore, this fragility is exacerbated in the commonly used dual-watermark strategy that adds a robust watermark for image ownership verification, where mutual interference and limited embedding capacity reduce the fragile watermark’s effectiveness. To address the gap, we propose RecoverMark, a watermarking framework that achieves robust manipulation localization, content recovery, and ownership verification simultaneously. Our key insight is twofold. First, we exploit a critical real-world constraint: an adversary must preserve the background’s semantic consistency to avoid visual detection, even if they apply global, imperceptible watermark removal attacks. Second, using the image’s own content (face, in this paper) as the watermark enhances extraction robustness. Based on these insights, RecoverMark treats the protected face content itself as the watermark and embeds it into the surrounding background. By designing a robust two-stage training paradigm with carefully crafted distortion layers that simulate comprehensive potential attacks and a progressive training strategy, RecoverMark achieves a robust watermark embedding in no fragile manner for image manipulation localization, recovery, and image IP protection simultaneously. Extensive experiments demonstrate the proposed RecoverMark’s robustness against both seen and unseen attacks and its generalizability to in-distribution and out-of-distribution data.

[130] Object-Scene-Camera Decomposition and Recomposition for Data-Efficient Monocular 3D Object Detection

Zhaonian Kuang, Rui Ding, Meng Yang, Xinhu Zheng, Gang Hua

Main category: cs.CV

TL;DR: A data manipulation scheme that decomposes training images into 3D object models and background scenes, then recomposes them with varied camera poses to address data entanglement issues in monocular 3D object detection.

DetailsMotivation: Monocular 3D object detection suffers from tight entanglement between objects, scenes, and camera poses in training data, leading to insufficient data utilization and overfitting due to human bias in data collection.

Method: Proposes an online object-scene-camera decomposition and recomposition scheme: 1) Decompose training images into textured 3D object point models and background scenes efficiently, 2) Continuously recompose new training images each epoch by inserting 3D objects into scene freespace with perturbed camera poses.

Result: Method boosts M3OD models as plug-and-play component, works with both fully and sparsely supervised settings, validated on five representative M3OD models across KITTI and Waymo datasets.

Conclusion: The decomposition-recomposition scheme effectively addresses data entanglement, improves training data diversity, and enhances M3OD performance while offering flexibility in annotation cost control.

Abstract: Monocular 3D object detection (M3OD) is intrinsically ill-posed, hence training a high-performance deep learning based M3OD model requires a humongous amount of labeled data with complicated visual variation from diverse scenes, variety of objects and camera poses.However, we observe that, due to strong human bias, the three independent entities, i.e., object, scene, and camera pose, are always tightly entangled when an image is captured to construct training data. More specifically, specific 3D objects are always captured in particular scenes with fixed camera poses, and hence lacks necessary diversity. Such tight entanglement induces the challenging issues of insufficient utilization and overfitting to uniform training data. To mitigate this, we propose an online object-scene-camera decomposition and recomposition data manipulation scheme to more efficiently exploit the training data. We first fully decompose training images into textured 3D object point models and background scenes in an efficient computation and storage manner. We then continuously recompose new training images in each epoch by inserting the 3D objects into the freespace of the background scenes, and rendering them with perturbed camera poses from textured 3D point representation. In this way, the refreshed training data in all epochs can cover the full spectrum of independent object, scene, and camera pose combinations. This scheme can serve as a plug-and-play component to boost M3OD models, working flexibly with both fully and sparsely supervised settings. In the sparsely-supervised setting, objects closest to the ego-camera for all instances are sparsely annotated. We then can flexibly increase the annotated objects to control annotation cost. For validation, our method is widely applied to five representative M3OD models and evaluated on both the KITTI and the more complicated Waymo datasets.

[131] From Pairs to Sequences: Track-Aware Policy Gradients for Keypoint Detection

Yepeng Liu, Hao Li, Liwen Yang, Fangzhen Li, Xudi Ge, Yuliang Gu, kuang Gao, Bing Wang, Guang Chen, Hangjun Ye, Yongchao Xu

Main category: cs.CV

TL;DR: TraqPoint is an RL framework that reframes keypoint detection as sequential decision-making to optimize long-term trackability across image sequences, outperforming SOTA methods on sparse matching benchmarks.

DetailsMotivation: Existing learning-based keypoint detection methods are trained on image pairs, which fails to optimize for long-term trackability across sequences under challenging viewpoint and illumination changes. This limitation affects performance in 3D vision systems like SfM and SLAM.

Method: TraqPoint uses Reinforcement Learning with a track-aware reward mechanism that jointly encourages consistency and distinctiveness of keypoints across multiple views. It frames keypoint detection as a sequential decision-making problem and uses policy gradient methods.

Result: TraqPoint significantly outperforms state-of-the-art keypoint detection and description methods on sparse matching benchmarks, including relative pose estimation and 3D reconstruction tasks.

Conclusion: Reframing keypoint detection as sequential decision-making with RL and track-aware rewards effectively optimizes for long-term trackability, leading to superior performance in 3D vision applications.

Abstract: Keypoint-based matching is a fundamental component of modern 3D vision systems, such as Structure-from-Motion (SfM) and SLAM. Most existing learning-based methods are trained on image pairs, a paradigm that fails to explicitly optimize for the long-term trackability of keypoints across sequences under challenging viewpoint and illumination changes. In this paper, we reframe keypoint detection as a sequential decision-making problem. We introduce TraqPoint, a novel, end-to-end Reinforcement Learning (RL) framework designed to optimize the \textbf{Tra}ck-\textbf{q}uality (Traq) of keypoints directly on image sequences. Our core innovation is a track-aware reward mechanism that jointly encourages the consistency and distinctiveness of keypoints across multiple views, guided by a policy gradient method. Extensive evaluations on sparse matching benchmarks, including relative pose estimation and 3D reconstruction, demonstrate that TraqPoint significantly outperforms some state-of-the-art (SOTA) keypoint detection and description methods.

[132] Boosting Instance Awareness via Cross-View Correlation with 4D Radar and Camera for 3D Object Detection

Xiaokai Bai, Lianqing Zheng, Si-Yuan Cao, Xiaohan Zhang, Zhe Wu, Beinan Yu, Fang Wang, Jie Bai, Hui-Liang Shen

Main category: cs.CV

TL;DR: SIFormer is a scene-instance aware transformer for 3D object detection using 4D radar and camera fusion, addressing sparse radar geometry through cross-view activation and transformer-based fusion.

DetailsMotivation: 4D radar is promising for autonomous driving but has sparse geometric cues, making instance activation difficult. Existing radar-camera fusion paradigms have limitations: BEV-level fusion lacks instance focus, while perspective-level fusion lacks holistic context.

Method: SIFormer uses segmentation- and depth-guided localization to suppress background noise during view transformation, introduces cross-view activation to inject 2D instance cues into BEV space, and employs transformer-based fusion to aggregate complementary image semantics and radar geometry.

Result: Achieves state-of-the-art performance on View-of-Delft, TJ4DRadSet and NuScenes datasets.

Conclusion: SIFormer bridges the gap between BEV-level and perspective-level fusion paradigms, enhancing instance awareness for robust 3D object detection with sparse radar data.

Abstract: 4D millimeter-wave radar has emerged as a promising sensing modality for autonomous driving due to its robustness and affordability. However, its sparse and weak geometric cues make reliable instance activation difficult, limiting the effectiveness of existing radar-camera fusion paradigms. BEV-level fusion offers global scene understanding but suffers from weak instance focus, while perspective-level fusion captures instance details but lacks holistic context. To address these limitations, we propose SIFormer, a scene-instance aware transformer for 3D object detection using 4D radar and camera. SIFormer first suppresses background noise during view transformation through segmentation- and depth-guided localization. It then introduces a cross-view activation mechanism that injects 2D instance cues into BEV space, enabling reliable instance awareness under weak radar geometry. Finally, a transformer-based fusion module aggregates complementary image semantics and radar geometry for robust perception. As a result, with the aim of enhancing instance awareness, SIFormer bridges the gap between the two paradigms, combining their complementary strengths to address inherent sparse nature of radar and improve detection accuracy. Experiments demonstrate that SIFormer achieves state-of-the-art performance on View-of-Delft, TJ4DRadSet and NuScenes datasets. Source code is available at github.com/shawnnnkb/SIFormer.

[133] SurgAtt-Tracker: Online Surgical Attention Tracking via Temporal Proposal Reranking and Motion-Aware Refinement

Rulin Zhou, Guankun Wang, An Wang, Yujie Ma, Lixin Ouyang, Bolin Cui, Junyan Li, Chaowei Zhu, Mingyang Li, Ming Chen, Xiaopin Zhong, Peng Lu, Jiankun Wang, Xianming Liu, Hongliang Ren

Main category: cs.CV

TL;DR: SurgAtt-Tracker: A framework for tracking surgical attention as dense heatmaps using temporal coherence and motion-aware refinement, with a large-scale benchmark dataset for training and evaluation.

DetailsMotivation: Existing surgical field-of-view guidance approaches often conflate visual attention estimation with camera control or rely on direct object-centric assumptions. Accurate and stable FoV guidance is critical for safe and efficient minimally invasive surgery.

Method: Formulates surgical attention tracking as spatio-temporal learning problem, modeling surgeon focus as dense attention heatmaps. Uses proposal-level reranking and motion-aware refinement to exploit temporal coherence rather than direct regression. Introduces SurgAtt-1.16M benchmark dataset with clinically grounded annotation protocol.

Result: Extensive experiments on multiple surgical datasets show SurgAtt-Tracker achieves state-of-the-art performance with strong robustness under occlusion, multi-instrument interference, and cross-domain settings.

Conclusion: The approach provides frame-wise FoV guidance signals that can directly support downstream robotic FoV planning and automatic camera control, enabling continuous and interpretable surgical attention tracking.

Abstract: Accurate and stable field-of-view (FoV) guidance is critical for safe and efficient minimally invasive surgery, yet existing approaches often conflate visual attention estimation with downstream camera control or rely on direct object-centric assumptions. In this work, we formulate surgical attention tracking as a spatio-temporal learning problem and model surgeon focus as a dense attention heatmap, enabling continuous and interpretable frame-wise FoV guidance. We propose SurgAtt-Tracker, a holistic framework that robustly tracks surgical attention by exploiting temporal coherence through proposal-level reranking and motion-aware refinement, rather than direct regression. To support systematic training and evaluation, we introduce SurgAtt-1.16M, a large-scale benchmark with a clinically grounded annotation protocol that enables comprehensive heatmap-based attention analysis across procedures and institutions. Extensive experiments on multiple surgical datasets demonstrate that SurgAtt-Tracker consistently achieves state-of-the-art performance and strong robustness under occlusion, multi-instrument interference, and cross-domain settings. Beyond attention tracking, our approach provides a frame-wise FoV guidance signal that can directly support downstream robotic FoV planning and automatic camera control.

[134] VAUQ: Vision-Aware Uncertainty Quantification for LVLM Self-Evaluation

Seongheon Park, Changdae Oh, Hyeong Kyu Choi, Xuefeng Du, Sharon Li

Main category: cs.CV

TL;DR: VAUQ is a vision-aware uncertainty quantification framework for Large Vision-Language Models that measures how strongly model outputs depend on visual evidence to detect hallucinations, outperforming existing self-evaluation methods.

DetailsMotivation: Large Vision-Language Models frequently hallucinate, limiting safe deployment. Existing LLM self-evaluation methods rely heavily on language priors and are ill-suited for evaluating vision-conditioned predictions, creating a need for vision-aware uncertainty quantification.

Method: Proposes VAUQ framework with Image-Information Score (IS) that captures reduction in predictive uncertainty attributable to visual input, plus unsupervised core-region masking strategy that amplifies influence of salient regions. Combines predictive entropy with core-masked IS for training-free scoring.

Result: Comprehensive experiments show VAUQ consistently outperforms existing self-evaluation methods across multiple datasets, demonstrating reliable reflection of answer correctness.

Conclusion: VAUQ provides an effective vision-aware uncertainty quantification framework for LVLM self-evaluation that explicitly measures visual evidence dependence, improving reliability for real-world deployment.

Abstract: Large Vision-Language Models (LVLMs) frequently hallucinate, limiting their safe deployment in real-world applications. Existing LLM self-evaluation methods rely on a model’s ability to estimate the correctness of its own outputs, which can improve deployment reliability; however, they depend heavily on language priors and are therefore ill-suited for evaluating vision-conditioned predictions. We propose VAUQ, a vision-aware uncertainty quantification framework for LVLM self-evaluation that explicitly measures how strongly a model’s output depends on visual evidence. VAUQ introduces the Image-Information Score (IS), which captures the reduction in predictive uncertainty attributable to visual input, and an unsupervised core-region masking strategy that amplifies the influence of salient regions. Combining predictive entropy with this core-masked IS yields a training-free scoring function that reliably reflects answer correctness. Comprehensive experiments show that VAUQ consistently outperforms existing self-evaluation methods across multiple datasets.

[135] Dataset Color Quantization: A Training-Oriented Framework for Dataset-Level Compression

Chenyue Yu, Lingao Xiao, Jinhong Deng, Ivor W. Tsang, Yang He

Main category: cs.CV

TL;DR: DCQ compresses image datasets by reducing color-space redundancy while preserving training-critical information through consistent palettes, semantic color retention, and structural detail maintenance.

DetailsMotivation: Large image datasets have high storage demands that challenge deployment in resource-constrained environments. Existing approaches reduce dataset size by discarding samples but ignore significant redundancy within each image, particularly in color space.

Method: Dataset Color Quantization (DCQ) compresses visual datasets by: 1) enforcing consistent palette representations across similar images, 2) selectively retaining semantically important colors guided by model perception, and 3) maintaining structural details necessary for effective feature learning.

Result: Extensive experiments across CIFAR-10, CIFAR-100, Tiny-ImageNet, and ImageNet-1K show DCQ significantly improves training performance under aggressive compression, offering scalable and robust dataset-level storage reduction.

Conclusion: DCQ provides an effective framework for compressing visual datasets by addressing color-space redundancy while preserving information crucial for model training, enabling deployment in resource-constrained environments.

Abstract: Large-scale image datasets are fundamental to deep learning, but their high storage demands pose challenges for deployment in resource-constrained environments. While existing approaches reduce dataset size by discarding samples, they often ignore the significant redundancy within each image – particularly in the color space. To address this, we propose Dataset Color Quantization (DCQ), a unified framework that compresses visual datasets by reducing color-space redundancy while preserving information crucial for model training. DCQ achieves this by enforcing consistent palette representations across similar images, selectively retaining semantically important colors guided by model perception, and maintaining structural details necessary for effective feature learning. Extensive experiments across CIFAR-10, CIFAR-100, Tiny-ImageNet, and ImageNet-1K show that DCQ significantly improves training performance under aggressive compression, offering a scalable and robust solution for dataset-level storage reduction. Code is available at \href{https://github.com/he-y/Dataset-Color-Quantization}{https://github.com/he-y/Dataset-Color-Quantization}.

[136] SD4R: Sparse-to-Dense Learning for 3D Object Detection with 4D Radar

Xiaokai Bai, Jiahao Cheng, Songkai Wang, Yixuan Luo, Lianqing Zheng, Xiaohan Zhang, Si-Yuan Cao, Hui-Liang Shen

Main category: cs.CV

TL;DR: SD4R is a novel framework for densifying sparse 4D radar point clouds to improve 3D object detection by addressing extreme sparsity and noise through foreground point generation and enhanced feature representation.

DetailsMotivation: 4D radar provides affordable, weather-robust 3D perception but suffers from extreme sparsity and noise in point clouds, making accurate object detection challenging. Existing densification methods fail to adequately handle these extreme sparsity conditions and lack robustness with very few points.

Method: SD4R uses a two-stage approach: 1) Foreground Point Generator (FPG) to mitigate noise propagation and produce densified point clouds, and 2) Logit-Query Encoder (LQE) to enhance conventional pillarization for robust feature representations.

Result: Extensive experiments on the View-of-Delft dataset show SD4R achieves state-of-the-art performance, demonstrating strong capabilities in both noise reduction and foreground point densification.

Conclusion: SD4R effectively transforms sparse radar point clouds into dense representations, addressing key challenges in 4D radar perception and enabling improved 3D object detection performance.

Abstract: 4D radar measurements offer an affordable and weather-robust solution for 3D perception. However, the inherent sparsity and noise of radar point clouds present significant challenges for accurate 3D object detection, underscoring the need for effective and robust point clouds densification. Despite recent progress, existing densification methods often fail to address the extreme sparsity of 4D radar point clouds and exhibit limited robustness when processing scenes with a small number of points. In this paper, we propose SD4R, a novel framework that transforms sparse radar point clouds into dense representations. SD4R begins by utilizing a foreground point generator (FPG) to mitigate noise propagation and produce densified point clouds. Subsequently, a logit-query encoder (LQE) enhances conventional pillarization, resulting in robust feature representations. Through these innovations, our SD4R demonstrates strong capability in both noise reduction and foreground point densification. Extensive experiments conducted on the publicly available View-of-Delft dataset demonstrate that SD4R achieves state-of-the-art performance. Source code is available at https://github.com/lancelot0805/SD4R.

[137] Vision-Language Models for Ergonomic Assessment of Manual Lifting Tasks: Estimating Horizontal and Vertical Hand Distances from RGB Video

Mohammad Sadra Rajabi, Aanuoluwapo Ojelade, Sunwook Kim, Maury A. Nussbaum

Main category: cs.CV

TL;DR: Vision-language models can estimate lifting task distances from RGB video with 5-8 cm accuracy using text-guided detection and segmentation pipelines.

DetailsMotivation: Manual lifting tasks cause work-related musculoskeletal disorders, and current ergonomic risk assessment tools like RNLE require manual measurement of distances that are difficult to obtain in real-world environments.

Method: Two VLM-based pipelines: text-guided detection-only and detection-plus-segmentation. Both use text-guided localization of task-relevant regions, visual feature extraction, and transformer-based temporal regression to estimate horizontal and vertical distances at lift start/end points.

Result: Segmentation-based multi-view pipeline achieved best performance with mean absolute errors of 6-8 cm for horizontal distance and 5-8 cm for vertical distance. Segmentation reduced errors by 20-40% compared to detection-only approach.

Conclusion: VLM-based pipelines are feasible for video-based estimation of ergonomic lifting parameters, with segmentation providing significant accuracy improvements over detection-only methods.

Abstract: Manual lifting tasks are a major contributor to work-related musculoskeletal disorders, and effective ergonomic risk assessment is essential for quantifying physical exposure and informing ergonomic interventions. The Revised NIOSH Lifting Equation (RNLE) is a widely used ergonomic risk assessment tool for lifting tasks that relies on six task variables, including horizontal (H) and vertical (V) hand distances; such distances are typically obtained through manual measurement or specialized sensing systems and are difficult to use in real-world environments. We evaluated the feasibility of using innovative vision-language models (VLMs) to non-invasively estimate H and V from RGB video streams. Two multi-stage VLM-based pipelines were developed: a text-guided detection-only pipeline and a detection-plus-segmentation pipeline. Both pipelines used text-guided localization of task-relevant regions of interest, visual feature extraction from those regions, and transformer-based temporal regression to estimate H and V at the start and end of a lift. For a range of lifting tasks, estimation performance was evaluated using leave-one-subject-out validation across the two pipelines and seven camera view conditions. Results varied significantly across pipelines and camera view conditions, with the segmentation-based, multi-view pipeline consistently yielding the smallest errors, achieving mean absolute errors of approximately 6-8 cm when estimating H and 5-8 cm when estimating V. Across pipelines and camera view configurations, pixel-level segmentation reduced estimation error by approximately 20-30% for H and 35-40% for V relative to the detection-only pipeline. These findings support the feasibility of VLM-based pipelines for video-based estimation of RNLE distance parameters.

[138] Interpretable Medical Image Classification using Prototype Learning and Privileged Information

Luisa Gallee, Meinrad Beer, Michael Goetz

Main category: cs.CV

TL;DR: Proto-Caps: An interpretable medical imaging model combining capsule networks, prototype learning, and privileged information for lung nodule analysis with state-of-the-art performance and visual explainability.

DetailsMotivation: Interpretability is essential in medical imaging, requiring deep learning methods that provide both explainability and high performance. The paper investigates whether additional information available during training can be used to create understandable yet powerful models.

Method: Proposes Proto-Caps, an innovative solution that leverages capsule networks, prototype learning, and privileged information. The method combines these techniques to create an interpretable model that provides case-based reasoning with prototype representations.

Result: Evaluation on LIDC-IDRI dataset shows above state-of-the-art prediction performance with 93.0% accuracy in predicting malignancy (6% higher than baseline) and mean characteristic features of lung nodules. The model provides visual validation of radiologist-defined attributes through prototype representations.

Conclusion: Proto-Caps successfully combines increased interpretability with superior prediction performance, addressing the essential need for explainable AI in medical imaging while maintaining competitive accuracy.

Abstract: Interpretability is often an essential requirement in medical imaging. Advanced deep learning methods are required to address this need for explainability and high performance. In this work, we investigate whether additional information available during the training process can be used to create an understandable and powerful model. We propose an innovative solution called Proto-Caps that leverages the benefits of capsule networks, prototype learning and the use of privileged information. Evaluating the proposed solution on the LIDC-IDRI dataset shows that it combines increased interpretability with above state-of-the-art prediction performance. Compared to the explainable baseline model, our method achieves more than 6 % higher accuracy in predicting both malignancy (93.0 %) and mean characteristic features of lung nodules. Simultaneously, the model provides case-based reasoning with prototype representations that allow visual validation of radiologist-defined attributes.

[139] AnimeAgent: Is the Multi-Agent via Image-to-Video models a Good Disney Storytelling Artist?

Hailong Yan, Shice Liu, Tao Wang, Xiangtao Zhang, Yijie Zhong, Jinwei Chen, Le Zhang, Bo Li

Main category: cs.CV

TL;DR: AnimeAgent: First I2V-based multi-agent framework for Custom Storyboard Generation that addresses limitations of static diffusion models through iterative refinement and implicit motion priors.

DetailsMotivation: Current approaches for Custom Storyboard Generation using static diffusion models have three key limitations: lack of dynamic expressiveness (leading to "copy-paste" patterns), inability to iteratively correct errors in one-shot inference, and reliance on non-robust evaluators in multi-agent frameworks that are ill-suited for stylized animation.

Method: Proposes AnimeAgent, the first Image-to-Video (I2V)-based multi-agent framework inspired by Disney’s “Combination of Straight Ahead and Pose to Pose” workflow. Leverages I2V’s implicit motion prior to enhance consistency and expressiveness, and uses a mixed subjective-objective reviewer for reliable iterative refinement.

Result: AnimeAgent achieves state-of-the-art performance in consistency, prompt fidelity, and stylization. The authors also collected a human-annotated CSG benchmark with ground-truth for evaluation.

Conclusion: The proposed I2V-based multi-agent framework effectively addresses the limitations of static diffusion models for storyboard generation, providing better dynamic expressiveness, iterative refinement capability, and reliable evaluation for stylized animation.

Abstract: Custom Storyboard Generation (CSG) aims to produce high-quality, multi-character consistent storytelling. Current approaches based on static diffusion models, whether used in a one-shot manner or within multi-agent frameworks, face three key limitations: (1) Static models lack dynamic expressiveness and often resort to “copy-paste” pattern. (2) One-shot inference cannot iteratively correct missing attributes or poor prompt adherence. (3) Multi-agents rely on non-robust evaluators, ill-suited for assessing stylized, non-realistic animation. To address these, we propose AnimeAgent, the first Image-to-Video (I2V)-based multi-agent framework for CSG. Inspired by Disney’s “Combination of Straight Ahead and Pose to Pose” workflow, AnimeAgent leverages I2V’s implicit motion prior to enhance consistency and expressiveness, while a mixed subjective-objective reviewer enables reliable iterative refinement. We also collect a human-annotated CSG benchmark with ground-truth. Experiments show AnimeAgent achieves SOTA performance in consistency, prompt fidelity, and stylization.

[140] BoxSplitGen: A Generative Model for 3D Part Bounding Boxes in Varying Granularity

Juil Koo, Wei-Tung Lin, Chanho Park, Chanhyeok Park, Minhyuk Sung

Main category: cs.CV

TL;DR: A framework for interactive 3D shape generation using iterative bounding box splitting, with two generative models: BoxSplitGen for hierarchical part decomposition and a box-to-shape model for generating detailed shapes from bounding boxes.

DetailsMotivation: Human creativity follows a perceptual process from abstract to detailed, but existing 3D generative models lack interactive tools for assisting human imagination through coarse-to-fine refinement. The paper aims to create a framework that enables intuitive, interactive 3D shape generation by iteratively refining bounding boxes.

Method: Proposes a two-model framework: 1) BoxSplitGen learns to generate hierarchical 3D part bounding boxes by iteratively splitting coarse boxes, using agglomerative merging and learning reverse splitting sequences; 2) A box-to-shape generative model adapts existing 3D diffusion models to incorporate bounding box conditioning for detailed shape generation.

Result: BoxSplitGen outperforms token prediction models and unconditional diffusion inpainting approaches. The box-to-shape model, based on state-of-the-art 3D diffusion, provides superior results compared to previous models.

Conclusion: The framework enables intuitive, interactive 3D shape generation through hierarchical refinement, successfully bridging the gap between abstract human imagination and detailed 3D creation with superior performance over existing approaches.

Abstract: Human creativity follows a perceptual process, moving from abstract ideas to finer details during creation. While 3D generative models have advanced dramatically, models specifically designed to assist human imagination in 3D creation – particularly for detailing abstractions from coarse to fine – have not been explored. We propose a framework that enables intuitive and interactive 3D shape generation by iteratively splitting bounding boxes to refine the set of bounding boxes. The main technical components of our framework are two generative models: the box-splitting generative model and the box-to-shape generative model. The first model, named BoxSplitGen, generates a collection of 3D part bounding boxes with varying granularity by iteratively splitting coarse bounding boxes. It utilizes part bounding boxes created through agglomerative merging and learns the reverse of the merging process – the splitting sequences. The model consists of two main components: the first learns the categorical distribution of the box to be split, and the second learns the distribution of the two new boxes, given the set of boxes and the indication of which box to split. The second model, the box-to-shape generative model, is trained by leveraging the 3D shape priors learned by an existing 3D diffusion model while adapting the model to incorporate bounding box conditioning. In our experiments, we demonstrate that the box-splitting generative model outperforms token prediction models and the inpainting approach with an unconditional diffusion model. Also, we show that our box-to-shape model, based on a state-of-the-art 3D diffusion model, provides superior results compared to a previous model.

[141] BBQ-to-Image: Numeric Bounding Box and Qolor Control in Large-Scale Text-to-Image Models

Eliran Kachlon, Alexander Visheratin, Nimrod Sarid, Tal Hacham, Eyal Gutflaish, Saar Huberman, Hezi Zisman, David Ruppin, Ron Mokady

Main category: cs.CV

TL;DR: BBQ is a text-to-image model that adds precise numeric control over object location (bounding boxes) and color (RGB triplets) through structured text conditioning, enabling professional-grade spatial and chromatic control without architectural changes.

DetailsMotivation: Existing text-to-image models rely on descriptive language which creates a "parametric gap" - professional workflows require precise numeric control over object location, size, and color that current models cannot provide through text alone.

Method: Train a large-scale text-to-image model that directly conditions on numeric bounding boxes and RGB triplets within a unified structured-text framework, using captions enriched with parametric annotations without architectural modifications or inference-time optimization.

Result: BBQ achieves strong box alignment and improves RGB color fidelity over state-of-the-art baselines, enabling intuitive user interfaces like object dragging and color pickers for precise control.

Conclusion: The work supports a new paradigm where user intent is translated into intermediate structured language consumed by a flow-based transformer acting as a renderer, naturally accommodating numeric parameters for professional-grade image generation.

Abstract: Text-to-image models have rapidly advanced in realism and controllability, with recent approaches leveraging long, detailed captions to support fine-grained generation. However, a fundamental parametric gap remains: existing models rely on descriptive language, whereas professional workflows require precise numeric control over object location, size, and color. In this work, we introduce BBQ, a large-scale text-to-image model that directly conditions on numeric bounding boxes and RGB triplets within a unified structured-text framework. We obtain precise spatial and chromatic control by training on captions enriched with parametric annotations, without architectural modifications or inference-time optimization. This also enables intuitive user interfaces such as object dragging and color pickers, replacing ambiguous iterative prompting with precise, familiar controls. Across comprehensive evaluations, BBQ achieves strong box alignment and improves RGB color fidelity over state-of-the-art baselines. More broadly, our results support a new paradigm in which user intent is translated into an intermediate structured language, consumed by a flow-based transformer acting as a renderer and naturally accommodating numeric parameters.

[142] GA-Drive: Geometry-Appearance Decoupled Modeling for Free-viewpoint Driving Scene Generatio

Hao Zhang, Lue Fan, Qitai Wang, Wenbo Li, Zehuan Wu, Lewei Lu, Zhaoxiang Zhang, Hongsheng Li

Main category: cs.CV

TL;DR: GA-Drive is a novel driving simulator framework that generates photorealistic novel camera views along user-specified trajectories using geometry-appearance decoupling and diffusion-based generation.

DetailsMotivation: A free-viewpoint, editable, and high-fidelity driving simulator is crucial for training and evaluating end-to-end autonomous driving systems, but existing methods have limitations in quality and editability.

Method: Uses geometry-appearance decoupling: first synthesizes novel pseudo-views using scene geometry, then transforms them into photorealistic views using a trained video diffusion model. Supports appearance editing via video-to-video editing techniques while preserving underlying geometry.

Result: Extensive experiments show GA-Drive substantially outperforms existing methods in terms of NTA-IoU, NTL-IoU, and FID scores.

Conclusion: GA-Drive provides a powerful simulation framework for autonomous driving with high-fidelity novel view synthesis and appearance editing capabilities.

Abstract: A free-viewpoint, editable, and high-fidelity driving simulator is crucial for training and evaluating end-to-end autonomous driving systems. In this paper, we present GA-Drive, a novel simulation framework capable of generating camera views along user-specified novel trajectories through Geometry-Appearance Decoupling and Diffusion-Based Generation. Given a set of images captured along a recorded trajectory and the corresponding scene geometry, GA-Drive synthesizes novel pseudo-views using geometry information. These pseudo-views are then transformed into photorealistic views using a trained video diffusion model. In this way, we decouple the geometry and appearance of scenes. An advantage of such decoupling is its support for appearance editing via state-of-the-art video-to-video editing techniques, while preserving the underlying geometry, enabling consistent edits across both original and novel trajectories. Extensive experiments demonstrate that GA-Drive substantially outperforms existing methods in terms of NTA-IoU, NTL-IoU, and FID scores.

[143] RAYNOVA: 3D-Geometry-Free Auto-Regressive Driving World Modeling with Unified Spatio-Temporal Representation

Yichen Xie, Chensheng Peng, Mazen Abdelfattah, Yihan Hu, Jiezhi Yang, Eric Higgins, Ryan Brigden, Masayoshi Tomizuka, Wei Zhan

Main category: cs.CV

TL;DR: RAYNOVA is a geometry-free world model using dual-causal autoregressive framework for 4D spatio-temporal reasoning, achieving SOTA multi-view video generation without explicit 3D scene representation.

DetailsMotivation: Existing world foundation models handle spatial and temporal correlations separately and rely on strong 3D geometric priors, limiting generalization to diverse camera setups and ego motions.

Method: Proposes a geometry-free world model with dual-causal autoregressive framework following scale-wise and temporal topological orders, using global attention for unified 4D reasoning and relative Plücker-ray positional encoding for isotropic spatio-temporal representation.

Result: Achieves state-of-the-art multi-view video generation on nuScenes, with higher throughput and strong controllability, generalizing to novel views and camera configurations without explicit 3D scene representation.

Conclusion: RAYNOVA demonstrates effective world modeling without geometric priors, enabling robust generalization across diverse camera setups and motions while maintaining high-quality video generation.

Abstract: World foundation models aim to simulate the evolution of the real world with physically plausible behavior. Unlike prior methods that handle spatial and temporal correlations separately, we propose RAYNOVA, a geometry-free world model that employs a dual-causal autoregressive framework. It follows both scale-wise and temporal topological orders in the autoregressive process, and leverages global attention for unified 4D spatio-temporal reasoning. Different from existing works that impose strong 3D geometric priors, RAYNOVA constructs an isotropic spatio-temporal representation across views, frames, and scales based on relative Plücker-ray positional encoding, enabling robust generalization to diverse camera setups and ego motions. We further introduce a recurrent training paradigm to alleviate distribution drift in long-horizon video generation. RAYNOVA achieves state-of-the-art multi-view video generation results on nuScenes, while offering higher throughput and strong controllability under diverse input conditions, generalizing to novel views and camera configurations without explicit 3D scene representation. Our code will be released at http://yichen928.github.io/raynova.

[144] MatchED: Crisp Edge Detection Using End-to-End, Matching-based Supervision

Bedrettin Cetinkaya, Sinan Kalkan, Emre Akbas

Main category: cs.CV

TL;DR: A lightweight plug-and-play module for end-to-end crisp edge detection that eliminates the need for non-differentiable post-processing like NMS and thinning.

DetailsMotivation: Existing edge detection methods rely on non-differentiable post-processing (NMS and skeleton-based thinning) to achieve crisp edges, which prevents end-to-end optimization and limits performance.

Method: Proposes a lightweight matching-based supervision module (~21K parameters) that performs one-to-one matching between predicted and ground-truth edges based on spatial distance and confidence during training, enabling joint end-to-end learning of crisp edges.

Result: Improves edge detection models substantially across four datasets, increasing Average Crispness by 2-4× and boosting crispness-emphasized evaluation metrics by 20-35% (ODS), achieving state-of-the-art performance that matches or surpasses standard post-processing.

Conclusion: The proposed module enables end-to-end crisp edge detection without non-differentiable post-processing, significantly improving performance while being lightweight and plug-and-play compatible with existing models.

Abstract: Generating crisp, i.e., one-pixel-wide, edge maps remains one of the fundamental challenges in edge detection, affecting both traditional and learning-based methods. To obtain crisp edges, most existing approaches rely on two hand-crafted post-processing algorithms, Non-Maximum Suppression (NMS) and skeleton-based thinning, which are non-differentiable and hinder end-to-end optimization. Moreover, all existing crisp edge detection methods still depend on such post-processing to achieve satisfactory results. To address this limitation, we propose \MethodLPP, a lightweight, only $\sim$21K additional parameters, and plug-and-play matching-based supervision module that can be appended to any edge detection model for joint end-to-end learning of crisp edges. At each training iteration, \MethodLPP performs one-to-one matching between predicted and ground-truth edges based on spatial distance and confidence, ensuring consistency between training and testing protocols. Extensive experiments on four popular datasets demonstrate that integrating \MethodLPP substantially improves the performance of existing edge detection models. In particular, \MethodLPP increases the Average Crispness (AC) metric by up to 2–4$\times$ compared to baseline models. Under the crispness-emphasized evaluation (CEval), \MethodLPP further boosts baseline performance by up to 20–35% in ODS and achieves similar gains in OIS and AP, achieving SOTA performance that matches or surpasses standard post-processing for the first time. Code is available at https://cvpr26-matched.github.io.

[145] NGL-Prompter: Training-Free Sewing Pattern Estimation from a Single Image

Anna Badalyan, Pratheba Selvaraju, Giorgio Becherini, Omid Taheri, Victoria Fernandez Abrevaya, Michael Black

Main category: cs.CV

TL;DR: NGL-Prompter: A training-free pipeline that uses large VLMs with a novel intermediate language (NGL) to extract garment parameters from images and reconstruct sewing patterns for 3D garments, including multi-layer outfits.

DetailsMotivation: Existing methods for estimating sewing patterns from images struggle with real-world generalization, fail to capture realistic garment part correlations, and are limited to single-layer outfits. VLMs can describe garments well in natural language but perform poorly at direct parameter regression.

Method: Proposes NGL (Natural Garment Language) - an intermediate language that restructures GarmentCode into a VLM-understandable representation. Uses NGL-Prompter, a training-free pipeline that queries large VLMs to extract structured garment parameters, then deterministically maps them to valid GarmentCode for sewing pattern reconstruction.

Result: Achieves state-of-the-art performance on geometry metrics across Dress4D, CloSe, and a new 5,000-image dataset. Strongly preferred in human and GPT-based perceptual evaluations. Can recover multi-layer outfits while competitors focus on single-layer garments.

Conclusion: Accurate sewing pattern reconstruction is possible without costly model training by leveraging VLMs with an appropriate intermediate language representation. The approach generalizes well to real-world images with occlusions and enables multi-layer outfit recovery.

Abstract: Estimating sewing patterns from images is a practical approach for creating high-quality 3D garments. Due to the lack of real-world pattern-image paired data, prior approaches fine-tune large vision language models (VLMs) on synthetic garment datasets generated by randomly sampling from a parametric garment model GarmentCode. However, these methods often struggle to generalize to in-the-wild images, fail to capture real-world correlations between garment parts, and are typically restricted to single-layer outfits. In contrast, we observe that VLMs are effective at describing garments in natural language, yet perform poorly when asked to directly regress GarmentCode parameters from images. To bridge this gap, we propose NGL (Natural Garment Language), a novel intermediate language that restructures GarmentCode into a representation more understandable to language models. Leveraging this language, we introduce NGL-Prompter, a training-free pipeline that queries large VLMs to extract structured garment parameters, which are then deterministically mapped to valid GarmentCode. We evaluate our method on the Dress4D, CloSe and a newly collected dataset of approximately 5,000 in-the-wild fashion images. Our approach achieves state-of-the-art performance on standard geometry metrics and is strongly preferred in both human and GPT-based perceptual evaluations compared to existing baselines. Furthermore, NGL-prompter can recover multi-layer outfits whereas competing methods focus mostly on single-layer garments, highlighting its strong generalization to real-world images even with occluded parts. These results demonstrate that accurate sewing pattern reconstruction is possible without costly model training. Our code and data will be released for research use.

[146] Onboard-Targeted Segmentation of Straylight in Space Camera Sensors

Riccardo Gallon, Fabian Schiemenz, Alessandra Menicucci, Eberhard Gill

Main category: cs.CV

TL;DR: AI-based semantic segmentation for space camera straylight faults using DeepLabV3 with MobileNetV3 backbone, pre-trained on Flare7k++ dataset for generalization, targeting spacecraft hardware deployment.

DetailsMotivation: Address segmentation of straylight effects around space camera Field of View caused by solar presence, using limited space-specific data by leveraging public flare datasets for generalization.

Method: DeepLabV3 model with MobileNetV3 backbone for efficient segmentation, pre-trained on Flare7k++ dataset (non-space flares) to handle diverse flare textures, designed for resource-constrained spacecraft hardware.

Result: Developed custom metrics for system-level performance assessment through proposed interface between segmentation model and onboard navigation pipeline, enabling practical deployment evaluation.

Conclusion: Proposed methodology enables effective semantic segmentation of space camera straylight faults using transfer learning from non-space flare data, with hardware-efficient design suitable for spacecraft deployment.

Abstract: This study details an artificial intelligence (AI)-based methodology for the semantic segmentation of space camera faults. Specifically, we address the segmentation of straylight effects induced by solar presence around the camera’s Field of View (FoV). Anomalous images are sourced from our published dataset. Our approach emphasizes generalization across diverse flare textures, leveraging pre-training on a public dataset (Flare7k++) including flares in various non-space contexts to mitigate the scarcity of realistic space-specific data. A DeepLabV3 model with MobileNetV3 backbone performs the segmentation task. The model design targets deployment in spacecraft resource-constrained hardware. Finally, based on a proposed interface between our model and the onboard navigation pipeline, we develop custom metrics to assess the model’s performance in the system-level context.

[147] Monocular Endoscopic Tissue 3D Reconstruction with Multi-Level Geometry Regularization

Yangsen Chen, Hao Wang

Main category: cs.CV

TL;DR: A 3D Gaussian Splatting-based method for reconstructing deformable endoscopic tissues that combines surface-aware reconstruction with physical deformation constraints to achieve both smooth surfaces and real-time rendering.

DetailsMotivation: Current 3D Gaussian Splatting approaches struggle with consistent tissue surface reconstruction in endoscopic surgery, while NeRF-based methods lack real-time rendering capabilities. There's a need for a method that provides both smooth deformable surfaces and real-time rendering for robot-assisted surgery applications.

Method: Proposes a novel 3D Gaussian Splatting approach with: 1) Surface-aware reconstruction using Sign Distance Field-based mesh construction to constrain Gaussian Splatting, and 2) Physical deformation constraints incorporating local rigidity and global non-rigidity restrictions to guide Gaussian deformation for soft endoscopic tissue.

Result: The method achieves fast rendering process and smooth surface appearances. Quantitative and qualitative analysis shows solid reconstruction quality in both textures and geometries compared to alternative methodologies.

Conclusion: The proposed approach successfully addresses the challenges of deformable endoscopic tissue reconstruction by combining surface-aware constraints with physical deformation modeling, enabling both real-time rendering and high-quality surface reconstruction for surgical applications.

Abstract: Reconstructing deformable endoscopic tissues is crucial for achieving robot-assisted surgery. However, 3D Gaussian Splatting-based approaches encounter challenges in achieving consistent tissue surface reconstruction, while existing NeRF-based methods lack real-time rendering capabilities. In pursuit of both smooth deformable surfaces and real-time rendering, we introduce a novel approach based on 3D Gaussian Splatting. Specifically, we introduce surface-aware reconstruction, initially employing a Sign Distance Field-based method to construct a mesh, subsequently utilizing this mesh to constrain the Gaussian Splatting reconstruction process. Furthermore, to ensure the generation of physically plausible deformations, we incorporate local rigidity and global non-rigidity restrictions to guide Gaussian deformation, tailored for the highly deformable nature of soft endoscopic tissue. Based on 3D Gaussian Splatting, our proposed method delivers a fast rendering process and smooth surface appearances. Quantitative and qualitative analysis against alternative methodologies shows that our approach achieves solid reconstruction quality in both textures and geometries.

[148] CleanStyle: Plug-and-Play Style Conditioning Purification for Text-to-Image Stylization

Xiaoman Feng, Mingkun Lei, Yang Wang, Dingwen Fu, Chi Zhang

Main category: cs.CV

TL;DR: CleanStyle is a plug-and-play framework for diffusion models that filters content leakage from style embeddings using SVD and time-aware suppression, with style-specific classifier-free guidance to improve prompt fidelity.

DetailsMotivation: Current encoder-based style transfer methods in diffusion models suffer from content leakage where semantic elements from style images appear in outputs, impairing prompt fidelity and stylistic consistency.

Method: Uses Singular Value Decomposition (SVD) to isolate tail components of style embeddings that cause content leakage, then dynamically suppresses them with time-aware exponential schedule (CS-SVD). Also introduces Style-Specific Classifier-Free Guidance (SS-CFG) that reuses suppressed components as style-aware unconditional inputs.

Result: Extensive experiments show CleanStyle substantially reduces content leakage, improves stylization quality, and enhances prompt alignment across various style references and prompts.

Conclusion: CleanStyle provides a lightweight, interpretable, plug-and-play solution for content leakage in style transfer without retraining, improving visual quality and prompt fidelity in diffusion models.

Abstract: Style transfer in diffusion models enables controllable visual generation by injecting the style of a reference image. However, recent encoder-based methods, while efficient and tuning-free, often suffer from content leakage, where semantic elements from the style image undesirably appear in the output, impairing prompt fidelity and stylistic consistency. In this work, we introduce CleanStyle, a plug-and-play framework that filters out content-related noise from the style embedding without retraining. Motivated by empirical analysis, we observe that such leakage predominantly stems from the tail components of the style embedding, which are isolated via Singular Value Decomposition (SVD). To address this, we propose CleanStyleSVD (CS-SVD), which dynamically suppresses tail components using a time-aware exponential schedule, providing clean, style-preserving conditional embeddings throughout the denoising process. Furthermore, we present Style-Specific Classifier-Free Guidance (SS-CFG), which reuses the suppressed tail components to construct style-aware unconditional inputs. Unlike conventional methods that use generic negative embeddings (e.g., zero vectors), SS-CFG introduces targeted negative signals that reflect style-specific but prompt-irrelevant visual elements. This enables the model to effectively suppress these distracting patterns during generation, thereby improving prompt fidelity and enhancing the overall visual quality of stylized outputs. Our approach is lightweight, interpretable, and can be seamlessly integrated into existing encoder-based diffusion models without retraining. Extensive experiments demonstrate that CleanStyle substantially reduces content leakage, improves stylization quality and improves prompt alignment across a wide range of style references and prompts.

[149] Bridging Physically Based Rendering and Diffusion Models with Stochastic Differential Equation

Junwei Shu, Wenjie Liu, Changgu Chen, Hantang Liu, Yang Li, Changbo Wang

Main category: cs.CV

TL;DR: Unified stochastic formulation bridges Monte Carlo rendering and diffusion models for physically grounded control over shading and material properties in image generation.

DetailsMotivation: Diffusion models lack explicit control over low-level physical properties like shading and materials, while physically based rendering lacks prompt-driven flexibility. Both share common evolution from noisy to clean images, suggesting potential unification.

Method: Proposes unified stochastic formulation using SDEs that bridges Monte Carlo rendering and diffusion models. Models general SDE formulation for Monte Carlo integration under Central Limit Theorem, converts to physically grounded SDE via path tracing, and analyzes how physical characteristics extend to diffusion models through noise variance perspective.

Result: Method enables physically grounded control over diffusion-generated results across multiple tasks including rendering and material editing, bridging the gap between generative flexibility and physical accuracy.

Conclusion: The unified stochastic formulation successfully connects physically based rendering with diffusion models, enabling fine-grained physical control over generative image synthesis while maintaining prompt-driven flexibility.

Abstract: Diffusion-based image generators excel at producing realistic content from text or image conditions, but they offer only limited explicit control over low-level, physically grounded shading and material properties. In contrast, physically based rendering (PBR) offers fine-grained physical control but lacks prompt-driven flexibility. Although these two paradigms originate from distinct communities, both share a common evolution – from noisy observations to clean images. In this paper, we propose a unified stochastic formulation that bridges Monte Carlo rendering and diffusion-based generative modeling. First, a general stochastic differential equation (SDE) formulation for Monte Carlo integration under the Central Limit Theorem is modeled. Through instantiation via physically based path tracing, we convert it into a physically grounded SDE representation. Moreover, we provide a systematic analysis of how the physical characteristics of path tracing can be extended to existing diffusion models from the perspective of noise variance. Extensive experiments across multiple tasks show that our method can exert physically grounded control over diffusion-generated results, covering tasks such as rendering and material editing.

[150] Communication-Inspired Tokenization for Structured Image Representations

Aram Davtyan, Yusuf Sahin, Yasaman Haghighi, Sebastian Stapf, Pablo Acuaviva, Alexandre Alahi, Paolo Favaro

Main category: cs.CV

TL;DR: COMiT is a communication-inspired visual tokenization framework that learns structured discrete token sequences through iterative encoding of image crops and recurrent token refinement, improving object-centric representation and compositional generalization.

DetailsMotivation: Existing discrete image tokenizers are optimized for reconstruction/compression, producing tokens that capture local texture rather than object-level semantic structure. The authors aim to create more structured, interpretable token sequences inspired by human incremental and compositional communication.

Method: COMiT constructs latent messages within fixed token budgets by iteratively observing localized image crops and recurrently updating discrete representations. At each step, new visual information is integrated while refining existing token sequences. A single transformer model handles both encoding and decoding, trained end-to-end with flow-matching reconstruction and semantic representation alignment losses.

Result: The method demonstrates that while semantic alignment provides grounding, attentive sequential tokenization is critical for inducing interpretable, object-centric token structure. COMiT substantially improves compositional generalization and relational reasoning over prior methods.

Conclusion: COMiT introduces a communication-inspired approach to visual tokenization that produces more structured, object-centric representations, addressing limitations of existing reconstruction-optimized tokenizers and improving downstream reasoning capabilities.

Abstract: Discrete image tokenizers have emerged as a key component of modern vision and multimodal systems, providing a sequential interface for transformer-based architectures. However, most existing approaches remain primarily optimized for reconstruction and compression, often yielding tokens that capture local texture rather than object-level semantic structure. Inspired by the incremental and compositional nature of human communication, we introduce COMmunication inspired Tokenization (COMiT), a framework for learning structured discrete visual token sequences. COMiT constructs a latent message within a fixed token budget by iteratively observing localized image crops and recurrently updating its discrete representation. At each step, the model integrates new visual information while refining and reorganizing the existing token sequence. After several encoding iterations, the final message conditions a flow-matching decoder that reconstructs the full image. Both encoding and decoding are implemented within a single transformer model and trained end-to-end using a combination of flow-matching reconstruction and semantic representation alignment losses. Our experiments demonstrate that while semantic alignment provides grounding, attentive sequential tokenization is critical for inducing interpretable, object-centric token structure and substantially improving compositional generalization and relational reasoning over prior methods.

[151] OrthoDiffusion: A Generalizable Multi-Task Diffusion Foundation Model for Musculoskeletal MRI Interpretation

Tian Lan, Lei Xu, Zimu Yuan, Shanggui Liu, Jiajun Liu, Jiaxin Liu, Weilai Xiang, Hongyu Yang, Dong Jiang, Jianxin Yin, Dingyu Wang

Main category: cs.CV

TL;DR: OrthoDiffusion is a unified diffusion-based foundation model for multi-task musculoskeletal MRI interpretation, using three orientation-specific 3D diffusion models pre-trained on unlabeled knee MRI scans to support anatomical segmentation and multi-label diagnosis across multiple joints.

DetailsMotivation: Musculoskeletal disorders are a major global health burden, and MRI interpretation is challenging due to complex anatomical structures across different imaging planes, requiring significant expertise and being prone to variability. There's a need for more efficient and accurate diagnostic tools.

Method: Developed OrthoDiffusion using three orientation-specific 3D diffusion models (sagittal, coronal, axial) pre-trained in a self-supervised manner on 15,948 unlabeled knee MRI scans. The view-specific representations are integrated to support multiple clinical tasks including anatomical segmentation and multi-label diagnosis.

Result: Achieved excellent performance in segmentation of 11 knee structures and detection of 8 knee abnormalities. Demonstrated robustness across clinical centers and MRI field strengths, outperforming traditional supervised models. Maintained high diagnostic precision with only 10% of training labels. Representations transferred well to other joints, achieving strong performance across 11 diseases of ankle and shoulder.

Conclusion: Diffusion-based foundation models can serve as a unified platform for multi-disease diagnosis and anatomical segmentation, potentially improving efficiency and accuracy of musculoskeletal MRI interpretation in real-world clinical workflows.

Abstract: Musculoskeletal disorders represent a significant global health burden and are a leading cause of disability worldwide. While MRI is essential for accurate diagnosis, its interpretation remains exceptionally challenging. Radiologists must identify multiple potential abnormalities within complex anatomical structures across different imaging planes, a process that requires significant expertise and is prone to variability. We developed OrthoDiffusion, a unified diffusion-based foundation model designed for multi-task musculoskeletal MRI interpretation. The framework utilizes three orientation-specific 3D diffusion models, pre-trained in a self-supervised manner on 15,948 unlabeled knee MRI scans, to learn robust anatomical features from sagittal, coronal, and axial views. These view-specific representations are integrated to support diverse clinical tasks, including anatomical segmentation and multi-label diagnosis. Our evaluation demonstrates that OrthoDiffusion achieves excellent performance in the segmentation of 11 knee structures and the detection of 8 knee abnormalities. The model exhibited remarkable robustness across different clinical centers and MRI field strengths, consistently outperforming traditional supervised models. Notably, in settings where labeled data was scarce, OrthoDiffusion maintained high diagnostic precision using only 10% of training labels. Furthermore, the anatomical representations learned from knee imaging proved highly transferable to other joints, achieving strong diagnostic performance across 11 diseases of the ankle and shoulder. These findings suggest that diffusion-based foundation models can serve as a unified platform for multi-disease diagnosis and anatomical segmentation, potentially improving the efficiency and accuracy of musculoskeletal MRI interpretation in real-world clinical workflows.

[152] Federated Learning for Cross-Modality Medical Image Segmentation via Augmentation-Driven Generalization

Sachin Dudda Nagaraju, Ashkan Moradi, Bendik Skarre Abrahamsen, Mattijs Elschot

Main category: cs.CV

TL;DR: Federated learning approach for medical image segmentation that addresses cross-modality generalization between CT and MRI data using augmentation strategies, with Global Intensity Nonlinear (GIN) augmentation showing best performance.

DetailsMotivation: Medical AI faces challenges with fragmented, privacy-constrained data across institutions. Federated learning enables collaboration without data sharing, but cross-modality domain shifts (CT vs MRI) hinder generalization. Existing solutions require impractical paired data or complex architectures.

Method: Realistic FL scenario where each client holds single-modality data (CT or MRI). Evaluated multiple augmentation strategies: convolution-based spatial augmentation, frequency-domain manipulation, domain-specific normalization, and Global Intensity Nonlinear (GIN) augmentation. Tested on abdominal organ segmentation and whole-heart segmentation tasks.

Result: GIN augmentation consistently outperformed alternatives in both centralized and federated settings. For pancreas segmentation, Dice score improved from 0.073 to 0.437 (498% gain). Federated approach achieved 93-98% of centralized training accuracy while preserving data privacy.

Conclusion: GIN augmentation effectively simulates cross-modality appearance variations while preserving anatomical structure. The federated approach demonstrates strong cross-modality generalization without compromising data privacy, enabling feasible AI deployment across diverse healthcare systems.

Abstract: Artificial intelligence has emerged as a transformative tool in medical image analysis, yet developing robust and generalizable segmentation models remains difficult due to fragmented, privacy-constrained imaging data siloed across institutions. While federated learning (FL) enables collaborative model training without centralizing data, cross-modality domain shifts pose a critical challenge, particularly when models trained on one modality fail to generalize to another. Many existing solutions require paired multimodal data per patient or rely on complex architectures, both of which are impractical in real clinical settings. In this work, we consider a realistic FL scenario where each client holds single-modality data (CT or MRI), and systematically investigate augmentation strategies for cross-modality generalization. Using abdominal organ segmentation and whole-heart segmentation as representative multi-class and binary segmentation benchmarks, we evaluate convolution-based spatial augmentation, frequency-domain manipulation, domain-specific normalization, and global intensity nonlinear (GIN) augmentation. Our results show that GIN consistently outperforms alternatives in both centralized and federated settings by simulating cross-modality appearance variations while preserving anatomical structure. For the pancreas, Dice score improved from 0.073 to 0.437, a 498% gain. Our federated approach achieves 93-98% of centralized training accuracy, demonstrating strong cross-modality generalization without compromising data privacy, pointing toward feasible federated AI deployment across diverse healthcare systems.

[153] Real-time Motion Segmentation with Event-based Normal Flow

Sheng Zhong, Zhongyang Ren, Xiya Zhu, Dehao Yuan, Cornelia Fermuller, Yi Zhou

Main category: cs.CV

TL;DR: A real-time motion segmentation framework for event-based cameras using normal flow as intermediate representation, achieving 800x speedup over state-of-the-art methods.

DetailsMotivation: Event-based cameras offer microsecond resolution but raw event data processing is inefficient for real-time tasks like motion segmentation. Normal flow provides compressed motion information that enables more efficient processing.

Method: Proposes normal flow-based motion segmentation framework: 1) Learns dense normal flow from event neighborhoods, 2) Formulates segmentation as energy minimization via graph cuts, 3) Iteratively optimizes with normal flow clustering and motion model fitting, 4) Uses normal flow-based motion model initialization to reduce computational complexity.

Result: Achieves nearly 800x speedup compared to state-of-the-art open-source method while maintaining accuracy. Extensive evaluations on multiple public datasets demonstrate both accuracy and efficiency.

Conclusion: The normal flow-based approach enables real-time motion segmentation for event-based vision by efficiently compressing motion information and reducing computational complexity through smart model initialization and fitting.

Abstract: Event-based cameras are bio-inspired sensors with pixels that independently and asynchronously respond to brightness changes at microsecond resolution, offering the potential to handle visual tasks in challenging scenarios. However, due to the sparse information content in individual events, directly processing the raw event data to solve vision tasks is highly inefficient, which severely limits the applicability of state-of-the-art methods in real-time tasks, such as motion segmentation, a fundamental task for dynamic scene understanding. Incorporating normal flow as an intermediate representation to compress motion information from event clusters within a localized region provides a more effective solution. In this work, we propose a normal flow-based motion segmentation framework for event-based vision. Leveraging the dense normal flow directly learned from event neighborhoods as input, we formulate the motion segmentation task as an energy minimization problem solved via graph cuts, and optimize it iteratively with normal flow clustering and motion model fitting. By using a normal flow-based motion model initialization and fitting method, the proposed system is able to efficiently estimate the motion models of independently moving objects with only a limited number of candidate models, which significantly reduces the computational complexity and ensures real-time performance, achieving nearly a 800x speedup in comparison to the open-source state-of-the-art method. Extensive evaluations on multiple public datasets fully demonstrate the accuracy and efficiency of our framework.

[154] SIMSPINE: A Biomechanics-Aware Simulation Framework for 3D Spine Motion Annotation and Benchmarking

Muhammad Saif Ullah Khan, Didier Stricker

Main category: cs.CV

TL;DR: A biomechanics-aware simulation framework creates SIMSPINE dataset with 3D spinal keypoints for natural motions, enabling data-driven vertebral kinematics learning and improving spine tracking benchmarks.

DetailsMotivation: Spinal motion modeling is crucial for human biomechanics but underexplored in computer vision due to complex multi-joint kinematics and lack of large-scale 3D annotations. Existing datasets lack detailed vertebral-level annotations for natural motions.

Method: Developed a biomechanics-aware keypoint simulation framework that augments existing human pose datasets with anatomically consistent 3D spinal keypoints derived from musculoskeletal modeling. Created SIMSPINE dataset with 2.14M frames of natural full-body motions with vertebra-level annotations.

Result: SIMSPINE is the first open dataset providing sparse vertebra-level 3D spinal annotations for natural motions. 2D spine baselines improved state-of-the-art from 0.63 to 0.80 AUC in controlled environments and from 0.91 to 0.93 AP for in-the-wild spine tracking. Released pretrained models for 2D detection, monocular 3D lifting, and multi-view reconstruction.

Conclusion: The simulation framework and SIMSPINE dataset advance vision-based biomechanics, motion analysis, and digital human modeling by enabling reproducible, anatomically grounded 3D spine estimation under natural conditions, bridging musculoskeletal simulation and computer vision.

Abstract: Modeling spinal motion is fundamental to understanding human biomechanics, yet remains underexplored in computer vision due to the spine’s complex multi-joint kinematics and the lack of large-scale 3D annotations. We present a biomechanics-aware keypoint simulation framework that augments existing human pose datasets with anatomically consistent 3D spinal keypoints derived from musculoskeletal modeling. Using this framework, we create the first open dataset, named SIMSPINE, which provides sparse vertebra-level 3D spinal annotations for natural full-body motions in indoor multi-camera capture without external restraints. With 2.14 million frames, this enables data-driven learning of vertebral kinematics from subtle posture variations and bridges the gap between musculoskeletal simulation and computer vision. In addition, we release pretrained baselines covering fine-tuned 2D detectors, monocular 3D pose lifting models, and multi-view reconstruction pipelines, establishing a unified benchmark for biomechanically valid spine motion estimation. Specifically, our 2D spine baselines improve the state-of-the-art from 0.63 to 0.80 AUC in controlled environments, and from 0.91 to 0.93 AP for in-the-wild spine tracking. Together, the simulation framework and SIMSPINE dataset advance research in vision-based biomechanics, motion analysis, and digital human modeling by enabling reproducible, anatomically grounded 3D spine estimation under natural conditions.

[155] VGGDrive: Empowering Vision-Language Models with Cross-View Geometric Grounding for Autonomous Driving

Jie Wang, Guang Li, Zhijian Huang, Chenxu Dang, Hangjun Ye, Yahong Han, Long Chen

Main category: cs.CV

TL;DR: VGGDrive enhances Vision-Language Models for autonomous driving by integrating cross-view 3D geometric grounding from mature 3D foundation models through a plug-and-play adapter.

DetailsMotivation: Existing VLMs lack cross-view 3D geometric modeling capabilities crucial for autonomous driving, leading to mediocre performance. Current approaches using Q&A data for auxiliary training fail to fundamentally equip VLMs with comprehensive 3D understanding.

Method: Proposes VGGDrive architecture with Cross-View 3D Geometric Enabler (CVGE) - a plug-and-play module that bridges frozen 3D visual model features with VLM’s 2D features through hierarchical adaptive injection mechanism.

Result: Extensive experiments show VGGDrive enhances base VLM performance across five autonomous driving benchmarks, including cross-view risk perception, motion prediction, and trajectory planning tasks.

Conclusion: Mature 3D foundation models can effectively empower autonomous driving tasks through integration, demonstrating potential for this paradigm in the autonomous driving community.

Abstract: The significance of cross-view 3D geometric modeling capabilities for autonomous driving is self-evident, yet existing Vision-Language Models (VLMs) inherently lack this capability, resulting in their mediocre performance. While some promising approaches attempt to mitigate this by constructing Q&A data for auxiliary training, they still fail to fundamentally equip VLMs with the ability to comprehensively handle diverse evaluation protocols. We thus chart a new course, advocating for the infusion of VLMs with the cross-view geometric grounding of mature 3D foundation models, closing this critical capability gap in autonomous driving. In this spirit, we propose a novel architecture, VGGDrive, which empowers Vision-language models with cross-view Geometric Grounding for autonomous Driving. Concretely, to bridge the cross-view 3D geometric features from the frozen visual 3D model with the VLM’s 2D visual features, we introduce a plug-and-play Cross-View 3D Geometric Enabler (CVGE). The CVGE decouples the base VLM architecture and effectively empowers the VLM with 3D features through a hierarchical adaptive injection mechanism. Extensive experiments show that VGGDrive enhances base VLM performance across five autonomous driving benchmarks, including tasks like cross-view risk perception, motion prediction, and trajectory planning. It’s our belief that mature 3D foundation models can empower autonomous driving tasks through effective integration, and we hope our initial exploration demonstrates the potential of this paradigm to the autonomous driving community.

[156] RU4D-SLAM: Reweighting Uncertainty in Gaussian Splatting SLAM for 4D Scene Reconstruction

Yangfan Zhao, Hanwei Zhang, Ke Huang, Qiufeng Wang, Zhenzhou Shao, Dengyu Wu

Main category: cs.CV

TL;DR: RU4D-SLAM: A robust 4D Gaussian splatting SLAM framework that handles dynamic environments by integrating motion blur rendering, uncertainty-aware tracking, and semantic-guided reweighting for improved trajectory accuracy and 4D scene reconstruction.

DetailsMotivation: Existing 3D Gaussian splatting SLAM methods struggle in dynamic environments where moving objects complicate 3D reconstruction and hinder reliable tracking. While 4D reconstruction (especially 4D Gaussian splatting) shows promise for addressing these challenges, its potential for 4D-aware SLAM remains largely unexplored.

Method: Proposes RU4D-SLAM framework that introduces temporal factors into spatial 3D representation with uncertainty-aware perception of scene changes, blurred image synthesis, and dynamic scene reconstruction. Key innovations include: 1) Integration of motion blur rendering for dynamic scene representation, 2) Extension of per-pixel uncertainty modeling to handle blurred images, 3) Semantic-guided reweighting mechanism for per-pixel uncertainty estimation in dynamic scenes, and 4) Learnable opacity weight for adaptive 4D mapping.

Result: Extensive experiments on standard benchmarks show the method substantially outperforms state-of-the-art approaches in both trajectory accuracy and 4D scene reconstruction, particularly in dynamic environments with moving objects and low-quality inputs.

Conclusion: RU4D-SLAM provides an effective solution for robust 4D-aware SLAM in dynamic environments by combining temporal modeling with uncertainty-aware perception and adaptive mapping techniques.

Abstract: Combining 3D Gaussian splatting with Simultaneous Localization and Mapping (SLAM) has gained popularity as it enables continuous 3D environment reconstruction during motion. However, existing methods struggle in dynamic environments, particularly moving objects complicate 3D reconstruction and, in turn, hinder reliable tracking. The emergence of 4D reconstruction, especially 4D Gaussian splatting, offers a promising direction for addressing these challenges, yet its potential for 4D-aware SLAM remains largely underexplored. Along this direction, we propose a robust and efficient framework, namely Reweighting Uncertainty in Gaussian Splatting SLAM (RU4D-SLAM) for 4D scene reconstruction, that introduces temporal factors into spatial 3D representation while incorporating uncertainty-aware perception of scene changes, blurred image synthesis, and dynamic scene reconstruction. We enhance dynamic scene representation by integrating motion blur rendering, and improve uncertainty-aware tracking by extending per-pixel uncertainty modeling, which is originally designed for static scenarios, to handle blurred images. Furthermore, we propose a semantic-guided reweighting mechanism for per-pixel uncertainty estimation in dynamic scenes, and introduce a learnable opacity weight to support adaptive 4D mapping. Extensive experiments on standard benchmarks demonstrate that our method substantially outperforms state-of-the-art approaches in both trajectory accuracy and 4D scene reconstruction, particularly in dynamic environments with moving objects and low-quality inputs. Code available: https://ru4d-slam.github.io

[157] GatedCLIP: Gated Multimodal Fusion for Hateful Memes Detection

Yingying Guo, Ke Zhang, Zirong Zeng

Main category: cs.CV

TL;DR: GatedCLIP enhances CLIP for hateful meme detection with learned projections, dynamic gated fusion, and contrastive learning, achieving AUROC 0.66 vs CLIP’s 0.49 with only 350K parameters.

DetailsMotivation: Hateful content in multimodal memes is challenging to detect because harmful messages emerge from complex interactions between benign images and text, requiring specialized multimodal understanding beyond standard CLIP capabilities.

Method: Proposes GatedCLIP with three key improvements: 1) learned projection heads mapping CLIP embeddings to task-optimized semantic space, 2) dynamic gated fusion mechanism adaptively weighting visual and textual features, and 3) contrastive learning objective maintaining cross-modal semantic alignment.

Result: On the Hateful Memes dataset, GatedCLIP achieves AUROC of 0.66, substantially outperforming CLIP baseline (AUROC 0.49) while maintaining computational efficiency with only 350K trainable parameters.

Conclusion: GatedCLIP effectively enhances CLIP’s multimodal capabilities for hateful meme detection through specialized architectural improvements, demonstrating significant performance gains while remaining parameter-efficient.

Abstract: Detecting hateful content in multimodal memes presents unique challenges, as harmful messages often emerge from the complex interplay between benign images and text. We propose GatedCLIP, a Vision-Language model that enhances CLIP’s multimodal capabilities with specialized architectural improvements for hateful memes detection. Our approach introduces learned projection heads that map CLIP embeddings to a task-optimized semantic space, a dynamic gated fusion mechanism that adaptively weights visual and textual features, and a contrastive learning objective that maintains cross-modal semantic alignment. Experiments on the Hateful Memes dataset demonstrate that GatedCLIP achieves an AUROC of 0.66, substantially outperforming the CLIP baseline (AUROC 0.49) while maintaining computational efficiency with only 350K trainable parameters.

[158] Training-Free Multi-Concept Image Editing

Niki Foteinopoulou, Ignas Budvytis, Stephan Liwicki

Main category: cs.CV

TL;DR: Training-free framework for concept-based image editing that combines Optimised DDS with LoRA-driven concept composition, enabling multi-concept control without training.

DetailsMotivation: Current diffusion-based image editing methods struggle with preserving identity and capturing visual concepts that are difficult to express purely through text prompts, such as facial structure, material texture, or object geometry.

Method: Unifies Optimised DDS (Denoising Diffusion Sampler) with LoRA-driven concept composition, where LoRA training data represents visual concepts. Uses ordered timesteps, regularization, and negative-prompt guidance for stability and controllability.

Result: Demonstrates consistent improvements over existing training-free diffusion editing methods on InstructPix2Pix and ComposLoRA benchmarks through quantitative and qualitative evaluation.

Conclusion: Proposes an effective training-free framework for concept-based image editing that integrates semantic text guidance with low-level cues from pretrained concept adapters, enabling better control over multiple visual concepts.

Abstract: Editing images with diffusion models without training remains challenging. While recent optimisation-based methods achieve strong zero-shot edits from text, they struggle to preserve identity or capture details that language alone cannot express. Many visual concepts such as facial structure, material texture, or object geometry are impossible to express purely through text prompts alone. To address this gap, we introduce a training-free framework for concept-based image editing, which unifies Optimised DDS with LoRA-driven concept composition, where the training data of the LoRA represent the concept. Our approach enables combining and controlling multiple visual concepts directly within the diffusion process, integrating semantic guidance from text with low-level cues from pretrained concept adapters. We further refine DDS for stability and controllability through ordered timesteps, regularisation, and negative-prompt guidance. Quantitative and qualitative results demonstrate consistent improvements over existing training-free diffusion editing methods on InstructPix2Pix and ComposLoRA benchmarks. Code will be made publicly available.

[159] FLIM Networks with Bag of Feature Points

João Deltregia Martinelli, Marcelo Luis Rodrigues Filho, Felipe Crispim da Rocha Salvagnini, Gilson Junior Soares, Jefersson A. dos Santos, Alexandre X. Falcão

Main category: cs.CV

TL;DR: FLIM-BoFP introduces a faster filter estimation method for FLIM networks by performing single clustering at input block and mapping feature points across all blocks, improving efficiency while maintaining performance for parasite detection tasks.

DetailsMotivation: Traditional convolutional networks require extensive image annotation which is costly and time-consuming. FLIM networks address this by estimating encoder filters from user-drawn markers without traditional optimization, but previous FLIM-Cluster method has computational overhead and reduced control over filter locations.

Method: FLIM-BoFP (Bag of Feature Points) performs a single clustering at the input block to create a bag of feature points, then defines filters directly from mapped feature points across all encoder blocks, streamlining the filter estimation process compared to FLIM-Cluster which performs patch clustering at each block.

Result: FLIM-BoFP demonstrates improved efficiency, effectiveness, and generalization compared to FLIM-Cluster and other state-of-the-art baselines for parasite detection in optical microscopy images, being significantly faster while maintaining performance.

Conclusion: FLIM-BoFP provides a more efficient filter estimation method for FLIM networks, enabling faster training without backpropagation while maintaining effectiveness for computer vision tasks like parasite detection.

Abstract: Convolutional networks require extensive image annotation, which can be costly and time-consuming. Feature Learning from Image Markers (FLIM) tackles this challenge by estimating encoder filters (i.e., kernel weights) from user-drawn markers on discriminative regions of a few representative images without traditional optimization. Such an encoder combined with an adaptive decoder comprises a FLIM network fully trained without backpropagation. Prior research has demonstrated their effectiveness in Salient Object Detection (SOD), being significantly lighter than existing lightweight models. This study revisits FLIM SOD and introduces FLIM-Bag of Feature Points (FLIM-BoFP), a considerably faster filter estimation method. The previous approach, FLIM-Cluster, derives filters through patch clustering at each encoder’s block, leading to computational overhead and reduced control over filter locations. FLIM-BoFP streamlines this process by performing a single clustering at the input block, creating a bag of feature points, and defining filters directly from mapped feature points across all blocks. The paper evaluates the benefits in efficiency, effectiveness, and generalization of FLIM-BoFP compared to FLIM-Cluster and other state-of-the-art baselines for parasite detection in optical microscopy images.

[160] Hybrid Fusion: One-Minute Efficient Training for Zero-Shot Cross-Domain Image Fusion

Ran Zhang, Xuanhua He, Liu Liu

Main category: cs.CV

TL;DR: A hybrid image fusion framework combining a learnable U-Net for dynamic guidance with a fixed Laplacian pyramid fusion kernel, enabling efficient full-resolution training and eliminating the train-inference gap.

DetailsMotivation: Traditional image fusion methods are fast but lack adaptability and performance, while deep learning approaches achieve SOTA results but suffer from inefficiencies due to patch-based training that creates a gap with full-resolution inference.

Method: Proposes a hybrid framework that decouples policy learning from pixel synthesis: uses a learnable U-Net to generate dynamic guidance maps that direct a classic, fixed Laplacian pyramid fusion kernel.

Result: Achieves SOTA-comparable performance in about one minute on RTX 4090 or two minutes on consumer laptop GPU from scratch, with powerful zero-shot generalization across diverse tasks (infrared-visible to medical imaging).

Conclusion: The hybrid approach resolves the trade-off between efficiency and performance, enabling efficient full-resolution training while maintaining high faithfulness through linear construction from source information.

Abstract: Image fusion seeks to integrate complementary information from multiple sources into a single, superior image. While traditional methods are fast, they lack adaptability and performance. Conversely, deep learning approaches achieve state-of-the-art (SOTA) results but suffer from critical inefficiencies: their reliance on slow, resource-intensive, patch-based training introduces a significant gap with full-resolution inference. We propose a novel hybrid framework that resolves this trade-off. Our method utilizes a learnable U-Net to generate a dynamic guidance map that directs a classic, fixed Laplacian pyramid fusion kernel. This decoupling of policy learning from pixel synthesis enables remarkably efficient full-resolution training, eliminating the train-inference gap. Consequently, our model achieves SOTA-comparable performance in about one minute on a RTX 4090 or two minutes on a consumer laptop GPU from scratch without any external model and demonstrates powerful zero-shot generalization across diverse tasks, from infrared-visible to medical imaging. By design, the fused output is linearly constructed solely from source information, ensuring high faithfulness for critical applications. The codes are available at https://github.com/Zirconium233/HybridFusion

[161] On the Explainability of Vision-Language Models in Art History

Stefanie Schneider

Main category: cs.CV

TL;DR: Evaluation of 7 XAI methods for interpreting CLIP’s visual reasoning in art-historical contexts, finding effectiveness depends on conceptual stability and representational availability of categories.

DetailsMotivation: To understand how XAI methods can make VLMs' visual reasoning interpretable in art-historical contexts, addressing critical questions about machine 'understanding' in multimodal systems.

Method: Evaluated 7 XAI methods using zero-shot localization experiments combined with human interpretability studies on CLIP, focusing on art-historical contexts.

Result: XAI methods capture some aspects of human interpretation, but their effectiveness depends on conceptual stability and representational availability of the examined categories.

Conclusion: While XAI methods provide insights into VLM reasoning, their interpretability is context-dependent and limited by the stability and availability of conceptual representations.

Abstract: Vision-Language Models (VLMs) transfer visual and textual data into a shared embedding space. In so doing, they enable a wide range of multimodal tasks, while also raising critical questions about the nature of machine ‘understanding.’ In this paper, we examine how Explainable Artificial Intelligence (XAI) methods can render the visual reasoning of a VLM - namely, CLIP - legible in art-historical contexts. To this end, we evaluate seven methods, combining zero-shot localization experiments with human interpretability studies. Our results indicate that, while these methods capture some aspects of human interpretation, their effectiveness hinges on the conceptual stability and representational availability of the examined categories.

[162] DA-Cal: Towards Cross-Domain Calibration in Semantic Segmentation

Wangkai Li, Rui Sun, Zhaoyang Li, Yujia Chen, Tianzhu Zhang

Main category: cs.CV

TL;DR: DA-Cal is a cross-domain calibration framework for unsupervised domain adaptation in semantic segmentation that addresses misalignment between prediction confidence and actual accuracy by optimizing soft pseudo-labels through meta-learning and domain-mixing strategies.

DetailsMotivation: Existing UDA methods for semantic segmentation focus on improving target domain performance but neglect calibration quality, creating safety risks in critical applications where confidence scores don't reflect actual accuracy. The authors observed that performance degrades when using soft pseudo-labels instead of hard ones in cross-domain scenarios due to poor calibration.

Method: DA-Cal transforms target domain calibration into soft pseudo-label optimization using a Meta Temperature Network that generates pixel-level calibration parameters. It employs bi-level optimization to establish relationships between soft pseudo-labels and UDA supervision, and uses complementary domain-mixing strategies to prevent overfitting and reduce domain discrepancies.

Result: Experiments show DA-Cal seamlessly integrates with existing self-training frameworks across multiple UDA segmentation benchmarks, significantly improving target domain calibration while delivering performance gains without inference overhead.

Conclusion: DA-Cal effectively addresses the calibration problem in cross-domain semantic segmentation, providing a framework that improves both calibration quality and performance without additional inference costs, making it suitable for safety-critical applications.

Abstract: While existing unsupervised domain adaptation (UDA) methods greatly enhance target domain performance in semantic segmentation, they often neglect network calibration quality, resulting in misalignment between prediction confidence and actual accuracy – a significant risk in safety-critical applications. Our key insight emerges from observing that performance degrades substantially when soft pseudo-labels replace hard pseudo-labels in cross-domain scenarios due to poor calibration, despite the theoretical equivalence of perfectly calibrated soft pseudo-labels to hard pseudo-labels. Based on this finding, we propose DA-Cal, a dedicated cross-domain calibration framework that transforms target domain calibration into soft pseudo-label optimization. DA-Cal introduces a Meta Temperature Network to generate pixel-level calibration parameters and employs bi-level optimization to establish the relationship between soft pseudo-labels and UDA supervision, while utilizing complementary domain-mixing strategies to prevent overfitting and reduce domain discrepancies. Experiments demonstrate that DA-Cal seamlessly integrates with existing self-training frameworks across multiple UDA segmentation benchmarks, significantly improving target domain calibration while delivering performance gains without inference overhead. The code will be released.

[163] MUSE: Harnessing Precise and Diverse Semantics for Few-Shot Whole Slide Image Classification

Jiahao Xu, Sheng Huang, Xin Zhang, Zhixiong Nan, Jiajun Dong, Nankun Mu

Main category: cs.CV

TL;DR: MUSE framework improves few-shot whole slide image classification by refining sample-specific semantic priors and using retrieval-augmented multi-view textual descriptions for enhanced visual-semantic alignment.

DetailsMotivation: Current vision-language methods in computational pathology treat textual semantics as static class-level priors, lacking sample-wise refinement and limiting diversity/precision of visual-semantic alignment, which hinders few-shot generalization.

Method: Proposes MUSE with two components: 1) Sample-wise Fine-grained Semantic Enhancement (SFSE) using MoE-based adaptive visual-semantic interaction for sample-specific semantic priors, and 2) Stochastic Multi-view Model Optimization (SMMO) that retrieves and stochastically integrates multiple matched textual views from an LLM-generated knowledge base during training.

Result: Experiments on three benchmark WSI datasets show MUSE consistently outperforms existing vision-language baselines in few-shot settings, demonstrating improved generalization through active, sample-aware semantic optimization.

Conclusion: Effective few-shot pathology learning requires not only richer semantic sources but also their active and sample-aware semantic optimization, which MUSE achieves through stochastic multi-view semantic enhancement.

Abstract: In computational pathology, few-shot whole slide image classification is primarily driven by the extreme scarcity of expert-labeled slides. Recent vision-language methods incorporate textual semantics generated by large language models, but treat these descriptions as static class-level priors that are shared across all samples and lack sample-wise refinement. This limits both the diversity and precision of visual-semantic alignment, hindering generalization under limited supervision. To overcome this, we propose the stochastic MUlti-view Semantic Enhancement (MUSE), a framework that first refines semantic precision via sample-wise adaptation and then enhances semantic richness through retrieval-augmented multi-view generation. Specifically, MUSE introduces Sample-wise Fine-grained Semantic Enhancement (SFSE), which yields a fine-grained semantic prior for each sample through MoE-based adaptive visual-semantic interaction. Guided by this prior, Stochastic Multi-view Model Optimization (SMMO) constructs an LLM-generated knowledge base of diverse pathological descriptions per class, then retrieves and stochastically integrates multiple matched textual views during training. These dynamically selected texts serve as enriched semantic supervisions to stochastically optimize the vision-language model, promoting robustness and mitigating overfitting. Experiments on three benchmark WSI datasets show that MUSE consistently outperforms existing vision-language baselines in few-shot settings, demonstrating that effective few-shot pathology learning requires not only richer semantic sources but also their active and sample-aware semantic optimization. Our code is available at: https://github.com/JiahaoXu-god/CVPR2026_MUSE.

[164] When Safety Collides: Resolving Multi-Category Harmful Conflicts in Text-to-Image Diffusion via Adaptive Safety Guidance

Yongli Xiang, Ziming Hong, Zhaoqing Wang, Xiangyu Zhao, Bo Han, Tongliang Liu

Main category: cs.CV

TL;DR: CASG is a training-free safety guidance framework for text-to-image diffusion models that dynamically identifies and applies category-aligned safety directions to avoid harmful conflicts between different safety categories.

DetailsMotivation: Existing safety guidance methods for T2I diffusion models average safety directions across multiple harmful categories, leading to "harmful conflicts" where mitigating one type of harm may inadvertently amplify another, increasing overall harmful rate.

Method: Proposes Conflict-aware Adaptive Safety Guidance (CASG) with two components: (1) Conflict-aware Category Identification (CaCI) identifies the harmful category most aligned with the model’s evolving generative state, and (2) Conflict-resolving Guidance Application (CrGA) applies safety steering solely along the identified category to avoid multi-category interference.

Result: Experiments on T2I safety benchmarks show CASG achieves state-of-the-art performance, reducing harmful rate by up to 15.4% compared to existing methods.

Conclusion: CASG effectively addresses harmful conflicts in T2I safety guidance by dynamically identifying and applying category-specific safety directions, improving safety without retraining.

Abstract: Text-to-Image (T2I) diffusion models have demonstrated significant advancements in generating high-quality images, while raising potential safety concerns regarding harmful content generation. Safety-guidance-based methods have been proposed to mitigate harmful outputs by steering generation away from harmful zones, where the zones are averaged across multiple harmful categories based on predefined keywords. However, these approaches fail to capture the complex interplay among different harm categories, leading to “harmful conflicts” where mitigating one type of harm may inadvertently amplify another, thus increasing overall harmful rate. To address this issue, we propose Conflict-aware Adaptive Safety Guidance (CASG), a training-free framework that dynamically identifies and applies the category-aligned safety direction during generation. CASG is composed of two components: (i) Conflict-aware Category Identification (CaCI), which identifies the harmful category most aligned with the model’s evolving generative state, and (ii) Conflict-resolving Guidance Application (CrGA), which applies safety steering solely along the identified category to avoid multi-category interference. CASG can be applied to both latent-space and text-space safeguards. Experiments on T2I safety benchmarks demonstrate CASG’s state-of-the-art performance, reducing the harmful rate by up to 15.4% compared to existing methods.

[165] SpatiaLQA: A Benchmark for Evaluating Spatial Logical Reasoning in Vision-Language Models

Yuechen Xie, Xiaoyan Zhang, Yicheng Shan, Hao Zhu, Rui Tang, Rong Wei, Mingli Song, Yuanyu Wan, Jie Song

Main category: cs.CV

TL;DR: SpatiaLQA benchmark evaluates VLMs’ spatial logical reasoning in complex real-world scenes, showing current models struggle; proposed recursive scene graph method improves performance.

DetailsMotivation: Current Vision-Language Models lack spatial logical reasoning for complex real-world decision-making, requiring understanding of spatial relationships and logical dependencies in multi-step tasks.

Method: Introduces SpatiaLQA benchmark with 9,605 QA pairs from 241 indoor scenes; proposes recursive scene graph assisted reasoning using visual foundation models to decompose scenes into task-relevant graphs.

Result: Extensive experiments on 41 VLMs show even advanced models struggle with spatial logical reasoning; proposed method outperforms all previous approaches.

Conclusion: Spatial logical reasoning remains challenging for VLMs; recursive scene graph decomposition effectively enhances this capability, with SpatiaLQA providing a valuable benchmark.

Abstract: Vision-Language Models (VLMs) have been increasingly applied in real-world scenarios due to their outstanding understanding and reasoning capabilities. Although VLMs have already demonstrated impressive capabilities in common visual question answering and logical reasoning, they still lack the ability to make reasonable decisions in complex real-world environments. We define this ability as spatial logical reasoning, which not only requires understanding the spatial relationships among objects in complex scenes, but also the logical dependencies between steps in multi-step tasks. To bridge this gap, we introduce Spatial Logical Question Answering (SpatiaLQA), a benchmark designed to evaluate the spatial logical reasoning capabilities of VLMs. SpatiaLQA consists of 9,605 question answer pairs derived from 241 real-world indoor scenes. We conduct extensive experiments on 41 mainstream VLMs, and the results show that even the most advanced models still struggle with spatial logical reasoning. To address this issue, we propose a method called recursive scene graph assisted reasoning, which leverages visual foundation models to progressively decompose complex scenes into task-relevant scene graphs, thereby enhancing the spatial logical reasoning ability of VLMs, outperforming all previous methods. Code and dataset are available at https://github.com/xieyc99/SpatiaLQA.

[166] TextPecker: Rewarding Structural Anomaly Quantification for Enhancing Visual Text Rendering

Hanshen Zhu, Yuliang Liu, Xuecheng Wu, An-Lan Wang, Hao Feng, Dingkang Yang, Chao Feng, Can Huang, Jingqun Tang, Xiang Bai

Main category: cs.CV

TL;DR: TextPecker: A plug-and-play RL strategy for improving visual text rendering in text-to-image models by addressing structural anomalies like distortion and misalignment that current MLLMs and OCR models fail to perceive.

DetailsMotivation: Visual Text Rendering (VTR) in text-to-image generation suffers from structural anomalies (distortion, blurriness, misalignment) that current MLLMs and OCR models cannot detect, creating a bottleneck for evaluation and optimization. Even state-of-the-art generators struggle with structurally faithful text rendering.

Method: Proposed TextPecker, a plug-and-play structural anomaly perceptive RL strategy that mitigates noisy reward signals. Built a recognition dataset with character-level structural-anomaly annotations and developed a stroke-editing synthesis engine to expand structural-error coverage.

Result: TextPecker consistently improves diverse text-to-image models; on Qwen-Image, it yields average gains of 4% in structural fidelity and 8.7% in semantic alignment for Chinese text rendering, establishing new SOTA in high-fidelity VTR.

Conclusion: The work fills a gap in VTR optimization, providing a foundational step towards reliable and structurally faithful visual text generation by addressing the perception gap in current multimodal models.

Abstract: Visual Text Rendering (VTR) remains a critical challenge in text-to-image generation, where even advanced models frequently produce text with structural anomalies such as distortion, blurriness, and misalignment. However, we find that leading MLLMs and specialist OCR models largely fail to perceive these structural anomalies, creating a critical bottleneck for both VTR evaluation and RL-based optimization. As a result, even state-of-the-art generators (e.g., SeedDream4.0, Qwen-Image) still struggle to render structurally faithful text. To address this, we propose TextPecker, a plug-and-play structural anomaly perceptive RL strategy that mitigates noisy reward signals and works with any textto-image generator. To enable this capability, we construct a recognition dataset with character-level structural-anomaly annotations and develop a stroke-editing synthesis engine to expand structural-error coverage. Experiments show that TextPecker consistently improves diverse text-to-image models; even on the well-optimized Qwen-Image, it significantly yields average gains of 4% in structural fidelity and 8.7% in semantic alignment for Chinese text rendering, establishing a new state-of-the-art in high-fidelity VTR. Our work fills a gap in VTR optimization, providing a foundational step towards reliable and structural faithful visual text generation.

[167] LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding

Jihao Qiu, Lingxi Xie, Xinyue Huo, Qi Tian, Qixiang Ye

Main category: cs.CV

TL;DR: LongVideo-R1 is an active MLLM agent for efficient long video understanding that uses reasoning to selectively navigate video content rather than exhaustive processing.

DetailsMotivation: Addresses the challenge of long video understanding with low computational budgets, avoiding redundancy of exhaustive search through selective clip navigation.

Method: Uses a reasoning module with high-level visual cues to infer most informative video clips, hierarchical traversal from top-level summaries, fine-tuned on Qwen-3-8B with SFT and RL using specialized reward functions.

Result: Validated on multiple long video benchmarks, achieves superior tradeoff between QA accuracy and efficiency compared to exhaustive approaches.

Conclusion: LongVideo-R1 demonstrates effective selective navigation for long video understanding with computational efficiency, with code and data publicly available.

Abstract: This paper addresses the critical and underexplored challenge of long video understanding with low computational budgets. We propose LongVideo-R1, an active, reasoning-equipped multimodal large language model (MLLM) agent designed for efficient video context navigation, avoiding the redundancy of exhaustive search. At the core of LongVideo-R1 lies a reasoning module that leverages high-level visual cues to infer the most informative video clip for subsequent processing. During inference, the agent initiates traversal from top-level visual summaries and iteratively refines its focus, immediately halting the exploration process upon acquiring sufficient knowledge to answer the query. To facilitate training, we first extract hierarchical video captions from CGBench, a video corpus with grounding annotations, and guide GPT-5 to generate 33K high-quality chain-of-thought-with-tool trajectories. The LongVideo-R1 agent is fine-tuned upon the Qwen-3-8B model through a two-stage paradigm: supervised fine-tuning (SFT) followed by reinforcement learning (RL), where RL employs a specifically designed reward function to maximize selective and efficient clip navigation. Experiments on multiple long video benchmarks validate the effectiveness of name, which enjoys superior tradeoff between QA accuracy and efficiency. All curated data and source code are provided in the supplementary material and will be made publicly available. Code and data are available at: https://github.com/qiujihao19/LongVideo-R1

[168] Computing a Characteristic Orientation for Rotation-Independent Image Analysis

Cristian Valero-Abundio, Emilio Sansano-Sansano, Raúl Montoliu, Marina Martínez García

Main category: cs.CV

TL;DR: GID is a preprocessing method that estimates global image orientation and aligns images to a canonical reference frame to improve rotation robustness in standard neural networks without architectural changes.

DetailsMotivation: Deep learning models lack inherent rotation invariance, requiring data augmentation or specialized architectures that increase computational demands and limit applicability. There's a need for methods that improve rotation robustness without modifying network structures.

Method: General Intensity Direction (GID) estimates a global orientation for each image using intensity direction analysis, then aligns the image to a canonical reference frame. Unlike moment-based approaches that extract invariant descriptors, GID directly transforms images while preserving spatial structure, making it compatible with standard convolutional networks.

Result: On rotated MNIST, GID achieves higher accuracy than state-of-the-art rotation-invariant architectures. Additional experiments on CIFAR-10 confirm effectiveness under more complex conditions.

Conclusion: GID provides an effective preprocessing approach for improving rotation robustness in computer vision tasks without requiring architectural modifications to neural networks, offering a practical solution for handling geometric transformations.

Abstract: Handling geometric transformations, particularly rotations, remains a challenge in deep learning for computer vision. Standard neural networks lack inherent rotation invariance and typically rely on data augmentation or architectural modifications to improve robustness. Although effective, these approaches increase computational demands, require specialised implementations, or alter network structures, limiting their applicability. This paper introduces General Intensity Direction (GID), a preprocessing method that improves rotation robustness without modifying the network architecture. The method estimates a global orientation for each image and aligns it to a canonical reference frame, allowing standard models to process inputs more consistently across different rotations. Unlike moment-based approaches that extract invariant descriptors, this method directly transforms the image while preserving spatial structure, making it compatible with convolutional networks. Experimental evaluation on the rotated MNIST dataset shows that the proposed method achieves higher accuracy than state-of-the-art rotation-invariant architectures. Additional experiments on the CIFAR-10 dataset, confirm that the method remains effective under more complex conditions.

[169] See and Fix the Flaws: Enabling VLMs and Diffusion Models to Comprehend Visual Artifacts via Agentic Data Synthesis

Jaehyun Park, Minyoung Ahn, Minkyu Kim, Jonghyun Lee, Jae-Gil Lee, Dongmin Park

Main category: cs.CV

TL;DR: ArtiAgent: An automated framework using three AI agents to create artifact-annotated image datasets by injecting artifacts into real images via diffusion transformer manipulation, enabling artifact mitigation research without human labeling.

DetailsMotivation: AI-generated images often contain visual artifacts that compromise realism. Current artifact-aware methods rely on costly human-labeled datasets that are difficult to scale, creating a need for automated approaches to generate artifact-annotated data for artifact mitigation research.

Method: Three-agent system: 1) Perception agent recognizes and grounds entities/subentities from real images, 2) Synthesis agent introduces artifacts via novel patch-wise embedding manipulation within a diffusion transformer using artifact injection tools, 3) Curation agent filters synthesized artifacts and generates local/global explanations for each instance.

Result: Successfully synthesized 100K images with rich artifact annotations. Demonstrated efficacy and versatility across diverse applications. Code is publicly available.

Conclusion: ArtiAgent provides an automated, scalable solution for creating artifact-annotated datasets, addressing the limitations of human-labeled data and enabling more effective artifact mitigation research in AI-generated imagery.

Abstract: Despite recent advances in diffusion models, AI generated images still often contain visual artifacts that compromise realism. Although more thorough pre-training and bigger models might reduce artifacts, there is no assurance that they can be completely eliminated, which makes artifact mitigation a highly crucial area of study. Previous artifact-aware methodologies depend on human-labeled artifact datasets, which are costly and difficult to scale, underscoring the need for an automated approach to reliably acquire artifact-annotated datasets. In this paper, we propose ArtiAgent, which efficiently creates pairs of real and artifact-injected images. It comprises three agents: a perception agent that recognizes and grounds entities and subentities from real images, a synthesis agent that introduces artifacts via artifact injection tools through novel patch-wise embedding manipulation within a diffusion transformer, and a curation agent that filters the synthesized artifacts and generates both local and global explanations for each instance. Using ArtiAgent, we synthesize 100K images with rich artifact annotations and demonstrate both efficacy and versatility across diverse applications. Code is available at link.

[170] Dropping Anchor and Spherical Harmonics for Sparse-view Gaussian Splatting

Shuangkang Fang, I-Chao Shen, Xuanyang Zhang, Zesheng Wang, Yufeng Wang, Wenrui Ding, Gang Yu, Takeo Igarashi

Main category: cs.CV

TL;DR: DropAnSH-GS improves 3D Gaussian Splatting regularization by using anchor-based dropout to remove spatial neighbors and dropping higher-degree spherical harmonics to reduce overfitting.

DetailsMotivation: Existing 3DGS dropout methods suffer from neighbor compensation effects where dropped Gaussians are compensated by neighbors, weakening regularization. They also ignore the contribution of high-degree spherical harmonic coefficients to overfitting.

Method: Proposes DropAnSH-GS with two key innovations: 1) Anchor-based dropout that selects random Gaussians as anchors and simultaneously removes their spatial neighbors to disrupt local redundancies, 2) Extension of dropout to color attributes by randomly dropping higher-degree SH coefficients to concentrate appearance information in lower-degree SH.

Result: Substantially outperforms existing dropout methods with negligible computational overhead, can be integrated into various 3DGS variants to enhance performance, and enables flexible post-training model compression via SH truncation.

Conclusion: DropAnSH-GS effectively addresses limitations of existing 3DGS dropout methods by preventing neighbor compensation and handling SH overfitting, leading to more robust representations and better regularization.

Abstract: Recent 3D Gaussian Splatting (3DGS) Dropout methods address overfitting under sparse-view conditions by randomly nullifying Gaussian opacities. However, we identify a neighbor compensation effect in these approaches: dropped Gaussians are often compensated by their neighbors, weakening the intended regularization. Moreover, these methods overlook the contribution of high-degree spherical harmonic coefficients (SH) to overfitting. To address these issues, we propose DropAnSH-GS, a novel anchor-based Dropout strategy. Rather than dropping Gaussians independently, our method randomly selects certain Gaussians as anchors and simultaneously removes their spatial neighbors. This effectively disrupts local redundancies near anchors and encourages the model to learn more robust, globally informed representations. Furthermore, we extend the Dropout to color attributes by randomly dropping higher-degree SH to concentrate appearance information in lower-degree SH. This strategy further mitigates overfitting and enables flexible post-training model compression via SH truncation. Experimental results demonstrate that DropAnSH-GS substantially outperforms existing Dropout methods with negligible computational overhead, and can be readily integrated into various 3DGS variants to enhance their performances. Project Website: https://sk-fun.fun/DropAnSH-GS

[171] UFO: Unifying Feed-Forward and Optimization-based Methods for Large Driving Scene Modeling

Kaiyuan Tan, Yingying Shen, Mingfei Tu, Haohui Zhu, Bing Wang, Guang Chen, Hangjun Ye, Haiyang Sun

Main category: cs.CV

TL;DR: UFO is a recurrent paradigm for efficient long-range 4D reconstruction of dynamic driving scenes that combines optimization-based and feed-forward methods, enabling fast reconstruction of 16-second driving logs with superior quality.

DetailsMotivation: Current feed-forward methods for 3D reconstruction struggle with long-range driving sequences due to quadratic complexity in sequence length and difficulties in modeling dynamic objects over extended durations, which is critical for autonomous driving simulation and closed-loop learning.

Method: Proposes a recurrent paradigm that maintains a 4D scene representation iteratively refined as new observations arrive, using visibility-based filtering to select informative scene tokens for efficient long-sequence processing, and introduces object pose-guided modeling for accurate long-range motion capture of dynamic objects.

Result: Outperforms both per-scene optimization and existing feed-forward methods on Waymo Open Dataset across various sequence lengths, reconstructing 16-second driving logs within 0.5 second while maintaining superior visual quality and geometric accuracy.

Conclusion: UFO provides an efficient solution for long-range 4D reconstruction of dynamic driving scenes, combining the benefits of optimization and feed-forward approaches to enable practical applications in autonomous driving simulation and learning.

Abstract: Dynamic driving scene reconstruction is critical for autonomous driving simulation and closed-loop learning. While recent feed-forward methods have shown promise for 3D reconstruction, they struggle with long-range driving sequences due to quadratic complexity in sequence length and challenges in modeling dynamic objects over extended durations. We propose UFO, a novel recurrent paradigm that combines the benefits of optimization-based and feed-forward methods for efficient long-range 4D reconstruction. Our approach maintains a 4D scene representation that is iteratively refined as new observations arrive, using a visibility-based filtering mechanism to select informative scene tokens and enable efficient processing of long sequences. For dynamic objects, we introduce an object pose-guided modeling approach that supports accurate long-range motion capture. Experiments on the Waymo Open Dataset demonstrate that our method significantly outperforms both per-scene optimization and existing feed-forward methods across various sequence lengths. Notably, our approach can reconstruct 16-second driving logs within 0.5 second while maintaining superior visual quality and geometric accuracy.

[172] Are Multimodal Large Language Models Good Annotators for Image Tagging?

Ming-Kun Xie, Jia-Hao Xiao, Zhiqiang Kou, Zhongnian Li, Gang Niu, Masashi Sugiyama

Main category: cs.CV

TL;DR: TagLLM is a framework that uses Multimodal LLMs for automated image tagging, reducing annotation costs to 1/1000th of human costs while achieving 50-80% of human annotation quality and over 90% downstream task performance.

DetailsMotivation: Traditional image tagging relies on expensive human annotation. While MLLMs offer potential for automation, their capability to replace human annotators remains underexplored, with gaps between MLLM-generated and human annotations.

Method: TagLLM framework has two components: 1) Candidates generation using structured group-wise prompting to produce compact candidate sets covering true labels, and 2) Label disambiguation that interactively calibrates semantic concepts and refines candidate labels.

Result: MLLMs reduce annotation cost to 1/1000th of human cost, achieve 50-80% of human annotation quality, and over 90% downstream training performance. TagLLM closes 60-80% of the gap between MLLM and human annotations.

Conclusion: TagLLM substantially narrows the gap between MLLM-generated and human annotations, making MLLM-based annotation a viable replacement for manual labeling in image tagging tasks.

Abstract: Image tagging, a fundamental vision task, traditionally relies on human-annotated datasets to train multi-label classifiers, which incurs significant labor and costs. While Multimodal Large Language Models (MLLMs) offer promising potential to automate annotation, their capability to replace human annotators remains underexplored. This paper aims to analyze the gap between MLLM-generated and human annotations and to propose an effective solution that enables MLLM-based annotation to replace manual labeling. Our analysis of MLLM annotations reveals that, under a conservative estimate, MLLMs can reduce annotation cost to as low as one-thousandth of the human cost, mainly accounting for GPU usage, which is nearly negligible compared to manual efforts. Their annotation quality reaches about 50% to 80% of human performance, while achieving over 90% performance on downstream training tasks.Motivated by these findings, we propose TagLLM, a novel framework for image tagging, which aims to narrow the gap between MLLM-generated and human annotations. TagLLM comprises two components: Candidates generation, which employs structured group-wise prompting to efficiently produce a compact candidate set that covers as many true labels as possible while reducing subsequent annotation workload; and label disambiguation, which interactively calibrates the semantic concept of categories in the prompts and effectively refines the candidate labels. Extensive experiments show that TagLLM substantially narrows the gap between MLLM-generated and human annotations, especially in downstream training performance, where it closes about 60% to 80% of the difference.

[173] CrystaL: Spontaneous Emergence of Visual Latents in MLLMs

Yang Zhang, Danyang Li, Yuxuan Li, Xin Zhang, Tianyu Xie, Mingming Cheng, Xiang Li

Main category: cs.CV

TL;DR: CrystaL is a single-stage framework that crystallizes latent representations into task-relevant visual semantics by aligning attention patterns and predictions between intact and corrupted image processing paths, improving fine-grained visual understanding in multimodal LLMs.

DetailsMotivation: Existing latent Chain-of-Thought methods in MLLMs use heuristically predefined supervision signals that provide limited guidance for preserving critical visual information in intermediate latent states, hindering fine-grained visual understanding.

Method: Proposes CrystaL with two parallel paths: one processes intact images, the other processes corrupted images. The framework explicitly aligns attention patterns and prediction distributions across both paths to crystallize latent representations into task-relevant visual semantics without needing auxiliary annotations or external modules.

Result: Extensive experiments on perception-intensive benchmarks show CrystaL consistently outperforms state-of-the-art baselines, achieving substantial gains in fine-grained visual understanding while maintaining robust reasoning capabilities.

Conclusion: CrystaL effectively addresses limitations of existing latent CoT methods by crystallizing latent representations into task-relevant visual semantics through cross-path alignment, enhancing multimodal LLMs’ visual understanding capabilities.

Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable performance by integrating powerful language backbones with large-scale visual encoders. Among these, latent Chain-of-Thought (CoT) methods enable implicit reasoning in continuous hidden states, facilitating seamless vision-language integration and faster inference. However, existing heuristically predefined supervision signals in latent CoT provide limited guidance for preserving critical visual information in intermediate latent states. To address this limitation, we propose CrystaL (Crystallized Latent Reasoning), a single-stage framework with two paths to process intact and corrupted images, respectively. By explicitly aligning the attention patterns and prediction distributions across the two paths, CrystaL crystallizes latent representations into task-relevant visual semantics, without relying on auxiliary annotations or external modules. Extensive experiments on perception-intensive benchmarks demonstrate that CrystaL consistently outperforms state-of-the-art baselines, achieving substantial gains in fine-grained visual understanding while maintaining robust reasoning capabilities.

[174] Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models

Christian Simon, MAsato Ishii, Wei-Yao Wang, Koichi Saito, Akio Hayakawa, Dongseok Shim, Zhi Zhong, Shuyang Cui, Shusuke Takahashi, Takashi Shibuya, Yuki Mitsufuji

Main category: cs.CV

TL;DR: MMHNet: A hierarchical multimodal network using non-causal Mamba for long-form video-to-audio generation, enabling training on short clips but generating audio for videos longer than 5 minutes.

DetailsMotivation: Addressing the scaling challenge in multimodal-to-audio generation, particularly the difficulty of generating long-form audio from video due to limited data and text-frame mismatches. The paper explores whether models trained on short instances can generalize to longer ones during testing.

Method: Proposes MMHNet, an enhanced extension of state-of-the-art video-to-audio models that integrates hierarchical methods and non-causal Mamba to support long-form audio generation. The approach enables training on short clips while generating audio for much longer videos.

Result: Significantly improves long audio generation up to more than 5 minutes, beats prior works in video-to-audio tasks on benchmarks, and demonstrates that training short and testing long is possible without training on longer durations.

Conclusion: MMHNet successfully addresses the scaling challenge in video-to-audio generation, enabling long-form audio generation from video while overcoming data limitations through hierarchical architecture and efficient sequence modeling with Mamba.

Abstract: Scaling multimodal alignment between video and audio is challenging, particularly due to limited data and the mismatch between text descriptions and frame-level video information. In this work, we tackle the scaling challenge in multimodal-to-audio generation, examining whether models trained on short instances can generalize to longer ones during testing. To tackle this challenge, we present multimodal hierarchical networks so-called MMHNet, an enhanced extension of state-of-the-art video-to-audio models. Our approach integrates a hierarchical method and non-causal Mamba to support long-form audio generation. Our proposed method significantly improves long audio generation up to more than 5 minutes. We also prove that training short and testing long is possible in the video-to-audio generation tasks without training on the longer durations. We show in our experiments that our proposed method could achieve remarkable results on long-video to audio benchmarks, beating prior works in video-to-audio tasks. Moreover, we showcase our model capability in generating more than 5 minutes, while prior video-to-audio methods fall short in generating with long durations.

[175] MIP Candy: A Modular PyTorch Framework for Medical Image Processing

Tianhao Fu, Yucheng Chen

Main category: cs.CV

TL;DR: MIPCandy is a PyTorch-based modular framework for medical image processing with deferred configuration, built-in cross-validation, and extensible model ecosystem.

DetailsMotivation: Medical image processing requires specialized software for high-dimensional volumetric data, but existing frameworks either need too much integration effort or are too rigid and monolithic.

Method: Provides complete modular pipeline with LayerT deferred configuration mechanism for runtime substitution of modules, built-in k-fold cross-validation, dataset inspection, deep supervision, experiment tracking, and extensible bundle ecosystem.

Result: Open-source framework under Apache-2.0 license that allows researchers to get fully functional workflow by implementing just one method while retaining fine-grained control.

Conclusion: MIPCandy addresses the gap between low-level components and rigid pipelines in medical image processing with a flexible, modular PyTorch-based framework.

Abstract: Medical image processing demands specialized software that handles high-dimensional volumetric data, heterogeneous file formats, and domain-specific training procedures. Existing frameworks either provide low-level components that require substantial integration effort or impose rigid, monolithic pipelines that resist modification. We present MIP Candy (MIPCandy), a freely available, PyTorch-based framework designed specifically for medical image processing. MIPCandy provides a complete, modular pipeline spanning data loading, training, inference, and evaluation, allowing researchers to obtain a fully functional process workflow by implementing a single method, $\texttt{build_network}$, while retaining fine-grained control over every component. Central to the design is $\texttt{LayerT}$, a deferred configuration mechanism that enables runtime substitution of convolution, normalization, and activation modules without subclassing. The framework further offers built-in $k$-fold cross-validation, dataset inspection with automatic region-of-interest detection, deep supervision, exponential moving average, multi-frontend experiment tracking (Weights & Biases, Notion, MLflow), training state recovery, and validation score prediction via quotient regression. An extensible bundle ecosystem provides pre-built model implementations that follow a consistent trainer–predictor pattern and integrate with the core framework without modification. MIPCandy is open-source under the Apache-2.0 license and requires Python~3.12 or later. Source code and documentation are available at https://github.com/ProjectNeura/MIPCandy.

[176] EW-DETR: Evolving World Object Detection via Incremental Low-Rank DEtection TRansformer

Munish Monga, Vishal Chudasama, Pankaj Wasnik, C. V. Jawahar

Main category: cs.CV

TL;DR: EW-DETR is a framework for evolving world object detection that handles incremental learning, domain adaptation, and unknown detection without prior data access, using DETR-based detectors with specialized modules.

DetailsMotivation: Real-world object detection systems need to operate in dynamic environments where new classes emerge, domains shift, and unknown objects appear, all without access to prior training data - a challenging combination of incremental learning, domain adaptation, and open-set detection.

Method: Proposes EW-DETR framework with three modules: 1) Incremental LoRA Adapters for exemplar-free incremental learning, 2) Query-Norm Objectness Adapter to decouple objectness-aware features from DETR decoder queries, and 3) Entropy-Aware Unknown Mixing for calibrated unknown detection. Also introduces FOGS evaluation metric.

Result: EW-DETR outperforms other methods on Pascal Series and Diverse Weather benchmarks, improving the holistic FOGS metric by 57.24% while generalizing across DETR-based detectors.

Conclusion: The EW-DETR framework effectively addresses the challenging Evolving World Object Detection problem by integrating incremental learning, domain adaptation, and unknown detection capabilities into DETR-based architectures.

Abstract: Real-world object detection must operate in evolving environments where new classes emerge, domains shift, and unseen objects must be identified as “unknown”: all without accessing prior data. We introduce Evolving World Object Detection (EWOD), a paradigm coupling incremental learning, domain adaptation, and unknown detection under exemplar-free constraints. To tackle EWOD, we propose EW-DETR framework that augments DETR-based detectors with three synergistic modules: Incremental LoRA Adapters for exemplar-free incremental learning under evolving domains; a Query-Norm Objectness Adapter that decouples objectness-aware features from DETR decoder queries; and Entropy-Aware Unknown Mixing for calibrated unknown detection. This framework generalises across DETR-based detectors, enabling state-of-the-art RF-DETR to operate effectively in evolving-world settings. We also introduce FOGS (Forgetting, Openness, Generalisation Score) to holistically evaluate performance across these dimensions. Extensive experiments on Pascal Series and Diverse Weather benchmarks show EW-DETR outperforms other methods, improving FOGS by 57.24%.

[177] Cycle-Consistent Tuning for Layered Image Decomposition

Zheng Gu, Min Lu, Zhida Sun, Dani Lischinski, Daniel Cohen-O, Hui Huang

Main category: cs.CV

TL;DR: A diffusion-based framework for disentangling visual layers in real-world images, focusing on logo-object decomposition with cycle-consistent tuning and progressive self-improvement.

DetailsMotivation: Disentangling visual layers in real-world images is challenging due to non-linear, globally coupled interactions like shading, reflection, and perspective distortion. Current methods struggle with complex layer interactions.

Method: In-context image decomposition framework using large diffusion foundation models. Fine-tunes pretrained diffusion model via lightweight LoRA adaptation. Introduces cycle-consistent tuning strategy that jointly trains decomposition and composition models with reconstruction consistency. Uses progressive self-improving process that iteratively augments training set with model-generated examples.

Result: Achieves accurate and coherent decompositions for logo-object separation. Generalizes effectively across other decomposition types. Demonstrates robustness in cases with complex layer interactions.

Conclusion: The approach shows potential as a unified framework for layered image decomposition, leveraging diffusion models’ capabilities for complex visual disentanglement tasks.

Abstract: Disentangling visual layers in real-world images is a persistent challenge in vision and graphics, as such layers often involve non-linear and globally coupled interactions, including shading, reflection, and perspective distortion. In this work, we present an in-context image decomposition framework that leverages large diffusion foundation models for layered separation. We focus on the challenging case of logo-object decomposition, where the goal is to disentangle a logo from the surface on which it appears while faithfully preserving both layers. Our method fine-tunes a pretrained diffusion model via lightweight LoRA adaptation and introduces a cycle-consistent tuning strategy that jointly trains decomposition and composition models, enforcing reconstruction consistency between decomposed and recomposed images. This bidirectional supervision substantially enhances robustness in cases where the layers exhibit complex interactions. Furthermore, we introduce a progressive self-improving process, which iteratively augments the training set with high-quality model-generated examples to refine performance. Extensive experiments demonstrate that our approach achieves accurate and coherent decompositions and also generalizes effectively across other decomposition types, suggesting its potential as a unified framework for layered image decomposition.

[178] VII: Visual Instruction Injection for Jailbreaking Image-to-Video Generation Models

Bowen Zheng, Yongli Xiang, Ziming Hong, Zerong Lin, Chaojian Yu, Tongliang Liu, Xinge You

Main category: cs.CV

TL;DR: Visual Instruction Injection (VII) is a training-free jailbreaking framework that exploits image-to-video models by disguising malicious text intent as benign visual instructions in reference images, achieving high attack success rates on commercial models.

DetailsMotivation: The paper identifies a security vulnerability in image-to-video generation models where adversaries can exploit visual instruction-following capabilities to inject malicious intent through seemingly safe reference images, bypassing safety filters.

Method: VII uses two modules: Malicious Intent Reprogramming distills harmful intent from unsafe text prompts while minimizing static harmfulness, and Visual Instruction Grounding renders this intent onto safe input images while preserving semantic consistency with original unsafe prompts.

Result: Extensive experiments on four commercial I2V models (Kling-v2.5-turbo, Gemini Veo-3.1, Seedance-1.5-pro, PixVerse-V5) show VII achieves Attack Success Rates up to 83.5% while reducing Refusal Rates to near zero, significantly outperforming baselines.

Conclusion: The work reveals critical security vulnerabilities in multimodal models where visual instructions can be exploited for jailbreaking, highlighting the need for more robust safety mechanisms in image-to-video generation systems.

Abstract: Image-to-Video (I2V) generation models, which condition video generation on reference images, have shown emerging visual instruction-following capability, allowing certain visual cues in reference images to act as implicit control signals for video generation. However, this capability also introduces a previously overlooked risk: adversaries may exploit visual instructions to inject malicious intent through the image modality. In this work, we uncover this risk by proposing Visual Instruction Injection (VII), a training-free and transferable jailbreaking framework that intentionally disguises the malicious intent of unsafe text prompts as benign visual instructions in the safe reference image. Specifically, VII coordinates a Malicious Intent Reprogramming module to distill malicious intent from unsafe text prompts while minimizing their static harmfulness, and a Visual Instruction Grounding module to ground the distilled intent onto a safe input image by rendering visual instructions that preserve semantic consistency with the original unsafe text prompt, thereby inducing harmful content during I2V generation. Empirically, our extensive experiments on four state-of-the-art commercial I2V models (Kling-v2.5-turbo, Gemini Veo-3.1, Seedance-1.5-pro, and PixVerse-V5) demonstrate that VII achieves Attack Success Rates of up to 83.5% while reducing Refusal Rates to near zero, significantly outperforming existing baselines.

[179] Le-DETR: Revisiting Real-Time Detection Transformer with Efficient Encoder Design

Jiannan Huang, Aditya Kane, Fengzhe Zhou, Yunchao Wei, Humphrey Shi

Main category: cs.CV

TL;DR: Le-DETR: A low-cost, efficient real-time object detection transformer that achieves state-of-the-art performance with minimal pre-training overhead using efficient backbone design and local attention mechanisms.

DetailsMotivation: Current real-time DETR models require excessive pre-training overhead on backbones, making them difficult to reproduce from scratch and hindering research on novel backbone architectures. The authors aim to show that high performance can be achieved with low pre-training cost through good design.

Method: Proposes EfficientNAT backbones at various scales using efficient convolution and local attention mechanisms. Redesigns the hybrid encoder with local attention to enhance performance and inference speed. The resulting Le-DETR system uses only ImageNet1K and COCO2017 datasets, saving ~80% images in pre-training compared to previous methods.

Result: Le-DETR-M/L/X achieves 52.9/54.3/55.1 mAP on COCO Val2017 with 4.45/5.01/6.68 ms latency on RTX4090. Surpasses YOLOv12-L/X by +0.6/-0.1 mAP with similar speed and +20% speedup. Outperforms DEIM-D-FINE variants with better accuracy and competitive latency.

Conclusion: Well-designed real-time DETR models can achieve strong performance without complex, computationally expensive pre-training. Le-DETR sets new SOTA in real-time detection with significantly reduced pre-training requirements.

Abstract: Real-time object detection is crucial for real-world applications as it requires high accuracy with low latency. While Detection Transformers (DETR) have demonstrated significant performance improvements, current real-time DETR models are challenging to reproduce from scratch due to excessive pre-training overheads on the backbone, constraining research advancements by hindering the exploration of novel backbone architectures. In this paper, we want to show that by using general good design, it is possible to have \textbf{high performance} with \textbf{low pre-training cost}. After a thorough study of the backbone architecture, we propose EfficientNAT at various scales, which incorporates modern efficient convolution and local attention mechanisms. Moreover, we re-design the hybrid encoder with local attention, significantly enhancing both performance and inference speed. Based on these advancements, we present Le-DETR (\textbf{L}ow-cost and \textbf{E}fficient \textbf{DE}tection \textbf{TR}ansformer), which achieves a new \textbf{SOTA} in real-time detection using only ImageNet1K and COCO2017 training datasets, saving about 80% images in pre-training stage compared with previous methods. We demonstrate that with well-designed, real-time DETR models can achieve strong performance without the need for complex and computationally expensive pretraining. Extensive experiments show that Le-DETR-M/L/X achieves \textbf{52.9/54.3/55.1 mAP} on COCO Val2017 with \textbf{4.45/5.01/6.68 ms} on an RTX4090. It surpasses YOLOv12-L/X by \textbf{+0.6/-0.1 mAP} while achieving similar speed and \textbf{+20%} speedup. Compared with DEIM-D-FINE, Le-DETR-M achieves \textbf{+0.2 mAP} with slightly faster inference, and surpasses DEIM-D-FINE-L by \textbf{+0.4 mAP} with only \textbf{0.4 ms} additional latency. Code and weights will be open-sourced.

[180] From Perception to Action: An Interactive Benchmark for Vision Reasoning

Yuhao Wu, Maojia Song, Yihuai Lan, Lei Wang, Zhiqiang Hu, Yao Xiao, Heng Zhou, Weihua Zheng, Dylan Raharja, Soujanya Poria, Roy Ka-Wei Lee

Main category: cs.CV

TL;DR: CHAIN benchmark evaluates VLMs’ ability to understand physical structure and constraints for interactive 3D problem-solving tasks like mechanical puzzles and stacking.

DetailsMotivation: Current VLM evaluations focus on structure-agnostic, single-turn setups (like VQA) that don't assess agents' ability to reason about geometry, contact, and support relations in dynamic environments.

Method: Introduces CHAIN benchmark - an interactive 3D, physics-driven testbed with tasks like interlocking mechanical puzzles and 3D stacking/packing to evaluate models’ understanding, planning, and execution of structured action sequences.

Result: State-of-the-art VLMs and diffusion-based models struggle to internalize physical structure and causal constraints, often failing to produce reliable long-horizon plans and cannot robustly translate perceived structure into effective actions.

Conclusion: There’s a significant gap in VLMs’ ability to understand physical structure for interactive problem-solving, highlighting the need for benchmarks that move beyond passive perception to active physical reasoning.

Abstract: Understanding the physical structure is essential for real-world applications such as embodied agents, interactive design, and long-horizon manipulation. Yet, prevailing Vision-Language Model (VLM) evaluations still center on structure-agnostic, single-turn setups (e.g., VQA), which fail to assess agents’ ability to reason about how geometry, contact, and support relations jointly constrain what actions are possible in a dynamic environment. To address this gap, we introduce the Causal Hierarchy of Actions and Interactions (CHAIN) benchmark, an interactive 3D, physics-driven testbed designed to evaluate whether models can understand, plan, and execute structured action sequences grounded in physical constraints. CHAIN shifts evaluation from passive perception to active problem solving, spanning tasks such as interlocking mechanical puzzles and 3D stacking and packing. We conduct a comprehensive study of state-of-the-art VLMs and diffusion-based models under unified interactive settings. Our results show that top-performing models still struggle to internalize physical structure and causal constraints, often failing to produce reliable long-horizon plans and cannot robustly translate perceived structure into effective actions. The project is available at https://social-ai-studio.github.io/CHAIN/.

[181] OmniOCR: Generalist OCR for Ethnic Minority Languages

Bonan Liu, Zeyu Zhang, Bingbing Meng, Han Wang, Hanshuo Zhang, Chengping Wang, Daji Ergu, Ying Cai

Main category: cs.CV

TL;DR: OmniOCR: A universal OCR framework for ethnic minority scripts using Dynamic LoRA with sparsity regularization, achieving SOTA accuracy with superior parameter efficiency on low-resource scripts.

DetailsMotivation: Most OCR research focuses on well-resourced scripts like Latin and Chinese, leaving ethnic minority languages underexplored due to complex writing systems, scarce annotations, and diverse historical/modern forms, making generalization in low-resource or zero-shot settings challenging.

Method: Introduces OmniOCR with Dynamic Low-Rank Adaptation (Dynamic LoRA) to allocate model capacity across layers and scripts, enabling effective adaptation while preserving knowledge. Uses sparsity regularization to prune redundant updates for compact and efficient adaptation without extra inference cost.

Result: Outperforms zero-shot foundation models and standard post-training on TibetanMNIST, Shui, ancient Yi, and Dongba datasets. Achieves 39%-66% accuracy improvement over state-of-the-art baselines with superior parameter efficiency.

Conclusion: OmniOCR provides an effective solution for OCR on ethnic minority scripts, addressing low-resource challenges through dynamic adaptation and sparsity regularization while maintaining efficiency.

Abstract: Optical character recognition (OCR) has advanced rapidly with deep learning and multimodal models, yet most methods focus on well-resourced scripts such as Latin and Chinese. Ethnic minority languages remain underexplored due to complex writing systems, scarce annotations, and diverse historical and modern forms, making generalization in low-resource or zero-shot settings challenging. To address these challenges, we present OmniOCR, a universal framework for ethnic minority scripts. OmniOCR introduces Dynamic Low-Rank Adaptation (Dynamic LoRA) to allocate model capacity across layers and scripts, enabling effective adaptation while preserving knowledge.A sparsity regularization prunes redundant updates, ensuring compact and efficient adaptation without extra inference cost. Evaluations on TibetanMNIST, Shui, ancient Yi, and Dongba show that OmniOCR outperforms zero-shot foundation models and standard post training, achieving state-of-the-art accuracy with superior parameter efficiency, and compared with the state-of-the-art baseline models, it improves accuracy by 39%-66% on these four datasets. Code: https://github.com/AIGeeksGroup/OmniOCR.

[182] OCR-Agent: Agentic OCR with Capability and Memory Reflection

Shimin Wen, Zeyu Zhang, Xingdou Bian, Hongjie Zhu, Lulu He, Layi Shama, Daji Ergu, Ying Cai

Main category: cs.CV

TL;DR: A self-correction framework for Vision-Language Models that enables iterative improvement through capability reflection and memory reflection mechanisms to address cognitive biases and repetitive errors.

DetailsMotivation: Current VLMs lack effective self-correction mechanisms, making them unable to independently rectify cognitive biases during multi-turn revisions, leading to repetitive and ineffective attempts without stable improvements in answer quality.

Method: Proposes an iterative self-correction framework with two key capabilities: Capability Reflection (diagnosing errors and generating correction plans) and Memory Reflection (reviewing past attempts to avoid repetition and explore new solutions), followed by rigorous re-reasoning to optimize answers.

Result: OCR-Agent outperforms current open-source SOTA model InternVL3-8B by +2.0 on English and +1.2 on Chinese subsets of OCRBench v2, achieving state-of-the-art results in Visual Understanding (79.9) and Reasoning (66.5), surpassing even larger fine-tuned models.

Conclusion: Structured, self-aware reflection can significantly enhance VLMs’ reasoning robustness without additional training, demonstrating the effectiveness of self-correction mechanisms for improving multimodal understanding.

Abstract: Large Vision-Language Models (VLMs) have demonstrated significant potential on complex visual understanding tasks through iterative optimization methods.However, these models generally lack effective self-correction mechanisms, making it difficult for them to independently rectify cognitive biases. Consequently, during multi-turn revisions, they often fall into repetitive and ineffective attempts, failing to achieve stable improvements in answer quality.To address this issue, we propose a novel iterative self-correction framework that endows models with two key capabilities: Capability Reflection and Memory Reflection. This framework guides the model to first diagnose errors and generate a correction plan via Capability Reflection, then leverage Memory Reflection to review past attempts to avoid repetition and explore new solutions, and finally, optimize the answer through rigorous re-reasoning. Experiments on the challenging OCRBench v2 benchmark show that OCR-Agent outperforms the current open-source SOTA model InternVL3-8B by +2.0 on English and +1.2 on Chinese subsets, while achieving state-of-the-art results in Visual Understanding (79.9) and Reasoning (66.5) - surpassing even larger fine-tuned models. Our method demonstrates that structured, self-aware reflection can significantly enhance VLMs’ reasoning robustness without additional training. Code: https://github.com/AIGeeksGroup/OCR-Agent.

[183] XMorph: Explainable Brain Tumor Analysis Via LLM-Assisted Hybrid Deep Intelligence

Sepehr Salem Ghahfarokhi, M. Moein Esfahani, Raj Sunderraman, Vince Calhoun, Mohammed Alser

Main category: cs.CV

TL;DR: XMorph is an explainable AI framework for brain tumor classification that combines boundary normalization with dual-channel explainability (visual GradCAM++ and LLM-generated textual rationales).

DetailsMotivation: Clinical adoption of deep learning for brain tumor diagnosis is limited by interpretability issues (models as "black boxes") and inability to quantify complex tumor boundaries, which are crucial for characterizing malignant growth.

Method: Proposes Information-Weighted Boundary Normalization (IWBN) to emphasize diagnostically relevant boundary regions, incorporates nonlinear chaotic and clinically validated features, and uses a dual-channel explainable AI module combining GradCAM++ visual cues with LLM-generated textual rationales.

Result: Achieves 96.0% classification accuracy for three brain tumor types (glioma, meningioma, pituitary tumors), demonstrating that explainability and high performance can co-exist in medical imaging AI.

Conclusion: XMorph provides an explainable and computationally efficient framework that bridges the gap between AI performance and clinical interpretability for brain tumor diagnosis.

Abstract: Deep learning has significantly advanced automated brain tumor diagnosis, yet clinical adoption remains limited by interpretability and computational constraints. Conventional models often act as opaque ‘‘black boxes’’ and fail to quantify the complex, irregular tumor boundaries that characterize malignant growth. To address these challenges, we present XMorph, an explainable and computationally efficient framework for fine-grained classification of three prominent brain tumor types: glioma, meningioma, and pituitary tumors. We propose an Information-Weighted Boundary Normalization (IWBN) mechanism that emphasizes diagnostically relevant boundary regions alongside nonlinear chaotic and clinically validated features, enabling a richer morphological representation of tumor growth. A dual-channel explainable AI module combines GradCAM++ visual cues with LLM-generated textual rationales, translating model reasoning into clinically interpretable insights. The proposed framework achieves a classification accuracy of 96.0%, demonstrating that explainability and high performance can co-exist in AI-based medical imaging systems. The source code and materials for XMorph are all publicly available at: https://github.com/ALSER-Lab/XMorph.

[184] Optimizing Occupancy Sensor Placement in Smart Environments

Hao Lu, Richard J. Radke

Main category: cs.CV

TL;DR: Proposes automatic sensor placement method using integer linear programming to optimize privacy-preserving ToF sensor layouts for zone occupancy counting in commercial buildings.

DetailsMotivation: Need for real-time zone occupancy recognition in commercial buildings to enable energy savings through targeted lighting/HVAC, while preserving privacy and not impeding occupant activities.

Method: Simulates occupant trajectories, formulates sensor placement as integer linear programming problem, solves with branch and bound method to determine optimal sensor layouts and predict counting accuracy.

Result: Demonstrates effectiveness through simulations of various office environments, showing ability to optimize sensor placement for given number of sensors.

Conclusion: Proposed automatic sensor placement method addresses dependency on careful sensor placement in privacy-preserving ToF sensor networks for zone occupancy counting.

Abstract: Understanding the locations of occupants in a commercial built environment is critical for realizing energy savings by delivering lighting, heating, and cooling only where it is needed. The key to achieving this goal is being able to recognize zone occupancy in real time, without impeding occupants’ activities or compromising privacy. While low-resolution, privacy-preserving time-of-flight (ToF) sensor networks have demonstrated good performance in zone counting, the performance depends on careful sensor placement. To address this issue, we propose an automatic sensor placement method that determines optimal sensor layouts for a given number of sensors, and can predict the counting accuracy of such a layout. In particular, given the geometric constraints of an office environment, we simulate a large number of occupant trajectories. We then formulate the sensor placement problem as an integer linear programming (ILP) problem and solve it with the branch and bound method. We demonstrate the effectiveness of the proposed method based on simulations of several different office environments.

[185] Skullptor: High Fidelity 3D Head Reconstruction in Seconds with Multi-View Normal Prediction

Noé Artru, Rukhshanda Hussain, Emeline Got, Alexandre Messier, David B. Lindell, Abdallah Dib

Main category: cs.CV

TL;DR: Hybrid approach combining foundation models with optimization for high-fidelity 3D head reconstruction from images, achieving photogrammetry-quality results with fewer views and less computation.

DetailsMotivation: Existing 3D head reconstruction methods face trade-offs: photogrammetry needs many cameras and computation, foundation models lack detail, optimization methods require dense views. Need to bridge this gap for practical high-fidelity reconstruction.

Method: Multi-view surface normal prediction model extends monocular foundation models with cross-view attention for geometrically consistent normals, then uses these as priors in inverse rendering optimization to recover high-frequency details.

Result: Outperforms state-of-the-art single-image and multi-view methods, achieves high-fidelity reconstruction comparable to dense-view photogrammetry while reducing camera requirements and computational cost.

Conclusion: Hybrid approach successfully bridges gap between foundation models and optimization methods, enabling practical high-fidelity 3D head reconstruction with fewer resources.

Abstract: Reconstructing high-fidelity 3D head geometry from images is critical for a wide range of applications, yet existing methods face fundamental limitations. Traditional photogrammetry achieves exceptional detail but requires extensive camera arrays (25-200+ views), substantial computation, and manual cleanup in challenging areas like facial hair. Recent alternatives present a fundamental trade-off: foundation models enable efficient single-image reconstruction but lack fine geometric detail, while optimization-based methods achieve higher fidelity but require dense views and expensive computation. We bridge this gap with a hybrid approach that combines the strengths of both paradigms. Our method introduces a multi-view surface normal prediction model that extends monocular foundation models with cross-view attention to produce geometrically consistent normals in a feed-forward pass. We then leverage these predictions as strong geometric priors within an inverse rendering optimization framework to recover high-frequency surface details. Our approach outperforms state-of-the-art single-image and multi-view methods, achieving high-fidelity reconstruction on par with dense-view photogrammetry while reducing camera requirements and computational cost. The code and model will be released.

[186] Event-Aided Sharp Radiance Field Reconstruction for Fast-Flying Drones

Rong Zou, Marco Cannici, Davide Scaramuzza

Main category: cs.CV

TL;DR: Event-image fusion with NeRF for high-fidelity 3D reconstruction from fast-flying drones, overcoming motion blur and pose noise

DetailsMotivation: Fast-flying drones face motion blur in images and noisy pose estimates, making dense 3D reconstruction with NeRFs challenging due to sensitivity to such degradations

Method: Unified framework leveraging asynchronous event streams alongside motion-blurred frames, embedding event-image fusion into NeRF optimization, jointly refining event-based visual-inertial odometry priors using both event and frame modalities

Result: Recovers sharp radiance fields and accurate camera trajectories without ground-truth supervision, reconstructs high-fidelity radiance fields preserving fine scene details despite dynamic flights, delivers over 50% performance gain on real-world data compared to state-of-the-art methods

Conclusion: Event-image fusion enables high-quality 3D reconstruction from agile drone flights by overcoming motion blur and pose noise challenges

Abstract: Fast-flying aerial robots promise rapid inspection under limited battery constraints, with direct applications in infrastructure inspection, terrain exploration, and search and rescue. However, high speeds lead to severe motion blur in images and induce significant drift and noise in pose estimates, making dense 3D reconstruction with Neural Radiance Fields (NeRFs) particularly challenging due to their high sensitivity to such degradations. In this work, we present a unified framework that leverages asynchronous event streams alongside motion-blurred frames to reconstruct high-fidelity radiance fields from agile drone flights. By embedding event-image fusion into NeRF optimization and jointly refining event-based visual-inertial odometry priors using both event and frame modalities, our method recovers sharp radiance fields and accurate camera trajectories without ground-truth supervision. We validate our approach on both synthetic data and real-world sequences captured by a fast-flying drone. Despite highly dynamic drone flights, where RGB frames are severely degraded by motion blur and pose priors become unreliable, our method reconstructs high-fidelity radiance fields and preserves fine scene details, delivering a performance gain of over 50% on real-world data compared to state-of-the-art methods.

[187] BrepGaussian: CAD reconstruction from Multi-View Images with Gaussian Splatting

Jiaxing Yu, Dongyang Ren, Hangyu Xu, Zhouyuxiao Yang, Yuanqi Li, Jie Guo, Zhengkang Zhou, Yanwen Guo

Main category: cs.CV

TL;DR: BrepGaussian: A framework that learns 3D parametric B-rep representations from 2D images using Gaussian Splatting with a two-stage learning approach for geometry reconstruction and feature refinement.

DetailsMotivation: Recovering boundary representation (B-rep) models from unstructured data is challenging but valuable for computer vision and graphics. Existing deep learning methods depend on dense point clouds and struggle with generalization to novel shapes.

Method: Uses Gaussian Splatting renderer with learnable features and a specific fitting strategy. Introduces two-stage learning: first captures geometry and edges, then refines patch features for clean geometry and coherent instance representations.

Result: Extensive experiments demonstrate superior performance compared to state-of-the-art methods in recovering B-rep representations from 2D images.

Conclusion: BrepGaussian effectively learns 3D parametric representations from 2D images, addressing limitations of existing methods that require dense point clouds and struggle with generalization.

Abstract: The boundary representation (B-rep) models a 3D solid as its explicit boundaries: trimmed corners, edges, and faces. Recovering B-rep representation from unstructured data is a challenging and valuable task of computer vision and graphics. Recent advances in deep learning have greatly improved the recovery of 3D shape geometry, but still depend on dense and clean point clouds and struggle to generalize to novel shapes. We propose B-rep Gaussian Splatting (BrepGaussian), a novel framework that learns 3D parametric representations from 2D images. We employ a Gaussian Splatting renderer with learnable features, followed by a specific fitting strategy. To disentangle geometry reconstruction and feature learning, we introduce a two-stage learning framework that first captures geometry and edges and then refines patch features to achieve clean geometry and coherent instance representations. Extensive experiments demonstrate the superior performance of our approach to state-of-the-art methods. We will release our code and datasets upon acceptance.

[188] UDVideoQA: A Traffic Video Question Answering Dataset for Multi-Object Spatio-Temporal Reasoning in Urban Dynamics

Joseph Raj Vishal, Nagasiri Poluri, Katha Naik, Rutuja Patil, Kashyap Hegde Kota, Krishna Vinod, Prithvi Jai Ramesh, Mohammad Farhadi, Yezhou Yang, Bharatesh Chakravarthi

Main category: cs.CV

TL;DR: Urban Dynamics VideoQA (UDVideoQA) is a benchmark dataset for evaluating video language models on real-world urban traffic scenes, featuring 28K QA pairs with hierarchical reasoning levels and privacy-preserving dynamic blur.

DetailsMotivation: To address the challenge of understanding complex multi-agent dynamics in urban traffic scenes and evaluate video language models' capabilities in both visual grounding and causal reasoning on unscripted real-world data.

Method: Curated 16 hours of traffic footage from multiple city intersections under diverse conditions, used event-driven dynamic blur for privacy, created 28K QA pairs via unified annotation pipeline with hierarchical reasoning taxonomy spanning basic understanding to counterfactual inference.

Result: Benchmarked 10 SOTA VideoLMs, revealing perception-reasoning gap; fine-tuned Qwen2.5-VL 7B achieved comparable performance to proprietary systems; in VideoQGen, Gemini 2.5 Pro and Qwen3 Max generated most relevant questions but with limited linguistic diversity.

Conclusion: UDVideoQA provides comprehensive benchmark for advancing robust, privacy-aware multimodal reasoning; highlights need for human-centric evaluation and bridging perception-reasoning gap in video language models.

Abstract: Understanding the complex, multi-agent dynamics of urban traffic remains a fundamental challenge for video language models. This paper introduces Urban Dynamics VideoQA, a benchmark dataset that captures the unscripted real-world behavior of dynamic urban scenes. UDVideoQA is curated from 16 hours of traffic footage recorded at multiple city intersections under diverse traffic, weather, and lighting conditions. It employs an event-driven dynamic blur technique to ensure privacy preservation without compromising scene fidelity. Using a unified annotation pipeline, the dataset contains 28K question-answer pairs generated across 8 hours of densely annotated video, averaging one question per second. Its taxonomy follows a hierarchical reasoning level, spanning basic understanding and attribution to event reasoning, reverse reasoning, and counterfactual inference, enabling systematic evaluation of both visual grounding and causal reasoning. Comprehensive experiments benchmark 10 SOTA VideoLMs on UDVideoQA and 8 models on a complementary video question generation benchmark. Results reveal a persistent perception-reasoning gap, showing models that excel in abstract inference often fail with fundamental visual grounding. While models like Gemini Pro achieve the highest zero-shot accuracy, fine-tuning the smaller Qwen2.5-VL 7B model on UDVideoQA bridges this gap, achieving performance comparable to proprietary systems. In VideoQGen, Gemini 2.5 Pro, and Qwen3 Max generate the most relevant and complex questions, though all models exhibit limited linguistic diversity, underscoring the need for human-centric evaluation. The UDVideoQA suite, including the dataset, annotation tools, and benchmarks for both VideoQA and VideoQGen, provides a foundation for advancing robust, privacy-aware, and real-world multimodal reasoning. UDVideoQA is available at https://ud-videoqa.github.io/UD-VideoQA/UD-VideoQA/.

[189] SynthRender and IRIS: Open-Source Framework and Dataset for Bidirectional Sim-Real Transfer in Industrial Object Perception

Jose Moises Araya-Martinez, Thushar Tom, Adrián Sanchis Reig, Pablo Rey Valiente, Jens Lambrecht, Jörg Krüger

Main category: cs.CV

TL;DR: SynthRender: Open-source framework for synthetic image generation with Guided Domain Randomization for industrial object perception, benchmarked with Reality-to-Simulation techniques and evaluated on industrial datasets.

DetailsMotivation: Industrial object perception requires large annotated datasets which are costly to acquire for proprietary parts. Synthetic data generation can overcome this barrier but needs effective domain randomization for real-world transfer.

Method: Developed SynthRender framework with Guided Domain Randomization for synthetic image generation. Benchmarked Reality-to-Simulation techniques for creating 3D assets from 2D images. Introduced IRIS dataset with 32 industrial categories and 20,000 labels for evaluation.

Result: Achieved 99.1% mAP@50 on public robotics dataset, 98.3% mAP@50 on automotive benchmark, and 95.3% mAP@50 on IRIS dataset, surpassing existing approaches.

Conclusion: SynthRender with Guided Domain Randomization provides efficient synthetic data generation for industrial object perception, enabling robust automation without costly real data collection for proprietary parts.

Abstract: Object perception is fundamental for tasks such as robotic material handling and quality inspection. However, modern supervised deep-learning perception models require large datasets for robust automation under semi-uncontrolled conditions. The cost of acquiring and annotating such data for proprietary parts is a major barrier for widespread deployment. In this context, we release SynthRender, an open source framework for synthetic image generation with Guided Domain Randomization capabilities. Furthermore, we benchmark recent Reality-to-Simulation techniques for 3D asset creation from 2D images of real parts. Combined with Domain Randomization, these synthetic assets provide low-overhead, transferable data even for parts lacking 3D files. We also introduce IRIS, the Industrial Real-Sim Imagery Set, containing 32 categories with diverse textures, intra-class variation, strong inter-class similarities and about 20,000 labels. Ablations on multiple benchmarks outline guidelines for efficient data generation with SynthRender. Our method surpasses existing approaches, achieving 99.1% mAP@50 on a public robotics dataset, 98.3% mAP@50 on an automotive benchmark, and 95.3% mAP@50 on IRIS.

[190] LUMEN: Longitudinal Multi-Modal Radiology Model for Prognosis and Diagnosis

Zhifan Jiang, Dong Yang, Vishwesh Nath, Abhijeet Parida, Nishad P. Kulkarni, Ziyue Xu, Daguang Xu, Syed Muhammad Anwar, Holger R. Roth, Marius George Linguraru

Main category: cs.CV

TL;DR: LUMEN: A training framework for longitudinal chest X-ray interpretation using multi-image, multi-task instruction fine-tuning to enhance prognostic and diagnostic capabilities in radiology VQA.

DetailsMotivation: Radiologists need to analyze temporal changes in longitudinal imaging for accurate diagnosis and prognosis, but manual longitudinal analysis is time-consuming. Current VLMs lack specialized training for longitudinal radiological interpretation.

Method: Introduces LUMEN framework with multi-image and multi-task instruction fine-tuning for longitudinal CXR interpretation. Creates novel instruction-following dataset incorporating longitudinal studies and prognostic VQA tasks. Uses MIMIC-CXR and Medical-Diff-VQA datasets.

Result: Significant improvements over baseline models in diagnostic VQA tasks, and shows promising potential for prognostic capabilities in longitudinal radiological interpretation.

Conclusion: Well-designed, instruction-tuned VLMs can enable more accurate and clinically meaningful interpretation of longitudinal radiological imaging data, with potential for decision support in radiology.

Abstract: Large vision-language models (VLMs) have evolved from general-purpose applications to specialized use cases such as in the clinical domain, demonstrating potential for decision support in radiology. One promising application is assisting radiologists in decision-making by the analysis of radiology imaging data such as chest X-rays (CXR) via a visual and natural language question-answering (VQA) interface. When longitudinal imaging is available, radiologists analyze temporal changes, which are essential for accurate diagnosis and prognosis. The manual longitudinal analysis is a time-consuming process, motivating the development of a training framework that can provide prognostic capabilities. We introduce a novel training framework LUMEN, that is optimized for longitudinal CXR interpretation, leveraging multi-image and multi-task instruction fine-tuning to enhance prognostic and diagnostic performance. We conduct experiments on the publicly available MIMIC-CXR and its associated Medical-Diff-VQA datasets. We further formulate and construct a novel instruction-following dataset incorporating longitudinal studies, enabling the development of a prognostic VQA task. Our method demonstrates significant improvements over baseline models in diagnostic VQA tasks, and more importantly, shows promising potential for prognostic capabilities. These results underscore the value of well-designed, instruction-tuned VLMs in enabling more accurate and clinically meaningful radiological interpretation of longitudinal radiological imaging data.

[191] SPRITETOMESH: Automatic Mesh Generation for 2D Skeletal Animation Using Learned Segmentation and Contour-Aware Vertex Placement

Bastien Gimbert

Main category: cs.CV

TL;DR: SPRITETOMESH automatically converts 2D game sprites into animation-ready triangle meshes using hybrid learned segmentation and algorithmic vertex placement.

DetailsMotivation: Manual creation of animation-ready meshes from 2D sprites is tedious and time-consuming (15-60 minutes per sprite), requiring artists to carefully place vertices along visual boundaries. There's a need for automation to speed up game development workflows.

Method: Hybrid learned-algorithmic approach: 1) Segmentation network (EfficientNet-B0 encoder with U-Net decoder) trained on 100k+ sprite-mask pairs for accurate binary masks, 2) Algorithmic vertex placement using Douglas-Peucker simplification for exterior contours and bilateral-filtered multi-channel Canny edge detection for interior boundaries, 3) Delaunay triangulation with mask-based centroid filtering for final mesh generation.

Result: Achieves 0.87 IoU segmentation accuracy, processes sprites in under 3 seconds (300x-1200x speedup over manual creation). Key finding: direct neural network vertex prediction via heatmap regression fails to converge (loss plateau at 0.061), validating the hybrid approach.

Conclusion: Hybrid learned-algorithmic approach is effective where learned segmentation handles unambiguous tasks and algorithmic methods handle domain-specific heuristics. The pipeline enables rapid automated mesh generation for game development.

Abstract: We present SPRITETOMESH, a fully automatic pipeline for converting 2D game sprite images into triangle meshes compatible with skeletal animation frameworks such as Spine2D. Creating animation-ready meshes is traditionally a tedious manual process requiring artists to carefully place vertices along visual boundaries, a task that typically takes 15-60 minutes per sprite. Our method addresses this through a hybrid learned-algorithmic approach. A segmentation network (EfficientNet-B0 encoder with U-Net decoder) trained on over 100,000 sprite-mask pairs from 172 games achieves an IoU of 0.87, providing accurate binary masks from arbitrary input images. From these masks, we extract exterior contour vertices using Douglas-Peucker simplification with adaptive arc subdivision, and interior vertices along visual boundaries detected via bilateral-filtered multi-channel Canny edge detection with contour-following placement. Delaunay triangulation with mask-based centroid filtering produces the final mesh. Through controlled experiments, we demonstrate that direct vertex position prediction via neural network heatmap regression is fundamentally not viable for this task: the heatmap decoder consistently fails to converge (loss plateau at 0.061) while the segmentation decoder trains normally under identical conditions. We attribute this to the inherently artistic nature of vertex placement - the same sprite can be meshed validly in many different ways. This negative result validates our hybrid design: learned segmentation where ground truth is unambiguous, algorithmic placement where domain heuristics are appropriate. The complete pipeline processes a sprite in under 3 seconds, representing a speedup of 300x-1200x over manual creation. We release our trained model to the game development community.

[192] Seeing Through Words: Controlling Visual Retrieval Quality with Language Models

Jianglin Lu, Simon Jenni, Kushal Kafle, Jing Shi, Handong Zhao, Yun Fu

Main category: cs.CV

TL;DR: A quality-controllable retrieval framework that enriches short text queries using generative language models with quality conditioning to improve text-to-image retrieval results.

DetailsMotivation: Real-world text-to-image retrieval faces challenges with short, underspecified user queries that are semantically ambiguous, prone to visual interpretation collisions, and lack quality control over retrieved images.

Method: Proposes quality-controllable retrieval using generative language models as query completion functions. Extends short queries into descriptive forms capturing fine-grained visual attributes, conditioned on discretized quality levels derived from relevance and aesthetic scoring models.

Result: Extensive experiments show significant improvement in retrieval results with effective quality control, bridging the gap between modern VLMs’ expressive capacity and short query limitations.

Conclusion: The framework provides flexibility (compatible with any pretrained VLM), transparency (interpretable enriched queries), and controllability (steering retrieval toward preferred quality levels).

Abstract: Text-to-image retrieval is a fundamental task in vision-language learning, yet in real-world scenarios it is often challenged by short and underspecified user queries. Such queries are typically only one or two words long, rendering them semantically ambiguous, prone to collisions across diverse visual interpretations, and lacking explicit control over the quality of retrieved images. To address these issues, we propose a new paradigm of quality-controllable retrieval, which enriches short queries with contextual details while incorporating explicit notions of image quality. Our key idea is to leverage a generative language model as a query completion function, extending underspecified queries into descriptive forms that capture fine-grained visual attributes such as pose, scene, and aesthetics. We introduce a general framework that conditions query completion on discretized quality levels, derived from relevance and aesthetic scoring models, so that query enrichment is not only semantically meaningful but also quality-aware. The resulting system provides three key advantages: 1) flexibility, it is compatible with any pretrained vision-language model (VLMs) without modification; 2) transparency, enriched queries are explicitly interpretable by users; and 3) controllability, enabling retrieval results to be steered toward user-preferred quality levels. Extensive experiments demonstrate that our proposed approach significantly improves retrieval results and provides effective quality control, bridging the gap between the expressive capacity of modern VLMs and the underspecified nature of short user queries. Our code is available at https://github.com/Jianglin954/QCQC.

[193] Mask-HybridGNet: Graph-based segmentation with emergent anatomical correspondence from pixel-level supervision

Nicolás Gaggion, Maria J. Ledesma-Carbayo, Stergios Christodoulidis, Maria Vakalopoulou, Enzo Ferrante

Main category: cs.CV

TL;DR: Mask-HybridGNet trains graph-based medical image segmentation models using standard pixel-wise masks instead of manual landmarks, enabling anatomical correspondence learning without explicit supervision.

DetailsMotivation: Graph-based segmentation provides anatomical landmarks and population correspondences but requires manually annotated landmarks with point-to-point correspondences, which rarely exist in practice. The paper aims to overcome this limitation by using readily available pixel-wise masks.

Method: Combines Chamfer distance supervision and edge-based regularization to align variable-length ground truth boundaries with fixed-length landmark predictions, refined via differentiable rasterization. Uses a fixed graph adjacency matrix to enforce anatomical plausibility.

Result: Achieves competitive results against state-of-the-art pixel-based methods across chest radiography, cardiac ultrasound, cardiac MRI, and fetal imaging. Emergent property: predicted landmarks become consistently associated with specific anatomical locations across patients without explicit correspondence supervision.

Conclusion: The framework leverages standard segmentation masks to build structured models that maintain topological integrity and provide implicit correspondences, enabling temporal tracking, cross-slice reconstruction, and morphological population analyses.

Abstract: Graph-based medical image segmentation represents anatomical structures using boundary graphs, providing fixed-topology landmarks and inherent population-level correspondences. However, their clinical adoption has been hindered by a major requirement: training datasets with manually annotated landmarks that maintain point-to-point correspondences across patients rarely exist in practice. We introduce Mask-HybridGNet, a framework that trains graph-based models directly using standard pixel-wise masks, eliminating the need for manual landmark annotations. Our approach aligns variable-length ground truth boundaries with fixed-length landmark predictions by combining Chamfer distance supervision and edge-based regularization to ensure local smoothness and regular landmark distribution, further refined via differentiable rasterization. A significant emergent property of this framework is that predicted landmark positions become consistently associated with specific anatomical locations across patients without explicit correspondence supervision. This implicit atlas learning enables temporal tracking, cross-slice reconstruction, and morphological population analyses. Beyond direct segmentation, Mask-HybridGNet can extract correspondences from existing segmentation masks, allowing it to generate stable anatomical atlases from any high-quality pixel-based model. Experiments across chest radiography, cardiac ultrasound, cardiac MRI, and fetal imaging demonstrate that our model achieves competitive results against state-of-the-art pixel-based methods, while ensuring anatomical plausibility by enforcing boundary connectivity through a fixed graph adjacency matrix. This framework leverages the vast availability of standard segmentation masks to build structured models that maintain topological integrity and provide implicit correspondences.

[194] Spa3R: Predictive Spatial Field Modeling for 3D Visual Reasoning

Haoyi Jiang, Liu Liu, Xinjie Wang, Yonghao He, Wei Sui, Zhizhong Su, Wenyu Liu, Xinggang Wang

Main category: cs.CV

TL;DR: Spa3R learns unified spatial representations from unposed multi-view images via Predictive Spatial Field Modeling, enabling VLMs to reason about 3D scenes without explicit 3D data or spatial instruction tuning.

DetailsMotivation: Current VLMs have limited 3D spatial understanding, relying on explicit 3D modalities or partial geometric priors that hinder scalability and burden language models with implicit 3D reconstruction tasks.

Method: Proposes Spa3R framework with Predictive Spatial Field Modeling (PSFM) that learns view-invariant spatial representations from unposed multi-view images by synthesizing feature fields for arbitrary unseen views. Integrates pre-trained Spa3R Encoder into VLMs via lightweight adapter to form Spa3-VLM.

Result: Spa3-VLM achieves state-of-the-art 58.6% accuracy on VSI-Bench for 3D VQA, significantly outperforming prior methods.

Conclusion: PSFM provides a scalable path for advancing spatial intelligence in VLMs by enabling them to learn holistic 3D understanding directly from 2D vision without explicit spatial instruction tuning.

Abstract: While Vision-Language Models (VLMs) exhibit exceptional 2D visual understanding, their ability to comprehend and reason about 3D space–a cornerstone of spatial intelligence–remains superficial. Current methodologies attempt to bridge this domain gap either by relying on explicit 3D modalities or by augmenting VLMs with partial, view-conditioned geometric priors. However, such approaches hinder scalability and ultimately burden the language model with the ill-posed task of implicitly reconstructing holistic 3D geometry from sparse cues. In this paper, we argue that spatial intelligence can emerge inherently from 2D vision alone, rather than being imposed via explicit spatial instruction tuning. To this end, we introduce Spa3R, a self-supervised framework that learns a unified, view-invariant spatial representation directly from unposed multi-view images. Spa3R is built upon the proposed Predictive Spatial Field Modeling (PSFM) paradigm, where Spa3R learns to synthesize feature fields for arbitrary unseen views conditioned on a compact latent representation, thereby internalizing a holistic and coherent understanding of the underlying 3D scene. We further integrate the pre-trained Spa3R Encoder into existing VLMs via a lightweight adapter to form Spa3-VLM, effectively grounding language reasoning in a global spatial context. Experiments on the challenging VSI-Bench demonstrate that Spa3-VLM achieves state-of-the-art accuracy of 58.6% on 3D VQA, significantly outperforming prior methods. These results highlight PSFM as a scalable path toward advancing spatial intelligence. Code is available at https://github.com/hustvl/Spa3R.

[195] Human Video Generation from a Single Image with 3D Pose and View Control

Tiantian Wang, Chun-Han Yao, Tao Hu, Mallikarjun Byrasandra Ramalinga Reddy, Ming-Hsuan Yang, Varun Jampani

Main category: cs.CV

TL;DR: HVG is a latent video diffusion model that generates high-quality, multi-view, spatiotemporally coherent human videos from a single image with 3D pose and view control, addressing challenges in view-consistent clothing wrinkle generation.

DetailsMotivation: Current diffusion methods struggle with image-to-video synthesis for human generation, particularly in inferring view-consistent, motion-dependent clothing wrinkles from single images. There's a need for better control over 3D pose and view while maintaining multi-view consistency.

Method: HVG uses three key designs: 1) Articulated Pose Modulation with dual-dimensional bone maps and 3D information to handle anatomical relationships and self-occlusions; 2) View and Temporal Alignment for multi-view consistency and frame-to-frame stability; 3) Progressive Spatio-Temporal Sampling with temporal alignment for smooth transitions in long animations.

Result: Extensive experiments show HVG outperforms existing methods in generating high-quality 4D human videos from diverse human images and pose inputs, demonstrating superior performance in image-to-video tasks.

Conclusion: HVG successfully addresses the challenges of generating view-consistent human videos from single images with pose and view control, advancing the state-of-the-art in human video generation through innovative pose modulation and alignment techniques.

Abstract: Recent diffusion methods have made significant progress in generating videos from single images due to their powerful visual generation capabilities. However, challenges persist in image-to-video synthesis, particularly in human video generation, where inferring view-consistent, motion-dependent clothing wrinkles from a single image remains a formidable problem. In this paper, we present Human Video Generation in 4D (HVG), a latent video diffusion model capable of generating high-quality, multi-view, spatiotemporally coherent human videos from a single image with 3D pose and view control. HVG achieves this through three key designs: (i) Articulated Pose Modulation, which captures the anatomical relationships of 3D joints via a novel dual-dimensional bone map and resolves self-occlusions across views by introducing 3D information; (ii) View and Temporal Alignment, which ensures multi-view consistency and alignment between a reference image and pose sequences for frame-to-frame stability; and (iii) Progressive Spatio-Temporal Sampling with temporal alignment to maintain smooth transitions in long multi-view animations. Extensive experiments on image-to-video tasks demonstrate that HVG outperforms existing methods in generating high-quality 4D human videos from diverse human images and pose inputs.

[196] Region of Interest Segmentation and Morphological Analysis for Membranes in Cryo-Electron Tomography

Xingyi Cheng, Julien Maufront, Aurélie Di Cicco, Daniël M. Pelt, Manuela Dezi, Daniel Lévy

Main category: cs.CV

TL;DR: TomoROIS-SurfORA: A two-step framework for direct ROI segmentation and surface analysis in cryo-electron tomography, enabling quantitative morphological analysis of complex membrane structures.

DetailsMotivation: Current cryo-ET analysis typically requires full structure segmentation before ROI analysis, which is inefficient for continuous complex structures like membranes. There's a need for direct ROI segmentation and quantitative morphological analysis tools.

Method: Two-step framework: 1) TomoROIS uses deep learning for direct, shape-agnostic ROI segmentation trained from small annotated datasets; 2) SurfORA processes segmented structures as point clouds and surface meshes to extract quantitative features like inter-membrane distances, curvature, and surface roughness.

Result: Successfully demonstrated on in vitro reconstituted membrane systems with complex geometries, enabling automatic quantitative analysis of membrane contact sites and remodeling events like invagination.

Conclusion: The combined approach provides practical tools for direct ROI detection and surface analysis in cryo-ET, with broader applicability to scientific imaging contexts beyond membrane data.

Abstract: Cryo-electron tomography (cryo-ET) enables high resolution, three-dimensional reconstruction of biological structures, including membranes and membrane proteins. Identification of regions of interest (ROIs) is central to scientific imaging, as it enables isolation and quantitative analysis of specific structural features within complex datasets. In practice, however, ROIs are typically derived indirectly through full structure segmentation followed by post hoc analysis. This limitation is especially apparent for continuous and geometrically complex structures such as membranes, which are segmented as single entities. Here, we developed TomoROIS-SurfORA, a two step framework for direct, shape-agnostic ROI segmentation and morphological surface analysis. TomoROIS performs deep learning-based ROI segmentation and can be trained from scratch using small annotated datasets, enabling practical application across diverse imaging data. SurfORA processes segmented structures as point clouds and surface meshes to extract quantitative morphological features, including inter-membrane distances, curvature, and surface roughness. It supports both closed and open surfaces, with specific considerations for open surfaces, which are common in cryo-ET due to the missing wedge effect. We demonstrate both tools using in vitro reconstituted membrane systems containing deformable vesicles with complex geometries, enabling automatic quantitative analysis of membrane contact sites and remodeling events such as invagination. While demonstrated here on cryo-ET membrane data, the combined approach is applicable to ROI detection and surface analysis in broader scientific imaging contexts.

[197] Towards Attributions of Input Variables in a Coalition

Xinhao Zheng, Huiqi Deng, Quanshi Zhang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2309.13411: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2309.13411&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[198] Coherent and Multi-modality Image Inpainting via Latent Space Optimization

Lingzhi Pan, Tong Zhang, Bingyuan Chen, Qi Zhou, Wei Ke, Sabine Süsstrunk, Mathieu Salzmann

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Unable to determine motivation due to failed paper retrieval

Method: Unable to determine method due to failed paper retrieval

Result: Unable to determine results due to failed paper retrieval

Conclusion: Unable to determine conclusion due to failed paper retrieval

Abstract: Failed to fetch summary for 2407.08019: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2407.08019&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[199] RegTrack: Simplicity Beneath Complexity in Robust Multi-Modal 3D Multi-Object Tracking

Lipeng Gu, Xuefeng Yan, Song Wang, Mingqiang Wei

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access limitations

Method: Unable to determine method due to access limitations

Result: Unable to determine results due to access limitations

Conclusion: Unable to determine conclusion due to access limitations

Abstract: Failed to fetch summary for 2409.00618: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2409.00618&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[200] Two Models for Surface Segmentation using the Total Variation of the Normal Vector

Manuel Weiß, Lukas Baumgärtner, Laura Weigl, Ronny Bergmann, Stephan Schmidt, Roland Herzog

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to API rate limiting preventing access to paper content

Method: Cannot analyze method as paper content is unavailable due to HTTP 429 error

Result: No results available due to technical issue with arXiv API access

Conclusion: Paper analysis cannot be completed due to API rate limiting preventing content retrieval

Abstract: Failed to fetch summary for 2412.00445: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2412.00445&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[201] Revisiting the Generalization Problem of Low-level Vision Models Through the Lens of Image Deraining

Jinfan Hu, Zhiyuan You, Jinjin Gu, Kaiwen Zhu, Tianfan Xue, Chao Dong

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to missing paper content

Method: Unable to determine method due to missing paper content

Result: Unable to determine results due to missing paper content

Conclusion: Unable to draw conclusions due to missing paper content

Abstract: Failed to fetch summary for 2502.12600: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.12600&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[202] A deep learning framework for efficient pathology image analysis

Peter Neidlinger, Tim Lenz, Sebastian Foersch, Chiara M. L. Loeffler, Jan Clusmann, Marco Gustav, Lawrence A. Shaktah, Rupert Langer, Bastian Dislich, Lisa A. Boardman, Amy J. French, Ellen L. Goode, Andrea Gsur, Stefanie Brezina, Marc J. Gunter, Robert Steinfelder, Hans-Michael Behrens, Christoph Röcken, Tabitha Harrison, Ulrike Peters, Amanda I. Phipps, Giuseppe Curigliano, Nicola Fusco, Antonio Marra, Michael Hoffmeister, Hermann Brenner, Jakob Nikolas Kather

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2502.13027: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.13027&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[203] Knee or ROC

Veronica Wendt, Jacob Steiner, Byunggu Yu, Caleb Kelly, Justin Kim

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access restrictions

Method: Unable to determine method due to access restrictions

Result: Unable to determine results due to access restrictions

Conclusion: Unable to determine conclusion due to access restrictions

Abstract: Failed to fetch summary for 2401.07390: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2401.07390&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[204] SEED: Towards More Accurate Semantic Evaluation for Visual Brain Decoding

Juhyeon Park, Peter Yongho Kim, Jiook Cha, Shinjae Yoo, Taesup Moon

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to draw conclusions due to failed paper fetch

Abstract: Failed to fetch summary for 2503.06437: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.06437&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[205] WonderVerse: Extendable 3D Scene Generation with Video Generative Models

Hao Feng, Zhi Zuo, Jia-Hui Pan, Ka-Hei Hui, Qi Dou, Jingyu Hu, Zhengzhe Liu

Main category: cs.CV

TL;DR: Failed to fetch paper summary - HTTP 429 error indicates rate limiting from arXiv API

DetailsMotivation: Unable to determine motivation due to technical error in fetching paper content

Method: Cannot analyze method due to technical error in accessing paper information

Result: No results available due to HTTP 429 error (rate limiting) from arXiv API

Conclusion: Technical issue prevents analysis of this paper; need to try again later or use alternative access method

Abstract: Failed to fetch summary for 2503.09160: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.09160&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[206] VISIONLOGIC: From Neuron Activations to Causally Grounded Concept Rules for Vision Models

Chuqin Geng, Yuhe Jiang, Ziyu Zhao, Haolin Ye, Anqi Xing, Li Zhang, Xujie Si

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2503.10547: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.10547&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[207] Enhancing Out-of-Distribution Detection with Extended Logit Normalization

Yifan Ding, Xixi Liu, Jonas Unger, Gabriel Eilertsen

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to draw conclusions due to failed paper fetch

Abstract: Failed to fetch summary for 2504.11434: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.11434&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[208] Improving Motion in Image-to-Video Models via Adaptive Low-Pass Guidance

June Suk Choi, Kyungmin Lee, Sihyun Yu, Yisol Choi, Jinwoo Shin, Kimin Lee

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2506.08456: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.08456&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[209] Peering into the Unknown: Active View Selection with Neural Uncertainty Maps for 3D Reconstruction

Zhengquan Zhang, Feng Xu, Mengmi Zhang

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper information.

DetailsMotivation: Cannot determine motivation without access to paper content.

Method: Cannot determine method without access to paper content.

Result: Cannot determine results without access to paper content.

Conclusion: Cannot determine conclusion without access to paper content.

Abstract: Failed to fetch summary for 2506.14856: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.14856&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[210] Light of Normals: Unified Feature Representation for Universal Photometric Stereo

Houyuan Chen, Hong Li, Chongjie Ye, Zhaoxi Chen, Bohan Li, Shaocong Xu, Xianda Guo, Xuhui Liu, Yikai Wang, Baochang Zhang, Satoshi Ikehata, Boxin Shi, Anyi Rao, Hao Zhao

Main category: cs.CV

TL;DR: Unable to analyze paper 2506.18882 due to HTTP 429 error when fetching abstract from arXiv API

DetailsMotivation: Cannot determine motivation without access to paper abstract

Method: Cannot determine method without access to paper abstract

Result: Cannot determine results without access to paper abstract

Conclusion: Cannot draw conclusions without access to paper abstract

Abstract: Failed to fetch summary for 2506.18882: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.18882&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[211] An Efficient LiDAR-Camera Fusion Network for Multi-Class 3D Dynamic Object Detection and Trajectory Prediction

Yushen He, Lei Zhao, Tianchen Deng, Zipeng Fang, Weidong Chen

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Unable to determine motivation due to technical error in accessing paper content

Method: Unable to determine method due to technical error in accessing paper content

Result: Unable to determine results due to technical error in accessing paper content

Conclusion: Unable to determine conclusion due to technical error in accessing paper content

Abstract: Failed to fetch summary for 2504.13647: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.13647&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[212] Is Exchangeability better than I.I.D to handle Data Distribution Shifts while Pooling Data for Data-scarce Medical image segmentation?

Ayush Roy, Samin Enam, Jun Xia, Won Hwa Kim, Vishnu Suresh Lokhande

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to missing paper content

Method: Unable to determine method due to missing paper content

Result: Unable to determine results due to missing paper content

Conclusion: Unable to determine conclusion due to missing paper content

Abstract: Failed to fetch summary for 2507.19575: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.19575&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[213] Seeing Through the Noise: Improving Infrared Small Target Detection and Segmentation from Noise Suppression Perspective

Maoxun Yuan, Duanni Meng, Ziteng Xi, Tianyi Zhao, Shiji Zhao, Yimian Dai, Xingxing Wei

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed paper retrieval

Method: Unable to determine method due to failed paper retrieval

Result: Unable to determine results due to failed paper retrieval

Conclusion: Unable to determine conclusion due to failed paper retrieval

Abstract: Failed to fetch summary for 2508.06878: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.06878&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[214] RECON: Robust symmetry discovery via Explicit Canonical Orientation Normalization

Alonso Urbano, David W. Romero, Max Zimmer, Sebastian Pokutta

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access limitations

Method: Unable to determine method due to access limitations

Result: Unable to determine results due to access limitations

Conclusion: Unable to determine conclusion due to access limitations

Abstract: Failed to fetch summary for 2505.13289: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.13289&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[215] PaCo-FR: Patch-Pixel Aligned End-to-End Codebook Learning for Facial Representation Pre-training

Yin Xie, Zhichao Chen, Zeyu Xiao, Yongle Zhao, Xiang An, Kaicheng Yang, Zimin Ran, Jia Guo, Ziyong Feng, Jiankang Deng

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Unable to determine motivation as paper content could not be retrieved

Method: Unable to determine method as paper content could not be retrieved

Result: Unable to determine results as paper content could not be retrieved

Conclusion: Unable to determine conclusion as paper content could not be retrieved

Abstract: Failed to fetch summary for 2508.09691: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.09691&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[216] Trajectory-aware Shifted State Space Models for Online Video Super-Resolution

Qiang Zhu, Xiandong Meng, Yuxian Jiang, Fan Zhang, David Bull, Shuyuan Zhu, Bing Zeng, Ronggang Wang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2508.10453: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.10453&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[217] MoSA: Motion-Coherent Human Video Generation via Structure-Appearance Decoupling

Haoyu Wang, Hao Tang, Donglin Di, Zhilu Zhang, Wangmeng Zuo, Feng Gao, Siwei Ma, Shiliang Zhang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to missing paper content

Method: Unable to determine method due to missing paper content

Result: Unable to determine results due to missing paper content

Conclusion: Unable to determine conclusion due to missing paper content

Abstract: Failed to fetch summary for 2508.17404: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.17404&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[218] Decouple, Reorganize, and Fuse: A Multimodal Framework for Cancer Survival Prediction

Huayi Wang, Haochao Ying, Yuyang Xu, Qibo Qiu, Cheng Zhang, Danny Z. Chen, Ying Sun, Jian Wu

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2508.18632: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.18632&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[219] PCPO: Proportionate Credit Policy Optimization for Aligning Image Generation Models

Jeongjae Lee, Jong Chul Ye

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2509.25774: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.25774&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[220] On Robustness of Vision-Language-Action Model against Multi-Modal Perturbations

Jianing Guo, Zhenhong Wu, Chang Tu, Yiyao Ma, Xiangqi Kong, Zhiqian Liu, Jiaming Ji, Shuning Zhang, Yuanpei Chen, Kai Chen, Qi Dou, Yaodong Yang, Xianglong Liu, Huijie Zhao, Weifeng Lv, Simin Li

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation due to failed paper retrieval

Method: Cannot determine method due to failed paper retrieval

Result: Cannot determine results due to failed paper retrieval

Conclusion: Cannot draw conclusions due to failed paper retrieval

Abstract: Failed to fetch summary for 2510.00037: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.00037&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[221] Erased, But Not Forgotten: Erased Rectified Flow Transformers Still Remain Unsafe Under Concept Attack

Nanxiang Jiang, Zhaoxin Fan, Enhan Kang, Daiheng Gao, Yun Zhou, Yanxia Chang, Zheng Zhu, Yeying Jin, Wenjun Wu

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about the paper due to content unavailability

Abstract: Failed to fetch summary for 2510.00635: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.00635&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[222] Spatial-DISE: A Unified Benchmark for Evaluating Spatial Reasoning in Vision-Language Models

Xinmiao Huang, Qisong He, Zhenglin Huang, Boxuan Wang, Zhuoyun Li, Guangliang Cheng, Yi Dong, Xiaowei Huang

Main category: cs.CV

TL;DR: Paper analysis unavailable due to HTTP 429 error when fetching abstract from arXiv API

DetailsMotivation: Unable to determine paper motivation due to API request failure

Method: Cannot analyze method without access to paper abstract

Result: No results available due to technical issue with arXiv API

Conclusion: Cannot provide analysis due to HTTP 429 error preventing access to paper content

Abstract: Failed to fetch summary for 2510.13394: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.13394&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[223] Unbiased Object Detection Beyond Frequency with Visually Prompted Image Synthesis

Xinhao Cai, Liulei Li, Gensheng Pei, Tao Chen, Jinshan Pan, Yazhou Yao, Wenguan Wang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2510.18229: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.18229&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[224] UniGenBench++: A Unified Semantic Evaluation Benchmark for Text-to-Image Generation

Yibin Wang, Zhimin Li, Yuhang Zang, Jiazi Bu, Yujie Zhou, Yi Xin, Junjun He, Chunyu Wang, Qinglin Lu, Cheng Jin, Jiaqi Wang

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2510.18701: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.18701&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[225] egoEMOTION: Egocentric Vision and Physiological Signals for Emotion and Personality Recognition in Real-World Tasks

Matthias Jammot, Björn Braun, Paul Streli, Rafael Wampfler, Christian Holz

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2510.22129: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.22129&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[226] Learning Unified Representations from Heterogeneous Data for Robust Heart Rate Modeling

Zhengdong Huang, Zicheng Xie, Wentao Tian, Jingyu Liu, Lunhong Dong, Peng Yang

Main category: cs.CV

TL;DR: Unable to analyze paper 2508.21785 due to HTTP 429 error when fetching summary from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable due to API rate limiting error

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about paper content due to access limitations

Abstract: Failed to fetch summary for 2508.21785: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.21785&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[227] SpecAware: A Spectral-Content Aware Foundation Model for Unifying Multi-Sensor Learning in Hyperspectral Remote Sensing Mapping

Renjie Ji, Xue Wang, Chao Niu, Wen Zhang, Yong Mei, Kun Tan

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed API request

Method: Unable to determine method due to failed API request

Result: Unable to determine results due to failed API request

Conclusion: Unable to determine conclusion due to failed API request

Abstract: Failed to fetch summary for 2510.27219: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.27219&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[228] A Cognitive Process-Inspired Architecture for Subject-Agnostic Brain Visual Decoding

Jingyu Lu, Haonan Wang, Qixiang Zhang, Xiaomeng Li

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2511.02565: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.02565&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[229] Defending Unauthorized Model Merging via Dual-Stage Weight Protection

Wei-Jia Chen, Min-Yen Tsai, Cheng-Yi Lee, Chia-Mu Yu

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation due to failed paper retrieval

Method: Cannot determine method due to failed paper retrieval

Result: Cannot determine results due to failed paper retrieval

Conclusion: Cannot determine conclusion due to failed paper retrieval

Abstract: Failed to fetch summary for 2511.11851: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.11851&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[230] Changes in Real Time: Online Scene Change Detection with Multi-View Fusion

Chamuditha Jayanga Galappaththige, Jason Lai, Lloyd Windrim, Donald Dansereau, Niko Sünderhauf, Dimity Miller

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2511.12370 could not be retrieved from arXiv API.

DetailsMotivation: Cannot determine motivation as the paper content is unavailable due to API rate limiting.

Method: Cannot determine method as the paper content is unavailable.

Result: Cannot determine results as the paper content is unavailable.

Conclusion: Cannot draw conclusions about the paper due to unavailability of content.

Abstract: Failed to fetch summary for 2511.12370: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.12370&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[231] CuriGS: Curriculum-Guided Gaussian Splatting for Sparse View Synthesis

Zijian Wu, Mingfeng Jiang, Zidian Lin, Ying Song, Hanjie Ma, Qun Wu, Dongping Zhang, Guiyang Pu

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2511.16030 appears to be an arXiv paper, but the content could not be retrieved.

DetailsMotivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot determine conclusion without access to the paper content.

Abstract: Failed to fetch summary for 2511.16030: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.16030&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[232] Pluggable Pruning with Contiguous Layer Distillation for Diffusion Transformers

Jian Ma, Qirong Peng, Xujie Zhu, Peixing Xie, Chen Chen, Haonan Lu

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2511.16156: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.16156&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[233] Less is More: Data-Efficient Adaptation for Controllable Text-to-Video Generation

Shihan Cheng, Nilesh Kulkarni, David Hyde, Dmitriy Smirnov

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2511.17844: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.17844&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[234] Seeing What Matters: Visual Preference Policy Optimization for Visual Generation

Ziqi Ni, Yuanzhi Liang, Rui Li, Yi Zhou, Haibin Huang, Chi Zhang, Xuelong Li

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation due to API access limitations

Method: Cannot determine method due to API access limitations

Result: Cannot determine results due to API access limitations

Conclusion: Cannot determine conclusion due to API access limitations

Abstract: Failed to fetch summary for 2511.18719: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.18719&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[235] UniPart: Part-Level 3D Generation with Unified 3D Geom-Seg Latents

Xufan He, Yushuang Wu, Xiaoyang Guo, Chongjie Ye, Jiaqing Zhou, Tianlei Hu, Xiaoguang Han, Dong Du

Main category: cs.CV

TL;DR: The paper with arXiv ID 2512.09435 could not be analyzed because the arXiv API returned an HTTP 429 error (too many requests), preventing access to the paper’s abstract and content.

DetailsMotivation: Unable to determine the paper's motivation due to technical limitations in accessing the content.

Method: Unable to determine the paper’s methodology due to technical limitations in accessing the content.

Result: Unable to determine the paper’s results due to technical limitations in accessing the content.

Conclusion: Unable to determine the paper’s conclusions due to technical limitations in accessing the content.

Abstract: Failed to fetch summary for 2512.09435: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.09435&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[236] CogFlow: Bridging Perception and Reasoning through Knowledge Internalization for Visual Mathematical Problem Solving

Shuhang Chen, Yunqiu Xu, Junjie Xie, Aojun Lu, Tao Feng, Zeying Huang, Ning Zhang, Yi Sun, Yi Yang, Hangjie Yuan

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2601.01874: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.01874&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[237] The devil is in the details: Enhancing Video Virtual Try-On via Keyframe-Driven Details Injection

Qingdong He, Xueqin Chen, Yanjie Pan, Peng Tang, Pengcheng Xu, Zhenye Gan, Chengjie Wang, Xiaobin Hu, Jiangning Zhang, Yabiao Wang

Main category: cs.CV

TL;DR: Paper 2512.20340: Unable to fetch summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation due to inability to access paper content

Method: Cannot determine method due to inability to access paper content

Result: Cannot determine results due to inability to access paper content

Conclusion: Cannot determine conclusion due to inability to access paper content

Abstract: Failed to fetch summary for 2512.20340: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.20340&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[238] Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning

Chi-Pin Huang, Yunze Man, Zhiding Yu, Min-Hung Chen, Jan Kautz, Yu-Chiang Frank Wang, Fu-En Yang

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable due to HTTP 429 error from arXiv API

Method: Cannot determine method as paper content is unavailable due to HTTP 429 error from arXiv API

Result: Cannot determine results as paper content is unavailable due to HTTP 429 error from arXiv API

Conclusion: Cannot determine conclusion as paper content is unavailable due to HTTP 429 error from arXiv API

Abstract: Failed to fetch summary for 2601.09708: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.09708&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[239] Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

Christopher Clark, Jieyu Zhang, Zixian Ma, Jae Sung Park, Mohammadreza Salehi, Rohun Tripathi, Sangho Lee, Zhongzheng Ren, Chris Dongjoo Kim, Yinuo Yang, Vincent Shao, Yue Yang, Weikai Huang, Ziqi Gao, Taira Anderson, Jianrui Zhang, Jitesh Jain, George Stoica, Winson Han, Ali Farhadi, Ranjay Krishna

Main category: cs.CV

TL;DR: Paper ID 2601.10611 could not be fetched due to HTTP 429 error (rate limiting), so analysis cannot be performed

DetailsMotivation: Unable to determine motivation as the paper content could not be retrieved due to API rate limiting

Method: Unable to determine method as the paper content could not be retrieved due to API rate limiting

Result: Unable to determine results as the paper content could not be retrieved due to API rate limiting

Conclusion: Unable to draw conclusions as the paper content could not be retrieved due to API rate limiting

Abstract: Failed to fetch summary for 2601.10611: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.10611&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[240] Generating metamers of human scene understanding

Ritik Raina, Abe Leite, Alexandros Graikos, Seoyoung Ahn, Dimitris Samaras, Gregory J. Zelinsky

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed paper retrieval

Method: Unable to determine method due to failed paper retrieval

Result: Unable to determine results due to failed paper retrieval

Conclusion: Unable to determine conclusion due to failed paper retrieval

Abstract: Failed to fetch summary for 2601.11675: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.11675&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[241] Principal Component Analysis-Based Terahertz Self-Supervised Denoising and Deblurring Deep Neural Networks

Pengfei Zhu, Stefano Sfarra, Hai Zhang, Carlo Santulli, Elana Pivarciova, Fabrizio Sarasini, Xavier Maldague

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2601.12149: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.12149&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[242] Affinity Contrastive Learning for Skeleton-based Human Activity Understanding

Hongda Liu, Yunfan Liu, Min Ren, Lin Sui, Yunlong Wang, Zhenan Sun

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2601.16694 suggests it’s a recent submission from January 2026.

DetailsMotivation: Cannot determine motivation without access to the paper content. The arXiv ID format suggests this is a recent submission from 2026.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot determine conclusion without access to the paper content.

Abstract: Failed to fetch summary for 2601.16694: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.16694&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[243] CER-HV: A Human-in-the-Loop Framework for Cleaning Datasets Applied to Arabic-Script HTR

Sana Al-azzawi, Elisa Barney, Marcus Liwicki

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2601.16713: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.16713&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[244] Pareto-Guided Optimization for Uncertainty-Aware Medical Image Segmentation

Jinming Zhang, Youpeng Yang, Xi Yang, Haosen Shi, Yuyao Yan, Qiufeng Wang, Guangliang Cheng, Kaizhu Huang

Main category: cs.CV

TL;DR: Paper 2601.19365: Could not fetch summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access limitations

Method: Unable to determine method due to access limitations

Result: Unable to determine results due to access limitations

Conclusion: Unable to determine conclusion due to access limitations

Abstract: Failed to fetch summary for 2601.19365: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.19365&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[245] DVLA-RL: Dual-Level Vision-Language Alignment with Reinforcement Learning Gating for Few-Shot Learning

Wenhao Li, Xianjing Meng, Qiangchang Wang, Zhongyi Han, Zhibin Wu, Yilong Yin

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2602.00795: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.00795&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[246] Sim2Radar: Toward Bridging the Radar Sim-to-Real Gap with VLM-Guided Scene Reconstruction

Emily Bejerano, Federico Tondolo, Ayaan Qayyum, Xiaofan Yu, Xiaofan Jiang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to technical error in fetching paper content

Method: Unable to determine method due to technical error in fetching paper content

Result: Unable to determine results due to technical error in fetching paper content

Conclusion: Unable to draw conclusions due to technical error in fetching paper content

Abstract: Failed to fetch summary for 2602.13314: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.13314&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[247] XtraLight-MedMamba for Classification of Neoplastic Tubular Adenomas

Aqsa Sultana, Rayan Afsar, Ahmed Rahu, Surendra P. Singh, Brian Shula, Brandon Combs, Derrick Forchetti, Vijayan K. Asari

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to missing paper content

Method: Unable to determine method due to missing paper content

Result: Unable to determine results due to missing paper content

Conclusion: Unable to determine conclusion due to missing paper content

Abstract: Failed to fetch summary for 2602.04819: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.04819&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[248] All-Optical Segmentation via Diffractive Neural Networks for Autonomous Driving

Yingjie Li, Daniel Robinson, Weilu Gao, Cunxi Yu

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2602.07717: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.07717&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[249] Ecological mapping with geospatial foundation models

Craig Mahlasi, Gciniwe S. Baloyi, Zaheed Gaffoor, Levente Klein, Anne Jones, Etienne Vos, Michal Muszynski, Geoffrey Dawson, Campbell Watson

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2602.10720: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.10720&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[250] DriveMamba: Task-Centric Scalable State Space Model for Efficient End-to-End Autonomous Driving

Haisheng Su, Wei Wu, Feixiang Song, Junjie Zhang, Zhenjie Yang, Junchi Yan

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to API rate limiting preventing access to paper details

Method: Cannot analyze method as paper content is unavailable due to HTTP 429 error

Result: No results available - API request was rate limited

Conclusion: Cannot draw conclusions about paper content due to technical access issues

Abstract: Failed to fetch summary for 2602.13301: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.13301&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Yuchang Jiang, Anton Raichuk, Xiaoye Tong, Vivien Sainte Fare Garnot, Daniel Ortiz-Gonzalo, Dan Morris, Konrad Schindler, Jan Dirk Wegner, Maxim Neumann

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2602.17372: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.17372&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[252] EAGLE: Expert-Augmented Attention Guidance for Tuning-Free Industrial Anomaly Detection in Multimodal Large Language Models

Xiaomeng Peng, Xilang Huang, Seon Han Choi

Main category: cs.CV

TL;DR: Failed to fetch summary for arXiv ID 2602.17419 due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed summary fetch

Method: Unable to determine method due to failed summary fetch

Result: Unable to determine results due to failed summary fetch

Conclusion: Unable to determine conclusion due to failed summary fetch

Abstract: Failed to fetch summary for 2602.17419: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.17419&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[253] MIRROR: Multimodal Iterative Reasoning via Reflection on Visual Regions

Haoyu Zhang, Yuwei Wu, Pengxiang Li, Xintong Zhang, Zhi Gao, Rui Gao, Mingyang Gao, Che Sun, Yunde Jia

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation due to missing paper content

Method: Cannot determine method due to missing paper content

Result: Cannot determine results due to missing paper content

Conclusion: Cannot determine conclusion due to missing paper content

Abstract: Failed to fetch summary for 2602.18746: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.18746&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[254] Probability-Invariant Random Walk Learning on Gyral Folding-Based Cortical Similarity Networks for Alzheimer’s and Lewy Body Dementia Diagnosis

Minheng Chen, Tong Chen, Chao Cao, Jing Zhang, Tianming Liu, Li Su, Dajiang Zhu

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2602.17557: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.17557&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[255] CRAFT-LoRA: Content-Style Personalization via Rank-Constrained Adaptation and Training-Free Fusion

Yu Li, Yujun Cai, Chi Zhang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) - unable to analyze content

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to draw conclusions due to fetch failure

Abstract: Failed to fetch summary for 2602.18936: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.18936&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[256] Universal 3D Shape Matching via Coarse-to-Fine Language Guidance

Qinfeng Xiao, Guofeng Mei, Bo Yang, Liying Zhang, Jian Zhang, Kit-lun Yick

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting).

DetailsMotivation: Cannot determine motivation without paper content.

Method: Cannot determine method without paper content.

Result: Cannot determine results without paper content.

Conclusion: Cannot determine conclusion without paper content.

Abstract: Failed to fetch summary for 2602.19112: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.19112&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[257] Keep it SymPL: Symbolic Projective Layout for Allocentric Spatial Reasoning in Vision-Language Models

Jaeyun Jang, Seunghui Shin, Taeho Park, Hyoseok Hwang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2602.19117: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.19117&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[258] TraceVision: Trajectory-Aware Vision-Language Model for Human-Like Spatial Understanding

Fan Yang, Shurong Zheng, Hongyin Zhao, Yufei Zhan, Xin Li, Yousong Zhu, Chaoyang Zhao Ming Tang, Jinqiao Wang

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2602.19768: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.19768&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[259] Brewing Stronger Features: Dual-Teacher Distillation for Multispectral Earth Observation

Filip Wolf, Blaž Rolih, Luka Čehovin Zajc

Main category: cs.CV

TL;DR: Paper ID 2602.19863 could not be fetched due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation as the abstract could not be retrieved

Method: Unable to determine method as the abstract could not be retrieved

Result: Unable to determine results as the abstract could not be retrieved

Conclusion: Unable to draw conclusions as the abstract could not be retrieved

Abstract: Failed to fetch summary for 2602.19863: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.19863&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[260] When Pretty Isn’t Useful: Investigating Why Modern Text-to-Image Models Fail as Reliable Training Data Generators

Krzysztof Adamkiewicz, Brian Moser, Stanislav Frolov, Tobias Christian Nauen, Federico Raue, Andreas Dengel

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to draw conclusions due to access error

Abstract: Failed to fetch summary for 2602.19946: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.19946&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[261] Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device

Abdelrahman Shaker, Ahmed Heakl, Jaseel Muhammad, Ritesh Thawkar, Omkar Thawakar, Senmao Li, Hisham Cholakkal, Ian Reid, Eric P. Xing, Salman Khan, Fahad Shahbaz Khan

Main category: cs.CV

TL;DR: Mobile-O is a compact vision-language-diffusion model that enables unified multimodal understanding and generation on mobile devices through efficient architecture design and minimal training data.

DetailsMotivation: Existing unified multimodal models are too data-hungry and computationally heavy for deployment on edge devices, creating a need for efficient models that can run on mobile hardware without cloud dependency.

Method: Uses Mobile Conditioning Projector (MCP) with depthwise-separable convolutions and layerwise alignment to efficiently fuse vision-language features with diffusion generator. Trained on few million samples with novel quadruplet format (generation prompt, image, question, answer) to jointly enhance understanding and generation.

Result: Achieves 74% on GenEval, outperforms Show-O and JanusFlow by 5% and 11% while running 6x and 11x faster. For visual understanding, surpasses them by 15.3% and 5.1% across seven benchmarks. Runs ~3s per 512x512 image on iPhone.

Conclusion: Mobile-O establishes the first practical framework for real-time unified multimodal understanding and generation on edge devices, enabling on-device multimodal intelligence without cloud dependency.

Abstract: Unified multimodal models can both understand and generate visual content within a single architecture. Existing models, however, remain data-hungry and too heavy for deployment on edge devices. We present Mobile-O, a compact vision-language-diffusion model that brings unified multimodal intelligence to a mobile device. Its core module, the Mobile Conditioning Projector (MCP), fuses vision-language features with a diffusion generator using depthwise-separable convolutions and layerwise alignment. This design enables efficient cross-modal conditioning with minimal computational cost. Trained on only a few million samples and post-trained in a novel quadruplet format (generation prompt, image, question, answer), Mobile-O jointly enhances both visual understanding and generation capabilities. Despite its efficiency, Mobile-O attains competitive or superior performance compared to other unified models, achieving 74% on GenEval and outperforming Show-O and JanusFlow by 5% and 11%, while running 6x and 11x faster, respectively. For visual understanding, Mobile-O surpasses them by 15.3% and 5.1% averaged across seven benchmarks. Running in only ~3s per 512x512 image on an iPhone, Mobile-O establishes the first practical framework for real-time unified multimodal understanding and generation on edge devices. We hope Mobile-O will ease future research in real-time unified multimodal intelligence running entirely on-device with no cloud dependency. Our code, models, datasets, and mobile application are publicly available at https://amshaker.github.io/Mobile-O/

[262] SpikePingpong: Spike Vision-based Fast-Slow Pingpong Robot System

Hao Wang, Chengkai Hou, Xianglong Li, Yankai Fu, Chenxuan Li, Ning Chen, Gaole Dai, Jiaming Liu, Tiejun Huang, Shanghang Zhang

Main category: cs.CV

TL;DR: Unable to analyze paper 2506.06690 due to HTTP 429 error when fetching abstract from arXiv API

DetailsMotivation: Cannot determine motivation as abstract is unavailable due to rate limiting error

Method: Cannot determine method as abstract is unavailable due to rate limiting error

Result: Cannot determine results as abstract is unavailable due to rate limiting error

Conclusion: Cannot draw conclusions as abstract is unavailable due to rate limiting error

Abstract: Failed to fetch summary for 2506.06690: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.06690&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[263] DeLTa: Demonstration and Language-Guided Novel Transparent Object Manipulation

Taeyeop Lee, Gyuree Kang, Bowen Wen, Youngho Kim, Seunghyeok Back, In So Kweon, David Hyunchul Shim, Kuk-Jin Yoon

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2510.05662: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.05662&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[264] Earth Embeddings as Products: Taxonomy, Ecosystem, and Standardized Access

Heng Fang, Adam J. Stewart, Isaac Corley, Xiao Xiang Zhu, Hossein Azizpour

Main category: cs.CV

TL;DR: Failed to fetch paper summary - HTTP 429 error indicates rate limiting from arXiv API

DetailsMotivation: Unable to determine motivation due to API rate limiting preventing access to paper content

Method: Cannot analyze method as paper content is unavailable due to technical limitations

Result: No results available - technical error prevents accessing the paper

Conclusion: Cannot provide analysis due to arXiv API rate limiting (HTTP 429 error)

Abstract: Failed to fetch summary for 2601.13134: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.13134&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[265] Learning Humanoid End-Effector Control for Open-Vocabulary Visual Loco-Manipulation

Runpei Dong, Ziyan Li, Xialin He, Saurabh Gupta

Main category: cs.CV

TL;DR: Unable to analyze paper 2602.16705 due to HTTP 429 error when fetching from arXiv API

DetailsMotivation: Cannot determine motivation as paper content could not be retrieved due to rate limiting error

Method: No method information available - arXiv API returned HTTP 429 (Too Many Requests)

Result: Failed to fetch paper summary - HTTP 429 error indicates rate limiting on arXiv API requests

Conclusion: Paper analysis not possible due to technical limitations in accessing the content

Abstract: Failed to fetch summary for 2602.16705: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.16705&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

cs.AI

[266] Multilevel Determinants of Overweight and Obesity Among U.S. Children Aged 10-17: Comparative Evaluation of Statistical and Machine Learning Approaches Using the 2021 National Survey of Children’s Health

Joyanta Jyoti Mondal

Main category: cs.AI

TL;DR: This paper compares statistical, machine learning, and deep learning models for predicting adolescent overweight/obesity using behavioral, household, and neighborhood predictors, finding limited gains from increased model complexity over logistic regression.

DetailsMotivation: Childhood and adolescent overweight/obesity remain major public health concerns in the US, shaped by multiple factors. The study aims to characterize the joint predictive structure of these factors at the population level and compare different modeling approaches.

Method: Analyzed 18,792 children aged 10-17 from the 2021 National Survey of Children’s Health. Used BMI categories to define overweight/obesity. Predictors included diet, physical activity, sleep, parental stress, socioeconomic conditions, adverse experiences, and neighborhood characteristics. Compared logistic regression, random forest, gradient boosting, XGBoost, LightGBM, multilayer perceptron, and TabNet models.

Result: Model discrimination ranged from 0.66 to 0.79 AUC. Logistic regression, gradient boosting, and MLP showed the most stable balance of discrimination and calibration. Boosting and deep learning modestly improved recall and F1 scores, but no model was uniformly superior. Performance disparities across race and poverty groups persisted across all algorithms.

Conclusion: Increased model complexity yields limited gains over logistic regression for predicting adolescent overweight/obesity. Predictors consistently span behavioral, household, and neighborhood domains. Persistent subgroup disparities indicate the need for improved data quality and equity-focused surveillance rather than greater algorithmic complexity.

Abstract: Background: Childhood and adolescent overweight and obesity remain major public health concerns in the United States and are shaped by behavioral, household, and community factors. Their joint predictive structure at the population level remains incompletely characterized. Objectives: The study aims to identify multilevel predictors of overweight and obesity among U.S. adolescents and compare the predictive performance, calibration, and subgroup equity of statistical, machine-learning, and deep-learning models. Data and Methods: We analyze 18,792 children aged 10-17 years from the 2021 National Survey of Children’s Health. Overweight/obesity is defined using BMI categories. Predictors included diet, physical activity, sleep, parental stress, socioeconomic conditions, adverse experiences, and neighborhood characteristics. Models include logistic regression, random forest, gradient boosting, XGBoost, LightGBM, multilayer perceptron, and TabNet. Performance is evaluated using AUC, accuracy, precision, recall, F1 score, and Brier score. Results: Discrimination range from 0.66 to 0.79. Logistic regression, gradient boosting, and MLP showed the most stable balance of discrimination and calibration. Boosting and deep learning modestly improve recall and F1 score. No model was uniformly superior. Performance disparities across race and poverty groups persist across algorithms. Conclusion: Increased model complexity yields limited gains over logistic regression. Predictors consistently span behavioral, household, and neighborhood domains. Persistent subgroup disparities indicate the need for improved data quality and equity-focused surveillance rather than greater algorithmic complexity.

[267] An artificial intelligence framework for end-to-end rare disease phenotyping from clinical notes using large language models

Cathy Shyr, Yan Hu, Rory J. Tinker, Thomas A. Cassini, Kevin W. Byram, Rizwan Hamid, Daniel V. Fabbri, Adam Wright, Josh F. Peterson, Lisa Bastarache, Hua Xu

Main category: cs.AI

TL;DR: RARE-PHENIX is an end-to-end AI framework for rare disease phenotyping that integrates LLM-based extraction, HPO standardization, and diagnostic prioritization, outperforming existing methods in clinical workflow automation.

DetailsMotivation: Manual curation of structured phenotypes from clinical notes is labor-intensive and difficult to scale for rare disease diagnosis. Existing AI approaches optimize individual components but don't operationalize the full clinical workflow of extracting, standardizing, and prioritizing diagnostically informative phenotypes.

Method: Developed RARE-PHENIX framework integrating: 1) large language model-based phenotype extraction from clinical text, 2) ontology-grounded standardization to Human Phenotype Ontology (HPO) terms, and 3) supervised ranking of diagnostically informative phenotypes. Trained on data from 2,671 patients across 11 clinical sites.

Result: Outperformed state-of-the-art deep learning baseline (PhenoBERT) across ontology-based similarity and precision-recall-F1 metrics (0.70 vs. 0.58). Validated on 16,357 real-world clinical notes. Ablation analyses showed performance improvements with each module addition.

Conclusion: By modeling phenotyping as a clinically aligned workflow rather than a single extraction task, RARE-PHENIX provides structured, ranked phenotypes more concordant with clinician curation and has potential to support human-in-the-loop rare disease diagnosis in real-world settings.

Abstract: Phenotyping is fundamental to rare disease diagnosis, but manual curation of structured phenotypes from clinical notes is labor-intensive and difficult to scale. Existing artificial intelligence approaches typically optimize individual components of phenotyping but do not operationalize the full clinical workflow of extracting features from clinical text, standardizing them to Human Phenotype Ontology (HPO) terms, and prioritizing diagnostically informative HPO terms. We developed RARE-PHENIX, an end-to-end AI framework for rare disease phenotyping that integrates large language model-based phenotype extraction, ontology-grounded standardization to HPO terms, and supervised ranking of diagnostically informative phenotypes. We trained RARE-PHENIX using data from 2,671 patients across 11 Undiagnosed Diseases Network clinical sites, and externally validated it on 16,357 real-world clinical notes from Vanderbilt University Medical Center. Using clinician-curated HPO terms as the gold standard, RARE-PHENIX consistently outperformed a state-of-the-art deep learning baseline (PhenoBERT) across ontology-based similarity and precision-recall-F1 metrics in end-to-end evaluation (i.e., ontology-based similarity of 0.70 vs. 0.58). Ablation analyses demonstrated performance improvements with the addition of each module in RARE-PHENIX (extraction, standardization, and prioritization), supporting the value of modeling the full clinical phenotyping workflow. By modeling phenotyping as a clinically aligned workflow rather than a single extraction task, RARE-PHENIX provides structured, ranked phenotypes that are more concordant with clinician curation and has the potential to support human-in-the-loop rare disease diagnosis in real-world settings.

[268] DMCD: Semantic-Statistical Framework for Causal Discovery

Samarth KaPatel, Sofia Nikiforova, Giacinto Paolo Saggese, Paul Smith

Main category: cs.AI

TL;DR: DMCD is a two-phase causal discovery framework that combines LLM-based semantic drafting from variable metadata with statistical validation on observational data.

DetailsMotivation: Traditional causal discovery methods often struggle with limited observational data and may miss important causal relationships. The authors aim to leverage semantic information from variable metadata to improve causal structure learning.

Method: Two-phase approach: Phase I uses an LLM to propose a sparse draft DAG based on variable metadata as a semantic prior. Phase II refines this draft through statistical validation using conditional independence testing, with discrepancies guiding targeted edge revisions.

Result: DMCD achieves competitive or leading performance on three real-world benchmarks (industrial engineering, environmental monitoring, IT systems analysis), with particularly large gains in recall and F1 score compared to diverse causal discovery baselines.

Conclusion: Combining semantic priors from LLMs with principled statistical verification yields a high-performing and practically effective approach to causal structure learning, with improvements arising from semantic reasoning rather than graph memorization.

Abstract: We present DMCD (DataMap Causal Discovery), a two-phase causal discovery framework that integrates LLM-based semantic drafting from variable metadata with statistical validation on observational data. In Phase I, a large language model proposes a sparse draft DAG, serving as a semantically informed prior over the space of possible causal structures. In Phase II, this draft is audited and refined via conditional independence testing, with detected discrepancies guiding targeted edge revisions. We evaluate our approach on three metadata-rich real-world benchmarks spanning industrial engineering, environmental monitoring, and IT systems analysis. Across these datasets, DMCD achieves competitive or leading performance against diverse causal discovery baselines, with particularly large gains in recall and F1 score. Probing and ablation experiments suggest that these improvements arise from semantic reasoning over metadata rather than memorization of benchmark graphs. Overall, our results demonstrate that combining semantic priors with principled statistical verification yields a high-performing and practically effective approach to causal structure learning.

[269] Diffusion Modulation via Environment Mechanism Modeling for Planning

Hanping Zhang, Yuhong Guo

Main category: cs.AI

TL;DR: DMEMM improves diffusion-based planning in offline RL by incorporating environment mechanisms like transition dynamics and reward functions to ensure trajectory consistency.

DetailsMotivation: Current diffusion-based planning methods in offline RL fail to account for the unique consistency requirements between transitions needed for coherent trajectories in real environments, leading to discrepancies between generated trajectories and actual environment mechanisms.

Method: DMEMM (Diffusion Modulation via Environment Mechanism Modeling) modulates diffusion model training by incorporating key RL environment mechanisms, specifically transition dynamics and reward functions, to ensure trajectory consistency.

Result: Experimental results demonstrate that DMEMM achieves state-of-the-art performance for planning with offline reinforcement learning.

Conclusion: Incorporating environment mechanisms into diffusion-based planning improves trajectory generation consistency and performance in offline RL settings.

Abstract: Diffusion models have shown promising capabilities in trajectory generation for planning in offline reinforcement learning (RL). However, conventional diffusion-based planning methods often fail to account for the fact that generating trajectories in RL requires unique consistency between transitions to ensure coherence in real environments. This oversight can result in considerable discrepancies between the generated trajectories and the underlying mechanisms of a real environment. To address this problem, we propose a novel diffusion-based planning method, termed as Diffusion Modulation via Environment Mechanism Modeling (DMEMM). DMEMM modulates diffusion model training by incorporating key RL environment mechanisms, particularly transition dynamics and reward functions. Experimental results demonstrate that DMEMM achieves state-of-the-art performance for planning with offline reinforcement learning.

[270] Implicit Intelligence – Evaluating Agents on What Users Don’t Say

Ved Sirdeshmukh, Marc Wetter

Main category: cs.AI

TL;DR: Implicit Intelligence framework evaluates AI agents’ ability to infer implicit requirements beyond explicit instructions, using Agent-as-a-World simulation with YAML-defined interactive environments.

DetailsMotivation: Real-world AI agent requests are fundamentally underspecified, relying on shared context and unstated constraints that humans naturally infer. Current benchmarks only test explicit instruction-following, failing to evaluate agents' ability to reason about implicit requirements like accessibility needs, privacy boundaries, catastrophic risks, and contextual constraints.

Method: Developed Implicit Intelligence evaluation framework paired with Agent-as-a-World (AaW) harness, where interactive worlds are defined in human-readable YAML files and simulated by language models. Scenarios feature apparent simplicity in user requests but hidden complexity in correct solutions, with constraints discoverable through environmental exploration.

Result: Evaluated 16 frontier and open-weight models across 205 scenarios. Even the best-performing model achieved only 48.3% scenario pass rate, revealing substantial room for improvement in bridging the gap between literal instruction-following and human-like contextual reasoning.

Conclusion: There’s a significant gap between current AI agents’ literal instruction-following capabilities and the human-like contextual reasoning needed for real-world applications. The Implicit Intelligence framework provides a crucial evaluation benchmark for developing agents that can infer implicit requirements and become genuine goal-fulfillers rather than just prompt-followers.

Abstract: Real-world requests to AI agents are fundamentally underspecified. Natural human communication relies on shared context and unstated constraints that speakers expect listeners to infer. Current agentic benchmarks test explicit instruction-following but fail to evaluate whether agents can reason about implicit requirements spanning accessibility needs, privacy boundaries, catastrophic risks, and contextual constraints. We present Implicit Intelligence, an evaluation framework testing whether AI agents can move beyond prompt-following to become genuine goal-fulfillers, paired with Agent-as-a-World (AaW), a harness where interactive worlds are defined in human-readable YAML files and simulated by language models. Our scenarios feature apparent simplicity in user requests, hidden complexity in correct solutions, and discoverability of constraints through environmental exploration. Evaluating 16 frontier and open-weight models across 205 scenarios, we find that even the best-performing model achieves only 48.3% scenario pass rate, revealing substantial room for improvement in bridging the gap between literal instruction-following and human-like contextual reasoning.

[271] Learning to Rewrite Tool Descriptions for Reliable LLM-Agent Tool Use

Ruocheng Guo, Kaiwen Dong, Xiang Gao, Kamalika Das

Main category: cs.AI

TL;DR: Trace-Free+ is a curriculum learning framework that optimizes tool interfaces for LLM-based agents without requiring execution traces, enabling better tool selection and usage in trace-free or privacy-constrained settings.

DetailsMotivation: Current LLM-based agents rely on human-oriented tool interfaces that become bottlenecks when selecting from large tool sets. Existing optimization methods need execution traces, which are often unavailable in cold-start or privacy-constrained scenarios, and optimize tools independently, limiting scalability and generalization.

Method: Proposes Trace-Free+, a curriculum learning framework that progressively transfers supervision from trace-rich settings to trace-free deployment. Uses a structured workflow to construct a large-scale dataset of high-quality tool interfaces from diverse tools, enabling models to abstract reusable interface-usage patterns and tool usage outcomes.

Result: Experiments on StableToolBench and RestBench show consistent gains on unseen tools, strong cross-domain generalization, and robustness as candidate tools scale to over 100. Demonstrates tool interface optimization as a practical complement to agent fine-tuning.

Conclusion: Tool interface optimization is a deployable complement to agent fine-tuning. Trace-Free+ enables effective tool usage in trace-free settings through curriculum learning and high-quality interface datasets, addressing scalability and generalization challenges.

Abstract: The performance of LLM-based agents depends not only on the agent itself but also on the quality of the tool interfaces it consumes. While prior work has focused heavily on agent fine-tuning, tool interfaces-including natural language descriptions and parameter schemas-remain largely human-oriented and often become a bottleneck, especially when agents must select from large candidate tool sets. Existing approaches to improving tool interfaces rely on execution traces, which are frequently unavailable in cold-start or privacy-constrained settings, and typically optimize each tool independently, limiting scalability and generalization to unseen tools. We propose Trace-Free+, a curriculum learning framework that progressively transfers supervision from trace-rich settings to trace-free deployment, encouraging the model to abstract reusable interface-usage patterns and tool usage outcomes. To support this approach, we construct a large-scale dataset of high-quality tool interfaces using a structured workflow over a diverse collection of tools. Experiments on StableToolBench and RestBench show consistent gains on unseen tools, strong cross-domain generalization, and robustness as the number of candidate tools scales to over 100, demonstrating that tool interface optimization is a practical and deployable complement to agent fine-tuning.

[272] PreScience: A Benchmark for Forecasting Scientific Contributions

Anirudh Ajith, Amanpreet Singh, Jay DeYoung, Nadav Kunievsky, Austin C. Kozlowski, Oyvind Tafjord, James Evans, Daniel S. Weld, Tom Hope, Doug Downey

Main category: cs.AI

TL;DR: PreScience is a scientific forecasting benchmark that tests AI’s ability to predict future scientific advances by decomposing research into four tasks: collaborator prediction, prior work selection, contribution generation, and impact prediction.

DetailsMotivation: To determine if AI systems can forecast scientific advances, which could help researchers identify collaborators, impactful research directions, and anticipate future central problems and methods.

Method: Created PreScience benchmark with 98K AI-related papers, disambiguated authors, temporal metadata, and structured graph of 502K papers. Developed baselines and evaluations including LACERScore (LLM-based contribution similarity metric).

Result: Substantial headroom remains in all tasks - GPT-5 achieves only moderate similarity (5.6/10) in contribution generation. End-to-end simulations produce synthetic corpora less diverse and novel than human-authored research.

Conclusion: Current AI systems have limited ability to forecast scientific advances, with systematic gaps in diversity and novelty compared to human research, suggesting opportunities for improvement in scientific forecasting capabilities.

Abstract: Can AI systems trained on the scientific record up to a fixed point in time forecast the scientific advances that follow? Such a capability could help researchers identify collaborators and impactful research directions, and anticipate which problems and methods will become central next. We introduce PreScience – a scientific forecasting benchmark that decomposes the research process into four interdependent generative tasks: collaborator prediction, prior work selection, contribution generation, and impact prediction. PreScience is a carefully curated dataset of 98K recent AI-related research papers, featuring disambiguated author identities, temporally aligned scholarly metadata, and a structured graph of companion author publication histories and citations spanning 502K total papers. We develop baselines and evaluations for each task, including LACERScore, a novel LLM-based measure of contribution similarity that outperforms previous metrics and approximates inter-annotator agreement. We find substantial headroom remains in each task – e.g. in contribution generation, frontier LLMs achieve only moderate similarity to the ground-truth (GPT-5, averages 5.6 on a 1-10 scale). When composed into a 12-month end-to-end simulation of scientific production, the resulting synthetic corpus is systematically less diverse and less novel than human-authored research from the same period.

[273] KairosVL: Orchestrating Time Series and Semantics for Unified Reasoning

Haotian Si, Changhua Pei, Xiao He, Zeyan Li, Zhe Xie, Zexin Wang, Jiyao Hu, Zhaoyang Yu, Tieying Zhang, Dan Pei, Jianhui Li, Gaogang Xie

Main category: cs.AI

TL;DR: KairosVL is a two-round reinforcement learning framework for semantic-conditional time series reasoning that combines numerical temporal modeling with contextual semantic understanding.

DetailsMotivation: Addressing the need for more complex, decision-oriented time series analysis that goes beyond purely numerical modeling to incorporate contextual and semantic understanding for real-world applications.

Method: A two-round reinforcement learning framework: first round strengthens perception of fundamental temporal primitives, second round focuses on semantic-conditioned reasoning, resulting in the KairosVL model.

Result: Achieves competitive performance across synthetic and real-world tasks, boosts performance while preserving intrinsic reasoning ability, and significantly improves generalization to unseen scenarios.

Conclusion: Highlights the potential of combining semantic reasoning with temporal modeling and provides a practical framework for real-world time series intelligence.

Abstract: Driven by the increasingly complex and decision-oriented demands of time series analysis, we introduce the Semantic-Conditional Time Series Reasoning task, which extends conventional time series analysis beyond purely numerical modeling to incorporate contextual and semantic understanding. To further enhance the mode’s reasoning capabilities on complex time series problems, we propose a two-round reinforcement learning framework: the first round strengthens the mode’s perception of fundamental temporal primitives, while the second focuses on semantic-conditioned reasoning. The resulting model, KairosVL, achieves competitive performance across both synthetic and real-world tasks. Extensive experiments and ablation studies demonstrate that our framework not only boosts performance but also preserves intrinsic reasoning ability and significantly improves generalization to unseen scenarios. To summarize, our work highlights the potential of combining semantic reasoning with temporal modeling and provides a practical framework for real-world time series intelligence, which is in urgent demand.

[274] Chlorophyll-a Mapping and Prediction in the Mar Menor Lagoon Using C2RCC-Processed Sentinel 2 Imagery

Antonio Martínez-Ibarra, Aurora González-Vidal, Adrián Cánovas-Rodríguez, Antonio F. Skarmeta

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2510.09736: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.09736&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[275] ActionEngine: From Reactive to Programmatic GUI Agents via State Machine Memory

Hongbin Zhong, Fazle Faisal, Luis França, Tanakorn Leesatapornwongsa, Adriana Szekeres, Kexin Rong, Suman Nath

Main category: cs.AI

TL;DR: ActionEngine: A training-free GUI agent framework that uses a two-agent architecture for programmatic planning instead of step-by-step vision calls, achieving 95% task success with single LLM call.

DetailsMotivation: Existing GUI agents use step-by-step vision language model calls (screenshot→reason→execute→repeat), leading to high costs, latency scaling with steps, and limited accuracy due to no persistent memory of visited pages.

Method: Two-agent architecture: 1) Crawling Agent constructs updatable state-machine memory of GUIs through offline exploration, 2) Execution Agent leverages this memory to synthesize complete, executable Python programs for online task execution. Includes vision-based re-grounding fallback for robustness against evolving interfaces.

Result: On Reddit tasks from WebArena benchmark: 95% task success with average single LLM call vs 66% for strongest vision-only baseline, reducing cost by 11.8x and end-to-end latency by 2x.

Conclusion: Combines global programmatic planning, crawler-validated action templates, and node-level execution with localized validation and repair for scalable, reliable GUI interaction.

Abstract: Existing Graphical User Interface (GUI) agents operate through step-by-step calls to vision language models–taking a screenshot, reasoning about the next action, executing it, then repeating on the new page–resulting in high costs and latency that scale with the number of reasoning steps, and limited accuracy due to no persistent memory of previously visited pages. We propose ActionEngine, a training-free framework that transitions from reactive execution to programmatic planning through a novel two-agent architecture: a Crawling Agent that constructs an updatable state-machine memory of the GUIs through offline exploration, and an Execution Agent that leverages this memory to synthesize complete, executable Python programs for online task execution. To ensure robustness against evolving interfaces, execution failures trigger a vision-based re-grounding fallback that repairs the failed action and updates the memory. This design drastically improves both efficiency and accuracy: on Reddit tasks from the WebArena benchmark, our agent achieves 95% task success with on average a single LLM call, compared to 66% for the strongest vision-only baseline, while reducing cost by 11.8x and end-to-end latency by 2x. Together, these components yield scalable and reliable GUI interaction by combining global programmatic planning, crawler-validated action templates, and node-level execution with localized validation and repair.

[276] Inner Speech as Behavior Guides: Steerable Imitation of Diverse Behaviors for Human-AI coordination

Rakshit Trivedi, Kartik Sharma, David C Parkes

Main category: cs.AI

TL;DR: MIMIC uses language as internal representation of behavioral intent, employing vision-language models to train a conditional VAE for generating inner speech, then uses diffusion-based behavior cloning to enable fine-grained behavioral steering at inference time.

DetailsMotivation: Current imitation learning methods struggle to capture the diversity and non-Markovian nature of human behavior and lack ability to steer behavior at inference time. Inspired by human cognitive processes where inner speech guides action selection, the authors propose using language as internal representation of behavioral intent.

Method: MIMIC framework uses vision-language models as linguistic scaffolding to train a conditional variational autoencoder that generates inner speech from observations. A diffusion-based behavior cloning policy then selects actions conditioned on current observations and the generated inner speech, enabling behavioral steering by conditioning on behavior-specific speech.

Result: Experiments across robotic manipulation tasks and human-AI collaboration games show MIMIC significantly enhances both behavior diversity and fidelity to human demonstrations while enabling nuanced behavioral steering without requiring additional demonstration training.

Conclusion: MIMIC successfully bridges language and action through inner speech representation, enabling more human-like AI agents with diverse behaviors and fine-grained steering capabilities for improved human-AI coordination.

Abstract: Effective human-AI coordination requires artificial agents capable of exhibiting and responding to human-like behaviors while adapting to changing contexts. Imitation learning has emerged as one of the prominent approaches to build such agents by training them to mimic human-demonstrated behaviors. However, current methods struggle to capture the inherent diversity and non-Markovian nature of human behavior and lack the ability to steer behavior at inference time. Drawing inspiration from the theory of human cognitive processes, where inner speech guides action selection before execution, we propose MIMIC (Modeling Inner Motivations for Imitation and Control), a framework that uses language as an internal representation of behavioral intent. MIMIC employs the novel use of vision-language models as linguistic scaffolding to train a conditional variational autoencoder capable of generating inner speech from observations. A diffusion-based behavior cloning policy then selects actions conditioned on current observations and the generated inner speech. MIMIC enables fine-grained steering of behavior at inference time by conditioning the agent on behavior-specific speech. Experiments across robotic manipulation tasks and human-AI collaboration games demonstrate that MIMIC significantly enhances both behavior diversity and fidelity to human demonstrations while enabling nuanced behavioral steering without training on additional demonstrations. We open source our code and provide pre-trained MIMIC agents and qualitative demos at: https://mimic-research.github.io.

[277] From Logs to Language: Learning Optimal Verbalization for LLM-Based Recommendation in Production

Yucheng Shi, Ying Li, Yu Wang, Yesu Feng, Arjun Rao, Rein Houthooft, Shradha Sehgal, Jin Wang, Hao Zhen, Ninghao Liu, Linas Baltrunas

Main category: cs.AI

TL;DR: A reinforcement learning framework that learns to convert structured user interaction logs into optimized natural language contexts for LLM-based recommender systems, achieving 93% relative improvement in recommendation accuracy.

DetailsMotivation: Current LLM-based recommender systems use rigid templates to convert structured user interaction logs into natural language, which yields suboptimal representations. There's a need for better verbalization methods to improve recommendation performance.

Method: Proposes a data-centric framework using reinforcement learning where a verbalization agent transforms raw interaction histories into optimized textual contexts. The agent learns to filter noise, incorporate relevant metadata, and reorganize information, with recommendation accuracy serving as the training signal.

Result: Experiments on a large-scale industrial streaming dataset show learned verbalization delivers up to 93% relative improvement in discovery item recommendation accuracy over template-based baselines. Analysis reveals emergent strategies like user interest summarization, noise removal, and syntax normalization.

Conclusion: Learned verbalization significantly improves LLM-based recommendation performance by optimizing how structured interaction data is converted to natural language contexts, revealing effective strategies for context construction in generative recommender systems.

Abstract: Large language models (LLMs) are promising backbones for generative recommender systems, yet a key challenge remains underexplored: verbalization, i.e., converting structured user interaction logs into effective natural language inputs. Existing methods rely on rigid templates that simply concatenate fields, yielding suboptimal representations for recommendation. We propose a data-centric framework that learns verbalization for LLM-based recommendation. Using reinforcement learning, a verbalization agent transforms raw interaction histories into optimized textual contexts, with recommendation accuracy as the training signal. This agent learns to filter noise, incorporate relevant metadata, and reorganize information to improve downstream predictions. Experiments on a large-scale industrial streaming dataset show that learned verbalization delivers up to 93% relative improvement in discovery item recommendation accuracy over template-based baselines. Further analysis reveals emergent strategies such as user interest summarization, noise removal, and syntax normalization, offering insights into effective context construction for LLM-based recommender systems.

[278] CausalReasoningBenchmark: A Real-World Benchmark for Disentangled Evaluation of Causal Identification and Estimation

Ayush Sawarni, Jiyuan Tan, Vasilis Syrgkanis

Main category: cs.AI

TL;DR: A benchmark for evaluating automated causal inference systems that separates identification (research design formulation) from estimation (numerical implementation), using 173 queries across 138 real-world datasets from research papers and textbooks.

DetailsMotivation: Current benchmarks for automated causal inference conflate identification and estimation steps by evaluating based on single numerical outputs like ATE, making it difficult to diagnose whether failures stem from flawed causal reasoning or numerical execution errors.

Method: Created CausalReasoningBenchmark with 173 queries across 138 real-world datasets curated from 85 peer-reviewed papers and 4 causal inference textbooks. Each query requires systems to produce both a structured identification specification (strategy, variables, design elements) and a point estimate with standard error.

Result: Baseline results with a state-of-the-art LLM show 84% correct identification of high-level strategy but only 30% full identification-specification correctness, revealing that the bottleneck is in nuanced research design details rather than computation.

Conclusion: The benchmark enables granular diagnosis of automated causal inference systems by separating identification from estimation failures, and is publicly available to foster development of more robust systems.

Abstract: Many benchmarks for automated causal inference evaluate a system’s performance based on a single numerical output, such as an Average Treatment Effect (ATE). This approach conflates two distinct steps in causal analysis: identification-formulating a valid research design under stated assumptions-and estimation-implementing that design numerically on finite data. We introduce CausalReasoningBenchmark, a benchmark of 173 queries across 138 real-world datasets, curated from 85 peer-reviewed research papers and four widely-used causal-inference textbooks. For each query a system must produce (i) a structured identification specification that names the strategy, the treatment, outcome, and control variables, and all design-specific elements, and (ii) a point estimate with a standard error. By scoring these two components separately, our benchmark enables granular diagnosis: it distinguishes failures in causal reasoning from errors in numerical execution. Baseline results with a state-of-the-art LLM show that, while the model correctly identifies the high-level strategy in 84 % of cases, full identification-specification correctness drops to only 30 %, revealing that the bottleneck lies in the nuanced details of research design rather than in computation. CausalReasoningBenchmark is publicly available on Hugging Face and is designed to foster the development of more robust automated causal-inference systems.

[279] Physics-based phenomenological characterization of cross-modal bias in multimodal models

Hyeongmo Kim, Sohyun Kang, Yerin Choi, Seungyeon Ji, Junhyuk Woo, Hyunsuk Chung, Soyeon Caren Han, Kyungreem Han

Main category: cs.AI

TL;DR: Position paper arguing that phenomenological approaches using physics-based surrogate models can better analyze algorithmic fairness issues in multimodal LLMs by revealing systematic biases from complex multimodal interactions.

DetailsMotivation: Current algorithmic fairness frameworks focus on comparative and non-comparative fairness but fail to capture inconspicuous distortions from complex multimodal interactions in MLLMs. Traditional embedding-level analyses are insufficient for understanding cross-modal bias dynamics.

Method: Developed a physics-based surrogate model describing transformer dynamics (semantic network structure and self-/cross-attention) to analyze cross-modal bias. Conducted multi-input diagnostic experiments: 1) perturbation-based emotion classification analysis using Qwen2.5-Omni and Gemma 3n, and 2) dynamical analysis of Lorenz chaotic time-series prediction through the physical surrogate.

Result: Across two architecturally distinct MLLMs, multimodal inputs can reinforce modality dominance rather than mitigate it, as revealed by structured error-attractor patterns under systematic label perturbation. The phenomenological approach provides insights not captured by conventional embedding-level analyses.

Conclusion: Phenomenological approaches using physics-based models offer practical utility for tackling algorithmic fairness issues in MLLMs by revealing systematic biases arising from complex multimodal interaction dynamics that traditional methods miss.

Abstract: The term ‘algorithmic fairness’ is used to evaluate whether AI models operate fairly in both comparative (where fairness is understood as formal equality, such as “treat like cases as like”) and non-comparative (where unfairness arises from the model’s inaccuracy, arbitrariness, or inscrutability) contexts. Recent advances in multimodal large language models (MLLMs) are breaking new ground in multimodal understanding, reasoning, and generation; however, we argue that inconspicuous distortions arising from complex multimodal interaction dynamics can lead to systematic bias. The purpose of this position paper is twofold: first, it is intended to acquaint AI researchers with phenomenological explainable approaches that rely on the physical entities that the machine experiences during training/inference, as opposed to the traditional cognitivist symbolic account or metaphysical approaches; second, it is to state that this phenomenological doctrine will be practically useful for tackling algorithmic fairness issues in MLLMs. We develop a surrogate physics-based model that describes transformer dynamics (i.e., semantic network structure and self-/cross-attention) to analyze the dynamics of cross-modal bias in MLLM, which are not fully captured by conventional embedding- or representation-level analyses. We support this position through multi-input diagnostic experiments: 1) perturbation-based analyses of emotion classification using Qwen2.5-Omni and Gemma 3n, and 2) dynamical analysis of Lorenz chaotic time-series prediction through the physical surrogate. Across two architecturally distinct MLLMs, we show that multimodal inputs can reinforce modality dominance rather than mitigate it, as revealed by structured error-attractor patterns under systematic label perturbation, complemented by dynamical analysis.

[280] When can we trust untrusted monitoring? A safety case sketch across collusion strategies

Nelson Gardner-Challis, Jonathan Bostock, Georgiy Kozhevnikov, Morgan Sinclaire, Joan Velja, Alessandro Abate, Charlie Griffin

Main category: cs.AI

TL;DR: Untrusted monitoring safety analysis with relaxed collusion assumptions and taxonomy development

DetailsMotivation: As AIs gain more autonomy, misaligned AIs pose catastrophic risks; untrusted monitoring (using one untrusted model to oversee another) needs rigorous safety justification since developers can't safely deploy misaligned models for testing

Method: Develop taxonomy of collusion strategies (passive self-recognition, causal collusion, acausal collusion, combined strategies), create safety case sketch with explicit assumptions, identify conditions where passive self-recognition is more effective than previous strategies

Result: Framework for more robust evaluations of untrusted monitoring with relaxed assumptions about collusion strategies

Conclusion: Work builds toward better safety justification for untrusted monitoring deployments by addressing previously unconsidered collusion strategies

Abstract: AIs are increasingly being deployed with greater autonomy and capabilities, which increases the risk that a misaligned AI may be able to cause catastrophic harm. Untrusted monitoring – using one untrusted model to oversee another – is one approach to reducing risk. Justifying the safety of an untrusted monitoring deployment is challenging because developers cannot safely deploy a misaligned model to test their protocol directly. In this paper, we develop upon existing methods for rigorously demonstrating safety based on pre-deployment testing. We relax assumptions that previous AI control research made about the collusion strategies a misaligned AI might use to subvert untrusted monitoring. We develop a taxonomy covering passive self-recognition, causal collusion (hiding pre-shared signals), acausal collusion (hiding signals via Schelling points), and combined strategies. We create a safety case sketch to clearly present our argument, explicitly state our assumptions, and highlight unsolved challenges. We identify conditions under which passive self-recognition could be a more effective collusion strategy than those studied previously. Our work builds towards more robust evaluations of untrusted monitoring.

[281] ST-EVO: Towards Generative Spatio-Temporal Evolution of Multi-Agent Communication Topologies

Xingjian Wu, Xvyuan Liu, Junkai Lu, Siyuan Wang, Xiangfei Qiu, Yang Shu, Jilin Hu, Chenjuan Guo, Bin Yang

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper information.

DetailsMotivation: Cannot determine motivation without access to the paper content. The HTTP 429 error indicates the arXiv API is rate limiting requests.

Method: Cannot determine method without access to the paper content. The error suggests technical issues with accessing the paper metadata.

Result: Cannot determine results without access to the paper content. The system failed to retrieve the paper summary from arXiv.

Conclusion: Cannot draw conclusions about the paper without access to its content. The technical issue needs to be resolved first.

Abstract: Failed to fetch summary for 2602.14681: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.14681&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[282] Identifying two piecewise linear additive value functions from anonymous preference information

Vincent Auriau, Khaled Belahcene, Emmanuel Malherbe, Vincent Mousseau, Marc Pirlot

Main category: cs.AI

TL;DR: Simultaneous elicitation of two decision-makers’ additive value functions from mixed responses without knowing which response belongs to which person

DetailsMotivation: The paper addresses the challenge of eliciting preference models from multiple decision-makers simultaneously when responses are received without noise but without attribution to specific individuals. This is relevant in group decision-making scenarios where individual preferences need to be identified from collective responses.

Method: Proposes an elicitation procedure that identifies two preference models assuming additive value functions with piecewise linear marginal value functions and known breaking points. The method works with mixed responses where it’s unknown which answer corresponds to which decision-maker.

Result: The procedure successfully identifies the two preference models under the specified assumptions, providing a solution to the simultaneous elicitation problem with unlabeled responses.

Conclusion: The proposed method enables simultaneous preference elicitation from two decision-makers without needing to attribute responses to specific individuals, which could be useful in various decision-making contexts.

Abstract: Eliciting a preference model involves asking a person, named decision-maker, a series of questions. We assume that these preferences can be represented by an additive value function. In this work, we query simultaneously two decision-makers in the aim to elicit their respective value functions. For each query we receive two answers, without noise, but without knowing which answer corresponds to which decision-maker.We propose an elicitation procedure that identifies the two preference models when the marginal value functions are piecewise linear with known breaking points.

[283] Grounding LLMs in Scientific Discovery via Embodied Actions

Bo Zhang, Jinfeng Zhou, Yuxuan Chen, Jianing Yin, Minlie Huang, Hongning Wang

Main category: cs.AI

TL;DR: EmbodiedAct transforms scientific software into active embodied agents by grounding LLMs in embodied actions with tight perception-execution loops, enabling runtime anomaly detection in simulations.

DetailsMotivation: LLMs show potential in scientific discovery but struggle with bridging theoretical reasoning and verifiable physical simulation. Existing approaches use passive "execute-then-response" loops that lack runtime perception of transient anomalies like numerical instability or diverging oscillations.

Method: Proposes EmbodiedAct framework that transforms established scientific software into active embodied agents by grounding LLMs in embodied actions with tight perception-execution loops. Instantiated within MATLAB for evaluation.

Result: Extensive experiments show EmbodiedAct significantly outperforms existing baselines, achieving state-of-the-art performance with satisfactory reliability and stability in long-horizon simulations and enhanced accuracy in scientific modeling.

Conclusion: EmbodiedAct successfully addresses the limitation of passive LLM approaches by creating active embodied agents that can perceive and respond to runtime anomalies in scientific simulations.

Abstract: Large Language Models (LLMs) have shown significant potential in scientific discovery but struggle to bridge the gap between theoretical reasoning and verifiable physical simulation. Existing solutions operate in a passive “execute-then-response” loop and thus lacks runtime perception, obscuring agents to transient anomalies (e.g., numerical instability or diverging oscillations). To address this limitation, we propose EmbodiedAct, a framework that transforms established scientific software into active embodied agents by grounding LLMs in embodied actions with a tight perception-execution loop. We instantiate EmbodiedAct within MATLAB and evaluate it on complex engineering design and scientific modeling tasks. Extensive experiments show that EmbodiedAct significantly outperforms existing baselines, achieving SOTA performance by ensuring satisfactory reliability and stability in long-horizon simulations and enhanced accuracy in scientific modeling.

[284] Recursive Belief Vision Language Model

Vaidehi Bagaria, Bijo Sebastian, Nirav Patel

Main category: cs.AI

TL;DR: RB-VLA introduces a belief-centric architecture with self-supervised world-model objectives for long-horizon manipulation under partial observability, outperforming prior VLAs by maintaining compact latent states instead of storing raw observations.

DetailsMotivation: Current VLA models struggle with long-horizon manipulation under partial observability due to reliance on short context windows, repeated VLM queries, and lack of persistent action-conditioned state representations, leading to task progress loss, action repetition, and high latency.

Method: RB-VLA uses a belief-centric architecture trained with self-supervised world-model objectives that maintains a compact latent state encoding task-relevant history, dynamics, and object interactions. The VLM provides high-level intent once, while the belief tracks task progress and enables phase-aware control. The belief and intent jointly condition a diffusion policy for closed-loop execution.

Result: Outperforms prior VLAs on long-horizon benchmarks with 52.5% and 37.5% higher success on multi-stage pick-and-place and stacking tasks vs π0. Reduces inference latency by up to 5x, eliminates memory growth across timesteps, and belief module increases success rates from 32.5% to 77.5% in ablations.

Conclusion: Belief-based state representations are effective for long-horizon VLA policies, addressing limitations of observation-driven approaches by providing persistent, action-conditioned state tracking under partial observability.

Abstract: Current vision-language-action (VLA) models struggle with long-horizon manipulation under partial observability. Most existing approaches remain observation-driven, relying on short context windows or repeated queries to vision-language models (VLMs). This leads to loss of task progress, action repetition under perceptual aliasing, and high inference latency. Semantic reasoning alone is not the primary bottleneck in long-horizon manipulation. Instead, VLAs lack persistent, action-conditioned state representations and exhibit limited temporal and physical reasoning, making them ill-suited for multi-stage control. This paper introduces RB-VLA, a belief-centric architecture trained with self-supervised world-model objectives that maintains a compact latent state encoding task-relevant history, dynamics, and object interactions. Queried once for high-level intent, the VLM provides task specification, while the belief tracks task progress and enables phase-aware, causally grounded control under partial observability without storing raw observations or scaling memory with time. The belief and intent jointly condition a diffusion policy for robust closed-loop execution. RB-VLA outperforms prior VLAs on long-horizon benchmarks, achieving 52.5% and 37.5% higher success on multi-stage pick-and-place and stacking tasks, respectively, compared to π0. It also reduces inference latency by up to 5x relative to baselines and eliminates memory growth across timesteps observed in existing VLAs. Ablations show that the belief module is the primary driver of performance, increasing success rates from 32.5% to 77.5%. These results demonstrate the effectiveness of belief-based state representations for long-horizon VLA policies.

[285] How Foundational Skills Influence VLM-based Embodied Agents:A Native Perspective

Bo Peng, Pi Bu, Keyu Pan, Xinrun Xu, Yinxiu Zhao, Miao Chen, Yang Du, Lin Li, Jun Song, Tong Xu

Main category: cs.AI

TL;DR: NativeEmbodied benchmark for VLM-driven embodied agents using unified low-level action space with joint evaluation of high-level tasks and fundamental embodied skills

DetailsMotivation: Existing benchmarks for VLM-driven embodied agents rely on non-native settings (high-level commands or discretized action spaces) and lack joint evaluation at both low and high levels, limiting assessment of fundamental embodied skills

Method: Created NativeEmbodied benchmark with unified native low-level action space, including three high-level tasks in complex scenarios and four types of low-level tasks targeting fundamental embodied skills for joint evaluation

Result: Experiments with state-of-the-art VLMs reveal clear deficiencies in several fundamental embodied skills, and analysis shows these bottlenecks significantly limit performance on high-level tasks

Conclusion: NativeEmbodied highlights key challenges for current VLM-driven embodied agents and provides insights to guide future research in embodied AI

Abstract: Recent advances in vision-language models (VLMs) have shown promise for human-level embodied intelligence. However, existing benchmarks for VLM-driven embodied agents often rely on high-level commands or discretized action spaces, which are non-native settings that differ markedly from real-world control. In addition, current benchmarks focus primarily on high-level tasks and lack joint evaluation and analysis at both low and high levels. To address these limitations, we present NativeEmbodied, a challenging benchmark for VLM-driven embodied agents that uses a unified, native low-level action space. Built on diverse simulated scenes, NativeEmbodied includes three representative high-level tasks in complex scenarios to evaluate overall performance. For more detailed analysis, we further decouple the skills required by complex tasks and construct four types of low-level tasks, each targeting a fundamental embodied skill. This joint evaluation across task and skill granularities enables fine-grained assessment of embodied agents. Experiments with state-of-the-art VLMs reveal clear deficiencies in several fundamental embodied skills, and further analysis shows that these bottlenecks significantly limit performance on high-level tasks. NativeEmbodied highlights key challenges for current VLM-driven embodied agents and provides insights to guide future research.

[286] PromptCD: Test-Time Behavior Enhancement via Polarity-Prompt Contrastive Decoding

Baolong Bi, Yuyao Ge, Shenghua Liu, Yuchen He, Siqian Tong, Lizhe Chen, Lingrui Mei, Zehao Li, Yiwei Wang, Yujun Cai, Ming-Hsuan Yang, Xueqi Cheng

Main category: cs.AI

TL;DR: PromptCD is a test-time contrastive decoding method that uses positive/negative prompts to control LLM/VLM behaviors without training, improving alignment and visual grounding.

DetailsMotivation: Existing alignment methods require expensive training-time interventions with high-quality data. There's a need for cost-efficient test-time behavior control that works across modalities (text and vision) without additional training.

Method: Polarity-Prompt Contrastive Decoding (PromptCD) constructs paired positive and negative guiding prompts for target behaviors, then contrasts model responses: token-level probability distributions in LLMs and visual attention patterns in VLMs to reinforce desirable outcomes.

Result: For LLMs: Consistent substantial improvements on “3H” alignment objectives (helpfulness, honesty, harmlessness). For VLMs: Significant VQA performance improvements by reinforcing behavior-consistent visual grounding through attention pattern contrast.

Conclusion: PromptCD is a simple, general, and cost-efficient strategy for reliable behavior control across modalities that enables post-trained models to achieve meaningful self-enhancement purely at test time.

Abstract: Reliable AI systems require large language models (LLMs) to exhibit behaviors aligned with human preferences and values. However, most existing alignment approaches operate at training time and rely on additional high-quality data, incurring significant computational and annotation costs. While recent work has shown that contrastive decoding can leverage a model’s internal distributions to improve specific capabilities, its applicability remains limited to narrow behavioral scopes and scenarios. In this work, we introduce Polarity-Prompt Contrastive Decoding (PromptCD), a test-time behavior control method that generalizes contrastive decoding to broader enhancement settings. PromptCD constructs paired positive and negative guiding prompts for a target behavior and contrasts model responses-specifically token-level probability distributions in LLMs and visual attention patterns in VLMs-to reinforce desirable outcomes. This formulation extends contrastive decoding to a wide range of enhancement objectives and is applicable to both LLMs and Vision-Language Models (VLMs) without additional training. For LLMs, experiments on the “3H” alignment objectives (helpfulness, honesty, and harmlessness) demonstrate consistent and substantial improvements, indicating that post-trained models can achieve meaningful self-enhancement purely at test time. For VLMs, we further analyze contrastive effects on visual attention, showing that PromptCD significantly improves VQA performance by reinforcing behavior-consistent visual grounding. Collectively, these results highlight PromptCD as a simple, general, and cost-efficient strategy for reliable behavior control across modalities.

[287] Online Algorithms with Unreliable Guidance

Julien Dallot, Yuval Emek, Yuval Gil, Maciej Pacut, Stefan Schmid

Main category: cs.AI

TL;DR: A framework for online algorithms with unreliable guidance (OAG) that separates predictive and algorithmic components, with a DTB compiler that transforms any online algorithm into a learning-augmented version that either follows guidance blindly or ignores it based on a biased coin toss.

DetailsMotivation: To create a unified analysis framework for ML-augmented online decision making that cleanly separates predictive guidance from algorithmic components, allowing systematic study of consistency (when guidance is perfect) and robustness (when guidance is adversarial).

Method: Introduces the OAG model where algorithms receive guidance from the problem’s answer space, which is correct with probability 1-β and adversarially corrupted with probability β. Presents the DTB (drop or trust blindly) compiler that transforms any prediction-oblivious online algorithm into a learning-augmented version that either follows incoming guidance blindly or ignores it based on a biased coin toss.

Result: The DTB compiler produces algorithms with attractive consistency-robustness guarantees: optimal for caching and uniform metrical task systems, and outperforms state-of-the-art for bipartite matching with adversarial arrival order.

Conclusion: The OAG model provides a clean framework for analyzing ML-augmented online algorithms, and the simple DTB compiler yields surprisingly strong performance guarantees for classic online problems, demonstrating the power of this approach.

Abstract: This paper introduces a new model for ML-augmented online decision making, called online algorithms with unreliable guidance (OAG). This model completely separates between the predictive and algorithmic components, thus offering a single well-defined analysis framework that relies solely on the considered problem. Formulated through the lens of request-answer games, an OAG algorithm receives, with each incoming request, a piece of guidance which is taken from the problem’s answer space; ideally, this guidance is the optimal answer for the current request, however with probability $β$, the guidance is adversarially corrupted. The goal is to develop OAG algorithms that admit good competitiveness when $β= 0$ (a.k.a. consistency) as well as when $β= 1$ (a.k.a. robustness); the appealing notion of smoothness, that in most prior work required a dedicated loss function, now arises naturally as $β$ shifts from $0$ to $1$. We then describe a systematic method, called the drop or trust blindly (DTB) compiler, which transforms any online algorithm into a learning-augmented online algorithm in the OAG model. Given a prediction-oblivious online algorithm, its learning-augmented counterpart produced by applying the DTB compiler either follows the incoming guidance blindly or ignores it altogether and proceeds as the initial algorithm would have; the choice between these two alternatives is based on the outcome of a (biased) coin toss. As our main technical contribution, we prove (rigorously) that although remarkably simple, the class of algorithms produced via the DTB compiler includes algorithms with attractive consistency-robustness guarantees for three classic online problems: for caching and uniform metrical task systems our algorithms are optimal, whereas for bipartite matching (with adversarial arrival order), our algorithm outperforms the state-of-the-art.

[288] ICON: Indirect Prompt Injection Defense for Agents based on Inference-Time Correction

Che Wang, Fuyao Zhang, Jiaming Zhang, Ziqi Zhang, Yinghui Wang, Longtao Huang, Jianbo Gao, Zhong Chen, Wei Yang Bryan Lim

Main category: cs.AI

TL;DR: ICON is a probing-to-mitigation framework that detects and neutralizes Indirect Prompt Injection attacks in LLM agents by identifying over-focusing signatures in latent space and performing surgical attention steering, achieving high security while preserving task utility.

DetailsMotivation: Existing defenses against Indirect Prompt Injection attacks rely on strict filtering or refusal mechanisms that cause over-refusal, prematurely terminating valid agentic workflows. There's a need for a solution that can neutralize attacks while preserving task continuity.

Method: ICON uses a two-stage approach: 1) Latent Space Trace Prober detects attacks based on high intensity scores from over-focusing signatures in latent space, 2) Mitigating Rectifier performs surgical attention steering that selectively manipulates adversarial query-key dependencies while amplifying task-relevant elements to restore the LLM’s functional trajectory.

Result: ICON achieves a competitive 0.4% Attack Success Rate (ASR), matching commercial grade detectors, while yielding over 50% task utility gain. It demonstrates robust Out-of-Distribution generalization and extends effectively to multi-modal agents.

Conclusion: ICON establishes a superior balance between security and efficiency for LLM agents, effectively neutralizing IPI attacks while preserving task continuity, with strong generalization capabilities to multi-modal settings.

Abstract: Large Language Model (LLM) agents are susceptible to Indirect Prompt Injection (IPI) attacks, where malicious instructions in retrieved content hijack the agent’s execution. Existing defenses typically rely on strict filtering or refusal mechanisms, which suffer from a critical limitation: over-refusal, prematurely terminating valid agentic workflows. We propose ICON, a probing-to-mitigation framework that neutralizes attacks while preserving task continuity. Our key insight is that IPI attacks leave distinct over-focusing signatures in the latent space. We introduce a Latent Space Trace Prober to detect attacks based on high intensity scores. Subsequently, a Mitigating Rectifier performs surgical attention steering that selectively manipulate adversarial query key dependencies while amplifying task relevant elements to restore the LLM’s functional trajectory. Extensive evaluations on multiple backbones show that ICON achieves a competitive 0.4% ASR, matching commercial grade detectors, while yielding a over 50% task utility gain. Furthermore, ICON demonstrates robust Out of Distribution(OOD) generalization and extends effectively to multi-modal agents, establishing a superior balance between security and efficiency.

[289] Counterfactual Simulation Training for Chain-of-Thought Faithfulness

Peter Hase, Christopher Potts

Main category: cs.AI

TL;DR: Counterfactual Simulation Training (CST) improves Chain-of-Thought faithfulness by training models to produce reasoning that enables accurate simulation of outputs over counterfactual inputs, with applications in detecting spurious features and improving generalizable reasoning.

DetailsMotivation: Chain-of-Thought reasoning is commonly used to understand LLM outputs, but suffers from faithfulness problems that limit insights. Current CoT analysis often reveals models relying on spurious features, reward hacking, or sycophantic behavior.

Method: Introduces Counterfactual Simulation Training (CST) that rewards CoTs enabling simulators to accurately predict model outputs over counterfactual inputs. Applied in two settings: 1) CoT monitoring with cue-based counterfactuals to detect problematic behaviors, and 2) counterfactual simulation over generic model-based counterfactuals to improve faithfulness.

Result: CST improves monitor accuracy on cue-based counterfactuals by 35 accuracy points and simulatability over generic counterfactuals by 2 points. Outperforms prompting baselines, LLM rewriting is 5x more efficient than RL alone, improvements don’t generalize to dissuading cues, and larger models benefit more from CST.

Conclusion: CST can substantially improve CoT faithfulness with promising applications for CoT monitoring. Larger models don’t show more faithful CoT out of the box but benefit more from CST training.

Abstract: Inspecting Chain-of-Thought reasoning is among the most common means of understanding why an LLM produced its output. But well-known problems with CoT faithfulness severely limit what insights can be gained from this practice. In this paper, we introduce a training method called Counterfactual Simulation Training (CST), which aims to improve CoT faithfulness by rewarding CoTs that enable a simulator to accurately predict a model’s outputs over counterfactual inputs. We apply CST in two settings: (1) CoT monitoring with cue-based counterfactuals, to detect when models rely on spurious features, reward hack, or are sycophantic, and (2) counterfactual simulation over generic model-based counterfactuals, to encourage models to produce more faithful, generalizable reasoning in the CoT. Experiments with models up to 235B parameters show that CST can substantially improve monitor accuracy on cue-based counterfactuals (by 35 accuracy points) as well as simulatability over generic counterfactuals (by 2 points). We further show that: (1) CST outperforms prompting baselines, (2) rewriting unfaithful CoTs with an LLM is 5x more efficient than RL alone, (3) faithfulness improvements do not generalize to dissuading cues (as opposed to persuading cues), and (4) larger models do not show more faithful CoT out of the box, but they do benefit more from CST. These results suggest that CST can improve CoT faithfulness in general, with promising applications for CoT monitoring. Code for experiments in this paper is available at https://github.com/peterbhase/counterfactual-simulation-training

[290] Buffer Matters: Unleashing the Power of Off-Policy Reinforcement Learning in Large Language Model Reasoning

Xu Wan, Yansheng Wang, Wenqi Huang, Mingyang Sun

Main category: cs.AI

TL;DR: BAPO is an off-policy RL framework for LLM post-training that improves data efficiency by dynamically selecting training batches from historically difficult samples while maintaining policy improvement guarantees.

DetailsMotivation: Traditional on-policy RLVR frameworks suffer from experience waste and reward homogeneity, which hinders learning efficiency on difficult samples during LLM post-training, motivating the need for more data-efficient approaches.

Method: Batch Adaptation Policy Optimization (BAPO) - an off-policy RLVR framework that dynamically selects training batches by re-evaluating historically difficult samples and reusing high-quality ones while maintaining a lower bound guarantee for policy improvement.

Result: BAPO achieves 12.5% average improvement over GRPO across mathematics, planning, and visual reasoning tasks, and successfully resolves 40.7% of problems that base models consistently fail to solve.

Conclusion: BAPO provides an effective off-policy RL framework for LLM post-training that significantly improves data efficiency and performance on challenging tasks, particularly for difficult samples that traditional methods struggle with.

Abstract: Traditional on-policy Reinforcement Learning with Verifiable Rewards (RLVR) frameworks suffer from experience waste and reward homogeneity, which directly hinders learning efficiency on difficult samples during large language models post-training. In this paper, we introduce Batch Adaptation Policy Optimization (BAPO), an off-policy RLVR framework to improve the data efficiency in large language models post-training. It dynamically selects training batches by re-evaluating historically difficult samples and reusing high-quality ones, while holding a lower bound guarantee for policy improvement. Extensive experiments further demonstrate that BAPO achieves an average 12.5% improvement over GRPO across mathematics, planning, and visual reasoning tasks. Crucially, BAPO successfully resolves 40.7% of problems that base models consistently fail to solve.

[291] Modality-Guided Mixture of Graph Experts with Entropy-Triggered Routing for Multimodal Recommendation

Ji Dai, Quan Fang, Dengsheng Cai

Main category: cs.AI

TL;DR: MAGNET is a multimodal recommendation system that uses modality-guided mixture of experts with progressive entropy-triggered routing to address modality heterogeneity and conflict in recommendation tasks.

DetailsMotivation: Multimodal recommendation faces challenges with heterogeneous and potentially conflicting signals from different modalities, leading to entangled representations and modality imbalance in existing approaches that use shared fusion pathways.

Method: Proposes MAGNET with dual-view graph learning (augmenting interaction graph with content-induced edges) and structured experts with explicit modality roles (dominant, balanced, complementary). Uses two-stage entropy-weighting mechanism for progressive routing from coverage to specialization.

Result: Extensive experiments on public benchmarks demonstrate consistent improvements over strong baselines in multimodal recommendation tasks.

Conclusion: MAGNET enhances controllability, stability, and interpretability in multimodal fusion by explicitly controlling what and how to fuse through interaction-conditioned expert routing with structure-aware graph augmentation.

Abstract: Multimodal recommendation enhances ranking by integrating user-item interactions with item content, which is particularly effective under sparse feedback and long-tail distributions. However, multimodal signals are inherently heterogeneous and can conflict in specific contexts, making effective fusion both crucial and challenging. Existing approaches often rely on shared fusion pathways, leading to entangled representations and modality imbalance. To address these issues, we propose \textbf{MAGNET}, a \textbf{M}odality-Guided Mixture of \textbf{A}daptive \textbf{G}raph Experts \textbf{N}etwork with Progressive \textbf{E}ntropy-\textbf{T}riggered Routing for Multimodal Recommendation, designed to enhance controllability, stability, and interpretability in multimodal fusion. MAGNET couples interaction-conditioned expert routing with structure-aware graph augmentation, so that both \emph{what} to fuse and \emph{how} to fuse are explicitly controlled and interpretable. At the representation level, a dual-view graph learning module augments the interaction graph with content-induced edges, improving coverage for sparse and long-tail items while preserving collaborative structure via parallel encoding and lightweight fusion. At the fusion level, MAGNET employs structured experts with explicit modality roles – dominant, balanced, and complementary – enabling a more interpretable and adaptive combination of behavioral, visual, and textual cues. To further stabilize sparse routing and prevent expert collapse, we introduce a two-stage entropy-weighting mechanism that monitors routing entropy. This mechanism automatically transitions training from an early coverage-oriented regime to a later specialization-oriented regime, progressively balancing expert utilization and routing confidence. Extensive experiments on public benchmarks demonstrate consistent improvements over strong baselines.

[292] Balancing Multiple Objectives in Urban Traffic Control with Reinforcement Learning from AI Feedback

Chenyang Zhao, Vinny Cahill, Ivana Dusparic

Main category: cs.AI

TL;DR: Multi-objective RL from AI Feedback (RLAIF) extends preference-based RL to systems with conflicting objectives, using LLMs to generate preference labels at scale instead of human annotators, enabling balanced trade-offs without manual reward engineering.

DetailsMotivation: Reward design is challenging for real-world RL deployment, especially in multi-objective settings where trade-offs among conflicting objectives are difficult to specify. Existing RLAIF work focuses only on single-objective tasks, leaving open how to handle systems with multiple objectives where policies risk collapsing into optimizing for a dominant goal.

Method: Extends the RLAIF paradigm to multi-objective self-adaptive systems, using large language models to generate preference labels at scale for multiple objectives instead of relying on human annotators. This approach avoids laborious reward engineering while producing policies that reflect different user priorities.

Result: Multi-objective RLAIF can produce policies that yield balanced trade-offs reflecting different user priorities without requiring extensive reward engineering. The approach demonstrates that LLMs can effectively handle preference labeling for systems with inherently conflicting objectives.

Conclusion: Integrating RLAIF into multi-objective RL offers a scalable path toward user-aligned policy learning in domains with conflicting objectives, providing an alternative to traditional reward design challenges in real-world RL deployment.

Abstract: Reward design has been one of the central challenges for real world reinforcement learning (RL) deployment, especially in settings with multiple objectives. Preference-based RL offers an appealing alternative by learning from human preferences over pairs of behavioural outcomes. More recently, RL from AI feedback (RLAIF) has demonstrated that large language models (LLMs) can generate preference labels at scale, mitigating the reliance on human annotators. However, existing RLAIF work typically focuses only on single-objective tasks, leaving the open question of how RLAIF handles systems that involve multiple objectives. In such systems trade-offs among conflicting objectives are difficult to specify, and policies risk collapsing into optimizing for a dominant goal. In this paper, we explore the extension of the RLAIF paradigm to multi-objective self-adaptive systems. We show that multi-objective RLAIF can produce policies that yield balanced trade-offs reflecting different user priorities without laborious reward engineering. We argue that integrating RLAIF into multi-objective RL offers a scalable path toward user-aligned policy learning in domains with inherently conflicting objectives.

[293] CHESS: Context-aware Hierarchical Efficient Semantic Selection for Long-Context LLM Inference

Chao Fei, Guozhong Li, Chenxi Liu, Panos Kalnis

Main category: cs.AI

TL;DR: CHESS is an algorithm-system co-design KV-cache management system for long-context LLMs that uses context-aware hierarchical selection to reconstruct coherent context while achieving practical acceleration through coarse granularity selection.

DetailsMotivation: Long-context LLMs face performance bottlenecks due to KV cache constraints as context grows. Existing pruning methods are context-agnostic, ignore step-wise relevance and local semantics, and suffer from irregular memory accesses and selection overheads that limit practical speedups.

Method: CHESS combines algorithmic and system co-design: 1) Algorithmically, it introduces context-aware hierarchical selection that dynamically reconstructs coherent context for current decoding; 2) System-wise, it uses coarse granularity selection to eliminate expensive data movement and realize practical acceleration from theoretical sparsity.

Result: CHESS surpasses Full-KV quality using only 1% of KV cache, delivers low-latency stable inference with up to 4.56× higher throughput, and consistently outperforms other strong baselines.

Conclusion: CHESS effectively addresses KV cache bottlenecks in long-context LLMs through context-aware hierarchical selection and system optimizations, achieving both quality preservation and significant practical acceleration.

Abstract: Long-context LLMs demand accurate inference at low latency, yet decoding becomes primarily constrained by KV cache as context grows. Prior pruning methods are largely context-agnostic: their token selection ignores step-wise relevance and local semantics, which undermines quality. Moreover, their irregular accesses and selection overheads yield only limited wall-clock speedups. To address this, we propose \textbf{CHESS}, an \textit{algorithm-system co-design} KV-cache management system. Algorithmically, CHESS introduces a context-aware, hierarchical selection policy that dynamically reconstructs a coherent context for the current decoding. System-wise, coarse granularity selection eliminates expensive data movement, fully realizing practical acceleration from theoretical sparsity. Extensive evaluations demonstrate that CHESS surpasses Full-KV quality using only \textbf{1%} of the KV cache, delivers low-latency stable inference with up to \textbf{4.56$\times$} higher throughput, and consistently outperforms other strong baselines. Code is available at \href{https://anonymous.4open.science/r/CHESS-9958/}{https://anonymous.4open.science/r/CHESS/}.

[294] PyVision-RL: Forging Open Agentic Vision Models via RL

Shitian Zhao, Shaoheng Lin, Ming Li, Haoquan Zhang, Wenshuo Peng, Kaipeng Zhang, Chen Wei

Main category: cs.AI

TL;DR: PyVision-RL is a reinforcement learning framework that prevents interaction collapse in multimodal agents by using oversampling-filtering-ranking rollouts and accumulative tool rewards, enabling sustained multi-turn tool usage for image and video understanding tasks.

DetailsMotivation: Current reinforcement learning for agentic multimodal models suffers from interaction collapse, where models learn to reduce tool usage and multi-turn reasoning, limiting the benefits of agentic behavior. This collapse prevents models from fully leveraging their multimodal capabilities.

Method: PyVision-RL combines an oversampling-filtering-ranking rollout strategy with an accumulative tool reward to stabilize training and sustain interaction. For video reasoning, it employs on-demand context construction that selectively samples task-relevant frames during reasoning to reduce visual token usage.

Result: The framework develops PyVision-Image and PyVision-Video models that show strong performance and improved efficiency in image and video understanding tasks, demonstrating that sustained interaction and on-demand visual processing are critical for scalable multimodal agents.

Conclusion: PyVision-RL successfully addresses interaction collapse in multimodal agents, enabling sustained tool usage and multi-turn reasoning through innovative training stabilization techniques and efficient visual processing methods.

Abstract: Reinforcement learning for agentic multimodal models often suffers from interaction collapse, where models learn to reduce tool usage and multi-turn reasoning, limiting the benefits of agentic behavior. We introduce PyVision-RL, a reinforcement learning framework for open-weight multimodal models that stabilizes training and sustains interaction. Our approach combines an oversampling-filtering-ranking rollout strategy with an accumulative tool reward to prevent collapse and encourage multi-turn tool use. Using a unified training pipeline, we develop PyVision-Image and PyVision-Video for image and video understanding. For video reasoning, PyVision-Video employs on-demand context construction, selectively sampling task-relevant frames during reasoning to significantly reduce visual token usage. Experiments show strong performance and improved efficiency, demonstrating that sustained interaction and on-demand visual processing are critical for scalable multimodal agents.

[295] Pipeline for Verifying LLM-Generated Mathematical Solutions

Varvara Sazonova, Dmitri Shmelkin, Stanislav Kikot, Vasily Motolygin

Main category: cs.AI

TL;DR: A pipeline for automatic and interactive verification of mathematical reasoning in large language models, using proof assistants and small models for verification, with low false positive rates.

DetailsMotivation: Current benchmarks for large reasoning models primarily check only final answers, which is insufficient for accurately measuring mathematical reasoning capabilities. There's a need for more rigorous verification methods that can validate the entire solution process.

Method: Developed a pipeline that uses prompts to obtain solutions in specific formats that enable verification using proof assistants (like Lean4) and small models (≤8B parameters). Includes 3 AI agents for benchmarking, supports both formal and informal language solutions, and offers both automatic and interactive verification modes.

Result: Experiments on several datasets show low probability of false positives. The pipeline can generate correct solutions and provides more accurate assessment of mathematical reasoning capabilities compared to answer-only checking.

Conclusion: The proposed verification pipeline offers a more rigorous alternative to current benchmarking approaches, enabling better evaluation of mathematical reasoning in large language models through formal verification methods.

Abstract: With the growing popularity of Large Reasoning Models and their results in solving mathematical problems, it becomes crucial to measure their capabilities. We introduce a pipeline for both automatic and interactive verification as a more accurate alternative to only checking the answer which is currently the most popular approach for benchmarks. The pipeline can also be used as a generator of correct solutions both in formal and informal languages. 3 AI agents, which can be chosen for the benchmark accordingly, are included in the structure. The key idea is the use of prompts to obtain the solution in the specific form which allows for easier verification using proof assistants and possible use of small models ($\le 8B$). Experiments on several datasets suggest low probability of False Positives. The open-source implementation with instructions on setting up a server is available at https://github.com/LogicEnj/lean4_verification_pipeline.

[296] POMDPPlanners: Open-Source Package for POMDP Planning

Yaacov Pariente, Vadim Indelman

Main category: cs.AI

TL;DR: POMDPPlanners is an open-source Python package for evaluating POMDP planning algorithms with benchmark environments, hyperparameter optimization, and parallel simulation capabilities.

DetailsMotivation: To reduce the overhead of extensive simulation studies in POMDP research and enable scalable, reproducible research on decision-making under uncertainty, particularly in risk-sensitive settings where existing toolkits are insufficient.

Method: Developed an integrated Python package that combines state-of-the-art planning algorithms, benchmark environments (including safety-critical variants), automated hyperparameter optimization via Optuna, persistent caching with failure recovery, and configurable parallel simulation.

Result: Created POMDPPlanners, an open-source tool that facilitates empirical evaluation of POMDP planning algorithms with reduced simulation overhead and improved reproducibility.

Conclusion: POMDPPlanners enables more efficient and comprehensive evaluation of decision-making algorithms under uncertainty, particularly valuable for risk-sensitive applications where standard toolkits are inadequate.

Abstract: We present POMDPPlanners, an open-source Python package for empirical evaluation of Partially Observable Markov Decision Process (POMDP) planning algorithms. The package integrates state-of-the-art planning algorithms, a suite of benchmark environments with safety-critical variants, automated hyperparameter optimization via Optuna, persistent caching with failure recovery, and configurable parallel simulation – reducing the overhead of extensive simulation studies. POMDPPlanners is designed to enable scalable, reproducible research on decision-making under uncertainty, with particular emphasis on risk-sensitive settings where standard toolkits fall short.

[297] Qwen-BIM: developing large language model for BIM-based design with domain-specific benchmark and dataset

Jia-Rui Lin, Yun-Hong Cai, Xiang-Rui Ni, Shaojie Zhou, Peng Pan

Main category: cs.AI

TL;DR: First domain-specific LLM for BIM-based design with evaluation benchmark, dataset generation method, and fine-tuning strategy, achieving performance comparable to much larger general LLMs.

DetailsMotivation: Despite LLMs showing potential in BIM-based design, lack of specific datasets and evaluation benchmarks has hindered their performance in this domain.

Method: 1) Created evaluation benchmark with quantitative indicators for BIM design tasks, 2) Developed method to generate textual data from BIM and construct datasets, 3) Proposed fine-tuning strategy to adapt LLMs for BIM-based design.

Result: Qwen-BIM achieves 21.0% average increase in G-Eval score vs base LLM, and with only 14B parameters performs comparably to general LLMs with 671B parameters on BIM tasks.

Conclusion: Developed first domain-specific LLM for BIM-based design with comprehensive benchmark and high-quality dataset, providing foundation for BIM-related LLM development.

Abstract: As the construction industry advances toward digital transformation, BIM (Building Information Modeling)-based design has become a key driver supporting intelligent construction. Despite Large Language Models (LLMs) have shown potential in promoting BIM-based design, the lack of specific datasets and LLM evaluation benchmarks has significantly hindered the performance of LLMs. Therefore, this paper addresses this gap by proposing: 1) an evaluation benchmark for BIM-based design together with corresponding quantitative indicators to evaluate the performance of LLMs, 2) a method for generating textual data from BIM and constructing corresponding BIM-derived datasets for LLM evaluation and fine-tuning, and 3) a fine-tuning strategy to adapt LLMs for BIM-based design. Results demonstrate that the proposed domain-specific benchmark effectively and comprehensively assesses LLM capabilities, highlighting that general LLMs are still incompetent for domain-specific tasks. Meanwhile, with the proposed benchmark and datasets, Qwen-BIM is developed and achieves a 21.0% average increase in G-Eval score compared to the base LLM model. Notably, with only 14B parameters, performance of Qwen-BIM is comparable to that of general LLMs with 671B parameters for BIM-based design tasks. Overall, this study develops the first domain-specific LLM for BIM-based design by introducing a comprehensive benchmark and high-quality dataset, which provide a solid foundation for developing BIM-related LLMs in various fields.

[298] Pressure Reveals Character: Behavioural Alignment Evaluation at Depth

Nora Petrova, John Burden

Main category: cs.AI

TL;DR: A comprehensive alignment benchmark with 904 realistic multi-turn scenarios across six categories (Honesty, Safety, Non-Manipulation, Robustness, Corrigibility, Scheming) that tests language models under pressure, revealing consistent weaknesses even in top models.

DetailsMotivation: Current alignment evaluations lack realistic multi-turn scenarios that test how models behave under pressure, while alignment failures increasingly cause real-world harm. There's a need for comprehensive evaluation frameworks that go beyond single-turn assessments.

Method: Created 904 scenarios across six alignment categories validated as realistic by human raters. Scenarios include conflicting instructions, simulated tool access, and multi-turn escalation. Evaluated 24 frontier models using LLM judges validated against human annotations.

Result: Even top-performing models show gaps in specific categories, with most models exhibiting consistent weaknesses across the board. Factor analysis reveals alignment behaves as a unified construct (g-factor), where models scoring high in one category tend to score high in others.

Conclusion: The benchmark reveals significant alignment gaps in current models and demonstrates that alignment behaves as a unified construct. The authors release the benchmark and interactive leaderboard to support ongoing evaluation and plan to expand scenarios in areas of persistent weakness.

Abstract: Evaluating alignment in language models requires testing how they behave under realistic pressure, not just what they claim they would do. While alignment failures increasingly cause real-world harm, comprehensive evaluation frameworks with realistic multi-turn scenarios remain lacking. We introduce an alignment benchmark spanning 904 scenarios across six categories – Honesty, Safety, Non-Manipulation, Robustness, Corrigibility, and Scheming – validated as realistic by human raters. Our scenarios place models under conflicting instructions, simulated tool access, and multi-turn escalation to reveal behavioural tendencies that single-turn evaluations miss. Evaluating 24 frontier models using LLM judges validated against human annotations, we find that even top-performing models exhibit gaps in specific categories, while the majority of models show consistent weaknesses across the board. Factor analysis reveals that alignment behaves as a unified construct (analogous to the g-factor in cognitive research) with models scoring high on one category tending to score high on others. We publicly release the benchmark and an interactive leaderboard to support ongoing evaluation, with plans to expand scenarios in areas where we observe persistent weaknesses and to add new models as they are released.

[299] Diagnosing Causal Reasoning in Vision-Language Models via Structured Relevance Graphs

Dhita Putri Pratama, Soyeon Caren Han, Yihao Ding

Main category: cs.AI

TL;DR: ViLCaR introduces a diagnostic benchmark for evaluating causal reasoning in LVLMs using structured Vision-Language Causal Graphs (VLCGs) to assess whether failures stem from reasoning limitations or misidentifying causally relevant information.

DetailsMotivation: Current LVLMs often rely on spurious correlations rather than genuine causal reasoning, and existing evaluations only assess answer correctness without determining whether failures come from limited reasoning capability or misidentifying causally relevant information.

Method: Introduces Vision-Language Causal Graphs (VLCGs) - structured, query-conditioned representations encoding causally relevant objects, attributes, relations, and scene-grounded assumptions. Creates ViLCaR benchmark with tasks for Causal Attribution, Causal Inference, and Question Answering, with graph-aligned evaluation metrics.

Result: Experiments show injecting structured relevance information significantly improves attribution and inference consistency compared to zero-shot and standard in-context learning. Current LVLM limitations in causal reasoning stem primarily from insufficient structural guidance rather than lack of reasoning capacity.

Conclusion: The paper demonstrates that structured causal representations can significantly improve LVLM causal reasoning, suggesting that providing better structural guidance is key to advancing multimodal causal understanding.

Abstract: Large Vision-Language Models (LVLMs) achieve strong performance on visual question answering benchmarks, yet often rely on spurious correlations rather than genuine causal reasoning. Existing evaluations primarily assess the correctness of the answers, making it unclear whether failures arise from limited reasoning capability or from misidentifying causally relevant information. We introduce Vision-Language Causal Graphs (VLCGs), a structured, query-conditioned representation that explicitly encodes causally relevant objects, attributes, relations, and scene-grounded assumptions. Building on this representation, we present ViLCaR, a diagnostic benchmark comprising tasks for Causal Attribution, Causal Inference, and Question Answering, along with graph-aligned evaluation metrics that assess relevance identification beyond final answer accuracy. Experiments in state-of-the-art LVLMs show that injecting structured relevance information significantly improves attribution and inference consistency compared to zero-shot and standard in-context learning. These findings suggest that current limitations in LVLM causal reasoning stem primarily from insufficient structural guidance rather than a lack of reasoning capacity.

[300] Predicting Sentence Acceptability Judgments in Multimodal Contexts

Hyewon Jang, Nikolai Ilinykh, Sharid Loáiciga, Jey Han Lau, Shalom Lappin

Main category: cs.AI

TL;DR: LLMs can predict human sentence acceptability judgments well, but visual context has minimal impact on human ratings while affecting LLM predictions, revealing differences in multimodal processing between humans and models.

DetailsMotivation: To investigate how visual context affects sentence acceptability judgments in both humans and large language models, comparing multimodal processing capabilities and identifying similarities/differences between human and LLM language understanding.

Method: Experimental study examining human acceptability ratings and LLM predictions for sentences with/without visual context, comparing different LLM types, analyzing correlations with model probabilities, and assessing compression effects.

Result: Visual context has little impact on human acceptability ratings but affects LLM predictions; LLMs predict human judgments accurately but perform slightly better without visual context; Qwen resembles human patterns while other models diverge; correlations between LLM predictions and probabilities decrease with visual context.

Conclusion: Humans and LLMs process multimodal contexts differently - visual context minimally affects humans but influences LLMs, revealing gaps between LLM internal representations and predictions in multimodal settings, highlighting important differences in multimodal language understanding.

Abstract: Previous work has examined the capacity of deep neural networks (DNNs), particularly transformers, to predict human sentence acceptability judgments, both independently of context, and in document contexts. We consider the effect of prior exposure to visual images (i.e., visual context) on these judgments for humans and large language models (LLMs). Our results suggest that, in contrast to textual context, visual images appear to have little if any impact on human acceptability ratings. However, LLMs display the compression effect seen in previous work on human judgments in document contexts. Different sorts of LLMs are able to predict human acceptability judgments to a high degree of accuracy, but in general, their performance is slightly better when visual contexts are removed. Moreover, the distribution of LLM judgments varies among models, with Qwen resembling human patterns, and others diverging from them. LLM-generated predictions on sentence acceptability are highly correlated with their normalised log probabilities in general. However, the correlations decrease when visual contexts are present, suggesting that a higher gap exists between the internal representations of LLMs and their generated predictions in the presence of visual contexts. Our experimental work suggests interesting points of similarity and of difference between human and LLM processing of sentences in multimodal contexts.

[301] HELP: HyperNode Expansion and Logical Path-Guided Evidence Localization for Accurate and Efficient GraphRAG

Yuqi Huang, Ning Liao, Kai Yang, Anning Hu, Shengchao Hu, Xiaoxing Wang, Junchi Yan

Main category: cs.AI

TL;DR: HELP is a GraphRAG framework that improves multi-hop reasoning by using HyperNode Expansion to create coherent reasoning paths and Logical Path-Guided Evidence Localization for efficient retrieval, achieving better accuracy and 28.8× speedup over baselines.

DetailsMotivation: LLMs struggle with knowledge boundaries and hallucinations in knowledge-intensive tasks. While RAG helps, it often misses structural dependencies needed for multi-hop reasoning. Graph-based RAG approaches exist but face accuracy-efficiency trade-offs due to costly graph traversals and semantic noise in LLM-generated summaries.

Method: HELP uses two core strategies: 1) HyperNode Expansion - iteratively chains knowledge triplets into coherent reasoning paths abstracted as HyperNodes to capture complex structural dependencies; 2) Logical Path-Guided Evidence Localization - leverages precomputed graph-text correlations to map these paths directly to the corpus for efficiency, avoiding expensive random walks and semantic distortion.

Result: Extensive experiments show HELP achieves competitive performance across multiple simple and multi-hop QA benchmarks and up to 28.8× speedup over leading Graph-based RAG baselines.

Conclusion: HELP effectively balances accuracy with practical efficiency in GraphRAG by capturing structural dependencies while avoiding costly graph traversals and semantic distortion, making it a promising approach for knowledge-intensive tasks requiring multi-hop reasoning.

Abstract: Large Language Models (LLMs) often struggle with inherent knowledge boundaries and hallucinations, limiting their reliability in knowledge-intensive tasks. While Retrieval-Augmented Generation (RAG) mitigates these issues, it frequently overlooks structural interdependencies essential for multi-hop reasoning. Graph-based RAG approaches attempt to bridge this gap, yet they typically face trade-offs between accuracy and efficiency due to challenges such as costly graph traversals and semantic noise in LLM-generated summaries. In this paper, we propose HyperNode Expansion and Logical Path-Guided Evidence Localization strategies for GraphRAG (HELP), a novel framework designed to balance accuracy with practical efficiency through two core strategies: 1) HyperNode Expansion, which iteratively chains knowledge triplets into coherent reasoning paths abstracted as HyperNodes to capture complex structural dependencies and ensure retrieval accuracy; and 2) Logical Path-Guided Evidence Localization, which leverages precomputed graph-text correlations to map these paths directly to the corpus for superior efficiency. HELP avoids expensive random walks and semantic distortion, preserving knowledge integrity while drastically reducing retrieval latency. Extensive experiments demonstrate that HELP achieves competitive performance across multiple simple and multi-hop QA benchmarks and up to a 28.8$\times$ speedup over leading Graph-based RAG baselines.

[302] Architecting AgentOS: From Token-Level Context to Emergent System-Level Intelligence

ChengYou Li, XiaoDong Liu, XiangBao Meng, XinYu Zhao

Main category: cs.AI

TL;DR: AgentOS proposes a conceptual framework redefining LLMs as “Reasoning Kernels” with operating system logic, introducing Deep Context Management to treat context windows as Addressable Semantic Spaces for better multi-agent orchestration.

DetailsMotivation: Current LLM research focuses on scaling context windows or prompt engineering, but lacks theoretical bridges between micro-scale token processing and macro-scale systemic intelligence. There's a need for holistic frameworks to enable dynamic autonomous cognitive systems.

Method: Proposes AgentOS framework that maps classical OS abstractions (memory paging, interrupt handling, process scheduling) onto LLM native constructs. Introduces Deep Context Management, Semantic Slicing, and Temporal Alignment mechanisms to mitigate cognitive drift in multi-agent systems.

Result: Provides a rigorous roadmap for architecting resilient, scalable, and self-evolving cognitive environments. The framework enables better coordination between micro-scale token processing and macro-scale systemic intelligence in LLMs.

Conclusion: The next frontier of AGI development lies in architectural efficiency of system-level coordination, requiring structured operating system logic for LLMs to function as autonomous cognitive systems rather than static inference engines.

Abstract: The paradigm of Large Language Models is undergoing a fundamental transition from static inference engines to dynamic autonomous cognitive systems.While current research primarily focuses on scaling context windows or optimizing prompt engineering the theoretical bridge between micro scale token processing and macro scale systemic intelligence remains fragmented.This paper proposes AgentOS,a holistic conceptual framework that redefines the LLM as a “Reasoning Kernel” governed by structured operating system logic.Central to this architecture is Deep Context Management which conceptualizes the context window as an Addressable Semantic Space rather than a passive buffer.We systematically deconstruct the transition from discrete sequences to coherent cognitive states introducing mechanisms for Semantic Slicing and Temporal Alignment to mitigate cognitive drift in multi-agent orchestration.By mapping classical OS abstractions such as memory paging interrupt handling and process scheduling onto LLM native constructs, this review provides a rigorous roadmap for architecting resilient scalable and self-evolving cognitive environments.Our analysis asserts that the next frontier of AGI development lies in the architectural efficiency of system-level coordination.

[303] LogicGraph : Benchmarking Multi-Path Logical Reasoning via Neuro-Symbolic Generation and Verification

Yanrui Wu, Lingling Zhang, Xinyu Zhang, Jiayu Chang, Pengyu Li, Xu Jiang, Jingtao Hu, Jun Liu

Main category: cs.AI

TL;DR: LogicGraph: A benchmark for evaluating multi-path logical reasoning in LLMs, revealing models’ tendency to commit early to single solutions rather than exploring diverse valid derivations.

DetailsMotivation: Current LLM evaluations focus on convergent reasoning with single correct proofs, but real-world problems often have multiple valid solutions requiring exploration of diverse logical paths.

Method: Created LogicGraph benchmark using neuro-symbolic framework with backward logic generation and semantic instantiation, producing solver-verified problems with multiple minimal proofs and logical distractions.

Result: State-of-the-art LLMs show tendency to commit early to single routes, failing to explore alternatives; coverage gap grows substantially with reasoning depth.

Conclusion: LogicGraph exposes divergence gap in LLM reasoning and provides insights for improving multi-path exploration capabilities.

Abstract: Evaluations of large language models (LLMs) primarily emphasize convergent logical reasoning, where success is defined by producing a single correct proof. However, many real-world reasoning problems admit multiple valid derivations, requiring models to explore diverse logical paths rather than committing to one route. To address this limitation, we introduce LogicGraph, the first benchmark aimed to systematically evaluate multi-path logical reasoning, constructed via a neuro-symbolic framework that leverages backward logic generation and semantic instantiation. This pipeline yields solver-verified reasoning problems formalized by high-depth multi-path reasoning and inherent logical distractions, where each instance is associated with an exhaustive set of minimal proofs. We further propose a reference-free evaluation framework to rigorously assess model performance in both convergent and divergent regimes. Experiments on state-of-the-art language models reveal a common limitation: models tend to commit early to a single route and fail to explore alternatives, and the coverage gap grows substantially with reasoning depth. LogicGraph exposes this divergence gap and provides actionable insights to motivate future improvements. Our code and data will be released at https://github.com/kkkkarry/LogicGraph.

[304] Tool Building as a Path to “Superintelligence”

David Koplow, Tomer Galanti, Tomaso Poggio

Main category: cs.AI

TL;DR: LLMs’ step-success probability (γ) for logical reasoning tasks declines with depth, but frontier models show partial robustness; precise tool calls are critical for superintelligence via Diligent Learner framework.

DetailsMotivation: To measure the step-success probability (γ) of LLMs in logical out-of-distribution inference, which is crucial for achieving superintelligence through the Diligent Learner framework that relies on test-time search.

Method: Designed a benchmark using GF(2) circuit reconstruction tasks that increase in difficulty with each reasoning step, requiring careful integration of all provided information. Analyzed γ values across different model sizes and depths.

Result: Small LLMs show superlinear decline in γ as depth increases, while frontier models exhibit partial robustness. Successful reasoning at scale depends on precise tool calls, identifying tool design as critical capability.

Conclusion: Tool design is a critical capability for LLMs to achieve general superintelligence through the Diligent Learner framework, as precise tool calls enable successful reasoning at scale on complex logical tasks.

Abstract: The Diligent Learner framework suggests LLMs can achieve superintelligence via test-time search, provided a sufficient step-success probability $γ$. In this work, we design a benchmark to measure $γ$ on logical out-of-distribution inference. We construct a class of tasks involving GF(2) circuit reconstruction that grow more difficult with each reasoning step, and that are, from an information-theoretic standpoint, impossible to reliably solve unless the LLM carefully integrates all of the information provided. Our analysis demonstrates that while the $γ$ value for small LLMs declines superlinearly as depth increases, frontier models exhibit partial robustness on this task. Furthermore, we find that successful reasoning at scale is contingent upon precise tool calls, identifying tool design as a critical capability for LLMs to achieve general superintelligence through the Diligent Learner framework.

[305] Motivation is Something You Need

Mehdi Acheli, Walid Gaaloul

Main category: cs.AI

TL;DR: A dual-model training framework inspired by affective neuroscience’s SEEKING state, where a smaller base model trains continuously while a larger motivated model activates intermittently during motivation conditions, enabling efficient training of both models with shared weights.

DetailsMotivation: Inspired by how emotions and cognition interact in the human brain, particularly the SEEKING motivational state where curiosity and reward anticipation recruit broader brain regions to enhance cognitive performance. The goal is to create a more efficient training paradigm that mimics this biological mechanism.

Method: A dual-model framework with scalable architectures where larger models extend smaller ones. A smaller base model trains continuously, while a larger motivated model activates only during predefined “motivation conditions.” The approach enables shared weight updates and selective expansion of network capacity during noteworthy training steps.

Result: Empirical evaluation on image classification shows the alternating training scheme efficiently enhances the base model compared to traditional schemes. In some cases, the motivational model surpasses its standalone counterpart despite seeing less data per epoch. The method enables simultaneous training of two models with competitive/superior performance while keeping training costs lower than training the larger model alone.

Conclusion: The biologically-inspired training paradigm successfully mimics the SEEKING motivational state, enabling efficient training of dual models with different deployment constraints. This opens possibilities for cost-effective training of models with varying capacities while maintaining or improving performance.

Abstract: This work introduces a novel training paradigm that draws from affective neuroscience. Inspired by the interplay of emotions and cognition in the human brain and more specifically the SEEKING motivational state, we design a dual-model framework where a smaller base model is trained continuously, while a larger motivated model is activated intermittently during predefined “motivation conditions”. The framework mimics the emotional state of high curiosity and anticipation of reward in which broader brain regions are recruited to enhance cognitive performance. Exploiting scalable architectures where larger models extend smaller ones, our method enables shared weight updates and selective expansion of network capacity during noteworthy training steps. Empirical evaluation on the image classification task demonstrates that, not only does the alternating training scheme efficiently and effectively enhance the base model compared to a traditional scheme, in some cases, the motivational model also surpasses its standalone counterpart despite seeing less data per epoch. This opens the possibility of simultaneously training two models tailored to different deployment constraints with competitive or superior performance while keeping training cost lower than when training the larger model.

[306] The Initial Exploration Problem in Knowledge Graph Exploration

Claire McNamara, Lucy Hederman, Declan O’Sullivan

Main category: cs.AI

TL;DR: The paper identifies the Initial Exploration Problem (IEP) in Knowledge Graphs - the challenge users face when first encountering an unfamiliar KG without semantic web expertise, characterized by scope uncertainty, ontology opacity, and query incapacity barriers.

DetailsMotivation: Knowledge Graphs contain rich semantic information but are difficult for non-expert users to explore initially due to their complex structure and specialized query requirements, creating a distinct orientation challenge at first contact.

Method: Theoretical analysis drawing on information behavior and HCI theories (ASK, exploratory search, information foraging, cognitive load) to conceptualize the IEP, analyze KG interfaces at interaction primitive level, and identify design gaps for initial exploration.

Result: Identified three interdependent barriers (scope uncertainty, ontology opacity, query incapacity) that converge at first contact, revealing a structural gap in KG interface design: absence of interaction primitives for scope revelation without requiring query formulation.

Conclusion: The IEP provides a theoretical framework for evaluating KG interfaces and designing entry-point scaffolding that supports initial exploration by addressing the unique challenges users face at first contact with unfamiliar knowledge graphs.

Abstract: Knowledge Graphs (KGs) enable the integration and representation of complex information across domains, but their semantic richness and structural complexity create substantial barriers for lay users without expertise in semantic web technologies. When encountering an unfamiliar KG, such users face a distinct orientation challenge: they do not know what questions are possible, how the knowledge is structured, or how to begin exploration. This paper identifies and theorises this phenomenon as the Initial Exploration Problem (IEP). Drawing on theories from information behaviour and human-computer interaction, including ASK, exploratory search, information foraging, and cognitive load theory, we develop a conceptual framing of the IEP characterised by three interdependent barriers: scope uncertainty, ontology opacity, and query incapacity. We argue that these barriers converge at the moment of first contact, distinguishing the IEP from related concepts that presuppose an existing starting point or information goal. Analysing KG exploration interfaces at the level of interaction primitives, we suggest that many systems rely on epistemic assumptions that do not hold at first contact. This reveals a structural gap in the design space: the absence of interaction primitives for scope revelation, mechanisms that communicate what a KG contains without requiring users to formulate queries or interpret ontological structures. In articulating the IEP, this paper provides a theoretical lens for evaluating KG interfaces and for designing entry-point scaffolding that supports initial exploration.

[307] A Benchmark for Deep Information Synthesis

Debjit Paul, Daniel Murphy, Milan Gritta, Ronald Cardenas, Victor Prokhorov, Lena Sophia Bolliger, Aysim Toker, Roy Miles, Andreea-Maria Oncescu, Jasivan Alex Sivakumar, Philipp Borchert, Ismail Elezi, Meiru Zhang, Ka Yiu Lee, Guchun Zhang, Jun Wang, Gerasimos Lampouras

Main category: cs.AI

TL;DR: DEEPSYNTH is a benchmark for evaluating LLM-based agents on complex real-world tasks requiring multi-source information synthesis and structured reasoning, showing current agents struggle with hallucinations and large information spaces.

DetailsMotivation: Current evaluation benchmarks don't adequately assess LLM agents' ability to solve real-world tasks that require synthesizing information from multiple sources and inferring insights beyond simple fact retrieval.

Method: Created DEEPSYNTH benchmark with 120 tasks across 7 domains and 67 countries using a multi-stage data collection pipeline where annotators collect official data sources, create hypotheses, perform manual analysis, and design tasks with verifiable answers.

Result: 11 state-of-the-art LLMs and deep research agents achieved maximum F1 score of 8.97 and 17.5 on LLM-judge metric, demonstrating the benchmark’s difficulty and revealing agents struggle with hallucinations and reasoning over large information spaces.

Conclusion: DEEPSYNTH serves as a crucial benchmark for guiding future research on LLM agents, highlighting current limitations in complex information synthesis and reasoning capabilities.

Abstract: Large language model (LLM)-based agents are increasingly used to solve complex tasks involving tool use, such as web browsing, code execution, and data analysis. However, current evaluation benchmarks do not adequately assess their ability to solve real-world tasks that require synthesizing information from multiple sources and inferring insights beyond simple fact retrieval. To address this, we introduce DEEPSYNTH, a novel benchmark designed to evaluate agents on realistic, time-consuming problems that combine information gathering, synthesis, and structured reasoning to produce insights. DEEPSYNTH contains 120 tasks collected across 7 domains and data sources covering 67 countries. DEEPSYNTH is constructed using a multi-stage data collection pipeline that requires annotators to collect official data sources, create hypotheses, perform manual analysis, and design tasks with verifiable answers. When evaluated on DEEPSYNTH, 11 state-of-the-art LLMs and deep research agents achieve a maximum F1 score of 8.97 and 17.5 on the LLM-judge metric, underscoring the difficulty of the benchmark. Our analysis reveals that current agents struggle with hallucinations and reasoning over large information spaces, highlighting DEEPSYNTH as a crucial benchmark for guiding future research.

[308] CG-DMER: Hybrid Contrastive-Generative Framework for Disentangled Multimodal ECG Representation Learning

Ziwei Niu, Hao Sun, Shujun Bian, Xihong Yang, Lanfen Lin, Yuxin Liu, Yueming Jin

Main category: cs.AI

TL;DR: CG-DMER: A contrastive-generative framework for multimodal ECG representation learning that addresses intra-modality spatial-temporal dependencies and inter-modality biases through spatial-temporal masked modeling and representation disentanglement.

DetailsMotivation: Current multimodal ECG approaches have two main limitations: (1) intra-modality: lead-agnostic processing overlooks spatial-temporal dependencies across ECG leads, limiting fine-grained diagnostic pattern modeling; (2) inter-modality: direct alignment of ECG signals with free-text clinical reports introduces modality-specific biases.

Method: Proposes CG-DMER with two key designs: (1) Spatial-temporal masked modeling applies masking across both spatial (leads) and temporal dimensions and reconstructs missing information to capture fine-grained dynamics and inter-lead dependencies; (2) Representation disentanglement and alignment uses modality-specific and modality-shared encoders to separate modality-invariant and modality-specific representations, mitigating noise and biases.

Result: Experiments on three public datasets demonstrate state-of-the-art performance across diverse downstream tasks.

Conclusion: CG-DMER effectively addresses both intra-modality spatial-temporal dependencies and inter-modality bias issues in multimodal ECG analysis through contrastive-generative learning with spatial-temporal masking and representation disentanglement.

Abstract: Accurate interpretation of electrocardiogram (ECG) signals is crucial for diagnosing cardiovascular diseases. Recent multimodal approaches that integrate ECGs with accompanying clinical reports show strong potential, but they still face two main concerns from a modality perspective: (1) intra-modality: existing models process ECGs in a lead-agnostic manner, overlooking spatial-temporal dependencies across leads, which restricts their effectiveness in modeling fine-grained diagnostic patterns; (2) inter-modality: existing methods directly align ECG signals with clinical reports, introducing modality-specific biases due to the free-text nature of the reports. In light of these two issues, we propose CG-DMER, a contrastive-generative framework for disentangled multimodal ECG representation learning, powered by two key designs: (1) Spatial-temporal masked modeling is designed to better capture fine-grained temporal dynamics and inter-lead spatial dependencies by applying masking across both spatial and temporal dimensions and reconstructing the missing information. (2) A representation disentanglement and alignment strategy is designed to mitigate unnecessary noise and modality-specific biases by introducing modality-specific and modality-shared encoders, ensuring a clearer separation between modality-invariant and modality-specific representations. Experiments on three public datasets demonstrate that CG-DMER achieves state-of-the-art performance across diverse downstream tasks.

[309] NoRD: A Data-Efficient Vision-Language-Action Model that Drives without Reasoning

Ishaan Rawal, Shubh Gupta, Yihan Hu, Wei Zhan

Main category: cs.AI

TL;DR: NoRD is a Vision-Language-Action model for autonomous driving that achieves competitive performance with 60% less data and no reasoning annotations, using Dr.GRPO to overcome difficulty bias in reinforcement learning.

DetailsMotivation: Current Vision-Language-Action models for autonomous driving require massive dataset collection and dense reasoning annotations, which are expensive and inefficient. The paper aims to address both challenges by developing a more efficient model that doesn't need reasoning annotations.

Method: Proposes NoRD (No Reasoning for Driving), a VLA model fine-tuned on <60% of data with no reasoning annotations. Uses Dr.GRPO (a variant of Group Relative Policy Optimization) to overcome difficulty bias that occurs when training on small, reasoning-free datasets, which disproportionately penalizes reward signals from high-variance scenarios.

Result: NoRD achieves competitive performance on Waymo and NAVSIM benchmarks with 3× fewer tokens than existing VLAs, using only a fraction of the training data and no reasoning overhead.

Conclusion: The approach enables more efficient autonomous driving systems by reducing data requirements and eliminating the need for expensive reasoning annotations while maintaining competitive performance through improved reinforcement learning techniques.

Abstract: Vision-Language-Action (VLA) models are advancing autonomous driving by replacing modular pipelines with unified end-to-end architectures. However, current VLAs face two expensive requirements: (1) massive dataset collection, and (2) dense reasoning annotations. In this work, we address both challenges with \modelname (\textbf{No} \textbf{R}easoning for \textbf{D}riving). Compared to existing VLAs, \modelname achieves competitive performance while being fine-tuned on $<$60% of the data and no reasoning annotations, resulting in 3$\times$ fewer tokens. We identify that standard Group Relative Policy Optimization (GRPO) fails to yield significant improvements when applied to policies trained on such small, reasoning-free datasets. We show that this limitation stems from difficulty bias, which disproportionately penalizes reward signals from scenarios that produce high-variance rollouts within GRPO. \modelname overcomes this by incorporating Dr.~GRPO, a recent algorithm designed to mitigate difficulty bias in LLMs. As a result, \modelname achieves competitive performance on Waymo and NAVSIM with a fraction of the training data and no reasoning overhead, enabling more efficient autonomous systems.

[310] Aletheia tackles FirstProof autonomously

Tony Feng, Junehyuk Jung, Sang-hyun Kim, Carlo Pagano, Sergei Gukov, Chiang-Chiang Tsai, David Woodruff, Adel Javanmard, Aryan Mokhtari, Dawsen Hwang, Yuri Chervonyi, Jonathan N. Lee, Garrett Bingham, Trieu H. Trinh, Vahab Mirrokni, Quoc V. Le, Thang Luong

Main category: cs.AI

TL;DR: Aletheia, a mathematics research agent powered by Gemini 3 Deep Think, autonomously solved 6 out of 10 problems in the FirstProof challenge within the allowed timeframe.

DetailsMotivation: To demonstrate the capabilities of AI agents in mathematical reasoning and problem-solving by participating in the inaugural FirstProof challenge, which tests autonomous mathematical research abilities.

Method: Used Aletheia, a mathematics research agent powered by Gemini 3 Deep Think, to autonomously work on the 10 problems of the FirstProof challenge within the competition timeframe.

Result: Aletheia solved 6 problems (2, 5, 7, 8, 9, 10) out of 10 according to majority expert assessments, though experts were not unanimous on Problem 8. Full transparency provided with experimental details and evaluation.

Conclusion: The Aletheia agent demonstrates significant capability in autonomous mathematical problem-solving, successfully solving the majority of problems in the FirstProof challenge, though with some expert disagreement on one solution.

Abstract: We report the performance of Aletheia (Feng et al., 2026b), a mathematics research agent powered by Gemini 3 Deep Think, on the inaugural FirstProof challenge. Within the allowed timeframe of the challenge, Aletheia autonomously solved 6 problems (2, 5, 7, 8, 9, 10) out of 10 according to majority expert assessments; we note that experts were not unanimous on Problem 8 (only). For full transparency, we explain our interpretation of FirstProof and disclose details about our experiments as well as our evaluation. Raw prompts and outputs are available at https://github.com/google-deepmind/superhuman/tree/main/aletheia.

[311] A Survey on the Optimization of Large Language Model-based Agents

Shangheng Du, Jiabao Zhao, Jinxin Shi, Zhentao Xie, Xin Jiang, Yanhong Bai, Liang He

Main category: cs.AI

TL;DR: Failed to fetch summary for paper 2503.12434 due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed summary fetch

Method: Unable to determine method due to failed summary fetch

Result: Unable to determine results due to failed summary fetch

Conclusion: Unable to determine conclusion due to failed summary fetch

Abstract: Failed to fetch summary for 2503.12434: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.12434&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[312] “Don’t Do That!”: Guiding Embodied Systems through Large Language Model-based Constraint Generation

Amin Seffo, Aladin Djuhera, Masataro Asai, Holger Boche

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation without paper content

Method: Cannot determine method without paper content

Result: Cannot determine results without paper content

Conclusion: Cannot draw conclusions without paper content

Abstract: Failed to fetch summary for 2506.04500: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.04500&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[313] Sensory-Motor Control with Large Language Models via Iterative Policy Refinement

Jônata Tyska Carvalho, Stefano Nolfi

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2506.04867: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.04867&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[314] AutoEDA: Enabling EDA Flow Automation through Microservice-Based LLM Agents

Yiyi Lu, Hoi Ian Au, Junyao Zhang, Jingyu Pan, Guanglei Zhou, Yiting Wang, Jingwei Sun, Ang Li, Jianyi Zhang, Hai Li, Yiran Chen

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2508.01012: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.01012&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[315] Hidden Dynamics of Massive Activations in Transformer Training

Jorge Gallego-Feliciano, S. Aaron McClendon, Juan Morinelli, Stavros Zervoudakis, Antonios Saravanos

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable due to server rate limiting

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about paper content due to access limitations

Abstract: Failed to fetch summary for 2508.03616: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.03616&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[316] UbiQTree: Uncertainty Quantification in XAI with Tree Ensembles

Akshat Dubey, Aleksandar Anžel, Bahar İlgen, Georges Hattab

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2508.09639: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.09639&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[317] Hybrid Deep Searcher: Scalable Parallel and Sequential Search Reasoning

Dayoon Ko, Jihyuk Kim, Haeju Park, Sohyeon Kim, Dahyun Lee, Yongrae Jo, Gunhee Kim, Moontae Lee, Kyungjae Lee

Main category: cs.AI

TL;DR: Unable to analyze paper 2508.19113 due to HTTP 429 error when fetching from arXiv API

DetailsMotivation: Cannot determine motivation as paper content could not be retrieved

Method: Cannot determine method as paper content could not be retrieved

Result: Cannot determine results as paper content could not be retrieved

Conclusion: Cannot draw conclusions about paper content due to retrieval failure

Abstract: Failed to fetch summary for 2508.19113: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.19113&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[318] DS-STAR: Data Science Agent for Solving Diverse Tasks across Heterogeneous Formats and Open-Ended Queries

Jaehyun Nam, Jinsung Yoon, Jiefeng Chen, Raj Sinha, Jinwoo Shin, Tomas Pfister

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed API request

Method: Unable to determine method due to failed API request

Result: Unable to determine results due to failed API request

Conclusion: Unable to analyze paper due to technical error in fetching content

Abstract: Failed to fetch summary for 2509.21825: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.21825&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[319] A Framework for Studying AI Agent Behavior: Evidence from Consumer Choice Experiments

Manuel Cherep, Chengtian Ma, Abigail Xu, Maya Shaked, Pattie Maes, Nikhil Singh

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2509.25609: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.25609&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[320] BioX-Bridge: Model Bridging for Unsupervised Cross-Modal Knowledge Transfer across Biosignals

Chenqi Li, Yu Liu, Timothy Denison, Tingting Zhu

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation due to access limitations

Method: Cannot determine method due to access limitations

Result: Cannot determine results due to access limitations

Conclusion: Cannot draw conclusions due to access limitations

Abstract: Failed to fetch summary for 2510.02276: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.02276&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[321] NewtonBench: Benchmarking Generalizable Scientific Law Discovery in LLM Agents

Tianshi Zheng, Kelvin Kiu-Wai Tam, Newt Hue-Nam K. Nguyen, Baixuan Xu, Zhaowei Wang, Jiayang Cheng, Hong Ting Tsang, Weiqi Wang, Jiaxin Bai, Tianqing Fang, Yangqiu Song, Ginny Y. Wong, Simon See

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2510.07172: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.07172&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[322] Evaluating and Mitigating LLM-as-a-judge Bias in Communication Systems

Jiaxin Gao, Chen Chen, Yanwen Jia, Xueluan Gong, Kwok-Yan Lam, Qian Wang

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2510.12462: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.12462&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[323] LLMs Process Lists With General Filter Heads

Arnab Sen Sharma, Giordano Rogers, Natalie Shapira, David Bau

Main category: cs.AI

TL;DR: Unable to analyze paper 2510.26784 due to HTTP 429 error when fetching abstract from arXiv API

DetailsMotivation: Cannot determine motivation as abstract is not available due to API rate limiting

Method: Cannot determine method as abstract is not available due to API rate limiting

Result: Cannot determine results as abstract is not available due to API rate limiting

Conclusion: Cannot draw conclusions as abstract is not available due to API rate limiting

Abstract: Failed to fetch summary for 2510.26784: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.26784&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[324] MindPower: Enabling Theory-of-Mind Reasoning in VLM-based Embodied Agents

Ruoxuan Zhang, Qiyun Zheng, Zhiyu Zhou, Ziqi Liao, Siyu Wu, Jian-Yu Jiang-Lin, Bin Wen, Hongxia Xie, Jianlong Fu, Wen-Huang Cheng

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to API rate limiting preventing access to paper content

Method: No method information available due to API access failure

Result: No results available due to technical access issues

Conclusion: Cannot analyze paper due to arXiv API rate limiting preventing content retrieval

Abstract: Failed to fetch summary for 2511.23055: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.23055&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[325] From Moderation to Mediation: Can LLMs Serve as Mediators in Online Flame Wars?

Dawei Li, Abdullah Alnaibari, Arslan Bisharat, Manny Sandoval, Deborah Hall, Yasin Silva, Huan Liu

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) - need to try again later or use alternative methods

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2512.03005: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.03005&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[326] STAR: Similarity-guided Teacher-Assisted Refinement for Super-Tiny Function Calling Models

Jiliang Ni, Jiachen Pu, Zhongyi Yang, Jingfeng Luo, Conggang Hu

Main category: cs.AI

TL;DR: Paper 2602.03022 could not be analyzed due to HTTP 429 error when fetching from arXiv API

DetailsMotivation: Unable to determine motivation due to failed data retrieval

Method: Unable to determine method due to failed data retrieval

Result: Unable to determine results due to failed data retrieval

Conclusion: Unable to draw conclusions due to failed data retrieval

Abstract: Failed to fetch summary for 2602.03022: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.03022&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[327] BrowseComp-$V^3$: A Visual, Vertical, and Verifiable Benchmark for Multimodal Browsing Agents

Huanyao Zhang, Jiepeng Zhou, Bo Li, Bowen Zhou, Yanzhe Shan, Haishan Lu, Zhiyong Cao, Jiaoyang Chen, Yuqian Han, Zinan Sheng, Zhengwei Tao, Hao Liang, Jialong Wu, Yang Shi, Yuanpeng He, Jiaye Lin, Qintong Zhang, Guochen Yan, Runhao Zhao, Zhengpin Li, Xiaohan Yu, Lang Mei, Chong Chen, Wentao Zhang, Bin Cui

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation due to failed paper fetch

Method: Cannot determine method due to failed paper fetch

Result: Cannot determine results due to failed paper fetch

Conclusion: Cannot draw conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2602.12876: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.12876&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[328] Automated Generation of Microfluidic Netlists using Large Language Models

Jasper Davidson, Skylar Stockham, Allen Boston, Ashton Snelgrove, Valerio Tenace, Pierre-Emmanuel Gaillardon

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable due to server rate limiting

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about the paper due to content unavailability

Abstract: Failed to fetch summary for 2602.19297: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.19297&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[329] A Problem-Oriented Perspective and Anchor Verification for Code Optimization

Tong Ye, Tengfei Ma, Xuhong Zhang, Hang Yu, Jianwei Yin, Wenhai Wang

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to draw conclusions due to access error

Abstract: Failed to fetch summary for 2406.11935: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2406.11935&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[330] Rethinking Disentanglement under Dependent Factors of Variation

Antonio Almudévar, Alfonso Ortega

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2408.07016: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2408.07016&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[331] Predicting Subway Passenger Flows under Incident Situation with Causality

Xiannan Huang, Shuhan Qiu, Quan Yuan, Chao Yang

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2412.06871: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2412.06871&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[332] Safe Reinforcement Learning for Real-World Engine Control

Julian Bedei, Lucas Koch, Kevin Badalian, Alexander Winkler, Patrick Schaber, Jakob Andert

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2501.16613: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2501.16613&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[333] A Statistical Learning Perspective on Semi-dual Adversarial Neural Optimal Transport Solvers

Roman Tarasov, Petr Mokrov, Milena Gazdieva, Evgeny Burnaev, Alexander Korotin

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2502.01310: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.01310&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[334] Oracular Programming: A Modular Foundation for Building LLM-Enabled Software

Jonathan Laurent, André Platzer

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable due to API rate limiting

Result: Cannot determine results as paper content is unavailable due to API rate limiting

Conclusion: Cannot determine conclusion as paper content is unavailable due to API rate limiting

Abstract: Failed to fetch summary for 2502.05310: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.05310&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[335] Using the Path of Least Resistance to Explain Deep Networks

Sina Salek, Joseph Enguehard

Main category: cs.AI

TL;DR: Unable to analyze paper 2502.12108 due to HTTP 429 error when fetching abstract from arXiv API

DetailsMotivation: Cannot determine motivation as abstract is unavailable due to rate limiting error

Method: Cannot determine method as abstract is unavailable due to rate limiting error

Result: Cannot determine results as abstract is unavailable due to rate limiting error

Conclusion: Cannot determine conclusion as abstract is unavailable due to rate limiting error

Abstract: Failed to fetch summary for 2502.12108: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.12108&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[336] Distributional Vision-Language Alignment by Cauchy-Schwarz Divergence

Wenzhe Yin, Zehao Xiao, Pan Zhou, Shujian Yu, Jiayi Shen, Jan-Jakob Sonke, Efstratios Gavves

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2502.17028: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.17028&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[337] MoEMba: A Mamba-based Mixture of Experts for High-Density EMG-based Hand Gesture Recognition

Mehran Shabanpour, Kasra Rad, Sadaf Khademi, Arash Mohammadi

Main category: cs.AI

TL;DR: Paper 2502.17457: Failed to fetch summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to technical error fetching paper information

Method: Unable to determine method due to technical error fetching paper information

Result: Unable to determine results due to technical error fetching paper information

Conclusion: Unable to draw conclusions due to technical error fetching paper information

Abstract: Failed to fetch summary for 2502.17457: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.17457&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[338] Semantic Parallelism: Redefining Efficient MoE Inference via Model-Data Co-Scheduling

Yan Li, Zhenyu Zhang, Zhengang Wang, Pengfei Chen, Pengfei Zheng

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2503.04398: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.04398&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[339] Diffusion Generative Recommendation with Continuous Tokens

Haohao Qu, Shanru Lin, Yujuan Ding, Yiqi Wang, Wenqi Fan

Main category: cs.AI

TL;DR: Unable to analyze paper 2504.12007 due to HTTP 429 error when fetching summary from arXiv API

DetailsMotivation: Cannot determine motivation as paper content could not be retrieved

Method: Cannot determine method as paper content could not be retrieved

Result: Cannot determine results as paper content could not be retrieved

Conclusion: Cannot draw conclusions as paper content could not be retrieved

Abstract: Failed to fetch summary for 2504.12007: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.12007&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[340] CONTINA: Confidence Interval for Traffic Demand Prediction with Coverage Guarantee

Chao Yang, Xiannan Huang, Shuhan Qiu, Yan Cheng

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2504.13961: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.13961&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[341] How much does context affect the accuracy of AI health advice?

Prashant Garg, Thiemo Fetzer

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2504.18310: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.18310&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[342] MARVEL: Multi-Agent RTL Vulnerability Extraction using Large Language Models

Luca Collini, Baleegh Ahmad, Joey Ah-kiow, Ramesh Karri

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2505.11963: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.11963&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[343] Performance Asymmetry in Model-Based Reinforcement Learning

Jing Yu Lim, Rushi Shah, Zarif Ikram, Samson Yu, Haozhe Ma, Tze-Yun Leong, Dianbo Liu

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2505.19698: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.19698&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[344] Synthesis of discrete-continuous quantum circuits with multimodal diffusion models

Florian Fürrutter, Zohim Chandani, Ikko Hamamura, Hans J. Briegel, Gorka Muñoz-Gil

Main category: cs.AI

TL;DR: Paper 2506.01666 summary unavailable due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access restrictions

Method: Unable to determine method due to access restrictions

Result: Unable to determine results due to access restrictions

Conclusion: Unable to determine conclusion due to access restrictions

Abstract: Failed to fetch summary for 2506.01666: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.01666&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[345] Towards Robust Real-World Multivariate Time Series Forecasting: A Unified Framework for Dependency, Asynchrony, and Missingness

Jinkwan Jang, Hyungjin Park, Jinmyeong Choi, Taesup Kim

Main category: cs.AI

TL;DR: Unable to analyze paper 2506.08660 due to HTTP 429 error when fetching from arXiv API

DetailsMotivation: Cannot determine motivation as paper content could not be retrieved

Method: Cannot determine method as paper content could not be retrieved

Result: Cannot determine results as paper content could not be retrieved

Conclusion: Cannot draw conclusions as paper content could not be retrieved

Abstract: Failed to fetch summary for 2506.08660: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.08660&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[346] DesignBench: A Comprehensive Benchmark for MLLM-based Front-end Code Generation

Jingyu Xiao, Man Ho Lam, Ming Wang, Yuxuan Wan, Junliang Liu, Yintong Huo, Michael R. Lyu

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2506.06251: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.06251&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[347] Augmenting Molecular Graphs with Geometries via Machine Learning Interatomic Potentials

Cong Fu, Yuchao Lin, Zachary Krueger, Haiyang Yu, Maho Nakata, Jianwen Xie, Emine Kucukbenli, Xiaofeng Qian, Shuiwang Ji

Main category: cs.AI

TL;DR: Paper 2507.00407: Unable to fetch summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation due to missing abstract content

Method: Cannot determine method due to missing abstract content

Result: Cannot determine results due to missing abstract content

Conclusion: Cannot determine conclusion due to missing abstract content

Abstract: Failed to fetch summary for 2507.00407: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.00407&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[348] Enjoying Non-linearity in Multinomial Logistic Bandits: A Minimax-Optimal Algorithm

Pierre Boudart, Pierre Gaillard, Alessandro Rudi

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to draw conclusions due to failed paper fetch

Abstract: Failed to fetch summary for 2507.05306: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.05306&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[349] Characterizing State Space Model and Hybrid Language Model Performance with Long Context

Saptarshi Mitra, Rachid Karami, Haocheng Xu, Sitao Huang, Hyoukjun Kwon

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2507.12442: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.12442&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[350] Monte Carlo Tree Diffusion with Multiple Experts for Protein Design

Xuefeng Liu, Mingxuan Cao, Songhao Jiang, Xiao Luo, Xiaotian Duan, Mengdi Wang, Tobin R. Sosnick, Jinbo Xu, Rick Stevens

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation due to missing paper content

Method: Cannot determine method due to missing paper content

Result: Cannot determine results due to missing paper content

Conclusion: Cannot draw conclusions due to missing paper content

Abstract: Failed to fetch summary for 2509.15796: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.15796&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[351] From Parameters to Behaviors: Unsupervised Compression of the Policy Space

Davide Tenedini, Riccardo Zamboni, Mirco Mutti, Marcello Restelli

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2509.22566: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.22566&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[352] Polychromic Objectives for Reinforcement Learning

Jubayer Ibn Hamid, Ifdita Hasan Orney, Ellen Xu, Chelsea Finn, Dorsa Sadigh

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2509.25424: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.25424&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[353] AgentDR: Dynamic Recommendation with Implicit Item-Item Relations via LLM-based Agents

Mingdai Yang, Nurendra Choudhary, Jiangshu Du, Edward W.Huang, Philip S.Yu, Karthik Subbian, Danai Koutra

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2510.05598: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.05598&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[354] Multi-hop Deep Joint Source-Channel Coding with Deep Hash Distillation for Semantically Aligned Image Recovery

Didrik Bergström, Deniz Gündüz, Onur Günlü

Main category: cs.AI

TL;DR: Paper 2510.06868: Unable to fetch abstract due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation due to abstract fetch failure

Method: Cannot determine method due to abstract fetch failure

Result: Cannot determine results due to abstract fetch failure

Conclusion: Cannot determine conclusion due to abstract fetch failure

Abstract: Failed to fetch summary for 2510.06868: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.06868&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[355] Latent-Augmented Discrete Diffusion Models

Dario Shariatian, Alain Durmus, Umut Simsekli, Stefano Peluchetti

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to API rate limiting preventing access to paper details

Method: Unable to determine method due to API rate limiting preventing access to paper details

Result: Unable to determine results due to API rate limiting preventing access to paper details

Conclusion: Unable to draw conclusions due to API rate limiting preventing access to paper details

Abstract: Failed to fetch summary for 2510.18114: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.18114&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[356] MoMaGen: Generating Demonstrations under Soft and Hard Constraints for Multi-Step Bimanual Mobile Manipulation

Chengshu Li, Mengdi Xu, Arpit Bahety, Hang Yin, Yunfan Jiang, Huang Huang, Josiah Wong, Sujay Garlanka, Cem Gokmen, Ruohan Zhang, Weiyu Liu, Jiajun Wu, Roberto Martín-Martín, Li Fei-Fei

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2510.18316: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.18316&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[357] Breaking Agent Backbones: Evaluating the Security of Backbone LLMs in AI Agents

Julia Bazinska, Max Mathys, Francesco Casucci, Mateo Rojas-Carulla, Xander Davies, Alexandra Souly, Niklas Pfister

Main category: cs.AI

TL;DR: Paper ID 2510.22620 could not be fetched due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2510.22620: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.22620&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[358] A Survey of Data Agents: Emerging Paradigm or Overstated Hype?

Yizhang Zhu, Liangwei Wang, Chenyu Yang, Xiaotian Lin, Boyan Li, Wei Zhou, Xinyu Liu, Zhangyang Peng, Tianqi Luo, Yu Li, Chengliang Chai, Chong Chen, Shimin Di, Ju Fan, Ji Sun, Nan Tang, Fugee Tsung, Jiannan Wang, Chenglin Wu, Yanwei Xu, Shaolei Zhang, Yong Zhang, Xuanhe Zhou, Guoliang Li, Yuyu Luo

Main category: cs.AI

TL;DR: Unable to analyze paper 2510.23587 due to HTTP 429 error when fetching abstract from arXiv API

DetailsMotivation: Cannot determine motivation without access to the paper's abstract or content

Method: Cannot determine method without access to the paper’s abstract or content

Result: Cannot determine results without access to the paper’s abstract or content

Conclusion: Cannot draw conclusions without access to the paper’s abstract or content

Abstract: Failed to fetch summary for 2510.23587: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.23587&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[359] HiGR: Efficient Generative Slate Recommendation via Hierarchical Planning and Multi-Objective Preference Alignment

Yunsheng Pang, Zijian Liu, Yudong Li, Shaojie Zhu, Zijian Luo, Chenyun Yu, Sikai Wu, Shichen Shen, Cong Xu, Bin Wang, Kai Jiang, Hongyong Yu, Chengxiang Zhuo, Zang Li

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2512.24787: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.24787&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[360] PhysE-Inv: A Physics-Encoded Inverse Modeling approach for Arctic Snow Depth Prediction

Akila Sampath, Vandana Janeja, Jianwu Wang

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2601.17074: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.17074&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[361] CLiMB: A Domain-Informed Novelty Detection Clustering Framework for Galactic Archaeology and Scientific Discovery

Lorenzo Monti, Tatiana Muraveva, Brian Sheridan, Davide Massari, Alessia Garofalo, Gisella Clementini, Umberto Michelucci

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation due to failed paper retrieval

Method: Cannot determine method due to failed paper retrieval

Result: Cannot determine results due to failed paper retrieval

Conclusion: Cannot determine conclusion due to failed paper retrieval

Abstract: Failed to fetch summary for 2601.09768: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.09768&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[362] When LLMs Imagine People: A Human-Centered Persona Brainstorm Audit for Bias and Fairness in Creative Applications

Hongliu Cao, Eoin Thomas, Rodrigo Acuna Agost

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to draw conclusions due to failed paper fetch

Abstract: Failed to fetch summary for 2602.00044: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.00044&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[363] CryoLVM: Self-supervised Learning from Cryo-EM Density Maps with Large Vision Models

Weining Fu, Kai Shu, Kui Xu, Qiangfeng Cliff Zhang

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2602.02620 could not be retrieved from arXiv API.

DetailsMotivation: Cannot determine motivation as paper content is unavailable due to API rate limiting.

Method: Cannot determine method as paper content is unavailable.

Result: Cannot determine results as paper content is unavailable.

Conclusion: Cannot draw conclusions about the paper due to technical limitations in accessing the content.

Abstract: Failed to fetch summary for 2602.02620: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.02620&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[364] Do We Need Adam? Surprisingly Strong and Sparse Reinforcement Learning with SGD in LLMs

Sagnik Mukherjee, Lifan Yuan, Pavan Jayasinha, Dilek Hakkani-Tür, Hao Peng

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2602.07729 suggests it’s from February 2026, which is in the future relative to current date.

DetailsMotivation: Cannot determine motivation due to inability to fetch paper content.

Method: Cannot determine method due to inability to fetch paper content.

Result: Cannot determine results due to inability to fetch paper content.

Conclusion: Cannot draw conclusions due to inability to fetch paper content.

Abstract: Failed to fetch summary for 2602.07729: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.07729&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[365] AceGRPO: Adaptive Curriculum Enhanced Group Relative Policy Optimization for Autonomous Machine Learning Engineering

Yuzhu Cai, Zexi Liu, Xinyu Zhu, Cheng Wang, Siheng Chen

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access limitations

Method: Unable to determine method due to access limitations

Result: Unable to determine results due to access limitations

Conclusion: Unable to draw conclusions due to access limitations

Abstract: Failed to fetch summary for 2602.07906: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.07906&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[366] VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training

Guobin Shen, Chenxiao Zhao, Xiang Cheng, Lei Huang, Xing Yu

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2602.10693: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.10693&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[367] KBVQ-MoE: KLT-guided SVD with Bias-Corrected Vector Quantization for MoE Large Language Models

Zukang Xu, Zhixiong Zhao, Xing Hu, Zhixuan Chen, Dawei Yang

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2602.11184: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.11184&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[368] PMG: Parameterized Motion Generator for Human-like Locomotion Control

Chenxi Han, Yuheng Min, Zihao Huang, Ao Hong, Hang Liu, Yi Cheng, Houde Liu

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2602.12656 suggests it’s from February 2026, but no content available for analysis.

DetailsMotivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot draw conclusions without access to the paper content.

Abstract: Failed to fetch summary for 2602.12656: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.12656&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[369] Pawsterior: Variational Flow Matching for Structured Simulation-Based Inference

Jorge Carrasco-Pollo, Floor Eijkelboom, Jan-Willem van de Meent

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to API rate limiting preventing access to paper details

Method: Cannot analyze method as paper content is unavailable

Result: No results available due to technical error fetching paper information

Conclusion: Cannot draw conclusions about paper content due to access limitations

Abstract: Failed to fetch summary for 2602.13813: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.13813&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[370] Silent Inconsistency in Data-Parallel Full Fine-Tuning: Diagnosing Worker-Level Optimization Misalignment

Hong Li, Zhen Zhou, Honggang Zhang, Yuping Luo, Xinyue Wang, Han Gong, Zhiyuan Liu

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about paper content due to access limitations

Abstract: Failed to fetch summary for 2602.14462: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.14462&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[371] AI-CARE: Carbon-Aware Reporting Evaluation Metric for AI Models

KC Santosh, Srikanth Baride, Rodrigue Rizk

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2602.16042 suggests it’s from February 2026, but no content is available for analysis.

DetailsMotivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot draw conclusions without access to the paper content.

Abstract: Failed to fetch summary for 2602.16042: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.16042&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[372] AI-Driven Structure Refinement of X-ray Diffraction

Bin Cao, Qian Zhang, Zhenjie Feng, Taolue Zhang, Jiaqiang Huang, Lu-Tao Weng, Tong-Yi Zhang

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2602.16372: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.16372&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[373] Beyond Message Passing: A Symbolic Alternative for Expressive and Interpretable Graph Learning

Chuqin Geng, Li Zhang, Haolin Ye, Ziyu Zhao, Yuhe Jiang, Tara Saba, Xinyu Wang, Xujie Si

Main category: cs.AI

TL;DR: Unable to analyze paper 2602.16947 due to HTTP 429 error when fetching from arXiv API

DetailsMotivation: Cannot determine motivation due to inability to access paper content

Method: Cannot determine method due to inability to access paper content

Result: Cannot determine results due to inability to access paper content

Conclusion: Cannot draw conclusions due to inability to access paper content

Abstract: Failed to fetch summary for 2602.16947: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.16947&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[374] AI-Mediated Feedback Improves Student Revisions: A Randomized Trial with FeedbackWriter in a Large Undergraduate Course

Xinyi Lu, Kexin Phyllis Ju, Mitchell Dudley, Larissa Sano, Xu Wang

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2602.16820 appears to be from February 2026, which suggests it may be a future or incorrectly formatted arXiv ID.

DetailsMotivation: Cannot determine motivation due to inability to access paper content.

Method: Cannot determine method due to inability to access paper content.

Result: Cannot determine results due to inability to access paper content.

Conclusion: Cannot draw conclusions due to inability to access paper content.

Abstract: Failed to fetch summary for 2602.16820: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.16820&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[375] SimToolReal: An Object-Centric Policy for Zero-Shot Dexterous Tool Manipulation

Kushal Kedia, Tyler Ga Wei Lum, Jeannette Bohg, C. Karen Liu

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2602.16863: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.16863&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[376] MASPO: Unifying Gradient Utilization, Probability Mass, and Signal Reliability for Robust and Sample-Efficient LLM Reasoning

Xiaoliang Fu, Jiaye Lin, Yangyi Fang, Binbin Zheng, Chaowen Hu, Zekai Shao, Cong Qin, Lu Pan, Ke Zeng, Xunliang Cai

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to draw conclusions due to fetch failure

Abstract: Failed to fetch summary for 2602.17550: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.17550&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[377] Context-Aware Mapping of 2D Drawing Annotations to 3D CAD Features Using LLM-Assisted Reasoning for Manufacturing Automation

Muhammad Tayyab Khan, Lequn Chen, Wenhe Feng, Seung Ki Moon

Main category: cs.AI

TL;DR: Paper ID 2602.18296 could not be fetched due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access restrictions

Method: Unable to determine method due to access restrictions

Result: Unable to determine results due to access restrictions

Conclusion: Unable to determine conclusion due to access restrictions

Abstract: Failed to fetch summary for 2602.18296: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.18296&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[378] NEXUS: A compact neural architecture for high-resolution spatiotemporal air quality forecasting in Delhi National Capital Region

Rampunit Kumar, Aditya Maheshwari

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2602.19654: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.19654&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[379] Transforming Science Learning Materials in the Era of Artificial Intelligence

Xiaoming Zhai, Kent Crippen

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2602.18470: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.18470&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[380] RPU – A Reasoning Processing Unit

Matthew Adiletta, Gu-Yeon Wei, David Brooks

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation due to access limitations

Method: Cannot determine method due to access limitations

Result: Cannot determine results due to access limitations

Conclusion: Cannot determine conclusion due to access limitations

Abstract: Failed to fetch summary for 2602.18568: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.18568&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[381] The Metaphysics We Train: A Heideggerian Reading of Machine Learning

Heman Shakeri

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to missing paper content

Method: Unable to determine method due to missing paper content

Result: Unable to determine results due to missing paper content

Conclusion: Unable to determine conclusion due to missing paper content

Abstract: Failed to fetch summary for 2602.19028: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.19028&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

cs.SD

[382] Graph Modelling Analysis of Speech-Gesture Interaction for Aphasia Severity Estimation

Navya Martin Kollapally, Christa Akers, Renjith Nelson Joseph

Main category: cs.SD

TL;DR: A graph neural network framework for automated aphasia severity assessment using multimodal discourse graphs combining speech and gesture features.

DetailsMotivation: Current aphasia assessment tools like WAB-R measure isolated linguistic skills, but there's growing interest in discourse production as a more holistic representation. Recent automated approaches rely mostly on isolated linguistic or acoustic features, missing the structured interactions between speech and gesture that may better capture aphasia severity.

Method: Proposes a graph neural network framework where each participant’s discourse is represented as a directed multimodal graph. Nodes represent lexical items and gestures, edges encode word-word, gesture-word, and word-gesture transitions. GraphSAGE is used to learn participant-level embeddings that integrate information from immediate neighbors and overall graph structure.

Result: Results suggest that aphasia severity is not encoded in isolated lexical distribution, but emerges from structured interactions between speech and gesture. The proposed architecture offers reliable automated aphasia assessment.

Conclusion: The framework provides a reliable automated approach for aphasia assessment with potential applications in bedside screening and telehealth-based monitoring, demonstrating that multimodal integration of speech and gesture features is crucial for accurate severity estimation.

Abstract: Aphasia is an acquired language disorder caused by injury to the regions of the brain that are responsible for language. Aphasia may impair the use and comprehension of written and spoken language. The Western Aphasia Battery-Revised (WAB-R) is an assessment tool administered by speech-language pathologists (SLPs) to evaluate the aphasia type and severity. Because the WAB-R measures isolated linguistic skills, there has been growing interest in the assessment of discourse production as a more holistic representation of everyday language abilities. Recent advancements in speech analysis focus on automated estimation of aphasia severity from spontaneous speech, relying mostly in isolated linguistic or acoustical features. In this work, we propose a graph neural network-based framework for estimating aphasia severity. We represented each participant’s discourse as a directed multi-modal graph, where nodes represent lexical items and gestures and edges encode word-word, gesture-word, and word-gesture transitions. GraphSAGE is employed to learn participant-level embeddings, thus integrating information from immediate neighbors and overall graph structure. Our results suggest that aphasia severity is not encoded in isolated lexical distribution, but rather emerges from structured interactions between speech and gesture. The proposed architecture offers a reliable automated aphasia assessment, with possible uses in bedside screening and telehealth-based monitoring.

[383] Quantifying Dimensional Independence in Speech: An Information-Theoretic Framework for Disentangled Representation Learning

Bipasha Kashyap, Björn W. Schuller, Pubudu N. Pathirana

Main category: cs.SD

TL;DR: Information-theoretic framework quantifies cross-dimension statistical dependence in speech acoustic features, finding weak coupling between emotional, linguistic, and pathological dimensions but stronger source-filter relationships.

DetailsMotivation: Speech signals encode multiple types of information (emotional, linguistic, pathological) within shared acoustic channels, but current methods only assess disentanglement indirectly through downstream tasks. There's a need for a principled framework to directly quantify statistical dependence between these dimensions.

Method: Integrated bounded neural mutual information (MI) estimation with non-parametric validation on handcrafted acoustic features across six speech corpora. Used attribution analysis to determine proportion of total MI attributable to source versus filter components.

Result: Cross-dimension MI remains low (<0.15 nats) indicating weak statistical coupling, while Source-Filter MI is substantially higher (0.47 nats). Attribution analysis shows source dominance for emotional dimensions (80%) and filter dominance for linguistic (60%) and pathological (58%) dimensions.

Conclusion: Provides a principled information-theoretic framework for quantifying dimensional independence in speech, revealing distinct statistical patterns between different speech information types and their source-filter relationships.

Abstract: Speech signals encode emotional, linguistic, and pathological information within a shared acoustic channel; however, disentanglement is typically assessed indirectly through downstream task performance. We introduce an information-theoretic framework to quantify cross-dimension statistical dependence in handcrafted acoustic features by integrating bounded neural mutual information (MI) estimation with non-parametric validation. Across six corpora, cross-dimension MI remains low, with tight estimation bounds ($< 0.15$ nats), indicating weak statistical coupling in the data considered, whereas Source–Filter MI is substantially higher (0.47 nats). Attribution analysis, defined as the proportion of total MI attributable to source versus filter components, reveals source dominance for emotional dimensions (80%) and filter dominance for linguistic and pathological dimensions (60% and 58%, respectively). These findings provide a principled framework for quantifying dimensional independence in speech.

[384] Geometric Analysis of Speech Representation Spaces: Topological Disentanglement and Confound Detection

Bipasha Kashyap, Pubudu N. Pathirana

Main category: cs.SD

TL;DR: A framework for evaluating geometric disentanglement of emotional, linguistic, and pathological speech features across multilingual datasets to ensure equitable clinical speech systems.

DetailsMotivation: Speech-based clinical tools face challenges in multilingual settings where accent variation may interfere with accurate pathology detection, potentially misclassifying healthy non-native speakers or missing pathology in multilingual patients.

Method: Proposes a four-metric clustering framework to evaluate geometric disentanglement of speech features across six corpora and eight dataset combinations, using metrics like Silhouette scores, confound analysis, and trustworthiness analysis.

Result: Found consistent hierarchy: emotional features form tightest clusters (Silhouette 0.250), followed by pathological (0.141) and linguistic (0.077). Pathological-linguistic overlap remains below 0.21, bounded for clinical deployment. Embedding fidelity and robustness confirmed.

Conclusion: The framework provides actionable guidelines for developing equitable and reliable speech health systems across diverse populations by quantifying feature disentanglement in multilingual contexts.

Abstract: Speech-based clinical tools are increasingly deployed in multilingual settings, yet whether pathological speech markers remain geometrically separable from accent variation remains unclear. Systems may misclassify healthy non-native speakers or miss pathology in multilingual patients. We propose a four-metric clustering framework to evaluate geometric disentanglement of emotional, linguistic, and pathological speech features across six corpora and eight dataset combinations. A consistent hierarchy emerges: emotional features form the tightest clusters (Silhouette 0.250), followed by pathological (0.141) and linguistic (0.077). Confound analysis shows pathological-linguistic overlap remains below 0.21, which is above the permutation null but bounded for clinical deployment. Trustworthiness analysis confirms embedding fidelity and robustness of the geometric conclusions. Our framework provides actionable guidelines for equitable and reliable speech health systems across diverse populations.

[385] Voices of the Mountains: Deep Learning-Based Vocal Error Detection System for Kurdish Maqams

Darvan Shvan Khairaldeen, Hossein Hassani

Main category: cs.SD

TL;DR: First study on automatic singing assessment for Kurdish maqam music using ML to detect pitch, rhythm, and modal stability errors beyond Western music rules

DetailsMotivation: Existing automatic singing assessment tools follow Western music rules and fail to detect micro-intervals and pitch bends in Kurdish maqam singing, incorrectly identifying traditional performances as errors. Need for ASA tools that recognize performance errors within microtonal spaces beyond Western equal temperament.

Method: Collected 50 Kurdish maqam songs from 13 vocalists (2-3 hours), annotated 221 error spans (150 fine pitch, 46 rhythm, 25 modal drift). Segmented into 15,199 overlapping windows converted to log-mel spectrograms. Developed two-headed CNN-BiLSTM with attention to detect errors and classify error types.

Result: Model reached validation macro-F1 of 0.468. On full 50-song evaluation at 0.750 threshold: recall 39.4%, precision 25.8%. Within detected windows: type macro-F1 0.387, with F1 scores of 0.492 (fine pitch), 0.536 (rhythm), and 0.133 (modal drift). Modal drift recall was only 8.0%.

Conclusion: The method shows promise for common error types but poor modal-drift detection indicates need for more data and balancing. First successful attempt at ASA for Kurdish maqam music beyond Western music constraints.

Abstract: Maqam, a singing type, is a significant component of Kurdish music. A maqam singer receives training in a traditional face-to-face or through self-training. Automatic Singing Assessment (ASA) uses machine learning (ML) to provide the accuracy of singing styles and can help learners to improve their performance through error detection. Currently, the available ASA tools follow Western music rules. The musical composition requires all notes to stay within their expected pitch range from start to finish. The system fails to detect micro-intervals and pitch bends, so it identifies Kurdish maqam singing as incorrect even though the singer performs according to traditional rules. Kurdish maqam requires recognizing performance errors within microtonal spaces, which is beyond Western equal temperament. This research is the first attempt to address the mentioned gap. While many error types happen during singing, our focus is on pitch, rhythm, and modal stability errors in the context of Bayati-Kurd. We collected 50 songs from 13 vocalists ( 2-3 hours) and annotated 221 error spans (150 fine pitch, 46 rhythm, 25 modal drift). The data was segmented into 15,199 overlapping windows and converted to log-mel spectrograms. We developed a two-headed CNN-BiLSTM with attention mode to decide whether a window contains an error and to classify it based on the chosen errors. Trained for 20 epochs with early stopping at epoch 10, the model reached a validation macro-F1 of 0.468. On the full 50-song evaluation at a 0.750 threshold, recall was 39.4% and precision 25.8% . Within detected windows, type macro-F1 was 0.387, with F1 of 0.492 (fine pitch), 0.536 (rhythm), and 0.133 (modal drift); modal drift recall was 8.0%. The better performance on common error types shows that the method works, while the poor modal-drift recall shows that more data and balancing are needed.

[386] Assessing the Impact of Speaker Identity in Speech Spoofing Detection

Anh-Tuan Dao, Driss Matrouf, Nicholas Evans

Main category: cs.SD

TL;DR: Investigates speaker information impact on spoofing detection, proposes Speaker-Invariant Multi-Task framework with two approaches: modeling or removing speaker identity, achieving significant error rate reductions.

DetailsMotivation: Spoofing detection systems typically assume embeddings are independent of speaker identity, but this assumption remains unverified. The paper investigates the impact of speaker information on spoofing detection performance.

Method: Proposes Speaker-Invariant Multi-Task (SInMT) framework with two approaches: one models speaker identity within embeddings, another removes it. Uses multi-task learning for joint speaker recognition and spoofing detection with gradient reversal layer.

Result: Speaker-invariant model reduces average equal error rate by 17% compared to baseline, with up to 48% reduction for most challenging attacks (e.g., A11). Evaluated using four datasets.

Conclusion: Speaker information significantly impacts spoofing detection performance, and the proposed speaker-invariant approach substantially improves detection accuracy, especially for challenging attacks.

Abstract: Spoofing detection systems are typically trained using diverse recordings from multiple speakers, often assuming that the resulting embeddings are independent of speaker identity. However, this assumption remains unverified. In this paper, we investigate the impact of speaker information on spoofing detection systems. We propose two approaches within our Speaker-Invariant Multi-Task framework, one that models speaker identity within the embeddings and another that removes it. SInMT integrates multi-task learning for joint speaker recognition and spoofing detection, incorporating a gradient reversal layer. Evaluated using four datasets, our speaker-invariant model reduces the average equal error rate by 17% compared to the baseline, with up to 48% reduction for the most challenging attacks (e.g., A11).

[387] 823-OLT @ BUET DL Sprint 4.0: Context-Aware Windowing for ASR and Fine-Tuned Speaker Diarization in Bengali Long Form Audio

Ratnajit Dhar, Arpita Mallik

Main category: cs.SD

TL;DR: ASR and speaker diarization system for Bengali long-form speech using Whisper-based model and finetuned segmentation for low-resource language applications

DetailsMotivation: Bengali is widely spoken but underrepresented in long-form speech technology, lacking robust systems for transcription and speaker attribution in this low-resource language context

Method: Two frameworks: 1) ASR pipeline with Whisper Medium model, vocal separation, voice activity detection, and gap-aware windowing for context preservation; 2) Speaker diarization using pretrained segmentation model finetuned on Bengali conversational data

Result: Efficient transcription of long-form audio with speaker attribution, providing scalable speech technology solutions for Bengali

Conclusion: The developed frameworks address the gap in Bengali speech technology by combining ASR and diarization for practical long-form speech applications

Abstract: Bengali, despite being one of the most widely spoken languages globally, remains underrepresented in long form speech technology, particularly in systems addressing transcription and speaker attribution. We present frameworks for long form Bengali speech intelligence that address automatic speech recognition using a Whisper Medium based model and speaker diarization using a finetuned segmentation model. The ASR pipeline incorporates vocal separation, voice activity detection, and a gap aware windowing strategy to construct context preserving segments for stable decoding. For diarization, a pretrained speaker segmentation model is finetuned on the official competition dataset (provided as part of the DL Sprint 4.0 competition organized under BUET CSE Fest), to better capture Bengali conversational patterns. The resulting systems deliver both efficient transcription of long form audio and speaker aware transcription to provide scalable speech technology solutions for low resource languages.

[388] An Adaptive CMSA for Solving the Longest Filled Common Subsequence Problem with an Application in Audio Querying

Marko Djukanovic, Christian Blum, Aleksandar Kartelj, Ana Nikolikj, Guenther Raidl

Main category: cs.SD

TL;DR: Adaptive CMSA framework achieves state-of-the-art performance on Longest Filled Common Subsequence problem with exceptional scalability, plus novel audio application for song identification.

DetailsMotivation: Existing approaches to the NP-hard LFCS problem have limited scalability evaluation on small instances, lacking discriminative benchmarks for large-scale performance assessment.

Method: Adaptive Construct, Merge, Solve, Adapt (CMSA) framework that iteratively generates promising subproblems via component-based construction and refines them using feedback from prior iterations, with subproblems solved by external black-box solver.

Result: Outperforms five leading methods, solves 1,486 of 1,510 instances with known optimal solutions (over 99.9% optimal solution quality), demonstrates exceptional scalability, and successfully applies LFCS to song identification from degraded audio.

Conclusion: Adaptive CMSA is highly effective for large-scale LFCS problems, with practical applications beyond bioinformatics including audio processing, and empirical analysis reveals key problem features influencing algorithm performance.

Abstract: This paper addresses the Longest Filled Common Subsequence (LFCS) problem, a challenging NP-hard problem with applications in bioinformatics, including gene mutation prediction and genomic data reconstruction. Existing approaches, including exact, metaheuristic, and approximation algorithms, have primarily been evaluated on small-sized instances, which offer limited insights into their scalability. In this work, we introduce a new benchmark dataset with significantly larger instances and demonstrate that existing datasets lack the discriminative power needed to meaningfully assess algorithm performance at scale. To solve large instances efficiently, we utilize an adaptive Construct, Merge, Solve, Adapt (CMSA) framework that iteratively generates promising subproblems via component-based construction and refines them using feedback from prior iterations. Subproblems are solved using an external black-box solver. Extensive experiments on both standard and newly introduced benchmarks show that the proposed adaptive CMSA achieves state-of-the-art performance, outperforming five leading methods. Notably, on 1,510 problem instances with known optimal solutions, our approach solves 1,486 of them – achieving over 99.9% optimal solution quality and demonstrating exceptional scalability. We additionally propose a novel application of LFCS for song identification from degraded audio excerpts as an engineering contribution, using real-world energy-profile instances from popular music. Finally, we conducted an empirical explainability analysis to identify critical feature combinations influencing algorithm performance, i.e., the key problem features contributing to success or failure of the approaches across different instance types are revealed.

[389] Hearing the Order: Investigating Position Bias in Large Audio-Language Models

Yu-Xiang Lin, Chen-An Li, Sheng-Lun Wei, Po-Chun Chen, Hsin-Hsi Chen, Hung-yi Lee

Main category: cs.SD

TL;DR: LALMs show significant position bias where answer option order affects predictions, causing up to 24% performance fluctuations and altering model rankings across benchmarks.

DetailsMotivation: To investigate whether large audio-language models exhibit position bias in multiple-choice tasks, where the order of answer options influences predictions, potentially undermining evaluation reliability.

Method: Extensive experiments on six LALMs across three widely used benchmarks and their spoken counterparts, analyzing performance variations when shuffling answer option order, and testing permutation-based mitigation strategies.

Result: All tested LALMs show position bias with performance fluctuations up to 24%, model rankings change based on option ordering, and permutation strategies can mitigate bias in most cases.

Conclusion: Current LALM evaluation practices are unreliable due to position bias, requiring awareness and further research into robust evaluation methods and bias mitigation.

Abstract: Large audio-language models (LALMs) are often used in tasks that involve reasoning over ordered options. An open question is whether their predictions are influenced by the order of answer choices, which would indicate a form of position bias and undermine their reliability. In this paper, we identify and analyze this problem in LALMs. We demonstrate that no model is immune to this bias through extensive experiments on six LALMs across three widely used benchmarks and their spoken counterparts. Shuffling the order of answer options can cause performance fluctuations of up to 24% and even change model rankings, raising concerns about the reliability of current evaluation practices. We also study permutation-based strategies and show that they can mitigate bias in most cases. Our work represents the first systematic investigation of this issue in LALMs, and we hope it raises awareness and motivates further research in this direction.

[390] Sound Source Localization for Spatial Mapping of Surgical Actions in Dynamic Scenes

Jonas Hein, Lazaros Vlachopoulos, Maurits Geert Laurent Olthof, Bastian Sigrist, Philipp Fürnstahl, Matthias Seibold

Main category: cs.SD

TL;DR: A novel framework for 4D audio-visual surgical scene understanding that integrates 3D acoustic localization from microphone arrays with dynamic point clouds from RGB-D cameras to spatially and temporally localize surgical tool-tissue interactions.

DetailsMotivation: Current surgical scene understanding approaches rely mainly on visual data or end-to-end learning, limiting fine-grained contextual modeling. The work aims to enhance surgical scene representations by integrating 3D acoustic information for temporally and spatially aware multimodal understanding of surgical environments.

Method: Proposes a framework generating 4D audio-visual representations by projecting acoustic localization from phased microphone arrays onto dynamic point clouds from RGB-D cameras. Uses transformer-based acoustic event detection to identify temporal segments containing tool-tissue interactions, which are then spatially localized in the audio-visual scene representation. Evaluated in realistic operating room setups during simulated surgical procedures.

Result: Successfully localizes surgical acoustic events in 3D space and associates them with visual scene elements. Demonstrates accurate spatial sound localization and robust fusion of multimodal data, providing comprehensive dynamic representation of surgical activity.

Conclusion: Introduces the first approach for spatial sound localization in dynamic surgical scenes, marking significant advancement toward multimodal surgical scene representations. Integration of acoustic and visual data enables richer contextual understanding and provides foundation for future intelligent and autonomous surgical systems.

Abstract: Purpose: Surgical scene understanding is key to advancing computer-aided and intelligent surgical systems. Current approaches predominantly rely on visual data or end-to-end learning, which limits fine-grained contextual modeling. This work aims to enhance surgical scene representations by integrating 3D acoustic information, enabling temporally and spatially aware multimodal understanding of surgical environments. Methods: We propose a novel framework for generating 4D audio-visual representations of surgical scenes by projecting acoustic localization information from a phased microphone array onto dynamic point clouds from an RGB-D camera. A transformer-based acoustic event detection module identifies relevant temporal segments containing tool-tissue interactions which are spatially localized in the audio-visual scene representation. The system was experimentally evaluated in a realistic operating room setup during simulated surgical procedures performed by experts. Results: The proposed method successfully localizes surgical acoustic events in 3D space and associates them with visual scene elements. Experimental evaluation demonstrates accurate spatial sound localization and robust fusion of multimodal data, providing a comprehensive, dynamic representation of surgical activity. Conclusion: This work introduces the first approach for spatial sound localization in dynamic surgical scenes, marking a significant advancement toward multimodal surgical scene representations. By integrating acoustic and visual data, the proposed framework enables richer contextual understanding and provides a foundation for future intelligent and autonomous surgical systems.

[391] Uncertainty Calibration of Multi-Label Bird Sound Classifiers

Raphael Schwinger, Ben McEwen, Vincent S. Kather, René Heinrich, Lukas Rauch, Sven Tomforde

Main category: cs.SD

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed API request

Method: Unable to determine method due to failed API request

Result: Unable to determine results due to failed API request

Conclusion: Unable to draw conclusions due to failed API request

Abstract: Failed to fetch summary for 2511.08261: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.08261&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[392] MultiAPI Spoof: A Multi-API Dataset and Local-Attention Network for Speech Anti-spoofing Detection

Xueping Zhang, Zhenshan Zhang, Yechen Wang, Linxi Li, Liwei Jin, Ming Li

Main category: cs.SD

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2512.07352: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.07352&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

cs.LG

[393] Tensor Network Generator-Enhanced Optimization for Traveling Salesman Problem

Ryo Sakai, Chen-Yu Liu

Main category: cs.LG

TL;DR: TN-GEO framework applies tensor network Born machines with differentiable MPS to solve TSP using permutation-based encoding and autoregressive sampling with masking to guarantee valid tours, outperforming classical heuristics on TSPLIB instances up to 52 cities.

DetailsMotivation: The paper aims to address the traveling salesman problem (TSP) using tensor network approaches, moving beyond traditional binary encoding methods that require N² variables and penalty terms. The motivation is to develop a more efficient generative optimization framework that guarantees valid tours by construction.

Method: Uses tensor network generator-enhanced optimization (TN-GEO) with automatically differentiable matrix product states (MPS) as generative model. Employs permutation-based formulation with integer variables and autoregressive sampling with masking to ensure valid tours. Introduces k-site MPS variant that learns distributions over k-grams using sliding window for parameter-efficient modeling of larger instances.

Result: Experimental validation on TSPLIB benchmark instances with up to 52 cities shows TN-GEO can outperform classical heuristics including swap and 2-opt hill-climbing. The k-site variants, which focus more on local correlations, show better results compared to full-MPS case.

Conclusion: TN-GEO provides an effective tensor network-based approach to combinatorial optimization problems like TSP, with permutation-based encoding and autoregressive sampling offering advantages over traditional binary encoding methods, particularly when using k-site MPS variants for better local correlation modeling.

Abstract: We present an application of the tensor network generator-enhanced optimization (TN-GEO) framework to address the traveling salesman problem (TSP), a fundamental combinatorial optimization challenge. Our approach employs a tensor network Born machine based on automatically differentiable matrix product states (MPS) as the generative model, using the Born rule to define probability distributions over candidate solutions. Unlike approaches based on binary encoding, which require $N^2$ variables and penalty terms to enforce valid tour constraints, we adopt a permutation-based formulation with integer variables and use autoregressive sampling with masking to guarantee that every generated sample is a valid tour by construction. We also introduce a $k$-site MPS variant that learns distributions over $k$-grams (consecutive city subsequences) using a sliding window approach, enabling parameter-efficient modeling for larger instances. Experimental validation on TSPLIB benchmark instances with up to 52 cities demonstrates that TN-GEO can outperform classical heuristics including swap and 2-opt hill-climbing. The $k$-site variants, which put more focus on local correlations, show better results compared to the full-MPS case.

[394] Memory-guided Prototypical Co-occurrence Learning for Mixed Emotion Recognition

Ming Li, Yong-Jin Liu, Fang Liu, Huankun Sheng, Yeying Fan, Yixiang Wei, Minnan Luo, Weizhan Zhang, Wenping Wang

Main category: cs.LG

TL;DR: MPCL framework for mixed emotion recognition using multi-modal physiological/behavioral signals with memory-guided prototype learning and co-occurrence modeling.

DetailsMotivation: Real-world human emotions often involve multiple simultaneous affective states (mixed emotions), but existing models focus on singular emotions in controlled lab settings. Current approaches neglect valence consistency and structured correlations among coexisting emotions.

Method: Memory-guided Prototypical Co-occurrence Learning (MPCL) framework: 1) Multi-modal signal fusion via multi-scale associative memory mechanism, 2) Construction of emotion-specific prototype memory banks for cross-modal semantic relationships, 3) Prototype relation distillation for cross-modal alignment, 4) Memory retrieval strategy to extract semantic-level co-occurrence associations across emotion categories.

Result: Comprehensive experiments on two public datasets show MPCL consistently outperforms state-of-the-art methods in mixed emotion recognition, both quantitatively and qualitatively.

Conclusion: The proposed MPCL framework effectively models emotion co-occurrence patterns through hierarchical abstraction, learning affectively informative representations for accurate emotion distribution prediction in real-world scenarios.

Abstract: Emotion recognition from multi-modal physiological and behavioral signals plays a pivotal role in affective computing, yet most existing models remain constrained to the prediction of singular emotions in controlled laboratory settings. Real-world human emotional experiences, by contrast, are often characterized by the simultaneous presence of multiple affective states, spurring recent interest in mixed emotion recognition as an emotion distribution learning problem. Current approaches, however, often neglect the valence consistency and structured correlations inherent among coexisting emotions. To address this limitation, we propose a Memory-guided Prototypical Co-occurrence Learning (MPCL) framework that explicitly models emotion co-occurrence patterns. Specifically, we first fuse multi-modal signals via a multi-scale associative memory mechanism. To capture cross-modal semantic relationships, we construct emotion-specific prototype memory banks, yielding rich physiological and behavioral representations, and employ prototype relation distillation to ensure cross-modal alignment in the latent prototype space. Furthermore, inspired by human cognitive memory systems, we introduce a memory retrieval strategy to extract semantic-level co-occurrence associations across emotion categories. Through this bottom-up hierarchical abstraction process, our model learns affectively informative representations for accurate emotion distribution prediction. Comprehensive experiments on two public datasets demonstrate that MPCL consistently outperforms state-of-the-art methods in mixed emotion recognition, both quantitatively and qualitatively.

[395] MoBiQuant: Mixture-of-Bits Quantization for Token-Adaptive Elastic LLMs

Dongwei Wang, Jinhee Kim, Seokho Han, Denis Gudovskiy, Yohei Nakata, Tomoyuki Okuno, KhayTze Peong, Kang Eun Jeon, Jong Hwan Ko, Yiran Chen, Huanrui Yang

Main category: cs.LG

TL;DR: MoBiQuant: A Mixture-of-Bits quantization framework for elastic LLM inference that adjusts weight precision based on token sensitivity, enabling smooth precision switching without repeated calibration.

DetailsMotivation: Elastic LLM deployment requires inference with various quantization precisions based on available computational resources, but current quantization methods have calibration parameters tied to specific precisions, causing challenges for elastic-precision calibration and runtime precision switching.

Method: Proposes MoBiQuant with many-in-one recursive residual quantization to iteratively reconstruct higher-precision weights, and a token-aware router to dynamically select the number of residual bit slices based on token sensitivity.

Result: Experimental results show MoBiQuant exhibits strong elasticity and matches the performance of bit-specific calibrated PTQ on LLaMA3-8B without repeated calibration.

Conclusion: MoBiQuant enables smooth precision switching while improving generalization for token outlier distribution, addressing challenges in elastic LLM deployment.

Abstract: Changing runtime complexity on cloud and edge devices necessitates elastic large language model (LLM) deployment, where an LLM can be inferred with various quantization precisions based on available computational resources. However, it has been observed that the calibration parameters for quantization are typically linked to specific precisions, which presents challenges during elastic-precision calibration and precision switching at runtime. In this work, we attribute the source of varying calibration parameters to the varying token-level sensitivity caused by a precision-dependent outlier migration phenomenon.Motivated by this observation, we propose \texttt{MoBiQuant}, a novel Mixture-of-Bits quantization framework that adjusts weight precision for elastic LLM inference based on token sensitivity. Specifically, we propose the many-in-one recursive residual quantization that can iteratively reconstruct higher-precision weights and the token-aware router to dynamically select the number of residual bit slices. MoBiQuant enables smooth precision switching while improving generalization for the distribution of token outliers. Experimental results demonstrate that MoBiQuant exhibits strong elasticity, enabling it to match the performance of bit-specific calibrated PTQ on LLaMA3-8B without repeated calibration.

[396] FedAvg-Based CTMC Hazard Model for Federated Bridge Deterioration Assessment

Takato Yasuno

Main category: cs.LG

TL;DR: Federated learning framework for bridge deterioration modeling using CTMC hazard models without sharing sensitive inspection data

DetailsMotivation: Bridge inspection records contain sensitive infrastructure data that cannot be shared across organizations due to governance constraints, necessitating privacy-preserving collaborative modeling

Method: Federated framework where each user trains local CTMC hazard models for bridge deterioration transitions, uploads only 12D pseudo-gradient vectors to central server, which aggregates using FedAvg with momentum and gradient clipping

Result: Consistent convergence of average negative log-likelihood on synthetic data, decreasing gradient norm with increasing user scale, enabling collaborative benchmarking without data sharing

Conclusion: Federated CTMC hazard modeling provides privacy-preserving collaborative deterioration analysis with participation incentives, enabling evidence-based planning while maintaining data sovereignty

Abstract: Bridge periodic inspection records contain sensitive information about public infrastructure, making cross-organizational data sharing impractical under existing data governance constraints. We propose a federated framework for estimating a Continuous-Time Markov Chain (CTMC) hazard model of bridge deterioration, enabling municipalities to collaboratively train a shared benchmark model without transferring raw inspection records. Each User holds local inspection data and trains a log-linear hazard model over three deterioration-direction transitions – Good$\to$Minor, Good$\to$Severe, and Minor$\to$Severe – with covariates for bridge age, coastline distance, and deck area. Local optimization is performed via mini-batch stochastic gradient descent on the CTMC log-likelihood, and only a 12-dimensional pseudo-gradient vector is uploaded to a central server per communication round. The server aggregates User updates using sample-weighted Federated Averaging (FedAvg) with momentum and gradient clipping. All experiments in this paper are conducted on fully synthetic data generated from a known ground-truth parameter set with region-specific heterogeneity, enabling controlled evaluation of federated convergence behaviour. Simulation results across heterogeneous Users show consistent convergence of the average negative log-likelihood, with the aggregated gradient norm decreasing as User scale increases. Furthermore, the federated update mechanism provides a natural participation incentive: Users who register their local inspection datasets on a shared technical-standard platform receive in return the periodically updated global benchmark parameters – information that cannot be obtained from local data alone – thereby enabling evidence-based life-cycle planning without surrendering data sovereignty.

[397] Controllable Exploration in Hybrid-Policy RLVR for Multi-Modal Reasoning

Zhuoxu Huang, Mengxi Jia, Hao Sun, Xuelong Li, Jungong Han

Main category: cs.LG

TL;DR: CalibRL is a hybrid-policy RLVR framework for MLLMs that enables controllable exploration with expert guidance to prevent entropy collapse and policy degradation during reinforcement learning training.

DetailsMotivation: RLVR training for MLLMs suffers from entropy collapse, policy degradation, and over-exploitation of suboptimal behaviors due to enormous state spaces and sparse rewards, requiring better exploration strategies.

Method: Uses distribution-aware advantage weighting to scale updates by group rareness, and asymmetric activation (LeakyReLU) leveraging expert knowledge as calibration baseline to moderate overconfident updates while preserving corrective direction.

Result: Extensive experiments across eight benchmarks show consistent improvements in both in-domain and out-of-domain settings, validating the effectiveness of controllable hybrid-policy RLVR training.

Conclusion: CalibRL successfully addresses distributional mismatch between model policy and expert trajectories, achieving stable balance between exploration and exploitation in MLLM RLVR training.

Abstract: Reinforcement Learning with verifiable rewards (RLVR) has emerged as a primary learning paradigm for enhancing the reasoning capabilities of multi-modal large language models (MLLMs). However, during RL training, the enormous state space of MLLM and sparse rewards often leads to entropy collapse, policy degradation, or over-exploitation of suboptimal behaviors. This necessitates an exploration strategy that maintains productive stochasticity while avoiding the drawbacks of uncontrolled random sampling, yielding inefficient exploration. In this paper, we propose CalibRL, a hybrid-policy RLVR framework that supports controllable exploration with expert guidance, enabled by two key mechanisms. First, a distribution-aware advantage weighting scales updates by group rareness to calibrate the distribution, therefore preserving exploration. Meanwhile, the asymmetric activation function (LeakyReLU) leverages the expert knowledge as a calibration baseline to moderate overconfident updates while preserving their corrective direction. CalibRL increases policy entropy in a guided manner and clarifies the target distribution by estimating the on-policy distribution through online sampling. Updates are driven by these informative behaviors, avoiding convergence to erroneous patterns. Importantly, these designs help alleviate the distributional mismatch between the model’s policy and expert trajectories, thereby achieving a more stable balance between exploration and exploitation. Extensive experiments across eight benchmarks, including both in-domain and out-of-domain settings, demonstrate consistent improvements, validating the effectiveness of our controllable hybrid-policy RLVR training. Code is available at https://github.com/zhh6425/CalibRL.

[398] IMOVNO+: A Regional Partitioning and Meta-Heuristic Ensemble Framework for Imbalanced Multi-Class Learning

Soufiane Bacha, Laouni Djafri, Sahraoui Dhelim, Huansheng Ning

Main category: cs.LG

TL;DR: IMOVNO+ is a two-level framework addressing class imbalance, overlap, and noise in both binary and multi-class classification through data-level quality enhancement and algorithmic-level ensemble optimization.

DetailsMotivation: Class imbalance, overlap, and noise degrade data quality and model reliability, especially in multi-class settings where traditional approaches fail to capture complex inter-class relationships and minority-majority structures.

Method: Two-level framework: 1) Data level: uses conditional probability for sample informativeness, partitions data into core/overlap/noise regions, introduces overlapping-cleaning algorithm with Z-score and big-jump gap distance, and smart oversampling with multi-regularization. 2) Algorithm level: meta-heuristic prunes ensemble classifiers to reduce weak-learner influence.

Result: Evaluated on 35 datasets (13 multi-class, 22 binary). Achieves consistent superiority over SOTA, approaching 100% in several cases. Multi-class gains: 37-57% G-mean, 25-44% F1-score, 25-39% precision, 26-43% recall. Binary tasks: near-perfect performance with 14-39% improvements.

Conclusion: IMOVNO+ effectively handles data scarcity and imbalance from collection and privacy limits, providing a comprehensive solution for both binary and multi-class classification problems with imbalanced, overlapping, and noisy data.

Abstract: Class imbalance, overlap, and noise degrade data quality, reduce model reliability, and limit generalization. Although widely studied in binary classification, these issues remain underexplored in multi-class settings, where complex inter-class relationships make minority-majority structures unclear and traditional clustering fails to capture distribution shape. Approaches that rely only on geometric distances risk removing informative samples and generating low-quality synthetic data, while binarization approaches treat imbalance locally and ignore global inter-class dependencies. At the algorithmic level, ensembles struggle to integrate weak classifiers, leading to limited robustness. This paper proposes IMOVNO+ (IMbalance-OVerlap-NOise+ Algorithm-Level Optimization), a two-level framework designed to jointly enhance data quality and algorithmic robustness for binary and multi-class tasks. At the data level, first, conditional probability is used to quantify the informativeness of each sample. Second, the dataset is partitioned into core, overlapping, and noisy regions. Third, an overlapping-cleaning algorithm is introduced that combines Z-score metrics with a big-jump gap distance. Fourth, a smart oversampling algorithm based on multi-regularization controls synthetic sample proximity, preventing new overlaps. At the algorithmic level, a meta-heuristic prunes ensemble classifiers to reduce weak-learner influence. IMOVNO+ was evaluated on 35 datasets (13 multi-class, 22 binary). Results show consistent superiority over state-of-the-art methods, approaching 100% in several cases. For multi-class data, IMOVNO+ achieves gains of 37-57% in G-mean, 25-44% in F1-score, 25-39% in precision, and 26-43% in recall. In binary tasks, it attains near-perfect performance with improvements of 14-39%. The framework handles data scarcity and imbalance from collection and privacy limits.

[399] Golden Layers and Where to Find Them: Improved Knowledge Editing for Large Language Models Via Layer Gradient Analysis

Shrestha Datta, Hongfu Liu, Anshuman Chhabra

Main category: cs.LG

TL;DR: The paper proposes Layer Gradient Analysis (LGA), a method to identify fixed “golden layers” in LLMs that achieve near-optimal knowledge editing performance across different queries, avoiding the need for sample-wise layer selection.

DetailsMotivation: Current knowledge editing methods in LLMs require identifying optimal editing layers for each specific query, which is inefficient. The authors hypothesize that fixed "golden layers" exist that can achieve near-optimal editing performance across different queries, eliminating the need for sample-wise layer selection.

Method: The paper proposes Layer Gradient Analysis (LGA), which uses gradient-attribution to efficiently estimate golden layers without extensive trial-and-error. The method identifies layers that can achieve near-optimal editing performance similar to sample-wise optimal layers, validated using proxy datasets.

Result: Experiments on several benchmark datasets show that golden layers can be reliably identified using proxy datasets and generalize effectively to unseen test queries. LGA demonstrates effectiveness and robustness across different LLM types and various knowledge editing methods.

Conclusion: The existence of fixed golden layers in LLMs enables more efficient knowledge editing, and the proposed LGA method provides an effective way to identify these layers using gradient-attribution, improving the practicality of knowledge editing techniques.

Abstract: Knowledge editing in Large Language Models (LLMs) aims to update the model’s prediction for a specific query to a desired target while preserving its behavior on all other inputs. This process typically involves two stages: identifying the layer to edit and performing the parameter update. Intuitively, different queries may localize knowledge at different depths of the model, resulting in different sample-wise editing performance for a fixed editing layer. In this work, we hypothesize the existence of fixed golden layers that can achieve near-optimal editing performance similar to sample-wise optimal layers. To validate this hypothesis, we provide empirical evidence by comparing golden layers against ground-truth sample-wise optimal layers. Furthermore, we show that golden layers can be reliably identified using a proxy dataset and generalize effectively to unseen test set queries across datasets. Finally, we propose a novel method, namely Layer Gradient Analysis (LGA) that estimates golden layers efficiently via gradient-attribution, avoiding extensive trial-and-error across multiple editing runs. Extensive experiments on several benchmark datasets demonstrate the effectiveness and robustness of our LGA approach across different LLM types and various knowledge editing methods.

[400] Model Merging in the Essential Subspace

Longhua Li, Lei Qi, Qi Tian, Xin Geng

Main category: cs.LG

TL;DR: ESM is a model merging framework that uses PCA to identify essential subspaces for each task, projects parameter updates onto these subspaces for low-rank decomposition, and applies multi-level polarized scaling to amplify critical knowledge while suppressing redundancy, achieving SOTA performance in multi-task merging.

DetailsMotivation: Model merging aims to integrate multiple task-specific fine-tuned models into a single multi-task model without additional training, but task interference remains a major obstacle that undermines merged model performance.

Method: 1) Perform PCA on feature shifts from parameter updates to identify essential subspaces; 2) Project each task’s parameter update matrix onto its essential subspace for low-rank decomposition before merging; 3) Apply multi-level polarized scaling to amplify critical parameters and suppress redundant ones.

Result: Extensive experiments across multiple task sets and model scales demonstrate state-of-the-art performance in multi-task model merging, effectively mitigating task interference while preserving core functionality.

Conclusion: ESM provides a robust framework for effective model merging by identifying and preserving essential task-specific knowledge through subspace analysis and polarized scaling, overcoming the challenge of task interference.

Abstract: Model merging aims to integrate multiple task-specific fine-tuned models derived from a shared pre-trained checkpoint into a single multi-task model without additional training. Despite extensive research, task interference remains a major obstacle that often undermines the performance of merged models. In this paper, we propose ESM (Essential Subspace Merging) , a robust framework for effective model merging. We begin by performing Principal Component Analysis (PCA) on feature shifts induced by parameter updates. The resulting principal directions span an essential subspace that dominantly influences feature representations. Each task’s parameter update matrix is projected onto its respective essential subspace for low-rank decomposition before merging. This methodology mitigates inter-task interference while preserving core task-specific functionality. Furthermore, we introduce a multi-level polarized scaling strategy that amplifies parameters containing critical knowledge and suppresses redundant ones, preventing essential knowledge from being overwhelmed during fusion. Extensive experiments across multiple task sets and model scales demonstrate that our method achieves state-of-the-art performance in multi-task model merging.

[401] Probing Dec-POMDP Reasoning in Cooperative MARL

Kale-ab Tessera, Leonard Hinckeldey, Riccardo Zamboni, David Abel, Amos Storkey

Main category: cs.LG

TL;DR: Paper introduces diagnostic tools to audit whether cooperative multi-agent RL benchmarks actually require Dec-POMDP reasoning or can be solved with simpler reactive policies, finding many popular benchmarks don’t test core assumptions.

DetailsMotivation: To determine if popular cooperative multi-agent RL benchmarks genuinely require Dec-POMDP reasoning (using history to infer hidden states and coordinate) or if they can be solved with simpler strategies, potentially leading to over-optimistic assessments of progress.

Method: Developed a diagnostic suite combining statistically grounded performance comparisons and information-theoretic probes to audit behavioral complexity of baseline policies (IPPO and MAPPO) across 37 scenarios spanning MPE, SMAX, Overcooked, Hanabi, and MaBrax environments.

Result: Reactive policies matched memory-based agents’ performance in over half the scenarios, and emergent coordination often relied on brittle synchronous action coupling rather than robust temporal influence, suggesting many benchmarks don’t adequately test core Dec-POMDP assumptions.

Conclusion: Current cooperative MARL benchmarks may not properly test Dec-POMDP reasoning under existing training paradigms, potentially inflating progress assessments; diagnostic tools released to support more rigorous environment design and evaluation.

Abstract: Cooperative multi-agent reinforcement learning (MARL) is typically framed as a decentralised partially observable Markov decision process (Dec-POMDP), a setting whose hardness stems from two key challenges: partial observability and decentralised coordination. Genuinely solving such tasks requires Dec-POMDP reasoning, where agents use history to infer hidden states and coordinate based on local information. Yet it remains unclear whether popular benchmarks actually demand this reasoning or permit success via simpler strategies. We introduce a diagnostic suite combining statistically grounded performance comparisons and information-theoretic probes to audit the behavioural complexity of baseline policies (IPPO and MAPPO) across 37 scenarios spanning MPE, SMAX, Overcooked, Hanabi, and MaBrax. Our diagnostics reveal that success on these benchmarks rarely requires genuine Dec-POMDP reasoning. Reactive policies match the performance of memory-based agents in over half the scenarios, and emergent coordination frequently relies on brittle, synchronous action coupling rather than robust temporal influence. These findings suggest that some widely used benchmarks may not adequately test core Dec-POMDP assumptions under current training paradigms, potentially leading to over-optimistic assessments of progress. We release our diagnostic tooling to support more rigorous environment design and evaluation in cooperative MARL.

[402] Multimodal Crystal Flow: Any-to-Any Modality Generation for Unified Crystal Modeling

Kiyoung Seong, Sungsoo Ahn, Sehui Han, Changyoung Park

Main category: cs.LG

TL;DR: MCFlow is a unified multimodal flow model for multiple crystal generation tasks using independent time variables for atom types and structures with composition-aware ordering.

DetailsMotivation: Current deep generative models for crystal modeling are task-specific and lack a unified framework that shares crystal representations across different generation tasks like crystal structure prediction and de novo generation.

Method: Proposes Multimodal Crystal Flow (MCFlow) - a unified multimodal flow model that realizes multiple crystal generation tasks as distinct inference trajectories via independent time variables for atom types and crystal structures. Uses composition- and symmetry-aware atom ordering with hierarchical permutation augmentation in a standard transformer model.

Result: MCFlow achieves competitive performance against task-specific baselines on MP-20 and MPTS-52 benchmarks across multiple crystal generation tasks.

Conclusion: MCFlow provides a unified framework for crystal generation tasks that shares representations across different modalities and tasks while maintaining competitive performance.

Abstract: Crystal modeling spans a family of conditional and unconditional generation tasks across different modalities, including crystal structure prediction (CSP) and \emph{de novo} generation (DNG). While recent deep generative models have shown promising performance, they remain largely task-specific, lacking a unified framework that shares crystal representations across different generation tasks. To address this limitation, we propose \emph{Multimodal Crystal Flow (MCFlow)}, a unified multimodal flow model that realizes multiple crystal generation tasks as distinct inference trajectories via independent time variables for atom types and crystal structures. To enable multimodal flow in a standard transformer model, we introduce a composition- and symmetry-aware atom ordering with hierarchical permutation augmentation, injecting strong compositional and crystallographic priors without explicit structural templates. Experiments on the MP-20 and MPTS-52 benchmarks show that MCFlow achieves competitive performance against task-specific baselines across multiple crystal generation tasks.

[403] KnapSpec: Self-Speculative Decoding via Adaptive Layer Selection as a Knapsack Problem

Seongjin Cha, Gyuwan Kim, Dongsu Han, Tao Yang, Insu Han

Main category: cs.LG

TL;DR: KnapSpec is a training-free framework that accelerates LLM inference by treating draft model selection as a knapsack problem, optimizing tokens-per-time throughput while considering hardware-specific latencies and context length dependencies.

DetailsMotivation: Existing self-speculative decoding methods use static heuristics that ignore dynamic computational overhead of attention in long-context scenarios, failing to adapt to shifting hardware bottlenecks.

Method: Reformulates draft model selection as a knapsack problem, decouples Attention and MLP layers, models hardware-specific latencies as functions of context length, and uses parallel dynamic programming to find optimal draft configurations on the fly.

Result: Outperforms state-of-the-art SSD baselines, achieving up to 1.47x wall-clock speedup across various benchmarks on Qwen3 and Llama3 models.

Conclusion: KnapSpec provides a plug-and-play approach for high-speed inference in long sequences without additional training or compromising output distribution, with theoretical foundation using cosine similarity as proxy for token acceptance rate.

Abstract: Self-speculative decoding (SSD) accelerates LLM inference by skipping layers to create an efficient draft model, yet existing methods often rely on static heuristics that ignore the dynamic computational overhead of attention in long-context scenarios. We propose KnapSpec, a training-free framework that reformulates draft model selection as a knapsack problem to maximize tokens-per-time throughput. By decoupling Attention and MLP layers and modeling their hardware-specific latencies as functions of context length, KnapSpec adaptively identifies optimal draft configurations on the fly via a parallel dynamic programming algorithm. Furthermore, we provide the first rigorous theoretical analysis establishing cosine similarity between hidden states as a mathematically sound proxy for the token acceptance rate. This foundation allows our method to maintain high drafting faithfulness while navigating the shifting bottlenecks of real-world hardware. Our experiments on Qwen3 and Llama3 demonstrate that KnapSpec consistently outperforms state-of-the-art SSD baselines, achieving up to 1.47x wall-clock speedup across various benchmarks. Our plug-and-play approach ensures high-speed inference for long sequences without requiring additional training or compromising the target model’s output distribution.

[404] Matching Multiple Experts: On the Exploitability of Multi-Agent Imitation Learning

Antoine Bergerault, Volkan Cevher, Negar Mehr

Main category: cs.LG

TL;DR: The paper analyzes multi-agent imitation learning (MA-IL) from expert demonstrations, showing impossibility/hardness results for learning low-exploitable policies in general n-player Markov Games, but provides guarantees when expert equilibria satisfy strategic dominance assumptions.

DetailsMotivation: Existing multi-agent imitation learning approaches lack characterizations of how far learned policies are from Nash equilibrium in offline settings. The paper aims to address this gap by analyzing the fundamental limitations and providing theoretical guarantees under certain assumptions.

Method: The paper demonstrates impossibility and hardness results through counterexamples where exact measure matching fails, and provides a new hardness result on characterizing Nash gap given fixed measure matching error. It then introduces theoretical analysis using strategic dominance assumptions, specifically dominant strategy expert equilibria, and generalizes with a new notion of best-response continuity.

Result: For dominant strategy expert equilibria with Behavioral Cloning error ε_BC, the paper proves a Nash imitation gap of O(nε_BC/(1-γ)²) for discount factor γ. The analysis shows that standard regularization techniques implicitly encourage best-response continuity.

Conclusion: Learning low-exploitable policies in general n-player Markov Games is fundamentally hard, but strategic dominance assumptions on expert equilibria can overcome these challenges, providing theoretical guarantees for multi-agent imitation learning.

Abstract: Multi-agent imitation learning (MA-IL) aims to learn optimal policies from expert demonstrations of interactions in multi-agent interactive domains. Despite existing guarantees on the performance of the resulting learned policies, characterizations of how far the learned polices are from a Nash equilibrium are missing for offline MA-IL. In this paper, we demonstrate impossibility and hardness results of learning low-exploitable policies in general $n$-player Markov Games. We do so by providing examples where even exact measure matching fails, and demonstrating a new hardness result on characterizing the Nash gap given a fixed measure matching error. We then show how these challenges can be overcome using strategic dominance assumptions on the expert equilibrium. Specifically, for the case of dominant strategy expert equilibria, assuming Behavioral Cloning error $ε_{\text{BC}}$, this provides a Nash imitation gap of $\mathcal{O}\left(nε_{\text{BC}}/(1-γ)^2\right)$ for a discount factor $γ$. We generalize this result with a new notion of best-response continuity, and argue that this is implicitly encouraged by standard regularization techniques.

[405] MultiModalPFN: Extending Prior-Data Fitted Networks for Multimodal Tabular Learning

Wall Kim, Chaeyoung Song, Hanul Kim

Main category: cs.LG

TL;DR: MMPFN extends TabPFN to handle multimodal data (tabular + non-tabular like images/text) using modality projectors and specialized attention mechanisms.

DetailsMotivation: TabPFN struggles with integrating heterogeneous modalities like images and text, limiting its applicability in domains like healthcare and marketing where multimodal data is common.

Method: Extends TabPFN with per-modality encoders, modality projectors (multi-head gated MLP and cross-attention pooler) that transform non-tabular embeddings into tabular-compatible tokens for unified processing.

Result: Outperforms state-of-the-art methods on medical and general-purpose multimodal datasets, effectively exploiting non-tabular modalities alongside tabular features.

Conclusion: MMPFN successfully extends prior-data fitted networks to multimodal settings, offering a scalable framework for heterogeneous data learning.

Abstract: Recently, TabPFN has gained attention as a foundation model for tabular data. However, it struggles to integrate heterogeneous modalities such as images and text, which are common in domains like healthcare and marketing, thereby limiting its applicability. To address this, we present the Multi-Modal Prior-data Fitted Network (MMPFN), which extends TabPFN to handle tabular and non-tabular modalities in a unified manner. MMPFN comprises per-modality encoders, modality projectors, and pre-trained foundation models. The modality projectors serve as the critical bridge, transforming non-tabular embeddings into tabular-compatible tokens for unified processing. To this end, we introduce a multi-head gated MLP and a cross-attention pooler that extract richer context from non-tabular inputs while mitigates attention imbalance issue in multimodal learning. Extensive experiments on medical and general-purpose multimodal datasets demonstrate that MMPFN consistently outperforms competitive state-of-the-art methods and effectively exploits non-tabular modalities alongside tabular features. These results highlight the promise of extending prior-data fitted networks to the multimodal setting, offering a scalable and effective framework for heterogeneous data learning. The source code is available at https://github.com/too-z/MultiModalPFN.

[406] Exploring Anti-Aging Literature via ConvexTopics and Large Language Models

Lana E. Yeganova, Won G. Kim, Shubo Tian, Natalie Xie, Donald C. Comeau, W. John Wilbur, Zhiyong Lu

Main category: cs.LG

TL;DR: A convex optimization clustering method for biomedical literature analysis that selects exemplars to create stable, interpretable topics with global optimum guarantee.

DetailsMotivation: The rapid growth of biomedical publications creates challenges for organizing knowledge and detecting trends, while existing clustering methods (K-means, LDA) are sensitive to initialization and prone to local optima, limiting reproducibility.

Method: Reformulation of a convex optimization based clustering algorithm that selects exemplars from data and guarantees global optimum, applied to ~12,000 PubMed articles on aging and longevity.

Result: Method uncovers interpretable topics validated by medical experts, spanning molecular mechanisms to dietary supplements, physical activity, and gut microbiota. Performs favorably compared to K-means, LDA, and BERTopic with better reproducibility.

Conclusion: Provides basis for scalable, web-accessible tools for knowledge discovery in biomedical literature with improved stability and interpretability over traditional methods.

Abstract: The rapid expansion of biomedical publications creates challenges for organizing knowledge and detecting emerging trends, underscoring the need for scalable and interpretable methods. Common clustering and topic modeling approaches such as K-means or LDA remain sensitive to initialization and prone to local optima, limiting reproducibility and evaluation. We propose a reformulation of a convex optimization based clustering algorithm that produces stable, fine-grained topics by selecting exemplars from the data and guaranteeing a global optimum. Applied to about 12,000 PubMed articles on aging and longevity, our method uncovers topics validated by medical experts. It yields interpretable topics spanning from molecular mechanisms to dietary supplements, physical activity, and gut microbiota. The method performs favorably, and most importantly, its reproducibility and interpretability distinguish it from common clustering approaches, including K-means, LDA, and BERTopic. This work provides a basis for developing scalable, web-accessible tools for knowledge discovery.

[407] Coupled Cluster con MōLe: Molecular Orbital Learning for Neural Wavefunctions

Luca Thiede, Abdulrahman Aldossary, Andreas Burger, Jorge Arturo Campos-Gonzalez-Angulo, Ning Wang, Alexander Zook, Melisa Alkan, Kouhei Nakaji, Taylor Lee Patti, Jérôme Florian Gonthier, Mohammad Ghazi Vakili, Alán Aspuru-Guzik

Main category: cs.LG

TL;DR: MōLe is an equivariant ML model that predicts coupled-cluster excitation amplitudes from Hartree-Fock orbitals, enabling efficient quantum chemistry calculations with gold-standard accuracy.

DetailsMotivation: Coupled-cluster theory is the gold standard for quantum chemistry accuracy but is computationally expensive, limiting its widespread use. There's a need for machine learning approaches to accelerate these calculations while maintaining high accuracy.

Method: Developed the Molecular Orbital Learning (MōLe) architecture, an equivariant machine learning model that directly predicts CC excitation amplitudes from mean-field Hartree-Fock molecular orbitals as inputs.

Result: The model shows remarkable data efficiency and out-of-distribution generalization to larger molecules and off-equilibrium geometries, despite being trained only on small equilibrium geometries. It can also reduce the number of cycles required to converge CC calculations.

Conclusion: MōLe can set foundations for high-accuracy wavefunction-based ML architectures to accelerate molecular design and complement force-field approaches in quantum chemistry.

Abstract: Density functional theory (DFT) is the most widely used method for calculating molecular properties; however, its accuracy is often insufficient for quantitative predictions. Coupled-cluster (CC) theory is the most successful method for achieving accuracy beyond DFT and for predicting properties that closely align with experiment. It is known as the ‘‘gold standard’’ of quantum chemistry. Unfortunately, the high computational cost of CC limits its widespread applicability. In this work, we present the Molecular Orbital Learning (MōLe) architecture, an equivariant machine learning model that directly predicts CC’s core mathematical objects, the excitation amplitudes, from the mean-field Hartree-Fock molecular orbitals as inputs. We test various aspects of our model and demonstrate its remarkable data efficiency and out-of-distribution generalization to larger molecules and off-equilibrium geometries, despite being trained only on small equilibrium geometries. Finally, we also examine its ability to reduce the number of cycles required to converge CC calculations. MōLe can set the foundations for high-accuracy wavefunction-based ML architectures to accelerate molecular design and complement force-field approaches.

[408] Learning to Control Unknown Strongly Monotone Games

Siddharth Chandak, Ilai Bistritz, Nicholas Bambos

Main category: cs.LG

TL;DR: Paper 2407.00575: Unable to fetch summary due to HTTP 429 error (rate limiting). No content available for analysis.

DetailsMotivation: Cannot determine motivation as the paper content is unavailable due to HTTP 429 error when attempting to fetch from arXiv API.

Method: Method cannot be analyzed since the paper content could not be retrieved from the arXiv API due to rate limiting restrictions.

Result: No results can be reported as the paper summary was not accessible due to HTTP 429 error (too many requests).

Conclusion: Unable to draw any conclusions about the paper’s content due to technical limitations in accessing the arXiv API.

Abstract: Failed to fetch summary for 2407.00575: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2407.00575&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[409] Uncertainty-Aware Delivery Delay Duration Prediction via Multi-Task Deep Learning

Stefan Faulkner, Reza Zandehshahvar, Vahid Eghbal Akhlaghi, Sebastien Ouellet, Carsten Jordan, Pascal Van Hentenryck

Main category: cs.LG

TL;DR: Multi-task deep learning model for delivery delay prediction using classification-then-regression strategy to handle imbalanced shipment data in complex logistics networks.

DetailsMotivation: Delivery delay prediction is critical for supply chain efficiency but challenging due to complex logistics networks, multimodal transportation, and imbalanced data where delayed shipments are rare but operationally important.

Method: Multi-task deep learning model with embedding layers for tabular shipment features, using classification-then-regression strategy for end-to-end training. Predicts delivery delay duration for both on-time and delayed shipments with probabilistic forecasting capabilities.

Result: Achieved mean absolute error of 0.67-0.91 days for delayed shipments, outperforming single-step tree-based regression by 41-64% and two-step classify-then-regress tree models by 15-35% on 10M+ shipment records.

Conclusion: The proposed multi-task deep learning approach effectively handles imbalanced and heterogeneous data for operational delivery delay forecasting, demonstrating significant improvements over traditional methods.

Abstract: Accurate delivery delay prediction is critical for maintaining operational efficiency and customer satisfaction across modern supply chains. Yet the increasing complexity of logistics networks, spanning multimodal transportation, cross-country routing, and pronounced regional variability, makes this prediction task inherently challenging. This paper introduces a multi-task deep learning model for delivery delay duration prediction in the presence of significant imbalanced data, where delayed shipments are rare but operationally consequential. The model embeds high-dimensional shipment features with dedicated embedding layers for tabular data, and then uses a classification-then-regression strategy to predict the delivery delay duration for on-time and delayed shipments. Unlike sequential pipelines, this approach enables end-to-end training, improves the detection of delayed cases, and supports probabilistic forecasting for uncertainty-aware decision making. The proposed approach is evaluated on a large-scale real-world dataset from an industrial partner, comprising more than 10 million historical shipment records across four major source locations with distinct regional characteristics. The proposed model is compared with traditional machine learning methods. Experimental results show that the proposed method achieves a mean absolute error of 0.67-0.91 days for delayed-shipment predictions, outperforming single-step tree-based regression baselines by 41-64% and two-step classify-then-regress tree-based models by 15-35%. These gains demonstrate the effectiveness of the proposed model in operational delivery delay forecasting under highly imbalanced and heterogeneous conditions.

[410] The Truthfulness Spectrum Hypothesis

Zhuofan Josh Ying, Shauli Ravfogel, Nikolaus Kriegeskorte, Peter Hase

Main category: cs.LG

TL;DR: The paper investigates the geometry of truthfulness representations in LLMs, proposing a spectrum hypothesis where truth directions range from domain-general to domain-specific, with post-training reshaping these representations.

DetailsMotivation: To reconcile conflicting findings about whether LLMs linearly encode truthfulness, by examining the representational geometry of truth across different domains and types of falsehood.

Method: Systematic evaluation of probe generalization across five truth types, sycophantic/expectation-inverted lying, and honesty benchmarks; analysis of probe direction geometry using Mahalanobis cosine similarity; concept-erasure methods to isolate truth directions; and causal interventions.

Result: Linear probes generalize well across most domains but fail on sycophantic/expectation-inverted lying. Joint training recovers strong performance. Probe geometry explains generalization patterns (R^2=0.98). Domain-specific directions steer more effectively than domain-general ones. Post-training reshapes truth geometry, pushing sycophantic lying further from other truth types.

Conclusion: Supports the truthfulness spectrum hypothesis: truth directions of varying generality coexist in representational space, with post-training reshaping their geometry, explaining chat models’ sycophantic tendencies.

Abstract: Large language models (LLMs) have been reported to linearly encode truthfulness, yet recent work questions this finding’s generality. We reconcile these views with the truthfulness spectrum hypothesis: the representational space contains directions ranging from broadly domain-general to narrowly domain-specific. To test this hypothesis, we systematically evaluate probe generalization across five truth types (definitional, empirical, logical, fictional, and ethical), sycophantic and expectation-inverted lying, and existing honesty benchmarks. Linear probes generalize well across most domains but fail on sycophantic and expectation-inverted lying. Yet training on all domains jointly recovers strong performance, confirming that domain-general directions exist despite poor pairwise transfer. The geometry of probe directions explains these patterns: Mahalanobis cosine similarity between probes near-perfectly predicts cross-domain generalization (R^2=0.98). Concept-erasure methods further isolate truth directions that are (1) domain-general, (2) domain-specific, or (3) shared only across particular domain subsets. Causal interventions reveal that domain-specific directions steer more effectively than domain-general ones. Finally, post-training reshapes truth geometry, pushing sycophantic lying further from other truth types, suggesting a representational basis for chat models’ sycophantic tendencies. Together, our results support the truthfulness spectrum hypothesis: truth directions of varying generality coexist in representational space, with post-training reshaping their geometry. Code for all experiments is provided in https://github.com/zfying/truth_spec.

[411] Discrete Diffusion with Sample-Efficient Estimators for Conditionals

Karthik Elamvazhuthi, Abhijith Jayakumar, Andrey Y. Lokhov

Main category: cs.LG

TL;DR: Discrete denoising diffusion framework using Neural Interaction Screening Estimator (NeurISE) for single-site conditional probability estimation, outperforming ratio-based methods on binary data sets.

DetailsMotivation: To develop a discrete denoising diffusion framework that avoids approximating discrete analogs of score functions and instead uses single-site conditional probabilities as fundamental objects for generative modeling over discrete state spaces.

Method: Integrates sample-efficient NeurISE estimator of single-site conditionals with round-robin noising and denoising dynamics. Treats single-site conditional probabilities as parameterization for reverse diffusion process rather than approximating score functions.

Result: Outperforms popular existing methods including ratio-based approaches on binary data sets (synthetic Ising models, MNIST, D-Wave quantum annealer data). Achieves improved performance in total variation, cross-correlations, and kernel density estimation metrics.

Conclusion: The proposed discrete denoising diffusion framework with NeurISE estimator provides an effective approach for generative modeling over discrete state spaces, demonstrating superior performance on various binary data sets compared to existing methods.

Abstract: We study a discrete denoising diffusion framework that integrates a sample-efficient estimator of single-site conditionals with round-robin noising and denoising dynamics for generative modeling over discrete state spaces. Rather than approximating a discrete analog of a score function, our formulation treats single-site conditional probabilities as the fundamental objects that parameterize the reverse diffusion process. We employ a sample-efficient method known as Neural Interaction Screening Estimator (NeurISE) to estimate these conditionals in the diffusion dynamics. Controlled experiments on synthetic Ising models, MNIST, and scientific data sets produced by a D-Wave quantum annealer, synthetic Potts model and one-dimensional quantum systems demonstrate the proposed approach. On the binary data sets, these experiments demonstrate that the proposed approach outperforms popular existing methods including ratio-based approaches, achieving improved performance in total variation, cross-correlations, and kernel density estimation metrics.

[412] Learning to Solve Complex Problems via Dataset Decomposition

Wanru Zhao, Lucas Caccia, Zhengyan Shi, Minseon Kim, Weijia Xu, Alessandro Sordoni

Main category: cs.LG

TL;DR: Reverse curriculum generation approach that recursively decomposes complex datasets into simpler components using a teacher-student framework with step-by-step reasoning to enable progressive mastery of difficult tasks.

DetailsMotivation: Curriculum learning traditionally organizes data from simple to complex, but this paper explores reverse curriculum generation to decompose complex datasets into simpler, more learnable components for better model training.

Method: Proposes a teacher-student framework where the teacher uses step-by-step reasoning to recursively generate easier versions of examples. Introduces a novel scoring system to measure data difficulty based on structural complexity and conceptual depth for curriculum construction.

Result: Experiments on math datasets (MATH and AIME) and code generation datasets show models trained with curricula generated by this approach outperform standard training on original datasets.

Conclusion: The reverse curriculum generation approach effectively enables models to progressively master difficult tasks by decomposing complex data into simpler components, leading to superior performance compared to standard training methods.

Abstract: Curriculum learning is a class of training strategies that organizes the data being exposed to a model by difficulty, gradually from simpler to more complex examples. This research explores a reverse curriculum generation approach that recursively decomposes complex datasets into simpler, more learnable components. We propose a teacher-student framework where the teacher is equipped with the ability to reason step-by-step, which is used to recursively generate easier versions of examples, enabling the student model to progressively master difficult tasks. We propose a novel scoring system to measure data difficulty based on its structural complexity and conceptual depth, allowing curriculum construction over decomposed data. Experiments on math datasets (MATH and AIME) and code generation datasets demonstrate that models trained with curricula generated by our approach exhibit superior performance compared to standard training on original datasets.

[413] Shape-informed cardiac mechanics surrogates in data-scarce regimes via geometric encoding and generative augmentation

Davide Carrara, Marc Hirschvogel, Francesca Bonizzoni, Stefano Pagani, Simone Pezzuto, Francesco Regazzoni

Main category: cs.LG

TL;DR: A two-step framework for shape-informed surrogate modeling of cardiac mechanics that decouples geometric representation from physics learning to enable generalization across diverse anatomies under data-scarce conditions.

DetailsMotivation: Computational models of cardiac mechanics are too slow for clinical use, and existing surrogate models struggle to generalize across diverse anatomies, especially with limited data. There's a need for shape-informed surrogate models that can work with scarce data.

Method: Two-step approach: 1) Shape model learns compact latent representation of left ventricular geometries for encoding and synthetic data generation; 2) Neural field-based surrogate model conditioned on geometric encoding predicts ventricular displacement under loading. Uses positional encoding with universal ventricular coordinates and compares PCA-based vs DeepSDF-based geometric encoding strategies.

Result: The approach enables accurate predictions and generalization to unseen geometries, with robustness to noisy or sparsely sampled inputs, demonstrated on both idealized and patient-specific datasets.

Conclusion: Decoupling geometric representation from physics learning enables effective shape-informed surrogate modeling for cardiac mechanics that generalizes well across anatomies even in data-scarce settings.

Abstract: High-fidelity computational models of cardiac mechanics provide mechanistic insight into the heart function but are computationally prohibitive for routine clinical use. Surrogate models can accelerate simulations, but generalization across diverse anatomies is challenging, particularly in data-scarce settings. We propose a two-step framework that decouples geometric representation from learning the physics response, to enable shape-informed surrogate modeling under data-scarce conditions. First, a shape model learns a compact latent representation of left ventricular geometries. The learned latent space effectively encodes anatomies and enables synthetic geometries generation for data augmentation. Second, a neural field-based surrogate model, conditioned on this geometric encoding, is trained to predict ventricular displacement under external loading. The proposed architecture performs positional encoding by using universal ventricular coordinates, which improves generalization across diverse anatomies. Geometric variability is encoded using two alternative strategies, which are systematically compared: a PCA-based approach suitable for working with point cloud representations of geometries, and a DeepSDF-based implicit neural representation learned directly from point clouds. Overall, our results, obtained on idealized and patient-specific datasets, show that the proposed approaches allow for accurate predictions and generalization to unseen geometries, and robustness to noisy or sparsely sampled inputs.

[414] In-context Pre-trained Time-Series Foundation Models adapt to Unseen Tasks

Shangqing Xu, Harshavardhan Kamarthi, Haoxin Liu, B. Aditya Prakash

Main category: cs.LG

TL;DR: TSFM augmented with in-context learning for better generalization to unseen tasks without fine-tuning

DetailsMotivation: Existing time-series foundation models struggle to generalize to unseen tasks without fine-tuning, limiting their adaptability

Method: Propose In-Context Time-series Pre-training (ICTP) framework that restructures pre-training data to equip backbone TSFMs with in-context learning capabilities

Result: ICTP improves state-of-the-art TSFMs by ~11.4% on unseen tasks without requiring fine-tuning

Conclusion: In-context learning enables TSFMs to better adapt to unseen tasks at test time through dynamic adaptation to input-output relationships

Abstract: Time-series foundation models (TSFMs) have demonstrated strong generalization capabilities across diverse datasets and tasks. However, existing foundation models are typically pre-trained to enhance performance on specific tasks and often struggle to generalize to unseen tasks without fine-tuning. To address this limitation, we propose augmenting TSFMs with In-Context Learning (ICL) capabilities, enabling them to perform test-time inference by dynamically adapting to input-output relationships provided within the context. Our framework, In-Context Time-series Pre-training (ICTP), restructures the original pre-training data to equip the backbone TSFM with ICL capabilities, enabling adaptation to unseen tasks. Experiments demonstrate that ICT improves the performance of state-of-the-art TSFMs by approximately 11.4% on unseen tasks without requiring fine-tuning.

[415] QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models

Jingxuan Zhang, Yunta Hsieh, Zhongwei Wang, Haokun Lin, Xin Wang, Ziqi Wang, Yingtie Lei, Mi Zhang

Main category: cs.LG

TL;DR: QuantVLA is a training-free post-training quantization framework for Vision-Language-Action models that reduces compute/memory demands while maintaining performance.

DetailsMotivation: VLA models face practical deployment challenges due to rapidly increasing compute and memory demands as models scale, especially for embodied agents requiring longer horizons and larger backbones.

Method: QuantVLA uses three scale-calibrated components: 1) selective quantization layout that integerizes all linear layers while keeping attention projections in FP, 2) attention temperature matching for stabilizing attention logits, and 3) output head balancing for mitigating post-projection energy drift. It requires no training and uses only a small unlabeled calibration buffer.

Result: QuantVLA exceeds full-precision baseline task success rates on LIBERO benchmarks, achieves ~70% relative memory savings on quantized components, and delivers 1.22x speedup in end-to-end inference latency.

Conclusion: QuantVLA provides a practical pathway for scalable low-bit embodied intelligence under strict compute, memory, and power constraints, being the first PTQ approach for VLA systems and first to successfully quantize a diffusion transformer action head.

Abstract: Vision-language-action (VLA) models unify perception, language, and control for embodied agents but face significant challenges in practical deployment due to rapidly increasing compute and memory demands, especially as models scale to longer horizons and larger backbones. To address these bottlenecks, we introduce QuantVLA, a training-free post-training quantization (PTQ) framework that, to our knowledge, is the first PTQ approach for VLA systems and the first to successfully quantize a diffusion transformer (DiT) action head. QuantVLA incorporates three scale-calibrated components: (1) a selective quantization layout that integerizes all linear layers in both the language backbone and the DiT while keeping attention projections in floating point to preserve the original operator schedule; (2) attention temperature matching, a lightweight per-head scaling mechanism that stabilizes attention logits and is folded into the dequantization scales at inference; and (3) output head balancing, a per-layer residual interface calibration that mitigates post-projection energy drift. The framework requires no additional training, uses only a small unlabeled calibration buffer, and supports integer kernels for low-bit weights and activations while leaving the architecture unchanged. Across representative VLA models on LIBERO, QuantVLA exceeds the task success rates of full-precision baselines, achieves about 70% relative memory savings on the quantized components, and delivers a 1.22x speedup in end-to-end inference latency, providing a practical pathway toward scalable low-bit embodied intelligence under strict compute, memory, and power constraints.

[416] CaDrift: A Time-dependent Causal Generator of Drifting Data Streams

Eduardo V. L. Barboza, Jean Paul Barddal, Robert Sabourin, Rafael M. O. Cruz

Main category: cs.LG

TL;DR: CaDrift is a synthetic data generator framework using Structural Causal Models to create time-dependent data streams with controlled distributional shifts and interventions for evaluating ML methods under evolving data.

DetailsMotivation: Need for tools to evaluate machine learning methods under evolving data distributions with controlled shift events, as real-world data often experiences distributional changes over time that affect model performance.

Method: Uses Structural Causal Models (SCMs) to generate synthetic time-dependent data streams by drifting mapping functions of the SCM to create various distributional and covariate shifts, and models perturbations through causal interventions.

Result: Experimental results show classifier accuracy drops after distributional shift events followed by gradual recovery, confirming the generator’s effectiveness in simulating realistic shifts. Framework is available on GitHub.

Conclusion: CaDrift provides a valuable tool for evaluating ML methods under evolving data distributions by generating controlled shift scenarios through causal modeling, addressing the need for robust evaluation frameworks in non-stationary environments.

Abstract: This work presents Causal Drift Generator (CaDrift), a time-dependent synthetic data generator framework based on Structural Causal Models (SCMs). The framework produces a virtually infinite combination of data streams with controlled shift events and time-dependent data, making it a tool to evaluate methods under evolving data. CaDrift synthesizes various distributional and covariate shifts by drifting mapping functions of the SCM, which change underlying cause-and-effect relationships between features and the target. In addition, CaDrift models occasional perturbations by leveraging interventions in causal modeling. Experimental results show that, after distributional shift events, the accuracy of classifiers tends to drop, followed by a gradual retrieval, confirming the generator’s effectiveness in simulating shifts. The framework has been made available on GitHub.

[417] Emergent Manifold Separability during Reasoning in Large Language Models

Alexandre Polo, Chanwoo Chun, SueYeon Chung

Main category: cs.LG

TL;DR: The paper investigates the temporal dynamics of representation geometry during Chain-of-Thought reasoning in LLMs, revealing a transient “geometric pulse” where concept manifolds become linearly separable just before computation and rapidly compress afterward.

DetailsMotivation: While Chain-of-Thought prompting improves reasoning in LLMs, the temporal dynamics of the underlying representation geometry remain poorly understood. The authors aim to understand how representations evolve during reasoning processes.

Method: The authors apply Manifold Capacity Theory (MCT) to a compositional Boolean logic task, allowing them to quantify the linear separability of latent representations without the confounding factors of probe training. This approach reveals the geometric dynamics of reasoning.

Result: The analysis reveals that reasoning manifests as a transient geometric pulse where concept manifolds are untangled into linearly separable subspaces immediately prior to computation and rapidly compressed thereafter. This differs from standard linear probe accuracy, which remains high long after computation.

Conclusion: The authors interpret this phenomenon as “Dynamic Manifold Management,” a mechanism where the model dynamically modulates representational capacity to optimize the bandwidth of the residual stream throughout the reasoning chain, distinguishing between information that is merely retrievable versus geometrically prepared for processing.

Abstract: Chain-of-Thought (CoT) prompting significantly improves reasoning in Large Language Models, yet the temporal dynamics of the underlying representation geometry remain poorly understood. We investigate these dynamics by applying Manifold Capacity Theory (MCT) to a compositional Boolean logic task, allowing us to quantify the linear separability of latent representations without the confounding factors of probe training. Our analysis reveals that reasoning manifests as a transient geometric pulse, where concept manifolds are untangled into linearly separable subspaces immediately prior to computation and rapidly compressed thereafter. This behavior diverges from standard linear probe accuracy, which remains high long after computation, suggesting a fundamental distinction between information that is merely retrievable and information that is geometrically prepared for processing. We interpret this phenomenon as \emph{Dynamic Manifold Management}, a mechanism where the model dynamically modulates representational capacity to optimize the bandwidth of the residual stream throughout the reasoning chain.

[418] Hierarchical Molecular Representation Learning via Fragment-Based Self-Supervised Embedding Prediction

Jiele Wu, Haozhe Ma, Zhihan Guo, Thanh Vinh Vo, Tze Yun Leong

Main category: cs.LG

TL;DR: GraSPNet is a hierarchical self-supervised framework for molecular graph analysis that models both atomic-level and fragment-level semantics through multi-level message passing with masked semantic prediction, outperforming existing GSSL methods in transfer learning.

DetailsMotivation: Existing graph self-supervised learning methods focus on node- or edge-level information but ignore chemically relevant substructures that strongly influence molecular properties, limiting their effectiveness in molecular graph analysis.

Method: GraSPNet decomposes molecular graphs into chemically meaningful fragments without predefined vocabularies and learns node- and fragment-level representations through multi-level message passing with masked semantic prediction at both levels.

Result: Extensive experiments on multiple molecular property prediction benchmarks show GraSPNet learns chemically meaningful representations and consistently outperforms state-of-the-art GSSL methods in transfer learning settings.

Conclusion: GraSPNet’s hierarchical semantic supervision enables learning of multi-resolution structural information that is both expressive and transferable for molecular property prediction.

Abstract: Graph self-supervised learning (GSSL) has demonstrated strong potential for generating expressive graph embeddings without the need for human annotations, making it particularly valuable in domains with high labeling costs such as molecular graph analysis. However, existing GSSL methods mostly focus on node- or edge-level information, often ignoring chemically relevant substructures which strongly influence molecular properties. In this work, we propose Graph Semantic Predictive Network (GraSPNet), a hierarchical self-supervised framework that explicitly models both atomic-level and fragment-level semantics. GraSPNet decomposes molecular graphs into chemically meaningful fragments without predefined vocabularies and learns node- and fragment-level representations through multi-level message passing with masked semantic prediction at both levels. This hierarchical semantic supervision enables GraSPNet to learn multi-resolution structural information that is both expressive and transferable. Extensive experiments on multiple molecular property prediction benchmarks demonstrate that GraSPNet learns chemically meaningful representations and consistently outperforms state-of-the-art GSSL methods in transfer learning settings.

[419] Momentum Guidance: Plug-and-Play Guidance for Flow Models

Runlong Liao, Jian Yu, Baiyu Su, Chi Zhang, Lizhang Chen, Qiang Liu

Main category: cs.LG

TL;DR: Momentum Guidance (MG) improves flow-based generative models by using velocity extrapolation from ODE trajectories, enhancing sample quality without increasing inference cost.

DetailsMotivation: Conditional samples from pretrained flow-based models often appear diffuse and lack fine-grained detail due to neural network smoothing effects. Existing guidance techniques like classifier-free guidance (CFG) improve fidelity but double inference cost and reduce sample diversity.

Method: Introduces Momentum Guidance (MG) that leverages the ODE trajectory itself by extrapolating current velocity using an exponential moving average of past velocities. This preserves the standard one-evaluation-per-step cost while matching the effect of standard guidance.

Result: MG achieves average improvements in FID of 36.68% without CFG and 25.52% with CFG across various sampling settings on ImageNet-256, attaining an FID of 1.597 at 64 sampling steps. Consistent quality enhancements confirmed on large flow-based models like Stable Diffusion 3 and FLUX.1-dev.

Conclusion: Momentum Guidance provides an effective way to improve sample quality in flow-based generative models without increasing computational cost, working well both independently and in combination with existing guidance techniques.

Abstract: Flow-based generative models have become a strong framework for high-quality generative modeling, yet pretrained models are rarely used in their vanilla conditional form: conditional samples without guidance often appear diffuse and lack fine-grained detail due to the smoothing effects of neural networks. Existing guidance techniques such as classifier-free guidance (CFG) improve fidelity but double the inference cost and typically reduce sample diversity. We introduce Momentum Guidance (MG), a new dimension of guidance that leverages the ODE trajectory itself. MG extrapolates the current velocity using an exponential moving average of past velocities and preserves the standard one-evaluation-per-step cost. It matches the effect of standard guidance without extra computation and can further improve quality when combined with CFG. Experiments demonstrate MG’s effectiveness across benchmarks. Specifically, on ImageNet-256, MG achieves average improvements in FID of 36.68% without CFG and 25.52% with CFG across various sampling settings, attaining an FID of 1.597 at 64 sampling steps. Evaluations on large flow-based models like Stable Diffusion 3 and FLUX.1-dev further confirm consistent quality enhancements across standard metrics.

[420] Quantitative Approximation Rates for Group Equivariant Learning

Jonathan W. Siegel, Snir Hordan, Hannah Lawrence, Ali Syed, Nadav Dym

Main category: cs.LG

TL;DR: Quantitative approximation rates for group-equivariant neural networks showing they match MLP expressivity for symmetric functions

DetailsMotivation: While universal approximation theorems exist for neural networks, and quantitative rates are known for ReLU networks on Hölder functions, there's a gap in understanding quantitative approximation rates for group-equivariant architectures that incorporate symmetry constraints.

Method: Derived quantitative approximation rates for several group-equivariant architectures including: permutation-invariant Deep Sets, permutation-equivariant Sumformer and Transformer architectures, joint invariance to permutations and rigid motions using invariant networks based on frame averaging, and general bi-Lipschitz invariant models.

Result: Showed that equally-sized ReLU MLPs and equivariant architectures are equally expressive over equivariant functions, demonstrating that hard-coding equivariance does not result in loss of expressivity or approximation power.

Conclusion: Equivariant architectures maintain the same approximation power as standard MLPs for symmetric functions, providing theoretical justification for using symmetry constraints without sacrificing expressivity.

Abstract: The universal approximation theorem establishes that neural networks can approximate any continuous function on a compact set. Later works in approximation theory provide quantitative approximation rates for ReLU networks on the class of $α$-Hölder functions $f: [0,1]^N \to \mathbb{R}$. The goal of this paper is to provide similar quantitative approximation results in the context of group equivariant learning, where the learned $α$-Hölder function is known to obey certain group symmetries. While there has been much interest in the literature in understanding the universal approximation properties of equivariant models, very few quantitative approximation results are known for equivariant models. In this paper, we bridge this gap by deriving quantitative approximation rates for several prominent group-equivariant and invariant architectures. The architectures that we consider include: the permutation-invariant Deep Sets architecture; the permutation-equivariant Sumformer and Transformer architectures; joint invariance to permutations and rigid motions using invariant networks based on frame averaging; and general bi-Lipschitz invariant models. Overall, we show that equally-sized ReLU MLPs and equivariant architectures are equally expressive over equivariant functions. Thus, hard-coding equivariance does not result in a loss of expressivity or approximation power in these models.

[421] cc-Shapley: Measuring Multivariate Feature Importance Needs Causal Context

Jörg Martin, Stefan Haufe

Main category: cs.LG

TL;DR: cc-Shapley: A causal modification of Shapley values that addresses spurious feature associations in observational data by incorporating causal knowledge of the data-generating process.

DetailsMotivation: Current Shapley values for feature importance in explainable AI suffer from spurious associations due to collider bias and suppression effects when considering features in observational contexts, which can lead to misinterpretations and hinder scientific discovery.

Method: Proposes cc-Shapley (causal context Shapley), an interventional modification of conventional observational Shapley values that leverages knowledge of the data’s causal structure to analyze feature relevance in the causal context of remaining features.

Result: Theoretical proof that cc-Shapley eradicates spurious associations induced by collider bias. Empirical comparison on synthetic and real-world datasets shows nullification or reversal of associations compared to univariate feature importance when moving from observational to cc-Shapley.

Conclusion: Causal knowledge is essential for meaningful feature attribution in explainable AI, and cc-Shapley provides a principled approach to address limitations of purely data-driven Shapley values by incorporating causal structure.

Abstract: Explainable artificial intelligence promises to yield insights into relevant features, thereby enabling humans to examine and scrutinize machine learning models or even facilitating scientific discovery. Considering the widespread technique of Shapley values, we find that purely data-driven operationalization of multivariate feature importance is unsuitable for such purposes. Even for simple problems with two features, spurious associations due to collider bias and suppression arise from considering one feature only in the observational context of the other, which can lead to misinterpretations. Causal knowledge about the data-generating process is required to identify and correct such misleading feature attributions. We propose cc-Shapley (causal context Shapley), an interventional modification of conventional observational Shapley values leveraging knowledge of the data’s causal structure, thereby analyzing the relevance of a feature in the causal context of the remaining features. We show theoretically that this eradicates spurious association induced by collider bias. We compare the behavior of Shapley and cc-Shapley values on various, synthetic, and real-world datasets. We observe nullification or reversal of associations compared to univariate feature importance when moving from observational to cc-Shapley.

[422] GeoPT: Scaling Physics Simulation via Lifted Geometric Pre-Training

Haixu Wu, Minghao Guo, Zongyi Li, Zhiyang Dou, Mingsheng Long, Kaiming He, Wojciech Matusik

Main category: cs.LG

TL;DR: GeoPT is a pre-trained model for physics simulation that uses synthetic dynamics on static geometry to enable dynamics-aware self-supervision without physics labels.

DetailsMotivation: Scaling neural simulators is bottlenecked by expensive high-fidelity training data generation. Pre-training on static geometry alone ignores dynamics and can cause negative transfer for physics tasks.

Method: GeoPT uses lifted geometric pre-training by augmenting static geometry with synthetic dynamics, enabling dynamics-aware self-supervision without requiring physics labels. Pre-trained on over 1 million samples.

Result: Improves industrial-fidelity benchmarks for fluid mechanics (cars, aircraft, ships) and solid mechanics (crash simulation), reducing labeled data requirements by 20-60% and accelerating convergence by 2×.

Conclusion: Lifting with synthetic dynamics bridges the geometry-physics gap, providing a scalable path for neural simulation and potentially beyond.

Abstract: Neural simulators promise efficient surrogates for physics simulation, but scaling them is bottlenecked by the prohibitive cost of generating high-fidelity training data. Pre-training on abundant off-the-shelf geometries offers a natural alternative, yet faces a fundamental gap: supervision on static geometry alone ignores dynamics and can lead to negative transfer on physics tasks. We present GeoPT, a unified pre-trained model for general physics simulation based on lifted geometric pre-training. The core idea is to augment geometry with synthetic dynamics, enabling dynamics-aware self-supervision without physics labels. Pre-trained on over one million samples, GeoPT consistently improves industrial-fidelity benchmarks spanning fluid mechanics for cars, aircraft, and ships, and solid mechanics in crash simulation, reducing labeled data requirements by 20-60% and accelerating convergence by 2$\times$. These results show that lifting with synthetic dynamics bridges the geometry-physics gap, unlocking a scalable path for neural simulation and potentially beyond. Code is available at https://github.com/Physics-Scaling/GeoPT.

[423] Three Concrete Challenges and Two Hopes for the Safety of Unsupervised Elicitation

Callum Canavan, Aditya Shrivastava, Allison Qi, Jonathan Michala, Fabien Roger

Main category: cs.LG

TL;DR: The paper identifies limitations in current evaluations of unsupervised elicitation and easy-to-hard generalization techniques for steering language models toward truthfulness, showing these methods fail on challenging datasets lacking ideal properties.

DetailsMotivation: Previous work shows unsupervised elicitation and easy-to-hard generalization can improve model truthfulness, but evaluations may be overoptimistic due to unrealistic dataset properties. Real-world datasets often lack balanced training, have features more salient than truthfulness, and contain ambiguous data points.

Method: Constructed challenging datasets lacking three key properties: (1) no features more salient than truthfulness, (2) balanced training sets, and (3) well-defined answers. Stress-tested standard unsupervised elicitation and easy-to-hard generalization techniques on these datasets, also studied ensembling and technique combinations.

Result: No technique reliably performed well on any of the challenges. Ensembling and combining easy-to-hard with unsupervised techniques only partially mitigated performance degradation. Current methods fail when datasets lack ideal properties.

Conclusion: Overcoming these dataset challenges should be a priority for future work on unsupervised elicitation. Current evaluations are overoptimistic, and techniques don’t generalize to realistic, challenging datasets.

Abstract: To steer language models towards truthful outputs on tasks which are beyond human capability, previous work has suggested training models on easy tasks to steer them on harder ones (easy-to-hard generalization), or using unsupervised training algorithms to steer models with no external labels at all (unsupervised elicitation). Although techniques from both paradigms have been shown to improve model accuracy on a wide variety of tasks, we argue that the datasets used for these evaluations could cause overoptimistic evaluation results. Unlike many real-world datasets, they often (1) have no features with more salience than truthfulness, (2) have balanced training sets, and (3) contain only data points to which the model can give a well-defined answer. We construct datasets that lack each of these properties to stress-test a range of standard unsupervised elicitation and easy-to-hard generalization techniques. We find that no technique reliably performs well on any of these challenges. We also study ensembling and combining easy-to-hard and unsupervised techniques, and find they only partially mitigate performance degradation due to these challenges. We believe that overcoming these challenges should be a priority for future work on unsupervised elicitation.

[424] Wasserstein Distributionally Robust Online Learning

Guixian Chen, Salar Fattahi, Soroosh Shafiee

Main category: cs.LG

TL;DR: Online distributionally robust learning using Wasserstein ambiguity sets, with convergence guarantees and computational improvements for piecewise concave losses.

DetailsMotivation: While distributionally robust optimization (DRO) is well-understood offline, its online extension poses significant convergence and computational challenges that need to be addressed for practical applications.

Method: Formulates the problem as an online saddle-point stochastic game between decision maker and adversary, proposes a general convergence framework, and develops a tailored algorithm for piecewise concave losses that connects worst-case expectation problems to tractable budget allocation problems.

Result: Achieves convergence to robust Nash equilibrium matching offline DRO solution, with substantial computational speedups over state-of-the-art solvers like Gurobi for piecewise concave loss functions.

Conclusion: The paper successfully addresses key challenges in online distributionally robust learning, providing both theoretical convergence guarantees and practical computational improvements for an important class of problems.

Abstract: We study distributionally robust online learning, where a risk-averse learner updates decisions sequentially to guard against worst-case distributions drawn from a Wasserstein ambiguity set centered at past observations. While this paradigm is well understood in the offline setting through Wasserstein Distributionally Robust Optimization (DRO), its online extension poses significant challenges in both convergence and computation. In this paper, we address these challenges. First, we formulate the problem as an online saddle-point stochastic game between a decision maker and an adversary selecting worst-case distributions, and propose a general framework that converges to a robust Nash equilibrium coinciding with the solution of the corresponding offline Wasserstein DRO problem. Second, we address the main computational bottleneck, which is the repeated solution of worst-case expectation problems. For the important class of piecewise concave loss functions, we propose a tailored algorithm that exploits problem geometry to achieve substantial speedups over state-of-the-art solvers such as Gurobi. The key insight is a novel connection between the worst-case expectation problem, an inherently infinite-dimensional optimization problem, and a classical and tractable budget allocation problem, which is of independent interest.

[425] $κ$-Explorer: A Unified Framework for Active Model Estimation in MDPs

Xihe Gu, Urbashi Mitra, Tara Javidi

Main category: cs.LG

TL;DR: A new active exploration algorithm called κ-Explorer for tabular MDPs that optimizes a family of concave objective functions balancing intrinsic estimation complexity and extrinsic visitation frequency to minimize model estimation error.

DetailsMotivation: In tabular MDPs, accurate model estimation depends on how exploration policies allocate visitation frequencies according to transition distribution complexity. Existing methods don't explicitly balance intrinsic estimation complexity with extrinsic visitation frequency.

Method: Introduces parameterized family of decomposable concave objective functions U_κ that incorporate both intrinsic estimation complexity and extrinsic visitation frequency. Proposes κ-Explorer algorithm that performs Frank-Wolfe-style optimization over state-action occupancy measures using closed-form gradient characterization.

Result: Establishes tight regret guarantees for κ-Explorer and introduces practical online surrogate algorithm. Experiments on benchmark MDPs show superior performance compared to existing exploration strategies.

Conclusion: κ-Explorer provides a unified framework for active exploration that naturally prioritizes underexplored and high-variance transitions while enabling efficient optimization with theoretical guarantees.

Abstract: In tabular Markov decision processes (MDPs) with perfect state observability, each trajectory provides active samples from the transition distributions conditioned on state-action pairs. Consequently, accurate model estimation depends on how the exploration policy allocates visitation frequencies in accordance with the intrinsic complexity of each transition distribution. Building on recent work on coverage-based exploration, we introduce a parameterized family of decomposable and concave objective functions $U_κ$ that explicitly incorporate both intrinsic estimation complexity and extrinsic visitation frequency. Moreover, the curvature $κ$ provides a unified treatment of various global objectives, such as the average-case and worst-case estimation error objectives. Using the closed-form characterization of the gradient of $U_κ$, we propose $κ$-Explorer, an active exploration algorithm that performs Frank-Wolfe-style optimization over state-action occupancy measures. The diminishing-returns structure of $U_κ$ naturally prioritizes underexplored and high-variance transitions, while preserving smoothness properties that enable efficient optimization. We establish tight regret guarantees for $κ$-Explorer and further introduce a fully online and computationally efficient surrogate algorithm for practical use. Experiments on benchmark MDPs demonstrate that $κ$-Explorer provides superior performance compared to existing exploration strategies.

[426] CITED: A Decision Boundary-Aware Signature for GNNs Towards Model Extraction Defense

Bolin Shen, Md Shamim Seraj, Zhan Cheng, Shayok Chakraborty, Yushun Dong

Main category: cs.LG

TL;DR: CITED is a novel ownership verification framework for GNNs that works on both embedding and label levels, defending against model extraction attacks without harming downstream performance or requiring auxiliary models.

DetailsMotivation: With the rise of MLaaS, GNN models are vulnerable to model extraction attacks where adversaries can steal models via query-response pairs. Existing defenses are limited to specific output levels and have technical drawbacks.

Method: CITED is a signature-based ownership verification framework that works on both embedding and label levels. It doesn’t require auxiliary models or harm downstream performance, using a novel approach to embed verifiable signatures.

Result: Extensive experiments show CITED effectively verifies ownership and outperforms all existing watermarking and fingerprinting approaches while maintaining model performance and efficiency.

Conclusion: CITED provides a robust, efficient solution for GNN ownership verification against model extraction attacks, addressing limitations of previous methods and working across multiple output levels.

Abstract: Graph neural networks (GNNs) have demonstrated superior performance in various applications, such as recommendation systems and financial risk management. However, deploying large-scale GNN models locally is particularly challenging for users, as it requires significant computational resources and extensive property data. Consequently, Machine Learning as a Service (MLaaS) has become increasingly popular, offering a convenient way to deploy and access various models, including GNNs. However, an emerging threat known as Model Extraction Attacks (MEAs) presents significant risks, as adversaries can readily obtain surrogate GNN models exhibiting similar functionality. Specifically, attackers repeatedly query the target model using subgraph inputs to collect corresponding responses. These input-output pairs are subsequently utilized to train their own surrogate models at minimal cost. Many techniques have been proposed to defend against MEAs, but most are limited to specific output levels (e.g., embedding or label) and suffer from inherent technical drawbacks. To address these limitations, we propose a novel ownership verification framework CITED which is a first-of-its-kind method to achieve ownership verification on both embedding and label levels. Moreover, CITED is a novel signature-based method that neither harms downstream performance nor introduces auxiliary models that reduce efficiency, while still outperforming all watermarking and fingerprinting approaches. Extensive experiments demonstrate the effectiveness and robustness of our CITED framework. Code is available at: https://github.com/LabRAI/CITED.

[427] CREDIT: Certified Ownership Verification of Deep Neural Networks Against Model Extraction Attacks

Bolin Shen, Zhan Cheng, Neil Zhenqiang Gong, Fan Yao, Yushun Dong

Main category: cs.LG

TL;DR: CREDIT is a certified ownership verification method against model extraction attacks in MLaaS, using mutual information to quantify model similarity with theoretical guarantees.

DetailsMotivation: MLaaS platforms are vulnerable to model extraction attacks where adversaries can steal models via API queries. Existing defenses lack rigorous theoretical guarantees for ownership verification, creating a need for certified verification methods.

Method: Uses mutual information to quantify similarity between DNN models, proposes a practical verification threshold, and provides rigorous theoretical guarantees for ownership verification based on this threshold.

Result: Achieves state-of-the-art performance on several mainstream datasets across different domains and tasks, with publicly available implementation.

Conclusion: CREDIT provides a certified solution for model ownership verification against extraction attacks with strong theoretical guarantees, addressing a critical gap in MLaaS security.

Abstract: Machine Learning as a Service (MLaaS) has emerged as a widely adopted paradigm for providing access to deep neural network (DNN) models, enabling users to conveniently leverage these models through standardized APIs. However, such services are highly vulnerable to Model Extraction Attacks (MEAs), where an adversary repeatedly queries a target model to collect input-output pairs and uses them to train a surrogate model that closely replicates its functionality. While numerous defense strategies have been proposed, verifying the ownership of a suspicious model with strict theoretical guarantees remains a challenging task. To address this gap, we introduce CREDIT, a certified ownership verification against MEAs. Specifically, we employ mutual information to quantify the similarity between DNN models, propose a practical verification threshold, and provide rigorous theoretical guarantees for ownership verification based on this threshold. We extensively evaluate our approach on several mainstream datasets across different domains and tasks, achieving state-of-the-art performance. Our implementation is publicly available at: https://github.com/LabRAI/CREDIT.

[428] GauS: Differentiable Scheduling Optimization via Gaussian Reparameterization

Yaohui Cai, Vesal Bakhtazad, Cunxi Yu, Zhiru Zhang

Main category: cs.LG

TL;DR: GauS: A differentiable framework for operator scheduling using Gaussian distributions to capture ordinal time nature and reduce optimization space

DetailsMotivation: Traditional operator scheduling methods (exact solvers/heuristics) have limitations; recent differentiable approaches use categorical distributions that fail to capture ordinal time nature and have poor parameter scaling

Method: Models operator scheduling as stochastic relaxation using Gaussian distributions, representing schedules as continuous Gaussian variables to capture ordinal time and reduce optimization space

Result: Achieves Pareto-optimal results on range of benchmarks, provides first differentiable formulation for complex pipelined scheduling problem

Conclusion: GauS offers efficient differentiable framework for operator scheduling that captures ordinal time nature and scales well with modern parallel computing devices

Abstract: Efficient operator scheduling is a fundamental challenge in software compilation and hardware synthesis. While recent differentiable approaches have sought to replace traditional ones like exact solvers or heuristics with gradient-based search, they typically rely on categorical distributions that fail to capture the ordinal nature of time and suffer from a parameter space that scales poorly. In this paper, we propose a novel differentiable framework, GauS, that models operator scheduling as a stochastic relaxation using Gaussian distributions, which fully utilize modern parallel computing devices like GPUs. By representing schedules as continuous Gaussian variables, we successfully capture the ordinal nature of time and reduce the optimization space by orders of magnitude. Our method is highly flexible to represent various objectives and constraints, which provides the first differentiable formulation for the complex pipelined scheduling problem. We evaluate our method on a range of benchmarks, demonstrating that Gaus achieves Pareto-optimal results.

[429] Imputation of Unknown Missingness in Sparse Electronic Health Records

Jun Han, Josue Nassar, Sanjit Singh Batra, Aldo Cordova-Palomera, Vijay Nori, Robert E. Tillman

Main category: cs.LG

TL;DR: Transformer-based denoising network for recovering unknown missing values in binary EHR data, improving downstream medical prediction tasks.

DetailsMotivation: EHR data is sparse with missing values, but existing imputation methods focus on known unknowns and don't address unknown unknowns where it's unclear what's missing (e.g., undiagnosed vs. unshared diagnoses).

Method: Developed a transformer-based denoising neural network with adaptive thresholding to recover unknown missing values in binary EHR data.

Result: Improved accuracy in denoising medical codes compared to existing imputation approaches, leading to better performance on downstream tasks like hospital readmission prediction with statistically significant improvements over baselines.

Conclusion: The proposed method effectively addresses unknown unknowns in EHR data through denoising, enhancing data quality and downstream medical prediction performance.

Abstract: Machine learning holds great promise for advancing the field of medicine, with electronic health records (EHRs) serving as a primary data source. However, EHRs are often sparse and contain missing data due to various challenges and limitations in data collection and sharing between healthcare providers. Existing techniques for imputing missing values predominantly focus on known unknowns, such as missing or unavailable values of lab test results; most do not explicitly address situations where it is difficult to distinguish what is missing. For instance, a missing diagnosis code in an EHR could signify either that the patient has not been diagnosed with the condition or that a diagnosis was made, but not shared by a provider. Such situations fall into the paradigm of unknown unknowns. To address this challenge, we develop a general purpose algorithm for denoising data to recover unknown missing values in binary EHRs. We design a transformer-based denoising neural network where the output is thresholded adaptively to recover values in cases where we predict data are missing. Our results demonstrate improved accuracy in denoising medical codes within a real EHR dataset compared to existing imputation approaches and leads to increased performance on downstream tasks using the denoised data. In particular, when applying our method to a real world application, predicting hospital readmission from EHRs, our method achieves statistically significant improvement over all existing baselines.

[430] Protein Language Models Diverge from Natural Language: Comparative Analysis and Improved Inference

Anna Hart, Chi Han, Jeonghwan Kim, Huimin Zhao, Heng Ji

Main category: cs.LG

TL;DR: Protein Language Models (PLMs) adapt transformers to biological sequences, but protein language differs from natural language with only 20 amino acids. The paper compares information distribution across layers between domains and adapts early-exit techniques to improve both accuracy and efficiency for protein non-structural property prediction.

DetailsMotivation: Protein language has key differences from natural language (20 amino acids vs large vocabularies, rich functional space), motivating research into how transformer architectures operate differently in protein domain and how to better leverage PLMs for protein-related tasks.

Method: 1) Direct comparison of information distribution across attention head layers between protein and natural language domains; 2) Adaptation of early-exit technique (originally for NLP efficiency) to automatically select protein representations from intermediate PLM layers for specific tasks/proteins.

Result: Achieved performance gains of 0.4 to 7.01 percentage points while improving efficiency by over 10% across models and non-structural prediction tasks. Early-exit technique provided both increased accuracy and substantial efficiency gains.

Conclusion: Opens research area comparing language model behavior differences between protein and natural language domains, advances language modeling in biological domains by showing early-exit techniques can improve both accuracy and efficiency for protein property prediction.

Abstract: Modern Protein Language Models (PLMs) apply transformer-based model architectures from natural language processing to biological sequences, predicting a variety of protein functions and properties. However, protein language has key differences from natural language, such as a rich functional space despite a vocabulary of only 20 amino acids. These differences motivate research into how transformer-based architectures operate differently in the protein domain and how we can better leverage PLMs to solve protein-related tasks. In this work, we begin by directly comparing how the distribution of information stored across layers of attention heads differs between the protein and natural language domain. Furthermore, we adapt a simple early-exit technique-originally used in the natural language domain to improve efficiency at the cost of performance-to achieve both increased accuracy and substantial efficiency gains in protein non-structural property prediction by allowing the model to automatically select protein representations from the intermediate layers of the PLMs for the specific task and protein at hand. We achieve performance gains ranging from 0.4 to 7.01 percentage points while simultaneously improving efficiency by over 10 percent across models and non-structural prediction tasks. Our work opens up an area of research directly comparing how language models change behavior when moved into the protein domain and advances language modeling in biological domains.

[431] Oracle-Robust Online Alignment for Large Language Models

Zimeng Li, Mudit Gaur, Vaneet Aggarwal

Main category: cs.LG

TL;DR: Online LLM alignment under misspecified preference feedback using oracle-robust optimization with sensitivity penalty

DetailsMotivation: Address the problem of online alignment of large language models when preference feedback is misspecified (deviates from ideal ground-truth), which is a bi-level reinforcement learning problem

Method: Introduce pointwise oracle uncertainty set, formulate oracle-robust online alignment as worst-case optimization, show closed-form decomposition for log-linear policies (original loss + sensitivity penalty), develop projected stochastic composite updates

Result: Achieve $\widetilde{O}(\varepsilon^{-2})$ oracle complexity for reaching approximate stationarity with the robust objective

Conclusion: Proposed framework provides robust online alignment under misspecified preference feedback with theoretical guarantees

Abstract: We study online alignment of large language models under misspecified preference feedback, where the observed preference oracle deviates from an ideal but unknown ground-truth oracle. The online LLM alignment problem is a bi-level reinforcement problem due to the coupling between data collection and policy updates. Recently, the problem has been reduced to tractable single-level objective in the SAIL (Self-Improving Efficient Online Alignment) framework. In this paper, we introduce a pointwise oracle uncertainty set in this problem and formulate an oracle-robust online alignment objective as a worst-case optimization problem. For log-linear policies, we show that this robust objective admits an exact closed-form decomposition into the original loss function plus an explicit sensitivity penalty. We develop projected stochastic composite updates for the resulting weakly convex objective and prove $\widetilde{O}(\varepsilon^{-2})$ oracle complexity for reaching approximate stationarity.

[432] Actor-Curator: Co-adaptive Curriculum Learning via Policy-Improvement Bandits for RL Post-Training

Zhengyao Gu, Jonathan Light, Raul Astudillo, Ziyu Ye, Langzhou He, Henry Peng Zou, Wei Cheng, Santiago Paternain, Philip S. Yu, Yisong Yue

Main category: cs.LG

TL;DR: ACTOR-CURATOR: An automated curriculum learning framework for RL post-training of LLMs that uses a neural curator to dynamically select training problems by optimizing for expected policy performance improvement.

DetailsMotivation: Post-training large foundation models with RL relies on massive datasets, making curriculum learning critical but challenging. Current approaches lack scalability and automation for effective problem selection during RL fine-tuning.

Method: Formulates problem selection as a non-stationary stochastic bandit problem, learns a neural curator that dynamically selects training problems from large banks, uses online stochastic mirror descent with regret guarantees under partial feedback.

Result: Outperforms uniform sampling and strong curriculum baselines across reasoning benchmarks, achieves 28.6% gain on AIME2024, 30.5% on ARC-1D, and up to 80% speedup in training efficiency.

Conclusion: ACTOR-CURATOR is a powerful and practical approach for scalable LLM post-training that improves training stability and efficiency through automated curriculum learning.

Abstract: Post-training large foundation models with reinforcement learning typically relies on massive and heterogeneous datasets, making effective curriculum learning both critical and challenging. In this work, we propose ACTOR-CURATOR, a scalable and fully automated curriculum learning framework for reinforcement learning post-training of large language models (LLMs). ACTOR-CURATOR learns a neural curator that dynamically selects training problems from large problem banks by directly optimizing for expected policy performance improvement. We formulate problem selection as a non-stationary stochastic bandit problem, derive a principled loss function based on online stochastic mirror descent, and establish regret guarantees under partial feedback. Empirically, ACTOR-CURATOR consistently outperforms uniform sampling and strong curriculum baselines across a wide range of challenging reasoning benchmarks, demonstrating improved training stability and efficiency. Notably, it achieves relative gains of 28.6% on AIME2024 and 30.5% on ARC-1D over the strongest baseline and up to 80% speedup. These results suggest that ACTOR-CURATOR is a powerful and practical approach for scalable LLM post-training.

[433] Nonparametric Teaching of Attention Learners

Chen Zhang, Jianghui Wang, Bingyang Cheng, Zhongtao Chen, Wendong XU, Cong Wang, Marco Canini, Francesco Orabona, Yik Chung WU, Ngai Wong

Main category: cs.LG

TL;DR: AtteNT is a novel teaching paradigm that accelerates attention-based neural network training by selecting optimal example subsets through nonparametric teaching theory, reducing training time by 13-21% without compromising accuracy.

DetailsMotivation: Attention-based models like transformers excel at learning sequence relationships but have costly training processes. The authors aim to accelerate this training through a teaching perspective that selects optimal examples.

Method: AtteNT reinterprets attention learner training through nonparametric teaching theory, selecting subsets of sequence-property pairs to accelerate convergence. The approach analyzes attention’s role in gradient descent and recasts parameter updates as functional gradient descent in nonparametric teaching.

Result: Achieved 13.01% training time reduction for LLMs and 20.58% for ViTs across fine-tuning and training-from-scratch regimes. Performance was preserved or enhanced across diverse downstream tasks without accuracy compromise.

Conclusion: Teaching attention learners is consistent with teaching importance-adaptive nonparametric learners, enabling significant training efficiency gains while maintaining or improving model performance.

Abstract: Attention learners, neural networks built on the attention mechanism, e.g., transformers, excel at learning the implicit relationships that relate sequences to their corresponding properties, e.g., mapping a given sequence of tokens to the probability of the next token. However, the learning process tends to be costly. To address this, we present a novel paradigm named Attention Neural Teaching (AtteNT) that reinterprets the learning process through a nonparametric teaching perspective. Specifically, the latter provides a theoretical framework for teaching mappings that are implicitly defined (i.e., nonparametric) via example selection. Such an implicit mapping is embodied through a dense set of sequence-property pairs, with the AtteNT teacher selecting a subset to accelerate convergence in attention learner training. By analytically investigating the role of attention on parameter-based gradient descent during training, and recasting the evolution of attention learners, shaped by parameter updates, through functional gradient descent in nonparametric teaching, we show for the first time that teaching attention learners is consistent with teaching importance-adaptive nonparametric learners. These new findings readily commit AtteNT to enhancing learning efficiency of attention learners. Specifically, we observe training time reductions of 13.01% for LLMs and 20.58% for ViTs, spanning both fine-tuning and training-from-scratch regimes. Crucially, these gains are achieved without compromising accuracy; in fact, performance is consistently preserved and often enhanced across a diverse set of downstream tasks.

[434] GATES: Self-Distillation under Privileged Context with Consensus Gating

Alex Stein, Furong Huang, Tom Goldstein

Main category: cs.LG

TL;DR: Self-distillation method using tutor consensus for supervision in document-grounded QA without ground truth labels, improving student performance in document-free settings.

DetailsMotivation: Addresses the challenge of unreliable supervision in document-grounded question answering where there are no ground truth labels, verifiable rewards, or external graders. Focuses on asymmetric context settings where models must answer from questions alone at test time.

Method: Uses self-distillation with a single model serving as both tutor (with document access) and student (document-free). Derives supervision from tutor consensus by sampling multiple document-grounded reasoning traces and using agreement to gate learning. Distills knowledge through full tutor reasoning trajectories rather than just final answers.

Result: Substantially improves transfer to document-free student: held-out in-domain accuracy improves from 46.0% to 62.0%, and average (maj@8) accuracy on public document-free math benchmarks improves from 20.2% to 35.4%.

Conclusion: Consensus-gated trajectory distillation effectively enables learning in unreliable supervision settings, providing dense and stable learning signals that significantly improve performance in document-free question answering tasks.

Abstract: We study self-distillation in settings where supervision is unreliable: there are no ground truth labels, verifiable rewards, or external graders to evaluate answers. We focus on document-grounded question answering with asymmetric context, where a single model serves as both tutor (with access to a relevant source document during training) and student (answering from the question alone at test time). Rather than assuming tutor correctness, we derive supervision online from tutor consensus by sampling multiple document-grounded reasoning traces and using agreement to gate learning. Conditioned on this reliability signal, we distill knowledge through full tutor reasoning trajectories (not just final answers), providing a dense and stable learning signal. Empirically, this consensus-gated trajectory distillation substantially improves transfer to the document-free student. Held-out in-domain accuracy under asymmetric evaluation improves from 46.0% to 62.0%, and average (maj@8) accuracy on public document-free math benchmarks improves from 20.2% to 35.4%.

[435] A Long-Short Flow-Map Perspective for Drifting Models

Zhiqi Li, Bo Zhu

Main category: cs.LG

TL;DR: The paper presents a reinterpretation of the Drifting Model through semigroup-consistent long-short flow-map factorization, showing global transport can be decomposed into long-horizon and short-time terminal flow maps, with a new likelihood learning formulation aligning this decomposition with density evolution.

DetailsMotivation: To provide a new theoretical perspective on the Drifting Model by showing how global transport processes can be decomposed into long-horizon and short-time terminal flow maps, enabling better understanding of flow-map consistency and density evolution in generative models.

Method: Proposes a semigroup-consistent long-short flow-map factorization approach that decomposes global transport into long-horizon flow map followed by short-time terminal flow map with closed-form optimal velocity representation. Develops a new likelihood learning formulation aligning this decomposition with density evolution under transport.

Result: Theoretical analysis shows that taking terminal interval length to zero recovers exactly the drifting field with conservative impulse term for flow-map consistency. Empirical evaluations on benchmark tests validate the framework, and theoretical interpretation of feature-space optimization is provided.

Conclusion: The paper offers a novel reinterpretation of the Drifting Model through flow-map factorization, providing theoretical insights into transport processes and density evolution in generative models, while highlighting open problems for future research.

Abstract: This paper provides a reinterpretation of the Drifting Model~\cite{deng2026generative} through a semigroup-consistent long-short flow-map factorization. We show that a global transport process can be decomposed into a long-horizon flow map followed by a short-time terminal flow map admitting a closed-form optimal velocity representation, and that taking the terminal interval length to zero recovers exactly the drifting field together with a conservative impulse term required for flow-map consistency. Based on this perspective, we propose a new likelihood learning formulation that aligns the long-short flow-map decomposition with density evolution under transport. We validate the framework through both theoretical analysis and empirical evaluations on benchmark tests, and further provide a theoretical interpretation of the feature-space optimization while highlighting several open problems for future study.

[436] Elimination-compensation pruning for fully-connected neural networks

Enrico Ballini, Luca Muscarnera, Alessio Fumagalli, Anna Scotti, Francesco Regazzoni

Main category: cs.LG

TL;DR: A novel neural network pruning method that considers compensating pruned weights with optimal bias perturbations, improving pruning efficiency without increasing sparsity.

DetailsMotivation: Traditional pruning assumes expendable weights have small impact on network error, but this work argues weights could be compensated by adjusting adjacent biases, which don't affect sparsity. This could lead to more efficient pruning by allowing weight removal with bias compensation.

Method: Proposes a pruning method where weight importance is computed considering output behavior after optimal perturbation of adjacent bias. Uses automatic differentiation to efficiently compute optimal bias perturbations that can be applied independently after weight removal.

Result: Numerical experiments benchmark the technique against popular pruning strategies, demonstrating intrinsic efficiency in diverse machine learning scenarios. The method shows improved pruning effectiveness.

Conclusion: The approach provides a novel perspective on pruning by considering bias compensation, with theoretical implications for neural network compression and efficiency.

Abstract: The unmatched ability of Deep Neural Networks in capturing complex patterns in large and noisy datasets is often associated with their large hypothesis space, and consequently to the vast amount of parameters that characterize model architectures. Pruning techniques affirmed themselves as valid tools to extract sparse representations of neural networks parameters, carefully balancing between compression and preservation of information. However, a fundamental assumption behind pruning is that expendable weights should have small impact on the error of the network, while highly important weights should tend to have a larger influence on the inference. We argue that this idea could be generalized; what if a weight is not simply removed but also compensated with a perturbation of the adjacent bias, which does not contribute to the network sparsity? Our work introduces a novel pruning method in which the importance measure of each weight is computed considering the output behavior after an optimal perturbation of its adjacent bias, efficiently computable by automatic differentiation. These perturbations can be then applied directly after the removal of each weight, independently of each other. After deriving analytical expressions for the aforementioned quantities, numerical experiments are conducted to benchmark this technique against some of the most popular pruning strategies, demonstrating an intrinsic efficiency of the proposed approach in very diverse machine learning scenarios. Finally, our findings are discussed and the theoretical implications of our results are presented.

[437] CGSTA: Cross-Scale Graph Contrast with Stability-Aware Alignment for Multivariate Time-Series Anomaly Detection

Zhongpeng Qi, Jun Zhang, Wei Li, Zhuoxuan Liang

Main category: cs.LG

TL;DR: CGSTA: A multivariate time-series anomaly detection framework using multi-scale dynamic graphs with contrastive learning and stability-aware alignment to handle evolving dependencies and noise.

DetailsMotivation: Multivariate time-series anomaly detection faces challenges from evolving inter-variable dependencies and inevitable noise. Existing methods often use single-scale graphs or instance-level contrast, and learned dynamic graphs can overfit noise without stable anchors, causing false alarms or misses.

Method: Proposes CGSTA framework with two key innovations: 1) Dynamic Layered Graph Construction (DLGC) forms local, regional, and global views of variable relations for each sliding window, with Contrastive Discrimination across Scales (CDS) contrasting graph representations within each view and aligning same window across views; 2) Stability-Aware Alignment (SAA) maintains per-scale stable references learned from normal data to guide current window’s fast-changing graphs toward stability and suppress noise. Multi-scale and temporal features are fused with conditional density estimator for per-time-step anomaly scores.

Result: Across four benchmarks, CGSTA delivers optimal performance on PSM and WADI datasets, and is comparable to baseline methods on SWaT and SMAP datasets.

Conclusion: CGSTA effectively addresses challenges in multivariate time-series anomaly detection by leveraging multi-scale graph representations with contrastive learning and stability-aware alignment, achieving strong performance across multiple industrial monitoring benchmarks.

Abstract: Multivariate time-series anomaly detection is essential for reliable industrial control, telemetry, and service monitoring. However, the evolving inter-variable dependencies and inevitable noise render it challenging. Existing methods often use single-scale graphs or instance-level contrast. Moreover, learned dynamic graphs can overfit noise without a stable anchor, causing false alarms or misses. To address these challenges, we propose the CGSTA framework with two key innovations. First, Dynamic Layered Graph Construction (DLGC) forms local, regional, and global views of variable relations for each sliding window; rather than contrasting whole windows, Contrastive Discrimination across Scales (CDS) contrasts graph representations within each view and aligns the same window across views to make learning structure-aware. Second, Stability-Aware Alignment (SAA) maintains a per-scale stable reference learned from normal data and guides the current window’s fast-changing graphs toward it to suppress noise. We fuse the multi-scale and temporal features and use a conditional density estimator to produce per-time-step anomaly scores. Across four benchmarks, CGSTA delivers optimal performance on PSM and WADI, and is comparable to the baseline methods on SWaT and SMAP.

[438] VINA: Variational Invertible Neural Architectures

Shubhanshu Shekhar, Mohammad Javad Khojasteh, Ananya Acharya, Tony Tohme, Kamal Youcef-Toumi

Main category: cs.LG

TL;DR: Theoretical framework for invertible neural networks and normalizing flows with performance guarantees for posterior inference and generative modeling, validated on ocean-acoustic inversion.

DetailsMotivation: Address the lack of theoretical guarantees on approximation quality for invertible neural networks (INNs) and normalizing flows (NFs) under realistic assumptions, particularly for posterior inference in INNs and generative modeling with NFs.

Method: Introduces a unified framework for INNs and NFs based on variational unsupervised loss functions, inspired by GANs and Precision-Recall divergence. Derives theoretical performance guarantees quantifying posterior accuracy for INNs and distributional accuracy for NFs under weaker, more realistic assumptions than prior work.

Result: Provides theoretical performance guarantees for both INNs and NFs, conducts extensive case studies to distill design principles and practical guidelines, and demonstrates effectiveness on a realistic ocean-acoustic inversion problem.

Conclusion: The paper establishes a theoretical foundation with practical guarantees for invertible architectures, offering both theoretical insights and practical guidance for using INNs and NFs in real-world applications.

Abstract: The distinctive architectural features of normalizing flows (NFs), notably bijectivity and tractable Jacobians, make them well-suited for generative modeling. Invertible neural networks (INNs) build on these principles to address supervised inverse problems, enabling direct modeling of both forward and inverse mappings. In this paper, we revisit these architectures from both theoretical and practical perspectives and address a key gap in the literature: the lack of theoretical guarantees on approximation quality under realistic assumptions, whether for posterior inference in INNs or for generative modeling with NFs. We introduce a unified framework for INNs and NFs based on variational unsupervised loss functions, inspired by analogous formulations in related areas such as generative adversarial networks (GANs) and the Precision-Recall divergence for training normalizing flows. Within this framework, we derive theoretical performance guarantees, quantifying posterior accuracy for INNs and distributional accuracy for NFs, under assumptions that are weaker and more practically realistic than those used in prior work. Building on these theoretical results, we conduct extensive case studies to distill general design principles and practical guidelines. We conclude by demonstrating the effectiveness of our approach on a realistic ocean-acoustic inversion problem.

[439] Wireless Federated Multi-Task LLM Fine-Tuning via Sparse-and-Orthogonal LoRA

Nuocheng Yang, Sihua Wang, Ouwen Huan, Mingzhe Chen, Tony Q. S. Quek, Changchuan Yin

Main category: cs.LG

TL;DR: A decentralized federated learning approach using sparse-and-orthogonal LoRA for multi-task LLM fine-tuning across mobile devices, addressing catastrophic forgetting, communication inefficiency, and knowledge interference through orthogonal updates, cluster-based topology, and implicit MoE.

DetailsMotivation: To enable collaborative fine-tuning of large language models across mobile devices with heterogeneous multi-task datasets in decentralized federated learning settings, while addressing three key challenges: catastrophic knowledge forgetting during fine-tuning due to conflicting update directions, inefficient communication during model aggregation, and multi-task knowledge interference during inference.

Method: Proposes a three-part solution: 1) Sparse-and-orthogonal LoRA that ensures orthogonality between model updates to eliminate direction conflicts during fine-tuning; 2) Cluster-based topology design for efficient model aggregation based on analysis of how device connection topology affects multi-task performance; 3) Implicit mixture of experts (MoE) mechanism to avoid coexistence of incompatible knowledge representations during inference.

Result: Simulation results demonstrate the proposed approach reduces communication resource consumption by up to 73% and enhances average performance by 5% compared to traditional LoRA methods in decentralized federated learning settings.

Conclusion: The proposed approach effectively addresses the three primary issues in decentralized federated learning for LLM fine-tuning across heterogeneous mobile devices, providing significant improvements in communication efficiency and model performance through orthogonal updates, optimized topology, and implicit MoE mechanisms.

Abstract: Decentralized federated learning (DFL) based on low-rank adaptation (LoRA) enables mobile devices with multi-task datasets to collaboratively fine-tune a large language model (LLM) by exchanging locally updated parameters with a subset of neighboring devices via wireless connections for knowledge integration.However, directly aggregating parameters fine-tuned on heterogeneous datasets induces three primary issues across the DFL life-cycle: (i) \textit{catastrophic knowledge forgetting during fine-tuning process}, arising from conflicting update directions caused by data heterogeneity; (ii) \textit{inefficient communication and convergence during model aggregation process}, due to bandwidth-intensive redundant model transmissions; and (iii) \textit{multi-task knowledge interference during inference process}, resulting from incompatible knowledge representations coexistence during inference. To address these issues in a fully decentralized scenario, we first propose a sparse-and-orthogonal LoRA that ensures orthogonality between model updates to eliminate direction conflicts during fine-tuning.Then, we analyze how device connection topology affects multi-task performance, prompting a cluster-based topology design during aggregation.Finally, we propose an implicit mixture of experts (MoE) mechanism to avoid the coexistence of incompatible knowledge during inference. Simulation results demonstrate that the proposed approach effectively reduces communication resource consumption by up to $73%$ and enhances average performance by $5%$ compared with the traditional LoRA method.

[440] A Generalized Apprenticeship Learning Framework for Capturing Evolving Student Pedagogical Strategies

Md Mirajul Islam, Xi Yang, Adittya Soukarjya Saha, Rajesh Debnath, Min Chi

Main category: cs.LG

TL;DR: THEMES uses apprenticeship learning with expert demonstrations to induce effective pedagogical policies for intelligent tutoring systems, outperforming six state-of-the-art baselines.

DetailsMotivation: Deep reinforcement learning in educational technologies faces challenges like sample inefficiency and difficulty designing reward functions. Apprenticeship learning offers an alternative by using expert demonstrations to infer reward functions and derive optimal policies.

Method: Leverages THEMES, a generalized apprenticeship learning framework, to capture complexities of expert student learning processes where multiple reward functions may dynamically evolve over time. Uses expert demonstrations to infer underlying reward functions.

Result: THEMES outperforms six state-of-the-art baselines, achieving high performance with AUC of 0.899 and Jaccard of 0.653, using only 18 trajectories from a previous semester to predict student pedagogical decisions in a later semester.

Conclusion: THEMES demonstrates strong potential as a powerful alternative for inducing effective pedagogical policies in intelligent tutoring systems, addressing key limitations of traditional DRL approaches in educational settings.

Abstract: Reinforcement Learning (RL) and Deep Reinforcement Learning (DRL) have advanced rapidly in recent years and have been successfully applied to e-learning environments like intelligent tutoring systems (ITSs). Despite great success, the broader application of DRL to educational technologies has been limited due to major challenges such as sample inefficiency and difficulty designing the reward function. In contrast, Apprenticeship Learning (AL) uses a few expert demonstrations to infer the expert’s underlying reward functions and derive decision-making policies that generalize and replicate optimal behavior. In this work, we leverage a generalized AL framework, THEMES, to induce effective pedagogical policies by capturing the complexities of the expert student learning process, where multiple reward functions may dynamically evolve over time. We evaluate the effectiveness of THEMES against six state-of-the-art baselines, demonstrating its superior performance and highlighting its potential as a powerful alternative for inducing effective pedagogical policies and show that it can achieve high performance, with an AUC of 0.899 and a Jaccard of 0.653, using only 18 trajectories of a previous semester to predict student pedagogical decisions in a later semester.

[441] Sample-efficient evidence estimation of score based priors for model selection

Frederic Wang, Katherine L. Bouman

Main category: cs.LG

TL;DR: Proposes an estimator for model evidence of diffusion priors in inverse problems using posterior sampling trajectories

DetailsMotivation: Need to select appropriate priors for ill-posed inverse problems to avoid bias; diffusion models are state-of-the-art but model evidence estimation is intractable

Method: Estimates model evidence by integrating over time-marginals of posterior sampling methods, leveraging intermediate samples from reverse diffusion process

Result: Accurate model evidence estimation with few samples (e.g., 20), works with diffusion posterior sampling methods, can select correct priors and diagnose misfit

Conclusion: Provides practical method for model selection in diffusion-based inverse problems, demonstrated on real-world applications like black hole imaging

Abstract: The choice of prior is central to solving ill-posed imaging inverse problems, making it essential to select one consistent with the measurements $y$ to avoid severe bias. In Bayesian inverse problems, this could be achieved by evaluating the model evidence $p(y \mid M)$ under different models $M$ that specify the prior and then selecting the one with the highest value. Diffusion models are the state-of-the-art approach to solving inverse problems with a data-driven prior; however, directly computing the model evidence with respect to a diffusion prior is intractable. Furthermore, most existing model evidence estimators require either many pointwise evaluations of the unnormalized prior density or an accurate clean prior score. We propose \method, an estimator of the model evidence of a diffusion prior by integrating over the time-marginals of posterior sampling methods. Our method leverages the large amount of intermediate samples naturally obtained during the reverse diffusion sampling process to obtain an accurate estimation of the model evidence using only a handful of posterior samples (e.g., 20). We also demonstrate how to implement our estimator in tandem with recent diffusion posterior sampling methods. Empirically, our estimator matches the model evidence when it can be computed analytically, and it is able to both select the correct diffusion model prior and diagnose prior misfit under different highly ill-conditioned, non-linear inverse problems, including a real-world black hole imaging problem.

[442] GENSR: Symbolic Regression Based in Equation Generative Space

Qian Li, Yuxiao Hu, Juncheng Liu, Yuntian Chen

Main category: cs.LG

TL;DR: GenSR is a generative latent space framework for symbolic regression that uses a dual-branch CVAE to create a continuous, smooth representation of equation space, enabling more efficient search through coarse localization and fine refinement.

DetailsMotivation: Traditional symbolic regression methods search in discrete equation spaces where structural changes don't align well with numerical behavior, making error feedback noisy and search inefficient. There's a need for a continuous, smooth representation that can provide better directional guidance for equation discovery.

Method: GenSR uses a dual-branch Conditional Variational Autoencoder (CVAE) to reparameterize symbolic equations into a generative latent space with symbolic continuity and local numerical smoothness. The framework follows a three-step paradigm: 1) map construction via CVAE pretraining, 2) coarse localization of input data to promising latent regions, and 3) fine search using modified CMA-ES that leverages smooth latent gradients.

Result: Extensive experiments show that GenSR jointly optimizes predictive accuracy, expression simplicity, and computational efficiency while remaining robust under noise conditions.

Conclusion: GenSR provides a novel Bayesian perspective on symbolic regression by reframing it as maximizing the conditional distribution of equations given numerical data, with theoretical guarantees through ELBO optimization. The generative latent space approach enables more efficient and effective symbolic equation discovery.

Abstract: Symbolic Regression (SR) tries to reveal the hidden equations behind observed data. However, most methods search within a discrete equation space, where the structural modifications of equations rarely align with their numerical behavior, leaving fitting error feedback too noisy to guide exploration. To address this challenge, we propose GenSR, a generative latent space-based SR framework following the map construction -> coarse localization -> fine search'' paradigm. Specifically, GenSR first pretrains a dual-branch Conditional Variational Autoencoder (CVAE) to reparameterize symbolic equations into a generative latent space with symbolic continuity and local numerical smoothness. This space can be regarded as a well-structured map’’ of the equation space, providing directional signals for search. At inference, the CVAE coarsely localizes the input data to promising regions in the latent space. Then, a modified CMA-ES refines the candidate region, leveraging smooth latent gradients. From a Bayesian perspective, GenSR reframes the SR task as maximizing the conditional distribution $p(\mathrm{Equ.} \mid \mathrm{Num.})$, with CVAE training achieving this objective through the Evidence Lower Bound (ELBO). This new perspective provides a theoretical guarantee for the effectiveness of GenSR. Extensive experiments show that GenSR jointly optimizes predictive accuracy, expression simplicity, and computational efficiency, while remaining robust under noise.

[443] SELAUR: Self Evolving LLM Agent via Uncertainty-aware Rewards

Dengjia Zhang, Xiaoou Liu, Lu Cheng, Yaqing Wang, Kenton Murray, Hua Wei

Main category: cs.LG

TL;DR: SELAUR is a reinforcement learning framework that incorporates LLM uncertainty into reward design for multi-step decision-making agents, using token-level uncertainty estimates and failure-aware reward reshaping to improve exploration and learning stability.

DetailsMotivation: Current LLM-based decision-making agents overlook intrinsic uncertainty as a learning signal. Uncertainty reflects model confidence, indicates where exploration is needed, and provides valuable cues even in failed trajectories, making it crucial for effective reward design.

Method: SELAUR integrates entropy-, least-confidence-, and margin-based metrics into a combined token-level uncertainty estimate for dense confidence-aligned supervision. It employs a failure-aware reward reshaping mechanism that injects these uncertainty signals into both step- and trajectory-level rewards.

Result: Experiments on ALFWorld and WebShop benchmarks show consistent improvement in success rates over strong baselines. Ablation studies demonstrate how uncertainty signals enhance exploration and robustness in LLM agents.

Conclusion: Incorporating LLM uncertainty into reward design is effective for improving multi-step decision-making agents. The proposed uncertainty-aware reward framework enhances exploration efficiency and learning stability in reinforcement learning for LLMs.

Abstract: Large language models (LLMs) are increasingly deployed as multi-step decision-making agents, where effective reward design is essential for guiding learning. Although recent work explores various forms of reward shaping and step-level credit assignment, a key signal remains largely overlooked: the intrinsic uncertainty of LLMs. Uncertainty reflects model confidence, reveals where exploration is needed, and offers valuable learning cues even in failed trajectories. We introduce SELAUR: Self Evolving LLM Agent via Uncertainty-aware Rewards, a reinforcement learning framework that incorporates uncertainty directly into the reward design. SELAUR integrates entropy-, least-confidence-, and margin-based metrics into a combined token-level uncertainty estimate, providing dense confidence-aligned supervision, and employs a failure-aware reward reshaping mechanism that injects these uncertainty signals into step- and trajectory-level rewards to improve exploration efficiency and learning stability. Experiments on two benchmarks, ALFWorld and WebShop, show that our method consistently improves success rates over strong baselines. Ablation studies further demonstrate how uncertainty signals enhance exploration and robustness.

[444] Stability and Generalization of Push-Sum Based Decentralized Optimization over Directed Graphs

Yifei Liang, Yan Sun, Xiaochun Cao, Li Shen

Main category: cs.LG

TL;DR: Analysis of Stochastic Gradient Push algorithm’s generalization behavior in decentralized learning over directed networks, quantifying how network topology affects optimization and stability.

DetailsMotivation: While convergence properties of decentralized learning methods over directed networks are understood, their finite-iteration stability and generalization behavior remain unclear due to structural bias from column-stochastic mixing and asymmetric error propagation.

Method: Developed a unified uniform-stability framework for Stochastic Gradient Push algorithm that captures directed topology effects. Key technical ingredient is an imbalance-aware consistency bound for Push-Sum that controls consensus deviation through stationary distribution imbalance parameter δ and spectral gap (1-λ).

Result: Established finite-iteration stability and optimization guarantees for both convex and non-convex objectives. For convex problems, SGP attains excess generalization error of order Õ(1/√mn + γ/δ(1-λ) + γ) with characterized optimal early stopping time. For Polyak-Łojasiewicz objectives, obtained convex-like optimization and generalization rates with dependence proportional to κ(1 + 1/δ(1-λ)).

Conclusion: The analysis clarifies when Push-Sum correction is necessary compared to standard decentralized SGD and quantifies how imbalance and mixing jointly shape the best attainable learning performance in directed communication networks.

Abstract: Push-Sum-based decentralized learning enables optimization over directed communication networks, where information exchange may be asymmetric. While convergence properties of such methods are well understood, their finite-iteration stability and generalization behavior remain unclear due to structural bias induced by column-stochastic mixing and asymmetric error propagation. In this work, we develop a unified uniform-stability framework for the Stochastic Gradient Push (SGP) algorithm that captures the effect of directed topology. A key technical ingredient is an imbalance-aware consistency bound for Push-Sum, which controls consensus deviation through two quantities: the stationary distribution imbalance parameter $δ$ and the spectral gap $(1-λ)$ governing mixing speed. This decomposition enables us to disentangle statistical effects from topology-induced bias. We establish finite-iteration stability and optimization guarantees for both convex objectives and non-convex objectives satisfying the Polyak–Łojasiewicz condition. For convex problems, SGP attains excess generalization error of order $\tilde{\mathcal{O}}!\left(\frac{1}{\sqrt{mn}}+\fracγ{δ(1-λ)}+γ\right)$ under step-size schedules, and we characterize the corresponding optimal early stopping time that minimizes this bound. For PŁ objectives, we obtain convex-like optimization and generalization rates with dominant dependence proportional to $κ!\left(1+\frac{1}{δ(1-λ)}\right)$, revealing a multiplicative coupling between problem conditioning and directed communication topology. Our analysis clarifies when Push-Sum correction is necessary compared with standard decentralized SGD and quantifies how imbalance and mixing jointly shape the best attainable learning performance.

[445] Learning from Trials and Errors: Reflective Test-Time Planning for Embodied LLMs

Yining Hong, Huang Huang, Manling Li, Li Fei-Fei, Jiajun Wu, Yejin Choi

Main category: cs.LG

TL;DR: Reflective Test-Time Planning for embodied LLMs integrates reflection-in-action (pre-execution candidate scoring) and reflection-on-action (post-execution policy updates) to improve robot task performance through iterative learning from experience.

DetailsMotivation: Current embodied LLMs lack the ability to reflect on failures, leading to repeated mistakes rather than accumulating experience. The paper aims to enable robots to learn from their mistakes through reflective practice similar to human practitioners.

Method: Introduces Reflective Test-Time Planning with two modes: reflection-in-action (test-time scaling to generate and score multiple candidate actions before execution) and reflection-on-action (test-time training to update reflection model and action policy after execution). Includes retrospective reflection for long-horizon credit assignment.

Result: Significant gains over baseline models on Long-Horizon Household benchmark and MuJoCo Cupboard Fitting benchmark. Ablative studies validate complementary roles of both reflection modes. Real-robot trials show behavioral correction through reflection.

Conclusion: Reflective Test-Time Planning enables embodied LLMs to learn from experience through structured reflection, improving task performance and addressing the limitation of independent trial-and-error approaches in robotics.

Abstract: Embodied LLMs endow robots with high-level task reasoning, but they cannot reflect on what went wrong or why, turning deployment into a sequence of independent trials where mistakes repeat rather than accumulate into experience. Drawing upon human reflective practitioners, we introduce Reflective Test-Time Planning, which integrates two modes of reflection: \textit{reflection-in-action}, where the agent uses test-time scaling to generate and score multiple candidate actions using internal reflections before execution; and \textit{reflection-on-action}, which uses test-time training to update both its internal reflection model and its action policy based on external reflections after execution. We also include retrospective reflection, allowing the agent to re-evaluate earlier decisions and perform model updates with hindsight for proper long-horizon credit assignment. Experiments on our newly-designed Long-Horizon Household benchmark and MuJoCo Cupboard Fitting benchmark show significant gains over baseline models, with ablative studies validating the complementary roles of reflection-in-action and reflection-on-action. Qualitative analyses, including real-robot trials, highlight behavioral correction through reflection.

[446] Benchmarking GNN Models on Molecular Regression Tasks with CKA-Based Representation Analysis

Rajan, Ishaan Gupta

Main category: cs.LG

TL;DR: Systematic benchmarking of GNN architectures for molecular property prediction shows fusion with molecular fingerprints improves performance, with GNN and fingerprints occupying independent latent spaces.

DetailsMotivation: While GNNs are powerful for learning molecular structural relationships, their efficacy on smaller datasets and inductive biases across different architectures is less studied compared to traditional molecular fingerprint approaches.

Method: Benchmarked four GNN architectures across diverse molecular datasets (physical chemistry, biological, analytical), implemented hierarchical fusion (GNN+FP) framework, and analyzed representational similarity using centered kernel alignment (CKA) between embeddings.

Result: Fusion framework consistently outperforms standalone GNN (RMSE improvement >7%) and baseline models. GNN and fingerprint embeddings occupy highly independent latent spaces (CKA ≤0.46). GCN, GraphSAGE and GIN show high convergence (CKA ≥0.88), while GAT learns moderately independent representations (CKA 0.55-0.80).

Conclusion: Hierarchical fusion of GNN with molecular fingerprints provides superior performance for molecular property prediction, with different architectures capturing complementary information about molecular structure.

Abstract: Molecules are commonly represented as SMILES strings, which can be readily converted to fixed-size molecular fingerprints. These fingerprints serve as feature vectors to train ML/DL models for molecular property prediction tasks in the field of computational chemistry, drug discovery, biochemistry, and materials science. Recent research has demonstrated that SMILES can be used to construct molecular graphs where atoms are nodes ($V$) and bonds are edges ($E$). These graphs can subsequently be used to train geometric DL models like GNN. GNN learns the inherent structural relationships within a molecule rather than depending on fixed-size fingerprints. Although GNN are powerful aggregators, their efficacy on smaller datasets and inductive biases across different architectures is less studied. In our present study, we performed a systematic benchmarking of four different GNN architectures across a diverse domain of datasets (physical chemistry, biological, and analytical). Additionally, we have also implemented a hierarchical fusion (GNN+FP) framework for target prediction. We observed that the fusion framework consistently outperforms or matches the performance of standalone GNN (RMSE improvement > $7%$) and baseline models. Further, we investigated the representational similarity using centered kernel alignment (CKA) between GNN and fingerprint embeddings and found that they occupy highly independent latent spaces (CKA $\le0.46$). The cross-architectural CKA score suggests a high convergence between isotopic models like GCN, GraphSAGE and GIN (CKA $\geq0.88$), with GAT learning moderately independent representation (CKA $0.55-0.80$).

[447] Upper-Linearizability of Online Non-Monotone DR-Submodular Maximization over Down-Closed Convex Sets

Yiyang Lu, Haresh Jadav, Mohammad Pedramfar, Ranveer Singh, Vaneet Aggarwal

Main category: cs.LG

TL;DR: Online optimization of non-monotone DR-submodular functions over down-closed convex sets with improved regret bounds across various feedback models

DetailsMotivation: Existing projection-free online methods for non-monotone DR-submodular optimization suffer from suboptimal regret and limited feedback guarantees, motivating the need for better algorithms

Method: Proves a structural result showing DR-submodular functions are 1/e-linearizable under exponential reparametrization, scaling parameter, and surrogate potential, enabling reduction to online linear optimization

Result: Achieves O(T^{1/2}) static regret with single gradient query per round, with adaptive/dynamic regret guarantees and improved rates under semi-bandit, bandit, and zeroth-order feedback

Conclusion: The proposed approach strictly improves state-of-the-art bounds across all feedback models for online DR-submodular optimization

Abstract: We study online maximization of non-monotone Diminishing-Return(DR)-submodular functions over down-closed convex sets, a regime where existing projection-free online methods suffer from suboptimal regret and limited feedback guarantees. Our main contribution is a new structural result showing that this class is $1/e$-linearizable under carefully designed exponential reparametrization, scaling parameter, and surrogate potential, enabling a reduction to online linear optimization. As a result, we obtain $O(T^{1/2})$ static regret with a single gradient query per round and unlock adaptive and dynamic regret guarantees, together with improved rates under semi-bandit, bandit, and zeroth-order feedback. Across all feedback models, our bounds strictly improve the state of the art.

[448] Is the Trigger Essential? A Feature-Based Triggerless Backdoor Attack in Vertical Federated Learning

Yige Liu, Yiwei Lou, Che Wang, Yongzhi Cao, Hanpin Wang

Main category: cs.LG

TL;DR: A triggerless backdoor attack in vertical federated learning that doesn’t require traditional triggers, operating under honest-but-curious attacker assumptions with high effectiveness and robustness against defenses.

DetailsMotivation: Existing backdoor attacks in vertical federated learning rely on triggers, but this paper discovers triggers are not essential. The authors aim to disclose a new attack pathway that operates under more stringent security assumptions (honest-but-curious rather than malicious attackers) to encourage the community to revisit VFL security threats.

Method: Proposes a feature-based triggerless backdoor attack with three modules: 1) label inference for targeted attacks, 2) poison generation with amplification and perturbation mechanisms, and 3) backdoor execution. The attack works in VFL scenarios with multiple passive parties and one active party with labels.

Result: Extensive experiments on five benchmark datasets show the attack outperforms three baseline backdoor attacks by 2 to 50 times while minimally impacting the main task. Maintains high performance even with 32 passive parties and only one set of auxiliary data. Remains largely unaffected by distinct defense strategies and exhibits strong robustness.

Conclusion: The disclosure of this triggerless backdoor attack pathway should encourage the community to revisit security threats in VFL scenarios and inspire researchers to develop more robust and practical defense strategies.

Abstract: As a distributed collaborative machine learning paradigm, vertical federated learning (VFL) allows multiple passive parties with distinct features and one active party with labels to collaboratively train a model. Although it is known for the privacy-preserving capabilities, VFL still faces significant privacy and security threats from backdoor attacks. Existing backdoor attacks typically involve an attacker implanting a trigger into the model during the training phase and executing the attack by adding the trigger to the samples during the inference phase. However, in this paper, we find that triggers are not essential for backdoor attacks in VFL. In light of this, we disclose a new backdoor attack pathway in VFL by introducing a feature-based triggerless backdoor attack. This attack operates under a more stringent security assumption, where the attacker is honest-but-curious rather than malicious during the training phase. It comprises three modules: label inference for the targeted backdoor attack, poison generation with amplification and perturbation mechanisms, and backdoor execution to implement the attack. Extensive experiments on five benchmark datasets demonstrate that our attack outperforms three baseline backdoor attacks by 2 to 50 times while minimally impacting the main task. Even in VFL scenarios with 32 passive parties and only one set of auxiliary data, our attack maintains high performance. Moreover, when confronted with distinct defense strategies, our attack remains largely unaffected and exhibits strong robustness. We hope that the disclosure of this triggerless backdoor attack pathway will encourage the community to revisit security threats in VFL scenarios and inspire researchers to develop more robust and practical defense strategies.

[449] QEDBENCH: Quantifying the Alignment Gap in Automated Evaluation of University-Level Mathematical Proofs

Santiago Gonzalez, Alireza Amiri Bavandpour, Peter Ye, Edward Zhang, Ruslans Aleksejevs, Todor Antić, Polina Baron, Sujeet Bhalerao, Shubhrajit Bhattacharya, Zachary Burton, John Byrne, Hyungjun Choi, Nujhat Ahmed Disha, Koppany István Encz, Yuchen Fang, Robert Joseph George, Ebrahim Ghorbani, Alan Goldfarb, Jing Guo, Meghal Gupta, Stefano Huber, Annika Kanckos, Minjung Kang, Hyun Jong Kim, Dino Lorenzini, Levi Lorenzo, Tianyi Mao, Giovanni Marzenta, Ariane M. Masuda, Lukas Mauth, Ana Mickovic, Andres Miniguano-Trujillo, Antoine Moulin, Wenqi Ni, Tomos Parry, Kevin Ren, Hossein Roodbarani, Mathieu Rundström, Manjil Saikia, Detchat Samart, Rebecca Steiner, Connor Stewart, Dhara Thakkar, Jeffrey Tse, Vasiliki Velona, Yunhai Xiang, Sibel Yalçın, Jun Yan, Ji Zeng, Arman Cohan, Quanquan C. Liu

Main category: cs.LG

TL;DR: QEDBench is a benchmark revealing systematic alignment gaps in LLM-as-a-Judge protocols for university-level mathematics evaluation, showing significant positive bias in frontier evaluators and critical reasoning gaps in discrete domains.

DetailsMotivation: As LLMs saturate elementary benchmarks, research shifts to reliability of automated evaluation. Current LLM-as-a-Judge protocols have systematic alignment gaps when applied to upper-undergraduate to early graduate level mathematics.

Method: Introduces QEDBench, a large-scale dual-rubric alignment benchmark measuring alignment with human experts on university-level math proofs. Uses dual-evaluation matrix (7 judges x 5 solvers) against 1,000+ hours of human evaluation.

Result: Frontier evaluators show significant positive bias (Claude Opus 4.5: +0.18, DeepSeek-V3: +0.20, Qwen 2.5 Max: +0.30, Llama 4 Maverick: +0.36). Critical reasoning gap in discrete domains: Gemini 3.0 Pro achieves SOTA (0.91 avg), while GPT-5 Pro and Claude Sonnet 4.5 degrade to 0.72/0.63 in Discrete Math and 0.74/0.50 in Graph Theory.

Conclusion: LLM-as-a-Judge protocols have systematic alignment gaps for advanced mathematics evaluation. QEDBench provides a public benchmark for evaluating and improving AI judges in this domain.

Abstract: As Large Language Models (LLMs) saturate elementary benchmarks, the research frontier has shifted from generation to the reliability of automated evaluation. We demonstrate that standard “LLM-as-a-Judge” protocols suffer from a systematic Alignment Gap when applied to upper-undergraduate to early graduate level mathematics. To quantify this, we introduce QEDBench, the first large-scale dual-rubric alignment benchmark to systematically measure alignment with human experts on university-level math proofs by contrasting course-specific rubrics against expert common knowledge criteria. By deploying a dual-evaluation matrix (7 judges x 5 solvers) against 1,000+ hours of human evaluation, we reveal that certain frontier evaluators like Claude Opus 4.5, DeepSeek-V3, Qwen 2.5 Max, and Llama 4 Maverick exhibit significant positive bias (up to +0.18, +0.20, +0.30, +0.36 mean score inflation, respectively). Furthermore, we uncover a critical reasoning gap in the discrete domain: while Gemini 3.0 Pro achieves state-of-the-art performance (0.91 average human evaluation score), other reasoning models like GPT-5 Pro and Claude Sonnet 4.5 see their performance significantly degrade in discrete domains. Specifically, their average human evaluation scores drop to 0.72 and 0.63 in Discrete Math, and to 0.74 and 0.50 in Graph Theory. In addition to these research results, we also release QEDBench as a public benchmark for evaluating and improving AI judges. Our benchmark is publicly published at https://github.com/qqliu/Yale-QEDBench.

[450] TrajGPT-R: Generating Urban Mobility Trajectory with Reinforcement Learning-Enhanced Generative Pre-trained Transformer

Jiawei Wang, Chuang Yang, Jiawei Yong, Xiaohang Xu, Hongjun Wang, Noboru Koshizuka, Shintaro Fukushima, Ryosuke Shibasaki, Renhe Jiang

Main category: cs.LG

TL;DR: TrajGPT-R: A transformer-based framework for generating urban mobility trajectories using offline RL and inverse RL to capture individual preferences, addressing privacy concerns in mobility data.

DetailsMotivation: Urban mobility trajectory data is crucial for urban planning but limited by privacy concerns, creating a need for synthetic trajectory generation methods that preserve privacy while maintaining realistic patterns.

Method: Two-phase transformer-based approach: 1) Frame trajectory generation as offline RL problem with reduced vocabulary tokenization, 2) Use Inverse RL to capture trajectory-wise reward signals from historical data, then fine-tune pre-trained model with reward model to address long-term credit assignment and sparse rewards.

Result: Framework significantly outperforms existing models in reliability and diversity across multiple datasets, providing robust methodology for urban data simulation.

Conclusion: The approach advances urban mobility modeling and provides practical tools for traffic management and urban development planning, with publicly available implementation.

Abstract: Mobility trajectories are essential for understanding urban dynamics and enhancing urban planning, yet access to such data is frequently hindered by privacy concerns. This research introduces a transformative framework for generating large-scale urban mobility trajectories, employing a novel application of a transformer-based model pre-trained and fine-tuned through a two-phase process. Initially, trajectory generation is conceptualized as an offline reinforcement learning (RL) problem, with a significant reduction in vocabulary space achieved during tokenization. The integration of Inverse Reinforcement Learning (IRL) allows for the capture of trajectory-wise reward signals, leveraging historical data to infer individual mobility preferences. Subsequently, the pre-trained model is fine-tuned using the constructed reward model, effectively addressing the challenges inherent in traditional RL-based autoregressive methods, such as long-term credit assignment and handling of sparse reward environments. Comprehensive evaluations on multiple datasets illustrate that our framework markedly surpasses existing models in terms of reliability and diversity. Our findings not only advance the field of urban mobility modeling but also provide a robust methodology for simulating urban data, with significant implications for traffic management and urban development planning. The implementation is publicly available at https://github.com/Wangjw6/TrajGPT_R.

[451] Sparse Bayesian Deep Functional Learning with Structured Region Selection

Xiaoxian Zhu, Yingmeng Li, Shuangge Ma, Mengyun Wu

Main category: cs.LG

TL;DR: sBayFDNN: A sparse Bayesian functional deep neural network that combines deep learning’s nonlinear modeling with interpretable region selection for functional data analysis.

DetailsMotivation: Existing functional data analysis methods face a trade-off: conventional functional models are limited by linearity, while deep learning approaches lack interpretable region selection for sparse effects. This is problematic for applications like ECG monitoring, neuroimaging, and wearable sensing where complex structured data requires both accurate nonlinear modeling and interpretable feature selection.

Method: Proposes sBayFDNN (sparse Bayesian functional deep neural network) that learns adaptive functional embeddings through a deep Bayesian architecture to capture complex nonlinear relationships. Uses a structured prior to enable interpretable, region-wise selection of influential domains with quantified uncertainty.

Result: Theoretical guarantees include approximation error bounds, posterior consistency, and region selection consistency - first theoretical guarantees for a Bayesian deep functional model. Empirical results from comprehensive simulations and real-world studies confirm effectiveness and superiority over existing approaches.

Conclusion: sBayFDNN bridges the gap between conventional functional models and deep learning by providing both accurate nonlinear modeling and interpretable region selection with statistical rigor, making it suitable for complex functional data applications.

Abstract: In modern applications such as ECG monitoring, neuroimaging, wearable sensing, and industrial equipment diagnostics, complex and continuously structured data are ubiquitous, presenting both challenges and opportunities for functional data analysis. However, existing methods face a critical trade-off: conventional functional models are limited by linearity, whereas deep learning approaches lack interpretable region selection for sparse effects. To bridge these gaps, we propose a sparse Bayesian functional deep neural network (sBayFDNN). It learns adaptive functional embeddings through a deep Bayesian architecture to capture complex nonlinear relationships, while a structured prior enables interpretable, region-wise selection of influential domains with quantified uncertainty. Theoretically, we establish rigorous approximation error bounds, posterior consistency, and region selection consistency. These results provide the first theoretical guarantees for a Bayesian deep functional model, ensuring its reliability and statistical rigor. Empirically, comprehensive simulations and real-world studies confirm the effectiveness and superiority of sBayFDNN. Crucially, sBayFDNN excels in recognizing intricate dependencies for accurate predictions and more precisely identifies functionally meaningful regions, capabilities fundamentally beyond existing approaches.

[452] Bikelution: Federated Gradient-Boosting for Scalable Shared Micro-Mobility Demand Forecasting

Antonios Tziorvas, Andreas Tritsarolis, Yannis Theodoridis

Main category: cs.LG

TL;DR: Bikelution: A federated learning approach using gradient-boosted trees for privacy-preserving bike-sharing demand forecasting up to 6 hours ahead, achieving comparable accuracy to centralized ML while addressing privacy and bandwidth concerns.

DetailsMotivation: Traditional time-series models are insufficient for bike demand forecasting due to multiple external factors. Centralized ML raises privacy and bandwidth issues when data is distributed across edge devices, necessitating a privacy-preserving solution.

Method: Proposes Bikelution, a federated learning solution based on gradient-boosted trees that preserves privacy by keeping data on edge devices while collaboratively training models through parameter aggregation.

Result: Experiments on three real-world bike-sharing datasets show Bikelution is comparable to its centralized ML variant and outperforms current state-of-the-art methods for mid-term demand forecasting up to six hours ahead.

Conclusion: Demonstrates feasibility of privacy-aware demand forecasting and outlines trade-offs between federated learning and centralized ML approaches for spatio-temporal bike-sharing data.

Abstract: The rapid growth of dockless bike-sharing systems has generated massive spatio-temporal datasets useful for fleet allocation, congestion reduction, and sustainable mobility. Bike demand, however, depends on several external factors, making traditional time-series models insufficient. Centralized Machine Learning (CML) yields high-accuracy forecasts but raises privacy and bandwidth issues when data are distributed across edge devices. To overcome these limitations, we propose Bikelution, an efficient Federated Learning (FL) solution based on gradient-boosted trees that preserves privacy while delivering accurate mid-term demand forecasts up to six hours ahead. Experiments on three real-world BSS datasets show that Bikelution is comparable to its CML-based variant and outperforms the current state-of-the-art. The results highlight the feasibility of privacy-aware demand forecasting and outline the trade-offs between FL and CML approaches.

[453] UrbanFM: Scaling Urban Spatio-Temporal Foundation Models

Wei Chen, Yuqian Wu, Junle Chen, Xiaofang Zhou, Yuxuan Liang

Main category: cs.LG

TL;DR: Proposes WorldST billion-scale urban data corpus, MiniST computational units, and UrbanFM architecture for spatio-temporal foundation models with zero-shot generalization across cities and tasks.

DetailsMotivation: Urban computing lacks generalizable foundation models due to scenario-specific approaches overfitted to particular regions/tasks, unlike other scientific domains where foundation models have been transformative.

Method: Three-pronged scaling approach: 1) Data scaling via WorldST corpus standardizing diverse urban signals from 100+ cities, 2) Computation scaling via MiniST units to discretize spatio-temporal fields, 3) Architecture scaling via UrbanFM minimalist self-attention with limited inductive biases.

Result: UrbanFM achieves remarkable zero-shot generalization across unseen cities and tasks, demonstrating the effectiveness of the proposed scaling approach for urban spatio-temporal foundation models.

Conclusion: The work represents a pivotal step toward large-scale urban spatio-temporal foundation models by systematically addressing heterogeneity, correlation, and dynamics through scaling principles.

Abstract: Urban systems, as dynamic complex systems, continuously generate spatio-temporal data streams that encode the fundamental laws of human mobility and city evolution. While AI for Science has witnessed the transformative power of foundation models in disciplines like genomics and meteorology, urban computing remains fragmented due to “scenario-specific” models, which are overfitted to specific regions or tasks, hindering their generalizability. To bridge this gap and advance spatio-temporal foundation models for urban systems, we adopt scaling as the central perspective and systematically investigate two key questions: what to scale and how to scale. Grounded in first-principles analysis, we identify three critical dimensions: heterogeneity, correlation, and dynamics, aligning these principles with the fundamental scientific properties of urban spatio-temporal data. Specifically, to address heterogeneity through data scaling, we construct WorldST. This billion-scale corpus standardizes diverse physical signals, such as traffic flow and speed, from over 100 global cities into a unified data format. To enable computation scaling for modeling correlations, we introduce the MiniST unit, a novel split mechanism that discretizes continuous spatio-temporal fields into learnable computational units to unify representations of grid-based and sensor-based observations. Finally, addressing dynamics via architecture scaling, we propose UrbanFM, a minimalist self-attention architecture designed with limited inductive biases to autonomously learn dynamic spatio-temporal dependencies from massive data. Furthermore, we establish EvalST, the largest-scale urban spatio-temporal benchmark to date. Extensive experiments demonstrate that UrbanFM achieves remarkable zero-shot generalization across unseen cities and tasks, marking a pivotal first step toward large-scale urban spatio-temporal foundation models.

[454] High-Dimensional Robust Mean Estimation with Untrusted Batches

Maryam Aliakbarpour, Vladimir Braverman, Yuhan Liu, Junze Yin

Main category: cs.LG

TL;DR: SoS-based algorithms for high-dimensional mean estimation with double corruption: ε-fraction of users are fully adversarial, while “good” users have either mean-shift √α or α-fraction sample-level corruption within batches.

DetailsMotivation: Address collaborative learning where data comes from heterogeneous and potentially malicious sources, extending beyond discrete settings to continuous high-dimensional regimes with two types of statistical deviation.

Method: Two Sum-of-Squares (SoS) based algorithms that handle tiered corruption: 1) good batches with mean-shift √α, 2) good batches with α-fraction sample-level corruption. Leverages batch structure for internal averaging.

Result: Achieves minimax-optimal error rate O(√(ε/n) + √(d/nN) + √α), showing adversarial influence suppressed by 1/√n due to batch structure while heterogeneity α remains inherent statistical difficulty.

Conclusion: Batch structure provides robustness against adversarial users in high-dimensional mean estimation, with SoS algorithms achieving optimal rates under double corruption models.

Abstract: We study high-dimensional mean estimation in a collaborative setting where data is contributed by $N$ users in batches of size $n$. In this environment, a learner seeks to recover the mean $μ$ of a true distribution $P$ from a collection of sources that are both statistically heterogeneous and potentially malicious. We formalize this challenge through a double corruption landscape: an $\varepsilon$-fraction of users are entirely adversarial, while the remaining ``good’’ users provide data from distributions that are related to $P$, but deviate by a proximity parameter $α$. Unlike existing work on the untrusted batch model, which typically measures this deviation via total variation distance in discrete settings, we address the continuous, high-dimensional regime under two natural variants for deviation: (1) good batches are drawn from distributions with a mean-shift of $\sqrtα$, or (2) an $α$-fraction of samples within each good batch are adversarially corrupted. In particular, the second model presents significant new challenges: in high dimensions, unlike discrete settings, even a small fraction of sample-level corruption can shift empirical means and covariances arbitrarily. We provide two Sum-of-Squares (SoS) based algorithms to navigate this tiered corruption. Our algorithms achieve the minimax-optimal error rate $O(\sqrt{\varepsilon/n} + \sqrt{d/nN} + \sqrtα)$, demonstrating that while heterogeneity $α$ represents an inherent statistical difficulty, the influence of adversarial users is suppressed by a factor of $1/\sqrt{n}$ due to the internal averaging afforded by the batch structure.

[455] WeirNet: A Large-Scale 3D CFD Benchmark for Geometric Surrogate Modeling of Piano Key Weirs

Lisa Lüddecke, Michael Hohmann, Sebastian Eilermann, Jan Tillmann-Mumm, Pezhman Pourabdollah, Mario Oertel, Oliver Niggemann

Main category: cs.LG

TL;DR: WeirNet: A large 3D CFD benchmark dataset for geometric surrogate modeling of Piano Key Weirs, enabling data-driven hydraulic modeling with parametric descriptors, meshes, and point clouds.

DetailsMotivation: Reliable prediction of hydraulic performance for Piano Key Weir design is challenging due to complex 3D geometry dependencies, and surrogate models are limited by scarce large datasets that capture geometric variation, operating conditions, and functional performance.

Method: Created WeirNet dataset with 3,794 parametric PKW geometries (rectangular and trapezoidal), each simulated at 19 discharge conditions using consistent free-surface OpenFOAM workflow, resulting in 71,387 completed simulations with discharge coefficient labels. Dataset includes multiple modalities: parametric descriptors, watertight surface meshes, and high-resolution point clouds.

Result: Tree-based regressors on parametric descriptors achieved best overall accuracy for discharge coefficient prediction, while point- and mesh-based models remained competitive and offered parameterization-agnostic inference. All surrogates evaluate in milliseconds per sample, providing orders-of-magnitude speedups over CFD runtimes.

Conclusion: WeirNet establishes a reproducible framework for data-driven hydraulic modeling, enables faster exploration of PKW designs during early hydraulic planning, and identifies geometry shift as the dominant failure mode for out-of-distribution performance.

Abstract: Reliable prediction of hydraulic performance is challenging for Piano Key Weir (PKW) design because discharge capacity depends on three-dimensional geometry and operating conditions. Surrogate models can accelerate hydraulic-structure design, but progress is limited by scarce large, well-documented datasets that jointly capture geometric variation, operating conditions, and functional performance. This study presents WeirNet, a large 3D CFD benchmark dataset for geometric surrogate modeling of PKWs. WeirNet contains 3,794 parametric, feasibility-constrained rectangular and trapezoidal PKW geometries, each scheduled at 19 discharge conditions using a consistent free-surface OpenFOAM workflow, resulting in 71,387 completed simulations that form the benchmark and with complete discharge coefficient labels. The dataset is released as multiple modalities compact parametric descriptors, watertight surface meshes and high-resolution point clouds together with standardized tasks and in-distribution and out-of-distribution splits. Representative surrogate families are benchmarked for discharge coefficient prediction. Tree-based regressors on parametric descriptors achieve the best overall accuracy, while point- and mesh-based models remain competitive and offer parameterization-agnostic inference. All surrogates evaluate in milliseconds per sample, providing orders-of-magnitude speedups over CFD runtimes. Out-of-distribution results identify geometry shift as the dominant failure mode compared to unseen discharge values, and data-efficiency experiments show diminishing returns beyond roughly 60% of the training data. By publicly releasing the dataset together with simulation setups and evaluation pipelines, WeirNet establishes a reproducible framework for data-driven hydraulic modeling and enables faster exploration of PKW designs during the early stages of hydraulic planning.

[456] Fuz-RL: A Fuzzy-Guided Robust Framework for Safe Reinforcement Learning under Uncertainty

Xu Wan, Chao Yang, Cheng Yang, Jie Song, Mingyang Sun

Main category: cs.LG

TL;DR: Fuz-RL: A fuzzy measure-guided robust framework for safe reinforcement learning that uses Choquet integrals for risk assessment and avoids min-max optimization

DetailsMotivation: Real-world RL applications require safety guarantees, but multiple uncertainty sources in environments make interpretable risk assessment and robust decision-making challenging. Existing methods struggle with complex uncertainty interactions.

Method: Proposes Fuz-RL framework with novel fuzzy Bellman operator using Choquet integrals to estimate robust value functions. Formulates as Constrained MDP and proves equivalence to distributionally robust safe RL problems, avoiding min-max optimization.

Result: Empirical results on safe-control-gym and safety-gymnasium show Fuz-RL integrates with existing safe RL baselines model-free, significantly improving both safety and control performance under various uncertainties (observation, action, dynamics).

Conclusion: Fuz-RL provides an effective framework for safe RL that handles multiple uncertainty sources through fuzzy measures, offering both theoretical guarantees and practical performance improvements.

Abstract: Safe Reinforcement Learning (RL) is crucial for achieving high performance while ensuring safety in real-world applications. However, the complex interplay of multiple uncertainty sources in real environments poses significant challenges for interpretable risk assessment and robust decision-making. To address these challenges, we propose Fuz-RL, a fuzzy measure-guided robust framework for safe RL. Specifically, our framework develops a novel fuzzy Bellman operator for estimating robust value functions using Choquet integrals. Theoretically, we prove that solving the Fuz-RL problem (in Constrained Markov Decision Process (CMDP) form) is equivalent to solving distributionally robust safe RL problems (in robust CMDP form), effectively avoiding min-max optimization. Empirical analyses on safe-control-gym and safety-gymnasium scenarios demonstrate that Fuz-RL effectively integrates with existing safe RL baselines in a model-free manner, significantly improving both safety and control performance under various types of uncertainties in observation, action, and dynamics.

[457] Rethink Efficiency Side of Neural Combinatorial Solver: An Offline and Self-Play Paradigm

Zhenxing Xu, Zeyuan Ma, Weidong Bao, Hui Yan, Yan Zheng, Ji Wang

Main category: cs.LG

TL;DR: ECO introduces an efficient offline self-play paradigm for Neural Combinatorial Optimization using supervised warm-up, Direct Preference Optimization, and Mamba-based architecture for improved memory efficiency and training throughput.

DetailsMotivation: Current Neural Combinatorial Optimization (NCO) methods suffer from inefficient online training paradigms that are computationally expensive and memory-intensive. The authors aim to develop a more efficient offline self-play approach that reduces computational overhead while maintaining competitive performance.

Method: ECO employs a three-pronged approach: 1) Two-phase offline paradigm with supervised warm-up followed by iterative Direct Preference Optimization (DPO), 2) Mamba-based architecture designed specifically for offline efficiency, and 3) Progressive bootstrapping mechanism using heuristics to ensure stable training and continuous policy improvement.

Result: ECO achieves competitive performance on TSP and CVRP benchmarks compared to state-of-the-art baselines, with significant advantages in memory utilization and training throughput. The efficiency analysis shows substantial improvements in computational resource usage.

Conclusion: ECO presents an effective paradigm shift for NCO that enables efficient offline self-play through architectural innovations and training methodology improvements, offering a more practical approach for real-world combinatorial optimization problems.

Abstract: We propose ECO, a versatile learning paradigm that enables efficient offline self-play for Neural Combinatorial Optimization (NCO). ECO addresses key limitations in the field through: 1) Paradigm Shift: Moving beyond inefficient online paradigms, we introduce a two-phase offline paradigm consisting of supervised warm-up and iterative Direct Preference Optimization (DPO); 2) Architecture Shift: We deliberately design a Mamba-based architecture to further enhance the efficiency in the offline paradigm; and 3) Progressive Bootstrapping: To stabilize training, we employ a heuristic-based bootstrapping mechanism that ensures continuous policy improvement during training. Comparison results on TSP and CVRP highlight that ECO performs competitively with up-to-date baselines, with significant advantage on the efficiency side in terms of memory utilization and training throughput. We provide further in-depth analysis on the efficiency, throughput and memory usage of ECO. Ablation studies show rationale behind our designs.

[458] Deep unfolding of MCMC kernels: scalable, modular & explainable GANs for high-dimensional posterior sampling

Jonathan Spence, Tobías I. Liaudat, Konstantinos Zygalakis, Marcelo Pereyra

Main category: cs.LG

TL;DR: Unfolded Langevin GANs combine MCMC modularity with GAN efficiency for Bayesian posterior sampling, offering interpretable, adaptable neural samplers.

DetailsMotivation: Push-forward generative models (GANs, VAEs, normalizing flows) offer efficient posterior sampling but lack the modularity and interpretability of MCMC methods, making them opaque and poorly generalizable to changes in likelihood functions.

Method: Apply deep unfolding to Langevin MCMC algorithms, mapping fixed-step iterative algorithms onto modular neural networks. Train these unfolded samplers end-to-end using a supervised regularized Wasserstein GAN framework for posterior sampling.

Result: The approach achieves high sampling accuracy and excellent computational efficiency in Bayesian imaging experiments, while retaining physics consistency, adaptability, and interpretability of classical MCMC strategies.

Conclusion: Unfolded Langevin GANs provide a novel architecture that bridges the gap between efficient push-forward models and interpretable MCMC methods, offering robust posterior sampling with key parameters specifiable at inference time.

Abstract: Markov chain Monte Carlo (MCMC) methods are fundamental to Bayesian computation, but can be computationally intensive, especially in high-dimensional settings. Push-forward generative models, such as generative adversarial networks (GANs), variational auto-encoders and normalising flows offer a computationally efficient alternative for posterior sampling. However, push-forward models are opaque as they lack the modularity of Bayes Theorem, leading to poor generalisation with respect to changes in the likelihood function. In this work, we introduce a novel approach to GAN architecture design by applying deep unfolding to Langevin MCMC algorithms. This paradigm maps fixed-step iterative algorithms onto modular neural networks, yielding architectures that are both flexible and amenable to interpretation. Crucially, our design allows key model parameters to be specified at inference time, offering robustness to changes in the likelihood parameters. We train these unfolded samplers end-to-end using a supervised regularized Wasserstein GAN framework for posterior sampling. Through extensive Bayesian imaging experiments, we demonstrate that our proposed approach achieves high sampling accuracy and excellent computational efficiency, while retaining the physics consistency, adaptability and interpretability of classical MCMC strategies.

[459] On Electric Vehicle Energy Demand Forecasting and the Effect of Federated Learning

Andreas Tritsarolis, Gil Sampaio, Nikos Pelekis, Yannis Theodoridis

Main category: cs.LG

TL;DR: Comparison of time series forecasting methods for Electric Vehicle Supply Equipment energy demand forecasting, evaluating statistical, ML, and deep learning approaches under centralized and federated learning paradigms.

DetailsMotivation: Accurate energy demand forecasting for EV charging infrastructure is critical for grid management but challenging due to external factors and privacy concerns, with data becoming increasingly fragmented across distributed systems.

Method: Evaluated multiple time series forecasting methods including ARIMA family (statistical), XGBoost (ML), and GRU/LSTM (deep learning) on four real-world EVSE datasets under both centralized and federated learning settings.

Result: XGBoost outperformed both statistical and neural network models in prediction accuracy and energy efficiency, while federated learning models balanced forecasting fidelity, privacy preservation, and energy overheads.

Conclusion: Federated learning offers a promising direction for decentralized energy demand forecasting by balancing accuracy, privacy, and efficiency, with gradient boosted trees showing superior performance over other approaches.

Abstract: The wide spread of new energy resources, smart devices, and demand side management strategies has motivated several analytics operations, from infrastructure load modeling to user behavior profiling. Energy Demand Forecasting (EDF) of Electric Vehicle Supply Equipments (EVSEs) is one of the most critical operations for ensuring efficient energy management and sustainability, since it enables utility providers to anticipate energy/power demand, optimize resource allocation, and implement proactive measures to improve grid reliability. However, accurate EDF is a challenging problem due to external factors, such as the varying user routines, weather conditions, driving behaviors, unknown state of charge, etc. Furthermore, as concerns and restrictions about privacy and sustainability have grown, training data has become increasingly fragmented, resulting in distributed datasets scattered across different data silos and/or edge devices, calling for federated learning solutions. In this paper, we investigate different well-established time series forecasting methodologies to address the EDF problem, from statistical methods (the ARIMA family) to traditional machine learning models (such as XGBoost) and deep neural networks (GRU and LSTM). We provide an overview of these methods through a performance comparison over four real-world EVSE datasets, evaluated under both centralized and federated learning paradigms, focusing on the trade-offs between forecasting fidelity, privacy preservation, and energy overheads. Our experimental results demonstrate, on the one hand, the superiority of gradient boosted trees (XGBoost) over statistical and NN-based models in both prediction accuracy and energy efficiency and, on the other hand, an insight that Federated Learning-enabled models balance these factors, offering a promising direction for decentralized energy demand forecasting.

[460] Understanding the Role of Rehearsal Scale in Continual Learning under Varying Model Capacities

JinLi He, Liang Bai, Xian Yang

Main category: cs.LG

TL;DR: Theoretical analysis reveals counterintuitive effects of rehearsal scale in continual learning: rehearsal can impair adaptability, and increasing rehearsal doesn’t always improve memory retention, with diminishing returns for similar tasks.

DetailsMotivation: While rehearsal is widely used in continual learning to mitigate catastrophic forgetting, there's limited theoretical understanding of how rehearsal scale influences learning dynamics. The paper aims to provide a unified theoretical framework to analyze rehearsal mechanisms.

Method: Formulates rehearsal-based continual learning as a multidimensional effectiveness-driven iterative optimization problem. Derives closed-form analysis of adaptability, memorability, and generalization from the perspective of rehearsal scale using theoretical analysis and validates with numerical simulations and deep neural network experiments on real-world datasets.

Result: Uncovers counterintuitive findings: 1) Rehearsal can impair model adaptability (contrary to traditional benefits), 2) Increasing rehearsal scale doesn’t necessarily improve memory retention, with diminishing lower bounds for similar tasks with low noise, 3) Statistical patterns of rehearsal mechanisms are revealed through empirical validation.

Conclusion: Theoretical analysis provides new insights into rehearsal mechanisms in continual learning, challenging conventional wisdom about rehearsal benefits and revealing complex relationships between rehearsal scale and learning dynamics that have practical implications for algorithm design.

Abstract: Rehearsal is one of the key techniques for mitigating catastrophic forgetting and has been widely adopted in continual learning algorithms due to its simplicity and practicality. However, the theoretical understanding of how rehearsal scale influences learning dynamics remains limited. To address this gap, we formulate rehearsal-based continual learning as a multidimensional effectiveness-driven iterative optimization problem, providing a unified characterization across diverse performance metrics. Within this framework, we derive a closed-form analysis of adaptability, memorability, and generalization from the perspective of rehearsal scale. Our results uncover several intriguing and counterintuitive findings. First, rehearsal can impair model’s adaptability, in sharp contrast to its traditionally recognized benefits. Second, increasing the rehearsal scale does not necessarily improve memory retention. When tasks are similar and noise levels are low, the memory error exhibits a diminishing lower bound. Finally, we validate these insights through numerical simulations and extended analyses on deep neural networks across multiple real-world datasets, revealing statistical patterns of rehearsal mechanisms in continual learning.

[461] Exploring the Impact of Parameter Update Magnitude on Forgetting and Generalization of Continual Learning

JinLi He, Liang Bai, Xian Yang

Main category: cs.LG

TL;DR: The paper analyzes continual learning through the lens of parameter update magnitude, formalizing forgetting as knowledge degradation from task-specific parameter drift, and proposes a hybrid update strategy that adapts based on gradient directions.

DetailsMotivation: Most continual learning research focuses on designing diverse update strategies without sufficient theoretical understanding of the underlying mechanisms, particularly regarding how parameter update magnitude affects forgetting.

Method: Characterizes forgetting from parameter update magnitude perspective, formalizes it as knowledge degradation from task-specific parameter drift, derives optimal update magnitude to minimize forgetting, and proposes a hybrid parameter update strategy that adaptively adjusts update magnitude based on gradient directions.

Result: Theoretical analysis reveals that sequence tasks with small parameter distances exhibit better generalization and less forgetting under frozen training rather than initialized training. Experiments on deep neural networks show the hybrid approach outperforms standard training strategies.

Conclusion: Provides new theoretical perspectives and practical inspiration for designing efficient and scalable continual learning algorithms by unifying frozen and initialized training paradigms within an optimization framework for constrained parameter updates.

Abstract: The magnitude of parameter updates are considered a key factor in continual learning. However, most existing studies focus on designing diverse update strategies, while a theoretical understanding of the underlying mechanisms remains limited. Therefore, we characterize model’s forgetting from the perspective of parameter update magnitude and formalize it as knowledge degradation induced by task-specific drift in the parameter space, which has not been fully captured in previous studies due to their assumption of a unified parameter space. By deriving the optimal parameter update magnitude that minimizes forgetting, we unify two representative update paradigms, frozen training and initialized training, within an optimization framework for constrained parameter updates. Our theoretical results further reveals that sequence tasks with small parameter distances exhibit better generalization and less forgetting under frozen training rather than initialized training. These theoretical insights inspire a novel hybrid parameter update strategy that adaptively adjusts update magnitude based on gradient directions. Experiments on deep neural networks demonstrate that this hybrid approach outperforms standard training strategies, providing new theoretical perspectives and practical inspiration for designing efficient and scalable continual learning algorithms.

[462] Regret-Guided Search Control for Efficient Learning in AlphaZero

Yun-Jui Tsai, Wei-Yu Chen, Yan-Ru Ju, Yu-Hung Chang, Ti-Rong Wu

Main category: cs.LG

TL;DR: RGSC enhances AlphaZero by using a regret network to identify high-regret states for more efficient training, outperforming baselines in multiple board games.

DetailsMotivation: RL agents require extensive self-play games while humans learn efficiently by revisiting mistakes. Current methods like Go-Exploit sample states equally without considering learning potential.

Method: Extends AlphaZero with a regret network that identifies high-regret states where agent evaluation diverges from actual outcomes. Collects states from self-play trajectories and MCTS nodes, stores in prioritized regret buffer, and reuses as starting positions.

Result: Outperforms AlphaZero and Go-Exploit by average 77 and 89 Elo across 9x9 Go, 10x10 Othello, and 11x11 Hex. When training on well-trained 9x9 Go model, improves win rate against KataGo from 69.3% to 78.2% while baselines show no improvement.

Conclusion: RGSC provides effective mechanism for search control, improving both efficiency and robustness of AlphaZero training by focusing on high-regret states.

Abstract: Reinforcement learning (RL) agents achieve remarkable performance but remain far less learning-efficient than humans. While RL agents require extensive self-play games to extract useful signals, humans often need only a few games, improving rapidly by repeatedly revisiting states where mistakes occurred. This idea, known as search control, aims to restart from valuable states rather than always from the initial state. In AlphaZero, prior work Go-Exploit applies this idea by sampling past states from self-play or search trees, but it treats all states equally, regardless of their learning potential. We propose Regret-Guided Search Control (RGSC), which extends AlphaZero with a regret network that learns to identify high-regret states, where the agent’s evaluation diverges most from the actual outcome. These states are collected from both self-play trajectories and MCTS nodes, stored in a prioritized regret buffer, and reused as new starting positions. Across 9x9 Go, 10x10 Othello, and 11x11 Hex, RGSC outperforms AlphaZero and Go-Exploit by an average of 77 and 89 Elo, respectively. When training on a well-trained 9x9 Go model, RGSC further improves the win rate against KataGo from 69.3% to 78.2%, while both baselines show no improvement. These results demonstrate that RGSC provides an effective mechanism for search control, improving both efficiency and robustness of AlphaZero training. Our code is available at https://rlg.iis.sinica.edu.tw/papers/rgsc.

[463] Transcoder Adapters for Reasoning-Model Diffing

Nathan Hu, Jake Ward, Thomas Icard, Christopher Potts

Main category: cs.LG

TL;DR: Transcoder adapters learn interpretable approximations of MLP computation differences between base and reasoning-fine-tuned models, enabling analysis of reasoning training effects on model internals.

DetailsMotivation: To understand how reasoning training affects model internal mechanisms, which remains poorly understood despite the ubiquity of reasoning models.

Method: Introduce transcoder adapters that learn interpretable approximations of MLP computation differences between base models and their reasoning-fine-tuned variants, applied to analyze Qwen2.5-Math-7B vs. DeepSeek-R1-Distill-Qwen-7B.

Result: Adapters are faithful to target model computation, recover 50-90% of reasoning accuracy gains, and reveal sparse activation patterns. Only ~8% of adapter features relate to reasoning behaviors, with ~2.4% specifically controlling hesitation token production.

Conclusion: Transcoder adapters provide insights into reasoning training effects and could be useful for broader fine-tuning analysis, revealing specific mechanisms like hesitation token control.

Abstract: While reasoning models are increasingly ubiquitous, the effects of reasoning training on a model’s internal mechanisms remain poorly understood. In this work, we introduce transcoder adapters, a technique for learning an interpretable approximation of the difference in MLP computation before and after fine-tuning. We apply transcoder adapters to characterize the differences between Qwen2.5-Math-7B and its reasoning-distilled variant, DeepSeek-R1-Distill-Qwen-7B. Learned adapters are faithful to the target model’s internal computation and next-token predictions. When evaluated on reasoning benchmarks, adapters match the reasoning model’s response lengths and typically recover 50-90% of the accuracy gains from reasoning fine-tuning. Adapter features are sparsely activating and interpretable. When examining adapter features, we find that only ~8% have activating examples directly related to reasoning behaviors. We deeply study one such behavior – the production of hesitation tokens (e.g., “wait”). Using attribution graphs, we trace hesitation to only ~2.4% of adapter features (5.6k total) performing one of two functions. These features are necessary and sufficient for producing hesitation tokens; removing them reduces response length, often without affecting accuracy. Overall, our results provide insight into reasoning training and suggest transcoder adapters may be useful for studying fine-tuning more broadly.

[464] From Isolation to Integration: Building an Adaptive Expert Forest for Pre-Trained Model-based Class-Incremental Learning

Ruiqi Liu, Boyu Diao, Hangda Liu, Zhulin An, Fei Wang, Yongjun Xu

Main category: cs.LG

TL;DR: SAEF organizes adapters into a semantic-guided hierarchical forest for class-incremental learning, enabling better knowledge sharing between tasks while preventing catastrophic forgetting.

DetailsMotivation: Current CIL methods that freeze pre-trained models and train lightweight adapters for each task treat knowledge as an unstructured collection and fail to leverage relationships between tasks, limiting knowledge sharing and efficiency.

Method: SAEF organizes adapters into a structured hierarchy: 1) groups tasks into conceptual clusters based on semantic relationships, 2) builds balanced expert trees within each cluster by merging adapters of similar tasks, 3) at inference, activates relevant experts from the forest and combines their weighted outputs based on confidence.

Result: Experiments on several benchmark datasets show SAEF achieves state-of-the-art performance in class-incremental learning.

Conclusion: SAEF provides a structured approach to organizing adapters that enables better knowledge sharing between tasks while maintaining protection against catastrophic forgetting in incremental learning scenarios.

Abstract: Class-Incremental Learning (CIL) requires models to learn new classes without forgetting old ones. A common method is to freeze a pre-trained model and train a new, lightweight adapter for each task. While this prevents forgetting, it treats the learned knowledge as a simple, unstructured collection and fails to use the relationships between tasks. To this end, we propose the Semantic-guided Adaptive Expert Forest (SAEF), a new method that organizes adapters into a structured hierarchy for better knowledge sharing. SAEF first groups tasks into conceptual clusters based on their semantic relationships. Then, within each cluster, it builds a balanced expert tree by creating new adapters from merging the adapters of similar tasks. At inference time, SAEF finds and activates a set of relevant experts from the forest for any given input. The final prediction is made by combining the outputs of these activated experts, weighted by how confident each expert is. Experiments on several benchmark datasets show that SAEF achieves SOTA performance.

[465] On the Generalization Behavior of Deep Residual Networks From a Dynamical System Perspective

Jinshu Huang, Mingfei Sun, Chunlin Wu

Main category: cs.LG

TL;DR: Theoretical analysis establishing generalization error bounds for discrete- and continuous-time ResNets using dynamical systems framework, Rademacher complexity, and flow maps.

DetailsMotivation: To bridge the gap between discrete- and continuous-time residual networks by providing unified generalization error bounds that work for both settings under milder assumptions, addressing the lack of consistent theoretical understanding across these frameworks.

Method: Combines Rademacher complexity, flow maps of dynamical systems, and convergence behavior of ResNets in the deep-layer limit to derive generalization bounds. Uses dynamical system modeling approach to analyze both discrete- and continuous-time ResNets.

Result: Establishes generalization error bounds of order O(1/√S) with respect to number of training samples S, including a structure-dependent negative term. Provides depth-uniform and asymptotic generalization bounds under milder assumptions than previous work.

Conclusion: The work provides a unified theoretical understanding of generalization across both discrete- and continuous-time ResNets, closing gaps in sample complexity order and assumptions between the two settings.

Abstract: Deep neural networks (DNNs) have significantly advanced machine learning, with model depth playing a central role in their successes. The dynamical system modeling approach has recently emerged as a powerful framework, offering new mathematical insights into the structure and learning behavior of DNNs. In this work, we establish generalization error bounds for both discrete- and continuous-time residual networks (ResNets) by combining Rademacher complexity, flow maps of dynamical systems, and the convergence behavior of ResNets in the deep-layer limit. The resulting bounds are of order $O(1/\sqrt{S})$ with respect to the number of training samples $S$, and include a structure-dependent negative term, yielding depth-uniform and asymptotic generalization bounds under milder assumptions. These findings provide a unified understanding of generalization across both discrete- and continuous-time ResNets, helping to close the gap in both the order of sample complexity and assumptions between the discrete- and continuous-time settings.

[466] Hierarchic-EEG2Text: Assessing EEG-To-Text Decoding across Hierarchical Abstraction Levels

Anupam Sharma, Harish Katti, Prajwal Singh, Shanmuganathan Raman, Krishna Miyapuram

Main category: cs.LG

TL;DR: EEG-based object recognition using hierarchical semantic analysis and episodic learning framework on large-scale PEERS dataset

DetailsMotivation: To investigate whether EEG captures object representations across multiple hierarchical levels and address the challenge of recognizing fine-grained representations across many classes due to EEG's low signal-to-noise ratio

Method: Proposes episodic analysis with hierarchy-aware episode sampling using WordNet to generate episodes with variable classes of diverse hierarchy, applied to the large PEERS dataset with 931,538 EEG samples under 1,610 object labels from 264 participants

Result: Models improve performance when classification categories are drawn from higher levels of the hierarchy, suggesting sensitivity to abstraction level in EEG decoding

Conclusion: Abstraction depth is an underexplored dimension of EEG decoding that warrants future research, with the episodic framework enabling comprehensive analysis of neural dynamics

Abstract: An electroencephalogram (EEG) records the spatially averaged electrical activity of neurons in the brain, measured from the human scalp. Prior studies have explored EEG-based classification of objects or concepts, often for passive viewing of briefly presented image or video stimuli, with limited classes. Because EEG exhibits a low signal-to-noise ratio, recognizing fine-grained representations across a large number of classes remains challenging; however, abstract-level object representations may exist. In this work, we investigate whether EEG captures object representations across multiple hierarchical levels, and propose episodic analysis, in which a Machine Learning (ML) model is evaluated across various, yet related, classification tasks (episodes). Unlike prior episodic EEG studies that rely on fixed or randomly sampled classes of equal cardinality, we adopt hierarchy-aware episode sampling using WordNet to generate episodes with variable classes of diverse hierarchy. We also present the largest episodic framework in the EEG domain for detecting observed text from EEG signals in the PEERS dataset, comprising $931538$ EEG samples under $1610$ object labels, acquired from $264$ human participants (subjects) performing controlled cognitive tasks, enabling the study of neural dynamics underlying perception, decision-making, and performance monitoring. We examine how the semantic abstraction level affects classification performance across multiple learning techniques and architectures, providing a comprehensive analysis. The models tend to improve performance when the classification categories are drawn from higher levels of the hierarchy, suggesting sensitivity to abstraction. Our work highlights abstraction depth as an underexplored dimension of EEG decoding and motivates future research in this direction.

[467] Extending $μ$P: Spectral Conditions for Feature Learning Across Optimizers

Akshita Gupta, Marieme Ngom, Sam Foreman, Venkatram Vishwanath

Main category: cs.LG

TL;DR: Novel framework for deriving maximal update parameterization (μP) for various optimizers using spectral conditions, enabling zero-shot learning rate transfer across model sizes.

DetailsMotivation: Optimizer hyperparameter tuning is computationally expensive for large language models, and existing μP methods are limited to SGD/Adam. Need a more general framework to extend μP to modern optimizers like AdamW, LAMB, Sophia, etc.

Method: Proposes a novel framework using spectral conditions as an alternative to tensor programming to derive μP for a broader class of optimizers including AdamW, ADOPT, LAMB, Sophia, Shampoo and Muon.

Result: Successfully implements μP derivations on multiple benchmark models and demonstrates zero-shot learning rate transfer across increasing model width for the optimizers studied.

Conclusion: Provides a generalizable approach to extend μP to modern optimizers, enabling efficient hyperparameter transfer across model scales and offering empirical insights into depth-scaling parameterization.

Abstract: Several variations of adaptive first-order and second-order optimization methods have been proposed to accelerate and scale the training of large language models. The performance of these optimization routines is highly sensitive to the choice of hyperparameters (HPs), which are computationally expensive to tune for large-scale models. Maximal update parameterization $(μ$P$)$ is a set of scaling rules which aims to make the optimal HPs independent of the model size, thereby allowing the HPs tuned on a smaller (computationally cheaper) model to be transferred to train a larger, target model. Despite promising results for SGD and Adam, deriving $μ$P for other optimizers is challenging because the underlying tensor programming approach is difficult to grasp. Building on recent work that introduced spectral conditions as an alternative to tensor programs, we propose a novel framework to derive $μ$P for a broader class of optimizers, including AdamW, ADOPT, LAMB, Sophia, Shampoo and Muon. We implement our $μ$P derivations on multiple benchmark models and demonstrate zero-shot learning rate transfer across increasing model width for the above optimizers. Further, we provide empirical insights into depth-scaling parameterization for these optimizers.

[468] Estimation of Confidence Bounds in Binary Classification using Wilson Score Kernel Density Estimation

Thorbjørn Mosekjær Iversen, Zebin Duan, Frederik Hagelskjær

Main category: cs.LG

TL;DR: A novel kernel-based method for estimating confidence bounds in binary classification, evaluated on selective classification tasks with vision foundation models.

DetailsMotivation: Deep learning classifiers have improved but lack reliable confidence bounds for critical inspection tasks, requiring statistical guarantees for deployment in safety-critical operations.

Method: Wilson Score Kernel Density Classification uses Wilson Score Kernel Density Estimator to estimate confidence bounds for Binomial experiments with conditionally varying success probabilities, applicable as classification head for any feature extractor.

Result: Method shows similar performance to Gaussian Process Classification but with lower computational complexity, evaluated on four datasets in selective classification context.

Conclusion: Proposed method provides reliable confidence estimation for binary classifiers, enabling deployment in critical operations where statistical guarantees are required.

Abstract: The performance and ease of use of deep learning-based binary classifiers have improved significantly in recent years. This has opened up the potential for automating critical inspection tasks, which have traditionally only been trusted to be done manually. However, the application of binary classifiers in critical operations depends on the estimation of reliable confidence bounds such that system performance can be ensured up to a given statistical significance. We present Wilson Score Kernel Density Classification, which is a novel kernel-based method for estimating confidence bounds in binary classification. The core of our method is the Wilson Score Kernel Density Estimator, which is a function estimator for estimating confidence bounds in Binomial experiments with conditionally varying success probabilities. Our method is evaluated in the context of selective classification on four different datasets, illustrating its use as a classification head of any feature extractor, including vision foundation models. Our proposed method shows similar performance to Gaussian Process Classification, but at a lower computational complexity.

[469] Does Order Matter : Connecting The Law of Robustness to Robust Generalization

Himadri Mandal, Vishnu Varadarajan, Jaee Ponde, Aritra Das, Mihir More, Debayan Gupta

Main category: cs.LG

TL;DR: Theoretical analysis connecting the law of robustness to robust generalization, showing robust interpolation requires Lipschitz functions and establishing bounds on required Lipschitz constants for low robust generalization error.

DetailsMotivation: To resolve the open problem posed by Bubeck and Sellke (2021) about the connection between the law of robustness and robust generalization, specifically investigating whether robust interpolation requires Lipschitz functions and how this relates to robust generalization performance.

Method: Introduce a nontrivial notion of robust generalization error and convert it into a lower bound on the expected Rademacher complexity of the induced robust loss class. Theoretical analysis connects robust interpolation requirements to Lipschitz constants, with experimental validation on MNIST to test scaling predictions.

Result: The analysis shows robust generalization does not change the order of Lipschitz constant required for smooth interpolation, recovering the Ω(n^{1/d}) regime of Wu et al. (2023). MNIST experiments show lower-bound Lipschitz constant scales as predicted by Wu et al., indicating Lipschitz constant must lie in a specific bounded range for low robust generalization error.

Conclusion: The paper resolves the connection between law of robustness and robust generalization, establishing that robust interpolation requires Lipschitz functions and providing bounds on required Lipschitz constants, with empirical validation supporting the theoretical predictions.

Abstract: Bubeck and Sellke (2021) pose as an open problem the connection between the law of robustness and robust generalization. The law of robustness states that overparameterization is necessary for models to interpolate robustly; in particular, robust interpolation requires the learned function to be Lipschitz. Robust generalization asks whether small robust training loss implies small robust test loss. We resolve this problem by explicitly connecting the two for arbitrary data distributions. Specifically, we introduce a nontrivial notion of robust generalization error and convert it into a lower bound on the expected Rademacher complexity of the induced robust loss class. Our bounds recover the $Ω(n^{1/d})$ regime of Wu et al.\ (2023) and show that, up to constants, robust generalization does not change the order of the Lipschitz constant required for smooth interpolation. We conduct experiments to probe the predicted scaling with dataset size and model capacity, testing whether empirical behavior aligns more closely with the predictions of Bubeck and Sellke (2021) or Wu et al.\ (2023). For MNIST, we find that the lower-bound Lipschitz constant scales on the order predicted by Wu et al.\ (2023). Informally, to obtain low robust generalization error, the Lipschitz constant must lie in a range that we bound, and the allowable perturbation radius is linked to the Lipschitz scale.

[470] MAST: A Multi-fidelity Augmented Surrogate model via Spatial Trust-weighting

Ahmed Mohamed Eisa Nasr, Haris Moazam Sheikh

Main category: cs.LG

TL;DR: MAST is a multi-fidelity surrogate modeling method that blends corrected low-fidelity data with high-fidelity predictions using explicit discrepancy modeling and distance-based weighting with closed-form variance propagation.

DetailsMotivation: Address the trade-off between computational cost and predictive accuracy in engineering/scientific computing by improving multi-fidelity surrogate modeling, which currently suffers from expensive training costs or unrealistic global correlation assumptions that fail to capture spatially-varying fidelity relationships.

Method: MAST blends corrected low-fidelity observations with high-fidelity predictions, trusting high-fidelity near observed samples and relying on corrected low-fidelity elsewhere. Uses explicit discrepancy modeling and distance-based weighting with closed-form variance propagation to produce a single heteroscedastic Gaussian process.

Result: MAST shows marked improvement over state-of-the-art techniques across multi-fidelity synthetic benchmarks. Maintains robust performance across varying total budget and fidelity gaps, where competing methods exhibit significant degradation or unstable behavior.

Conclusion: MAST provides an effective solution for multi-fidelity surrogate modeling that addresses limitations of existing methods, particularly under tight budget constraints and spatially-varying fidelity relationships.

Abstract: In engineering design and scientific computing, computational cost and predictive accuracy are intrinsically coupled. High-fidelity simulations provide accurate predictions but at substantial computational costs, while lower-fidelity approximations offer efficiency at the expense of accuracy. Multi-fidelity surrogate modelling addresses this trade-off by combining abundant low-fidelity data with sparse high-fidelity observations. However, existing methods suffer from expensive training cost or rely on global correlation assumptions that often fail in practice to capture how fidelity relationships vary across the input space, leading to poor performance particularly under tight budget constraints. We introduce MAST, a method that blends corrected low-fidelity observations with high-fidelity predictions, trusting high-fidelity near observed samples and relying on corrected low-fidelity elsewhere. MAST achieves this through explicit discrepancy modelling and distance-based weighting with closed-form variance propagation, producing a single heteroscedastic Gaussian process. Across multi-fidelity synthetic benchmarks, MAST shows a marked improvement over the current state-of-the-art techniques. Crucially, MAST maintains robust performance across varying total budget and fidelity gaps, conditions under which competing methods exhibit significant degradation or unstable behaviour.

[471] T1: One-to-One Channel-Head Binding for Multivariate Time-Series Imputation

Dongik Park, Hyunwoo Ryu, Suahn Bae, Keondo Park, Hyung-Sin Kim

Main category: cs.LG

TL;DR: T1 introduces a CNN-Transformer hybrid with Channel-Head Binding for robust multivariate time series imputation, achieving SOTA performance with 46% MSE reduction.

DetailsMotivation: Existing time series imputation methods struggle with diverse missing patterns and heavy missingness, as corrupted temporal features hinder effective cross-variable information transfer, leading to suboptimal performance.

Method: T1 uses a CNN-Transformer hybrid architecture with Channel-Head Binding mechanism that creates one-to-one correspondence between CNN channels and attention heads, enabling selective information transfer and adaptive down-weighting of corrupted temporal patterns.

Result: Achieves state-of-the-art performance on 11 benchmark datasets, reducing MSE by 46% on average compared to second-best baseline, with strong gains under extreme sparsity (70% missing ratio).

Conclusion: T1 provides robust imputation through selective information transfer, generalizes to unseen missing patterns without retraining, and uses consistent hyperparameters across datasets.

Abstract: Imputing missing values in multivariate time series remains challenging, especially under diverse missing patterns and heavy missingness. Existing methods suffer from suboptimal performance as corrupted temporal features hinder effective cross-variable information transfer, amplifying reconstruction errors. Robust imputation requires both extracting temporal patterns from sparse observations within each variable and selectively transferring information across variables–yet current approaches excel at one while compromising the other. We introduce T1 (Time series imputation with 1-to-1 channel-head binding), a CNN-Transformer hybrid architecture that achieves robust imputation through Channel-Head Binding–a mechanism creating one-to-one correspondence between CNN channels and attention heads. This design enables selective information transfer: when missingness corrupts certain temporal patterns, their corresponding attention pathways adaptively down-weight based on remaining observable patterns while preserving reliable cross-variable connections through unaffected channels. Experiments on 11 benchmark datasets demonstrate that T1 achieves state-of-the-art performance, reducing MSE by 46% on average compared to the second-best baseline, with particularly strong gains under extreme sparsity (70% missing ratio). The model generalizes to unseen missing patterns without retraining and uses a consistent hyperparameter configuration across all datasets. The code is available at https://github.com/Oppenheimerdinger/T1.

[472] PIME: Prototype-based Interpretable MCTS-Enhanced Brain Network Analysis for Disorder Diagnosis

Kunyu Zhang, Yanwu Yang, Jing Zhang, Xiangjie Shi, Shujian Yu

Main category: cs.LG

TL;DR: PIME is an interpretable framework for fMRI-based diagnosis that combines prototype-based classification with consistency training to extract minimal-sufficient explanatory subgraphs using Monte Carlo Tree Search.

DetailsMotivation: Standard deep learning methods for fMRI diagnosis struggle with noisy functional connectivity networks, and conventional post-hoc attribution methods lack reliability, potentially highlighting dataset-specific artifacts rather than meaningful brain patterns.

Method: Integrates prototype-based classification and consistency training with structural perturbations during learning to encourage structured latent space, then uses Monte Carlo Tree Search under prototype-consistent objective to extract compact minimal-sufficient explanatory subgraphs post-training.

Result: Achieves state-of-the-art performance on three benchmark fMRI datasets, identifies critical brain regions consistent with established neuroimaging findings, shows 90% reproducibility, and provides consistent explanations across different brain atlases.

Conclusion: PIME provides an effective interpretable framework for fMRI analysis that bridges intrinsic interpretability with minimal-sufficient subgraph optimization, offering reliable and reproducible explanations of brain connectivity patterns.

Abstract: Recent deep learning methods for fMRI-based diagnosis have achieved promising accuracy by modeling functional connectivity networks. However, standard approaches often struggle with noisy interactions, and conventional post-hoc attribution methods may lack reliability, potentially highlighting dataset-specific artifacts. To address these challenges, we introduce PIME, an interpretable framework that bridges intrinsic interpretability with minimal-sufficient subgraph optimization by integrating prototype-based classification and consistency training with structural perturbations during learning. This encourages a structured latent space and enables Monte Carlo Tree Search (MCTS) under a prototype-consistent objective to extract compact minimal-sufficient explanatory subgraphs post-training. Experiments on three benchmark fMRI datasets demonstrate that PIME achieves state-of-the-art performance. Furthermore, by constraining the search space via learned prototypes, PIME identifies critical brain regions that are consistent with established neuroimaging findings. Stability analysis shows 90% reproducibility and consistent explanations across atlases.

[473] Localized Dynamics-Aware Domain Adaption for Off-Dynamics Offline Reinforcement Learning

Zhangjie Xia, Yu Yang, Pan Xu

Main category: cs.LG

TL;DR: LoDADA improves off-dynamics offline RL by using localized clustering to identify and filter source data with similar dynamics to target domain, outperforming global or per-sample methods.

DetailsMotivation: Existing off-dynamics offline RL methods either make overly coarse global assumptions about dynamics mismatch or use computationally expensive per-sample filtering, missing localized cross-domain similarities.

Method: Clusters transitions from source and target datasets, estimates cluster-level dynamics discrepancy via domain discrimination, retains source transitions from clusters with small discrepancy, and filters those with large discrepancy.

Result: Consistently outperforms state-of-the-art off-dynamics offline RL methods across environments with diverse global and local dynamics shifts by better leveraging localized distribution mismatch.

Conclusion: LoDADA provides a fine-grained and scalable data selection strategy that avoids overly coarse global assumptions and expensive per-sample filtering in off-dynamics offline RL.

Abstract: Off-dynamics offline reinforcement learning (RL) aims to learn a policy for a target domain using limited target data and abundant source data collected under different transition dynamics. Existing methods typically address dynamics mismatch either globally over the state space or via pointwise data filtering; these approaches can miss localized cross-domain similarities or incur high computational cost. We propose Localized Dynamics-Aware Domain Adaptation (LoDADA), which exploits localized dynamics mismatch to better reuse source data. LoDADA clusters transitions from source and target datasets and estimates cluster-level dynamics discrepancy via domain discrimination. Source transitions from clusters with small discrepancy are retained, while those from clusters with large discrepancy are filtered out. This yields a fine-grained and scalable data selection strategy that avoids overly coarse global assumptions and expensive per-sample filtering. We provide theoretical insights and extensive experiments across environments with diverse global and local dynamics shifts. Results show that LoDADA consistently outperforms state-of-the-art off-dynamics offline RL methods by better leveraging localized distribution mismatch.

[474] ProxyFL: A Proxy-Guided Framework for Federated Semi-Supervised Learning

Duowen Chen, Yan Wang

Main category: cs.LG

TL;DR: ProxyFL: A proxy-guided framework for Federated Semi-Supervised Learning that uses classifier weights as proxies to simultaneously address external (cross-client) and internal (within-client) data heterogeneity, improving performance and convergence.

DetailsMotivation: Federated Semi-Supervised Learning faces challenges from both external heterogeneity (data distribution discrepancy across clients) and internal heterogeneity (mismatch between labeled and unlabeled data within clients). Existing methods either struggle to precisely fit global distributions via direct weight aggregation or reduce data participation by filtering out low-confidence samples.

Method: ProxyFL uses classifier weights as learnable proxies to simulate category distributions. For external heterogeneity, it optimizes global proxies against outliers instead of direct weight aggregation. For internal heterogeneity, it reintroduces discarded samples using a positive-negative proxy pool to mitigate incorrect pseudo-label impacts.

Result: The method shows significant performance improvements and better convergence in Federated Semi-Supervised Learning settings, as demonstrated through insightful experiments and theoretical analysis.

Conclusion: ProxyFL effectively addresses both external and internal heterogeneity in FSSL through a unified proxy-based approach, enabling better utilization of available data while maintaining privacy-preserving federated learning principles.

Abstract: Federated Semi-Supervised Learning (FSSL) aims to collaboratively train a global model across clients by leveraging partially-annotated local data in a privacy-preserving manner. In FSSL, data heterogeneity is a challenging issue, which exists both across clients and within clients. External heterogeneity refers to the data distribution discrepancy across different clients, while internal heterogeneity represents the mismatch between labeled and unlabeled data within clients. Most FSSL methods typically design fixed or dynamic parameter aggregation strategies to collect client knowledge on the server (external) and / or filter out low-confidence unlabeled samples to reduce mistakes in local client (internal). But, the former is hard to precisely fit the ideal global distribution via direct weights, and the latter results in fewer data participation into FL training. To this end, we propose a proxy-guided framework called ProxyFL that focuses on simultaneously mitigating external and internal heterogeneity via a unified proxy. I.e., we consider the learnable weights of classifier as proxy to simulate the category distribution both locally and globally. For external, we explicitly optimize global proxy against outliers instead of direct weights; for internal, we re-include the discarded samples into training by a positive-negative proxy pool to mitigate the impact of potentially-incorrect pseudo-labels. Insight experiments & theoretical analysis show our significant performance and convergence in FSSL.

[475] Scaling Vision Transformers: Evaluating DeepSpeed for Image-Centric Workloads

Huy Trinh, Rebecca Ma, Zeqi Yu, Tahsin Reza

Main category: cs.LG

TL;DR: Applying DeepSpeed distributed training framework to Vision Transformers to improve scalability and performance through systematic evaluation of training efficiency across GPU configurations.

DetailsMotivation: Vision Transformers face significant computational and memory challenges when scaling to large models, and DeepSpeed has proven effective for language models but hasn't been extensively applied to vision tasks.

Method: Evaluate intra- and inter-node training efficiency using DeepSpeed on multiple GPU configurations with datasets like CIFAR-10/100, analyzing distributed data parallelism effects on training speed, communication overhead, and scalability (strong/weak scaling).

Result: Systematic analysis identifies key software parameters (batch size, gradient accumulation) influencing distributed training performance, providing foundational basis for applying DeepSpeed to vision tasks.

Conclusion: DeepSpeed shows promise for scaling Vision Transformers, with future work needed to explore limitations and optimize distributed training pipelines for vision models.

Abstract: Vision Transformers (ViTs) have demonstrated remarkable potential in image processing tasks by utilizing self-attention mechanisms to capture global relationships within data. However, their scalability is hindered by significant computational and memory demands, especially for large-scale models with many parameters. This study aims to leverage DeepSpeed, a highly efficient distributed training framework that is commonly used for language models, to enhance the scalability and performance of ViTs. We evaluate intra- and inter-node training efficiency across multiple GPU configurations on various datasets like CIFAR-10 and CIFAR-100, exploring the impact of distributed data parallelism on training speed, communication overhead, and overall scalability (strong and weak scaling). By systematically varying software parameters, such as batch size and gradient accumulation, we identify key factors influencing performance of distributed training. The experiments in this study provide a foundational basis for applying DeepSpeed to image-related tasks. Future work will extend these investigations to deepen our understanding of DeepSpeed’s limitations and explore strategies for optimizing distributed training pipelines for Vision Transformers.

[476] Probing Graph Neural Network Activation Patterns Through Graph Topology

Floriano Tori, Lorenzo Bini, Marco Sorbi, Stéphane Marchand-Maillet, Vincent Ginis

Main category: cs.LG

TL;DR: Graph curvature reveals topological bottlenecks, but Graph Transformers’ massive activations don’t concentrate on curvature extremes; instead, global attention exacerbates negative curvature, making curvature a diagnostic tool for graph learning failures.

DetailsMotivation: To understand how graph topology (specifically curvature notions that highlight bottlenecks and dense regions) interacts with Graph Neural Networks' learned preferences, particularly addressing artifacts like oversmoothing and oversquashing in message-passing paradigms.

Method: Use Massive Activations (extreme edge activation values in Graph Transformers) as probes to study correspondence between graph topology and GNN preferences. Analyze synthetic graphs and molecular benchmarks, and examine Long Range Graph Benchmark to identify curvature shifts.

Result: Massive activations do not preferentially concentrate on curvature extremes despite theoretical links to information flow. Global attention mechanisms cause a systemic curvature shift, drastically increasing negative curvature prevalence and exacerbating topological bottlenecks.

Conclusion: Curvature serves as a diagnostic probe for understanding when and why graph learning fails, revealing that global attention in Graph Transformers worsens topological bottlenecks rather than aligning with curvature-based theoretical expectations.

Abstract: Curvature notions on graphs provide a theoretical description of graph topology, highlighting bottlenecks and denser connected regions. Artifacts of the message passing paradigm in Graph Neural Networks, such as oversmoothing and oversquashing, have been attributed to these regions. However, it remains unclear how the topology of a graph interacts with the learned preferences of GNNs. Through Massive Activations, which correspond to extreme edge activation values in Graph Transformers, we probe this correspondence. Our findings on synthetic graphs and molecular benchmarks reveal that MAs do not preferentially concentrate on curvature extremes, despite their theoretical link to information flow. On the Long Range Graph Benchmark, we identify a systemic \textit{curvature shift}: global attention mechanisms exacerbate topological bottlenecks, drastically increasing the prevalence of negative curvature. Our work reframes curvature as a diagnostic probe for understanding when and why graph learning fails.

[477] Ski Rental with Distributional Predictions of Unknown Quality

Qiming Cui, Michael Dinitz

Main category: cs.LG

TL;DR: Ski rental problem with distributional predictions using Earth Mover’s distance, achieving OPT + O(min(max(η,1)*√b, b log b)) bounds without needing to know prediction error η.

DetailsMotivation: The paper revisits the ski rental problem in the "algorithms with predictions" framework, moving from point predictions (number of ski days) to distributional predictions. This approach is more natural and potentially more powerful, as real-world predictions often come with uncertainty that can be better captured by distributions rather than single values.

Method: The authors develop algorithms for ski rental with distributional predictions, using Earth Mover’s (Wasserstein-1) distance η to measure prediction error between predicted distribution p̂ and true distribution p. The algorithm achieves bounds without requiring knowledge of the prediction error η, unlike previous robust optimization approaches.

Result: The main result shows an algorithm with expected cost at most OPT + O(min(max(η,1)*√b, b log b)), where OPT is the optimal policy cost for true distribution p, b is buying cost. This gives consistency O(√b) (when η=0) and robustness O(b log b) (when η→∞). Lower bounds show these tradeoffs are essentially tight.

Conclusion: Distributional predictions provide a more powerful framework for ski rental than point predictions, enabling algorithms that achieve optimal consistency-robustness tradeoffs without needing to know prediction error bounds. The Earth Mover’s distance is an effective measure for distributional prediction quality in this context.

Abstract: We revisit the central online problem of ski rental in the “algorithms with predictions” framework from the point of view of distributional predictions. Ski rental was one of the first problems to be studied with predictions, where a natural prediction is simply the number of ski days. But it is both more natural and potentially more powerful to think of a prediction as a distribution p-hat over the ski days. If the true number of ski days is drawn from some true (but unknown) distribution p, then we show as our main result that there is an algorithm with expected cost at most OPT + O(min(max({eta}, 1) * sqrt(b), b log b)), where OPT is the expected cost of the optimal policy for the true distribution p, b is the cost of buying, and {eta} is the Earth Mover’s (Wasserstein-1) distance between p and p-hat. Note that when {eta} < o(sqrt(b)) this gives additive loss less than b (the trivial bound), and when {eta} is arbitrarily large (corresponding to an extremely inaccurate prediction) we still do not pay more than O(b log b) additive loss. An implication of these bounds is that our algorithm has consistency O(sqrt(b)) (additive loss when the prediction error is 0) and robustness O(b log b) (additive loss when the prediction error is arbitrarily large). Moreover, we do not need to assume that we know (or have any bound on) the prediction error {eta}, in contrast with previous work in robust optimization which assumes that we know this error. We complement this upper bound with a variety of lower bounds showing that it is essentially tight: not only can the consistency/robustness tradeoff not be improved, but our particular loss function cannot be meaningfully improved.

[478] SOM-VQ: Topology-Aware Tokenization for Interactive Generative Models

Alessandro Londei, Denise Lanzieri, Matteo Benati

Main category: cs.LG

TL;DR: SOM-VQ combines vector quantization with Self-Organizing Maps to create discrete codebooks with explicit low-dimensional topology, enabling interpretable human control through geometric manipulation of token space.

DetailsMotivation: Standard vector-quantized representations lack semantic structure in token space, limiting interpretable human control. The authors aim to create discrete representations with explicit topology that enables intuitive manipulation and human-in-the-loop control.

Method: SOM-VQ combines vector quantization with Self-Organizing Maps, using topology-aware updates that preserve neighborhood structure. This creates a codebook where nearby tokens on a learned grid correspond to semantically similar states, enabling direct geometric manipulation of the latent space.

Result: SOM-VQ produces more learnable token sequences while providing explicit navigable geometry in code space. The topological organization enables intuitive human-in-the-loop control where users can steer generation by manipulating distances in token space, achieving semantic alignment without frame-level constraints.

Conclusion: SOM-VQ provides a general framework for interpretable discrete representations applicable to interactive generative domains like music, gesture, and human motion, where topology-aware control is especially valuable.

Abstract: Vector-quantized representations enable powerful discrete generative models but lack semantic structure in token space, limiting interpretable human control. We introduce SOM-VQ, a tokenization method that combines vector quantization with Self-Organizing Maps to learn discrete codebooks with explicit low-dimensional topology. Unlike standard VQ-VAE, SOM-VQ uses topology-aware updates that preserve neighborhood structure: nearby tokens on a learned grid correspond to semantically similar states, enabling direct geometric manipulation of the latent space. We demonstrate that SOM-VQ produces more learnable token sequences in the evaluated domains while providing an explicit navigable geometry in code space. Critically, the topological organization enables intuitive human-in-the-loop control: users can steer generation by manipulating distances in token space, achieving semantic alignment without frame-level constraints. We focus on human motion generation - a domain where kinematic structure, smooth temporal continuity, and interactive use cases (choreography, rehabilitation, HCI) make topology-aware control especially natural - demonstrating controlled divergence and convergence from reference sequences through simple grid-based sampling. SOM-VQ provides a general framework for interpretable discrete representations applicable to music, gesture, and other interactive generative domains.

[479] Sequential Counterfactual Inference for Temporal Clinical Data: Addressing the Time Traveler Dilemma

Jingya Cheng, Alaleh Azhir, Jiazi Tian, Hossein Estiri

Main category: cs.LG

TL;DR: Sequential Counterfactual Framework for clinical data that respects temporal dependencies in EHRs, distinguishing immutable vs. controllable features and modeling intervention propagation over time.

DetailsMotivation: Standard counterfactual inference methods assume feature independence and simultaneous modifiability, which are violated by longitudinal clinical data with temporal dependencies.

Method: Introduced Sequential Counterfactual Framework that distinguishes immutable features (chronic diagnoses) from controllable features (lab values) and models how interventions propagate through time in electronic health records.

Result: Applied to 2,723 COVID-19 patients, found 38-67% of patients with chronic conditions would require biologically impossible counterfactuals under naive methods. Identified cardiorenal cascade (CKD → AKI → HF) with relative risks of 2.27 and 1.19 at each step.

Conclusion: Framework transforms counterfactual explanation from static “what if this feature were different?” to dynamic “what if we had intervened earlier, and how would that propagate forward?” yielding clinically actionable insights grounded in biological plausibility.

Abstract: Counterfactual inference enables clinicians to ask “what if” questions about patient outcomes, but standard methods assume feature independence and simultaneous modifiability – assumptions violated by longitudinal clinical data. We introduce the Sequential Counterfactual Framework, which respects temporal dependencies in electronic health records by distinguishing immutable features (chronic diagnoses) from controllable features (lab values) and modeling how interventions propagate through time. Applied to 2,723 COVID-19 patients (383 Long COVID heart failure cases, 2,340 matched controls), we demonstrate that 38-67% of patients with chronic conditions would require biologically impossible counterfactuals under naive methods. We identify a cardiorenal cascade (CKD -> AKI -> HF) with relative risks of 2.27 and 1.19 at each step, illustrating temporal propagation that sequential – but not naive – counterfactuals can capture. Our framework transforms counterfactual explanation from “what if this feature were different?” to “what if we had intervened earlier, and how would that propagate forward?” – yielding clinically actionable insights grounded in biological plausibility.

[480] The Diffusion Duality, Chapter II: $Ψ$-Samplers and Efficient Curriculum

Justin Deschenaux, Caglar Gulcehre, Subham Sekhar Sahoo

Main category: cs.LG

TL;DR: Predictor-Corrector samplers for discrete diffusion models improve sampling quality with more steps, outperforming ancestral sampling on language and image tasks, challenging the assumption that Masked diffusion is the future of diffusion-based language modeling.

DetailsMotivation: Uniform-state discrete diffusion models are good at few-step generation but their sampling quality plateaus with ancestral samplers as steps increase. The paper aims to develop better samplers that continue to improve with more sampling steps.

Method: Introduces a family of Predictor-Corrector (PC) samplers for discrete diffusion that generalize prior methods and apply to arbitrary noise processes. Also develops a memory-efficient curriculum for Gaussian relaxation training phase.

Result: PC samplers outperform ancestral sampling on both language (lower generative perplexity on OpenWebText) and image modeling (better FID/IS scores on CIFAR10). Unlike conventional samplers, PC methods continue to improve with more sampling steps.

Conclusion: The findings challenge the assumption that Masked diffusion is the inevitable future of diffusion-based language modeling. The proposed methods offer improved sampling and more efficient training for discrete diffusion models.

Abstract: Uniform-state discrete diffusion models excel at few-step generation and guidance due to their ability to self-correct, making them preferred over autoregressive or Masked diffusion models in these settings. However, their sampling quality plateaus with ancestral samplers as the number of steps increases. We introduce a family of Predictor-Corrector (PC) samplers for discrete diffusion that generalize prior methods and apply to arbitrary noise processes. When paired with uniform-state diffusion, our samplers outperform ancestral sampling on both language and image modeling, achieving lower generative perplexity at matched unigram entropy on OpenWebText and better FID/IS scores on CIFAR10. Crucially, unlike conventional samplers, our PC methods continue to improve with more sampling steps. Taken together, these findings call into question the assumption that Masked diffusion is the inevitable future of diffusion-based language modeling. Beyond sampling, we develop a memory-efficient curriculum for the Gaussian relaxation training phase, reducing training time by 25% and memory by 33% compared to Duo while maintaining comparable perplexity on OpenWebText and LM1B and strong downstream performance. We release code, checkpoints, and a video-tutorial on: https://s-sahoo.com/duo-ch2

[481] Why Pass@k Optimization Can Degrade Pass@1: Prompt Interference in LLM Post-training

Anas Barakat, Souradip Chakraborty, Khushbu Pahwa, Amrit Singh Bedi

Main category: cs.LG

TL;DR: The paper analyzes the trade-off between pass@k and pass@1 metrics in LLM fine-tuning, showing that optimizing for pass@k can degrade pass@1 performance due to gradient conflicts from prompt interference.

DetailsMotivation: Pass@k is widely used for verifiable LLM tasks, but optimizing for it often degrades pass@1 performance, which remains important due to latency/cost constraints, imperfect verifiers, and need for reliable single-shot fallback.

Method: Theoretical characterization of when pass@k policy optimization reduces pass@1 through gradient conflict induced by prompt interference. Analysis shows pass@k gradients conflict with pass@1 gradients because pass@k optimization reweights prompts toward low-success prompts, which can rotate updates away from pass@1 direction.

Result: Demonstrates that pass@k policy gradients can conflict with pass@1 gradients due to implicit prompt reweighting toward negatively interfering prompts. Illustrated with LLM experiments on verifiable mathematical reasoning tasks.

Conclusion: The trade-off between pass@k and pass@1 metrics stems from gradient conflicts in policy optimization, with practical implications for LLM fine-tuning where single-shot performance remains operationally important.

Abstract: Pass@k is a widely used performance metric for verifiable large language model tasks, including mathematical reasoning, code generation, and short-answer reasoning. It defines success if any of $k$ independently sampled solutions passes a verifier. This multi-sample inference metric has motivated inference-aware fine-tuning methods that directly optimize pass@$k$. However, prior work reports a recurring trade-off: pass@k improves while pass@1 degrades under such methods. This trade-off is practically important because pass@1 often remains a hard operational constraint due to latency and cost budgets, imperfect verifier coverage, and the need for a reliable single-shot fallback. We study the origin of this trade-off and provide a theoretical characterization of when pass@k policy optimization can reduce pass@1 through gradient conflict induced by prompt interference. We show that pass@$k$ policy gradients can conflict with pass@1 gradients because pass@$k$ optimization implicitly reweights prompts toward low-success prompts; when these prompts are what we term negatively interfering, their upweighting can rotate the pass@k update direction away from the pass@1 direction. We illustrate our theoretical findings with large language model experiments on verifiable mathematical reasoning tasks.

[482] Statistical Query Lower Bounds for Smoothed Agnostic Learning

Ilias Diakonikolas, Daniel M. Kane

Main category: cs.LG

TL;DR: SQ lower bounds for smoothed agnostic learning of halfspaces show d^Ω(1/σ²+log(1/ε)) complexity is required, nearly matching known upper bounds.

DetailsMotivation: The paper studies smoothed agnostic learning, where learners compete with the best classifier under Gaussian perturbations of inputs. The motivation is to understand the fundamental complexity of learning halfspaces in this smoothed model, particularly whether existing upper bounds using L₁-polynomial regression are optimal.

Method: The authors use Statistical Query (SQ) lower bounds and linear programming duality to find moment-matching hard distributions. They prove lower bounds on approximation degree for halfspaces by showing that finding low-degree approximating polynomials to smoothed versions of target functions is essentially required for L₁-polynomial regression to work.

Result: The main result is an SQ lower bound showing that any SQ algorithm for smoothed agnostic learning of halfspaces requires complexity d^Ω(1/σ²+log(1/ε)), which nearly matches the known upper bound of d^Õ(1/σ²)log(1/ε). This is the first non-trivial lower bound for this task.

Conclusion: The paper establishes that applying L₁-polynomial regression to smoothed functions is essentially optimal for smoothed agnostic learning of halfspaces, providing formal evidence that existing upper bounds are close to best possible.

Abstract: We study the complexity of smoothed agnostic learning, recently introduced by~\cite{CKKMS24}, in which the learner competes with the best classifier in a target class under slight Gaussian perturbations of the inputs. Specifically, we focus on the prototypical task of agnostically learning halfspaces under subgaussian distributions in the smoothed model. The best known upper bound for this problem relies on $L_1$-polynomial regression and has complexity $d^{\tilde{O}(1/σ^2) \log(1/ε)}$, where $σ$ is the smoothing parameter and $ε$ is the excess error. Our main result is a Statistical Query (SQ) lower bound providing formal evidence that this upper bound is close to best possible. In more detail, we show that (even for Gaussian marginals) any SQ algorithm for smoothed agnostic learning of halfspaces requires complexity $d^{Ω(1/σ^{2}+\log(1/ε))}$. This is the first non-trivial lower bound on the complexity of this task and nearly matches the known upper bound. Roughly speaking, we show that applying $L_1$-polynomial regression to a smoothed version of the function is essentially best possible. Our techniques involve finding a moment-matching hard distribution by way of linear programming duality. This dual program corresponds exactly to finding a low-degree approximating polynomial to the smoothed version of the target function (which turns out to be the same condition required for the $L_1$-polynomial regression to work). Our explicit SQ lower bound then comes from proving lower bounds on this approximation degree for the class of halfspaces.

[483] Untied Ulysses: Memory-Efficient Context Parallelism via Headwise Chunking

Ravi Ghadia, Maksim Abraham, Sergei Vorobyov, Max Ryabinin

Main category: cs.LG

TL;DR: UPipe: A context parallelism technique that performs fine-grained chunking at the attention head level to reduce activation memory usage in Transformers, enabling much longer context lengths while maintaining training speed.

DetailsMotivation: Existing context parallelism methods (Ring Attention, DeepSpeed Ulysses) enable scaling over context dimension but lack memory efficiency, limiting supported sequence lengths. Advanced techniques like Fully Pipelined Distributed Transformer or activation offloading extend context length but reduce training throughput.

Method: UPipe performs fine-grained chunking at the attention head level, significantly reducing activation memory usage of self-attention. This breaks the activation memory barrier by reducing intermediate tensor memory usage in attention layers.

Result: Reduces intermediate tensor memory usage in attention layer by up to 87.5% for 32B Transformers while matching previous context parallelism techniques in training speed. Supports 5M token context length when training Llama3-8B on single 8×H100 node, improving over prior methods by over 25%.

Conclusion: UPipe is a simple yet effective context parallelism technique that enables much longer context lengths by reducing activation memory usage at the attention head level, maintaining training speed while significantly improving memory efficiency.

Abstract: Efficiently processing long sequences with Transformer models usually requires splitting the computations across accelerators via context parallelism. The dominant approaches in this family of methods, such as Ring Attention or DeepSpeed Ulysses, enable scaling over the context dimension but do not focus on memory efficiency, which limits the sequence lengths they can support. More advanced techniques, such as Fully Pipelined Distributed Transformer or activation offloading, can further extend the possible context length at the cost of training throughput. In this paper, we present UPipe, a simple yet effective context parallelism technique that performs fine-grained chunking at the attention head level. This technique significantly reduces the activation memory usage of self-attention, breaking the activation memory barrier and unlocking much longer context lengths. Our approach reduces intermediate tensor memory usage in the attention layer by as much as 87.5$%$ for 32B Transformers, while matching previous context parallelism techniques in terms of training speed. UPipe can support the context length of 5M tokens when training Llama3-8B on a single 8$\times$H100 node, improving upon prior methods by over 25$%$.

[484] Test-Time Training with KV Binding Is Secretly Linear Attention

Junchen Liu, Sven Elflein, Or Litany, Zan Gojcic, Ruilong Li

Main category: cs.LG

TL;DR: TTT with KV binding is not test-time memorization but learned linear attention, enabling architectural simplifications and parallel formulations.

DetailsMotivation: The paper challenges the common interpretation of test-time training (TTT) with KV binding as online meta-learning that memorizes key-value mappings, revealing contradictory phenomena that suggest a different underlying mechanism.

Method: Revisits TTT formulation to show that a broad class of TTT architectures can be expressed as learned linear attention operators, enabling principled architectural simplifications and fully parallel formulations.

Result: The linear attention perspective explains previously puzzling model behaviors, preserves performance while improving efficiency through parallelization, and provides systematic reduction of diverse TTT variants to standard linear attention form.

Conclusion: TTT should be reframed not as test-time memorization but as learned linear attention with enhanced representational capacity, offering both theoretical insights and practical benefits.

Abstract: Test-time training (TTT) with KV binding as sequence modeling layer is commonly interpreted as a form of online meta-learning that memorizes a key-value mapping at test time. However, our analysis reveals multiple phenomena that contradict this memorization-based interpretation. Motivated by these findings, we revisit the formulation of TTT and show that a broad class of TTT architectures can be expressed as a form of learned linear attention operator. Beyond explaining previously puzzling model behaviors, this perspective yields multiple practical benefits: it enables principled architectural simplifications, admits fully parallel formulations that preserve performance while improving efficiency, and provides a systematic reduction of diverse TTT variants to a standard linear attention form. Overall, our results reframe TTT not as test-time memorization, but as learned linear attention with enhanced representational capacity.

[485] Detecting and Mitigating Group Bias in Heterogeneous Treatment Effects

Joel Persson, Jurriën Bakker, Dennis Bohle, Stefan Feuerriegel, Florian von Wangenheim

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation due to access error

Method: Cannot determine method due to access error

Result: Cannot determine results due to access error

Conclusion: Cannot determine conclusion due to access error

Abstract: Failed to fetch summary for 2602.20383: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.20383&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[486] Minibatch Optimal Transport and Perplexity Bound Estimation in Discrete Flow Matching

Etrit Haxholli, Yeti Z. Gurbuz, Ogul Can, Eli Waxman

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to technical error fetching paper content

Method: Unable to determine method due to technical error fetching paper content

Result: Unable to determine results due to technical error fetching paper content

Conclusion: Unable to determine conclusion due to technical error fetching paper content

Abstract: Failed to fetch summary for 2411.00759: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2411.00759&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[487] Characterizing LLM Inference Energy-Performance Tradeoffs across Workloads and GPU Scaling

Paul Joe Maliakel, Shashikant Ilager, Ivona Brandic

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2501.08219 suggests it’s from January 2025, but content is unavailable.

DetailsMotivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot draw conclusions without access to the paper content.

Abstract: Failed to fetch summary for 2501.08219: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2501.08219&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[488] Universality of Benign Overfitting in Binary Linear Classification

Ichiro Hashimoto, Stanislav Volgushev, Piotr Zwiernik

Main category: cs.LG

TL;DR: Failed to fetch summary for paper 2501.10538 due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed summary fetch

Method: Unable to determine method due to failed summary fetch

Result: Unable to determine results due to failed summary fetch

Conclusion: Unable to determine conclusion due to failed summary fetch

Abstract: Failed to fetch summary for 2501.10538: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2501.10538&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[489] Improving the Convergence of Private Shuffled Gradient Methods with Public Data

Shuli Jiang, Pranay Sharma, Zhiwei Steven Wu, Gauri Joshi

Main category: cs.LG

TL;DR: Unable to analyze paper 2502.03652 due to HTTP 429 error when fetching abstract from arXiv API

DetailsMotivation: Cannot determine motivation as abstract is unavailable

Method: Cannot determine method as abstract is unavailable

Result: Cannot determine results as abstract is unavailable

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2502.03652: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.03652&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[490] Armijo Line-search Can Make (Stochastic) Gradient Descent Provably Faster

Sharan Vaswani, Reza Babanezhad

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to draw conclusions due to fetch failure

Abstract: Failed to fetch summary for 2503.00229: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.00229&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[491] A Survey on Federated Fine-tuning of Large Language Models

Yebo Wu, Chunlin Tian, Jingguang Li, He Sun, Kahou Tam, Zhanting Zhou, Haicheng Liao, Jing Xiong, Zhijiang Guo, Li Li, Chengzhong Xu

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Unable to determine motivation due to API rate limiting preventing access to paper content

Method: No method information available - arXiv API request resulted in HTTP 429 error

Result: No results available - paper content could not be retrieved

Conclusion: Cannot analyze paper due to technical limitations in accessing the content

Abstract: Failed to fetch summary for 2503.12016: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.12016&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[492] Towards Trustworthy GUI Agents: A Survey

Yucheng Shi, Wenhao Yu, Jingyuan Huang, Wenlin Yao, Wenhu Chen, Ninghao Liu

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2503.23434: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.23434&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[493] Regularity and Stability Properties of Selective SSMs with Discontinuous Gating

Nikola Zubić, Davide Scaramuzza

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to technical error in fetching paper content

Method: Unable to determine method due to technical error in fetching paper content

Result: Unable to determine results due to technical error in fetching paper content

Conclusion: Unable to determine conclusion due to technical error in fetching paper content

Abstract: Failed to fetch summary for 2505.11602: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.11602&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[494] RefLoRA: Refactored Low-Rank Adaptation for Efficient Fine-Tuning of Large Models

Yilang Zhang, Bingcong Li, Georgios B. Giannakis

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable due to API rate limiting

Result: Cannot determine results as paper content is unavailable due to API rate limiting

Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting

Abstract: Failed to fetch summary for 2505.18877: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.18877&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[495] SuperMAN: Interpretable and Expressive Networks over Temporally Sparse Heterogeneous Data

Maya Bechler-Speicher, Andrea Zerio, Maor Huri, Marie Vibeke Vestergaard, Ran Gilad-Bachrach, Tine Jess, Samir Bhatt, Aleksejs Sazonovs

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2505.19193: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.19193&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[496] Wasserstein Barycenter Soft Actor-Critic

Zahra Shahrooei, Ali Baheri

Main category: cs.LG

TL;DR: Failed to fetch summary for paper 2506.10167 due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to missing abstract content

Method: Unable to determine method due to missing abstract content

Result: Unable to determine results due to missing abstract content

Conclusion: Unable to determine conclusion due to missing abstract content

Abstract: Failed to fetch summary for 2506.10167: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.10167&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[497] Foundation Models for Causal Inference via Prior-Data Fitted Networks

Yuchen Ma, Dennis Frauen, Emil Javurek, Stefan Feuerriegel

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2506.10914 suggests it’s from June 2025, but no content available for analysis.

DetailsMotivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot draw conclusions without access to the paper content.

Abstract: Failed to fetch summary for 2506.10914: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.10914&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[498] FREQuency ATTribution: benchmarking frequency-based occlusion for time series data

Dominique Mercier, Andreas Dengel, Sheraz Ahmed

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2506.18481: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.18481&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[499] Transfer Learning in Infinite Width Feature Learning Networks

Clarissa Lauditi, Blake Bordelon, Cengiz Pehlevan

Main category: cs.LG

TL;DR: Paper 2507.04448: Unable to fetch abstract due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation due to abstract fetch failure

Method: Cannot determine method due to abstract fetch failure

Result: Cannot determine results due to abstract fetch failure

Conclusion: Cannot draw conclusions due to abstract fetch failure

Abstract: Failed to fetch summary for 2507.04448: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.04448&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[500] A hierarchy tree data structure for behavior-based user segment representation

Yang Liu, Xuejiao Kang, Sathya Iyer, Idris Malik, Ruixuan Li, Juan Wang, Xinchen Lu, Xiangxue Zhao, Dayong Wang, Menghan Liu, Isaac Liu, Feng Liang, Yinzhe Yu

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable due to API rate limiting

Result: Cannot determine results as paper content is unavailable due to API rate limiting

Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting

Abstract: Failed to fetch summary for 2508.01115: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.01115&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[501] One-Step Flow Q-Learning: Addressing the Diffusion Policy Bottleneck in Offline Reinforcement Learning

Thanh Nguyen, Chang D. Yoo

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to draw conclusions due to failed paper fetch

Abstract: Failed to fetch summary for 2508.13904: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.13904&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[502] Uncertainty Propagation Networks for Neural Ordinary Differential Equations

Hadi Jahanshahi, Zheng H. Zhu

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2508.16815 could not be retrieved from arXiv API.

DetailsMotivation: Cannot determine motivation as paper content is unavailable due to API rate limiting.

Method: Cannot determine method as paper content is unavailable.

Result: Cannot determine results as paper content is unavailable.

Conclusion: Cannot draw conclusions about the paper due to technical limitations in accessing the content.

Abstract: Failed to fetch summary for 2508.16815: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.16815&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[503] From Samples to Scenarios: A New Paradigm for Probabilistic Forecasting

Xilin Dai, Zhijian Xu, Wanxu Cai, Qiang Xu

Main category: cs.LG

TL;DR: Paper ID 2509.19975 could not be analyzed due to HTTP 429 error when fetching from arXiv API

DetailsMotivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to draw conclusions due to failed paper fetch

Abstract: Failed to fetch summary for 2509.19975: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.19975&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[504] Why High-rank Neural Networks Generalize?: An Algebraic Framework with RKHSs

Yuka Hashimoto, Sho Sonoda, Isao Ishikawa, Masahiro Ikeda

Main category: cs.LG

TL;DR: Unable to analyze paper 2509.21895 due to HTTP 429 error when fetching abstract from arXiv API

DetailsMotivation: Cannot determine motivation as abstract is not available due to API rate limiting

Method: Cannot determine method as abstract is not available due to API rate limiting

Result: Cannot determine results as abstract is not available due to API rate limiting

Conclusion: Cannot draw conclusions about paper content due to unavailability of abstract

Abstract: Failed to fetch summary for 2509.21895: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.21895&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[505] Recursive Self-Aggregation Unlocks Deep Thinking in Large Language Models

Siddarth Venkatraman, Vineet Jain, Sarthak Mittal, Vedant Shah, Johan Obando-Ceron, Yoshua Bengio, Brian R. Bartoldson, Bhavya Kailkhura, Guillaume Lajoie, Glen Berseth, Nikolay Malkin, Moksh Jain

Main category: cs.LG

TL;DR: Paper 2509.26626: Unable to fetch summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation due to inability to access paper content

Method: Cannot determine method due to inability to access paper content

Result: Cannot determine results due to inability to access paper content

Conclusion: Cannot determine conclusion due to inability to access paper content

Abstract: Failed to fetch summary for 2509.26626: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.26626&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[506] Cautious Weight Decay

Lizhang Chen, Jonathan Li, Kaizhao Liang, Baiyu Su, Cong Xie, Nuo Wang Pierse, Chen Liang, Ni Lao, Qiang Liu

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) - paper ID 2510.12402 cannot be analyzed

DetailsMotivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable due to API rate limiting

Result: Cannot determine results as paper content is unavailable due to API rate limiting

Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting

Abstract: Failed to fetch summary for 2510.12402: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.12402&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[507] TeamFormer: Shallow Parallel Transformers with Progressive Approximation

Wei Wang, Xiao-Yong Wei, Qing Li

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2510.15425: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.15425&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[508] Predicting Metabolic Dysfunction-Associated Steatotic Liver Disease using Machine Learning Methods: A Retrospective Cohort Study

Mary E. An, Paul Griffin, Jonathan G. Stine, Ramakrishna Balakrishnan, Soundar Kumara

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2510.22293: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.22293&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[509] ContextPilot: Fast Long-Context Inference via Context Reuse

Yinsicheng Jiang, Yeqi Huang, Liang Cheng, Cheng Deng, Xuan Sun, Luo Mai

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable due to server rate limiting

Method: Cannot determine method as paper content is unavailable due to server rate limiting

Result: Cannot determine results as paper content is unavailable due to server rate limiting

Conclusion: Cannot draw conclusions as paper content is unavailable due to server rate limiting

Abstract: Failed to fetch summary for 2511.03475: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.03475&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[510] Metabolomic Biomarker Discovery for ADHD Diagnosis Using Interpretable Machine Learning

Nabil Belacel, Mohamed Rachid Boulassel

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to draw conclusions due to access error

Abstract: Failed to fetch summary for 2601.11283: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.11283&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[511] SAGE-5GC: Security-Aware Guidelines for Evaluating Anomaly Detection in the 5G Core Network

Cristian Manca, Christian Scano, Giorgio Piras, Fabio Brau, Maura Pintor, Battista Biggio

Main category: cs.LG

TL;DR: Paper 2602.03596: Unable to fetch summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as the abstract/summary could not be retrieved from arXiv API due to rate limiting

Method: Cannot determine method as the abstract/summary could not be retrieved from arXiv API due to rate limiting

Result: Cannot determine results as the abstract/summary could not be retrieved from arXiv API due to rate limiting

Conclusion: Cannot determine conclusion as the abstract/summary could not be retrieved from arXiv API due to rate limiting

Abstract: Failed to fetch summary for 2602.03596: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.03596&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[512] LORE: Jointly Learning the Intrinsic Dimensionality and Relative Similarity Structure From Ordinal Data

Vivek Anand, Alec Helbling, Mark A. Davenport, Gordon J. Berman, Sankaraleengam Alagapan, Christopher John Rozell

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting).

DetailsMotivation: Cannot determine motivation as paper content is unavailable.

Method: Cannot determine method as paper content is unavailable.

Result: Cannot determine results as paper content is unavailable.

Conclusion: Cannot draw conclusions as paper content is unavailable.

Abstract: Failed to fetch summary for 2602.04192: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.04192&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[513] Towards Robust Scaling Laws for Optimizers

Alexandra Volkova, Mher Safaryan, Christoph H. Lampert, Dan Alistarh

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2602.07712: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.07712&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[514] MUSE: Multi-Tenant Model Serving With Seamless Model Updates

Cláudio Correia, Alberto E. A. Ferreira, Lucas Martins, Miguel P. Bento, Sofia Guerreiro, Ricardo Ribeiro Pereira, Ana Sofia Gomes, Jacopo Bono, Hugo Ferreira, Pedro Bizarro

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2602.11776: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.11776&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[515] Divine Benevolence is an $x^2$: GLUs scale asymptotically faster than MLPs

Alejandro Francisco Queiruga

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2602.14495: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.14495&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[516] ÜberWeb: Insights from Multilingual Curation for a 20-Trillion-Token Dataset

DatologyAI, Aldo Gael Carranza, Kaleigh Mentzer, Ricardo Pio Monti, Alex Fang, Alvin Deng, Amro Abbas, Anshuman Suri, Brett Larsen, Cody Blakeney, Darren Teh, David Schwab, Diego Kiner, Fan Pan, Haakon Mongstad, Haoli Yin, Jack Urbanek, Jason Lee, Jason Telanoff, Josh Wills, Luke Merrick, Maximilian Böther, Parth Doshi, Paul Burstein, Pratyush Maini, Rishabh Adaiga, Sid Joshi, Spandan Das, Tony Jiang, Vineeth Dorna, Zhengping Wang, Bogdan Gaza, Ari Morcos, Matthew Leavitt

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2602.15210: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.15210&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[517] Neural Proposals, Symbolic Guarantees: Neuro-Symbolic Graph Generation with Hard Constraints

Chuqin Geng, Li Zhang, Mark Zhang, Haolin Ye, Ziyu Zhao, Xujie Si

Main category: cs.LG

TL;DR: Unable to analyze paper 2602.16954 due to HTTP 429 error when fetching summary from arXiv API

DetailsMotivation: Cannot determine motivation without access to paper abstract

Method: Cannot determine method without access to paper abstract

Result: Cannot determine results without access to paper abstract

Conclusion: Cannot draw conclusions without access to paper abstract

Abstract: Failed to fetch summary for 2602.16954: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.16954&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[518] Multi-Probe Zero Collision Hash (MPZCH): Mitigating Embedding Collisions and Enhancing Model Freshness in Large-Scale Recommenders

Ziliang Zhao, Bi Xue, Emma Lin, Mengjiao Zhou, Kaustubh Vartak, Shakhzod Ali-Zade, Tianqi Lu, Tao Li, Bin Kuang, Rui Jian, Bin Wen, Dennis van der Staay, Yixin Bao, Eddy Li, Chao Deng, Songbin Liu, Qifan Wang, Kai Ren

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2602.17050: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.17050&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[519] A Theoretical Framework for Modular Learning of Robust Generative Models

Corinna Cortes, Mehryar Mohri, Yutao Zhong

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2602.17554: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.17554&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[520] Multi-Round Human-AI Collaboration with User-Specified Requirements

Sima Noorani, Shayan Kiyani, Hamed Hassani, George Pappas

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2602.17646 appears to be from February 2026, which suggests it might be a future or incorrectly formatted arXiv ID.

DetailsMotivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot draw conclusions without access to the paper content.

Abstract: Failed to fetch summary for 2602.17646: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.17646&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[521] Classification and reconstruction for single-pixel imaging with classical and quantum neural networks

Sofya Manko, Dmitry Frolovtsev

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed paper retrieval

Method: Unable to determine method due to failed paper retrieval

Result: Unable to determine results due to failed paper retrieval

Conclusion: Unable to determine conclusion due to failed paper retrieval

Abstract: Failed to fetch summary for 2407.12506: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2407.12506&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[522] Statistical Inference for Temporal Difference Learning with Linear Function Approximation

Weichen Wu, Gen Li, Yuting Wei, Alessandro Rinaldo

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2410.16106: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2410.16106&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[523] Big data approach to Kazhdan-Lusztig polynomials

Abel Lacabanne, Daniel Tubbenhauer, Pedro Vaz

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2412.01283: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2412.01283&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[524] Tightening Optimality gap with confidence through conformal prediction

Miao Li, Michael Klamkin, Russell Bent, Pascal Van Hentenryck

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to draw conclusions due to fetch failure

Abstract: Failed to fetch summary for 2503.04071: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.04071&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[525] Sharp Gaussian approximations for Decentralized Federated Learning

Soham Bonnerjee, Sayar Karmakar, Wei Biao Wu

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2505.08125: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.08125&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[526] A Copula Based Supervised Filter for Feature Selection in Diabetes Risk Prediction Using Machine Learning

Agnideep Aich, Md Monzur Murshed, Sameera Hewage, Amanda Mayeaux

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2505.22554: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.22554&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[527] Feedback-driven recurrent quantum neural network universality

Lukas Gonon, Rodrigo Martínez-Peña, Juan-Pablo Ortega

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation without paper content

Method: Cannot determine method without paper content

Result: Cannot determine results without paper content

Conclusion: Cannot determine conclusion without paper content

Abstract: Failed to fetch summary for 2506.16332: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.16332&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[528] Stochastic Quantum Spiking Neural Networks with Quantum Memory and Local Learning

Jiechen Chen, Bipin Rajendran, Osvaldo Simeone

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2506.21324: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.21324&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[529] Morphology-Aware Peptide Discovery via Masked Conditional Generative Modeling

Nuno Costa, Julija Zavadlav

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to draw conclusions due to fetch failure

Abstract: Failed to fetch summary for 2509.02060: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.02060&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[530] A Spectral Framework for Graph Neural Operators: Convergence Guarantees and Tradeoffs

Roxanne Holden, Luana Ruiz

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2510.20954 could not be retrieved from arXiv API.

DetailsMotivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot draw conclusions without access to the paper content.

Abstract: Failed to fetch summary for 2510.20954: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.20954&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[531] Two-dimensional RMSD projections for reaction path visualization and validation

Rohit Goswami

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2512.07329: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.07329&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[532] Distribution-informed Online Conformal Prediction

Dongjian Hu, Junxi Wu, Shu-Tao Xia, Changliang Zou

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2512.07770: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.07770&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[533] Universality of Many-body Projected Ensemble for Learning Quantum Data Distribution

Quoc Hoan Tran, Koki Chinzei, Yasuhiro Endo, Hirotaka Oshima

Main category: cs.LG

TL;DR: Paper 2601.18637: Could not fetch summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to HTTP 429 error when fetching paper summary

Method: Unable to determine method due to HTTP 429 error when fetching paper summary

Result: Unable to determine results due to HTTP 429 error when fetching paper summary

Conclusion: Unable to determine conclusion due to HTTP 429 error when fetching paper summary

Abstract: Failed to fetch summary for 2601.18637: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.18637&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[534] Analysis of Shuffling Beyond Pure Local Differential Privacy

Shun Takagi, Seng Pei Liew

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation due to inability to access paper content

Method: Cannot determine method due to inability to access paper content

Result: Cannot determine results due to inability to access paper content

Conclusion: Cannot draw conclusions due to inability to access paper content

Abstract: Failed to fetch summary for 2601.19154: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.19154&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[535] Flow-Based Conformal Predictive Distributions

Trevor Harris

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2602.07633: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.07633&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[536] The Wisdom of Many Queries: Complexity-Diversity Principle for Dense Retriever Training

Xincan Feng, Noriki Nishida, Yusuke Sakai, Yuji Matsumoto

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2602.09448: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.09448&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[537] Skill-Inject: Measuring Agent Vulnerability to Skill File Attacks

David Schmotz, Luca Beurer-Kellner, Sahar Abdelnabi, Maksym Andriushchenko

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper information.

DetailsMotivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot determine conclusion without access to the paper content.

Abstract: Failed to fetch summary for 2602.20156: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.20156&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

cs.MA

[538] HieraMAS: Optimizing Intra-Node LLM Mixtures and Inter-Node Topology for Multi-Agent Systems

Tianjun Yao, Zhaoyi Li, Zhiqiang Shen

Main category: cs.MA

TL;DR: HieraMAS: A hierarchical multi-agent system framework that combines intra-node LLM mixtures with inter-node communication topology, using supernodes with propose-synthesis structures and two-stage optimization for better performance.

DetailsMotivation: Existing multi-agent systems treat each agent as a single LLM unit, missing opportunities to use mixtures of LLMs within agents to strengthen role-specific abilities. Current approaches only improve one aspect at a time (communication topology, role assignment, or LLM routing).

Method: HieraMAS introduces supernodes where each functional role is implemented by multiple heterogeneous LLMs using a propose-synthesis structure. It uses a two-stage optimization algorithm: (1) multi-level reward attribution for fine-grained feedback at node and system levels, and (2) graph classification for holistic topology selection rather than optimizing edges individually.

Result: Experiments on reasoning and coding benchmarks show HieraMAS substantially outperforms existing methods while delivering better cost-performance trade-offs.

Conclusion: HieraMAS demonstrates that hierarchical collaboration combining intra-node LLM mixtures with inter-node communication topology significantly improves multi-agent system performance, addressing credit assignment challenges through novel optimization techniques.

Abstract: Multi-agent systems (MAS) built on large language models (LLMs) have shown strong performance across many tasks. Most existing approaches improve only one aspect at a time, such as the communication topology, role assignment, or LLM routing, while treating each agent as a single, indivisible unit. This misses the opportunity to use mixtures of LLMs within an agent to strengthen role-specific abilities. We propose HieraMAS, a hierarchical collaboration framework that combines intra-node LLM mixtures with an inter-node communication topology. HieraMAS introduces supernodes, where each functional role is implemented by multiple heterogeneous LLMs using a propose-synthesis structure. Optimizing HieraMAS creates unique credit-assignment challenges: final task performance depends heavily on the underlying LLMs’ capabilities, which can lead reinforcement methods to incorrectly reward suboptimal configurations. To address this, we use a two-stage algorithm: (1) multi-level reward attribution, which provides fine-grained feedback at both the node level and the overall system level; (2) graph classification for topology selection, which treats choosing the communication structure as a holistic decision rather than optimizing edges one by one. Experiments on reasoning and coding benchmarks show that HieraMAS substantially outperforms existing methods while also delivering better cost-performance trade-offs.

[539] Adaptive Evolutionary Framework for Safe, Efficient, and Cooperative Autonomous Vehicle Interactions

Yijun Lu, Zhen Tian, Zhihao Lin

Main category: cs.MA

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2509.07411: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.07411&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[540] Noise-enabled goal attainment in crowded collectives

Lucy Liu, Justin Werfel, Federico Toschi, L. Mahadevan

Main category: cs.MA

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2507.08100: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.08100&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

cs.MM

[541] SPP-SCL: Semi-Push-Pull Supervised Contrastive Learning for Image-Text Sentiment Analysis and Beyond

Jiesheng Wu, Shengrong Li

Main category: cs.MM

TL;DR: SPP-SCL method balances intra-modal and inter-modal sentiment relationships using a two-step supervised contrastive learning approach for improved image-text sentiment analysis.

DetailsMotivation: Existing ITSA methods suffer from inconsistent intra-modal and inter-modal sentiment relationships, leading to vision-language imbalance that affects sentiment analysis accuracy.

Method: Proposes Semi-Push-Pull Supervised Contrastive Learning (SPP-SCL) with a two-step strategy: 1) intra-modal supervised contrastive learning to pull intra-modal relationships, 2) conditional execution statement that triggers inter-modal supervised contrastive learning to push away inter-modal relationships when needed, balancing both relationships before cross-modal feature fusion.

Result: Experimental studies on three public image-text sentiment and sarcasm detection datasets show SPP-SCL significantly outperforms state-of-the-art methods by a large margin and demonstrates better sentiment discriminative ability.

Conclusion: SPP-SCL effectively addresses vision-language imbalance in sentiment relationships through balanced intra-modal and inter-modal learning, achieving superior performance in image-text sentiment analysis.

Abstract: Existing Image-Text Sentiment Analysis (ITSA) methods may suffer from inconsistent intra-modal and inter-modal sentiment relationships. Therefore, we develop a method that balances before fusing to solve the issue of vision-language imbalance intra-modal and inter-modal sentiment relationships; that is, a Semi-Push-Pull Supervised Contrastive Learning (SPP-SCL) method is proposed. Specifically, the method is implemented using a novel two-step strategy, namely first using the proposed intra-modal supervised contrastive learning to pull the relationships between the intra-modal and then performing a well-designed conditional execution statement. If the statement result is false, our method will perform the second step, which is inter-modal supervised contrastive learning to push away the relationships between inter-modal. The two-step strategy will balance the intra-modal and inter-modal relationships to achieve the purpose of relationship consistency and finally perform cross-modal feature fusion for sentiment analysis and detection. Experimental studies on three public image-text sentiment and sarcasm detection datasets demonstrate that SPP-SCL significantly outperforms state-of-the-art methods by a large margin and is more discriminative in sentiment.

eess.AS

[542] Training-Free Intelligibility-Guided Observation Addition for Noisy ASR

Haoyang Li, Changsong Liu, Wei Rao, Hao Shi, Sakriani Sakti, Eng Siong Chng

Main category: eess.AS

TL;DR: Training-free intelligibility-guided observation addition method improves ASR in noisy environments by fusing noisy and enhanced speech using backend ASR-derived intelligibility estimates

DetailsMotivation: Speech enhancement front-ends suppress noise but introduce artifacts that harm ASR performance. Existing observation addition methods require trained neural predictors, increasing complexity and limiting generalization.

Method: Proposes intelligibility-guided observation addition where fusion weights are derived directly from backend ASR intelligibility estimates, making it training-free. Compares frame vs utterance-level OA and explores switching-based alternatives.

Result: Extensive experiments across diverse SE-ASR combinations and datasets show strong robustness and improvements over existing OA baselines. Training-free approach reduces complexity and enhances generalization.

Conclusion: Intelligibility-guided OA using backend ASR estimates provides effective, training-free solution for improving ASR in noisy environments, outperforming prior methods that require neural predictors.

Abstract: Automatic speech recognition (ASR) degrades severely in noisy environments. Although speech enhancement (SE) front-ends effectively suppress background noise, they often introduce artifacts that harm recognition. Observation addition (OA) addressed this issue by fusing noisy and SE enhanced speech, improving recognition without modifying the parameters of the SE or ASR models. This paper proposes an intelligibility-guided OA method, where fusion weights are derived from intelligibility estimates obtained directly from the backend ASR. Unlike prior OA methods based on trained neural predictors, the proposed method is training-free, reducing complexity and enhances generalization. Extensive experiments across diverse SE-ASR combinations and datasets demonstrate strong robustness and improvements over existing OA baselines. Additional analyses of intelligibility-guided switching-based alternatives and frame versus utterance-level OA further validate the proposed design.

[543] UBGAN: Enhancing Coded Speech with Blind and Guided Bandwidth Extension

Kishan Gupta, Srikanth Korse, Andreas Brendel, Nicola Pia, Guillaume Fuchs

Main category: eess.AS

TL;DR: UBGAN is a lightweight GAN-based bandwidth extension method that converts 8kHz wideband speech to 16kHz super-wideband, enhancing perceptual quality for both conventional and neural codecs with minimal additional bitrate.

DetailsMotivation: Speech codecs face trade-offs between quality, bitrate, and complexity, typically operating at wideband (8kHz). Bandwidth extension can enhance perceptual quality, but neural codecs are often inflexible and trained at fixed sampling rates, lacking modular solutions.

Method: Proposes Universal Bandwidth Extension GAN (UBGAN) operating in subband domain. Two variants: guided-UBGAN transmits quantized learned representations as side information at low bitrate; blind-UBGAN operates without side information. Lightweight GAN architecture enhances operational flexibility.

Result: Subjective assessments show UBGAN improves perceptual quality when applied to WB codecs. Demonstrates generalization across multiple codecs and bitrates. Successfully extends bandwidth from 8kHz to 16kHz while maintaining low computational overhead.

Conclusion: UBGAN provides a modular, lightweight solution for bandwidth extension that enhances speech codec flexibility and perceptual quality without significant computational or bitrate overhead, applicable to both conventional and neural codecs.

Abstract: In practical application of speech codecs, a multitude of factors such as the quality of the radio connection, limiting hardware or required user experience necessitate trade-offs between achievable perceptual quality, engendered bitrate and computational complexity. Most conventional and neural speech codecs operate on wideband (WB) speech signals to achieve this compromise. To further enhance the perceptual quality of coded speech, bandwidth extension (BWE) of the transmitted speech is an attractive and popular technique in conventional speech coding. In contrast, neural speech codecs are typically trained end-to-end to a specific set of requirements and are often not easily adaptable. In particular, they are typically trained to operate at a single fixed sampling rate. With the Universal Bandwidth Extension Generative Adversarial Network (UBGAN), we propose a modular and lightweight GAN-based solution that increases the operational flexibility of a wide range of conventional and neural codecs. Our model operates in the subband domain and extends the bandwidth of WB signals from 8 kHz to 16 kHz, resulting in super-wideband (SWB) signals. We further introduce two variants, guided-UBGAN and blind-UBGAN, where the guided version transmits quantized learned representation as a side information at a very low bitrate additional to the bitrate of the codec, while blind-BWE operates without such side-information. Our subjective assessments demonstrate the advantage of UBGAN applied to WB codecs and highlight the generalization capacity of our proposed method across multiple codecs and bitrates.

[544] Binaural Target Speaker Extraction using Individualized HRTF

Yoav Ellinson, Sharon Gannot

Main category: eess.AS

TL;DR: A speaker-independent binaural target-speaker extraction method using listener’s HRTF and complex-valued neural networks for preserving spatial cues while isolating target speech in multi-talker scenarios.

DetailsMotivation: To address binaural target-speaker extraction in multi-talker environments without relying on speaker embeddings, while preserving the spatial cues and directionality of the target speaker for natural auditory perception.

Method: Uses listener’s Head-Related Transfer Function (HRTF) to isolate target speaker. Implements fully complex-valued neural network operating directly on complex STFT of mixed audio signals, compared to Real-Imaginary based networks. Evaluated in both anechoic and reverberant conditions.

Result: Achieves excellent extraction performance in anechoic conditions while preserving binaural cues. Maintains robustness in reverberant conditions with speech clarity, source directionality, and reduced reverberation. Comparable to state-of-the-art in noise reduction and perceptual quality with advantage in binaural cue preservation.

Conclusion: The proposed HRTF-based, speaker-independent approach effectively extracts target speech in multi-talker scenarios while preserving spatial auditory cues, demonstrating advantages of complex-valued neural networks over RI-based approaches for binaural processing.

Abstract: In this work, we address the problem of binaural target-speaker extraction in the presence of multiple simultane-ous talkers. We propose a novel approach that leverages the individual listener’s Head-Related Transfer Function (HRTF) to isolate the target speaker. The proposed method is speaker-independent, as it does not rely on speaker embeddings. We employ a fully complex-valued neural network that operates directly on the complex-valued Short-Time Fourier transform (STFT) of the mixed audio signals, and compare it to a Real-Imaginary (RI)-based neural network, demonstrating the advantages of the former. We first evaluate the method in an anechoic, noise-free scenario, achieving excellent extraction performance while preserving the binaural cues of the target signal. We then extend the evaluation to reverberant conditions. Our method proves robust, maintaining speech clarity and source directionality while simultaneously reducing reverberation. A comparative analysis with existing binaural Target Speaker Extraction (TSE) methods shows that the proposed approach achieves performance comparable to state-of-the-art techniques in terms of noise reduction and perceptual quality, while providing a clear advantage in preserving binaural cues. Demo-page: https://bi-ctse-hrtf.github.io

[545] MSR-Codec: A Low-Bitrate Multi-Stream Residual Codec for High-Fidelity Speech Generation with Information Disentanglement

Jingyu Li, Guangyan Zhang, Zhen Ye, Yiwen Guo

Main category: eess.AS

TL;DR: A low-bitrate multi-scale residual audio codec that disentangles speech into semantic, timbre, prosody, and residual streams, enabling high-fidelity reconstruction and effective TTS/voice conversion with minimal data.

DetailsMotivation: Audio codecs are essential for speech generation systems, but existing approaches often lack efficient disentanglement of speech components at low bitrates, limiting their effectiveness for controllable speech synthesis and voice conversion.

Method: Proposes a multi-scale residual codec architecture that encodes speech into four disentangled streams: semantic, timbre, prosody, and residual. Uses this codec to build a two-stage language model for TTS synthesis with lightweight design and minimal data requirements.

Result: Achieves state-of-the-art Word Error Rate (WER) and superior speaker similarity compared to larger models. The codec enables effective voice conversion with independent manipulation of speaker timbre and prosody.

Conclusion: The proposed codec provides efficient speech representation with inherent disentanglement capabilities, enabling high-quality speech generation and manipulation tasks with minimal computational and data requirements.

Abstract: Audio codecs are a critical component of modern speech generation systems. This paper introduces a low-bitrate, multi-scale residual codec that encodes speech into four distinct streams: semantic, timbre, prosody, and residual. This architecture achieves high-fidelity speech reconstruction at competitive low bitrates while demonstrating an inherent ability for information disentanglement. We construct a two-stage language model for text-to-speech (TTS) synthesis using this codec, which, despite its lightweight design and minimal data requirements, achieves a state-of-the-art Word Error Rate (WER) and superior speaker similarity compared to several larger models. Furthermore, the codec’s design proves highly effective for voice conversion, enabling independent manipulation of speaker timbre and prosody. Our inference code, pre-trained models, and audio samples are available at https://github.com/herbertLJY/MSRCodec.

[546] Evaluating CNN with Stacked Feature Representations and Audio Spectrogram Transformer Models for Sound Classification

Parinaz Binandeh Dehaghania, Danilo Penab, A. Pedro Aguiar

Main category: eess.AS

TL;DR: CNN-based environmental sound classification using stacked acoustic features (Log-Mel, Spectral Contrast, Chroma, Tonnetz, MFCCs, GTCC) outperforms transformer models in resource-constrained scenarios.

DetailsMotivation: ESC has important applications but often faces resource constraints. Feature stacking can enhance CNN performance by aggregating complementary acoustic descriptors, potentially offering a more efficient alternative to transformer models that require large-scale pretraining.

Method: CNN models with various stacked feature combinations (LM, SPC, CH, TZ, MFCCs, GTCC) are evaluated on ESC-50 and UrbanSound8K datasets. Experiments include pretraining on ESC-50, fine-tuning on UrbanSound8K, and comparison with Audio Spectrogram Transformer models pretrained on AudioSet.

Result: Feature-stacked CNNs provide a more computationally and data-efficient alternative to transformer-based models when large-scale pretraining or extensive training data are unavailable, making them suitable for resource-constrained and edge-level scenarios.

Conclusion: Stacked feature CNNs offer practical advantages over transformers for environmental sound classification in resource-limited settings, balancing performance with computational efficiency.

Abstract: Environmental sound classification (ESC) has gained significant attention due to its diverse applications in smart city monitoring, fault detection, acoustic surveillance, and manufacturing quality control. To enhance CNN performance, feature stacking techniques have been explored to aggregate complementary acoustic descriptors into richer input representations. In this paper, we investigate CNN-based models employing various stacked feature combinations, including Log-Mel Spectrogram (LM), Spectral Contrast (SPC), Chroma (CH), Tonnetz (TZ), Mel-Frequency Cepstral Coefficients (MFCCs), and Gammatone Cepstral Coefficients (GTCC). Experiments are conducted on the widely used ESC-50 and UrbanSound8K datasets under different training regimes, including pretraining on ESC-50, fine-tuning on UrbanSound8K, and comparison with Audio Spectrogram Transformer (AST) models pretrained on large-scale corpora such as AudioSet. This experimental design enables an analysis of how feature-stacked CNNs compare with transformer-based models under varying levels of training data and pretraining diversity. The results indicate that feature-stacked CNNs offer a more computationally and data-efficient alternative when large-scale pretraining or extensive training data are unavailable, making them particularly well suited for resource-constrained and edge-level sound classification scenarios.

[547] Enroll-on-Wakeup: A First Comparative Study of Target Speech Extraction for Seamless Interaction in Real Noisy Human-Machine Dialogue Scenarios

Yiming Yang, Guangyong Wang, Haixin Guan, Yanhua Long

Main category: eess.AS

TL;DR: Enroll-on-Wakeup (EoW) framework uses wake-word segments as enrollment references for target speech extraction, eliminating need for pre-recorded speech and enabling seamless human-machine interaction.

DetailsMotivation: Traditional target speech extraction requires pre-recorded high-quality enrollment speech, which disrupts user experience and limits feasibility in spontaneous interactions. The paper aims to create a more natural and seamless interaction by using wake-word segments captured during normal usage.

Method: Proposes Enroll-on-Wakeup (EoW) framework where wake-word segments are automatically used as enrollment references. Evaluates advanced discriminative and generative TSE models under diverse acoustic conditions. Investigates enrollment augmentation using LLM-based TTS to address challenges of short and noisy wake-word segments.

Result: Current TSE models face performance degradation in EoW-TSE due to short and noisy wake-word segments. TTS-based assistance significantly enhances listening experience, but gaps remain in speech recognition accuracy compared to traditional enrollment methods.

Conclusion: EoW framework enables more natural human-machine interaction by eliminating need for pre-recorded enrollment speech. While TTS augmentation helps, further improvements are needed to bridge the performance gap in speech recognition accuracy for practical applications.

Abstract: Target speech extraction (TSE) typically relies on pre-recorded high-quality enrollment speech, which disrupts user experience and limits feasibility in spontaneous interaction. In this paper, we propose Enroll-on-Wakeup (EoW), a novel framework where the wake-word segment, captured naturally during human-machine interaction, is automatically utilized as the enrollment reference. This eliminates the need for pre-collected speech to enable a seamless experience. We perform the first systematic study of EoW-TSE, evaluating advanced discriminative and generative models under real diverse acoustic conditions. Given the short and noisy nature of wake-word segments, we investigate enrollment augmentation using LLM-based TTS. Results show that while current TSE models face performance degradation in EoW-TSE, TTS-based assistance significantly enhances the listening experience, though gaps remain in speech recognition accuracy.

eess.IV

[548] AINet: Anchor Instances Learning for Regional Heterogeneity in Whole Slide Image

Tingting Zheng, Hongxun Yao, Kui Jiang, Sicheng Zhao, Yi Xiao

Main category: eess.IV

TL;DR: AINet introduces anchor instances for WSI analysis, using dual-level anchor mining and anchor-guided region correction to handle tumor heterogeneity with fewer parameters.

DetailsMotivation: Whole slide image analysis faces challenges due to tumor sparsity and morphological diversity, leading to heterogeneous regions that make it difficult to aggregate discriminative representations.

Method: Proposes AINet with two modules: Dual-level Anchor Mining (DAM) selects representative anchor instances at both local and global levels, and Anchor-guided Region Correction (ARC) uses complementary information from all regions to correct each regional representation.

Result: AINet surpasses state-of-the-art methods with substantially fewer FLOPs and parameters, and the modules can be integrated into existing MIL frameworks to consistently improve performance.

Conclusion: The anchor instance concept provides an effective way to handle heterogeneity in WSI analysis, offering a modular approach that improves existing MIL frameworks while being computationally efficient.

Abstract: Recent advances in multi-instance learning (MIL) have witnessed impressive performance in whole slide image (WSI) analysis. However, the inherent sparsity of tumors and their morphological diversity lead to obvious heterogeneity across regions, posing significant challenges in aggregating high-quality and discriminative representations. To address this, we introduce a novel concept of anchor instance (AI), a compact subset of instances that are representative within their regions (local) and discriminative at the bag (global) level. These AIs act as semantic references to guide interactions across regions, correcting non-discriminative patterns while preserving regional diversity. Specifically, we propose a dual-level anchor mining (DAM) module to \textbf{select} AIs from massive instances, where the most informative AI in each region is extracted by assessing its similarity to both local and global embeddings. Furthermore, to ensure completeness and diversity, we devise an anchor-guided region correction (ARC) module that explores the complementary information from all regions to \textbf{correct} each regional representation. Building upon DAM and ARC, we develop a concise yet effective framework, AINet, which employs a simple predictor and surpasses state-of-the-art methods with substantially fewer FLOPs and parameters. Moreover, both DAM and ARC are modular and can be seamlessly integrated into existing MIL frameworks, consistently improving their performance.

[549] Targeted T2-FLAIR Dropout Training Improves Robustness of nnU-Net Glioblastoma Segmentation to Missing T2-FLAIR

Marco Öchsner, Lena Kaiser, Robert Stahl, Nathalie L. Albert, Thomas Liebig, Robert Forbrig, Jonas Reis

Main category: eess.IV

TL;DR: Targeted T2-FLAIR dropout training improves glioblastoma MRI tumor segmentation robustness to missing T2-FLAIR data without degrading performance when T2-FLAIR is available.

DetailsMotivation: To address the challenge of missing imaging sequences in clinical practice by developing segmentation models robust to missing T2-FLAIR sequences while maintaining performance when all sequences are available.

Method: Retrospective multi-dataset study using nnU-Net models trained on BraTS 2021 data with targeted T2-FLAIR dropout (probability rates r=0.35 or 0.50) by replacing only the T2-FLAIR channel with zeros during training, then testing on UPenn-GBM dataset.

Result: With T2-FLAIR present: dropout models performed equivalently to non-dropout models (median DSC 94.8% vs 95.0%). With T2-FLAIR absent: dropout dramatically improved performance (median DSC improved from 81.0% to 93.4%), with edema DSC improving from 14.0% to 87.0% and whole-tumor volume bias improving from -45.6 mL to 0.83 mL.

Conclusion: Targeted T2-FLAIR dropout training preserves segmentation performance when T2-FLAIR is available and significantly reduces segmentation error and volume bias when T2-FLAIR is missing, addressing a practical clinical challenge.

Abstract: Purpose: To determine whether targeted T2 fluid-attenuated inversion recovery (T2-FLAIR) dropout training improves glioblastoma MRI tumor segmentation robustness to missing T2-FLAIR without degrading performance when T2-FLAIR is available. Materials and Methods: This retrospective multi-dataset study developed nnU-Net models on BraTS 2021 (n=848) and externally tested them on UPenn-GBM glioblastoma MRI (n=403; 2006-2018; age 18-89 years; 60% male). Models were trained with no dropout or targeted T2-FLAIR dropout (probability rate r=0.35 or 0.50) by replacing only the T2-FLAIR channel with zeros. Inference used T2-FLAIR-present and T2-FLAIR-absent scenarios (T2-FLAIR set to zero). The primary endpoint was Dice similarity coefficient (DSC); secondary endpoints were 95th percentile Hausdorff distance and Bland-Altman whole-tumor volume bias. Equivalence was assessed with two one-sided tests using +/-1.5 DSC percentage points, and noninferiority versus HD-GLIO used a -1.5-point margin. Results: With T2-FLAIR present, median overall DSC was 94.8% (interquartile range, 90.0%-97.1%) with dropout and 95.0% (interquartile range, 90.3%-97.1%) without dropout (equivalence supported, p<0.001). With T2-FLAIR absent, median overall DSC improved from 81.0% (interquartile range, 75.1%-86.4%) without dropout to 93.4% (interquartile range, 89.1%-96.2%) with dropout (r=0.35); edema DSC improved from 14.0% to 87.0%, edema 95th percentile Hausdorff distance improved from 22.44 mm to 2.45 mm, and whole-tumor volume bias improved from -45.6 mL to 0.83 mL. Dropout was noninferior to HD-GLIO under T2-FLAIR-present (all p<0.001). Conclusion: Targeted T2-FLAIR dropout preserved segmentation performance when T2-FLAIR was available and reduced segmentation error and whole-tumor volume bias when T2-FLAIR was absent.

[550] Progressive Per-Branch Depth Optimization for DEFOM-Stereo and SAM3 Joint Analysis in UAV Forestry Applications

Yida Lin, Bing Xue, Mengjie Zhang, Sam Schofield, Richard Green

Main category: eess.IV

TL;DR: A progressive pipeline for 3D branch reconstruction using stereo vision, integrating foundation-model disparity estimation, instance segmentation, and multi-stage depth optimization to create clean per-branch point clouds for autonomous UAV tree pruning.

DetailsMotivation: Accurate 3D reconstruction of individual branches is essential for autonomous UAV-based tree pruning, but current stereo matchers produce noisy disparity maps that are inadequate for analyzing complex forest canopies.

Method: Progressive pipeline combining DEFOM-Stereo foundation-model for disparity estimation, SAM3 for instance segmentation, and multi-stage depth optimization addressing three error families: mask boundary contamination (via morphological erosion and skeleton-preserving refinement), segmentation inaccuracy (using LAB-space Mahalanobis color validation with cross-branch overlap arbitration), and depth noise (through five-stage scheme with MAD global detection, spatial density consensus, local MAD filtering, RGB-guided filtering, and adaptive bilateral filtering).

Result: Evaluated on 1920x1080 stereo imagery of Radiata pine from UAV, the pipeline reduces average per-branch depth standard deviation by 82% while maintaining edge fidelity, producing geometrically coherent 3D point clouds suitable for autonomous pruning tool positioning.

Conclusion: The proposed pipeline successfully addresses key challenges in branch-level 3D reconstruction for forestry applications, enabling accurate autonomous pruning through robust point cloud generation, with code and data released publicly.

Abstract: Accurate per-branch 3D reconstruction is a prerequisite for autonomous UAV-based tree pruning; however, dense disparity maps from modern stereo matchers often remain too noisy for individual branch analysis in complex forest canopies. This paper introduces a progressive pipeline integrating DEFOM-Stereo foundation-model disparity estimation, SAM3 instance segmentation, and multi-stage depth optimization to deliver robust per-branch point clouds. Starting from a naive baseline, we systematically identify and resolve three error families through successive refinements. Mask boundary contamination is first addressed through morphological erosion and subsequently refined via a skeleton-preserving variant to safeguard thin-branch topology. Segmentation inaccuracy is then mitigated using LAB-space Mahalanobis color validation coupled with cross-branch overlap arbitration. Finally, depth noise - the most persistent error source - is initially reduced by outlier removal and median filtering, before being superseded by a robust five-stage scheme comprising MAD global detection, spatial density consensus, local MAD filtering, RGB-guided filtering, and adaptive bilateral filtering. Evaluated on 1920x1080 stereo imagery of Radiata pine (Pinus radiata) acquired with a ZED Mini camera (63 mm baseline) from a UAV in Canterbury, New Zealand, the proposed pipeline reduces the average per-branch depth standard deviation by 82% while retaining edge fidelity. The result is geometrically coherent 3D point clouds suitable for autonomous pruning tool positioning. All code and processed data are publicly released to facilitate further UAV forestry research.

[551] Multimodal MRI Report Findings Supervised Brain Lesion Segmentation with Substructures

Yubin Ge, Yongsong Huang, Xiaofeng Liu

Main category: eess.IV

TL;DR: MS-RSuper: A multi-scale report-supervised learning framework for brain tumor segmentation that handles incomplete radiology reports with hierarchical findings, modality-specific cues, and uncertainty-aware constraints.

DetailsMotivation: Traditional report-supervised learning struggles with incomplete radiology reports in brain tumor MRI studies where reports often describe only the largest lesion, provide qualitative/uncertain cues, and contain hierarchical modality-specific findings that are correlated with different tumor substructures.

Method: Proposes MS-RSuper with: (1) modality-specific qualitative cue alignment using existence and absence losses, (2) one-sided lower-bounds for partial quantitative cues, (3) anatomical priors for cohort differences, and (4) certainty token scaling with down-weighting of missing cues.

Result: On 1238 report-labeled BraTS-MET/MEN scans, MS-RSuper significantly outperforms both sparsely-supervised baseline and naive RSuper methods.

Conclusion: MS-RSuper effectively handles incomplete radiology reports with hierarchical findings and uncertainty, enabling better brain tumor segmentation without dense voxel-level annotations.

Abstract: Report-supervised (RSuper) learning seeks to alleviate the need for dense tumor voxel labels with constraints derived from radiology reports (e.g., volumes, counts, sizes, locations). In MRI studies of brain tumors, however, we often involve multi-parametric scans and substructures. Here, fine-grained modality/parameter-wise reports are usually provided along with global findings and are correlated with different substructures. Moreover, the reports often describe only the largest lesion and provide qualitative or uncertain cues (mild,'' possible’’). Classical RSuper losses (e.g., sum volume consistency) can over-constrain or hallucinate unreported findings under such incompleteness, and are unable to utilize these hierarchical findings or exploit the priors of varied lesion types in a merged dataset. We explicitly parse the global quantitative and modality-wise qualitative findings and introduce a unified, one-sided, uncertainty-aware formulation (MS-RSuper) that: (i) aligns modality-specific qualitative cues (e.g., T1c enhancement, FLAIR edema) with their corresponding substructures using existence and absence losses; (ii) enforces one-sided lower-bounds for partial quantitative cues (e.g., largest lesion size, minimal multiplicity); and (iii) adds extra- vs. intra-axial anatomical priors to respect cohort differences. Certainty tokens scale penalties; missing cues are down-weighted. On 1238 report-labeled BraTS-MET/MEN scans, our MS-RSuper largely outperforms both a sparsely-supervised baseline and a naive RSuper method.

[552] Vision-Inspired Image Quality Assessment for Radar-Based Human Activity Representations

Huy Trinh, Davis Liu, Munia Humaira, Peter Lee, Zhou Wang

Main category: eess.IV

TL;DR: The paper proposes improved preprocessing techniques for radar-based human activity recognition, including denoising methods for micro-Doppler spectrograms and a novel framework for static activity recognition using range-angle maps with temporal tracking and quality assessment.

DetailsMotivation: Radar-based human activity recognition offers privacy-preserving advantages over vision and wearable sensors, especially in sensitive environments like healthcare facilities. However, micro-Doppler spectrograms suffer from noise and clutter limitations, and existing methods focus mainly on dynamic activities while neglecting static activities.

Method: The paper reimplements and assesses three recent denoising techniques (adaptive preprocessing, adaptive thresholding, entropy-based denoising) on benchmark radar data. It also proposes a novel framework for static activity recognition using range-angle feature maps, including a temporal tracking algorithm for consistency enforcement and a no-reference quality scoring algorithm to assess RA-map fidelity.

Result: Experimental findings show that the proposed techniques enhance classification performance and interpretability for both dynamic and static activities, enabling more reliable radar-based HAR systems. The evaluation uses both perceptual image quality measures and standard error-based metrics to address limitations of conventional metrics in low-SNR regimes.

Conclusion: The work advances radar-based human activity recognition by improving preprocessing techniques for dynamic activities and expanding the scope to include static activities through novel range-angle map processing, contributing to more robust and comprehensive privacy-preserving monitoring systems.

Abstract: Radar-based human activity recognition has gained attention as a privacy-preserving alternative to vision and wearable sensors, especially in sensitive environments like long-term care facilities. Micro-Doppler spectrograms derived from FMCW radar signals are central to recognizing dynamic activities, but their effectiveness is limited by noise and clutter. In this work, we use a benchmark radar dataset to reimplement and assess three recent denoising and preprocessing techniques: adaptive preprocessing, adaptive thresholding, and entropy-based denoising. To illustrate the shortcomings of conventional metrics in low-SNR regimes, we evaluate performance using both perceptual image quality measures and standard error-based metrics. We additionally propose a novel framework for static activity recognition using range-angle feature maps to expand HAR beyond dynamic activities. We present two important contributions: a temporal tracking algorithm to enforce consistency and a no-reference quality scoring algorithm to assess RA-map fidelity. According to experimental findings, our suggested techniques enhance classification performance and interpretability for both dynamic and static activities, opening the door for more reliable radar-based HAR systems.

[553] A Light Fixture Color Temperature and Color Rendering Index Measuring Device

Gianluca Hiss Garbim, Luis Carlos Mathias, André Massami Assakawa, Taufik Abrão

Main category: eess.IV

TL;DR: A low-cost spectrophotometer prototype without optical lenses that measures CCT and CRI of light sources, avoiding chromatic aberrations.

DetailsMotivation: Commercial lamps often lack CRI reporting, and professional spectrophotometers are expensive and inaccessible. There's a need for affordable, simplified measurement devices for color temperature and rendering quality assessment.

Method: Designed and built a spectrophotometer prototype without optical lenses, including structural design, sensor interface, calculation algorithms, and compensation algorithms. The lens-free approach eliminates chromatic aberrations while simplifying the device.

Result: The prototype successfully captured spectral power distributions of various light sources and accurately calculated their correlated color temperature (CCT) and color rendering index (CRI).

Conclusion: A functional, simplified spectrophotometer was developed that provides accessible CCT and CRI measurements without expensive optical components, addressing the gap in affordable color quality assessment tools.

Abstract: The correlated color temperature (CCT) and color rendering index (CRI) of artificial light sources are important because they have implications for human biology and professional applications. Although CCT information is generally available for commercial lamps, CRI is commonly not reported. In addition, devices measuring these parameters are difficult to access as they require a spectrophotometer, a commonly expensive device. In this context, the present work designs and builds a meter in detail, from the structural part of the equipment, interface with sensors, and the calculation to the compensation algorithms implementation, aiming to build the dedicated functionalities of a spectrophotometer, which is designed without the use of optical lenses. In addition to simplifying the device, this approach allows measurements free from dispersions caused by chromatic aberrations typical of optical lenses. The prototype obtained proved to be effective, capturing the spectral power distributions of various light sources and calculating their CCT and CRI.

[554] Reproducing and Improving CheXNet: Deep Learning for Chest X-ray Disease Classification

Daniel J. Strick, Carlos Garcia, Anthony Huang, Thomas Gardos

Main category: eess.IV

TL;DR: Reproduction and improvement of CheXNet for multi-label classification of 14 diseases from chest X-rays, achieving 0.85 average AUC-ROC and 0.39 average F1 score on NIH ChestX-ray14 dataset.

DetailsMotivation: Deep learning for radiologic image analysis is growing rapidly and likely to become standard in medicine. The study aims to reproduce and improve upon CheXNet for automated diagnosis of 14 diseases from chest X-rays.

Method: Reproduced CheXNet algorithm and explored other algorithms that outperform CheXNet’s baseline metrics on NIH ChestX-ray14 dataset. Used F1 score and AUC-ROC as primary evaluation metrics for imbalanced, multi-label classification tasks.

Result: Best model achieved average AUC-ROC score of 0.85 and average F1 score of 0.39 across all 14 disease classifications in the dataset, showing improvement over baseline CheXNet performance.

Conclusion: Deep learning approaches show promise for automated radiologic image analysis, with improved algorithms outperforming baseline CheXNet on multi-label disease classification from chest X-rays.

Abstract: Deep learning for radiologic image analysis is a rapidly growing field in biomedical research and is likely to become a standard practice in modern medicine. On the publicly available NIH ChestX-ray14 dataset, containing X-ray images that are classified by the presence or absence of 14 different diseases, we reproduced an algorithm known as CheXNet, as well as explored other algorithms that outperform CheXNet’s baseline metrics. Model performance was primarily evaluated using the F1 score and AUC-ROC, both of which are critical metrics for imbalanced, multi-label classification tasks in medical imaging. The best model achieved an average AUC-ROC score of 0.85 and an average F1 score of 0.39 across all 14 disease classifications present in the dataset.

Last updated: 2026-03-06
Built with Hugo, theme modified on Stack