Editor’s Picks
Top papers matching your research interests in multimodal LLMs, audio and vision understanding/generation.
[1] MAGE: Modality-Agnostic Music Generation and Editing
Muhammad Usama Saleem, Tejasvi Ravi, Tianyu Xu, Rajeev Nongpiur, Ishan Chatterjee, Mayur Jagdishbhai Patel, Pu Wang
Main category: cs.SD
TL;DR: MAGE is a modality-agnostic framework for multimodal music generation and editing that unifies tasks using flow-based Transformers and cross-gated modulation for better cross-modal grounding.
Details
Motivation: Current multimodal music systems are limited by single-task designs, brittle prompting interfaces, and weak cross-modal grounding that causes prompt drift and spurious content during generation/editing.Method: Uses Controlled Multimodal FluxFormer (flow-based Transformer) for controllable latent trajectories, Audio-Visual Nexus Alignment for temporal consistency, cross-gated modulation for multiplicative control, and dynamic modality-masking curriculum training.
Result: Achieves competitive quality on MUSIC benchmark, supports effective multimodal-guided music generation and targeted editing with robust inference under missing modalities.
Conclusion: MAGE provides a lightweight, flexible framework for practical music workflows that unifies multimodal music generation and editing with improved cross-modal grounding.
Abstract: Multimodal music creation requires models that can both generate audio from high-level cues and edit existing mixtures in a targeted manner. Yet most multimodal music systems are built for a single task and a fixed prompting interface, making their conditioning brittle when guidance is ambiguous, temporally misaligned, or partially missing. Common additive fusion or feature concatenation further weakens cross-modal grounding, often causing prompt drift and spurious musical content during generation and editing. We propose MAGE, a modality-agnostic framework that unifies multimodal music generation and mixture-grounded editing within a single continuous latent formulation. At its core, MAGE uses a Controlled Multimodal FluxFormer, a flow-based Transformer that learns controllable latent trajectories for synthesis and editing under any available subset of conditions. To improve grounding, we introduce Audio-Visual Nexus Alignment to select temporally consistent visual evidence for the audio timeline, and a cross-gated modulation mechanism that applies multiplicative control from aligned visual and textual cues to the audio latents, suppressing unsupported components rather than injecting them. Finally, we train with a dynamic modality-masking curriculum that exposes the model to text-only, visual-only, joint multimodal, and mixture-guided settings, enabling robust inference under missing modalities without training separate models. Experiments on the MUSIC benchmark show that MAGE supports effective multimodal-guided music generation and targeted editing, achieving competitive quality while offering a lightweight and flexible interface tailored to practical music workflows.
Relevance: 9/10
[2] Audio Flamingo Next: Next-Generation Open Audio-Language Models for Speech, Sound, and Music
Sreyan Ghosh, Arushi Goel, Kaousheik Jayakumar, Lasha Koroshinadze, Nishit Anand, Zhifeng Kong, Siddharth Gururani, Sang-gil Lee, Jaehyeon Kim, Aya Aljafari, Chao-Han Huck Yang, Sungwon Kim, Ramani Duraiswami, Dinesh Manocha, Mohammad Shoeybi, Bryan Catanzaro, Ming-Yu Liu, Wei Ping
Main category: cs.SD
TL;DR: AF-Next is an advanced large audio-language model that improves audio understanding and reasoning across speech, environmental sounds, and music, with support for long audio inputs up to 30 minutes and new temporal reasoning capabilities.
Details
Motivation: To address limitations in existing audio-language models by improving accuracy, supporting longer audio inputs, and enabling better temporal reasoning and interpretability for complex audio understanding tasks.Method: Systematic analysis of Audio Flamingo 3 to identify gaps, curation of large-scale datasets (1M+ hours), curriculum-based training (pre-training, mid-training, post-training), and introduction of Temporal Audio Chain-of-Thought for timestamp-grounded reasoning.
Result: Outperforms similarly sized open models by large margins across 20 benchmarks, competitive with larger models, exhibits strong real-world utility and generalization to unseen tasks.
Conclusion: AF-Next represents a significant advancement in audio-language modeling with improved capabilities for understanding and reasoning over diverse audio types, especially for long and complex audio inputs.
Abstract: We present Audio Flamingo Next (AF-Next), the next-generation and most capable large audio-language model in the Audio Flamingo series, designed to advance understanding and reasoning over speech, environmental sounds and music. Compared to Audio Flamingo 3, AF-Next introduces: (i) a stronger foundational audio-language model that significantly improves accuracy across diverse audio understanding tasks; (ii) scalable strategies for constructing large-scale audio understanding and reasoning data beyond existing academic benchmarks; (iii) support for long and complex audio inputs up to 30 minutes; and (iv) Temporal Audio Chain-of-Thought, a new reasoning paradigm that explicitly grounds intermediate reasoning steps to timestamps in long audio, enabling fine-grained temporal alignment and improved interpretability. To enable these capabilities, we first conduct a systematic analysis of Audio Flamingo 3 to identify key gaps in audio understanding and reasoning. We then curate and scale new large-scale datasets totaling over 1 million hours to address these limitations and expand the existing AudioSkills-XL, LongAudio-XL, AF-Think and AF-Chat datasets. AF-Next is trained using a curriculum-based strategy spanning pre-training, mid-training and post-training stages. Extensive experiments across 20 audio understanding and reasoning benchmarks, including challenging long-audio tasks, show that AF-Next outperforms similarly sized open models by large margins and remains highly competitive with and sometimes surpasses, much larger open-weight and closed models. Beyond benchmark performance, AF-Next exhibits strong real-world utility and transfers well to unseen tasks, highlighting its robustness and generalization ability. In addition to all data, code and methods, we open-source 3 variants of AF-Next, including AF-Next-Instruct, AF-Next-Think and AF-Next-Captioner.
Relevance: 9/10
[3] Audio-Omni: Extending Multi-modal Understanding to Versatile Audio Generation and Editing
Zeyue Tian, Binxin Yang, Zhaoyang Liu, Jiexuan Zhang, Ruibin Yuan, Hubery Yin, Qifeng Chen, Chen Li, Jing Lv, Wei Xue, Yike Guo
Main category: cs.SD
TL;DR: Audio-Omni is the first end-to-end framework that unifies audio generation and editing across general sound, music, and speech domains with integrated multimodal understanding capabilities, achieving SOTA performance across multiple benchmarks.
Details
Motivation: Current multimodal models typically address audio understanding, generation, and editing with specialized models, lacking a unified framework that can seamlessly integrate all three tasks across general domains.Method: Combines a frozen Multimodal Large Language Model for high-level reasoning with a trainable Diffusion Transformer for high-fidelity synthesis, and introduces AudioEdit dataset with over 1M curated editing pairs to overcome data scarcity in audio editing.
Result: Achieves state-of-the-art performance across benchmarks, outperforming prior unified approaches while matching or surpassing specialized expert models, with additional capabilities like knowledge-augmented reasoning, in-context generation, and zero-shot cross-lingual control.
Conclusion: Audio-Omni represents a promising direction toward universal generative audio intelligence by unifying generation and editing across domains with integrated multimodal understanding.
Abstract: Recent progress in multimodal models has spurred rapid advances in audio understanding, generation, and editing. However, these capabilities are typically addressed by specialized models, leaving the development of a truly unified framework that can seamlessly integrate all three tasks underexplored. While some pioneering works have explored unifying audio understanding and generation, they often remain confined to specific domains. To address this, we introduce Audio-Omni, the first end-to-end framework to unify generation and editing across general sound, music, and speech domains, with integrated multi-modal understanding capabilities. Our architecture synergizes a frozen Multimodal Large Language Model for high-level reasoning with a trainable Diffusion Transformer for high-fidelity synthesis. To overcome the critical data scarcity in audio editing, we construct AudioEdit, a new large-scale dataset comprising over one million meticulously curated editing pairs. Extensive experiments demonstrate that Audio-Omni achieves state-of-the-art performance across a suite of benchmarks, outperforming prior unified approaches while achieving performance on par with or superior to specialized expert models. Beyond its core capabilities, Audio-Omni exhibits remarkable inherited capabilities, including knowledge-augmented reasoning generation, in-context generation, and zero-shot cross-lingual control for audio generation, highlighting a promising direction toward universal generative audio intelligence. The code, model, and dataset will be publicly released on https://zeyuet.github.io/Audio-Omni.
Relevance: 9/10
Today’s Research Highlights
AI-enhanced summaries of the latest research papers from arXiv.
Table of Contents
- cs.CL [Total: 282]
- cs.CV [Total: 459]
- cs.AI [Total: 297]
- cs.SD [Total: 25]
- cs.LG [Total: 246]
- cs.MA [Total: 14]
- cs.MM [Total: 1]
- eess.AS [Total: 7]
- eess.IV [Total: 19]
cs.CL
[1] Self-Calibrating Language Models via Test-Time Discriminative Distillation
Mohamed Rissal Hedna, Jan Strich, Martin Semmann, Chris Biemann
Main category: cs.CL
TL;DR: SECL is a test-time training method that uses LLMs’ internal “P(True)” signal as self-supervision to calibrate model confidence without labeled data, reducing calibration error by 56-78% across multiple domains.
Details
Motivation: LLMs are systematically overconfident but contain a better-calibrated internal signal (P(True) when asked "Is this answer correct?") than their verbalized confidence. Existing calibration methods require labeled data, degrade under distribution shifts, or have high inference costs.Method: SECL uses test-time training (TTT) that exploits the gap between LLMs’ verbalized confidence and their internal P(True) signal as label-free self-supervision. It adapts only when input distribution shifts, training on 6-26% of question streams, and uses distillation from the baseline P(True) signal.
Result: Across four small language models from three families and four diverse domains, SECL reduces Expected Calibration Error (ECE) by 56-78%, outperforming its own supervision signal and matching or outperforming recent inference-time methods.
Conclusion: SECL is the first method to apply test-time training to calibration, requiring no labeled data or human supervision. Seven ablations confirm each component is crucial and robust across configurations.
Abstract: Large language models (LLMs) are systematically overconfident: they routinely express high certainty on questions they often answer incorrectly. Existing calibration methods either require labeled validation data, degrade under distribution shifts, or incur substantial inference costs. Recent work has shown that LLMs already contain a better-calibrated signal than the one they verbalize: the token probability of “True” when the model is asked “Is this answer correct?” ($P(\text{True})$) consistently outperforms their stated confidence, a gap that is theoretically grounded as generative error is lower-bounded by roughly twice the corresponding discriminative error. We introduce $\textbf{SECL}$ ($\textbf{SE}$lf-$\textbf{C}$alibrating $\textbf{L}$anguage Models), a test-time training (TTT) pipeline that exploits this gap as label-free self-supervision, requiring no labeled data or human supervision. SECL adapts only when the input distribution shifts, training on just 6–26% of the question stream at lower cost than the baseline it distills from. Across four small language models from three model families and four diverse domains, SECL reduces Expected Calibration Error (ECE) by 56–78%, outperforming its own supervision signal and matching or outperforming recent inference-time methods. SECL is the first method to apply TTT to calibration; seven ablations covering signal quality, gating strategy, weight accumulation, loss design, domain ordering, hyperparameter sensitivity, and layer selection confirm that each component is crucial and robust across configurations. Code: https://anonymous.4open.science/r/secl-emnlp26-submission-C890
[2] Toward Generalized Cross-Lingual Hateful Language Detection with Web-Scale Data and Ensemble LLM Annotations
Dang H. Dang, Jelena Mitrovi, Michael Granitzer
Main category: cs.CL
TL;DR: LLM-based synthetic annotations and web-scale unlabelled data improve multilingual hate speech detection, especially for smaller models and low-resource languages.
Details
Motivation: To investigate whether large-scale unlabelled web data and LLM-based synthetic annotations can enhance multilingual hate speech detection, particularly in low-resource settings.Method: Two strategies: 1) Continued pre-training of BERT models on unlabelled web texts before fine-tuning, 2) Using four open-source LLMs to produce synthetic annotations through ensemble strategies (mean averaging, majority voting, LightGBM meta-learner).
Result: Continued pre-training yields ~3% average macro-F1 gain across 16 benchmarks. LightGBM ensemble outperforms other strategies. Synthetic labels benefit small models significantly (+11% F1 for Llama3.2-1B) but provide modest gains for larger models (+0.6% for Qwen2.5-14B).
Conclusion: Combination of web-scale unlabelled data and LLM-ensemble annotations is most valuable for smaller models and low-resource languages in hate speech detection.
Abstract: We study whether large-scale unlabelled web data and LLM-based synthetic annotations can improve multilingual hate speech detection. Starting from texts crawled via OpenWebSearch.eu~(OWS) in four languages (English, German, Spanish, Vietnamese), we pursue two complementary strategies. First, we apply continued pre-training to BERT models by continuing masked language modelling on unlabelled OWS texts before supervised fine-tuning, and show that this yields an average macro-F1 gain of approximately 3% over standard baselines across sixteen benchmarks, with stronger gains in low-resource settings. Second, we use four open-source LLMs (Mistral-7B, Llama3.1-8B, Gemma2-9B, Qwen2.5-14B) to produce synthetic annotations through three ensemble strategies: mean averaging, majority voting, and a LightGBM meta-learner. The LightGBM ensemble consistently outperforms the other strategies. Fine-tuning on these synthetic labels substantially benefits a small model (Llama3.2-1B: +11% pooled F1), but provides only a modest gain for the larger Qwen2.5-14B (+0.6%). Our results indicate that the combination of web-scale unlabelled data and LLM-ensemble annotations is the most valuable for smaller models and low-resource languages.
[3] HumorGen: Cognitive Synergy for Humor Generation in Large Language Models via Persona-Based Distillation
Edward Ajayi, Prasenjit Mitra
Main category: cs.CL
TL;DR: A framework using cognitive personas and psychological humor theories to generate high-quality humor data for fine-tuning LLMs, achieving competitive humor generation with smaller models.
Details
Motivation: Standard LLM training objectives conflict with humor's need for surprise and incongruity, creating a gap in humor generation capabilities that requires novel approaches.Method: Cognitive Synergy Framework with Mixture-of-Thought approach using six cognitive personas to synthesize diverse comedic perspectives, creating theoretically grounded humor datasets for fine-tuning 7B parameter models with DPO and novel O-GRPO optimization.
Result: 7B model significantly outperforms larger instruction-tuned baselines and achieves performance competitive with state-of-the-art proprietary models, showing cognitive-driven data curation is more critical than alignment algorithms or model scale.
Conclusion: Psychological theory-driven data curation is key to effective humor generation in LLMs, with the proposed framework enabling competitive performance with smaller models through cognitive synergy.
Abstract: Humor generation poses a significant challenge for Large Language Models (LLMs), because their standard training objective - predicting the most likely next word - inherently conflicts with the surprise and incongruity needed for comedy. To bridge this gap, we introduce the Cognitive Synergy Framework, a theoretically grounded methodology for generating high-quality humor data inspired by psychological theories of humor. Utilizing a Mixture-of-Thought (MoT) approach, we deploy six cognitive personas (e.g., The Absurdist, The Cynic) to synthesize diverse comedic perspectives for a given prompt. This framework creates a theoretically grounded dataset, which we use to fine-tune a 7B-parameter student model. We compare Direct Preference Optimization (DPO) and a novel Offline Group Relative Policy Optimization (O-GRPO); our 7B model significantly outperforms larger instruction-tuned baselines and achieves performance competitive with state-of-the-art proprietary models. We find that cognitive-driven data curation is far more critical than alignment algorithms or model scale for humor generation. Code and data will be available upon publication.
[4] ASPIRin: Action Space Projection for Interactivity-Optimized Reinforcement Learning in Full-Duplex Speech Language Models
Chi-Yuan Hsiao, Ke-Han Lu, Yu-Kuan Fu, Guan-Ting Lin, Hsiao-Tsung Hung, Hung-yi Lee
Main category: cs.CL
TL;DR: ASPIRin is an RL framework for full-duplex speech language models that decouples timing decisions (when to speak) from content generation (what to say) to prevent semantic degradation while optimizing turn-taking behavior.
Details
Motivation: Standard RL optimization for full-duplex speech language models degrades semantic quality, causing generative collapse and repetition when trying to optimize temporal dynamics like turn-taking.Method: ASPIRin uses Action Space Projection to map text vocabulary into binary states (speak/silence), applies Group Relative Policy Optimization with rule-based rewards, and isolates timing decisions from token selection to preserve semantic coherence.
Result: ASPIRin optimizes interactivity across turn-taking, backchanneling, and pause handling while reducing duplicate n-grams by over 50% compared to standard GRPO, effectively eliminating degenerative repetition.
Conclusion: Decoupling timing decisions from content generation in full-duplex speech language models prevents semantic degradation while enabling effective optimization of interactive behaviors.
Abstract: End-to-end full-duplex Speech Language Models (SLMs) require precise turn-taking for natural interaction. However, optimizing temporal dynamics via standard raw-token reinforcement learning (RL) degrades semantic quality, causing severe generative collapse and repetition. We propose ASPIRin, an interactivity-optimized RL framework that explicitly decouples when to speak from what to say. Using Action Space Projection, ASPIRin maps the text vocabulary into a coarse-grained binary state (active speech vs. inactive silence). By applying Group Relative Policy Optimization (GRPO) with rule-based rewards, it balances user interruption and response latency. Empirical evaluations show ASPIRin optimizes interactivity across turn-taking, backchanneling, and pause handling. Crucially, isolating timing from token selection preserves semantic coherence and reduces the portion of duplicate n-grams by over 50% compared to standard GRPO, effectively eliminating degenerative repetition.
[5] Generating High Quality Synthetic Data for Dutch Medical Conversations
Cecilia Kuan, Aditya Kamlesh Parikh, Henk van den Heuvel
Main category: cs.CL
TL;DR: A pipeline for generating synthetic Dutch medical dialogues using fine-tuned LLMs, evaluated through quantitative metrics and qualitative review, showing feasibility but highlighting challenges in achieving natural conversation flow and domain specificity.
Details
Motivation: Clinical NLP development is hindered by scarce domain-specific datasets due to privacy/ethical constraints. Medical conversations contain valuable communication insights missing from EHRs, creating a need for ethically generated synthetic data to expand Dutch clinical NLP resources.Method: Developed a pipeline using Dutch fine-tuned Large Language Models to generate synthetic medical dialogues, with real medical conversations as linguistic/structural reference. Evaluation involved quantitative metrics (lexical variety, turn-taking patterns) and qualitative review by native speakers and medical practitioners.
Result: Quantitative analysis showed strong lexical variety but overly regular turn-taking (suggesting scripted rather than natural flow). Qualitative review produced slightly below-average scores, with issues in domain specificity and natural expression. Limited correlation between quantitative and qualitative results was observed.
Conclusion: Generating synthetic Dutch medical dialogues is feasible but requires domain knowledge and carefully structured prompting to balance naturalness and structure. Numerical metrics alone cannot fully capture linguistic quality. This work provides foundation for expanding Dutch clinical NLP resources through ethically generated synthetic data.
Abstract: Medical conversations offer insights into clinical communication often absent from Electronic Health Records. However, developing reliable clinical Natural Language Processing (NLP) models is hampered by the scarcity of domain-specific datasets, as clinical data are typically inaccessible due to privacy and ethical constraints. To address these challenges, we present a pipeline for generating synthetic Dutch medical dialogues using a Dutch fine-tuned Large Language Model, with real medical conversations serving as linguistic and structural reference. The generated dialogues were evaluated through quantitative metrics and qualitative review by native speakers and medical practitioners. Quantitative analysis revealed strong lexical variety and overly regular turn-taking, suggesting scripted rather than natural conversation flow. Qualitative review produced slightly below-average scores, with raters noting issues in domain specificity and natural expression. The limited correlation between quantitative and qualitative results highlights that numerical metrics alone cannot fully capture linguistic quality. Our findings demonstrate that generating synthetic Dutch medical dialogues is feasible but requires domain knowledge and carefully structured prompting to balance naturalness and structure in conversation. This work provides a foundation for expanding Dutch clinical NLP resources through ethically generated synthetic data.
[6] GIANTS: Generative Insight Anticipation from Scientific Literature
Joy He-Yueya, Anikait Singh, Ge Gao, Michael Y. Li, Sherry Yang, Chelsea Finn, Emma Brunskill, Noah D. Goodman
Main category: cs.CL
TL;DR: Paper introduces insight anticipation task where models predict downstream paper insights from parent papers, with GiantsBench benchmark and GIANTS-4B model trained via RL to optimize this capability.
Details
Motivation: To explore language models' ability to perform literature-grounded synthesis for scientific discovery, specifically predicting novel insights from existing research papers.Method: Developed GiantsBench (17k examples across 8 domains), used LM judge for evaluation, trained GIANTS-4B via reinforcement learning using similarity scores as proxy reward.
Result: GIANTS-4B outperforms proprietary baselines, achieves 34% relative improvement over gemini-3-pro, produces more conceptually clear insights, and SciJudge-30B predicts its insights lead to higher citations.
Conclusion: Language models can be effectively trained for insight anticipation, demonstrating potential for automated scientific discovery through literature synthesis.
Abstract: Scientific breakthroughs often emerge from synthesizing prior ideas into novel contributions. While language models (LMs) show promise in scientific discovery, their ability to perform this targeted, literature-grounded synthesis remains underexplored. We introduce insight anticipation, a generation task in which a model predicts a downstream paper’s core insight from its foundational parent papers. To evaluate this capability, we develop GiantsBench, a benchmark of 17k examples across eight scientific domains, where each example consists of a set of parent papers paired with the core insight of a downstream paper. We evaluate models using an LM judge that scores similarity between generated and ground-truth insights, and show that these similarity scores correlate with expert human ratings. Finally, we present GIANTS-4B, an LM trained via reinforcement learning (RL) to optimize insight anticipation using these similarity scores as a proxy reward. Despite its smaller open-source architecture, GIANTS-4B outperforms proprietary baselines and generalizes to unseen domains, achieving a 34% relative improvement in similarity score over gemini-3-pro. Human evaluations further show that GIANTS-4B produces insights that are more conceptually clear than those of the base model. In addition, SciJudge-30B, a third-party model trained to compare research abstracts by likely citation impact, predicts that insights generated by GIANTS-4B are more likely to lead to higher citations, preferring them over the base model in 68% of pairwise comparisons. We release our code, benchmark, and model to support future research in automated scientific discovery.
[7] Claim2Vec: Embedding Fact-Check Claims for Multilingual Similarity and Clustering
Rrubaa Panchendrarajan, Arkaitz Zubiaga
Main category: cs.CL
TL;DR: Claim2Vec: A multilingual embedding model for fact-check claims that improves claim clustering performance through contrastive learning on similar claim pairs.
Details
Motivation: Recurrent claims in multilingual misinformation present challenges for automated fact-checking systems. While claim matching and retrieval exist, claim clustering (grouping similar claims resolvable by same fact-check) remains underexplored, especially in multilingual settings.Method: Fine-tune a multilingual encoder using contrastive learning with similar multilingual claim pairs to create improved semantic embeddings optimized for fact-check claims.
Result: Claim2Vec significantly improves clustering performance across three datasets, 14 multilingual embedding models, and 7 clustering algorithms. Enhances both cluster label alignment and geometric structure of embedding space, with cross-lingual knowledge transfer observed in multilingual clusters.
Conclusion: Claim2Vec is the first multilingual embedding model optimized for fact-check claims, demonstrating effectiveness for claim clustering tasks and enabling better handling of recurrent misinformation across languages.
Abstract: Recurrent claims present a major challenge for automated fact-checking systems designed to combat misinformation, especially in multilingual settings. While tasks such as claim matching and fact-checked claim retrieval aim to address this problem by linking claim pairs, the broader challenge of effectively representing groups of similar claims that can be resolved with the same fact-check via claim clustering remains relatively underexplored. To address this gap, we introduce Claim2Vec, the first multilingual embedding model optimized to represent fact-check claims as vectors in an improved semantic embedding space. We fine-tune a multilingual encoder using contrastive learning with similar multilingual claim pairs. Experiments on the claim clustering task using three datasets, 14 multilingual embedding models, and 7 clustering algorithms demonstrate that Claim2Vec significantly improves clustering performance. Specifically, it enhances both cluster label alignment and the geometric structure of the embedding space across different cluster configurations. Our multilingual analysis shows that clusters containing multiple languages benefit from fine-tuning, demonstrating cross-lingual knowledge transfer.
[8] Spoiler Alert: Narrative Forecasting as a Metric for Tension in LLM Storytelling
Peiqi Sui, Yutong Zhu, Tianyi Cheng, Peter West, Richard Jean So, Hoyt Long, Ari Holtzman
Main category: cs.CL
TL;DR: The paper introduces 100-Endings metric to measure narrative tension in stories, showing LLMs fail to recognize compelling stories and proposes a story-generation pipeline that increases narrative tension while maintaining EQ-Bench performance.
Details
Motivation: LLMs fail to generate compelling stories and cannot recognize good storytelling - they rank zero-shot AI stories above New Yorker short stories on creative-writing benchmarks. Existing rubrics overlook narrative tension, a key dimension of compelling human stories.Method: Introduces 100-Endings metric: walks through stories sentence by sentence, predicts 100 possible endings at each position, measures tension as prediction mismatch rate. Also analyzes sentence-level curve statistics like inflection rate. Designs story-generation pipeline with structural constraints including story template analysis, idea formulation, and narrative scaffolding.
Result: 100-Endings correctly ranks New Yorker stories far above LLM outputs (unlike rubric-based judges). The proposed story-generation pipeline significantly increases narrative tension as measured by 100-Endings while maintaining performance on EQ-Bench leaderboard.
Conclusion: Narrative tension is a crucial missing dimension in story evaluation. The 100-Endings metric provides a better measure of compelling storytelling, and structural constraints can improve LLM story generation by increasing narrative tension.
Abstract: LLMs have so far failed both to generate consistently compelling stories and to recognize this failure–on the leading creative-writing benchmark (EQ-Bench), LLM judges rank zero-shot AI stories above New Yorker short stories, a gold standard for literary fiction. We argue that existing rubrics overlook a key dimension of compelling human stories: narrative tension. We introduce the 100-Endings metric, which walks through a story sentence by sentence: at each position, a model predicts how the story will end 100 times given only the text so far, and we measure tension as how often predictions fail to match the ground truth. Beyond the mismatch rate, the sentence-level curve yields complementary statistics, such as inflection rate, a geometric measure of how frequently the curve reverses direction, tracking twists and revelations. Unlike rubric-based judges, 100-Endings correctly ranks New Yorker stories far above LLM outputs. Grounded in narratological principles, we design a story-generation pipeline using structural constraints, including analysis of story templates, idea formulation, and narrative scaffolding. Our pipeline significantly increases narrative tension as measured by the 100-Endings metric, while maintaining performance on the EQ-Bench leaderboard.
[9] Simulating Organized Group Behavior: New Framework, Benchmark, and Analysis
Xinkai Zou, Yiming Huang, Zhuohang Wu, Jian Sha, Nan Huang, Longfei Yun, Jingbo Shang, Letian Peng
Main category: cs.CL
TL;DR: Paper introduces Organized Group Behavior Simulation task and GROVE benchmark for predicting organizational decisions, with structured analytical framework for interpretable modeling of group behavior evolution and transfer.
Details
Motivation: Understanding how organized groups (corporations, organizations) make decisions is crucial for real-world dynamics and applications like market prediction, but lacks formal research platform.Method: Proposes Organized Group Behavior Simulation task, GROVE benchmark with 44 entities and 8,052 context-decision pairs, and structured analytical framework converting decisions into interpretable behavioral models with time-aware adapters and group-aware transfer.
Result: Framework outperforms summarization- and retrieval-based baselines, captures temporal behavioral drift within groups, enables knowledge transfer for data-scarce organizations, and provides traceable evidence nodes.
Conclusion: Provides comprehensive research platform for group behavior understanding with practical applications, demonstrating effectiveness of structured analytical approach over simple prompting methods.
Abstract: Simulating how organized groups (e.g., corporations) make decisions (e.g., responding to a competitor’s move) is essential for understanding real-world dynamics and could benefit relevant applications (e.g., market prediction). In this paper, we formalize this problem as a concrete research platform for group behavior understanding, providing: (1) a task definition with benchmark and evaluation criteria, (2) a structured analytical framework with a corresponding algorithm, and (3) detailed temporal and cross-group analysis. Specifically, we propose Organized Group Behavior Simulation, a task that models organized groups as collective entities from a practical perspective: given a group facing a particular situation (e.g., AI Boom), predict the decision it would take. To support this task, we present GROVE (GRoup Organizational BehaVior Evaluation), a benchmark covering 44 entities with 8,052 real-world context-decision pairs collected from Wikipedia and TechCrunch across 9 domains, with an end-to-end evaluation protocol assessing consistency, initiative, scope, magnitude, and horizon. Beyond straightforward prompting pipelines, we propose a structured analytical framework that converts collective decision-making events into an interpretable, adaptive, and traceable behavioral model, achieving stronger performance than summarization- and retrieval-based baselines. It further introduces an adapter mechanism for time-aware evolution and group-aware transfer, and traceable evidence nodes grounding each decision rule in originating historical events. Our analysis reveals temporal behavioral drift within individual groups, which the time-aware adapter effectively captures for stronger prediction, and structured cross-group similarity that enables knowledge transfer for data-scarce organizations.
[10] Should We be Pedantic About Reasoning Errors in Machine Translation?
Calvin Bao, Marine Carpuat
Main category: cs.CL
TL;DR: The paper investigates reasoning errors in machine translation across multiple language pairs and finds that while these errors can be identified with varying precision, correcting them has limited impact on translation quality, suggesting reasoning faithfulness issues in MT.
Details
Motivation: To understand and quantify reasoning errors in machine translation systems across different language pairs, and to determine whether correcting these reasoning errors improves translation quality.Method: Developed an automated annotation protocol to detect three types of reasoning errors in translation: source sentence-misaligned, model hypothesis-misaligned, and reasoning trace-misaligned. Tested weak-to-strong interventions (hedging, removal, re-reasoning after removal, hindsight, and oracle interventions) on perturbed reasoning traces to correct identified errors.
Result: Reasoning errors can be identified with high precision in Urdu but lower precision in Spanish. Small corrections to reasoning traces have little impact on translation quality, while stronger interventions yield highest resolution rates but mixed translation quality gains. Removing reasoning errors doesn’t significantly resolve initial errors.
Conclusion: Machine translation systems exhibit limited reasoning faithfulness - while reasoning errors can be identified, correcting them doesn’t substantially improve translation quality, suggesting deeper issues in how reasoning is integrated into translation models.
Abstract: Across multiple language pairings (English $\to$ {Spanish, French, German, Mandarin, Japanese, Urdu, Cantonese}), we find reasoning errors in translation. To quantify how often these reasoning errors occur, we leverage an automated annotation protocol for reasoning evaluation wherein the goal is to detect if a reasoning step is any of three error categories: (1) source sentence-misaligned, (2) model hypothesis-misaligned, or (3) reasoning trace-misaligned. We probe the reasoning model with perturbed traces correcting for these identified reasoning errors using an array of weak-to-strong interventions: hedging, removal, re-reasoning after removal, hindsight, and oracle interventions. Experimenting with interventions on the reasoning traces suggests that small corrections to the reasoning have little impact on translation quality, but stronger interventions yield the highest resolution rates, despite translation quality gains being mixed. We find ultimately that reasoning errors in MT can be identified with high precision in Urdu but lower precision in Spanish, but that removing these reasoning errors does not resolve the initial errors significantly, suggesting limited reasoning faithfulness for machine translation.
[11] Human vs. Machine Deception: Distinguishing AI-Generated and Human-Written Fake News Using Ensemble Learning
Samuel Jaeger, Calvin Ibeneye, Aya Vera-Jimenez, Dhrubajyoti Ghosh
Main category: cs.CL
TL;DR: Study examines linguistic and emotional differences between human-written and AI-generated fake news, using machine learning to distinguish them with high accuracy.
Details
Motivation: The rise of AI-generated fake news alongside traditional human-written misinformation creates a need to understand their differences and develop reliable detection methods.Method: Constructed document-level features including sentence structure, lexical diversity, punctuation, readability indices, and emotion-based features. Applied multiple classification models (logistic regression, random forest, SVM, XGBoost, neural network) and ensemble methods.
Result: Strong classification performance with readability-based features as most informative predictors. AI-generated text shows more uniform stylistic patterns. Ensemble learning provides modest but consistent improvements.
Conclusion: Stylistic and structural properties of text provide a robust basis for distinguishing AI-generated misinformation from human-written fake news.
Abstract: The rapid adoption of large language models has introduced a new class of AI-generated fake news that coexists with traditional human-written misinformation, raising important questions about how these two forms of deceptive content differ and how reliably they can be distinguished. This study examines linguistic, structural, and emotional differences between human-written and AI-generated fake news and evaluates machine learning and ensemble-based methods for distinguishing these content types. A document-level feature representation is constructed using sentence structure, lexical diversity, punctuation patterns, readability indices, and emotion-based features capturing affective dimensions such as fear, anger, joy, sadness, trust, and anticipation. Multiple classification models, including logistic regression, random forest, support vector machines, extreme gradient boosting, and a neural network, are applied alongside an ensemble framework that aggregates predictions across models. Model performance is assessed using accuracy and area under the receiver operating characteristic curve. The results show strong and consistent classification performance, with readability-based features emerging as the most informative predictors and AI-generated text exhibiting more uniform stylistic patterns. Ensemble learning provides modest but consistent improvements over individual models. These findings indicate that stylistic and structural properties of text provide a robust basis for distinguishing AI-generated misinformation from human-written fake news.
[12] Weird Generalization is Weirdly Brittle
Miriam Wanner, Hannah Collison, William Jurayj, Benjamin Van Durme, Mark Dredze, William Walden
Main category: cs.CL
TL;DR: Weird generalization in fine-tuned models is brittle and can be mitigated with simple prompt-based interventions
Details
Motivation: To replicate and extend prior work on weird generalization - where models fine-tuned on narrow domains develop surprising traits that manifest outside that domain - to better understand this safety concernMethod: Extended replication study across expanded suite of models and datasets, testing various training-time, prompt-based interventions to mitigate weird generalization
Result: Weird generalization is exceptionally brittle, emerging only for specific models on specific datasets, and vanishes under simple interventions; most effective interventions provide prompt context that makes generalized behavior expected
Conclusion: Weird generalization poses a safety threat but is easily mitigated with simple solutions, clarifying the nature of the risk
Abstract: Weird generalization is a phenomenon in which models fine-tuned on data from a narrow domain (e.g. insecure code) develop surprising traits that manifest even outside that domain (e.g. broad misalignment)-a phenomenon that prior work has highlighted as a critical safety concern. Here, we present an extended replication study of key weird generalization results across an expanded suite of models and datasets. We confirm that surprising (and dangerous) traits can emerge under certain circumstances, but we find that weird generalization is exceptionally brittle: it emerges only for specific models on specific datasets, and it vanishes under simple training-time, prompt-based interventions. We find that the most effective interventions provide prompt context that makes the generalized behavior the expected behavior. However, we show that even very generic interventions that do not anticipate specific generalized traits can still be effective in mitigating weird generalization’s effects. Our findings thus help clarify the nature of the safety threat that weird generalization poses and point toward an easily implemented set of solutions.
[13] CoSToM:Causal-oriented Steering for Intrinsic Theory-of-Mind Alignment in Large Language Models
Mengfan Li, Xuanhua Shi, Yang Deng
Main category: cs.CL
TL;DR: CoSToM is a causal intervention framework that enhances LLMs’ Theory of Mind capabilities by mapping internal ToM representations and steering activations in critical layers, improving social reasoning and dialogue quality.
Details
Motivation: While LLMs show promise on standard ToM benchmarks, they often fail to generalize to complex scenarios and rely on prompt scaffolding rather than true intrinsic cognition. The paper aims to determine if LLMs truly possess internal ToM knowledge and can externalize it into stable behaviors.Method: CoSToM framework: 1) Uses causal tracing to map internal distribution of ToM features and identify layers encoding fundamental ToM semantics. 2) Implements lightweight alignment via targeted activation steering within these ToM-critical layers.
Result: Experiments show CoSToM significantly enhances human-like social reasoning capabilities and improves downstream dialogue quality compared to baseline approaches.
Conclusion: The framework successfully transitions from mechanistic interpretation to active intervention, demonstrating that targeted activation steering in identified ToM-critical layers can align LLMs’ internal knowledge with external social reasoning behaviors.
Abstract: Theory of Mind (ToM), the ability to attribute mental states to others, is a hallmark of social intelligence. While large language models (LLMs) demonstrate promising performance on standard ToM benchmarks, we observe that they often fail to generalize to complex task-specific scenarios, relying heavily on prompt scaffolding to mimic reasoning. The critical misalignment between the internal knowledge and external behavior raises a fundamental question: Do LLMs truly possess intrinsic cognition, and can they externalize this internal knowledge into stable, high-quality behaviors? To answer this, we introduce CoSToM (Causal-oriented Steering for ToM alignment), a framework that transitions from mechanistic interpretation to active intervention. First, we employ causal tracing to map the internal distribution of ToM features, empirically uncovering the internal layers’ characteristics in encoding fundamental ToM semantics. Building on this insight, we implement a lightweight alignment framework via targeted activation steering within these ToM-critical layers. Experiments demonstrate that CoSToM significantly enhances human-like social reasoning capabilities and downstream dialogue quality.
[14] ODUTQA-MDC: A Task for Open-Domain Underspecified Tabular QA with Multi-turn Dialogue-based Clarification
Zhensheng Wang, ZhanTeng Lin, Wenmian Yang, Kun Zhou, Yiquan Zhang, Weijia Jia
Main category: cs.CL
TL;DR: Introduces ODUTQA-MDC task and benchmark for open-domain underspecified tabular QA with dynamic clarification, plus MAIC-TQA multi-agent framework for ambiguity detection and interactive refinement.
Details
Motivation: Current LLMs struggle with open-domain tabular QA when queries contain underspecified or uncertain expressions, requiring a systematic approach to handle ambiguity through clarification dialogues.Method: Proposes ODUTQA-MDC benchmark with large-scale dataset, fine-grained labeling, and dynamic clarification interface. Introduces MAIC-TQA multi-agent framework for ambiguity detection, clarification dialogue, and answer refinement.
Result: Experiments validate the benchmark and framework, establishing them as key resources for advancing conversational, underspecification-aware tabular QA research.
Conclusion: The work provides comprehensive resources and a multi-agent framework to address underspecified queries in tabular QA through interactive clarification, advancing the field toward more robust conversational systems.
Abstract: The advancement of large language models (LLMs) has enhanced tabular question answering (Tabular QA), yet they struggle with open-domain queries exhibiting underspecified or uncertain expressions. To address this, we introduce the ODUTQA-MDC task and the first comprehensive benchmark to tackle it. This benchmark includes: (1) a large-scale ODUTQA dataset with 209 tables and 25,105 QA pairs; (2) a fine-grained labeling scheme for detailed evaluation; and (3) a dynamic clarification interface that simulates user feedback for interactive assessment. We also propose MAIC-TQA, a multi-agent framework that excels at detecting ambiguities, clarifying them through dialogue, and refining answers. Experiments validate our benchmark and framework, establishing them as a key resource for advancing conversational, underspecification-aware Tabular QA research.
[15] Computational Implementation of a Model of Category-Theoretic Metaphor Comprehension
Fumitaka Iwaki, Miho Fuyama, Hayato Saigo, Tatsuji Takahashi
Main category: cs.CL
TL;DR: Computational implementation of metaphor comprehension model using indeterminate natural transformation theory, with improved algorithms outperforming existing ones on data fitting, systematicity, and novelty measures.
Details
Motivation: To develop a computational implementation of the TINT theory of metaphor comprehension, simplifying algorithms to better align with the original theoretical framework and validate through empirical testing.Method: Simplified algorithms implementing the indeterminate natural transformation model, evaluated through data fitting with experimental data, systematicity assessment of metaphor comprehension results, and novelty measurement of source-target associative structure correspondence.
Result: The improved algorithm outperformed existing implementations across all three evaluation measures: data-fitting with experimental data, systematicity of metaphor comprehension, and novelty of comprehension.
Conclusion: The computational implementation successfully operationalizes TINT theory for metaphor comprehension, with simplified algorithms demonstrating superior performance over previous approaches.
Abstract: In this study, we developed a computational implementation for a model of metaphor comprehension based on the theory of indeterminate natural transformation (TINT) proposed by Fuyama et al. We simplified the algorithms implementing the model to be closer to the original theory and verified it through data fitting and simulations. The outputs of the algorithms are evaluated with three measures: data-fitting with experimental data, the systematicity of the metaphor comprehension result, and the novelty of the comprehension (i.e. the correspondence of the associative structure of the source and target of the metaphor). The improved algorithm outperformed the existing ones in all the three measures.
[16] Linguistic Accommodation Between Neurodivergent Communities on Reddit:A Communication Accommodation Theory Analysis of ADHD and Autism Groups
Saad Mankarious, Nour Zein, Iyad Ait Hou, Aya Zirikly
Main category: cs.CL
TL;DR: This paper examines how ADHD and autism communities on Reddit linguistically adjust their language when engaging with each other, showing convergent accommodation patterns that differ from identity disclosure effects.
Details
Motivation: The research shifts focus from individual-level mental health detection to intergroup behavior between neurodivergent communities, aiming to understand how ADHD and autism communities communicate across boundaries on social media platforms.Method: Uses Communication Accommodation Theory (CAT) framework with Language Inquiry and Word Count Lexicon (LIWC) analysis on Reddit data. Examines linguistic profiles within home communities and cross-community posting patterns, plus longitudinal analysis around diagnosis disclosure moments.
Result: Each community maintains distinct linguistic profiles that shift in opposite directions during cross-community engagement (convergent accommodation). Topic-independent variables show partial evidence against purely topical explanations. Diagnosis disclosure has small effects, sometimes opposite to accommodation patterns.
Conclusion: The study reveals complex intergroup communication dynamics between neurodivergent communities, suggesting situational audience adaptation and identity processes involve different mechanisms, with implications for community moderation and clinical perspectives.
Abstract: Social media research on mental health has focused predominantly on detecting and diagnosing conditions at the individual level. In this work, we shift attention to \emph{intergroup} behavior, examining how two prominent neurodivergent communities, ADHD and autism, adjust their language when engaging with each other on Reddit. Grounded in Communication Accommodation Theory (CAT), we first establish that each community maintains a distinct linguistic profile as measured by Language Inquiry and Word Count Lexicon (LIWC). We then show that these profiles shift in opposite directions when users cross community boundaries: features that are elevated in one group’s home community decrease when its members post in the other group’s space, and vice versa, consistent with convergent accommodation. The involvement of topic-independent summary variables (Authentic, Clout) in these shifts provides partial evidence against a purely topical explanation. Finally, in an exploratory longitudinal analysis around the moment of public diagnosis disclosure, we find that its effects on linguistic style are small and, in some cases, directionally opposite to cross-community accommodation, providing initial evidence that situational audience adaptation and longer-term identity processes may involve different mechanisms. Our findings contribute to understanding intergroup communication dynamics among neurodivergent populations online and carry implications for community moderation and clinical perspectives on these conditions.
[17] Reason Only When Needed: Efficient Generative Reward Modeling via Model-Internal Uncertainty
Chao Xue, Yao Wang, Mengqiao Liu, Di Liang, Xingsheng Han, Peiyang Liu, Xianjie Wu, Chenyao Lu, Lei Jiang, Yu Lu, Haibo Shi, Shuang Liang, Minlong Peng, Flora D. Salim
Main category: cs.CL
TL;DR: E-GRM: Efficient generative reward modeling framework that uses model-internal uncertainty to selectively trigger Chain-of-Thought reasoning only when needed, reducing computational costs while improving answer accuracy.
Details
Motivation: Existing GRM implementations apply CoT prompting indiscriminately to all inputs regardless of complexity, introducing unnecessary computational costs. Current approaches also rely on voting-based mechanisms that lack granularity in assessing reasoning quality.Method: E-GRM leverages convergence behavior of parallel model generations to estimate uncertainty and selectively trigger CoT reasoning only when needed. It introduces a lightweight discriminative scorer trained with hybrid regression-ranking objective for fine-grained evaluation of reasoning paths.
Result: Experiments on multiple reasoning benchmarks show E-GRM substantially reduces inference cost while consistently improving answer accuracy.
Conclusion: Model-internal uncertainty is an effective and general signal for efficient reasoning-aware reward modeling, enabling selective CoT triggering without handcrafted features or task-dependent signals.
Abstract: Recent advancements in the Generative Reward Model (GRM) have demonstrated its potential to enhance the reasoning abilities of LLMs through Chain-of-Thought (CoT) prompting. Despite these gains, existing implementations of GRM suffer from two critical limitations. First, CoT prompting is applied indiscriminately to all inputs regardless of their inherent complexity. This introduces unnecessary computational costs for tasks amenable to fast, direct inference. Second, existing approaches primarily rely on voting-based mechanisms to evaluate CoT outputs, which often lack granularity and precision in assessing reasoning quality. In this paper, we propose E-GRM, an efficient generative reward modeling framework grounded in model-internal uncertainty. E-GRM leverages the convergence behavior of parallel model generations to estimate uncertainty and selectively trigger CoT reasoning only when needed, without relying on handcrafted features or task-dependent signals. To improve reward fidelity, we introduce a lightweight discriminative scorer trained with a hybrid regression–ranking objective to provide fine-grained evaluation of reasoning paths. Experiments on multiple reasoning benchmarks show that E-GRM substantially reduces inference cost while consistently improving answer accuracy, demonstrating that model-internal uncertainty is an effective and general signal for efficient reasoning-aware reward modeling.
[18] Why Supervised Fine-Tuning Fails to Learn: A Systematic Study of Incomplete Learning in Large Language Models
Chao Xue, Yao Wang, Mengqiao Liu, Di Liang, Xingsheng Han, Peiyang Liu, Xianjie Wu, Chenyao Lu, Lei Jiang, Yu Lu, Haibo Shi, Shuang Liang, Minlong Peng, Flora D. Salim
Main category: cs.CL
TL;DR: Systematic study of Incomplete Learning Phenomenon (ILP) in LLM fine-tuning where models fail to reproduce subsets of their own supervised training data, identifying five root causes and proposing diagnostic framework.
Details
Motivation: Supervised Fine-Tuning (SFT) is standard for adapting LLMs but has persistent failure mode where models fail to correctly reproduce subsets of their supervised training data even after convergence, requiring systematic study.Method: Formalize ILP as post-training failure to internalize supervised instances, demonstrate prevalence across model families/domains/datasets, identify five recurrent sources through controlled analyses, and introduce diagnostic-first framework mapping unlearned samples to causes using observable training/inference signals.
Result: Experiments on Qwen, LLaMA, and OLMo2 show incomplete learning is widespread and heterogeneous, with aggregate metric improvements masking persistent unlearned subsets. Targeted mitigation strategies studied as causal interventions.
Conclusion: Findings highlight need for fine-grained diagnosis of what supervised fine-tuning fails to learn and why, revealing limitations of current SFT approaches and importance of understanding failure modes beyond aggregate metrics.
Abstract: Supervised Fine-Tuning (SFT) is the standard approach for adapting large language models (LLMs) to downstream tasks. However, we observe a persistent failure mode: even after convergence, models often fail to correctly reproduce a subset of their own supervised training data. We refer to this behavior as the Incomplete Learning Phenomenon(ILP). This paper presents the first systematic study of ILP in LLM fine-tuning. We formalize ILP as post-training failure to internalize supervised instances and demonstrate its prevalence across multiple model families, domains, and datasets. Through controlled analyses, we identify five recurrent sources of incomplete learning: (1) missing prerequisite knowledge in the pre-trained model, (2) conflicts between SFT supervision and pre-training knowledge, (3) internal inconsistencies within SFT data, (4) left-side forgetting during sequential fine-tuning, and (5) insufficient optimization for rare or complex patterns. We introduce a diagnostic-first framework that maps unlearned samples to these causes using observable training and inference signals, and study several targeted mitigation strategies as causal interventions. Experiments on Qwen, LLaMA, and OLMo2 show that incomplete learning is widespread and heterogeneous, and that improvements in aggregate metrics can mask persistent unlearned subsets. The findings highlight the need for fine-grained diagnosis of what supervised fine-tuning fails to learn, and why.
[19] SEPTQ: A Simple and Effective Post-Training Quantization Paradigm for Large Language Models
Han Liu, Haotian Gao, Xiaotong Zhang, Changya Li, Feng Zhang, Wei Wang, Fenglong Ma, Hong Yu
Main category: cs.CL
TL;DR: SEPTQ is a simple two-step post-training quantization method for LLMs that identifies important weight elements globally and quantizes them column-by-column, achieving better performance than existing methods especially in low-bit scenarios.
Details
Motivation: LLMs have massive computational and storage costs, making quantization essential for deployment on resource-limited devices. While PTQ is preferred over QAT for LLMs due to lower training costs, existing PTQ methods are complex and suffer significant performance degradation in low-bit settings.Method: SEPTQ first calculates importance scores for each weight matrix element and determines quantization locations statically. Then uses a mask matrix representing important locations to quantize and update weights column-by-column until optimal quantized weights are obtained.
Result: SEPTQ outperforms other baselines across various datasets and model sizes (millions to billions of parameters) in different quantization bit-levels, with particularly strong performance in low-bit quantization scenarios.
Conclusion: SEPTQ provides an effective and efficient PTQ solution for LLMs that simplifies the quantization process to two steps while maintaining performance, especially valuable for low-bit deployment scenarios.
Abstract: Large language models (LLMs) have shown remarkable performance in various domains, but they are constrained by massive computational and storage costs. Quantization, an effective technique for compressing models to fit resource-limited devices while preserving generative quality, encompasses two primary methods: quantization aware training (QAT) and post-training quantization (PTQ). QAT involves additional retraining or fine-tuning, thus inevitably resulting in high training cost and making it unsuitable for LLMs. Consequently, PTQ has become the research hotspot in recent quantization methods. However, existing PTQ methods usually rely on various complex computation procedures and suffer from considerable performance degradation under low-bit quantization settings. To alleviate the above issues, we propose a simple and effective post-training quantization paradigm for LLMs, named SEPTQ. Specifically, SEPTQ first calculates the importance score for each element in the weight matrix and determines the quantization locations in a static global manner. Then it utilizes the mask matrix which represents the important locations to quantize and update the associated weights column-by-column until the appropriate quantized weight matrix is obtained. Compared with previous methods, SEPTQ simplifies the post-training quantization procedure into only two steps, and considers the effectiveness and efficiency simultaneously. Experimental results on various datasets across a suite of models ranging from millions to billions in different quantization bit-levels demonstrate that SEPTQ significantly outperforms other strong baselines, especially in low-bit quantization scenarios.
[20] Who Wrote This Line? Evaluating the Detection of LLM-Generated Classical Chinese Poetry
Jiang Li, Tian Lan, Shanshan Wang, Dongxing Zhang, Dianqing Lin, Guanglai Gao, Derek F. Wong, Xiangdong Su
Main category: cs.CL
TL;DR: ChangAn benchmark for detecting LLM-generated classical Chinese poetry, with 30,664 poems (10,276 human-written, 20,388 AI-generated) and evaluation of 12 AI detectors showing current methods fail for this specialized domain.
Details
Motivation: The rise of AI-generated literary texts raises authenticity and ethical concerns, but existing AI text detection methods haven't addressed classical Chinese poetry, which presents unique challenges due to its strict metrical regularity, shared poetic imagery, and flexible syntax.Method: Created ChangAn benchmark containing 30,664 poems (10,276 human-written, 20,388 generated by four popular LLMs). Systematically evaluated 12 AI detectors across different text granularities and generation strategies to assess their performance on classical Chinese poetry detection.
Result: Current Chinese text detectors fail as reliable tools for detecting LLM-generated classical Chinese poetry, highlighting the limitations of existing methods and validating the need for specialized benchmarks like ChangAn.
Conclusion: The ChangAn benchmark addresses a critical gap in AI text detection for classical Chinese poetry, demonstrating that current detectors are inadequate for this specialized literary domain and providing a valuable resource for future research.
Abstract: The rapid development of large language models (LLMs) has extended text generation tasks into the literary domain. However, AI-generated literary creations has raised increasingly prominent issues of creative authenticity and ethics in literary world, making the detection of LLM-generated literary texts essential and urgent. While previous works have made significant progress in detecting AI-generated text, it has yet to address classical Chinese poetry. Due to the unique linguistic features of classical Chinese poetry, such as strict metrical regularity, a shared system of poetic imagery, and flexible syntax, distinguishing whether a poem is authored by AI presents a substantial challenge. To address these issues, we introduce ChangAn, a benchmark for detecting LLM-generated classical Chinese poetry that containing total 30,664 poems, 10,276 are human-written poems and 20,388 poems are generated by four popular LLMs. Based on ChangAn, we conducted a systematic evaluation of 12 AI detectors, investigating their performance variations across different text granularities and generation strategies. Our findings highlight the limitations of current Chinese text detectors, which fail to serve as reliable tools for detecting LLM-generated classical Chinese poetry. These results validate the effectiveness and necessity of our proposed ChangAn benchmark. Our dataset and code are available at https://github.com/VelikayaScarlet/ChangAn.
[21] Knowing What to Stress: A Discourse-Conditioned Text-to-Speech Benchmark
Arnon Turetzky, Avihu Dekel, Hagai Aronowitz, Ron Hoory, Yossi Adi
Main category: cs.CL
TL;DR: CAST benchmark evaluates TTS systems’ ability to infer contextually appropriate word-level stress from discourse context, revealing a gap between text understanding and speech realization.
Details
Motivation: Spoken meaning depends on word emphasis, and while TTS systems generate expressive speech, it's unclear if they can infer contextually appropriate stress from discourse alone. Current systems may not properly realize the intended emphasis patterns that convey correction, contrast, or clarification.Method: Created Context-Aware Stress TTS (CAST) benchmark with contrastive context pairs: identical sentences paired with distinct contexts requiring different stressed words. Evaluated state-of-the-art TTS systems using this benchmark to assess their ability to generate appropriate word-level stress from context.
Result: Found a consistent gap: text-only language models reliably recover intended stress from context, but TTS systems frequently fail to realize it in speech. The benchmark reveals limitations in current TTS systems’ ability to generate contextually appropriate emphasis.
Conclusion: There’s a significant gap between text understanding and speech realization in TTS systems regarding context-aware stress. The CAST benchmark, evaluation framework, construction pipeline, and synthetic corpus are released to support future work on context-aware speech synthesis.
Abstract: Spoken meaning often depends not only on what is said, but also on which word is emphasized. The same sentence can convey correction, contrast, or clarification depending on where emphasis falls. Although modern text-to-speech (TTS) systems generate expressive speech, it remains unclear whether they infer contextually appropriate stress from discourse alone. To address this gap, we present Context-Aware Stress TTS (CAST), a benchmark for evaluating context-conditioned word-level stress in TTS. Items are defined as contrastive context pairs: identical sentences paired with distinct contexts requiring different stressed words. We evaluate state-of-the-art systems and find a consistent gap: text-only language models reliably recover the intended stress from context, yet TTS systems frequently fail to realize it in speech. We release the benchmark, evaluation framework, construction pipeline and a synthetic corpus to support future work on context-aware speech synthesis.
[22] CircuitSynth: Reliable Synthetic Data Generation
Zehua Cheng, Wei Dai, Jiahao Sun, Thomas Lukasiewicz
Main category: cs.CL
TL;DR: CircuitSynth is a neuro-symbolic framework that combines LLM reasoning with symbolic logic constraints to generate high-fidelity synthetic data with formal guarantees on validity and coverage.
Details
Motivation: LLMs often produce hallucinations, logical inconsistencies, and mode collapse when generating structured data. Existing methods lack mechanisms to balance linguistic expressivity with formal guarantees of validity and coverage.Method: Decouples semantic reasoning from surface realization by distilling Teacher LLM reasoning into Probabilistic Sentential Decision Diagrams (PSDDs) to create tractable semantic priors that enforce hard logical constraints, plus convex optimization for soft distributional goals.
Result: Achieves 100% Schema Validity in complex logic puzzles (vs 12.4% for baselines) and significantly outperforms state-of-the-art methods in rare-combination coverage across diverse benchmarks.
Conclusion: CircuitSynth successfully bridges the gap between neural generation and symbolic reasoning, providing formal guarantees while maintaining expressivity for structured data generation.
Abstract: The generation of high-fidelity synthetic data is a cornerstone of modern machine learning, yet Large Language Models (LLMs) frequently suffer from hallucinations, logical inconsistencies, and mode collapse when tasked with structured generation. Existing approaches, such as prompting or retrieval-augmented generation, lack the mechanisms to balance linguistic expressivity with formal guarantees regarding validity and coverage. To address this, we propose CircuitSynth, a novel neuro-symbolic framework that decouples semantic reasoning from surface realization. By distilling the reasoning capabilities of a Teacher LLM into a Probabilistic Sentential Decision Diagram (PSDD), CircuitSynth creates a tractable semantic prior that structurally enforces hard logical constraints. Furthermore, we introduce a convex optimization mechanism to rigorously satisfy soft distributional goals. Empirical evaluations across diverse benchmarks demonstrate that CircuitSynth achieves 100% Schema Validity even in complex logic puzzles where unconstrained baselines fail (12.4%) while significantly outperforming state-of-the-art methods in rare-combination coverage.
[23] BlasBench: An Open Benchmark for Irish Speech Recognition
Jyoutir Raj, John Conway
Main category: cs.CL
TL;DR: BlasBench: An open Irish-specific ASR evaluation benchmark with Irish-aware text normalization that reveals generalization gaps between datasets
Details
Motivation: There is no open Irish-specific benchmark to compare end-user ASR systems under a shared Irish-aware evaluation protocol, making it difficult to properly assess ASR performance for the Irish language.Method: Created BlasBench, an open evaluation harness with Irish-aware text normalization that preserves fadas, lenition, and eclipsis. Benchmarked 12 systems across four architecture families on Common Voice ga-IE and FLEURS ga-IE datasets.
Result: All Whisper variants exceeded 100% WER. The best open model (omniASR LLM 7B) achieved 30.65% WER on Common Voice and 39.09% on FLEURS. Models fine-tuned on Common Voice lost 33-43 WER points on FLEURS, revealing a significant generalization gap.
Conclusion: Single-dataset evaluation masks generalization problems in ASR systems. The release of BlasBench enables proper Irish ASR evaluation and reveals important cross-dataset performance gaps that need addressing.
Abstract: No open Irish-specific benchmark compares end-user ASR systems under a shared Irish-aware evaluation protocol. To solve this, we release BlasBench, an open evaluation harness with Irish-aware text normalisation that preserves fadas, lenition, and eclipsis. We benchmark 12 systems across four architecture families on Common Voice ga-IE and FLEURS ga-IE. All Whisper variants exceed 100% WER. The best open model (omniASR LLM 7B) achieves 30.65% WER on Common Voice and 39.09% on FLEURS. We noticed models fine-tuned on Common Voice lose 33-43 WER points on FLEURS, revealing a generalisation gap that is invisible to single-dataset evaluation.
[24] Training-Free Cross-Lingual Dysarthria Severity Assessment via Phonological Subspace Analysis in Self-Supervised Speech Representations
Bernard Muller, Antonio Armando Ortiz Barrañón, LaVonne Roberts
Main category: cs.CL
TL;DR: Training-free dysarthria severity assessment using phonological feature subspaces in frozen HuBERT representations without requiring labeled pathological speech data.
Details
Motivation: Current dysarthric speech severity assessment methods require trained clinicians or supervised models with labeled pathological speech, limiting scalability across languages and clinical settings.Method: Extract phone-level embeddings via Montreal Forced Aligner, compute d-prime scores along phonological contrast directions (nasality, voicing, stridency, sonorance, manner, vowel features) derived from healthy controls, and construct a 12-dimensional phonological profile using frozen HuBERT representations.
Result: Evaluated 890 speakers across 10 corpora, 5 languages, and 3 aetiologies; all five consonant d-prime features correlate significantly with clinical severity (rho = -0.50 to -0.56). Nasality d-prime decreases monotonically from control to severe in 6 of 7 severity-graded corpora.
Conclusion: The method requires no dysarthric training data, applies to any language with existing MFA acoustic model (29 languages), and provides a scalable, training-free approach for dysarthria severity assessment.
Abstract: Dysarthric speech severity assessment typically requires trained clinicians or supervised models built from labelled pathological speech, limiting scalability across languages and clinical settings. We present a training-free method that quantifies dysarthria severity by measuring degradation in phonological feature subspaces within frozen HuBERT representations. No supervised severity model is trained; feature directions are estimated from healthy control speech using a pretrained forced aligner. For each speaker, we extract phone-level embeddings via Montreal Forced Aligner, compute d-prime scores along phonological contrast directions (nasality, voicing, stridency, sonorance, manner, and four vowel features) derived exclusively from healthy controls, and construct a 12-dimensional phonological profile.Evaluating 890 speakers across 10 corpora, 5 languages (English, Spanish, Dutch, Mandarin, French), and 3 primary aetiologies (Parkinson’s disease, cerebral palsy, ALS), we find that all five consonant d-prime features correlate significantly with clinical severity (random-effects meta-analysis rho = -0.50 to -0.56, p < 2e-4; pooled Spearman rho = -0.47 to -0.55 with bootstrap 95% CIs not crossing zero). The effect replicates within individual corpora, survives FDR correction, and remains robust to leave-one-corpus-out removal and alignment quality controls. Nasality d-prime decreases monotonically from control to severe in 6 of 7 severity-graded corpora. Mann-Whitney U tests confirm that all 12 features distinguish controls from severely dysarthric speakers (p < 0.001).The method requires no dysarthric training data and applies to any language with an existing MFA acoustic model (currently 29 languages). We release the full pipeline and phone feature configurations for six languages.
[25] Efficient Training for Cross-lingual Speech Language Models
Yan Zhou, Qingkai Fang, Yun Hong, Yang Feng
Main category: cs.CL
TL;DR: CSLM is an efficient training method for cross-lingual speech LLMs using discrete speech tokens and novel alignment strategies for multimodal and multilingual capabilities.
Details
Motivation: Current LLMs focus mainly on text, but speech LLMs are needed for natural human-AI interaction. Building effective end-to-end speech LLMs is challenging due to limited data and difficulty expanding to more languages.Method: Proposes CSLM with novel alignment strategy achieving cross-modal and cross-lingual alignment through continual pre-training. Uses instruction fine-tuning following speech-text interleaved chain-of-modality generation process to enhance modal alignment at finer granularity.
Result: CSLM demonstrates strong cross-modal alignment capabilities and general task abilities on cross-modal tasks, mono-lingual conversational tasks, and cross-lingual conversational tasks. Aligns different modalities and languages simultaneously without massive speech data.
Conclusion: CSLM provides an efficient approach for building cross-lingual speech LLMs with good language scalability, improving generation quality and reducing latency through better modal alignment.
Abstract: Currently, large language models (LLMs) predominantly focus on the text modality. To enable more natural human-AI interaction, speech LLMs are emerging, but building effective end-to-end speech LLMs remains challenging due to limited data and the difficulty in expanding to more languages. In this paper, we introduce Cross-lingual Speech Language Model (CSLM), an efficient training method for cross-lingual speech LLMs based on discrete speech tokens. We propose a novel alignment strategy that achieves cross-modal and cross-lingual alignment through continual pre-training. By conducting instruction fine-tuning following a speech-text interleaved chain-of-modality generation process, we enhance modal alignment at a finer granularity, thereby improving generation quality and reducing latency. CSLM aligns different modalities and languages simultaneously without the need for massive speech data, thus exhibiting good language scalability. Evaluations on cross-modal tasks, mono-lingual conversational tasks, and cross-lingual conversational tasks demonstrate CSLM’s strong cross-modal alignment capabilities and general task abilities. (Code is available at: https://github.com/ictnlp/CSLM)
[26] Think in Sentences: Explicit Sentence Boundaries Enhance Language Model’s Capabilities
Zhichen Liu, Yongyuan Li, Yang Xu
Main category: cs.CL
TL;DR: Proposes inserting delimiters at sentence boundaries in LLM inputs to improve reasoning capabilities through sentence-by-sentence processing, achieving significant performance gains on reasoning tasks.
Details
Motivation: Existing approaches for improving LLMs via dummy token insertion focus only on the tokens themselves but fail to leverage the inherent sentence-level structure of natural language, which is critical since LLMs acquire linguistic capabilities through exposure to human-generated texts that are inherently structured at sentence level.Method: Insert delimiters at sentence boundaries in LLM inputs to facilitate sentence-by-sentence processing behavior during reasoning. Two concrete methods: (1) In-context learning and (2) Supervised fine-tuning, experimented using models ranging from 7B to 600B parameters (Deepseek-V3).
Result: Consistent improvements across various tasks, with notable gains of up to 7.7% on GSM8k and 12.5% on DROP. Fine-tuned LLMs demonstrate sentence awareness through their internal representations.
Conclusion: Establishes a simple yet effective technique for enhancing LLM capabilities, offering promising directions for cognitive-inspired LLM enhancement paradigm by leveraging sentence-level structure.
Abstract: Researchers have explored different ways to improve large language models (LLMs)’ capabilities via dummy token insertion in contexts. However, existing works focus solely on the dummy tokens themselves, but fail to leverage the inherent sentence-level structure of natural language. This is a critical oversight, as LLMs acquire linguistic capabilities through exposure to human-generated texts, which are inherently structured at the sentence level. Motivated by this gap, we propose an approach that inserts delimiters at sentence boundaries in LLM inputs, which not only integrates dummy tokens into the context, but also facilitates LLMs with sentence-by-sentence processing behavior during reasoning. Two concrete methods: (1). In-context learning and (2). Supervised fine-tuning are experimented using 7B models to 600B Deepseek-V3. Our results demonstrate consistent improvements across various tasks, with notable gains of up to 7.7% on GSM8k and 12.5% on DROP. Furthermore, the fine-tuned LLMs can incorporate sentence awareness evidenced by their internal representations. Our work establishes a simple yet effective technique for enhancing LLM’s capabilities, offering promising directions for cognitive-inspired LLM enhancement paradigm.
[27] Nationality encoding in language model hidden states: Probing culturally differentiated representations in persona-conditioned academic text
Paul Jackson, Ruizhe Li, Elspeth Edelstein
Main category: cs.CL
TL;DR: Probing study shows Gemma-3-4b-it encodes nationality-discriminative information in hidden states when generating academic text conditioned by British vs. Chinese academic personas, despite no surface-level differences.
Details
Motivation: To investigate whether LLMs encode culturally differentiated representations when generating academic text, specifically testing if Gemma-3-4b-it encodes nationality-discriminative information in hidden states when conditioned by British and Chinese academic personas.Method: Generated 270 texts from 45 prompt templates crossed with six persona conditions in a 2x3 design. Used logistic regression probes on hidden-state activations across 35 layers with various controls (shuffled-label baselines, surface-text skyline classifier, cross-family tests, sentence-level baselines). Annotated probe-selected token positions for structural, lexical, and stance features using Stanza NLP pipeline.
Result: Nationality probe reached 0.968 cross-validated accuracy at Layer 18 with perfect held-out classification. Nationality encoding followed non-monotonic trajectory across layers. British-associated patterns showed more postmodification, hedging, boosting, passive voice, and evaluative/process-oriented vocabulary. Chinese-associated patterns showed more premodification, nominal predicates, and sociocultural/internationalisation vocabulary. No significant nationality differences found in full generated surface text at sentence level.
Conclusion: LLMs encode nationality-discriminative information in hidden states despite no surface-level differences, extending probing methodology to sociolinguistic attributes with implications for EAP and language pedagogy.
Abstract: Large language models are increasingly used as writing tools and pedagogical resources in English for Academic Purposes, but it remains unclear whether they encode culturally differentiated representations when generating academic text. This study tests whether Gemma-3-4b-it encodes nationality-discriminative information in hidden states when generating research article introductions conditioned by British and Chinese academic personas. A corpus of 270 texts was generated from 45 prompt templates crossed with six persona conditions in a 2 x 3 design. Logistic regression probes were trained on hidden-state activations across all 35 layers, with shuffled-label baselines, a surface-text skyline classifier, cross-family tests, and sentence-level baselines used as controls. Probe-selected token positions were annotated for structural, lexical, and stance features using the Stanza NLP pipeline. The nationality probe reached 0.968 cross-validated accuracy at Layer 18, with perfect held-out classification. Nationality encoding followed a non-monotonic trajectory across layers, with structural effects strongest in the middle to upper network and lexical-domain effects peaking earlier. At high-signal token positions, British-associated patterns showed more postmodification, hedging, boosting, passive voice, and evaluative or process-oriented vocabulary, while Chinese-associated patterns showed more premodification, nominal predicates, and sociocultural or internationalisation vocabulary. However, sentence-level analysis found no significant nationality differences in the full generated surface text. The findings extend probing methodology to a sociolinguistic attribute and have practical implications for EAP and language pedagogy.
[28] FAITH: Factuality Alignment through Integrating Trustworthiness and Honestness
Xiaoning Dong, Chengyan Wu, Yajie Wen, Yu Chen, Yun Xue, Jing Zhang, Wei Xu, Bolei Ma
Main category: cs.CL
TL;DR: FAITH: A post-training framework that improves LLM factuality by integrating natural-language uncertainty signals with external knowledge through trustworthiness/honestness quadrants and retrieval augmentation.
Details
Motivation: LLMs generate factually inaccurate content despite having knowledge, undermining reliability. Existing approaches use numerical uncertainty scores that lack semantic richness for LLMs to understand their internal states of trustworthiness and honestness, leading to insufficient factuality alignment.Method: 1) Augment training datasets by computing confidence scores and semantic entropy from LLM outputs, mapping them to a knowledge state quadrant describing trustworthiness (knowledge possession) and honestness (answering behaviors) in natural language. 2) Design reward function considering both correctness and uncertainty signals, fine-tune LLM using PPO. 3) Add retrieval-augmented module to retrieve relevant external passages to improve consistency between internal and external knowledge.
Result: Extensive experiments on four knowledge-intensive benchmarks demonstrate that FAITH enhances factual accuracy and truthfulness of LLMs.
Conclusion: FAITH effectively improves LLM factuality by integrating natural-language uncertainty signals with external knowledge through a comprehensive framework addressing both internal knowledge states and external grounding.
Abstract: Large Language Models (LLMs) can generate factually inaccurate content even if they have corresponding knowledge, which critically undermines their reliability. Existing approaches attempt to mitigate this by incorporating uncertainty in QA prompt during training, but these numerical scores lack the semantic richness for LLM to properly understand its internal states of trustworthiness and honestness, leading to insufficient factuality alignment. We introduce FAITH (Factuality Alignment through Integrating Trustworthiness and Honestness), a post-training framework for factuality alignment that integrates natural-language uncertainty signals with external knowledge. Specifically, we augment training datasets by computing confidence scores and semantic entropy from LLM outputs and mapping them into a knowledge state quadrant that describes the model’s internal knowledge possession (trustworthiness) and answering behaviors (honestness) in natural language. Based on this enhanced data, we design a reward function that considers both correctness and uncertainty signals, and fine-tune the LLM using the Proximal Policy Optimization (PPO) algorithm. To further mitigate weakly grounded responses, we design a retrieval-augmented module that retrieves relevant external passages, improving the consistency between internal and external knowledge representations. Extensive experiments on four knowledge-intensive benchmarks demonstrate that FAITH enhances the factual accuracy and truthfulness of LLMs.
[29] Relational Probing: LM-to-Graph Adaptation for Financial Prediction
Yingjie Niu, Changhong Jin, Rian Dolphin, Ruihai Dong
Main category: cs.CL
TL;DR: Relational Probing replaces LM head with relation head to induce relational graphs from hidden states for stock prediction, enabling joint training with downstream tasks.
Details
Motivation: Current prompting-based pipelines for financial relation extraction incur autoregressive decoding costs and decouple graph construction from downstream optimization. Need more efficient integration of language models with structured output tasks.Method: Replace standard language model head with a relation head that induces relational graphs directly from LM hidden states. Train jointly with downstream task model for stock-trend prediction. Uses Qwen3 SLMs (0.6B/1.7B/4B) as upstream models.
Result: Relational Probing yields consistent performance improvements at competitive inference cost compared to co-occurrence baselines. Enables language model outputs to be reshaped into task-specific formats.
Conclusion: The approach learns semantic representations while preserving strict graph structure, allowing LMs to go beyond text generation into structured output tasks efficiently.
Abstract: Language models can be used to identify relationships between financial entities in text. However, while structured output mechanisms exist, prompting-based pipelines still incur autoregressive decoding costs and decouple graph construction from downstream optimization. We propose \emph{Relational Probing}, which replaces the standard language-model head with a relation head that induces a relational graph directly from language-model hidden states and is trained jointly with the downstream task model for stock-trend prediction. This approach both learns semantic representations and preserves the strict structure of the induced relational graph. It enables language-model outputs to go beyond text, allowing them to be reshaped into task-specific formats for downstream models. To enhance reproducibility, we provide an operational definition of small language models (SLMs): models that can be fine-tuned end-to-end on a single 24GB GPU under specified batch-size and sequence-length settings. Experiments use Qwen3 backbones (0.6B/1.7B/4B) as upstream SLMs and compare against a co-occurrence baseline. Relational Probing yields consistent performance improvements at competitive inference cost.
[30] RPA-Check: A Multi-Stage Automated Framework for Evaluating Dynamic LLM-based Role-Playing Agents
Riccardo Rosati, Edoardo Colucci, Massimiliano Bolognini, Adriano Mancini, Paolo Sernani
Main category: cs.CL
TL;DR: RPA-Check: A multi-stage automated evaluation framework for assessing LLM-based Role-Playing Agents in complex, constraint-heavy environments using semantic filtering and LLM-as-a-Judge verification.
Details
Motivation: Standard NLP metrics fail to capture nuances of role adherence, logical consistency, and narrative stability in LLM-based Role-Playing Agents, creating a need for specialized evaluation frameworks.Method: Four-step pipeline: (1) Dimension Definition for qualitative behavioral criteria, (2) Augmentation into granular boolean checklist indicators, (3) Semantic Filtering for objectivity and redundancy removal, (4) LLM-as-a-Judge Evaluation with chain-of-thought verification.
Result: Framework validated on LLM Court (forensic training game) across five legal scenarios, revealing inverse relationship between parametric scale and procedural consistency, with smaller instruction-tuned models (8-9B) outperforming larger architectures.
Conclusion: RPA-Check provides standardized, reproducible metrics for generative agent evaluation in specialized domains, addressing limitations of current evaluation methods for interactive LLM systems.
Abstract: The rapid adoption of Large Language Models (LLMs) in interactive systems has enabled the creation of dynamic, open-ended Role-Playing Agents (RPAs). However, evaluating these agents remains a significant challenge, as standard NLP metrics fail to capture the nuances of role adherence, logical consistency, and long-term narrative stability. This paper introduces RPA-Check, a multi-stage automated evaluation framework designed to objectively assess the performance of LLM-based RPAs in complex, constraints-heavy environments. Our methodology is based on a four-step pipeline: (1) Dimension Definition, establishing high-level qualitative behavioral criteria; (2) Augmentation, where these requirements are expanded into granular boolean checklist indicators; (3) Semantic Filtering, to ensure indicator objectivity, no redundancy and agent isolation; and (4) LLM-as-a-Judge Evaluation, which employs chain-of-thought verification to score agent fidelity. We validate this framework by applying it to LLM Court, a serious game for forensic training involving several quantized local models. Experimental results across five distinct legal scenarios demonstrate the framework’s ability to identify subtle trade-offs between model size, reasoning depth, and operational stability. Notably, the findings reveal an inverse relationship between parametric scale and procedural consistency, showing that smaller, adequately instruction-tuned models (8-9B) can outperform larger architectures prone to user-alignment bias or sycophancy. RPA-Check thus provides a standardized and reproducible metric for future research in generative agent evaluation within specialized domains.
[31] CodeComp: Structural KV Cache Compression for Agentic Coding
Qiujiang Chen, Jing Xiong, Chenyang Zhao, Sidi Yang, Ngai Wong
Main category: cs.CL
TL;DR: CodeComp: A training-free KV cache compression framework that incorporates static program analysis via Code Property Graphs to preserve structurally critical tokens for code understanding tasks.
Details
Motivation: Agentic code tasks like fault localization and patch generation require processing long codebases under tight memory constraints, where the KV cache becomes the primary bottleneck. Existing compression methods rely exclusively on attention signals, systematically discarding structurally critical tokens essential for code understanding.Method: CodeComp incorporates static program analysis into LLM inference via Code Property Graph priors extracted by Joern. It identifies and preserves structurally critical tokens (call sites, branch conditions, assignments) that attention-only methods would discard, enabling effective KV cache compression without model modification.
Result: CodeComp consistently outperforms attention-only compression baselines under equal memory budgets, recovers majority of full-context accuracy under aggressive KV cache compression, matches patch generation quality of uncompressed full-context inference, and integrates seamlessly into SGLang-based agentic coding pipelines.
Conclusion: Incorporating program analysis priors into KV cache compression is crucial for code understanding tasks, as attention signals alone are insufficient for identifying structurally critical tokens. CodeComp demonstrates that static analysis can significantly improve compression effectiveness without model retraining.
Abstract: Agentic code tasks such as fault localization and patch generation require processing long codebases under tight memory constraints, where the Key-Value (KV) cache becomes the primary inference bottleneck. Existing compression methods rely exclusively on attention signals to estimate token importance, systematically discarding structurally critical tokens such as call sites, branch conditions, and assignments that are essential for code understanding. We present CodeComp, a training-free KV cache compression framework that incorporates static program analysis into LLM inference via Code Property Graph priors extracted by Joern. Across bug localization and code generation benchmarks, CodeComp consistently outperforms attention-only compression baselines under equal memory budgets, recovering the majority of full-context accuracy under aggressive KV cache compression, while matching the patch generation quality of uncompressed full-context inference and integrating seamlessly into SGLang-based agentic coding pipelines without model modification.
[32] Comparative Analysis of Large Language Models in Healthcare
Subin Santhosh, Farwa Abbas, Hussain Ahmad, Claudia Szabo
Main category: cs.CL
TL;DR: Comparative evaluation of LLMs (ChatGPT, LLaMA, Grok, Gemini, ChatDoctor) on medical tasks shows domain-specific models excel in contextual reliability while general-purpose models perform better in structured QA tasks.
Details
Motivation: LLMs are transforming healthcare AI but lack standardized benchmarking for medical applications, raising concerns about accuracy, reliability, and patient safety in clinical deployment.Method: Evaluated multiple LLMs on core medical tasks (patient note summarization, medical QA) using open-access datasets (MedMCQA, PubMedQA, Asclepius) with linguistic and task-specific metrics.
Result: Domain-specific models (ChatDoctor) excel in contextual reliability and medical accuracy, while general-purpose models (Grok, LLaMA) perform better in structured QA tasks with higher quantitative accuracy.
Conclusion: LLMs can support medical professionals but require ethical standards, contextual accuracy, and human oversight; task-specific evaluation and cautious integration into healthcare workflows are essential.
Abstract: Background: Large Language Models (LLMs) are transforming artificial intelligence applications in healthcare due to their ability to understand, generate, and summarize complex medical text. They offer valuable support to clinicians, researchers, and patients, yet their deployment in high-stakes clinical environments raises critical concerns regarding accuracy, reliability, and patient safety. Despite substantial attention in recent years, standardized benchmarking of LLMs for medical applications has been limited. Objective: This study addresses the need for a standardized comparative evaluation of LLMs in medical settings. Method: We evaluate multiple models, including ChatGPT, LLaMA, Grok, Gemini, and ChatDoctor, on core medical tasks such as patient note summarization and medical question answering, using the open-access datasets, MedMCQA, PubMedQA, and Asclepius, and assess performance through a combination of linguistic and task-specific metrics. Results: The results indicate that domain-specific models, such as ChatDoctor, excel in contextual reliability, producing medically accurate and semantically aligned text, whereas general-purpose models like Grok and LLaMA perform better in structured question-answering tasks, demonstrating higher quantitative accuracy. This highlights the complementary strengths of domain-specific and general-purpose LLMs depending on the medical task. Conclusion: Our findings suggest that LLMs can meaningfully support medical professionals and enhance clinical decision-making; however, their safe and effective deployment requires adherence to ethical standards, contextual accuracy, and human oversight in relevant cases. These results underscore the importance of task-specific evaluation and cautious integration of LLMs into healthcare workflows.
[33] Adaptive Multi-Expert Reasoning via Difficulty-Aware Routing and Uncertainty-Guided Aggregation
Mohamed Ehab, Ali Hamdi
Main category: cs.CL
TL;DR: AMR is a framework that improves math reasoning in LLMs by dynamically adapting strategies based on problem difficulty, using specialized experts and verification to enhance robustness.
Details
Motivation: LLMs show inconsistent performance across math problems of varying difficulty levels, needing a more adaptive approach to handle complexity effectively.Method: Uses difficulty-based routing to predict problem complexity, deploys three specialized experts to generate responses, employs multiple correction phases, neural verification, and clustering-based aggregation for final answer selection.
Result: Achieved 75.28% accuracy on GSM8K dataset using only original training data, outperforming most comparable 7B models trained on synthetic data.
Conclusion: Difficulty-based routing and uncertainty-driven aggregation are efficient and effective for improving math reasoning robustness in LLMs.
Abstract: Large language models (LLMs) demonstrate strong performance in math reasoning benchmarks, but their performance varies inconsistently across problems with varying levels of difficulty. This paper describes Adaptive Multi-Expert Reasoning (AMR), a framework that focuses on problem complexity by reasoning with dynamically adapted strategies. An agile routing system that focuses on problem text predicts problems’ difficulty and uncertainty and guides a reconfigurable sampling mechanism to manage the breadth of generation. Three specialized experts create candidate responses, which are modified during multiple correction and finalization phases. A neural verifier assesses the correctness of responses, while a clustering-based aggregation technique identifies the final candidate answer based on a combination of consensus and answer quality. When evaluated on the GSM8K dataset, AMR achieved 75.28% accuracy while only using the original training data. This result outperformed the majority of comparable 7B models that were trained on synthetic data. This showcases that models using difficulty-based routing and uncertainty-driven aggregation are efficient and effective in improving math reasoning models’ robustness.
[34] Interactive ASR: Towards Human-Like Interaction and Semantic Coherence Evaluation for Agentic Speech Recognition
Peng Wang, Yanqiao Zhu, Zixuan Jiang, Qinyuan Chen, Xingjian Zhao, Xipeng Qiu, Wupeng Wang, Zhifu Gao, Xiangang Li, Kai Yu, Xie Chen
Main category: cs.CL
TL;DR: LLM-driven agentic framework for interactive ASR with semantic-aware evaluation and multi-turn correction
Details
Motivation: Current ASR systems lack semantic-level evaluation (beyond WER) and systematic interactive correction capabilities that mimic human communicationMethod: Proposes LLM-as-a-Judge for semantic evaluation and LLM-driven agent framework for multi-turn interactive refinement of recognition outputs
Result: Extensive experiments on GigaSpeech, WenetSpeech, and ASRU 2019 code-switching test sets show improved semantic fidelity and interactive correction capability
Conclusion: Agentic framework effectively integrates semantic evaluation and interactive correction for ASR, advancing beyond traditional token-level metrics
Abstract: Recent years have witnessed remarkable progress in automatic speech recognition (ASR), driven by advances in model architectures and large-scale training data. However, two important aspects remain underexplored. First, Word Error Rate (WER), the dominant evaluation metric for decades, treats all words equally and often fails to reflect the semantic correctness of an utterance at the sentence level. Second, interactive correction-an essential component of human communication-has rarely been systematically studied in ASR research. In this paper, we integrate these two perspectives under an agentic framework for interactive ASR. We propose leveraging LLM-as-a-Judge as a semantic-aware evaluation metric to assess recognition quality beyond token-level accuracy. Furthermore, we design an LLM-driven agent framework to simulate human-like multi-turn interaction, enabling iterative refinement of recognition outputs through semantic feedback. Extensive experiments are conducted on standard benchmarks, including GigaSpeech (English), WenetSpeech (Chinese), the ASRU 2019 code-switching test set. Both objective and subjective evaluations demonstrate the effectiveness of the proposed framework in improving semantic fidelity and interactive correction capability. We will release the code to facilitate future research in interactive and agentic ASR.
[35] A Structured Clustering Approach for Inducing Media Narratives
Rohan Das, Advait Deshmukh, Alexandria Leto, Zohar Naaman, I-Ta Lee, Maria Leonor Pacheco
Main category: cs.CL
TL;DR: A framework for inducing rich narrative schemas by jointly modeling events and characters through structured clustering to capture nuanced storytelling structures in media narratives.
Details
Motivation: Media narratives shape public opinion, but computational approaches fail to capture nuanced storytelling structures emphasized by communication theory. Existing methods either miss subtle patterns through coarse analysis or require domain-specific taxonomies that limit scalability.Method: A framework that induces narrative schemas by jointly modeling events and characters via structured clustering, producing explainable schemas aligned with framing theory while scaling to large corpora without exhaustive manual annotation.
Result: The approach produces rich, explainable narrative schemas that capture nuanced storytelling structures while being scalable to large text corpora without requiring extensive manual annotation.
Conclusion: The framework bridges the gap between computational analysis and communication theory by providing a scalable method to capture nuanced narrative structures in media content.
Abstract: Media narratives wield tremendous power in shaping public opinion, yet computational approaches struggle to capture the nuanced storytelling structures that communication theory emphasizes as central to how meaning is constructed. Existing approaches either miss subtle narrative patterns through coarse-grained analysis or require domain-specific taxonomies that limit scalability. To bridge this gap, we present a framework for inducing rich narrative schemas by jointly modeling events and characters via structured clustering. Our approach produces explainable narrative schemas that align with established framing theory while scaling to large corpora without exhaustive manual annotation.
[36] BLUEmed: Retrieval-Augmented Multi-Agent Debate for Clinical Error Detection
Saukun Thika You, Nguyen Anh Khoa Tran, Wesley K. Marizane, Hanshu Rao, Qiunan Zhang, Xiaolei Huang
Main category: cs.CL
TL;DR: BLUEmed: Multi-agent debate framework with hybrid RAG for detecting terminology substitution errors in clinical notes, achieving state-of-the-art performance through evidence-grounded reasoning and multi-perspective verification.
Details
Motivation: Terminology substitution errors in clinical notes (where medical terms are replaced by linguistically valid but clinically different terms) are challenging for automated detection in healthcare, requiring robust methods to identify subtle but critical errors.Method: Multi-agent debate framework with hybrid Retrieval-Augmented Generation (RAG) that decomposes clinical notes into sub-queries, retrieves evidence through dense/sparse/online retrieval, assigns two domain expert agents distinct knowledge bases for independent analysis, uses structured counter-argumentation and cross-source adjudication for disagreements, and includes cascading safety layer to filter false positives.
Result: Achieves best accuracy (69.13%), ROC-AUC (74.45%), and PR-AUC (72.44%) under few-shot prompting on clinical terminology substitution detection benchmark, outperforming single-agent RAG and debate-only baselines across six backbone models and two prompting strategies.
Conclusion: Retrieval augmentation and structured debate are complementary for clinical error detection, with the framework benefiting most from models with strong instruction-following and clinical language understanding capabilities.
Abstract: Terminology substitution errors in clinical notes, where one medical term is replaced by a linguistically valid but clinically different term, pose a persistent challenge for automated error detection in healthcare. We introduce BLUEmed, a multi-agent debate framework augmented with hybrid Retrieval-Augmented Generation (RAG) that combines evidence-grounded reasoning with multi-perspective verification for clinical error detection. BLUEmed decomposes each clinical note into focused sub-queries, retrieves source-partitioned evidence through dense, sparse, and online retrieval, and assigns two domain expert agents distinct knowledge bases to produce independent analyses; when the experts disagree, a structured counter-argumentation round and cross-source adjudication resolve the conflict, followed by a cascading safety layer that filters common false-positive patterns. We evaluate BLUEmed on a clinical terminology substitution detection benchmark under both zero-shot and few-shot prompting with multiple backbone models spanning proprietary and open-source families. Experimental results show that BLUEmed achieves the best accuracy (69.13%), ROC-AUC (74.45%), and PR-AUC (72.44%) under few-shot prompting, outperforming both single-agent RAG and debate-only baselines. Further analyses across six backbone models and two prompting strategies confirm that retrieval augmentation and structured debate are complementary, and that the framework benefits most from models with sufficient instruction-following and clinical language understanding.
[37] NameBERT: Scaling Name-Based Nationality Classification with LLM-Augmented Open Academic Data
Cong Ming, Ruixin Shi, Yifan Hu
Main category: cs.CL
TL;DR: A framework that uses LLMs to generate synthetic names for augmenting training data to improve name-based nationality classification, particularly for underrepresented countries, while maintaining inference efficiency.
Details
Motivation: Existing name-based nationality classifiers suffer from coverage gaps and limited performance for underrepresented countries due to small or source-specific training datasets. While LLMs show strong zero-shot performance, they are computationally expensive for real-time, large-scale deployment.Method: Created a large-scale name-nationality dataset from Open Academic Graph (OAG), then used LLMs as dataset enrichers (not inference engines) to generate synthetic names for augmenting low-resource countries. Evaluated on both real and synthetic-tail test sets.
Result: NameBERT models achieved significantly higher accuracy than state-of-the-art baselines across both in- and out-of-domain tasks. Augmentation produced large gains when evaluation included synthetic tail names and modest lift on tail-country metrics otherwise.
Conclusion: Using LLMs for data augmentation rather than inference enables efficient, high-performance name-based nationality classification that addresses coverage gaps for underrepresented countries while maintaining scalability.
Abstract: Inferring nationality from personal names is a critical capability for equity and bias monitoring, personalization, and a valuable tool in biomedical and sociological research. However, existing name-based nationality classifiers are typically trained on relatively small or source-specific labeled datasets, which can introduce coverage gaps and limit performance for underrepresented countries. While large language models (LLMs) demonstrate strong zero-shot performance for name-based nationality prediction, their computational cost and latency make them impractical for real-time, large-scale deployment. In this work, we created a large-scale name-nationality dataset from the Open Academic Graph (OAG) and introduce a framework that leverages LLMs as dataset enrichers rather than inference engines. We augment low-resource countries with LLM-generated names and evaluate on real and synthetic-tail test sets. We find that augmentation produces large gains when evaluation includes synthetic tail names and still offers a modest lift on tail-country metrics otherwise. Overall, NameBERT models achieve significantly higher accuracy than state-of-the-art baselines across both in- and out-of-domain tasks, while remaining efficient for large-scale inference compared to LLMs.
[38] LASQ: A Low-resource Aspect-based Sentiment Quadruple Extraction Dataset
Aizihaierjiang Yusufu, Jiang Liu, Kamran Aziz, Abidan Ainiwaer, Bobo Li, Fei Li, Donghong Ji, Aizierguli Yusufu
Main category: cs.CL
TL;DR: LASQ: First low-resource language aspect-based sentiment quadruple dataset for Uzbek and Uyghur with syntax-enhanced grid-tagging model
Details
Motivation: Existing ABSA research focuses on high-resource languages, leaving low-resource languages under-explored, especially for fine-grained sentiment extraction tasks.Method: Created LASQ dataset for Uzbek and Uyghur with quadruple extraction task; designed grid-tagging model with Syntax Knowledge Embedding Module (SKEM) incorporating POS and dependency knowledge to handle agglutinative language challenges.
Result: Experiments on LASQ show consistent improvements over competitive baselines, validating both the dataset’s utility and the effectiveness of the proposed modeling approach.
Conclusion: LASQ addresses the gap in low-resource language ABSA research and the proposed syntax-enhanced model effectively handles lexical sparsity in agglutinative languages.
Abstract: In recent years, aspect-based sentiment analysis (ABSA) has made rapid progress and shown strong practical value. However, existing research and benchmarks are largely concentrated on high-resource languages, leaving fine-grained sentiment extraction in low-resource languages under-explored. To address this gap, we constructed the first Low-resource languages Aspect-based Sentiment Quadruple dataset, named LASQ, which includes two low-resource languages: Uzbek and Uyghur. Secondly, it includes a fine-grained target-aspect-opinion-sentiment quadruple extraction task. To facilitate future research, we designed a grid-tagging model that integrates syntactic knowledge. This model incorporates part-of-speech (POS) and dependency knowledge into the model through our designed Syntax Knowledge Embedding Module (SKEM), thereby alleviating the lexical sparsity problem caused by agglutinative languages. Experiments on LASQ demonstrate consistent gains over competitive baselines, validating both the dataset’s utility and the effectiveness of the proposed modeling approach.
[39] Turing or Cantor: That is the Question
Eugene Eberbach
Main category: cs.CL
TL;DR: The paper presents new theoretical results extending Turing’s work, including measures of undecidability, super-Turing computation models, new complexity classes for undecidable problems, and a negative answer to a P≠NP equivalent for undecidable problems.
Details
Motivation: To extend Alan Turing's foundational work by introducing new theoretical frameworks for understanding undecidable problems, building on Cantor's contributions to set theory and mathematics foundations.Method: Theoretical analysis and formal definitions: 1) Introducing a measure of undecidability based on probability distribution of input data, 2) Extending Turing’s infinite logics and Oracle machines to super-Turing computation models, 3) Defining three new complexity classes for TM undecidable problems (U-complete, D-complete, H-complete), 4) Proving negative answer to P≠NP equivalent for U-complete class.
Result: Multiple novel theoretical contributions: 1) First explicit definition of complexity classes for undecidable problems, 2) Negative resolution of P≠NP equivalent for U-complete class, 3) Framework for measuring undecidability degrees, 4) Extension of Turing’s models to super-Turing computation.
Conclusion: The paper significantly advances theoretical computer science by providing new frameworks for understanding undecidability and extending Turing’s legacy with novel complexity classes and computation models that go beyond traditional Turing machines.
Abstract: Alan Turing is considered as a founder of current computer science together with Kurt Godel, Alonzo Church and John von Neumann. In this paper multiple new research results are presented. It is demonstrated that there would not be Alan Turing’s achievements without earlier seminal contributions by Georg Cantor in the set theory and foundations of mathematics. It is proposed to introduce the measure of undecidability of problems unsolvable by Turing machines based on probability distribution of its input data, i.e., to provide the degree of unsolvabilty based on the number of undecidable instances of input data versus decidable ones. It is proposed as well to extend the Turing’s work on infinite logics and Oracle machines to a whole class of super-Turing models of computation. Next, the three new complexity classes for TM undecidable problems have been defined: U-complete (Universal complete), D-complete (Diagonalization complete) and H-complete (Hypercomputation complete) classes. The above has never been defined explicitly before by other scientists, and has been inspired by Cook/Levin NP-complete class for intractable problems. Finally, an equivalent to famous P is not equal to NP unanswered question for NP-complete class, has been answered negatively for U-complete class of complexity for undecidable problems.
[40] CodaRAG: Connecting the Dots with Associativity Inspired by Complementary Learning
Cheng-Yen Li, Xuanjun Chen, Claire Lin, Wei-Yu Chen, Wenhua Nie, Hung-Yi Lee, Jyh-Shing Roger Jang
Main category: cs.CL
TL;DR: CodaRAG: A framework that evolves retrieval from passive lookup to active associative discovery by consolidating fragmented knowledge into a memory graph and navigating it via multi-dimensional pathways to recover evidence chains.
Details
Motivation: LLMs struggle with knowledge-intensive tasks due to hallucinations and fragmented reasoning over dispersed information. Existing RAG methods treat evidence as isolated units, failing to reconstruct logical chains connecting evidence points.Method: Three-stage pipeline: (1) Knowledge Consolidation to unify fragmented extractions into a stable memory substrate; (2) Associative Navigation to traverse the graph via semantic, contextualized, and functional pathways to recover evidence chains; (3) Interference Elimination to prune hyper-associative noise for coherent reasoning context.
Result: On GraphRAG-Bench, achieves absolute gains of 7-10% in retrieval recall and 3-11% in generation accuracy, demonstrating superior ability to robustify associative evidence retrieval.
Conclusion: CodaRAG successfully transforms retrieval into active associative discovery, enabling systematic recovery of dispersed evidence chains for improved factual, reasoning, and creative tasks.
Abstract: Large Language Models (LLMs) struggle with knowledge-intensive tasks due to hallucinations and fragmented reasoning over dispersed information. While Retrieval-Augmented Generation (RAG) grounds generation in external sources, existing methods often treat evidence as isolated units, failing to reconstruct the logical chains that connect these dots. Inspired by Complementary Learning Systems (CLS), we propose CodaRAG, a framework that evolves retrieval from passive lookup into active associative discovery. CodaRAG operates via a three-stage pipeline: (1) Knowledge Consolidation to unify fragmented extractions into a stable memory substrate; (2) Associative Navigation to traverse the graph via multi-dimensional pathways-semantic, contextualized, and functional-explicitly recovering dispersed evidence chains; and (3) Interference Elimination to prune hyper-associative noise, ensuring a coherent, high-precision reasoning context. On GraphRAG-Bench, CodaRAG achieves absolute gains of 7-10% in retrieval recall and 3-11% in generation accuracy. These results demonstrate CodaRAG’s superior ability to systematically robustify associative evidence retrieval for factual, reasoning, and creative tasks.
[41] Instruction Data Selection via Answer Divergence
Bo Li, Mingda Wang, Shikun Zhang, Wei Ye
Main category: cs.CL
TL;DR: ADG selects instruction data using answer divergence scores based on geometric structure of multiple model outputs, improving instruction tuning with better data selection.
Details
Motivation: Instruction tuning performance heavily depends on the quality and composition of instruction-response corpora, but current data selection methods don't effectively identify instructions that elicit diverse, multi-modal responses.Method: ADG generates multiple high-temperature responses per instruction, embeds them, computes divergence scores combining dispersion magnitude and shape anisotropy to select instructions with diverse, multi-modal answers.
Result: Fine-tuning on just 10K ADG-selected examples outperforms strong baselines across 6 benchmarks in reasoning, knowledge, and coding tasks using two different model backbones.
Conclusion: Answer divergence is a practical signal for instruction data selection, with both dispersion magnitude and shape anisotropy being necessary components for effective data curation.
Abstract: Instruction tuning relies on large instruction-response corpora whose quality and composition strongly affect downstream performance. We propose Answer Divergence-Guided Selection (ADG), which selects instruction data based on the geometric structure of multi-sample outputs. ADG draws several high-temperature generations per instruction, maps responses into an embedding space, and computes an output divergence score that jointly encodes dispersion magnitude and shape anisotropy. High scores correspond to instructions whose answers are both far apart and multi-modal, rather than clustered paraphrases along a single direction. Across two backbones and three public instruction pools, fine-tuning on only 10K ADG-selected examples consistently outperforms strong selectors on six benchmarks spanning reasoning, knowledge, and coding. Analyses further show that both dispersion magnitude and shape anisotropy are necessary, supporting answer divergence as a practical signal for instruction data selection. Code and appendix are included in the supplementary materials.
[42] NOSE: Neural Olfactory-Semantic Embedding with Tri-Modal Orthogonal Contrastive Learning
Yanyi Su, Hongshuai Wang, Zhifeng Gao, Jun Cheng
Main category: cs.CL
TL;DR: NOSE is a multimodal representation learning framework that aligns molecular structure, receptor sequences, and natural language descriptions for olfaction, using orthogonal constraints and weak positive sampling to address sparse olfactory language data.
Details
Motivation: Current olfactory representation methods only model isolated segments of the olfactory pathway (molecule to receptors to language), resulting in embeddings that lack biological grounding and semantic interpretability. There's a need for a holistic approach that captures the complete chain from chemical structure to linguistic perception.Method: Proposes NOSE framework that aligns three modalities: molecular structure, receptor sequence, and natural language description. Uses orthogonal constraints to decouple modality contributions rather than simple fusion. Introduces weak positive sample strategy to calibrate semantic similarity for sparse olfactory language data.
Result: Achieves state-of-the-art performance and excellent zero-shot generalization. Demonstrates strong alignment between representation space and human olfactory intuition.
Conclusion: NOSE successfully captures the complete olfactory pathway through multimodal alignment, providing biologically grounded and semantically interpretable representations that generalize well to new olfactory tasks.
Abstract: Olfaction lies at the intersection of chemical structure, neural encoding, and linguistic perception, yet existing representation methods fail to fully capture this pathway. Current approaches typically model only isolated segments of the olfactory pathway, overlooking the complete chain from molecule to receptors to linguistic descriptions. Such fragmentation yields learned embeddings that lack both biological grounding and semantic interpretability. We propose NOSE (Neural Olfactory-Semantic Embedding), a representation learning framework that aligns three modalities along the olfactory pathway: molecular structure, receptor sequence, and natural language description. Rather than simply fusing these signals, we decouple their contributions via orthogonal constraints, preserving the unique encoded information of each modality. To address the sparsity of olfactory language, we introduce a weak positive sample strategy to calibrate semantic similarity, preventing erroneous repulsion of similar odors in the feature space. Extensive experiments demonstrate that NOSE achieves state-of-the-art (SOTA) performance and excellent zero-shot generalization, confirming the strong alignment between its representation space and human olfactory intuition.
[43] EviCare: Enhancing Diagnosis Prediction with Deep Model-Guided Evidence for In-Context Reasoning
Hengyu Zhang, Xuyun Zhang, Pengxiang Zhan, Linhao Luo, Hang Lv, Yanchao Tan, Shirui Pan, Carl Yang
Main category: cs.CL
TL;DR: EviCare: An in-context reasoning framework that integrates deep model guidance into LLM-based diagnosis prediction from EHRs to better identify novel clinically important conditions.
Details
Motivation: Existing LLM-based approaches for EHR diagnosis prediction tend to overfit to historically observed diagnoses and overlook novel yet clinically important conditions critical for early intervention.Method: EviCare performs three-step reasoning: (1) deep model inference for candidate selection, (2) evidential prioritization for set-based EHRs, and (3) relational evidence construction for novel diagnosis prediction, then composes these signals into adaptive in-context prompts to guide LLM reasoning.
Result: On MIMIC-III and MIMIC-IV benchmarks, EviCare achieves significant performance gains, outperforming both LLM-only and deep model-only baselines by average 20.65% across precision and accuracy metrics, with 30.97% average improvement in novel diagnosis prediction.
Conclusion: EviCare effectively addresses the limitation of existing LLM approaches in identifying novel diagnoses through deep model guidance and in-context reasoning, improving both accuracy and interpretability in EHR-based diagnosis prediction.
Abstract: Recent advances in large language models (LLMs) have enabled promising progress in diagnosis prediction from electronic health records (EHRs). However, existing LLM-based approaches tend to overfit to historically observed diagnoses, often overlooking novel yet clinically important conditions that are critical for early intervention. To address this, we propose EviCare, an in-context reasoning framework that integrates deep model guidance into LLM-based diagnosis prediction. Rather than prompting LLMs directly with raw EHR inputs, EviCare performs (1) deep model inference for candidate selection, (2) evidential prioritization for set-based EHRs, and (3) relational evidence construction for novel diagnosis prediction. These signals are then composed into an adaptive in-context prompt to guide LLM reasoning in an accurate and interpretable manner. Extensive experiments on two real-world EHR benchmarks (MIMIC-III and MIMIC-IV) demonstrate that EviCare achieves significant performance gains, which consistently outperforms both LLM-only and deep model-only baselines by an average of 20.65% across precision and accuracy metrics. The improvements are particularly notable in challenging novel diagnosis prediction, yielding average improvements of 30.97%.
[44] Dynamic Adaptive Attention and Supervised Contrastive Learning: A Novel Hybrid Framework for Text Sentiment Classification
Qingyang Li
Main category: cs.CL
TL;DR: A hybrid BERT-based framework combining dynamic adaptive multi-head attention with supervised contrastive learning for improved sentiment classification in long movie reviews.
Details
Motivation: Traditional models struggle with long-distance semantic dependencies and ambiguous emotional expressions in lengthy movie reviews, necessitating better attention mechanisms and representation learning.Method: Integrates dynamic adaptive multi-head attention (using global context pooling to regulate attention head contributions) with supervised contrastive learning into a BERT-based Transformer encoder.
Result: Achieves 94.67% accuracy on IMDB dataset, outperforming strong baselines by 1.5-2.5 percentage points.
Conclusion: The lightweight, efficient framework effectively handles long review texts and is extensible to other text classification tasks.
Abstract: The exponential growth of user-generated movie reviews on digital platforms has made accurate text sentiment classification a cornerstone task in natural language processing. Traditional models, including standard BERT and recurrent architectures, frequently struggle to capture long-distance semantic dependencies and resolve ambiguous emotional expressions in lengthy review texts. This paper proposes a novel hybrid framework that seamlessly integrates dynamic adaptive multi-head attention with supervised contrastive learning into a BERT-based Transformer encoder. The dynamic adaptive attention module employs a global context pooling vector to dynamically regulate the contribution of each attention head, thereby focusing on critical sentiment-bearing tokens while suppressing noise. Simultaneously, the supervised contrastive learning branch enforces tighter intra-class compactness and larger inter-class separation in the embedding space. Extensive experiments on the IMDB dataset demonstrate that the proposed model achieves competitive performance with an accuracy of 94.67%, outperforming strong baselines by 1.5–2.5 percentage points. The framework is lightweight, efficient, and readily extensible to other text classification tasks.
[45] From Query to Counsel: Structured Reasoning with a Multi-Agent Framework and Dataset for Legal Consultation
Mingfei Lu, Yi Zhang, Mengjia Wu, Yue Feng
Main category: cs.CL
TL;DR: JurisMA: A modular multi-agent framework for legal consultation QA using structured task decomposition and legal element graphs, trained on a new Chinese legal dataset JurisCQAD.
Details
Motivation: Legal consultation QA faces challenges including scarcity of high-quality training data, complex task composition, and strong contextual dependencies that existing approaches don't adequately address.Method: Created JurisCQAD dataset (43K+ Chinese legal queries), designed structured task decomposition converting queries into legal element graphs (entities, events, intents, legal issues), and proposed JurisMA modular multi-agent framework with dynamic routing, statutory grounding, and stylistic optimization.
Result: System significantly outperforms both general-purpose and legal-domain LLMs on refined LawBench across multiple lexical and semantic metrics, demonstrating benefits of interpretable decomposition and modular collaboration.
Conclusion: Structured task decomposition with legal element graphs combined with modular multi-agent framework enables strong context-aware reasoning for legal consultation QA, effectively capturing dependencies across legal facts, norms, and procedural logic.
Abstract: Legal consultation question answering (Legal CQA) presents unique challenges compared to traditional legal QA tasks, including the scarcity of high-quality training data, complex task composition, and strong contextual dependencies. To address these, we construct JurisCQAD, a large-scale dataset of over 43,000 real-world Chinese legal queries annotated with expert-validated positive and negative responses, and design a structured task decomposition that converts each query into a legal element graph integrating entities, events, intents, and legal issues. We further propose JurisMA, a modular multi-agent framework supporting dynamic routing, statutory grounding, and stylistic optimization. Combined with the element graph, the framework enables strong context-aware reasoning, effectively capturing dependencies across legal facts, norms, and procedural logic. Trained on JurisCQAD and evaluated on a refined LawBench, our system significantly outperforms both general-purpose and legal-domain LLMs across multiple lexical and semantic metrics, demonstrating the benefits of interpretable decomposition and modular collaboration in Legal CQA.
[46] Why Don’t You Know? Evaluating the Impact of Uncertainty Sources on Uncertainty Quantification in LLMs
Maiya Goloburda, Roman Vashurin, Fedor Chernogorsky, Nurkhan Laiyk, Daniil Orel, Preslav Nakov, Maxim Panov
Main category: cs.CL
TL;DR: Paper studies how different sources of uncertainty (knowledge gaps, output variability, input ambiguity) affect UQ methods for LLMs, finding current methods perform poorly when uncertainty stems from non-knowledge sources.
Details
Motivation: Current UQ methods for LLMs produce single confidence scores but uncertainty arises from multiple distinct sources with different implications. Need to understand how uncertainty sources impact UQ method effectiveness.Method: Introduces new dataset categorizing uncertainty sources, enabling systematic evaluation of UQ performance under each condition (knowledge gaps, output variability, input ambiguity).
Result: Experiments show many UQ methods perform well when uncertainty stems from model knowledge limitations, but degrade or become misleading when other sources (output variability, input ambiguity) are introduced.
Conclusion: Highlights need for uncertainty-aware methods that explicitly account for source of uncertainty in LLMs, rather than single confidence scores.
Abstract: As Large Language Models (LLMs) are increasingly deployed in real-world applications, reliable uncertainty quantification (UQ) becomes critical for safe and effective use. Most existing UQ approaches for language models aim to produce a single confidence score – for example, estimating the probability that a model’s answer is correct. However, uncertainty in natural language tasks arises from multiple distinct sources, including model knowledge gaps, output variability, and input ambiguity, which have different implications for system behavior and user interaction. In this work, we study how the source of uncertainty impacts the behavior and effectiveness of existing UQ methods. To enable controlled analysis, we introduce a new dataset that explicitly categorizes uncertainty sources, allowing systematic evaluation of UQ performance under each condition. Our experiments reveal that while many UQ methods perform well when uncertainty stems solely from model knowledge limitations, their performance degrades or becomes misleading when other sources are introduced. These findings highlight the need for uncertainty-aware methods that explicitly account for the source of uncertainty in large language models.
[47] Structure-Grounded Knowledge Retrieval via Code Dependencies for Multi-Step Data Reasoning
Xinyi Huang, Mingzhe Lu, Haoyu Dong
Main category: cs.CL
TL;DR: SGKR is a retrieval framework that uses function-call dependency graphs to retrieve task-critical knowledge for LLM-based code generation in data analysis tasks.
Details
Motivation: Current retrieval-augmented approaches for LLMs rely on lexical/embedding similarity, which is often insufficient for multi-step reasoning tasks where relevant knowledge is grounded in executable code and dependency structures.Method: SGKR organizes domain knowledge with a graph induced by function-call dependencies, extracts semantic input/output tags from questions, identifies dependency paths connecting them, constructs task-relevant subgraphs, and assembles associated knowledge and function implementations as structured context for LLM-based code generation.
Result: Experiments on multi-step data analysis benchmarks show SGKR consistently improves solution correctness over no-retrieval and similarity-based retrieval baselines for both vanilla LLMs and coding agents.
Conclusion: Structure-grounded knowledge retrieval using function-call dependency graphs is more effective than similarity-based approaches for retrieving task-critical knowledge in multi-step data analysis tasks requiring code generation.
Abstract: Selecting the right knowledge is critical when using large language models (LLMs) to solve domain-specific data analysis tasks. However, most retrieval-augmented approaches rely primarily on lexical or embedding similarity, which is often a weak proxy for the task-critical knowledge needed for multi-step reasoning. In many such tasks, the relevant knowledge is not merely textually related to the query, but is instead grounded in executable code and the dependency structure through which computations are carried out. To address this mismatch, we propose SGKR (Structure-Grounded Knowledge Retrieval), a retrieval framework that organizes domain knowledge with a graph induced by function-call dependencies. Given a question, SGKR extracts semantic input and output tags, identifies dependency paths connecting them, and constructs a task-relevant subgraph. The associated knowledge and corresponding function implementations are then assembled as a structured context for LLM-based code generation. Experiments on multi-step data analysis benchmarks show that SGKR consistently improves solution correctness over no-retrieval and similarity-based retrieval baselines for both vanilla LLMs and coding agents.
[48] ReFEree: Reference-Free and Fine-Grained Method for Evaluating Factual Consistency in Real-World Code Summarization
Suyoung Bae, CheolWon Na, Jaehoon Lee, Yumin Lee, YunSeok Choi, Jee-Hyong Lee
Main category: cs.CL
TL;DR: ReFEree is a reference-free, fine-grained method for evaluating factual consistency in real-world code summaries, addressing limitations of previous methods that struggle with multi-sentence functionalities and dependency context.
Details
Motivation: As LLMs generate longer, more descriptive code summaries, accurate evaluation of factual consistency becomes critical. Previous methods are designed for short summaries of isolated code snippets and fail to handle multi-sentence functionalities and dependency context in real-world code summaries.Method: Defines factual inconsistency criteria specific to code summaries, evaluates at segment level using these criteria with dependency information, then aggregates segment-level results into a fine-grained score. Constructs a code summarization benchmark with human-annotated factual consistency labels.
Result: ReFEree achieves the highest correlation with human judgment among 13 baselines, improving 15-18% over previous state-of-the-art methods.
Conclusion: ReFEree provides an effective reference-free, fine-grained evaluation method for factual consistency in code summaries, addressing key limitations of previous approaches and showing superior correlation with human judgment.
Abstract: As Large Language Models (LLMs) have become capable of generating long and descriptive code summaries, accurate and reliable evaluation of factual consistency has become a critical challenge. However, previous evaluation methods are primarily designed for short summaries of isolated code snippets. Consequently, they struggle to provide fine-grained evaluation of multi-sentence functionalities and fail to accurately assess dependency context commonly found in real-world code summaries. To address this, we propose ReFEree, a reference-free and fine-grained method for evaluating factual consistency in real-world code summaries. We define factual inconsistency criteria specific to code summaries and evaluate them at the segment level using these criteria along with dependency information. These segment-level results are then aggregated into a fine-grained score. We construct a code summarization benchmark with human-annotated factual consistency labels. The evaluation results demonstrate that ReFEree achieves the highest correlation with human judgment among 13 baselines, improving 15-18% over the previous state-of-the-art. Our code and data are available at https://github.com/bsy99615/ReFEree.git.
[49] Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets
Dat Tran, Douwe Kiela
Main category: cs.CL
TL;DR: Single-agent LLM systems match or outperform multi-agent systems on multi-hop reasoning tasks when reasoning tokens are held constant, challenging reported MAS advantages as artifacts of unaccounted computation and context effects.
Details
Motivation: Recent work shows strong performance from multi-agent LLM systems (MAS), but gains may be confounded by increased test-time computation. The theoretical basis and evaluation methodology for comparing single-agent (SAS) vs. multi-agent systems under normalized computation remains unclear.Method: 1) Information-theoretic argument grounded in Data Processing Inequality suggesting SAS are more information-efficient under fixed reasoning-token budget with perfect context utilization; 2) Controlled empirical study across three model families (Qwen3, DeepSeek-R1-Distill-Llama, Gemini 2.5) comparing SAS with multiple MAS architectures under matched budgets; 3) Diagnostic analysis of system behavior and evaluation methodology including API-based budget control artifacts.
Result: SAS consistently match or outperform MAS on multi-hop reasoning tasks when reasoning tokens are held constant. Identified significant artifacts in API-based budget control (particularly Gemini 2.5) and standard benchmarks that can inflate apparent MAS gains. MAS become competitive only when single agent’s context utilization is degraded or with more compute.
Conclusion: Reported advantages of multi-agent systems for multi-hop reasoning are better explained by unaccounted computation and context effects rather than inherent architectural benefits. Highlights importance of understanding trade-offs between compute, context, and coordination in agentic systems.
Abstract: Recent work reports strong performance from multi-agent LLM systems (MAS), but these gains are often confounded by increased test-time computation. When computation is normalized, single-agent systems (SAS) can match or outperform MAS, yet the theoretical basis and evaluation methodology behind this comparison remain unclear. We present an information-theoretic argument, grounded in the Data Processing Inequality, suggesting that under a fixed reasoning-token budget and with perfect context utilization, single-agent systems are more information-efficient. This perspective further predicts that multi-agent systems become competitive when a single agent’s effective context utilization is degraded, or when more compute is expended. We test these predictions in a controlled empirical study across three model families (Qwen3, DeepSeek-R1-Distill-Llama, and Gemini 2.5), comparing SAS with multiple MAS architectures under matched budgets. We find that SAS consistently match or outperform MAS on multi-hop reasoning tasks when reasoning tokens are held constant. Beyond aggregate performance, we conduct a detailed diagnostic analysis of system behavior and evaluation methodology. We identify significant artifacts in API-based budget control (particularly in Gemini 2.5) and in standard benchmarks, both of which can inflate apparent gains from MAS. Overall, our results suggest that, for multi-hop reasoning tasks, many reported advantages of multi-agent systems are better explained by unaccounted computation and context effects rather than inherent architectural benefits, and highlight the importance of understanding and explicitly controlling the trade-offs between compute, context, and coordination in agentic systems.
[50] Lost in Diffusion: Uncovering Hallucination Patterns and Failure Modes in Diffusion Large Language Models
Zhengnan Guo, Fei Tan
Main category: cs.CL
TL;DR: dLLMs exhibit higher hallucination rates than autoregressive LLMs, with unique failure modes like premature termination and incomplete denoising, despite comparable general task performance.
Details
Motivation: While diffusion-based large language models (dLLMs) show promise as non-autoregressive alternatives to autoregressive LLMs, their faithfulness and hallucination patterns remain largely unexplored, creating a gap in understanding their reliability.Method: Conducted first controlled comparative study evaluating hallucination patterns in dLLMs versus autoregressive counterparts, controlling for architecture, scale, and pre-training weights. Analyzed inference-time compute dynamics and identified unique failure modes.
Result: Current dLLMs exhibit higher propensity for hallucination than AR models. Quasi-autoregressive generation suffers from early saturation, while non-sequential decoding allows continuous refinement. Identified unique diffusion failure modes: premature termination, incomplete denoising, and context intrusion.
Conclusion: Although dLLMs have narrowed performance gap on general tasks, their distinct hallucination mechanisms pose critical challenges to model reliability, highlighting the need for specialized mitigation strategies.
Abstract: While Diffusion Large Language Models (dLLMs) have emerged as a promising non-autoregressive paradigm comparable to autoregressive (AR) models, their faithfulness, specifically regarding hallucination, remains largely underexplored. To bridge this gap, we present the first controlled comparative study to evaluate hallucination patterns in dLLMs. Our results demonstrate that current dLLMs exhibit a higher propensity for hallucination than AR counterparts controlled for architecture, scale, and pre-training weights. Furthermore, an analysis of inference-time compute reveals divergent dynamics: while quasi-autoregressive generation suffers from early saturation, non-sequential decoding unlocks potential for continuous refinement. Finally, we identify distinct failure modes unique to the diffusion process, including premature termination, incomplete denoising, and context intrusion. Our findings underscore that although dLLMs have narrowed the performance gap on general tasks, their distinct hallucination mechanisms pose a critical challenge to model reliability. Our code is available at https://github.com/ZeroLoss-Lab/Lost-in-Diffusion
[51] LLMs Should Incorporate Explicit Mechanisms for Human Empathy
Xiaoxing You, Qiang Huang, Jun Yu
Main category: cs.CL
TL;DR: LLMs need explicit empathy mechanisms to preserve human perspectives in high-stakes applications, as current models fail at modeling affect, context, and relational dynamics despite good benchmark performance.
Details
Motivation: As LLMs are increasingly deployed in human-centered settings, their success depends on faithfully preserving human perspectives, not just correctness or fluency. Current LLMs systematically fail at this requirement by attenuating affect, misrepresenting contextual salience, and rigidifying relational stance.Method: Formalizes empathy as an observable behavioral property: capacity to model and respond to human perspectives while preserving intention, affect, and context. Identifies four mechanisms of empathic failure (sentiment attenuation, empathic granularity mismatch, conflict avoidance, linguistic distancing) and organizes them along three dimensions (cognitive, cultural, relational empathy). Conducts empirical analyses showing benchmark performance can mask systematic empathic distortions.
Result: Empirical analyses demonstrate that strong benchmark performance can mask systematic empathic distortions in LLMs. The paper identifies specific failure patterns and motivates the need for empathy-aware objectives, benchmarks, and training signals.
Conclusion: LLM development needs to incorporate empathy as a first-class component through explicit mechanisms, objectives, benchmarks, and training signals to address systematic failures in modeling human perspectives, affect, and relational dynamics.
Abstract: This paper argues that Large Language Models (LLMs) should incorporate explicit mechanisms for human empathy. As LLMs become increasingly deployed in high-stakes human-centered settings, their success depends not only on correctness or fluency but on faithful preservation of human perspectives. Yet, current LLMs systematically fail at this requirement: even when well-aligned and policy-compliant, they often attenuate affect, misrepresent contextual salience, and rigidify relational stance in ways that distort meaning. We formalize empathy as an observable behavioral property: the capacity to model and respond to human perspectives while preserving intention, affect, and context. Under this framing, we identify four recurring mechanisms of empathic failure in contemporary LLMs–sentiment attenuation, empathic granularity mismatch, conflict avoidance, and linguistic distancing–arising as structural consequences of prevailing training and alignment practices. We further organize these failures along three dimensions: cognitive, cultural, and relational empathy, to explain their manifestation across tasks. Empirical analyses show that strong benchmark performance can mask systematic empathic distortions, motivating empathy-aware objectives, benchmarks, and training signals as first-class components of LLM development.
[52] Early Decisions Matter: Proximity Bias and Initial Trajectory Shaping in Non-Autoregressive Diffusion Language Models
Jiyeon Kim, Sungik Choi, Yongrae Jo, Moontae Lee, Minjoon Seo
Main category: cs.CL
TL;DR: Diffusion language models struggle with non-autoregressive decoding due to proximity bias causing spatial error propagation; proposed solution uses lightweight planning and temperature annealing to guide early token selection.
Details
Motivation: Diffusion-based language models offer parallel token generation and bidirectional context modeling advantages over autoregressive models, but their application to non-autoregressive decoding for reasoning and planning tasks remains challenging and underexplored.Method: Systematically analyze inference dynamics of non-autoregressive decoding in diffusion language models, identify proximity bias issue, and propose a minimal-intervention approach with lightweight planner and end-of-sequence temperature annealing to guide early token selection.
Result: The method shows substantial overall improvement over existing heuristic baselines on various reasoning and planning tasks without significant computational overhead.
Conclusion: Proximity bias in diffusion language models causes spatial error propagation in non-autoregressive decoding, but this can be effectively addressed with targeted guidance mechanisms for early token selection.
Abstract: Diffusion-based language models (dLLMs) have emerged as a promising alternative to autoregressive language models, offering the potential for parallel token generation and bidirectional context modeling. However, harnessing this flexibility for fully non-autoregressive decoding remains an open question, particularly for reasoning and planning tasks. In this work, we investigate non-autoregressive decoding in dLLMs by systematically analyzing its inference dynamics along the temporal axis. Specifically, we uncover an inherent failure mode in confidence-based non-autoregressive generation stemming from a strong proximity bias-the tendency for the denoising order to concentrate on spatially adjacent tokens. This local dependency leads to spatial error propagation, rendering the entire trajectory critically contingent on the initial unmasking position. Leveraging this insight, we present a minimal-intervention approach that guides early token selection, employing a lightweight planner and end-of-sequence temperature annealing. We thoroughly evaluate our method on various reasoning and planning tasks and observe substantial overall improvement over existing heuristic baselines without significant computational overhead.
[53] Bridging Linguistic Gaps: Cross-Lingual Mapping in Pre-Training and Dataset for Enhanced Multilingual LLM Performance
Weihua Zheng, Chang Liu, Zhengyuan Liu, Xin Huang, Kui Wu, Muhammad Huzaifah Md Shahrin, Aiti Aw, Roy Ka-Wei Lee
Main category: cs.CL
TL;DR: A method to improve multilingual LLMs by adding a Cross-Lingual Mapping Task during pre-training, enhancing cross-lingual alignment without hurting monolingual fluency, with significant gains on translation and cross-lingual tasks.
Details
Motivation: Multilingual LLMs struggle with cross-lingual tasks due to data imbalances between high/low-resource languages and monolingual bias in pre-training. Existing methods need extensive parallel data or suffer from instability.Method: Introduces a Cross-Lingual Mapping Task during pre-training that bi-directionally maps languages within the LLM embedding space. Also proposes a Language Alignment Coefficient to quantify cross-lingual consistency in limited-data scenarios.
Result: Achieves gains up to 11.9 BLEU points in machine translation, 6.72 points in CLQA BERTScore-Precision, and over 5% in CLNLU accuracy compared to strong multilingual baselines.
Conclusion: Incorporating cross-lingual objectives into pre-training can significantly improve multilingual LLMs, with the proposed method enhancing both language generation and comprehension across languages.
Abstract: Multilingual Large Language Models (LLMs) struggle with cross-lingual tasks due to data imbalances between high-resource and low-resource languages, as well as monolingual bias in pre-training. Existing methods, such as bilingual fine-tuning and contrastive alignment, can improve cross-lingual performance, but they often require extensive parallel data or suffer from instability. To address these challenges, we introduce a Cross-Lingual Mapping Task during the pre-training phase, which enhances cross-lingual alignment without compromising monolingual fluency. Our approach bi-directionally maps languages within the LLM embedding space, improving both language generation and comprehension. We further propose a Language Alignment Coefficient to robustly quantify cross-lingual consistency, even in limited-data scenarios. Experimental results on machine translation (MT), cross-lingual natural language understanding (CLNLU), and cross-lingual question answering (CLQA) show that our model achieves gains of up to 11.9 BLEU points in MT, 6.72 points in CLQA BERTScore-Precision, and more than 5% in CLNLU accuracy over strong multilingual baselines. These findings highlight the potential of incorporating cross-lingual objectives into pre-training to improve multilingual LLMs.
[54] Computational Lesions in Multilingual Language Models Separate Shared and Language-specific Brain Alignment
Yang Cui, Jingyuan Sun, Yizheng Sun, Yifan Wang, Yunhao Zhang, Jixing Li, Shaonan Wang, Hongpeng Zhou, John Hale, Chengqing Zong, Goran Nenadic
Main category: cs.CL
TL;DR: Multilingual LLMs with targeted computational lesions reveal shared language processing backbone with embedded specializations in the brain, using fMRI during naturalistic story listening across English, Chinese, and French.
Details
Motivation: To understand whether language processing in the brain is shared or language-specific across different languages, and to provide a causal framework for studying multilingual brain-model alignment using AI systems as controllable testbeds.Method: Used six multilingual large language models as controllable systems, created targeted “computational lesions” by zeroing parameter sets important across languages or specific to one language, then compared intact vs. lesioned models in predicting fMRI responses during 100 minutes of naturalistic story listening in native English, Chinese, and French (112 participants).
Result: Lesioning a compact shared core reduced whole-brain encoding correlation by 60.32% relative to intact models, while language-specific lesions preserved cross-language separation in embedding space but selectively weakened brain predictivity for the matched native language.
Conclusion: Results support a shared backbone with embedded specializations for language processing in the brain, providing a causal framework for studying multilingual brain-model alignment using LLMs as computational models.
Abstract: How the brain supports language across different languages is a basic question in neuroscience and a useful test for multilingual artificial intelligence. Neuroimaging has identified language-responsive brain regions across languages, but it cannot by itself show whether the underlying processing is shared or language-specific. Here we use six multilingual large language models (LLMs) as controllable systems and create targeted ``computational lesions’’ by zeroing small parameter sets that are important across languages or especially important for one language. We then compare intact and lesioned models in predicting functional magnetic resonance imaging (fMRI) responses during 100 minutes of naturalistic story listening in native English, Chinese and French (112 participants). Lesioning a compact shared core reduces whole-brain encoding correlation by 60.32% relative to intact models, whereas language-specific lesions preserve cross-language separation in embedding space but selectively weaken brain predictivity for the matched native language. These results support a shared backbone with embedded specializations and provide a causal framework for studying multilingual brain-model alignment.
[55] ProUIE: A Macro-to-Micro Progressive Learning Method for LLM-based Universal Information Extraction
Wenda Liu, Zhigang Song, Shuai Nie, Guangyao Liu, Lisung Chen, Binyu Yang, Yaran Chen, Peng Zhou, Hongzhen Wang, Yuchen Liu, Wenyue Hu, Jiaming Xu, Runyu Shi, Ying Huang
Main category: cs.CL
TL;DR: ProUIE: A Macro-to-Micro progressive learning approach for universal information extraction without external data, using three stages of complete modeling, streamlined alignment, and deep exploration with stepwise rewards.
Details
Motivation: Current LLM-based universal information extraction methods often require additional external information beyond original training data, which increases training complexity while providing limited performance gains. The authors aim to improve UIE without introducing any external information.Method: ProUIE uses a three-stage progressive learning approach: 1) Macro-level Complete Modeling (CM) - learns NER, RE, and EE along intrinsic difficulty order on full training data; 2) Meso-level Streamlined Alignment (SA) - operates on sampled data with simplified target formats to make outputs more concise and controllable; 3) Micro-level Deep Exploration (DE) - applies GRPO with stepwise fine-grained rewards over structural units to guide exploration.
Result: Experiments on 36 public datasets show ProUIE consistently improves unified extraction, outperforming strong instruction-tuned baselines on average for NER and RE while using a smaller backbone. It also demonstrates clear gains in large-scale production-oriented information extraction.
Conclusion: ProUIE provides an effective progressive learning approach for universal information extraction that achieves performance improvements without requiring external data, making it more efficient and practical for real-world applications.
Abstract: LLM-based universal information extraction (UIE) methods often rely on additional information beyond the original training data, which increases training complexity yet often yields limited gains. To address this, we propose ProUIE, a Macro-to-Micro progressive learning approach that improves UIE without introducing any external information. ProUIE consists of three stages: (i) macro-level Complete Modeling (CM), which learns NER, RE, and EE along their intrinsic difficulty order on the full training data to build a unified extraction foundation, (ii) meso-level Streamlined Alignment (SA), which operates on sampled data with simplified target formats, streamlining and regularizing structured outputs to make them more concise and controllable, and (iii) micro-level Deep Exploration (DE), which applies GRPO with stepwise fine-grained rewards (SFR) over structural units to guide exploration and improve performance. Experiments on 36 public datasets show that ProUIE consistently improves unified extraction, outperforming strong instruction-tuned baselines on average for NER and RE while using a smaller backbone, and it further demonstrates clear gains in large-scale production-oriented information extraction.
[56] Efficient Process Reward Modeling via Contrastive Mutual Information
Nakyung Lee, Sangwoo Hong, Jungwoo Lee
Main category: cs.CL
TL;DR: CPMI is a novel automatic reward labeling method that uses contrastive pointwise mutual information to infer step-level supervision for chain-of-thought reasoning, reducing computational costs by 84-98% compared to existing methods.
Details
Motivation: Training process reward models (PRMs) for verifying chain-of-thought reasoning steps requires costly human annotation or computationally expensive automated methods like Monte Carlo estimation, creating a need for more efficient automatic reward labeling approaches.Method: Proposes contrastive pointwise mutual information (CPMI) that quantifies how much a reasoning step increases mutual information between the step and correct answer relative to hard-negative alternatives, using the model’s internal probability to infer step-level supervision without extensive rollouts.
Result: CPMI reduces dataset construction time by 84% and token generation by 98% compared to Monte Carlo estimation, while achieving higher accuracy on process-level evaluations and mathematical reasoning benchmarks.
Conclusion: CPMI provides an efficient and effective automatic reward labeling method for chain-of-thought reasoning verification that significantly reduces computational costs while maintaining or improving performance.
Abstract: Recent research has devoted considerable effort to verifying the intermediate reasoning steps of chain-of-thought (CoT) trajectories using process reward models (PRMs) and other verifier models. However, training a PRM typically requires human annotators to assign reward scores to each reasoning step, which is both costly and time-consuming. Existing automated approaches, such as Monte Carlo (MC) estimation, also demand substantial computational resources due to repeated LLM rollouts. To overcome these limitations, we propose contrastive pointwise mutual information (CPMI), a novel automatic reward labeling method that leverages the model’s internal probability to infer step-level supervision while significantly reducing the computational burden of annotating dataset. CPMI quantifies how much a reasoning step increases the mutual information between the step and the correct target answer relative to hard-negative alternatives. This contrastive signal serves as a proxy for the step’s contribution to the final solution and yields a reliable reward. The experimental results show that CPMI-based labeling reduces dataset construction time by 84% and token generation by 98% compared to MC estimation, while achieving higher accuracy on process-level evaluations and mathematical reasoning benchmarks.
[57] HeceTokenizer: A Syllable-Based Tokenization Approach for Turkish Retrieval
Senol Gulgonul
Main category: cs.CL
TL;DR: HeceTokenizer is a syllable-based tokenizer for Turkish that leverages the language’s phonological structure to create an OOV-free vocabulary, achieving strong retrieval performance with a tiny model.
Details
Motivation: The paper aims to develop an efficient tokenizer for Turkish that exploits the language's deterministic phonological structure to overcome vocabulary limitations and improve retrieval performance with minimal computational resources.Method: Develops a syllable-based tokenizer using Turkish’s six-pattern phonological structure to create a closed vocabulary of ~8,000 unique syllable types. Trains a BERT-tiny encoder (1.5M parameters) from scratch on Turkish Wikipedia using masked language modeling, combined with fine-grained chunk-based retrieval strategy.
Result: Achieves 50.3% Recall@5 on TQuAD retrieval benchmark, surpassing the 46.92% reported by a morphology-driven baseline that uses a 200 times larger model, demonstrating superior efficiency and performance.
Conclusion: The phonological regularity of Turkish syllables provides a strong and resource-light inductive bias for retrieval tasks, enabling efficient tokenization and competitive performance with minimal model size.
Abstract: HeceTokenizer is a syllable-based tokenizer for Turkish that exploits the deterministic six-pattern phonological structure of the language to construct a closed, out-of-vocabulary (OOV)-free vocabulary of approximately 8,000 unique syllable types. A BERT-tiny encoder (1.5M parameters) is trained from scratch on a subset of Turkish Wikipedia using a masked language modeling objective and evaluated on the TQuAD retrieval benchmark using Recall@5. Combined with a fine-grained chunk-based retrieval strategy, HeceTokenizer achieves 50.3% Recall@5, surpassing the 46.92% reported by a morphology-driven baseline that uses a 200 times larger model. These results suggest that the phonological regularity of Turkish syllables provides a strong and resource-light inductive bias for retrieval tasks.
[58] Learning and Enforcing Context-Sensitive Control for LLMs
Mohammad Albinhassan, Pranava Madhyastha, Mark Law, Alessandra Russo
Main category: cs.CL
TL;DR: Automatic learning of context-sensitive constraints from LLM interactions through syntactic exploration and constraint exploitation phases, enabling small LLMs to generate with perfect constraint adherence without manual specification.
Details
Motivation: Overcoming limitations of Context-Free Grammars in guaranteeing generation validity by automatically learning context-sensitive constraints from LLM interactions, eliminating the need for manual specification which requires specialized expertise.Method: Two-phase framework: 1) Syntactic exploration to gather diverse outputs for constraint learning, 2) Constraint exploitation to enforce learned rules during generation. This integrates context-sensitive grammar learning with LLM generation.
Result: Enables even small LLMs (1B parameters) to learn and generate with perfect constraint adherence, outperforming larger counterparts and state-of-the-art reasoning models.
Conclusion: First integration of context-sensitive grammar learning with LLM generation, eliminating manual specification while maintaining generation validity.
Abstract: Controlling the output of Large Language Models (LLMs) through context-sensitive constraints has emerged as a promising approach to overcome the limitations of Context-Free Grammars (CFGs) in guaranteeing generation validity. However, such constraints typically require manual specification – a significant barrier demanding specialized expertise. We introduce a framework that automatically learns context-sensitive constraints from LLM interactions through a two-phase process: syntactic exploration to gather diverse outputs for constraint learning, followed by constraint exploitation to enforce these learned rules during generation. Experiments demonstrate that our method enables even small LLMs (1B parameters) to learn and generate with perfect constraint adherence, outperforming larger counterparts and state-of-the-art reasoning models. This work represents the first integration of context-sensitive grammar learning with LLM generation, eliminating manual specification while maintaining generation validity.
[59] QFS-Composer: Query-focused summarization pipeline for less resourced languages
Vuk Đuranović, Marko Robnik Šikonja
Main category: cs.CL
TL;DR: QFS-Composer: A novel framework for query-focused summarization in low-resource languages using query decomposition, question generation, question answering, and abstractive summarization to improve factual alignment with user intent.
Details
Motivation: LLMs perform well in text summarization but effectiveness drops significantly in less-resourced languages with limited training data and evaluation tools, particularly for query-focused summarization where factual alignment with user intent is crucial.Method: QFS-Composer integrates query decomposition, question generation, question answering, and abstractive summarization. Developed Slovenian QA and QG models based on a Slovene LLM, and adapted reference-free evaluation approaches for Slovenian language.
Result: The QA-guided summarization pipeline yields improved consistency and relevance over baseline LLMs for Slovenian query-focused summarization, establishing an extensible methodology for advancing QFS in less-resourced languages.
Conclusion: The work presents a successful framework for improving query-focused summarization in low-resource languages through integrated QA-guided approaches, with potential for extension to other languages and domains.
Abstract: Large language models (LLMs) demonstrate strong performance in text summarization, yet their effectiveness drops significantly across languages with restricted training resources. This work addresses the challenge of query-focused summarization (QFS) in less-resourced languages, where labeled datasets and evaluation tools are limited. We present a novel QFS framework, QFS-Composer, that integrates query decomposition, question generation (QG), question answering (QA), and abstractive summarization to improve the factual alignment of a summary with user intent. We test our approach on the Slovenian language. To enable high-quality supervision and evaluation, we develop the Slovenian QA and QG models based on a Slovene LLM and adapt evaluation approaches for reference-free summary evaluation. Empirical evaluation shows that the QA-guided summarization pipeline yields improved consistency and relevance over baseline LLMs. Our work establishes an extensible methodology for advancing QFS in less-resourced languages.
[60] Attention Sinks as Internal Signals for Hallucination Detection in Large Language Models
Jakub Binkowski, Kamil Adamczewski, Tomasz Kajdanowicz
Main category: cs.CL
TL;DR: SinkProbe: A hallucination detection method that identifies attention sinks in LLMs to detect factually incorrect outputs, achieving state-of-the-art performance.
Details
Motivation: Large language models frequently produce hallucinations - fluent but factually incorrect outputs. Current detection methods use attention map features but lack understanding of underlying mechanisms. The paper aims to develop a theoretically-grounded detection method.Method: Proposes SinkProbe based on attention sinks - tokens that accumulate disproportionate attention mass during generation. The method uses sink scores computed from attention maps, with classifier preferentially relying on sinks whose associated value vectors have large norms. Shows previous methods implicitly depend on attention sinks.
Result: Produces state-of-the-art hallucination detection results across popular datasets and LLMs. Demonstrates mathematical relationship between sink scores and previous methods.
Conclusion: SinkProbe provides a theoretically-grounded hallucination detection method that outperforms existing approaches by exploiting attention sink mechanisms in LLMs.
Abstract: Large language models frequently exhibit hallucinations: fluent and confident outputs that are factually incorrect or unsupported by the input context. While recent hallucination detection methods have explored various features derived from attention maps, the underlying mechanisms they exploit remain poorly understood. In this work, we propose SinkProbe, a hallucination detection method grounded in the observation that hallucinations are deeply entangled with attention sinks - tokens that accumulate disproportionate attention mass during generation - indicating a transition from distributed, input-grounded attention to compressed, prior-dominated computation. Importantly, although sink scores are computed solely from attention maps, we find that the classifier preferentially relies on sinks whose associated value vectors have large norms. Moreover, we show that previous methods implicitly depend on attention sinks by establishing their mathematical relationship to sink scores. Our findings yield a novel hallucination detection method grounded in theory that produces state-of-the-art results across popular datasets and LLMs.
[61] Expect the Unexpected? Testing the Surprisal of Salient Entities
Jessica Lin, Amir Zeldes
Main category: cs.CL
TL;DR: Study shows globally salient discourse entities have higher surprisal than non-salient ones, and reduce surprisal for surrounding content when used as prompts, refining Uniform Information Density theory.
Details
Motivation: Previous work on Uniform Information Density hypothesis has largely disregarded the relative salience of discourse participants, creating a gap in understanding how entity salience relates to information distribution in discourse.Method: Used 70K manually annotated mentions across 16 genres of English and a novel minimal-pair prompting method to study how overall salience of entities relates to surprisal metrics.
Result: Globally salient entities exhibit significantly higher surprisal than non-salient ones (even controlling for confounds), and reduce surprisal for surrounding content when used as prompts. Effect varies by genre - strongest in topic-coherent texts, weakest in conversational contexts.
Conclusion: Findings refine the UID competing pressures framework by identifying global entity salience as a mechanism shaping information distribution in discourse, with implications for discourse structure and predictability.
Abstract: Previous work examining the Uniform Information Density (UID) hypothesis has shown that while information as measured by surprisal metrics is distributed more or less evenly across documents overall, local discrepancies can arise due to functional pressures corresponding to syntactic and discourse structural constraints. However, work thus far has largely disregarded the relative salience of discourse participants. We fill this gap by studying how overall salience of entities in discourse relates to surprisal using 70K manually annotated mentions across 16 genres of English and a novel minimal-pair prompting method. Our results show that globally salient entities exhibit significantly higher surprisal than non-salient ones, even controlling for position, length, and nesting confounds. Moreover, salient entities systematically reduce surprisal for surrounding content when used as prompts, enhancing document-level predictability. This effect varies by genre, appearing strongest in topic-coherent texts and weakest in conversational contexts. Our findings refine the UID competing pressures framework by identifying global entity salience as a mechanism shaping information distribution in discourse.
[62] Too Nice to Tell the Truth: Quantifying Agreeableness-Driven Sycophancy in Role-Playing Language Models
Arya Shah, Deepali Mishra, Chaklam Silpasuwanchai
Main category: cs.CL
TL;DR: Persona agreeableness in LLMs strongly predicts sycophantic behavior, with 9 of 13 models showing significant positive correlation between agreeableness and sycophancy rates.
Details
Motivation: While LLMs can adopt personas for role-playing, this raises concerns about sycophancy - prioritizing user validation over factual accuracy. Prior work established sycophancy risks AI safety, but the relationship between specific personality traits (like agreeableness) and sycophantic behavior remains unexplored.Method: Systematic investigation across 13 open-weight LLMs (0.6B-20B parameters). Developed benchmark with 275 personas evaluated on NEO-IPIP agreeableness subscales. Exposed each persona to 4,950 sycophancy-eliciting prompts across 33 topic categories. Analyzed correlations between persona agreeableness and sycophancy rates.
Result: 9 of 13 models showed statistically significant positive correlations between persona agreeableness and sycophancy rates. Pearson correlations reached r = 0.87, with effect sizes as large as Cohen’s d = 2.33. Agreeableness functions as reliable predictor of persona-induced sycophancy.
Conclusion: Agreeableness strongly influences sycophantic behavior in role-playing LLMs, with direct implications for deploying AI systems and developing alignment strategies that account for personality-mediated deceptive behaviors.
Abstract: Large language models increasingly serve as conversational agents that adopt personas and role-play characters at user request. This capability, while valuable, raises concerns about sycophancy: the tendency to provide responses that validate users rather than prioritize factual accuracy. While prior work has established that sycophancy poses risks to AI safety and alignment, the relationship between specific personality traits of adopted personas and the degree of sycophantic behavior remains unexplored. We present a systematic investigation of how persona agreeableness influences sycophancy across 13 small, open-weight language models ranging from 0.6B to 20B parameters. We develop a benchmark comprising 275 personas evaluated on NEO-IPIP agreeableness subscales and expose each persona to 4,950 sycophancy-eliciting prompts spanning 33 topic categories. Our analysis reveals that 9 of 13 models exhibit statistically significant positive correlations between persona agreeableness and sycophancy rates, with Pearson correlations reaching $r = 0.87$ and effect sizes as large as Cohen’s $d = 2.33$. These findings demonstrate that agreeableness functions as a reliable predictor of persona-induced sycophancy, with direct implications for the deployment of role-playing AI systems and the development of alignment strategies that account for personality-mediated deceptive behaviors.
[63] Self-Correcting RAG: Enhancing Faithfulness via MMKP Context Selection and NLI-Guided MCTS
Shijia Xu, Zhou Wu, Xiaolong Jia, Yu Wang, Kai Liu, April Xiaowen Dong
Main category: cs.CL
TL;DR: Self-Correcting RAG framework improves complex reasoning by treating retrieval as optimization (MMKP) and generation as path planning (NLI-guided MCTS), reducing hallucinations and improving accuracy.
Details
Motivation: Traditional RAG faces challenges with low context utilization and frequent hallucinations when handling complex reasoning tasks, limiting its effectiveness for multi-hop QA and fact-checking.Method: Proposes Self-Correcting RAG with two key innovations: 1) Formalizes context selection as multi-dimensional multiple-choice knapsack problem (MMKP) to maximize information density under token budget, 2) Uses NLI-guided Monte Carlo Tree Search (MCTS) to dynamically explore reasoning trajectories and validate answer faithfulness.
Result: Experiments on six multi-hop QA and fact-checking datasets show significant improvements in reasoning accuracy on complex queries while effectively reducing hallucinations, outperforming strong baselines.
Conclusion: The Self-Correcting RAG framework successfully addresses key limitations of traditional RAG by integrating optimization-based retrieval and planning-based generation, enabling more reliable complex reasoning.
Abstract: Retrieval-augmented generation (RAG) substantially extends the knowledge boundary of large language models. However, it still faces two major challenges when handling complex reasoning tasks: low context utilization and frequent hallucinations. To address these issues, we propose Self-Correcting RAG, a unified framework that reformulates retrieval and generation as constrained optimization and path planning. On the input side, we move beyond traditional greedy retrieval and, for the first time, formalize context selection as a multi-dimensional multiple-choice knapsack problem (MMKP), thereby maximizing information density and removing redundancy under a strict token budget. On the output side, we introduce a natural language inference (NLI)-guided Monte Carlo Tree Search (MCTS) mechanism, which leverages test-time compute to dynamically explore reasoning trajectories and validate the faithfulness of generated answers. Experiments on six multi-hop question answering and fact-checking datasets demonstrate that our method significantly improves reasoning accuracy on complex queries while effectively reducing hallucinations, outperforming strong existing baselines.Our code is available at https://github.com/xjiacs/Self-Correcting-RAG .
[64] RCBSF: A Multi-Agent Framework for Automated Contract Revision via Stackelberg Game
Shijia Xu, Yu Wang, Xiaolong Jia, Zhou Wu, Kai Liu, April Xiaowen Dong
Main category: cs.CL
TL;DR: RCBSF is a risk-constrained bilevel Stackelberg framework for automated contract revision that uses a hierarchical game-theoretic approach with global and local agents to optimize revisions while controlling risk.
Details
Motivation: Current LLMs in Legal AI suffer from hallucinated safety issues and lack rigorous behavioral constraints for automated contract revision, limiting their practical utility despite widespread adoption.Method: Proposes Risk-Constrained Bilevel Stackelberg Framework (RCBSF) that formulates revision as a non-cooperative Stackelberg game with hierarchical Leader-Follower structure: Global Prescriptive Agent imposes risk budgets on follower system (Constrained Revision Agent + Local Verification Agent) for iterative optimization.
Result: Achieves state-of-the-art performance with average Risk Resolution Rate of 84.21%, surpassing iterative baselines while enhancing token efficiency. Theoretical guarantees show convergence to equilibrium with superior utility over unguided configurations.
Conclusion: RCBSF effectively addresses hallucination and constraint issues in automated contract revision through game-theoretic formulation, providing both theoretical guarantees and empirical validation of superior performance.
Abstract: Despite the widespread adoption of Large Language Models (LLMs) in Legal AI, their utility for automated contract revision remains impeded by hallucinated safety and a lack of rigorous behavioral constraints. To address these limitations, we propose the Risk-Constrained Bilevel Stackelberg Framework (RCBSF), which formulates revision as a non-cooperative Stackelberg game. RCBSF establishes a hierarchical Leader Follower structure where a Global Prescriptive Agent (GPA) imposes risk budgets upon a follower system constituted by a Constrained Revision Agent (CRA) and a Local Verification Agent (LVA) to iteratively optimize output. We provide theoretical guarantees that this bilevel formulation converges to an equilibrium yielding strictly superior utility over unguided configurations. Empirical validation on a unified benchmark demonstrates that RCBSF achieves state-of-the-art performance, surpassing iterative baselines with an average Risk Resolution Rate (RRR) of 84.21% while enhancing token efficiency. Our code is available at https://github.com/xjiacs/RCBSF .
[65] Deep-Reporter: Deep Research for Grounded Multimodal Long-Form Generation
Fangda Ye, Zhifei Xie, Yuxin Hu, Yihang Yin, Shurui Huang, Shikai Dong, Jianzhu Bao, Shuicheng Yan
Main category: cs.CL
TL;DR: Deep-Reporter: An agentic framework for multimodal long-form generation that integrates text and visuals through search, synthesis, and context management, with a new benchmark M2LongBench.
Details
Motivation: Existing agentic search frameworks are text-centric and overlook multimodal evidence needed for real-world expert reports, creating a need for multimodal long-form generation capabilities.Method: Three-component framework: (1) Agentic Multimodal Search and Filtering for retrieving text passages and information-dense visuals, (2) Checklist-Guided Incremental Synthesis for coherent image-text integration and citation placement, (3) Recurrent Context Management for balancing long-range coherence with local fluency. Includes a curation pipeline producing 8K agentic traces for model optimization.
Result: Created M2LongBench with 247 research tasks across 9 domains and a stable multimodal sandbox. Experiments show multimodal long-form generation is challenging, especially in multimodal selection and integration, but effective post-training can bridge the gap.
Conclusion: Deep-Reporter addresses the pressing need for multimodal long-form generation by providing a unified framework that effectively integrates visual and textual information through agentic processes, with demonstrated improvements through post-training.
Abstract: Recent agentic search frameworks enable deep research via iterative planning and retrieval, reducing hallucinations and enhancing factual grounding. However, they remain text-centric, overlooking the multimodal evidence that characterizes real-world expert reports. We introduce a pressing task: multimodal long-form generation. Accordingly, we propose Deep-Reporter, a unified agentic framework for grounded multimodal long-form generation. It orchestrates: (i) Agentic Multimodal Search and Filtering to retrieve and filter textual passages and information-dense visuals; (ii) Checklist-Guided Incremental Synthesis to ensure coherent image-text integration and optimal citation placement; and (iii) Recurrent Context Management to balance long-range coherence with local fluency. We develop a rigorous curation pipeline producing 8K high-quality agentic traces for model optimization. We further introduce M2LongBench, a comprehensive testbed comprising 247 research tasks across 9 domains and a stable multimodal sandbox. Extensive experiments demonstrate that long-form multimodal generation is a challenging task, especially in multimodal selection and integration, and effective post-training can bridge the gap.
[66] How You Ask Matters! Adaptive RAG Robustness to Query Variations
Yunah Jang, Megha Sundriyal, Kyomin Jung, Meeyoung Cha
Main category: cs.CL
TL;DR: Adaptive RAG systems are vulnerable to semantically identical query variations, with small surface-level changes dramatically affecting retrieval behavior and accuracy despite no change in intent.
Details
Motivation: Real-world queries often vary in surface form while maintaining the same semantic intent, but the impact of these variations on Adaptive RAG systems remains under-explored. Current Adaptive RAG methods promise efficiency by dynamically triggering retrieval only when needed, but their robustness to query variations is unknown.Method: Created the first large-scale benchmark of diverse yet semantically identical query variations combining human-written and model-generated rewrites. Systematically evaluated Adaptive RAG robustness across three dimensions: answer quality, computational cost, and retrieval decisions.
Result: Discovered a critical robustness gap where small surface-level changes in queries dramatically alter retrieval behavior and accuracy. Larger models showed better performance but robustness did not improve accordingly. Adaptive RAG methods are highly vulnerable to query variations that preserve identical semantics.
Conclusion: Adaptive RAG systems face a critical robustness challenge where semantically identical query variations can significantly impact system behavior and accuracy, revealing a fundamental vulnerability that needs to be addressed for reliable real-world deployment.
Abstract: Adaptive Retrieval-Augmented Generation (RAG) promises accuracy and efficiency by dynamically triggering retrieval only when needed and is widely used in practice. However, real-world queries vary in surface form even with the same intent, and their impact on Adaptive RAG remains under-explored. We introduce the first large-scale benchmark of diverse yet semantically identical query variations, combining human-written and model-generated rewrites. Our benchmark facilitates a systematic evaluation of Adaptive RAG robustness by examining its key components across three dimensions: answer quality, computational cost, and retrieval decisions. We discover a critical robustness gap, where small surface-level changes in queries dramatically alter retrieval behavior and accuracy. Although larger models show better performance, robustness does not improve accordingly. These findings reveal that Adaptive RAG methods are highly vulnerable to query variations that preserve identical semantics, exposing a critical robustness challenge.
[67] Generating Multiple-Choice Knowledge Questions with Interpretable Difficulty Estimation using Knowledge Graphs and Large Language Models
Mehmet Can Şakiroğlu, H. Altay Güvenir, Kamer Kaya
Main category: cs.CL
TL;DR: A novel method for generating multiple-choice questions with difficulty estimation using knowledge graphs and LLMs, where LLMs construct KGs from documents, MCQs are generated from graph components, and difficulty is estimated through nine interpretable signals.
Details
Motivation: Automated MCQ generation with accurate difficulty estimation is challenging but crucial for adaptive, AI-assisted education systems. Current systems lack interpretable difficulty estimation and structured knowledge integration.Method: Uses LLMs to construct knowledge graphs from input documents. MCQs are generated by selecting KG nodes as keys, sampling related triples/quintuples (optionally augmented), prompting LLMs to create stems, and selecting distractors from the KG. Nine difficulty signals are computed and combined into a unified score using data-driven approach.
Result: The method generates high-quality MCQs with interpretable difficulty estimation that aligns with human perceptions, improving automated MCQ generation through structured knowledge representations and data-driven difficulty modeling.
Conclusion: The approach successfully integrates structured knowledge representations with LLMs and data-driven difficulty estimation to enhance automated MCQ generation for adaptive education systems.
Abstract: Generating multiple-choice questions (MCQs) with difficulty estimation remains challenging in automated MCQ-generation systems used in adaptive, AI-assisted education. This study proposes a novel methodology for generating MCQs with difficulty estimation from the input documents by utilizing knowledge graphs (KGs) and large language models (LLMs). Our approach uses an LLM to construct a KG from input documents, from which MCQs are then systematically generated. Each MCQ is generated by selecting a node from the KG as the key, sampling a related triple or quintuple – optionally augmented with an extra triple – and prompting an LLM to generate a corresponding stem from these graph components. Distractors are then selected from the KG. For each MCQ, nine difficulty signals are computed and combined into a unified difficulty score using a data-driven approach. Experimental results demonstrate that our method generates high-quality MCQs whose difficulty estimation is interpretable and aligns with human perceptions. Our approach improves automated MCQ generation by integrating structured knowledge representations with LLMs and a data-driven difficulty estimation model.
[68] Do BERT Embeddings Encode Narrative Dimensions? A Token-Level Probing Analysis of Time, Space, Causality, and Character in Fiction
Beicheng Bei, Hannah Hyesun Chun, Chen Guo, Arwa Saghiri
Main category: cs.CL
TL;DR: BERT embeddings encode narrative semantic dimensions (time, space, causality, character) but not as discrete clusters, with rare categories showing boundary leakage issues.
Details
Motivation: To investigate whether BERT embeddings encode multidimensional narrative semantics (time, space, causality, character) that are crucial for narrative understanding.Method: Used LLM to create token-level dataset labeled with four narrative categories plus “others,” then applied linear probing on BERT embeddings and compared with variance-matched random embeddings control.
Result: BERT probe achieved 94% accuracy vs 47% control, with macro-average recall of 0.83. Rare categories like causality (0.75) and space (0.66) showed moderate success but suffered from “Boundary Leakage” misclassification. Unsupervised clustering aligned poorly with categories (ARI = 0.081).
Conclusion: BERT encodes meaningful narrative information but not as discretely separable clusters, with rare dimensions prone to misclassification as “others.” Future work includes syntactic baseline and expanded datasets.
Abstract: Narrative understanding requires multidimensional semantic structures. This study investigates whether BERT embeddings encode dimensions of fictional narrative semantics – time, space, causality, and character. Using an LLM to accelerate annotation, we construct a token-level dataset labeled with these four narrative categories plus “others.” A linear probe on BERT embeddings (94% accuracy) significantly outperforms a control probe on variance-matched random embeddings (47%), confirming that BERT encodes meaningful narrative information. With balanced class weighting, the probe achieves a macro-average recall of 0.83, with moderate success on rare categories such as causality (recall = 0.75) and space (recall = 0.66). However, confusion matrix analysis reveals “Boundary Leakage,” where rare dimensions are systematically misclassified as “others.” Clustering analysis shows that unsupervised clustering aligns near-randomly with predefined categories (ARI = 0.081), suggesting that narrative dimensions are encoded but not as discretely separable clusters. Future work includes a POS-only baseline to disentangle syntactic patterns from narrative encoding, expanded datasets, and layer-wise probing.
[69] When Meaning Isn’t Literal: Exploring Idiomatic Meaning Across Languages and Modalities
Sarmistha Das, Shreyas Guha, Suvrayan Bandyopadhyay, Salisa Phosit, Kitsuchart Pasupa, Sriparna Saha
Main category: cs.CL
TL;DR: Mediom: A multilingual multimodal idiom corpus with HIDE framework for improving metaphor comprehension in AI systems
Details
Motivation: Current language models struggle with idiomatic reasoning due to cultural and metaphorical nuances, often focusing on literal interpretations rather than figurative meanings.Method: Created Mediom corpus with 3,533 Hindi, Bengali, and Thai idioms featuring explanations, translations, and text-image representations. Developed HIDE framework using error-feedback retrieval and diagnostic cues for iterative reasoning refinement.
Result: Benchmarked LLMs and VLMs on Mediom, exposing systematic failures in metaphor comprehension. HIDE framework shows promise for improving idiom understanding through targeted reasoning hints.
Conclusion: Mediom provides a rigorous test bed for culturally grounded, multimodal idiom understanding, while HIDE offers methodology for embedding reasoning hints in next-generation AI systems.
Abstract: Idiomatic reasoning, deeply intertwined with metaphor and culture, remains a blind spot for contemporary language models, whose progress skews toward surface-level lexical and semantic cues. For instance, the Bengali idiom \textit{\foreignlanguage{bengali}{\char"0986\char"0999\char"09CD\char"0997\char"09C1 \char"09B0 \char"09AB\char"09B2 \char"099F\char"0995}} (angur fol tok, grapes are sour''): it encodes denial-driven rationalization, yet naive models latch onto the literal fox-and-grape imagery. Addressing this oversight, we present Mediom,’’ a multilingual, multimodal idiom corpus of 3,533 Hindi, Bengali, and Thai idioms, each paired with gold-standard explanations, cross-lingual translations, and carefully aligned text–image representations. We benchmark both large language models (textual reasoning) and vision-language models (figurative disambiguation) on Mediom, exposing systematic failures in metaphor comprehension. To mitigate these gaps, we propose ``HIDE,’’ a Hinting-based Idiom Explanation framework that leverages error-feedback retrieval and targeted diagnostic cues for iterative reasoning refinement. Collectively, Mediom and HIDE establish a rigorous test bed and methodology for culturally grounded, multimodal idiom understanding embedded with reasoning hints in next-generation AI systems.
[70] TInR: Exploring Tool-Internalized Reasoning in Large Language Models
Qiancheng Xu, Yongqi Li, Fan Liu, Hongru Wang, Min Yang, Wenjie Li
Main category: cs.CL
TL;DR: TInR-U is a framework that internalizes tool knowledge into LLMs for unified reasoning and tool usage, addressing issues with external tool documentation through bidirectional knowledge alignment, supervised fine-tuning, and reinforcement learning.
Details
Motivation: Existing Tool-Integrated Reasoning (TIR) methods rely on external tool documentation, leading to tool mastery difficulty, tool size constraints, and inference inefficiency. The paper explores Tool-Internalized Reasoning (TInR) to facilitate reasoning with tool knowledge internalized into LLMs.Method: Proposes TInR-U framework with three-phase training: 1) tool internalization using bidirectional knowledge alignment strategy, 2) supervised fine-tuning warm-up with high-quality reasoning annotations, and 3) reinforcement learning with TInR-specific rewards.
Result: TInR-U achieves superior performance in both in-domain and out-of-domain settings, demonstrating effectiveness and efficiency compared to existing methods.
Conclusion: Tool-Internalized Reasoning (TInR) through the TInR-U framework successfully addresses limitations of external tool documentation, enabling more efficient and effective reasoning with internalized tool knowledge in LLMs.
Abstract: Tool-Integrated Reasoning (TIR) has emerged as a promising direction by extending Large Language Models’ (LLMs) capabilities with external tools during reasoning. Existing TIR methods typically rely on external tool documentation during reasoning. However, this leads to tool mastery difficulty, tool size constraints, and inference inefficiency. To mitigate these issues, we explore Tool-Internalized Reasoning (TInR), aiming at facilitating reasoning with tool knowledge internalized into LLMs. Achieving this goal presents notable requirements, including tool internalization and tool-reasoning coordination. To address them, we propose TInR-U, a tool-internalized reasoning framework for unified reasoning and tool usage. TInR-U is trained through a three-phase pipeline: 1) tool internalization with a bidirectional knowledge alignment strategy; 2) supervised fine-tuning warm-up using high-quality reasoning annotations, and 3) reinforcement learning with TInR-specific rewards. We comprehensively evaluate our method across in-domain and out-of-domain settings. Experiment results show that TInR-U achieves superior performance in both settings, highlighting its effectiveness and efficiency.
[71] Position-Agnostic Pre-Projection for Transformer Attention: Nonlinear Feature Construction and Content Skip Before Q/K/V
Chirag Shinde
Main category: cs.CL
TL;DR: Two transformer modifications: 1) non-linear pre-projection MLP before Q/K/V projections, and 2) content skip connection bypassing attention. Combined approach improves language model performance without K/V cache overhead.
Details
Motivation: To enhance transformer attention blocks by allowing richer feature construction before positional encoding and enabling content information to bypass positional attention where beneficial, potentially improving language understanding.Method: Two complementary modifications: 1) Insert non-linear MLP between layer norm and Q/K/V projections to create richer position-agnostic features before positional encoding. 2) Add content skip connection that routes pre-projection features around attention mechanism, allowing content information to bypass position-aware attention.
Result: Combined approach achieves strongest results: +40.6% LAMBADA accuracy and -39% perplexity at 160M scale on Pythia models. Learned skip weights show consistent pattern: later transformer layers activate content bypass more strongly than earlier layers. No K/V cache overhead added.
Conclusion: The modifications improve transformer performance by separating content processing from positional attention, with deeper layers benefiting more from content information that bypasses positional encoding. The approach is efficient with no additional cache overhead.
Abstract: We propose two complementary modifications to transformer attention blocks. First, a non-linear pre-projection MLP is inserted between layer norm and Q/K/V projections, constructing richer features in a position-agnostic manner before any positional encoding is applied. Second, a content skip connection routes the pre-projection’s features around the attention mechanism, allowing content information to bypass position-aware attention where beneficial. In frozen-probe experiments on Pythia-160M and 410M, the combined approach achieves the strongest results across methods: +40.6% LAMBADA accuracy and -39% perplexity at 160M scale. Learned skip connection weights reveal a consistent pattern across model sizes: later transformer layers activate the content bypass more strongly than earlier layers, suggesting that deeper layers benefit from content information that does not pass through positional attention. All modifications add no K/V cache overhead.
[72] Advancing Polish Language Modeling through Tokenizer Optimization in the Bielik v3 7B and 11B Series
Krzysztof Ociepa, Łukasz Flis, Remigiusz Kinas, Krzysztof Wróbel, Adrian Gwoździej
Main category: cs.CL
TL;DR: Bielik v3 PL series introduces Polish-optimized LLMs with dedicated tokenization to address inefficiencies of universal tokenizers, improving performance for Polish language tasks.
Details
Motivation: Universal tokenizers in general-purpose LLMs fail to capture morphological nuances of specific languages like Polish, leading to higher fertility ratios, increased inference costs, and restricted effective context windows.Method: Transition from universal Mistral-based tokenization to dedicated Polish-optimized vocabulary, using FOCUS-based embedding initialization, multi-stage pretraining curriculum, and post-training alignment with Supervised Fine-Tuning, Direct Preference Optimization, and Reinforcement Learning through Group Relative Policy Optimization with verifiable rewards.
Result: Development of Bielik v3 PL series with 7B and 11B parameter variants representing significant milestone in language-specific LLM optimization for Polish.
Conclusion: Language-specific tokenization optimization addresses fundamental architectural inefficiencies in multilingual LLMs, enabling better performance for specific languages like Polish.
Abstract: The development of the Bielik v3 PL series, encompassing both the 7B and 11B parameter variants, represents a significant milestone in the field of language-specific large language model (LLM) optimization. While general-purpose models often demonstrate impressive multilingual capabilities, they frequently suffer from a fundamental architectural inefficiency: the use of universal tokenizers. These tokenizers, typically designed to cover a broad spectrum of languages, often fail to capture the morphological nuances of specific languages like Polish, leading to higher fertility ratios, increased inference costs, and restricted effective context windows. This report details the transition from the universal Mistral-based tokenization to a dedicated Polish-optimized vocabulary for the Bielik v3 models, exploring the FOCUS-based embedding initialization, the multi-stage pretraining curriculum, and the subsequent post-training alignment involving Supervised Fine-Tuning, Direct Preference Optimization, and Reinforcement Learning through Group Relative Policy Optimization with verifiable rewards.
[73] OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language World Models
Xiaomeng Hu, Yinger Zhang, Fei Huang, Jianhong Tu, Yang Su, Lianghao Deng, Yuxuan Liu, Yantao Liu, Dayiheng Liu, Tsung-Yi Ho
Main category: cs.CL
TL;DR: OccuBench is a benchmark for evaluating AI agents across 100 real-world professional task scenarios spanning 10 industries and 65 specialized domains, using Language World Models to simulate domain-specific environments.
Details
Motivation: Existing benchmarks only evaluate AI agents in few domains with public environments, but agents need to perform professional work across hundreds of occupational domains. There's a need for systematic cross-industry evaluation of AI agents on professional occupational tasks.Method: Uses Language World Models (LWMs) to simulate domain-specific environments through LLM-driven tool response generation. Multi-agent synthesis pipeline automatically produces evaluation instances with guaranteed solvability, calibrated difficulty, and document-grounded diversity. Evaluates agents along task completion across domains and environmental robustness under controlled fault injection (explicit errors, implicit data degradation, mixed faults).
Result: Evaluated 15 frontier models across 8 families: (1) No single model dominates all industries, each has distinct occupational capability profile; (2) Implicit faults (truncated data, missing fields) are harder than explicit errors and mixed faults; (3) Larger models, newer generations, and higher reasoning effort consistently improve performance (GPT-5.2 improves 27.5 points from minimal to maximum reasoning effort); (4) Strong agents are not necessarily strong environment simulators - simulator quality is critical for LWM-based evaluation reliability.
Conclusion: OccuBench provides the first systematic cross-industry evaluation of AI agents on professional occupational tasks, enabling comprehensive assessment of agent capabilities across diverse real-world professional domains.
Abstract: AI agents are expected to perform professional work across hundreds of occupational domains (from emergency department triage to nuclear reactor safety monitoring to customs import processing), yet existing benchmarks can only evaluate agents in the few domains where public environments exist. We introduce OccuBench, a benchmark covering 100 real-world professional task scenarios across 10 industry categories and 65 specialized domains, enabled by Language World Models (LWMs) that simulate domain-specific environments through LLM-driven tool response generation. Our multi-agent synthesis pipeline automatically produces evaluation instances with guaranteed solvability, calibrated difficulty, and document-grounded diversity. OccuBench evaluates agents along two complementary dimensions: task completion across professional domains and environmental robustness under controlled fault injection (explicit errors, implicit data degradation, and mixed faults). We evaluate 15 frontier models across 8 model families and find that: (1) no single model dominates all industries, as each has a distinct occupational capability profile; (2) implicit faults (truncated data, missing fields) are harder than both explicit errors (timeouts, 500s) and mixed faults, because they lack overt error signals and require the agent to independently detect data degradation; (3) larger models, newer generations, and higher reasoning effort consistently improve performance. GPT-5.2 improves by 27.5 points from minimal to maximum reasoning effort; and (4) strong agents are not necessarily strong environment simulators. Simulator quality is critical for LWM-based evaluation reliability. OccuBench provides the first systematic cross-industry evaluation of AI agents on professional occupational tasks.
[74] AOP-Smart: A RAG-Enhanced Large Language Model Framework for Adverse Outcome Pathway Analysis
Qinjiang Niu, Lu Yan
Main category: cs.CL
TL;DR: AOP-Smart: A retrieval-augmented generation framework that improves LLM reliability for toxicological Adverse Outcome Pathways knowledge tasks by reducing hallucinations through structured knowledge retrieval from AOP-Wiki.
Details
Motivation: Large language models have been applied to AOP-related question answering but suffer from hallucination problems, generating factually inconsistent or unsubstantiated content, limiting their reliability in toxicological research and risk assessment.Method: Proposes AOP-Smart, an AOP-oriented RAG framework that retrieves relevant knowledge from official AOP-Wiki XML data using Key Events, Key Event Relationships, and specific AOP information to augment LLM responses for user questions.
Result: On 20 AOP-related QA tasks, RAG dramatically improved accuracy: GPT from 15.0% to 95.0%, DeepSeek from 35.0% to 100.0%, and Gemini from 20.0% to 95.0%, showing significant reduction in hallucinations and improved answer consistency.
Conclusion: AOP-Smart effectively mitigates LLM hallucination problems in AOP knowledge tasks, substantially enhancing answer accuracy and reliability for toxicological research applications.
Abstract: Adverse Outcome Pathways (AOPs) are an important knowledge framework in toxicological research and risk assessment. In recent years, large language models (LLMs) have gradually been applied to AOP-related question answering and mechanistic reasoning tasks. However, due to the existence of the hallucination problem, that is, the model may generate content that is inconsistent with facts or lacks evidence, their reliability is still limited. To address this issue, this study proposes an AOP-oriented Retrieval-Augmented Generation (RAG) framework, AOP-Smart. Based on the official XML data from AOP-Wiki, this method uses Key Events (KEs), Key Event Relationships (KERs), and specific AOP information to retrieve relevant knowledge for user questions, thereby improving the reliability of the generated results of large language models. To evaluate the effectiveness of the proposed method, this study constructed a test set containing 20 AOP-related question answering tasks, covering KE identification, upstream and downstream KE retrieval, and complex AOP retrieval tasks. Experiments were conducted on three mainstream large language models, Gemini, DeepSeek, and ChatGPT, and comparative tests were performed under two settings: without RAG and with RAG. The experimental results show that, without using RAG, the accuracies of GPT, DeepSeek, and Gemini were 15.0%, 35.0%, and 20.0%, respectively; after using RAG, their accuracies increased to 95.0%, 100.0%, and 95.0%, respectively. The results indicate that AOP-Smart can significantly alleviate the hallucination problem of large language models in AOP knowledge tasks, and greatly improve the accuracy and consistency of their answers.
[75] HTAA: Enhancing LLM Planning via Hybrid Toolset Agentization & Adaptation
Chengrui Huang, Junshuo Zhang, Zhiyuan Ma, Xikun Wang, Ximeng Wang, Menghua Jiang, Gang Zeng, Zhaobing Han, Shen Gao, Shuo Shang
Main category: cs.CL
TL;DR: HTAA introduces a hierarchical framework for scalable tool-use in LLMs by grouping frequently co-used tools into specialized agent tools and using asymmetric planner adaptation for coordination.
Details
Motivation: Current flat tool-calling architectures for LLMs are inefficient and suffer from error accumulation when scaling to hundreds of tools needed for real-world applications.Method: Proposes Hybrid Toolset Agentization & Adaptation (HTAA): 1) Toolset agentization - encapsulating frequently co-used tools into specialized agent tools to reduce action space; 2) Asymmetric Planner Adaptation - trajectory-based training aligning high-level planner with agent tools via backward reconstruction and forward refinement.
Result: HTAA achieves higher task success rates, shorter tool calling trajectories, and significantly reduces context overhead compared to baselines on InfoVerify dataset and other benchmarks. In production deployment, it substantially reduces manual validation effort and operational costs.
Conclusion: HTAA provides an effective hierarchical framework for scalable tool-use in LLMs, addressing inefficiencies of flat architectures and enabling reliable use of hundreds of tools for real-world applications.
Abstract: Enabling large language models to scale and reliably use hundreds of tools is critical for real-world applications, yet challenging due to the inefficiency and error accumulation inherent in flat tool-calling architectures. To address this, we propose Hybrid Toolset Agentization & Adaptation (HTAA), a hierarchical framework for scalable tool-use planning. We propose a novel toolset agentization paradigm, which encapsulates frequently co-used tools into specialized agent tools, thereby reducing the planner’s action space and mitigating redundancy. To ensure effective coordination, we design Asymmetric Planner Adaptation, a trajectory-based training paradigm that aligns the high-level planner with agent tools via backward reconstruction and forward refinement. To validate the performance of HTAA, we conduct experiments on a real-world internal dataset, InfoVerify, based on the POI validation workflow of China’s largest online large-scale ride-hailing platform, featuring long-horizon executable tool trajectories. Experiments on InfoVerify and widely-used benchmarks show that HTAA consistently achieves higher task success rates, requires short tool calling trajectories, and significantly reduces context overhead compared to strong baselines. Furthermore, in a production deployment, HTAA substantially reduces manual validation effort and operational cost, demonstrating its practical efficacy.
[76] Mem$^2$Evolve: Towards Self-Evolving Agents via Co-Evolutionary Capability Expansion and Experience Distillation
Zihao Cheng, Zeming Liu, Yingyu Shan, Xinyi Wang, Xiangrong Zhu, Yunpu Ma, Hongru Wang, Yuhang Guo, Wei Lin, Yunhong Wang
Main category: cs.CL
TL;DR: Mem2Evolve is a co-evolutionary framework that integrates experience accumulation and dynamic asset creation for LLM-powered agents, enabling mutual reinforcement between capability expansion and experience distillation.
Details
Motivation: Existing LLM agent frameworks treat experience accumulation and asset creation as separate evolutionary processes, overlooking their interdependence. Experience accumulation is limited by static toolkits, while asset creation lacks experiential guidance, leading to suboptimal capability growth.Method: Proposes Mem2Evolve with two core components: Experience Memory and Asset Memory. The framework leverages accumulated experience to guide dynamic creation of new assets (tools/expert agents), while simultaneously acquiring new experience from using these assets, enabling co-evolution.
Result: Extensive experiments across 6 task categories and 8 benchmarks show Mem2Evolve achieves 18.53% improvement over standard LLMs, 11.80% over experience-only agents, and 6.46% over asset-creation-only agents, demonstrating superior effectiveness and stability.
Conclusion: Mem2Evolve establishes a more effective and stable self-evolving agent framework by integrating experience distillation and capability expansion through co-evolution, overcoming limitations of isolated evolutionary approaches.
Abstract: While large language model–powered agents can self-evolve by accumulating experience or by dynamically creating new assets (i.e., tools or expert agents), existing frameworks typically treat these two evolutionary processes in isolation. This separation overlooks their intrinsic interdependence: the former is inherently bounded by a manually predefined static toolset, while the latter generates new assets from scratch without experiential guidance, leading to limited capability growth and unstable evolution. To address this limitation, we introduce a novel paradigm of co-evolutionary Capability Expansion and Experience Distillation. Guided by this paradigm, we propose the \textbf{Mem$^{\textbf{2}}$Evolve}, which integrates two core components: \textbf{Experience Memory} and \textbf{Asset Memory}. Specifically, Mem$^{2}$Evolve leverages accumulated experience to guide the dynamic creation of assets, thereby expanding the agent’s capability space while simultaneously acquiring new experience to achieve co-evolution. Extensive experiments across 6 task categories and 8 benchmarks demonstrate that Mem$^{2}$Evolve achieves improvement of 18.53% over standard LLMs, 11.80% over agents evolving solely through experience, and 6.46% over those evolving solely through asset creation, establishing it as a substantially more effective and stable self-evolving agent framework. Code is available at: https://buaa-irip-llm.github.io/Mem2Evolve.
[77] YIELD: A Large-Scale Dataset and Evaluation Framework for Information Elicitation Agents
Victor De Lima, Grace Hui Yang
Main category: cs.CL
TL;DR: A framework for Information Elicitation Agents (IEAs) that aim to extract information from users for institutional objectives, with a new dataset YIELD and formalization as a POMDP.
Details
Motivation: Most conversational agents focus on satisfying user needs, but many real-world scenarios (academic interviews, judicial proceedings, investigations) require agents that can actively elicit information from users to support institutional decision-making processes.Method: Introduces Information Elicitation Agents (IEAs) with a formalization as a finite-horizon POMDP. Creates YIELD dataset of 2,281 human-to-human dialogues (26M tokens) for training. Proposes novel metrics tailored to IEAs and conducts experiments with foundation LLMs.
Result: Training on YIELD improves LLM alignment with real elicitation behavior. Findings are corroborated by human evaluation. The dataset, code, evaluation tools, and fine-tuned model adapters are publicly released.
Conclusion: The paper establishes a framework for systematic research on information elicitation agents, providing a dataset, formalization, and evaluation metrics to advance this important but understudied area of conversational AI.
Abstract: Most conversational agents (CAs) are designed to satisfy user needs through user-driven interactions. However, many real-world settings, such as academic interviewing, judicial proceedings, and journalistic investigations, involve broader institutional decision-making processes and require agents that can elicit information from users. In this paper, we introduce Information Elicitation Agents (IEAs) in which the agent’s goal is to elicit information from users to support the agent’s institutional or task-oriented objectives. To enable systematic research on this setting, we present YIELD, a 26M-token dataset of 2,281 ethically sourced, human-to-human dialogues. Moreover, we formalize information elicitation as a finite-horizon POMDP and propose novel metrics tailored to IEAs. Pilot experiments on multiple foundation LLMs show that training on YIELD improves their alignment with real elicitation behavior and findings are corroborated by human evaluation. We release YIELD under CC BY 4.0. The dataset, project code, evaluation tools, and fine-tuned model adapters are available at: https://github.com/infosenselab/yield.
[78] When Verification Fails: How Compositionally Infeasible Claims Escape Rejection
Muxin Liu, Delip Rao, Grace Kim, Chris Callison-Burch
Main category: cs.CL
TL;DR: Paper shows existing scientific claim verification benchmarks fail to distinguish rigorous verification from shortcut reasoning, constructs new compositional benchmarks revealing models over-accept claims despite non-salient contradictions.
Details
Motivation: Existing scientific claim verification benchmarks cannot distinguish between models that properly enforce the Closed-World Assumption (requiring all constraints to be supported) and models that use a shortcut of only checking the most salient constraint.Method: Constructs compositionally infeasible claims where the salient constraint is supported but a non-salient constraint is contradicted. Tests various model families and modalities, and uses model context interventions to analyze verification behavior.
Result: Models that saturate existing benchmarks consistently over-accept compositionally infeasible claims, confirming prevalence of shortcut reasoning. Different models occupy distinct positions on a shared ROC curve, indicating differences in verification thresholds rather than underlying reasoning ability.
Conclusion: Current verification benchmarks are insufficient, and the compositional inference bottleneck is a structural property of current verification behavior that cannot be overcome by strategy guidance alone.
Abstract: Scientific claim verification, the task of determining whether claims are entailed by scientific evidence, is fundamental to establishing discoveries in evidence while preventing misinformation. This process involves evaluating each asserted constraint against validated evidence. Under the Closed-World Assumption (CWA), a claim is accepted if and only if all asserted constraints are positively supported. We show that existing verification benchmarks cannot distinguish models enforcing this standard from models applying a simpler shortcut called salient-constraint checking, which applies CWA’s rejection criterion only to the most salient constraint and accepts when that constraint is supported. Because existing benchmarks construct infeasible claims by perturbing a single salient element they are insufficient at distinguishing between rigorous claim verification and simple salient-constraint reliance. To separate the two, we construct compositionally infeasible claims where the salient constraint is supported but a non-salient constraint is contradicted. Across model families and modalities, models that otherwise saturate existing benchmarks consistently over-accept these claims, confirming the prevalence of such shortcut reasoning. Via model context interventions, we show that different models and prompting strategies occupy distinct positions on a shared ROC curve, indicating that the gap between model families reflects differences in verification threshold rather than underlying reasoning ability, and that the compositional inference bottleneck is a structural property of current verification behavior that strategy guidance alone cannot overcome.
[79] When Valid Signals Fail: Regime Boundaries Between LLM Features and RL Trading Policies
Zhengzhe Yang
Main category: cs.CL
TL;DR: LLMs can generate predictive numerical features for RL trading agents via automated prompt optimization, but these features don’t translate to robust policy improvement during macroeconomic shocks.
Details
Motivation: To investigate whether LLMs can generate continuous numerical features that improve reinforcement learning trading agents, specifically examining if LLM-extracted features from unstructured financial data can enhance trading performance.Method: Built a modular pipeline with frozen LLM as stateless feature extractor, transforming daily news/filings into fixed-dimensional vectors for downstream PPO agent. Introduced automated prompt-optimization loop treating extraction prompt as discrete hyperparameter, tuned directly against Information Coefficient (Spearman correlation) rather than NLP losses.
Result: Optimized prompt discovered genuinely predictive features (IC > 0.15 on held-out data). However, during macroeconomic shock distribution shift, LLM-derived features added noise and underperformed price-only baseline. In calmer test regime, agent recovered but macroeconomic state variables remained most robust driver of policy improvement.
Conclusion: There’s a gap between feature-level validity and policy-level robustness, paralleling known transfer learning challenges under distribution shift. LLM features can be predictive but may not translate to robust downstream task performance during real-world distribution shifts.
Abstract: Can large language models (LLMs) generate continuous numerical features that improve reinforcement learning (RL) trading agents? We build a modular pipeline where a frozen LLM serves as a stateless feature extractor, transforming unstructured daily news and filings into a fixed-dimensional vector consumed by a downstream PPO agent. We introduce an automated prompt-optimization loop that treats the extraction prompt as a discrete hyperparameter and tunes it directly against the Information Coefficient - the Spearman rank correlation between predicted and realized returns - rather than NLP losses. The optimized prompt discovers genuinely predictive features (IC above 0.15 on held-out data). However, these valid intermediate representations do not automatically translate into downstream task performance: during a distribution shift caused by a macroeconomic shock, LLM-derived features add noise, and the augmented agent under-performs a price-only baseline. In a calmer test regime the agent recovers, yet macroeconomic state variables remain the most robust driver of policy improvement. Our findings highlight a gap between feature-level validity and policy-level robustness that parallels known challenges in transfer learning under distribution shift.
[80] Uncertainty-Aware Web-Conditioned Scientific Fact-Checking
Ashwin Vinod, Katrin Erk
Main category: cs.CL
TL;DR: A scientific fact-checking pipeline using atomic predicate decomposition with uncertainty-gated web search for verifying technical claims in specialized domains like biomedicine and materials science.
Details
Motivation: Existing fact-checking systems often hallucinate or apply inconsistent reasoning when verifying technical, compositional claims, especially under source and cost/latency constraints in specialized domains.Method: Atomic predicate-argument decomposition with calibrated uncertainty-gated corroboration: facts are aligned to local snippets via embeddings, verified by compact evidence-grounded checker, and only uncertain facts trigger domain-restricted web search over authoritative sources.
Result: The framework surpasses strongest benchmarks on multiple benchmarks, with web corroboration invoked for only a minority of atomic facts on average, showing selective external evidence consultation under calibrated uncertainty.
Conclusion: Coupling atomic granularity with calibrated, uncertainty-gated corroboration yields more interpretable and context-conditioned verification suitable for high-stakes, single-document settings requiring traceable rationales and predictable cost/latency.
Abstract: Scientific fact-checking is vital for assessing claims in specialized domains such as biomedicine and materials science, yet existing systems often hallucinate or apply inconsistent reasoning, especially when verifying technical, compositional claims against an evidence snippet under source and cost/latency constraints. We present a pipeline centered on atomic predicate-argument decomposition and calibrated, uncertainty-gated corroboration: atomic facts are aligned to local snippets via embeddings, verified by a compact evidence-grounded checker, and only facts with uncertain support trigger domain-restricted web search over authoritative sources. The system supports both binary and tri-valued classification where it predicts labels from Supported, Refuted, NEI for three-way tasks. We evaluate under two regimes, Context-Only (no web) and Context+Web (uncertainty-gated web corroboration); when retrieved evidence conflicts with the provided context, we abstain with NEI rather than overriding the context. On multiple benchmarks, our framework surpasses the strongest benchmarks. In our experiments, web corroboration was invoked for only a minority of atomic facts on average, indicating that external evidence is consulted selectively under calibrated uncertainty rather than routinely. Overall, coupling atomic granularity with calibrated, uncertainty-gated corroboration yields more interpretable and context-conditioned verification, making the approach well-suited to high-stakes, single-document settings that demand traceable rationales, predictable cost/latency, and conservative.
[81] A Systematic Analysis of the Impact of Persona Steering on LLM Capabilities
Jiaqi Chen, Ming Wang, Tingna Xie, Shi Feng, Yongkang Liu
Main category: cs.CL
TL;DR: Persona induction in LLMs affects cognitive performance beyond style, with task-dependent effects that align 74% with human personality-cognition patterns, enabling adaptive persona routing for improved performance.
Details
Motivation: While persona induction is common for customizing LLM interaction styles, its impact on core cognitive capabilities remains unknown. The paper investigates whether inducing specific personality traits affects LLMs' fundamental reasoning and problem-solving abilities.Method: Used Neuron-based Personality Trait Induction (NPTI) to induce Big Five personality traits in LLMs, evaluated performance across six cognitive benchmarks, analyzed task dependence and trait effects, and proposed Dynamic Persona Routing (DPR) for query-adaptive persona selection.
Result: Persona induction produces stable, reproducible cognitive performance shifts beyond stylistic changes. Effects are strongly task-dependent: some personalities improve instruction-following while others impair complex reasoning. Openness and Extraversion have strongest effects. LLM personality-cognition relationships show 73.68% directional consistency with human patterns. DPR outperforms best static persona without training.
Conclusion: Persona induction meaningfully affects LLM cognitive capabilities, not just interaction style. The discovered regularities enable practical applications like adaptive persona routing for performance optimization, bridging LLM behavior with human personality-cognition relationships.
Abstract: Imbuing Large Language Models (LLMs) with specific personas is prevalent for tailoring interaction styles, yet the impact on underlying cognitive capabilities remains unexplored. We employ the Neuron-based Personality Trait Induction (NPTI) framework to induce Big Five personality traits in LLMs and evaluate performance across six cognitive benchmarks. Our findings reveal that persona induction produces stable, reproducible shifts in cognitive task performance beyond surface-level stylistic changes. These effects exhibit strong task dependence: certain personalities yield consistent gains on instruction-following, while others impair complex reasoning. Effect magnitude varies systematically by trait dimension, with Openness and Extraversion exerting the most robust influence. Furthermore, LLM effects show 73.68% directional consistency with human personality-cognition relationships. Capitalizing on these regularities, we propose Dynamic Persona Routing (DPR), a lightweight query-adaptive strategy that outperforms the best static persona without additional training.
[82] Shared Emotion Geometry Across Small Language Models: A Cross-Architecture Study of Representation, Behavior, and Methodological Confounds
Jihoon Jeong
Main category: cs.CL
TL;DR: Analysis of emotion vector geometries across small language models reveals universal emotion representations across mature architectures despite behavioral differences, with RLHF restructuring only immature representations and methodological decomposition of prior comprehension-vs-generation effects.
Details
Motivation: To understand how emotion representations are structured across different small language model architectures, and to investigate whether behavioral differences in models arise from shared underlying emotion representations or distinct representational geometries.Method: Extracted 21-emotion vector sets from 12 small language models (6 architectures × base/instruct versions, 1B-8B parameters) using a unified comprehension-mode pipeline at fp16 precision, analyzed via representational similarity analysis on raw cosine RDMs (representational dissimilarity matrices).
Result: Five mature architectures share nearly identical 21-emotion geometry (pairwise RDM Spearman correlations 0.74-0.92), with universality persisting across diametrically opposed behavioral profiles. Gemma-3 1B base (immature case) shows extreme residual-stream anisotropy and is restructured by RLHF, while mature families show high within-family base×instruct correlations (rho ≥ 0.92). Methodological decomposition reveals four distinct layers in prior comprehension-vs-generation effects.
Conclusion: Mature language models share universal emotion representations despite behavioral differences, suggesting behavioral facets arise above shared emotion representations. RLHF restructures only representations that are not yet organized. Methodological analysis shows prior comprehension-vs-generation effects decompose into multiple layers requiring careful interpretation.
Abstract: We extract 21-emotion vector sets from twelve small language models (six architectures x base/instruct, 1B-8B parameters) under a unified comprehension-mode pipeline at fp16 precision, and compare the resulting geometries via representational similarity analysis on raw cosine RDMs. The five mature architectures (Qwen 2.5 1.5B, SmolLM2 1.7B, Llama 3.2 3B, Mistral 7B v0.3, Llama 3.1 8B) share nearly identical 21-emotion geometry, with pairwise RDM Spearman correlations of 0.74-0.92. This universality persists across diametrically opposed behavioral profiles: Qwen 2.5 and Llama 3.2 occupy opposite poles of MTI Compliance facets yet produce nearly identical emotion RDMs (rho = 0.81), so behavioral facet differences arise above the shared emotion representation. Gemma-3 1B base, the one immature case in our dataset, exhibits extreme residual-stream anisotropy (0.997) and is restructured by RLHF across all geometric descriptors, whereas the five already-mature families show within-family base x instruct RDM correlations of rho >= 0.92 (Mistral 7B v0.3 at rho = 0.985), suggesting RLHF restructures only representations that are not yet organized. Methodologically, we show that what prior work has read as a single comprehension-vs-generation method effect in fact decomposes into four distinct layers – a coarse method-dependent dissociation, robust sub-parameter sensitivity within generation, a true precision (fp16 vs INT8) effect, and a conflated cross-experiment bias that distorts in opposite directions for different models – so that a single rho between two prior emotion-vector studies is not a safe basis for interpretation without the layered decomposition.
[83] ks-pret-5m: a 5 million word, 12 million token kashmiri pretraining dataset
Haq Nawaz Malik, Nahfid Nissar
Main category: cs.CL
TL;DR: KS-PRET-5M is the largest publicly available pretraining dataset for Kashmiri language with 5.09M words, assembled from archival/literary sources and web content, processed through rigorous cleaning, and released under open license.
Details
Motivation: To address the lack of large-scale pretraining datasets for low-resource languages like Kashmiri, which hinders development of language models and computational linguistic research for this language.Method: Dataset assembled from two sources: digitized archival/literary material converted from InPage format, and Unicode-native web text. All text processed through an eleven-stage cleaning pipeline to ensure script purity, then tokenized using google/muril-base-cased tokenizer.
Result: Created KS-PRET-5M dataset with 5.09M words, 27.6M characters, 295K unique word types, achieving high Kashmiri script ratio (0.9965) with minimal Devanagari contamination. Tokenization yielded 12.13M subword tokens with 2.383 tokens per word ratio.
Conclusion: The dataset enables language model pretraining, tokenizer training, and computational linguistic research for Kashmiri, addressing resource scarcity for this low-resource language.
Abstract: We present KS-PRET-5M, the largest publicly available pretraining dataset for the Kashmiri language, comprising 5,090,244 (5.09M) words, 27,692,959 (27.6M) characters, and a vocabulary of 295,433 (295.4K) unique word types. We assembled the dataset from two source classes: digitized archival and literary material, encompassing literature, news, biographies, novels, poetry, religious scholarship, and academic writing, recovered from the proprietary InPage desktop-publishing format using the converter of Malik~\cite{malik2024inpage}, and Unicode-native text collected from Kashmiri-language web sources. All text was processed through an eleven-stage cleaning pipeline that achieves a mean Kashmiri script ratio of 0.9965, reducing Devanagari contamination to 146 characters across the full dataset. We tokenized the dataset empirically using google/muril-base-cased, yielding a subword ratio of 2.383 tokens per word and a total of approximately 12.13 million subword tokens, substantially higher than prior estimates derived from non-Kashmiri Perso-Arabic analogues. KS-PRET-5M is released as a single continuous text stream under CCBY4.0 to support language model pretraining, tokenizer training, and computational linguistic research for Kashmiri.
[84] BITS Pilani at SemEval-2026 Task 9: Structured Supervised Fine-Tuning with DPO Refinement for Polarization Detection
Atharva Gupta, Dhruv Kumar, Yash Sinha
Main category: cs.CL
TL;DR: Two-stage approach combining supervised fine-tuning with DPO refinement for detecting political polarization in social media text, using Qwen 2.5-7B-Instruct with interpretable slot-filling templates.
Details
Motivation: Accurate computational detection of online polarization is challenging due to nuanced rhetoric, implicit framing, and high annotation costs. Recent findings show LLMs can function as strong polarization detectors with contextual prompting.Method: Two-stage approach: 1) Supervised fine-tuning of Qwen 2.5-7B-Instruct with LoRA using interpretable slot-filling template (target, claim type, manifestation checklist, justification), 2) DPO refinement with automatically generated preference pairs to reduce false negatives.
Result: DPO refinement improves both accuracy and reduces false negatives without extra annotation. On English development set: recall increased from 0.5085 to 0.7797, macro-F1 improved by ~5 points.
Conclusion: Preference-based refinement with DPO effectively enhances polarization detection performance in social media text, demonstrating the value of combining structured fine-tuning with preference optimization.
Abstract: The POLAR SemEval-2026 Shared Task aims to detect online polarization and focuses on the classification and identification of multilingual, multicultural, and multi-event polarization. Accurate computational detection of online polarization is challenging due to nuanced rhetoric, implicit framing, and the high cost of human-in-the-loop annotation. Building on recent findings that contextual prompting enables large language models to function as strong polarization detectors, we present a two-stage approach for detecting political polarization in social media text that combines structured supervised fine-tuning with Direct Preference Optimization (DPO) refinement. We fine-tune Qwen 2.5-7B-Instruct with LoRA using an interpretable slot-filling template (target, claim type, manifestation checklist, and justification). We then apply DPO with automatically generated preference pairs to reduce costly false negatives. Experiments on the SemEval 2026 POLAR shared task dataset show that preference-based refinement improves both accuracy and decreases false negatives without extra annotation. On the English development set, DPO increases recall from 0.5085 to 0.7797 and improves macro-F1 by ~5 points.
[85] DeCoVec: Building Decoding Space based Task Vector for Large Language Models via In-Context Learning
Feiyang Li, Yile Wang
Main category: cs.CL
TL;DR: DeCoVec is a training-free, non-invasive framework that constructs task vectors in decoding space using in-context learning to steer LLMs without weight updates.
Details
Motivation: Existing task vector approaches require fine-tuning or invasive manipulation of internal states, limiting flexibility and scalability. There's a need for training-free, non-invasive methods to steer LLMs effectively.Method: Constructs task vectors directly in decoding space by capturing the difference between output logit distributions of few-shot and zero-shot prompts, then injects this vector into the decoding process to steer generation.
Result: Outperforms standard few-shot baselines across seven LLMs (0.5B-9B) on TruthfulQA, Math-500, and AQUA-RAT with gains up to +5.50 average accuracy. Effectively suppresses generation degeneration and logical flaws with strong robustness to demonstration ordering.
Conclusion: DeCoVec offers a training-free, non-invasive solution for LLM steering without requiring weight updates or auxiliary models, providing an efficient alternative to existing approaches.
Abstract: Task vectors, representing directions in model or activation spaces that encode task-specific behaviors, have emerged as a promising tool for steering large language models (LLMs). However, existing approaches typically require fine-tuning or invasive manipulation of internal states, limiting their flexibility and scalability. We propose \textsc{DeCoVec} (Decoding Space based Task Vector), a training-free and non-invasive framework that constructs task vectors directly in the \textit{decoding space} by leveraging in-context learning (ICL). Specifically, \textsc{DeCoVec} captures the task essence as the difference between the output logit distributions of few-shot and zero-shot prompts, then steers generation by injecting this vector into the decoding process. Experiments across seven LLMs (0.5B–9B) on TruthfulQA, Math-500, and AQUA-RAT show that \textsc{DeCoVec} consistently outperforms standard few-shot baselines, with gains up to +5.50 average accuracy. Further analysis demonstrates that \textsc{DeCoVec} effectively suppresses generation degeneration and logical flaws while exhibiting strong robustness to demonstration ordering, all without incurring additional input token costs. Our method offers a training-free and non-invasive solution for LLM steering without requiring weight updates or auxiliary models.
[86] How Robust Are Large Language Models for Clinical Numeracy? An Empirical Study on Numerical Reasoning Abilities in Clinical Contexts
Minh-Vuong Nguyen, Fatemeh Shiri, Zhuang Li, Karin Verspoor
Main category: cs.CL
TL;DR: Benchmark for evaluating LLMs’ clinical numerical reasoning across different note formats, showing models struggle with relational comparison and aggregation tasks while being sensitive to format variations.
Details
Motivation: Current LLM evaluations for clinical numerical reasoning have limited operation-level coverage, focusing mainly on arithmetic computation, and lack assessment of robustness across different clinical note formats, which is critical for safe clinical deployment.Method: Created ClinicNumRobBench with 1,624 context-question instances evaluating four clinical numeracy types: value retrieval, arithmetic computation, relational comparison, and aggregation. Used MIMIC-IV vital-sign records in three semantically equivalent representations, including real-world note-style variant, with 42 question templates.
Result: Value retrieval was strong (most models >85% accuracy), but relational comparison and aggregation were challenging (some models <15%). Fine-tuning on medical data reduced numeracy by over 30% relative to base models, and performance dropped under note-style variations.
Conclusion: ClinicNumRobBench provides a rigorous testbed for clinically reliable numerical reasoning, revealing LLM sensitivity to format variations and significant challenges in complex numerical reasoning tasks despite strong performance on simple value retrieval.
Abstract: Large Language Models (LLMs) are increasingly being explored for clinical question answering and decision support, yet safe deployment critically requires reliable handling of patient measurements in heterogeneous clinical notes. Existing evaluations of LLMs for clinical numerical reasoning provide limited operation-level coverage, restricted primarily to arithmetic computation, and rarely assess the robustness of numerical understanding across clinical note formats. We introduce ClinicNumRobBench, a benchmark of 1,624 context-question instances with ground-truth answers that evaluates four main types of clinical numeracy: value retrieval, arithmetic computation, relational comparison, and aggregation. To stress-test robustness, ClinicNumRobBench presents longitudinal MIMIC-IV vital-sign records in three semantically equivalent representations, including a real-world note-style variant derived from the Open Patients dataset, and instantiates queries using 42 question templates. Experiments on 14 LLMs show that value retrieval is generally strong, with most models exceeding 85% accuracy, while relational comparison and aggregation remain challenging, with some models scoring below 15%. Fine-tuning on medical data can reduce numeracy relative to base models by over 30%, and performance drops under note-style variation indicate LLM sensitivity to format. ClinicNumRobBench offers a rigorous testbed for clinically reliable numerical reasoning. Code and data URL are available on https://github.com/MinhVuong2000/ClinicNumRobBench.
[87] SHARE: Social-Humanities AI for Research and Education
João Gonçalves, Sonia de Jager, Petr Knoth, David Pride, Nick Jelicic
Main category: cs.CL
TL;DR: SHARE models are causal language models specifically pretrained for social sciences and humanities, achieving performance close to general models with 100x less data, paired with MIRROR interface that enables text review without generation.
Details
Motivation: To create language models specifically tailored for social sciences and humanities (SSH) disciplines that preserve SSH principles and norms, addressing the gap where general-purpose models may not adequately handle SSH texts or may compromise disciplinary integrity.Method: Developed SHARE family of causal language models pretrained specifically on SSH texts, and created MIRROR user interface designed for reviewing SSH text inputs without generating new text, maintaining critical engagement.
Result: SHARE models perform close to general-purpose models (Phi-4) that use 100 times more tokens, as demonstrated by custom SSH Cloze benchmark. MIRROR interface successfully enables harnessing model capabilities while preserving SSH integrity.
Conclusion: The SHARE models and MIRROR interface provide a specialized solution for SSH disciplines, demonstrating that domain-specific pretraining can achieve competitive performance with far less data while maintaining disciplinary integrity through non-generative interfaces.
Abstract: This intermediate technical report introduces the SHARE family of base models and the MIRROR user interface. The SHARE models are the first causal language models fully pretrained by and for the social sciences and humanities (SSH). Their performance in modelling SSH texts is close to that of general purpose models (Phi-4) which use 100 times more tokens, as shown by our custom SSH Cloze benchmark. The MIRROR user interface is designed for reviewing text inputs from the SSH disciplines while preserving critical engagement. By prototyping a generative AI interface that does not generate any text, we propose a way to harness the capabilities of the SHARE models without compromising the integrity of SSH principles and norms.
[88] Evaluating Memory Capability in Continuous Lifelog Scenario
Jianjie Zheng, Zhichen Liu, Zhanyu Shen, Jingxiang Qu, Guanhua Chen, Yile Wang, Yang Xu, Yang Liu, Sijie Cheng
Main category: cs.CL
TL;DR: LifeDialBench: A novel benchmark for lifelogging audio conversations with online evaluation protocol, revealing current memory systems underperform simple RAG baselines due to over-designed structures and lossy compression.
Details
Motivation: Wearable devices create opportunities for lifelogging ambient conversations, but existing benchmarks focus on online one-on-one chatting or human-AI interactions, neglecting real-world scenarios. There's a scarcity of public lifelogging audio datasets.Method: Proposed hierarchical synthesis framework to create LifeDialBench with two subsets: EgoMem (real-world egocentric videos) and LifeMem (simulated virtual community). Introduced Online Evaluation protocol adhering to temporal causality to prevent temporal leakage issues.
Result: Current sophisticated memory systems fail to outperform simple RAG-based baseline. Over-designed structures and lossy compression in current approaches negatively impact performance, highlighting need for high-fidelity context preservation.
Conclusion: LifeDialBench addresses gaps in lifelogging audio benchmarks. Online evaluation protocol is crucial for realistic assessment. Current memory systems need simpler, higher-fidelity approaches for lifelog scenarios rather than complex, lossy architectures.
Abstract: Nowadays, wearable devices can continuously lifelog ambient conversations, creating substantial opportunities for memory systems. However, existing benchmarks primarily focus on online one-on-one chatting or human-AI interactions, thus neglecting the unique demands of real-world scenarios. Given the scarcity of public lifelogging audio datasets, we propose a hierarchical synthesis framework to curate \textbf{\textsc{LifeDialBench}}, a novel benchmark comprising two complementary subsets: \textbf{EgoMem}, built on real-world egocentric videos, and \textbf{LifeMem}, constructed using simulated virtual community. Crucially, to address the issue of temporal leakage in traditional offline settings, we propose an \textbf{Online Evaluation} protocol that strictly adheres to temporal causality, ensuring systems are evaluated in a realistic streaming fashion. Our experimental results reveal a counterintuitive finding: current sophisticated memory systems fail to outperform a simple RAG-based baseline. This highlights the detrimental impact of over-designed structures and lossy compression in current approaches, emphasizing the necessity of high-fidelity context preservation for lifelog scenarios. We release our code and data at https://github.com/qys77714/LifeDialBench.
[89] MathAgent: Adversarial Evolution of Constraint Graphs for Mathematical Reasoning Data Synthesis
Zixiong Yu, Jun Rao, Guhan Chen, Songtao Tian, Bohan Li, Jiansheng Wei, Min Zhang, Xiaojun Meng
Main category: cs.CL
TL;DR: Hierarchical framework for synthesizing mathematical reasoning data using Legislator-Executor paradigm that separates constraint graph optimization from semantic instantiation, outperforming existing datasets with just 1K samples.
Details
Motivation: Current methods for mathematical reasoning data synthesis rely on seed mutation or simple prompts, suffering from mode collapse and limited logical complexity. There's a need for high-quality data synthesis without human priors.Method: Proposes hierarchical synthesis framework treating data synthesis as unsupervised optimization over constraint graphs. Uses Legislator-Executor paradigm: Legislator evolves structured generation blueprints encoding constraints, while Executor instantiates specifications into diverse natural language scenarios.
Result: Models fine-tuned on 1K synthesized samples outperform widely-used datasets (LIMO, s1K) across eight mathematical benchmarks, showing superior out-of-distribution generalization. Tested on 10 models across Qwen, Llama, Mistral, and Gemma series.
Conclusion: The decoupling of skeleton design from linguistic realization enables focus on constructing complex logical structures, leading to high-quality data synthesis that significantly improves mathematical reasoning capabilities.
Abstract: Synthesizing high-quality mathematical reasoning data without human priors remains a significant challenge. Current approaches typically rely on seed data mutation or simple prompt engineering, often suffering from mode collapse and limited logical complexity. This paper proposes a hierarchical synthesis framework that formulates data synthesis as an unsupervised optimization problem over a constraint graph followed by semantic instantiation, rather than treating it as a direct text generation task. We introduce a Legislator-Executor paradigm: The Legislator adversarially evolves structured generation blueprints encoding the constraints of the problem, while the Executor instantiates these specifications into diverse natural language scenarios. This decoupling of skeleton design from linguistic realization enables a prioritized focus on constructing complex and diverse logical structures, thereby guiding high-quality data synthesis. Experiments conducted on a total of 10 models across the Qwen, Llama, Mistral, and Gemma series demonstrate that our method achieves notable results: models fine-tuned on 1K synthesized samples outperform widely-used datasets of comparable scale (LIMO, s1K) across eight mathematical benchmarks, exhibiting superior out-of-distribution generalization.
[90] TRACE: An Experiential Framework for Coherent Multi-hop Knowledge Graph Question Answering
Yingxu Wang, Jiaxin Huang, Mengzhu Wang, Nan Yin
Main category: cs.CL
TL;DR: TRACE is an experiential framework for multi-hop KGQA that unifies LLM-driven contextual reasoning with exploration prior integration to enhance reasoning coherence and robustness.
Details
Motivation: Existing multi-hop KGQA methods treat reasoning steps independently and fail to leverage prior exploration experience, leading to fragmented reasoning and redundant exploration.Method: TRACE dynamically translates reasoning paths into natural language narratives for semantic continuity, abstracts prior trajectories into reusable experiential priors, and uses dual-feedback re-ranking to integrate contextual narratives with exploration priors for relation selection.
Result: Extensive experiments on multiple KGQA benchmarks demonstrate that TRACE consistently outperforms state-of-the-art baselines.
Conclusion: TRACE effectively enhances multi-hop KGQA by maintaining reasoning coherence through contextual narratives and leveraging exploration experience through reusable priors.
Abstract: Multi-hop Knowledge Graph Question Answering (KGQA) requires coherent reasoning across relational paths, yet existing methods often treat each reasoning step independently and fail to effectively leverage experience from prior explorations, leading to fragmented reasoning and redundant exploration. To address these challenges, we propose Trajectoryaware Reasoning with Adaptive Context and Exploration priors (TRACE), an experiential framework that unifies LLM-driven contextual reasoning with exploration prior integration to enhance the coherence and robustness of multihop KGQA. Specifically, TRACE dynamically translates evolving reasoning paths into natural language narratives to maintain semantic continuity, while abstracting prior exploration trajectories into reusable experiential priors that capture recurring exploration patterns. A dualfeedback re-ranking mechanism further integrates contextual narratives with exploration priors to guide relation selection during reasoning. Extensive experiments on multiple KGQA benchmarks demonstrate that TRACE consistently outperforms state-of-the-art baselines.
[91] CocoaBench: Evaluating Unified Digital Agents in the Wild
CocoaBench Team, Shibo Hao, Zhining Zhang, Zhiqi Liang, Tianyang Liu, Yuheng Zha, Qiyue Gao, Jixuan Chen, Zilong Wang, Zhoujun Cheng, Haoxiang Zhang, Junli Wang, Hexi Jin, Boyuan Zheng, Kun Zhou, Yu Wang, Feng Yao, Licheng Liu, Yijiang Li, Zhifei Li, Zhengtao Han, Pracha Promthaw, Tommaso Cerruti, Xiaohan Fu, Ziqiao Ma, Jingbo Shang, Lianhui Qin, Julian McAuley, Eric P. Xing, Zhengzhong Liu, Rupesh Kumar Srivastava, Zhiting Hu
Main category: cs.CL
TL;DR: CocoaBench is a benchmark for unified digital agents that require composition of vision, search, and coding capabilities, with current agents achieving only 45.1% success rate.
Details
Motivation: Current LLM agent evaluations test capabilities in isolation, leaving a gap for evaluating agents that need to combine different capabilities like vision, search, and coding in unified systems.Method: Introduces CocoaBench with human-designed, long-horizon tasks requiring flexible composition of vision, search, and coding, specified only by instruction and automatic evaluation function. Also presents CocoaAgent as a lightweight shared scaffold for controlled comparison.
Result: Current agents perform poorly on CocoaBench with best system achieving only 45.1% success rate, showing substantial room for improvement in reasoning/planning, tool use/execution, and visual grounding.
Conclusion: The benchmark reveals significant gaps in current unified digital agents’ ability to combine multiple capabilities, highlighting needs for better reasoning, planning, tool use, and visual grounding.
Abstract: LLM agents now perform strongly in software engineering, deep research, GUI automation, and various other applications, while recent agent scaffolds and models are increasingly integrating these capabilities into unified systems. Yet, most evaluations still test these capabilities in isolation, which leaves a gap for more diverse use cases that require agents to combine different capabilities. We introduce CocoaBench, a benchmark for unified digital agents built from human-designed, long-horizon tasks that require flexible composition of vision, search, and coding. Tasks are specified only by an instruction and an automatic evaluation function over the final output, enabling reliable and scalable evaluation across diverse agent infrastructures. We also present CocoaAgent, a lightweight shared scaffold for controlled comparison across model backbones. Experiments show that current agents remain far from reliable on CocoaBench, with the best evaluated system achieving only 45.1% success rate. Our analysis further points to substantial room for improvement in reasoning and planning, tool use and execution, and visual grounding.
[92] Exploring Knowledge Conflicts for Faithful LLM Reasoning: Benchmark and Method
Tianzhe Zhao, Jiaoyan Chen, Shuxiu Zhang, Haiping Zhu, Qika Lin, Jun Liu
Main category: cs.CL
TL;DR: ConflictQA benchmark tests LLMs on cross-source conflicts between textual and knowledge graph evidence, revealing LLMs struggle with heterogeneous conflicting evidence and tend to rely exclusively on one source.
Details
Motivation: Existing RAG research focuses on conflicts between external knowledge and LLMs' parametric knowledge, but overlooks conflicts across different external knowledge sources, particularly between unstructured text and structured knowledge graphs.Method: Introduces ConflictQA benchmark that systematically creates conflicts between textual evidence and KG evidence, evaluates LLMs on this benchmark, and proposes XoT framework - a two-stage explanation-based thinking approach for reasoning over heterogeneous conflicting evidence.
Result: LLMs struggle with cross-source conflicts, become sensitive to prompting choices, tend to rely exclusively on either KG or textual evidence, leading to incorrect responses. XoT framework shows effectiveness in handling such conflicts.
Conclusion: Cross-source knowledge conflicts present significant challenges for LLMs in RAG systems, requiring specialized approaches like XoT for faithful reasoning over heterogeneous evidence.
Abstract: Large language models (LLMs) have achieved remarkable success across a wide range of applications especially when augmented by external knowledge through retrieval-augmented generation (RAG). Despite their widespread adoption, recent studies have shown that LLMs often struggle to perform faithful reasoning when conflicting knowledge is retrieved. However, existing work primarily focuses on conflicts between external knowledge and the parametric knowledge of LLMs, leaving conflicts across external knowledge largely unexplored. Meanwhile, modern RAG systems increasingly emphasize the integration of unstructured text and (semi-)structured data like knowledge graphs (KGs) to improve knowledge completeness and reasoning faithfulness. To address this gap, we introduce ConflictQA, a novel benchmark that systematically instantiates conflicts between textual evidence and KG evidence. Extensive evaluations across representative LLMs reveal that, facing such cross-source conflicts, LLMs often fail to identify reliable evidence for correct reasoning. Instead, LLMs become more sensitive to prompting choices and tend to rely exclusively on either KG or textual evidence, resulting in incorrect responses. Based on these findings, we further propose XoT, a two-stage explanation-based thinking framework tailored for reasoning over heterogeneous conflicting evidence, and verify its effectiveness with extensive experiments.
[93] HiEdit: Lifelong Model Editing with Hierarchical Reinforcement Learning
Yangfan Wang, Tianyang Sun, Chen Tang, Jie Liu, Wei Cai, Jingchi Jiang
Main category: cs.CL
TL;DR: HiEdit: Hierarchical reinforcement learning framework for lifelong model editing that dynamically selects knowledge-relevant layers per editing instance, improving precision and reducing side effects.
Details
Motivation: Existing lifelong model editing approaches apply parameter perturbations to static, dense sets of LLM layers for all edits, which is counter-intuitive since different knowledge pieces are stored in distinct layers. This leads to poor adaptability for new knowledge integration and catastrophic forgetting of both general and previously edited knowledge.Method: Proposes HiEdit, a hierarchical reinforcement learning framework that adaptively identifies the most knowledge-relevant layers for each editing instance. It enables dynamic, instance-aware layer selection and incorporates an intrinsic reward for sparsity to achieve precise, localized updates.
Result: Experiments on various LLMs show HiEdit boosts the performance of competitive RLEdit by an average of 8.48% while perturbing only half of the layers per edit.
Conclusion: HiEdit provides an effective framework for lifelong model editing by enabling dynamic layer selection, improving editing precision while minimizing side effects and catastrophic forgetting.
Abstract: Lifelong model editing (LME) aims to sequentially rectify outdated or inaccurate knowledge in deployed LLMs while minimizing side effects on unrelated inputs. However, existing approaches typically apply parameter perturbations to a static and dense set of LLM layers for all editing instances. This practice is counter-intuitive, as we hypothesize that different pieces of knowledge are stored in distinct layers of the model. Neglecting this layer-wise specificity can impede adaptability in integrating new knowledge and result in catastrophic forgetting for both general and previously edited knowledge. To address this, we propose HiEdit, a hierarchical reinforcement learning framework that adaptively identifies the most knowledge-relevant layers for each editing instance. By enabling dynamic, instance-aware layer selection and incorporating an intrinsic reward for sparsity, HiEdit achieves precise, localized updates. Experiments on various LLMs show that HiEdit boosts the performance of the competitive RLEdit by an average of 8.48% with perturbing only half of the layers per edit. Our code is available at: https://github.com/yangfanww/hiedit.
[94] RUMLEM: A Dictionary-Based Lemmatizer for Romansh
Dominic P. Fischer, Zachary Hopton, Jannis Vamvas
Main category: cs.CL
TL;DR: RUMLEM is a lemmatizer for Romansh language varieties that achieves 77-84% coverage and can identify language varieties with 95% accuracy.
Details
Motivation: Lemmatization is crucial for NLP applications, but Romansh (a minority language with multiple varieties) lacks comprehensive lemmatization tools. The authors aim to create a lemmatizer that covers all main Romansh varieties and can also perform variety-aware language classification.Method: RUMLEM is based on comprehensive, community-driven morphological databases for Romansh. It covers five main varieties plus the supra-regional standard Rumantsch Grischun. The system uses these databases to map inflected word forms to dictionary forms and identify language varieties.
Result: RUMLEM covers 77-84% of words in typical Romansh texts. Evaluation on 30,000 texts shows 95% accuracy in variety identification. A proof of concept demonstrates feasibility of Romansh vs. non-Romansh language classification.
Conclusion: RUMLEM provides effective lemmatization for Romansh varieties and demonstrates additional utility in language variety classification, addressing needs for minority language NLP tools.
Abstract: Lemmatization – the task of mapping an inflected word form to its dictionary form – is a crucial component of many NLP applications. In this paper, we present RUMLEM, a lemmatizer that covers the five main varieties of Romansh as well as the supra-regional standard variety Rumantsch Grischun. It is based on comprehensive, community-driven morphological databases for Romansh, enabling RUMLEM to cover 77-84% of the words in a typical Romansh text. Since there is a dedicated database for each Romansh variety, an additional application of RUMLEM is variety-aware language classification. Evaluation on 30'000 Romansh texts of varying lengths shows that RUMLEM correctly identifies the variety in 95% of cases. In addition, a proof of concept demonstrates the feasibility of Romansh vs. non-Romansh language classification based on the lemmatizer.
[95] Judge Like Human Examiners: A Weighted Importance Multi-Point Evaluation Framework for Generative Tasks with Long-form Answers
Guoxin Yu, Chulun Zhou, Lemao Liu, Qi Wang, Mo Yu, Jialong Tang, Baosong Yang, Xiang Ao, Wao Lam, Yue Yu
Main category: cs.CL
TL;DR: WIMPE is a framework for evaluating long-form generative model responses by decomposing reference answers into weighted context-bound scoring points and measuring alignment and contradiction with two complementary metrics.
Details
Motivation: Current evaluation methods for long-form generative tasks struggle to assess whether responses are genuinely grounded in provided contexts and fail to capture the heterogeneous importance of different aspects of reference answers.Method: Proposes Weighted Importance Multi-Point Evaluation (WIMPE) framework that factorizes each reference answer into weighted context-bound scoring points. Uses two metrics: Weighted Point-wise Alignment (WPA) to measure alignment and Point-wise Conflict Penalty (PCP) to measure contradiction between model responses and reference answers.
Result: Extensive experiments on 10 generative tasks demonstrate that WIMPE achieves higher correlations with human annotations compared to existing evaluation methods.
Conclusion: WIMPE provides a more effective framework for evaluating long-form generative model responses by addressing limitations of current methods through weighted context-bound scoring points and complementary alignment/contradiction metrics.
Abstract: Evaluating the quality of model responses remains challenging in generative tasks with long-form answers, as the expected answers usually contain multiple semantically distinct yet complementary factors that should be factorized for fine-grained assessment. Recent evaluation methods resort to relying on either task-level rubrics or question-aware checklists. However, they still 1) struggle to assess whether a response is genuinely grounded in provided contexts; 2) fail to capture the heterogeneous importance of different aspects of reference answers. Inspired by human examiners, we propose a Weighted Importance Multi-Point Evaluation (WIMPE) framework, which factorizes each reference answer into weighted context-bound scoring points. Two complementary metrics, namely Weighted Point-wise Alignment (WPA) and Point-wise Conflict Penalty (PCP), are designed to measure the alignment and contradiction between model responses and reference answers. Extensive experiments on 10 generative tasks demonstrate that WIMPE achieves higher correlations with human annotations.
[96] Dialectic-Med: Mitigating Diagnostic Hallucinations via Counterfactual Adversarial Multi-Agent Debate
Zhixiang Lu, Jionglong Su
Main category: cs.CL
TL;DR: Dialectic-Med is a multi-agent framework for medical MLLMs that uses adversarial dialectics between specialized agents to reduce confirmation bias and hallucinations in diagnostic reasoning.
Details
Motivation: MLLMs in healthcare suffer from severe confirmation bias, hallucinating visual details to support potentially erroneous diagnostic hypotheses. Existing CoT approaches lack intrinsic correction mechanisms and are vulnerable to error propagation.Method: Proposes Dialectic-Med with three role-specialized agents: Proponent formulates diagnostic hypotheses, Opponent uses visual falsification module to retrieve contradictory evidence, and Mediator resolves conflicts via weighted consensus graph. Enforces diagnostic rigor through adversarial dialectics.
Result: Achieves state-of-the-art performance on MIMIC-CXR-VQA, VQA-RAD, and PathVQA. Fundamentally enhances trustworthiness of reasoning process, improves explanation faithfulness, and decisively mitigates hallucinations compared to single-agent baselines.
Conclusion: Dialectic-Med establishes a new standard for trustworthy medical MLLMs by explicitly modeling cognitive falsification processes, ensuring diagnostic reasoning is tightly grounded in verified visual regions through adversarial dialectics.
Abstract: Multimodal Large Language Models (MLLMs) in healthcare suffer from severe confirmation bias, often hallucinating visual details to support initial, potentially erroneous diagnostic hypotheses. Existing Chain-of-Thought (CoT) approaches lack intrinsic correction mechanisms, rendering them vulnerable to error propagation. To bridge this gap, we propose Dialectic-Med, a multi-agent framework that enforces diagnostic rigor through adversarial dialectics. Unlike static consensus models, Dialectic-Med orchestrates a dynamic interplay between three role-specialized agents: a proponent that formulates diagnostic hypotheses; an opponent equipped with a novel visual falsification module that actively retrieves contradictory visual evidence to challenge the Proponent; and a mediator that resolves conflicts via a weighted consensus graph. By explicitly modeling the cognitive process of falsification, our framework guarantees that diagnostic reasoning is tightly grounded in verified visual regions. Empirical evaluations on MIMIC-CXR-VQA, VQA-RAD, and PathVQA demonstrate that Dialectic-Med not only achieves state-of-the-art performance but also fundamentally enhances the trustworthiness of the reasoning process. Beyond accuracy, our approach significantly enhances explanation faithfulness and decisively mitigates hallucinations, establishing a new standard over single-agent baselines.
[97] Transactional Attention: Semantic Sponsorship for KV-Cache Retention
Abhinaba Basu
Main category: cs.CL
TL;DR: Transactional Attention (TA) introduces a sponsorship mechanism where structural anchor patterns protect adjacent value-bearing tokens from KV-cache eviction, solving the dormant token problem in KV-cache compression.
Details
Motivation: Existing KV-cache compression methods fail on credential retrieval because they can't handle dormant tokens - credentials, API keys, and configuration values that receive near-zero attention during encoding but become essential at generation time. These tokens lack the statistical signals that eviction policies rely on.Method: Transactional Attention (TA) uses a sponsorship mechanism where structural anchor patterns (e.g., “key:”, “password:”) protect adjacent value-bearing tokens from eviction. TA-Fast is an attention-free variant that reduces memory overhead and is compatible with SDPA and FlashAttention.
Result: TA achieves 100% credential retrieval at K=16 tokens where six baselines achieve 0%, and sustains 100% accuracy across 200 function-calling trials. TA-Fast reduces memory overhead by 52% with less than 1% latency overhead.
Conclusion: Transactional Attention solves the dormant token problem in KV-cache compression through structural sponsorship, is orthogonal to existing methods, and maintains high accuracy with minimal overhead.
Abstract: At K=16 tokens (0.4% of a 4K context), every existing KV-cache compression method achieves 0% on credential retrieval. The failure mode is dormant tokens: credentials, API keys, and configuration values that receive near-zero attention but become essential at generation time. Because these tokens lack the statistical signals that eviction policies rely on, no method based on attention scores, reconstruction loss, or learned retention gates retains them. We introduce Transactional Attention (TA), a sponsorship mechanism in which structural anchor patterns (e.g., “key:”, “password:”) protect adjacent value-bearing tokens from eviction. TA achieves 100% credential retrieval at K=16 where six baselines (H2O, TOVA, SnapKV, StreamingLLM, PyramidKV, DynamicKV) achieve 0%, and sustains 100% accuracy across 200 function-calling trials. TA-Fast, an attention-free variant, reduces memory overhead by 52% and is compatible with SDPA and FlashAttention. TA is orthogonal to existing compression methods and adds less than 1% latency overhead.
[98] Polyglot Teachers: Evaluating Language Models for Multilingual Synthetic Data Generation
Lester James V. Miranda, Ivan Vulić, Anna Korhonen
Main category: cs.CL
TL;DR: Systematic study of what makes effective multilingual teachers for synthetic SFT data generation, finding model scale alone doesn’t predict effectiveness; data quality metrics like prompt diversity and response fluency matter more.
Details
Motivation: Current practice of selecting largest available models as teachers for multilingual synthetic data generation is ad hoc and may lead to poor-quality data due to capability gaps in non-English languages, resulting in suboptimal student performance.Method: Evaluated 10 language models across 6 typologically diverse languages, generated over 1.4M SFT examples, trained 240 student models, and introduced Polyglot Score metric to measure teacher effectiveness by correlating intrinsic data quality with extrinsic student performance.
Result: Gemma 3 27B and Aya Expanse 32B emerged as consistently effective teachers; model scale alone doesn’t significantly predict teacher effectiveness; data quality metrics (prompt diversity, length, response fluency) capture over 93.3% of variance in intrinsic data quality and predict student performance.
Conclusion: Practical recommendations include matching teacher-student model families and using translation techniques for less-resourced languages; the work advances data-centric research in multilingual synthetic data and LM development.
Abstract: Synthesizing supervised finetuning (SFT) data from language models (LMs) to teach smaller models multilingual tasks has become increasingly common. However, teacher model selection is often ad hoc, typically defaulting to the largest available option, even though such models may have significant capability gaps in non-English languages. This practice can result in poor-quality synthetic data and suboptimal student downstream performance. In this work, we systematically characterize what makes an effective multilingual teacher. We measure intrinsic measures of data quality with extrinsic student model performance in a metric we call Polyglot Score; evaluating 10 LMs across 6 typologically diverse languages, generating over 1.4M SFT examples and training 240 student models. Among the models tested, Gemma 3 27B and Aya Expanse 32B emerge as consistently effective teachers across different student base model families. Further analyses reveal that model scale alone does not significantly predict teacher effectiveness; instead, data qualities such as prompt diversity, length, and response fluency capture over 93.3% of variance in intrinsic data quality and predict student performance. Finally, we provide practical recommendations, including matching the model families of teacher-student pairs and translating from or responding to existing prompts, which can yield improvements for less-resourced languages. We hope that our work advances data-centric research in multilingual synthetic data and LM development.
[99] Enhancing Multimodal Large Language Models for Ancient Chinese Character Evolution Analysis via Glyph-Driven Fine-Tuning
Rui Song, Lida Shi, Ruihua Qi, Yingji Li, Hao Xu
Main category: cs.CL
TL;DR: A comprehensive benchmark (11 tasks, 130k+ instances) for evaluating MLLMs on ancient Chinese script evolution analysis, revealing limitations in glyph-level comparison and core tasks, leading to a glyph-driven fine-tuning framework (GEVO) that improves performance even for small-scale models.
Details
Motivation: The evolution of written characters is fundamental for understanding cultural transformation and historical continuity, but how MLLMs can be systematically leveraged for text evolution analysis remains an open and underexplored problem, especially for ancient Chinese scripts.Method: Constructed a comprehensive benchmark with 11 tasks and over 130,000 instances to evaluate MLLM capabilities in ancient Chinese script evolution analysis. Proposed GEVO, a glyph-driven fine-tuning framework that explicitly encourages models to capture evolutionary consistency in glyph transformations.
Result: Existing MLLMs demonstrate limited ability in glyph-level comparison and substantially constrained performance on core tasks like character recognition and evolutionary reasoning. GEVO enables even 2B-scale models to achieve consistent and comprehensive performance improvements across all evaluated tasks.
Conclusion: The paper addresses a significant gap in applying MLLMs to ancient script analysis, provides a valuable benchmark, and demonstrates that targeted fine-tuning approaches like GEVO can significantly enhance MLLM capabilities for text evolution understanding tasks.
Abstract: In recent years, rapid advances in Multimodal Large Language Models (MLLMs) have increasingly stimulated research on ancient Chinese scripts. As the evolution of written characters constitutes a fundamental pathway for understanding cultural transformation and historical continuity, how MLLMs can be systematically leveraged to support and advance text evolution analysis remains an open and largely underexplored problem. To bridge this gap, we construct a comprehensive benchmark comprising 11 tasks and over 130,000 instances, specifically designed to evaluate the capability of MLLMs in analyzing the evolution of ancient Chinese scripts. We conduct extensive evaluations across multiple widely used MLLMs and observe that, while existing models demonstrate a limited ability in glyph-level comparison, their performance on core tasks-such as character recognition and evolutionary reasoning-remains substantially constrained. Motivated by these findings, we propose a glyph-driven fine-tuning framework (GEVO) that explicitly encourages models to capture evolutionary consistency in glyph transformations and enhances their understanding of text evolution. Experimental results show that even models at the 2B scale achieve consistent and comprehensive performance improvements across all evaluated tasks. To facilitate future research, we publicly release both the benchmark and the trained models\footnote{https://github.com/songruiecho/GEVO}.
[100] Do LLMs Know Tool Irrelevance? Demystifying Structural Alignment Bias in Tool Invocations
Yilong Liu, Xixun Lin, Pengfei Cao, Ge Zhang, Fang Fang, Yanan Cao
Main category: cs.CL
TL;DR: LLMs have a structural alignment bias where they tend to invoke tools when query attributes match tool parameters, even when the tool is irrelevant to the user’s goal.
Details
Motivation: LLMs often face irrelevant tools in practice and should refrain from invoking them, but current models show systematic biases in tool refusal that need investigation.Method: Introduces SABEval dataset to decouple structural alignment from semantic relevance, uses Contrastive Attention Attribution to analyze internal mechanisms, and proposes a rebalancing strategy to mitigate the bias.
Result: Structural alignment bias causes severe tool-invocation errors in LLMs, with two competing pathways (semantic checking vs structural matching) driving decisions. The rebalancing strategy effectively reduces bias without harming general tool-use capabilities.
Conclusion: Structural alignment bias is a widespread mechanistic flaw in LLM tool refusal that requires systematic evaluation and mitigation strategies to improve reliable tool use.
Abstract: Large language models (LLMs) have demonstrated impressive capabilities in utilizing external tools. In practice, however, LLMs are often exposed to tools that are irrelevant to the user’s query, in which case the desired behavior is to refrain from invocations. In this work, we identify a widespread yet overlooked mechanistic flaw in tool refusal, which we term structural alignment bias: Even when a tool fails to serve the user’s goal, LLMs still tend to invoke it whenever query attributes can be validly assigned to tool parameters. To systematically study this bias, we introduce SABEval, a new dataset that decouples structural alignment from semantic relevance. Our analysis shows that structural alignment bias induces severe tool-invocation errors in LLMs, yet remains largely unaccounted for in existing evaluations. To investigate the internal mechanisms underlying this bias, we propose Contrastive Attention Attribution, which reveals two competing pathways for semantic checking and structural matching. The relative strength of these pathways drives LLMs’ tool invocation decisions. Based on these findings, we further introduce a rebalancing strategy that effectively mitigates structural alignment bias, as demonstrated by extensive experiments, without degrading general tool-use capabilities.
[101] Retrieval as Generation: A Unified Framework with Self-Triggered Information Planning
Bo Li, Mingda Wang, Gexiang Fang, Shikun Zhang, Wei Ye
Main category: cs.CL
TL;DR: GRIP integrates retrieval decisions directly into token-level decoding, enabling end-to-end coordination between retrieval and generation without external controllers.
Details
Motivation: Current RAG approaches treat retrieval as an external intervention, lacking tight coordination between retrieval and generation. The authors aim to embed retrieval control directly into the generation process for more seamless integration.Method: Proposes GRIP framework where the model regulates retrieval behavior through control-token emission. Uses Self-Triggered Information Planning to decide when to retrieve, how to reformulate queries, and when to terminate within a single autoregressive trajectory.
Result: Experiments on five QA benchmarks show GRIP surpasses strong RAG baselines and is competitive with GPT-4o while using substantially fewer parameters.
Conclusion: GRIP demonstrates that embedding retrieval control directly into generation enables more effective coordination between retrieval and reasoning, supporting dynamic multi-step inference with on-the-fly evidence integration.
Abstract: We revisit retrieval-augmented generation (RAG) by embedding retrieval control directly into generation. Instead of treating retrieval as an external intervention, we express retrieval decisions within token-level decoding, enabling end-to-end coordination without additional controllers or classifiers. Under the paradigm of Retrieval as Generation, we propose \textbf{GRIP} (\textbf{G}eneration-guided \textbf{R}etrieval with \textbf{I}nformation \textbf{P}lanning), a unified framework in which the model regulates retrieval behavior through control-token emission. Central to GRIP is \textit{Self-Triggered Information Planning}, which allows the model to decide when to retrieve, how to reformulate queries, and when to terminate, all within a single autoregressive trajectory. This design tightly couples retrieval and reasoning and supports dynamic multi-step inference with on-the-fly evidence integration. To supervise these behaviors, we construct a structured training set covering answerable, partially answerable, and multi-hop queries, each aligned with specific token patterns. Experiments on five QA benchmarks show that GRIP surpasses strong RAG baselines and is competitive with GPT-4o while using substantially fewer parameters.
[102] Bridging What the Model Thinks and How It Speaks: Self-Aware Speech Language Models for Expressive Speech Generation
Kuang Wang, Lai Wei, Qibing Bai, Ping Lin, Wenkai Fang, Feng Jiang, Zhongjie Jiang, Jun Huang, Yannan Wang, Haizhou Li
Main category: cs.CL
TL;DR: SA-SLM addresses the semantic understanding-acoustic realization gap in Speech Language Models by making models aware of their expressive intent during generation and aligning acoustic outputs with that intent through self-critique.
Details
Motivation: Speech Language Models have strong semantic understanding but generate flat speech that fails to convey expressive intent, creating a gap between what the model understands and what it acoustically produces.Method: Two core approaches: (1) Intent-Aware Bridging using Variational Information Bottleneck to translate internal semantics into smooth expressive intent, and (2) Realization-Aware Alignment where the model acts as its own critic to verify acoustic outputs match intended expression via rubric-based feedback.
Result: Trained on only 800 hours of expressive speech data, the 3B parameter SA-SLM surpasses all open-source baselines and comes within 0.08 points of GPT-4o-Audio in overall expressiveness on the EchoMind benchmark.
Conclusion: Making speech models self-aware of their intent during generation and realization during training effectively bridges the semantic-acoustic gap, enabling more expressive speech generation with limited data.
Abstract: Speech Language Models (SLMs) exhibit strong semantic understanding, yet their generated speech often sounds flat and fails to convey expressive intent, undermining user engagement. We term this mismatch the semantic understanding-acoustic realization gap. We attribute this gap to two key deficiencies: (1) intent transmission failure, where SLMs fail to provide the stable utterance-level intent needed for expressive delivery; and (2) realization-unaware training, where no feedback signal verifies whether acoustic outputs faithfully reflect intended expression. To address these issues, we propose SA-SLM (Self-Aware Speech Language Model), built on the principle that the model should be aware of what it thinks during generation and how it speaks during training. SA-SLM addresses this gap through two core contributions: (1) Intent-Aware Bridging, which uses a Variational Information Bottleneck (VIB) objective to translate the model’s internal semantics into temporally smooth expressive intent, making speech generation aware of what the model intends to express; and (2) Realization-Aware Alignment, which repurposes the model as its own critic to verify and align acoustic realization with intended expressive intent via rubric-based feedback. Trained on only 800 hours of expressive speech data, our 3B parameter SA-SLM surpasses all open-source baselines and comes within 0.08 points of GPT-4o-Audio in overall expressiveness on the EchoMind benchmark.
[103] METRO: Towards Strategy Induction from Expert Dialogue Transcripts for Non-collaborative Dialogues
Haofu Yang, Jiaji Liu, Chen Huang, Faguo Wu, Wenqiang Lei, See-Kiong Ng
Main category: cs.CL
TL;DR: METRO uses LLMs to autonomously learn dialogue strategies from transcripts, creating a hierarchical Strategy Forest for non-collaborative agents.
Details
Motivation: Traditional non-collaborative dialogue agents require manual, unscalable expert strategy codification. Need automated methods to induce strategies from data.Method: Leverages large language models to autonomously induce both strategy actions and planning logic from raw transcripts. Formalizes expert knowledge into a Strategy Forest - hierarchical structure capturing short-term responses (nodes) and long-term strategic foresight (branches).
Result: Outperforms existing methods by average 9%-10% across two benchmarks. Shows strategic behavioral diversity, foresight, and robust cross-task transferability.
Conclusion: Offers cost-effective, scalable way to build non-collaborative agents. Provides new insights into automated strategy learning for dialogue systems.
Abstract: Developing non-collaborative dialogue agents traditionally requires the manual, unscalable codification of expert strategies. We propose \ours, a method that leverages large language models to autonomously induce both strategy actions and planning logic directly from raw transcripts. METRO formalizes expert knowledge into a Strategy Forest, a hierarchical structure that captures both short-term responses (nodes) and long-term strategic foresight (branches). Experimental results across two benchmarks show that METRO demonstrates promising performance, outperforming existing methods by an average of 9%-10%. Our further analysis not only reveals the success behind METRO (strategic behavioral diversity and foresight), but also demonstrates its robust cross-task transferability. This offers new insights into building non-collaborative agents in a cost-effective and scalable way. Our code is available at https://github.com/Humphrey-0125/METRO.
[104] Think Before you Write: QA-Guided Reasoning for Character Descriptions in Books
Argyrios Papoudakis, Mirella Lapata, Frank Keller
Main category: cs.CL
TL;DR: A framework that decouples reasoning from generation for character description generation from long narratives, using QA-guided reasoning to improve faithfulness and informativeness.
Details
Motivation: Generating accurate character descriptions from long-form narratives is challenging due to the need to track evolving attributes, integrate scattered evidence, and infer implicit details. Surprisingly, LLMs perform better when built-in reasoning is disabled, motivating a decoupled approach.Method: Proposes a training framework with two components: 1) a reasoning model that produces structured QA reasoning traces, and 2) a generation model that conditions on these traces to produce final character descriptions. Can be applied on top of long-context LLMs or chunk-based methods.
Result: Experiments on BookWorm and CroSS datasets show that QA-guided reasoning improves faithfulness, informativeness, and grounding over strong long-context baselines.
Conclusion: Decoupling reasoning from generation via structured QA traces is effective for character description generation from long narratives, addressing the limitations of end-to-end LLM approaches.
Abstract: Character description generation is an important capability for narrative-focused applications such as summarization, story analysis, and character-driven simulations. However, generating accurate character descriptions from long-form narratives (e.g., novels) is challenging: models must track evolving attributes (e.g., relationships and events), integrate evidence scattered across the text, and infer implicit details. Despite the success of reasoning-enabled LLMs on many benchmarks, we find that for character description generation their performance improves when built-in reasoning is disabled (i.e., an empty reasoning trace). Motivated by this, we propose a training framework that decouples reasoning from generation. Our approach, which can be applied on top of long-context LLMs or chunk-based methods, consists of a reasoning model that produces a structured QA reasoning trace and a generation model that conditions on this trace to produce the final character description. Experiments on two datasets (BookWorm and CroSS) show that QA-guided reasoning improves faithfulness, informativeness, and grounding over strong long-context baselines.
[105] METER: Evaluating Multi-Level Contextual Causal Reasoning in Large Language Models
Pengfeng Li, Chen Huang, Chaoqun Hao, Hongyao Chen, Xiao-Yong Wei, Wenqiang Lei, See-Kiong Ng
Main category: cs.CL
TL;DR: METER is a benchmark for evaluating LLMs’ contextual causal reasoning across all three levels of the causal hierarchy under unified context settings, revealing performance degradation at higher causal levels due to distraction by irrelevant information and reduced context faithfulness.
Details
Motivation: Existing benchmarks for evaluating causal reasoning in LLMs are fragmented, lack context consistency, and don't cover the full causal hierarchy, making it difficult to systematically assess LLMs' contextual causal reasoning capabilities.Method: Developed METER benchmark to systematically evaluate LLMs across all three levels of the causal ladder (association, intervention, counterfactual) under unified context settings. Conducted extensive evaluation of various LLMs followed by mechanistic analysis through error pattern identification and internal information flow tracing.
Result: LLMs show significant performance decline as tasks ascend the causal hierarchy. Two primary failure modes identified: (1) distraction by causally irrelevant but factually correct information at lower causal levels, and (2) reduced faithfulness to provided context at higher causal levels leading to performance degradation.
Conclusion: The work advances understanding of LLM contextual causal reasoning mechanisms and establishes a foundation for future research. The benchmark reveals systematic limitations in LLMs’ causal reasoning capabilities that need to be addressed.
Abstract: Contextual causal reasoning is a critical yet challenging capability for Large Language Models (LLMs). Existing benchmarks, however, often evaluate this skill in fragmented settings, failing to ensure context consistency or cover the full causal hierarchy. To address this, we pioneer METER to systematically benchmark LLMs across all three levels of the causal ladder under a unified context setting. Our extensive evaluation of various LLMs reveals a significant decline in proficiency as tasks ascend the causal hierarchy. To diagnose this degradation, we conduct a deep mechanistic analysis via both error pattern identification and internal information flow tracing. Our analysis reveals two primary failure modes: (1) LLMs are susceptible to distraction by causally irrelevant but factually correct information at lower level of causality; and (2) as tasks ascend the causal hierarchy, faithfulness to the provided context degrades, leading to a reduced performance. We belive our work advances our understanding of the mechanisms behind LLM contextual causal reasoning and establishes a critical foundation for future research. Our code and dataset are available at https://github.com/SCUNLP/METER .
[106] Policy Split: Incentivizing Dual-Mode Exploration in LLM Reinforcement with Dual-Mode Entropy Regularization
Jiashu Yao, Heyan Huang, Chuwei Luo, Daiqing Wu, Zeming Liu, Yuhang Guo, Yangyang Kang
Main category: cs.CL
TL;DR: Policy Split: A novel RL paradigm for LLMs that splits policy into normal and high-entropy modes with shared parameters, using collaborative dual-mode entropy regularization to balance task correctness with exploration.
Details
Motivation: To address the challenge of encouraging diverse exploration in reinforcement learning for large language models without compromising accuracy, as existing methods often struggle to balance exploration with task performance.Method: Proposes Policy Split paradigm that bifurcates policy into normal and high-entropy modes using a high-entropy prompt. Both modes share model parameters but undergo collaborative dual-mode entropy regularization: normal mode optimizes for task correctness, high-entropy mode incorporates exploration preference, and they learn collaboratively.
Result: Extensive experiments show consistent outperformance over established entropy-guided RL baselines across various model sizes in general and creative tasks. Analysis reveals Policy Split facilitates dual-mode exploration where high-entropy mode generates distinct behavioral patterns providing unique learning signals.
Conclusion: Policy Split effectively balances exploration and accuracy in RL for LLMs through collaborative dual-mode learning, enabling diverse exploration without compromising task performance.
Abstract: To encourage diverse exploration in reinforcement learning (RL) for large language models (LLMs) without compromising accuracy, we propose Policy Split, a novel paradigm that bifurcates the policy into normal and high-entropy modes with a high-entropy prompt. While sharing model parameters, the two modes undergo collaborative dual-mode entropy regularization tailored to distinct objectives. Specifically, the normal mode optimizes for task correctness, while the high-entropy mode incorporates a preference for exploration, and the two modes learn collaboratively. Extensive experiments demonstrate that our approach consistently outperforms established entropy-guided RL baselines across various model sizes in general and creative tasks. Further analysis reveals that Policy Split facilitates dual-mode exploration, where the high-entropy mode generates distinct behavioral patterns to the normal mode, providing unique learning signals.
[107] Triviality Corrected Endogenous Reward
Xinda Wang, Zhengxu Hou, Yangshijie Zhang, Bingren Yan, Jialin Liu, Chenzhuo Zhao, Zhibo Yang, Bin-Bin Yang, Feng Xiao
Main category: cs.CL
TL;DR: TCER addresses triviality bias in unsupervised RL for text generation by using relative information gain between specialist and generalist policies with probability-dependent correction.
Details
Motivation: Open-ended text generation lacks verifiable rewards, forcing reliance on judge models that need annotated data or closed-source models. Unsupervised RL using confidence-based rewards from mathematical reasoning shows promise but causes triviality bias in writing tasks.Method: Proposes TCER (Triviality Corrected Endogenous Reward) that rewards relative information gain between a specialist policy and generalist reference policy, modulated by probability-dependent correction to address triviality bias.
Result: TCER achieves consistent improvements across multiple writing benchmarks and model architectures without external supervision, and also transfers effectively to mathematical reasoning tasks.
Conclusion: TCER successfully addresses triviality bias in unsupervised RL for open-ended generation, demonstrating generality across different generation tasks including writing and mathematical reasoning.
Abstract: Reinforcement learning for open-ended text generation is constrained by the lack of verifiable rewards, necessitating reliance on judge models that require either annotated data or powerful closed-source models. Inspired by recent work on unsupervised reinforcement learning for mathematical reasoning using confidence-based endogenous rewards, we investigate whether this principle can be adapted to open-ended writing tasks. We find that directly applying confidence rewards leads to Triviality Bias: the policy collapses toward high-probability outputs, reducing diversity and meaningful content. We propose TCER (Triviality Corrected Endogenous Reward), which addresses this bias by rewarding the relative information gain between a specialist policy and a generalist reference policy, modulated by a probability-dependent correction mechanism. Across multiple writing benchmarks and model architectures, TCER achieves consistent improvements without external supervision. Furthermore, TCER also transfers effectively to mathematical reasoning, validating the generality of our approach across different generation tasks.
[108] NovBench: Evaluating Large Language Models on Academic Paper Novelty Assessment
Wenqing Wu, Yi Zhao, Yuzhuo Wang, Siyou Li, Juexi Shao, Yunfei Long, Chengzhi Zhang
Main category: cs.CL
TL;DR: NovBench is the first large-scale benchmark for evaluating LLMs’ ability to assess research novelty in peer review, using 1,684 paper-review pairs from NLP conferences with a four-dimensional evaluation framework.
Details
Motivation: The growing volume of academic submissions pressures human reviewers, and while LLMs show promise in generating review comments, there's no dedicated benchmark for systematically evaluating their ability to assess research novelty.Method: Created NovBench with 1,684 paper-review pairs from a leading NLP conference, extracting novelty descriptions from introductions and expert-written novelty evaluations. Proposed a four-dimensional evaluation framework (Relevance, Correctness, Coverage, Clarity) to assess LLM-generated novelty evaluations.
Result: Extensive experiments show current models exhibit limited understanding of scientific novelty, and fine-tuned models often suffer from instruction-following deficiencies.
Conclusion: There’s a need for targeted fine-tuning strategies that jointly improve novelty comprehension and instruction adherence in LLMs for peer review applications.
Abstract: Novelty is a core requirement in academic publishing and a central focus of peer review, yet the growing volume of submissions has placed increasing pressure on human reviewers. While large language models (LLMs), including those fine-tuned on peer review data, have shown promise in generating review comments, the absence of a dedicated benchmark has limited systematic evaluation of their ability to assess research novelty. To address this gap, we introduce NovBench, the first large-scale benchmark designed to evaluate LLMs’ capability to generate novelty evaluations in support of human peer review. NovBench comprises 1,684 paper-review pairs from a leading NLP conference, including novelty descriptions extracted from paper introductions and corresponding expert-written novelty evaluations. We focus on both sources because the introduction provides a standardized and explicit articulation of novelty claims, while expert-written novelty evaluations constitute one of the current gold standards of human judgment. Furthermore, we propose a four-dimensional evaluation framework (including Relevance, Correctness, Coverage, and Clarity) to assess the quality of LLM-generated novelty evaluations. Extensive experiments on both general and specialized LLMs under different prompting strategies reveal that current models exhibit limited understanding of scientific novelty, and that fine–tuned models often suffer from instruction-following deficiencies. These findings underscore the need for targeted fine-tuning strategies that jointly improve novelty comprehension and instruction adherence.
[109] Time is Not a Label: Continuous Phase Rotation for Temporal Knowledge Graphs and Agentic Memory
Weixian Waylon Li, Jiaxin Zhang, Xianan Jim Yang, Tiejun Ma, Yiwen Guo
Main category: cs.CL
TL;DR: RoMem is a temporal knowledge graph module that uses semantic speed gates and continuous phase rotation to distinguish persistent vs evolving facts without deletion, achieving SOTA on temporal KG completion and improving agentic memory performance.
Details
Motivation: Existing structured memory approaches struggle with temporal modeling - they either bury old knowledge, overwrite facts, or require expensive LLM calls, failing to distinguish persistent facts from evolving ones in long-lived systems like autonomous agents.Method: RoMem uses a pretrained Semantic Speed Gate that maps relation text embeddings to volatility scores, learning which relations evolve fast (e.g., “president of”) vs remain stable (e.g., “born in”). Combined with continuous phase rotation, it enables geometric shadowing where obsolete facts are rotated out of phase in complex vector space.
Result: Achieves state-of-the-art 72.6 MRR on ICEWS05-15 temporal KG completion. For agentic memory: 2-3x MRR and answer accuracy on MultiTQ, dominates LoCoMo hybrid benchmark, preserves static memory with zero degradation on DMR-MSC, and generalizes zero-shot to financial domains (FinTMMBench).
Conclusion: RoMem provides an effective drop-in temporal knowledge graph module that enables structured memory systems to handle temporal dynamics by distinguishing persistent vs evolving facts through semantic volatility learning and geometric shadowing.
Abstract: Structured memory representations such as knowledge graphs are central to autonomous agents and other long-lived systems. However, most existing approaches model time as discrete metadata, either sorting by recency (burying old-yet-permanent knowledge), simply overwriting outdated facts, or requiring an expensive LLM call at every ingestion step, leaving them unable to distinguish persistent facts from evolving ones. To address this, we introduce RoMem, a drop-in temporal knowledge graph module for structured memory systems, applicable to agentic memory and beyond. A pretrained Semantic Speed Gate maps each relation’s text embedding to a volatility score, learning from data that evolving relations (e.g., “president of”) should rotate fast while persistent ones (e.g., “born in”) should remain stable. Combined with continuous phase rotation, this enables geometric shadowing: obsolete facts are rotated out of phase in complex vector space, so temporally correct facts naturally outrank contradictions without deletion. On temporal knowledge graph completion, RoMem achieves state-of-the-art results on ICEWS05-15 (72.6 MRR). Applied to agentic memory, it delivers 2-3x MRR and answer accuracy on temporal reasoning (MultiTQ), dominates hybrid benchmark (LoCoMo), preserves static memory with zero degradation (DMR-MSC), and generalises zero-shot to unseen financial domains (FinTMMBench).
[110] Relax: An Asynchronous Reinforcement Learning Engine for Omni-Modal Post-Training at Scale
Liujie Zhang, Benzhe Ning, Rui Yang, Xiaoyan Yu, Jiaxing Li, Lumeng Wu, Jia Liu, Minghao Li, Weihang Chen, Weiqi Hu, Lei Zhang
Main category: cs.CL
TL;DR: Relax is an open-source RL training engine designed for multimodal LLMs that addresses challenges in heterogeneous data flows, operational robustness, and staleness-throughput tradeoffs through an omni-native architecture, service-level decoupling, and asynchronous training.
Details
Motivation: As LLMs extend to omni-modal inputs and agentic multi-turn workflows, RL training systems face three interdependent challenges: heterogeneous data flows, operational robustness at scale, and the staleness-throughput tradeoff, which existing systems struggle to handle effectively.Method: Relax uses three co-designed architectural layers: 1) omni-native architecture with multimodal support built into the full stack, 2) independent fault-isolated services for each RL role that can scale independently, and 3) service-level decoupling enabling asynchronous training via TransferQueue data bus with tunable staleness parameter.
Result: Achieves 1.20× speedup over veRL on Qwen3-4B on-policy training, 1.76× speedup in fully async mode on Qwen3-4B, and 2.00× speedup on Qwen3-Omni-30B. Supports R3 for MoE models with only 1.9% overhead vs 32% degradation in veRL. Demonstrates stable omni-modal RL convergence across image, text, and audio, sustaining over 2,000 steps on video.
Conclusion: Relax provides an effective RL training engine for multimodal LLMs that addresses key scalability and efficiency challenges while maintaining convergence quality, making it suitable for complex omni-modal agentic workflows.
Abstract: Reinforcement learning (RL) post-training has proven effective at unlocking reasoning, self-reflection, and tool-use capabilities in large language models. As models extend to omni-modal inputs and agentic multi-turn workflows, RL training systems face three interdependent challenges: heterogeneous data flows, operational robustness at scale, and the staleness – throughput tradeoff. We present \textbf{Relax} (Reinforcement Engine Leveraging Agentic X-modality), an open-source RL training engine that addresses these challenges through three co-designed architectural layers. First, an \emph{omni-native architecture} builds multimodal support into the full stack – from data preprocessing and modality-aware parallelism to inference generation – rather than retrofitting it onto a text-centric pipeline. Second, each RL role runs as an independent, fault-isolated service that can be scaled, recovered, and upgraded without global coordination. Third, service-level decoupling enables asynchronous training via the TransferQueue data bus, where a single staleness parameter smoothly interpolates among on-policy, near-on-policy, and fully asynchronous execution. Relax achieves a 1.20$\times$ end-to-end speedup over veRL on Qwen3-4B on-policy training. Its fully async mode delivers a 1.76$\times$ speedup over colocate on Qwen3-4B and a 2.00$\times$ speedup on Qwen3-Omni-30B, while all modes converge to the same reward level. Relax supports R3 (Rollout Routing Replay)~\cite{ma2025r3} for MoE models with only 1.9% overhead, compared to 32% degradation in veRL under the same configuration. It further demonstrates stable omni-modal RL convergence on Qwen3-Omni across image, text, and audio, sustaining over 2{,}000 steps on video without degradation. Relax is available at https://github.com/rednote-ai/Relax.
[111] Synthius-Mem: Brain-Inspired Hallucination-Resistant Persona Memory Achieving 94.4% Memory Accuracy and 99.6% Adversarial Robustness on LoCoMo
Artem Gadzhiev, Andrew Kislov
Main category: cs.CL
TL;DR: Synthius-Mem is a brain-inspired structured persona memory system for AI agents that extracts and organizes user information into six cognitive domains to prevent hallucination and achieve state-of-the-art performance on memory benchmarks.
Details
Motivation: Current LLM agent memory systems (sliding windows, summarization, embedding-based RAG, flat fact extraction) suffer from catastrophic information loss, semantic drift, or uncontrolled hallucination about users. No existing system reports adversarial robustness or the ability to refuse questions about facts users never disclosed.Method: A structured persona memory system that extracts what is known about a person rather than retrieving what was said. Uses a full persona extraction pipeline that decomposes conversations into six cognitive domains (biography, experiences, preferences, social circle, work, psychometrics), consolidates and deduplicates per domain, and retrieves structured facts via CategoryRAG with 21.79 ms latency.
Result: Achieves 94.37% accuracy on LoCoMo benchmark (exceeding all published systems including MemMachine’s 91.69% and human performance of 87.9 F1). Core memory fact accuracy reaches 98.64%. Adversarial robustness reaches 99.55%. Reduces token consumption by ~5x compared to full-context replay while achieving higher accuracy.
Conclusion: Synthius-Mem achieves state-of-the-art results on LoCoMo and is the only persona memory system that both exceeds human-level performance and reports adversarial robustness, providing reliable long-term memory that doesn’t hallucinate.
Abstract: Providing AI agents with reliable long-term memory that does not hallucinate remains an open problem. Current approaches to memory for LLM agents – sliding windows, summarization, embedding-based RAG, and flat fact extraction – each reduce token cost but introduce catastrophic information loss, semantic drift, or uncontrolled hallucination about the user. The structural reason is architectural: every published memory system on the LoCoMo benchmark treats conversation as a retrieval problem over raw or lightly summarized dialogue segments, and none reports adversarial robustness, the ability to refuse questions about facts the user never disclosed. We present Synthius-Mem, a brain-inspired structured persona memory system that takes a fundamentally different approach. Instead of retrieving what was said, Synthius-Mem extracts what is known about the person: a full persona extraction pipeline decomposes conversations into six cognitive domains (biography, experiences, preferences, social circle, work, psychometrics), consolidates and deduplicates per domain, and retrieves structured facts via CategoryRAG at 21.79 ms latency. On the LoCoMo benchmark (ACL 2024, 10 conversations, 1,813 questions), Synthius-Mem achieves 94.37% accuracy, exceeding all published systems including MemMachine (91.69%, adversarial score is not reported) and human performance (87.9 F1). Core memory fact accuracy reaches 98.64%. Adversarial robustness, the hallucination resistance metric that no competing system reports, reaches 99.55%. Synthius-Mem reduces token consumption by ~5x compared to full-context replay while achieving higher accuracy. Synthius-Mem achieves state-of-the-art results on LoCoMo and is, to our knowledge, the only persona memory system that both exceeds human-level performance and reports adversarial robustness.
[112] Phonological distances for linguistic typology and the origin of Indo-European languages
Marius Mavridis, Juan De Gregorio, Raul Toral, David Sanchez
Main category: cs.CL
TL;DR: Phoneme dependencies modeled as Markov chains reveal linguistic relationships, enabling phylogenetic analysis and geographic correlation for language families.
Details
Motivation: To develop quantitative methods for analyzing linguistic relatedness using phoneme dependencies, with applications to typology, evolutionary linguistics, and language family origins.Method: Information-theoretic framework modeling phoneme sequences as second-order Markov chains, using articulatory features to compute phonological distances across 67 languages from parallel corpus.
Result: Phonological distance matrix recovers major language families, shows contact-induced convergence, correlates with geographic distance, and supports Steppe hypothesis for Indo-European homeland.
Conclusion: Short-range phoneme dependencies effectively capture linguistic relatedness patterns, providing quantitative tools for typological analysis and evolutionary linguistics research.
Abstract: We show that short-range phoneme dependencies encode large-scale patterns of linguistic relatedness, with direct implications for quantitative typology and evolutionary linguistics. Specifically, using an information-theoretic framework, we argue that phoneme sequences modeled as second-order Markov chains essentially capture the statistical correlations of a phonological system. This finding enables us to quantify distances among 67 modern languages from a multilingual parallel corpus employing a distance metric that incorporates articulatory features of phonemes. The resulting phonological distance matrix recovers major language families and reveals signatures of contact-induced convergence. Remarkably, we obtain a clear correlation with geographic distance, allowing us to constrain a plausible homeland region for the Indo-European family, consistent with the Steppe hypothesis.
[113] MIXAR: Scaling Autoregressive Pixel-based Language Models to Multiple Languages and Scripts
Chen Hu, Yintao Tai, Antonio Vergari, Frank Keller, Alessandro Suglia
Main category: cs.CL
TL;DR: MIXAR is the first generative pixel-based language model trained on eight languages with different scripts, showing improved multilingual performance and robustness compared to previous pixel-based and token-based models.
Details
Motivation: Pixel-based language models offer advantages over token-based approaches by avoiding tokenization challenges, but face difficulties with multilingual generalization due to perceptual diversity across languages in pixel space.Method: Developed MIXAR, a generative pixel-based language model trained on eight different languages using various scripts, and scaled it to 0.5B parameters for enhanced capabilities.
Result: MIXAR demonstrates substantial performance improvements on discriminative and generative multilingual tasks, shows robustness to unseen languages, and exhibits enhanced capabilities when scaled to 0.5B parameters, including better performance on LAMBADA and resilience to orthographic attacks.
Conclusion: Pixel-based language models like MIXAR can effectively handle multilingual challenges and offer promising alternatives to token-based approaches, with scaling further enhancing their capabilities and robustness.
Abstract: Pixel-based language models are gaining momentum as alternatives to traditional token-based approaches, promising to circumvent tokenization challenges. However, the inherent perceptual diversity across languages poses a significant hurdle for multilingual generalization in pixel space. This paper introduces MIXAR, the first generative pixel-based language model trained on eight different languages utilizing a range of different scripts. We empirically evaluate MIXAR against previous pixel-based models as well as comparable tokenizer-based models, demonstrating substantial performance improvement on discriminative and generative multilingual tasks. Additionally, we show how MIXAR is robust to languages never seen during the training. These results are further strengthened when scaling the model to 0.5B parameters which not only improves its capabilities in generative tasks like LAMBADA but also its robustness when challenged with input perturbations such as orthographic attacks.
[114] Decomposing and Reducing Hidden Measurement Error in LLM Evaluation Pipelines
Solomon Messing
Main category: cs.CL
TL;DR: LLM evaluation pipelines have hidden uncertainty from prompt variations, judge models, and temperature settings that can flip rankings and reverse conclusions, requiring better uncertainty quantification and optimized pipeline designs.
Details
Motivation: LLM evaluations are crucial for model deployment, safety standards, and research conclusions, but current evaluation methods have hidden uncertainties from prompt rephrasing, judge model switching, and temperature changes that can significantly alter results and rankings, creating exploitable surfaces for gaming benchmarks.Method: Decomposes LLM pipeline uncertainty into its sources, distinguishes variance that shrinks with more data from sensitivity to researcher design choices, projects efficient paths to reduce total error, and provides recommendations for benchmark builders to minimize exploitable surfaces.
Result: Projection-optimized pipelines outperform 73% of possible naive pipelines against human baselines across various tasks (ideology annotation, safety classification, MMLU benchmarking, propaganda audit). On MMLU, optimized budget allocation halves estimation error compared to standard single-prompt evaluation at equivalent cost.
Conclusion: Proper uncertainty quantification in LLM evaluations is essential for reliable benchmarking, and small-sample variance estimation can generate confidence intervals with nominal coverage while providing actionable recommendations to reduce measurement error and improve benchmark robustness.
Abstract: LLM evaluations drive which models get deployed, which safety standards get adopted, and which research conclusions get published. Yet these scores carry hidden uncertainty: rephrasing the prompt, switching the judge model, or changing the temperature can shift results enough to flip rankings and reverse conclusions. Standard confidence intervals ignore this variance, producing under-coverage that worsens with more data. The unmeasured variance also creates an exploitable surface: model developers can optimize against measurement noise rather than genuine capability. This paper decomposes LLM pipeline uncertainty into its sources, distinguishes variance that shrinks with more data from sensitivity to researcher design choices, and projects the most efficient path to reducing total error. For benchmark builders, the same decomposition identifies which design choices contribute exploitable surface for gaming and prescribes designs that minimize it. Across ideology annotation, safety classification, MMLU benchmarking, and a human-validated propaganda audit, projection-optimized pipelines outperform 73% of possible naive pipelines against a human baseline. On MMLU, optimized budget allocation halves estimation error compared to standard single-prompt evaluation at equivalent cost. A small-sample variance estimation exercise is sufficient to derive confidence intervals that approach nominal coverage when the model includes the relevant pipeline facets, and to generate recommendations for reducing measurement error and improving benchmark robustness.
[115] A Triadic Suffix Tokenization Scheme for Numerical Reasoning
Olga Chetverina
Main category: cs.CL
TL;DR: Triadic Suffix Tokenization (TST) is a new tokenization method that groups digits into three-digit triads with explicit magnitude markers to preserve numerical structure in LLMs, addressing arithmetic errors caused by standard subword tokenization.
Details
Motivation: Standard subword tokenization methods fragment numbers inconsistently, causing LLMs to lose positional and decimal structure, which is a primary driver of errors in arithmetic and scientific reasoning tasks.Method: TST partitions digits into three-digit triads and annotates each triad with explicit magnitude markers. Two variants: (1) vocabulary-based approach adding up to 10,000 fixed tokens covering 33 orders of magnitude, and (2) suffix-marker approach using special tokens to denote magnitude dynamically.
Result: The framework preserves exact digits while making order-of-magnitude relationships transparent at the token level. It’s scalable, architecture-agnostic, and can be integrated as a drop-in preprocessing step. Experimental validation is deferred to future work.
Conclusion: TST provides a deterministic tokenization scheme that maintains numerical structure in LLMs, offering a solution to arithmetic reasoning limitations caused by standard tokenization methods.
Abstract: Standard subword tokenization methods fragment numbers inconsistently, causing large language models (LLMs) to lose positional and decimal structure - a primary driver of errors in arithmetic and scientific reasoning. We introduce Triadic Suffix Tokenization (TST), a deterministic scheme that partitions digits into three-digit triads and annotates each triad with an explicit magnitude marker. Critically, the scheme defines a fixed, one-to-one mapping between suffixes and orders of magnitude for the integer part (thousands, millions, billions, etc.) and a parallel system of replicated markers for fractional depth (tenths, thousandths, millionths, etc.). Unlike approaches that rely on positional inference, this method provides a consistent gradient signal, which should ensure stable convergence. Two implementation variants are proposed: (1) a vocabulary-based approach that adds at most 10,000 fixed tokens to an existing vocabulary, covering 33 orders of magnitude ($10^{-15}$ to $10^{18}$); and (2) a suffix-marker approach that uses a small set of special tokens to denote magnitude dynamically. Both variants preserve exact digits while making order-of-magnitude relationships transparent at the token level. The framework is inherently scalable, allowing for linear vocabulary expansion to accommodate arbitrary precision and range. TST is architecture-agnostic and can be integrated as a drop-in preprocessing step. Experimental validation is deferred to future work.
[116] Self-Evolving LLM Memory Extraction Across Heterogeneous Tasks
Yuqing Yang, Tengxiao Liu, Wang Bill Zhu, Taiwei Shi, Linxin Song, Robin Jia
Main category: cs.CL
TL;DR: BEHEMOTH benchmark for heterogeneous memory extraction in LLM assistants, with CluE cluster-based self-evolving strategy outperforming previous methods across diverse tasks.
Details
Motivation: As LLM-based assistants become persistent and personalized, they need to extract and retain useful information from past conversations as memory, but the types of information worth remembering vary considerably across different tasks.Method: Formalize heterogeneous memory extraction task and introduce BEHEMOTH benchmark using 18 existing datasets across personalization, problem-solving, and agentic tasks. Propose CluE, a cluster-based self-evolving strategy that groups training examples by extraction scenarios, analyzes each cluster independently, and synthesizes cross-cluster insights to update extraction prompts.
Result: Empirical analysis shows no single static extraction prompt dominates across all task categories, and existing self-evolving frameworks degrade with heterogeneous tasks. CluE achieves +9.04% relative gain and consistently outperforms prior self-evolving frameworks on BEHEMOTH benchmark.
Conclusion: CluE effectively addresses the challenge of heterogeneous memory extraction in LLM assistants through cluster-based self-evolution, demonstrating strong generalization across diverse task categories.
Abstract: As LLM-based assistants become persistent and personalized, they must extract and retain useful information from past conversations as memory. However, the types of information worth remembering vary considerably across tasks. We formalize the \textit{heterogeneous memory extraction} task and introduce \textbf{BEHEMOTH}, a benchmark that repurposes 18 existing datasets spanning personalization, problem-solving, and agentic tasks, using a downstream utility-driven metric for systematic evaluation. Our empirical analysis confirms that no single static extraction prompt dominates across all task categories, and that existing self-evolving prompt optimization frameworks, originally designed for homogeneous distributions, degrade when training tasks are heterogeneous. To address this, we propose \textbf{CluE}, a cluster-based self-evolving strategy that groups training examples into clusters by extraction scenarios, analyzes each cluster independently, and synthesizes cross-cluster insights to update the extraction prompt. Experiments on BEHEMOTH show that CluE generalizes effectively across heterogeneous tasks ($+$9.04% relative gain), consistently outperforming prior self-evolving frameworks.
[117] Utilizing and Calibrating Hindsight Process Rewards via Reinforcement with Mutual Information Self-Evaluation
Jiashu Yao, Heyan Huang, Zeming Liu, Yuhang Guo
Main category: cs.CL
TL;DR: MISE introduces a reinforcement learning paradigm using hindsight generative self-evaluation as dense reward signals, calibrated against environmental feedback to overcome sparse reward challenges in LLM-based agents.
Details
Motivation: The paper addresses the sparse reward challenge in reinforcement learning for large language model-based agents, where sparse extrinsic rewards make learning difficult and inefficient.Method: Proposes Mutual Information Self-Evaluation (MISE) which uses hindsight generative self-evaluation to create dense internal reward signals, then calibrates these rewards against environmental feedback through a theoretical framework combining mutual information with KL divergence.
Result: MISE outperforms strong baselines, enabling open-source LLMs with about 7B parameters to achieve performance comparable to GPT-4o on validation tasks without expert supervision.
Conclusion: The work provides the first formal foundation for generative self-rewarding paradigms and demonstrates that calibrated self-evaluation can effectively supplement sparse extrinsic rewards for autonomous learning.
Abstract: To overcome the sparse reward challenge in reinforcement learning (RL) for agents based on large language models (LLMs), we propose Mutual Information Self-Evaluation (MISE), an RL paradigm that utilizes hindsight generative self-evaluation as dense reward signals while simultaneously calibrating them against the environmental feedbacks. Empirically, MISE enables an agent to learn autonomously from dense internal rewards supplementing sparse extrinsic signals. Theoretically, our work provides the first formal foundation for the paradigm of generative self-rewarding. We prove that utilizing hindsight self-evaluation rewards is equivalent to minimizing an objective that combines mutual information with a KL divergence term between the policy and a proxy reward policy. This theoretical insight then informs and justifies our calibration step, which actively aligns these rewards with the optimal policy. Extensive experiments show that MISE outperforms strong baselines, enabling open-source LLMs about 7B parameters to achieve performance comparable to GPT-4o on validation without expert supervision.
[118] Back to Basics: Let Conversational Agents Remember with Just Retrieval and Generation
Yuqian Wu, Wei Chen, Zhengjun Huang, Junle Chen, Qingxiang Liu, Kai Wang, Xiaofang Zhou, Yuxuan Liang
Main category: cs.CL
TL;DR: A minimalist conversational memory framework that addresses signal sparsity in long dialogues through turn isolation retrieval and query-driven pruning, outperforming complex hierarchical methods.
Details
Motivation: Existing conversational memory systems suffer from context dilution as conversations grow longer, with complex hierarchical summarization or reinforcement learning approaches being vulnerable to the Signal Sparsity Effect where relevant signals become isolated and redundant content accumulates.Method: Proposes a minimalist framework with two key components: Turn Isolation Retrieval (TIR) that uses max-activation strategy to capture turn-level signals instead of global aggregation, and Query-Driven Pruning (QDP) that removes redundant sessions and conversational filler to create compact, high-density evidence sets.
Result: Extensive experiments on multiple benchmarks show the framework achieves robust performance across diverse settings, consistently outperforming strong baselines while maintaining high efficiency in tokens and latency.
Conclusion: Establishes a new minimalist baseline for conversational memory that effectively addresses signal sparsity and redundancy issues in long-term dialogue systems.
Abstract: Existing conversational memory systems rely on complex hierarchical summarization or reinforcement learning to manage long-term dialogue history, yet remain vulnerable to context dilution as conversations grow. In this work, we offer a different perspective: the primary bottleneck may lie not in memory architecture, but in the \textit{Signal Sparsity Effect} within the latent knowledge manifold. Through controlled experiments, we identify two key phenomena: \textit{Decisive Evidence Sparsity}, where relevant signals become increasingly isolated with longer sessions, leading to sharp degradation in aggregation-based methods; and \textit{Dual-Level Redundancy}, where both inter-session interference and intra-session conversational filler introduce large amounts of non-informative content, hindering effective generation. Motivated by these insights, we propose \method, a minimalist framework that brings conversational memory back to basics, relying solely on retrieval and generation via Turn Isolation Retrieval (TIR) and Query-Driven Pruning (QDP). TIR replaces global aggregation with a max-activation strategy to capture turn-level signals, while QDP removes redundant sessions and conversational filler to construct a compact, high-density evidence set. Extensive experiments on multiple benchmarks demonstrate that \method achieves robust performance across diverse settings, consistently outperforming strong baselines while maintaining high efficiency in tokens and latency, establishing a new minimalist baseline for conversational memory.
[119] CArtBench: Evaluating Vision-Language Models on Chinese Art Understanding, Interpretation, and Authenticity
Xuefeng Wei, Zhixuan Wang, Xuan Zhou, Zhi Qu, Hongyao Li, Yusuke Sakai, Hidetaka Kamigaito, Taro Watanabe
Main category: cs.CL
TL;DR: CARTBENCH is a comprehensive Chinese artwork evaluation benchmark for VLMs with four subtasks testing evidence-grounded reasoning, expert-style appreciation, defensible reinterpretation, and authenticity discrimination.
Details
Motivation: Existing vision-language model benchmarks for Chinese artworks focus on short-form recognition and QA, lacking comprehensive evaluation of deeper art understanding, evidence-based reasoning, and connoisseur-level discrimination abilities needed for museum applications.Method: Created CARTBENCH by aligning Palace Museum objects from Wikidata with authoritative catalog pages across five art categories and multiple dynasties. Includes four subtasks: CURATORQA (evidence-grounded reasoning), CATALOGCAPTION (structured appreciation), REINTERPRET (defensible reinterpretation with expert ratings), and CONNOISSEURPAIRS (authenticity discrimination).
Result: Evaluation of nine representative VLMs shows: high overall CURATORQA accuracy masks sharp drops on hard evidence linking and style-to-period inference; long-form appreciation remains far from expert references; authenticity discrimination stays near chance level, revealing significant gaps in connoisseur-level reasoning.
Conclusion: CARTBENCH reveals substantial limitations in current VLMs for deep art understanding, particularly in evidence-based reasoning, expert-level appreciation, and authenticity discrimination, highlighting the need for more sophisticated multimodal reasoning capabilities for museum and cultural heritage applications.
Abstract: We introduce CARTBENCH, a museum-grounded benchmark for evaluating vision-language models (VLMs) on Chinese artworks beyond short-form recognition and QA. CARTBENCH comprises four subtasks: CURATORQA for evidence-grounded recognition and reasoning, CATALOGCAPTION for structured four-section expert-style appreciation, REINTERPRET for defensible reinterpretation with expert ratings, and CONNOISSEURPAIRS for diagnostic authenticity discrimination under visually similar confounds. CARTBENCH is built by aligning image-bearing Palace Museum objects from Wikidata with authoritative catalog pages, spanning five art categories across multiple dynasties. Across nine representative VLMs, we find that high overall CURATORQA accuracy can mask sharp drops on hard evidence linking and style-to-period inference; long-form appreciation remains far from expert references; and authenticity-oriented diagnostic discrimination stays near chance, underscoring the difficulty of connoisseur-level reasoning for current models.
[120] Hidden Failures in Robustness: Why Supervised Uncertainty Quantification Needs Better Evaluation
Joe Stacey, Hadas Orgad, Kentaro Inui, Benjamin Heinzerling, Nafise Sadat Moosavi
Main category: cs.CL
TL;DR: Systematic study of supervised uncertainty probes for LLMs reveals poor robustness under distribution shift, with middle-layer representations and token aggregation being key factors for reliable uncertainty estimation.
Details
Motivation: Hidden states of LLMs contain signals useful for uncertainty estimation and hallucination detection, but it's unclear how robust existing probe-based methods are and which designs provide reliable uncertainty estimates under distribution shift.Method: Trained over 2,000 probes while systematically varying representation layer, feature type, and token aggregation strategy across models, tasks, and OOD settings to evaluate robustness of supervised uncertainty probes.
Result: Found poor robustness in current methods, especially for long-form generations. Probe robustness driven more by probe inputs than architecture: middle-layer representations generalize better than final-layer hidden states, and token aggregation across response tokens is more robust than single-token features.
Conclusion: Better evaluation is prerequisite for building more robust probes; proposed hybrid back-off strategy for improving robustness based on systematic findings about representation layers and token aggregation.
Abstract: Recent work has shown that the hidden states of large language models contain signals useful for uncertainty estimation and hallucination detection, motivating a growing interest in efficient probe-based approaches. Yet it remains unclear how robust existing methods are, and which probe designs provide uncertainty estimates that are reliable under distribution shift. We present a systematic study of supervised uncertainty probes across models, tasks, and OOD settings, training over 2,000 probes while varying the representation layer, feature type, and token aggregation strategy. Our evaluation highlights poor robustness in current methods, particularly in the case of long-form generations. We also find that probe robustness is driven less by architecture and more by the probe inputs. Middle-layer representations generalise more reliably than final-layer hidden states, and aggregating across response tokens is consistently more robust than relying on single-token features. These differences are often largely invisible in-distribution but become more important under distribution shift. Informed by our evaluation, we explore a simple hybrid back-off strategy for improving robustness, arguing that better evaluation is a prerequisite for building more robust probes.
[121] Playing Along: Learning a Double-Agent Defender for Belief Steering via Theory of Mind
Hanqi Xiao, Vaidehi Patil, Zaid Khan, Hyunji Lee, Elias Stengel-Eskin, Mohit Bansal
Main category: cs.CL
TL;DR: The paper introduces ToM-SB, a privacy-themed theory-of-mind challenge where AI agents must act as double agents to steer attackers’ beliefs about sensitive information, showing that current LLMs struggle but can be improved with reinforcement learning that rewards both theory-of-mind reasoning and successful deception.
Details
Motivation: As LLMs become conversational systems, their ability to reason about dialogue partners' intentions (theory-of-mind) is critical for safe interaction with adversarial partners. The paper addresses the need for models that can understand and manipulate others' beliefs in privacy-sensitive scenarios.Method: Proposes ToM-SB challenge where defenders act as double agents to steer attackers’ beliefs. Tests frontier models (Gemini3-Pro, GPT-5.4) and trains AI double agents using reinforcement learning with both fooling and ToM rewards. Evaluates across different attacker strengths and both in-distribution and out-of-distribution settings.
Result: Frontier models struggle on hard ToM-SB scenarios. RL-trained AI double agents outperform them, especially when combining both ToM and fooling rewards. Shows bidirectional emergence: fooling rewards improve ToM, and ToM rewards improve fooling. Gains in ToM and fooling are well-correlated.
Conclusion: ToM-SB is a challenging benchmark for LLM theory-of-mind capabilities. AI double agents trained with RL on both ToM and fooling objectives can significantly outperform current frontier models, demonstrating that belief modeling is key to success in adversarial privacy scenarios.
Abstract: As large language models (LLMs) become the engine behind conversational systems, their ability to reason about the intentions and states of their dialogue partners (i.e., form and use a theory-of-mind, or ToM) becomes increasingly critical for safe interaction with potentially adversarial partners. We propose a novel privacy-themed ToM challenge, ToM for Steering Beliefs (ToM-SB), in which a defender must act as a Double Agent to steer the beliefs of an attacker with partial prior knowledge within a shared universe. To succeed on ToM-SB, the defender must engage with and form a ToM of the attacker, with a goal of fooling the attacker into believing they have succeeded in extracting sensitive information. We find that strong frontier models like Gemini3-Pro and GPT-5.4 struggle on ToM-SB, often failing to fool attackers in hard scenarios with partial attacker prior knowledge, even when prompted to reason about the attacker’s beliefs (ToM prompting). To close this gap, we train models on ToM-SB to act as AI Double Agents using reinforcement learning, testing both fooling and ToM rewards. Notably, we find a bidirectionally emergent relationship between ToM and attacker-fooling: rewarding fooling success alone improves ToM, and rewarding ToM alone improves fooling. Across four attackers with different strengths, six defender methods, and both in-distribution and out-of-distribution (OOD) evaluation, we find that gains in ToM and attacker-fooling are well-correlated, highlighting belief modeling as a key driver of success on ToM-SB. AI Double Agents that combine both ToM and fooling rewards yield the strongest fooling and ToM performance, outperforming Gemini3-Pro and GPT-5.4 with ToM prompting on hard scenarios. We also show that ToM-SB and AI Double Agents can be extended to stronger attackers, demonstrating generalization to OOD settings and the upgradability of our task.
[122] Please Make it Sound like Human: Encoder-Decoder vs. Decoder-Only Transformers for AI-to-Human Text Style Transfer
Utsav Paneru
Main category: cs.CL
TL;DR: Researchers develop AI models to rewrite AI-generated text to read as human-authored, creating a parallel corpus and identifying stylistic markers to train models for this reverse detection task.
Details
Motivation: While AI text detection has been studied, the reverse process - systematically rewriting AI-generated prose to appear human-authored - remains under-explored, creating a gap in understanding AI-human style transfer.Method: Built a parallel corpus of 25,140 AI-input/human-reference text pairs, identified 11 measurable stylistic markers distinguishing AI vs. human text, and fine-tuned three models (BART-base, BART-large, and Mistral-7B-Instruct with QLoRA) for style transfer.
Result: BART-large achieved highest reference similarity (BERTScore F1: 0.924, ROUGE-L: 0.566, chrF++: 55.92) with 17x fewer parameters than Mistral-7B. Mistral-7B showed higher marker shift but this reflected overshoot rather than accuracy.
Conclusion: Current style transfer evaluation has a blind spot regarding shift accuracy, and smaller models like BART-large can outperform larger ones for this specific AI-to-human text transformation task.
Abstract: AI-generated text has become common in academic and professional writing, prompting research into detection methods. Less studied is the reverse: systematically rewriting AI-generated prose to read as genuinely human-authored. We build a parallel corpus of 25,140 paired AI-input and human-reference text chunks, identify 11 measurable stylistic markers separating the two registers, and fine-tune three models: BART-base, BART-large, and Mistral-7B-Instruct with QLoRA. BART-large achieves the highest reference similarity – BERTScore F1 of 0.924, ROUGE-L of 0.566, and chrF++ of 55.92 – with 17x fewer parameters than Mistral-7B. We show that Mistral-7B’s higher marker shift score reflects overshoot rather than accuracy, and argue that shift accuracy is a meaningful blind spot in current style transfer evaluation.
[123] Legal2LogicICL: Improving Generalization in Transforming Legal Cases to Logical Formulas via Diverse Few-Shot Learning
Jieying Xue, Phuong Minh Nguyen, Ha Thanh Nguyen, May Myo Zin, Ken Satoh
Main category: cs.CL
TL;DR: LLM-based legal reasoning framework using retrieval-augmented few-shot learning to map legal cases to logical formulas without training data, with new Legal2Proleg dataset.
Details
Motivation: Logic-based legal reasoning systems struggle with data scarcity for fine-tuning models to map natural language cases to logical formulas. Need for better generalization without extensive annotated data.Method: Legal2LogicICL: few-shot retrieval framework balancing diversity and similarity at semantic and structural levels, mitigating entity-induced bias in legal texts. Uses retrieval-augmented generation for in-context learning without training.
Result: Significantly improves accuracy, stability, and generalization in transforming legal case descriptions to logical representations on both open-source and proprietary LLMs. New Legal2Proleg dataset supports evaluation.
Conclusion: Proposed framework enables effective few-shot legal reasoning with interpretable logical rule generation, addressing data scarcity through retrieval-augmented in-context learning.
Abstract: This work aims to improve the generalization of logic-based legal reasoning systems by integrating recent advances in NLP with legal-domain adaptive few-shot learning techniques using LLMs. Existing logic-based legal reasoning pipelines typically rely on fine-tuned models to map natural-language legal cases into logical formulas before forwarding them to a symbolic reasoner. However, such approaches are heavily constrained by the scarcity of high-quality annotated training data. To address this limitation, we propose a novel LLM-based legal reasoning framework that enables effective in-context learning through retrieval-augmented generation. Specifically, we introduce Legal2LogicICL, a few-shot retrieval framework that balances diversity and similarity of exemplars at both the latent semantic representation level and the legal text structure level. In addition, our method explicitly accounts for legal structure by mitigating entity-induced retrieval bias in legal texts, where lengthy and highly specific entity mentions often dominate semantic representations and obscure legally meaningful reasoning patterns. Our Legal2LogicICL constructs informative and robust few-shot demonstrations, leading to accurate and stable logical rule generation without requiring additional training. In addition, we construct a new dataset, named Legal2Proleg, which is annotated with alignments between legal cases and PROLEG logical formulas to support the evaluation of legal semantic parsing. Experimental results on both open-source and proprietary LLMs demonstrate that our approach significantly improves accuracy, stability, and generalization in transforming natural-language legal case descriptions into logical representations, highlighting its effectiveness for interpretable and reliable legal reasoning. Our code is available at https://github.com/yingjie7/Legal2LogicICL.
[124] Evaluating Cooperation in LLM Social Groups through Elected Leadership
Ryan Faulkner, Anushka Deshpande, David Guzman Piedrahita, Joel Z. Leibo, Zhijing Jin
Main category: cs.CL
TL;DR: LLM-based multi-agent simulation shows elected leadership improves social welfare by 55.4% and survival time by 128.6% in common-pool resource governance scenarios.
Details
Motivation: Existing multi-agent research lacks insight into whether structured leadership and election mechanisms can improve collective decision-making in common-pool resource governance, despite these being critical organizational features in human society.Method: Developed an open-source framework simulating leadership through elected personas and candidate-driven agendas, conducted empirical study of LLMs under controlled governance conditions, analyzed social influence via agent social graphs and centrality metrics, and performed sentiment analysis on leader utterances.
Result: Elected leadership improves social welfare scores by 55.4% and survival time by 128.6% across high-performing LLMs. Social graph analysis revealed leader influence patterns, and sentiment analysis showed rhetorical and cooperative tendencies in leader communications.
Conclusion: Leadership and election mechanisms significantly enhance cooperation and social welfare in multi-agent systems, laying foundation for further study of organizational structures in navigating complex social dilemmas.
Abstract: Governing common-pool resources requires agents to develop enduring strategies through cooperation and self-governance to avoid collective failure. While foundation models have shown potential for cooperation in these settings, existing multi-agent research provides little insight into whether structured leadership and election mechanisms can improve collective decision making. The lack of such a critical organizational feature ubiquitous in human society presents a significant shortcoming of the current methods. In this work we aim to directly address whether leadership and elections can support improved social welfare and cooperation through multi-agent simulation with LLMs. We present our open-source framework that simulates leadership through elected personas and candidate-driven agendas and carry out an empirical study of LLMs under controlled governance conditions. Our experiments demonstrate that having elected leadership improves social welfare scores by 55.4% and survival time by 128.6% across a range of high performing LLMs. Through the construction of an agent social graph we compute centrality metrics to assess the social influence of leader personas and also analyze rhetorical and cooperative tendencies revealed through a sentiment analysis on leader utterances. This work lays the foundation for further study of election mechanisms in multi-agent systems toward navigating complex social dilemmas.
[125] Discourse Diversity in Multi-Turn Empathic Dialogue
Hongli Zhan, Emma S. Gueorguieva, Javier Hernandez, Jina Suh, Desmond C. Ong, Junyi Jessy Li
Main category: cs.CL
TL;DR: LLMs show formulaic discourse move repetition in multi-turn empathic dialogue, with MINT framework improving empathy and reducing repetition through reinforcement learning with cross-turn tactic novelty signals.
Details
Motivation: While LLMs produce empathic responses in single-turn settings, they exhibit formulaic patterns that may extend to discourse moves (what responses do for the person). This is particularly problematic for empathic dialogue where varied strategies are needed as conversations unfold, but prior work shows LLMs reuse tactic sequences more than humans.Method: The paper introduces MINT (Multi-turn Inter-tactic Novelty Training), a reinforcement learning framework that optimizes discourse move diversity across multi-turn empathic dialogue. The best variant combines an empathy quality reward with a cross-turn tactic novelty signal.
Result: LLMs reuse tactics at nearly double the rate of humans in multi-turn conversations (0.50-0.56 vs. 0.27). MINT improves aggregate empathy by 25.3% over vanilla models and reduces cross-turn discourse move repetition by 26.3% on 4B models, surpassing all baselines including quality-only and token-level diversity methods.
Conclusion: Current LLMs lack not empathy itself, but the ability to vary their discourse moves across conversations. MINT effectively addresses this limitation by optimizing for discourse move diversity while maintaining or improving empathy quality.
Abstract: Large language models (LLMs) produce responses rated as highly empathic in single-turn settings (Ayers et al., 2023; Lee et al., 2024), yet they are also known to be formulaic generators that reuse the same lexical patterns, syntactic templates, and discourse structures across tasks (Jiang et al., 2025; Shaib et al., 2024; Namuduri et al., 2025). Less attention has been paid to whether this formulaicity extends to the level of discourse moves, i.e., what a response does for the person it is addressing. This question is especially consequential for empathic dialogue, where effective support demands not just a kind response at one moment but varied strategies as a conversation unfolds (Stiles et al., 1998). Indeed, prior work shows that LLMs reuse the same tactic sequences more than human supporters in single-turn settings (Gueorguieva et al., 2026). We extend this analysis to multi-turn conversations and find that the rigidity compounds: once a tactic appears in a supporter turn, LLMs reuse it in the next at nearly double the rate of humans (0.50-0.56 vs. 0.27). This pattern holds across LLMs serving as supporters in real emotional support conversations, and is invisible to standard similarity metrics. To address this gap, we introduce MINT (Multi-turn Inter-tactic Novelty Training), the first reinforcement learning framework to optimize discourse move diversity across multi-turn empathic dialogue. The best MINT variant combines an empathy quality reward with a cross-turn tactic novelty signal, improving aggregate empathy by 25.3% over vanilla across 1.7B and 4B models while reducing cross-turn discourse move repetition by 26.3% on the 4B model, surpassing all baselines including quality-only and token-level diversity methods on both measures. These results suggest that what current models lack is not empathy itself, but the ability to vary their discourse moves across a conversation.
[126] LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling
Yuxin Chen, Chumeng Liang, Hangke Sui, Ruihan Guo, Chaoran Cheng, Jiaxuan You, Ge Liu
Main category: cs.CL
TL;DR: LangFlow is a continuous diffusion language model that matches discrete diffusion performance through Bregman divergence connection to Flow Matching, with innovations in ODE-based NLL bound, learnable noise scheduling, and self-conditioning training.
Details
Motivation: Prior continuous diffusion language models lag behind discrete counterparts in performance. The authors aim to close this gap by developing a continuous DLM that can rival discrete diffusion models in language modeling tasks.Method: Connects embedding-space DLMs to Flow Matching via Bregman divergence. Introduces three key innovations: (1) ODE-based NLL bound for principled evaluation, (2) information-uniform principle for noise scheduling with learnable Gumbel-based scheduler, and (3) improved training protocol with self-conditioning.
Result: Achieves perplexity of 30.0 on LM1B and 24.6 on OpenWebText. Matches top discrete DLMs at comparable scale and surpasses autoregressive baselines in zero-shot transfer across multiple benchmarks.
Conclusion: LangFlow demonstrates that continuous diffusion is a competitive and promising paradigm for language modeling, closing the performance gap with discrete diffusion models.
Abstract: Continuous diffusion models have achieved strong performance across domains such as images. However, in language modeling, prior continuous diffusion language models (DLMs) lag behind discrete counterparts. In this work, we close this gap with LangFlow, the first continuous DLM to rival discrete diffusion. Our approach connects embedding-space DLMs to Flow Matching via Bregman divergence and introduces three key innovations: (1) a novel ODE-based NLL bound for principled evaluation of continuous flow-based language models; (2) an information-uniform principle for noise scheduling, motivating a learnable scheduler based on a Gumbel distribution; and (3) an improved training protocol incorporating self-conditioning, which enhances both likelihood and sample quality.LangFlow achieves strong performance across benchmarks, reaching a perplexity (PPL) of 30.0 on LM1B and 24.6 on OpenWebText. It matches top discrete DLMs at comparable scale and surpasses autoregressive baselines in zero-shot transfer across multiple benchmarks. LangFlow provides clear evidence that continuous diffusion is a competitive and promising paradigm for language modeling. https://github.com/nealchen2003/LangFlow
[127] HistLens: Mapping Idea Change across Concepts and Corpora
Yi Jing, Weiyun Qiu, Yihang Peng, Zhifang Sui
Main category: cs.CL
TL;DR: HistLens is a unified framework for multi-concept, multi-corpus conceptual history analysis using SAE-based representations to track semantic evolution across time and sources.
Details
Motivation: Existing computational approaches for diachronic semantics are limited to single concepts or corpora, making cross-source comparisons difficult, and rely on surface lexical evidence that misses implicit concept expressions.Method: Proposes HistLens, a SAE-based framework that decomposes concept representations into interpretable features and tracks their activation dynamics over time and across sources within a shared coordinate system.
Result: Experiments on long-span press corpora show HistLens supports cross-concept, cross-corpus computation of idea evolution patterns and enables implicit concept computation.
Conclusion: HistLens bridges conceptual modeling with interpretive needs, broadening analytical perspectives and methodological repertoire for social science and humanities diachronic text analysis.
Abstract: Language change both reflects and shapes social processes, and the semantic evolution of foundational concepts provides a measurable trace of historical and social transformation. Despite recent advances in diachronic semantics and discourse analysis, existing computational approaches often (i) concentrate on a single concept or a single corpus, making findings difficult to compare across heterogeneous sources, and (ii) remain confined to surface lexical evidence, offering insufficient computational and interpretive granularity when concepts are expressed implicitly. We propose HistLens, a unified, SAE-based framework for multi-concept, multi-corpus conceptual-history analysis. The framework decomposes concept representations into interpretable features and tracks their activation dynamics over time and across sources, yielding comparable conceptual trajectories within a shared coordinate system. Experiments on long-span press corpora show that HistLens supports cross-concept, cross-corpus computation of patterns of idea evolution and enables implicit concept computation. By bridging conceptual modeling with interpretive needs, HistLens broadens the analytical perspectives and methodological repertoire available to social science and the humanities for diachronic text analysis.
[128] Agentic Aggregation for Parallel Scaling of Long-Horizon Agentic Tasks
Yoonsang Lee, Howard Yen, Xi Ye, Danqi Chen
Main category: cs.CL
TL;DR: AggAgent: A novel agentic aggregation method for parallel test-time scaling in long-horizon agentic tasks that treats parallel trajectories as an environment and uses lightweight tools to synthesize information across multiple rollouts.
Details
Motivation: Parallel test-time scaling has been effective for chain-of-thought reasoning but faces unique challenges in agentic tasks: long, multi-turn, tool-augmented trajectories with open-ended outputs. Existing methods either discard rich trajectory information or exceed context windows when concatenating all trajectories.Method: Proposes AggAgent, an aggregation agent that treats parallel trajectories as an environment. Equips it with lightweight tools to inspect candidate solutions and search across trajectories, enabling on-demand navigation and synthesis of information from multiple parallel rollouts.
Result: Outperforms all existing aggregation methods by up to 5.3% absolute on average and 10.3% on two deep research tasks across six benchmarks and three model families (GLM-4.7, Qwen3.5, MiniMax-M2.5), while adding minimal overhead with aggregation cost bounded by a single agentic rollout.
Conclusion: Agentic aggregation is an effective and cost-efficient approach to parallel test-time scaling for long-horizon agentic tasks, establishing a new paradigm for leveraging multiple parallel trajectories in complex reasoning and research scenarios.
Abstract: We study parallel test-time scaling for long-horizon agentic tasks such as agentic search and deep research, where multiple rollouts are generated in parallel and aggregated into a final response. While such scaling has proven effective for chain-of-thought reasoning, agentic tasks pose unique challenges: trajectories are long, multi-turn, and tool-augmented, and outputs are often open-ended. Aggregating only final answers discards rich information from trajectories, while concatenating all trajectories exceeds the model’s context window. To address this, we propose AggAgent, an aggregation agent that treats parallel trajectories as an environment. We equip it with lightweight tools to inspect candidate solutions and search across trajectories, enabling it to navigate and synthesize information on demand. Across six benchmarks and three model families (GLM-4.7, Qwen3.5, MiniMax-M2.5), AggAgent outperforms all existing aggregation methods-by up to 5.3% absolute on average and 10.3% on two deep research tasks-while adding minimal overhead, as the aggregation cost remains bounded by a single agentic rollout. Our findings establish agentic aggregation as an effective and cost-efficient approach to parallel test-time scaling.
[129] General365: Benchmarking General Reasoning in Large Language Models Across Diverse and Challenging Tasks
Junlin Liu, Shengnan An, Shuang Zhou, Dan Ma, Shixiong Luo, Ying Xie, Yuan Zhang, Wenling Yuan, Yifan Zhou, Xiaoyu Li, Ziwen Wang, Xuezhi Cao, Xunliang Cai
Main category: cs.CL
TL;DR: General365 benchmark assesses LLM general reasoning by restricting knowledge to K-12 level, revealing current models struggle with complex constraints and semantic interference despite strong domain-specific performance.
Details
Motivation: Current LLMs excel in domain-specific reasoning (math/physics) but their ability to generalize reasoning to broader contexts with complex constraints and semantic interference remains under-explored. There's a need to decouple reasoning from specialized expertise.Method: Created General365 benchmark with 365 seed problems and 1,095 variants across 8 categories, restricting background knowledge to K-12 level to explicitly decouple reasoning from specialized expertise. Evaluated 26 leading LLMs on this benchmark.
Result: Top-performing LLM achieved only 62.8% accuracy, in stark contrast to near-perfect performances in math/physics benchmarks. Results show current LLM reasoning abilities are heavily domain-dependent with significant room for improvement.
Conclusion: General reasoning remains a major challenge for LLMs despite strong domain-specific performance. General365 serves as a catalyst for advancing LLM reasoning toward robust, general-purpose real-world applications beyond specialized domains.
Abstract: Contemporary large language models (LLMs) have demonstrated remarkable reasoning capabilities, particularly in specialized domains like mathematics and physics. However, their ability to generalize these reasoning skills to more general and broader contexts–often termed general reasoning–remains under-explored. Unlike domain-specific reasoning, general reasoning relies less on expert knowledge but still presents formidable reasoning challenges, such as complex constraints, nested logical branches, and semantic interference. To address this gap, we introduce General365, a benchmark specifically designed to assess general reasoning in LLMs. By restricting background knowledge to a K-12 level, General365 explicitly decouples reasoning from specialized expertise. The benchmark comprises 365 seed problems and 1,095 variant problems across eight categories, ensuring both high difficulty and diversity. Evaluations across 26 leading LLMs reveal that even the top-performing model achieves only 62.8% accuracy, in stark contrast to the near-perfect performances of LLMs in math and physics benchmarks. These results suggest that the reasoning abilities of current LLMs are heavily domain-dependent, leaving significant room for improvement in broader applications. We envision General365 as a catalyst for advancing LLM reasoning beyond domain-specific tasks toward robust, general-purpose real-world scenarios. Code, Dataset, and Leaderboard: https://general365.github.io
[130] C-ReD: A Comprehensive Chinese Benchmark for AI-Generated Text Detection Derived from Real-World Prompts
Chenxi Qing, Junxi Wu, Zheng Liu, Yixiang Qiu, Hongyao Yu, Bin Chen, Hao Wu, Shu-Tao Xia
Main category: cs.CL
TL;DR: C-ReD is a comprehensive Chinese AI-generated text detection benchmark addressing limitations in existing Chinese detection datasets through diverse models, domains, and realistic prompts.
Details
Motivation: While LLMs offer convenience, they also introduce risks like phishing and academic dishonesty. Existing Chinese detection benchmarks suffer from limited model diversity, data homogeneity, and lack of prompt realism, creating gaps in reliable AI-generated text detection for Chinese corpora.Method: Proposes C-ReD benchmark with comprehensive coverage: diverse Chinese LLMs, multiple domains (news, academic, creative writing), and realistic prompts. Focuses on addressing model diversity, domain coverage, and prompt realism limitations of prior Chinese benchmarks.
Result: C-ReD enables reliable in-domain detection and shows strong generalization to unseen LLMs and external Chinese datasets. Successfully addresses critical gaps in model diversity, domain coverage, and prompt realism that limited prior Chinese detection benchmarks.
Conclusion: C-ReD provides a comprehensive benchmark for Chinese AI-generated text detection that overcomes limitations of existing datasets, supporting both reliable detection and generalization to new models and domains.
Abstract: Recently, large language models (LLMs) are capable of generating highly fluent textual content. While they offer significant convenience to humans, they also introduce various risks, like phishing and academic dishonesty. Numerous research efforts have been dedicated to developing algorithms for detecting AI-generated text and constructing relevant datasets. However, in the domain of Chinese corpora, challenges remain, including limited model diversity and data homogeneity. To address these issues, we propose C-ReD: a comprehensive Chinese Real-prompt AI-generated Detection benchmark. Experiments demonstrate that C-ReD not only enables reliable in-domain detection but also supports strong generalization to unseen LLMs and external Chinese datasets-addressing critical gaps in model diversity, domain coverage, and prompt realism that have limited prior Chinese detection benchmarks. We release our resources at https://github.com/HeraldofLight/C-ReD.
[131] CLSGen: A Dual-Head Fine-Tuning Framework for Joint Probabilistic Classification and Verbalized Explanation
WonJin Yoon, Kangyu Zhu, Ian Bulovic, Autumn Sehy, Yanjun Gao, Dmitriy Dligach, Majid Afshar, Timothy A. Miller
Main category: cs.CL
TL;DR: CLSGen is a novel LLM fine-tuning framework for binary classification that enables reliable probability estimation while preserving explanation-generation capabilities, addressing catastrophic forgetting in traditional fine-tuning approaches.
Details
Motivation: LLMs show promise for complex decision-making but lack reliable quantitative probability estimation. Traditional fine-tuning for probability estimation causes catastrophic forgetting and linguistic collapse, destroying explanation capabilities needed for interpretability.Method: Proposes CLSGen framework with new model architecture, training methodology, and data construction strategy specifically for binary classification tasks to enable robust probability estimation while preserving explanation generation.
Result: CLSGen outperforms baselines on classification metrics (AUROC and F1-score) across multiple benchmark datasets. Generated explanations show strong label-justification alignment and high readability.
Conclusion: CLSGen successfully enables LLMs to provide reliable probability estimates for binary classification without sacrificing explanation capabilities, addressing a critical deployment hurdle for practical decision-making applications.
Abstract: With the recent progress of Large Language Models (LLMs), there is a growing interest in applying these models to solve complex and challenging problems. Modern LLMs, capable of processing long contexts and generating verbalized explanations, offer significant potential in addressing real-world applications. However, a critical hurdle in deploying LLMs for practical decision-making is their inability to provide reliable, quantitative probabilities. While task-specific fine-tuning of LLMs using traditional discriminative objectives (similar to encoder-only models) can yield probability estimates, this often leads to catastrophic forgetting and linguistic collapse. Consequently, the model loses its ability to generate explanations, severely undermining its interpretability and usability. To address this challenge, we propose CLSGen, a novel LLM fine-tuning framework designed for binary classification tasks. The CLSGen framework encompasses a new model architecture, training methodology, and data construction strategy to enable robust probability estimation without sacrificing the model’s inherent explanation-generation capabilities. Experimental results across multiple benchmark datasets demonstrate that models fine-tuned with CLSGen outperform existing baselines in classification metrics (AUROC and F1-score). Regarding explanation, the results showed strong alignment between predicted labels and generated justifications, as well as high readability.
[132] Psychological Concept Neurons: Can Neural Control Bias Probing and Shift Generation in LLMs?
Yuto Harada, Hiro Taiyo Hamada
Main category: cs.CL
TL;DR: LLMs can represent Big Five personality traits internally, with concept-selective neurons in mid-layers that can be manipulated to bias representations, but behavioral control over generated labels is weaker and less precise.
Details
Motivation: While LLMs can imitate personality profiles and predict user personality using psychological constructs like the Big Five, it's unclear where and how these personality representations are stored internally and how they relate to behavioral outputs.Method: Used probing to examine emergence of Big Five information across model layers, identified concept-selective neurons for each trait, and conducted interventions (enhancing/suppressing activations) to test causal relationships between representations and behavioral outputs.
Result: Big Five information becomes decodable early and persists through layers; concept-selective neurons are most prevalent in mid-layers with limited cross-domain overlap. Interventions successfully shift probe readouts (targeted success rates >0.8) but have weaker, more variable effects on generated labels with cross-trait spillover.
Conclusion: There’s a gap between representational control and behavioral control in LLMs - while internal personality representations can be causally steered, comparable control over generated behavioral outputs is more difficult even with interventions on concept-selective neurons.
Abstract: Using psychological constructs such as the Big Five, large language models (LLMs) can imitate specific personality profiles and predict a user’s personality. While LLMs can exhibit behaviors consistent with these constructs, it remains unclear where and how they are represented inside the model and how they relate to behavioral outputs. To address this gap, we focus on questionnaire-operationalized Big Five concepts, analyze the formation and localization of their internal representations, and use interventions to examine how these representations relate to behavioral outputs. In our experiment, we first use probing to examine where Big Five information emerges across model depth. We then identify neurons that respond selectively to each Big Five concept and test whether enhancing or suppressing their activations can bias latent representations and label generation in intended directions. We find that Big Five information becomes rapidly decodable in early layers and remains detectable through the final layers, while concept-selective neurons are most prevalent in mid layers and exhibit limited overlap across domains. Interventions on these neurons consistently shift probe readouts toward targeted concepts, with targeted success rates exceeding 0.8 for some concepts, indicating that the model’s internal separation of Big Five personality traits can be causally steered. At the label-generation level, the same interventions often bias generated label distributions in the intended directions, but the effects are weaker, more concept-dependent, and often accompanied by cross-trait spillover, indicating that comparable control over generated labels is difficult even with interventions on a large fraction of concept-selective neurons. Overall, our findings reveal a gap between representational control and behavioral control in LLMs.
[133] Saar-Voice: A Multi-Speaker Saarbrücken Dialect Speech Corpus
Lena S. Oberkircher, Jesujoba O. Alabi, Dietrich Klakow, Jürgen Trouvain
Main category: cs.CL
TL;DR: Saar-Voice: A 6-hour speech corpus for the Saarbrücken German dialect with aligned text-audio data to address dialect underrepresentation in NLP/speech technologies.
Details
Motivation: Dialects are culturally significant but underrepresented in linguistic resources and computational models, leading to performance disparities. Standardized language varieties dominate NLP and speech technologies, creating a need for dialect-specific resources.Method: Created a 6-hour speech corpus by collecting Saarbrücken dialect text from digitized books and local materials, then recording a subset with 9 speakers. Conducted analyses on text and speech components, addressing orthographic/speaker variation and grapheme-to-phoneme conversion challenges.
Result: Produced Saar-Voice corpus with aligned textual and audio representations for the Saarbrücken German dialect, enabling dialect-aware text-to-speech research in low-resource scenarios.
Conclusion: The corpus provides a foundation for future research on dialect-aware TTS, particularly for zero-shot and few-shot model adaptation in low-resource dialect scenarios.
Abstract: Natural language processing (NLP) and speech technologies have made significant progress in recent years; however, they remain largely focused on standardized language varieties. Dialects, despite their cultural significance and widespread use, are underrepresented in linguistic resources and computational models, resulting in performance disparities. To address this gap, we introduce Saar-Voice, a six-hour speech corpus for the Saarbrücken dialect of German. The dataset was created by first collecting text through digitized books and locally sourced materials. A subset of this text was recorded by nine speakers, and we conducted analyses on both the textual and speech components to assess the dataset’s characteristics and quality. We discuss methodological challenges related to orthographic and speaker variation, and explore grapheme-to-phoneme (G2P) conversion. The resulting corpus provides aligned textual and audio representations. This serves as a foundation for future research on dialect-aware text-to-speech (TTS), particularly in low-resource scenarios, including zero-shot and few-shot model adaptation.
[134] Template-assisted Contrastive Learning of Task-oriented Dialogue Sentence Embeddings
Minsik Oh, Jiwei Li, Guoyin Wang
Main category: cs.CL
TL;DR: TaDSE learns dialogue sentence embeddings using template information via contrastive learning, achieving SOTA on dialogue benchmarks.
Details
Motivation: Learning high-quality sentence embeddings from dialogues is essential for dialogue tasks but difficult due to scarce utterance relationship annotations, while token-level annotations like entities and templates are easier to obtain.Method: Template-aware Dialogue Sentence Embedding (TaDSE) uses template information to learn utterance embeddings via self-supervised contrastive learning, enhanced with synthetically augmented dataset that diversifies utterance-template association using slot-filling.
Result: TaDSE achieves significant improvements over previous SOTA methods on five downstream benchmark dialogue datasets and introduces a novel semantic compression test showing correlation with uniformity and alignment.
Conclusion: Template information effectively improves dialogue sentence embeddings through contrastive learning, with synthetic augmentation further enhancing performance across multiple dialogue tasks.
Abstract: Learning high quality sentence embeddings from dialogues has drawn increasing attentions as it is essential to solve a variety of dialogue-oriented tasks with low annotation cost. Annotating and gathering utterance relationships in conversations are difficult, while token-level annotations, \eg, entities, slots and templates, are much easier to obtain. Other sentence embedding methods are usually sentence-level self-supervised frameworks and cannot utilize token-level extra knowledge. We introduce Template-aware Dialogue Sentence Embedding (TaDSE), a novel augmentation method that utilizes template information to learn utterance embeddings via self-supervised contrastive learning framework. We further enhance the effect with a synthetically augmented dataset that diversifies utterance-template association, in which slot-filling is a preliminary step. We evaluate TaDSE performance on five downstream benchmark dialogue datasets. The experiment results show that TaDSE achieves significant improvements over previous SOTA methods for dialogue. We further introduce a novel analytic instrument of semantic compression test, for which we discover a correlation with uniformity and alignment. Our code is available at https://github.com/minsik-ai/Template-Contrastive-Embedding
[135] Language Reconstruction with Brain Predictive Coding from fMRI Data
Congchi Yin, Ziyi Ye, Piji Li
Main category: cs.CL
TL;DR: PredFT is an fMRI-to-text decoding model that incorporates predictive coding theory to improve language reconstruction from brain signals by leveraging the brain’s natural tendency to predict future words.
Details
Motivation: Current brain signal decoding methods lack effective use of semantic information. Predictive coding theory suggests the brain naturally predicts future words, which could improve language reconstruction from fMRI data.Method: Proposes PredFT with main and side networks. Side network extracts predictive representations from brain ROIs using self-attention, then fuses them into main network for continuous language decoding.
Result: Outperforms current decoding models on multiple evaluation metrics across two naturalistic language comprehension fMRI datasets.
Conclusion: Incorporating predictive coding theory improves fMRI-to-text decoding, demonstrating the value of leveraging the brain’s natural predictive mechanisms for language reconstruction.
Abstract: Many recent studies have shown that the perception of speech can be decoded from brain signals and subsequently reconstructed as continuous language. However, there is a lack of neurological basis for how the semantic information embedded within brain signals can be used more effectively to guide language reconstruction. Predictive coding theory suggests the human brain naturally engages in continuously predicting future words that span multiple timescales. This implies that the decoding of brain signals could potentially be associated with a predictable future. To explore the predictive coding theory within the context of language reconstruction, this paper proposes \textsc{PredFT}(\textbf{F}MRI-to-\textbf{T}ext decoding with \textbf{Pred}ictive coding). \textsc{PredFT} consists of a main network and a side network. The side network obtains brain predictive representation from related regions of interest(ROIs) with a self-attention module. The representation is then fused into the main network for continuous language decoding. Experiments on two naturalistic language comprehension fMRI datasets show that \textsc{PredFT} outperforms current decoding models on several evaluation metrics.
[136] LaMI: Augmenting Large Language Models via Late Multi-Image Fusion
Guy Yariv, Idan Schwartz, Yossi Adi, Sagie Benaim
Main category: cs.CL
TL;DR: Late multi-image fusion method that enhances LLMs with visual commonsense by generating multiple images from text prompts and combining their predictions with text-only LLM outputs through a late-fusion layer.
Details
Motivation: LLMs lack visual grounding for commonsense reasoning, while VLMs have reduced text-only reasoning performance and require costly multimodal training. Need a method that improves visual commonsense without harming textual reasoning.Method: Proposes late multi-image fusion: generates multiple images from text prompts using lightweight parallel sampling, then combines their prediction probabilities with text-only LLM outputs through a late-fusion layer that integrates projected visual features before final prediction.
Result: Significantly outperforms augmented LLMs on visual reasoning, matches VLMs on vision-based tasks, and improves NLP performance when applied to strong LLMs like LLaMA 3, with modest test-time overhead.
Conclusion: Late multi-image fusion effectively enhances LLMs with visual commonsense reasoning capabilities while maintaining or improving text-only performance, offering a cost-effective alternative to full multimodal training.
Abstract: Commonsense reasoning often requires both textual and visual knowledge, yet Large Language Models (LLMs) trained solely on text lack visual grounding (e.g., “what color is an emperor penguin’s belly?”). Visual Language Models (VLMs) perform better on visually grounded tasks but face two limitations: (i) often reduced performance on text-only commonsense reasoning compared to text-trained LLMs, and (ii) adapting newly released LLMs to vision input typically requires costly multimodal training. An alternative augments LLMs with test-time visual signals, improving visual commonsense without harming textual reasoning, but prior designs often rely on early fusion and a single image, which can be suboptimal. We propose a late multi-image fusion method: multiple images are generated from the text prompt with a lightweight parallel sampling, and their prediction probabilities are combined with those of a text-only LLM through a late-fusion layer that integrates projected visual features just before the final prediction. Across visual commonsense and NLP benchmarks, our method significantly outperforms augmented LLMs on visual reasoning, matches VLMs on vision-based tasks, and, when applied to strong LLMs such as LLaMA 3, also improves NLP performance while adding only modest test-time overhead. Project page is available at: https://guyyariv.github.io/LaMI.
[137] A Multilingual Dataset and Empirical Validation for the Mutual Reinforcement Effect in Information Extraction
Chengguang Gan, Sunbowen Lee, Qingyu Yin, Yunhao Liang, Xinyang He, Hanjun Wei, Younghun Lim, Shijian Wang, Hexiang Huang, Qinghao Zhang, Shiwen Ni, Tatsunori Mori
Main category: cs.CL
TL;DR: The paper introduces Multilingual MRE Mix (MMM) dataset covering English, Japanese, and Chinese to empirically validate the Mutual Reinforcement Effect in information extraction across languages, showing 76% of sub-datasets exhibit MRE.
Details
Motivation: Prior work reported Mutual Reinforcement Effect (MRE) in Japanese where word-level and sentence-level tasks mutually improve each other, but its generality across languages and task settings lacked empirical validation due to missing multilingual MRE datasets.Method: Introduced MMM dataset with 21 sub-datasets across three languages, using LLM-assisted dataset translation and alignment framework to reduce manual annotation. Adopted unified input-output framework to train open-domain information extraction model with full fine-tuning ablations and knowledgeable verbalizers based on MRE-mix data.
Result: Experimental results show 76% of MMM sub-datasets consistently exhibit Mutual Reinforcement Effect across languages, providing systematic empirical validation of MRE in multilingual settings.
Conclusion: The study provides empirical evidence for the generality of Mutual Reinforcement Effect across languages and demonstrates its practical value for information extraction through systematic multilingual validation.
Abstract: The Mutual Reinforcement Effect (MRE) describes a phenomenon in information extraction where word-level and sentence-level tasks can mutually improve each other when jointly modeled. While prior work has reported MRE in Japanese, its generality across languages and task settings has not been empirically validated, largely due to the lack of multilingual MRE datasets. To address this limitation, we introduce the Multilingual MRE Mix dataset (MMM), which consists of 21 sub-datasets covering English, Japanese, and Chinese. We propose an LLM-assisted dataset translation and alignment framework that significantly reduces manual annotation effort while preserving the structural requirements of MRE tasks. Building on MMM, we adopt a unified input-output framework to train an open-domain information extraction model and conduct extensive empirical studies, including full fine-tuning ablations and the construction of knowledgeable verbalizers based on MRE-mix data. Experimental results show that 76 percent of the MMM sub-datasets consistently exhibit the Mutual Reinforcement Effect across languages. These findings provide systematic empirical validation of MRE in multilingual settings and demonstrate its practical value for information extraction.
[138] MegaFake: A Theory-Driven Dataset of Fake News Generated by Large Language Models
Lionel Z. Wang, Ka Chung Ng, Yiming Ma, Wenqi Fan
Main category: cs.CL
TL;DR: Researchers develop LLM-Fake Theory to explain machine-generated deception and create MegaFake dataset for studying LLM-generated fake news detection.
Details
Motivation: Large language models can generate highly convincing fake news at scale, posing significant threats to online information integrity. Understanding the motivations and mechanisms behind LLM-generated fake news is crucial for effective detection and governance.Method: Developed LLM-Fake Theory integrating social psychology theories to explain machine-generated deception. Designed an innovative prompt engineering pipeline that automates fake news generation using LLMs without manual annotation. Created MegaFake dataset derived from FakeNewsNet using this pipeline.
Result: Created a theoretically informed machine-generated fake news dataset (MegaFake) and advanced both theoretical understanding of human-machine deception mechanisms and practical approaches to fake news detection in the LLM era.
Conclusion: The study provides a theoretical framework and dataset for understanding and detecting LLM-generated fake news, addressing a critical challenge in the era of generative AI.
Abstract: Fake news significantly influences decision-making processes by misleading individuals, organizations, and even governments. Large language models (LLMs), as part of generative AI, can amplify this problem by generating highly convincing fake news at scale, posing a significant threat to online information integrity. Therefore, understanding the motivations and mechanisms behind fake news generated by LLMs is crucial for effective detection and governance. In this study, we develop the LLM-Fake Theory, a theoretical framework that integrates various social psychology theories to explain machine-generated deception. Guided by this framework, we design an innovative prompt engineering pipeline that automates fake news generation using LLMs, eliminating manual annotation needs. Utilizing this pipeline, we create a theoretically informed \underline{M}achin\underline{e}-\underline{g}ener\underline{a}ted \underline{Fake} news dataset, MegaFake, derived from FakeNewsNet. Through extensive experiments with MegaFake, we advance both theoretical understanding of human-machine deception mechanisms and practical approaches to fake news detection in the LLM era.
[139] RiTeK: A Dataset for Large Language Models Complex Reasoning over Textual Knowledge Graphs in Medicine
Jiatan Huang, Mingchen Li, Zonghai Yao, Dawei Li, Yuxin Zhang, Zhichao Yang, Yongkang Xiao, Feiyun Ouyang, Xiaohan Li, Shuo Han, Hong Yu
Main category: cs.CL
TL;DR: RiTeK: A benchmark dataset for evaluating LLM-based retrieval systems on medical textual knowledge graphs with complex topological structures and reasoning queries.
Details
Motivation: Medical domain questions require accurate retrieval from medical Textual Knowledge Graphs (TKGs) to enhance LLM inference, but current limitations include scarce medical TKGs, limited topological expressiveness, and lack of comprehensive evaluation benchmarks for medical TKG retrievers.Method: Developed RiTeK dataset by synthesizing realistic user queries with diverse topological structures, relational information, and complex textual descriptions, followed by rigorous medical expert evaluation. Used this benchmark to evaluate 11 representative LLM-based retrieval systems.
Result: Existing retrieval methods struggle on the RiTeK benchmark, revealing notable limitations in current LLM-driven retrieval approaches for semi-structured medical data.
Conclusion: There is a pressing need for more effective retrieval systems tailored for semi-structured data in the medical domain, and RiTeK serves as a comprehensive benchmark to drive future research in this area.
Abstract: Answering complex real-world questions in the medical domain often requires accurate retrieval from medical Textual Knowledge Graphs (medical TKGs), as the relational path information from TKGs could enhance the inference ability of Large Language Models (LLMs). However, the main bottlenecks lie in the scarcity of existing medical TKGs, the limited expressiveness of their topological structures, and the lack of comprehensive evaluations of current retrievers for medical TKGs. To address these challenges, we first develop a Dataset1 for LLMs Complex Reasoning over medical Textual Knowledge Graphs (RiTeK), covering a broad range of topological structures. Specifically, we synthesize realistic user queries integrating diverse topological structures, relational information, and complex textual descriptions. We conduct a rigorous medical expert evaluation process to assess and validate the quality of our synthesized queries. RiTeK also serves as a comprehensive benchmark dataset for evaluating the capabilities of retrieval systems built upon LLMs. By assessing 11 representative retrievers on this benchmark, we observe that existing methods struggle to perform well, revealing notable limitations in current LLM-driven retrieval approaches. These findings highlight the pressing need for more effective retrieval systems tailored for semi-structured data in the medical domain.
[140] Improving LLM Unlearning Robustness via Random Perturbations
Dang Huu-Tien, Hoang Thanh-Tung, Anh Bui, Minh-Phuong Nguyen, Le-Minh Nguyen, Naoya Inoue
Main category: cs.CL
TL;DR: LLM unlearning methods inadvertently create backdoor vulnerabilities where forget-tokens act as triggers causing model misbehavior, requiring defense mechanisms like Random Noise Augmentation to improve robustness.
Details
Motivation: Current LLM unlearning methods reduce model robustness by making them vulnerable to forget-tokens in retain-queries, revealing that unlearning processes inadvertently create backdoor-like vulnerabilities rather than truly erasing knowledge.Method: Proposes a theoretical framework reframing unlearning as a backdoor attack/defense problem, where forget-tokens become triggers. Introduces Random Noise Augmentation (RNA) as a lightweight, model-agnostic defense approach with theoretical guarantees to mitigate these vulnerabilities.
Result: Extensive experiments show RNA significantly improves robustness of unlearned models while preserving forget and retain performance, demonstrating that unlearning methods inherently create backdoor vulnerabilities that can be mitigated through proper defense mechanisms.
Conclusion: LLM unlearning methods themselves poison models by creating backdoor vulnerabilities rather than erasing knowledge, and the proposed backdoor attack-defense framework provides insights for improving unlearning robustness through approaches like RNA.
Abstract: Here, we show that current LLM unlearning methods inherently reduce models’ robustness, causing them to misbehave even when a single non-adversarial forget-token is present in the retain-query. Toward understanding underlying causes, we propose a novel theoretical framework that reframes the unlearning process as a backdoor attack and defense problem: we formulate how the forgetting process inadvertently learns to align forget-tokens (backdoor triggers) with the target-representations (target labels). As a result, forget-tokens act as backdoor triggers that, when activated in retain-queries, cause disruptions in unlearned models’ behaviors, similar to successful backdoor attacks. The sense that, LLM unlearning methods themselves poison the model, make it more vulnerable to forget-tokens, and hide rather than erase target knowledge, describes their true mechanism. To mitigate the vulnerability caused by the forgetting process, we reinterpret the retaining process as a backdoor defense and propose Random Noise Augmentation (RNA), a lightweight, model and method-agnostic approach with theoretical guarantees for improving the robustness of unlearned models. Extensive experiments demonstrate that RNA significantly improves the robustness of unlearned models while preserving forget and retain performances. This backdoor attack-defense framework offers insights into the mechanism of unlearning that can shed light on future research directions for improving unlearning robustness.
[141] Ultra-Low-Dimensional Prompt Tuning via Random Projection
Zijun Wu, Yongchang Hao, Lili Mou
Main category: cs.CL
TL;DR: ULPT (Ultra-Low-dimensional Prompt Tuning) reduces prompt tuning parameters by 98% by optimizing prompts in 2D space and using frozen random up-projection, outperforming other parameter-efficient methods across 20+ NLP tasks.
Details
Motivation: Large language models are expensive to fine-tune, and while prompt tuning addresses parameter efficiency, existing methods still tie prompt embeddings to the model's hidden dimensionality, limiting parameter savings.Method: ULPT optimizes prompts in an ultra-low-dimensional space (e.g., 2D) and uses a frozen random matrix for up-projection to the model’s hidden dimension, dramatically reducing trainable parameters.
Result: Achieves 98% reduction in training parameters compared to vanilla prompt tuning while maintaining performance, consistently outperforms recent parameter-efficient tuning methods across 20+ NLP tasks.
Conclusion: ULPT provides a storage-efficient framework for massive LLM customization with minimal parameter overhead, making it practical for large-scale model deployment.
Abstract: Large language models achieve state-of-the-art performance but are increasingly costly to fine-tune. Prompt tuning is a parameter-efficient fine-tuning method that addresses parameter-efficiency by learning prompt embeddings, but these embeddings are typically tied to the model’s hidden dimensionality, limiting parameter saving. In this paper, we propose Ultra-Low-dimensional Prompt Tuning (ULPT), a simple yet effective method that optimizes prompts in a low-dimensional space (e.g., 2D) and uses a frozen random matrix for up-projection. ULPT can achieve 98% reduction in the training parameters compared to vanilla prompt tuning while preserving performance. Our extensive experiments across over 20 NLP tasks demonstrate that ULPT consistently outperforms recent parameter-efficient tuning methods using significantly fewer parameters, making it well-suited as a storage-efficient framework for massive LLM customization.
[142] CounterBench: Evaluating and Improving Counterfactual Reasoning in Large Language Models
Yuefei Chen, Vivek K. Singh, Jing Ma, Ruixiang Tang
Main category: cs.CL
TL;DR: LLMs struggle with formal counterfactual reasoning, performing near random chance on new benchmark CounterBench; proposed CoIn method improves performance through iterative reasoning and backtracking.
Details
Motivation: Counterfactual reasoning is challenging for AI, but previous studies focused on commonsense causal reasoning where LLMs use prior knowledge. Need to assess LLMs' ability to perform formal counterfactual inference using structured rules rather than just commonsense knowledge.Method: Created CounterBench dataset with 1K counterfactual reasoning questions varying in difficulty, causal graph structures, question types, and name variants. Proposed CoIn (Counterfactual Inference) method that guides LLMs through iterative reasoning and backtracking to systematically explore counterfactual solutions.
Result: Most LLMs perform at random guessing levels on counterfactual reasoning tasks. CoIn method significantly improves LLM performance on counterfactual reasoning and consistently enhances performance across different LLM models.
Conclusion: Counterfactual reasoning remains a significant challenge for LLMs, but structured reasoning approaches like CoIn can substantially improve their performance on formal counterfactual inference tasks.
Abstract: Counterfactual reasoning is widely recognized as one of the most challenging and intricate aspects of causality in artificial intelligence. In this paper, we evaluate the performance of large language models (LLMs) in counterfactual reasoning. In contrast to previous studies that primarily focus on commonsense causal reasoning, where LLMs often rely on prior knowledge for inference, we specifically assess their ability to perform counterfactual inference using a set of formal rules. To support this evaluation, we introduce a new benchmark dataset, CounterBench, comprising 1K counterfactual reasoning questions. The dataset is designed with varying levels of difficulty, diverse causal graph structures, distinct types of counterfactual questions, and multiple nonsensical name variants. Our experiments demonstrate that counterfactual reasoning poses a significant challenge for LLMs, with most models performing at levels comparable to random guessing. To enhance LLM’s counterfactual reasoning ability, we propose a novel reasoning paradigm, CoIn, which guides LLMs through iterative reasoning and backtracking to systematically explore counterfactual solutions. Experimental results show that our method significantly improves LLM performance on counterfactual reasoning tasks and consistently enhances performance across different LLMs.Our dataset is available at https://huggingface.co/datasets/CounterBench/CounterBench.
[143] LIFT: A Novel Framework for Enhancing Long-Context Understanding of LLMs via Long Input Fine-Tuning
Yansheng Mao, Yufei Xu, Jiaqi Li, Fanxu Meng, Haotong Yang, Zilong Zheng, Xiyuan Wang, Muhan Zhang
Main category: cs.CL
TL;DR: LIFT is a framework that fine-tunes short-context LLMs to handle long inputs by adapting model parameters to specific long contexts, avoiding quadratic attention complexity while enabling inference without the original long context.
Details
Motivation: Long context understanding is challenging for LLMs due to limited context windows and quadratic attention complexity. Current approaches focus on extending context windows, but LIFT proposes storing long input information in model parameters instead.Method: LIFT fine-tunes short-context LLMs on specific long inputs by adapting model parameters to absorb the long context. It uses LLM-generated synthetic tasks to enhance comprehension beyond memorization, with an optimized pipeline reducing Time to First Token to <10 seconds for 8k context.
Result: LIFT enables short-context LLMs to answer questions about long inputs even when the original context isn’t provided during inference, avoiding quadratic complexity while maintaining performance on long-context understanding tasks.
Conclusion: LIFT offers a novel parameter-adaptive approach to long-context modeling that avoids quadratic attention costs, provides feasibility for real-world deployment, and opens new directions for efficient long-context understanding in LLMs.
Abstract: Long context understanding remains challenging for large language models due to their limited context windows. This paper introduces Long Input Fine-Tuning (LIFT), a novel framework for long-context modeling that can enhance the long-context performance of arbitrary short-context LLMs by dynamically adapting their parameters to the given long input. Importantly, rather than endlessly extending the context window size to accommodate increasingly longer inputs in context, LIFT stores and absorbs the long input in parameters. By fine-tuning the long input into model parameters, LIFT allows short-context LLMs to answer questions even when the required information is not provided in the context during inference, avoiding the quadratic complexity w.r.t. input length of a normal long context model. Furthermore, LIFT does not simply perform continued pretraining on new, long contexts, but leverages carefully designed LLM-generated synthetic tasks to enhance the comprehension of long contexts, moving beyond mere memorization. To accommodate the additional cost of fine-tuning, we design a highly optimized pipeline that reduces the Time to First Token (TTFT) to less than 10 seconds for 8k context. We further provide a comprehensive analysis of LIFT’s strengths and limitations in long-context understanding, discuss its feasibility for large-scale real-world deployment, and highlight valuable directions for future research.
[144] Preference Learning Unlocks LLMs’ Psycho-Counseling Skills
Mian Zhang, Shaun M. Eack, Zhiyu Zoey Chen
Main category: cs.CL
TL;DR: PsychoCounsel-Preference dataset and aligned LLM for psycho-counseling applications, achieving 87% win rate against GPT-4o
Details
Motivation: Addressing the gap in mental health support by applying LLMs to psycho-counseling, but current LLMs struggle due to lack of high-quality supervision data and privacy concerns around real counseling sessionsMethod: Proposed professional evaluation principles for therapists’ responses, created PsychoCounsel-Preference dataset (36k preference comparison pairs), and trained aligned models using reward modeling and preference learning
Result: PsychoCounsel-Llama3-8B achieves 87% win rate against GPT-4o, demonstrating effectiveness in psycho-counseling response generation
Conclusion: The proposed dataset and aligned models provide valuable resources for improving LLMs in psycho-counseling applications, addressing data quality and privacy challenges
Abstract: Applying large language models (LLMs) to assist in psycho-counseling is an emerging and meaningful approach, driven by the significant gap between patient needs and the availability of mental health support. However, current LLMs struggle to consistently provide effective responses to client speeches, largely due to the lack of supervision from high-quality real psycho-counseling data, whose content is typically inaccessible due to client privacy concerns. Furthermore, the quality of therapists’ responses in available sessions can vary significantly based on their professional training and experience. Assessing the quality of therapists’ responses remains an open challenge. In this work, we address these challenges by first proposing a set of professional and comprehensive principles to evaluate therapists’ responses to client speeches. Using these principles, we create a preference dataset, PsychoCounsel-Preference, which contains 36k high-quality preference comparison pairs. This dataset aligns with the preferences of professional psychotherapists, providing a robust foundation for evaluating and improving LLMs in psycho-counseling. Experiments on reward modeling and preference learning demonstrate that PsychoCounsel-Preference is an excellent resource for LLMs to acquire essential skills for responding to clients in a counseling session. Our best-aligned model, PsychoCounsel-Llama3-8B, achieves an impressive win rate of 87% against GPT-4o. We release PsychoCounsel-Preference, PsychoCounsel-Llama3-8B and the reward model PsychoCounsel Llama3-8B-Reward to facilitate the research of psycho-counseling with LLMs at: https://hf.co/Psychotherapy-LLM.
[145] Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models
Haoxiang Sun, Yingqian Min, Zhipeng Chen, Wayne Xin Zhao, Ji-Rong Wen
Main category: cs.CL
TL;DR: OlymMATH is a challenging Olympiad-level math benchmark with 350 problems in English and Chinese, featuring both natural language evaluation and formal verification in Lean 4 to assess reasoning capabilities of large models.
Details
Motivation: Existing math benchmarks have become saturated by large reasoning models, creating an urgent need for more challenging evaluation frameworks that can properly assess advanced reasoning capabilities beyond simple computational problems.Method: Created OlymMATH benchmark with 350 manually curated Olympiad-level problems from printed publications to minimize data contamination. Includes dual evaluation paradigms: (1) OlymMATH-EASY/HARD with 200 computational problems for natural language evaluation, and (2) OlymMATH-LEAN with 150 problems formalized in Lean 4 for rigorous process-level verification.
Result: The benchmark presents significant challenges to current models, revealing performance gaps between languages and instances where models rely on heuristic “guessing” rather than rigorous reasoning. The authors release extensive resources including 582k+ reasoning trajectories and visualization tools.
Conclusion: OlymMATH provides a comprehensive, challenging benchmark for evaluating advanced mathematical reasoning capabilities, with dual evaluation paradigms that enable both objective assessment and rigorous process verification, supporting future research in reasoning model development.
Abstract: The rapid advancement of large reasoning models has saturated existing math benchmarks, underscoring the urgent need for more challenging evaluation frameworks. To address this, we introduce OlymMATH, a rigorously curated, Olympiad-level math benchmark comprising 350 problems, each with parallel English and Chinese versions. OlymMATH is the first benchmark to unify dual evaluation paradigms within a single suite: (1) natural language evaluation through OlymMATH-EASY and OlymMATH-HARD, comprising 200 computational problems with numerical answers for objective rule-based assessment, and (2) formal verification through OlymMATH-LEAN, offering 150 problems formalized in Lean 4 for rigorous process-level evaluation. All problems are manually sourced from printed publications to minimize data contamination, verified by experts, and span four core domains. Extensive experiments reveal the benchmark’s significant challenge, and our analysis also uncovers consistent performance gaps between languages and identifies cases where models employ heuristic “guessing” rather than rigorous reasoning. To further support community research, we release 582k+ reasoning trajectories, a visualization tool, and expert solutions at https://github.com/RUCAIBox/OlymMATH.
[146] If an LLM Were a Character, Would It Know Its Own Story? Evaluating Lifelong Learning in LLMs
Siqi Fan, Xiusheng Huang, Yiqun Yao, Xuezhi Fang, Kang Liu, Peng Han, Shuo Shang, Aixin Sun, Yequan Wang
Main category: cs.CL
TL;DR: LIFESTATE-BENCH benchmark assesses lifelong learning in LLMs through episodic datasets with narrative structure, evaluating self-awareness, memory, and relationship tracking across parametric vs. non-parametric approaches.
Details
Motivation: Current LLM benchmarks fail to capture emergent lifelong learning dynamics during multi-turn, multi-agent interactions where LLMs exhibit consistent character-like behaviors, necessitating a specialized benchmark to assess stateful learning capabilities.Method: Introduces LIFESTATE-BENCH with two episodic datasets (Hamlet and synthetic scripts) rich in narrative structure and character interactions. Uses fact checking evaluation to probe models’ self-awareness, episodic memory retrieval, and relationship tracking across parametric and non-parametric approaches.
Result: Nonparametric methods significantly outperform parametric ones in managing stateful learning, but all models exhibit challenges with catastrophic forgetting as interactions extend, highlighting limitations in current lifelong learning capabilities.
Conclusion: There’s a need for further advancements in lifelong learning for LLMs, as current models struggle with maintaining stateful knowledge over extended interactions despite showing emergent character-like behaviors in multi-agent scenarios.
Abstract: Large language models (LLMs) can carry out human-like dialogue, but unlike humans, they are stateless due to the superposition property. However, during multi-turn, multi-agent interactions, LLMs begin to exhibit consistent, character-like behaviors, hinting at a form of emergent lifelong learning. Despite this, existing benchmarks often fail to capture these dynamics, primarily focusing on static, open-ended evaluations. To address this gap, we introduce LIFESTATE-BENCH, a benchmark designed to assess lifelong learning in LLMs. It features two episodic datasets: Hamlet and a synthetic script collection, rich in narrative structure and character interactions. Our fact checking evaluation probes models’ self-awareness, episodic memory retrieval, and relationship tracking, across both parametric and non-parametric approaches. Experiments on models like Llama3.1-8B, GPT-4-turbo, and DeepSeek R1, we demonstrate that nonparametric methods significantly outperform parametric ones in managing stateful learning. However, all models exhibit challenges with catastrophic forgetting as interactions extend, highlighting the need for further advancements in lifelong learning.
[147] Tuning Language Models for Robust Prediction of Diverse User Behaviors
Fanjin Meng, Jingtao Ding, Jiahui Gong, Chen Yang, Hong Chen, Zuojian Wang, Haisheng Lu, Yong Li
Main category: cs.CL
TL;DR: BehaviorLM: A progressive fine-tuning approach for LLMs to better predict both frequent anchor behaviors and rare tail behaviors in user behavior prediction tasks.
Details
Motivation: Deep learning models struggle with long-tailed user behaviors, while LLMs have rich behavioral knowledge from pretraining but existing fine-tuning approaches overfit to frequent behaviors and fail at rare behaviors.Method: Two-stage progressive fine-tuning: 1) Fine-tune LLMs on anchor behaviors while preserving general behavioral knowledge, 2) Fine-tune using balanced subset based on sample difficulty to improve tail behavior predictions without sacrificing anchor performance.
Result: Experimental results on two real-world datasets show BehaviorLM robustly predicts both anchor and tail behaviors and effectively leverages LLM behavioral knowledge for few-shot tail behavior prediction.
Conclusion: BehaviorLM addresses the long-tail behavior prediction problem in LLMs through progressive fine-tuning, enabling better prediction of both frequent and rare user behaviors.
Abstract: Predicting user behavior is essential for intelligent assistant services, yet deep learning models often struggle to capture long-tailed behaviors. Large language models (LLMs), with their pretraining on vast corpora containing rich behavioral knowledge, offer promise. However, existing fine-tuning approaches tend to overfit to frequent anchor'' behaviors, reducing their ability to predict less common tail’’ behaviors. In this paper, we introduce BehaviorLM, a progressive fine-tuning approach that addresses this issue. In the first stage, LLMs are fine-tuned on anchor behaviors while preserving general behavioral knowledge. In the second stage, fine-tuning uses a balanced subset of all behaviors based on sample difficulty to improve tail behavior predictions without sacrificing anchor performance. Experimental results on two real-world datasets demonstrate that BehaviorLM robustly predicts both anchor and tail behaviors and effectively leverages LLM behavioral knowledge to master tail behavior prediction with few-shot examples.
[148] Revisiting Epistemic Markers in Confidence Estimation: Can Markers Accurately Reflect Large Language Models’ Uncertainty?
Jiayu Liu, Qing Zong, Weiqi Wang, Yangqiu Song
Main category: cs.CL
TL;DR: LLMs’ use of epistemic markers (e.g., “fairly confident”) for confidence expression is inconsistent, especially in out-of-distribution scenarios, raising reliability concerns for confidence estimation.
Details
Motivation: As LLMs are increasingly used in high-stakes domains, accurately assessing their confidence is crucial. Humans express confidence through epistemic markers rather than numerical values, but it's unclear whether LLMs consistently use these markers to reflect their intrinsic confidence due to difficulty quantifying uncertainty associated with various markers.Method: Define marker confidence as observed accuracy when a model employs an epistemic marker. Evaluate stability across multiple question-answering datasets in both in-distribution and out-of-distribution settings for open-source and proprietary LLMs.
Result: While markers generalize well within the same distribution, their confidence is inconsistent in out-of-distribution scenarios. This raises significant concerns about the reliability of epistemic markers for confidence estimation.
Conclusion: There is a need for improved alignment between marker-based confidence and actual model uncertainty, as current epistemic markers are unreliable for confidence estimation, especially in out-of-distribution scenarios.
Abstract: As large language models (LLMs) are increasingly used in high-stakes domains, accurately assessing their confidence is crucial. Humans typically express confidence through epistemic markers (e.g., “fairly confident”) instead of numerical values. However, it remains unclear whether LLMs consistently use these markers to reflect their intrinsic confidence due to the difficulty of quantifying uncertainty associated with various markers. To address this gap, we first define marker confidence as the observed accuracy when a model employs an epistemic marker. We evaluate its stability across multiple question-answering datasets in both in-distribution and out-of-distribution settings for open-source and proprietary LLMs. Our results show that while markers generalize well within the same distribution, their confidence is inconsistent in out-of-distribution scenarios. These findings raise significant concerns about the reliability of epistemic markers for confidence estimation, underscoring the need for improved alignment between marker based confidence and actual model uncertainty. Our code is available at https://github.com/HKUST-KnowComp/MarConf.
[149] Aligning What LLMs Do and Say: Towards Self-Consistent Explanations
Sahar Admoni, Ofra Amir, Assaf Hallak, Yftah Ziser
Main category: cs.CL
TL;DR: PSCB benchmark reveals gaps between LLM decisions and their explanations, shows Spearman correlation better measures alignment than cosine similarity, and DPO improves alignment without hurting accuracy.
Details
Motivation: LLMs offer apparent interpretability through explanations, but post-hoc rationales often misrepresent what actually shaped model outputs, creating a gap between the features driving answers and those emphasized in explanations.Method: Introduce Post-hoc Self-Consistency Bank (PSCB) - a large-scale benchmark linking model decisions with diverse explanations and attribution vectors across datasets, methods, and model families. Use Spearman rank correlation to measure alignment, then apply Direct Preference Optimization (DPO) to attribution-based preference data.
Result: Spearman rank correlation provides more reliable signal of alignment than cosine similarity. DPO improves alignment between decisions and explanations without degrading task accuracy, while standard supervised fine-tuning fails to achieve comparable gains. Improvements generalize robustly across domains.
Conclusion: The work enables scalable and faithful alignment between LLM decisions and their natural language explanations, addressing the gap between what drives model outputs and what explanations claim.
Abstract: Large language models (LLMs) seem to offer an easy path to interpretability: just ask them to explain their answers. Yet the features driving an answer often differ from those emphasized in its explanation, meaning post-hoc rationales can misrepresent what actually shaped the model’s output. We quantify this gap by comparing the feature-importance distributions of answers and their explanations. Prior analyses reveal such discrepancies, but large-scale study has been limited by the high computational cost of attribution methods. To address this, we introduce the Post-hoc Self-Consistency Bank (PSCB), a large-scale benchmark linking model decisions with diverse explanations and attribution vectors across datasets, methods, and model families. Using PSCB, we find that Spearman rank correlation provides a more reliable signal of alignment than cosine similarity. Building on this insight, we apply Direct Preference Optimization (DPO) to attribution-based preference data, improving alignment without degrading task accuracy, and show that standard supervised fine-tuning on the same data fails to achieve comparable gains. These improvements generalize robustly across domains, paving the way toward scalable and faithful alignment between LLM decisions and their natural language explanations.
[150] LingoLoop Attack: Trapping MLLMs via Linguistic Context and State Entrapment into Endless Loops
Jiyuan Fu, Kaixun Jiang, Lingyi Hong, Jinglun Li, Haijing Guo, Dingkang Yang, Zhaoyu Chen, Wenqiang Zhang
Main category: cs.CL
TL;DR: LingoLoop: A novel attack that exploits POS tag characteristics and generative path pruning to trap MLLMs in verbose, repetitive loops, causing resource exhaustion through excessive token generation.
Details
Motivation: Multimodal LLMs require substantial computational resources during inference, making them vulnerable to resource exhaustion attacks. Existing attacks are limited because they don't consider how Part-of-Speech characteristics affect EOS token generation or how sentence-level patterns influence output length.Method: Two key mechanisms: 1) POS-Aware Delay Mechanism that adjusts attention weights based on POS tags to postpone EOS token generation, and 2) Generative Path Pruning Mechanism that limits hidden state magnitudes to encourage persistent repetitive loops.
Result: LingoLoop successfully traps MLLMs like Qwen2.5-VL-3B in generative loops, inducing outputs with up to 367x more tokens than clean inputs, leading to massive energy consumption increases when generation limits are relaxed.
Conclusion: The attack exposes significant vulnerabilities in MLLMs related to their generation mechanisms, posing serious challenges for reliable deployment due to resource exhaustion risks.
Abstract: Multimodal Large Language Models (MLLMs) have shown great promise but require substantial computational resources during inference. Attackers can exploit this by inducing excessive output, leading to resource exhaustion and service degradation. Prior energy-latency attacks aim to increase generation time by broadly shifting the output token distribution away from the EOS token, but they neglect the influence of token-level Part-of-Speech (POS) characteristics on EOS and sentence-level structural patterns on output counts, limiting their efficacy. To address this, we propose LingoLoop, an attack designed to induce MLLMs to generate excessively verbose and repetitive sequences. First, we find that the POS tag of a token strongly affects the likelihood of generating an EOS token. Based on this insight, we propose a POS-Aware Delay Mechanism to postpone EOS token generation by adjusting attention weights guided by POS information. Second, we identify that constraining output diversity to induce repetitive loops is effective for sustained generation. We introduce a Generative Path Pruning Mechanism that limits the magnitude of hidden states, encouraging the model to produce persistent loops. Extensive experiments on models like Qwen2.5-VL-3B demonstrate LingoLoop’s powerful ability to trap them in generative loops; it consistently drives them to their generation limits and, when those limits are relaxed, can induce outputs with up to 367x more tokens than clean inputs, triggering a commensurate surge in energy consumption. These findings expose significant MLLMs’ vulnerabilities, posing challenges for their reliable deployment.
[151] STU-PID: Steering Token Usage via PID Controller for Efficient Large Language Model Reasoning
Aryasomayajula Ram Bharadwaj
Main category: cs.CL
TL;DR: STUPID is a training-free method using PID controllers to dynamically adjust activation steering strength during LLM inference, reducing redundant reasoning steps while maintaining accuracy.
Details
Motivation: Large Language Models with extended chain-of-thought reasoning often suffer from "overthinking" - generating excessive redundant reasoning steps that increase computational costs and potentially degrade performance. Existing static steering approaches lack adaptability to dynamically adjust intervention based on real-time reasoning quality.Method: STUPID employs a PID controller to dynamically modulate activation steering strength during inference. It combines a chunk-level classifier for detecting redundant reasoning patterns with a PID control mechanism that adaptively adjusts steering intensity based on predicted redundancy probability.
Result: On GSM8K, STUPID achieves 6% improvement in accuracy while reducing token usage by 32%, outperforming static steering baselines.
Conclusion: STUPID provides a principled framework for dynamic reasoning calibration that maintains reasoning quality while significantly improving computational efficiency, offering a training-free solution to the overthinking problem in LLMs.
Abstract: Large Language Models employing extended chain-of-thought (CoT) reasoning often suffer from the overthinking phenomenon, generating excessive and redundant reasoning steps that increase computational costs while potentially degrading performance. While recent work has explored static steering approaches to mitigate this issue, they lack the adaptability to dynamically adjust intervention strength based on real-time reasoning quality. We propose STUPID (Steering Token Usage via PID controller), a novel training-free method that employs a PID controller to dynamically modulate activation steering strength during inference. Our approach combines a chunk-level classifier for detecting redundant reasoning patterns with a PID control mechanism that adaptively adjusts steering intensity based on the predicted redundancy probability. Experimental evaluation on GSM8K demonstrates that STUPID achieves a 6% improvement in accuracy while reducing token usage by 32%, outperforming static steering baselines. Our method provides a principled framework for dynamic reasoning calibration that maintains reasoning quality while significantly improving computational efficiency.
[152] What Factors Affect LLMs and RLLMs in Financial Question Answering?
Peng Wang, Xuesi Hu, Jiageng Wu, Yuntao Zou, Qiancheng Zhang, Dagang Li
Main category: cs.CL
TL;DR: Systematic evaluation of LLMs and RLLMs on financial QA tasks, analyzing prompting methods, agent frameworks, and multilingual alignment to understand performance enhancement strategies.
Details
Motivation: Despite growing attention to LLMs and RLLMs, there's limited systematic exploration of methods to fully unlock their performance in the financial domain, particularly for complex reasoning tasks.Method: Used five LLMs and four RLLMs to assess effects of prompting methods, agentic frameworks, and multilingual alignment methods on financial question-answering tasks through systematic evaluation.
Result: (1) Prompting methods and agent frameworks enhance LLM performance by simulating Long CoT; (2) RLLMs’ inherent Long CoT capabilities limit conventional methods’ effectiveness; (3) Multilingual alignment methods improve LLM performance via reasoning length extension but offer minimal benefits for RLLMs.
Conclusion: The study provides important insights into performance enhancement strategies for LLMs and RLLMs in financial QA, highlighting different optimization approaches needed for LLMs vs. RLLMs due to their inherent reasoning capabilities.
Abstract: Recently, large language models (LLMs) and reasoning large language models (RLLMs) have gained considerable attention from many researchers. RLLMs enhance the reasoning capabilities of LLMs through Long Chain-of-Thought (Long CoT) processes, significantly improving the performance of LLMs in addressing complex problems. However, there are few works that systematically explore what methods can fully unlock the performance of LLMs and RLLMs within the financial domain. To investigate the impact of various methods on LLMs and RLLMs, we utilize five LLMs and four RLLMs to assess the effects of prompting methods, agentic frameworks, and multilingual alignment methods on financial question-answering tasks. Our research findings indicate: (1) Current prompting methods and agent frameworks enhance the performance of LLMs in financial question answering by simulating Long CoT; (2) RLLMs possess inherent Long CoT capabilities, which limits the effectiveness of conventional methods in further enhancing their performance; (3) Current advanced multilingual alignment methods primarily improve the multilingual performance of LLMs by extending the reasoning length, which yields minimal benefits for RLLMs. Additionally, we discuss strategies for enhancing the performance of LLMs and RLLMs in financial question answering, which may serve as a inspiration for future improvements. We hope that this study can serve as an important reference for LLMs and RLLMs in the field of financial question answering.
[153] AttnTrace: Contextual Attribution of Prompt Injection and Knowledge Corruption
Yanting Wang, Runpeng Geng, Ying Chen, Jinyuan Jia
Main category: cs.CL
TL;DR: AttnTrace is a new context traceback method for LLMs that uses attention weights to identify which parts of a long context most influence model responses, offering improved accuracy and efficiency over existing methods.
Details
Motivation: As LLMs are increasingly used in RAG pipelines and autonomous agents with long contexts, there's a need for efficient and accurate methods to trace which parts of the context contribute most to model responses, for applications like forensic analysis and improving interpretability.Method: AttnTrace uses attention weights produced by LLMs for prompts, with two novel techniques to enhance effectiveness: 1) attention weight processing methods, and 2) theoretical design choices that optimize traceback accuracy. The method operates through an attribution-before-detection paradigm.
Result: AttnTrace outperforms state-of-the-art methods like TracLLM in both accuracy and efficiency (significantly faster than TracLLM’s hundreds of seconds per traceback). It also improves prompt injection detection in long contexts and can effectively identify injected instructions in manipulated documents.
Conclusion: AttnTrace provides an effective and efficient solution for context traceback in LLMs, with practical applications in security, interpretability, and trustworthiness of LLM systems operating on long contexts.
Abstract: Long-context large language models (LLMs), such as Gemini-2.5-Pro and Claude-Sonnet-4, are increasingly used to empower advanced AI systems, including retrieval-augmented generation (RAG) pipelines and autonomous agents. In these systems, an LLM receives an instruction along with a context–often consisting of texts retrieved from a knowledge database or memory–and generates a response that is contextually grounded by following the instruction. Recent studies have designed solutions to trace back to a subset of texts in the context that contributes most to the response generated by the LLM. These solutions have numerous real-world applications, including performing post-attack forensic analysis and improving the interpretability and trustworthiness of LLM outputs. While significant efforts have been made, state-of-the-art solutions such as TracLLM often lead to a high computation cost, e.g., it takes TracLLM hundreds of seconds to perform traceback for a single response-context pair. In this work, we propose AttnTrace, a new context traceback method based on the attention weights produced by an LLM for a prompt. To effectively utilize attention weights, we introduce two techniques designed to enhance the effectiveness of AttnTrace, and we provide theoretical insights for our design choice. We also perform a systematic evaluation for AttnTrace. The results demonstrate that AttnTrace is more accurate and efficient than existing state-of-the-art context traceback methods. We also show that AttnTrace can improve state-of-the-art methods in detecting prompt injection under long contexts through the attribution-before-detection paradigm. As a real-world application, we demonstrate that AttnTrace can effectively pinpoint injected instructions in a paper designed to manipulate LLM-generated reviews. The code is at https://github.com/Wang-Yanting/AttnTrace.
[154] ZARA: Training-Free Motion Time-Series Reasoning via Evidence-Grounded LLM Agents
Zechen Li, Baiyu Chen, Hao Xue, Flora D. Salim
Main category: cs.CL
TL;DR: ZARA is a zero-training framework for human activity recognition that uses knowledge- and retrieval-augmented agents to reason about motion sensor time-series without requiring model retraining.
Details
Motivation: Traditional HAR systems are limited to fixed activity sets and require expensive retraining for new behaviors. LLMs offer open-set reasoning but struggle with numerical time-series, leading to hallucinations and poor grounding.Method: ZARA distills reference data into a statistically grounded textual knowledge base that transforms signal patterns into verifiable natural-language priors. It uses retrieval-augmented agents to iteratively select discriminative cues and perform grounded reasoning over candidate activities without training.
Result: Extensive experiments on eight benchmarks show ZARA generalizes robustly to unseen subjects and across datasets, demonstrating strong transferability across heterogeneous sensor domains.
Conclusion: ZARA represents progress toward trustworthy, plug-and-play motion understanding that goes beyond dataset-specific artifacts, enabling zero-training adaptation to new activities.
Abstract: Motion sensor time-series are central to Human Activity Recognition (HAR), yet conventional approaches are constrained to fixed activity sets and typically require costly parameter retraining to adapt to new behaviors. While Large Language Models (LLMs) offer promising open-set reasoning capabilities, applying them directly to numerical time-series often leads to hallucinations and weak grounding. To address this challenge, we propose ZARA (Zero-training Activity Reasoning Agents), a knowledge- and retrieval-augmented agentic framework for motion time-series reasoning in a training-free inference setting. Rather than relying on black-box projections, ZARA distills reference data into a statistically grounded textual knowledge base that transforms implicit signal patterns into verifiable natural-language priors. Guided by retrieved evidence, ZARA iteratively selects discriminative cues and performs grounded reasoning over candidate activities. Extensive experiments on eight benchmarks show that ZARA generalizes robustly to unseen subjects and across datasets, demonstrating strong transferability across heterogeneous sensor domains. These results mark a step toward trustworthy, plug-and-play motion understanding beyond dataset-specific artifacts. Our code is available at https://github.com/zechenli03/ZARA.
[155] Echoes of Automation: The Increasing Use of LLMs in Newsmaking
Abolfazl Ansari, Delvin Ce Zhang, Nafis Irtiza Tripto, Dongwon Lee
Main category: cs.CL
TL;DR: Analysis of AI-generated content in news media showing increasing use of GenAI, especially in local/college news, with AI used more in introductions and manual writing in conclusions, affecting linguistic patterns.
Details
Motivation: To investigate the impact of Generative AI (particularly LLMs) on journalistic integrity and authorship by examining AI-generated content across various news media formats and outlets.Method: Analyzed over 40,000 news articles from major, local, and college news media using three advanced AI-text detectors (Binoculars, Fast-Detect GPT, GPTZero) with sentence-level and linguistic analysis.
Result: Found substantial increase in GenAI use in recent years, especially in local and college news. LLMs often used in introductions while conclusions written manually. GenAI boosts word richness and readability but lowers formality, leading to more uniform writing styles in local media.
Conclusion: GenAI is increasingly used in journalism, particularly in local/college media, affecting writing patterns and potentially journalistic integrity, with AI more prevalent in article introductions than conclusions.
Abstract: The rapid rise of Generative AI (GenAI), particularly LLMs, poses concerns for journalistic integrity and authorship. This study examines AI-generated content across over 40,000 news articles from major, local, and college news media, in various media formats. Using three advanced AI-text detectors (e.g., Binoculars, Fast-Detect GPT, and GPTZero), we find substantial increase of GenAI use in recent years, especially in local and college news. Sentence-level analysis reveals LLMs are often used in the introduction of news, while conclusions usually written manually. Linguistic analysis shows GenAI boosts word richness and readability but lowers formality, leading to more uniform writing styles, particularly in local media.
[156] SafeConstellations: Mitigating Over-Refusals in LLMs Through Task-Aware Representation Steering
Utsav Maskey, Sumit Yadav, Mark Dras, Usman Naseem
Main category: cs.CL
TL;DR: SafeConstellations: An inference-time trajectory-shifting method that reduces LLM over-refusal by tracking task-specific embedding patterns and guiding representations toward non-refusal pathways, achieving up to 73% reduction in over-refusal rates.
Details
Motivation: LLMs increasingly exhibit over-refusal behavior where safety mechanisms cause models to reject benign instructions that resemble harmful content, diminishing utility in production applications that rely on common prompt templates or specific tasks like sentiment analysis and language translation.Method: Mechanistic analysis reveals LLMs follow distinct “constellation” patterns in embedding space with consistent trajectories across NLP tasks. SafeConstellations tracks these task-specific trajectory patterns at inference time and guides representations toward non-refusal pathways using trajectory-shifting techniques.
Result: The method reduces over-refusal rates by up to 73% with minimal impact on utility, offering a principled and conditional approach that selectively guides model behavior only on tasks prone to over-refusal.
Conclusion: SafeConstellations provides an effective inference-time solution for mitigating LLM over-refusal by leveraging task-specific trajectory patterns in embedding space, maintaining safety while improving utility for production applications.
Abstract: LLMs increasingly exhibit over-refusal behavior, where safety mechanisms cause models to reject benign instructions that seemingly resemble harmful content. This phenomenon diminishes utility in production applications that repeatedly rely on common prompt templates or applications that frequently rely on LLMs for specific tasks (e.g. sentiment analysis, language translation). Through extensive evaluation, we demonstrate that LLMs persist in refusing inputs containing harmful content, even when they are reframed with tasks that have benign intent. Our mechanistic analysis reveals that LLMs follow distinct “constellation” patterns in embedding space as representations traverse layers, with each NLP task maintaining consistent trajectories that shift predictably between refusal and non-refusal cases. We introduce SafeConstellations, an inference-time trajectory-shifting approach that tracks task-specific trajectory patterns and guides representations toward non-refusal pathways. By selectively guiding model behavior only on tasks prone to over-refusal, our method reduces over-refusal rates by up to 73% with minimal impact on utility – offering a principled and conditional approach to mitigating over-refusals.
[157] KCS: Diversify Multi-hop Question Generation with Knowledge Composition Sampling
Yangfan Wang, Jie Liu, Chen Tang, Lian Yan, Jingchi Jiang
Main category: cs.CL
TL;DR: KCS framework improves multi-hop QA diversity by sampling varied knowledge compositions using sentence-level conditional prediction and probabilistic contrastive loss.
Details
Motivation: Multi-hop QA suffers from data sparsity leading to spurious patterns; existing methods generate simple questions but neglect integrating essential knowledge from relevant sentences.Method: Knowledge Composition Sampling (KCS) models knowledge composition selection as sentence-level conditional prediction, uses probabilistic contrastive loss to predict next relevant knowledge, and employs stochastic decoding for inference.
Result: KCS improves knowledge composition selection accuracy by 3.9% and yields improvements on HotpotQA and 2WikiMultihopQA datasets when used for data augmentation.
Conclusion: KCS effectively expands diversity of multi-hop questions by sampling varied knowledge compositions, addressing data sparsity issues in multi-hop QA.
Abstract: Multi-hop question answering faces substantial challenges due to data sparsity, which increases the likelihood of language models learning spurious patterns. To address this issue, prior research has focused on diversifying question generation through content planning and varied expression. However, these approaches often emphasize generating simple questions and neglect the integration of essential knowledge, such as relevant sentences within documents. This paper introduces the Knowledge Composition Sampling (KCS), an innovative framework designed to expand the diversity of generated multi-hop questions by sampling varied knowledge compositions within a given context. KCS models the knowledge composition selection as a sentence-level conditional prediction task and utilizes a probabilistic contrastive loss to predict the next most relevant piece of knowledge. During inference, we employ a stochastic decoding strategy to effectively balance accuracy and diversity. Compared to competitive baselines, our KCS improves the overall accuracy of knowledge composition selection by 3.9%, and its application for data augmentation yields improvements on HotpotQA and 2WikiMultihopQA datasets. Our code is available at: https://github.com/yangfanww/kcs.
[158] Multi-Model Synthetic Training for Mission-Critical Small Language Models
Nolan Platt, Pragyansmita Nayak
Main category: cs.CL
TL;DR: Using LLMs as teachers to generate synthetic Q&A pairs from maritime AIS data, enabling cost-effective fine-tuning of smaller models for specialized domain tasks.
Details
Motivation: LLMs have limited application in specialized domains due to scarcity of domain-specific training data and high inference costs of large models. Need cost-effective solutions for specialized AI applications like maritime intelligence.Method: Use GPT-4o and o3-mini as one-time teachers to generate 21,543 synthetic Q&A pairs from 3.2 billion AIS vessel tracking records. Fine-tune smaller Qwen2.5-7B model on this synthetic dataset rather than using large models directly for inference.
Result: Achieved 261x cost reduction for maritime intelligence. Fine-tuned Qwen2.5-7B achieves 75% accuracy on maritime tasks, comparable to larger models but substantially cheaper. Framework prevents overfitting and ensures accurate reasoning.
Conclusion: Smaller, properly fine-tuned models can match larger model accuracy at dramatically lower costs. Approach enables specialized AI applications where manual annotation is infeasible, with applications in maritime safety, security, and traffic management.
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across many domains, yet their application to specialized fields remains constrained by the scarcity and complexity of domain-specific training data. We present a novel approach that achieves a 261x cost reduction for maritime intelligence by using LLMs as one-time teachers rather than using them directly for inference. Our method transforms 3.2 billion Automatic Identification System (AIS) vessel tracking records into 21,543 synthetic question and answer pairs through multi-model generation (GPT-4o and o3-mini), preventing overfitting and ensuring accurate reasoning. The resulting fine-tuned Qwen2.5-7B model achieves 75% accuracy on maritime tasks, while being substantially cheaper than using a larger model for inference. We show that smaller, cheaper models – when fine tuned properly – can provide similar accuracy compared to larger models that are prohibitively expensive. Our work contributes to the growing field of synthetic dataset generation for specialized AI applications and presents a highly reproducible framework for domains where manual annotation is infeasible. Beyond expanding research in the growing field of specialized small language models, our approach has immediate applications in maritime safety, security operations, and vessel traffic management systems in various industries.
[159] FS-DFM: Fast and Accurate Long Text Generation with Few-Step Diffusion Language Models
Amin Karimi Monsefi, Nikhil Bhendawade, Manuel Rafael Ciosici, Dominic Culver, Yizhe Zhang, Irina Belousova
Main category: cs.CL
TL;DR: FS-DFM is a few-step discrete flow-matching model for faster language generation that achieves perplexity parity with 1024-step baselines using only 8 steps, delivering 128x speedup.
Details
Motivation: Autoregressive language models are serial (one token per forward pass), limiting throughput and increasing latency for long sequences. Diffusion language models parallelize but require hundreds to thousands of model evaluations for quality. Need faster generation without sacrificing quality.Method: Introduces FS-DFM (Few-Step Discrete Flow-Matching) with three key components: 1) Makes number of sampling steps an explicit parameter and trains for consistency across step budgets, 2) Reliable update rule that moves probability without overshooting, 3) Strong teacher guidance distilled from long-run trajectories.
Result: On language modeling benchmarks, FS-DFM with 8 sampling steps achieves perplexity parity with 1,024-step discrete-flow baseline for generating 1,024 tokens using similar-size model, delivering up to 128 times faster sampling and corresponding latency/throughput gains.
Conclusion: FS-DFM enables efficient few-step discrete flow-matching for language generation, achieving quality parity with long-run baselines while dramatically improving speed, making parallel language generation more practical.
Abstract: Autoregressive language models (ARMs) deliver strong likelihoods, but are inherently serial: they generate one token per forward pass, which limits throughput and inflates latency for long sequences. Diffusion Language Models (DLMs) parallelize across positions and thus appear promising for language generation, yet standard discrete diffusion typically needs hundreds to thousands of model evaluations to reach high quality, trading serial depth for iterative breadth. We introduce FS-DFM, Few-Step Discrete Flow-Matching. A discrete flow-matching model designed for speed without sacrificing quality. The core idea is simple: make the number of sampling steps an explicit parameter and train the model to be consistent across step budgets, so one big move lands where many small moves would. We pair this with a reliable update rule that moves probability in the right direction without overshooting, and with strong teacher guidance distilled from long-run trajectories. Together, these choices make few-step sampling stable, accurate, and easy to control. On language modeling benchmarks, FS-DFM with 8 sampling steps achieves perplexity parity with a 1,024-step discrete-flow baseline for generating 1,024 tokens using a similar-size model, delivering up to 128 times faster sampling and corresponding latency/throughput gains. Code & pretrained checkpoints: https://github.com/apple/ml-fs-dfm
[160] LayerNorm Induces Recency Bias in Transformer Decoders
Junu Kim, Xiao Liu, Zhenghao Lin, Lei Ji, Yeyun Gong, Edward Choi
Main category: cs.CL
TL;DR: Analysis shows that stacked causal self-attention layers combined with LayerNorm induce recency bias in Transformer decoders, contrary to previous understanding of positional bias toward earlier tokens.
Details
Motivation: There's a discrepancy between the positional bias induced by causal self-attention layers (which favors earlier tokens) and the recency bias (favoring later tokens) typically observed in Transformer decoders. The paper aims to understand how architectural components interact to create this recency bias.Method: The authors analyze the interaction between causal self-attention and other architectural components, specifically examining how stacked causal self-attention layers combined with LayerNorm induce recency bias. They also study the effects of residual connections and input token embedding distributions on this bias.
Result: The analysis reveals that the combination of stacked causal self-attention layers with LayerNorm is responsible for inducing recency bias in Transformer decoders, providing new theoretical insights into positional information dynamics.
Conclusion: The findings provide theoretical insights into how positional information interacts with architectural components in Transformers and suggest directions for improving positional encoding strategies.
Abstract: Causal self-attention provides positional information to Transformer decoders. Prior work has shown that stacks of causal self-attention layers alone induce a positional bias in attention scores toward earlier tokens. However, this differs from the bias toward later tokens typically observed in Transformer decoders, known as recency bias. We address this discrepancy by analyzing the interaction between causal self-attention and other architectural components. We show that stacked causal self-attention layers combined with LayerNorm induce recency bias. Furthermore, we examine the effects of residual connections and the distribution of input token embeddings on this bias. Our results provide new theoretical insights into how positional information interacts with architectural components and suggest directions for improving positional encoding strategies.
[161] RedNote-Vibe: A Dataset for Capturing Temporal Dynamics of AI-Generated Text in Lifestyle Social Media
Yudong Li, Yufei Sun, Peiru Yang, Yuhan Yao, Wanyue Li, Jiajun Zou, Haoyang Yang, Haotian Gan, Linlin Shen, Yongfeng Huang
Main category: cs.CL
TL;DR: RedNote-Vibe is a 5-year dataset from lifestyle platform RedNote (Xiaohongshu) with engagement metrics, and PLAD is a psycholinguistic framework for AI-generated text detection that reveals insights about human vs. AI content dynamics.
Details
Motivation: To study temporal dynamics of content creation on lifestyle platforms and develop robust detection methods for AI-generated content, while understanding how AI tools affect engagement and content quality over time.Method: Created RedNote-Vibe dataset spanning 5 years from RedNote platform with comprehensive engagement metrics. Proposed Psycholinguistic AIGT Detection Framework (PLAD) grounded in cognitive psychology, leveraging deep psychological signatures for interpretable AI-generated text detection.
Result: PLAD shows superior detection performance. Key findings: human content outperforms AI in emotional resonance; AI content is more homogeneous and rarely produces breaking posts; human-AI gap narrows for high-investment interactions; small group of users strategically using AI tools achieve higher engagement.
Conclusion: The dataset enables temporal analysis of content dynamics, and PLAD provides robust, interpretable AI detection. Strategic AI use by some users can enhance engagement, but human content remains superior in emotional domains.
Abstract: We introduce RedNote-Vibe, a dataset spanning five years (pre-LLM to July 2025) sourced from lifestyle platform RedNote (Xiaohongshu), capturing the temporal dynamics of content creation and is enriched with comprehensive engagement metrics. To address the detection challenge posed by RedNote-Vibe, we propose the \textbf{PsychoLinguistic AIGT Detection Framework (PLAD)}. Grounded in cognitive psychology, PLAD leverages deep psychological signatures for robust and interpretable detection. Our experiments demonstrate PLAD’s superior performance and reveal insights into content dynamics: (1) human content continues to outperform AI in emotionally resonant domains; (2) AI content is more homogeneous and rarely produces breaking posts, however, this human-AI gap narrows for arousing higher-investment interactions; and (3) most interestingly, a small group of users who strategically utilize AI tools can achieve higher engagement outcomes. The dataset is available at https://github.com/ydli-ai/RedNote-Vibe
[162] StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs
Yuhan Song, Linhao Zhang, Chuhan Wu, Aiwei Liu, Wei Jia, Houfeng Wang, Xiao Zhou
Main category: cs.CL
TL;DR: StableToken is a robust semantic speech tokenizer that uses multi-branch architecture with bit-wise voting to create stable token sequences resistant to acoustic perturbations, improving downstream SpeechLLM robustness.
Details
Motivation: Current semantic speech tokenizers are fragile to meaning-irrelevant acoustic perturbations, causing drastic token sequence changes even at high SNRs where speech remains intelligible. This instability increases learning burden for downstream LLMs and stems from brittle single-path quantization and distant training signals.Method: StableToken uses a multi-branch architecture that processes audio in parallel, then merges representations via a powerful bit-wise voting mechanism to form a single, stable token sequence. This consensus-driven approach addresses the limitations of single-path quantization.
Result: StableToken achieves state-of-the-art token stability, drastically reducing Unit Edit Distance (UED) under diverse noise conditions. This foundational stability translates to significant improvements in SpeechLLM robustness across various tasks.
Conclusion: The proposed StableToken tokenizer successfully addresses the fragility of semantic speech tokenizers through its consensus-driven architecture, providing stable token sequences that enhance downstream SpeechLLM performance and robustness.
Abstract: Prevalent semantic speech tokenizers, designed to capture linguistic content, are surprisingly fragile. We find they are not robust to meaning-irrelevant acoustic perturbations; even at high Signal-to-Noise Ratios (SNRs) where speech is perfectly intelligible, their output token sequences can change drastically, increasing the learning burden for downstream LLMs. This instability stems from two flaws: a brittle single-path quantization architecture and a distant training signal indifferent to intermediate token stability. To address this, we introduce StableToken, a tokenizer that achieves stability through a consensus-driven mechanism. Its multi-branch architecture processes audio in parallel, and these representations are merged via a powerful bit-wise voting mechanism to form a single, stable token sequence. StableToken sets a new state-of-the-art in token stability, drastically reducing Unit Edit Distance (UED) under diverse noise conditions. This foundational stability translates directly to downstream benefits, significantly improving the robustness of SpeechLLMs on a variety of tasks. Our code and model are publicly available at https://github.com/Tencent/StableToken.
[163] ChatInject: Abusing Chat Templates for Prompt Injection in LLM Agents
Hwan Chang, Yonghyun Jun, Hwanhee Lee
Main category: cs.CL
TL;DR: ChatInject: A novel attack exploiting LLM agents’ chat template dependencies through structured malicious payloads and multi-turn persuasion dialogues, achieving significantly higher success rates than traditional prompt injection methods.
Details
Motivation: As LLM-based agents increasingly interact with external environments, new attack surfaces emerge. While previous research focused on plain-text injection attacks, there's an underexplored vulnerability in LLMs' dependence on structured chat templates and susceptibility to contextual manipulation through persuasive multi-turn dialogues.Method: Introduces ChatInject attack that formats malicious payloads to mimic native chat templates, exploiting models’ instruction-following tendencies. Develops a persuasion-driven Multi-turn variant that primes agents across conversational turns to accept suspicious actions. Tests across frontier LLMs on AgentDojo and InjecAgent benchmarks.
Result: ChatInject achieves significantly higher average attack success rates: 5.18% to 32.05% on AgentDojo and 15.13% to 45.90% on InjecAgent. Multi-turn dialogues show particularly strong performance at average 52.33% success rate on InjecAgent. Chat-template-based payloads demonstrate strong transferability across models and remain effective against closed-source LLMs. Existing prompt-based defenses are largely ineffective.
Conclusion: The paper reveals critical vulnerabilities in current agent systems, highlighting how chat template dependencies and multi-turn persuasion can be exploited for indirect prompt injection attacks, with existing defenses proving inadequate against these sophisticated approaches.
Abstract: The growing deployment of large language model (LLM) based agents that interact with external environments has created new attack surfaces for adversarial manipulation. One major threat is indirect prompt injection, where attackers embed malicious instructions in external environment output, causing agents to interpret and execute them as if they were legitimate prompts. While previous research has focused primarily on plain-text injection attacks, we find a significant yet underexplored vulnerability: LLMs’ dependence on structured chat templates and their susceptibility to contextual manipulation through persuasive multi-turn dialogues. To this end, we introduce ChatInject, an attack that formats malicious payloads to mimic native chat templates, thereby exploiting the model’s inherent instruction-following tendencies. Building on this foundation, we develop a persuasion-driven Multi-turn variant that primes the agent across conversational turns to accept and execute otherwise suspicious actions. Through comprehensive experiments across frontier LLMs, we demonstrate three critical findings: (1) ChatInject achieves significantly higher average attack success rates than traditional prompt injection methods, improving from 5.18% to 32.05% on AgentDojo and from 15.13% to 45.90% on InjecAgent, with multi-turn dialogues showing particularly strong performance at average 52.33% success rate on InjecAgent, (2) chat-template-based payloads demonstrate strong transferability across models and remain effective even against closed-source LLMs, despite their unknown template structures, and (3) existing prompt-based defenses are largely ineffective against this attack approach, especially against Multi-turn variants. These findings highlight vulnerabilities in current agent systems.
[164] Infusing Theory of Mind into Socially Intelligent LLM Agents
EunJeong Hwang, Yuwei Yin, Giuseppe Carenini, Peter West, Vered Shwartz
Main category: cs.CL
TL;DR: LLMs with explicit Theory of Mind (ToM) reasoning achieve better dialogue performance and goal achievement, demonstrated through ToMAgent which combines ToM with dialogue lookahead for strategic social interactions.
Details
Motivation: Current chatbots and LLM-based social agents lack Theory of Mind (understanding others' mental states), which is crucial for human social intelligence. Integrating ToM could improve dialogue effectiveness and goal achievement in social interactions.Method: Introduces ToMAgent (ToMA), a ToM-focused dialogue agent trained by pairing Theory of Mind with dialogue lookahead to produce mental states maximally useful for achieving dialogue goals. First shows that simply prompting models to generate mental states between dialogue turns provides benefits, then develops the full ToMA approach.
Result: Experiments on Sotopia interactive social evaluation benchmark show ToMA outperforms baselines. Comprehensive analysis reveals ToMA exhibits more strategic, goal-oriented reasoning behaviors, enables long-horizon adaptation, and maintains better relationships with dialogue partners.
Conclusion: Explicit integration of Theory of Mind significantly improves LLM-based social agents’ dialogue performance and goal achievement, representing a step forward in building socially intelligent LLM agents.
Abstract: Theory of Mind (ToM)-an understanding of the mental states of others-is a key aspect of human social intelligence, yet, chatbots and LLM-based social agents do not typically integrate it. In this work, we demonstrate that LLMs that explicitly use ToM get better at dialogue, achieving goals more effectively. After showing that simply prompting models to generate mental states between dialogue turns already provides significant benefit, we further introduce ToMAgent (ToMA), a ToM-focused dialogue agent. ToMA is trained by pairing ToM with dialogue lookahead to produce mental states that are maximally useful for achieving dialogue goals. Experiments on the Sotopia interactive social evaluation benchmark demonstrate the effectiveness of our method over a range of baselines. Comprehensive analysis shows that ToMA exhibits more strategic, goal-oriented reasoning behaviors, which enable long-horizon adaptation, while maintaining better relationships with their partners. Our results suggest a step forward in integrating ToM for building socially intelligent LLM agents.
[165] MASH: Modeling Abstention via Selective Help-Seeking
Mustafa Omer Gul, Claire Cardie, Tanya Goyal
Main category: cs.CL
TL;DR: MASH is a training framework that uses reinforcement learning to teach LLMs to selectively use search tools as a proxy for abstention when they lack parametric knowledge, improving answer accuracy and abstention performance without requiring predetermined knowledge boundaries.
Details
Motivation: LLMs often hallucinate answers when faced with questions outside their parametric knowledge boundaries. Current approaches for teaching abstention require predetermined knowledge boundaries to construct training data, which is impractical. The authors propose using search tool usage as a natural proxy for abstention.Method: MASH uses reinforcement learning with a pay-per-search reward structure. The framework penalizes external help-seeking (search tool use) while rewarding answer accuracy. This teaches LLMs to only use search tools when they lack parametric knowledge, effectively aligning search tool use with knowledge boundaries.
Result: On three knowledge-intensive QA datasets, MASH substantially improves selective help-seeking performance over prior efficient search approaches. On multi-hop datasets, it improves answer accuracy by 7.6%. It also demonstrates strong off-the-shelf abstention performance competitive with prior methods that require predetermined knowledge boundaries.
Conclusion: MASH effectively aligns search tool use with parametric knowledge boundaries through reinforcement learning. This approach enables LLMs to make better abstention decisions and use search tools more efficiently without requiring explicit knowledge boundary determination for training data construction.
Abstract: LLMs cannot reliably recognize their parametric knowledge boundaries and often hallucinate answers to outside-of-boundary questions. In this paper, we introduce MASH (Modeling Abstention via Selective Help-seeking), a training framework that readily extracts abstentions from LLMs. Our key idea is that any external help-seeking by an LLM, i.e. search tool use, can serve as a proxy for abstention if the external help (search) is appropriately penalized while also rewarding answer accuracy. MASH operationalizes this idea using reinforcement learning with a pay-per-search reward. We run experiments on three knowledge-intensive QA datasets. Our results show that MASH substantially improves upon the selective help-seeking performance of prior efficient search approaches; on multi-hop datasets, it improves answer accuracy by 7.6%. Furthermore, MASH demonstrates strong off-the-shelf abstention performance, showcasing behavior competitive with prior abstention methods that additionally require predetermining model knowledge boundaries to construct training data. Overall, we show MASH training effectively aligns search tool use with parametric knowledge, which can be successfully leveraged for making abstention decisions and efficient search tool use
[166] Catalog-Native LLM: Speaking Item-ID Dialect with Less Entanglement for Recommendation
Reza Shirkavand, Xiaokai Wei, Chen Wang, Zheng Hui, Heng Huang, Michelle Gong
Main category: cs.CL
TL;DR: IDIOMoE is a novel recommendation system that combines collaborative filtering with LLMs by treating item interaction histories as a native dialect, using mixture-of-experts architecture to handle both text and item modalities without interference.
Details
Motivation: Modern recommendation systems need to combine the predictive accuracy of collaborative filtering with the expressive reasoning of LLMs, especially as users expect natural-language queries and transparent explanations. However, collaborative signals are token-efficient but semantically opaque, while LLMs struggle with implicit user preferences when trained only on text.Method: IDIOMoE treats item interaction histories as a native dialect within language space. It splits the Feed Forward Network of each block of a pretrained LLM into separate text and item experts with token-type gating, avoiding destructive interference between modalities while enabling collaborative signals to be understood like natural language.
Result: IDIOMoE demonstrates strong recommendation performance across both public and proprietary datasets while preserving the text understanding capabilities of the pretrained LLM.
Conclusion: The paper successfully unifies collaborative filtering and LLMs by treating item interactions as a native dialect, enabling effective multimodal recommendation systems that handle both text and item modalities without interference.
Abstract: While collaborative filtering delivers predictive accuracy and efficiency, and Large Language Models (LLMs) enable expressive and generalizable reasoning, modern recommendation systems must bring these strengths together. Growing user expectations, such as natural-language queries and transparent explanations, further highlight the need for a unified approach. However, doing so is nontrivial. Collaborative signals are often token-efficient but semantically opaque, while LLMs are semantically rich but struggle to model implicit user preferences when trained only on textual inputs. This paper introduces Item-ID + Oral-language Mixture-of-Experts Language Model (IDIOMoE), which treats item interaction histories as a native dialect within the language space, enabling collaborative signals to be understood in the same way as natural language. By splitting the Feed Forward Network of each block of a pretrained LLM into a separate text expert and an item expert with token-type gating, our method avoids destructive interference between text and catalog modalities. IDIOMoE demonstrates strong recommendation performance across both public and proprietary datasets, while preserving the text understanding of the pretrained model.
[167] EEPO: Exploration-Enhanced Policy Optimization via Sample-Then-Forget
Liang Chen, Xueting Han, Qizhou Wang, Bo Han, Jing Bai, Hinrich Schutze, Kam-Fai Wong
Main category: cs.CL
TL;DR: EEPO introduces a two-stage rollout framework with adaptive unlearning to balance exploration-exploitation in RL for LLMs, preventing entropy collapse and improving performance on reasoning tasks.
Details
Motivation: Current RL methods for LLMs overemphasize exploitation, leading to entropy collapse and diminished exploration capacity. Techniques that increase policy stochasticity often fail to escape dominant behavioral modes, creating a self-reinforcing loop that further erodes exploration.Method: EEPO uses two-stage rollouts with adaptive unlearning: first stage generates half the trajectories, then lightweight unlearning temporarily suppresses these sampled responses, forcing second stage to explore different regions of output space. This sample-then-forget mechanism disrupts the self-reinforcing loop.
Result: Across five reasoning benchmarks, EEPO outperforms GRPO with average relative gains of 24.3% on Qwen2.5-3B, 33.0% on Llama3.2-3B-Instruct, and 10.4% on Qwen3-8B-Base.
Conclusion: EEPO effectively addresses exploration-exploitation trade-off in RL for LLMs through its novel two-stage rollout with adaptive unlearning, demonstrating significant performance improvements on reasoning tasks.
Abstract: Balancing exploration and exploitation remains a central challenge in reinforcement learning with verifiable rewards (RLVR) for large language models (LLMs). Current RLVR methods often overemphasize exploitation, leading to entropy collapse, diminished exploratory capacity, and ultimately limited performance gains. Although techniques that increase policy stochasticity can promote exploration, they frequently fail to escape dominant behavioral modes. This creates a self-reinforcing loop – repeatedly sampling and rewarding dominant modes – that further erodes exploration. We introduce Exploration-Enhanced Policy Optimization (EEPO), a framework that promotes exploration via two-stage rollouts with adaptive unlearning. In the first stage, the model generates half of the trajectories; it then undergoes a lightweight unlearning step to temporarily suppress these sampled responses, forcing the second stage to explore different regions of the output space. This sample-then-forget mechanism disrupts the self-reinforcing loop and promotes wider exploration during rollouts. Across five reasoning benchmarks, EEPO outperforms GRPO, achieving average relative gains of 24.3% on Qwen2.5-3B, 33.0% on Llama3.2-3B-Instruct, and 10.4% on Qwen3-8B-Base.
[168] EDUMATH: Generating Standards-aligned Educational Math Word Problems
Bryan R. Christ, Penelope Molitz, Beau LeBlond, Zachary Gottesman, Jonathan Kropko, Thomas Hartvigsen
Main category: cs.CL
TL;DR: LLMs can generate math word problems customized to student interests and educational standards, with teacher-annotated data enabling open models to match or outperform closed models, and students preferring customized problems.
Details
Motivation: Math word problems are important educational tools but teachers lack time to customize them for individual students' interests and ability levels, creating a need for automated solutions.Method: Used joint human expert-LLM judge approach to evaluate 11,000+ generated MWPs, created teacher-annotated dataset, trained 12B open model and text classifier, and conducted student study comparing customized vs human-written MWPs.
Result: 12B open model matches larger models’ performance; 30B open LLM with classifier outperforms closed baselines; generated MWPs are more similar to human-written ones; students perform similarly on both but prefer customized MWPs.
Conclusion: LLMs can effectively support math education by generating customized math word problems that align with educational standards and student preferences, with open models achieving competitive performance.
Abstract: Math word problems (MWPs) are critical K-12 educational tools, and customizing them to students’ interests and ability levels can enhance learning. However, teachers struggle to find time to customize MWPs for students given large class sizes and increasing burnout. We propose that LLMs can support math education by generating MWPs customized to student interests and math education standards. We use a joint human expert-LLM judge approach to evaluate over 11,000 MWPs generated by open and closed LLMs and develop the first teacher-annotated dataset for standards-aligned educational MWP generation. We show the value of our data by using it to train a 12B open model that matches the performance of larger and more capable open models. We also use our teacher-annotated data to train a text classifier that enables a 30B open LLM to outperform existing closed baselines without any training. Next, we show our models’ MWPs are more similar to human-written MWPs than those from existing models. We conclude by conducting the first study of customized LLM-generated MWPs with grade school students, finding they perform similarly on our models’ MWPs relative to human-written MWPs but consistently prefer our customized MWPs.
[169] HiPRAG: Hierarchical Process Rewards for Efficient Agentic Retrieval Augmented Generation
Peilin Wu, Mian Zhang, Kun Wan, Wentian Zhao, Kaiyu He, Xinya Du, Zhiyu Chen
Main category: cs.CL
TL;DR: HiPRAG introduces hierarchical process rewards for agentic RAG to optimize search efficiency by reducing over-search and under-search behaviors through fine-grained RL training.
Details
Motivation: Current agentic RAG systems suffer from suboptimal search behaviors like over-search (retrieving known information) and under-search (failing to search when needed), leading to inefficiency and unreliable outputs. Existing RL training methods use outcome-based rewards that lack fine-grained control over the search process.Method: HiPRAG incorporates a knowledge-grounded process reward into RL training by decomposing reasoning trajectories into discrete steps and applying a hierarchical reward function. This evaluates search necessity on-the-fly and provides bonuses based on optimal search/non-search step proportions, complementing traditional outcome and format rewards.
Result: Experiments on Qwen2.5 and Llama-3.2 models across seven QA benchmarks show average accuracies of 65.4% (3B) and 67.2% (7B) while improving search efficiency. Over-search rate reduced to 2.3% with lower under-search rates. The method generalizes well across RL algorithms, model families, sizes, and types.
Conclusion: HiPRAG demonstrates the importance of fine-grained process optimization in RL for improving search agent efficiency and optimality, showing that optimizing the reasoning process itself (not just outcomes) leads to better performance and generalization.
Abstract: Agentic RAG is a powerful technique for incorporating external information that LLMs lack, enabling better problem solving and question answering. However, suboptimal search behaviors exist widely, such as over-search (retrieving information already known) and under-search (failing to search when necessary), which leads to unnecessary overhead and unreliable outputs. Current training methods, which typically rely on outcome-based rewards in a RL framework, lack the fine-grained control needed to address these inefficiencies. To overcome this, we introduce Hierarchical Process Rewards for Efficient agentic RAG (HiPRAG), a training methodology that incorporates a fine-grained, knowledge-grounded process reward into the RL training. Our approach evaluates the necessity of each search decision on-the-fly by decomposing the agent’s reasoning trajectory into discrete, parsable steps. We then apply a hierarchical reward function that provides an additional bonus based on the proportion of optimal search and non-search steps, on top of commonly used outcome and format rewards. Experiments on the Qwen2.5 and Llama-3.2 models across seven diverse QA benchmarks show that our method achieves average accuracies of 65.4% (3B) and 67.2% (7B). This is accomplished while improving search efficiency, reducing the over-search rate to just 2.3% and concurrently lowering the under-search rate. These results demonstrate the efficacy of optimizing the reasoning process itself, not just the final outcome. Further experiments and analysis demonstrate that HiPRAG shows good generalizability across a wide range of RL algorithms, model families, sizes, and types. This work demonstrates the importance and potential of fine-grained control through RL, for improving the efficiency and optimality of reasoning for search agents.
[170] A Survey of Inductive Reasoning for Large Language Models
Kedi Chen, Dezhao Ruan, Yuhao Dan, Yaoting Wang, Siyu Yan, Xuecheng Wu, Yinqi Zhang, Qin Chen, Jie Zhou, Liang He, Biqing Qi, Linyang Li, Qipeng Guo, Xiaoming Shi, Wei Zhang
Main category: cs.CL
TL;DR: First comprehensive survey of inductive reasoning for LLMs, covering methods, benchmarks, and analysis of inductive ability sources.
Details
Motivation: Inductive reasoning is fundamental for knowledge generalization and human cognition, but lacks systematic study in LLMs despite increasing interest.Method: Categorizes inductive reasoning improvement methods into three areas: post-training, test-time scaling, and data augmentation. Summarizes benchmarks and proposes unified sandbox-based evaluation with observation coverage metric.
Result: Provides comprehensive survey framework, analysis of inductive ability sources, and insights into how model architectures and data help with inductive tasks.
Conclusion: Establishes foundation for future inductive reasoning research in LLMs through systematic categorization, evaluation methodology, and analysis of inductive capability sources.
Abstract: Reasoning is an important task for large language models (LLMs). Among all the reasoning paradigms, inductive reasoning is one of the fundamental types, which is characterized by its particular-to-general thinking process and the non-uniqueness of its answers. The inductive mode is crucial for knowledge generalization and aligns better with human cognition, so it is a fundamental mode of learning, hence attracting increasing interest. Despite the importance of inductive reasoning, there is no systematic summary of it. Therefore, this paper presents the first comprehensive survey of inductive reasoning for LLMs. First, methods for improving inductive reasoning are categorized into three main areas: post-training, test-time scaling, and data augmentation. Then, current benchmarks of inductive reasoning are summarized, and a unified sandbox-based evaluation approach with the observation coverage metric is derived. Finally, we offer some analyses regarding the source of inductive ability and how simple model architectures and data help with inductive tasks, providing a solid foundation for future research.
[171] Domain-Specific Data Generation Framework for RAG Adaptation
Chris Xing Tian, Weihao Xie, Zhen Chen, Zhengyuan Yi, Hui Liu, Haoliang Li, Shiqi Wang, Siwei Ma
Main category: cs.CL
TL;DR: RAGen is a scalable framework for generating domain-specific question-answer-context triples to adapt RAG systems to specialized domains using modular components for concept extraction, question generation, and multi-chunk retrieval.
Details
Motivation: RAG systems need domain-specific training data beyond general QA to effectively adapt to specialized settings, but creating such data manually is challenging and doesn't scale for dynamic domains.Method: Modular pipeline with semantic chunking, hierarchical concept extraction, Bloom’s Taxonomy-guided question generation, precise answer extraction, multi-chunk retrieval, and curated distractor contexts for robust training.
Result: RAGen efficiently generates domain-grounded QAC triples for diverse RAG adaptation strategies, handles large evolving document corpora without redundant processing, and supports optimization of LLM, retriever, and embedding models.
Conclusion: RAGen provides a scalable solution for adapting RAG systems to dynamic domains like scientific research and enterprise knowledge bases through automated generation of specialized training data.
Abstract: Retrieval-Augmented Generation (RAG) combines the language understanding and reasoning power of large language models (LLMs) with external retrieval to enable domain-grounded responses. Effectively adapting RAG systems to domain-specific settings requires specialized, context-rich training data beyond general-purpose question-answering. Here, we propose RAGen, a scalable and modular framework for generating domain-grounded question-answer-context (QAC) triples tailored to diverse RAG adaptation approaches. RAGen produces these QAC triples by identifying key concepts in documents, generating diverse questions guided by Bloom’s Taxonomy-inspired principles, and pairing them with precise answers extracted from relevant contexts. RAGen supports multiple RAG adaptation strategies, including the optimization of key components such as the LLM, retriever, and embedding model, etc. Its modular pipeline features semantic chunking, hierarchical concept extraction, and multi-chunk retrieval, along with the introduction of curated distractor contexts to promote robust reasoning. Designed for scalability, RAGen efficiently handles large and evolving document corpora without redundant processing, making it especially suitable for dynamic evolving domains such as scientific research and enterprise knowledge bases.
[172] Beyond Black-Box Interventions: Latent Probing for Faithful Retrieval-Augmented Generation
Linfeng Gao, Qinggang Zhang, Baolong Bi, Bo Zeng, Zheng Yuan, Zerui Chen, Zhimin Wei, Shenghua Liu, Linlong Xu, Longyue Wang, Weihua Luo, Jinsong Su
Main category: cs.CL
TL;DR: ProbeRAG improves RAG faithfulness by analyzing LLM internal reasoning, identifying knowledge conflicts in latent space, and modulating attention for better context integration.
Details
Motivation: Current RAG systems often generate unfaithful responses that conflict with provided context. Existing black-box interventions lack understanding of when/why knowledge conflicts occur, making them brittle and data-intensive.Method: Three-stage framework: 1) Fine-grained knowledge pruning to filter irrelevant context, 2) Latent conflict probing to identify hard conflicts in model’s latent space, 3) Conflict-aware attention to modulate attention heads toward faithful context integration.
Result: Extensive experiments show ProbeRAG substantially improves both accuracy and contextual faithfulness compared to existing methods.
Conclusion: Moving beyond black-box interventions to analyze internal reasoning processes enables more faithful RAG systems by identifying and addressing knowledge conflicts at the latent representation level.
Abstract: Retrieval-Augmented Generation (RAG) systems often fail to maintain contextual faithfulness, generating responses that conflict with the provided context or fail to fully leverage the provided evidence. Existing methods attempt to improve faithfulness through external interventions, such as specialized prompting, decoding-based calibration, or preference optimization. However, since these approaches treat the LLM as a black box, they lack a reliable mechanism to assess when and why knowledge conflicts occur. Consequently, they tend to be brittle, data-intensive, and agnostic to the model’s internal reasoning process. In this paper, we move beyond black-box interventions to analyze the model’s internal reasoning process. We discover that conflicting and aligned knowledge states are linearly separable in the model’s latent space, and contextual noise systematically increases the entropy of these representations. Based on these findings, we propose ProbeRAG, a novel framework for faithful RAG that operates in three stages: (i) fine-grained knowledge pruning to filter irrelevant context, (ii) latent conflict probing to identify hard conflicts in the model’s latent space, and (iii) conflict-aware attention to modulate attention heads toward faithful context integration. Extensive experiments demonstrate that ProbeRAG substantially improves both accuracy and contextual faithfulness. The related resources are available at https://github.com/LinfengGao/ProbeRAG.
[173] Think Parallax: Solving Multi-Hop Problems via Multi-View Knowledge-Graph-Based Retrieval-Augmented Generation
Jinliang Liu, Jiale Bai, Shaoning Zeng
Main category: cs.CL
TL;DR: ParallaxRAG is a multi-view framework for multi-hop reasoning over knowledge graphs that addresses Transformer attention head specialization by creating aligned semantic spaces for different reasoning hops.
Details
Motivation: LLMs struggle with multi-hop reasoning over knowledge graphs due to Transformer attention heads naturally specializing in distinct semantic relations across reasoning stages, forming hop-aligned relay patterns. Existing KG-RAG systems collapse all reasoning hops into single representations, suppressing this implicit structure and causing noisy path exploration.Method: Introduces ParallaxRAG, a symmetric multi-view framework that decouples queries and KGs into aligned, head-specific semantic spaces. Enforces relational diversity across multiple attention heads while constraining weakly related paths to construct cleaner subgraphs and guide LLMs through hop-wise reasoning.
Result: Achieves state-of-the-art retrieval and QA performance on WebQSP and CWQ benchmarks, substantially reduces hallucination, and generalizes strongly to the biomedical BioASQ benchmark.
Conclusion: ParallaxRAG successfully addresses the structural limitations of Transformer attention heads for multi-hop reasoning by leveraging their natural specialization patterns, leading to improved KG-based reasoning performance and reduced hallucinations.
Abstract: Large language models (LLMs) still struggle with multi-hop reasoning over knowledge-graphs (KGs), and we identify a previously overlooked structural reason for this difficulty: Transformer attention heads naturally specialize in distinct semantic relations across reasoning stages, forming a hop-aligned relay pattern. This key finding suggests that multi-hop reasoning is inherently multi-view, yet existing KG-based retrieval-augmented generation (KG-RAG) systems collapse all reasoning hops into a single representation, flat embedding space, suppressing this implicit structure and causing noisy or drifted path exploration. We introduce ParallaxRAG, a symmetric multi-view framework that decouples queries and KGs into aligned, head-specific semantic spaces. By enforcing relational diversity across multiple heads while constraining weakly related paths, ParallaxRAG constructs more accurate, cleaner subgraphs and guides LLMs through grounded, hop-wise reasoning. On WebQSP and CWQ, it achieves state-of-the-art retrieval and QA performance, substantially reduces hallucination, and generalizes strongly to the biomedical BioASQ benchmark.
[174] SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors
Tiancheng Hu, Joachim Baumann, Lorenzo Lupo, Nigel Collier, Dirk Hovy, Paul Röttger
Main category: cs.CL
TL;DR: SimBench is a standardized benchmark for evaluating LLM simulation of human behavior across 20 diverse datasets, revealing current models achieve modest fidelity (40.8/100) with scaling patterns and demographic limitations.
Details
Motivation: Current evaluations of LLM simulation fidelity are fragmented with bespoke tasks and metrics, creating incomparable results. There's a need for standardized benchmarks to enable robust, reproducible science of LLM simulation of human behavior.Method: Introduced SimBench, a large-scale benchmark unifying 20 diverse datasets covering tasks from moral decision-making to economic choice across global participant pools. Evaluates simulation fidelity systematically.
Result: Best LLMs achieve modest simulation fidelity (40.80/100). Performance scales log-linearly with model size but not with inference-time compute. Discovered alignment-simulation tradeoff: instruction tuning helps on consensus questions but harms on diverse ones. Models struggle with specific demographic groups. Simulation ability correlates strongly with knowledge-intensive reasoning (MMLU-Pro, r = 0.939).
Conclusion: SimBench provides foundation for measuring LLM simulation fidelity systematically. Current models have meaningful but limited ability to simulate human behavior, with specific limitations in demographic representation and diverse scenarios.
Abstract: Large language model (LLM) simulations of human behavior have the potential to revolutionize the social and behavioral sciences, if and only if they faithfully reflect real human behaviors. Current evaluations of simulation fidelity are fragmented, based on bespoke tasks and metrics, creating a patchwork of incomparable results. To address this, we introduce SimBench, the first large-scale, standardized benchmark for a robust, reproducible science of LLM simulation. By unifying 20 diverse datasets covering tasks from moral decision-making to economic choice across a large global participant pool, SimBench provides the necessary foundation to ask fundamental questions about when, how, and why LLM simulations succeed or fail. We show that the best LLMs today achieve meaningful but modest simulation fidelity (score: 40.80/100), with performance scaling log-linearly with model size but not with increased inference-time compute. We discover an alignment-simulation tradeoff: instruction tuning improves performance on low-entropy (consensus) questions but degrades it on high-entropy (diverse) ones. Models particularly struggle when simulating specific demographic groups. Finally, we demonstrate that simulation ability correlates most strongly with knowledge-intensive reasoning (MMLU-Pro, r = 0.939). By making progress measurable, we aim to accelerate the development of more faithful LLM simulators.
[175] AtlasKV: Augmenting LLMs with Billion-Scale Knowledge Graphs in 20GB VRAM
Haoyu Huang, Hong Ting Tsang, Jiaxin Bai, Xi Peng, Gong Zhang, Yangqiu Song
Main category: cs.CL
TL;DR: AtlasKV is a parametric method for augmenting LLMs with billion-scale knowledge graphs using minimal GPU memory, eliminating the need for external retrievers or long context windows.
Details
Motivation: Current RAG methods for LLMs have limitations: they rely on external retrieval modules, introduce inference latency due to expensive searches, and require long relevant context, especially for large-scale knowledge augmentation.Method: Proposes AtlasKV with two components: KG2KV to integrate KG triples into LLMs at scale, and HiKVP (hierarchical key-value pairs) to achieve sub-linear time and memory complexity. Uses LLMs’ inherent attention mechanism without external retrievers.
Result: Achieves scalable knowledge integration with billion-scale KGs (e.g., 1B triples) using very little GPU memory (less than 20GB VRAM), maintains strong knowledge grounding and generalization, and requires no retraining for new knowledge.
Conclusion: AtlasKV provides an effective parametric alternative to RAG for large-scale knowledge integration into LLMs, addressing latency and memory issues while maintaining performance.
Abstract: Retrieval-augmented generation (RAG) has shown some success in augmenting large language models (LLMs) with external knowledge. However, as a non-parametric knowledge integration paradigm for LLMs, RAG methods heavily rely on external retrieval modules and the retrieved textual context prior. Especially for very large scale knowledge augmentation, they would introduce substantial inference latency due to expensive searches and much longer relevant context. In this paper, we propose a parametric knowledge integration method, called \textbf{AtlasKV}, a scalable, effective, and general way to augment LLMs with billion-scale knowledge graphs (KGs) (e.g. 1B triples) using very little GPU memory cost (e.g. less than 20GB VRAM). In AtlasKV, we introduce KG2KV and HiKVP to integrate KG triples into LLMs at scale with sub-linear time and memory complexity. It maintains strong knowledge grounding and generalization performance using the LLMs’ inherent attention mechanism, and requires no external retrievers, long context priors, or retraining when adapting to new knowledge.
[176] What’s In My Human Feedback? Learning Interpretable Descriptions of Preference Data
Rajiv Movva, Smitha Milli, Sewon Min, Emma Pierson
Main category: cs.CL
TL;DR: WIMHF uses sparse autoencoders to extract interpretable features from human feedback data, revealing diverse preferences across datasets and enabling safer data curation and personalization.
Details
Motivation: Human feedback can unpredictably alter language models, but practitioners lack understanding of what feedback data encodes. Prior work focuses on specific attributes, but automatic feature extraction without pre-specified hypotheses remains challenging.Method: WIMHF uses sparse autoencoders to explain feedback data, characterizing both the preferences a dataset can measure and what annotators actually express. It identifies human-interpretable features across 7 datasets.
Result: WIMHF identifies a small number of interpretable features accounting for most preference prediction signal. Reveals diverse human preferences across contexts (e.g., Reddit users prefer informality/jokes, while HH-RLHF/PRISM annotators disprefer them). Surfaces unsafe preferences like LMArena users voting against refusals in favor of toxic content.
Conclusion: WIMHF enables effective data curation (37% safety gains on Arena) and fine-grained personalization (annotator-specific weights improve prediction). Provides human-centered analysis for better understanding and using preference data.
Abstract: Human feedback can alter language models in unpredictable and undesirable ways, as practitioners lack a clear understanding of what feedback data encodes. While prior work studies preferences over certain attributes (e.g., length or sycophancy), automatically extracting relevant features without pre-specifying hypotheses remains challenging. We introduce What’s In My Human Feedback? (WIMHF), a method to explain feedback data using sparse autoencoders. WIMHF characterizes both (1) the preferences a dataset is capable of measuring and (2) the preferences that the annotators actually express. Across 7 datasets, WIMHF identifies a small number of human-interpretable features that account for the majority of the preference prediction signal achieved by black-box models. These features reveal a wide diversity in what humans prefer, and the role of dataset-level context: for example, users on Reddit prefer informality and jokes, while annotators in HH-RLHF and PRISM disprefer them. WIMHF also surfaces potentially unsafe preferences, such as that LMArena users tend to vote against refusals, often in favor of toxic content. The learned features enable effective data curation: re-labeling the harmful examples in Arena yields large safety gains (+37%) with no cost to general performance. They also allow fine-grained personalization: on the Community Alignment dataset, we learn annotator-specific weights over subjective features that improve preference prediction. WIMHF provides a human-centered analysis method for practitioners to better understand and use preference data.
[177] Why Do Multilingual Reasoning Gaps Emerge in Reasoning Language Models?
Deokhyung Kang, Seonjeong Hwang, Daehui Kim, Hyounghun Kim, Gary Geunbae Lee
Main category: cs.CL
TL;DR: The paper identifies that multilingual reasoning gaps in language models stem from understanding failures in non-English inputs, proposes detection methods for these failures, and introduces Selective Translation to bridge the gap by translating only problematic inputs.
Details
Motivation: Reasoning language models perform worse in low-resource languages than high-resource ones, creating a multilingual reasoning gap. The underlying causes of this gap are not well understood, and addressing it is important for more equitable multilingual reasoning capabilities.Method: The authors first demonstrate that the gap stems from understanding failures where models can’t properly translate multilingual inputs to English (the dominant reasoning language). They evaluate various detection methods for these understanding failures, finding supervised approaches work best. They then propose Selective Translation, which incorporates English translations into reasoning traces only when understanding failures are detected.
Result: Experimental results with Qwen3-4B show that Selective Translation substantially bridges the multilingual reasoning gap, achieving near full-translation performance while translating only about 20% of inputs. Understanding failures were found to be detectable to a meaningful extent.
Conclusion: Failures in language understanding are the primary driver of the multilingual reasoning gap, and these can be detected and selectively mitigated. This clarifies the origin of the gap and suggests a path toward more equitable multilingual reasoning.
Abstract: Reasoning language models (RLMs) achieve strong performance on complex reasoning tasks, yet they still exhibit a multilingual reasoning gap, performing better in high-resource languages than in low-resource ones. While recent efforts have been made to address this gap, its underlying causes remain largely unexplored. In this work, we show that this gap primarily stems from failures in language understanding-specifically, the model’s inability to translate multilingual inputs into the language dominating its reasoning traces (typically English). As identifying understanding failures can enable targeted mitigation of the gap, we evaluate a range of detection methods and find that understanding failures are detectable to a meaningful extent, with supervised approaches performing best. Building on this, we propose Selective Translation, a strategy that incorporates an English translation into the initial reasoning trace only when an understanding failure is detected. Experimental results using Qwen3-4B show that Selective Translation substantially bridges the multilingual reasoning gap, achieving near full-translation performance while translating only about 20% of inputs. Together, our results show that failures in language understanding are the primary driver of the multilingual reasoning gap and can be detected and selectively mitigated, clarifying its origin and suggesting a path toward more equitable multilingual reasoning. Our code and data are publicly available at https://github.com/deokhk/RLM_analysis
[178] LiveCLKTBench: Towards Reliable Evaluation of Cross-Lingual Knowledge Transfer in Multilingual LLMs
Pei-Fu Guo, Yun-Da Tsai, Chun-Chia Hsu, Kai-Xin Chen, Ya-An Tsai, Kai-Wei Chang, Nanyun Peng, Mi-Yen Yeh, Shou-De Lin
Main category: cs.CL
TL;DR: LiveCLKTBench is an automated pipeline for evaluating cross-lingual knowledge transfer in LLMs by generating time-sensitive factual questions to isolate genuine transfer from pre-training exposure.
Details
Motivation: Current evaluation of cross-lingual knowledge transfer in LLMs is challenging because correct answers in target languages could result from either genuine transfer or prior exposure during pre-training, making it difficult to isolate and measure true transfer capabilities.Method: The pipeline identifies self-contained, time-sensitive knowledge entities from real-world domains, filters them based on temporal occurrence, verifies them against model knowledge, generates factual questions from entity documents, and translates them into multiple languages to evaluate transferability across linguistic boundaries.
Result: Evaluation of several LLMs across five languages shows that cross-lingual transfer is strongly influenced by linguistic distance and often asymmetric across language directions. Larger models improve transfer, but gains diminish with scale and vary across domains.
Conclusion: LiveCLKTBench provides a reliable benchmark for evaluating cross-lingual knowledge transfer, revealing important patterns about linguistic distance and asymmetry in transfer, with implications for multilingual model development.
Abstract: Evaluating cross-lingual knowledge transfer in large language models is challenging, as correct answers in a target language may arise either from genuine transfer or from prior exposure during pre-training. We present LiveCLKTBench, an automated generation pipeline specifically designed to isolate and measure cross-lingual knowledge transfer. Our pipeline identifies self-contained, time-sensitive knowledge entities from real-world domains, filters them based on temporal occurrence, and verifies them against the model’s knowledge. The documents of these valid entities are then used to generate factual questions, which are translated into multiple languages to evaluate transferability across linguistic boundaries. Using LiveCLKTBench, we evaluate several LLMs across five languages and observe that cross-lingual transfer is strongly influenced by linguistic distance and often asymmetric across language directions. While larger models improve transfer, the gains diminish with scale and vary across domains. These findings provide new insights into multilingual transfer and demonstrate the value of LiveCLKTBench as a reliable benchmark for future research.
[179] MCAT: Scaling Many-to-Many Speech-to-Text Translation with MLLMs to 70 Languages
Yexing Du, Kaiyuan Liu, Youcheng Pan, Bo Yang, Keqi Deng, Xie Chen, Yang Xiang, Ming Liu, Bing Qin, YaoWei Wang
Main category: cs.CL
TL;DR: MCAT is a multilingual cost-effective accelerated speech-to-text translation framework that extends MLLMs to 70 languages with curriculum learning and reduces speech sequence length to 30 tokens for faster inference.
Details
Motivation: Current MLLMs for speech-to-text translation face two key limitations: (1) English-centric datasets restrict many-to-many translation capabilities across languages, and (2) inference speed degrades dramatically when speech is converted into long sequences (e.g., 750 tokens).Method: Proposes MCAT framework with two innovations: (1) language scaling method using curriculum learning and data balancing to extend coverage to 70 languages, and (2) optimized speech adapter module that reduces speech sequence length to only 30 tokens.
Result: MCAT surpasses state-of-the-art end-to-end models on FLEURS dataset across 70x69 directions and enhances inference efficiency. Tested on MLLMs of different scales (9B and 27B).
Conclusion: MCAT effectively addresses language coverage and efficiency challenges in MLLM-based speech-to-text translation, enabling many-to-many translation across 70 languages with improved inference speed.
Abstract: Multimodal Large Language Models (MLLMs) have achieved great success in Speech-to-Text Translation (S2TT) tasks. However, current research is constrained by two key challenges: language coverage and efficiency. Most of the popular S2TT datasets are substantially English-centric, which restricts the scaling-up of MLLMs’ many-to-many translation capabilities. Moreover, the inference speed of MLLMs degrades dramatically when the speech is converted into long sequences (e.g., 750 tokens). To address these limitations, we propose a Multilingual Cost-effective Accelerated Speech-to-Text Translator (MCAT) framework, which includes two innovations. First, a language scaling method that leverages curriculum learning and a data balancing strategy is introduced to extend the language coverage supported by MLLMs to 70 languages and achieve mutual translation among these languages. Second, an optimized speech adapter module is designed to reduce the length of the speech sequence to only 30 tokens. Extensive experiments were conducted on MLLMs of different scales (9B and 27B). The experimental results demonstrate that MCAT not only surpasses state-of-the-art end-to-end models on the FLEURS dataset across 70x69 directions but also enhances inference efficiency. The code and models are released at https://github.com/yxduir/m2m-70.
[180] Different types of syntactic agreement recruit the same units within large language models
Daria Kryvosheieva, Andrea de Varda, Evelina Fedorenko, Greta Tuckute
Main category: cs.CL
TL;DR: LLMs represent syntactic agreement as a meaningful functional category with shared neural components across different agreement types and languages.
Details
Motivation: To understand how grammatical knowledge is represented in LLMs, specifically whether different syntactic phenomena recruit shared or distinct components, using agreement phenomena as a case study.Method: Functional localization approach inspired by cognitive neuroscience to identify LLM units responsive to 67 English syntactic phenomena across 7 open-weight models, with cross-lingual analysis in 57 languages.
Result: Different types of syntactic agreement recruit overlapping sets of units, suggesting agreement constitutes a meaningful functional category; pattern holds across English, Russian, Chinese; structurally similar languages share more units for subject-verb agreement.
Conclusion: Syntactic agreement represents a meaningful category within LLMs’ representational spaces, revealing systematic organization of grammatical knowledge.
Abstract: Large language models (LLMs) can reliably distinguish grammatical from ungrammatical sentences, but how grammatical knowledge is represented within the models remains an open question. We investigate whether different syntactic phenomena recruit shared or distinct components in LLMs. Using a functional localization approach inspired by cognitive neuroscience, we identify the LLM units most responsive to 67 English syntactic phenomena in seven open-weight models. These units are consistently recruited across sentences containing the phenomena and causally support the models’ syntactic performance. Critically, different types of syntactic agreement (e.g., subject-verb, anaphor, determiner-noun) recruit overlapping sets of units, suggesting that agreement constitutes a meaningful functional category for LLMs. This pattern holds in English, Russian, and Chinese; and further, in a cross-lingual analysis of 57 diverse languages, structurally more similar languages share more units for subject-verb agreement. Taken together, these findings reveal that syntactic agreement-a critical marker of syntactic dependencies-constitutes a meaningful category within LLMs’ representational spaces.
[181] Enhancing Geo-localization for Crowdsourced Flood Imagery via LLM-Guided Attention
Fengyi Xu, Jun Ma, Waishan Qiu, Cui Guo, Jack C. P. Cheng
Main category: cs.CL
TL;DR: VPR-AttLLM integrates LLMs into Visual Place Recognition pipelines to improve geo-localization of crowdsourced flood imagery by using attention mechanisms to isolate location-informative regions and suppress noise.
Details
Motivation: Crowdsourced social media imagery provides real-time visual evidence of urban flooding but lacks reliable geographic metadata. Existing VPR models struggle with cross-source domain shifts and visual distortions in crisis imagery.Method: A model-agnostic framework that integrates LLMs’ semantic reasoning and geospatial knowledge into VPR pipelines via attention-guided descriptor enhancement. Uses LLMs to identify location-informative regions and suppress transient noise without retraining or new data.
Result: Integration with state-of-the-art VPR models (CosPlace, EigenPlaces, SALAD) consistently improved recall, yielding 1-3% relative gains and up to 8% improvement on challenging real flood imagery across San Francisco and Hong Kong datasets.
Conclusion: VPR-AttLLM bridges human-like spatial reasoning with modern VPR architectures through urban perception principles, offering a scalable plug-and-play solution for rapid geo-localization of crowdsourced crisis imagery to advance cognitive urban resilience.
Abstract: Crowdsourced social media imagery provides real-time visual evidence of urban flooding but often lacks reliable geographic metadata for emergency response. Existing Visual Place Recognition (VPR) models struggle to geo-localize these images due to cross-source domain shifts and visual distortions. We present VPR-AttLLM, a model-agnostic framework integrating the semantic reasoning and geospatial knowledge of Large Language Models (LLMs) into VPR pipelines via attention-guided descriptor enhancement. VPR-AttLLM uses LLMs to isolate location-informative regions and suppress transient noise, improving retrieval without model retraining or new data. We evaluate this framework across San Francisco and Hong Kong using established queries, synthetic flooding scenarios, and real social media flood images. Integrating VPR-AttLLM with state-of-the-art models (CosPlace, EigenPlaces, SALAD) consistently improves recall, yielding 1-3% relative gains and up to 8% on challenging real flood imagery. By embedding urban perception principles into attention mechanisms, VPR-AttLLM bridges human-like spatial reasoning with modern VPR architectures. Its plug-and-play design and cross-source robustness offer a scalable solution for rapid geo-localization of crowdsourced crisis imagery, advancing cognitive urban resilience.
[182] Solver-Independent Automated Problem Formulation via LLMs for High-Cost Simulation-Driven Design
Yuchen Li, Handing Wang, Bing Xue, Mengjie Zhang, Yaochu Jin
Main category: cs.CL
TL;DR: APF framework uses LLMs to automatically convert natural language design requirements into executable optimization models for simulation-driven design, with novel data generation pipeline for fine-tuning without solver feedback.
Details
Motivation: In simulation-driven design, translating ambiguous requirements into mathematical optimization is time-consuming and expert-dependent. LLMs could automate this but current approaches either produce poor formalization or need solver feedback, which is unavailable due to high simulation costs.Method: Proposes APF framework with innovative pipeline for automatically generating high-quality fine-tuning data without solver feedback, using data generation and test instance annotation. This dataset is used for supervised fine-tuning of LLMs to generate accurate, executable optimization problem formulations.
Result: Experimental results on antenna design show APF significantly outperforms existing methods in both requirement formalization accuracy and resulting radiation efficiency curves in meeting design goals.
Conclusion: APF provides a solver-independent framework for automated problem formulation via LLMs that effectively converts natural language requirements into optimization models, overcoming data scarcity issues in high-cost simulation domains.
Abstract: In the high-cost simulation-driven design domain, translating ambiguous design requirements into a mathematical optimization formulation is a bottleneck for optimizing product performance. This process is time-consuming and heavily reliant on expert knowledge. While large language models (LLMs) offer potential for automating this task, existing approaches either suffer from poor formalization that fails to accurately align with the design intent or rely on solver feedback for data filtering, which is unavailable due to the high simulation costs. To address this challenge, we propose APF, a framework for solver-independent, automated problem formulation via LLMs designed to automatically convert engineers’ natural language requirements into executable optimization models. The core of this framework is an innovative pipeline for automatically generating high-quality data, which overcomes the difficulty of constructing suitable fine-tuning datasets in the absence of high-cost solver feedback with the help of data generation and test instance annotation. The generated high-quality dataset is used to perform supervised fine-tuning on LLMs, significantly enhancing their ability to generate accurate and executable optimization problem formulations. Experimental results on antenna design demonstrate that APF significantly outperforms the existing methods in both the accuracy of requirement formalization and the quality of resulting radiation efficiency curves in meeting the design goals.
[183] M$^3$KG-RAG: Multi-hop Multimodal Knowledge Graph-enhanced Retrieval-Augmented Generation
Hyeongcheol Park, Jiyoung Seo, Jaewon Mun, Hogun Park, Wonmin Byeon, Sung June Kim, Hyeonsoo Im, JeungSub Lee, Sangpil Kim
Main category: cs.CL
TL;DR: M³KG-RAG enhances multimodal RAG by constructing multi-hop multimodal knowledge graphs and using GRASP for precise entity grounding and relevance filtering, improving audio-visual reasoning in MLLMs.
Details
Motivation: Current multimodal RAG approaches face limitations in audio-visual domains due to insufficient modality coverage in existing knowledge graphs and similarity-based retrieval that fails to filter irrelevant knowledge.Method: Proposes M³KG-RAG with two components: 1) lightweight multi-agent pipeline to construct multi-hop MMKGs with context-enriched triplets, and 2) GRASP mechanism for precise entity grounding, relevance evaluation, and pruning of redundant context.
Result: Extensive experiments across diverse multimodal benchmarks show significant improvements in MLLMs’ multimodal reasoning and grounding compared to existing approaches.
Conclusion: M³KG-RAG effectively addresses limitations of current multimodal RAG by enabling precise retrieval of query-aligned audio-visual knowledge and filtering out irrelevant content, enhancing reasoning depth and answer faithfulness.
Abstract: Retrieval-Augmented Generation (RAG) has recently been extended to multimodal settings, connecting multimodal large language models (MLLMs) with vast corpora of external knowledge such as multimodal knowledge graphs (MMKGs). Despite their recent success, multimodal RAG in the audio-visual domain remains challenging due to 1) limited modality coverage and multi-hop connectivity of existing MMKGs, and 2) retrieval based solely on similarity in a shared multimodal embedding space, which fails to filter out off-topic or redundant knowledge. To address these limitations, we propose M$^3$KG-RAG, a Multi-hop Multimodal Knowledge Graph-enhanced RAG that retrieves query-aligned audio-visual knowledge from MMKGs, improving reasoning depth and answer faithfulness in MLLMs. Specifically, we devise a lightweight multi-agent pipeline to construct multi-hop MMKG (M$^3$KG), which contains context-enriched triplets of multimodal entities, enabling modality-wise retrieval based on input queries. Furthermore, we introduce GRASP (Grounded Retrieval And Selective Pruning), which ensures precise entity grounding to the query, evaluates answer-supporting relevance, and prunes redundant context to retain only knowledge essential for response generation. Extensive experiments across diverse multimodal benchmarks demonstrate that M$^3$KG-RAG significantly enhances MLLMs’ multimodal reasoning and grounding over existing approaches. Project website: https://kuai-lab.github.io/cvpr2026m3kgrag/
[184] CricBench: A Multilingual Benchmark for Evaluating LLMs in Cricket Analytics
Parth Agarwal, Navya Kommuri, Trizal Garg, Prisha Singhal, Dhruv Shah, Vaibhav Devraj, Yash Sinha, Jagat Sesh Challa, Murari Mandal, Dhruv Kumar
Main category: cs.CL
TL;DR: CricBench is a Text-to-SQL benchmark for cricket analytics evaluating LLMs on domain-specific multilingual queries across four cricket formats.
Details
Motivation: Cricket is the world's second most popular sport with billions of fans seeking advanced statistical insights, but LLMs' capability to handle domain-specific nuances and multilingual requirements in sports analytics remains under-explored.Method: Created CricBench benchmark suite with Gold-Standard dataset of 2,654 evaluation instances across four languages (English, Hindi, Punjabi, Telugu) and four cricket formats (Test, ODI, T20I, IPL). Evaluated seven LLMs using schema-only prompting.
Result: No single model dominates: GPT-5 Mini leads on Test cricket (12.4% DMA), Qwen 235B leads on IPL (28.7%) and T20I (17.5%), all models score 0% on hard ODI queries. Models show disconnect between syntactic validity (>98% execution accuracy) and semantic correctness (<29% DMA), with 37-55 percentage point domain gap versus BIRD benchmark.
Conclusion: CricBench is the first Text-to-SQL benchmark for cricket analytics, revealing significant challenges in domain-specific SQL generation and highlighting the need for improved sports analytics capabilities in LLMs.
Abstract: Cricket is the second most popular sport worldwide, with billions of fans seeking advanced statistical insights unavailable through standard web searches. Although LLMs have advanced significantly in Text-to-SQL tasks, their capability to handle domain-specific nuances and multilingual requirements in sports analytics remains under-explored. We present CricBench, a benchmark suite evaluating the intrinsic SQL generation abilities of LLMs on cricket data across four formats: Test, ODI, T20I, and IPL. We curate a Gold-Standard dataset of 2,654 evaluation instances across four languages (English, Hindi, Punjabi, and Telugu). We evaluate seven models, GPT-5 Mini, Claude Sonnet 4, DeepSeek R1 and V3, Qwen 235B, Llama 3.1, and Gemma 2, using schema-only prompting. No single model dominates across all formats: GPT-5 Mini leads on Test cricket (12.4% DMA), Qwen 235B leads on IPL (28.7%) and T20I (17.5%), and all models score 0% on hard ODI queries. All models show a stark disconnect between syntactic validity (>98% execution accuracy) and semantic correctness (<29% DMA), with a domain gap of 37-55 percentage points versus BIRD. To our knowledge, CricBench is the first Text-to-SQL benchmark for cricket analytics.
[185] Enhancing Multilingual RAG Systems with Debiased Language Preference-Guided Query Fusion
Jeonghyun Park, Byeongjeong Kim, Seojin Hwang, Hwanhee Lee
Main category: cs.CL
TL;DR: The paper identifies biases in multilingual RAG evaluation, proposes a debiased metric (DeLP), and introduces DELTA framework that leverages monolingual alignment for better cross-lingual retrieval.
Details
Motivation: Current multilingual RAG systems show perceived English preference, but this may be due to evaluation biases rather than inherent model capabilities. The authors aim to identify and address structural biases in mRAG evaluation.Method: 1) Identify three structural biases: exposure bias, gold availability prior, and cultural priors. 2) Propose DeLP metric to factor out these confounds. 3) Develop DELTA framework that strategically uses monolingual alignment for cross-lingual retrieval and generation.
Result: Analysis with DeLP shows English preference is largely due to evidence distribution, not model bias. Retrievers fundamentally favor monolingual alignment. DELTA outperforms English pivoting and mRAG baselines across diverse languages.
Conclusion: The perceived English preference in mRAG is an evaluation artifact. Monolingual alignment is key for cross-lingual retrieval. DELTA provides efficient multilingual RAG without English pivoting.
Abstract: Multilingual Retrieval-Augmented Generation (mRAG) systems often exhibit a perceived preference for high-resource languages, particularly English, resulting in the widespread adoption of English pivoting. While prior studies attribute this advantage to the superior English-centric capabilities of Large Language Models (LLMs), we find that such measurements are significantly distorted by structural priors inherent in evaluation benchmarks. Specifically, we identify exposure bias and a gold availability prior-both driven by the disproportionate concentration of resources in English-as well as cultural priors rooted in topic locality, as factors that hinder accurate assessment of genuine language preference. To address these biases, we propose DeLP (Debiased Language Preference), a calibrated metric designed to explicitly factor out these structural confounds. Our analysis using DeLP reveals that the previously reported English preference is largely a byproduct of evidence distribution rather than an inherent model bias. Instead, we find that retrievers fundamentally favor monolingual alignment between the query and the document language. Building on this insight, we introduce DELTA (DEbiased Language preference-guided Text Augmentation), a lightweight and efficient mRAG framework that strategically leverages monolingual alignment to optimize cross-lingual retrieval and generation. Experimental results demonstrate that DELTA consistently outperforms English pivoting and mRAG baselines across diverse languages.
[186] Agent-Dice: Disentangling Knowledge Updates via Geometric Consensus for Agent Continual Learning
Zheng Wu, Xingyu Lou, Xinbei Ma, Yansi Li, Weiwen Liu, Weinan Zhang, Jun Wang, Zhuosheng Zhang
Main category: cs.CL
TL;DR: Agent-Dice is a parameter fusion framework for LLM-based agents that addresses catastrophic forgetting in continual learning by disentangling shared knowledge from task-specific interference through geometric consensus filtering and curvature-based importance weighting.
Details
Motivation: LLM-based agents need to continually learn new tasks without forgetting previous ones (stability-plasticity dilemma). Current approaches fail to distinguish between common knowledge shared across tasks and conflicting knowledge from task-specific interference.Method: Two-stage parameter fusion framework: 1) Geometric consensus filtering to prune conflicting gradients, 2) Curvature-based importance weighting to amplify shared semantics. Uses directional consensus evaluation to disentangle knowledge updates.
Result: Extensive experiments on GUI agents and tool-use agent domains show outstanding continual learning performance with minimal computational overhead and parameter updates.
Conclusion: Agent-Dice effectively addresses the stability-plasticity dilemma in LLM-based agents by explicitly distinguishing between shared and conflicting knowledge, enabling better continual learning with theoretical guarantees.
Abstract: Large Language Model (LLM)-based agents significantly extend the utility of LLMs by interacting with dynamic environments. However, enabling agents to continually learn new tasks without catastrophic forgetting remains a critical challenge, known as the stability-plasticity dilemma. In this work, we argue that this dilemma fundamentally arises from the failure to explicitly distinguish between common knowledge shared across tasks and conflicting knowledge introduced by task-specific interference. To address this, we propose Agent-Dice, a parameter fusion framework based on directional consensus evaluation. Concretely, Agent-Dice disentangles knowledge updates through a two-stage process: geometric consensus filtering to prune conflicting gradients, and curvature-based importance weighting to amplify shared semantics. We provide a rigorous theoretical analysis that establishes the validity of the proposed fusion scheme and offers insight into the origins of the stability-plasticity dilemma. Extensive experiments on GUI agents and tool-use agent domains demonstrate that Agent-Dice exhibits outstanding continual learning performance with minimal computational overhead and parameter updates. The codes are available at https://github.com/Wuzheng02/Agent-Dice.
[187] Doc-PP: Document Policy Preservation Benchmark for Large Vision-Language Models
Haeun Jang, Hwan Chang, Hwanhee Lee
Main category: cs.CL
TL;DR: Doc-PP benchmark reveals safety gap in multimodal document QA where models leak sensitive info during complex reasoning, and proposes DVA framework for policy-compliant document understanding
Details
Motivation: Real-world document QA requires adherence to dynamic, user-defined disclosure policies, but existing safety research focuses on implicit social norms or text-only settings, overlooking multimodal document complexitiesMethod: Introduces Doc-PP benchmark from real-world reports requiring reasoning across visual/textual elements under strict policies, identifies Reasoning-Induced Safety Gap, and proposes DVA (Decompose-Verify-Aggregation) framework that decouples reasoning from policy verification
Result: Models frequently leak sensitive information when answers require complex synthesis or aggregation across modalities; extracted text improves perception but facilitates leakage; DVA significantly outperforms standard prompting defenses
Conclusion: Multimodal document safety requires specialized approaches beyond text-only methods; DVA offers robust baseline for policy-compliant document understanding by structurally separating reasoning from policy verification
Abstract: The deployment of Large Vision-Language Models (LVLMs) for real-world document question answering is often constrained by dynamic, user-defined policies that dictate information disclosure based on context. While ensuring adherence to these explicit constraints is critical, existing safety research primarily focuses on implicit social norms or text-only settings, overlooking the complexities of multimodal documents. In this paper, we introduce Doc-PP (Document Policy Preservation Benchmark), a novel benchmark constructed from real-world reports requiring reasoning across heterogeneous visual and textual elements under strict non-disclosure policies. Our evaluation highlights a systemic Reasoning-Induced Safety Gap: models frequently leak sensitive information when answers must be inferred through complex synthesis or aggregated across modalities, effectively circumventing existing safety constraints. Furthermore, we identify that providing extracted text improves perception but inadvertently facilitates leakage. To address these vulnerabilities, we propose DVA (Decompose-Verify-Aggregation), a structural inference framework that decouples reasoning from policy verification. Experimental results demonstrate that DVA significantly outperforms standard prompting defenses, offering a robust baseline for policy-compliant document understanding
[188] Disco-RAG: Discourse-Aware Retrieval-Augmented Generation
Dongqi Liu, Hang Ding, Qiming Feng, Xurong Xie, Zhucun Xue, Chengjie Wang, Jian Li, Jiangning Zhang, Yabiao Wang
Main category: cs.CL
TL;DR: Disco-RAG: A discourse-aware retrieval-augmented generation framework that injects discourse structure into RAG systems to improve knowledge synthesis across documents.
Details
Motivation: Existing RAG strategies treat retrieved passages in a flat, unstructured way, which prevents models from capturing structural cues and constrains their ability to synthesize knowledge from dispersed evidence across documents.Method: Constructs intra-chunk discourse trees to capture local hierarchies and builds inter-chunk rhetorical graphs to model cross-passage coherence. These structures are jointly integrated into a planning blueprint that conditions the generation process.
Result: Achieves state-of-the-art results on question answering and long-document summarization benchmarks without fine-tuning.
Conclusion: Discourse structure plays an important role in advancing RAG systems, and explicitly injecting discourse signals into generation processes significantly improves performance on knowledge-intensive tasks.
Abstract: Retrieval-Augmented Generation (RAG) has emerged as an important means of enhancing the performance of large language models (LLMs) in knowledge-intensive tasks. However, most existing RAG strategies treat retrieved passages in a flat and unstructured way, which prevents the model from capturing structural cues and constrains its ability to synthesize knowledge from dispersed evidence across documents. To overcome these limitations, we propose Disco-RAG, a discourse-aware framework that explicitly injects discourse signals into the generation process. Our method constructs intra-chunk discourse trees to capture local hierarchies and builds inter-chunk rhetorical graphs to model cross-passage coherence. These structures are jointly integrated into a planning blueprint that conditions the generation. Experiments on question answering and long-document summarization benchmarks show the efficacy of our approach. Disco-RAG achieves state-of-the-art results on the benchmarks without fine-tuning. These findings underscore the important role of discourse structure in advancing RAG systems.
[189] Merging Triggers, Breaking Backdoors: Defensive Poisoning for Instruction-Tuned Language Models
San Kim, Gary Geunbae Lee
Main category: cs.CL
TL;DR: MB-Defense: A two-stage training pipeline that immunizes instruction-tuned LLMs against backdoor attacks through defensive poisoning and backdoor neutralization.
Details
Motivation: Instruction-tuned LLMs are vulnerable to backdoor attacks through poisoned training data, but defenses for such models remain underexplored despite growing security risks.Method: Two-stage framework: (1) Defensive Poisoning merges attacker and defensive triggers into unified backdoor representation, (2) Backdoor Neutralization breaks this representation through additional training to restore clean behavior.
Result: Extensive experiments show MB-Defense substantially lowers attack success rates while preserving instruction-following ability across multiple LLMs.
Conclusion: MB-Defense offers a generalizable, data-efficient defense strategy that improves robustness of instruction-tuned LLMs against unseen backdoor attacks.
Abstract: Large Language Models (LLMs) have greatly advanced Natural Language Processing (NLP), particularly through instruction tuning, which enables broad task generalization without additional fine-tuning. However, their reliance on large-scale datasets-often collected from human or web sources-makes them vulnerable to backdoor attacks, where adversaries poison a small subset of data to implant hidden behaviors. Despite this growing risk, defenses for instruction-tuned models remain underexplored. We propose MB-Defense (Merging & Breaking Defense Framework), a novel training pipeline that immunizes instruction-tuned LLMs against diverse backdoor threats. MB-Defense comprises two stages: (i) Defensive Poisoning, which merges attacker and defensive triggers into a unified backdoor representation, and (ii) Backdoor Neutralization, which breaks this representation through additional training to restore clean behavior. Extensive experiments across multiple LLMs show that MB-Defense substantially lowers attack success rates while preserving instruction-following ability. Our method offers a generalizable and data-efficient defense strategy, improving the robustness of instruction-tuned LLMs against unseen backdoor attacks.
[190] GenProve: Learning to Generate Text with Fine-Grained Provenance
Jingxuan Wei, Xingyue Wang, Yanghaoyu Liao, Jie Dong, Yuchen Liu, Caijun Jia, Bihui Yu, Junnan Zhu
Main category: cs.CL
TL;DR: Paper introduces Generation-time Fine-grained Provenance task and ReFInE dataset with expert annotations for Quotation, Compression, Inference relations, plus GenProve framework using SFT+GRPO for joint answer fidelity and provenance correctness.
Details
Motivation: Current citation methods in LLMs are insufficient for accountability as users struggle to verify how cited sources support generated claims. Existing approaches are coarse-grained and fail to distinguish between direct quotes and complex reasoning, limiting verifiability.Method: 1) Introduce Generation-time Fine-grained Provenance task requiring models to generate fluent answers with structured, sentence-level provenance triples. 2) Create ReFInE dataset with expert-verified annotations distinguishing Quotation, Compression, and Inference relations. 3) Propose GenProve framework combining Supervised Fine-Tuning with Group Relative Policy Optimization, optimizing composite reward for both answer fidelity and provenance correctness.
Result: GenProve significantly outperforms 14 strong LLMs in joint evaluation of answer quality and provenance correctness. Analysis reveals reasoning gap: models excel at surface-level quotation but struggle significantly with inference-based provenance, showing verifiable reasoning remains a distinct frontier challenge.
Conclusion: Fine-grained provenance generation is crucial for LLM accountability, with inference-based provenance being particularly challenging. The proposed framework and dataset advance verifiable reasoning capabilities beyond surface-level citation.
Abstract: Large language models (LLM) often hallucinate, and while adding citations is a common solution, it is frequently insufficient for accountability as users struggle to verify how a cited source supports a generated claim. Existing methods are typically coarse-grained and fail to distinguish between direct quotes and complex reasoning. In this paper, we introduce Generation-time Fine-grained Provenance, a task where models must generate fluent answers while simultaneously producing structured, sentence-level provenance triples. To enable this, we present ReFInE (Relation-aware Fine-grained Interpretability & Evidence), a dataset featuring expert verified annotations that distinguish between Quotation, Compression, and Inference. Building on ReFInE, we propose GenProve, a framework that combines Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO). By optimizing a composite reward for answer fidelity and provenance correctness, GenProve significantly outperforms 14 strong LLMs in joint evaluation. Crucially, our analysis uncovers a reasoning gap where models excel at surface-level quotation but struggle significantly with inference-based provenance, suggesting that verifiable reasoning remains a frontier challenge distinct from surface-level citation.
[191] FlashMem: Distilling Intrinsic Latent Memory via Computation Reuse
Yubo Hou, Zhisheng Chen, Tao Wan, Zengchang Qin
Main category: cs.CL
TL;DR: FlashMem is a framework that enables efficient long-horizon memory for LLMs by distilling intrinsic memory from reasoning states via computation reuse, eliminating redundant parameterization while maintaining performance.
Details
Motivation: LLMs lack mechanisms to preserve dynamic context, forcing agents to redundantly reprocess history for long-horizon autonomy. Current memory approaches suffer from architectural segregation and rely on auxiliary encoders that decouple memory from the reasoning backbone.Method: FlashMem distills intrinsic memory directly from transient reasoning states via computation reuse. It identifies the last hidden state as a sufficient statistic for interaction history, uses a Shared-KV Consolidator to synthesize memory by attending to the backbone’s frozen cache, and employs a parameter-free Cognitive Monitor that leverages attention entropy to adaptively trigger consolidation only during high epistemic uncertainty.
Result: FlashMem matches the performance of heavy baselines while reducing inference latency by 5 times, effectively bridging the gap between efficiency and persistent cognition.
Conclusion: FlashMem provides an efficient framework for enabling long-horizon memory in LLMs without architectural segregation or redundant parameterization, achieving significant latency improvements while maintaining performance.
Abstract: The stateless architecture of Large Language Models inherently lacks the mechanism to preserve dynamic context, compelling agents to redundantly reprocess history to maintain long-horizon autonomy. While latent memory offers a solution, current approaches are hindered by architectural segregation, relying on auxiliary encoders that decouple memory from the reasoning backbone. We propose FlashMem, a framework that distills intrinsic memory directly from transient reasoning states via computation reuse. Leveraging the property that internal representations uniquely encode input trajectories, FlashMem identifies the last hidden state as a sufficient statistic for the interaction history. This enables a Shared-KV Consolidator to synthesize memory by attending directly to the backbone’s frozen cache, eliminating redundant re-parameterization. Furthermore, a parameter-free Cognitive Monitor leverages attention entropy to adaptively trigger consolidation only when high epistemic uncertainty is detected. Experiments demonstrate that FlashMem matches the performance of heavy baselines while reducing inference latency by 5 times, effectively bridging the gap between efficiency and persistent cognition.
[192] GanitLLM: Difficulty-Aware Bengali Mathematical Reasoning through Curriculum-GRPO
Shubhashis Roy Dipta, Khairul Mahbub, Nadia Najjar
Main category: cs.CL
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2601.06767 could not be retrieved from arXiv API.
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting.Method: Cannot determine method as paper content is unavailable due to API rate limiting.
Result: Cannot determine results as paper content is unavailable due to API rate limiting.
Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting.
Abstract: Failed to fetch summary for 2601.06767: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.06767&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[193] MM-LIMA: Less Is More for Alignment in Multi-Modal Datasets
Lai Wei, Xiaozhe Li, Zihao Jiang, Weiran Huang, Lichao Sun
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to paper fetch failureMethod: Unable to determine method due to paper fetch failure
Result: Unable to determine results due to paper fetch failure
Conclusion: Unable to determine conclusion due to paper fetch failure
Abstract: Failed to fetch summary for 2308.12067: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2308.12067&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[194] Controlling Multimodal Conversational Agents with Coverage-Enhanced Latent Actions
Yongqi Li, Hao Lang, Tieyun Qian, Yongbin Li
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation due to API access errorMethod: Unable to determine method due to API access error
Result: Unable to determine results due to API access error
Conclusion: Unable to determine conclusion due to API access error
Abstract: Failed to fetch summary for 2601.07516: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.07516&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[195] Generation-Augmented Generation: A Plug-and-Play Framework for Private Knowledge Injection in Large Language Models
Rongji Li, Jian Xu, Yi Chen, Xueqing Chen, Yisheng Yang, Jiayi Wang, Xingyu Chen, Chunyu Xie, Dawei Leng, Xu-Yao Zhang
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2601.08209: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.08209&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[196] MCGA: A Multi-task Classical Chinese Literary Genre Audio Corpus
Yexing Du, Kaiyuan Liu, Bihe Zhang, Youcheng Pan, Bo Yang, Liangyu Huo, Xiyuan Zhang, Jian Xie, Daojing He, Yang Xiang, Ming Liu, Bing Qin
Main category: cs.CL
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2601.09270 exists but content cannot be retrieved from arXiv API.
Details
Motivation: Cannot determine motivation without access to paper content.Method: Cannot determine method without access to paper content.
Result: Cannot determine results without access to paper content.
Conclusion: Cannot determine conclusion without access to paper content.
Abstract: Failed to fetch summary for 2601.09270: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.09270&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[197] LLMs for Game Theory: Entropy-Guided In-Context Learning and Adaptive CoT Reasoning
Tommaso Felice Banfi, Sashenka Gamage
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2601.10775: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.10775&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[198] Powerful Training-Free Membership Inference Against Autoregressive Language Models
David Ilić, David Stanojević, Kostadin Cvejoski
Main category: cs.CL
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to the paper contentMethod: Cannot determine method without access to the paper content
Result: Cannot determine results without access to the paper content
Conclusion: Cannot determine conclusion without access to the paper content
Abstract: Failed to fetch summary for 2601.12104: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.12104&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[199] ClaimDB: A Fact Verification Benchmark over Large Structured Data
Michael Theologitis, Preetam Prabhu Srikar Dammu, Chirag Shah, Dan Suciu
Main category: cs.CL
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2601.14698: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.14698&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[200] Parallelism and Generation Order in Masked Diffusion Language Models: Limits Today, Potential Tomorrow
Yangyang Zhong, Yanmei Gu, Zhengqing Zang, Xiaomeng Li, Yuqi Ding, Xibei Jia, Yuting Shen, Zhenzhong Lan, Liwang Zhu, Weiping Liu, Junlin Zhou, Haisheng Liu, Zhong Xin Yu, Pengxin Luo, Donglian Qi, Yunfeng Yan, Junbo Zhao
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2601.15593: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.15593&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[201] Who Gets Which Message? Auditing Demographic Bias in LLM-Generated Targeted Text
Tunazzina Islam
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2601.17172: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.17172&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[202] HyperGraphPro: Progress-Aware Reinforcement Learning for Structure-Guided Hypergraph RAG
Jinyoung Park, Sanghyeok Lee, Omar Zia Khan, Hyunwoo J. Kim, Joo-Kyung Kim
Main category: cs.CL
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2601.17755: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.17755&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[203] MERMAID: Memory-Enhanced Retrieval and Reasoning with Multi-Agent Iterative Knowledge Grounding for Veracity Assessment
Yupeng Cao, Chengyang He, Yangyang Yu, Ping Wang, K.P. Subbalakshmi
Main category: cs.CL
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2601.22361: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.22361&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[204] Exploring Knowledge Purification in Multi-Teacher Knowledge Distillation for LLMs
Ruihan Jin, Pengpeng Shao, Zhengqi Wen, Jinyang Wu, Mingkuan Feng, Shuo Yang, Chu Yuan Zhang, Jianhua Tao
Main category: cs.CL
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2602.01064: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.01064&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[205] Beyond RAG for Agent Memory: Retrieval by Decoupling and Aggregation
Zhanghao Hu, Qinglin Zhu, Di Liang, Hanqi Yan, Yulan He, Lin Gui
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper detailsMethod: Unable to determine method due to API rate limiting preventing access to paper details
Result: Unable to determine results due to API rate limiting preventing access to paper details
Conclusion: Unable to determine conclusion due to API rate limiting preventing access to paper details
Abstract: Failed to fetch summary for 2602.02007: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.02007&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[206] Why Steering Works: Toward a Unified View of Language Model Parameter Dynamics
Ziwen Xu, Chenyan Wu, Hengyu Sun, Haiwen Hong, Mengru Wang, Yunzhi Yao, Longtao Huang, Hui Xue, Shumin Deng, Zhixuan Chu, Huajun Chen, Ningyu Zhang
Main category: cs.CL
TL;DR: Paper ID 2602.02343 could not be fetched due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2602.02343: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.02343&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[207] ChemPro: A Progressive Chemistry Benchmark for Large Language Models
Aaditya Baranwal, Shruti Vyas
Main category: cs.CL
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about the paper due to access limitations
Abstract: Failed to fetch summary for 2602.03108: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.03108&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[208] Physical Commonsense Reasoning for Lower-Resourced Languages and Dialects: a Study on Basque
Jaione Bengoetxea, Itziar Gonzalez-Dios, Rodrigo Agerri
Main category: cs.CL
TL;DR: Paper 2602.14812: Unable to fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2602.14812: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.14812&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[209] Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference
Arindam Khaled
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2602.19509: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.19509&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[210] MEDSYN: Benchmarking Multi-EviDence SYNthesis in Complex Clinical Cases for Multimodal Large Language Models
Boqi Chen, Xudong Liu, Jiachuan Peng, Marianne Frey-Marti, Bang Zheng, Kyle Lam, Lin Li, Jianing Qiu
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error in accessing paper contentMethod: Unable to determine method due to technical error in accessing paper content
Result: Unable to determine results due to technical error in accessing paper content
Conclusion: Unable to draw conclusions due to technical error in accessing paper content
Abstract: Failed to fetch summary for 2602.21950: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.21950&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[211] How Controllable Are Large Language Models? A Unified Evaluation across Behavioral Granularities
Ziwen Xu, Kewei Xu, Haoming Xu, Haiwen Hong, Longtao Huang, Hui Xue, Ningyu Zhang, Yongliang Shen, Guozhou Zheng, Huajun Chen, Shumin Deng
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.02578: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.02578&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[212] Prompt Injection as Role Confusion
Charles Ye, Jasmine Cui, Dylan Hadfield-Menell
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2603.12277 appears to be from March 2023, but content is unavailable.
Details
Motivation: Unable to determine motivation due to content fetch failure. The paper's relevance cannot be assessed without abstract or content.Method: Unknown - content unavailable due to HTTP 429 error from arXiv API.
Result: No results available - paper content could not be retrieved.
Conclusion: Cannot draw conclusions about paper content due to technical retrieval error.
Abstract: Failed to fetch summary for 2603.12277: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.12277&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[213] BiT-MCTS: A Theme-based Bidirectional MCTS Approach to Chinese Fiction Generation
Zhaoyi Li, Xu Zhang, Xiaojun Wan
Main category: cs.CL
TL;DR: Unable to analyze paper 2603.14410 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation without access to the paper abstractMethod: Cannot determine method without access to the paper abstract
Result: Cannot determine results without access to the paper abstract
Conclusion: Cannot determine conclusion without access to the paper abstract
Abstract: Failed to fetch summary for 2603.14410: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.14410&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[214] Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning
Yixuan Even Xu, Yash Savani, Fei Fang, J. Zico Kolter
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2504.13818: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.13818&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[215] MemDLM: Memory-Enhanced DLM Training
Zehua Pei, Hui-Ling Zhen, Weizhe Lin, Sinno Jialin Pan, Yunhe Wang, Mingxuan Yuan, Bei Yu
Main category: cs.CL
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2603.22241: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.22241&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[216] Measuring and curing reasoning rigidity: from decorative chain-of-thought to genuine faithfulness
Abhinaba Basu, Pavan Chakraborty
Main category: cs.CL
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2603.22816: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.22816&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[217] MSA: Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens
Yu Chen, Runkai Chen, Sheng Yi, Xinda Zhao, Xiaohong Li, Jianjin Zhang, Jun Sun, Chuanrui Hu, Yunyun Han, Lidong Bing, Yafeng Deng, Tianqiao Chen
Main category: cs.CL
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2603.23516: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.23516&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[218] TokUR: Token-Level Uncertainty Estimation for Large Language Model Reasoning
Tunyu Zhang, Haizhou Shi, Yibin Wang, Hengyi Wang, Xiaoxiao He, Zhuowei Li, Haoxian Chen, Ligong Han, Kai Xu, Huan Zhang, Dimitris Metaxas, Hao Wang
Main category: cs.CL
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2505.11737: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.11737&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[219] GameplayQA: A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents
Yunzhe Wang, Runhui Xu, Kexin Zheng, Tianyi Zhang, Jayavibhav Niranjan Kogundi, Soham Hans, Volkan Ustun
Main category: cs.CL
TL;DR: Paper 2603.24329 summary unavailable due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2603.24329: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.24329&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[220] PICon: A Multi-Turn Interrogation Framework for Evaluating Persona Agent Consistency
Minseo Kim, Sujeong Im, Junseong Choi, Junhee Lee, Chaeeun Shim, Edward Choi
Main category: cs.CL
TL;DR: Unable to analyze paper 2603.25620 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract retrieval failedMethod: Cannot determine method as abstract retrieval failed
Result: Cannot determine results as abstract retrieval failed
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2603.25620: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.25620&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[221] C2F-Thinker: Coarse-to-Fine Reasoning with Hint-Guided Reinforcement Learning for Multimodal Sentiment Analysis
Miaosen Luo, Zhenhao Yang, Jieshen Long, Jinghu Sun, Yichu Liu, Sijie Mai
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2604.00013: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.00013&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[222] M2-Verify: A Large-Scale Multidomain Benchmark for Checking Multimodal Claim Consistency
Abolfazl Ansari, Delvin Ce Zhang, Zhuoyang Zou, Wenpeng Yin, Dongwon Lee
Main category: cs.CL
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2604.01306: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.01306&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[223] Valence-Arousal Subspace in LLMs: Circular Emotion Geometry and Multi-Behavioral Control
Lihao Sun, Lewen Yan, Xiaoya Lu, Andrew Lee, Jie Zhang, Jing Shao
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2604.03147: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.03147&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[224] How Alignment Routes: Localizing, Scaling, and Controlling Policy Circuits in Language Models
Gregory N. Frank
Main category: cs.CL
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2604.04385: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.04385&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[225] Structured Causal Video Reasoning via Multi-Objective Alignment
Zinuo Li, Yongxin Guo, Jun Liu, Jiawei Zhan, Xi Jiang, Chengjie Wang, Mohammed Bennamoun, Farid Boussaid, Feng Zheng, Qiuhong Ke
Main category: cs.CL
TL;DR: Factum-4B introduces Structured Event Facts for video understanding, using causal relationships between events to improve temporal reasoning in Video-LLMs through a four-stage training pipeline with multi-objective reinforcement learning.
Details
Motivation: Current Video-LLMs rely on unstructured video reasoning with verbose textual descriptions and weak temporal causality modeling, leading to inefficient processes and fragile causal inference. The paper aims to bridge the cognitive gap by introducing structured mental representations similar to human understanding.Method: Proposes constructing Structured Event Facts (compact representation of salient events and causal relationships) before reasoning. Introduces CausalFact-60K dataset and four-stage training: facts alignment, format warm-start, thinking warm-start, and RL-based post-training. Uses Multi-Objective Reinforcement Learning to balance structural completeness, causal fidelity, and reasoning length.
Result: Develops Factum-4B model that yields more reliable reasoning and delivers stronger performance on challenging video understanding tasks requiring fine-grained temporal inference.
Conclusion: Structured Event Facts provide explicit constraints for concise and causally grounded reasoning in video understanding, addressing limitations of current Video-LLMs through structured representation and multi-objective optimization.
Abstract: Human understanding of video dynamics is typically grounded in a structured mental representation of entities, actions, and temporal relations, rather than relying solely on immediate deductive reasoning. In contrast, existing Video-LLMs largely depend on unstructured video reasoning, where critical visual evidence is embedded in verbose textual descriptions and temporal causality is often weakly modeled. This leads to inefficient processes and fragile causal inference. To bridge this cognitive gap, we propose constructing a compact representation of salient events and their causal relationships, which we name Structured Event Facts, prior to the reasoning stage. This structured prior serves as an explicit constraint to promote concise and causally grounded reasoning, while also making intermediate evidence easier to verify. To effectively train models on such structured facts, we introduce CausalFact-60K and a four-stage training pipeline comprising facts alignment, format warm-start, thinking warm-start, and reinforcement learning-based post-training. During RL stage, we find that this framework introduces competing objectives, as structural completeness and causal fidelity must be balanced against reasoning length, making it difficult to optimize. We address this challenge by formulating the optimization as a Multi-Objective Reinforcement Learning (MORL) problem and explicitly optimizing toward the Pareto-Frontier to balance these trade-offs. As a result, we introduce Factum-4B, which yields more reliable reasoning and delivers stronger performance on challenging video understanding tasks requiring fine-grained temporal inference.
[226] Stop Fixating on Prompts: Reasoning Hijacking and Constraint Tightening for Red-Teaming LLM Agents
Yanxu Mao, Peipei Liu, Tiehan Cui, Congying Liu, Mingzhe Xing, Datao You
Main category: cs.CL
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2604.05549: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.05549&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[227] Data Mixing Agent: Learning to Re-weight Domains for Continual Pre-training
Kailai Yang, Xiao Liu, Lei Ji, Hao Li, Xiao Liang, Zhiwei Liu, Yeyun Gong, Peng Cheng, Mao Yang
Main category: cs.CL
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2507.15640: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.15640&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[228] Measuring What Matters!! Assessing Therapeutic Principles in Mental-Health Conversation
Abdullah Mazhar, Het Riteshkumar Shah, Aseem Srivastava, Smriti Joshi, Md Shad Akhtar
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2604.05795: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.05795&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[229] Discourse Coherence and Response-Guided Context Rewriting for Multi-Party Dialogue Generation
Zhiyu Cao, Peifeng Li, Qiaoming Zhu
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to draw conclusions due to failed API request
Abstract: Failed to fetch summary for 2604.06784: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.06784&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[230] Cross-Tokenizer LLM Distillation through a Byte-Level Interface
Avyav Kumar Singh, Yen-Chen Wu, Alexandru Cioba, Alberto Bernacchia, Davide Buffelli
Main category: cs.CL
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2604.07466 exists but cannot be analyzed without access to its content.
Details
Motivation: Cannot determine motivation without access to paper content. The arXiv API returned a rate limiting error (HTTP 429), preventing retrieval of the paper's abstract and details.Method: Cannot determine method without access to paper content. The arXiv API rate limiting prevents analysis of the paper’s technical approach.
Result: Cannot determine results without access to paper content. The arXiv API error prevents evaluation of the paper’s findings.
Conclusion: Cannot draw conclusions about the paper’s content due to technical limitations in accessing the arXiv API. The paper exists but cannot be analyzed at this time.
Abstract: Failed to fetch summary for 2604.07466: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.07466&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[231] CAMO: A Class-Aware Minority-Optimized Ensemble for Robust Language Model Evaluation on Imbalanced Data
Mohamed Ehab, Ali Hamdi, Khaled Shaban
Main category: cs.CL
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2604.07583: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.07583&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[232] Proximal Supervised Fine-Tuning
Wenhong Zhu, Ruobing Xie, Rui Wang, Xingwu Sun, Di Wang, Pengfei Liu
Main category: cs.CL
TL;DR: Unable to analyze paper 2508.17784 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract is unavailable due to rate limiting errorMethod: Cannot determine method as abstract is unavailable due to rate limiting error
Result: Cannot determine results as abstract is unavailable due to rate limiting error
Conclusion: Cannot draw conclusion as paper content is inaccessible due to HTTP 429 error
Abstract: Failed to fetch summary for 2508.17784: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.17784&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[233] Detecting HIV-Related Stigma in Clinical Narratives Using Large Language Models
Ziyi Chen, Yasir Khan, Mengyuan Zhang, Cheng Peng, Mengxian Lyu, Yiyang Liu, Krishna Vaddiparti, Robert L Cook, Mattia Prosperi, Yonghui Wu
Main category: cs.CL
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2604.07717: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.07717&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[234] Linear Representations of Hierarchical Concepts in Language Models
Masaki Sakata, Benjamin Heinzerling, Takumi Ito, Sho Yokoi, Kentaro Inui
Main category: cs.CL
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2604.07886: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.07886&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[235] StyleBench: Evaluating thinking styles in Large Language Models
Junyu Guo, Shangding Gu, Ming Jin, Costas Spanos, Javad Lavaei
Main category: cs.CL
TL;DR: Paper 2509.20868: Unable to fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to inability to access paper contentMethod: Cannot determine method due to inability to access paper content
Result: Cannot determine results due to inability to access paper content
Conclusion: Cannot determine conclusion due to inability to access paper content
Abstract: Failed to fetch summary for 2509.20868: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.20868&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[236] Data Selection for Multi-turn Dialogue Instruction Tuning
Bo Li, Shikun Zhang, Wei Ye
Main category: cs.CL
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2604.07892: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.07892&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[237] Efficient Provably Secure Linguistic Steganography via Range Coding
Ruiyi Yan, Yugo Murawaki
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2604.08052: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.08052&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[238] Re-Mask and Redirect: Exploiting Denoising Irreversibility in Diffusion Language Models
Arth Singh
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2604.08557: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.08557&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[239] Anchored Sliding Window: Toward Robust and Imperceptible Linguistic Steganography
Ruiyi Yan, Shiao Meng, Yugo Murawaki
Main category: cs.CL
TL;DR: Unable to analyze paper 2604.09066 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation without access to the paper abstractMethod: Cannot determine method without access to the paper abstract
Result: Cannot determine results without access to the paper abstract
Conclusion: Unable to analyze due to technical error fetching paper information
Abstract: Failed to fetch summary for 2604.09066: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.09066&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[240] Semantic-Space Exploration and Exploitation in RLVR for LLM Reasoning
Fanding Huang, Guanbo Huang, Xiao Fan, Yi He, Xiao Liang, Xiao Chen, Qinting Jiang, Faisal Nadeem Khan, Jingyan Jiang, Zhi Wang
Main category: cs.CL
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) when querying arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2509.23808: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.23808&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[241] Many-Tier Instruction Hierarchy in LLM Agents
Jingyu Zhang, Tianjian Li, William Jurayj, Hongyuan Zhan, Benjamin Van Durme, Daniel Khashabi
Main category: cs.CL
TL;DR: Paper 2604.09443: Unable to fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to unavailability of paper contentMethod: Cannot determine method due to unavailability of paper content
Result: Cannot determine results due to unavailability of paper content
Conclusion: Cannot determine conclusion due to unavailability of paper content
Abstract: Failed to fetch summary for 2604.09443: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.09443&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[242] From Reasoning to Agentic: Credit Assignment in Reinforcement Learning for Large Language Models
Chenchen Zhang
Main category: cs.CL
TL;DR: Survey of 47 credit assignment methods for RL with LLMs, focusing on sparse rewards in reasoning and agentic settings, with taxonomy, resources, and analysis of methodological differences.
Details
Motivation: Address the credit assignment problem in RL for LLMs where sparse outcome-level rewards make it difficult to determine which actions in long trajectories caused the outcome, particularly in reasoning RL (chain-of-thought) and agentic RL (multi-turn interaction).Method: Survey of 47 methods organized in two-dimensional taxonomy by assignment granularity (token, segment, step, turn, multi-agent) and methodology (Monte Carlo, temporal difference, model-based, game-theoretic, information-theoretic). Created three resources: structured paper inventory, reporting checklist, and benchmark protocol with decision tree.
Result: Analysis shows reasoning credit assignment is maturing around process reward models and critic-free group comparison, while agentic credit assignment is driving new approaches like hindsight counterfactual analysis, privileged asymmetric critics, and turn-level MDP reformulations.
Conclusion: The shift from reasoning to agentic RL complicates and reshapes the credit assignment landscape, with agentic CA requiring fundamentally different approaches than reasoning CA due to stochastic transitions, partial observability, and longer horizons.
Abstract: Reinforcement learning (RL) for large language models (LLMs) increasingly relies on sparse, outcome-level rewards – yet determining which actions within a long trajectory caused the outcome remains difficult. This credit assignment (CA) problem manifests in two regimes: reasoning RL, where credit must be distributed across tokens and steps within a single chain-of-thought generation (500–30K+ tokens); and agentic RL, where multi-turn environment interaction introduces stochastic transitions, partial observability, and horizons of 100+ turns (100K–1M tokens), making episode-level credit increasingly uninformative. We survey 47 CA methods (41 core, 6 adjacent enablers) published between 2024 and early 2026, organizing them in a two-dimensional taxonomy by assignment granularity (token, segment, step, turn, multi-agent) and methodology (Monte Carlo, temporal difference, model-based, game-theoretic, information-theoretic). Beyond the survey itself, we contribute three reusable resources: (1) a structured, machine-readable paper inventory with taxonomy labels, baseline families, and evidence levels; (2) a reporting checklist for future CA papers, validated against the reviewed literature to identify systematic methodological gaps; and (3) a benchmark protocol specification with task families, metadata requirements, and controlled bifurcation tasks, accompanied by a method selection decision tree. Our synthesis suggests that the shift from reasoning to agentic RL complicates and reshapes the credit assignment landscape: reasoning CA is maturing around process reward models and critic-free group comparison, while agentic CA is driving genuinely new approaches – hindsight counterfactual analysis, privileged asymmetric critics, and turn-level MDP reformulations – that have no direct precedent in reasoning RL.
[243] SCITUNE: Aligning Large Language Models with Human-Curated Scientific Multimodal Instructions
Sameera Horawalavithana, Sai Munikoti, Ian Stewart, Henry Kvinge, Karl Pazdernik
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2307.01139: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2307.01139&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[244] An Iterative Utility Judgment Framework Inspired by Philosophical Relevance via LLMs
Hengran Zhang, Keping Bi, Jiafeng Guo, Xueqi Cheng
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2406.11290: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2406.11290&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[245] Large Language Models Can Help Mitigate Barren Plateaus in Quantum Neural Networks
Jun Zhuang, Chaowen Guan
Main category: cs.CL
TL;DR: Paper 2502.13166 summary unavailable due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access restrictionsMethod: Unable to determine method due to access restrictions
Result: Unable to determine results due to access restrictions
Conclusion: Unable to determine conclusion due to access restrictions
Abstract: Failed to fetch summary for 2502.13166: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.13166&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[246] Can Large Language Models Infer Causal Relationships from Real-World Text?
Ryan Saklad, Aman Chadha, Oleg Pavlov, Raha Moraffah
Main category: cs.CL
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2505.18931: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.18931&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[247] Defending against Backdoor Attacks via Module Switching
Weijun Li, Ansh Arora, Xuanli He, Mark Dras, Qiongkai Xu
Main category: cs.CL
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2504.05902: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.05902&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[248] VisText-Mosquito: A Unified Multimodal Dataset for Visual Detection, Segmentation, and Textual Explanation on Mosquito Breeding Sites
Md. Adnanul Islam, Md. Faiyaz Abdullah Sayeedi, Md. Asaduzzaman Shuvo, Shahanur Rahman Bappy, Md Asiful Islam, Swakkhar Shatabda
Main category: cs.CL
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2506.14629: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.14629&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[249] Disambiguation-Centric Finetuning Makes Enterprise Tool-Calling LLMs More Realistic and Less Risky
Ashutosh Hathidara, Julien Yu, Sebastian Schreiber
Main category: cs.CL
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2507.03336: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.03336&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[250] Reliable Evaluation Protocol for Low-Precision Retrieval
Kisu Yang, Yoonna Jang, Hwanseok Jang, Kenneth Choi, Isabelle Augenstein, Heuiseok Lim
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2508.03306: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.03306&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[251] Find Your Optimal Teacher: Personalized Data Synthesis via Router-Guided Multi-Teacher Distillation
Hengyuan Zhang, Shiping Yang, Xiao Liang, Chenming Shang, Yuxuan Jiang, Chaofan Tao, Jing Xiong, Hayden Kwok-Hay So, Ruobing Xie, Angel X. Chang, Ngai Wong
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: No method information available due to API access issues
Result: No results available - paper content inaccessible
Conclusion: Cannot analyze paper due to technical limitations in accessing the abstract
Abstract: Failed to fetch summary for 2510.10925: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.10925&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[252] ChatCLIDS: Simulating Persuasive AI Dialogues to Promote Closed-Loop Insulin Adoption in Type 1 Diabetes Care
Zonghai Yao, Talha Chafekar, Junda Wang, Shuo Han, Feiyun Ouyang, Junhui Qian, Lingxi Li, Hong Yu
Main category: cs.CL
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to failed paper retrievalMethod: Cannot determine method due to failed paper retrieval
Result: Cannot determine results due to failed paper retrieval
Conclusion: Cannot determine conclusion due to failed paper retrieval
Abstract: Failed to fetch summary for 2509.00891: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.00891&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[253] RISK: A Framework for GUI Agents in E-commerce Risk Management
Renqi Chen, Zeyin Tao, Jianming Guo, Jingzhe Zhu, Yiheng Peng, Qingqing Sun, Tianyi Zhang, Shuai Chen
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2509.21982: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.21982&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[254] CARINOX: Inference-time Scaling with Category-Aware Reward-based Initial Noise Optimization and Exploration
Seyed Amir Kasaei, Ali Aghayari, Arash Marioriyad, Niki Sepasian, Shayan Baghayi Nejad, MohammadAmin Fazli, Mahdieh Soleymani Baghshah, Mohammad Hossein Rohban
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2509.17458: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.17458&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[255] SecureVibeBench: Evaluating Secure Coding Capabilities of Code Agents with Realistic Vulnerability Scenarios
Junkai Chen, Huihui Huang, Yunbo Lyu, Junwen An, Jieke Shi, Chengran Yang, Ting Zhang, Haoye Tian, Yikun Li, Zhenhao Li, Xin Zhou, Xing Hu, David Lo
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2509.22097: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.22097&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[256] BadGraph: A Backdoor Attack Against Latent Diffusion Model for Text-Guided Graph Generation
Liang Ye, Shengqin Chen, Jiazhu Dai
Main category: cs.CL
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2510.20792 could not be retrieved from arXiv API.
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting.Method: Cannot determine method as paper content is unavailable due to API rate limiting.
Result: Cannot determine results as paper content is unavailable due to API rate limiting.
Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting.
Abstract: Failed to fetch summary for 2510.20792: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.20792&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[257] Thought Branches: Interpreting LLM Reasoning Requires Resampling
Uzay Macar, Paul C. Bogdan, Senthooran Rajamanoharan, Neel Nanda
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2510.27484: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.27484&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[258] Quantifying the Climate Risk of Generative AI: Region-Aware Carbon Accounting with G-TRACE and the AI Sustainability Pyramid
Zahida Kausar, Seemab Latif, Raja Khurrum Shahzad, Mehwish Fatima
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error in fetching paper contentMethod: Unable to determine method due to technical error in fetching paper content
Result: Unable to determine results due to technical error in fetching paper content
Conclusion: Unable to draw conclusions due to technical error in fetching paper content
Abstract: Failed to fetch summary for 2511.04776: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.04776&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[259] SynthAgent: Adapting Web Agents with Synthetic Supervision
Zhaoyang Wang, Yiming Liang, Xuchao Zhang, Qianhui Wu, Siwei Han, Anson Bastos, Rujia Wang, Chetan Bansal, Baolin Peng, Jianfeng Gao, Saravan Rajmohan, Huaxiu Yao
Main category: cs.CL
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2511.06101 exists but summary retrieval failed.
Details
Motivation: Cannot determine motivation without access to paper content.Method: Cannot determine method without access to paper content.
Result: Cannot determine results without access to paper content.
Conclusion: Cannot draw conclusions without access to paper content.
Abstract: Failed to fetch summary for 2511.06101: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.06101&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[260] Process-Centric Analysis of Agentic Software Systems
Shuyang Liu, Yang Chen, Rahul Krishna, Saurabh Sinha, Jatin Ganhotra, Reyhan Jabbarvand
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2512.02393: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.02393&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[261] Look Twice before You Leap: A Rational Framework for Localized Adversarial Anonymization
Donghang Duan, Xu Zheng, Yuefeng He, Chong Mu, Leyi Cai, Lizong Zhang
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to retrieval failureMethod: Unable to determine method due to retrieval failure
Result: Unable to determine results due to retrieval failure
Conclusion: Unable to determine conclusion due to retrieval failure
Abstract: Failed to fetch summary for 2512.06713: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.06713&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[262] Pay Less Attention to Function Words for Free Robustness of Vision-Language Models
Qiwei Tian, Chenhao Lin, Zhengyu Zhao, Chao Shen
Main category: cs.CL
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about the paper due to technical fetching issues
Abstract: Failed to fetch summary for 2512.07222: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.07222&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[263] Understanding Generalization in Role-Playing Models via Information Theory
Yongqi Li, Hao Lang, Fei Huang, Tieyun Qian, Yongbin Li
Main category: cs.CL
TL;DR: Unable to analyze paper 2512.17270 due to HTTP 429 error when fetching from arXiv API
Details
Motivation: Cannot determine motivation as paper content could not be retrievedMethod: No method information available due to API request failure
Result: No results available - paper summary fetch failed
Conclusion: Unable to provide analysis due to technical limitations in accessing the paper
Abstract: Failed to fetch summary for 2512.17270: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.17270&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[264] Agri-R1: Agricultural Reasoning for Disease Diagnosis via Automated-Synthesis and Reinforcement Learning
Wentao Zhang, Mingkun Xu, Qi Zhang, Shangyang Li, Derek F. Wong, Lifei Wang, Yanchao Yang, Lina Lu, Tao Fang
Main category: cs.CL
TL;DR: Unable to analyze paper 2601.04672 due to HTTP 429 error when fetching from arXiv API
Details
Motivation: Cannot determine motivation as paper content could not be retrievedMethod: Cannot determine method as paper content could not be retrieved
Result: Cannot determine results as paper content could not be retrieved
Conclusion: Cannot draw conclusions as paper content could not be retrieved
Abstract: Failed to fetch summary for 2601.04672: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.04672&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[265] MAESTRO: Meta-learning Adaptive Estimation of Scalarization Trade-offs for Reward Optimization
Yang Zhao, Hepeng Wang, Xiao Ding, Yangou Ouyang, Bibo Cai, Kai Xiong, Jinglong Gao, Zhouhao Sun, Li Du, Bing Qin, Ting Liu
Main category: cs.CL
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2601.07208: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.07208&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[266] SpeechLess: Micro-utterance with Personalized Spatial Memory-aware Assistant in Everyday Augmented Reality
Yoonsang Kim, Devshree Jadeja, Divyansh Pradhan, Yalong Yang, Arie Kaufman
Main category: cs.CL
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting).
Details
Motivation: Cannot determine motivation without access to paper content.Method: Cannot determine method without access to paper content.
Result: Cannot determine results without access to paper content.
Conclusion: Cannot draw conclusions without access to paper content.
Abstract: Failed to fetch summary for 2602.00793: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.00793&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[267] From Speech-to-Spatial: Grounding Utterances on A Live Shared View with Augmented Reality
Yoonsang Kim, Divyansh Pradhan, Devshree Jadeja, Arie Kaufman
Main category: cs.CL
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2602.03059 suggests it’s from February 2025, but no content available for analysis.
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting.Method: Cannot determine method as paper content is unavailable due to API rate limiting.
Result: Cannot determine results as paper content is unavailable due to API rate limiting.
Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting.
Abstract: Failed to fetch summary for 2602.03059: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.03059&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[268] MerNav: A Highly Generalizable Memory-Execute-Review Framework for Zero-Shot Object Goal Navigation
Dekang Qi, Shuang Zeng, Xinyuan Chang, Feng Xiong, Shichao Xie, Xiaolong Wu, Mu Xu
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2602.05467: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.05467&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[269] Why Code, Why Now: An Information-Theoretic Perspective on the Limits of Machine Learning
Zhimin Zhao
Main category: cs.CL
TL;DR: Unable to analyze paper 2602.13934 due to HTTP 429 error when fetching from arXiv API
Details
Motivation: Cannot determine motivation as paper content could not be retrievedMethod: Cannot determine method as paper content could not be retrieved
Result: Cannot determine results as paper content could not be retrieved
Conclusion: Cannot draw conclusions as paper content could not be retrieved
Abstract: Failed to fetch summary for 2602.13934: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.13934&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[270] Both Ends Count! Just How Good are LLM Agents at “Text-to-Big SQL”?
Germán T. Eizaguirre, Lars Tissen, Marc Sánchez-Artigas
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to missing paper contentMethod: Unable to determine method due to missing paper content
Result: Unable to determine results due to missing paper content
Conclusion: Unable to determine conclusion due to missing paper content
Abstract: Failed to fetch summary for 2602.21480: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.21480&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[271] VeriInteresting: An Empirical Study of Model Prompt Interactions in Verilog Code Generation
Luca Collini, Andrew Hennesee, Patrick Yubeaton, Siddharth Garg, Ramesh Karri
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper retrievalMethod: Unable to determine method due to failed paper retrieval
Result: Unable to determine results due to failed paper retrieval
Conclusion: Unable to draw conclusions due to failed paper retrieval
Abstract: Failed to fetch summary for 2603.08715: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.08715&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[272] BiCLIP: Domain Canonicalization via Structured Geometric Transformation
Pranav Mantini, Shishir K. Shah
Main category: cs.CL
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2603.08942 appears to be from March 2026, which is unusual for current arXiv submissions.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot determine conclusion without access to the paper content.
Abstract: Failed to fetch summary for 2603.08942: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.08942&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[273] Rethinking LLM Watermark Detection in Black-Box Settings: A Non-Intrusive Third-Party Framework
Zhuoshang Wang, Yubing Ren, Yanan Cao, Fang Fang, Xiaoxue Li, Li Guo
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.14968: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.14968&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[274] Resource Consumption Threats in Large Language Models
Yuanhe Zhang, Xinyue Wang, Zhican Chen, Weiliu Wang, Zilu Zhang, Zhengshuo Gong, Zhenhong Zhou, Kun Wang, Li Sun, Yang Liu, Sen Su
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2603.16068: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.16068&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[275] Detection Is Cheap, Routing Is Learned: Why Refusal-Based Alignment Evaluation Fails
Gregory N. Frank
Main category: cs.CL
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2603.18280: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.18280&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[276] Do Neurons Dream of Primitive Operators? Wake-Sleep Compression Rediscovers Schank’s Event Semantics
Peter Balogh
Main category: cs.CL
TL;DR: Unable to analyze paper 2603.25975 due to HTTP 429 error when fetching from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2603.25975: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.25975&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[277] Towards Efficient Large Vision-Language Models: A Comprehensive Survey on Inference Strategies
Surendra Pathak, Bo Han
Main category: cs.CL
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper.
Details
Motivation: Cannot determine motivation without access to paper content.Method: Cannot determine method without access to paper content.
Result: Cannot determine results without access to paper content.
Conclusion: Cannot draw conclusions without access to paper content.
Abstract: Failed to fetch summary for 2603.27960: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.27960&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[278] SODA: Semi On-Policy Black-Box Distillation for Large Language Models
Xiwen Chen, Jingjing Wang, Wenhui Zhu, Peijie Qiu, Xuanzhao Dong, Hejian Sang, Zhipeng Wang, Alborz Geramifard, Feng Luo
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2604.03873: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.03873&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[279] EduIllustrate: Towards Scalable Automated Generation Of Multimodal Educational Content
Shuzhen Bi, Mingzi Zhang, Zhuoxuan Li, Xiaolong Wang, Keqian Li, Aimin Zhou
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2604.05005: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.05005&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[280] Beyond the Beep: Scalable Collision Anticipation and Real-Time Explainability with BADAS-2.0
Roni Goldshmidt, Hamish Scott, Lorenzo Niccolini, Hernan Matzner
Main category: cs.CL
TL;DR: Paper 2604.05767: Could not fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to missing abstractMethod: Unable to determine method due to missing abstract
Result: Unable to determine results due to missing abstract
Conclusion: Unable to determine conclusion due to missing abstract
Abstract: Failed to fetch summary for 2604.05767: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.05767&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[281] IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures
David Gringras
Main category: cs.CL
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). Paper ID 2604.07709 cannot be analyzed without access to the abstract content.
Details
Motivation: Cannot determine motivation without access to the paper abstract.Method: Cannot determine method without access to the paper abstract.
Result: Cannot determine results without access to the paper abstract.
Conclusion: Cannot draw conclusions without access to the paper abstract.
Abstract: Failed to fetch summary for 2604.07709: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.07709&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[282] Arbitration Failure, Not Perceptual Blindness: How Vision-Language Models Resolve Visual-Linguistic Conflicts
Farhad Nooralahzadeh, Omid Rohanian, Yi Zhang, Jonathan Fürst, Kurt Stockinger
Main category: cs.CL
TL;DR: Paper 2604.09364: Unable to fetch abstract due to HTTP 429 error (rate limiting). No content available for analysis.
Details
Motivation: Cannot determine motivation as abstract content is unavailable due to HTTP 429 error from arXiv API.Method: Cannot determine method as abstract content is unavailable due to HTTP 429 error from arXiv API.
Result: Cannot determine results as abstract content is unavailable due to HTTP 429 error from arXiv API.
Conclusion: Cannot draw conclusions as abstract content is unavailable due to HTTP 429 error from arXiv API.
Abstract: Failed to fetch summary for 2604.09364: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.09364&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.CV
[283] Not Your Stereo-Typical Estimator: Combining Vision and Language for Volume Perception
Gautham Vinod, Bruce Coburn, Siddeshwar Raghavan, Fengqing Zhu
Main category: cs.CV
TL;DR: Stereo vision + text priors for object volume estimation, outperforming vision-only methods
Details
Motivation: Address limitations of existing volume estimation methods that rely on complex 3D reconstruction or struggle with single-view ambiguity in computer vision applications for robotics, logistics, and healthcareMethod: Fuses implicit 3D cues from stereo vision with explicit prior knowledge from natural language text prompts containing object class and approximate volume; extracts deep features from stereo image pair and text, integrates them using projection layer into unified multimodal representation for regression
Result: Extensive experiments on public datasets show text-guided approach significantly outperforms vision-only baselines; demonstrates that even simple textual priors can effectively guide volume estimation
Conclusion: Leveraging textual priors with stereo vision enables more accurate volume estimation, paving way for more context-aware visual measurement systems
Abstract: Accurate volume estimation of objects from visual data is a long-standing challenge in computer vision with significant applications in robotics, logistics, and smart health. Existing methods often rely on complex 3D reconstruction pipelines or struggle with the ambiguity inherent in single-view images. To address these limitations, we introduce a new method that fuses implicit 3D cues from stereo vision with explicit prior knowledge from natural language text. Our approach extracts deep features from a stereo image pair and a descriptive text prompt that contains the object’s class and an approximate volume, then integrates them using a simple yet effective projection layer into a unified, multi-modal representation for regression. We conduct extensive experiments on public datasets demonstrating that our text-guided approach significantly outperforms vision-only baselines. Our findings show that leveraging even simple textual priors can effectively guide the volume estimation task, paving the way for more context-aware visual measurement systems. Code: https://gitlab.com/viper-purdue/stereo-typical-estimator.
[284] 3D Multi-View Stylization with Pose-Free Correspondences Matching for Robust 3D Geometry Preservation
Shirsha Bose
Main category: cs.CV
TL;DR: Multi-view 3D scene stylization method that preserves geometric consistency for downstream 3D tasks using correspondence-based consistency loss and depth preservation.
Details
Motivation: Artistic style transfer for multi-view 3D scenes is challenging because independent per-view stylization disrupts geometric correspondences needed for downstream 3D tasks like SLAM, depth prediction, and reconstruction.Method: Feed-forward stylization network with test-time optimization using composite objective: AdaIN-inspired style loss, correspondence-based consistency loss (SuperPoint/SuperGlue), depth-preservation loss (MiDaS/DPT), and global color alignment with staged weight scheduling.
Result: Method reduces structural distortion, improves SLAM stability and reconstructed geometry while maintaining competitive stylization quality, outperforming MuVieCAST baselines on trajectory and point-cloud consistency metrics.
Conclusion: The proposed multi-view stylization approach successfully preserves geometric consistency needed for downstream 3D tasks while achieving artistic style transfer, addressing the fundamental challenge of maintaining correspondences across views.
Abstract: Artistic style transfer is well studied for images and videos, but extending it to multi-view 3D scenes remains difficult because stylization can disrupt correspondences needed by geometry-aware pipelines. Independent per-view stylization often causes texture drift, warped edges, and inconsistent shading, degrading SLAM, depth prediction, and multi-view reconstruction. This thesis addresses multi-view stylization that remains usable for downstream 3D tasks without assuming camera poses or an explicit 3D representation during training. We introduce a feed-forward stylization network trained with per-scene test-time optimization under a composite objective coupling appearance transfer with geometry preservation. Stylization is driven by an AdaIN-inspired loss from a frozen VGG-19 encoder, matching channel-wise moments to a style image. To stabilize structure across viewpoints, we propose a correspondence-based consistency loss using SuperPoint and SuperGlue, constraining descriptors from a stylized anchor view to remain consistent with matched descriptors from the original multi-view set. We also impose a depth-preservation loss using MiDaS/DPT and use global color alignment to reduce depth-model domain shift. A staged weight schedule introduces geometry and depth constraints. We evaluate on Tanks and Temples and Mip-NeRF 360 using image and reconstruction metrics. Style adherence and structure retention are measured by Color Histogram Distance (CHD) and Structure Distance (DSD). For 3D consistency, we use monocular DROID-SLAM trajectories and symmetric Chamfer distance on back-projected point clouds. Across ablations, correspondence and depth regularization reduce structural distortion and improve SLAM stability and reconstructed geometry; on scenes with MuVieCAST baselines, our method yields stronger trajectory and point-cloud consistency while maintaining competitive stylization.
[285] PA-SFM: Tracker-free differentiable acoustic radiation for freehand 3D photoacoustic imaging
Shuang Li, Jian Gao, Chulhong Kim, Seongwook Choi, Qian Chen, Yibing Wang, Shuang Wu, Yu Zhang, Tingting Huang, Yucheng Zhou, Boxin Yao, Yao Yao, Changhui Li
Main category: cs.CV
TL;DR: PA-SFM: A tracker-free framework for 3D handheld photoacoustic tomography that uses only photoacoustic data for sensor pose recovery and 3D reconstruction via differentiable acoustic radiation modeling.
Details
Motivation: Traditional 3D handheld photoacoustic tomography relies on bulky, expensive external positioning sensors to correct motion artifacts, limiting clinical flexibility and accessibility. There's a need for a low-cost, software-defined solution that doesn't require external tracking hardware.Method: PA-SFM integrates the acoustic wave equation into a differentiable programming pipeline with a GPU-accelerated acoustic radiation kernel. It simultaneously optimizes 3D photoacoustic source distribution and sensor array pose via gradient descent. Uses coarse-to-fine optimization with geometric consistency checks and rigid-body constraints to eliminate motion outliers.
Result: Achieves sub-millimeter positioning accuracy and restores high-resolution 3D vascular structures comparable to ground-truth benchmarks in both numerical simulations and in-vivo rat experiments.
Conclusion: PA-SFM offers a low-cost, software-defined solution for clinical freehand photoacoustic imaging that eliminates the need for external tracking hardware while maintaining high reconstruction quality.
Abstract: Three-dimensional (3D) handheld photoacoustic tomography typically relies on bulky and expensive external positioning sensors to correct motion artifacts, which severely limits its clinical flexibility and accessibility. To address this challenge, we present PA-SFM, a tracker-free framework that leverages exclusively single-modality photoacoustic data for both sensor pose recovery and high-fidelity 3D reconstruction via differentiable acoustic radiation modeling. Unlike traditional structure-from-motion (SFM) methods based on visual features, PA-SFM integrates the acoustic wave equation into a differentiable programming pipeline. By leveraging a high-performance, GPU-accelerated acoustic radiation kernel, the framework simultaneously optimizes the 3D photoacoustic source distribution and the sensor array pose via gradient descent. To ensure robust convergence in freehand scenarios, we introduce a coarse-to-fine optimization strategy that incorporates geometric consistency checks and rigid-body constraints to eliminate motion outliers. We validated the proposed method through both numerical simulations and in-vivo rat experiments. The results demonstrate that PA-SFM achieves sub-millimeter positioning accuracy and restores high-resolution 3D vascular structures comparable to ground-truth benchmarks, offering a low-cost, software-defined solution for clinical freehand photoacoustic imaging. The source code is publicly available at \href{https://github.com/JaegerCQ/PA-SFM}{https://github.com/JaegerCQ/PA-SFM}.
[286] TRACE: Thermal Recognition Attentive-Framework for CO2 Emissions from Livestock
Taminul Islam, Abdellah Lakhssassi, Toqi Tahamid Sarker, Mohamed Embaby, Khaled R Ahmed, Amer AbuGhazaleh
Main category: cs.CV
TL;DR: TRACE is a thermal video framework for cattle CO2 emission monitoring that jointly performs CO2 plume segmentation and emission flux classification using thermal gas-aware attention and temporal fusion modules.
Details
Motivation: Current systems cannot provide continuous, spatially resolved CO2 measurements from free-roaming cattle without physical confinement or contact, which is needed for rumen metabolic monitoring and farm-scale carbon accounting.Method: TRACE uses: 1) Thermal Gas-Aware Attention encoder with per-pixel gas intensity supervision, 2) Attention-based Temporal Fusion module for breath-cycle dynamics, and 3) four-stage progressive training curriculum to couple both objectives without gradient interference.
Result: Achieves mIoU of 0.998 on CO2 plume segmentation and best results on all segmentation/classification metrics, outperforming 15 SOTA models including larger domain-specific gas segmenters.
Conclusion: TRACE enables practical non-invasive, continuous per-animal CO2 monitoring from overhead thermal cameras at commercial scale for livestock management.
Abstract: Quantifying exhaled CO2 from free-roaming cattle is both a direct indicator of rumen metabolic state and a prerequisite for farm-scale carbon accounting, yet no existing system can deliver continuous, spatially resolved measurements without physical confinement or contact. We present TRACE (Thermal Recognition Attentive-Framework for CO2 Emissions from Livestock), the first unified framework to jointly address per-frame CO2 plume segmentation and clip-level emission flux classification from mid-wave infrared (MWIR) thermal video. TRACE contributes three domain-specific advances: a Thermal Gas-Aware Attention (TGAA) encoder that incorporates per-pixel gas intensity as a spatial supervisory signal to direct self-attention toward high-emission regions at each encoder stage; an Attention-based Temporal Fusion (ATF) module that captures breath-cycle dynamics through structured cross-frame attention for sequence-level flux classification; and a four-stage progressive training curriculum that couples both objectives while preventing gradient interference. Benchmarked against fifteen state-of-the-art models on the CO2 Farm Thermal Gas Dataset, TRACE achieves an mIoU of 0.998 and the best result on every segmentation and classification metric simultaneously, outperforming domain-specific gas segmenters with several times more parameters and surpassing all baselines in flux classification. Ablation studies confirm that each component is individually essential: gas-conditioned attention alone determines precise plume boundary localization, and temporal reasoning is indispensable for flux-level discrimination. TRACE establishes a practical path toward non-invasive, continuous, per-animal CO2 monitoring from overhead thermal cameras at commercial scale. Codes are available at https://github.com/taminulislam/trace.
[287] FlowHijack: A Dynamics-Aware Backdoor Attack on Flow-Matching Vision-Language-Action Models
Xinyuan An, Tao Luo, Gengyun Peng, Yaobing Wang, Kui Ren, Dongxia Wang
Main category: cs.CV
TL;DR: FlowHijack is the first backdoor attack framework targeting flow-matching Vision-Language-Action models by exploiting their continuous vector-field dynamics through τ-conditioned injection and dynamics mimicry regularization.
Details
Motivation: As Vision-Language-Action (VLA) models with flow-matching policies become important for robotics, their unique continuous action generation mechanism presents unexplored security vulnerabilities. Existing backdoor attacks designed for autoregressive discrete VLAs cannot target these continuous dynamics, creating a critical security gap.Method: FlowHijack combines a τ-conditioned injection strategy that manipulates the initial phase of action generation with a dynamics mimicry regularizer. This approach systematically targets the underlying vector-field dynamics of flow-matching VLAs while maintaining stealth through context-aware triggers.
Result: Experiments show FlowHijack achieves high attack success rates with stealthy triggers where prior works failed. It preserves benign task performance and generates malicious actions that are behaviorally indistinguishable from normal actions by enforcing kinematic similarity.
Conclusion: The research reveals significant vulnerabilities in continuous embodied models and highlights the urgent need for defenses targeting the internal generative dynamics of flow-matching VLA models.
Abstract: Vision-Language-Action (VLA) models are emerging as a cornerstone for robotics, with flow-matching policies like $π_0$ showing great promise in generating smooth, continuous actions. As these models advance, their unique action generation mechanism - the vector field dynamics - presents a critical yet unexplored security vulnerability, particularly backdoor vulnerabilities. Existing backdoor attacks designed for autoregressive discretization VLAs cannot be directly applied to this new continuous dynamics. We introduce FlowHijack, the first backdoor attack framework to systematically target the underlying vector-field dynamics of flow-matching VLAs. Our method combines a novel $τ$-conditioned injection strategy, which manipulates the initial phase of the action generation, with a dynamics mimicry regularizer. Experiments demonstrate that FlowHijack achieves high attack success rates using stealthy, context-aware triggers where prior works failed. Crucially, it preserves benign task performance and, by enforcing kinematic similarity, generates malicious actions that are behaviorally indistinguishable from normal actions. Our findings reveal a significant vulnerability in continuous embodied models, highlighting the urgent need for defenses targeting the model’s internal generative dynamics.
[288] LoViF 2026 The First Challenge on Weather Removal in Videos
Chenghao Qian
Main category: cs.CV
TL;DR: Review of LoViF 2026 Challenge on Weather Removal in Videos, introducing a new WRV dataset and evaluating methods for restoring clean videos from weather-degraded inputs with temporal consistency.
Details
Motivation: To advance robust and realistic video restoration under real-world weather conditions by developing methods that can remove adverse weather effects like rain and snow while preserving scene structure and motion dynamics.Method: Organized a challenge with 37 participants, introduced a new WRV dataset containing 18 videos (1,216 synthesized frames paired with real-world ground-truth frames), and evaluated submissions using protocols considering both fidelity and perceptual quality.
Result: The challenge attracted 37 participants and received 5 valid final submissions with corresponding fact sheets, contributing to progress in weather removal for videos.
Conclusion: The LoViF 2026 Challenge successfully advanced the field of video weather removal by providing a benchmark dataset and evaluation framework, though the paper itself is a review/challenge report rather than presenting novel technical methods.
Abstract: This paper presents a review of the LoViF 2026 Challenge on Weather Removal in Videos. The challenge encourages the development of methods for restoring clean videos from inputs degraded by adverse weather conditions such as rain and snow, with an emphasis on achieving visually plausible and temporally consistent results while preserving scene structure and motion dynamics. To support this task, we introduce a new short-form WRV dataset tailored for video weather removal. It consists of 18 videos 1,216 synthesized frames paired with 1,216 real-world ground-truth frames at a resolution of 832 x 480, and is split into training, validation, and test sets with a ratio of 1:1:1. The goal of this challenge is to advance robust and realistic video restoration under real-world weather conditions, with evaluation protocols that jointly consider fidelity and perceptual quality. The challenge attracted 37 participants and received 5 valid final submissions with corresponding fact sheets, contributing to progress in weather removal for videos. The project is publicly available at https://www.codabench.org/competitions/13462/.
[289] Prints in the Magnetic Dust: Robust Similarity Search in Legacy Media Images Using Checksum Count Vectors
Maciej Grzeszczuk, Kinga Skorupska, Grzegorz M. Wójcik
Main category: cs.CV
TL;DR: Paper proposes Checksum Count Vectors for detecting duplicates and variants in digitized magnetic tape recordings to automate preservation workflows.
Details
Motivation: To automate the preservation process of early home computing artifacts by reducing manual technical work, allowing volunteers to focus on historical context rather than struggling with technical tools for decoding, verifying, repairing, testing, and documenting audio tape images.Method: Developed a feature representation based on Checksum Count Vectors to detect duplicates and variants of recordings within large data stores. Tested on a collection of 4902 decoded tape images.
Result: Achieved 58% accuracy in detecting variants and 97% accuracy in identifying alternative copies, even for damaged recordings with up to 75% of records missing.
Conclusion: The approach represents an important step toward fully automated pipelines for restoration, de-duplication, and semantic integration of historical digital artifacts through sequence matching, automatic repair, and knowledge discovery.
Abstract: Digitizing magnetic media containing computer data is only the first step towards the preservation of early home computing era artifacts. The audio tape images must be decoded, verified, repaired if necessary, tested, and documented. If parts of this process could be effectively automated, volunteers could focus on contributing contextual and historical knowledge rather than struggling with technical tools. We therefore propose a feature representation based on Checksum Count Vectors and evaluate its applicability to detecting duplicates and variants of recordings within a large data store. The approach was tested on a collection of decoded tape images (n=4902), achieving 58% accuracy in detecting variants and 97% accuracy in identifying alternative copies, for damaged recordings with up to 75% of records missing. These results represent an important step towards fully automated pipelines for restoration, de-duplication, and semantic integration of historical digital artifacts through sequence matching, automatic repair and knowledge discovery.
[290] A Modular Zero-Shot Pipeline for Accident Detection, Localization, and Classification in Traffic Surveillance Video
Amey Thakur, Sarvesh Talele
Main category: cs.CV
TL;DR: A zero-shot pipeline for traffic accident prediction in surveillance videos using three independent modules: temporal localization via frame-difference peak detection, spatial localization via optical flow centroid analysis, and collision type classification via CLIP embeddings similarity.
Details
Motivation: To address the ACCIDENT @ CVPR 2026 challenge requirement of predicting traffic accidents in surveillance videos without labeled real-world training data, necessitating a zero-shot approach that doesn't rely on domain-specific fine-tuning.Method: Three independent modules: 1) Temporal localization using z-score normalized frame-difference signals with peak detection, 2) Spatial localization using weighted centroid of cumulative dense optical flow magnitude maps (Farneback algorithm), 3) Collision type classification using cosine similarity between CLIP image embeddings of frames near detected peaks and text embeddings from multi-prompt natural language descriptions of collision categories.
Result: A publicly available Kaggle notebook implementation that provides a complete zero-shot pipeline for traffic accident prediction in surveillance videos without requiring any labeled training data or domain-specific fine-tuning.
Conclusion: The proposed zero-shot pipeline successfully addresses the challenge requirements by decomposing the problem into independent modules using pre-trained models, demonstrating the feasibility of accident prediction without labeled surveillance video data.
Abstract: We describe a zero-shot pipeline developed for the ACCIDENT @ CVPR 2026 challenge. The challenge requires predicting when, where, and what type of traffic accident occurs in surveillance video, without labeled real-world training data. Our method separates the problem into three independent modules. The first module localizes the collision in time by running peak detection on z-score normalized frame-difference signals. The second module finds the impact location by computing the weighted centroid of cumulative dense optical flow magnitude maps using the Farneback algorithm. The third module classifies collision type by measuring cosine similarity between CLIP image embeddings of frames near the detected peak and text embeddings built from multi-prompt natural language descriptions of each collision category. No domain-specific fine-tuning is involved; the pipeline processes each video using only pre-trained model weights. Our implementation is publicly available as a Kaggle notebook.
[291] OmniScript: Towards Audio-Visual Script Generation for Long-Form Cinematic Video
Junfu Pu, Yuxin Chen, Teng Wang, Ying Shan
Main category: cs.CV
TL;DR: OmniScript: An 8B-parameter multimodal model for generating hierarchical, temporally-grounded scripts from long-form cinematic videos, outperforming larger models and matching proprietary models like Gemini 3-Pro.
Details
Motivation: Current MLLMs excel at short-form video understanding but struggle with generating detailed, temporally-grounded scripts for long-form cinematic videos. There's a need for models that can translate complex narrative videos into structured scripts with character actions, dialogues, expressions, and audio cues.Method: Introduces the novel V2S (video-to-script) task and creates a human-annotated benchmark. Proposes OmniScript, an 8B-parameter omni-modal (audio-visual) language model trained via progressive pipeline: chain-of-thought supervised fine-tuning for plot/character reasoning, followed by reinforcement learning with temporally segmented rewards.
Result: OmniScript significantly outperforms larger open-source models and achieves performance comparable to state-of-the-art proprietary models (including Gemini 3-Pro) in both temporal localization and multi-field semantic accuracy, despite its parameter efficiency.
Conclusion: The proposed OmniScript model effectively addresses the V2S task, demonstrating that parameter-efficient multimodal models can achieve state-of-the-art performance in long-form narrative comprehension through specialized training methodologies.
Abstract: Current multimodal large language models (MLLMs) have demonstrated remarkable capabilities in short-form video understanding, yet translating long-form cinematic videos into detailed, temporally grounded scripts remains a significant challenge. This paper introduces the novel video-to-script (V2S) task, aiming to generate hierarchical, scene-by-scene scripts encompassing character actions, dialogues, expressions, and audio cues. To facilitate this, we construct a first-of-its-kind human-annotated benchmark and propose a temporally-aware hierarchical evaluation framework. Furthermore, we present OmniScript, an 8B-parameter omni-modal (audio-visual) language model tailored for long-form narrative comprehension. OmniScript is trained via a progressive pipeline that leverages chain-of-thought supervised fine-tuning for plot and character reasoning, followed by reinforcement learning using temporally segmented rewards. Extensive experiments demonstrate that despite its parameter efficiency, OmniScript significantly outperforms larger open-source models and achieves performance comparable to state-of-the-art proprietary models, including Gemini 3-Pro, in both temporal localization and multi-field semantic accuracy.
[292] Grid2Matrix: Revealing Digital Agnosia in Vision-Language Models
Yunkai Zhang, Linda Li, Yingxin Cui, Xiyuan Ruan, Zeyu Zheng, Kezhen Chen, Yi Zhang, Diji Yang
Main category: cs.CV
TL;DR: VLMs fail on simple grid-to-matrix tasks despite preserving visual information, revealing a “Digital Agnosia” gap between visual encoding and language expression.
Details
Motivation: Current VLM benchmarks don't require exhaustive image readout, potentially hiding failures in capturing fine visual details. Need a controlled benchmark to test visual detail preservation.Method: Introduce Grid2Matrix (G2M) benchmark with color grids and color-to-number mappings. Vary grid size and colors to increase visual complexity while minimizing semantic confounds. Test VLMs on zero-shot end-to-end evaluation and probe visual encoders separately.
Result: VLMs show sharp early collapse on surprisingly small grids rather than gradual degradation. Visual encoders preserve more grid information than end-to-end outputs, revealing “Digital Agnosia” gap. Errors are structured and depend on patch boundaries. Model scaling and multimodal alignment don’t fully eliminate failures.
Conclusion: G2M reveals fundamental limitations in VLMs’ ability to translate fine visual details to language, providing a testbed for understanding where VLMs lose visual information and evaluating tasks requiring precise visual detail capture.
Abstract: Vision-Language Models (VLMs) excel on many multimodal reasoning benchmarks, but these evaluations often do not require an exhaustive readout of the image and can therefore obscure failures in faithfully capturing all visual details. We introduce Grid2Matrix (G2M), a controlled benchmark in which a model is shown a color grid and a color-to-number mapping, and must output the corresponding matrix. By varying grid size and the number of colors, G2M provides a simple way to increase visual complexity while minimizing semantic confounds. We find that VLMs exhibit a sharp early collapse in zero-shot end-to-end evaluation, failing on surprisingly small grids rather than degrading gradually as the task becomes denser. We probe the visual encoders of VLMs from two representative families and find that they preserve substantially more of the grid information than the corresponding end-to-end outputs. This suggests that the failure is not explained by visual encoding alone, but also reflects a gap between what remains recoverable from visual features and what is ultimately expressed in language. We term this gap \textit{Digital Agnosia}. Further analyses show that these errors are highly structured and depend strongly on how grid cells overlap with visual patch boundaries. We also find that common strategies such as model scaling and multimodal alignment do not fully eliminate this failure mode. We expect G2M to serve as a useful testbed for understanding where and how VLMs lose fine visual details, and for evaluating tasks where missing even small visual details can matter, such as tables, charts, forms, and GUIs.
[293] Hierarchical Textual Knowledge for Enhanced Image Clustering
Yijie Zhong, Yunfan Gao, Weipeng Jiang, Haofen Wang
Main category: cs.CV
TL;DR: KEC is a knowledge-enhanced clustering method that uses LLMs to create hierarchical concept-attribute structured knowledge to guide image clustering, improving performance over visual-only and naive text-augmented approaches.
Details
Motivation: Traditional image clustering methods rely only on visual features, which struggle with visually similar but semantically different classes. Recent vision-language models enable text knowledge use, but existing methods use coarse labels or simple nouns, missing rich conceptual and attribute-level semantics in textual space.Method: Construct hierarchical concept-attribute structured knowledge using LLMs: 1) condense redundant textual labels into abstract concepts, 2) automatically extract discriminative attributes for each concept and similar concept pairs via structured prompts to LLMs, 3) instantiate this knowledge for each input image to create knowledge-enhanced features, 4) combine with original visual features for various clustering algorithms.
Result: Evaluated on 20 diverse datasets, KEC shows consistent improvements over existing methods using additional textual knowledge. Without training, KEC outperforms zero-shot CLIP on 14 out of 20 datasets. Demonstrates that naive use of textual knowledge can harm clustering, while KEC provides both accuracy and robustness.
Conclusion: KEC effectively leverages structured textual knowledge from LLMs to enhance image clustering, addressing limitations of visual-only and naive text-augmented approaches through hierarchical concept-attribute knowledge construction and instantiation.
Abstract: Image clustering aims to group images in an unsupervised fashion. Traditional methods focus on knowledge from visual space, making it difficult to distinguish between visually similar but semantically different classes. Recent advances in vision-language models enable the use of textual knowledge to enhance image clustering. However, most existing methods rely on coarse class labels or simple nouns, overlooking the rich conceptual and attribute-level semantics embedded in textual space. In this paper, we propose a knowledge-enhanced clustering (KEC) method that constructs a hierarchical concept-attribute structured knowledge with the help of large language models (LLMs) to guide clustering. Specifically, we first condense redundant textual labels into abstract concepts and then automatically extract discriminative attributes for each single concept and similar concept pairs, via structured prompts to LLMs. This knowledge is instantiated for each input image to achieve the knowledge-enhanced features. The knowledge-enhanced features with original visual features are adapted to various downstream clustering algorithms. We evaluate KEC on 20 diverse datasets, showing consistent improvements across existing methods using additional textual knowledge. KEC without training outperforms zero-shot CLIP on 14 out of 20 datasets. Furthermore, the naive use of textual knowledge may harm clustering performance, while KEC provides both accuracy and robustness.
[294] Immunizing 3D Gaussian Generative Models Against Unauthorized Fine-Tuning via Attribute-Space Traps
Jianwei Zhang, Sihan Cao, Chaoning Zhang, Ziming Hong, Jiaxin Huang, Pengcheng Zheng, Caiyan Qin, Wei Dong, Yang Yang, Tongliang Liu
Main category: cs.CV
TL;DR: GaussLock is a defense framework that protects 3D generative models from fine-tuning attacks by embedding parameter-space traps that collapse spatial distributions and distort geometry during unauthorized use.
Details
Motivation: The public accessibility of pre-trained 3D generative models creates vulnerability to fine-tuning attacks that can steal specialized knowledge. Unlike 2D images and language models, 3D generators with explicit Gaussian representations expose structural parameters directly to gradient-based optimization, requiring specialized protection.Method: GaussLock uses a lightweight parameter-space immunization framework integrating authorized distillation with attribute-aware trap losses targeting position, scale, rotation, opacity, and color. These traps systematically collapse spatial distributions, distort geometric shapes, align rotational axes, and suppress primitive visibility to destroy structural integrity.
Result: Experiments on large-scale Gaussian models show GaussLock effectively neutralizes unauthorized fine-tuning attacks, substantially degrading reconstruction quality (higher LPIPS, lower PSNR) while maintaining performance on authorized fine-tuning.
Conclusion: GaussLock provides the first effective defense for 3D generative models against fine-tuning attacks by embedding parameter-space traps that preserve authorized functionality while disrupting unauthorized reconstructions.
Abstract: Recent large-scale generative models enable high-quality 3D synthesis. However, the public accessibility of pre-trained weights introduces a critical vulnerability. Adversaries can fine-tune these models to steal specialized knowledge acquired during pre-training, leading to intellectual property infringement. Unlike defenses for 2D images and language models, 3D generators require specialized protection due to their explicit Gaussian representations, which expose fundamental structural parameters directly to gradient-based optimization. We propose GaussLock, the first approach designed to defend 3D generative models against fine-tuning attacks. GaussLock is a lightweight parameter-space immunization framework that integrates authorized distillation with attribute-aware trap losses targeting position, scale, rotation, opacity, and color. Specifically, these traps systematically collapse spatial distributions, distort geometric shapes, align rotational axes, and suppress primitive visibility to fundamentally destroy structural integrity. By jointly optimizing these dual objectives, the distillation process preserves fidelity on authorized tasks while the embedded traps actively disrupt unauthorized reconstructions. Experiments on large-scale Gaussian models demonstrate that GaussLock effectively neutralizes unauthorized fine-tuning attacks. It substantially degrades the quality of unauthorized reconstructions, evidenced by significantly higher LPIPS and lower PSNR, while effectively maintaining performance on authorized fine-tuning.
[295] 3DTV: A Feedforward Interpolation Network for Real-Time View Synthesis
Stefan Schulz, Fernando Edelstein, Hannah Dröge, Matthias B. Hullin, Markus Plack
Main category: cs.CV
TL;DR: 3DTV: A feedforward network for real-time sparse-view interpolation that combines lightweight geometry with learning for free-viewpoint rendering, enabling efficient novel view synthesis without scene-specific optimization.
Details
Motivation: Real-time free-viewpoint rendering needs to balance multi-camera redundancy with latency constraints for interactive applications like AR/VR and telepresence. Existing methods often require scene-specific optimization or explicit proxies, limiting practical deployment.Method: Uses Delaunay-based triplet selection for angular coverage, pose-aware depth module estimating coarse-to-fine depth pyramid, efficient feature reprojection, and occlusion-aware blending. Runs feedforward without retraining.
Result: Outperforms recent real-time novel-view baselines on challenging multi-view video datasets, achieving strong balance of quality and efficiency. Avoids explicit proxies for robust rendering across diverse scenes.
Conclusion: 3DTV provides a practical solution for low-latency multi-view streaming and interactive rendering, making it suitable for AR/VR, telepresence, and interactive applications without requiring scene-specific optimization.
Abstract: Real-time free-viewpoint rendering requires balancing multi-camera redundancy with the latency constraints of interactive applications. We address this challenge by combining lightweight geometry with learning and propose 3DTV, a feedforward network for real-time sparse-view interpolation. A Delaunay-based triplet selection ensures angular coverage for each target view. Building on this, we introduce a pose-aware depth module that estimates a coarse-to-fine depth pyramid, enabling efficient feature reprojection and occlusion-aware blending. Unlike methods that require scene-specific optimization, 3DTV runs feedforward without retraining, making it practical for AR/VR, telepresence, and interactive applications. Our experiments on challenging multi-view video datasets demonstrate that 3DTV consistently achieves a strong balance of quality and efficiency, outperforming recent real-time novel-view baselines. Crucially, 3DTV avoids explicit proxies, enabling robust rendering across diverse scenes. This makes it a practical solution for low-latency multi-view streaming and interactive rendering. Project Page: https://stefanmschulz.github.io/3DTV_webpage/
[296] MuPPet: Multi-person 2D-to-3D Pose Lifting
Thomas Markhorst, Zhi-Yi Lin, Jouh Yeong Chew, Jan van Gemert, Xucong Zhang
Main category: cs.CV
TL;DR: MuPPet is a novel multi-person 2D-to-3D pose lifting framework that explicitly models inter-person correlations to improve 3D pose estimation in social interaction scenarios.
Details
Motivation: Existing 2D-to-3D pose lifting methods often neglect inter-person relationships or cannot handle varying group sizes, limiting their effectiveness in multi-person social interaction settings where coherence and relationships among individuals are essential.Method: Proposes MuPPet with three key components: Person Encoding to structure individual representations, Permutation Augmentation to enhance training diversity, and Dynamic Multi-Person Attention to adaptively model correlations between individuals.
Result: Extensive experiments on group interaction datasets show MuPPet significantly outperforms state-of-the-art single- and multi-person 2D-to-3D pose lifting methods, and improves robustness in occlusion scenarios.
Conclusion: The work highlights the importance of modeling inter-person correlations for accurate and socially-aware 3D pose estimation, paving the way for better understanding of multi-person social dynamics.
Abstract: Multi-person social interactions are inherently built on coherence and relationships among all individuals within the group, making multi-person localization and body pose estimation essential to understanding these social dynamics. One promising approach is 2D-to-3D pose lifting which provides a 3D human pose consisting of rich spatial details by building on the significant advances in 2D pose estimation. However, the existing 2D-to-3D pose lifting methods often neglect inter-person relationships or cannot handle varying group sizes, limiting their effectiveness in multi-person settings. We propose MuPPet, a novel multi-person 2D-to-3D pose lifting framework that explicitly models inter-person correlations. To leverage these inter-person dependencies, our approach introduces Person Encoding to structure individual representations, Permutation Augmentation to enhance training diversity, and Dynamic Multi-Person Attention to adaptively model correlations between individuals. Extensive experiments on group interaction datasets demonstrate MuPPet significantly outperforms state-of-the-art single- and multi-person 2D-to-3D pose lifting methods, and improves robustness in occlusion scenarios. Our findings highlight the importance of modeling inter-person correlations, paving the way for accurate and socially-aware 3D pose estimation. Our code is available at: https://github.com/Thomas-Markhorst/MuPPet
[297] Face Density as a Proxy for Data Complexity: Quantifying the Hardness of Instance Count
Abolfazl Mohammadi-Seif, Ricardo Baeza-Yates
Main category: cs.CV
TL;DR: Instance density (face count per image) is a primary driver of data complexity that causes monotonic performance degradation across ML tasks, with models trained on low-density data failing to generalize to high-density scenes.
Details
Motivation: To move beyond model-centric innovations and understand data complexity, specifically isolating and quantifying the impact of instance density as a fundamental dimension of data hardness that limits achievable performance.Method: Controlled experiments on WIDER FACE and Open Images datasets with perfectly balanced sampling across exactly 1-18 faces per image, measuring performance degradation across classification, regression, and detection paradigms while controlling for class imbalance.
Result: Model performance degrades monotonically with increasing face count across all tasks, with models trained on low-density regimes failing to generalize to higher densities (4.6x error increase), showing systematic under-counting bias and suggesting density acts as a domain shift.
Conclusion: Instance density is an intrinsic, quantifiable dimension of data hardness that motivates interventions in curriculum learning and density-stratified evaluation to address this fundamental challenge.
Abstract: Machine learning progress has historically prioritized model-centric innovations, yet achievable performance is frequently capped by the intrinsic complexity of the data itself. In this work, we isolate and quantify the impact of instance density (measured by face count) as a primary driver of data complexity. Rather than simply observing that ``crowded scenes are harder,’’ we rigorously control for class imbalance to measure the precise degradation caused by density alone. Controlled experiments on the WIDER FACE and Open Images datasets, restricted to exactly 1 to 18 faces per image with perfectly balanced sampling, reveal that model performance degrades monotonically with increasing face count. This trend holds across classification, regression, and detection paradigms, even when models are fully exposed to the entire density range. Furthermore, we demonstrate that models trained on low-density regimes fail to generalize to higher densities, exhibiting a systematic under-counting bias, with error rates increasing by up to 4.6x, which suggests density acts as a domain shift. These findings establish instance density as an intrinsic, quantifiable dimension of data hardness and motivate specific interventions in curriculum learning and density-stratified evaluation.
[298] Are We Recognizing the Jaguar or Its Background? A Diagnostic Framework for Jaguar Re-Identification
Antonio Rueda-Toicen, Abigail Allen Martin, Daniil Morozov, Matin Mahmood, Alexandra Schild, Shahabeddin Dayani, Davide Panza, Gerard de Melo
Main category: cs.CV
TL;DR: A diagnostic framework for wildlife re-identification that evaluates whether models rely on correct evidence (coat patterns) vs. spurious cues (background, silhouette), with case studies on jaguar re-ID using leakage-controlled context ratios and laterality diagnostics.
Details
Motivation: Current wildlife re-identification methods may achieve strong retrieval metrics while relying on wrong evidence like background context or silhouette shape instead of the coat patterns that actually define animal identity, leading to unreliable systems.Method: Introduces a diagnostic framework with two axes: 1) leakage-controlled context ratio (background/foreground) computed from inpainted background-only vs foreground-only images, and 2) laterality diagnostic based on cross-flank retrieval and mirror self-similarity. Creates a Pantanal jaguar benchmark with per-pixel segmentation masks and identity-balanced evaluation protocol. Tests mitigation families including ArcFace fine-tuning, anti-symmetry regularization, and Lorentz hyperbolic embeddings.
Result: Provides a framework to measure what visual evidence models use for re-identification, going beyond standard retrieval metrics to evaluate whether models rely on correct biological features (coat patterns) vs. spurious cues.
Conclusion: The diagnostic framework enables deeper evaluation of wildlife re-ID models by assessing what evidence they use, not just how well they rank, helping ensure models learn biologically meaningful features rather than exploiting dataset biases.
Abstract: Jaguar re-identification (re-ID) from citizen-science imagery can look strong on standard retrieval metrics while still relying on the wrong evidence, such as background context or silhouette shape, instead of the coat pattern that defines identity. We introduce a diagnostic framework for wildlife re-ID with two axes: a leakage-controlled context ratio, background/foreground, computed from inpainted background-only versus foreground-only images, and a laterality diagnostic based on cross-flank retrieval and mirror self-similarity. To make these diagnostics measurable, we curate a Pantanal jaguar benchmark with per-pixel segmentation masks and an identity-balanced evaluation protocol. We then use representative mitigation families, ArcFace fine-tuning, anti-symmetry regularization, and Lorentz hyperbolic embeddings, as case studies under the same evaluation lens. The goal is not only to ask which model ranks best, but also what visual evidence it uses to do so.
[299] CAGE: Bridging the Accuracy-Aesthetics Gap in Educational Diagrams via Code-Anchored Generative Enhancement
Dikshant Kukreja, Kshitij Sah, Karan Goyal, Mukesh Mohania, Vikram Goyal
Main category: cs.CV
TL;DR: CAGE is a hybrid system that combines LLM-generated code for accurate diagram structure with diffusion model enhancement for visual quality, addressing the accuracy-aesthetics tradeoff in educational diagram generation.
Details
Motivation: Educational diagrams are crucial for K-12 instruction, but existing methods fail to balance accuracy and visual quality. Diffusion models produce visually rich but textually garbled diagrams, while LLM-based code generation yields accurate but visually flat outputs. Closed-source APIs are unreliable and expensive for educational scale.Method: CAGE (Code-Anchored Generative Enhancement): First, an LLM synthesizes executable code to produce a structurally correct diagram. Then, a diffusion model conditioned on the programmatic output via ControlNet refines it into a visually polished graphic while preserving label fidelity. Also introduces EduDiagram-2K dataset of 2,000 paired programmatic-stylized diagrams.
Result: Quantified the accuracy-aesthetics dilemma across three paradigms on 400 K-12 diagram prompts using automated and human evaluation. CAGE resolves this tradeoff by combining code-based accuracy with diffusion-based visual enhancement. Proof-of-concept results demonstrate the approach’s effectiveness.
Conclusion: CAGE bridges the gap between accuracy and aesthetics in educational diagram generation, enabling scalable production of both correct and engaging instructional materials. The approach presents a promising research direction for multimedia generation.
Abstract: Educational diagrams – labeled illustrations of biological processes, chemical structures, physical systems, and mathematical concepts – are essential cognitive tools in K-12 instruction. Yet no existing method can generate them both accurately and engagingly. Open-source diffusion models produce visually rich images but catastrophically garble text labels. Code-based generation via LLMs guarantees label correctness but yields visually flat outputs. Closed-source APIs partially bridge this gap but remain unreliable and prohibitively expensive at educational scale. We quantify this accuracy-aesthetics dilemma across all three paradigms on 400 K-12 diagram prompts, measuring both label fidelity and visual quality through complementary automated and human evaluation protocols. To resolve it, we propose CAGE (Code-Anchored Generative Enhancement): an LLM synthesizes executable code producing a structurally correct diagram, then a diffusion model conditioned on the programmatic output via ControlNet refines it into a visually polished graphic while preserving label fidelity. We also introduce EduDiagram-2K, a collection of 2,000 paired programmatic-stylized diagrams enabling this pipeline, and present proof-of-concept results and a research agenda for the multimedia community.
[300] TaFall: Balance-Informed Fall Detection via Passive Thermal Sensing
Chengxiao Li, Xie Zhang, Wei Zhu, Yan Jiang, Chenshu Wu
Main category: cs.CV
TL;DR: TaFall: A privacy-preserving fall detection system using thermal array sensors that models falls as balance degradation processes and achieves high accuracy with ultra-low false alarm rates in real-world deployments.
Details
Motivation: Falls are a major health risk for older adults, especially in private indoor environments where monitoring must balance effectiveness with privacy. Existing privacy-preserving approaches using radio frequency sensing rely on coarse motion cues, limiting reliability in real-world deployments.Method: Uses low-cost thermal array sensing to detect falls by modeling them as processes of balance degradation. Key innovations include: (1) appearance-motion fusion model for robust pose reconstruction from low-resolution thermal maps, (2) physically grounded balance-aware learning, and (3) pose-bridged pretraining to improve robustness.
Result: Achieved 98.26% detection rate with 0.65% false alarm rate on dataset with over 3,000 fall instances from 35 participants across diverse indoor environments. In 27-day deployments across four homes, attained ultra-low false alarm rate of 0.00126%. Pilot bathroom study confirmed robustness under moisture and thermal interference.
Conclusion: TaFall establishes a reliable and privacy-preserving approach to fall detection in everyday living environments by combining thermal array sensing with biomechanical balance modeling.
Abstract: Falls are a major cause of injury and mortality among older adults, yet most incidents occur in private indoor environments where monitoring must balance effectiveness with privacy. Existing privacy-preserving fall detection approaches, particularly those based on radio frequency sensing, often rely on coarse motion cues, which limits reliability in real-world deployments. We introduce TaFall, a balance-informed fall detection system based on low-cost, privacy-preserving thermal array sensing. The key insight is that TaFall models a fall as a process of balance degradation and detects falls by estimating pose-driven biomechanical balance dynamics. To enable this capability from low-resolution thermal array maps, we propose (i) an appearance-motion fusion model for robust pose reconstruction, (ii) physically grounded balance-aware learning, and (iii) pose-bridged pretraining to improve robustness. TaFall achieves a detection rate of 98.26% with a false alarm rate of 0.65% on our dataset with over 3,000 fall instances from 35 participants across diverse indoor environments. In 27 day deployments across four homes, TaFall attains an ultra-low false alarm rate of 0.00126% and a pilot bathroom study confirms robustness under moisture and thermal interference. Together, these results establish TaFall as a reliable and privacy-preserving approach to fall detection in everyday living environments.
[301] Seeing Through Deception: Uncovering Misleading Creator Intent in Multimodal News with Vision-Language Models
Jiaying Wu, Fanxiao Li, Zihang Fu, Min-Yen Kan, Bryan Hooi
Main category: cs.CV
TL;DR: DeceptionDecoded: A large-scale benchmark for multimodal misinformation detection focusing on creator intent, with 12,000 image-caption pairs and three intent-centric tasks to evaluate VLMs’ ability to reason about misleading narratives.
Details
Motivation: Multimodal misinformation impact stems not just from factual errors but from deliberately embedded misleading narratives. Understanding creator intent is crucial for effective multimodal misinformation detection and information governance, but current VLMs lack this capability.Method: Created DeceptionDecoded benchmark using intent-guided simulation framework that models both desired influence and execution plans of news creators. Dataset includes 12,000 image-caption pairs (both misleading and non-misleading) grounded in trustworthy reference articles, spanning manipulations across visual and textual modalities. Supports three tasks: misleading intent detection, misleading source attribution, and creator desire inference.
Result: Evaluation of 14 state-of-the-art VLMs shows they struggle with intent reasoning, relying on shallow cues like surface-level alignment, stylistic polish, or heuristic authenticity signals. Models trained on DeceptionDecoded demonstrate strong transferability to real-world MMD, validating the framework as both benchmark and data synthesis engine.
Conclusion: DeceptionDecoded serves as both a diagnostic tool for VLM fragility in intent reasoning and a high-quality data synthesis engine for enhancing robustness in real-world multimodal misinformation governance. The framework enables models to learn implication-level intent reasoning.
Abstract: The impact of multimodal misinformation arises not only from factual inaccuracies but also from the misleading narratives that creators deliberately embed. Interpreting such creator intent is therefore essential for multimodal misinformation detection (MMD) and effective information governance. To this end, we introduce DeceptionDecoded, a large-scale benchmark of 12,000 image-caption pairs grounded in trustworthy reference articles, created using an intent-guided simulation framework that models both the desired influence and the execution plan of news creators. The dataset captures both misleading and non-misleading cases, spanning manipulations across visual and textual modalities, and supports three intent-centric tasks: (1) misleading intent detection, (2) misleading source attribution, and (3) creator desire inference. We evaluate 14 state-of-the-art vision-language models (VLMs) and find that they struggle with intent reasoning, often relying on shallow cues such as surface-level alignment, stylistic polish, or heuristic authenticity signals. To bridge this, our framework systematically synthesizes data that enables models to learn implication-level intent reasoning. Models trained on DeceptionDecoded demonstrate strong transferability to real-world MMD, validating our framework as both a benchmark to diagnose VLM fragility and a data synthesis engine that provides high-quality, intent-focused resources for enhancing robustness in real-world multimodal misinformation governance.
[302] EDFNet: Early Fusion of Edge and Depth for Thin-Obstacle Segmentation in UAV Navigation
Negar Fathi
Main category: cs.CV
TL;DR: EDFNet: Early-fusion RGB-Depth-Edge segmentation framework for thin-obstacle perception in UAV navigation, achieving best performance with pretrained RGBDE U-Net but struggling with ultra-thin categories.
Details
Motivation: Thin obstacles like wires, poles, and branches are difficult for UAVs to detect due to few pixels, weak visual contrast, and class imbalance. Existing segmentation methods don't fully exploit multimodal cues needed for thin-structure perception.Method: EDFNet is a modular early-fusion segmentation framework integrating RGB, depth, and edge information. Evaluated on DDOS dataset across 16 modality-backbone configurations using U-Net and DeepLabV3 in pretrained and non-pretrained settings.
Result: Early RGB-Depth-Edge fusion provides competitive baseline with consistent gains in boundary-sensitive metrics. Pretrained RGBDE U-Net achieves best overall performance (Thin-Structure Evaluation Score: 0.244, mean IoU: 0.219, boundary IoU: 0.234) at 19.62 FPS. Performance on ultra-thin categories remains low across all models.
Conclusion: Early RGB-Depth-Edge fusion is a practical modular baseline for thin-obstacle segmentation in UAV navigation, but reliable ultra-thin segmentation remains an open challenge.
Abstract: Autonomous Unmanned Aerial Vehicles (UAVs) must reliably detect thin obstacles such as wires, poles, and branches to navigate safely in real-world environments. These structures remain difficult to perceive because they occupy few pixels, often exhibit weak visual contrast, and are strongly affected by class imbalance. Existing segmentation methods primarily target coarser obstacles and do not fully exploit the complementary multimodal cues needed for thin-structure perception. We present EDFNet, a modular early-fusion segmentation framework that integrates RGB, depth, and edge information for thin-obstacle perception in cluttered aerial scenes. We evaluate EDFNet on the Drone Depth and Obstacle Segmentation (DDOS) dataset across sixteen modality-backbone configurations using U-Net and DeepLabV3 in pretrained and non-pretrained settings. The results show that early RGB-Depth-Edge fusion provides a competitive and well-balanced baseline, with the most consistent gains appearing in boundary-sensitive and recall-oriented metrics. The pretrained RGBDE U-Net achieves the best overall performance, with the highest Thin-Structure Evaluation Score (0.244), mean IoU (0.219), and boundary IoU (0.234), while maintaining competitive runtime performance (19.62 FPS) on our evaluation hardware. However, performance on the rarest ultra-thin categories remains low across all models, indicating that reliable ultra-thin segmentation is still an open challenge. Overall, these findings position early RGB-Depth-Edge fusion as a practical and modular baseline for thin-obstacle segmentation in UAV navigation.
[303] GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning
Chengqi Duan, Rongyao Fang, Yuqing Wang, Kun Wang, Linjiang Huang, Xingyu Zeng, Hongsheng Li, Xihui Liu
Main category: cs.CV
TL;DR: GoT-R1 uses reinforcement learning with MLLM-based rewards to improve spatial reasoning in text-to-image generation for complex compositional prompts
Details
Motivation: Current visual generation models struggle with complex prompts requiring precise spatial relationships and multiple object attributes, needing better semantic-spatial reasoning capabilitiesMethod: Reinforcement learning framework building on Generation Chain-of-Thought, with dual-stage multi-dimensional reward system using MLLMs to evaluate reasoning process and final output across semantic alignment, spatial accuracy, and visual quality
Result: Significant improvements on T2I-CompBench benchmark, especially for compositional tasks with precise spatial relationships and attribute binding
Conclusion: GoT-R1 advances state-of-the-art in image generation by transferring sophisticated reasoning capabilities to visual generation domain
Abstract: Visual generation models have made remarkable progress in creating realistic images from text prompts, yet struggle with complex prompts that specify multiple objects with precise spatial relationships and attributes. Effective handling of such prompts requires explicit reasoning about the semantic content and spatial layout. We present GoT-R1, a framework that applies reinforcement learning to enhance semantic-spatial reasoning in visual generation. Building upon the Generation Chain-of-Thought approach, GoT-R1 enables models to autonomously discover effective reasoning strategies beyond predefined templates through carefully designed reinforcement learning. To achieve this, we propose a dual-stage multi-dimensional reward framework that leverages MLLMs to evaluate both the reasoning process and final output, enabling effective supervision across the entire generation pipeline. The reward system assesses semantic alignment, spatial accuracy, and visual quality in a unified approach. Experimental results demonstrate significant improvements on T2I-CompBench benchmark, particularly in compositional tasks involving precise spatial relationships and attribute binding. GoT-R1 advances the state-of-the-art in image generation by successfully transferring sophisticated reasoning capabilities to the visual generation domain. To facilitate future research, we make our code and pretrained models publicly available at https://github.com/gogoduan/GoT-R1.
[304] Assessing Privacy Preservation and Utility in Online Vision-Language Models
Karmesh Siddharam Chaudhari, Youxiang Zhu, Amy Feng, Xiaohui Liang, Honggang Zhang
Main category: cs.CV
TL;DR: This paper addresses privacy risks in Online Vision Language Models (OVLMs) where images uploaded for various utilities can reveal Personally Identifiable Information (PII) through contextual relationships, and proposes methods to protect privacy while maintaining utility in VLM applications.
Details
Motivation: The increasing use of OVLMs for image processing introduces significant privacy risks as users upload images without awareness of potential privacy violations. Images contain relationships that can reveal PII, where even seemingly harmless details can indirectly expose sensitive information through surrounding clues.Method: The paper investigates how extraction of contextual relationships from images leads to direct (explicit) or indirect (implicit) PII exposure. It proposes methods to protect privacy while preserving the intended utility of images in VLM-based applications, though specific techniques are not detailed in the abstract.
Result: The evaluation demonstrates the efficacy of the proposed privacy protection techniques, highlighting the delicate balance between maintaining utility and protecting privacy in online image processing environments.
Conclusion: The paper addresses critical PII disclosure issues in OVLMs and provides solutions for privacy protection in vision-language model applications, emphasizing the need to balance privacy preservation with functional utility.
Abstract: The increasing use of Online Vision Language Models (OVLMs) for processing images has introduced significant privacy risks, as individuals frequently upload images for various utilities, unaware of the potential for privacy violations. Images contain relationships that relate to Personally Identifiable Information (PII), where even seemingly harmless details can indirectly reveal sensitive information through surrounding clues. This paper explores the critical issue of PII disclosure in images uploaded to OVLMs and its implications for user privacy. We investigate how the extraction of contextual relationships from images can lead to direct (explicit) or indirect (implicit) exposure of PII, significantly compromising personal privacy. Furthermore, we propose methods to protect privacy while preserving the intended utility of the images in Vision Language Model (VLM)-based applications. Our evaluation demonstrates the efficacy of these techniques, highlighting the delicate balance between maintaining utility and protecting privacy in online image processing environments. Index Terms-Personally Identifiable Information (PII), Privacy, Utility, privacy concerns, sensitive information
[305] I Can’t Believe TTA Is Not Better: When Test-Time Augmentation Hurts Medical Image Classification
Daniel Nobrega Medeiros
Main category: cs.CV
TL;DR: TTA (test-time augmentation) consistently degrades accuracy in medical imaging classification across multiple benchmarks and architectures, contrary to common assumptions, with distribution shifts and batch normalization mismatches identified as key causes.
Details
Motivation: The paper challenges the widespread assumption that test-time augmentation improves classification accuracy in medical imaging, where it's routinely deployed in production systems and competition solutions. The authors aim to systematically test this assumption empirically.Method: Conducted systematic empirical study across three MedMNIST v2 benchmarks and four architectures spanning three orders of magnitude in parameter count (21K to 11M). Evaluated TTA with standard augmentation pipelines, analyzed distribution shifts, and performed ablation studies on augmentation strategies.
Result: TTA consistently degrades accuracy relative to single-pass inference, with drops as severe as 31.6 percentage points for ResNet-18 on pathology images. Only exception was ResNet-18 on dermatology images with modest +1.6% gain. Degradation affects all architectures and worsens with more augmented views.
Conclusion: TTA should not be applied as a default post-hoc improvement but must be validated on specific model-dataset combinations. Distribution shift between augmented and training-time inputs, amplified by batch normalization statistics mismatch, is the primary mechanism causing degradation.
Abstract: Test-time augmentation (TTA)–aggregating predictions over multiple augmented copies of a test input–is widely assumed to improve classification accuracy, particularly in medical imaging where it is routinely deployed in production systems and competition solutions. We present a systematic empirical study challenging this assumption across three MedMNIST v2 benchmarks and four architectures spanning three orders of magnitude in parameter count (21K to 11M). Our principal finding is that TTA with standard augmentation pipelines consistently degrades accuracy relative to single-pass inference, with drops as severe as 31.6 percentage points for ResNet-18 on pathology images. This degradation affects all architectures, including convolutional models, and worsens with more augmented views. The sole exception is ResNet-18 on dermatology images, which gains a modest +1.6%. We identify the distribution shift between augmented and training-time inputs–amplified by batch normalization statistics mismatch–as the primary mechanism. Our ablation studies show that augmentation strategy matters critically: intensity-only augmentations preserve more performance than geometric transforms, and including the original unaugmented image partially mitigates but does not eliminate the accuracy drop. These findings serve as a cautionary note for practitioners: TTA should not be applied as a default post-hoc improvement but must be validated on the specific model-dataset combination.
[306] Generalizable Deepfake Detection Based on Forgery-aware Layer Masking and Multi-artifact Subspace Decomposition
Xiang Zhang, Wenliang Weng, Daoyong Fu, Beijing Chen, Ziqiang Li, Ziwen He, Zhangjie Fu
Main category: cs.CV
TL;DR: FMSD: A deepfake detection framework using forgery-aware layer masking and multi-artifact subspace decomposition to improve generalization across diverse forgery methods while preserving pretrained semantic representations.
Details
Motivation: Deepfake detection struggles in cross-dataset scenarios due to varying artifact patterns across forgery methods. Existing approaches often overemphasize forgery-specific cues and disturb semantic representations, weakening generalization.Method: Proposes FMSD with two key components: 1) Forgery-aware Layer Masking identifies forgery-sensitive layers via gradient bias-variance analysis for selective updating, and 2) Multi-Artifact Subspace Decomposition uses SVD to decompose selected layers into semantic and multiple learnable artifact subspaces with orthogonality and spectral consistency constraints.
Result: The framework effectively models diverse forgery patterns while preserving pretrained semantic representations, improving generalization in cross-dataset scenarios.
Conclusion: FMSD addresses limitations of existing deepfake detection methods by selectively updating forgery-sensitive layers and decomposing representations to capture heterogeneous artifacts without compromising semantic understanding.
Abstract: Deepfake detection remains highly challenging, particularly in cross-dataset scenarios and complex real-world settings. This challenge mainly arises because artifact patterns vary substantially across different forgery methods, whereas adapting pretrained models to such artifacts often overemphasizes forgery-specific cues and disturbs semantic representations, thereby weakening generalization. Existing approaches typically rely on full-parameter fine-tuning or auxiliary supervision to improve discrimination. However, they often struggle to model diverse forgery artifacts without compromising pretrained representations. To address these limitations, we propose FMSD, a deepfake detection framework built upon Forgery-aware Layer Masking and Multi-Artifact Subspace Decomposition. Specifically, Forgery-aware Layer Masking evaluates the bias-variance characteristics of layer-wise gradients to identify forgery-sensitive layers, thereby selectively updating them while reducing unnecessary disturbance to pretrained representations. Building upon this, Multi-Artifact Subspace Decomposition further decomposes the selected layer weights via Singular Value Decomposition (SVD) into a semantic subspace and multiple learnable artifact subspaces. These subspaces are optimized to capture heterogeneous and complementary forgery artifacts, enabling effective modeling of diverse forgery patterns while preserving pretrained semantic representations. Furthermore, orthogonality and spectral consistency constraints are imposed to regularize the artifact subspaces, reducing redundancy across them while preserving the overall spectral structure of pretrained weights.
[307] Attention-Guided Flow-Matching for Sparse 3D Geological Generation
Zhixiang Lu, Mengqi Han, Peixin Guo, Tianming Bai, Jionglong Su, Fei Fang, Sifan Song
Main category: cs.CV
TL;DR: 3D-GeoFlow: An attention-guided continuous flow matching framework for generating high-resolution 3D geological models from sparse 1D borehole and 2D surface data, addressing limitations of traditional methods and diffusion models for categorical data.
Details
Motivation: Traditional geological modeling methods fail to capture non-linear topological discontinuities under extreme data sparsity, while diffusion models suffer from representation collapse when conditioned on sparse categorical grids. There's a need for better methods to construct 3D geological models from sparse multimodal data.Method: Proposes 3D-GeoFlow, an Attention-Guided Continuous Flow Matching framework that reformulates discrete categorical generation as continuous vector field regression optimized via Mean Squared Error. Uses 3D Attention Gates to dynamically propagate localized borehole features across volumetric latent space for structural coherence.
Result: Extensive out-of-distribution evaluations on a curated dataset of 2,200 procedurally generated 3D geological cases show that 3D-GeoFlow significantly outperforms heuristic interpolations and standard diffusion baselines, achieving a paradigm shift in geological modeling.
Conclusion: 3D-GeoFlow represents a significant advancement in sparse multimodal geological modeling, providing stable deterministic optimal transport paths and addressing key limitations of existing methods for categorical data generation from sparse inputs.
Abstract: Constructing high-resolution 3D geological models from sparse 1D borehole and 2D surface data is a highly ill-posed inverse problem. Traditional heuristic and implicit modeling methods fundamentally fail to capture non-linear topological discontinuities under extreme sparsity, often yielding unrealistic artifacts. Furthermore, while deep generative architectures like Diffusion Models have revolutionized continuous domains, they suffer from severe representation collapse when conditioned on sparse categorical grids. To bridge this gap, we propose 3D-GeoFlow, the first Attention-Guided Continuous Flow Matching framework tailored for sparse multimodal geological modeling. By reformulating discrete categorical generation as a simulation-free, continuous vector field regression optimized via Mean Squared Error, our model establishes stable, deterministic optimal transport paths. Crucially, we integrate 3D Attention Gates to dynamically propagate localized borehole features across the volumetric latent space, ensuring macroscopic structural coherence. To validate our framework, we curated a large-scale multimodal dataset comprising 2,200 procedurally generated 3D geological cases. Extensive out-of-distribution (OOD) evaluations demonstrate that 3D-GeoFlow achieves a paradigm shift, significantly outperforming heuristic interpolations and standard diffusion baselines.
[308] PASTA: Vision Transformer Patch Aggregation for Weakly Supervised Target and Anomaly Segmentation
Melanie Neubauer, Elmar Rueckert, Christian Rauch
Main category: cs.CV
TL;DR: PASTA: A weakly supervised pipeline for object segmentation and classification using image-level supervision and self-supervised ViT features, achieving domain-agnostic anomaly detection with reduced training time.
Details
Motivation: Existing perception systems fail to meet real-time processing, pixel-level segmentation precision, and robust accuracy requirements for industrial/agricultural applications like material recycling and weeding due to reliance on exhaustively annotated datasets.Method: Proposes PASTA pipeline using weak image-level supervision. Compares observed scenes with nominal references to identify Target and Anomaly objects through distribution analysis in self-supervised Vision Transformer feature spaces. Uses semantic text-prompts via Segment Anything Model 3 for zero-shot object segmentation guidance.
Result: 75.8% training time reduction compared to domain-specific baselines. Achieves up to 88.3% IoU for Target segmentation and up to 63.5% IoU for Anomaly segmentation in industrial and agricultural domains. Demonstrated on custom steel scrap recycling and plant datasets.
Conclusion: PASTA provides a domain-agnostic weakly supervised approach that reduces annotation burden while maintaining high segmentation performance for target and anomaly detection in unstructured environments.
Abstract: Detecting unseen anomalies in unstructured environments presents a critical challenge for industrial and agricultural applications such as material recycling and weeding. Existing perception systems frequently fail to satisfy the strict operational requirements of these domains, specifically real-time processing, pixel-level segmentation precision, and robust accuracy, due to their reliance on exhaustively annotated datasets. To address these limitations, we propose a weakly supervised pipeline for object segmentation and classification using weak image-level supervision called ‘Patch Aggregation for Segmentation of Targets and Anomalies’ (PASTA). By comparing an observed scene with a nominal reference, PASTA identifies Target and Anomaly objects through distribution analysis in self-supervised Vision Transformer (ViT) feature spaces. Our pipeline utilizes semantic text-prompts via the Segment Anything Model 3 to guide zero-shot object segmentation. Evaluations on a custom steel scrap recycling dataset and a plant dataset demonstrate a 75.8% training time reduction of our approach to domain-specific baselines. While being domain-agnostic, our method achieves superior Target (up to 88.3% IoU) and Anomaly (up to 63.5% IoU) segmentation performance in the industrial and agricultural domain.
[309] Identity-Aware U-Net: Fine-grained Cell Segmentation via Identity-Aware Representation Learning
Rui Xiao
Main category: cs.CV
TL;DR: IAU-Net is a unified framework for fine-grained object segmentation that combines spatial localization with instance discrimination using an auxiliary embedding branch and triplet-based metric learning to distinguish morphologically similar objects.
Details
Motivation: Segmentation models struggle with objects having highly similar shapes, ambiguous boundaries, overlapping instances, and weak inter-instance visual differences. Conventional models lack discriminative capacity to reliably distinguish target objects from morphologically similar distractors.Method: Proposes Identity-Aware U-Net (IAU-Net) with U-Net-style encoder-decoder architecture augmented by an auxiliary embedding branch that learns discriminative identity representations from high-level features. Incorporates triplet-based metric learning to pull target-consistent embeddings together and separate them from hard negatives with similar morphology.
Result: Experiments on benchmarks including cell segmentation demonstrate promising results, particularly in challenging cases involving similar contours, dense layouts, and ambiguous boundaries.
Conclusion: The framework enables models to move beyond category-level segmentation and acquire stronger capability for precise discrimination among visually similar objects through joint modeling of spatial localization and instance discrimination.
Abstract: Precise segmentation of objects with highly similar shapes remains a challenging problem in dense prediction, especially in scenarios with ambiguous boundaries, overlapping instances, and weak inter-instance visual differences. While conventional segmentation models are effective at localizing object regions, they often lack the discriminative capacity required to reliably distinguish a target object from morphologically similar distractors. In this work, we study fine-grained object segmentation from an identity-aware perspective and propose Identity-Aware U-Net (IAU-Net), a unified framework that jointly models spatial localization and instance discrimination. Built upon a U-Net-style encoder-decoder architecture, our method augments the segmentation backbone with an auxiliary embedding branch that learns discriminative identity representations from high-level features, while the main branch predicts pixel-accurate masks. To enhance robustness in distinguishing objects with near-identical contours or textures, we further incorporate triplet-based metric learning, which pulls target-consistent embeddings together and separates them from hard negatives with similar morphology. This design enables the model to move beyond category-level segmentation and acquire a stronger capability for precise discrimination among visually similar objects. Experiments on benchmarks including cell segmentation demonstrate promising results, particularly in challenging cases involving similar contours, dense layouts, and ambiguous boundaries.
[310] Multi-Granularity Reasoning for Image Quality Assessment via Attribute-Aware Reinforcement Learning to Rank
Xiangyong Chen, Xiaochuan Lin, Haoran Liu, Xuan Li, Yichen Su, Xiangwei Guo
Main category: cs.CV
TL;DR: MG-IQA extends reinforcement learning to rank for multi-granularity image quality assessment, jointly predicting overall quality and fine-grained attributes through attribute-aware prompting and multi-dimensional reward modeling.
Details
Motivation: Existing IQA methods using RL2R operate at single granularity, predicting only overall quality scores while overlooking the multi-dimensional nature of human quality perception (sharpness, color fidelity, noise, aesthetics).Method: Proposes MG-IQA with: 1) attribute-aware prompting for structured multi-attribute reasoning from VLMs, 2) multi-dimensional Thurstone reward model for attribute-specific fidelity rewards, 3) cross-domain alignment for stable joint training across synthetic/authentic distortion and AI-generated image datasets.
Result: Extensive experiments on eight IQA benchmarks show MG-IQA outperforms SOTA methods in overall quality prediction (average SRCC improvement of 2.1%) and attribute-level assessment, while generating interpretable, human-aligned quality descriptions.
Conclusion: MG-IQA successfully extends RL2R to multi-granularity IQA, enabling joint assessment of overall quality and fine-grained attributes with improved performance and interpretability.
Abstract: Recent advances in reasoning-induced image quality assessment (IQA) have demonstrated the power of reinforcement learning to rank (RL2R) for training vision-language models (VLMs) to assess perceptual quality. However, existing approaches operate at a single granularity, predicting only an overall quality score, while overlooking the multi-dimensional nature of human quality perception, which encompasses attributes such as sharpness, color fidelity, noise level, and compositional aesthetics. In this paper, we propose MG-IQA (Multi-Granularity IQA), a multi-granularity reasoning framework that extends RL2R to jointly assess overall image quality and fine-grained quality attributes within a single inference pass. Our approach introduces three key innovations: (1) an attribute-aware prompting strategy that elicits structured multi-attribute reasoning from VLMs; (2) a multi-dimensional Thurstone reward model that computes attribute-specific fidelity rewards for group relative policy optimization; and (3) a cross-domain alignment mechanism that enables stable joint training across synthetic distortion, authentic distortion, and AI-generated image datasets without perceptual scale re-alignment. Extensive experiments on eight IQA benchmarks demonstrate that MG-IQA consistently outperforms state-of-the-art methods in both overall quality prediction (average SRCC improvement of 2.1%) and attribute-level assessment, while generating interpretable, human-aligned quality descriptions.
[311] The Deployment Gap in AI Media Detection: Platform-Aware and Visually Constrained Adversarial Evaluation
Aishwarya Budhkar, Trishita Dhara, Siddhesh Sheth
Main category: cs.CV
TL;DR: Platform-aware adversarial evaluation framework reveals AI media detectors are vulnerable to realistic deployment transforms like resizing, compression, and meme-style modifications, despite near-perfect clean performance.
Details
Motivation: AI media detectors show near-perfect performance in clean lab settings but their robustness under real-world deployment conditions (where images undergo resizing, compression, re-encoding, and visual modifications) remains underexplored, creating a deployment gap between laboratory robustness and real-world reliability.Method: Introduces a platform-aware adversarial evaluation framework that models deployment transforms (resizing, compression, screenshot-style distortions) and constrains perturbations to visually plausible meme-style bands rather than full-image noise. Evaluates detectors under this threat model with per-image and universal perturbation attacks.
Result: Detectors achieving AUC ≈ 0.99 in clean settings experience substantial degradation under platform-aware attacks. Per-image attacks reduce AUC significantly and achieve high fake-to-real misclassification rates. Universal perturbations exist even under localized band constraints. Detectors also show pronounced calibration collapse, becoming confidently incorrect under attack.
Conclusion: Robustness measured under clean conditions substantially overestimates deployment robustness. Platform-aware evaluation should be a necessary component of future AI media security benchmarks. The evaluation framework is released to facilitate standardized robustness assessment.
Abstract: Recent AI media detectors report near-perfect performance under clean laboratory evaluation, yet their robustness under realistic deployment conditions remains underexplored. In practice, AI-generated images are resized, compressed, re-encoded, and visually modified before being shared on online platforms. We argue that this creates a deployment gap between laboratory robustness and real-world reliability. In this work, we introduce a platform-aware adversarial evaluation framework for AI media detection that explicitly models deployment transforms (e.g., resizing, compression, screenshot-style distortions) and constrains perturbations to visually plausible meme-style bands rather than full-image noise. Under this threat model, detectors achieving AUC $\approx$ 0{.}99 in clean settings experience substantial degradation. Per-image platform-aware attacks reduce AUC to significantly lower levels and achieve high fake-to-real misclassification rates, despite strict visual constraints. We further demonstrate that universal perturbations exist even under localized band constraints, revealing shared vulnerability directions across inputs. Beyond accuracy degradation, we observe pronounced calibration collapse under attack, where detectors become confidently incorrect. Our findings highlight that robustness measured under clean conditions substantially overestimates deployment robustness. We advocate for platform-aware evaluation as a necessary component of future AI media security benchmarks and release our evaluation framework to facilitate standardized robustness assessment.
[312] Orthogonal Quadratic Complements for Vision Transformer Feed-Forward Networks
Wang Zixian
Main category: cs.CV
TL;DR: OQC introduces orthogonal quadratic complements for vision transformers, creating auxiliary quadratic branches that provide complementary information to main branches through orthogonal projection, improving accuracy with better speed-accuracy tradeoffs.
Details
Motivation: Existing bilinear feed-forward replacements for vision transformers conflate two effects: stronger second-order interactions and increased redundancy. The authors aim to study a design where auxiliary quadratic features contribute only information not already captured by the dominant hidden representation.Method: Proposes Orthogonal Quadratic Complements (OQC) which construct a low-rank quadratic auxiliary branch and explicitly project it onto the orthogonal complement of the main branch before injection. Also studies efficient low-rank realization (OQC-LR) and gated extensions (OQC-static and OQC-dynamic).
Result: On CIFAR-100, full OQC improves AFBO baseline from 64.25 to 65.59, while OQC-LR reaches 65.52 with better speed-accuracy tradeoff. On TinyImageNet, OQC-dynamic achieves 51.88, improving baseline (50.45) by 1.43 points. Mechanism analyses show near-zero post-projection auxiliary-main overlap with improved representation geometry and class separation.
Conclusion: OQC provides a principled approach to incorporate complementary quadratic features in vision transformers through orthogonal projection, achieving consistent improvements across datasets with both ungated and gated variants.
Abstract: Recent bilinear feed-forward replacements for vision transformers can substantially improve accuracy, but they often conflate two effects: stronger second-order interactions and increased redundancy relative to the main branch. We study a complementary design principle in which auxiliary quadratic features contribute only information not already captured by the dominant hidden representation. To this end, we propose Orthogonal Quadratic Complements (OQC), which construct a low-rank quadratic auxiliary branch and explicitly project it onto the orthogonal complement of the main branch before injection. We further study an efficient low-rank realization (OQC-LR) and gated extensions (OQC-static and OQC-dynamic). Under a parameter-matched Deep-ViT and CIFAR-100 protocol with a fixed penultimate residual readout, full OQC improves an AFBO baseline from 64.25 +/- 0.22 to 65.59 +/- 0.22, while OQC-LR reaches 65.52 +/- 0.25 with a substantially better speed-accuracy tradeoff. On TinyImageNet, the gated extension OQC-dynamic achieves 51.88 +/- 0.32, improving the baseline (50.45 +/- 0.21) by 1.43 points and outperforming all ungated variants. Mechanism analyses show near-zero post-projection auxiliary-main overlap together with improved representation geometry and class separation. The full family, including both ungated and gated variants, generalizes consistently across both datasets.
[313] Robust Fair Disease Diagnosis in CT Images
Justin Li, Daniel Ding, Asmita Yuki Pritha, Aryana Hou, Xin Wang, Shu Hu
Main category: cs.CV
TL;DR: Two-level objective combining logit-adjusted cross-entropy for class imbalance and Conditional Value at Risk for group fairness improves chest CT diagnosis across demographics.
Details
Motivation: Clinical datasets suffer from compound problems: class imbalance and demographic underrepresentation that standard rebalancing or fairness methods alone cannot address, leading to uneven performance across patient groups.Method: Proposes a two-level objective: 1) Logit-adjusted cross-entropy loss at sample level to handle class imbalance with provable consistency guarantees, and 2) Conditional Value at Risk aggregation at group level to direct optimization toward demographic groups with higher loss.
Result: On Fair Disease Diagnosis benchmark using 3D ResNet-18, achieves gender-averaged macro F1 of 0.8403 with fairness gap of 0.0239, showing 13.3% improvement in score and 78% reduction in demographic disparity over baseline. Ablations confirm both components are necessary.
Conclusion: The combined approach effectively addresses compound failure modes in clinical data where class imbalance and group underrepresentation coincide, outperforming methods that target only one aspect.
Abstract: Automated diagnosis from chest CT has improved considerably with deep learning, but models trained on skewed datasets tend to perform unevenly across patient demographics. However, the situation is worse than simple demographic bias. In clinical data, class imbalance and group underrepresentation often coincide, creating compound failure modes that neither standard rebalancing nor fairness corrections can fix alone. We introduce a two-level objective that targets both axes of this problem. Logit-adjusted cross-entropy loss operates at the sample level, shifting decision margins by class frequency with provable consistency guarantees. Conditional Value at Risk aggregation operates at the group level, directing optimization pressure toward whichever demographic group currently has the higher loss. We evaluate on the Fair Disease Diagnosis benchmark using a 3D ResNet-18 pretrained on Kinetics-400, classifying CT volumes into Adenocarcinoma, Squamous Cell Carcinoma, COVID-19, and Normal groups with patient sex annotations. The training set illustrates the compound problem concretely: squamous cell carcinoma has 84 samples total, 5 of them female. The combined loss reaches a gender-averaged macro F1 of 0.8403 with a fairness gap of 0.0239, a 13.3% improvement in score and 78% reduction in demographic disparity over the baseline. Ablations show that each component alone falls short. The code is publicly available at https://github.com/Purdue-M2/Fair-Disease-Diagnosis.
[314] XD-MAP: Cross-Modal Domain Adaptation via Semantic Parametric Maps for Scalable Training Data Generation
Frank Bieder, Hendrik Königshof, Haohao Hu, Fabian Immel, Yinzhe Shen, Jan-Hendrik Pauls, Christoph Stiller
Main category: cs.CV
TL;DR: XD-MAP transfers semantic knowledge from camera images to LiDAR data using parametric maps for cross-modal domain adaptation without manual labeling.
Details
Motivation: To bridge the gap between available datasets and deployment domains when specialized models outperform foundation models, enabling knowledge transfer from camera to LiDAR without requiring sensor overlap or manual annotation.Method: Leverages detections on camera images to create semantic parametric maps, models map elements to produce pseudo labels in LiDAR domain, extends angular perception from front-view camera to full 360° view without direct sensor overlap.
Result: Outperforms single shot baselines by +19.5 mIoU for 2D semantic segmentation, +19.5 PQth for 2D panoptic segmentation, and +32.3 mIoU in 3D semantic segmentation on large-scale road feature dataset.
Conclusion: Demonstrates effective cross-modal domain adaptation achieving strong LiDAR performance without manual labeling, bridging camera and LiDAR sensing domains.
Abstract: Until open-world foundation models match the performance of specialized approaches, deep learning systems remain dependent on task- and sensor-specific data availability. To bridge the gap between available datasets and deployment domains, domain adaptation strategies are widely used. In this work, we propose XD-MAP, a novel approach to transfer sensor-specific knowledge from an image dataset to LiDAR, an entirely different sensing domain. Our method leverages detections on camera images to create a semantic parametric map. The map elements are modeled to produce pseudo labels in the target domain without any manual annotation effort. Unlike previous domain transfer approaches, our method does not require direct overlap between sensors and enables extending the angular perception range from a front-view camera to a full 360° view. On our large-scale road feature dataset, XD-MAP outperforms single shot baseline approaches by +19.5 mIoU for 2D semantic segmentation, +19.5 PQth for 2D panoptic segmentation, and +32.3 mIoU in 3D semantic segmentation. The results demonstrate the effectiveness of our approach achieving strong performance on LiDAR data without any manual labeling.
[315] Head-wise Modality Specialization within MLLMs for Robust Fake News Detection under Missing Modality
Kai Qian, Weijie Shi, Jiaqi Wang, Mengze Li, Hao Chen, Yue Cui, Hanghui Guo, Ziyi Liu, Jia Zhu, Jiajie Xu
Main category: cs.CV
TL;DR: A method for robust multimodal fake news detection under missing modalities using head-wise modality specialization in MLLMs with attention constraints and unimodal knowledge retention.
Details
Motivation: Real-world multimodal fake news detection suffers from missing modalities (deleted/corrupted images), requiring robust verification ability for each modality individually, which is challenging due to insufficient learning of low-contribution modalities and scarce unimodal annotations.Method: Head-wise modality specialization within MLLMs: 1) Study attention heads and their relationship with performance under missing modality, 2) Head-wise specialization mechanism that allocates heads to different modalities with lower-bound attention constraints, 3) Unimodal Knowledge Retention strategy to prevent heads from drifting away from unimodal knowledge learned from limited supervision.
Result: The method improves robustness under missing modality while preserving performance with full multimodal input.
Conclusion: The proposed head-wise modality specialization approach effectively addresses the missing modality problem in multimodal fake news detection by preserving unimodal verification abilities through specialized attention heads and knowledge retention strategies.
Abstract: Multimodal fake news detection (MFND) aims to verify news credibility by jointly exploiting textual and visual evidence. However, real-world news dissemination frequently suffers from missing modality due to deleted images, corrupted screenshots, and similar issues. Thus, robust detection in this scenario requires preserving strong verification ability for each modality, which is challenging in MFND due to insufficient learning of the low-contribution modality and scarce unimodal annotations. To address this issue, we propose Head-wise Modality Specialization within Multimodal Large Language Models (MLLMs) for robust MFND under missing modality. Specifically, we first systematically study attention heads in MLLMs and their relationship with performance under missing modality, showing that modality-critical heads serve as key carriers of unimodal verification ability through their modality specialization. Based on this observation, to better preserve verification ability for the low-contribution modality, we introduce a head-wise specialization mechanism that explicitly allocates these heads to different modalities and preserves their specialization through lower-bound attention constraints. Furthermore, to better exploit scarce unimodal annotations, we propose a Unimodal Knowledge Retention strategy that prevents these heads from drifting away from the unimodal knowledge learned from limited supervision. Experiments show that our method improves robustness under missing modality while preserving performance with full multimodal input.
[316] LAST: Leveraging Tools as Hints to Enhance Spatial Reasoning for Multimodal Large Language Models
Shi-Yu Tian, Zhi Zhou, Kun-Yang Yu, Ming Yang, Yang Chen, Ziqiao Shang, Lan-Zhe Guo, Yu-Feng Li
Main category: cs.CV
TL;DR: LAST framework integrates specialized vision tools with LLMs for spatial reasoning, using an interactive sandbox (LAST-Box) and progressive training to overcome tool invocation and output interpretation challenges.
Details
Motivation: MLLMs struggle with spatial reasoning due to hallucinations and imprecision in parsing complex geometric layouts. Data-driven scaling fails to internalize structured geometric priors, while integrating specialized vision models faces challenges with heterogeneous tool invocation and interpreting diverse low-level outputs.Method: Proposes LAST framework with LAST-Box sandbox that abstracts heterogeneous tool invocations into atomic instructions and reusable spatial skills, returning multimodal hints. Uses three-stage progressive training: understanding tool outputs, then proficient and adaptive tool invocation.
Result: LAST-7B achieves ~20% performance gains over its backbone on four datasets, outperforms strong proprietary closed-source LLMs, and substantially enhances reasoning on complex spatial tasks.
Conclusion: Tool-augmented approach effectively addresses MLLMs’ spatial reasoning limitations by integrating specialized vision models through unified abstraction and progressive training.
Abstract: Spatial reasoning is a cornerstone capability for intelligent systems to perceive and interact with the physical world. However, multimodal large language models (MLLMs) frequently suffer from hallucinations and imprecision when parsing complex geometric layouts. As data-driven scaling struggles to internalize structured geometric priors and spatial constraints, integrating mature, specialized vision models presents a compelling alternative. Despite its promise, applying this paradigm to spatial reasoning is hindered by two key challenges: The difficulty of invoking heterogeneous, parameter-rich tools, as well as the challenge of understanding and effectively leveraging their diverse low-level outputs (e.g., segmentation masks, depth maps) in high-level reasoning. To address these challenges, we propose LAST, a unified framework for tool-augmented spatial reasoning. LAST features an extensible interactive sandbox, termed LAST-Box, which abstracts heterogeneous tool invocations into atomic instructions and reusable spatial skills, returning multimodal hints (e.g., annotated images and textual descriptions) that can be directly consumed by LLMs. We further design a three-stage progressive training strategy that guides models from understanding tool outputs to proficient and adaptive tool invocation. Experiments on four datasets show that LAST-7B achieves around 20% performance gains over its backbone and outperforms strong proprietary closed-source LLMs, substantially enhancing reasoning on complex spatial tasks.
[317] Sat2Sound: A Unified Framework for Zero-Shot Soundscape Mapping
Subash Khanal, Srikumar Sastry, Aayush Dhakal, Adeel Ahmad, Abby Stylianou, Nathan Jacobs
Main category: cs.CV
TL;DR: Sat2Sound is a multimodal framework that maps sounds to geographic locations using satellite imagery, audio, and text descriptions, enabling soundscape understanding and synthesis.
Details
Motivation: Existing methods for geospatial soundscape mapping rely on limited paired satellite-audio data that fails to capture sound diversity. There's a need for more comprehensive soundscape understanding that can handle the full range of ambient sounds at locations.Method: Uses vision-language models to generate rich soundscape descriptions, then learns jointly from audio, text descriptions, satellite images, and synthetic captions through contrastive and codebook-aligned learning to discover shared “soundscape concepts” across modalities.
Result: Achieves state-of-the-art performance in cross-modal retrieval between satellite images and audio on GeoSound and SoundingEarth benchmarks. Enables location-conditioned soundscape synthesis through text-to-audio models.
Conclusion: Sat2Sound provides a unified framework for explainable soundscape mapping and synthesis, overcoming data limitations through multimodal learning and enabling immersive applications with limited computational resources.
Abstract: We present Sat2Sound, a unified multimodal framework for geospatial soundscape understanding, designed to predict and map the distribution of sounds across the Earth’s surface. Existing methods for this task rely on paired satellite images and geotagged audio samples, which often fail to capture the full diversity of sound at a location. Sat2Sound overcomes this limitation by augmenting datasets with semantically rich, vision-language model-generated soundscape descriptions, which broaden the range of possible ambient sounds represented at each location. Our framework jointly learns from audio, text descriptions of audio, satellite images, and synthetic image captions through contrastive and codebook-aligned learning, discovering a set of “soundscape concepts” shared across modalities, enabling hyper-localized, explainable soundscape mapping. Sat2Sound achieves state-of-the-art performance in cross-modal retrieval between satellite image and audio on the GeoSound and SoundingEarth benchmarks. Finally, by retrieving detailed soundscape captions that can be rendered through text-to-audio models, Sat2Sound enables location-conditioned soundscape synthesis for immersive and educational applications, even with limited computational resources. Our code and models are available at https://github.com/mvrl/sat2sound.
[318] Zero-Shot Synthetic-to-Real Handwritten Text Recognition via Task Analogies
Carlos Garrido-Munoz, Aniello Panariello, Silvia Cascianelli, Angelo Porrello, Simone Calderara, Jorge Calvo-Zaragoza, Rita Cucchiara
Main category: cs.CV
TL;DR: Zero-shot synthetic-to-real handwriting recognition using parameter correction learned from source languages and transferred to target languages without real data.
Details
Motivation: HTR models trained on synthetic handwriting fail to generalize to real handwriting, and existing adaptation methods require real samples from target domains. The paper addresses the fully zero-shot setting where no real data from target languages is available.Method: Learn how model parameters change when moving from synthetic to real handwriting in source languages, then transfer this learned correction to new target languages. When using multiple sources, use linguistic similarity to weigh their contributions when combining them.
Result: Experiments across five languages and six architectures show consistent improvements over synthetic-only baselines. The transferred corrections benefit even languages unrelated to the source languages.
Conclusion: The proposed zero-shot adaptation method effectively transfers learned parameter corrections from source to target languages without requiring real data from target domains, demonstrating cross-linguistic generalization capabilities.
Abstract: Handwritten Text Recognition (HTR) models trained on synthetic handwriting often struggle to generalize to real text, and existing adaptation methods still require real samples from the target domain. In this work, we tackle the fully zero-shot synthetic-to-real generalization setting, where no real data from the target language is available. Our approach learns how model parameters change when moving from synthetic to real handwriting in one or more source languages and transfers this learned correction to new target languages. When using multiple sources, we rely on linguistic similarity to weigh their contrubition when combining them. Experiments across five languages and six architectures show consistent improvements over synthetic-only baselines and reveal that the transferred corrections benefit even languages unrelated to the sources.
[319] Training Deep Visual Networks Beyond Loss and Accuracy Through a Dynamical Systems Approach
Hai La Quang, Hassan Ugail, Newton Howard, Cong Tran Tien, Nam Vu Hoai, Hung Nguyen Viet
Main category: cs.CV
TL;DR: Paper introduces dynamical systems analysis to study training dynamics of vision models, measuring integration, metastability, and stability from layer activations across training epochs.
Details
Motivation: Traditional metrics like loss and accuracy reveal little about how internal representations change during training. Need complementary methods to understand training dynamics beyond surface-level performance measures.Method: Uses signal analysis techniques from biological neural activity studies to define three measures from layer activations: integration score (long-range coordination), metastability score (flexibility between synchronized states), and combined dynamical stability index. Applied to 9 model-dataset combinations including ResNet variants, DenseNet-121, MobileNetV2, VGG-16, and Vision Transformer on CIFAR-10/100.
Result: Three main patterns: 1) Integration measure distinguishes easier CIFAR-10 from harder CIFAR-100; 2) Stability index volatility changes may signal convergence before accuracy plateaus; 3) Integration-metastability relationship reflects different training behaviors.
Conclusion: Provides exploratory but promising new framework to understand deep visual training dynamics beyond traditional metrics, offering insights into internal representation changes during training.
Abstract: Deep visual recognition models are usually trained and evaluated using metrics such as loss and accuracy. While these measures show whether a model is improving, they reveal very little about how its internal representations change during training. This paper introduces a complementary way to study that process by examining training through the lens of dynamical systems. Drawing on ideas from signal analysis originally used to study biological neural activity, we define three measures from layer activations collected across training epochs: an integration score that reflects long-range coordination across layers, a metastability score that captures how flexibly the network shifts between more and less synchronised states, and a combined dynamical stability index. We apply this framework to nine combinations of model architecture and dataset, including several ResNet variants, DenseNet-121, MobileNetV2, VGG-16, and a pretrained Vision Transformer on CIFAR-10 and CIFAR-100. The results suggest three main patterns. First, the integration measure consistently distinguishes the easier CIFAR-10 setting from the more difficult CIFAR-100 setting. Second, changes in the volatility of the stability index may provide an early sign of convergence before accuracy fully plateaus. Third, the relationship between integration and metastability appears to reflect different styles of training behaviour. Overall, this study offers an exploratory but promising new way to understand deep visual training beyond loss and accuracy.
[320] Multi-Head Attention based interaction-aware architecture for Bangla Handwritten Character Recognition: Introducing a Primary Dataset
Mirza Raquib, Asif Pervez Polok, Kedar Nath Biswas, Farida Siddiqi Prity, Saydul Akbar Murad, Nick Rahimi
Main category: cs.CV
TL;DR: A new balanced dataset for Bangla handwritten character recognition with 78 classes (~650 samples each) and a hybrid deep learning architecture combining EfficientNetB3, Vision Transformer, and Conformer modules achieves 98.84% accuracy.
Details
Motivation: Handwritten Bangla character recognition is challenging due to diverse writing styles, inconsistent stroke patterns, high visual resemblance between characters, and limited/imbalanced existing datasets.Method: Created a new balanced dataset with diverse demographic representation, then proposed an interaction-aware hybrid architecture integrating EfficientNetB3, Vision Transformer, and Conformer modules in parallel with multi-head cross-attention fusion.
Result: Achieved 98.84% accuracy on the constructed dataset and 96.49% on the external CHBCR benchmark, demonstrating strong generalization. Grad-CAM visualizations provide interpretability.
Conclusion: The proposed dataset and hybrid architecture effectively address Bangla handwritten character recognition challenges, with publicly available resources for further research.
Abstract: Character recognition is the fundamental part of an optical character recognition (OCR) system. Word recognition, sentence transcription, document digitization, and language processing are some of the higher-order activities that can be done accurately through character recognition. Nonetheless, recognizing handwritten Bangla characters is not an easy task because they are written in different styles with inconsistent stroke patterns and a high degree of visual character resemblance. The datasets available are usually limited in intra-class and inequitable in class distribution. We have constructed a new balanced dataset of Bangla written characters to overcome those problems. This consists of 78 classes and each class has approximately 650 samples. It contains the basic characters, composite (Juktobarno) characters and numerals. The samples were a diverse group comprising a large age range and socioeconomic groups. Elementary and high school students, university students, and professionals are the contributing factors. The sample also has right and left-handed writers. We have further proposed an interaction-aware hybrid deep learning architecture that integrates EfficientNetB3, Vision Transformer, and Conformer modules in parallel. A multi-head cross-attention fusion mechanism enables effective feature interaction across these components. The proposed model achieves 98.84% accuracy on the constructed dataset and 96.49% on the external CHBCR benchmark, demonstrating strong generalization capability. Grad-CAM visualizations further provide interpretability by highlighting discriminative regions. The dataset and source code of this research is publicly available at: https://huggingface.co/MIRZARAQUIB/Bangla_Handwritten_Character_Recognition.
[321] Data-Driven Automated Identification of Optimal Feature-Representative Images in Infrared Thermography Using Statistical and Morphological Metrics
Harutyun Yagdjian, Martin Gurka
Main category: cs.CV
TL;DR: A data-driven framework for identifying defect-representative images in infrared thermography sequences without requiring prior spatial knowledge of defects.
Details
Motivation: IRT post-processing generates image sequences where defect visibility varies across domains, making defect identification challenging. Conventional metrics require prior knowledge of defect locations or reference regions, limiting automated analysis.Method: Proposes three complementary metrics: Homogeneity Index (HI) quantifying statistical heterogeneity, Representative Elementary Area (REA) from Minkowski-functionals, and Total Variation Energy (TVE) index for sensitivity to localized anomalies.
Result: Validated on pulse-heated IRT data from CFRP plate with artificial defects at various depths. Demonstrates robust, unbiased ranking of image sequences for automated defect-oriented image selection.
Conclusion: Provides a reliable data-driven methodology for identifying defect-representative images in IRT without requiring prior spatial information, enabling automated analysis.
Abstract: Infrared thermography (IRT) is a widely used non-destructive testing technique for detecting structural features such as subsurface defects. However, most IRT post-processing methods generate image sequences in which defect visibility varies strongly across time, frequency, or coefficient/index domains, making the identification of defect-representative images a critical challenge. Conventional evaluation metrics, such as the signal-to-noise ratio (SNR) or the Tanimoto criterion, often require prior knowledge of defect locations or defect-free reference regions, limiting their suitability for automated and unsupervised analysis. In this work, a data-driven methodology is proposed to identify images within IRT datasets that are most likely to contain and represent structural features, particularly anomalies and defects, without requiring prior spatial information. The approach is based on three complementary metrics: the Homogeneity Index of Mixture (HI), which quantifies statistical heterogeneity via deviations of local intensity distributions from a global reference distribution; a Representative Elementary Area (REA), derived from a Minkowski-functional adaptation of the Representative Elementary Volume concept to two-dimensional images; and a geometrical-topological Total Variation Energy (TVE) index, also based on two-dimensional Minkowski functionals, designed to improve sensitivity to localized anomalies. The framework is validated experimentally using pulse-heated IRT data from a carbon fiber-reinforced polymer (CFRP) plate containing six artificial defects at depths between 0.135 mm and 0.810 mm, and is further supported by one-dimensional N-layer thermal model simulations. The results demonstrate robust and unbiased ranking of image sequences and provide a reliable basis for automated defect-oriented image selection in IRT.
[322] LOLGORITHM: Funny Comment Generation Agent For Short Videos
Xuan Ouyang, Senan Wang, Bouzhou Wang, Siyuan Xiahou, Jinrong Zhou, Yuekang Li
Main category: cs.CV
TL;DR: LOLGORITHM is a modular multi-agent framework for generating authentic, stylized comments on short-form videos that conform to platform-specific cultural norms, outperforming baselines with 80-84% human preference rates.
Details
Motivation: Existing approaches like video summarization and danmaku generation fail to produce authentic comments that match platform-specific cultural and linguistic norms for short-form video platforms, which are central to multimedia information dissemination.Method: A modular multi-agent framework with three core modules: video content summarization, video classification, and comment generation with semantic retrieval and hot meme augmentation. Supports six controllable comment styles and uses a bilingual dataset of 3,267 videos and 16,335 comments across YouTube and Douyin.
Result: LOLGORITHM consistently outperforms baseline methods, achieving human preference selection rates of 80.46% on YouTube and 84.29% on Douyin across 107 respondents. Ablation studies show gains are due to framework architecture rather than backbone LLM choice.
Conclusion: The framework demonstrates robustness and generalizability for generating authentic, stylized comments that conform to platform-specific cultural norms, addressing a gap in existing approaches for short-form video engagement.
Abstract: Short-form video platforms have become central to multimedia information dissemination, where comments play a critical role in driving engagement, propagation, and algorithmic feedback. However, existing approaches – including video summarization and live-streaming danmaku generation – fail to produce authentic comments that conform to platform-specific cultural and linguistic norms. In this paper, we present LOLGORITHM, a novel modular multi-agent framework for stylized short-form video comment generation. LOLGORITHM supports six controllable comment styles and comprises three core modules: video content summarization, video classification, and comment generation with semantic retrieval and hot meme augmentation. We further construct a bilingual dataset of 3,267 videos and 16,335 comments spanning five high-engagement categories across YouTube and Douyin. Evaluation combining automatic scoring and large-scale human preference analysis demonstrates that LOLGORITHM consistently outperforms baseline methods, achieving human preference selection rates of 80.46% on YouTube and 84.29% on Douyin across 107 respondents. Ablation studies confirm that these gains are attributable to the framework architecture rather than the choice of backbone LLM, underscoring the robustness and generalizability of our approach.
[323] Multi-Frequency Local Plasticity for Visual Representation Learning
Mehdi Fatan Serj, C. Alejandro Parraga, Xavier Otazu
Main category: cs.CV
TL;DR: A hybrid visual recognition system combining fixed Gabor filters, local Hebbian learning, associative memory, and top-down feedback achieves 80.1% accuracy on CIFAR-10 without end-to-end backpropagation, recovering substantial performance through architectural priors.
Details
Motivation: To investigate how far structured architectural bias can compensate for the absence of end-to-end gradient-based representation learning in visual recognition, exploring alternatives to standard backpropagation approaches.Method: Modular hierarchical framework with: (1) fixed multi-frequency Gabor decomposition into 7 parallel streams, (2) within-stream competitive learning using Hebbian/Oja updates and anti-Hebbian decorrelation, (3) associative memory module inspired by modern Hopfield retrieval, and (4) iterative top-down modulation using local prediction/reconstruction signals. Only final linear readout and top-down projection matrices are optimized by gradient descent.
Result: On CIFAR-10: 80.1% accuracy (linear probe) vs 71.0% for Hebbian-only baseline and 83.4% for gradient-trained model on same fixed Gabor basis. On CIFAR-100: 54.8% accuracy. Factorial analysis shows multi-frequency streams, associative memory, and top-down feedback contribute additively with significant Streams x TopDown interaction.
Conclusion: Carefully chosen architectural priors can recover substantial performance typically associated with global gradient training, though a measurable residual gap remains. The hybrid approach demonstrates viability of predominantly locally-trained systems with minimal gradient optimization.
Abstract: We study how far structured architectural bias can compensate for the absence of end-to-end gradient-based representation learning in visual recognition. Building on the VisNet tradition, we introduce a modular hierarchical framework combining: (i) fixed multi-frequency Gabor decomposition into F=7 parallel streams; (ii) within-stream competitive learning with Hebbian and Oja updates and anti-Hebbian decorrelation; (iii) an associative memory module inspired by modern Hopfield retrieval; and (iv) iterative top-down modulation using local prediction and reconstruction signals. Representational layers are trained without end-to-end backpropagation through the full hierarchy; only the final linear readout and top-down projection matrices are optimized by gradient descent. We therefore interpret the model as a hybrid system that is predominantly locally trained but includes a small number of gradient-trained parameters. On CIFAR-10, the full model reaches 80.1% +/- 0.3% top-1 accuracy, linear probe), compared with 71.0% for a Hebbian-only baseline and 83.4% for a gradient-trained model on the same fixed Gabor basis. On CIFAR-100, performance is 54.8%. Factorial analysis indicates that multi-frequency streams, associative memory, and top-down feedback contribute largely additively, with a significant Streams x TopDown interaction (p=0.02). These results suggest that carefully chosen architectural priors can recover a substantial fraction of the performance typically associated with global gradient training, while leaving a measurable residual gap. Experiments are limited to CIFAR-10/100.
[324] See Fair, Speak Truth: Equitable Attention Improves Grounding and Reduces Hallucination in Vision-Language Alignment
Mohammad Anas Azeez, Ankan Deria, Zohaib Hasan Siddiqui, Adinath Madhavrao Dukre, Rafiq Ali, Sara Atito, Yutong Xie, Imran Razzak
Main category: cs.CV
TL;DR: DOP-OBC is a training-free decoding strategy that reduces object hallucination in MLLMs by promoting equitable attention allocation through dominant object penalty and outlier boost coefficient mechanisms.
Details
Motivation: MLLMs often hallucinate objects due to disproportionate attention allocation where visually dominant or frequent objects overshadow rare, small, or peripheral objects, leading to incomplete visual grounding.Method: Proposes DOP-OBC: Dominant Object Penalty (DOP) suppresses attention on visually dominant regions, while Outlier Boost Coefficient (OBC) amplifies attention for rare but confidently detected objects. Implemented as per-row logit modulations in causal attention mask without weight updates.
Result: Consistent reduction in object hallucination on CHAIR and POPE benchmarks, improved GPT-4o assessed captioning quality across correctness, consistency, detail, context, and temporal dimensions for both image and video MLLMs.
Conclusion: Fairness in attention allocation is a practical and effective approach for more faithful multimodal generation, with DOP-OBC demonstrating that equitable attention can reduce hallucinations without requiring model retraining.
Abstract: Multimodal large language models (MLLMs) frequently hallucinate objects that are absent from the visual input, often because attention during decoding is disproportionately drawn to visually dominant or frequently occurring content. We observe that this inequity in attention allocation is a root cause of object hallucination: when rare, small, or contextually peripheral objects receive insufficient attention, the model fails to ground its generation in the full visual scene. We argue that every object in an image, regardless of its size, frequency or visual salience, deserves equal representational opportunity during decoding. To this end, we propose DOP-OBC, a training-free and architecture-agnostic decoding strategy built on the principle of equitable attention. Two complementary object-aware signals work in tandem: a Dominant Object Penalty (DOP) that softly suppresses attention over-concentration on visually dominant regions, and an Outlier Boost Coefficient (OBC) that amplifies attention toward rare yet confidently detected objects. These signals are injected as per-row logit modulations within the causal attention mask, requiring no weight updates and preserving autoregressive decoding properties. Extensive experiments across image and video MLLMs demonstrate consistent reductions in object hallucination on CHAIR and POPE benchmarks, alongside improvements in GPT-4o assessed captioning quality across correctness, consistency, detail, context and temporal dimensions. DOP-OBC establishes that fairness in attention allocation is not merely a design principle but a practical and effective path toward more faithful multimodal generation.
[325] MedLVR: Latent Visual Reasoning for Reliable Medical Visual Question Answering
Suyang Xi, Songtao Hu, Yuxiang Lai, Wangyun Dan, Yaqi Liu, Shansong Wang, Xiaofeng Yang
Main category: cs.CV
TL;DR: MedLVR introduces latent visual reasoning for medical VQA by interleaving visual evidence states in autoregressive decoding, improving accuracy through iterative visual reasoning rather than static image embeddings.
Details
Motivation: Current medical VLMs use text-centric reasoning where images are encoded once as static context, which fails to preserve subtle, localized visual evidence crucial for accurate clinical diagnosis.Method: Introduces latent visual reasoning framework with explicit visual evidence state in autoregressive decoding, using short latent reasoning segments that reuse hidden states as continuous latent steps. Two-stage training: ROI-supervised fine-tuning aligns latent states with clinical image evidence, and Visual-Latent Policy Optimization (VLPO) optimizes reasoning under outcome-level rewards.
Result: Outperforms recent reasoning baselines on OmniMedVQA and five external medical VQA benchmarks, improving average score from 48.3% to 53.4% over Qwen2.5-VL-7B backbone.
Conclusion: Latent visual reasoning provides effective mechanism for preserving diagnostically relevant visual evidence and improving reliability of medical VQA, addressing limitations of text-centric approaches.
Abstract: Medical vision–language models (VLMs) have shown strong potential for medical visual question answering (VQA), yet their reasoning remains largely text-centric: images are encoded once as static context, and subsequent inference is dominated by language. This paradigm is fundamentally limited in clinical scenarios, where accurate answers often depend on subtle, localized visual evidence that cannot be reliably preserved in static embeddings. We propose \textsc{MedLVR}, a latent visual reasoning framework that introduces an explicit visual evidence state into autoregressive decoding. Instead of relying solely on text-based intermediate reasoning, \textsc{MedLVR} interleaves a short latent reasoning segment within the decoder by reusing hidden states as continuous latent steps, enabling iterative preservation and refinement of query-relevant visual evidence before answer generation. To support effective visual supervision, we adopt a two-stage training strategy: region of interest (ROI)-supervised fine-tuning aligns latent states with clinically relevant image evidence, and Visual-Latent Policy Optimization (VLPO) further optimizes latent reasoning and answer generation under outcome-level rewards. Experiments on OmniMedVQA and five external medical VQA benchmarks show that \textsc{MedLVR} consistently outperforms recent reasoning baselines and improves the average score over the Qwen2.5-VL-7B backbone from 48.3% to 53.4%. These results show that latent visual reasoning provides an effective mechanism for preserving diagnostically relevant visual evidence and improving the reliability of medical VQA.
[326] Text-Guided 6D Object Pose Rearrangement via Closed-Loop VLM Agents
Sangwon Baik, Gunhee Kim, Mingi Choi, Hanbyul Joo
Main category: cs.CV
TL;DR: VLMs can achieve strong 3D understanding and 6D pose prediction through inference-time techniques and iterative closed-loop reasoning without additional training.
Details
Motivation: Vision-Language Models (VLMs) have strong visual reasoning but struggle with 3D understanding, particularly in inferring text-consistent goal 6D poses of objects in 3D scenes.Method: Closed-loop interaction where VLM acts as an agent: observe scene, evaluate faithfulness to instruction, propose pose update, apply update, render updated scene. Key techniques: multi-view reasoning with view selection, object-centered coordinate visualization, single-axis rotation prediction.
Result: Approach surpasses prior methods at predicting text-guided goal 6D poses, works consistently across closed/open-source VLMs, enables more successful robot manipulation when combined with motion planning.
Conclusion: Inference-time techniques and iterative reasoning enable VLMs to achieve dramatic 3D understanding improvements without fine-tuning, bridging gap between 2D vision-language capabilities and 3D reasoning.
Abstract: Vision-Language Models (VLMs) exhibit strong visual reasoning capabilities, yet they still struggle with 3D understanding. In particular, VLMs often fail to infer a text-consistent goal 6D pose of a target object in a 3D scene. However, we find that with some inference-time techniques and iterative reasoning, VLMs can achieve dramatic performance gains. Concretely, given a 3D scene represented by an RGB-D image (or a compositional scene of 3D meshes) and a text instruction specifying a desired state change, we repeat the following loop: observe the current scene; evaluate whether it is faithful to the instruction; propose a pose update for the target object; apply the update; and render the updated scene. Through this closed-loop interaction, the VLM effectively acts as an agent. We further introduce three inference-time techniques that are essential to this closed-loop process: (i) multi-view reasoning with supporting view selection, (ii) object-centered coordinate system visualization, and (iii) single-axis rotation prediction. Without any additional fine-tuning or new modules, our approach surpasses prior methods at predicting the text-guided goal 6D pose of the target object. It works consistently across both closed-source and open-source VLMs. Moreover, when combining our 6D pose prediction with simple robot motion planning, it enables more successful robot manipulation than existing methods. Finally, we conduct an ablation study to demonstrate the necessity of each proposed technique.
[327] Biomarker-Based Pretraining for Chagas Disease Screening in Electrocardiograms
Elias Stenhede, Arian Ranjbar
Main category: cs.CV
TL;DR: ECG-based Chagas disease detection using biomarker-pretrained models achieves 5th place in PhysioNet Challenge 2025
Details
Motivation: Chagas disease screening from ECGs is challenging due to scarce and noisy labels in existing datasets, requiring more robust approachesMethod: Biomarker-based pretraining where ECG feature extractor is trained to predict percentile-binned blood biomarkers from MIMIC-IV-ECG dataset, then fine-tuned on Brazilian datasets for Chagas detection using 5-model ensemble
Result: Achieved challenge score of 0.269 on hidden test set, ranking 5th in Detection of Chagas Disease from the ECG: The George B. Moody PhysioNet Challenge 2025
Conclusion: Biomarker-based pretraining is effective for ECG analysis tasks with limited labeled data, enabling competitive Chagas disease detection performance
Abstract: Chagas disease screening via ECGs is limited by scarce and noisy labels in existing datasets. We propose a biomarker-based pretraining approach, where an ECG feature extractor is first trained to predict percentile-binned blood biomarkers from the MIMIC-IV-ECG dataset. The pretrained model is then fine-tuned on Brazilian datasets for Chagas detection. Our 5-model ensemble, developed by the Ahus AIM team, achieved a challenge score of 0.269 on the hidden test set, ranking 5th in Detection of Chagas Disease from the ECG: The George B. Moody PhysioNet Challenge 2025. Source code and the model are shared on GitHub: github.com/Ahus-AIM/physionet-challenge-2025
[328] RobustMedSAM: Degradation-Resilient Medical Image Segmentation via Robust Foundation Model Adaptation
Jieru Li, Matthew Chen, Micky C. Nnamdi, J. Ben Tamo, Benoit L. Marteau, May D. Wang
Main category: cs.CV
TL;DR: RobustMedSAM combines medical-domain adaptation (MedSAM) and corruption robustness (RobustSAM) via module-wise checkpoint fusion for robust medical image segmentation under realistic corruptions.
Details
Motivation: Medical image segmentation models based on SAM perform well on clean benchmarks but degrade under realistic image corruptions like noise, blur, and artifacts. Existing approaches address either medical-domain adaptation or corruption robustness separately, but not both jointly.Method: Proposes RobustMedSAM with module-wise checkpoint fusion: initializes image encoder from MedSAM (medical priors) and mask decoder from RobustSAM (corruption robustness) under shared ViT-B architecture. Fine-tunes only mask decoder on 35 medical datasets spanning 6 modalities and 12 corruption types while freezing other components. Also investigates SVD-based parameter-efficient variant for limited encoder adaptation.
Result: RobustMedSAM improves degraded-image Dice from 0.613 to 0.719 (+0.106) over SAM on both in-distribution and out-of-distribution benchmarks, demonstrating effective fusion of complementary pretrained models.
Conclusion: Structured fusion of complementary pretrained models (medical-domain adaptation + corruption robustness) is an effective and practical approach for robust medical image segmentation under realistic corruptions.
Abstract: Medical image segmentation models built on Segment Anything Model (SAM) achieve strong performance on clean benchmarks, yet their reliability often degrades under realistic image corruptions such as noise, blur, motion artifacts, and modality-specific distortions. Existing approaches address either medical-domain adaptation or corruption robustness, but not both jointly. In SAM, we find that these capabilities are concentrated in complementary modules: the image encoder preserves medical priors, while the mask decoder governs corruption robustness. Motivated by this observation, we propose RobustMedSAM, which adopts module-wise checkpoint fusion by initializing the image encoder from MedSAM and the mask decoder from RobustSAM under a shared ViT-B architecture. We then fine-tune only the mask decoder on 35 medical datasets from MedSegBench, spanning six imaging modalities and 12 corruption types, while freezing the remaining components to preserve pretrained medical representations. We additionally investigate an SVD-based parameter-efficient variant for limited encoder adaptation. Experiments on both in-distribution and out-of-distribution benchmarks show that RobustMedSAM improves degraded-image Dice from 0.613 to 0.719 (+0.106) over SAM, demonstrating that structured fusion of complementary pretrained models is an effective and practical approach for robust medical image segmentation.
[329] ACCIDENT: A Benchmark Dataset for Vehicle Accident Detection from Traffic Surveillance Videos
Lukas Picek, Michal Čermák, Marek Hanzl, Vojtěch Čermák
Main category: cs.CV
TL;DR: ACCIDENT is a benchmark dataset for traffic accident detection in CCTV footage with 4,238 clips (real and synthetic) annotated for temporal/spatial localization and collision type classification, evaluated in supervised and zero-shot settings.
Details
Motivation: There's a need for comprehensive benchmarks to evaluate traffic accident detection models in real-world CCTV scenarios, especially for handling uncertainty, ambiguity, and data-scarce situations.Method: Created a curated dataset of 2,027 real and 2,211 synthetic CCTV clips with annotations for accident time, spatial location, and collision type. Defined three core tasks with custom metrics accounting for uncertainty in CCTV footage.
Result: The benchmark is challenging as shown by diverse baselines including heuristic, motion-aware, and vision-language approaches. The dataset and evaluation framework are publicly available.
Conclusion: ACCIDENT provides a comprehensive benchmark for traffic accident detection that addresses real-world challenges in CCTV analysis, supporting both data-rich and data-scarce scenarios.
Abstract: We introduce ACCIDENT, a benchmark dataset for traffic accident detection in CCTV footage, designed to evaluate models in supervised (IID and OOD) and zero-shot settings, reflecting both data-rich and data-scarce scenarios. The benchmark consists of a curated set of 2,027 real and 2,211 synthetic clips annotated with the accident time, spatial location, and high-level collision type. We define three core tasks: (i) temporal localization of the accident, (ii) its spatial localization, and (iii) collision type classification. Each task is evaluated using custom metrics that account for the uncertainty and ambiguity inherent in CCTV footage. In addition to the benchmark, we provide a diverse set of baselines, including heuristic, motion-aware, and vision-language approaches, and show that ACCIDENT is challenging. You can access the ACCIDENT at: https://accidentbench.github.io
[330] F3G-Avatar : Face Focused Full-body Gaussian Avatar
Willem Menu, Erkut Akdag, Pedro Quesado, Yasaman Kashefbahrami, Egor Bondarev
Main category: cs.CV
TL;DR: F3G-Avatar: A full-body, face-aware Gaussian avatar method that preserves fine-grained facial geometry and expression details using a two-branch architecture with face-focused deformation.
Details
Motivation: Existing full-body Gaussian avatar methods optimize for global reconstruction but fail to preserve fine-grained facial geometry and expression details due to limited facial representational capacity, especially for high-frequency pose-dependent deformations.Method: Uses a clothed Momentum Human Rig (MHR) template, renders front/back positional maps, decodes into 3D Gaussians via two-branch architecture: body branch for pose-dependent deformations and face-focused deformation branch for head geometry/appearance refinement. Gaussians are fused, posed with linear blend skinning, and rendered with differentiable Gaussian splatting.
Result: Achieves strong rendering quality with face-view performance of PSNR/SSIM/LPIPS: 26.243/0.964/0.084 on AvatarReX dataset. Ablations confirm contributions of MHR template and face-focused deformation.
Conclusion: F3G-Avatar provides a practical, high-quality pipeline for realistic, animatable full-body avatar synthesis with improved facial detail preservation.
Abstract: Existing full-body Gaussian avatar methods primarily optimize global reconstruction quality and often fail to preserve fine-grained facial geometry and expression details. This challenge arises from limited facial representational capacity that causes difficulties in modeling high-frequency pose-dependent deformations. To address this, we propose F3G-Avatar, a full-body, face-aware avatar synthesis method that reconstructs animatable human representations from multi-view RGB video and regressed pose/shape parameters. Starting from a clothed Momentum Human Rig (MHR) template, front/back positional maps are rendered and decoded into 3D Gaussians through a two-branch architecture: a body branch that captures pose-dependent non-rigid deformations and a face-focused deformation branch that refines head geometry and appearance. The predicted Gaussians are fused, posed with linear blend skinning (LBS), and rendered with differentiable Gaussian splatting. Training combines reconstruction and perceptual objectives with a face-specific adversarial loss to enhance realism in close-up views. Experiments demonstrate strong rendering quality, with face-view performance reaching PSNR/SSIM/LPIPS of 26.243/0.964/0.084 on the AvatarReX dataset. Ablations further highlight contributions of the MHR template and the face-focused deformation. F3G-Avatar provides a practical, high-quality pipeline for realistic, animatable full-body avatar synthesis.
[331] Vector Field Synthesis with Sparse Streamlines Using Diffusion Model
Nguyen K. Phan, Ricardo Morales, Sebastian D. Espriella, Guoning Chen
Main category: cs.CV
TL;DR: A diffusion-based framework for synthesizing 2D vector fields from sparse streamline inputs while maintaining physical plausibility through classifier-free guidance.
Details
Motivation: Traditional optimization-based approaches for vector field synthesis from sparse inputs often struggle to maintain physical plausibility and geometric constraints simultaneously. There's a need for methods that can generate physically consistent vector fields from limited observations while preserving both geometric and physical constraints.Method: Uses a conditional denoising diffusion probabilistic model with classifier-free guidance. The framework progressively reconstructs vector fields from sparse, coherent inputs (streamlines) while maintaining physical plausibility through the diffusion process.
Result: The method successfully synthesizes plausible vector fields that adhere to physical laws while maintaining fidelity to sparse input observations. It outperforms traditional optimization-based approaches in terms of flexibility and physical consistency.
Conclusion: Diffusion models provide an effective framework for physically plausible vector field synthesis from sparse inputs, offering advantages over traditional optimization methods in maintaining both geometric and physical constraints.
Abstract: We present a novel diffusion-based framework for synthesizing 2D vector fields from sparse, coherent inputs (i.e., streamlines) while maintaining physical plausibility. Our method employs a conditional denoising diffusion probabilistic model with classifier-free guidance, enabling progressive reconstruction that preserves both geometric and physical constraints. Experimental results demonstrate our method’s ability to synthesize plausible vector fields that adhere to physical laws while maintaining fidelity to sparse input observations, outperforming traditional optimization-based approaches in terms of flexibility and physical consistency.
[332] Is There Knowledge Left to Extract? Evidence of Fragility in Medically Fine-Tuned Vision-Language Models
Oliver McLaughlin, Daniel Shubin, Carsten Eickhoff, Ritambhara Singh, William Rudman, Michal Golovanevsky
Main category: cs.CV
TL;DR: Medical VLMs perform poorly on complex medical imaging tasks despite domain-specific fine-tuning, showing limited clinical reasoning and high sensitivity to prompts.
Details
Motivation: To evaluate whether domain-specific fine-tuning of vision-language models (VLMs) actually improves clinical reasoning in medical imaging tasks, or if performance gains are superficial.Method: Evaluated four paired open-source VLMs (LLaVA vs. LLaVA-Med; Gemma vs. MedGemma) across four medical imaging tasks of increasing difficulty. Introduced a description-based pipeline where models generate image descriptions for text-only diagnosis by GPT-5.1. Analyzed vision encoder embeddings to understand failure sources.
Result: Performance degrades to near-random levels as task difficulty increases. Medical fine-tuning provides no consistent advantage. Models are highly sensitive to prompt formulation. Description-based pipeline recovers limited additional signal but remains bounded by task difficulty. Failures stem from both weak visual representations and downstream reasoning.
Conclusion: Medical VLM performance is fragile, prompt-dependent, and not reliably improved by domain-specific fine-tuning, indicating limited clinical reasoning capabilities.
Abstract: Vision-language models (VLMs) are increasingly adapted through domain-specific fine-tuning, yet it remains unclear whether this improves reasoning beyond superficial visual cues, particularly in high-stakes domains like medicine. We evaluate four paired open-source VLMs (LLaVA vs. LLaVA-Med; Gemma vs. MedGemma) across four medical imaging tasks of increasing difficulty: brain tumor, pneumonia, skin cancer, and histopathology classification. We find that performance degrades toward near-random levels as task difficulty increases, indicating limited clinical reasoning. Medical fine-tuning provides no consistent advantage, and models are highly sensitive to prompt formulation, with minor changes causing large swings in accuracy and refusal rates. To test whether closed-form VQA suppresses latent knowledge, we introduce a description-based pipeline where models generate image descriptions that a text-only model (GPT-5.1) uses for diagnosis. This recovers a limited additional signal but remains bounded by task difficulty. Analysis of vision encoder embeddings further shows that failures stem from both weak visual representations and downstream reasoning. Overall, medical VLM performance is fragile, prompt-dependent, and not reliably improved by domain-specific fine-tuning.
[333] Training-Free Object-Background Compositional T2I via Dynamic Spatial Guidance and Multi-Path Pruning
Yang Deng, David Mould, Paul L. Rosin, Yu-Kun Lai
Main category: cs.CV
TL;DR: Training-free framework for text-to-image diffusion models that addresses foreground bias by restructuring diffusion sampling with dynamic spatial guidance and multi-path pruning for better foreground-background compositionality.
Details
Motivation: Existing text-to-image diffusion models have persistent foreground bias that treats background as passive and under-optimized, compromising global scene coherence and constraining compositional control.Method: Two key components: 1) Dynamic Spatial Guidance with time-dependent gating mechanism to modulate foreground/background attention during diffusion, and 2) Multi-Path Pruning that explores multiple latent trajectories and filters candidates using attention statistics and semantic alignment signals.
Result: Extensive evaluations across multiple diffusion backbones demonstrate consistent improvements in background coherence and object-background compositional alignment. A new benchmark is developed for evaluating object-background compositionality.
Conclusion: The proposed training-free framework effectively addresses foreground bias in text-to-image diffusion models, improving scene coherence and compositional control through explicit modeling of foreground-background interactions.
Abstract: Existing text-to-image diffusion models, while excelling at subject synthesis, exhibit a persistent foreground bias that treats the background as a passive and under-optimized byproduct. This imbalance compromises global scene coherence and constrains compositional control. To address the limitation, we propose a training-free framework that restructures diffusion sampling to explicitly account for foreground-background interactions. Our approach consists of two key components. First, Dynamic Spatial Guidance introduces a soft, time step dependent gating mechanism that modulates foreground and background attention during the diffusion process, enabling spatially balanced generation. Second, Multi-Path Pruning performs multi-path latent exploration and dynamically filters candidate trajectories using both internal attention statistics and external semantic alignment signals, retaining trajectories that better satisfy object-background constraints. We further develop a benchmark specifically designed to evaluate object-background compositionality. Extensive evaluations across multiple diffusion backbones demonstrate consistent improvements in background coherence and object-background compositional alignment.
[334] Do vision models perceive illusory motion in static images like humans?
Isabella Elaine Rosario, Fan L. Cheng, Zitang Sun, Nikolaus Kriegeskorte
Main category: cs.CV
TL;DR: Most optical flow DNNs fail to perceive illusory motion in static images like Rotating Snakes illusion, with only human-inspired Dual-Channel model showing rotational motion during saccade simulation.
Details
Motivation: To understand differences between human and machine motion processing, using visual motion illusions as probes to reveal computational strategies and develop more human-centered AI systems.Method: Evaluated several optical flow models on Rotating Snakes illusion, simulated saccadic eye movements, conducted ablation analyses on Dual-Channel model to examine contributions of luminance-based and color-feature-based motion signals and recurrent attention mechanisms.
Result: Most optical flow models failed to generate flow fields consistent with human perception of illusory motion; only Dual-Channel model exhibited expected rotational motion, particularly during saccade simulation; both luminance and color-feature signals plus recurrent attention were critical.
Conclusion: Substantial gap exists between current optical-flow models and human visual motion processing; insights can guide development of motion-estimation systems with better correspondence to human perception for human-centric AI.
Abstract: Understanding human motion processing is essential for building reliable, human-centered computer vision systems. Although deep neural networks (DNNs) achieve strong performance in optical flow estimation, they remain less robust than humans and rely on fundamentally different computational strategies. Visual motion illusions provide a powerful probe into these mechanisms, revealing how human and machine vision align or diverge. While recent DNN-based motion models can reproduce dynamic illusions such as reverse-phi, it remains unclear whether they can perceive illusory motion in static images, exemplified by the Rotating Snakes illusion. We evaluate several representative optical flow models on Rotating Snakes and show that most fail to generate flow fields consistent with human perception. Under simulated conditions mimicking saccadic eye movements, only the human-inspired Dual-Channel model exhibits the expected rotational motion, with the closest correspondence emerging during the saccade simulation. Ablation analyses further reveal that both luminance-based and higher-order color–feature–based motion signals contribute to this behavior and that a recurrent attention mechanism is critical for integrating local cues. Our results highlight a substantial gap between current optical-flow models and human visual motion processing, and offer insights for developing future motion-estimation systems with improved correspondence to human perception and human-centric AI.
[335] FF3R: Feedforward Feature 3D Reconstruction from Unconstrained views
Chaoyi Zhou, Run Wang, Feng Luo, Mert D. Pesé, Zhiwen Fan, Yiqi Zhong, Siyu Huang
Main category: cs.CV
TL;DR: FF3R is an annotation-free feed-forward framework that unifies geometric and semantic 3D reasoning from multi-view images without requiring camera poses, depth maps, or semantic labels.
Details
Motivation: Existing vision foundation models treat geometry reconstruction and semantic understanding in isolation, leading to redundant pipelines and compounded errors. There's a need for unified 3D reasoning that doesn't rely on annotations.Method: Uses rendering supervision for RGB and feature maps. Introduces Token-wise Fusion Module (enriches geometry tokens with semantic context via cross-attention) and Semantic-Geometry Mutual Boosting mechanism (combines geometry-guided feature warping for global consistency with semantic-aware voxelization for local coherence).
Result: Superior performance on ScanNet and DL3DV-10K in novel-view synthesis, open-vocabulary semantic segmentation, and depth estimation. Strong generalization to in-the-wild scenarios.
Conclusion: FF3R establishes a scalable paradigm for unified 3D reasoning without annotations, paving the way for embodied intelligence systems requiring both spatial and semantic understanding.
Abstract: Recent advances in vision foundation models have revolutionized geometry reconstruction and semantic understanding. Yet, most of the existing approaches treat these capabilities in isolation, leading to redundant pipelines and compounded errors. This paper introduces FF3R, a fully annotation-free feed-forward framework that unifies geometric and semantic reasoning from unconstrained multi-view image sequences. Unlike previous methods, FF3R does not require camera poses, depth maps, or semantic labels, relying solely on rendering supervision for RGB and feature maps, establishing a scalable paradigm for unified 3D reasoning. In addition, we address two critical challenges in feedforward feature reconstruction pipelines, namely global semantic inconsistency and local structural inconsistency, through two key innovations: (i) a Token-wise Fusion Module that enriches geometry tokens with semantic context via cross-attention, and (ii) a Semantic-Geometry Mutual Boosting mechanism combining geometry-guided feature warping for global consistency with semantic-aware voxelization for local coherence. Extensive experiments on ScanNet and DL3DV-10K demonstrate FF3R’s superior performance in novel-view synthesis, open-vocabulary semantic segmentation, and depth estimation, with strong generalization to in-the-wild scenarios, paving the way for embodied intelligence systems that demand both spatial and semantic understanding.
[336] PAS: Estimating the target accuracy before domain adaptation
Raphaella Diniz, Jackson de Faria, Martin Ester
Main category: cs.CV
TL;DR: PAS is a novel score for estimating transferability of source domains and pre-trained models to target tasks before domain adaptation, enabling efficient selection of best resources.
Details
Motivation: Domain adaptation performance depends heavily on source domain and pre-trained feature extractor selection, but choosing these is difficult without labeled target validation data and with many available models.Method: PAS leverages pre-trained model generalization power and assesses source-target compatibility based on pre-trained feature embeddings, then integrates into framework for selecting most relevant pre-trained model and source domain.
Result: Extensive experiments on image classification benchmarks show PAS strongly correlates with actual target accuracy and consistently guides selection of best-performing pre-trained model and source domain.
Conclusion: PAS effectively estimates transferability before adaptation, improving target accuracy while reducing computational overhead by selecting optimal source domains and pre-trained models.
Abstract: The goal of domain adaptation is to make predictions for unlabeled samples from a target domain with the help of labeled samples from a different but related source domain. The performance of domain adaptation methods is highly influenced by the choice of source domain and pre-trained feature extractor. However, the selection of source data and pre-trained model is not trivial due to the absence of a labeled validation set for the target domain and the large number of available pre-trained models. In this work, we propose PAS, a novel score designed to estimate the transferability of a source domain set and a pre-trained feature extractor to a target classification task before actually performing domain adaptation. PAS leverages the generalization power of pre-trained models and assesses source-target compatibility based on the pre-trained feature embeddings. We integrate PAS into a framework that indicates the most relevant pre-trained model and source domain among multiple candidates, thus improving target accuracy while reducing the computational overhead. Extensive experiments on image classification benchmarks demonstrate that PAS correlates strongly with actual target accuracy and consistently guides the selection of the best-performing pre-trained model and source domain for adaptation.
[337] DINO_4D: Semantic-Aware 4D Reconstruction
Yiru Yang, Zhuojie Wu, Quentin Marguet, Nishant Kumar Singh, Max Schulthess
Main category: cs.CV
TL;DR: DINO_4D uses frozen DINOv3 features as semantic priors for 4D dynamic scene reconstruction, improving tracking accuracy and completeness while maintaining linear time complexity.
Details
Motivation: To bridge low-level geometric sensing with high-level semantic understanding in 4D dynamic scene reconstruction, addressing semantic drift during dynamic tracking.Method: Introduces frozen DINOv3 features as structural priors to inject semantic awareness into the reconstruction process, maintaining linear time complexity O(T).
Result: Significantly improves Tracking Accuracy (APD) and Reconstruction Completeness on Point Odyssey and TUM-Dynamics benchmarks while maintaining computational efficiency.
Conclusion: Establishes a new paradigm for constructing 4D World Models with both geometric precision and semantic understanding.
Abstract: In the intersection of computer vision and robotic perception, 4D reconstruction of dynamic scenes serve as the critical bridge connecting low-level geometric sensing with high-level semantic understanding. We present DINO_4D, introducing frozen DINOv3 features as structural priors, injecting semantic awareness into the reconstruction process to effectively suppress semantic drift during dynamic tracking. Experiments on the Point Odyssey and TUM-Dynamics benchmarks demonstrate that our method maintains the linear time complexity $O(T)$ of its predecessors while significantly improving Tracking Accuracy (APD) and Reconstruction Completeness. DINO_4D establishes a new paradigm for constructing 4D World Models that possess both geometric precision and semantic understanding.
[338] Topo-ADV: Generating Topology-Driven Imperceptible Adversarial Point Clouds
Gayathry Chandramana Krishnan Nampoothiry, Raghuram Venkatapuram, Anirban Ghosh, Ayan Dutta
Main category: cs.CV
TL;DR: Topo-ADV: First topology-driven adversarial attack for 3D point cloud models using persistent homology optimization to manipulate topological features while maintaining geometric plausibility.
Details
Motivation: Existing 3D adversarial attacks focus on geometric properties, assuming shape fidelity preserves semantics. This work challenges that assumption by exploring topological structure as a vulnerability surface for point cloud models.Method: Proposes Topo-ADV, an end-to-end differentiable framework incorporating persistent homology as optimization objective. Uses differentiable topological representations to jointly optimize: 1) topology divergence loss altering persistence, 2) misclassification objective, and 3) geometric imperceptibility constraints.
Result: Achieves up to 100% attack success rates on ModelNet40, ShapeNet Part, and ScanObjectNN datasets using PointNet and DGCNN classifiers. Perturbations remain geometrically indistinguishable from originals, beating state-of-the-art methods on perceptibility metrics.
Conclusion: Topological structure represents a significant vulnerability for 3D point cloud models. Topology-driven attacks can be highly effective while maintaining geometric plausibility, challenging assumptions about shape fidelity and semantic preservation.
Abstract: Deep neural networks for 3D point cloud understanding have achieved remarkable success in object classification and recognition, yet recent work shows that these models remain highly vulnerable to adversarial perturbations. Existing 3D attacks predominantly manipulate geometric properties such as point locations, curvature, or surface structure, implicitly assuming that preserving global shape fidelity preserves semantic content. In this work, we challenge this assumption and introduce the first topology-driven adversarial attack for point cloud deep learning. Our key insight is that the homological structure of a 3D object constitutes a previously unexplored vulnerability surface. We propose Topo-ADV, an end-to-end differentiable framework that incorporates persistent homology as an explicit optimization objective, enabling gradient-based manipulation of topological features during adversarial example generation. By embedding persistence diagrams through differentiable topological representations, our method jointly optimizes (i) a topology divergence loss that alters persistence, (ii) a misclassification objective, and (iii) geometric imperceptibility constraints that preserve visual plausibility. Experiments demonstrate that subtle topology-driven perturbations consistently achieve up to 100% attack success rates on benchmark datasets such as ModelNet40, ShapeNet Part, and ScanObjectNN using PointNet and DGCNN classifiers, while remaining geometrically indistinguishable from the original point clouds, beating state-of-the-art methods on various perceptibility metrics.
[339] PointSplat: Efficient Geometry-Driven Pruning and Transformer Refinement for 3D Gaussian Splatting
Anh Thuan Tran, Jana Kosecka
Main category: cs.CV
TL;DR: PointSplat: A 3D geometry-driven prune-and-refine framework for 3D Gaussian Splatting that reduces model size while maintaining rendering quality without per-scene optimization.
Details
Motivation: Traditional 3D Gaussian Splatting methods require millions of Gaussians for complex scenes, leading to high memory and storage demands. Existing pruning approaches rely on 2D images and per-scene fine-tuning, which limits efficiency and scalability.Method: Two key components: (1) Geometry-driven pruning strategy that ranks Gaussians based solely on 3D attributes (eliminating 2D image dependency), and (2) Dual-branch encoder that separates and re-weights geometric and appearance features to avoid imbalance.
Result: Extensive experiments on ScanNet++ and Replica datasets show PointSplat achieves competitive rendering quality and superior efficiency across varying sparsity levels without additional per-scene optimization.
Conclusion: PointSplat bridges Gaussian pruning and transformer refinement with a geometry-driven approach that reduces model complexity while maintaining visual quality, offering a more efficient alternative to traditional methods.
Abstract: 3D Gaussian Splatting (3DGS) has recently unlocked real-time, high-fidelity novel view synthesis by representing scenes using explicit 3D primitives. However, traditional methods often require millions of Gaussians to capture complex scenes, leading to significant memory and storage demands. Recent approaches have addressed this issue through pruning and per-scene fine-tuning of Gaussian parameters, thereby reducing the model size while maintaining visual quality. These strategies typically rely on 2D images to compute important scores followed by scene-specific optimization. In this work, we introduce PointSplat, 3D geometry-driven prune-and-refine framework that bridges previously disjoint directions of gaussian pruning and transformer refinement. Our method includes two key components: (1) an efficient geometry-driven strategy that ranks Gaussians based solely on their 3D attributes, removing reliance on 2D images during pruning stage, and (2) a dual-branch encoder that separates, re-weights geometric and appearance to avoid feature imbalance. Extensive experiments on ScanNet++ and Replica across varying sparsity levels demonstrate that PointSplat consistently achieves competitive rendering quality and superior efficiency without additional per-scene optimization.
[340] From UAV Imagery to Agronomic Reasoning: A Multimodal LLM Benchmark for Plant Phenotyping
Yu Wu, Guangzeng Han, Ibra Niang Niang, Francia Ravelombola, Maiara Oliveira, Jason Davis, Dong Chen, Feng Lin, Xiaolei Huang
Main category: cs.CV
TL;DR: PlantXpert: A multimodal reasoning benchmark for soybean and cotton phenotyping using vision-language models, evaluating 11 state-of-the-art VLMs on domain-specific plant science tasks.
Details
Motivation: Plant science requires domain-specific knowledge, fine-grained visual interpretation, and complex biological reasoning that current foundation models struggle with. There's a need for structured evaluation frameworks to adapt VLMs to agricultural phenotyping tasks.Method: Developed PlantXpert benchmark with 385 digital images and 3,000+ samples covering disease, pest control, weed management, and yield. Evaluated 11 VLMs on visual expertise, quantitative reasoning, and multi-step agronomic reasoning capabilities.
Result: Task-specific fine-tuning improved accuracy up to 78% (Qwen3-VL models). Model scaling gains diminished beyond certain capacity, generalization across crops was uneven, and quantitative/biological reasoning remained challenging.
Conclusion: PlantXpert provides a foundation for assessing evidence-grounded agronomic reasoning and advancing multimodal model development in plant science, though significant challenges remain in domain adaptation.
Abstract: To improve crop genetics, high-throughput, effective and comprehensive phenotyping is a critical prerequisite. While such tasks were traditionally performed manually, recent advances in multimodal foundation models, especially in vision-language models (VLMs), have enabled more automated and robust phenotypic analysis. However, plant science remains a particularly challenging domain for foundation models because it requires domain-specific knowledge, fine-grained visual interpretation, and complex biological and agronomic reasoning. To address this gap, we develop PlantXpert, an evidence-grounded multimodal reasoning benchmark for soybean and cotton phenotyping. Our benchmark provides a structured and reproducible framework for agronomic adaptation of VLMs, and enables controlled comparison between base models and their domain-adapted counterparts. We constructed a dataset comprising 385 digital images and more than 3,000 benchmark samples spanning key plant science domains including disease, pest control, weed management, and yield. The benchmark can assess diverse capabilities including visual expertise, quantitative reasoning, and multi-step agronomic reasoning. A total of 11 state-of-the-art VLMs were evaluated. The results indicate that task-specific fine-tuning leads to substantial improvement in accuracy, with models such as Qwen3-VL-4B and Qwen3-VL-30B achieving up to 78%. At the same time, gains from model scaling diminish beyond a certain capacity, generalization across soybean and cotton remains uneven, and quantitative as well as biologically grounded reasoning continue to pose substantial challenges. These findings suggest that PlantXpert can serve as a foundation for assessing evidence-grounded agronomic reasoning and for advancing multimodal model development in plant science.
[341] Does Your VFM Speak Plant? The Botanical Grammar of Vision Foundation Models for Object Detection
Lars Lundqvist, Earl Ranario, Hamid Kamangir, Heesup Yun, Christine Diepenbrock, Brian N. Bailey, J. Mason Earles
Main category: cs.CV
TL;DR: Systematic prompt optimization framework for open-vocabulary object detection in agricultural scenes, showing model-specific prompt structures transfer from synthetic to real data.
Details
Motivation: Vision foundation models promise zero-shot object detection but performance is highly sensitive to text prompt construction, especially in complex agricultural scenes. Need systematic approach to optimize prompts for different models.Method: Evaluated four open-vocabulary detectors (YOLO World, SAM3, Grounding DINO, OWLv2) for cowpea flower/pod detection. Decomposed prompts into eight axes, conducted one-factor-at-a-time analysis followed by combinatorial optimization. Used LLM to translate discovered axis structure to morphologically distinct targets.
Result: Model-specific combinatorial prompts yield substantial gains over naive baselines (+0.357 mAP@0.5 for YOLO World, +0.362 for OWLv2). Prompt structures optimized on synthetic data transfer effectively to real-world fields, matching or exceeding those discovered on labeled real data for most model-object combinations.
Conclusion: Prompt engineering can substantially close the gap between zero-shot VFMs and supervised detectors without manual annotation. Optimal prompts are model-specific, non-obvious, and transferable across domains.
Abstract: Vision foundation models (VFMs) offer the promise of zero-shot object detection without task-specific training data, yet their performance in complex agricultural scenes remains highly sensitive to text prompt construction. We present a systematic prompt optimization framework evaluating four open-vocabulary detectors – YOLO World, SAM3, Grounding DINO, and OWLv2 – for cowpea flower and pod detection across synthetic and real field imagery. We decompose prompts into eight axes and conduct one-factor-at-a-time analysis followed by combinatorial optimization, revealing that models respond divergently to prompt structure: conditions that optimize one architecture can collapse another. Applying model-specific combinatorial prompts yields substantial gains over a naive species-name baseline, including +0.357 mAP@0.5 for YOLO World and +0.362 mAP@0.5 for OWLv2 on synthetic cowpea flower data. To evaluate cross-task generalization, we use an LLM to translate the discovered axis structure to a morphologically distinct target – cowpea pods – and compare against prompting using the discovered optimal structures from synthetic flower data. Crucially, prompt structures optimized exclusively on synthetic data transfer effectively to real-world fields: synthetic-pipeline prompts match or exceed those discovered on labeled real data for the majority of model-object combinations (flower: 0.374 vs. 0.353 for YOLO World; pod: 0.429 vs. 0.371 for SAM3). Our findings demonstrate that prompt engineering can substantially close the gap between zero-shot VFMs and supervised detectors without requiring manual annotation, and that optimal prompts are model-specific, non-obvious, and transferable across domains.
[342] BLPR: Robust License Plate Recognition under Viewpoint and Illumination Variations via Confidence-Driven VLM Fallback
Guillermo Auza Banegas, Diego Calvimontes Vera, Sergio Castro Sandoval, Natalia Condori Peredo, Edwin Salcedo
Main category: cs.CV
TL;DR: BLPR is a two-stage license plate recognition framework for Bolivian plates using YOLO detection with synthetic data training, geometric rectification, and a confidence-based fallback to Gemma3 4B VLM for ambiguous cases.
Details
Motivation: License plate recognition in unconstrained environments like Bolivia faces challenges due to limited data, unique visual characteristics, illumination changes, and viewpoint distortion, requiring robust solutions.Method: Two-stage pipeline: 1) YOLO-based detector pretrained on synthetic Blender data simulating extreme perspectives/lighting, fine-tuned on real Bolivian data; 2) geometric rectification and character recognition with confidence-based fallback to Gemma3 4B VLM for ambiguous cases.
Result: Achieves 89.6% character-level recognition accuracy on real-world Bolivian data and introduces the first public Bolivian LPDR dataset for evaluation under diverse conditions.
Conclusion: BLPR demonstrates effective license plate recognition for challenging Bolivian urban environments through synthetic-to-real domain adaptation and confidence-based VLM fallback mechanisms.
Abstract: Robust license plate recognition in unconstrained environments remains a significant challenge, particularly in underrepresented regions with limited data availability and unique visual characteristics, such as Bolivia. Recognition accuracy in real-world conditions is often degraded by factors such as illumination changes and viewpoint distortion. To address these challenges, we introduce BLPR, a novel deep learning-based License Plate Detection and Recognition (LPDR) framework specifically designed for Bolivian license plates. The proposed system follows a two-stage pipeline where a YOLO-based detector is pretrained on synthetic data generated in Blender to simulate extreme perspectives and lighting conditions, and subsequently fine-tuned on street-level data collected in La Paz, Bolivia. Detected plates are geometrically rectified and passed to a character recognition model. To improve robustness under ambiguous scenarios, a lightweight vision-language model (Gemma3 4B) is selectively triggered as a confidence-based fallback mechanism. The proposed framework further leverages synthetic-to-real domain adaptation to improve robustness under diverse real-world conditions. We also introduce the first publicly available Bolivian LPDR dataset, enabling evaluation under diverse viewpoint and illumination conditions. The system achieves a character-level recognition accuracy of 89.6% on real-world data, demonstrating its effectiveness for deployment in challenging urban environments. Our project is publicly available at https://github.com/EdwinTSalcedo/BLPR.
[343] I Walk the Line: Examining the Role of Gestalt Continuity in Object Binding for Vision Transformers
Alexa R. Tartaglini, Michael A. Lepori
Main category: cs.CV
TL;DR: Vision transformers learn to bind visual objects using Gestalt continuity principles, with specific attention heads tracking continuity cues that generalize across datasets and contribute to object binding representations.
Details
Motivation: Object binding is fundamental for visual cognition but challenging for neural networks. While recent evidence shows pretrained vision models can bind objects, it's unclear how they achieve this. The paper investigates whether models rely on Gestalt continuity principles for object binding.Method: Used synthetic datasets to test binding sensitivity to continuity vs other Gestalt principles. Analyzed attention heads in pretrained vision transformers that track continuity. Conducted ablation studies on these heads to assess their contribution to object binding representations.
Result: Binding probes show sensitivity to continuity across various pretrained vision transformers. Specific attention heads track continuity and generalize across datasets. Ablation of these heads often reduces models’ ability to produce representations encoding object binding.
Conclusion: Vision transformers learn to use Gestalt continuity principles for object binding, with specialized attention heads that track continuity cues. These mechanisms contribute to the models’ ability to form coherent object representations.
Abstract: Object binding is a foundational process in visual cognition, during which low-level perceptual features are joined into object representations. Binding has been considered a fundamental challenge for neural networks, and a major milestone on the way to artificial models with flexible visual intelligence. Recently, several investigations have demonstrated evidence that binding mechanisms emerge in pretrained vision models, enabling them to associate portions of an image that contain an object. The question remains: how are these models binding objects together? In this work, we investigate whether vision models rely on the principle of Gestalt continuity to perform object binding, over and above other principles like similarity and proximity. Using synthetic datasets, we demonstrate that binding probes are sensitive to continuity across a wide range of pretrained vision transformers. Next, we uncover particular attention heads that track continuity, and show that these heads generalize across datasets. Finally, we ablate these attention heads, and show that they often contribute to producing representations that encode object binding.
[344] Cross-Cultural Value Awareness in Large Vision-Language Models
Phillip Howard, Xin Su, Kathleen C. Fraser
Main category: cs.CV
TL;DR: LVLMs exhibit cultural stereotypes in value judgments based on visual cultural contexts like religion, nationality, and socioeconomic status, with systematic biases across multiple dimensions.
Details
Motivation: While fairness concerns in LVLMs have focused on social biases, there's limited research on cultural stereotypes related to religion, nationality, and socioeconomic status. The paper aims to investigate how cultural contexts in images influence LVLMs' judgments about people's moral, ethical, and political values.Method: Multi-dimensional analysis using counterfactual image sets depicting the same person across different cultural contexts. Evaluation framework uses Moral Foundations Theory, lexical analyses, and measures sensitivity of generated values to depicted cultural contexts across five popular LVLMs.
Result: LVLMs demonstrate systematic cultural stereotypes in value judgments, with generated values being sensitive to depicted cultural contexts. The models show awareness of cultural value differences but reinforce harmful stereotypes.
Conclusion: LVLMs exhibit problematic cultural biases in value judgments based on visual cultural contexts, highlighting the need for fairness interventions beyond social biases to address cultural stereotypes.
Abstract: The rapid adoption of large vision-language models (LVLMs) in recent years has been accompanied by growing fairness concerns due to their propensity to reinforce harmful societal stereotypes. While significant attention has been paid to such fairness concerns in the context of social biases, relatively little prior work has examined the presence of stereotypes in LVLMs related to cultural contexts such as religion, nationality, and socioeconomic status. In this work, we aim to narrow this gap by investigating how cultural contexts depicted in images influence the judgments LVLMs make about a person’s moral, ethical, and political values. We conduct a multi-dimensional analysis of such value judgments in five popular LVLMs using counterfactual image sets, which depict the same person across different cultural contexts. Our evaluation framework diagnoses LVLM awareness of cultural value differences through the use of Moral Foundations Theory, lexical analyses, and the sensitivity of generated values to depicted cultural contexts.
[345] Unmixing-Guided Spatial-Spectral Mamba with Clustering Tokens for Hyperspectral Image Classification
Yimin Zhu, Lincoln Linlin Xu
Main category: cs.CV
TL;DR: A novel unmixing-guided spatial-spectral Mamba model with clustering tokens for hyperspectral image classification that addresses spectral mixture effects and spatial-spectral heterogeneity through multi-task learning.
Details
Motivation: Hyperspectral image classification is challenging due to spectral-mixture effects, spatial-spectral heterogeneity, and difficulty preserving class boundaries and details. Existing methods struggle with these issues, motivating a new approach that combines spectral unmixing with spatial-spectral modeling.Method: 1) Spectral unmixing network that learns endmembers and abundance maps while accounting for endmember variabilities; 2) Top-K token selection strategy based on abundance map clusters for adaptive token sequencing; 3) Unmixing-guided spatial-spectral Mamba module for improved feature learning; 4) Multi-task supervision scheme for simultaneous endmember-abundance pattern and classification label learning.
Result: The model outperforms state-of-the-art approaches on four HSI datasets, demonstrating superior classification accuracy while also producing comprehensive spectral-library and abundance maps.
Conclusion: The proposed unmixing-guided spatial-spectral Mamba framework effectively addresses HSI classification challenges by integrating spectral unmixing with advanced sequence modeling, providing both accurate classification and interpretable spectral analysis outputs.
Abstract: Although hyperspectral image (HSI) classification is critical for supporting various environmental applications, it is a challenging task due to the spectral-mixture effect, the spatial-spectral heterogeneity and the difficulty to preserve class boundaries and details. This letter presents a novel unmixing-guided spatial-spectral Mamba with clustering tokens for improved HSI classification, with the following contributions. First, to disentangle the spectral mixture effect in HSI for improved pattern discovery, we design a novel spectral unmixing network that not only automatically learns endmembers and abundance maps from HSI but also accounts for endmember variabilities. Second, to generate Mamba token sequences, based on the clusters defined by abundance maps, we design an efficient Top-\textit{K} token selection strategy to adaptively sequence the tokens for improved representational capability. Third, to improve spatial-spectral feature learning and detail preservation, based on the Top-\textit{K} token sequences, we design a novel unmixing-guided spatial-spectral Mamba module that greatly improves traditional Mamba models in terms of token learning and sequencing. Fourth, to learn simultaneously the endmember-abundance patterns and classification labels, a multi-task scheme is designed for model supervision, leading to a new unmixing-classification framework that outputs not only accurate classification maps but also a comprehensive spectral-library and abundance maps. Comparative experiments on four HSI datasets demonstrate that our model can greatly outperform the other state-of-the-art approaches. Code is available at https://github.com/GSIL-UCalgary/Unmixing_guided_Mamba.git
[346] Learnable Motion-Focused Tokenization for Effective and Efficient Video Unsupervised Domain Adaptation
Tzu Ling Liu, Ian Stavness, Mrigank Rochan
Main category: cs.CV
TL;DR: LMFT tokenizes video frames and learns to discard low-motion background tokens while retaining motion-rich action tokens for efficient Video Unsupervised Domain Adaptation.
Details
Motivation: Existing VUDA methods struggle with static backgrounds that exacerbate domain shifts and overlook computational efficiency, limiting real-world adoption in action recognition.Method: Learnable Motion-Focused Tokenization (LMFT) tokenizes video frames into patch tokens and learns to discard low-motion, redundant background tokens while retaining motion-rich, action-relevant tokens for adaptation.
Result: Achieves state-of-the-art performance on three standard VUDA benchmarks across 21 domain adaptation settings while significantly reducing computational overhead.
Conclusion: LMFT enables VUDA that is both effective and computationally efficient by focusing on motion-rich tokens and discarding redundant background information.
Abstract: Video Unsupervised Domain Adaptation (VUDA) poses a significant challenge in action recognition, requiring the adaptation of a model from a labeled source domain to an unlabeled target domain. Despite recent advances, existing VUDA methods often fall short of fully supervised performance, a key reason being the prevalence of static and uninformative backgrounds that exacerbate domain shifts. Additionally, prior approaches largely overlook computational efficiency, limiting real-world adoption. To address these issues, we propose Learnable Motion-Focused Tokenization (LMFT) for VUDA. LMFT tokenizes video frames into patch tokens and learns to discard low-motion, redundant tokens, primarily corresponding to background regions, while retaining motion-rich, action-relevant tokens for adaptation. Extensive experiments on three standard VUDA benchmarks across 21 domain adaptation settings show that our VUDA framework with LMFT achieves state-of-the-art performance while significantly reducing computational overhead. LMFT thus enables VUDA that is both effective and computationally efficient.
[347] YUV20K: A Complexity-Driven Benchmark and Trajectory-Aware Alignment Model for Video Camouflaged Object Detection
Yiyu Liu, Shuo Ye, Chao Hao, Zitong Yu
Main category: cs.CV
TL;DR: A new Video Camouflaged Object Detection benchmark YUV20K with 24K frames and novel framework with Motion Feature Stabilization and Trajectory-Aware Alignment modules for handling complex motion scenarios.
Details
Motivation: Current VCOD methods face limitations due to scarce challenging benchmarks and poor robustness against erratic motion dynamics, particularly struggling with Motion-Induced Appearance Instability and Temporal Feature Misalignment in complex motion scenarios.Method: Proposes YUV20K benchmark with 24,295 annotated frames across 91 scenes and 47 species, targeting challenging scenarios. Introduces novel framework with two key modules: Motion Feature Stabilization (MFS) using frame-agnostic Semantic Basis Primitives to stabilize features, and Trajectory-Aware Alignment (TAA) using trajectory-guided deformable sampling for precise temporal alignment.
Result: Method significantly outperforms state-of-the-art competitors on existing datasets and establishes new baseline on YUV20K. Shows superior cross-domain generalization and robustness in complex spatiotemporal scenarios.
Conclusion: The YUV20K benchmark addresses data scarcity in VCOD, while the proposed framework with MFS and TAA modules effectively handles complex motion dynamics, advancing the field of video camouflaged object detection.
Abstract: Video Camouflaged Object Detection (VCOD) is currently constrained by the scarcity of challenging benchmarks and the limited robustness of models against erratic motion dynamics. Existing methods often struggle with Motion-Induced Appearance Instability and Temporal Feature Misalignment caused by complex motion scenarios. To address the data bottleneck, we present YUV20K, a pixel-level annoated complexity-driven VCOD benchmark. Comprising 24,295 annotated frames across 91 scenes and 47 kinds of species, it specifically targets challenging scenarios like large-displacement motion, camera motion and other 4 types scenarios. On the methodological front, we propose a novel framework featuring two key modules: Motion Feature Stabilization (MFS) and Trajectory-Aware Alignment (TAA). The MFS module utilizes frame-agnostic Semantic Basis Primitives to stablize features, while the TAA module leverages trajectory-guided deformable sampling to ensure precise temporal alignment. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art competitors on existing datasets and establishes a new baseline on the challenging YUV20K. Notably, our framework exhibits superior cross-domain generalization and robustness when confronting complex spatiotemporal scenarios. Our code and dataset will be available at https://github.com/K1NSA/YUV20K
[348] FlowPalm: Optical Flow Driven Non-Rigid Deformation for Geometrically Diverse Palmprint Generation
Yuchen Zou, Huikai Shao, Lihuang Fang, Zhipeng Xiong, Dexing Zhong
Main category: cs.CV
TL;DR: FlowPalm is a palmprint generation framework that uses optical flow to simulate complex non-rigid deformations, addressing geometric variation limitations in existing methods.
Details
Motivation: Existing palmprint generation methods focus mainly on style translation while ignoring or approximating geometric variation, which is crucial for reflecting the diversity of real palmprints needed for training effective recognition models.Method: FlowPalm estimates optical flows between real palmprint pairs to capture statistical patterns of geometric deformations, then uses a progressive sampling process that gradually introduces these deformations during diffusion while maintaining identity consistency.
Result: Extensive experiments on six benchmark datasets show FlowPalm significantly outperforms state-of-the-art palmprint generation approaches in downstream recognition tasks.
Conclusion: FlowPalm effectively addresses geometric variation in synthetic palmprint generation through optical-flow-driven deformation modeling, improving recognition model training.
Abstract: Recently, synthetic palmprints have been increasingly used as substitutes for real data to train recognition models. To be effective, such synthetic data must reflect the diversity of real palmprints, including both style variation and geometric variation. However, existing palmprint generation methods mainly focus on style translation, while geometric variation is either ignored or approximated by simple handcrafted augmentations. In this work, we propose FlowPalm, an optical-flow-driven palmprint generation framework capable of simulating the complex non-rigid deformations observed in real palms. Specifically, FlowPalm estimates optical flows between real palmprint pairs to capture the statistical patterns of geometric deformations. Building on these priors, we design a progressive sampling process that gradually introduces the geometric deformations during diffusion while maintaining identity consistency. Extensive experiments on six benchmark datasets demonstrate that FlowPalm significantly outperforms state-of-the-art palmprint generation approaches in downstream recognition tasks. Project page: https://yuchenzou.github.io/FlowPalm/
[349] Gait Recognition with Temporal Kolmogorov-Arnold Networks
Mohammed Asad, Dinesh Kumar Vishwakarma
Main category: cs.CV
TL;DR: TKAN (Temporal Kolmogorov-Arnold Network) for gait recognition addresses limitations of recurrent and transformer models by using learnable 1D functions and two-level memory to efficiently model both local gait cycles and long-term motion trends.
Details
Motivation: Gait recognition is valuable for surveillance as it can be acquired at distance without cooperation, but existing silhouette-based temporal models struggle with long sequences, noise, appearance variations, and computational efficiency. Recurrent models lose early frame information, transformers need more resources and data, and both have issues with irregular sequences and noise.Method: Proposes Temporal Kolmogorov-Arnold Network (TKAN) that replaces fixed edge weights with learnable one-dimensional functions. Incorporates two-level memory: short-term RKAN sublayers for cycle-level dynamics and gated long-term pathway for broader temporal context. Uses CNN+TKAN framework for gait recognition.
Result: Experiments on CASIA-B dataset show the CNN+TKAN framework achieves strong recognition performance under the reported evaluation setting.
Conclusion: TKAN provides an efficient solution for gait recognition that can handle both local gait cycles and long-term motion trends while maintaining computational efficiency compared to recurrent and transformer architectures.
Abstract: Gait recognition is a biometric modality that identifies individuals from their characteristic walking patterns. Unlike conventional biometric traits, gait can be acquired at a distance and without active subject cooperation, making it suitable for surveillance and public safety applications. Nevertheless, silhouette-based temporal models remain sensitive to long sequences, observation noise, and appearance-related covariates. Recurrent architectures often struggle to preserve information from earlier frames and are inherently sequential to optimize, whereas transformer-based models typically require greater computational resources and larger training sets and may be sensitive to irregular sequence lengths and noisy inputs. These limitations reduce robustness under clothing variation, carrying conditions, and view changes, while also hindering the joint modeling of local gait cycles and longer-term motion trends. To address these challenges, we introduce a Temporal Kolmogorov-Arnold Network (TKAN) for gait recognition. The proposed model replaces fixed edge weights with learnable one-dimensional functions and incorporates a two-level memory mechanism consisting of short-term RKAN sublayers and a gated long-term pathway. This design enables efficient modeling of both cycle-level dynamics and broader temporal context while maintaining a compact backbone. Experiments on the CASIA-B dataset indicate that the proposed CNN+TKAN framework achieves strong recognition performance under the reported evaluation setting.
[350] Revisiting the Scale Loss Function and Gaussian-Shape Convolution for Infrared Small Target Detection
Hao Li, Man Fung Zhuo
Main category: cs.CV
TL;DR: A novel approach for infrared small target detection addressing training instability through diff-based scale loss and improving spatial attention via Gaussian-shaped convolution with learnable scale and orientation alignment.
Details
Motivation: Infrared small target detection suffers from two key issues: training instability due to non-monotonic scale loss functions, and inadequate spatial attention because generic convolution kernels don't account for the physical imaging characteristics of small targets.Method: Proposes a diff-based scale loss that weights predictions by signed area difference between predicted mask and ground truth for monotonic gradients. Introduces Gaussian-shaped convolution with learnable scale parameter to match center-concentrated intensity profiles, augmented with rotated pinwheel mask that adaptively aligns kernel orientation via straight-through estimator.
Result: Extensive experiments on IRSTD-1k, NUDT-SIRST, and SIRST-UAVB datasets demonstrate consistent improvements in mIoU, Pd, and Fa metrics over state-of-the-art methods.
Conclusion: The proposed diff-based scale loss and Gaussian-shaped convolution with orientation alignment effectively address training stability and spatial attention challenges in infrared small target detection, achieving superior performance across multiple benchmarks.
Abstract: Infrared small target detection still faces two persistent challenges: training instability from non-monotonic scale loss functions, and inadequate spatial attention due to generic convolution kernels that ignore the physical imaging characteristics of small targets. In this paper, we revisit both aspects. For the loss side, we propose a \emph{diff-based scale loss} that weights predictions according to the signed area difference between the predicted mask and the ground truth, yielding strictly monotonic gradients and stable convergence. We further analyze a family of four scale loss variants to understand how their geometric properties affect detection behavior. For the spatial side, we introduce \emph{Gaussian-shaped convolution} with a learnable scale parameter to match the center-concentrated intensity profile of infrared small targets, and augment it with a \emph{rotated pinwheel mask} that adaptively aligns the kernel with target orientation via a straight-through estimator. Extensive experiments on IRSTD-1k, NUDT-SIRST, and SIRST-UAVB demonstrate consistent improvements in $mIoU$, $P_d$, and $F_a$ over state-of-the-art methods. We release our anonymous code and pretrained models.
[351] A Comparative Study of Modern Object Detectors for Robust Apple Detection in Orchard Imagery
Mohammed Asad, Ajai Kumar Gautam, Priyanshu Dhiman, Rishi Raj Prajapati
Main category: cs.CV
TL;DR: Benchmark study comparing six object detectors for apple detection in orchard images using AppleBBCH81 dataset with unified evaluation protocol.
Details
Motivation: Apple detection in orchards is challenging due to illumination changes, leaf clutter, dense fruit clusters, and occlusion. Need for fair comparison of detectors for agricultural applications like yield prediction and robotic harvesting.Method: Established controlled benchmark with deterministic train/validation/test split on AppleBBCH81 dataset. Evaluated six detectors: YOLOv10n, YOLO11n, RT-DETR-L, Faster R-CNN, FCOS, and SSDLite320 using COCO-style mAP metrics, precision-recall curves, and fixed-threshold analysis.
Result: YOLO11n achieved best strict localization (mAP@0.5:0.95 = 0.6065, mAP@0.5 = 0.9620). YOLOv10n had highest F1-score at fixed threshold. RT-DETR-L showed high recall but low precision due to false positives.
Conclusion: Detector selection for orchard applications should consider both localization accuracy and threshold robustness based on downstream task requirements.
Abstract: Accurate apple detection in orchard images is important for yield prediction, fruit counting, robotic harvesting, and crop monitoring. However, changing illumination, leaf clutter, dense fruit clusters, and partial occlusion make detection difficult. To provide a fair and reproducible comparison, this study establishes a controlled benchmark for single-class apple detection on the public AppleBBCH81 dataset using one deterministic train, validation, and test split and a unified evaluation protocol across six representative detectors: YOLOv10n, YOLO11n, RT-DETR-L, Faster R-CNN (ResNet50-FPN), FCOS (ResNet50-FPN), and SSDLite320 (MobileNetV3-Large). Performance is evaluated primarily using COCO-style mAP@0.5 and mAP@0.5:0.95, and threshold-dependent behavior is further analyzed using precision-recall curves and fixed-threshold precision, recall, and F1-score at IoU = 0.5. On the validation split, YOLO11n achieves the best strict localization performance with mAP@0.5:0.95 = 0.6065 and mAP@0.5 = 0.9620, followed closely by RT-DETR-L and YOLOv10n. At a fixed operating point with confidence >= 0.05, YOLOv10n attains the highest F1-score, whereas RT-DETR-L achieves very high recall but low precision because of many false positives at low confidence. These findings show that detector selection for orchard deployment should be guided not only by localization-aware accuracy but also by threshold robustness and the requirements of the downstream task.
[352] GIF: A Conditional Multimodal Generative Framework for IR Drop Imaging in Chip Layouts
Kiran Thorat, Nicole Meng, Mostafa Karami, Caiwen Ding, Yingjie Lao, Zhijie Jerry Shi
Main category: cs.CV
TL;DR: GIF is a generative framework using diffusion models with multimodal conditioning (geometry + topology) to predict IR drop images for chip power integrity analysis, outperforming prior ML methods.
Details
Motivation: Traditional EDA tools for IR drop analysis are slow/expensive at scale. Existing ML methods fail to capture both local/long-range dependencies and ignore crucial geometric layout and logical connectivity information.Method: GIF uses a conditional diffusion process guided by fused image and graph features, combining geometric layout features with logical circuit topology representations for multimodal conditioning.
Result: On CircuitNet-N28 dataset: 0.78 SSIM, 0.95 Pearson correlation, 21.77 PSNR, 0.026 NMAE, outperforming prior methods. Demonstrates reliable high-quality IR drop image generation.
Conclusion: IR drop analysis can effectively leverage generative modeling advances when geometric layout and logical topology are jointly modeled, enabling structured image generation for chip design.
Abstract: IR drop analysis is essential in physical chip design to ensure the power integrity of on-chip power delivery networks. Traditional Electronic Design Automation (EDA) tools have become slow and expensive as transistor density scales. Recent works have introduced machine learning (ML)-based methods that formulate IR drop analysis as an image prediction problem. These existing ML approaches fail to capture both local and long-range dependencies and ignore crucial geometrical and topological information from physical layouts and logical connectivity. To address these limitations, we propose GIF, a Generative IR drop Framework that uses both geometrical and topological information to generate IR drop images. GIF fuses image and graph features to guide a conditional diffusion process, producing high-quality IR drop images. For instance, On the CircuitNet-N28 dataset, GIF achieves 0.78 SSIM, 0.95 Pearson correlation, 21.77 PSNR, and 0.026 NMAE, outperforming prior methods. These results demonstrate that our framework, using diffusion based multimodal conditioning, reliably generates high quality IR drop images. This shows that IR drop analysis can effectively leverage recent advances in generative modeling when geometric layout features and logical circuit topology are jointly modeled. By combining geometry aware spatial features with logical graph representations, GIF enables IR drop analysis to benefit from recent advances in generative modeling for structured image generation.
[353] SwinTextUNet: Integrating CLIP-Based Text Guidance into Swin Transformer U-Nets for Medical Image Segmentation
Ashfak Yeafi, Parthaw Goswami, Md Khairul Islam, Ashifa Islam Shamme
Main category: cs.CV
TL;DR: SwinTextUNet: A multimodal medical image segmentation framework combining CLIP text embeddings with Swin Transformer UNet architecture for improved segmentation of ambiguous/low-contrast patterns.
Details
Motivation: Traditional medical image segmentation models relying solely on visual features struggle with ambiguous or low-contrast patterns, limiting their clinical utility. The authors aim to overcome these limitations by incorporating semantic text guidance to enhance segmentation robustness and accuracy.Method: Proposes SwinTextUNet, a multimodal segmentation framework that integrates Contrastive Language Image Pretraining (CLIP) derived textual embeddings into a Swin Transformer UNet backbone. Uses cross-attention and convolutional fusion mechanisms to align semantic text guidance with hierarchical visual representations. Evaluates a four-stage variant optimized for performance-complexity balance.
Result: Achieves Dice score of 86.47% and IoU of 78.2% on the QaTaCOV19 dataset. Ablation studies validate the importance of text guidance and multimodal fusion components. The four-stage variant provides optimal balance between performance and complexity.
Conclusion: Vision-language integration shows promise for advancing medical image segmentation and supporting clinically meaningful diagnostic tools. The multimodal approach enhances robustness and accuracy when dealing with ambiguous patterns in medical imaging.
Abstract: Precise medical image segmentation is fundamental for enabling computer aided diagnosis and effective treatment planning. Traditional models that rely solely on visual features often struggle when confronted with ambiguous or low contrast patterns. To overcome these limitations, we introduce SwinTextUNet, a multimodal segmentation framework that incorporates Contrastive Language Image Pretraining (CLIP), derived textual embeddings into a Swin Transformer UNet backbone. By integrating cross attention and convolutional fusion, the model effectively aligns semantic text guidance with hierarchical visual representations, enhancing robustness and accuracy. We evaluate our approach on the QaTaCOV19 dataset, where the proposed four stage variant achieves an optimal balance between performance and complexity, yielding Dice and IoU scores of 86.47% and 78.2%, respectively. Ablation studies further validate the importance of text guidance and multimodal fusion. These findings underscore the promise of vision language integration in advancing medical image segmentation and supporting clinically meaningful diagnostic tools.
[354] Demographic and Linguistic Bias Evaluation in Omnimodal Language Models
Alaa Elobaid
Main category: cs.CV
TL;DR: Evaluation reveals significant demographic and linguistic biases in omnimodal language models, with audio understanding tasks showing substantially lower performance and greater bias compared to image/video tasks.
Details
Motivation: Omnimodal language models that process text, images, audio, and video are being widely deployed, but their performance across different demographic groups and modalities is not well studied, raising concerns about fairness in real-world applications.Method: Evaluated four omnimodal models on tasks including demographic attribute estimation, identity verification, activity recognition, multilingual speech transcription, and language identification, measuring accuracy differences across age, gender, skin tone, language, and country of origin.
Result: Image and video understanding tasks generally exhibit better performance with smaller demographic disparities, while audio understanding tasks show significantly lower performance and substantial bias, including large accuracy differences across age groups, genders, and languages, and frequent prediction collapse toward narrow categories.
Conclusion: The findings highlight the importance of evaluating fairness across all supported modalities as omnimodal language models are increasingly used in real-world applications, with particular attention needed for audio understanding tasks that show the most significant biases.
Abstract: This paper provides a comprehensive evaluation of demographic and linguistic biases in omnimodal language models that process text, images, audio, and video within a single framework. Although these models are being widely deployed, their performance across different demographic groups and modalities is not well studied. Four omnimodal models are evaluated on tasks that include demographic attribute estimation, identity verification, activity recognition, multilingual speech transcription, and language identification. Accuracy differences are measured across age, gender, skin tone, language, and country of origin. The results show that image and video understanding tasks generally exhibit better performance with smaller demographic disparities. In contrast, audio understanding tasks exhibit significantly lower performance and substantial bias, including large accuracy differences across age groups, genders, and languages, and frequent prediction collapse toward narrow categories. These findings highlight the importance of evaluating fairness across all supported modalities as omnimodal language models are increasingly used in real-world applications.
[355] What and Where to Adapt: Structure-Semantics Co-Tuning for Machine Vision Compression via Synergistic Adapters
Shaobo Liu, Haobo Xiong, Kai Liu, Yuna Lin
Main category: cs.CV
TL;DR: S2-CoT introduces a novel parameter-efficient fine-tuning framework for image compression codecs that coordinates structural and semantic adapters to optimize both encoder-decoder features and entropy model statistics.
Details
Motivation: Existing parameter-efficient fine-tuning methods for image compression focus mainly on encoder-decoder backbones while neglecting the entropy model's statistical semantics, which predicts latent feature distributions. Naive adapter insertion in entropy models leads to suboptimal results, highlighting the need for coordinated adaptation across the compression pipeline.Method: Proposes Structure-Semantics Co-Tuning (S2-CoT) with two specialized adapters: Structural Fidelity Adapter (SFA) in encoder-decoder for high-fidelity representations via spatial-frequency fusion, and Semantic Context Adapter (SCA) in entropy model to align with SFA-tuned features by refining channel context for efficient statistical coding.
Result: Achieves state-of-the-art results across four diverse base codecs with minimal trainable parameters, closely matching full fine-tuning performance while avoiding the performance degradation of naive adapter approaches.
Conclusion: S2-CoT demonstrates that coordinated adaptation of both structural and semantic components is crucial for effective parameter-efficient fine-tuning in image compression, enabling near-full fine-tuning performance with significantly reduced parameters.
Abstract: Parameter-efficient fine-tuning of pre-trained codecs is a promising direction in image compression for human and machine vision. While most existing works have primarily focused on tuning the feature structure within the encoder-decoder backbones, the adaptation of the statistical semantics within the entropy model has received limited attention despite its function of predicting the probability distribution of latent features. Our analysis reveals that naive adapter insertion into the entropy model can lead to suboptimal outcomes, underscoring that the effectiveness of adapter-based tuning depends critically on the coordination between adapter type and placement across the compression pipeline. Therefore, we introduce Structure-Semantics Co-Tuning (S2-CoT), a novel framework that achieves this coordination via two specialized, synergistic adapters: the Structural Fidelity Adapter (SFA) and the Semantic Context Adapter (SCA). SFA is integrated into the encoder-decoder to preserve high-fidelity representations by dynamically fusing spatial and frequency information; meanwhile, the SCA adapts the entropy model to align with SFA-tuned features by refining the channel context for more efficient statistical coding. Through joint optimization, S2-CoT turns potential performance degradation into synergistic gains, achieving state-of-the-art results across four diverse base codecs with only a small fraction of trainable parameters, closely matching full fine-tuning performance. Code is available at https://github.com/Brock-bit4/S2-CoT.
[356] FREE-Switch: Frequency-based Dynamic LoRA Switch for Style Transfer
Shenghe Zheng, Minyu Zhang, Tianhao Liu, Hongzhi Wang
Main category: cs.CV
TL;DR: FREE-Switch: A frequency-domain importance-driven dynamic LoRA switching method for efficiently combining pretrained diffusion adapters for customized image generation without training.
Details
Motivation: Existing model merging methods for diffusion models suffer from content drift due to error accumulation across diffusion steps, while training-based approaches are computationally expensive and training-free methods use uniform fusion strategies that ignore inter-adapter differences, leading to detail degradation.Method: Proposes a frequency-domain importance-driven dynamic LoRA switch method that recognizes different adapters are specialized for different content types, so each diffusion step carries different significance for each adapter. Also includes an automatic Generation Alignment mechanism to align generation intents at the semantic level for maintaining consistency.
Result: The FREE-Switch framework efficiently combines adapters for different objects and styles, substantially reducing training cost for high-quality customized generation.
Conclusion: The proposed method enables low-cost customized generation by efficiently combining pretrained diffusion adapters while avoiding content drift and detail degradation issues of existing approaches.
Abstract: With the growing availability of open-sourced adapters trained on the same diffusion backbone for diverse scenes and objects, combining these pretrained weights enables low-cost customized generation. However, most existing model merging methods are designed for classification or text generation, and when applied to image generation, they suffer from content drift due to error accumulation across multiple diffusion steps. For image-oriented methods, training-based approaches are computationally expensive and unsuitable for edge deployment, while training-free ones use uniform fusion strategies that ignore inter-adapter differences, leading to detail degradation. We find that since different adapters are specialized for generating different types of content, the contribution of each diffusion step carries different significance for each adapter. Accordingly, we propose a frequency-domain importance-driven dynamic LoRA switch method. Furthermore, we observe that maintaining semantic consistency across adapters effectively mitigates detail loss; thus, we design an automatic Generation Alignment mechanism to align generation intents at the semantic level. Experiments demonstrate that our FREE-Switch (Frequency-based Efficient and Dynamic LoRA Switch) framework efficiently combines adapters for different objects and styles, substantially reducing the training cost of high-quality customized generation.
[357] LVSum: A Benchmark for Timestamp-Aware Long Video Summarization
Alkesh Patel, Melis Ozyildirim, Ying-Chang Cheng, Ganesh Nagarajan
Main category: cs.CV
TL;DR: LVSum is a new benchmark for evaluating long video summarization in MLLMs, focusing on temporal alignment and semantic grounding across 13 domains.
Details
Motivation: Current MLLMs struggle with maintaining temporal fidelity and producing semantically/temporally grounded summaries for long videos, necessitating a specialized benchmark for evaluation.Method: Created LVSum benchmark with human-annotated long-form videos across 13 domains, each with precise temporal references in summaries. Evaluated proprietary and open-source MLLMs using new LLM-based metrics for content relevance and modality coherence plus standard metrics.
Result: Revealed systematic gaps in temporal understanding among existing MLLMs, providing insights for advancing temporal reasoning in long video summarization.
Conclusion: LVSum establishes a new foundation for evaluating and improving temporal reasoning capabilities in MLLMs for long video summarization tasks.
Abstract: Long video summarization presents significant challenges for current multimodal large language models (MLLMs), particularly in maintaining temporal fidelity over extended durations and producing summaries that are both semantically and temporally grounded. In this work, we present LVSum, a human-annotated benchmark designed specifically for evaluating long video summarization with fine-grained temporal alignment. LVSum comprises diverse long-form videos across 13 domains, each paired with human-generated summaries containing precise temporal references. We conduct a comprehensive evaluation of both proprietary and open-source MLLMs on LVSum, assessing performance using newly introduced LLM-based metrics for content relevance and modality coherence, alongside standard evaluation metrics. Our experiments reveal systematic gaps in temporal understanding among existing MLLMs and offer insights that establish a new foundation for advancing temporal reasoning in long video summarization.
[358] SinkTrack: Attention Sink based Context Anchoring for Large Language Models
Xu Liu, Guikun Chen, Wenguan Wang
Main category: cs.CV
TL;DR: SinkTrack is a training-free method that uses the attention sink phenomenon (LLMs’ tendency to focus on the first token) to anchor context by injecting key features into the BOS token, reducing hallucination and context forgetting in both textual and multimodal tasks.
Details
Motivation: LLMs suffer from hallucination and context forgetting due to attention drift, where models shift focus away from initial context toward newly generated tokens. The paper aims to counteract this by leveraging the attention sink phenomenon.Method: SinkTrack treats the BOS token as an information anchor and injects key contextual features (from input image or instruction) into its representation. This training-free, plug-and-play method maintains attention to initial context throughout generation with minimal inference overhead.
Result: Experiments show SinkTrack mitigates hallucination and context forgetting across textual (+21.6% on SQuAD2.0 with Llama3.1-8B-Instruct) and multimodal tasks (+22.8% on M3CoT with Qwen2.5-VL-7B-Instruct), with consistent gains across architectures and scales.
Conclusion: SinkTrack effectively addresses attention drift in LLMs by leveraging attention sink, providing a robust, generalizable solution for reducing hallucination and context forgetting in both textual and multimodal applications.
Abstract: Large language models (LLMs) suffer from hallucination and context forgetting. Prior studies suggest that attention drift is a primary cause of these problems, where LLMs’ focus shifts towards newly generated tokens and away from the initial input context. To counteract this, we make use of a related, intrinsic characteristic of LLMs: attention sink – the tendency to consistently allocate high attention to the very first token (i.e.,
[359] Prompt Relay: Inference-Time Temporal Control for Multi-Event Video Generation
Gordon Chen, Ziqi Huang, Ziwei Liu
Main category: cs.CV
TL;DR: Prompt Relay is an inference-time method for video diffusion models that enables fine-grained temporal control over multiple events in generated videos by using a penalty mechanism in cross-attention to ensure each temporal segment attends only to its assigned prompt.
Details
Motivation: Current video diffusion models struggle with temporal succession of multiple events, lack mechanisms to control when concepts appear/duration/order, and suffer from semantic entanglement when using paragraph-style prompts, especially problematic for movie-grade video synthesis requiring coherent storytelling.Method: Prompt Relay is a plug-and-play inference-time method that introduces a penalty into the cross-attention mechanism, ensuring each temporal segment attends only to its assigned prompt, allowing representation of one semantic concept at a time without architectural modifications or computational overhead.
Result: The method improves temporal prompt alignment, reduces semantic interference between events, and enhances visual quality in multi-event video generation while maintaining computational efficiency.
Conclusion: Prompt Relay provides effective temporal control for video diffusion models, addressing key limitations in multi-event video generation and enabling more coherent storytelling capabilities.
Abstract: Video diffusion models have achieved remarkable progress in generating high-quality videos. However, these models struggle to represent the temporal succession of multiple events in real-world videos and lack explicit mechanisms to control when semantic concepts appear, how long they persist, and the order in which multiple events occur. Such control is especially important for movie-grade video synthesis, where coherent storytelling depends on precise timing, duration, and transitions between events. When using a single paragraph-style prompt to describe a sequence of complex events, models often exhibit semantic entanglement, where concepts intended for different moments in the video bleed into one another, resulting in poor text-video alignment. To address these limitations, we propose Prompt Relay, an inference-time, plug-and-play method to enable fine-grained temporal control in multi-event video generation, requiring no architectural modifications and no additional computational overhead. Prompt Relay introduces a penalty into the cross-attention mechanism, so that each temporal segment attends only to its assigned prompt, allowing the model to represent one semantic concept at a time and thereby improving temporal prompt alignment, reducing semantic interference, and enhancing visual quality.
[360] Counting to Four is still a Chore for VLMs
Duy Le Dinh Anh, Patrick Amadeus Irawan, Tuan Van Vo
Main category: cs.CV
TL;DR: Vision-language models struggle with simple object counting despite complex reasoning capabilities, with failures stemming from visual evidence degradation in language layers rather than just visual perception limits.
Details
Motivation: VLMs perform well on complex multimodal reasoning but fail at simple grounding skills like object counting. Existing evaluations only assess final outputs without revealing where failures occur inside the model architecture.Method: Introduces COUNTINGTRICKS evaluation suite for controlled shape-based counting tests. Uses attention analysis and component-wise probing to trace visual evidence flow. Evaluates Modality Attention Share (MAS) intervention to enforce minimum visual attention during answer generation.
Result: Count-relevant visual evidence is strongest in modality projection stage but degrades substantially in later language layers where models become more susceptible to text priors. MAS intervention shows counting failures stem from underuse of visual evidence during language-stage reasoning.
Conclusion: Counting failures in VLMs result not only from visual perception limits but also from insufficient use of visual evidence during language-stage reasoning, suggesting architectural interventions can improve grounding capabilities.
Abstract: Vision–language models (VLMs) have achieved impressive performance on complex multimodal reasoning tasks, yet they still fail on simple grounding skills such as object counting. Existing evaluations mostly assess only final outputs, offering limited insight into where these failures arise inside the model. In this work, we present an empirical study of VLM counting behavior through both behavioral and mechanistic analysis. We introduce COUNTINGTRICKS, a controlled evaluation suite of simple shape-based counting cases designed to expose vulnerabilities under different patchification layouts and adversarial prompting conditions. Using attention analysis and component-wise probing, we show that count-relevant visual evidence is strongest in the modality projection stage but degrades substantially in later language layers, where models become more susceptible to text priors. Motivated by this finding, we further evaluate Modality Attention Share (MAS), a lightweight intervention that encourages a minimum budget of visual attention during answer generation. Our results suggest that counting failures in VLMs stem not only from visual perception limits, but also from the underuse of visual evidence during language-stage reasoning. Code and dataset will be released at https://github.com/leduy99/-CVPRW26-Modality-Attention-Share.
[361] Intra-finger Variability of Diffusion-based Latent Fingerprint Generation
Noor Hussein, Anil K. Jain, Karthik Nandakumar
Main category: cs.CV
TL;DR: Systematic evaluation of synthetic fingerprint generation using diffusion models, focusing on intra-finger variability and latent style diversity enhancement through a comprehensive style bank.
Details
Motivation: To address limitations in existing synthetic fingerprint generators by systematically evaluating intra-finger variability and enhancing latent style diversity for more realistic and diverse synthetic fingerprint generation.Method: Constructed a comprehensive latent style bank from seven diverse datasets to enable synthesis of latent prints with over 40 distinct styles. Implemented a semi-automated framework to analyze fingerprint ridge and minutiae integrity in generated impressions.
Result: Generation process largely preserves identity but introduces small local inconsistencies (addition/removal of minutiae), especially in poor quality reference image regions. Mismatch between reference image and style embedding causes global inconsistencies and hallucinated ridge patterns.
Conclusion: Existing synthetic fingerprint generators have limitations in simultaneously achieving diversity and identity consistency, highlighting the need for improved models that balance both aspects.
Abstract: The primary goal of this work is to systematically evaluate the intra-finger variability of synthetic fingerprints (particularly latent prints) generated using a state-of-the-art diffusion model. Specifically, we focus on enhancing the latent style diversity of the generative model by constructing a comprehensive \textit{latent style bank} curated from seven diverse datasets, which enables the precise synthesis of latent prints with over 40 distinct styles encapsulating different surfaces and processing techniques. We also implement a semi-automated framework to understand the integrity of fingerprint ridges and minutiae in the generated impressions. Our analysis indicates that though the generation process largely preserves the identity, a small number of local inconsistencies (addition and removal of minutiae) are introduced, especially when there are poor quality regions in the reference image. Furthermore, mismatch between the reference image and the chosen style embedding that guides the generation process introduces global inconsistencies in the form of hallucinated ridge patterns. These insights highlight the limitations of existing synthetic fingerprint generators and the need to further improve these models to simultaneously enhance both diversity and identity consistency.
[362] U$^{2}$Flow: Uncertainty-Aware Unsupervised Optical Flow Estimation
Xunpei Sun, Wenwei Lin, Yi Chang, Gang Chen
Main category: cs.CV
TL;DR: U²Flow is the first recurrent unsupervised framework that jointly estimates optical flow and per-pixel uncertainty using a decoupled learning strategy with Laplace-based maximum likelihood objective.
Details
Motivation: Unsupervised optical flow methods typically lack reliable uncertainty estimation, which limits their robustness and interpretability. There's a need for methods that can jointly estimate both optical flow and uncertainty without ground truth supervision.Method: Proposes U²Flow with a decoupled learning strategy that derives uncertainty supervision from augmentation consistency via a Laplace-based maximum likelihood objective. The predicted uncertainty guides adaptive flow refinement, modulates regional smoothness loss, and enables uncertainty-guided bidirectional flow fusion.
Result: Extensive experiments on KITTI and Sintel show state-of-the-art performance among unsupervised methods while producing highly reliable uncertainty maps.
Conclusion: U²Flow effectively demonstrates the benefits of joint optical flow and uncertainty estimation in an unsupervised framework, improving both performance and interpretability.
Abstract: Unsupervised optical flow methods typically lack reliable uncertainty estimation, limiting their robustness and interpretability. We propose U$^{2}$Flow, the first recurrent unsupervised framework that jointly estimates optical flow and per-pixel uncertainty. The core innovation is a decoupled learning strategy that derives uncertainty supervision from augmentation consistency via a Laplace-based maximum likelihood objective, enabling stable training without ground truth. The predicted uncertainty is further integrated into the network to guide adaptive flow refinement and dynamically modulate the regional smoothness loss. Furthermore, we introduce an uncertainty-guided bidirectional flow fusion mechanism that enhances robustness in challenging regions. Extensive experiments on KITTI and Sintel demonstrate that U$^{2}$Flow achieves state-of-the-art performance among unsupervised methods while producing highly reliable uncertainty maps, validating the effectiveness of our joint estimation paradigm. The code is available at https://github.com/sunzunyi/U2FLOW.
[363] On The Application of Linear Attention in Multimodal Transformers
Armin Gerami, Seyedehanita Madani, Ramani Duraiswami
Main category: cs.CV
TL;DR: Linear Attention reduces multimodal Transformer computational complexity from quadratic to linear while maintaining competitive performance on vision-language tasks.
Details
Motivation: Multimodal Transformers have quadratic attention complexity that limits scalability for large vision-language models, creating a need for more efficient alternatives.Method: Integrate Linear Attention (LA) into multimodal frameworks, evaluate across ViT-S/16, ViT-B/16, and ViT-L/16 architectures trained on LAION-400M dataset, validate with ImageNet-21K zero-shot accuracy.
Result: Linear Attention achieves significant computational savings while preserving competitive performance and following the same scaling laws as standard softmax attention.
Conclusion: Linear Attention is a robust, scalable solution for next-generation multimodal Transformers processing large, complex datasets.
Abstract: Multimodal Transformers serve as the backbone for state-of-the-art vision-language models, yet their quadratic attention complexity remains a critical barrier to scalability. In this work, we investigate the viability of Linear Attention (LA) as a high-efficiency alternative within multimodal frameworks. By integrating LA, we reduce the computational overhead from quadratic to linear relative to sequence length while preserving competitive performance. We evaluate our approach across ViT-S/16, ViT-B/16, and ViT-L/16 architectures trained on the LAION-400M dataset, with validation focused on ImageNet-21K zero-shot accuracy. Our systematic evaluation demonstrates that Linear Attention not only yields significant computational savings but also adheres to the same scaling laws as standard softmax attention. These findings position Linear Attention as a robust, scalable solution for next-generation multimodal Transformers tasked with processing increasingly large and complex datasets.
[364] Spotlight and Shadow: Attention-Guided Dual-Anchor Introspective Decoding for MLLM Hallucination Mitigation
Yebo Wu, Han Jin, Zhijiang Guo, Li Li
Main category: cs.CV
TL;DR: DaID is a contrastive decoding framework that reduces hallucinations in MLLMs by identifying and leveraging internal perceptual discrepancies between visual and textual signals.
Details
Motivation: Multimodal LLMs suffer from hallucination where generated text contradicts visual content, despite having strong reasoning capabilities. There's a need to better align text generation with visual evidence.Method: Dual-Anchor Introspective Decoding (DaID) uses contrastive decoding that dynamically calibrates token generation by mining the model’s internal perceptual discrepancies. It identifies a Spotlight layer to amplify visual factual signals and a Shadow layer to suppress textual inertia, guided by visual attention distributions.
Result: Experimental results across multiple benchmarks and MLLMs show DaID significantly mitigates hallucination while enhancing general reasoning capabilities.
Conclusion: DaID effectively reduces hallucinations in MLLMs by leveraging internal model representations to better align text generation with visual evidence, improving both factual accuracy and reasoning.
Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable reasoning capabilities yet continue to suffer from hallucination, where generated text contradicts visual content. In this paper, we introduce Dual-Anchor Introspective Decoding (DaID), a novel contrastive decoding framework that dynamically calibrates each token generation by mining the model’s internal perceptual discrepancies. Specifically, DaID identifies a Spotlight layer to amplify visual factual signals and a Shadow layer to suppress textual inertia. By leveraging visual attention distributions to guide this dual-anchor selection process, our method ensures precise, token-specific adaptation. Experimental results across multiple benchmarks and MLLMs demonstrate that DaID significantly mitigates hallucination while enhancing general reasoning capabilities.
[365] DocRevive: A Unified Pipeline for Document Text Restoration
Kunal Purkayastha, Ayan Banerjee, Josep Llados, Umapada Pal
Main category: cs.CV
TL;DR: A unified pipeline for document restoration combining OCR, image analysis, masked language modeling, and diffusion models to reconstruct damaged/occluded text while preserving visual integrity.
Details
Motivation: Document understanding tasks suffer from damaged, occluded, or incomplete text, which remains a critical but unexplored problem. A document reconstruction process could benefit subsequent document understanding tasks.Method: Combines state-of-the-art OCR, advanced image analysis, masked language modeling, and diffusion-based models. Pipeline detects/recognizes text, identifies degradation with occlusion detector, uses inpainting for semantically coherent reconstruction, and diffusion module reintegrates text matching font, size, and alignment.
Result: Created synthetic dataset of 30,078 degraded document images simulating diverse degradation scenarios. Proposed Unified Context Similarity Metric (UCSM) for evaluation. Pipeline advances document restoration and sets new standard for text reconstruction.
Conclusion: The work advances document restoration, benefiting archival research and digital preservation while setting a new standard for text reconstruction. Dataset and code are publicly available.
Abstract: In Document Understanding, the challenge of reconstructing damaged, occluded, or incomplete text remains a critical yet unexplored problem. Subsequent document understanding tasks can benefit from a document reconstruction process. In response, this paper presents a novel unified pipeline combining state-of-the-art Optical Character Recognition (OCR), advanced image analysis, masked language modeling, and diffusion-based models to restore and reconstruct text while preserving visual integrity. We create a synthetic dataset of 30{,}078 degraded document images that simulates diverse document degradation scenarios, setting a benchmark for restoration tasks. Our pipeline detects and recognizes text, identifies degradation with an occlusion detector, and uses an inpainting model for semantically coherent reconstruction. A diffusion-based module seamlessly reintegrates text, matching font, size, and alignment. To evaluate restoration quality, we propose a Unified Context Similarity Metric (UCSM), incorporating edit, semantic, and length similarities with a contextual predictability measure that penalizes deviations when the correct text is contextually obvious. Our work advances document restoration, benefiting archival research and digital preservation while setting a new standard for text reconstruction. The OPRB dataset and code are available at \href{https://huggingface.co/datasets/kpurkayastha/OPRB}{Hugging Face} and \href{https://github.com/kunalpurkayastha/DocRevive}{Github} respectively.
[366] Attention-Guided Dual-Stream Learning for Group Engagement Recognition: Fusing Transformer-Encoded Motion Dynamics with Scene Context via Adaptive Gating
Saniah Kayenat Chowdhury, Muhammad E. H. Chowdhury
Main category: cs.CV
TL;DR: DualEngage: A two-stream framework for group-level engagement recognition from classroom videos that combines individual motion dynamics with scene-level spatiotemporal information.
Details
Motivation: Existing engagement recognition methods focus on online classrooms or individual-level estimation, but group-level engagement in physical classrooms is crucial for learning outcomes and requires modeling both individual behaviors and group dynamics.Method: Two-stream framework: primary stream models person-level motion dynamics (detecting/tracking students, extracting optical flow with RAFT, transformer encoding, attention pooling); secondary stream captures scene-level spatiotemporal information using 3D ResNet. Features combined via softmax-gated fusion.
Result: Achieved 0.9621±0.0161 classification accuracy and 0.9530±0.0204 macro-averaged F1 on Classroom Group Engagement Dataset using fivefold cross-validation. Ablation study shows contributions of both streams.
Conclusion: DualEngage effectively models group engagement by jointly considering individual actions and group dynamics, representing one of the first dual-stream approaches in classroom engagement recognition that explicitly leverages motion cues.
Abstract: Student engagement is crucial for improving learning outcomes in group activities. Highly engaged students perform better both individually and contribute to overall group success. However, most existing automated engagement recognition methods are designed for online classrooms or estimate engagement at the individual level. Addressing this gap, we propose DualEngage, a novel two-stream framework for group-level engagement recognition from in-classroom videos. It models engagement as a joint function of both individual and group-level behaviors. The primary stream models person-level motion dynamics by detecting and tracking students, extracting dense optical flow with the Recurrent All-Pairs Field Transforms network, encoding temporal motion patterns using a transformer encoder, and finally aggregating per-student representations through attention pooling into a unified representation. The secondary stream captures scene-level spatiotemporal information from the full video clip, leveraging a pretrained three-dimensional Residual Network. The two-stream representations are combined via softmax-gated fusion, which dynamically weights each stream’s contribution based on the joint context of both features. DualEngage learns a joint representation of individual actions with overarching group dynamics. We evaluate the proposed approach using fivefold cross-validation on the Classroom Group Engagement Dataset developed by Ocean University of China, achieving an average classification accuracy of 0.9621+/-0.0161 with a macro-averaged F1 of 0.9530+/-0.0204. To understand the contribution of each branch, we further conduct an ablation study comparing single-stream variants against the two-stream model. This work is among the first in classroom engagement recognition to adopt a dual-stream design that explicitly leverages motion cues as an estimator.
[367] MatRes: Zero-Shot Test-Time Model Adaptation for Simultaneous Matching and Restoration
Kanggeon Lee, Soochahn Lee, Kyoung Mu Lee
Main category: cs.CV
TL;DR: MatRes is a zero-shot test-time adaptation framework that jointly improves image restoration and geometric matching using only a single low-quality/high-quality image pair, addressing their mutual interference in real-world scenarios with severe degradations and viewpoint changes.
Details
Motivation: Real-world image pairs often have both severe degradations and large viewpoint changes, making image restoration and geometric matching mutually interfering tasks when treated independently. Existing approaches handle these separately, but they interfere with each other in practice.Method: MatRes is a zero-shot test-time adaptation framework that uses only a single low-quality and high-quality image pair. It enforces conditional similarity at corresponding locations, updating only lightweight modules while keeping all pretrained components frozen, requiring no offline training or additional supervision.
Result: Extensive experiments across diverse combinations show that MatRes yields significant gains in both restoration and geometric alignment compared to using either restoration or matching models alone.
Conclusion: MatRes offers a practical and widely applicable solution for real-world scenarios where users commonly capture multiple images of a scene with varying viewpoints and quality, effectively addressing the often-overlooked mutual interference between matching and restoration.
Abstract: Real-world image pairs often exhibit both severe degradations and large viewpoint changes, making image restoration and geometric matching mutually interfering tasks when treated independently. In this work, we propose MatRes, a zero-shot test-time adaptation framework that jointly improves restoration quality and correspondence estimation using only a single low-quality and high-quality image pair. By enforcing conditional similarity at corresponding locations, MatRes updates only lightweight modules while keeping all pretrained components frozen, requiring no offline training or additional supervision. Extensive experiments across diverse combinations show that MatRes yields significant gains in both restoration and geometric alignment compared to using either restoration or matching models alone. MatRes offers a practical and widely applicable solution for real-world scenarios where users commonly capture multiple images of a scene with varying viewpoints and quality, effectively addressing the often-overlooked mutual interference between matching and restoration.
[368] Active Diffusion Matching: Score-based Iterative Alignment of Cross-Modal Retinal Images
Kanggeon Lee, Su Jeong Song, Soochahn Lee, Kyoung Mu Lee
Main category: cs.CV
TL;DR: ADM is a novel cross-modal alignment method using dual diffusion models to align standard and ultra-widefield fundus images with state-of-the-art accuracy.
Details
Motivation: Aligning Standard Fundus Images (SFIs) and Ultra-Widefield Fundus Images (UWFIs) is challenging due to substantial differences in viewing range and retina's amorphous appearance, with no specialized methods currently available.Method: Active Diffusion Matching (ADM) integrates two interdependent score-based diffusion models to jointly estimate global transformations and local deformations via iterative Langevin Markov chain, with custom sampling strategies for input adaptability.
Result: ADM achieves state-of-the-art alignment accuracy with mAUC improvements of 5.2 points on private SFI-UWFI dataset and 0.4 points on public SFI-SFI dataset compared to existing methods.
Conclusion: ADM effectively bridges the gap in aligning SFIs and UWFIs, providing an innovative solution to a previously unaddressed challenge through joint optimization of global and local alignment.
Abstract: Objective: The study aims to address the challenge of aligning Standard Fundus Images (SFIs) and Ultra-Widefield Fundus Images (UWFIs), which is difficult due to their substantial differences in viewing range and the amorphous appearance of the retina. Currently, no specialized method exists for this task, and existing image alignment techniques lack accuracy. Methods: We propose Active Diffusion Matching (ADM), a novel cross-modal alignment method. ADM integrates two interdependent score-based diffusion models to jointly estimate global transformations and local deformations via an iterative Langevin Markov chain. This approach facilitates a stochastic, progressive search for optimal alignment. Additionally, custom sampling strategies are introduced to enhance the adaptability of ADM to given input image pairs. Results: Comparative experimental evaluations demonstrate that ADM achieves state-of-the-art alignment accuracy. This was validated on two datasets: a private dataset of SFI-UWFI pairs and a public dataset of SFI-SFI pairs, with mAUC improvements of 5.2 and 0.4 points on the private and public datasets, respectively, compared to existing state-of-the-art methods. Conclusion: ADM effectively bridges the gap in aligning SFIs and UWFIs, providing an innovative solution to a previously unaddressed challenge. The method’s ability to jointly optimize global and local alignment makes it highly effective for cross-modal image alignment tasks. Significance: ADM has the potential to transform the integrated analysis of SFIs and UWFIs, enabling better clinical utility and supporting learning-based image enhancements. This advancement could significantly improve diagnostic accuracy and patient outcomes in ophthalmology.
[369] Particle Diffusion Matching: Random Walk Correspondence Search for the Alignment of Standard and Ultra-Widefield Fundus Images
Kanggeon Lee, Soochahn Lee, Kyoung Mu Lee
Main category: cs.CV
TL;DR: PDM uses diffusion-guided random walk search to align standard and ultra-widefield retinal images, achieving state-of-the-art performance for multimodal integration in ophthalmology.
Details
Motivation: Aligning Standard Fundus Images (SFIs) and Ultra-Widefield Fundus Images (UWFIs) is challenging due to scale differences, appearance variations, and lack of distinctive features, limiting multimodal analysis in ophthalmology.Method: Particle Diffusion Matching (PDM) uses iterative Random Walk Correspondence Search guided by a diffusion model that estimates displacement vectors considering local appearance, particle structural distribution, and global transformation.
Result: PDM achieves state-of-the-art performance across multiple retinal image alignment benchmarks, with substantial improvement on SFI-UWFI pairs and effectiveness in real-world clinical scenarios.
Conclusion: PDM provides accurate, scalable correspondence estimation for retinal image alignment, overcoming limitations of existing methods and enabling better multimodal integration for disease diagnosis and analysis.
Abstract: We propose a robust alignment technique for Standard Fundus Images (SFIs) and Ultra-Widefield Fundus Images (UWFIs), which are challenging to align due to differences in scale, appearance, and the scarcity of distinctive features. Our method, termed Particle Diffusion Matching (PDM), performs alignment through an iterative Random Walk Correspondence Search (RWCS) guided by a diffusion model. At each iteration, the model estimates displacement vectors for particle points by considering local appearance, the structural distribution of particles, and an estimated global transformation, enabling progressive refinement of correspondences even under difficult conditions. PDM achieves state-of-the-art performance across multiple retinal image alignment benchmarks, showing substantial improvement on a primary dataset of SFI-UWFI pairs and demonstrating its effectiveness in real-world clinical scenarios. By providing accurate and scalable correspondence estimation, PDM overcomes the limitations of existing methods and facilitates the integration of complementary retinal image modalities. This diffusion-guided search strategy offers a new direction for improving downstream supervised learning, disease diagnosis, and multi-modal image analysis in ophthalmology.
[370] Global monitoring of methane point sources using deep learning on hyperspectral radiance measurements from EMIT
Vishal V. Batchu, Michelangelo Conserva, Alex Wilson, Anna M. Michalak, Varun Gulshan, Philip G. Brodrick, Andrew K. Thorpe, Christopher V. Arsdale
Main category: cs.CV
TL;DR: MAPL-EMIT: An end-to-end vision transformer framework for detecting methane plumes from space-based imaging spectroscopy, enabling automated global methane monitoring with improved detection limits.
Details
Motivation: Current space-based methane monitoring relies heavily on manual plume identification, which is labor-intensive and not scalable for global monitoring. There's a need for automated, high-throughput methods to identify methane point sources that contribute to climate forcing and safety hazards.Method: Developed MAPL-EMIT, an end-to-end vision transformer framework that leverages complete radiance spectrum from EMIT instrument. Uses spectral and spatial context to jointly retrieve methane enhancements across all pixels. Trained on 3.6 million physics-based synthetic plumes injected into global EMIT radiance data.
Result: Model captures 79% of known hand-annotated plume complexes across test set of 1084 EMIT granules, while identifying twice as many plausible plumes as human analysts. Validated against airborne data, landfills, and controlled release experiments. Enables high-throughput implementation on full EMIT catalog.
Conclusion: MAPL-EMIT shifts methane monitoring from labor-intensive workflows to a rapid, scalable paradigm for global plume mapping at facility scale, enabling automated detection of previously uncaptured methane sources.
Abstract: Anthropogenic methane (CH4) point sources drive near-term climate forcing, safety hazards, and system inefficiencies. Space-based imaging spectroscopy is emerging as a tool for identifying emissions globally, but existing approaches largely rely on manual plume identification. Here we present the Methane Analysis and Plume Localization with EMIT (MAPL-EMIT) model, an end-to-end vision transformer framework that leverages the complete radiance spectrum from the Earth Surface Mineral Dust Source Investigation (EMIT) instrument to jointly retrieve methane enhancements across all pixels within a scene. This approach brings together spectral and spatial context to significantly lower detection limits. MAPL-EMIT simultaneously supports enhancement quantification, plume delineation, and source localization, even for multiple overlapping plumes. The model was trained on 3.6 million physics-based synthetic plumes injected into global EMIT radiance data. Synthetic evaluation confirms the model’s ability to identify plumes with high recall and precision and to capture weaker plumes relative to existing matched-filter approaches. On real-world benchmarks, MAPL-EMIT captures 79% of known hand-annotated NASA L2B plume complexes across a test set of 1084 EMIT granules, while capturing twice as many plausible plumes than identified by human analysts. Further validation against coincident airborne data, top-emitting landfills, and controlled release experiments confirms the model’s ability to identify previously uncaptured sources. By incorporating model-generated metrics such as spectral fit scores and estimated noise levels, the framework can further limit false-positive rates. Overall, MAPL-EMIT enables high-throughput implementation on the full EMIT catalog, shifting methane monitoring from labor-intensive workflows to a rapid, scalable paradigm for global plume mapping at the facility scale.
[371] Mining Attribute Subspaces for Efficient Fine-tuning of 3D Foundation Models
Yu Jiang, Hanwen Jiang, Ahmed Abdelkader, Wen-Sheng Chu, Brandon Y. Feng, Zhangyang Wang, Qixing Huang
Main category: cs.CV
TL;DR: Analysis of LoRA subspaces for 3D foundation models, showing disentangled subspaces for different variations (texture, geometry, camera, lighting) that generalize from synthetic to real data.
Details
Motivation: With the rise of 3D foundation models, there's interest in fine-tuning them for downstream tasks using LoRA. Since 3D datasets have distinct variations in texture, geometry, camera motion, and lighting, the paper investigates whether there are LoRA subspaces associated with each variation type, if they're disentangled, and how to compute them effectively.Method: The approach generates synthetic datasets with controlled variations, fine-tunes a LoRA adapter on each dataset, extracts a LoRA subspace for each variation type, and shows these subspaces are approximately disentangled. Integrating them creates a reduced LoRA subspace for efficient fine-tuning with improved accuracy.
Result: The method demonstrates that LoRA subspaces for different 3D variations are approximately disentangled. The reduced LoRA subspace derived from synthetic data generalizes to real datasets, enabling efficient fine-tuning with improved prediction accuracy for downstream tasks.
Conclusion: The paper provides answers to fundamental questions about LoRA subspaces in 3D foundation models, showing disentangled subspaces for different variations that can be effectively computed and generalized from synthetic to real data, improving fine-tuning efficiency and accuracy.
Abstract: With the emergence of 3D foundation models, there is growing interest in fine-tuning them for downstream tasks, where LoRA is the dominant fine-tuning paradigm. As 3D datasets exhibit distinct variations in texture, geometry, camera motion, and lighting, there are interesting fundamental questions: 1) Are there LoRA subspaces associated with each type of variation? 2) Are these subspaces disentangled (i.e., orthogonal to each other)? 3) How do we compute them effectively? This paper provides answers to all these questions. We introduce a robust approach that generates synthetic datasets with controlled variations, fine-tunes a LoRA adapter on each dataset, and extracts a LoRA sub-space associated with each type of variation. We show that these subspaces are approximately disentangled. Integrating them leads to a reduced LoRA subspace that enables efficient LoRA fine-tuning with improved prediction accuracy for downstream tasks. In particular, we show that such a reduced LoRA subspace, despite being derived entirely from synthetic data, generalizes to real datasets. An ablation study validates the effectiveness of the choices in our approach.
[372] ABot-Claw: A Foundation for Persistent, Cooperative, and Self-Evolving Robotic Agents
Dongjie Huo, Haoyun Liu, Guoqing Liu, Dekang Qi, Zhiming Sun, Maoguo Gao, Jianxin He, Yandan Yang, Xinyuan Chang, Feng Xiong, Xing Wei, Zhiheng Ma, Mu Xu
Main category: cs.CV
TL;DR: ABot-Claw extends OpenClaw with embodied capabilities for multi-robot coordination, visual memory, and closed-loop feedback to bridge high-level reasoning with physical execution in open-world environments.
Details
Motivation: Current embodied AI systems have a gap between high-level reasoning and low-level physical execution. While VLA models provide perception and responses, they're open-loop. Existing planning agents work in closed sandboxes with limited real-system control. OpenClaw lacks embodied control for multi-robot execution.Method: 1) Unified embodiment interface with capability-driven scheduling for heterogeneous robot coordination; 2) Visual-centric cross-embodiment multimodal memory for persistent context retention; 3) Critic-based closed-loop feedback with generalist reward model for online evaluation and replanning. Uses decoupled architecture across OpenClaw, shared service, and robot embodiment layers.
Result: ABot-Claw enables real-world interaction, closes the loop from natural language intent to physical action, and supports progressively self-evolving robotic agents in open, dynamic environments.
Conclusion: The system bridges the gap between reasoning and execution, providing a framework for embodied agents that can operate in open-world environments with closed-loop feedback and multi-robot coordination.
Abstract: Current embodied intelligent systems still face a substantial gap between high-level reasoning and low-level physical execution in open-world environments. Although Vision-Language-Action (VLA) models provide strong perception and intuitive responses, their open-loop nature limits long-horizon performance. Agents incorporating System 2 cognitive mechanisms improve planning, but usually operate in closed sandboxes with predefined toolkits and limited real-system control. OpenClaw provides a localized runtime with full system privileges, but lacks the embodied control architecture required for long-duration, multi-robot execution. We therefore propose ABot-Claw, an embodied extension of OpenClaw that integrates: 1) a unified embodiment interface with capability-driven scheduling for heterogeneous robot coordination; 2) a visual-centric cross-embodiment multimodal memory for persistent context retention and grounded retrieval; and 3) a critic-based closed-loop feedback mechanism with a generalist reward model for online progress evaluation, local correction, and replanning. With a decoupled architecture spanning the OpenClaw layer, shared service layer, and robot embodiment layer, ABot-Claw enables real-world interaction, closes the loop from natural language intent to physical action, and supports progressively self-evolving robotic agents in open, dynamic environments.
[373] Degradation-Consistent Paired Training for Robust AI-Generated Image Detection
Zongyou Yang, Yinghan Hou, Xiaokun Yang
Main category: cs.CV
TL;DR: DCPT improves AI-generated image detector robustness to real-world corruptions through paired training with feature and prediction consistency constraints, achieving significant gains without added parameters or inference overhead.
Details
Motivation: Current AI-generated image detectors perform poorly under real-world image corruptions like JPEG compression, Gaussian blur, and downsampling. Existing methods treat degradation robustness as a byproduct of data augmentation rather than an explicit training objective.Method: Proposes Degradation-Consistent Paired Training (DCPT) that constructs clean and degraded views of each training image, then applies two constraints: feature consistency loss (cosine distance between representations) and prediction consistency loss (symmetric KL divergence between output distributions).
Result: On Synthbuster benchmark (9 generators, 8 degradation conditions), DCPT improves degraded-condition average accuracy by 9.1 percentage points compared to baseline, with only 0.9% clean accuracy sacrifice. Most significant improvement under JPEG compression (+15.7% to +17.9%).
Conclusion: Training objective improvement (via DCPT) is more effective than architectural augmentation for degradation robustness. The method adds zero parameters and zero inference overhead while significantly improving robustness to real-world image corruptions.
Abstract: AI-generated image detectors suffer significant performance degradation under real-world image corruptions such as JPEG compression, Gaussian blur, and resolution downsampling. We observe that state-of-the-art methods, including B-Free, treat degradation robustness as a byproduct of data augmentation rather than an explicit training objective. In this work, we propose Degradation-Consistent Paired Training (DCPT), a simple yet effective training strategy that explicitly enforces robustness through paired consistency constraints. For each training image, we construct a clean view and a degraded view, then impose two constraints: a feature consistency loss that minimizes the cosine distance between clean and degraded representations, and a prediction consistency loss based on symmetric KL divergence that aligns output distributions across views. DCPT adds zero additional parameters and zero inference overhead. Experiments on the Synthbuster benchmark (9 generators, 8 degradation conditions) demonstrate that DCPT improves the degraded-condition average accuracy by 9.1 percentage points compared to an identical baseline without paired training, while sacrificing only 0.9% clean accuracy. The improvement is most pronounced under JPEG compression (+15.7% to +17.9%). Ablation further reveals that adding architectural components leads to overfitting on limited training data, confirming that training objective improvement is more effective than architectural augmentation for degradation robustness.
[374] Long-Horizon Streaming Video Generation via Hybrid Attention with Decoupled Distillation
Ruibin Li, Tao Yang, Fangzhou Ai, Tianhe Wu, Shilei Wen, Bingyue Peng, Lei Zhang
Main category: cs.CV
TL;DR: Hybrid Forcing: A streaming video generation method using hybrid attention (linear temporal + block-sparse) to preserve long-range dependencies while reducing computational overhead for real-time video generation.
Details
Motivation: Current streaming video generation methods using sliding window attention lose distant history during long video generation and have computational overhead that prevents real-time deployment. There's a need to balance temporal information retention with computational efficiency.Method: Proposes Hybrid Forcing with three key components: 1) lightweight linear temporal attention to preserve long-range dependencies beyond sliding windows using compact key-value states, 2) block-sparse attention within local sliding windows to reduce redundant computation, and 3) decoupled distillation strategy tailored to the hybrid attention design for stable optimization.
Result: Achieves state-of-the-art performance on both short- and long-form video generation benchmarks. Achieves real-time, unbounded 832x480 video generation at 29.5 FPS on a single NVIDIA H100 GPU without quantization or model compression.
Conclusion: Hybrid Forcing effectively addresses the limitations of sliding window attention in streaming video generation by balancing temporal context preservation with computational efficiency, enabling real-time high-quality video generation.
Abstract: Streaming video generation (SVG) distills a pretrained bidirectional video diffusion model into an autoregressive model equipped with sliding window attention (SWA). However, SWA inevitably loses distant history during long video generation, and its computational overhead remains a critical challenge to real-time deployment. In this work, we propose Hybrid Forcing, which jointly optimizes temporal information retention and computational efficiency through a hybrid attention design. First, we introduce lightweight linear temporal attention to preserve long-range dependencies beyond the sliding window. In particular, we maintain a compact key-value state to incrementally absorb evicted tokens, retaining temporal context with negligible memory and computational overhead. Second, we incorporate block-sparse attention into the local sliding window to reduce redundant computation within short-range modeling, reallocating computational capacity toward more critical dependencies. Finally, we introduce a decoupled distillation strategy tailored to the hybrid attention design. A few-step initial distillation is performed under dense attention, then the distillation of our proposed linear temporal and block-sparse attention is activated for streaming modeling, ensuring stable optimization. Extensive experiments on both short- and long-form video generation benchmarks demonstrate that Hybrid Forcing consistently achieves state-of-the-art performance. Notably, our model achieves real-time, unbounded 832x480 video generation at 29.5 FPS on a single NVIDIA H100 GPU without quantization or model compression. The source code and trained models are available at https://github.com/leeruibin/hybrid-forcing.
[375] VGGT-HPE: Reframing Head Pose Estimation as Relative Pose Prediction
Vasiliki Vasileiou, Panagiotis P. Filntisis, Petros Maragos, Kostas Daniilidis
Main category: cs.CV
TL;DR: VGGT-HPE introduces a relative head pose estimation approach that predicts transformations between head configurations rather than absolute poses, using a geometry foundation model fine-tuned on synthetic data only.
Details
Motivation: Traditional monocular head pose estimation uses absolute regression which forces networks to internalize dataset-specific canonical frames. The authors argue that predicting relative transformations between observed head configurations is fundamentally easier and more robust.Method: VGGT-HPE uses a relative formulation built on a general-purpose geometry foundation model. It’s fine-tuned exclusively on synthetic facial renderings, reducing the problem to estimating geometric displacement from an explicitly provided anchor with known pose. The anchor can be chosen at test time (e.g., near-neutral or temporally adjacent frames).
Result: Despite zero real-world training data, VGGT-HPE achieves state-of-the-art results on the BIWI benchmark, outperforming established absolute regression methods trained on mixed and real datasets. Controlled benchmarks show relative prediction is intrinsically more accurate than absolute regression, with advantages scaling with pose difficulty.
Conclusion: Relative head pose estimation is more robust than absolute regression, especially for difficult poses. The method demonstrates strong generalization from synthetic to real data and allows flexible anchor selection at test time.
Abstract: Monocular head pose estimation is traditionally formulated as direct regression from a single image to an absolute pose. This paradigm forces the network to implicitly internalize a dataset-specific canonical reference frame. In this work, we argue that predicting the relative rigid transformation between two observed head configurations is a fundamentally easier and more robust formulation. We introduce VGGT-HPE, a relative head pose estimator built upon a general-purpose geometry foundation model. Finetuned exclusively on synthetic facial renderings, our method sidesteps the need for an implicit anchor by reducing the problem to estimating a geometric displacement from an explicitly provided anchor with a known pose. As a practical benefit, the relative formulation also allows the anchor to be chosen at test time - for instance, a near-neutral frame or a temporally adjacent one - so that the prediction difficulty can be controlled by the application. Despite zero real-world training data, VGGT-HPE achieves state-of-the-art results on the BIWI benchmark, outperforming established absolute regression methods trained on mixed and real datasets. Through controlled easy- and hard-pair benchmarks, we also systematically validate our core hypothesis: relative prediction is intrinsically more accurate than absolute regression, with the advantage scaling alongside the difficulty of the target pose. Project page and code: https://vasilikivas.github.io/VGGT-HPE
[376] Dual-Branch Remote Sensing Infrared Image Super-Resolution
Xining Ge, Gengjia Chang, Weijun Yuan, Zhan Li, Zhanglu Chen, Boyang Yao, Yihang Chen, Yifan Deng, Shuhong Liu
Main category: cs.CV
TL;DR: A dual-branch system combining HAT-L and MambaIRv2-L transformers for infrared image super-resolution, winning NTIRE 2026 challenge with test-time ensemble techniques.
Details
Motivation: Infrared image super-resolution is challenging due to weak textures and sensitivity to unstable local sharpening, requiring complementary local and global modeling unlike visible-image SR.Method: Dual-branch system with HAT-L branch (local transformer) and MambaIRv2-L branch (global state-space model), using test-time local conversion on HAT, eight-way self-ensemble on MambaIRv2, and fixed equal-weight image-space fusion.
Result: Won NTIRE 2026 Infrared Image Super-Resolution Challenge; fused output outperforms single branches on 12 synthetic 4× thermal samples from Caltech Aerial RGB-Thermal dataset in PSNR, SSIM, and overall Score.
Conclusion: Infrared super-resolution benefits from explicit complementarity between locally strong transformer restoration and globally stable state-space modeling.
Abstract: Remote sensing infrared image super-resolution aims to recover sharper thermal observations from low-resolution inputs while preserving target contours, scene layout, and radiometric stability. Unlike visible-image super-resolution, thermal imagery is weakly textured and more sensitive to unstable local sharpening, which makes complementary local and global modeling especially important. This paper presents our solution to the NTIRE 2026 Infrared Image Super-Resolution Challenge, a dual-branch system that combines a HAT-L branch and a MambaIRv2-L branch. The inference pipeline applies test-time local conversion on HAT, eight-way self-ensemble on MambaIRv2, and fixed equal-weight image-space fusion. We report both the official challenge score and a reproducible evaluation on 12 synthetic times-four thermal samples derived from Caltech Aerial RGB-Thermal, on which the fused output outperforms either single branch in PSNR, SSIM, and the overall Score. The results suggest that infrared super-resolution benefits from explicit complementarity between locally strong transformer restoration and globally stable state-space modeling.
[377] A Dual Cross-Attention Graph Learning Framework For Multimodal MRI-Based Major Depressive Disorder Detection
Nojod M. Alotaibi, Areej M. Alhothali
Main category: cs.CV
TL;DR: A dual cross-attention multimodal fusion framework for MDD classification using structural and functional MRI data, achieving improved performance over simple concatenation methods.
Details
Motivation: Major depressive disorder (MDD) involves complex neurobiological changes that cannot be fully captured by single imaging modalities. While multimodal MRI combining structural and functional data offers more comprehensive understanding, effective integration of these modalities remains challenging.Method: Proposes a dual cross-attention-based multimodal fusion framework that explicitly models bidirectional interactions between structural MRI (sMRI) and resting-state functional MRI (rs-fMRI) representations. Tested on REST-meta-MDD dataset using both structural and functional brain atlas configurations with 10-fold stratified cross-validation.
Result: The fusion algorithm achieves robust and competitive performance across all atlas types, consistently outperforming conventional feature-level concatenation for functional atlases while maintaining comparable performance for structural atlases. Best model obtained 84.71% accuracy, 86.42% sensitivity, 82.89% specificity, 84.34% precision, and 85.37% F1-score.
Conclusion: The findings emphasize the importance of explicitly modeling cross-modal interactions for multimodal neuroimaging-based MDD classification, demonstrating that sophisticated fusion approaches can improve diagnostic performance.
Abstract: Major depressive disorder (MDD) is a prevalent mental disorder associated with complex neurobiological changes that cannot be fully captured using a single imaging modality. The use of multimodal magnetic resonance imaging (MRI) provides a more comprehensive understanding of brain changes by combining structural and functional data. Despite this, the effective integration of these modalities remains challenging. In this study, we propose a dual cross-attention-based multimodal fusion framework that explicitly models bidirectional interactions between structural MRI (sMRI) and resting-state functional MRI (rs-fMRI) representations. The proposed approach is tested on the large-scale REST-meta-MDD dataset using both structural and functional brain atlas configurations. Numerous experiments conducted under a 10-fold stratified cross-validation demonstrated that the proposed fusion algorithm achieves robust and competitive performance across all atlas types. The proposed method consistently outperforms conventional feature-level concatenation for functional atlases, while maintaining comparable performance for structural atlases. The most effective dual cross-attention multimodal model obtained 84.71% accuracy, 86.42% sensitivity, 82.89% specificity, 84.34% precision, and 85.37% F1-score. These findings emphasize the importance of explicitly modeling cross-modal interactions for multimodal neuroimaging-based MDD classification.
[378] PhyMix: Towards Physically Consistent Single-Image 3D Indoor Scene Generation with Implicit–Explicit Optimization
Dongli Wu, Jingyu Hu, Ka-Hei Hui, Xiaobao Wei, Chengwen Luo, Jianqiang Li, Zhengzhe Liu
Main category: cs.CV
TL;DR: PhyMix: A physics-aware framework for 3D indoor scene generation that integrates evaluation, training, and inference-time optimization to produce physically plausible scenes.
Details
Motivation: Existing 3D indoor scene generators produce visually plausible results but often violate real-world physics, limiting their reliability for robotics, embodied AI, and design applications. There's a need for physically consistent scene generation.Method: Proposes PhyMix with two components: (1) Scene-GRPO for implicit alignment using physics evaluator as preference signal, and (2) Test-Time Optimizer (TTO) for explicit refinement using differentiable evaluator signals. Also introduces a unified Physics Evaluator benchmark measuring geometric priors, contact, stability, and deployability.
Result: State-of-the-art performance in both visual fidelity and physical plausibility. Extensive evaluations show robustness in synthetic and real-world images.
Conclusion: The framework successfully unifies evaluation, reward shaping, and inference-time correction to produce 3D indoor scenes that are both visually faithful and physically plausible.
Abstract: Existing single-image 3D indoor scene generators often produce results that look visually plausible but fail to obey real-world physics, limiting their reliability in robotics, embodied AI, and design. To examine this gap, we introduce a unified Physics Evaluator that measures four main aspects: geometric priors, contact, stability, and deployability, which are further decomposed into nine sub-constraints, establishing the first benchmark to measure physical consistency. Based on this evaluator, our analysis shows that state-of-the-art methods remain largely physics-unaware. To overcome this limitation, we further propose a framework that integrates feedback from the Physics Evaluator into both training and inference, enhancing the physical plausibility of generated scenes. Specifically, we propose PhyMix, which is composed of two complementary components: (i) implicit alignment via Scene-GRPO, a critic-free group-relative policy optimization that leverages the Physics Evaluator as a preference signal and biases sampling towards physically feasible layouts, and (ii) explicit refinement via a plug-and-play Test-Time Optimizer (TTO) that uses differentiable evaluator signals to correct residual violations during generation. Overall, our method unifies evaluation, reward shaping, and inference-time correction, producing 3D indoor scenes that are visually faithful and physically plausible. Extensive synthetic evaluations confirm state-of-the-art performance in both visual fidelity and physical plausibility, and extensive qualitative examples in stylized and real-world images further showcase the robustness of the method. We will release codes and models upon publication.
[379] VGA-Bench: A Unified Benchmark and Multi-Model Framework for Video Aesthetics and Generation Quality Evaluation
Longteng Jiang, DanDan Zheng, Qianqian Qiao, Heng Huang, Huaye Wang, Yihang Bo, Bao Peng, Jingdong Chen, Jun Zhou, Xin Jin
Main category: cs.CV
TL;DR: VGA-Bench is a unified benchmark for evaluating both video generation quality and aesthetic quality in AIGC-based video generation, addressing the gap in existing benchmarks that focus only on technical fidelity.
Details
Motivation: Existing video generation benchmarks focus primarily on technical fidelity metrics, leaving a significant gap in holistic assessment that includes perceptual and artistic qualities. There's a critical need for comprehensive evaluation frameworks that encompass aesthetic appeal alongside generation quality.Method: Developed a principled three-tier taxonomy (Aesthetic Quality, Aesthetic Tagging, Generation Quality) with fine-grained sub-dimensions. Created 1,016 diverse prompts and generated over 60,000 videos using 12 video generation models. Annotated subset via human labeling and developed three multi-task neural assessors: VAQA-Net (aesthetic quality), VTag-Net (aesthetic tagging), and VGQA-Net (generation quality).
Result: The models achieve reliable alignment with human judgments, offering both accuracy and efficiency. The benchmark enables scalable and automated evaluation of video generation models across comprehensive quality dimensions.
Conclusion: VGA-Bench provides a unified benchmark for joint evaluation of video generation quality and aesthetic quality, addressing limitations in existing evaluation frameworks and enabling applications in content moderation, model debugging, and generative model optimization.
Abstract: The rapid advancement of AIGC-based video generation has underscored the critical need for comprehensive evaluation frameworks that go beyond traditional generation quality metrics to encompass aesthetic appeal. However, existing benchmarks remain largely focused on technical fidelity, leaving a significant gap in holistic assessment-particularly with respect to perceptual and artistic qualities. To address this limitation, we introduce VGA-Bench, a unified benchmark for joint evaluation of video generation quality and aesthetic quality. VGA-Bench is built upon a principled three-tier taxonomy: Aesthetic Quality, Aesthetic Tagging, and Generation Quality, each decomposed into multiple fine-grained sub-dimensions to enable systematic assessment. Guided by this taxonomy, we design 1,016 diverse prompts and generate a large-scale dataset of over 60,000 videos using 12 video generation models, ensuring broad coverage across content, style, and artifacts. To enable scalable and automated evaluation, we annotate a subset of the dataset via human labeling and develop three dedicated multi-task neural assessors: VAQA-Net for aesthetic quality prediction, VTag-Net for automatic aesthetic tagging, and VGQA-Net for generation and basic quality attributes. Extensive experiments demonstrate that our models achieve reliable alignment with human judgments, offering both accuracy and efficiency. We release VGA-Bench as a public benchmark to foster research in AIGC evaluation, with applications in content moderation, model debugging, and generative model optimization.
[380] Improving Deep Learning-Based Target Volume Auto-Delineation for Adaptive MR-Guided Radiotherapy in Head and Neck Cancer: Impact of a Volume-Aware Dice Loss
Sogand Beirami, Zahra Esmaeilzadeh, Ahmed Gomaa, Pluvio Stephan, Ishita Sheth, Thomas Weissmann, Juliane Szkitsak, Philipp Schubert, Yixing Huang, Annette Schwarz, Stefanie Corradini, Florian Putz
Main category: cs.CV
TL;DR: Volume-Aware Dice loss improves detection of small metastatic lymph nodes in head and neck cancer segmentation but requires balanced approach to maintain primary tumor accuracy.
Details
Motivation: Manual delineation of target volumes in head and neck cancer radiotherapy is time-consuming and has high inter-observer variability. The study aims to improve auto-segmentation of primary tumors and metastatic lymph nodes using volume-sensitive loss functions to better detect small, complex nodal metastases.Method: Used nnU-Net ResEnc M architecture on HNTS-MRG 2024 dataset for multi-label segmentation. Compared standard Dice loss baseline against two Volume-Aware configurations: “Dual Mask” (VA loss on both PT and LN) and “Selective LN Mask” (VA loss on LN only). Evaluated using volumetric Dice scores, surface-based metrics, and lesion-wise detection sensitivity/precision.
Result: Selective LN Mask configuration achieved highest LN Volumetric Dice Score (0.758 vs. 0.734 baseline) and improved LN Lesion-Wise Detection Sensitivity (84.93% vs. 81.80%). However, PT detection precision declined significantly (63.65% vs. 81.27%). Dual Mask configuration provided most balanced performance, maintaining PT precision at 82.04% while improving LN sensitivity to 83.46%.
Conclusion: Volume-sensitive loss function helps mitigate under-representation of small metastatic lesions in HNC segmentation. While selective weighting yields best nodal detection, dual-mask approach is necessary in multi-label tasks to maintain segmentation accuracy for larger primary tumor volumes.
Abstract: Background: Manual delineation of target volumes in head and neck cancer (HNC) remains a significant bottleneck in radiotherapy planning, characterized by high inter-observer variability and time consumption. This study evaluates the integration of a Volume-Aware (VA) Dice loss function into a self-configuring deep learning framework to enhance the auto-segmentation of primary tumors (PT) and metastatic lymph nodes (LN) for adaptive MR-guided radiotherapy. We investigate how volume-sensitive weighting affects the detection of small, anatomically complex nodal metastases compared to conventional loss functions. Methods: Utilizing the HNTS-MRG 2024 dataset, we implemented an nnU-Net ResEnc M architecture. We conducted a multi-label segmentation task, comparing a standard Dice loss baseline against two Volume-Aware configurations: a “Dual Mask” setup (VA loss on both PT and LN) and a “Selective LN Mask” setup (VA loss on LN only). Evaluation metrics included volumetric Dice scores, surface-based metrics (SDS, MSD, HD95), and lesion-wise binary detection sensitivity and precision. Results: The Selective LN Mask configuration achieved the highest LN Volumetric Dice Score (0.758 vs. 0.734 baseline) and significantly improved LN Lesion-Wise Detection Sensitivity (84.93% vs. 81.80%). However, a critical trade-off was observed; PT detection precision declined significantly in the selective setup (63.65% vs. 81.27%). The Dual Mask configuration provided the most balanced performance across both targets, maintaining primary tumor precision at 82.04% while improving LN sensitivity to 83.46%. Conclusions: A volume-sensitive loss function mitigated the under-representation of small metastatic lesions in HNC. While selective weighting yielded the best nodal detection, a dual-mask approach is required in multi-label tasks to maintain segmentation accuracy for larger primary tumor volumes.
[381] Semantic Manipulation Localization
Zhenshan Tan, Chenhan Lu, Yuxiang Huang, Ziwen He, Xiang Zhang, Yuzhe Sha, Xianyi Chen, Tianrun Chen, Zhangjie Fu
Main category: cs.CV
TL;DR: TRACE is a framework for Semantic Manipulation Localization (SML) that detects subtle semantic edits in images by combining semantic anchoring, perturbation sensing, and constrained reasoning, outperforming traditional artifact-based methods.
Details
Motivation: Modern image editing and generative models create subtle semantic manipulations that alter object attributes, states, or relationships without obvious low-level artifacts, making conventional Image Manipulation Localization (IML) methods ineffective as they rely on artifact detection rather than semantic sensitivity.Method: TRACE framework with three progressively coupled components: 1) Semantic anchoring to identify meaningful regions supporting image understanding, 2) Semantic perturbation sensing using frequency cues to capture subtle edits under visual consistency, and 3) Semantic-constrained reasoning to verify candidate regions through joint reasoning over semantic content and scope.
Result: TRACE consistently outperforms existing IML methods on the constructed SML benchmark, producing more complete, compact, and semantically coherent localization results.
Conclusion: The work demonstrates the necessity of moving beyond artifact-based localization and provides a new direction for image forensics in complex semantic editing scenarios through the proposed Semantic Manipulation Localization task and TRACE framework.
Abstract: Image Manipulation Localization (IML) aims to identify edited regions in an image. However, with the increasing use of modern image editing and generative models, many manipulations no longer exhibit obvious low-level artifacts. Instead, they often involve subtle but meaning-altering edits to an object’s attributes, state, or relationships while remaining highly consistent with the surrounding content. This makes conventional IML methods less effective because they mainly rely on artifact detection rather than semantic sensitivity. To address this issue, we introduce Semantic Manipulation Localization (SML), a new task that focuses on localizing subtle semantic edits that significantly change image interpretation. We further construct a dedicated fine-grained benchmark for SML using a semantics-driven manipulation pipeline with pixel-level annotations. Based on this task, we propose TRACE (Targeted Reasoning of Attributed Cognitive Edits), an end-to-end framework that models semantic sensitivity through three progressively coupled components: semantic anchoring, semantic perturbation sensing, and semantic-constrained reasoning. Specifically, TRACE first identifies semantically meaningful regions that support image understanding, then injects perturbation-sensitive frequency cues to capture subtle edits under strong visual consistency, and finally verifies candidate regions through joint reasoning over semantic content and semantic scope. Extensive experiments show that TRACE consistently outperforms existing IML methods on our benchmark and produces more complete, compact, and semantically coherent localization results. These results demonstrate the necessity of moving beyond artifact-based localization and provide a new direction for image forensics in complex semantic editing scenarios.
[382] Visual Late Chunking: An Empirical Study of Contextual Chunking for Efficient Visual Document Retrieval
Yibo Yan, Mingdong Ou, Yi Cao, Jiahao Huo, Xin Zou, Shuliang Liu, James Kwok, Xuming Hu
Main category: cs.CV
TL;DR: ColChunk: A plug-and-play framework using multimodal late chunking with hierarchical clustering on patch embeddings to create efficient, contextualized multi-vectors for visual document retrieval, achieving 90% storage reduction and 9-point nDCG@5 improvement.
Details
Motivation: Multi-vector models for Visual Document Retrieval (VDR) provide fine-grained matching but suffer from high storage and computational costs that hinder practical deployment. There's a need for efficient solutions that maintain accuracy while reducing resource requirements.Method: ColChunk introduces multimodal late chunking using hierarchical clustering on patch-level embeddings fused with a 2D position prior to ensure spatial-semantic coherence. This adaptive grouping creates content-aware representations that preserve global context while drastically reducing vector count.
Result: Evaluations across 24 VDR datasets show ColChunk achieves over 90% reduction in storage requirements while delivering a 9-point average improvement in nDCG@5 across representative single-vector models.
Conclusion: ColChunk provides a practical solution for balancing retrieval accuracy and efficiency in visual document systems, offering a plug-and-play framework that significantly reduces computational and storage costs while improving performance.
Abstract: Multi-vector models dominate Visual Document Retrieval (VDR) due to their fine-grained matching capabilities, but their high storage and computational costs present a major barrier to practical deployment. In this paper, we propose ColChunk, a plug-and-play framework that introduces multimodal late chunking to construct efficient, contextualized multi-vectors. Unlike existing pruning or fixed-token approaches, ColChunk employs hierarchical clustering on patch-level embeddings, fused with a 2D position prior to ensure spatial-semantic coherence. This adaptive grouping allows for a content-aware representation that preserves global context while drastically reducing the vector count. Evaluations across 24 VDR datasets demonstrate ColChunk achieves over a 90% reduction in storage requirements while simultaneously delivering a 9-point average improvement in nDCG@5 across representative single-vector models. ColChunk provides a practical solution for balancing retrieval accuracy and efficiency in visual document systems.
[383] Radiology Report Generation for Low-Quality X-Ray Images
Hongze Zhu, Chen Hu, Jiaxuan Jiang, Hong Liu, Yawen Huang, Ming Hu, Tianyu Wang, Zhijian Wu, Yefeng Zheng
Main category: cs.CV
TL;DR: A robust radiology report generation framework that addresses performance degradation from low-quality medical images through automated quality assessment and dual-loop training with bi-level optimization.
Details
Motivation: Existing vision-language models for radiology report generation assume high-quality inputs, but real-world clinical environments have noisy/artifact-ridden images causing severe performance degradation with current models.Method: Proposes: 1) Automated Quality Assessment Agent (AQAA) to identify low-quality samples in MIMIC-CXR dataset, establishing LRRG benchmark; 2) Dual-loop Training Strategy using bi-level optimization and gradient consistency to learn quality-agnostic diagnostic features by aligning gradient directions across quality variations.
Result: Extensive experiments show the approach effectively mitigates model performance degradation caused by image quality deterioration. Code and data to be released upon acceptance.
Conclusion: The framework bridges the gap between ideal lab conditions and real-world clinical environments by explicitly addressing image quality variations in radiology report generation.
Abstract: Vision-Language Models (VLMs) have significantly advanced automated Radiology Report Generation (RRG). However, existing methods implicitly assume high-quality inputs, overlooking the noise and artifacts prevalent in real-world clinical environments. Consequently, current models exhibit severe performance degradation when processing suboptimal images. To bridge this gap, we propose a robust report generation framework explicitly designed for image quality variations. We first introduce an Automated Quality Assessment Agent (AQAA) to identify low-quality samples within the MIMIC-CXR dataset and establish the Low-quality Radiology Report Generation (LRRG) benchmark. To tackle degradation-induced shifts, we propose a novel Dual-loop Training Strategy leveraging bi-level optimization and gradient consistency. This approach ensures the model learns quality-agnostic diagnostic features by aligning gradient directions across varying quality regimes. Extensive experiments demonstrate that our approach effectively mitigates model performance degradation caused by image quality deterioration. The code and data will be released upon acceptance.
[384] A3-FPN: Asymptotic Content-Aware Pyramid Attention Network for Dense Visual Prediction
Meng’en Qin, Yu Song, Quanling Zhao, Xiaodong Yang, Yingtao Che, Xiaohui Yang
Main category: cs.CV
TL;DR: A3-FPN is a feature pyramid network that improves multi-scale representation for dense prediction tasks through asymptotically disentangled framework and content-aware attention modules.
Details
Motivation: Existing feature pyramid networks have design defects that inhibit them from capturing discriminative features and recognizing small objects effectively in dense prediction tasks.Method: Proposes Asymptotic Content-Aware Pyramid Attention Network (A3-FPN) with horizontally-spread column network for asymptotically global feature interaction, content-aware attention modules for feature fusion with position-wise offsets and weights, and feature reassembly based on information content.
Result: Achieves 49.6 mask AP on MS COCO and 85.6 mIoU on Cityscapes when paired with OneFormer and Swin-L backbone, demonstrating significant performance gains across multiple datasets.
Conclusion: A3-FPN effectively enhances multi-scale feature representation and can be integrated into various CNN and Transformer-based architectures for improved dense prediction performance.
Abstract: Learning multi-scale representations is the common strategy to tackle object scale variation in dense prediction tasks. Although existing feature pyramid networks have greatly advanced visual recognition, inherent design defects inhibit them from capturing discriminative features and recognizing small objects. In this work, we propose Asymptotic Content-Aware Pyramid Attention Network (A3-FPN), to augment multi-scale feature representation via the asymptotically disentangled framework and content-aware attention modules. Specifically, A3-FPN employs a horizontally-spread column network that enables asymptotically global feature interaction and disentangles each level from all hierarchical representations. In feature fusion, it collects supplementary content from the adjacent level to generate position-wise offsets and weights for context-aware resampling, and learns deep context reweights to improve intra-category similarity. In feature reassembly, it further strengthens intra-scale discriminative feature learning and reassembles redundant features based on information content and spatial variation of feature maps. Extensive experiments on MS COCO, VisDrone2019-DET and Cityscapes demonstrate that A3-FPN can be easily integrated into state-of-the-art CNN and Transformer-based architectures, yielding remarkable performance gains. Notably, when paired with OneFormer and Swin-L backbone, A3-FPN achieves 49.6 mask AP on MS COCO and 85.6 mIoU on Cityscapes. Codes are available at https://github.com/mason-ching/A3-FPN.
[385] Are Pretrained Image Matchers Good Enough for SAR-Optical Satellite Registration?
Isaac Corley, Alex Stoken, Gabriele Berton
Main category: cs.CV
TL;DR: Evaluation of 24 pretrained image matchers for cross-modal optical-SAR registration reveals asymmetric transfer, protocol sensitivity, and that foundation model features may enable modality invariance without explicit cross-modal training.
Details
Motivation: Cross-modal optical-SAR registration is crucial for disaster response but existing matchers are developed on natural images, lacking evaluation on satellite/SAR domains. Need to assess zero-shot transfer of modern matchers to cross-modal satellite registration.Method: Evaluated 24 pretrained matcher families in zero-shot setting on SpaceNet9 and two cross-modal benchmarks using deterministic protocol with tiled large-image inference, robust geometric filtering, and tie-point-grounded metrics.
Result: XoFTR and RoMa achieved lowest mean error (3.0 px), but RoMa did so without cross-modal training. MatchAnything-ELoFTR (3.4 px) trained on synthetic pairs performed similarly. Protocol choices (geometry model, tile size, inlier gating) affected accuracy up to 33×. 3D-reconstruction matchers (MASt3R, DUSt3R) were protocol-sensitive and fragile.
Conclusion: Foundation-model features (DINOv2) may provide modality invariance that substitutes for explicit cross-modal supervision. Protocol design significantly impacts performance, sometimes more than matcher choice. Findings guide practical deployment and future matcher design for cross-modal satellite registration.
Abstract: Cross-modal optical-SAR (Synthetic Aperture Radar) registration is a bottleneck for disaster-response via remote sensing, yet modern image matchers are developed and benchmarked almost exclusively on natural-image domains. We evaluate twenty-four pretrained matcher families–in a zero-shot setting with no fine-tuning or domain adaptation on satellite or SAR data–on SpaceNet9 and two additional cross-modal benchmarks under a deterministic protocol with tiled large-image inference, robust geometric filtering, and tie-point-grounded metrics. Our results reveal asymmetric transfer–matchers with explicit cross-modal training do not uniformly outperform those without it. While XoFTR (trained for visible-thermal matching) and RoMa achieve the lowest reported mean error at $3.0$ px on the labeled SpaceNet9 training scenes, RoMa achieves this without any cross-modal training, and MatchAnything-ELoFTR ($3.4$ px)–trained on synthetic cross-modal pairs–matches closely, suggesting (as a working hypothesis) that foundation-model features (DINOv2) may contribute to modality invariance that partially substitutes for explicit cross-modal supervision. 3D-reconstruction matchers (MASt3R, DUSt3R), which are not designed for traditional 2D image matching, are highly protocol-sensitive and remain fragile under default settings. Deployment protocol choices (geometry model, tile size, inlier gating) shift accuracy by up to $33\times$ for a single matcher, sometimes exceeding the effect of swapping matchers entirely within the evaluated sweep–affine geometry alone reduces mean error from $12.34$ to $9.74$ px. These findings inform both practical deployment of existing matchers and future matcher design for cross-modal satellite registration.
[386] SMFormer: Empowering Self-supervised Stereo Matching via Foundation Models and Data Augmentation
Yun Wang, Zhengjie Yang, Jiahao Zheng, Zhanjie Zhang, Dapeng Oliver Wu, Yulan Guo
Main category: cs.CV
TL;DR: SMFormer: A self-supervised stereo matching framework using Vision Foundation Models and data augmentation to overcome photometric consistency limitations, achieving SOTA performance competitive with supervised methods.
Details
Motivation: Self-supervised stereo matching methods rely on photometric consistency assumptions, which can fail in real-world scenarios with disturbances like illumination changes, creating an accuracy gap compared to supervised methods.Method: Integrates Vision Foundation Models with Feature Pyramid Networks for robust feature representation, plus a data augmentation mechanism that enforces feature consistency under illumination variations and regularizes disparity prediction consistency between augmented and standard samples.
Result: Achieves state-of-the-art performance among self-supervised methods on multiple benchmarks, competes with supervised methods, and even outperforms some SOTA supervised methods like CFNet on the challenging Booster benchmark.
Conclusion: SMFormer demonstrates that integrating VFMs with effective data augmentation can overcome limitations of photometric consistency assumptions in self-supervised stereo matching, closing the gap with supervised approaches.
Abstract: Recent self-supervised stereo matching methods have made significant progress. They typically rely on the photometric consistency assumption, which presumes corresponding points across views share the same appearance. However, this assumption could be compromised by real-world disturbances, resulting in invalid supervisory signals and a significant accuracy gap compared to supervised methods. To address this issue, we propose SMFormer, a framework integrating more reliable self-supervision guided by the Vision Foundation Model (VFM) and data augmentation. We first incorporate the VFM with the Feature Pyramid Network (FPN), providing a discriminative and robust feature representation against disturbance in various scenarios. We then devise an effective data augmentation mechanism that ensures robustness to various transformations. The data augmentation mechanism explicitly enforces consistency between learned features and those influenced by illumination variations. Additionally, it regularizes the output consistency between disparity predictions of strong augmented samples and those generated from standard samples. Experiments on multiple mainstream benchmarks demonstrate that our SMFormer achieves state-of-the-art (SOTA) performance among self-supervised methods and even competes on par with supervised ones. Remarkably, in the challenging Booster benchmark, SMFormer even outperforms some SOTA supervised methods, such as CFNet.
[387] Adapting 2D Multi-Modal Large Language Model for 3D CT Image Analysis
Yang Yu, Dunyuan Xu, Yaoqian Li, Xiaomeng Li, Jinpeng Li, Pheng-Ann Heng
Main category: cs.CV
TL;DR: A 3D medical MLLM framework that transfers 2D MLLMs to 3D medical imaging by reusing pre-trained parameters and using a Text-Guided Hierarchical MoE to extract task-specific features for medical report generation and visual question answering.
Details
Motivation: 3D medical image analysis is crucial for diagnosis/treatment, but existing 3D medical MLLMs suffer from insufficiently pretrained vision encoders due to scarce 3D medical data, and can't extract customized features for different tasks like MRG and MVQA.Method: 1) Transfer 2D MLLM (trained on natural images) to support 3D medical volumetric inputs while reusing all pre-trained parameters. 2) Design Text-Guided Hierarchical MoE (TGH-MoE) framework to distinguish tasks under text prompt guidance. 3) Two-stage training strategy to learn both task-shared and task-specific image features.
Result: The method outperforms existing 3D medical MLLMs in both Medical Report Generation (MRG) and Medical Visual Question Answering (MVQA) tasks.
Conclusion: The proposed approach successfully addresses limitations of existing 3D medical MLLMs by transferring 2D MLLMs and using task-specific feature extraction, demonstrating superior performance on clinical tasks.
Abstract: 3D medical image analysis is of great importance in disease diagnosis and treatment. Recently, multimodal large language models (MLLMs) have exhibited robust perceptual capacity, strong cross-modal alignment, and promising generalizability. Therefore, they have great potential to improve the performance of medical report generation (MRG) and medical visual question answering (MVQA), which serve as two important tasks in clinical scenarios. However, due to the scarcity of 3D medical images, existing 3D medical MLLMs suffer from insufficiently pretrained vision encoder and inability to extract customized image features for different kinds of tasks. In this paper, we propose to first transfer a 2D MLLM, which is well trained with 2D natural images, to support 3D medical volumetric inputs while reusing all of its pre-trained parameters. To enable the vision encoder to extract tailored image features for various tasks, we then design a Text-Guided Hierarchical MoE (TGH-MoE) framework, which can distinguish tasks under the guidance of the text prompt. Furthermore, we propose a two-stage training strategy to learn both task-shared and task-specific image features. As demonstrated empirically, our method outperforms existing 3D medical MLLMs in both MRG and MVQA tasks. Our code will be released once this paper is accepted.
[388] MedVeriSeg: Teaching MLLM-Based Medical Segmentation Models to Verify Query Validity Without Extra Training
Ziqian Lu, Qinyue Tong, Jun Liu, Yunlong Yu
Main category: cs.CV
TL;DR: MedVeriSeg is a training-free verification framework that enables medical segmentation MLLMs to identify and reject false queries with non-existent targets, improving reliability in clinical applications.
Details
Motivation: Current MLLM-based medical image segmentation methods (like LISA) cannot reliably reject false queries and often produce hallucinated segmentation masks for absent targets, reducing practical reliability in medical education and clinical use.Method: Proposes MedVeriSeg with a Similarity Response Quality Scoring Module that analyzes similarity maps between [SEG] token features and MLLM image features across three aspects: strength, compactness, and purity. Incorporates GPT-4o to jointly assess similarity heatmaps and scoring module results for final verification.
Result: Experiments on a benchmark from SA-Med2D-20M show MedVeriSeg effectively rejects false-query segmentation requests while maintaining reliable recognition of true queries.
Conclusion: MedVeriSeg provides a training-free verification framework that enhances the reliability of medical segmentation MLLMs by preventing hallucinated segmentations for non-existent targets.
Abstract: Despite recent advances in MLLM-based medical image segmentation, existing LISA-like methods cannot reliably reject false queries and often produce hallucinated segmentation masks for absent targets. This limitation reduces practical reliability in both medical education and clinical use. In this work, we propose MedVeriSeg, a training-free verification framework that equips LISA-like medical segmentation models with the ability to identify and reject false queries which contain non-existent targets. Our key observation is that the similarity map between the [SEG] token feature and MLLM image features exhibits markedly different distribution patterns for true and false queries. Based on this, we introduce a Similarity Response Quality Scoring Module that characterizes the similarity map from three aspects: strength, compactness, and purity, producing an initial target-existence prediction. We further incorporate qualitative visual evidence by using GPT-4o to jointly assess the similarity heatmap and the results of Similarity Response Quality Scoring Module for final verification. Experiments on a small-scale benchmark constructed from SA-Med2D-20M show that MedVeriSeg effectively rejects false-query segmentation requests while maintaining reliable recognition of true queries.
[389] Warm-Started Reinforcement Learning for Iterative 3D/2D Liver Registration
Hanyuan Zhang, Lucas He, Zijie Cheng, Abdolrahim Kadkhodamohammadi, Danail Stoyanov, Brian R. Davidson, Evangeles B. Mazomenos, Matthew. J Clarkson
Main category: cs.CV
TL;DR: Reinforcement learning framework for CT-to-video registration in surgical AR using discrete actions for rigid transformation selection and iteration stopping.
Details
Motivation: Current learning-based methods for CT-to-video registration in surgical AR produce coarse alignments requiring optimization-based refinement, increasing inference time. Need for automated, efficient iterative registration without manual parameter tuning.Method: Discrete-action RL framework formulates registration as sequential decision-making. Uses shared feature encoder warm-started from supervised pose estimation network to extract features from CT renderings and laparoscopic frames. RL policy head learns to choose rigid transformations along 6DoF and decide when to stop iterations.
Result: Achieved average target registration error (TRE) of 15.70 mm on public laparoscopic dataset, comparable to supervised approaches with optimization, while achieving faster convergence.
Conclusion: RL-based formulation enables automated, efficient iterative registration without manual parameter tuning. Discrete framework provides foundation for future continuous-action and deformable registration models in surgical AR applications.
Abstract: Registration between preoperative CT and intraoperative laparoscopic video plays a crucial role in augmented reality (AR) guidance for minimally invasive surgery. Learning-based methods have recently achieved registration errors comparable to optimization-based approaches while offering faster inference. However, many supervised methods produce coarse alignments that rely on additional optimization-based refinement, thereby increasing inference time. We present a discrete-action reinforcement learning (RL) framework that formulates CT-to-video registration as a sequential decision-making process. A shared feature encoder, warm-started from a supervised pose estimation network to provide stable geometric features and faster convergence, extracts representations from CT renderings and laparoscopic frames, while an RL policy head learns to choose rigid transformations along six degrees of freedom and to decide when to stop the iteration. Experiments on a public laparoscopic dataset demonstrated that our method achieved an average target registration error (TRE) of 15.70 mm, comparable to supervised approaches with optimization, while achieving faster convergence. The proposed RL-based formulation enables automated, efficient iterative registration without manually tuned step sizes or stopping criteria. This discrete framework provides a practical foundation for future continuous-action and deformable registration models in surgical AR applications.
[390] A Comparison of Multi-View Stereo Methods for Photogrammetric 3D Reconstruction: From Traditional to Learning-Based Approaches
Yawen Li, George Vosselman, Francesco Nex
Main category: cs.CV
TL;DR: Comparative evaluation of traditional vs. learning-based MVS methods for 3D reconstruction, showing trade-offs between accuracy, speed, and robustness.
Details
Motivation: Traditional photogrammetric 3D reconstruction (SfM/MVS) provides high accuracy but faces speed and scalability challenges. Learning-based MVS methods aim for faster, more efficient reconstruction, but their performance relative to traditional methods needs systematic evaluation.Method: Comparative evaluation between traditional MVS pipeline (COLMAP) and state-of-the-art learning-based approaches including geometry-guided methods (MVSNet, PatchmatchNet, MVSAnywhere, MVSFormer++) and end-to-end frameworks (Stereo4D, FoundationStereo, DUSt3R, MASt3R, Fast3R, VGGT). Two experiments on aerial scenarios: MARS-LVIG dataset with LiDAR ground truth, and Pix4D public scene with Pix4Dmapper ground truth. Evaluated accuracy, coverage, and runtime.
Result: COLMAP provides reliable, geometrically consistent results but requires more computation time. Learning-based methods show stronger feature-matching capability and robustness when traditional methods fail. Geometry-guided methods need careful dataset preparation and often depend on COLMAP priors. End-to-end methods (DUSt3R, VGGT) achieve competitive accuracy with substantially faster reconstruction but exhibit larger residuals in challenging scenarios.
Conclusion: Learning-based MVS methods offer promising alternatives to traditional approaches with better speed and robustness, though trade-offs exist in accuracy and dependency on traditional pipelines. End-to-end methods show particular promise for fast reconstruction despite some accuracy limitations.
Abstract: Photogrammetric 3D reconstruction has long relied on traditional Structure-from-Motion (SfM) and Multi-View Stereo (MVS) methods, which provide high accuracy but face challenges in speed and scalability. Recently, learning-based MVS methods have emerged, aiming for faster and more efficient reconstruction. This work presents a comparative evaluation between a representative traditional MVS pipeline (COLMAP) and state-of-the-art learning-based approaches, including geometry-guided methods (MVSNet, PatchmatchNet, MVSAnywhere, MVSFormer++) and end-to-end frameworks (Stereo4D, FoundationStereo, DUSt3R, MASt3R, Fast3R, VGGT). Two experiments were conducted on different aerial scenarios. The first experiment used the MARS-LVIG dataset, where ground-truth 3D reconstruction was provided by LiDAR point clouds. The second experiment used a public scene from the Pix4D official website, with ground truth generated by Pix4Dmapper. We evaluated accuracy, coverage, and runtime across all methods. Experimental results show that although COLMAP can provide reliable and geometrically consistent reconstruction results, it requires more computation time. In cases where traditional methods fail in image registration, learning-based approaches exhibit stronger feature-matching capability and greater robustness. Geometry-guided methods usually require careful dataset preparation and often depend on camera pose or depth priors generated by COLMAP. End-to-end methods such as DUSt3R and VGGT achieve competitive accuracy and reasonable coverage while offering substantially faster reconstruction. However, they exhibit relatively large residuals in 3D reconstruction, particularly in challenging scenarios.
[391] Real-Time Human Reconstruction and Animation using Feed-Forward Gaussian Splatting
Devdoot Chatterjee, Zakaria Laskar, C. V. Jawahar
Main category: cs.CV
TL;DR: A feed-forward Gaussian splatting framework for human 3D reconstruction from multi-view RGB images using SMPL-X poses, enabling real-time animation without repeated network inference.
Details
Motivation: Existing methods for human 3D reconstruction often require depth supervision, fixed input views, UV maps, or repeated feed-forward inference for each target view/pose, limiting their efficiency and real-time animation capabilities.Method: Predicts 3D Gaussian primitives associated with each SMPL-X vertex in canonical pose. Uses one constrained Gaussian near SMPL-X surface for geometric prior, plus additional unconstrained Gaussians per vertex to capture clothing/hair deviations. Enables animation via linear blend skinning without further network evaluation.
Result: Achieves reconstruction quality comparable to state-of-the-art methods on THuman 2.1, AvatarReX and THuman 4.0 datasets while uniquely supporting real-time animation and interactive applications.
Conclusion: The method provides an efficient, animatable human representation from single forward pass, bridging the gap between high-quality reconstruction and real-time animation capabilities.
Abstract: We present a generalizable feed-forward Gaussian splatting framework for human 3D reconstruction and real-time animation that operates directly on multi-view RGB images and their associated SMPL-X poses. Unlike prior methods that rely on depth supervision, fixed input views, UV map, or repeated feed-forward inference for each target view or pose, our approach predicts, in a canonical pose, a set of 3D Gaussian primitives associated with each SMPL-X vertex. One Gaussian is regularized to remain close to the SMPL-X surface, providing a strong geometric prior and stable correspondence to the parametric body model, while an additional small set of unconstrained Gaussians per vertex allows the representation to capture geometric structures that deviate from the parametric surface, such as clothing and hair. In contrast to recent approaches such as HumanRAM, which require repeated network inference to synthesize novel poses, our method produces an animatable human representation from a single forward pass; by explicitly associating Gaussian primitives with SMPL-X vertices, the reconstructed model can be efficiently animated via linear blend skinning without further network evaluation. We evaluate our method on the THuman 2.1, AvatarReX and THuman 4.0 datasets, where it achieves reconstruction quality comparable to state-of-the-art methods while uniquely supporting real-time animation and interactive applications. Code and pre-trained models are available at https://github.com/Devdoot57/HumanGS .
[392] EditCrafter: Tuning-free High-Resolution Image Editing via Pretrained Diffusion Model
Kunho Kim, Sumin Seo, Yongjun Cho, Hyungjin Chung
Main category: cs.CV
TL;DR: EditCrafter enables high-resolution image editing without tuning by using pretrained diffusion models with tiled inversion and noise-damped guidance for resolutions beyond training sizes.
Details
Motivation: Existing diffusion-based image editing methods are limited to training resolutions (512x512 or 1024x1024) and fail with arbitrary aspect ratios or higher resolutions, producing unrealistic structures and repetition when naively applied patch-wise.Method: EditCrafter uses tiled inversion to preserve original image identity and introduces noise-damped manifold-constrained classifier-free guidance (NDCFG++) tailored for high-resolution editing from inverted latents, operating without fine-tuning.
Result: The method achieves impressive editing results across various resolutions without requiring fine-tuning or optimization, successfully handling resolutions significantly exceeding training sizes.
Conclusion: EditCrafter provides an effective solution for high-resolution image editing using pretrained diffusion models, overcoming resolution limitations through tiled inversion and specialized guidance techniques.
Abstract: We propose EditCrafter, a high-resolution image editing method that operates without tuning, leveraging pretrained text-to-image (T2I) diffusion models to process images at resolutions significantly exceeding those used during training. Leveraging the generative priors of large-scale T2I diffusion models enables the development of a wide array of novel generation and editing applications. Although numerous image editing methods have been proposed based on diffusion models and exhibit high-quality editing results, they are difficult to apply to images with arbitrary aspect ratios or higher resolutions since they only work at the training resolutions (512x512 or 1024x1024). Naively applying patch-wise editing fails with unrealistic object structures and repetition. To address these challenges, we introduce EditCrafter, a simple yet effective editing pipeline. EditCrafter operates by first performing tiled inversion, which preserves the original identity of the input high-resolution image. We further propose a noise-damped manifold-constrained classifier-free guidance (NDCFG++) that is tailored for high resolution image editing from the inverted latent. Our experiments show that the our EditCrafter can achieve impressive editing results across various resolutions without fine-tuning and optimization.
[393] Dual-Exposure Imaging with Events
Mingyuan Lin, Hongyi Liu, Chu He, Wen Yang, Gui-Song Xia, Lei Yu
Main category: cs.CV
TL;DR: E-DEI algorithm uses event cameras to enhance dual-exposure imaging by addressing motion artifacts and feature discrepancies through event-based motion deblurring and low-light enhancement in a dual-path architecture.
Details
Motivation: Dual-Exposure Imaging (DEI) improves low-light image quality but suffers from artifacts due to scene motion and exposure differences. Event cameras offer high temporal resolution to capture accurate motion information, enabling better alignment and fusion of dual-exposure images.Method: Proposes Event-based DEI (E-DEI) algorithm that decomposes the task into event-based motion deblurring and low-light enhancement. Uses dual-path parallel feature propagation architecture with Dual-path Feature Alignment and Fusion (DFAF) module to align and fuse features from dual-exposure images using event data. Created real-world PIED dataset with paired low-/normal-light images and events.
Result: Experiments on multiple datasets demonstrate superiority of the method. The approach effectively reduces artifacts and improves image quality in low-light conditions with motion.
Conclusion: E-DEI effectively leverages event camera data to address limitations of traditional DEI, producing higher quality images in challenging low-light dynamic scenes through better motion handling and feature alignment.
Abstract: By combining complementary benefits of short- and long-exposure images, Dual-Exposure Imaging (DEI) enhances image quality in low-light scenarios. However, existing DEI approaches inevitably suffer from producing artifacts due to spatial displacement from scene motion and image feature discrepancies from different exposure times. To tackle this problem, we propose a novel Event-based DEI (E-DEI) algorithm, which reconstructs high-quality images from dual-exposure image pairs and events, leveraging high temporal resolution of event cameras to provide accurate inter-/intra-frame dynamic information. Specifically, we decompose this complex task into an integration of two sub-tasks, i.e., event-based motion deblurring and low-light image enhancement tasks, which guides us to design E-DEI network as a dual-path parallel feature propagation architecture. We propose a Dual-path Feature Alignment and Fusion (DFAF) module to effectively align and fuse features extracted from dual-exposure images with assistance of events. Furthermore, we build a real-world Dataset containing Paired low-/normal-light Images and Events (PIED). Experiments on multiple datasets show the superiority of our method. The code and dataset are available at github.
[394] FastSHADE: Fast Self-augmented Hierarchical Asymmetric Denoising for Efficient inference on mobile devices
Nikolay Falaleev
Main category: cs.CV
TL;DR: FastSHADE is a lightweight U-Net-style network for real-time image denoising on mobile devices, featuring asymmetric frequency denoising blocks and noise shifting self-augmentation to achieve efficient speed-fidelity trade-offs.
Details
Motivation: Real-time image denoising is crucial for mobile photography but challenging due to latency and power constraints on edge devices. Existing methods struggle to balance efficiency with high-fidelity restoration for practical deployment.Method: Proposes FastSHADE with multi-stage architecture: 1) Asymmetric Frequency Denoising Block (AFDB) decouples spatial structure extraction from high-frequency noise suppression, 2) Spatially Gated Upsampler (SGU) optimizes high-resolution skip connection fusion, and 3) Noise Shifting Self-Augmentation strategy enhances data diversity without domain shifts.
Result: On MAI2021 benchmark, FastSHADE-M achieves real-time latency (<50 ms on mobile GPU) while preserving structural integrity. FastSHADE-XL establishes new state-of-the-art for overall image quality. The scalable model family demonstrates efficient speed-fidelity trade-offs.
Conclusion: FastSHADE successfully bridges the gap between theoretical network efficiency and practical deployment for real-world mobile ISP pipelines, offering a lightweight solution for real-time, high-fidelity image denoising on edge devices.
Abstract: Real-time image denoising is essential for modern mobile photography but remains challenging due to the strict latency and power constraints of edge devices. This paper presents FastSHADE (Fast Self-augmented Hierarchical Asymmetric Denoising), a lightweight U-Net-style network tailored for real-time, high-fidelity restoration on mobile GPUs. Our method features a multi-stage architecture incorporating a novel Asymmetric Frequency Denoising Block (AFDB) that decouples spatial structure extraction from high-frequency noise suppression to maximize efficiency, and a Spatially Gated Upsampler (SGU) that optimizes high-resolution skip connection fusion. To address generalization, we introduce an efficient Noise Shifting Self-Augmentation strategy that enhances data diversity without inducing domain shifts. Evaluations on the MAI2021 benchmark demonstrate that our scalable model family establishes a highly efficient speed-fidelity trade-off. Our base FastSHADE-M variant maintains real-time latency (<50 ms on a modern mobile GPU) while preserving structural integrity, and our scaled-up FastSHADE-XL establishes a new state-of-the-art for overall image quality. Ultimately, FastSHADE successfully bridges the gap between theoretical network efficiency and practical deployment for real-world mobile ISP pipelines.
[395] FashionMV: Product-Level Composed Image Retrieval with Multi-View Fashion Data
Peng Yuan, Bingyin Mei, Hui Zhang
Main category: cs.CV
TL;DR: Proposes Multi-View CIR task for product-level retrieval using multiple reference images, introduces FashionMV dataset, and presents ProCIR framework with multimodal LLM for improved performance.
Details
Motivation: Existing CIR methods operate at image level (single reference image + text), but real e-commerce users reason about products from multiple viewpoints, creating a "View Incompleteness" problem.Method: Introduces Multi-View CIR task, constructs FashionMV dataset (127K products, 472K images, 220K triplets), and proposes ProCIR framework with multimodal LLM using three mechanisms: two-stage dialogue, caption-based alignment, chain-of-thought guidance, plus optional supervised fine-tuning.
Result: Best 0.8B-parameter model outperforms all baselines including 10x larger general-purpose embedding models; systematic ablation shows alignment is most critical, two-stage dialogue is prerequisite, SFT and chain-of-thought are partially redundant.
Conclusion: Multi-View CIR addresses real-world product retrieval needs, ProCIR framework effectively leverages multimodal LLMs, and FashionMV enables product-level CIR research.
Abstract: Composed Image Retrieval (CIR) retrieves target images using a reference image paired with modification text. Despite rapid advances, all existing methods and datasets operate at the image level – a single reference image plus modification text in, a single target image out – while real e-commerce users reason about products shown from multiple viewpoints. We term this mismatch View Incompleteness and formally define a new Multi-View CIR task that generalizes standard CIR from image-level to product-level retrieval. To support this task, we construct FashionMV, the first large-scale multi-view fashion dataset for product-level CIR, comprising 127K products, 472K multi-view images, and over 220K CIR triplets, built through a fully automated pipeline leveraging large multimodal models. We further propose ProCIR (Product-level Composed Image Retrieval), a modeling framework built upon a multimodal large language model that employs three complementary mechanisms – two-stage dialogue, caption-based alignment, and chain-of-thought guidance – together with an optional supervised fine-tuning (SFT) stage that injects structured product knowledge prior to contrastive training. Systematic ablation across 16 configurations on three fashion benchmarks reveals that: (1) alignment is the single most critical mechanism; (2) the two-stage dialogue architecture is a prerequisite for effective alignment; and (3) SFT and chain-of-thought serve as partially redundant knowledge injection paths. Our best 0.8B-parameter model outperforms all baselines, including general-purpose embedding models 10x its size. The dataset, model, and code are publicly available at https://github.com/yuandaxia2001/FashionMV.
[396] Seeing No Evil: Blinding Large Vision-Language Models to Safety Instructions via Adversarial Attention Hijacking
Jingru Li, Wei Ren, Tianqing Zhu
Main category: cs.CV
TL;DR: Attention-Guided Visual Jailbreaking attacks LVLMs by manipulating attention patterns to suppress safety instruction retrieval rather than overpowering alignment, achieving high attack success rates with fewer iterations.
Details
Motivation: Existing visual jailbreaking attacks optimize image perturbations to maximize harmful output likelihood but suffer from slow convergence due to gradient conflict between adversarial objectives and the model's safety-retrieval mechanism. The authors aim to circumvent safety alignment more efficiently by directly manipulating attention patterns.Method: Proposes Attention-Guided Visual Jailbreaking with two auxiliary objectives: (1) suppressing attention to alignment-relevant prefix tokens, and (2) anchoring generation on adversarial image features. This push-pull formulation reduces gradient conflict by directly manipulating attention patterns rather than overpowering safety mechanisms.
Result: Achieves 94.4% attack success rate on Qwen-VL (vs. 68.8% baseline) with 40% fewer iterations. Reduces gradient conflict by 45%. At tighter perturbation budgets (ε=8/255), maintains 59.0% ASR compared to 45.7% for standard methods. Mechanistic analysis reveals successful attacks suppress system-prompt attention by 80%, causing “safety blindness.”
Conclusion: Attention manipulation provides a more effective approach to visual jailbreaking than traditional gradient-based attacks. The method demonstrates that LVLMs can be made to generate harmful content not by overriding safety rules, but by preventing their retrieval through attention suppression.
Abstract: Large Vision-Language Models (LVLMs) rely on attention-based retrieval of safety instructions to maintain alignment during generation. Existing attacks typically optimize image perturbations to maximize harmful output likelihood, but suffer from slow convergence due to gradient conflict between adversarial objectives and the model’s safety-retrieval mechanism. We propose Attention-Guided Visual Jailbreaking, which circumvents rather than overpowers safety alignment by directly manipulating attention patterns. Our method introduces two simple auxiliary objectives: (1) suppressing attention to alignment-relevant prefix tokens and (2) anchoring generation on adversarial image features. This simple yet effective push-pull formulation reduces gradient conflict by 45% and achieves 94.4% attack success rate on Qwen-VL (vs. 68.8% baseline) with 40% fewer iterations. At tighter perturbation budgets ($ε=8/255$), we maintain 59.0% ASR compared to 45.7% for standard methods. Mechanistic analysis reveals a failure mode we term safety blindness: successful attacks suppress system-prompt attention by 80%, causing models to generate harmful content not by overriding safety rules, but by failing to retrieve them.
[397] AC-MIL: Weakly Supervised Atrial LGE-MRI Quality Assessment via Adversarial Concept Disentanglement
K M Arefeen Sultan, Kaysen Hansen, Benjamin Orkild, Alan Morris, Eugene Kholmovski, Erik Bieging, Eugene Kwan, Ravi Ranjan, Ed DiBella, Shireen Elhabian
Main category: cs.CV
TL;DR: AC-MIL is a weakly supervised framework that decomposes MRI quality assessment into interpretable clinical concepts using adversarial erasure and spatial diversity constraints to provide actionable feedback on specific failure modes.
Details
Motivation: Current MIL methods for MRI quality assessment produce opaque global feature vectors that don't provide actionable feedback on specific failure modes like motion blur, inadequate contrast, or lack of anatomical context, limiting clinical utility.Method: Proposes Adversarial Concept-MIL (AC-MIL) with: 1) decomposition of global image quality into clinically defined radiological concepts using volume-level supervision, 2) unsupervised residual branch with adversarial erasure to prevent information leakage, and 3) spatial diversity constraint to penalize overlap between concept attention maps.
Result: AC-MIL successfully opens the MIL black box, providing highly localized spatial concept maps that allow clinicians to pinpoint specific causes of non-diagnostic scans while maintaining competitive ordinal grading performance against existing baselines.
Conclusion: AC-MIL achieves deep clinical transparency in weakly supervised quality assessment while maintaining strong performance, making it valuable for clinical applications where interpretability is crucial.
Abstract: High-quality Late Gadolinium Enhancement (LGE) MRI can be helpful for atrial fibrillation management, yet scan quality is frequently compromised by patient motion, irregular breathing, and suboptimal image acquisition timing. While Multiple Instance Learning (MIL) has emerged as a powerful tool for automated quality assessment under weak supervision, current state-of-the-art methods map localized visual evidence to a single, opaque global feature vector. This black box approach fails to provide actionable feedback on specific failure modes, obscuring whether a scan degrades due to motion blur, inadequate contrast, or a lack of anatomical context. In this paper, we propose Adversarial Concept-MIL (AC-MIL), a weakly supervised framework that decomposes global image quality into clinically defined radiological concepts using only volume-level supervision. To capture latent quality variations without entangling predefined concepts, our framework incorporates an unsupervised residual branch guided by an adversarial erasure mechanism to strictly prevent information leakage. Furthermore, we introduce a spatial diversity constraint that penalizes overlap between distinct concept attention maps, ensuring localized and interpretable feature extraction. Extensive experiments on a clinical dataset of atrial LGE-MRI volumes demonstrate that AC-MIL successfully opens the MIL black box, providing highly localized spatial concept maps that allow clinicians to pinpoint the specific causes of non-diagnostic scans. Crucially, our framework achieves this deep clinical transparency while maintaining highly competitive ordinal grading performance against existing baselines. Code to be released on acceptance.
[398] Class-Adaptive Cooperative Perception for Multi-Class LiDAR-based 3D Object Detection in V2X Systems
Blessing Agyei Kyem, Joshua Kofi Asamoah, Armstrong Aboah
Main category: cs.CV
TL;DR: A class-adaptive cooperative perception architecture for multi-class 3D object detection from LiDAR data that addresses limitations of uniform fusion strategies across different object classes.
Details
Motivation: Most cooperative 3D object detectors use uniform fusion strategies that fail to handle different geometric structures and point-sampling patterns of small vs. large objects, and evaluation protocols are narrow, leaving robust multi-class detection across diverse V2X interactions insufficiently explored.Method: Four components: 1) multi-scale window attention with learned scale routing for spatially adaptive feature extraction, 2) class-specific fusion module separating small/large objects into attentive fusion pathways, 3) bird’s-eye-view enhancement through parallel dilated convolution and channel recalibration, 4) class-balanced objective weighting to reduce bias toward frequent categories.
Result: Experiments on V2X-Real benchmark across various cooperation settings show consistent improvements over baselines, with largest gains on trucks, clear improvements on pedestrians, and competitive results on cars.
Conclusion: Aligning feature extraction and fusion with class-dependent geometry and point density leads to more balanced cooperative perception in realistic V2X deployments.
Abstract: Cooperative perception allows connected vehicles and roadside infrastructure to share sensor observations, creating a fused scene representation beyond the capability of any single platform. However, most cooperative 3D object detectors use a uniform fusion strategy for all object classes, which limits their ability to handle the different geometric structures and point-sampling patterns of small and large objects. This problem is further reinforced by narrow evaluation protocols that often emphasize a single dominant class or only a few cooperation settings, leaving robust multi-class detection across diverse vehicle-to-everything interactions insufficiently explored. To address this gap, we propose a class-adaptive cooperative perception architecture for multi-class 3D object detection from LiDAR data. The model integrates four components: multi-scale window attention with learned scale routing for spatially adaptive feature extraction, a class-specific fusion module that separates small and large objects into attentive fusion pathways, bird’s-eye-view enhancement through parallel dilated convolution and channel recalibration for richer contextual representation, and class-balanced objective weighting to reduce bias toward frequent categories. Experiments on the V2X-Real benchmark cover vehicle-centric, infrastructure-centric, vehicle-to-vehicle, infrastructure-to-infrastructure, and vehicle-to-infrastructure settings under identical backbone and training configurations. The proposed method consistently improves mean detection performance over strong intermediate-fusion baselines, with the largest gains on trucks, clear improvements on pedestrians, and competitive results on cars. These results show that aligning feature extraction and fusion with class-dependent geometry and point density leads to more balanced cooperative perception in realistic vehicle-to-everything deployments.
[399] SatReg: Regression-based Neural Architecture Search for Lightweight Satellite Image Segmentation
Edward Humes, Tinoosh Mohsenin
Main category: cs.CV
TL;DR: SatReg is a regression-based hardware-aware tuning framework for lightweight remote-sensing segmentation on edge platforms, using knowledge distillation and surrogate modeling to efficiently find near-optimal architecture settings for deployment targets.
Details
Motivation: As Earth-observation workloads move toward onboard and edge processing, remote-sensing segmentation models must operate under tight latency and energy constraints, requiring efficient adaptation to edge platforms.Method: Uses CM-UNet as teacher architecture, reduces search space to two dominant width-related variables, profiles student models on NVIDIA Jetson Orin Nano, fits low-order surrogate models for mIoU, latency, and power, and employs knowledge distillation for efficient training.
Result: The framework enables fast selection of near-optimal architecture settings without exhaustive search, showing that selected variables affect task accuracy and hardware cost differently.
Conclusion: Reduced-space regression is a practical strategy for adapting hybrid CNN-Mamba segmentation models to future space-edge systems.
Abstract: As Earth-observation workloads move toward onboard and edge processing, remote-sensing segmentation models must operate under tight latency and energy constraints. We present SatReg, a regression-based hardware-aware tuning framework for lightweight remote-sensing segmentation on edge platforms. Using CM-UNet as the teacher architecture, we reduce the search space to two dominant width-related variables, profile a small set of student models on an NVIDIA Jetson Orin Nano, and fit low-order surrogate models for mIoU, latency, and power. Knowledge distillation is used to efficiently train the sampled students. The learned surrogates enable fast selection of near-optimal architecture settings for deployment targets without exhaustive search. Results show that the selected variables affect task accuracy and hardware cost differently, making reduced-space regression a practical strategy for adapting hybrid CNN-Mamba segmentation models to future space-edge systems.
[400] Anatomy-Informed Deep Learning for Abdominal Aortic Aneurysm Segmentation
Osamah Sufyan, Martin Brückmann, Ralph Wickenhöfer, Babette Dellen, Uwe Jaekel
Main category: cs.CV
TL;DR: Anatomy-aware segmentation framework for abdominal aortic aneurysms using organ exclusion masks from TotalSegmentator to reduce false positives and improve accuracy in CT angiography.
Details
Motivation: Accurate segmentation of abdominal aortic aneurysms in CT angiography is challenging due to large anatomical variability, low-contrast vessel boundaries, and proximity of organs with similar intensities to vascular structures, leading to false positives.Method: Proposes an anatomy-aware segmentation framework that integrates organ exclusion masks derived from TotalSegmentator into U-Net training. These masks encode explicit anatomical priors by identifying non-vascular organs and penalizing aneurysm predictions within these regions, guiding the model to focus on the aorta while suppressing anatomically implausible predictions.
Result: Despite training on a relatively small dataset, the anatomy-aware model achieves high accuracy, substantially reduces false positives, and improves boundary consistency compared to a standard U-Net baseline.
Conclusion: Incorporating anatomical knowledge through exclusion masks provides an efficient mechanism to enhance robustness and generalization, enabling reliable AAA segmentation even with limited training data.
Abstract: In CT angiography, the accurate segmentation of abdominal aortic aneurysms (AAAs) is difficult due to large anatomical variability, low-contrast vessel boundaries, and the close proximity of organs whose intensities resemble vascular structures, often leading to false positives. To address these challenges, we propose an anatomy-aware segmentation framework that integrates organ exclusion masks derived from TotalSegmentator into the training process. These masks encode explicit anatomical priors by identifying non-vascular organsand penalizing aneurysm predictions within these regions, thereby guiding the U-Net to focus on the aorta and its pathological dilation while suppressing anatomically implausible predictions. Despite being trained on a relatively small dataset, the anatomy-aware model achieves high accuracy, substantially reduces false positives, and improves boundary consistency compared to a standard U-Net baseline. The results demonstrate that incorporating anatomical knowledge through exclusion masks provides an efficient mechanism to enhance robustness and generalization, enabling reliable AAA segmentation even with limited training data.
[401] NTIRE 2026 Challenge on Single Image Reflection Removal in the Wild: Datasets, Results, and Methods
Jie Cai, Kangning Yang, Zhiyuan Li, Florin-Alexandru Vasluianu, Radu Timofte, Jinlong Li, Jinglin Shen, Zibo Meng, Junyan Cao, Lu Zhao, Pengwei Liu, Yuyi Zhang, Fengjun Guo, Jiagao Hu, Zepeng Wang, Fei Wang, Daiguo Zhou, Yi’ang Chen, Honghui Zhu, Mengru Yang, Yan Luo, Kui Jiang, Jin Guo, Jonghyuk Park, Jae-Young Sim, Wei Zhou, Hongyu Huang, Linfeng Li, Lindong Kong, Saiprasad Meesiyawar, Misbha Falak Khanpagadi, Nikhil Akalwadi, Ramesh Ashok Tabib, Uma Mudenagudi, Bilel Benjdira, Anas M. Ali, Wadii Boulila, Kosuke Shigematsu, Hiroto Shirono, Asuka Shin, Guoyi Xu, Yaoxin Jiang, Jiajia Liu, Yaokun Shi, Jiachen Tu, Shreeniketh Joshi, Jin-Hui Jiang, Yu-Fan Lin, Yu-Jou Hsiao, Chia-Ming Lee, Fu-En Yang, Yu-Chiang Frank Wang, Chih-Chung Hsu
Main category: cs.CV
TL;DR: NTIRE 2026 challenge on single-image reflection removal using OpenRR-5k dataset with real-world images, attracting 100+ registrations and advancing SOTA performance.
Details
Motivation: Single-image reflection removal (SIRR) is fundamental for image restoration, but existing methods are mostly tested on synthetic or limited real-world images, creating a gap for practical applications.Method: Organized a challenge using the OpenRR-5k dataset containing real-world images with various reflection scenarios and intensities, requiring participants to generate clean images without reflections.
Result: Challenge attracted over 100 registrations with 11 final participants; top-ranked methods advanced state-of-the-art reflection removal performance and received unanimous expert recognition.
Conclusion: The challenge successfully bridged the gap between academic research and real-world applications in reflection removal, with the OpenRR-5k dataset serving as a valuable benchmark for future research.
Abstract: In this paper, we review the NTIRE 2026 challenge on single-image reflection removal (SIRR) in the Wild. SIRR is a fundamental task in image restoration. Despite progress in academic research, most methods are tested on synthetic images or limited real-world images, creating a gap in real-world applications. In this challenge, we provide participants with the OpenRR-5k dataset, which requires them to process real-world images that cover a range of reflection scenarios and intensities, with the goal of generating clean images without reflections. The challenge attracted more than 100 registrations, with 11 of them participating in the final testing phase. The top-ranked methods advanced the state-of-the-art reflection removal performance and earned unanimous recognition from the five experts in the field. The proposed OpenRR-5k dataset is available at https://huggingface.co/datasets/qiuzhangTiTi/OpenRR-5k, and the homepage of this challenge is at https://github.com/caijie0620/OpenRR-5k. Due to page limitations, this article only presents partial content; the full report and detailed analyses are available in the extended arXiv version.
[402] SIMPLER: H&E-Informed Representation Learning for Structured Illumination Microscopy
Abu Zahid Bin Aziz, Syed Fahim Ahmed, Gnanesh Rasineni, Mei Wang, Olcaytu Hatipoglu, Marisa Ricci, Malaiyah Shaw, Guang Li, J. Quincy Brown, Valerio Pascucci, Shireen Elhabian
Main category: cs.CV
TL;DR: SIMPLER is a cross-modality self-supervised pretraining framework that aligns H&E-stained histology with Structured Illumination Microscopy (SIM) to learn reusable representations for fresh tissue imaging without staining.
Details
Motivation: Existing foundation models in digital pathology are trained on thin tissue sections (H&E/IHC) but don't address thick-tissue fluorescence modalities like SIM. Direct transfer to SIM suffers from modality shift, and naive fine-tuning overfits to modality-specific appearance rather than learning underlying histological structure.Method: SIMPLER uses H&E as a semantic anchor to learn reusable SIM representations through progressive alignment of SIM and H&E using adversarial, contrastive, and reconstruction-based objectives. This encourages SIM embeddings to internalize histological structure from H&E while preserving modality-specific characteristics.
Result: A single pretrained SIMPLER encoder transfers across multiple downstream tasks (multiple instance learning, morphological clustering), consistently outperforming SIM models trained from scratch or H&E-only pretraining. Joint alignment enhances SIM performance without degrading H&E representations.
Conclusion: SIMPLER enables effective cross-modality learning between traditional histology (H&E) and novel imaging modalities (SIM), demonstrating asymmetric enrichment where SIM benefits from H&E’s rich semantic annotations while maintaining its unique imaging characteristics.
Abstract: Structured Illumination Microscopy (SIM) enables rapid, high-contrast optical sectioning of fresh tissue without staining or physical sectioning, making it promising for intraoperative and point-of-care diagnostics. Recent foundation and large-scale self-supervised models in digital pathology have demonstrated strong performance on section-based modalities such as Hematoxylin and Eosin (H&E) and immunohistochemistry (IHC). However, these approaches are predominantly trained on thin tissue sections and do not explicitly address thick-tissue fluorescence modalities such as SIM. When transferred directly to SIM, performance is constrained by substantial modality shift, and naive fine-tuning often overfits to modality-specific appearance rather than underlying histological structure. We introduce SIMPLER (Structured Illumination Microscopy-Powered Learning for Embedding Representations), a cross-modality self-supervised pretraining framework that leverages H&E as a semantic anchor to learn reusable SIM representations. H&E encodes rich cellular and glandular structure aligned with established clinical annotations, while SIM provides rapid, nondestructive imaging of fresh tissue. During pretraining, SIM and H&E are progressively aligned through adversarial, contrastive, and reconstruction-based objectives, encouraging SIM embeddings to internalize histological structure from H&E without collapsing modality-specific characteristics. A single pretrained SIMPLER encoder transfers across multiple downstream tasks, including multiple instance learning and morphological clustering, consistently outperforming SIM models trained from scratch or H&E-only pretraining. Importantly, joint alignment enhances SIM performance without degrading H&E representations, demonstrating asymmetric enrichment rather
[403] Context Matters: Vision-Based Depression Detection Comparing Classical and Deep Approaches
Maneesh Bilalpur, Saurabh Hinduja, Sonish Sivarajkumar, Nicholas Allen, Yanshan Wang, Itir Onal Ertugrul, Jeffrey F. Cohn
Main category: cs.CV
TL;DR: Comparison of classical (handcrafted features + SVM) vs deep learning (FMAE-IAT embeddings + MLP) approaches for depression detection from vision, showing classical methods outperform in accuracy and fairness across two clinical contexts.
Details
Motivation: To understand how classical interpretable approaches compare to modern deep learning methods for depression detection from vision in terms of accuracy, fairness, and generalizability across different clinical contexts.Method: Compared classical approach (handcrafted features + SVM) with deep learning approach (FMAE-IAT embeddings + Multi-Layer Perceptron) on two depression datasets: TPOT (mother-child interactions) and Pitt (patient-clinician interviews). Depression was operationalized differently in each context.
Result: Classical approach achieved higher accuracy in both contexts and was significantly fairer than the deep approach in patient-clinician context. Cross-context generalizability was modest for both approaches, suggesting depression detection may be context-specific.
Conclusion: Classical interpretable methods outperform deep learning approaches for depression detection from vision in accuracy and fairness, with limited generalizability across contexts, indicating depression manifestations may be context-dependent.
Abstract: The classical approach to detecting depression from vision emphasizes interpretable features, such as facial expression, and classifiers such as the Support Vector Machine (SVM). With the advent of deep learning, there has been a shift in feature representations and classification approaches. Contemporary approaches use learnt features from general-purpose vision models such as VGGNet to train machine learning models. Little is known about how classical and deep approaches compare in depression detection with respect to accuracy, fairness, and generalizability, especially across contexts. To address these questions, we compared classical and deep approaches to the detection of depression in the visual modality in two different contexts: Mother-child interactions in the TPOT database and patient-clinician interviews in the Pitt database. In the former, depression was operationalized as a history of depression per the DSM and current or recent clinically significant symptoms. In the latter, all participants met initial criteria for depression per DSM, and depression was reassessed over the course of treatment. The classical approach included handcrafted features with SVM classifiers. Learnt features were turn-level embeddings from the FMAE-IAT that were combined with Multi-Layer Perceptron classifiers. The classical approach achieved higher accuracy in both contexts. It was also significantly fairer than the deep approach in the patient-clinician context. Cross-context generalizability was modest at best for both approaches, which suggests that depression may be context-specific.
[404] Multi-modal, multi-scale representation learning for satellite imagery analysis just needs a good ALiBi
Patrick Kage, Pavlos Andreadis
Main category: cs.CV
TL;DR: Scale-ALiBi: A transformer attention mechanism with spatial encoding bias for multi-resolution satellite imagery, improving performance on GEO-Bench benchmark.
Details
Motivation: Vision foundation models struggle with processing satellite imagery across multiple spatial resolutions and modes (optical vs SAR). Current approaches don't effectively handle relationships between image patches at different ground sample distances.Method: Proposes Scale-ALiBi, a linear bias transformer attention mechanism with spatial encoding bias for relationships between image patches at different scales. Implemented using triple-contrastive and reconstructive architecture on aligned high/low-resolution optical and low-resolution SAR satellite imagery.
Result: Shows improvement on the GEO-Bench benchmark. The newly curated dataset of aligned multi-resolution, multi-modal satellite imagery is released publicly.
Conclusion: Scale-ALiBi effectively addresses the challenge of processing satellite imagery across multiple resolutions and modes, providing better representations for downstream tasks through improved spatial encoding in transformer attention.
Abstract: Vision foundation models have been shown to be effective at processing satellite imagery into representations fit for downstream tasks, however, creating models which operate over multiple spatial resolutions and modes is challenging. This paper presents Scale-ALiBi, a linear bias transformer attention mechanism with a spatial encoding bias to relationships between image patches at different ground sample distance scales. We provide an implementation of Scale-ALiBi over a dataset of aligned high- and low-resolution optical and low-resolution SAR satellite imagery data using a triple-contrastive and reconstructive architecture, show an improvement on the GEO-Bench benchmark, and release the newly curated dataset publicly.
[405] Multinex: Lightweight Low-light Image Enhancement via Multi-prior Retinex
Alexandru Brateanu, Tingting Mu, Codruta Ancuti, Cosmin Ancuti
Main category: cs.CV
TL;DR: Multinex: An ultra-lightweight structured framework for low-light image enhancement that integrates multiple fine-grained representations within a Retinex residual formulation, achieving SOTA performance with minimal parameters (45K or 0.7K).
Details
Motivation: Current SOTA low-light image enhancement techniques rely on large models and multi-stage training, making them impractical for edge deployment. They also suffer from instability and artifacts due to dependence on single color spaces.Method: Proposes Multinex framework that decomposes images into illumination and color prior stacks from distinct analytic representations, then learns to fuse these representations into luminance and reflectance adjustments using lightweight neural operations within a Retinex residual formulation.
Result: Lightweight variants (45K and 0.7K parameters) significantly outperform corresponding lightweight SOTA models and reach comparable performance to heavy models in extensive benchmarks.
Conclusion: Multinex provides an effective ultra-lightweight solution for low-light image enhancement that balances performance and computational efficiency, making it suitable for edge deployment.
Abstract: Low-light image enhancement (LLIE) aims to restore natural visibility, color fidelity, and structural detail under severe illumination degradation. State-of-the-art (SOTA) LLIE techniques often rely on large models and multi-stage training, limiting practicality for edge deployment. Moreover, their dependence on a single color space introduces instability and visible exposure or color artifacts. To address these, we propose Multinex, an ultra-lightweight structured framework that integrates multiple fine-grained representations within a principled Retinex residual formulation. It decomposes an image into illumination and color prior stacks derived from distinct analytic representations, and learns to fuse these representations into luminance and reflectance adjustments required to correct exposure. By prioritizing enhancement over reconstruction and exploiting lightweight neural operations, Multinex significantly reduces computational cost, exemplified by its lightweight (45K parameters) and nano (0.7K parameters) versions. Extensive benchmarks show that all lightweight variants significantly outperform their corresponding lightweight SOTA models, and reach comparable performance to heavy models. Paper page available at https://albrateanu.github.io/multinex.
[406] DeepShapeMatchingKit: Accelerated Functional Map Solver and Shape Matching Pipelines Revisited
Yizheng Xie, Lennart Bastian, Congyue Deng, Thomas W. Mitchel, Maolin Gao, Daniel Cremers
Main category: cs.CV
TL;DR: Deep functional maps for 3D shape matching get major speedup via vectorized reformulation, plus analysis of DiffusionNet variants and improved evaluation metrics.
Details
Motivation: Standard functional map implementations have computational bottlenecks at higher spectral resolutions due to solving k independent linear systems serially. There's also undocumented implementation divergence in DiffusionNet's spatial gradient features.Method: Proposes vectorized reformulation solving all systems in single kernel call for 33x speedup. Analyzes two DiffusionNet variants with different tangent-plane transformations. Introduces balanced accuracy metric for partial-to-partial matching evaluation.
Result: Achieves up to 33x speedup while preserving exact solution. Documents distinct behaviors of DiffusionNet variants across benchmarks. Shows balanced accuracy provides useful complementary metric under varying overlap ratios.
Conclusion: Provides computational improvements and implementation clarifications for deep functional maps, packaged in open-source DeepShapeMatchingKit for standardized training, evaluation, and data pipelines.
Abstract: Deep functional maps, leveraging learned feature extractors and spectral correspondence solvers, are fundamental to non-rigid 3D shape matching. Based on an analysis of open-source implementations, we find that standard functional map implementations solve k independent linear systems serially, which is a computational bottleneck at higher spectral resolution. We thus propose a vectorized reformulation that solves all systems in a single kernel call, achieving up to a 33x speedup while preserving the exact solution. Furthermore, we identify and document a previously unnoticed implementation divergence in the spatial gradient features of the mainstay DiffusionNet: two variants that parameterize distinct families of tangent-plane transformations, and present experiments analyzing their respective behaviors across diverse benchmarks. We additionally revisit overlap prediction evaluation for partial-to-partial matching and show that balanced accuracy provides a useful complementary metric under varying overlap ratios. To share these advancements with the wider community, we present an open-source codebase, DeepShapeMatchingKit, that incorporates these improvements and standardizes training, evaluation, and data pipelines for common deep shape matching methods. The codebase is available at: https://github.com/xieyizheng/DeepShapeMatchingKit
[407] Agentic Video Generation: From Text to Executable Event Graphs via Tool-Constrained LLM Planning
Nicolae Cudlenco, Mihai Masala, Marius Leordeanu
Main category: cs.CV
TL;DR: Agentic video generation system that uses LLMs to create structured event graphs (GEST) executed in 3D game engines, rather than generating pixels directly, ensuring semantic reliability and physical validity.
Details
Motivation: Existing multi-agent video generation systems produce visually impressive but semantically unreliable outputs with no ground truth annotations. The authors aim to create a system that generates semantically reliable and physically valid videos through structured specifications rather than direct pixel generation.Method: Uses LLMs to construct Graph of Events in Space and Time (GEST) - structured specification of actors, actions, objects, and temporal constraints. Employs hierarchical two-agent architecture: Director for story planning and Scene Builder for individual scenes through round-based state machine. Programmatic state backend enforces simulator constraints through validated tool calls, ensuring executable specifications by construction.
Result: Agentic narratives win 79% of text and 74% of video comparisons against procedural baselines via LLM jury. In seeded generation comparisons, engine-generated videos substantially outperform neural generators: 58% vs 25% and 20% on physical validity, and 3.75/5 vs 2.33 and 1.50 on semantic alignment.
Conclusion: The agentic system using structured GEST specifications executed in 3D game engines produces more semantically reliable and physically valid videos than neural generation approaches, demonstrating the value of separating narrative planning from constraint enforcement.
Abstract: Existing multi-agent video generation systems use LLM agents to orchestrate neural video generators, producing visually impressive but semantically unreliable outputs with no ground truth annotations. We present an agentic system that inverts this paradigm: instead of generating pixels, the LLM constructs a formal Graph of Events in Space and Time (GEST) – a structured specification of actors, actions, objects, and temporal constraints – which is then executed deterministically in a 3D game engine. A staged LLM refinement pipeline fails entirely at this task (0 of 50 attempts produce an executable specification), motivating a fundamentally different architecture based on a separation of concerns: the LLM handles narrative planning through natural language reasoning, while a programmatic state backend enforces all simulator constraints through validated tool calls, guaranteeing that every generated specification is executable by construction. The system uses a hierarchical two-agent architecture – a Director that plans the story and a Scene Builder that constructs individual scenes through a round-based state machine – with dedicated Relation Subagents that populate the logical and semantic edge types of the GEST formalism that procedural generation leaves empty, making this the first approach to exercise the full expressive capacity of the representation. We evaluate in two stages: autonomous generation against procedural baselines via a 3-model LLM jury, where agentic narratives win 79% of text and 74% of video comparisons; and seeded generation where the same text is given to our system, VEO 3.1, and WAN 2.2, with human annotations showing engine-generated videos substantially outperform neural generators on physical validity (58% vs 25% and 20%) and semantic alignment (3.75/5 vs 2.33 and 1.50).
[408] GTASA: Ground Truth Annotations for Spatiotemporal Analysis, Evaluation and Training of Video Models
Nicolae Cudlenco, Mihai Masala, Marius Leordeanu
Main category: cs.CV
TL;DR: GTASA introduces a corpus of multi-actor videos with spatial relation graphs and temporal mappings, plus GEST-Engine for generating physically plausible and semantically faithful videos, outperforming neural generators in evaluations.
Details
Motivation: Current neural video generators struggle with complex multi-actor scenarios, and evaluating them is difficult due to lack of ground truth for physical plausibility and semantic faithfulness.Method: Created GTASA corpus with per-frame spatial relation graphs and event-level temporal mappings, developed GEST-Engine system based on Graphs of Events in Space and Time, compared with neural generators using human evaluation and video captioning models.
Result: GEST-Engine shows clear advantages over neural generators in physical validity and semantic alignment; self-supervised video encoders encode spatial structure better than VLM visual encoders across 11 spatiotemporal reasoning tasks.
Conclusion: GTASA provides valuable ground truth for evaluating video generation and understanding, revealing strengths of different video encoding approaches for spatiotemporal reasoning.
Abstract: Generating complex multi-actor scenario videos remains difficult even for state-of-the-art neural generators, while evaluating them is hard due to the lack of ground truth for physical plausibility and semantic faithfulness. We introduce GTASA, a corpus of multi-actor videos with per-frame spatial relation graphs and event-level temporal mappings, and the system that produced it based on Graphs of Events in Space and Time (GEST): GEST-Engine. We compare our method with both open and closed source neural generators and prove both qualitatively (human evaluation of physical validity and semantic alignment) and quantitatively (via training video captioning models) the clear advantages of our method. Probing four frozen video encoders across 11 spatiotemporal reasoning tasks enabled by GTASA’s exact 3D ground truth reveals that self-supervised encoders encode spatial structure significantly better than VLM visual encoders.
[409] FishRoPE: Projective Rotary Position Embeddings for Omnidirectional Visual Perception
Rahul Ahuja, Mudit Jain, Bala Murali Manoghar Sai Sudhakar, Venkatraman Narayanan, Pratik Likhar, Varun Ravi Kumar, Senthil Yogamani
Main category: cs.CV
TL;DR: FishRoPE adapts frozen vision foundation models to fisheye cameras using spherical coordinate attention and LoRA, achieving SOTA on fisheye perception tasks without retraining from scratch.
Details
Motivation: Current vision foundation models and BEV representations assume pinhole camera geometry, but fisheye cameras (widely used in autonomous vehicles) have severe radial distortion that makes these representations geometrically inconsistent. Retraining foundation models from scratch is impractical due to scarcity of large-scale fisheye annotations.Method: Two components: 1) Frozen DINOv2 backbone with Low-Rank Adaptation (LoRA) to transfer self-supervised features to fisheye without task-specific pretraining, 2) Fisheye Rotary Position Embedding (FishRoPE) which reparameterizes attention in spherical coordinates so attention operates on angular separation rather than pixel distance. FishRoPE is architecture-agnostic and reduces to standard formulation under pinhole geometry.
Result: Achieves state-of-the-art results on WoodScape 2D detection (54.3 mAP) and SynWoodScapes BEV segmentation (65.1 mIoU).
Conclusion: FishRoPE provides a lightweight framework to adapt frozen vision foundation models to fisheye geometry through spherical coordinate attention, solving the geometric inconsistency problem without expensive retraining.
Abstract: Vision foundation models (VFMs) and Bird’s Eye View (BEV) representation have advanced visual perception substantially, yet their internal spatial representations assume the rectilinear geometry of pinhole cameras. Fisheye cameras, widely deployed on production autonomous vehicles for their surround-view coverage, exhibit severe radial distortion that renders these representations geometrically inconsistent. At the same time, the scarcity of large-scale fisheye annotations makes retraining foundation models from scratch impractical. We present \ours, a lightweight framework that adapts frozen VFMs to fisheye geometry through two components: a frozen DINOv2 backbone with Low-Rank Adaptation (LoRA) that transfers rich self-supervised features to fisheye without task-specific pretraining, and Fisheye Rotary Position Embedding (FishRoPE), which reparameterizes the attention mechanism in the spherical coordinates of the fisheye projection so that both self-attention and cross-attention operate on angular separation rather than pixel distance. FishRoPE is architecture-agnostic, introduces negligible computational overhead, and naturally reduces to the standard formulation under pinhole geometry. We evaluate \ours on WoodScape 2D detection (54.3 mAP) and SynWoodScapes BEV segmentation (65.1 mIoU), where it achieves state-of-the-art results on both benchmarks.
[410] Rethinking Video Human-Object Interaction: Set Prediction over Time for Unified Detection and Anticipation
Yuanhao Luo, Di Wen, Kunyu Peng, Ruiping Liu, Junwei Zheng, Yufan Chen, Jiale Wei, Rainer Stiefelhage
Main category: cs.CV
TL;DR: HOI-DA: A joint detection-anticipation framework for video-based human-object interaction understanding that models future interactions as residual transitions from current pair states, with improved benchmark for faithful evaluation.
Details
Motivation: Existing methods treat anticipation as a downstream forecasting task separate from detection, limiting joint reasoning. Current benchmarks have sparse keyframe annotations that can misalign future labels from actual dynamics, reducing anticipation evaluation reliability.Method: Introduces DETAnt-HOI benchmark (temporally corrected from VidHOI and Action Genome) and HOI-DA framework that jointly performs subject-object localization, present HOI detection, and future anticipation by modeling future interactions as residual transitions from current pair states.
Result: Experiments show consistent improvements in both detection and anticipation, with larger gains at longer horizons. The joint learning approach proves most effective when anticipation is learned together with detection as a structural constraint on pair-level video representation learning.
Conclusion: Anticipation is most effective when learned jointly with detection as a structural constraint on pair-level video representation learning. The proposed framework and benchmark address limitations of existing approaches for more faithful multi-horizon evaluation.
Abstract: Video-based human-object interaction (HOI) understanding requires both detecting ongoing interactions and anticipating their future evolution. However, existing methods usually treat anticipation as a downstream forecasting task built on externally constructed human-object pairs, limiting joint reasoning between detection and prediction. In addition, sparse keyframe annotations in current benchmarks can temporally misalign nominal future labels from actual future dynamics, reducing the reliability of anticipation evaluation. To address these issues, we introduce DETAnt-HOI, a temporally corrected benchmark derived from VidHOI and Action Genome for more faithful multi-horizon evaluation, and HOI-DA, a pair-centric framework that jointly performs subject-object localization, present HOI detection, and future anticipation by modeling future interactions as residual transitions from current pair states. Experiments show consistent improvements in both detection and anticipation, with larger gains at longer horizons. Our results highlight that anticipation is most effective when learned jointly with detection as a structural constraint on pair-level video representation learning. Benchmark and code will be publicly available.
[411] IMPACT: A Dataset for Multi-Granularity Human Procedural Action Understanding in Industrial Assembly
Di Wen, Zeyun Zhong, David Schneider, Manuel Zaremski, Linus Kunzmann, Yitian Shi, Ruiping Liu, Yufan Chen, Junwei Zheng, Jiahang Li, Jonas Hemmerich, Qiyi Tong, Patric Grauberger, Arash Ajoudani, Danda Pani Paudel, Sven Matthiesen, Barbara Deml, Jürgen Beyerer, Luc Van Gool, Rainer Stiefelhagen, Kunyu Peng
Main category: cs.CV
TL;DR: IMPACT is a synchronized five-view RGB-D dataset for industrial procedural understanding, focusing on assembly/disassembly of an angle grinder with multi-view capture, detailed annotations, and anomaly-recovery supervision.
Details
Motivation: Current industrial procedural understanding benchmarks lack realistic deployment conditions - they don't capture synchronized ego-exo views, decoupled bimanual actions, compliance-aware state tracking, and anomaly-recovery supervision within real industrial workflows.Method: Created a dataset with 112 trials from 13 participants (39.5 hours total) using synchronized five-view RGB-D capture during real assembly/disassembly of a commercial angle grinder. Includes multi-route execution via partial-order prerequisite graph, six-category anomaly taxonomy, NASA-TLX cognitive load measurement, and hierarchical annotation linking hand-specific atomic actions to procedural steps.
Result: The dataset reveals fundamental limitations invisible to single-task benchmarks, particularly under realistic deployment conditions involving incomplete observations, flexible execution paths, and corrective behavior. Systematic baselines demonstrate these challenges.
Conclusion: IMPACT provides the first real industrial assembly benchmark with comprehensive synchronized multi-view capture and detailed annotations, enabling research on deployment-oriented procedural understanding under realistic conditions.
Abstract: We introduce IMPACT, a synchronized five-view RGB-D dataset for deployment-oriented industrial procedural understanding, built around real assembly and disassembly of a commercial angle grinder with professional-grade tools. To our knowledge, IMPACT is the first real industrial assembly benchmark that jointly provides synchronized ego-exo RGB-D capture, decoupled bimanual annotation, compliance-aware state tracking, and explicit anomaly–recovery supervision within a single real industrial workflow. It comprises 112 trials from 13 participants totaling 39.5 hours, with multi-route execution governed by a partial-order prerequisite graph, a six-category anomaly taxonomy, and operator cognitive load measured via NASA-TLX. The annotation hierarchy links hand-specific atomic actions to coarse procedural steps, component assembly states, and per-hand compliance phases, with synchronized null spans across views to decouple perceptual limitations from algorithmic failure. Systematic baselines reveal fundamental limitations that remain invisible to single-task benchmarks, particularly under realistic deployment conditions that involve incomplete observations, flexible execution paths, and corrective behavior. The full dataset, annotations, and evaluation code are available at https://github.com/Kratos-Wen/IMPACT.
[412] Neural Stochastic Processes for Satellite Precipitation Refinement
Shunya Nagashima, Takumi Bannai, Shuitsu Koyama, Tomoya Mitsui, Shuntaro Suzuki
Main category: cs.CV
TL;DR: Neural Stochastic Process (NSP) model for precipitation estimation that fuses satellite and gauge data using a Neural Process encoder and latent Neural SDE, outperforming 13 baselines on new QPEBench benchmark.
Details
Motivation: Accurate precipitation estimation is crucial for flood forecasting and water management. Satellite data has systematic biases while ground gauges are sparse. Existing methods treat time steps independently, discarding temporal structure in precipitation fields.Method: Proposes Neural Stochastic Process (NSP) with Neural Process encoder conditioning on arbitrary sets of gauge observations and latent Neural SDE on 2D spatial representation. Trained under single variational objective with simulation-free cost. Also introduces QPEBench benchmark with 43,756 hourly samples over CONUS.
Result: NSP outperforms 13 baselines across all six evaluation metrics on QPEBench and surpasses JAXA’s operational gauge-calibrated product. Additional experiment on Kyushu, Japan confirms generalization to different region with independent data sources.
Conclusion: NSP effectively fuses satellite and gauge data for precipitation estimation by capturing temporal structure, demonstrating state-of-the-art performance and generalization capabilities across different regions.
Abstract: Accurate precipitation estimation is critical for flood forecasting, water resource management, and disaster preparedness. Satellite products provide global hourly coverage but contain systematic biases; ground-based gauges are accurate at point locations but too sparse for direct gridded correction. Existing methods fuse these sources by interpolating gauge observations onto the satellite grid, but treat each time step independently and therefore discard temporal structure in precipitation fields. We propose Neural Stochastic Process (NSP), a model that pairs a Neural Process encoder conditioning on arbitrary sets of gauge observations with a latent Neural SDE on a 2D spatial representation. NSP is trained under a single variational objective with simulation-free cost. We also introduce QPEBench, a benchmark of 43{,}756 hourly samples over the Contiguous United States (2021–2025) with four aligned data sources and six evaluation metrics. On QPEBench, NSP outperforms 13 baselines across all six metrics and surpasses JAXA’s operational gauge-calibrated product. An additional experiment on Kyushu, Japan confirms generalization to a different region with independent data sources.
[413] Point2Pose: Occlusion-Recovering 6D Pose Tracking and 3D Reconstruction for Multiple Unknown Objects Via 2D Point Trackers
Tzu-Yuan Lin, Ho Jae Lee, Kevin Doherty, Yonghyeon Lee, Sangbae Kim
Main category: cs.CV
TL;DR: Point2Pose is a model-free method for 6D pose tracking of multiple objects from monocular RGB-D video using sparse image points and online TSDF reconstruction.
Details
Motivation: Existing model-free tracking methods lack capabilities for multi-object tracking and recovery from complete occlusion. The authors aim to develop a method that can track multiple unseen objects without requiring CAD models or category priors.Method: Uses 2D point tracking for long-range correspondences to enable recovery after occlusion, while incrementally reconstructing an online Truncated Signed Distance Function (TSDF) representation of tracked objects. Initialized only from sparse image points.
Result: Achieves performance comparable to state-of-the-art on severe-occlusion benchmarks while supporting multi-object tracking and recovery from complete occlusion, capabilities not supported by previous model-free approaches.
Conclusion: Point2Pose provides a robust model-free solution for 6D pose tracking that handles multiple objects and recovers from complete occlusions, demonstrated through a new multi-object tracking dataset.
Abstract: We present Point2Pose, a model-free method for causal 6D pose tracking of multiple rigid objects from monocular RGB-D video. Initialized only from sparse image points on the objects to be tracked, our approach tracks multiple unseen objects without requiring object CAD models or category priors. Point2Pose leverages a 2D point tracker to obtain long-range correspondences, enabling instant recovery after complete occlusion. Simultaneously, the system incrementally reconstructs an online Truncated Signed Distance Function (TSDF) representation of the tracked targets. Alongside the method, we introduce a new multi-object tracking dataset comprising both simulation and real-world sequences, with motion-capture ground truth for evaluation. Experiments show that Point2Pose achieves performance comparable to the state-of-the-art methods on a severe-occlusion benchmark, while additionally supporting multi-object tracking and recovery from complete occlusion, capabilities that are not supported by previous model-free tracking approaches.
[414] DiningBench: A Hierarchical Multi-view Benchmark for Perception and Reasoning in the Dietary Domain
Song Jin, Juntian Zhang, Xun Zhang, Zeying Tian, Fei Jiang, Guojun Yin, Wei Lin, Yong Liu, Rui Yan
Main category: cs.CV
TL;DR: DiningBench: A hierarchical multi-view benchmark for evaluating VLMs on food understanding across fine-grained classification, nutrition estimation, and visual QA, revealing current models’ limitations in fine-grained discrimination and nutritional reasoning.
Details
Motivation: Current Vision-Language Models (VLMs) have limited application in the food domain due to existing benchmarks that use coarse-grained categories, single-view images, and inaccurate metadata. There's a need for more comprehensive evaluation of VLMs' food understanding capabilities.Method: Introduces DiningBench, a hierarchical multi-view benchmark with 3,021 distinct dishes averaging 5.27 images per entry. Features fine-grained “hard” negatives from identical menus and verification-based nutritional data. Evaluates 29 state-of-the-art open-source and proprietary models across three cognitive levels.
Result: Current VLMs excel at general reasoning but struggle significantly with fine-grained visual discrimination and precise nutritional reasoning. The study identifies five primary failure modes and investigates the impact of multi-view inputs and Chain-of-Thought reasoning.
Conclusion: DiningBench serves as a challenging testbed to advance food-centric VLM research, highlighting the need for improved fine-grained discrimination and nutritional reasoning capabilities in vision-language models.
Abstract: Recent advancements in Vision-Language Models (VLMs) have revolutionized general visual understanding. However, their application in the food domain remains constrained by benchmarks that rely on coarse-grained categories, single-view imagery, and inaccurate metadata. To bridge this gap, we introduce DiningBench, a hierarchical, multi-view benchmark designed to evaluate VLMs across three levels of cognitive complexity: Fine-Grained Classification, Nutrition Estimation, and Visual Question Answering. Unlike previous datasets, DiningBench comprises 3,021 distinct dishes with an average of 5.27 images per entry, incorporating fine-grained “hard” negatives from identical menus and rigorous, verification-based nutritional data. We conduct an extensive evaluation of 29 state-of-the-art open-source and proprietary models. Our experiments reveal that while current VLMs excel at general reasoning, they struggle significantly with fine-grained visual discrimination and precise nutritional reasoning. Furthermore, we systematically investigate the impact of multi-view inputs and Chain-of-Thought reasoning, identifying five primary failure modes. DiningBench serves as a challenging testbed to drive the next generation of food-centric VLM research. All codes are released in https://github.com/meituan/DiningBench.
[415] SignReasoner: Compositional Reasoning for Complex Traffic Sign Understanding via Functional Structure Units
Ruibin Wang, Zhenyu Lin, Xinhai Zhao
Main category: cs.CV
TL;DR: SignReasoner transforms general VLMs into expert traffic sign reasoners using Functional Structure Units (FSUs) for compositional generalization, achieving SOTA on TrafficSignEval benchmark.
Details
Motivation: Current models lack compositional generalization for complex traffic signs with intricate layouts, multi-lingual text, and composite symbols, which is critical for autonomous driving safety.Method: Proposes Functional Structure Unit (FSU) for function-based decomposition of signs into minimal core blocks. Uses two-stage VLM post-training: Iterative Caption-FSU Distillation and FSU-GRPO with Tree Edit Distance rewards.
Result: SignReasoner achieves new state-of-the-art on TrafficSignEval benchmark with remarkable data efficiency and no architectural modification, significantly improving traffic sign understanding in various VLMs.
Conclusion: The FSU-based approach enables robust generalization to unseen sign compositions by learning underlying structural grammar, making VLMs effective traffic sign reasoners.
Abstract: Accurate semantic understanding of complex traffic signs-including those with intricate layouts, multi-lingual text, and composite symbols-is critical for autonomous driving safety. Current models, both specialized small ones and large Vision Language Models (VLMs), suffer from a significant bottleneck: a lack of compositional generalization, leading to failure when encountering novel sign configurations. To overcome this, we propose SignReasoner, a novel paradigm that transforms general VLMs into expert traffic sign reasoners. Our core innovation is Functional Structure Unit (FSU), which shifts from common instance-based modeling to flexible function-based decomposition. By breaking down complex signs into minimal, core functional blocks (e.g., Direction, Notice, Lane), our model learns the underlying structural grammar, enabling robust generalization to unseen compositions. We define this decomposition as the FSU-Reasoning task and introduce a two-stage VLM post-training pipeline to maximize performance: Iterative Caption-FSU Distillation that enhances the model’s accuracy in both FSU-reasoning and caption generation; FSU-GRPO that uses Tree Edit Distance (TED) to compute FSU differences as the rewards in GRPO algorithm, boosting reasoning abilities. Experiments on the newly proposed FSU-Reasoning benchmark, TrafficSignEval, show that SignReasoner achieves new SOTA with remarkable data efficiency and no architectural modification, significantly improving the traffic sign understanding in various VLMs.
[416] Enhancing Fine-Grained Spatial Grounding in 3D CT Report Generation via Discriminative Guidance
Chenyu Wang, Weicheng Dai, Han Liu, Wenchao Li, Kayhan Batmanghelich
Main category: cs.CV
TL;DR: A novel framework (DCP-PD) for radiology report generation that uses discriminative cue-prompting with prompt dropout to improve fine-grained attribute and spatial localization in chest CT reports, achieving state-of-the-art performance.
Details
Motivation: Existing vision-language models for radiology report generation have two key limitations: (1) coarse training supervision that aligns whole CT volumes with full reports without explicit alignment for fine-grained attributes or pathology locations, and (2) holistic evaluation methods that don't assess spatial grounding diagnostically.Method: Proposes Discriminative Cue-Prompting with Prompt Dropout (DCP-PD), a plug-and-play framework that distills fine-grained cues from free-text reports and uses them to guide report generation while mitigating shortcut reliance via prompt dropout. Also introduces a hierarchical, location-aware question-set protocol to assess pathology-location grounding.
Result: Achieves state-of-the-art performance on CT-RATE, improving macro F1 from 0.501 to 0.603 (20% relative improvement), and substantially boosts out-of-distribution performance on Rad-ChestCT from F1 0.266 to 0.503 (89% relative improvement).
Conclusion: The proposed DCP-PD framework effectively improves fine-grained attribute alignment and spatial localization in radiology report generation, though fine-grained spatial localization remains challenging even for high-performing models on current benchmarks.
Abstract: Vision–language models (VLMs) for radiology report generation (RRG) can produce long-form chest CT reports from volumetric scans and show strong potential to improve radiology workflow efficiency and consistency. However, existing methods face two key limitations: (i) training supervision is often coarse, aligning a whole CT volume with a full free-text report without explicit alignment for fine-grained attributes or pathology locations; and (ii) evaluation is typically holistic (lexical overlap, entity matching, or LLM-as-a-judge scores) and not diagnostic for spatial grounding. We propose \emph{Discriminative Cue-Prompting with Prompt Dropout (DCP-PD)}, a plug-and-play framework that distills fine-grained cues from free-text reports and uses them to guide report generation while mitigating shortcut reliance via prompt dropout. DCP-PD achieves state-of-the-art performance on CT-RATE, improving macro F1 from $=0.501$ to $0.603$ (20% relative), and substantially boosts out-of-distribution performance on Rad-ChestCT from F1 $=0.266$ to $0.503$ (89% relative). Finally, we introduce a hierarchical, location-aware question-set protocol (presence $\rightarrow$ laterality $\rightarrow$ lobe) to directly assess pathology-location grounding, showing that fine-grained spatial localization remains challenging even for models that score highly on current benchmarks.
[417] PERCEPT-Net: A Perceptual Loss Driven Framework for Reducing MRI Artifact Tissue Confusion
Ziheng Guo, Danqun Zheng, Chengwei Chen, Boyang Pan, Shuai Li, Ziqin Yu, Xiaoxiao Chen, Langdi Zhong, Yun Bian, Nan-Jie Gong
Main category: cs.CV
TL;DR: PERCEPT-Net is a deep learning framework for MRI motion artifact correction that uses perceptual supervision to preserve anatomical structures while suppressing artifacts.
Details
Motivation: Existing MRI artifact correction models suffer from poor clinical generalization due to artifact-tissue confusion, failing to distinguish artifacts from anatomical structures, leading to compromised diagnostic quality.Method: Uses residual U-Net backbone with multi-scale recovery module and dual attention mechanisms, enhanced by Motion Perceptual Loss (MPL) that provides artifact-aware supervision by learning generalizable motion artifact representations.
Result: Outperformed state-of-the-art methods on clinical data, with ablation studies showing MPL is crucial for structural consistency and tissue contrast preservation. Radiologist evaluations confirmed superior image quality and diagnostic structure preservation.
Conclusion: PERCEPT-Net effectively suppresses motion artifacts in clinical MRI without compromising anatomical integrity through task-specific perceptual learning, improving clinical robustness and mitigating over-smoothing in medical image reconstruction.
Abstract: Purpose: Existing deep learning-based MRI artifact correction models exhibit poor clinical generalization due to inherent artifact-tissue confusion, failing to discriminate artifacts from anatomical structures. To resolve this, we introduce PERCEPT-Net, a framework leveraging dedicated perceptual supervision for structure-preserving artifact suppression. Method: PERCEPT-Net utilizes a residual U-Net backbone integrated with a multi-scale recovery module and dual attention mechanisms to preserve anatomical context and salient features. The core mechanism, Motion Perceptual Loss (MPL), provides artifact-aware supervision by learning generalizable motion artifact representations. This logic directly guides the network to suppress artifacts while maintaining anatomical fidelity. Training utilized a hybrid dataset of real and simulated sequences, followed by prospective validation via objective metrics and expert radiologist assessments. Result: PERCEPT-Net outperformed state-of-the-art methods on clinical data. Ablation analysis established a direct causal link between MPL and performance; its omission caused a significant deterioration in structural consistency (p < 0.001) and tissue contrast (p < 0.001). Radiologist evaluations corroborated these objective metrics, scoring PERCEPT-Net significantly higher in global image quality (median 3 vs. 2, p < 0.001) and verifying the preservation of critical diagnostic structures. Conclusion: By integrating task-specific, artifact-aware perceptual learning, PERCEPT-Net suppresses motion artifacts in clinical MRI without compromising anatomical integrity. This framework improves clinical robustness and provides a verifiable mechanism to mitigate over-smoothing and structural degradation in medical image reconstruction.
[418] ReContraster: Making Your Posters Stand Out with Regional Contrast
Peixuan Zhang, Zijian Jia, Ziqi Cai, Shuchen Weng, Si Li, Boxin Shi
Main category: cs.CV
TL;DR: ReContraster: A training-free model for poster design using regional contrast effects and compositional multi-agent systems to create visually striking posters.
Details
Motivation: Effective poster design requires capturing attention and conveying messages clearly. Inspired by "contrast effects" principle, the paper aims to create posters that stand out by leveraging regional contrast.Method: Training-free model using compositional multi-agent system to emulate cognitive behaviors of poster designers (identify elements, organize layout, evaluate candidates). Integrates hybrid denoising strategy during diffusion process for harmonious region transitions.
Result: Superior performance over state-of-the-art methods confirmed by seven quantitative metrics and four user studies. Produces visually striking and aesthetically appealing posters.
Conclusion: ReContraster successfully creates attention-grabbing posters using regional contrast principles and multi-agent systems, with comprehensive evaluation showing its effectiveness.
Abstract: Effective poster design requires rapidly capturing attention and clearly conveying messages. Inspired by the ``contrast effects’’ principle, we propose ReContraster, the first training-free model to leverage regional contrast to make posters stand out. By emulating the cognitive behaviors of a poster designer, ReContraster introduces the compositional multi-agent system to identify elements, organize layout, and evaluate generated poster candidates. To further ensure harmonious transitions across region boundaries, ReContraster integrates the hybrid denoising strategy during the diffusion process. We additionally contribute a new benchmark dataset for comprehensive evaluation. Seven quantitative metrics and four user studies confirm its superiority over relevant state-of-the-art methods, producing visually striking and aesthetically appealing posters.
[419] Parameter Efficient Fine-tuning for Domain-specific Gastrointestinal Disease Recognition
Sanjaya Poudel, Nikita Kunwor, Raj Simkhada, Mustafa Munir, Manish Dhakal, Khem Poudel
Main category: cs.CV
TL;DR: Proposes using LoRA (low-rank adaptation) modules for fine-tuning pretrained foundation models on medical image classification tasks to address distribution shifts between cross-source images while maintaining parameter efficiency.
Details
Motivation: Addresses the expensive practice of training separate models for each medical image source due to distribution shifts, which requires storing multiple copies of large pretrained models when fully fine-tuned for single datasets.Method: Uses LoRA modules that learn lightweight task-specific low-rank matrices to perturb pretrained weights for fine-tuning downstream classification tasks, improving parameter efficiency compared to end-to-end fine-tuning.
Result: For gastrointestinal tract diseases, LoRA-based fine-tuning shows significantly better results than end-to-end fine-tuning while maintaining improved parameter efficiency.
Conclusion: LoRA provides an effective parameter-efficient fine-tuning approach for medical image analysis that addresses cross-source distribution shifts without requiring multiple full model copies.
Abstract: Despite recent advancements in the field of medical image analysis with the use of pretrained foundation models, the issue of distribution shifts between cross-source images largely remains adamant. To circumvent that issue, investigators generally train a separate model for each source. However, this method becomes expensive when we fully fine-tune pretrained large models for a single dataset, as we must store multiple copies of those models. Thus, in this work, we propose using a low-rank adaptation (LoRA) module for fine-tuning downstream classification tasks. LoRAs learn lightweight task-specific low-rank matrices that perturb pretrained weights to optimize those downstream tasks. For gastrointestinal tract diseases, they exhibit significantly better results than end-to-end finetuning with improved parameter efficiency. Code is available at: github.com/sanjay931/peft-gi-recognition.
[420] AIM-Bench: Benchmarking and Improving Affective Image Manipulation via Fine-Grained Hierarchical Control
Shi Chen, Xuecheng Wu, Heli Sun, Yunyun Shi, Xinyi Yin, Fengjian Xue, Jinheng Xie, Dingkang Yang, Hao Wang, Junxiao Xue, Liang He
Main category: cs.CV
TL;DR: AIM-Bench: First benchmark for Affective Image Manipulation with 800 samples covering 8 emotions and 5 editing types, plus AIM-40k dataset for instruction tuning to address positivity bias in current models.
Details
Motivation: Current image editing benchmarks focus on object-level modifications but lack fine-grained affective dimensions. There's a need for benchmarks that can capture and evaluate emotional manipulation in images.Method: 1) Created AIM-Bench using dual-path affective modeling (Mikels taxonomy + Valence-Arousal-Dominance framework) via hierarchical human-in-the-loop workflow. 2) Developed composite evaluation suite with rule-based and model-based metrics. 3) Proposed AIM-40k dataset using inverse repainting strategy to create balanced instruction-tuning data.
Result: Current editing models show significant challenges including prevalent positivity bias. Fine-tuning on AIM-40k yields 9.15% relative improvement in overall performance on AIM-Bench.
Conclusion: The paper introduces the first benchmark for affective image manipulation and addresses data imbalance issues through a scalable data engine, demonstrating effectiveness through improved model performance.
Abstract: Affective Image Manipulation (AIM) aims to evoke specific emotions through targeted editing. Current image editing benchmarks primarily focus on object-level modifications in general scenarios, lacking the fine-grained granularity to capture affective dimensions. To bridge this gap, we introduce the first benchmark designed for AIM termed AIM-Bench. This benchmark is built upon a dual-path affective modeling scheme that integrates the Mikels emotion taxonomy with the Valence-Arousal-Dominance framework, enabling high-level semantic and fine-grained continuous manipulation. Through a hierarchical human-in-the-loop workflow, we finally curate 800 high-quality samples covering 8 emotional categories and 5 editing types. To effectively assess performance, we also design a composite evaluation suite combining rule-based and model-based metrics to holistically assess instruction consistency, aesthetics, and emotional expressiveness. Extensive evaluations reveal that current editing models face significant challenges, most notably a prevalent positivity bias, which stemming from inherent imbalances in training data distribution. To tackle this, we propose a scalable data engine utilizing an inverse repainting strategy to construct AIM-40k, a balanced instruction-tuning dataset comprising 40k samples. Concretely, we enhance raw affective images via generative redrawing to establish high-fidelity ground truths, and synthesize input images with divergent emotions and paired precise instructions. Fine-tuning a baseline model on AIM-40k yields a 9.15% relative improvement in overall performance, demonstrating the effectiveness of our AIM-40k. Our data and related code will be made open soon.
[421] A Benchmark and Multi-Agent System for Instruction-driven Cinematic Video Compilation
Peixuan Zhang, Chang Zhou, Ziyuan Zhang, Hualuo Liu, Chunjie Zhang, Jingqi Liu, Xiaohui Zhou, Xi Chen, Shuchen Weng, Si Li, Boxin Shi
Main category: cs.CV
TL;DR: CineBench: First benchmark for instruction-driven cinematic video compilation with professional annotations. CineAgents: Multi-agent system using script reverse-engineering and iterative narrative planning to overcome contextual collapse and temporal fragmentation in video compilation.
Details
Motivation: Growing demand for adapting long cinematic content into short videos requires versatile automatic compilation systems. Existing methods are limited to predefined tasks, and there's no comprehensive benchmark for evaluating cinematic compilation quality.Method: Introduces CineBench benchmark with diverse user instructions and professional annotations. Proposes CineAgents multi-agent system using “design-and-compose” paradigm: script reverse-engineering to build hierarchical narrative memory, and iterative narrative planning to refine creative blueprint into final compiled script.
Result: CineAgents significantly outperforms existing methods, generating compilations with superior narrative coherence and logical coherence. The benchmark enables comprehensive evaluation of cinematic video compilation systems.
Conclusion: CineBench provides the first comprehensive benchmark for instruction-driven cinematic video compilation, while CineAgents demonstrates a novel multi-agent approach that effectively addresses contextual collapse and temporal fragmentation in video compilation tasks.
Abstract: The surging demand for adapting long-form cinematic content into short videos has motivated the need for versatile automatic video compilation systems. However, existing compilation methods are limited to predefined tasks, and the community lacks a comprehensive benchmark to evaluate the cinematic compilation. To address this, we introduce CineBench, the first benchmark for instruction-driven cinematic video compilation, featuring diverse user instructions and high-quality ground-truth compilations annotated by professional editors. To overcome contextual collapse and temporal fragmentation, we present CineAgents, a multi-agent system that reformulates cinematic video compilation into ``design-and-compose’’ paradigm. CineAgents performs script reverse-engineering to construct a hierarchical narrative memory to provide multi-level context and employs an iterative narrative planning process that refines a creative blueprint into a final compiled script. Extensive experiments demonstrate that CineAgents significantly outperforms existing methods, generating compilations with superior narrative coherence and logical coherence.
[422] Toward Accountable AI-Generated Content on Social Platforms: Steganographic Attribution and Multimodal Harm Detection
Xinlei Guan, David Arosemena, Tejaswi Dhandu, Kuan Huang, Meng Xu, Miles Q. Li, Bingyu Shen, Ruiyang Qin, Umamaheswara Rao Tida, Boyang Li
Main category: cs.CV
TL;DR: A steganography-based attribution framework for AI-generated images that embeds cryptographically signed identifiers and uses multimodal harmful content detection for attribution verification.
Details
Motivation: The rapid growth of generative AI has created challenges in content moderation and digital forensics, particularly when benign AI-generated images are paired with harmful or misleading text, creating difficult-to-detect misuse that undermines traditional moderation frameworks and complicates attribution.Method: Introduces a steganography-enabled attribution framework that embeds cryptographically signed identifiers into images at creation time and uses multimodal harmful content detection as a trigger for attribution verification. Evaluates five watermarking methods across spatial, frequency, and wavelet domains, and integrates a CLIP-based fusion model for multimodal harmful-content detection.
Result: Spread-spectrum watermarking, especially in the wavelet domain, provides strong robustness under blur distortions. The multimodal fusion detector achieves an AUC-ROC of 0.99, enabling reliable cross-modal attribution verification.
Conclusion: The components form an end-to-end forensic pipeline that enables reliable tracing of harmful deployments of AI-generated imagery, supporting accountability in modern synthetic media environments.
Abstract: The rapid growth of generative AI has introduced new challenges in content moderation and digital forensics. In particular, benign AI-generated images can be paired with harmful or misleading text, creating difficult-to-detect misuse. This contextual misuse undermines the traditional moderation framework and complicates attribution, as synthetic images typically lack persistent metadata or device signatures. We introduce a steganography enabled attribution framework that embeds cryptographically signed identifiers into images at creation time and uses multimodal harmful content detection as a trigger for attribution verification. Our system evaluates five watermarking methods across spatial, frequency, and wavelet domains. It also integrates a CLIP-based fusion model for multimodal harmful-content detection. Experiments demonstrate that spread-spectrum watermarking, especially in the wavelet domain, provides strong robustness under blur distortions, and our multimodal fusion detector achieves an AUC-ROC of 0.99, enabling reliable cross-modal attribution verification. These components form an end-to-end forensic pipeline that enables reliable tracing of harmful deployments of AI-generated imagery, supporting accountability in modern synthetic media environments. Our code is available at GitHub: https://github.com/bli1/steganography
[423] ExpertEdit: Learning Skill-Aware Motion Editing from Expert Videos
Arjun Somayazulu, Kristen Grauman
Main category: cs.CV
TL;DR: ExpertEdit: A framework for skill-driven motion editing that automatically improves novice sports motions to expert level using only unpaired expert video demonstrations, without requiring paired data or manual edit guidance.
Details
Motivation: Visual feedback showing near-perfect versions of one's own performance accelerates motor skill learning more effectively than watching expert demonstrations alone. However, existing motion editing approaches require paired input-output data (rare and expensive) and explicit edit guidance at inference, making them unsuitable for personalized skill feedback.Method: ExpertEdit learns an expert motion prior using a masked language modeling objective that infills masked motion spans with expert-level refinements. At inference, novice motion is masked at skill-critical moments and projected into the learned expert manifold, producing localized skill improvements without paired supervision or manual edit guidance.
Result: Across eight diverse techniques and three sports from Ego-Exo4D and Karate Kyokushin, ExpertEdit outperforms state-of-the-art supervised motion editing methods on multiple metrics of motion realism and expert quality.
Conclusion: ExpertEdit enables personalized skill feedback by automatically editing a person’s motion to reflect higher skill using only unpaired expert demonstrations, addressing limitations of existing supervised approaches that require expensive paired data and manual guidance.
Abstract: Visual feedback is critical for motor skill acquisition in sports and rehabilitation, and psychological studies show that observing near-perfect versions of one’s own performance accelerates learning more effectively than watching expert demonstrations alone. We propose to enable such personalized feedback by automatically editing a person’s motion to reflect higher skill. Existing motion editing approaches are poorly suited for this setting because they assume paired input-output data – rare and expensive to curate for skill-driven tasks – and explicit edit guidance at inference. We introduce ExpertEdit, a framework for skill-driven motion editing trained exclusively on unpaired expert video demonstrations. ExpertEdit learns an expert motion prior with a masked language modeling objective that infills masked motion spans with expert-level refinements. At inference, novice motion is masked at skill-critical moments and projected into the learned expert manifold, producing localized skill improvements without paired supervision or manual edit guidance. Across eight diverse techniques and three sports from Ego-Exo4D and Karate Kyokushin, ExpertEdit outperforms state-of-the-art supervised motion editing methods on multiple metrics of motion realism and expert quality. Project page: https://vision.cs.utexas.edu/projects/expert_edit/ .
[424] UDAPose: Unsupervised Domain Adaptation for Low-Light Human Pose Estimation
Haopeng Chen, Yihao Ai, Kabeen Kim, Robby T. Tan, Yixin Chen, Bo Wang
Main category: cs.CV
TL;DR: UDAPose is a novel unsupervised domain adaptation framework for human pose estimation in low-light conditions, featuring realistic low-light image synthesis and dynamic fusion of visual cues with pose priors.
Details
Motivation: Low-light conditions pose significant challenges for human pose estimation due to lack of annotated datasets and loss of visual information. Existing domain adaptation methods produce unrealistic low-light images that fail to preserve high-frequency characteristics, leading to poor generalization to real low-light scenes.Method: Proposes UDAPose with two key components: 1) A synthesis method using Direct-Current-based High-Pass Filter (DHF) and Low-light Characteristics Injection Module (LCIM) to inject high-frequency details from input low-light images, and 2) Dynamic Control of Attention (DCA) module that adaptively balances image cues with learned pose priors in Transformer architecture.
Result: Outperforms state-of-the-art methods with significant AP gains: 10.1 (56.4%) on ExLPose-test hard set and 7.4 (31.4%) in cross-dataset validation on EHPT-XC.
Conclusion: UDAPose effectively addresses low-light pose estimation challenges through realistic low-light image synthesis and adaptive fusion of visual cues with pose priors, demonstrating superior performance over existing methods.
Abstract: Low-visibility scenarios, such as low-light conditions, pose significant challenges to human pose estimation due to the scarcity of annotated low-light datasets and the loss of visual information under poor illumination. Recent domain adaptation techniques attempt to utilize well-lit labels by augmenting well-lit images to mimic low-light conditions. But handcrafted augmentations oversimplify noise patterns, while learning-based methods often fail to preserve high-frequency low-light characteristics, producing unrealistic images that lead pose models to generalize poorly to real low-light scenes. Moreover, recent pose estimators rely on image cues through image-to-keypoint cross-attention, but these cues become unreliable under low-light conditions. To address these issues, we propose Unsupervised Domain Adaptation for Pose Estimation (UDAPose), a novel framework that synthesizes low-light images and dynamically fuses visual cues with pose priors for improved pose estimation. Specifically, our synthesis method incorporates a Direct-Current-based High-Pass Filter (DHF) and a Low-light Characteristics Injection Module (LCIM) to inject high-frequency details from input low-light images, overcoming rigidity or the detail loss in existing approaches. Furthermore, we introduce a Dynamic Control of Attention (DCA) module that adaptively balances image cues with learned pose priors in the Transformer architecture. Experiments show that UDAPose outperforms state-of-the-art methods, with notable AP gains of 10.1 (56.4%) on the ExLPose-test hard set (LL-H) and 7.4 (31.4%) in cross-dataset validation on EHPT-XC. Code: https://github.com/Vision-and-Multimodal-Intelligence-Lab/UDAPose
[425] Visual Enhanced Depth Scaling for Multimodal Latent Reasoning
Yudong Han, Yong Wang, Zaiquan Yang, Zhen Qu, Liyuan Pan, Xiangxiang Chu
Main category: cs.CV
TL;DR: A novel multimodal latent reasoning framework that replaces explicit Chain-of-Thought decoding with implicit feature propagation, addressing visual under-optimization and complex token instability through visual replay modules and adaptive routing depth scaling.
Details
Motivation: Current multimodal reasoning suffers from explicit Chain-of-Thought decoding limitations: reduced representation informativeness and increased inference latency. There's also a language bias causing visual under-optimization and fixed architectural depths limiting complex token refinement.Method: Proposes visual replay module using causal self-attention to estimate token saliency with spatially-coherent constraints, and routing depth scaling that adaptively allocates additional reasoning steps to complex tokens. Uses curriculum strategy to progressively internalize explicit CoT into latent representations.
Result: Achieves state-of-the-art performance across diverse benchmarks while delivering substantial inference speedups over explicit CoT baselines.
Conclusion: The framework successfully addresses visual under-optimization and complex token instability, enabling efficient multimodal latent reasoning with improved performance and reduced latency.
Abstract: Multimodal latent reasoning has emerged as a promising paradigm that replaces explicit Chain-of-Thought (CoT) decoding with implicit feature propagation, simultaneously enhancing representation informativeness and reducing inference latency. By analyzing token-level gradient dynamics during latent training, we reveal two critical observations: (1) visual tokens exhibit significantly higher and more volatile gradient norms than their textual counterparts due to inherent language bias, resulting in systematic visual under-optimization; and (2) semantically simple tokens converge rapidly, whereas complex tokens exhibit persistent gradient instability constrained by fixed architectural depths. To address these limitations, we propose a visual replay module and routing depth scaling to collaboratively enhance visual perception and refine complicated latents for deeper contextual reasoning. The former module leverages causal self-attention to estimate token saliency, reinforcing fine-grained grounding through spatially-coherent constraints. Complementarily, the latter mechanism adaptively allocates additional reasoning steps to complex tokens, enabling deeper contextual refinement. Guided by a curriculum strategy that progressively internalizes explicit CoT into compact latent representations, our framework achieves state-of-the-art performance across diverse benchmarks while delivering substantial inference speedups over explicit CoT baselines.
[426] FreeScale: Scaling 3D Scenes via Certainty-Aware Free-View Generation
Chenhan Jiang, Yu Chen, Qingwen Zhang, Jifei Song, Songcen Xu, Dit-Yan Yeung, Jiankang Deng
Main category: cs.CV
TL;DR: FreeScale generates scalable training data for novel view synthesis by using scene reconstruction as a geometric proxy and certainty-aware sampling to minimize reconstruction artifacts.
Details
Motivation: Novel View Synthesis (NVS) models suffer from limited training data - real captures are sparse while synthetic data has domain gaps. There's a need for scalable, high-quality training data that bridges this gap.Method: Leverages imperfect scene reconstructions as geometric proxies, then uses certainty-aware free-view sampling to identify novel viewpoints that are semantically meaningful and minimally affected by reconstruction errors.
Result: Achieves 2.7 dB PSNR gain on out-of-distribution benchmarks and enhances per-scene 3D Gaussian Splatting optimization across multiple datasets.
Conclusion: Provides a practical data generation engine to overcome fundamental bottlenecks in 3D vision by transforming limited real-world sequences into scalable training data.
Abstract: The development of generalizable Novel View Synthesis (NVS) models is critically limited by the scarcity of large-scale training data featuring diverse and precise camera trajectories. While real-world captures are photorealistic, they are typically sparse and discrete. Conversely, synthetic data scales but suffers from a domain gap and often lacks realistic semantics. We introduce FreeScale, a novel framework that leverages the power of scene reconstruction to transform limited real-world image sequences into a scalable source of high-quality training data. Our key insight is that an imperfect reconstructed scene serves as a rich geometric proxy, but naively sampling from it amplifies artifacts. To this end, we propose a certainty-aware free-view sampling strategy identifying novel viewpoints that are both semantically meaningful and minimally affected by reconstruction errors. We demonstrate FreeScale’s effectiveness by scaling up the training of feedforward NVS models, achieving a notable gain of 2.7 dB in PSNR on challenging out-of-distribution benchmarks. Furthermore, we show that the generated data can actively enhance per-scene 3D Gaussian Splatting optimization, leading to consistent improvements across multiple datasets. Our work provides a practical and powerful data generation engine to overcome a fundamental bottleneck in 3D vision. Project page: https://mvp-ai-lab.github.io/FreeScale.
[427] Data-Efficient Surgical Phase Segmentation in Small-Incision Cataract Surgery: A Controlled Study of Vision Foundation Models
Lincoln Spencer, Song Wang, Chen Chen
Main category: cs.CV
TL;DR: Foundation models like DINOv3 and V-JEPA2 outperform supervised encoders for surgical phase segmentation in low-data settings, with DINOv3 ViT-7B achieving best results (83.4% accuracy) on cataract surgery videos.
Details
Motivation: Surgical phase segmentation is crucial for computer-assisted surgery but challenging when labeled surgical videos are scarce. The paper aims to identify effective visual representations for data-efficient phase segmentation in manual small-incision cataract surgery.Method: Controlled comparison of visual representations using identical temporal model (MS-TCN++) and training/evaluation settings. Compared supervised encoders (ResNet-50, I3D) against self-supervised foundation models (DINOv3, V-JEPA2) using cached-feature pipeline that separates visual encoding from temporal learning. Also examined domain transfer using unlabeled videos and lightweight adaptation.
Result: Foundation-model features significantly improve segmentation performance. DINOv3 ViT-7B achieved best overall results: 83.4% accuracy and 87.0 edit score on SICS-155 dataset (19 phases). Analysis shows when domain transfer helps or hurts performance.
Conclusion: Modern vision foundation models demonstrate strong transferability to surgical workflow understanding and provide practical guidance for low-label medical video settings. Foundation models outperform traditional supervised approaches in data-scarce surgical phase segmentation.
Abstract: Surgical phase segmentation is central to computer-assisted surgery, yet robust models remain difficult to develop when labeled surgical videos are scarce. We study data-efficient phase segmentation for manual small-incision cataract surgery (SICS) through a controlled comparison of visual representations. To isolate representation quality, we pair each visual encoder with the same temporal model (MS-TCN++) under identical training and evaluation settings on SICS-155 (19 phases). We compare supervised encoders (ResNet-50, I3D) against large self-supervised foundation models (DINOv3, V-JEPA2), and use a cached-feature pipeline that decouples expensive visual encoding from lightweight temporal learning. Foundation-model features improve segmentation performance in this setup, with DINOv3 ViT-7B achieving the best overall results (83.4% accuracy, 87.0 edit score). We further examine cataract-domain transfer using unlabeled videos and lightweight adaptation, and analyze when it helps or hurts. Overall, the study indicates strong transferability of modern vision foundation models to surgical workflow understanding and provides practical guidance for low-label medical video settings. The project website is available at: https://sl2005.github.io/DataEfficient-sics-phase-seg/
[428] FGML-DG: Feynman-Inspired Cognitive Science Paradigm for Cross-Domain Medical Image Segmentation
Yucheng Song, Chenxi Li, Haokang Ding, Zhining Liao, Zhifang Liao
Main category: cs.CV
TL;DR: A cognitive-inspired meta-learning framework (FGML-DG) for medical image domain generalization segmentation that mimics Feynman’s learning techniques to handle domain shifts across medical imaging modalities and data sources.
Details
Motivation: Domain Generalization (DG) in medical image segmentation faces challenges from domain shifts, imaging variations, and patient diversity across different modalities (MRI, CT) and data sources (hospitals, devices), leading to degraded performance in unseen domains. Existing methods have insufficient style feature simplification, inadequate domain knowledge reuse, and lack feedback-driven optimization.Method: Proposes FGML-DG framework with three key components: 1) ‘Concept understanding’ principle to simplify complex features into style information statistics for precise style feature alignment, 2) Meta-style memory and recall method (MetaStyle) to emulate human memory system for past knowledge utilization, 3) Feedback-Driven Re-Training strategy (FDRT) for dynamic learning focus adjustment based on prediction errors.
Result: The method outperforms existing domain generalization approaches on two challenging medical image domain generalization tasks.
Conclusion: The cognitive-inspired meta-learning paradigm effectively addresses domain generalization challenges in medical image segmentation by mimicking human cognitive learning processes, enabling better knowledge transfer and adaptation to unseen domains.
Abstract: In medical image segmentation across multiple modalities (e.g., MRI, CT, etc.) and heterogeneous data sources (e.g., different hospitals and devices), Domain Generalization (DG) remains a critical challenge in AI-driven healthcare. This challenge primarily arises from domain shifts, imaging variations, and patient diversity, which often lead to degraded model performance in unseen domains. To address these limitations, we identify key issues in existing methods, including insufficient simplification of complex style features, inadequate reuse of domain knowledge, and a lack of feedback-driven optimization. To tackle these problems, inspired by Feynman’s learning techniques in educational psychology, this paper introduces a cognitive science-inspired meta-learning paradigm for medical image domain generalization segmentation. We propose, for the first time, a cognitive-inspired Feynman-Guided Meta-Learning framework for medical image domain generalization segmentation (FGML-DG), which mimics human cognitive learning processes to enhance model learning and knowledge transfer. Specifically, we first leverage the ‘concept understanding’ principle from Feynman’s learning method to simplify complex features across domains into style information statistics, achieving precise style feature alignment. Second, we design a meta-style memory and recall method (MetaStyle) to emulate the human memory system’s utilization of past knowledge. Finally, we incorporate a Feedback-Driven Re-Training strategy (FDRT), which mimics Feynman’s emphasis on targeted relearning, enabling the model to dynamically adjust learning focus based on prediction errors. Experimental results demonstrate that our method outperforms other existing domain generalization approaches on two challenging medical image domain generalization tasks.
[429] STORM: End-to-End Referring Multi-Object Tracking in Videos
Zijia Lu, Jingru Yi, Jue Wang, Yuxiao Chen, Junwen Chen, Xinyu Li, Davide Modolo
Main category: cs.CV
TL;DR: STORM is an end-to-end multimodal LLM for referring multi-object tracking that jointly performs grounding and tracking in a unified framework, eliminating external detectors and enabling coherent reasoning over appearance, motion, and language.
Details
Motivation: Existing RMOT approaches decompose object grounding and tracking into separate modules, leading to limited performance due to scarce training videos, ambiguous annotations, and restricted domains. There's a need for a unified framework that can handle complex spatial-temporal reasoning.Method: STORM uses an end-to-end MLLM architecture that jointly performs grounding and tracking. It employs task-composition learning (TCL) that decomposes RMOT into image grounding and object tracking sub-tasks for better data efficiency. The authors also create STORM-Bench, a new RMOT dataset with accurate trajectories and diverse, unambiguous referring expressions.
Result: STORM achieves state-of-the-art performance on image grounding, single-object tracking, and RMOT benchmarks. It demonstrates strong generalization and robust spatial-temporal grounding in complex real-world scenarios.
Conclusion: STORM provides a unified end-to-end solution for referring multi-object tracking that outperforms existing approaches by enabling coherent reasoning across appearance, motion, and language modalities through task-composition learning and improved dataset quality.
Abstract: Referring multi-object tracking (RMOT) is a task of associating all the objects in a video that semantically match with given textual queries or referring expressions. Existing RMOT approaches decompose object grounding and tracking into separated modules and exhibit limited performance due to the scarcity of training videos, ambiguous annotations, and restricted domains. In this work, we introduce STORM, an end-to-end MLLM that jointly performs grounding and tracking within a unified framework, eliminating external detectors and enabling coherent reasoning over appearance, motion, and language. To improve data efficiency, we propose a task-composition learning (TCL) strategy that decomposes RMOT into image grounding and object tracking, allowing STORM to leverage data-rich sub-tasks and learn structured spatial–temporal reasoning. We further construct STORM-Bench, a new RMOT dataset with accurate trajectories and diverse, unambiguous referring expressions generated through a bottom-up annotation pipeline. Extensive experiments show that STORM achieves state-of-the-art performance on image grounding, single-object tracking, and RMOT benchmarks, demonstrating strong generalization and robust spatial–temporal grounding in complex real-world scenarios. STORM-Bench is released at https://github.com/amazon-science/storm-referring-multi-object-grounding.
[430] BareBones: Benchmarking Zero-Shot Geometric Comprehension in VLMs
Aaditya Baranwal, Vishal Yadav, Abhishek Rajora
Main category: cs.CV
TL;DR: BareBones benchmark tests VLMs’ geometric shape comprehension using pixel-level silhouettes, revealing severe performance collapse when RGB textures are removed (Texture Bias Cliff).
Details
Motivation: To determine if Vision-Language Models genuinely comprehend geometric structure or merely exploit RGB textures and contextual priors as statistical shortcuts, addressing gaps in existing evaluations that conflate semantic reasoning with texture mapping.Method: Created BareBones benchmark with pixel-level silhouettes of geometrically distinct classes across six datasets (including novel WTP-Bench), establishing noise-free geometric taxonomy. Evaluated 26 state-of-the-art VLMs under RGB deprivation conditions.
Result: Revealed consistent, severe performance collapse under RGB deprivation (Texture Bias Cliff), showing VLMs have universal structural blindspots and lack genuine geometric grounding.
Conclusion: BareBones establishes rigorous yardstick for assessing geometric comprehension in VLMs, documenting fundamental limitations in current architectures’ ability to understand pure geometric structure.
Abstract: While Vision-Language Models (VLMs) demonstrate remarkable zero-shot recognition capabilities across a diverse spectrum of multimodal tasks, it yet remains an open question whether these architectures genuinely comprehend geometric structure or merely exploit RGB textures and contextual priors as statistical shortcuts. Existing evaluations fail to isolate this mechanism, conflating semantic reasoning with texture mapping and relying on imprecise annotations that inadvertently leak environmental cues. To address this gap, we introduce \textbf{BareBones}, a zero-shot benchmark designed to stress-test pure geometric shape comprehension. We curate pixel-level silhouettes of geometrically distinct classes across six datasets: five established segmentation sources (ImageNet-S, DIS5K, ThinObject5K, PASCAL VOC, CUB-200) and our novel flagship collection, WTP-Bench, establishing a noise-free geometric taxonomy. WTP-Bench is an extreme, fine-grained visual puzzle that forces models to identify inter-class geometric concepts from boundary contours alone. Our evaluation of 26 state-of-the-art proprietary and open-weight VLMs (\eg, GPT-4.1, Gemini, Claude Sonnet 4.5, LLaVA) reveals a consistent, severe performance collapse under RGB deprivation, a phenomenon we term the \textit{Texture Bias Cliff}. By documenting universal structural blindspots, BareBones establishes a rigorous yardstick for genuine geometric grounding.
[431] The Second Challenge on Real-World Face Restoration at NTIRE 2026: Methods and Results
Jingkai Wang, Jue Gong, Zheng Chen, Kai Liu, Jiatong Li, Yulun Zhang, Radu Timofte, Jiachen Tu, Yaokun Shi, Guoyi Xu, Yaoxin Jiang, Jiajia Liu, Yingsi Chen, Yijiao Liu, Hui Li, Yu Wang, Congchao Zhu, Alexandru-Gabriel Lefterache, Anamaria Radoi, Chuanyue Yan, Tao Lu, Yanduo Zhang, Kanghui Zhao, Jiaming Wang, Yuqi Li, WenBo Xiong, Yifei Chen, Xian Hu, Wei Deng, Daiguo Zhou, Sujith Roy, Claudia Jesuraj, Vikas B, Spoorthi LC, Nikhil Akalwadi, Ramesh Ashok Tabib, Uma Mudenagudi, Yuxuan Jiang, Chengxi Zeng, Tianhao Peng, Fan Zhang, David Bull Wei Zhou, Linfeng Li, Hongyu Huang, Hoyoung Lee, SangYun Oh, ChangYoung Jeong, Axi Niu, Jinyang Zhang, Zhenguo Wu, Senyan Qing, Jinqiu Sun, Yanning Zhang
Main category: cs.CV
TL;DR: Review of NTIRE 2026 face restoration challenge focusing on generating natural, realistic outputs with identity consistency, using perceptual quality metrics and AdaFace for identity verification.
Details
Motivation: To advance state-of-the-art solutions for real-world face restoration with focus on perceptual quality and realism while maintaining identity consistency, without constraints on computational resources or training data.Method: Challenge-based evaluation using weighted image quality assessment (IQA) score and AdaFace model as identity checker, with 96 registrants and 10 teams submitting valid models.
Result: 9 teams achieved valid scores in final ranking, advancing performance of real-world face restoration and providing overview of latest trends in the field.
Conclusion: The collaborative challenge successfully advanced face restoration techniques, highlighting current trends and pushing boundaries in perceptual quality and identity preservation.
Abstract: This paper provides a review of the NTIRE 2026 challenge on real-world face restoration, highlighting the proposed solutions and the resulting outcomes. The challenge focuses on generating natural and realistic outputs while maintaining identity consistency. Its goal is to advance state-of-the-art solutions for perceptual quality and realism, without imposing constraints on computational resources or training data. Performance is evaluated using a weighted image quality assessment (IQA) score and employs the AdaFace model as an identity checker. The competition attracted 96 registrants, with 10 teams submitting valid models; ultimately, 9 teams achieved valid scores in the final ranking. This collaborative effort advances the performance of real-world face restoration while offering an in-depth overview of the latest trends in the field.
[432] Bidirectional Learning of Facial Action Units and Expressions via Structured Semantic Mapping across Heterogeneous Datasets
Jia Li, Yu Zhang, Yin Chen, Zhenzhen Hu, Yong Li, Richang Hong, Shiguang Shan, Meng Wang
Main category: cs.CV
TL;DR: SSM framework enables bidirectional learning between facial action units (AUs) and facial expressions (FEs) across heterogeneous datasets using structured semantic mapping and textual prototypes.
Details
Motivation: Existing work focuses on unidirectional knowledge transfer from AUs to FEs, but bidirectional learning is insufficiently explored. Heterogeneous data conditions (different annotation paradigms, label granularity, data availability) further complicate joint learning of these semantically correlated tasks.Method: Proposes Structured Semantic Mapping (SSM) framework with: 1) shared visual backbone for unified facial representations, 2) Textual Semantic Prototype (TSP) module using fixed textual descriptions with learnable context prompts for semantic mediation, and 3) Dynamic Prior Mapping (DPM) module incorporating Facial Action Coding System knowledge and learning data-driven association matrix for bidirectional knowledge transfer.
Result: SSM achieves state-of-the-art performance on both AU detection and FE recognition benchmarks simultaneously. Demonstrates that holistic expression semantics can enhance fine-grained AU learning even across heterogeneous datasets.
Conclusion: The SSM framework effectively addresses bidirectional learning between AUs and FEs under heterogeneous data conditions, enabling mutual enhancement between fine-grained muscular activations and coarse-grained affective states through structured semantic mapping.
Abstract: Facial action unit (AU) detection and facial expression (FE) recognition can be jointly viewed as affective facial behavior tasks, representing fine-grained muscular activations and coarse-grained holistic affective states, respectively. Despite their inherent semantic correlation, existing studies predominantly focus on knowledge transfer from AUs to FEs, while bidirectional learning remains insufficiently explored. In practice, this challenge is further compounded by heterogeneous data conditions, where AU and FE datasets differ in annotation paradigms (frame-level vs.\ clip-level), label granularity, and data availability and diversity, hindering effective joint learning. To address these issues, we propose a Structured Semantic Mapping (SSM) framework for bidirectional AU–FE learning under different data domains and heterogeneous supervision. SSM consists of three key components: (1) a shared visual backbone that learns unified facial representations from dynamic AU and FE videos; (2) semantic mediation via a Textual Semantic Prototype (TSP) module, which constructs structured semantic prototypes from fixed textual descriptions augmented with learnable context prompts, serving as supervision signals and cross-task alignment anchors in a shared semantic space; and (3) a Dynamic Prior Mapping (DPM) module that incorporates prior knowledge derived from the Facial Action Coding System and learns a data-driven association matrix in a high-level feature space, enabling explicit and bidirectional knowledge transfer. Extensive experiments on popular AU detection and FE recognition benchmarks show that SSM achieves state-of-the-art performance on both tasks simultaneously, and demonstrate that holistic expression semantics can in turn enhance fine-grained AU learning even across heterogeneous datasets.
[433] Differentiable Vector Quantization for Rate-Distortion Optimization of Generative Image Compression
Shiyin Jiang, Wei Long, Minghao Han, Zhenghao Chen, Ce Zhu, Shuhang Gu
Main category: cs.CV
TL;DR: RDVQ: A unified framework for end-to-end rate-distortion optimization in vector quantization-based image compression using differentiable relaxation of codebook distribution and autoregressive entropy modeling.
Details
Motivation: Existing vector quantization methods for image compression lack principled joint rate-distortion optimization due to disconnect between representation learning and entropy modeling, making it difficult to achieve optimal compression at extremely low bitrates.Method: Proposes RDVQ framework with: 1) Differentiable relaxation of codebook distribution to enable end-to-end RD optimization, 2) Autoregressive entropy model for accurate entropy modeling and test-time rate control, 3) Lightweight architecture for efficient compression.
Result: Achieves strong performance at extremely low bitrates with competitive/superior perceptual quality using significantly fewer parameters. Reduces bitrate by up to 75.71% on DISTS and 37.63% on LPIPS on DIV2K-val compared to RDEIC.
Conclusion: RDVQ enables principled rate-distortion optimization for VQ-based compression and introduces an entropy-constrained formulation that bridges image tokenization and compression, offering a unified approach for low-bitrate image compression.
Abstract: The rapid growth of visual data under stringent storage and bandwidth constraints makes extremely low-bitrate image compression increasingly important. While Vector Quantization (VQ) offers strong structural fidelity, existing methods lack a principled mechanism for joint rate-distortion (RD) optimization due to the disconnect between representation learning and entropy modeling. We propose RDVQ, a unified framework that enables end-to-end RD optimization for VQ-based compression via a differentiable relaxation of the codebook distribution, allowing the entropy loss to directly shape the latent prior. We further develop an autoregressive entropy model that supports accurate entropy modeling and test-time rate control. Extensive experiments demonstrate that RDVQ achieves strong performance at extremely low bitrates with a lightweight architecture, attaining competitive or superior perceptual quality with significantly fewer parameters. Compared with RDEIC, RDVQ reduces bitrate by up to 75.71% on DISTS and 37.63% on LPIPS on DIV2K-val. Beyond empirical gains, RDVQ introduces an entropy-constrained formulation of VQ, highlighting the potential for a more unified view of image tokenization and compression. The code will be available at https://github.com/CVL-UESTC/RDVQ.
[434] NTIRE 2026 Challenge on Short-form UGC Video Restoration in the Wild with Generative Models: Datasets, Methods and Results
Xin Li, Jiachao Gong, Xijun Wang, Shiyao Xiong, Bingchen Li, Suhang Yao, Chao Zhou, Zhibo Chen, Radu Timofte, Yuxiang Chen, Shibo Yin, Yilian Zhong, Yushun Fang, Xilei Zhu, Yahui Wang, Chen Lu, Meisong Zheng, Xiaoxu Chen, Jing Yang, Zhaokun Hu, Jiahui Liu, Ying Chen, Haoran Bai, Sibin Deng, Shengxi Li, Mai Xu, Junyang Chen, Hao Chen, Xinzhe Zhu, Fengkai Zhang, Long Sun, Yixing Yang, Xindong Zhang, Jiangxin Dong, Jinshan Pan, Jiyuan Zhang, Shuai Liu, Yibin Huang, Xiaotao Wang, Lei Lei, Zhirui Liu, Shinan Chen, Shang-Quan Sun, Wenqi Ren, Jingyi Xu, Zihong Chen, Zhuoya Zou, Xiuhao Qiu, Jingyu Ma, Huiyuan Fu, Kun Liu, Huadong Ma, Dehao Feng, Zhijie Ma, Boqi Zhang, Jiawei Shi, Hao Kang, Yixin Yang, Yeying Jin, Xu Cheng, Yuxuan Jiang, Chengxi Zeng, Tianhao Peng, Fan Zhang, David Bull, Yanan Xing, Jiachen Tu, Guoyi Xu, Yaoxin Jiang, Jiajia Liu, Yaokun Shi, Wei Zhou, Linfeng Li, Hang Song, Qi Xu, Kun Yuan, Yizhen Shao, Yulin Ren
Main category: cs.CV
TL;DR: NTIRE 2026 challenge on short-form UGC video restoration using generative models with new KwaiVIR benchmark containing synthetic and real-world distorted videos.
Details
Motivation: To establish a strong practical benchmark for restoring short-form user-generated content videos under complex real-world degradations using generative models.Method: Two-track challenge: subjective (user study evaluation) and objective tracks using KwaiVIR benchmark with 200 synthetic and 48 wild training videos, plus validation and testing sets.
Result: 95 teams registered, 12 submitted valid solutions achieving strong performance on KwaiVIR benchmark, showing progress in S-UGC video restoration.
Conclusion: The challenge successfully established a benchmark and demonstrated encouraging progress in generative-model-based short-form UGC video restoration in the wild.
Abstract: This paper presents an overview of the NTIRE 2026 Challenge on Short-form UGC Video Restoration in the Wild with Generative Models. This challenge utilizes a new short-form UGC (S-UGC) video restoration benchmark, termed KwaiVIR, which is contributed by USTC and Kuaishou Technology. It contains both synthetically distorted videos and real-world short-form UGC videos in the wild. For this edition, the released data include 200 synthetic training videos, 48 wild training videos, 11 validation videos, and 20 testing videos. The primary goal of this challenge is to establish a strong and practical benchmark for restoring short-form UGC videos under complex real-world degradations, especially in the emerging paradigm of generative-model-based S-UGC video restoration. This challenge has two tracks: (i) the primary track is a subjective track, where the evaluation is based on a user study; (ii) the second track is an objective track. These two tracks enable a comprehensive assessment of restoration quality. In total, 95 teams have registered for this competition. And 12 teams submitted valid final solutions and fact sheets for the testing phase. The submitted methods achieved strong performance on the KwaiVIR benchmark, demonstrating encouraging progress in short-form UGC video restoration in the wild.
[435] Spatio-Temporal Difference Guided Motion Deblurring with the Complementary Vision Sensor
Yapeng Meng, Lin Yang, Yuguo Chen, Xiangru Chen, Taoyi Wang, Lijian Wang, Zheyu Yang, Yihan Lin, Rong Zhao
Main category: cs.CV
TL;DR: STGDNet uses novel complementary vision sensor (CVS) data with spatial and temporal difference streams to guide deblurring of motion-blurred RGB frames, outperforming RGB-only and event-based methods.
Details
Motivation: Motion blur in RGB frames is ill-posed without temporal cues. Event cameras have limitations like rate saturation and entangled edge/motion information. The Tianmouc CVS sensor provides synchronized RGB with high-frame-rate spatial difference (edge) and temporal difference (motion) data, offering a promising solution for extreme dynamic scene deblurring.Method: STGDNet uses a recurrent multi-branch architecture that iteratively encodes and fuses spatial difference (SD) and temporal difference (TD) sequences to restore structure and color details from blurry RGB inputs. The method leverages complementary modalities from the CVS sensor.
Result: Outperforms current RGB-only and event-based deblurring approaches on both synthetic CVS datasets and real-world evaluations. Demonstrates strong generalization across over 100 extreme real-world scenarios.
Conclusion: The complementary vision sensor with synchronized SD and TD data provides effective guidance for motion deblurring. STGDNet successfully leverages these modalities to achieve state-of-the-art performance in extreme dynamic scenes.
Abstract: Motion blur arises when rapid scene changes occur during the exposure period, collapsing rich intra-exposure motion into a single RGB frame. Without explicit structural or temporal cues, RGB-only deblurring is highly ill-posed and often fails under extreme motion. Inspired by the human visual system, brain-inspired vision sensors introduce temporally dense information to alleviate this problem. However, event cameras still suffer from event rate saturation under rapid motion, while the event modality entangles edge features and motion cues, which limits their effectiveness. As a recent breakthrough, the complementary vision sensor (CVS), Tianmouc, captures synchronized RGB frames together with high-frame-rate, multi-bit spatial difference (SD, encoding structural edges) and temporal difference (TD, encoding motion cues) data within a single RGB exposure, offering a promising solution for RGB deblurring under extreme dynamic scenes. To fully leverage these complementary modalities, we propose Spatio-Temporal Difference Guided Deblur Net (STGDNet), which adopts a recurrent multi-branch architecture that iteratively encodes and fuses SD and TD sequences to restore structure and color details lost in blurry RGB inputs. Our method outperforms current RGB or event-based approaches in both synthetic CVS dataset and real-world evaluations. Moreover, STGDNet exhibits strong generalization capability across over 100 extreme real-world scenarios. Project page: https://tmcDeblur.github.io/
[436] Learning 3D Representations for Spatial Intelligence from Unposed Multi-View Images
Bo Zhou, Qiuxia Lai, Zeren Sun, Xiangbo Shu, Yazhou Yao, Wenguan Wang
Main category: cs.CV
TL;DR: UniSplat: A feed-forward framework for robust 3D representation learning from unposed multi-view images using dual-masking, coarse-to-fine Gaussian splatting, and pose-conditioned recalibration to unify geometry, appearance, and semantics.
Details
Motivation: Learning robust 3D representations directly from unposed multi-view images is challenging for spatial intelligence. Existing self-supervised methods suffer from weak geometry induction, limited appearance detail, and inconsistencies between geometry and semantics.Method: Three complementary components: 1) Dual-masking strategy (encoder and decoder tokens with geometry-focused decoder masks) to strengthen geometry induction; 2) Coarse-to-fine Gaussian splatting to reduce appearance-semantics inconsistencies; 3) Pose-conditioned recalibration mechanism that interrelates multiple head outputs by re-projecting 3D point and semantic maps into image plane using estimated camera parameters.
Result: Unified 3D representations robust to unposed, sparse-view inputs that generalize across diverse tasks, laying a perceptual foundation for spatial intelligence.
Conclusion: UniSplat addresses limitations in 3D representation learning by integrating geometry, appearance, and semantics in a feed-forward framework, enabling robust spatial understanding from challenging unposed multi-view inputs.
Abstract: Robust 3D representation learning forms the perceptual foundation of spatial intelligence, enabling downstream tasks in scene understanding and embodied AI. However, learning such representations directly from unposed multi-view images remains challenging. Recent self-supervised methods attempt to unify geometry, appearance, and semantics in a feed-forward manner, but they often suffer from weak geometry induction, limited appearance detail, and inconsistencies between geometry and semantics. We introduce UniSplat, a feed-forward framework designed to address these limitations through three complementary components. First, we propose a dual-masking strategy that strengthens geometry induction in the encoder. By masking both encoder and decoder tokens, and targeting decoder masks toward geometry-rich regions, the model is forced to infer structural information from incomplete visual cues, yielding geometry-aware representations even under unposed inputs. Second, we develop a coarse-to-fine Gaussian splatting strategy that reduces appearance-semantics inconsistencies by progressively refining the radiance field. Finally, to enforce geometric-semantic consistency, we introduce a pose-conditioned recalibration mechanism that interrelates the outputs of multiple heads by re-projecting predicted 3D point and semantic maps into the image plane using estimated camera parameters, and aligning them with corresponding RGB and semantic predictions to ensure cross-task consistency, thereby resolving geometry-semantic mismatches. Together, these components yield unified 3D representations that are robust to unposed, sparse-view inputs and generalize across diverse tasks, laying a perceptual foundation for spatial intelligence.
[437] Omnimodal Dataset Distillation via High-order Proxy Alignment
Yuxuan Gao, Xiaohao Liu, Xiaobo Xia, Tongliang Liu
Main category: cs.CV
TL;DR: HoPA is a novel method for Omnimodal Dataset Distillation that captures high-order cross-modal alignments via a compact proxy, enabling scalable joint distillation across heterogeneous modalities beyond traditional bimodal settings.
Details
Motivation: Current dataset distillation methods are limited to single-modal or bimodal settings, but real-world applications often involve more than two modalities. Extending dataset distillation to omnimodal scenarios is challenging due to increased heterogeneity and complex cross-modal interactions.Method: Proposes HoPA (High-order Proxy Alignment), a unified method that captures high-order cross-modal alignments via a compact proxy. It abstracts omnimodal alignment with a shared similarity structure, avoiding combinatorial complexity of pairwise modality modeling. The method is compatible with trajectory matching and enables scalable joint distillation across heterogeneous modalities.
Result: Extensive experiments on various benchmarks demonstrate superior compression-performance trade-offs compared to existing competitors. Theoretical analysis from spectral perspective validates the method’s rationality against bimodal dataset distillation techniques.
Conclusion: HoPA successfully addresses the challenges of omnimodal dataset distillation by capturing high-order cross-modal alignments, enabling efficient compression of multi-modal datasets while preserving training performance across heterogeneous modalities.
Abstract: Dataset distillation compresses large-scale datasets into compact synthetic sets while preserving training performance, but existing methods are largely restricted to single-modal or bimodal settings. Extending dataset distillation to scenarios involving more than two modalities, i.e., Omnimodal Dataset Distillation, remains underexplored and challenging due to increased heterogeneity and complex cross-modal interactions. In this work, we identify the key determinant that bounds the endpoint discrepancy in the omnimodal setting, which is exacerbated with an increasing number of modalities. To this end, we propose HoPA, a unified method that captures high-order cross-modal alignments via a compact proxy, which is compatible with trajectory matching as well. By abstracting omnimodal alignment with a shared similarity structure, our method avoids the combinatorial complexity of pairwise modality modeling and enables scalable joint distillation across heterogeneous modalities. Theoretical analysis from the spectral perspective reveals the rationality of our proposed method against bimodal dataset distillation techniques. Extensive experiments on various benchmarks demonstrate that the proposed method achieves superior compression-performance trade-offs compared to existing competitors. The source code will be publicly released.
[438] Rein3D: Reinforced 3D Indoor Scene Generation with Panoramic Video Diffusion Models
Dehui Wang, Congsheng Xu, Rong Wei, Yue Shi, Shoufa Chen, Dingxiang Luo, Tianshuo Yang, Xiaokang Yang, Yusen Qin, Rui Tang, Yao Mu
Main category: cs.CV
TL;DR: Rein3D reconstructs full 360-degree indoor scenes from sparse inputs using 3D Gaussian Splatting coupled with video diffusion models for global consistency.
Details
Motivation: Existing approaches for 3D indoor scene synthesis struggle with inferring massive missing geometry in large unseen areas while maintaining global consistency, often producing locally plausible but globally inconsistent reconstructions.Method: Uses a “restore-and-refine” paradigm: radial exploration renders imperfect panoramic videos from coarse 3DGS initialization, restored by panoramic video-to-video diffusion model, enhanced via video super-resolution, then used as pseudo-ground truths to update global 3D Gaussian field. Also creates PanoV2V-15K dataset for training.
Result: Produces photorealistic and globally consistent 3D scenes, significantly improves long-range camera exploration compared with existing baselines.
Conclusion: Rein3D effectively addresses the challenge of global consistency in 3D indoor scene reconstruction by combining explicit 3D representation with diffusion-based video restoration.
Abstract: The growing demand for Embodied AI and VR applications has highlighted the need for synthesizing high-quality 3D indoor scenes from sparse inputs. However, existing approaches struggle to infer massive amounts of missing geometry in large unseen areas while maintaining global consistency, often producing locally plausible but globally inconsistent reconstructions. We present Rein3D, a framework that reconstructs full 360-degree indoor environments by coupling explicit 3D Gaussian Splatting (3DGS) with temporally coherent priors from video diffusion models. Our approach follows a “restore-and-refine” paradigm: we employ a radial exploration strategy to render imperfect panoramic videos along trajectories starting from the origin, effectively uncovering occluded regions from a coarse 3DGS initialization. These sequences are restored by a panoramic video-to-video diffusion model and further enhanced via video super-resolution to synthesize high-fidelity geometry and textures. Finally, these refined videos serve as pseudo-ground truths to update the global 3D Gaussian field. To support this task, we construct PanoV2V-15K, a dataset of over 15K paired clean and degraded panoramic videos for diffusion-based scene restoration. Experiments demonstrate that Rein3D produces photorealistic and globally consistent 3D scenes and significantly improves long-range camera exploration compared with existing baselines.
[439] TAPNext++: What’s Next for Tracking Any Point (TAP)?
Sebastian Jung, Artem Zholus, Martin Sundermeyer, Carl Doersch, Ross Goroshin, David Joseph Tan, Sarath Chandar, Rudolph Triebel, Federico Tombari
Main category: cs.CV
TL;DR: TAPNext++ improves point tracking in long videos and re-detection of occluded points using recurrent transformers with sequence parallelism and geometric augmentations.
Details
Motivation: TAPNext struggles with long video sequences and fails to re-detect query points after occlusion or leaving the frame, which is crucial for AR/XR and robotics applications.Method: Uses recurrent transformer architecture with sequence parallelism to train on 1024-frame sequences, introduces new Re-Detection Average Jaccard metric, and adds geometric augmentations like periodic roll to simulate point re-entries.
Result: Achieves state-of-the-art performance on multiple benchmarks, tracking points in much longer sequences while maintaining low memory and compute footprint.
Conclusion: Recurrent transformers can be substantially improved for point tracking, especially for long sequences and re-detection scenarios, advancing AR/XR and robotics applications.
Abstract: Tracking-Any-Point (TAP) models aim to track any point through a video which is a crucial task in AR/XR and robotics applications. The recently introduced TAPNext approach proposes an end-to-end, recurrent transformer architecture to track points frame-by-frame in a purely online fashion – demonstrating competitive performance at minimal latency. However, we show that TAPNext struggles with longer video sequences and also frequently fails to re-detect query points that reappear after being occluded or leaving the frame. In this work, we present TAPNext++, a model that tracks points in sequences that are orders of magnitude longer while preserving the low memory and compute footprint of the architecture. We train the recurrent video transformer using several data-driven solutions, including training on long 1024-frame sequences enabled by sequence parallelism techniques. We highlight that re-detection performance is a blind spot in the current literature and introduce a new metric, Re-Detection Average Jaccard ($AJ_{RD}$), to explicitly evaluate tracking on re-appearing points. To improve re-detection of points, we introduce tailored geometric augmentations, such as periodic roll that simulates point re-entries, and supervising occluded points. We demonstrate that recurrent transformers can be substantially improved for point tracking and set a new state-of-the-art on multiple benchmarks. Model and code can be found at https://tap-next-plus-plus.github.io.
[440] CoFusion: Multispectral and Hyperspectral Image Fusion via Spectral Coordinate Attention
Baisong Li
Main category: cs.CV
TL;DR: CoFusion is a unified spatial-spectral collaborative fusion framework for multispectral and hyperspectral image fusion that models cross-scale and cross-modal dependencies to achieve better spatial detail enhancement and spectral fidelity.
Details
Motivation: Existing multispectral and hyperspectral image fusion methods struggle with modeling cross-scale interactions and spatial-spectral collaboration, making it difficult to achieve optimal trade-offs between spatial detail enhancement and spectral fidelity.Method: Proposes CoFusion with: 1) Multi-Scale Generator (MSG) for three-level pyramidal architecture integrating global semantics and local details; 2) dual-branch strategy with Spatial Coordinate-Aware Mixing (SpaCAM) for multi-scale spatial contexts and Spectral Coordinate-Aware Mixing (SpeCAM) for spectral representations via frequency decomposition; 3) Spatial-Spectral Cross-Fusion Module (SSCFM) for dynamic cross-modal alignment and complementary feature fusion.
Result: Extensive experiments on multiple benchmark datasets show CoFusion consistently outperforms state-of-the-art methods, achieving superior performance in both spatial reconstruction and spectral consistency.
Conclusion: CoFusion effectively addresses the limitations of existing methods by explicitly modeling cross-scale and cross-modal dependencies, providing a robust solution for multispectral and hyperspectral image fusion.
Abstract: Multispectral and Hyperspectral Image Fusion (MHIF) aims to reconstruct high-resolution images by integrating low-resolution hyperspectral images (LRHSI) and high-resolution multispectral images (HRMSI). However, existing methods face limitations in modeling cross-scale interactions and spatial-spectral collaboration, making it difficult to achieve an optimal trade-off between spatial detail enhancement and spectral fidelity. To address this challenge, we propose CoFusion: a unified spatial-spectral collaborative fusion framework that explicitly models cross-scale and cross-modal dependencies. Specifically, a Multi-Scale Generator (MSG) is designed to construct a three-level pyramidal architecture, enabling the effective integration of global semantics and local details. Within each scale, a dual-branch strategy is employed: the Spatial Coordinate-Aware Mixing module (SpaCAM) is utilized to capture multi-scale spatial contexts, while the Spectral Coordinate-Aware Mixing module (SpeCAM) enhances spectral representations through frequency decomposition and coordinate mixing. Furthermore, we introduce the Spatial-Spectral Cross-Fusion Module (SSCFM) to perform dynamic cross-modal alignment and complementary feature fusion. Extensive experiments on multiple benchmark datasets demonstrate that CoFusion consistently outperforms state-of-the-art methods, achieving superior performance in both spatial reconstruction and spectral consistency.
[441] GeoMeld: Toward Semantically Grounded Foundation Models for Remote Sensing
Maram Hasan, Md Aminur Hossain, Savitra Roy, Souparna Bhowmik, Ayush V. Patel, Mainak Singha, Subhasis Chaudhuri, Muhammad Haris Khan, Biplab Banerjee
Main category: cs.CV
TL;DR: GeoMeld is a large-scale multimodal remote sensing dataset with 2.5M spatially aligned samples and semantically grounded language supervision, paired with GeoMeld-FM pretraining framework for multimodal foundation modeling.
Details
Motivation: Remote sensing lacks large-scale spatially aligned multimodal datasets with semantically grounded supervision needed for effective foundation modeling across diverse modalities and resolutions.Method: Created GeoMeld dataset with 2.5M samples using unified alignment protocol and agentic captioning framework; developed GeoMeld-FM pretraining combining multi-pretext masked autoencoding, JEPA representation learning, and caption-vision contrastive alignment.
Result: Experiments show consistent gains in downstream transfer and cross-sensor robustness, establishing scalable framework for semantically grounded multimodal foundation modeling in remote sensing.
Conclusion: GeoMeld and GeoMeld-FM provide a scalable reference framework for multimodal foundation modeling in remote sensing with semantically grounded supervision and aligned heterogeneous modalities.
Abstract: Effective foundation modeling in remote sensing requires spatially aligned heterogeneous modalities coupled with semantically grounded supervision, yet such resources remain limited at scale. We present GeoMeld, a large-scale multimodal dataset with approximately 2.5 million spatially aligned samples. The dataset spans diverse modalities and resolutions and is constructed under a unified alignment protocol for modality-aware representation learning. GeoMeld provides semantically grounded language supervision through an agentic captioning framework that synthesizes and verifies annotations from spectral signals, terrain statistics, and structured geographic metadata, encoding measurable cross-modality relationships within textual descriptions. To leverage this dataset, we introduce GeoMeld-FM, a pretraining framework that combines multi-pretext masked autoencoding over aligned modalities, JEPA representation learning, and caption-vision contrastive alignment. This joint objective enables the learned representation space to capture both reliable cross-sensor physical consistency and grounded semantics. Experiments demonstrate consistent gains in downstream transfer and cross-sensor robustness. Together, GeoMeld and GeoMeld-FM establish a scalable reference framework for semantically grounded multi-modal foundation modeling in remote sensing.
[442] COREY: A Prototype Study of Entropy-Guided Operator Fusion with Hadamard Reparameterization for Selective State Space Models
Bo Ma, Jinsong Wu, Hongjiang Wei, Weiqi Yan
Main category: cs.CV
TL;DR: COREY is a framework for optimizing State Space Models (like Mamba) through memory-aware operator fusion and Hadamard-based feature reparameterization to reduce memory bandwidth limitations in long-context inference.
Details
Motivation: State Space Models (SSMs) like Mamba offer linear-time sequence modeling for long-context inference, but their practical deployment is memory-bandwidth limited due to fragmented kernels and repeated intermediate tensor materialization from selective state updates.Method: COREY combines memory-aware operator fusion with Hadamard-based feature reparameterization. It uses activation entropy (estimated with fixed-width histograms) as a runtime scheduling statistic to determine fusion boundaries and tile sizes. To regularize heavy-tailed activations, it absorbs normalized Hadamard transforms into linear projections while preserving functional equivalence.
Result: In controlled prototype studies over heavy-tailed SSM activations, COREY consistently reduces proxy latency, improves throughput, and lowers DRAM traffic compared to unfused and fixed-depth baselines. Low-bit results are reported through a hand-crafted stability proxy for diagnostic purposes.
Conclusion: COREY demonstrates effective optimization of SSM deployments through memory-aware fusion and feature reparameterization, addressing practical memory bandwidth limitations while maintaining functional equivalence.
Abstract: State Space Models (SSMs), represented by the Mamba family, provide linear-time sequence modeling and are attractive for long-context inference. Yet practical deployments remain memory-bandwidth limited because selective state updates are often decomposed into fragmented kernels with repeated intermediate tensor materialization. We present COREY, a prototype framework that combines memory-aware operator fusion with Hadamard-based feature reparameterization. Activation entropy, estimated with fixed-width histograms, is used as a runtime scheduling statistic to place fusion boundaries and choose tile sizes. To regularize heavy-tailed activations, we absorb normalized Hadamard transforms into linear projections, preserving functional equivalence while reducing peak-coordinate concentration. In a controlled prototype study over heavy-tailed SSM activations, COREY consistently reduces proxy latency, improves throughput, and lowers DRAM traffic relative to unfused and fixed-depth baselines. Low-bit results are reported only through a hand-crafted stability proxy and are intended as diagnostic evidence rather than checkpoint-level quality claims. Code repository: https://github.com/mabo1215/COREY_Transformer.git.
[443] Sign Language Recognition in the Age of LLMs
Vaclav Javorek, Jakub Honzik, Ivan Gruber, Tomas Zelezny, Marek Hruz
Main category: cs.CV
TL;DR: VLMs evaluated for zero-shot isolated sign language recognition, showing open-source models lag behind supervised classifiers but capture partial sign-text alignment, with proprietary models performing better due to scale and data diversity.
Details
Motivation: To investigate whether general-purpose Vision Language Models (VLMs) can address specialized visual recognition problems like isolated sign language recognition (ISLR) without task-specific training, exploring their zero-shot capabilities.Method: Evaluated several open-source and proprietary VLMs on WLASL300 benchmark using prompt-only zero-shot inference, then conducted follow-up experiments to analyze visual-semantic alignment between signs and text descriptions.
Result: Open-source VLMs performed far behind classic supervised ISLR classifiers, but captured partial visual-semantic alignment. Larger proprietary models achieved substantially higher accuracy, highlighting importance of model scale and training data diversity.
Conclusion: Current VLMs show promise for zero-shot ISLR but require further scaling and diverse training to match specialized supervised approaches, with proprietary models demonstrating better performance due to their larger scale.
Abstract: Recent Vision Language Models (VLMs) have demonstrated strong performance across a wide range of multimodal reasoning tasks. This raises the question of whether such general-purpose models can also address specialized visual recognition problems such as isolated sign language recognition (ISLR) without task-specific training. In this work, we investigate the capability of modern VLMs to perform ISLR in a zero-shot setting. We evaluate several open-source and proprietary VLMs on the WLASL300 benchmark. Our experiments show that, under prompt-only zero-shot inference, current open-source VLMs remain far behind classic supervised ISLR classifiers by a wide margin. However, follow-up experiments reveal that these models capture partial visual-semantic alignment between signs and text descriptions. Larger proprietary models achieve substantially higher accuracy, highlighting the importance of model scale and training data diversity. All our code is publicly available on GitHub.
[444] Self-supervised Pretraining of Cell Segmentation Models
Kaden Stillwagon, Alexandra Dunnum VandeLoo, Benjamin Magondu, Craig R. Forest
Main category: cs.CV
TL;DR: DINOCell is a self-supervised framework for cell instance segmentation that adapts DINOv2 representations to microscopy data through continued self-supervised training on unlabeled cell images, outperforming SAM-based approaches by 10.42% on LIVECell benchmark.
Details
Motivation: Progress in cell instance segmentation is constrained by scarcity of high-quality labeled microscopy datasets. Existing approaches using natural-image pretrained models like SAM suffer from domain shift issues due to misaligned objectness and texture priors between natural and microscopy images.Method: Proposes DINOCell framework that leverages DINOv2 representations and adapts them to microscopy through continued self-supervised training on unlabeled cell images before supervised fine-tuning. This domain adaptation approach better aligns representations with microscopy data characteristics.
Result: Achieves SEG score of 0.784 on LIVECell benchmark, improving by 10.42% over leading SAM-based models. Shows strong zero-shot performance on three out-of-distribution microscopy datasets, demonstrating robust domain adaptation.
Conclusion: Domain-adapted self-supervised pretraining is beneficial for robust cell segmentation, outperforming natural-image pretrained models by better aligning representations with microscopy data characteristics through continued self-supervised training.
Abstract: Instance segmentation enables the analysis of spatial and temporal properties of cells in microscopy images by identifying the pixels belonging to each cell. However, progress is constrained by the scarcity of high-quality labeled microscopy datasets. Many recent approaches address this challenge by initializing models with segmentation-pretrained weights from large-scale natural-image models such as Segment Anything Model (SAM). However, representations learned from natural images often encode objectness and texture priors that are poorly aligned with microscopy data, leading to degraded performance under domain shift. We propose DINOCell, a self-supervised framework for cell instance segmentation that leverages representations from DINOv2 and adapts them to microscopy through continued self-supervised training on unlabeled cell images prior to supervised fine-tuning. On the LIVECell benchmark, DINOCell achieves a SEG score of 0.784, improving by 10.42% over leading SAM-based models, and demonstrates strong zero-shot performance on three out-of-distribution microscopy datasets. These results highlight the benefits of domain-adapted self-supervised pretraining for robust cell segmentation.
[445] How to Design a Compact High-Throughput Video Camera?
Chenxi Qiu, Tao Yue, Xuemei Hu
Main category: cs.CV
TL;DR: Proposes a low-bit gradient camera scheme for high-throughput video imaging using gradient-based sensing with multi-scale CNN reconstruction to overcome readout and transmission bottlenecks.
Details
Motivation: High throughput video acquisition faces challenges with system complexity in existing multi-camera systems and readout/transmission bottlenecks as pixel counts increase. Gradient cameras offer fast readout and efficient representation advantages.Method: Proposes a low-bit gradient camera scheme leveraging gradient-based sensing for efficient data representation. Uses a multi-scale reconstruction CNN to reconstruct high-resolution images from the gradient measurements.
Result: Extensive experiments on simulated and real data demonstrate promising reconstruction quality and feasibility of the proposed method for high-throughput video imaging.
Conclusion: The proposed gradient camera scheme effectively addresses readout and transmission bottlenecks for high-throughput video imaging while maintaining image quality through neural network reconstruction.
Abstract: High throughput video acquisition is a challenging problem and has been drawing increasing attention. Existing high throughput imaging systems splice hundreds of sub-images/videos into high throughput videos, suffering from extremely high system complexity. Alternatively, with pixel sizes reducing to sub-micrometer levels, integrating ultra-high throughput on a single chip is becoming feasible. Nevertheless, the readout and output transmission speed cannot keep pace with the increasing pixel numbers. To this end, this paper analyzes the strength of gradient cameras in fast readout and efficient representation, and proposes a low-bit gradient camera scheme based on existing technologies that can resolve the readout and transmission bottlenecks for high throughput video imaging. A multi-scale reconstruction CNN is proposed to reconstruct high-resolution images. Extensive experiments on both simulated and real data are conducted to demonstrate the promising quality and feasibility of the proposed method.
[446] NTIRE 2026 The Second Challenge on Day and Night Raindrop Removal for Dual-Focused Images: Methods and Results
Xin Li, Yeying Jin, Suhang Yao, Beibei Lin, Zhaoxin Fan, Wending Yan, Xin Jin, Zongwei Wu, Bingchen Li, Peishu Shi, Yufei Yang, Yu Li, Zhibo Chen, Bihan Wen, Robby T. Tan, Radu Timofte, Runzhe Li, Kui Jiang, Zhaocheng Yu, Yiang Chen, Junjun Jiang, Xianming Liu, Hongde Gu, Zeliang Li, Mache You, Jiangxin Dong, Jinshan Pan, Qiyu Rong, Bowen Shao, Hongyuan Jing, Mengmeng Zhang, Bo Ding, Hui Zhang, Yi Ren, Mohab Kishawy, Jun Chen, Anh-Kiet Duong, Petra Gomez-Kramer, Jean-Michel Carozza, Wangzhi Xing, Xin Lu, Enxuan Gu, Jingxi Zhang, Diqi Chen, Qiaosi Yi, Bingcai Wei, Wenjie Li, Bowen Tie, Heng Guo, Zhanyu Ma, Jiachen Tu, Guoyi Xu, Yaoxin Jiang, Cici Liu, Yaokun Shi, Paula Garrido Mellado, Daniel Feijoo, Alvaro Garcia Lara, Marcos V. Conde, Zhidong Zhu, Bangshu Xiong, Qiaofeng Ou, Zhibo Rao, Wei Li, Zida Zhang, Hui Geng, Qisheng Xu, Xuyao Deng, Changjian Wang, Kele Xu, Guanglu Dong, Qiyao Zhao, Tianheng Zheng, Chunlei Li, Lichao Mou, Chao Ren, Chang-De Peng, Chieh-Yu Tsai, Guan-Cheng Liu, Li-Wei Kang, Abhishek Rajak, Milan Kumar Singh, Ankit Kumar, Dimple Sonone, Kishor Upla, Kiran Raja, Huilin Zhao, Xing Xu, Chuan Chen, Yeming Lao, Wenjing Xun, Li Yang, Bilel Benjdira, Anas M. Ali, Wadii Boulila, Hao Yang, Ruikun Zhang, Liyuan Pan
Main category: cs.CV
TL;DR: NTIRE 2026 challenge on removing raindrops from dual-focused images under day/night conditions, using Raindrop Clarity dataset with 14,139 training images.
Details
Motivation: Establish a strong benchmark for raindrop removal under various illumination and focus conditions, building on previous challenge success to advance practical image restoration techniques.Method: Challenge format with 168 registered teams submitting solutions evaluated on Raindrop Clarity dataset containing 14,139 training, 407 validation, and 593 testing images of real-world raindrop-affected scenes.
Result: 17 teams submitted valid final solutions achieving strong performance on the dataset, demonstrating progress in this challenging computer vision task.
Conclusion: The challenge successfully advanced raindrop removal techniques and established practical benchmarks for image restoration under varying illumination and focus conditions.
Abstract: This paper presents an overview of the NTIRE 2026 Second Challenge on Day and Night Raindrop Removal for Dual-Focused Images. Building upon the success of the first edition, this challenge attracted a wide range of impressive solutions, all developed and evaluated on our real-world Raindrop Clarity dataset~\cite{jin2024raindrop}. For this edition, we adjust the dataset with 14,139 images for training, 407 images for validation, and 593 images for testing. The primary goal of this challenge is to establish a strong and practical benchmark for the removal of raindrops under various illumination and focus conditions. In total, 168 teams have registered for the competition, and 17 teams submitted valid final solutions and fact sheets for the testing phase. The submitted methods achieved strong performance on the Raindrop Clarity dataset, demonstrating the growing progress in this challenging task.
[447] Language Prompt vs. Image Enhancement: Boosting Object Detection With CLIP in Hazy Environments
Jian Pang, Bingfeng Zhang, Jin Wang, Baodi Liu, Dapeng Tao, Weifeng Liu
Main category: cs.CV
TL;DR: Using language prompts from CLIP to enhance object detection in hazy environments without image enhancement, via novel loss functions and a new synthetic dataset.
Details
Motivation: Object detection in hazy environments is challenging due to degraded visibility and weakened semantics. Traditional image enhancement methods are unstable, so the paper explores using language prompts to enhance semantics directly.Method: Proposes CLIP-guided Cross-Entropy Loss (CLIP-CE) using Approximation of Mutual Exclusion (AME) to weight semantic weakening. Introduces Fine-tuned AME (FAME) for adaptive weight tuning. Also creates HazyCOCO dataset with 61,258 synthetic hazy images.
Result: Achieves state-of-the-art performance on hazy object detection benchmarks. The method effectively enhances weakened semantics without unstable image enhancement modules.
Conclusion: Language prompts can effectively enhance object semantics in degraded conditions, providing a stable alternative to image enhancement for hazy object detection.
Abstract: Object detection in hazy environments is challenging because degraded objects are nearly invisible and their semantics are weakened by environmental noise, making it difficult for detectors to identify. Common approaches involve image enhancement to boost weakened semantics, but these methods are limited by the instability of enhanced modules. This paper proposes a novel solution by employing language prompts to enhance weakened semantics without image enhancement. Specifically, we design Approximation of Mutual Exclusion (AME) to provide credible weights for Cross-Entropy Loss, resulting in CLIP-guided Cross-Entropy Loss (CLIP-CE). The provided weights assess the semantic weakening of objects. Through the backpropagation of CLIP-CE, weakened semantics are enhanced, making degraded objects easier to detect. In addition, we present Fine-tuned AME (FAME) which adaptively fine-tunes the weight of AME based on the predicted confidence. The proposed FAME compensates for the imbalanced optimization in AME. Furthermore, we present HazyCOCO, a large-scale synthetic hazy dataset comprising 61258 images. Experimental results demonstrate that our method achieves state-of-the-art performance. The code and dataset will be released.
[448] What Do Vision-Language Models Encode for Personalized Image Aesthetics Assessment?
Koki Ryu, Hitomi Yanaka
Main category: cs.CV
TL;DR: VLMs encode diverse aesthetic attributes in their internal representations, enabling lightweight personalized image aesthetics assessment without fine-tuning through simple linear models.
Details
Motivation: Personalized image aesthetics assessment (PIAA) has practical applications, but it's unclear whether vision-language models (VLMs) encode the rich, multi-level aesthetic attributes needed for effective personalization without model fine-tuning.Method: Analyze VLM internal representations to examine aesthetic attribute presence and distribution, then leverage these representations for lightweight personalization using simple linear models without model fine-tuning.
Result: VLMs encode diverse aesthetic attributes that propagate into language decoder layers, enabling effective PIAA with simple linear models. Analysis reveals how aesthetic information transfers across VLM layers and image domains.
Conclusion: VLMs can be effectively utilized for modeling subjective, individual aesthetic preferences through their internal representations, providing insights for personalized aesthetics assessment.
Abstract: Personalized image aesthetics assessment (PIAA) is an important research problem with practical real-world applications. While methods based on vision-language models (VLMs) are promising candidates for PIAA, it remains unclear whether they internally encode rich, multi-level aesthetic attributes required for effective personalization. In this paper, we first analyze the internal representations of VLMs to examine the presence and distribution of such aesthetic attributes, and then leverage them for lightweight, individual-level personalization without model fine-tuning. Our analysis reveals that VLMs encode diverse aesthetic attributes that propagate into the language decoder layers. Building on these representations, we demonstrate that simple linear models can perform PIAA effectively. We further analyze how aesthetic information is transferred across layers in different VLM architectures and across image domains. Our findings provide insights into how VLMs can be utilized for modeling subjective, individual aesthetic preferences. Our code is available at https://github.com/ynklab/vlm-latent-piaa.
[449] LogitDynamics: Reliable ViT Error Detection from Layerwise Logit Trajectories
Ido Beigelman, Moti Freiman
Main category: cs.CV
TL;DR: A method for predicting vision model errors by analyzing how class evidence evolves across transformer layers, using lightweight linear heads to extract features from intermediate layers.
Details
Motivation: Reliable confidence estimation is critical for deploying vision models, and the paper investigates whether vision transformers have internal signals similar to LLMs that can detect hallucinations/errors.Method: Attach lightweight linear heads to intermediate ViT layers to extract features capturing class evidence evolution, including predicted class logits, top-K competitors, and statistics about class ranking instability across depth. Train a linear probe on these features to predict error indicators.
Result: The method improves or matches AUCPR over baselines across datasets and shows stronger cross-dataset generalization while requiring minimal additional computation.
Conclusion: Vision Transformers contain depth-wise signals that can be leveraged for reliable error prediction, similar to hallucination detection in LLMs, enabling efficient confidence estimation.
Abstract: Reliable confidence estimation is critical when deploying vision models. We study error prediction: determining whether an image classifier’s output is correct using only signals from a single forward pass. Motivated by internal-signal hallucination detection in large language models, we investigate whether similar depth-wise signals exist in Vision Transformers (ViTs). We propose a simple method that models how class evidence evolves across layers. By attaching lightweight linear heads to intermediate layers, we extract features from the last L layers that capture both the logits of the predicted class and its top-K competitors, as well as statistics describing instability of top-ranked classes across depth. A linear probe trained on these features predicts the error indicator. Across datasets, our method improves or matches AUCPR over baselines and shows stronger cross-dataset generalization while requiring minimal additional computation.
[450] Reasoning Resides in Layers: Restoring Temporal Reasoning in Video-Language Models with Layer-Selective Merging
Zihang Fu, Haonan Wang, Jian Kang, Kenji Kawaguchi, Jiaying Wu
Main category: cs.CV
TL;DR: MERIT is a training-free model merging framework that restores temporal reasoning in video-language models by selectively merging self-attention layers between VLMs and their text-only backbones.
Details
Motivation: Multimodal adaptation often weakens the reasoning abilities inherited from language-only pretraining, especially in video-language models where visual alignment can impair temporal reasoning over sequential events.Method: MERIT searches over layer-wise self-attention merging recipes between a VLM and its paired text-only backbone using an objective that improves temporal reasoning while penalizing degradation in temporal perception.
Result: Across three representative VLMs and multiple challenging video benchmarks, MERIT consistently improves temporal reasoning, preserves or improves temporal perception, and generalizes beyond the search set to four distinct benchmarks.
Conclusion: Targeted, perception-aware model merging can effectively restore temporal reasoning in VLMs without retraining, with selected layers being disproportionately important for reasoning and shifting model decisions toward temporally relevant evidence.
Abstract: Multimodal adaptation equips large language models (LLMs) with perceptual capabilities, but often weakens the reasoning ability inherited from language-only pretraining. This trade-off is especially pronounced in video-language models (VLMs), where visual alignment can impair temporal reasoning (TR) over sequential events. We propose MERIT, a training-free, task-driven model merging framework for restoring TR in VLMs. MERIT searches over layer-wise self-attention merging recipes between a VLM and its paired text-only backbone using an objective that improves TR while penalizing degradation in temporal perception (TP). Across three representative VLMs and multiple challenging video benchmarks, MERIT consistently improves TR, preserves or improves TP, and generalizes beyond the search set to four distinct benchmarks. It also outperforms uniform full-model merging and random layer selection, showing that effective recovery depends on selecting the right layers. Interventional masking and frame-level attribution further show that the selected layers are disproportionately important for reasoning and shift model decisions toward temporally and causally relevant evidence. These results show that targeted, perception-aware model merging can effectively restore TR in VLMs without retraining.
[451] HiddenObjects: Scalable Diffusion-Distilled Spatial Priors for Object Placement
Marco Schouten, Ioannis Siglidis, Serge Belongie, Dim P. Papadopoulos
Main category: cs.CV
TL;DR: Learning explicit spatial priors for object placement by distilling knowledge from text-to-image diffusion models, creating a large-scale dataset and lightweight model for fast inference.
Details
Motivation: Existing methods for object placement rely on limited manually annotated data or inpainting-based pipelines with artifacts that promote shortcut learning. There's a need for scalable, automated approaches to learn spatial priors for realistic object placement in natural scenes.Method: Proposes a fully automated framework that evaluates dense object placements on real backgrounds using diffusion-based inpainting. Creates HiddenObjects dataset with 27M placement annotations across 27k scenes, with ranked bounding box insertions. Distills spatial priors into a lightweight model for fast inference.
Result: Spatial priors outperform sparse human annotations on downstream image editing (3.90 vs. 2.68 VLM-Judge), significantly surpass existing placement baselines and zero-shot Vision-Language Models. Distilled model achieves 230,000x faster inference.
Conclusion: The method successfully learns explicit spatial priors from diffusion models, creating a scalable framework for object placement that outperforms existing approaches and enables practical fast inference.
Abstract: We propose a method to learn explicit, class-conditioned spatial priors for object placement in natural scenes by distilling the implicit placement knowledge encoded in text-conditioned diffusion models. Prior work relies either on manually annotated data, which is inherently limited in scale, or on inpainting-based object-removal pipelines, whose artifacts promote shortcut learning. To address these limitations, we introduce a fully automated and scalable framework that evaluates dense object placements on high-quality real backgrounds using a diffusion-based inpainting pipeline. With this pipeline, we construct HiddenObjects, a large-scale dataset comprising 27M placement annotations, evaluated across 27k distinct scenes, with ranked bounding box insertions for different images and object categories. Experimental results show that our spatial priors outperform sparse human annotations on a downstream image editing task (3.90 vs. 2.68 VLM-Judge), and significantly surpass existing placement baselines and zero-shot Vision-Language Models for object placement. Furthermore, we distill these priors into a lightweight model for fast practical inference (230,000x faster).
[452] Revisiting Compositionality in Dual-Encoder Vision-Language Models: The Role of Inference
Imanol Miranda, Ander Salaberria, Eneko Agirre, Gorka Azkune
Main category: cs.CV
TL;DR: A lightweight transformer learns fine-grained region-segment alignments from frozen patch and token embeddings to improve compositional generalization in dual-encoder VLMs like CLIP, addressing limitations of global cosine similarity inference.
Details
Motivation: Dual-encoder VLMs like CLIP perform poorly on compositional benchmarks, often characterized as bag-of-words systems. The authors argue this limitation may stem from the standard inference protocol based on global cosine similarity rather than deficient representations.Method: Introduces a lightweight transformer that learns fine-grained region-segment alignments directly from frozen patch and token embeddings. The approach explicitly enforces localized alignment at inference without updating pretrained encoders.
Result: Learning localized alignment over frozen representations matches full fine-tuning on in-domain retrieval while yielding substantial improvements on controlled out-of-domain compositional benchmarks. The approach outperforms prior end-to-end compositional training methods under distribution shift.
Conclusion: Global embedding matching is identified as a key bottleneck in dual-encoder VLMs, and alignment mechanisms are crucial for robust compositional generalization. The work demonstrates that compositional limitations can be addressed through inference-time alignment rather than representation updates.
Abstract: Dual-encoder Vision-Language Models (VLMs) such as CLIP are often characterized as bag-of-words systems due to their poor performance on compositional benchmarks. We argue that this limitation may stem less from deficient representations than from the standard inference protocol based on global cosine similarity. First, through controlled diagnostic experiments, we show that explicitly enforcing fine-grained region-segment alignment at inference dramatically improves compositional performance without updating pretrained encoders. We then introduce a lightweight transformer that learns such alignments directly from frozen patch and token embeddings. Comparing against full fine-tuning and prior end-to-end compositional training methods, we find that although these approaches improve in-domain retrieval, their gains do not consistently transfer under distribution shift. In contrast, learning localized alignment over frozen representations matches full fine-tuning on in-domain retrieval while yielding substantial improvements on controlled out-of-domain compositional benchmarks. These results identify global embedding matching as a key bottleneck in dual-encoder VLMs and highlight the importance of alignment mechanisms for robust compositional generalization.
[453] Retrieving to Recover: Towards Incomplete Audio-Visual Question Answering via Semantic-consistent Purification
Jiayu Zhang, Shuo Ye, Qilang Ye, Zihan Song, Jiajian Huang, Zitong Yu
Main category: cs.CV
TL;DR: R²ScP is a retrieval-based framework for Audio-Visual Question Answering that handles missing modalities by retrieving domain-specific knowledge instead of generative imputation, with adaptive purification to remove semantic noise.
Details
Motivation: Current AVQA methods struggle with missing modalities in real-world scenarios, and generative imputation approaches fail to capture unique modality-specific knowledge, leading to hallucinations and reduced accuracy.Method: Proposes R²ScP framework that shifts from generative imputation to retrieval-based recovery using cross-modal retrieval via unified semantic embeddings, with context-aware adaptive purification to remove semantic noise and two-stage training for modeling semantic relationships.
Result: Extensive experiments show R²ScP significantly improves AVQA performance and enhances robustness in modal-incomplete scenarios compared to existing methods.
Conclusion: Retrieval-based recovery with adaptive purification is more effective than generative imputation for handling missing modalities in AVQA, addressing the limitations of capturing modality-specific knowledge and reducing hallucinations.
Abstract: Recent Audio-Visual Question Answering (AVQA) methods have advanced significantly. However, most AVQA methods lack effective mechanisms for handling missing modalities, suffering from severe performance degradation in real-world scenarios with data interruptions. Furthermore, prevailing methods for handling missing modalities predominantly rely on generative imputation to synthesize missing features. While partially effective, these methods tend to capture inter-modal commonalities but struggle to acquire unique, modality-specific knowledge within the missing data, leading to hallucinations and compromised reasoning accuracy. To tackle these challenges, we propose R$^{2}$ScP, a novel framework that shifts the paradigm of missing modality handling from traditional generative imputation to retrieval-based recovery. Specifically, we leverage cross-modal retrieval via unified semantic embeddings to acquire missing domain-specific knowledge. To maximize semantic restoration, we introduce a context-aware adaptive purification mechanism that eliminates latent semantic noise within the retrieved data. Additionally, we employ a two-stage training strategy to explicitly model the semantic relationships between knowledge from different sources. Extensive experiments demonstrate that R$^{2}$ScP significantly improves AVQA and enhances robustness in modal-incomplete scenarios.
[454] Architecture-Agnostic Modality-Isolated Gated Fusion for Robust Multi-Modal Prostate MRI Segmentation
Yongbo Shu, Wenzhao Xie, Shanhu Yao, Zirui Xin, Luo Lei, Kewen Chen, Aijing Luo
Main category: cs.CV
TL;DR: MIGF (Modality-Isolated Gated Fusion) improves robustness in multi-modal prostate MRI segmentation by maintaining separate modality streams and using modality dropout training, achieving better performance when inputs are incomplete or corrupted.
Details
Motivation: Multi-parametric prostate MRI often has missing or degraded sequences in practice, but existing fusion methods assume complete inputs and entangle modality information early, limiting robustness to corrupted or absent channels.Method: Proposes MIGF with separate modality-specific encoding streams before learned gating, combined with modality dropout training to enforce compensation behavior under incomplete inputs. Tested on six backbones with seven missing-modality scenarios.
Result: MIGF improved ideal-scenario Ranking Score for UNet, nnUNet, and Mamba by 2.8%, 4.6%, and 13.4% respectively. Best model (MIGFNet-nnUNet) achieved 0.7304 +/- 0.056. Robustness gains come from strict modality isolation and dropout-driven compensation.
Conclusion: For robust multi-modal segmentation: structurally contain corrupted inputs first, then train explicitly for incomplete-input compensation. Simpler design principle outperforms complex adaptive routing approaches.
Abstract: Multi-parametric prostate MRI – combining T2-weighted, apparent diffusion coefficient, and high b-value diffusion-weighted sequences – is central to non-invasive detection of clinically significant prostate cancer, yet in routine practice individual sequences may be missing or degraded by motion, artifacts, or abbreviated protocols. Existing multi-modal fusion strategies typically assume complete inputs and entangle modality-specific information at early layers, offering limited resilience when one channel is corrupted or absent. We propose Modality-Isolated Gated Fusion (MIGF), an architecture-agnostic module that maintains separate modality-specific encoding streams before a learned gating stage, combined with modality dropout training to enforce compensation behavior under incomplete inputs. We benchmark six bare backbones and assess MIGF-equipped models under seven missing-modality and artifact scenarios on the PI-CAI dataset (1,500 studies, fold-0 split, five random seeds). Among bare backbones, nnUNet provided the strongest balance of performance and stability. MIGF improved ideal-scenario Ranking Score for UNet, nnUNet, and Mamba by 2.8%, 4.6%, and 13.4%, respectively; the best model, MIGFNet-nnUNet (gating + ModDrop, no deep supervision), achieved 0.7304 +/- 0.056. Mechanistic analysis reveals that robustness gains arise from strict modality isolation and dropout-driven compensation rather than adaptive per-sample quality routing: the gate converged to a stable modality prior, and deep supervision was beneficial only for the largest backbone while degrading lighter models. These findings support a simpler design principle for robust multi-modal segmentation: structurally contain corrupted inputs first, then train explicitly for incomplete-input compensation.
[455] Investigating Bias and Fairness in Appearance-based Gaze Estimation
Burak Akgül, Erol Şahin, Sinan Kalkan
Main category: cs.CV
TL;DR: First comprehensive benchmark evaluating fairness in appearance-based gaze estimation across ethnicity and gender groups, revealing significant performance disparities and limited effectiveness of existing bias mitigation strategies.
Details
Motivation: While gaze estimation has improved in accuracy and domain adaptation, fairness across demographic groups remains unexplored with no comprehensive benchmark for algorithmic bias in this domain.Method: Established fairness baseline by analyzing state-of-the-art gaze estimation models using standard fairness metrics, evaluated effectiveness of existing bias mitigation strategies when applied to gaze domain.
Result: Revealed significant performance disparities across ethnicity and gender groups, showed that existing bias mitigation strategies have limited fairness contributions in gaze estimation.
Conclusion: Calls for research into developing robust, equitable gaze estimators, releases annotations, code, and trained models to support future research and reproducibility.
Abstract: While appearance-based gaze estimation has achieved significant improvements in accuracy and domain adaptation, the fairness of these systems across different demographic groups remains largely unexplored. To date, there is no comprehensive benchmark quantifying algorithmic bias in gaze estimation. This paper presents the first extensive evaluation of fairness in appearance-based gaze estimation, focusing on ethnicity and gender attributes. We establish a fairness baseline by analyzing state-of-the-art models using standard fairness metrics, revealing significant performance disparities. Furthermore, we evaluate the effectiveness of existing bias mitigation strategies when applied to the gaze domain and show that their fairness contributions are limited. We summarize key insights and open issues. Overall, our work calls for research into developing robust, equitable gaze estimators. To support future research and reproducibility, we publicly release our annotations, code, and trained models at: github.com/akgulburak/gaze-estimation-fairness
[456] Defending against Patch-Based and Texture-Based Adversarial Attacks with Spectral Decomposition
Wei Zhang, Xinyu Chang, Xiao Li, Yiming Zhu, Xiaolin Hu
Main category: cs.CV
TL;DR: ASD is a defense mechanism using Discrete Wavelet Transform spectral analysis combined with adversarial training to protect against patch-based and texture-based adversarial attacks in computer vision systems.
Details
Motivation: Patch-based and texture-based adversarial attacks pose real threats to security-critical applications like person detection in surveillance and autonomous systems. Existing defenses struggle against adaptive attacks specifically designed to bypass them.Method: Proposes Adversarial Spectrum Defense (ASD) that uses spectral decomposition via Discrete Wavelet Transform to analyze adversarial patterns across multiple frequency scales, capturing both high-frequency (fine-grained) and low-frequency (spatially pervasive) perturbations. Combines this with off-the-shelf Adversarial Training.
Result: ASD+AT achieved state-of-the-art performance against various attacks, outperforming previous defense methods by 21.73% in APs, even against strong adaptive adversaries specifically designed against ASD.
Conclusion: ASD provides a comprehensive defense strategy against both patch-based and texture-based adversarial attacks by leveraging multi-resolution spectral analysis, demonstrating robustness even against adaptive attacks.
Abstract: Adversarial examples present significant challenges to the security of Deep Neural Network (DNN) applications. Specifically, there are patch-based and texture-based attacks that are usually used to craft physical-world adversarial examples, posing real threats to security-critical applications such as person detection in surveillance and autonomous systems, because those attacks are physically realizable. Existing defense mechanisms face challenges in the adaptive attack setting, i.e., the attacks are specifically designed against them. In this paper, we propose Adversarial Spectrum Defense (ASD), a defense mechanism that leverages spectral decomposition via Discrete Wavelet Transform (DWT) to analyze adversarial patterns across multiple frequency scales. The multi-resolution and localization capability of DWT enables ASD to capture both high-frequency (fine-grained) and low-frequency (spatially pervasive) perturbations. By integrating this spectral analysis with the off-the-shelf Adversarial Training (AT) model, ASD provides a comprehensive defense strategy against both patch-based and texture-based adversarial attacks. Extensive experiments demonstrate that ASD+AT achieved state-of-the-art (SOTA) performance against various attacks, outperforming the APs of previous defense methods by 21.73%, in the face of strong adaptive adversaries specifically designed against ASD. Code available at https://github.com/weiz0823/adv-spectral-defense .
[457] Turning Generators into Retrievers: Unlocking MLLMs for Natural Language-Guided Geo-Localization
Yuqi Chen, Xiaohan Zhang, Ahmad Arrabi, Waqas Sultani, Chen Chen, Safwan Wshah
Main category: cs.CV
TL;DR: Adapting Multimodal Large Language Models (MLLMs) for natural-language guided cross-view geo-localization via parameter-efficient finetuning, achieving state-of-the-art performance with minimal trainable parameters.
Details
Motivation: Existing CLIP-style dual-encoder architectures for cross-view geo-localization suffer from weak cross-modal generalization and require complex designs, while MLLMs have powerful semantic reasoning capabilities but aren't optimized for retrieval tasks.Method: Parameter-efficient finetuning of MLLMs that optimizes latent representations while preserving pretrained multimodal knowledge, enabling strong cross-modal alignment without architectural redesign. Systematic analysis of diverse variables from model backbone to feature aggregation.
Result: Achieves SOTA on GeoText-1652 with 12.2% improvement in Text-to-Image Recall@1, top performance in 5 out of 12 subtasks on CVG-Text, surpassing baselines with far fewer trainable parameters.
Conclusion: MLLMs serve as robust foundation for semantic cross-view retrieval, paving way for MLLM-based geo-localization as scalable, powerful alternative to traditional dual-encoder designs.
Abstract: Natural-language Guided Cross-view Geo-localization (NGCG) aims to retrieve geo-tagged satellite imagery using textual descriptions of ground scenes. While recent NGCG methods commonly rely on CLIP-style dual-encoder architectures, they often suffer from weak cross-modal generalization and require complex architectural designs. In contrast, Multimodal Large Language Models (MLLMs) offer powerful semantic reasoning capabilities but are not directly optimized for retrieval tasks. In this work, we present a simple yet effective framework to adapt MLLMs for NGCG via parameter-efficient finetuning. Our approach optimizes latent representations within the MLLM while preserving its pretrained multimodal knowledge, enabling strong cross-modal alignment without redesigning model architectures. Through systematic analysis of diverse variables, from model backbone to feature aggregation, we provide practical and generalizable insights for leveraging MLLMs in NGCG. Our method achieves SOTA on GeoText-1652 with a 12.2% improvement in Text-to-Image Recall@1 and secures top performance in 5 out of 12 subtasks on CVG-Text, all while surpassing baselines with far fewer trainable parameters. These results position MLLMs as a robust foundation for semantic cross-view retrieval and pave the way for MLLM-based NGCG to be adopted as a scalable, powerful alternative to traditional dual-encoder designs. Project page and code are available at https://yuqichen888.github.io/NGCG-MLLMs-web/.
[458] MMRareBench: A Rare-Disease Multimodal and Multi-Image Medical Benchmark
Junzhi Ning, Jiashi Lin, Yingying Fang, Wei Li, Jiyao Liu, Cheng Tang, Chenglong Ma, Wenhao Tang, Tianbin Li, Ziyan Huang, Guang Yang, Junjun He
Main category: cs.CV
TL;DR: MMRareBench is the first rare-disease benchmark evaluating multimodal and multi-image clinical capabilities across diagnosis, treatment planning, evidence alignment, and examination suggestion, revealing fragmented MLLM capabilities and low treatment-planning performance.
Details
Motivation: Current MLLMs are tested on common conditions with single images, but rare diseases require multimodal and multi-image evidence integration under data scarcity, which remains unevaluated. Clinicians lack prior knowledge for rare diseases and must rely on case-level evidence.Method: Created MMRareBench with 1,756 QA pairs and 7,958 medical images from PMC case reports, featuring Orphanet-anchored ontology alignment, track-specific leakage control, evidence-grounded annotations, and two-level evaluation protocol. Evaluated 23 MLLMs across four workflow-aligned tracks.
Result: Evaluation revealed fragmented capability profiles, universally low treatment-planning performance, and medical-domain models trailing general-purpose MLLMs on multi-image tracks despite competitive diagnostic scores. Shows capacity dilution effect where medical fine-tuning narrows diagnostic gap but erodes compositional multi-image capability.
Conclusion: Rare-disease clinical workflows demand multimodal and multi-image evidence integration that current MLLMs struggle with, especially in treatment planning. Medical fine-tuning may inadvertently harm compositional reasoning needed for rare disease cases.
Abstract: Multimodal large language models (MLLMs) have advanced clinical tasks for common conditions, but their performance on rare diseases remains largely untested. In rare-disease scenarios, clinicians often lack prior clinical knowledge, forcing them to rely strictly on case-level evidence for clinical judgments. Existing benchmarks predominantly evaluate common-condition, single-image settings, leaving multimodal and multi-image evidence integration under rare-disease data scarcity systematically unevaluated. We introduce MMRareBench, to our knowledge the first rare-disease benchmark jointly evaluating multimodal and multi-image clinical capability across four workflow-aligned tracks: diagnosis, treatment planning, cross-image evidence alignment, and examination suggestion. The benchmark comprises 1,756 question-answer pairs with 7,958 associated medical images curated from PMC case reports, with Orphanet-anchored ontology alignment, track-specific leakage control, evidence-grounded annotations, and a two-level evaluation protocol. A systematic evaluation of 23 MLLMs reveals fragmented capability profiles and universally low treatment-planning performance, with medical-domain models trailing general-purpose MLLMs substantially on multi-image tracks despite competitive diagnostic scores. These patterns are consistent with a capacity dilution effect: medical fine-tuning can narrow the diagnostic gap but may erode the compositional multi-image capability that rare-disease evidence integration demands.
[459] Lung Cancer Detection Using Deep Learning
Imama Ajmi, Abhishek Das
Main category: cs.CV
TL;DR: A deep learning study comparing CNN architectures (InceptionV3, MobileNetV2, VGG16, ResNet152) and proposing a novel 16-layer CNN model for lung cancer detection, with evaluation metrics including accuracy, precision, recall, and F1-score.
Details
Motivation: Lung cancer has low survival rates (20%) due to late detection, with 10-15% cases in non-smokers. Early and accurate detection is crucial for effective treatment, necessitating improved deep learning approaches for classification.Method: Comparative study of deep learning models: InceptionV3, MobileNetV2, VGG16, ResNet152, plus a novel proposed 16-layer CNN architecture with convolutional, pooling, flatten, dropout, fully connected, and dense layers. Tested up to 30 epochs with metrics including accuracy, precision, recall, F1-score.
Result: Proposed 16-layer CNN model shows increasing accuracy with more epochs (tested up to 30), overcomes overfitting problems, and provides comprehensive performance comparison of all models using standard evaluation metrics.
Conclusion: The proposed CNN architecture demonstrates improved lung cancer detection capabilities with consistent accuracy improvement across epochs and overcomes overfitting, contributing to advancement in medical imaging classification.
Abstract: Lung cancer, the second leading cause of cancer-related deaths, is primarily linked to long-term tobacco smoking (85% of cases). Surprisingly, 10-15% of cases occur in non-smokers. In 2020, approximately 2 million people were affected globally, resulting in 1.5 million deaths. The survival rate, at around 20%, lags behind other cancers, partly due to late-stage symptom manifestation. Necessitates early and accurate detection for effective treatment. Performance metrics such as accuracy, precision, recall (sensitivity), and F1-score are computed to provide a comprehensive evaluation of each model’s capabilities. By comparing these metrics, this study offers insights into the strengths and limitations of each approach, contributing to the advancement of lung cancer detection techniques. In this paper, we are going to discuss the methodologies of lung cancer detection using different deep learning algorithms - InceptionV3, MobileNetV2, VGG16, ResNet152 - are explored for their efficacy in classifying lung cancer cases. Our Proposed Model algorithm based is a 16 layers architecture based on CNN model. Our Proposed model exhibits several key highlights that contribute to its novelty. By integrating multiple layer types such as convolutional, pooling, flatten, dropout, fully connected and dense layers, the model leverages the strengths of each layer to enhance its predictive capabilities. Novelty of our proposed model is that its accuracy is increasing consistently with the increasing no of epochs. We have tested the model performance up to epoch no 30. Our proposed model also overcome the overfitting problem.
[460] At FullTilt: Real-Time Open-Set 3D Macromolecule Detection Directly from Tilted 2D Projections
Ming-Yang Ho, Alberto Bartesaghi
Main category: cs.CV
TL;DR: FullTilt: An end-to-end framework for open-set 3D macromolecule detection that operates directly on 2D tilt-series instead of reconstructed 3D tomograms, achieving orders of magnitude faster inference with reduced VRAM requirements.
Details
Motivation: Current 3D macromolecule detection methods in cryogenic electron tomography require slow sliding-window inference over extracted subvolumes due to VRAM constraints, making large-scale analysis impractical. The authors aim to eliminate redundant volumetric computation by working directly with the original 2D tilt-series data.Method: FullTilt operates directly on aligned 2D tilt-series rather than reconstructed 3D volumes. It introduces: 1) a tilt-series encoder for efficient cross-view information fusion, 2) a multiclass visual prompt encoder for flexible prompting, 3) a tilt-aware query initializer to anchor 3D queries effectively, and 4) an auxiliary geometric primitives module to enhance multi-view geometry understanding and robustness to imaging artifacts.
Result: Extensive evaluations on three real-world datasets show that FullTilt achieves state-of-the-art zero-shot performance while drastically reducing runtime and VRAM requirements. The method accelerates inference by orders of magnitude compared to traditional volumetric approaches.
Conclusion: FullTilt represents a paradigm shift in 3D macromolecule detection by eliminating redundant volumetric computation through direct tilt-series processing. The framework enables rapid, large-scale visual proteomics analysis with practical computational requirements.
Abstract: Open-set 3D macromolecule detection in cryogenic electron tomography eliminates the need for target-specific model retraining. However, strict VRAM constraints prohibit processing an entire 3D tomogram, forcing current methods to rely on slow sliding-window inference over extracted subvolumes. To overcome this, we propose FullTilt, an end-to-end framework that redefines 3D detection by operating directly on aligned 2D tilt-series. Because a tilt-series contains significantly fewer images than slices in a reconstructed tomogram, FullTilt eliminates redundant volumetric computation, accelerating inference by orders of magnitude. To process the entire tilt-series simultaneously, we introduce a tilt-series encoder to efficiently fuse cross-view information. We further propose a multiclass visual prompt encoder for flexible prompting, a tilt-aware query initializer to effectively anchor 3D queries, and an auxiliary geometric primitives module to enhance the model’s understanding of multi-view geometry while improving robustness to adverse imaging artifacts. Extensive evaluations on three real-world datasets demonstrate that FullTilt achieves state-of-the-art zero-shot performance while drastically reducing runtime and VRAM requirements, paving the way for rapid, large-scale visual proteomics analysis. All code and data will be publicly available upon publication.
[461] HOG-Layout: Hierarchical 3D Scene Generation, Optimization and Editing via Vision-Language Models
Haiyan Jiang, Deyu Zhang, Dongdong Weng, Weitao Song, Henry Been-Lirn Duh
Main category: cs.CV
TL;DR: HOG-Layout is a text-driven hierarchical 3D scene generation and editing system using LLMs and VLMs with RAG for semantic consistency and optimization for physical plausibility.
Details
Motivation: Manual 3D layout creation is labor-intensive, while data-driven methods lack diversity. Large models offer new opportunities for 3D scene synthesis, but need to address semantic and physical consistency issues.Method: Uses LLMs and VLMs for text-driven hierarchical scene generation, incorporates RAG for semantic consistency, adds optimization module for physical consistency, and employs hierarchical representation for real-time editing.
Result: Produces more reasonable environments than existing baselines while supporting fast and intuitive scene editing with improved semantic and physical consistency.
Conclusion: HOG-Layout successfully enables text-driven hierarchical 3D scene generation and real-time editing with improved consistency and plausibility through LLMs, VLMs, RAG, and optimization techniques.
Abstract: 3D layout generation and editing play a crucial role in Embodied AI and immersive VR interaction. However, manual creation requires tedious labor, while data-driven generation often lacks diversity. The emergence of large models introduces new possibilities for 3D scene synthesis. We present HOG-Layout that enables text-driven hierarchical scene generation, optimization and real-time scene editing with large language models (LLMs) and vision-language models (VLMs). HOG-Layout improves scene semantic consistency and plausibility through retrieval-augmented generation (RAG) technology, incorporates an optimization module to enhance physical consistency, and adopts a hierarchical representation to enhance inference and optimization, achieving real-time editing. Experimental results demonstrate that HOG-Layout produces more reasonable environments compared with existing baselines, while supporting fast and intuitive scene editing.
[462] Uncertainty-quantified Pulse Signal Recovery from Facial Video using Regularized Stochastic Interpolants
Vineet R. Shenoy, Cheng Peng, Rama Chellappa, Yu Sun
Main category: cs.CV
TL;DR: RIS-iPPG introduces a stochastic framework for imaging photoplethysmography that models BVP recovery as an inverse problem with uncertainty quantification through test-time sampling of solution space.
Details
Motivation: Current iPPG algorithms lack uncertainty analysis crucial for clinical applications, as they don't perform test-time sampling of solution space to quantify reconstruction confidence.Method: Models iPPG recovery as an inverse problem using probability paths that evolve camera pixel distribution to ground-truth signal distribution via flow and score vectors; solves stochastic differential equation at test-time to sample posterior distribution; adds regularization using correlation between residual flow vectors of adjacent time windows.
Result: Superior reconstruction quality and uncertainty estimates on three datasets compared to state-of-the-art methods.
Conclusion: RIS-iPPG provides critical uncertainty quantification for iPPG algorithms, enabling wider adoption in clinical and consumer settings through reliable confidence measures.
Abstract: Imaging Photoplethysmography (iPPG), an optical procedure which recovers a human’s blood volume pulse (BVP) waveform using pixel readout from a camera, is an exciting research field with many researchers performing clinical studies of iPPG algorithms. While current algorithms to solve the iPPG task have shown outstanding performance on benchmark datasets, no state-of-the art algorithms, to the best of our knowledge, performs test-time sampling of solution space, precluding an uncertainty analysis that is critical for clinical applications. We address this deficiency though a new paradigm named Regularized Interpolants with Stochasticity for iPPG (RIS-iPPG). Modeling iPPG recovery as an inverse problem, we build probability paths that evolve the camera pixel distribution to the ground-truth signal distribution by predicting the instantaneous flow and score vectors of a time-dependent stochastic process; and at test-time, we sample the posterior distribution of the correct BVP waveform given the camera pixel intensity measurements by solving a stochastic differential equation. Given that physiological changes are slowly varying, we show that iPPG recovery can be improved through regularization that maximizes the correlation between the residual flow vector predictions of two adjacent time windows. Experimental results on three datasets show that RIS-iPPG provides superior reconstruction quality and uncertainty estimates of the reconstruction, a critical tool for the widespread adoption of iPPG algorithms in clinical and consumer settings.
[463] LIDARLearn: A Unified Deep Learning Library for 3D Point Cloud Classification, Segmentation, and Self-Supervised Representation Learning
Said Ohamouddou, Hanaa El Afia, Abdellatif El Afia, Raddouane Chiheb
Main category: cs.CV
TL;DR: LIDARLearn is a unified PyTorch library for 3D point cloud analysis that integrates 55+ model configurations across supervised, self-supervised, and parameter-efficient fine-tuning methods with standardized evaluation protocols.
Details
Motivation: Current point cloud analysis methods are scattered across incompatible codebases with different data pipelines, evaluation protocols, and configuration formats, making fair comparisons and reproducibility difficult.Method: Developed a unified, extensible PyTorch library with registry-based framework supporting 29 supervised architectures, 7 SSL pre-training methods, and 5 PEFT strategies. Includes standardized training runners, cross-validation, automated table generation, statistical testing, and comprehensive test suite.
Result: Created LIDARLearn library with 55+ model configurations covering classification, semantic segmentation, part segmentation, and few-shot learning. Includes 2,200+ automated tests for end-to-end validation and supports rigorous multi-model comparison with statistical testing.
Conclusion: LIDARLearn provides a comprehensive, standardized framework for point cloud analysis that enables fair comparisons, reproducibility, and extensibility for the research community.
Abstract: Three-dimensional (3D) point cloud analysis has become central to applications ranging from autonomous driving and robotics to forestry and ecological monitoring. Although numerous deep learning methods have been proposed for point cloud understanding, including supervised backbones, self-supervised pre-training (SSL), and parameter-efficient fine-tuning (PEFT), their implementations are scattered across incompatible codebases with differing data pipelines, evaluation protocols, and configuration formats, making fair comparisons difficult. We introduce \lib{}, a unified, extensible PyTorch library that integrates over 55 model configurations covering 29 supervised architectures, seven SSL pre-training methods, and five PEFT strategies, all within a single registry-based framework supporting classification, semantic segmentation, part segmentation, and few-shot learning. \lib{} provides standardised training runners, cross-validation with stratified $K$-fold splitting, automated LaTeX/CSV table generation, built-in Friedman/Nemenyi statistical testing with critical-difference diagrams for rigorous multi-model comparison, and a comprehensive test suite with 2,200+ automated tests validating every configuration end-to-end. The code is available at https://github.com/said-ohamouddou/LIDARLearn under the MIT licence.
[464] ReplicateAnyScene: Zero-Shot Video-to-3D Composition via Textual-Visual-Spatial Alignment
Mingyu Dong, Chong Xia, Mingyuan Jia, Weichen Lyu, Long Xu, Zheng Zhu, Yueqi Duan
Main category: cs.CV
TL;DR: ReplicateAnyScene: A framework for fully automated, zero-shot transformation of casual videos into compositional 3D scenes using vision foundation models and five-stage cascade alignment.
Details
Motivation: Humans can rapidly perceive and segment objects from videos and mentally assemble them into structured 3D scenes. Current methods struggle with practical deployment due to insufficient cross-modal integration, reliance on manual object prompting, auxiliary visual inputs, and training biases restricting them to simple scenes.Method: Five-stage cascade pipeline that extracts and structurally aligns generic priors from vision foundation models across textual, visual, and spatial dimensions, grounding them into structured 3D representations while ensuring semantic coherence and physical plausibility.
Result: Superior performance over existing baselines in generating high-quality compositional 3D scenes, demonstrated through extensive experiments. Also introduces C3DR benchmark for comprehensive evaluation.
Conclusion: ReplicateAnyScene enables practical deployment of compositional 3D reconstruction from casual videos without manual intervention, advancing Spatial Intelligence and Embodied AI capabilities.
Abstract: Humans exhibit an innate capacity to rapidly perceive and segment objects from video observations, and even mentally assemble them into structured 3D scenes. Replicating such capability, termed compositional 3D reconstruction, is pivotal for the advancement of Spatial Intelligence and Embodied AI. However, existing methods struggle to achieve practical deployment due to the insufficient integration of cross-modal information, leaving them dependent on manual object prompting, reliant on auxiliary visual inputs, and restricted to overly simplistic scenes by training biases. To address these limitations, we propose ReplicateAnyScene, a framework capable of fully automated and zero-shot transformation of casually captured videos into compositional 3D scenes. Specifically, our pipeline incorporates a five-stage cascade to extract and structurally align generic priors from vision foundation models across textual, visual, and spatial dimensions, grounding them into structured 3D representations and ensuring semantic coherence and physical plausibility of the constructed scenes. To facilitate a more comprehensive evaluation of this task, we further introduce the C3DR benchmark to assess reconstruction quality from diverse aspects. Extensive experiments demonstrate the superiority of our method over existing baselines in generating high-quality compositional 3D scenes.
[465] WBCBench 2026: A Challenge for Robust White Blood Cell Classification Under Class Imbalance
Xin Tian, Xudong Ma, Tianqi Yang, Alin Achim, Bartłomiej W Papież, Phandee Watanaboonyongcharoen, Nantheera Anantrasirichai
Main category: cs.CV
TL;DR: WBCBench 2026 is an ISBI challenge for automated white blood cell classification that tests algorithms under class imbalance, patient-level data separation, and synthetic domain shifts with controlled noise/blur/illumination perturbations.
Details
Motivation: To create a comprehensive benchmark for white blood cell classification that addresses real-world challenges including severe class imbalance, strict patient-level data separation to prevent data leakage, and synthetic domain shifts that simulate scanner- and setting-induced variations between development and deployment conditions.Method: The benchmark uses single-site microscopic blood smear images with standardized staining and expert annotations. It’s organized into two phases: Phase 1 provides pristine training data, while Phase 2 introduces degraded images with split-specific severity distributions for train, validation, and test sets. The benchmark includes a standardized submission schema, open-source evaluator, and uses macro-averaged F1 score as the primary ranking metric.
Result: The paper reviews the challenge and summarizes proposed solutions and final outcomes, though specific numerical results are not provided in the abstract. The benchmark successfully creates a controlled environment to stress-test algorithms under realistic deployment conditions.
Conclusion: WBCBench 2026 provides a rigorous benchmark for WBC classification that addresses key practical challenges including domain shift, class imbalance, and proper patient-level data separation, enabling more robust algorithm development and evaluation.
Abstract: We present WBCBench 2026, an ISBI challenge and benchmark for automated WBC classification designed to stress-test algorithms under three key difficulties: (i) severe class imbalance across 13 morphologically fine-grained WBC classes, (ii) strict patient-level separation between training, validation and test sets, and (iii) synthetic scanner- and setting-induced domain shift via controlled noise, blur and illumination perturbations. All images are single-site microscopic blood smear acquisitions with standardised staining and expert hematopathologist annotations. This paper reviews the challenge and summarises the proposed solutions and final outcomes. The benchmark is organised into two phases. Phase 1 provides a pristine training set. Phase 2 introduces degraded images with split-specific severity distributions for train, validation and test, emulating a realistic shift between development and deployment conditions. We specify a standardised submission schema, open-source evaluator, and macro-averaged F1 score as the primary ranking metric.
[466] Analytical Modeling and Correction of Distance Error in Homography-Based Ground-Plane Mapping
Mateusz Szulc, Marcin Iwanowski
Main category: cs.CV
TL;DR: Paper analyzes distance estimation errors from homography perturbations in monocular camera systems, showing quadratic error growth with distance and comparing regression vs gradient descent correction methods.
Details
Motivation: Manual initialization of planar homographies for monocular distance estimation often contains small inaccuracies that propagate into systematic distance distortions, affecting intelligent monitoring systems.Method: Derived explicit relationship between homography perturbations and resulting distance error, showing quadratic error growth. Evaluated two correction strategies: regression-based estimation of quadratic error function and direct homography optimization via coordinate-based gradient descent.
Result: Large-scale simulation with 19+ million test samples showed regression achieves higher peak accuracy when model is reliably fitted, while gradient descent provides greater robustness against poor initial calibration.
Conclusion: Improving geometric calibration may yield greater performance gains than increasing model complexity in practical monocular distance estimation systems.
Abstract: Accurate distance estimation from monocular cameras is essential for intelligent monitoring systems. In many deployments, image coordinates are mapped to ground positions using planar homographies initialized by manual selection of corresponding regions. Small inaccuracies in this initialization propagate into systematic distance distortions. This paper derives an explicit relationship between homography perturbations and the resulting distance error, showing that the error grows approximately quadratically with the true distance from the camera. Based on this model, two simple correction strategies are evaluated: regression-based estimation of the quadratic error function and direct optimization of the homography via coordinate-based gradient descent. A large-scale simulation study with more than 19 million test samples demonstrates that regression achieves higher peak accuracy when the model is reliably fitted, whereas gradient descent provides greater robustness against poor initial calibration. This suggests that improving geometric calibration may yield greater performance gains than increasing model complexity in many practical systems.
[467] Uncertainty-Guided Attention and Entropy-Weighted Loss for Precise Plant Seedling Segmentation
Mohamed Ehab, Ali Hamdi
Main category: cs.CV
TL;DR: UGDA-Net is an uncertainty-guided dual attention network for plant seedling segmentation that uses channel variance modulation, entropy-weighted loss, and deep supervision to improve segmentation of fine leaf structures.
Details
Motivation: Plant seedling segmentation is important for automated phenotyping in precision agriculture, but standard segmentation models struggle with intricate background images and fine leaf structures.Method: Proposes UGDA-Net with three components: 1) Uncertainty-Guided Dual Attention using channel variance to modulate feature maps, 2) entropy-weighted hybrid loss focusing on high-uncertainty boundary pixels, and 3) deep supervision for intermediate encoder layers.
Result: Improved segmentation performance with 9.3% increase in Dice coefficient above baseline for U-Net and 13.2% above baseline for LinkNet, with reduced false positives at leaf boundaries and uncertainty heatmaps consistent with complex morphology.
Conclusion: UGDA-Net provides high-definition segmentation of delicate plant structures, showing that uncertainty-guided attention and uncertainty-weighted loss are complementary systems for improving segmentation accuracy.
Abstract: Plant seedling segmentation supports automated phenotyping in precision agriculture. Standard segmentation models face difficulties due to intricate background images and fine structures in leaves. We introduce UGDA-Net (Uncertainty-Guided Dual Attention Network with Entropy-Weighted Loss and Deep Supervision). Three novel components make up UGDA-Net. The first component is Uncertainty-Guided Dual Attention (UGDA). UGDA uses channel variance to modulate feature maps. The second component is an entropy-weighted hybrid loss function. This loss function focuses on high-uncertainty boundary pixels. The third component employs deep supervision for intermediate encoder layers. We performed a comprehensive systematic ablation study. This study focuses on two widely-used architectures, U-Net and LinkNet. It analyzes five incremental configurations: Baseline, Loss-only, Attention-only, Deep Supervision, and UGDA-Net. We trained UGDA-net using a high-resolution plant seedling image dataset containing 432 images. We demonstrate improved segmentation performance and accuracy. With an increase in Dice coefficient of 9.3% above baseline. LinkNet’s variance is 13.2% above baseline. Overlays that are qualitative in nature show the reduced false positives at the leaf boundary. Uncertainty heatmaps are consistent with the complex morphology. UGDA-Net aids in the segmentation of delicate structures in plants and provides a high-def solution. The results showed that uncertainty-guided attention and uncertainty-weighted loss are two complementing systems.
[468] HO-Flow: Generalizable Hand-Object Interaction Generation with Latent Flow Matching
Zerui Chen, Rolandos Alexandros Potamias, Shizhe Chen, Jiankang Deng, Cordelia Schmid, Stefanos Zafeiriou
Main category: cs.CV
TL;DR: HO-Flow: A framework for generating realistic 3D hand-object interaction sequences from text and canonical 3D objects using interaction-aware VAE encoding and masked flow matching for temporal coherence.
Details
Motivation: Existing methods for generating 3D hand-object interactions lack expressive motion representations and temporal reasoning capabilities, limiting their ability to produce physically plausible and temporally coherent sequences.Method: 1) Interaction-aware variational autoencoder encodes hand-object motion sequences into unified latent space using kinematics; 2) Masked flow matching combines auto-regressive temporal reasoning with continuous latent generation; 3) Predicts object motions relative to initial frame for better generalization and pre-training on synthetic data.
Result: State-of-the-art performance on GRAB, OakInk, and DexYCB benchmarks in both physical plausibility and motion diversity for interaction motion synthesis.
Conclusion: HO-Flow effectively addresses limitations in 3D hand-object interaction generation by combining interaction-aware encoding with flow-based temporal modeling, enabling realistic synthesis from text and 3D objects.
Abstract: Generating realistic 3D hand-object interactions (HOI) is a fundamental challenge in computer vision and robotics, requiring both temporal coherence and high-fidelity physical plausibility. Existing methods remain limited in their ability to learn expressive motion representations for generation and perform temporal reasoning. In this paper, we present HO-Flow, a framework for synthesizing realistic hand-object motion sequences from texts and canoncial 3D objects. HO-Flow first employs an interaction-aware variational autoencoder to encode sequences of hand and object motions into a unified latent manifold by incorporating hand and object kinematics, enabling the representation to capture rich interaction dynamics. It then leverages a masked flow matching model that combines auto-regressive temporal reasoning with continuous latent generation, improving temporal coherence. To further enhance generalization, HO-Flow predicts object motions relative to the initial frame, enabling effective pre-training on large-scale synthetic data. Experiments on the GRAB, OakInk, and DexYCB benchmarks demonstrate that HO-Flow achieves state-of-the-art performance in both physical plausibility and motion diversity for interaction motion synthesis.
[469] Immune2V: Image Immunization Against Dual-Stream Image-to-Video Generation
Zeqian Long, Ozgur Kara, Haotian Xue, Yongxin Chen, James M. Rehg
Main category: cs.CV
TL;DR: Immune2V is a defense framework against image-to-video deepfake generation that prevents unauthorized animation of static images by creating temporally persistent adversarial perturbations that withstand video encoding and text guidance processes.
Details
Motivation: Image-to-video generation enables creation of realistic deepfakes by animating static images, posing significant societal harm. Existing defenses work for static images but fail for I2V generation due to temporal dynamics and text guidance in video models.Method: Immune2V enforces temporally balanced latent divergence at the encoder level to prevent adversarial signal dilution across frames, and aligns intermediate generative representations with precomputed collapse-inducing trajectories to counteract text-guidance override.
Result: Extensive experiments show Immune2V produces substantially stronger and more persistent degradation than adapted image-level baselines under the same imperceptibility budget, effectively defending against unauthorized I2V generation.
Conclusion: The proposed Immune2V framework successfully addresses the unique challenges of defending against image-to-video deepfake generation by understanding and counteracting the temporal and guidance mechanisms of modern I2V models.
Abstract: Image-to-video (I2V) generation has the potential for societal harm because it enables the unauthorized animation of static images to create realistic deepfakes. While existing defenses effectively protect against static image manipulation, extending these to I2V generation remains underexplored and non-trivial. In this paper, we systematically analyze why modern I2V models are highly robust against naive image-level adversarial attacks (i.e., immunization). We observe that the video encoding process rapidly dilutes the adversarial noise across future frames, and the continuous text-conditioned guidance actively overrides the intended disruptive effect of the immunization. Building on these findings, we propose the Immune2V framework which enforces temporally balanced latent divergence at the encoder level to prevent signal dilution, and aligns intermediate generative representations with a precomputed collapse-inducing trajectory to counteract the text-guidance override. Extensive experiments demonstrate that Immune2V produces substantially stronger and more persistent degradation than adapted image-level baselines under the same imperceptibility budget.
[470] Retinal Cyst Detection from Optical Coherence Tomography Images
Abhishek Dharmaratnakar, Aadheeshwar Vijayakumar, Suchand Dayanand
Main category: cs.CV
TL;DR: ResNet CNN patchwise classification approach for segmenting retinal cysts in OCT images, achieving over 70% dice coefficient across different image vendors.
Details
Motivation: Retinal cysts are important indicators in ocular diseases, but current automatic segmentation methods have low accuracy (68%) and are sensitive to image quality variations across different OCT vendors.Method: Uses ResNet CNN with patchwise classification for segmentation, trained on cyst segmentation challenge dataset and tested on data from 4 different OCT vendors with annotations from 2 graders.
Result: Achieved over 70% dice coefficient across all vendors regardless of image quality, outperforming previous state-of-the-art approaches.
Conclusion: The ResNet-based approach provides robust cyst segmentation that works well across different OCT imaging systems and image qualities, addressing limitations of previous methods.
Abstract: Retinal Cysts are formed by leakage and accumulation of fluid in the retina due to the incompetence of retinal vasculature. These cystic spaces have significance in several ocular diseases such as age-related macular degeneration, diabetic macular edema, etc. Optical coherence tomography is one of the predominant diagnosing techniques for imaging retinal pathologies. Segmenting and quantification of intraretinal cysts plays the vital role in predicting visual acuity. In literature, several methods have been proposed for automatic segmentation of intraretinal cysts. As cystoid macular edema becomes a major problem to humankind, we need to quantify it accurately and operate it out, else it might cause many problems later on. Though research is being carried out in this area, not much of progress has been made and accuracy achieved so far is 68% which is very less. Also, the methods depend on the quality of the image and give very low results for high noise images like topcon. This work uses ResNet CNN (Convolutional Neural Network) approach of segmentation by the way of patchwise classification for training on image set from cyst segmentation challenge dataset and testing on test data set given by 2 different graders for all 4 vendors in the challenge. It also compares these methods using first publicly available novel cyst segmentation challenge dataset. The methods were evaluated using quantitative measures to assess their robustness against the challenges of intraretinal cyst segmentation. The results are found to be better than the previous state of the art approaches giving more than 70% dice coefficient on all vendors irrespective of their quality.
[471] LRD-Net: A Lightweight Real-Centered Detection Network for Cross-Domain Face Forgery Detection
Xuecen Zhang, Vipin Chaudhary
Main category: cs.CV
TL;DR: LRD-Net is a lightweight face forgery detection framework that combines frequency-guided attention with real-centered learning for improved cross-domain generalization and computational efficiency.
Details
Motivation: Current face forgery detection methods suffer from poor cross-domain generalization when encountering unseen forgery types and have high computational costs that limit deployment on resource-constrained devices.Method: LRD-Net uses a sequential frequency-guided architecture with a Multi-Scale Wavelet Guidance Module that generates attention signals to condition a MobileNetV3-based spatial backbone. It employs real-centered learning with exponential moving average prototype updates and drift regularization.
Result: LRD-Net achieves state-of-the-art cross-domain detection accuracy on the DiFF benchmark with only 2.63M parameters (9x fewer than conventional approaches), 8x faster training, and nearly 10x faster inference.
Conclusion: Robust cross-domain face forgery detection can be achieved without sacrificing computational efficiency, making LRD-Net suitable for real-time deployment in mobile authentication systems and resource-constrained environments.
Abstract: The rapid advancement of diffusion-based generative models has made face forgery detection a critical challenge in digital forensics. Current detection methods face two fundamental limitations: poor cross-domain generalization when encountering unseen forgery types, and substantial computational overhead that hinders deployment on resource-constrained devices. We propose LRD-Net (Lightweight Real-centered Detection Network), a novel framework that addresses both challenges simultaneously. Unlike existing dual-branch approaches that process spatial and frequency information independently, LRD-Net adopts a sequential frequency-guided architecture where a lightweight Multi-Scale Wavelet Guidance Module generates attention signals that condition a MobileNetV3-based spatial backbone. This design enables effective exploitation of frequency-domain cues while avoiding the redundancy of parallel feature extraction. Furthermore, LRD-Net employs a real-centered learning strategy with exponential moving average prototype updates and drift regularization, anchoring representations around authentic facial images rather than modeling diverse forgery patterns. Extensive experiments on the DiFF benchmark demonstrate that LRD-Net achieves state-of-the-art cross-domain detection accuracy, consistently outperforming existing methods. Critically, LRD-Net accomplishes this with only 2.63M parameters - approximately 9x fewer than conventional approaches - while achieving over 8x faster training and nearly 10x faster inference. These results demonstrate that robust cross-domain face forgery detection can be achieved without sacrificing computational efficiency, making LRD-Net suitable for real-time deployment in mobile authentication systems and resource-constrained environments.
[472] Product Review Based on Optimized Facial Expression Detection
Vikrant Chaugule, Abhishek D, Aadheeshwar Vijayakumar, Pravin Bhaskar Ramteke, Shashidhar G. Koolagudi
Main category: cs.CV
TL;DR: A method for product review using facial expression recognition of customers in supermarkets to assess brand acceptance, featuring a modified Harris algorithm for faster feature extraction.
Details
Motivation: To develop an automated system for evaluating public acceptance of products based on brand by analyzing customer facial expressions in retail environments, addressing the need for efficient real-time emotion detection in commercial settings.Method: Uses facial expression recognition through feature point extraction with a modified Harris algorithm that reduces time complexity compared to existing methods. The approach analyzes customer facial expressions in supermarkets/hypermarkets to infer product acceptance.
Result: The modified Harris algorithm proved significantly faster while maintaining nearly the same accuracy for corner point detection needed in facial expression recognition, with reduced time complexity compared to existing algorithms.
Conclusion: The proposed method offers an efficient approach for automated product review through facial expression analysis, with the modified Harris algorithm providing practical speed improvements suitable for real-time retail applications.
Abstract: This paper proposes a method to review public acceptance of products based on their brand by analyzing the facial expression of the customer intending to buy the product from a supermarket or hypermarket. In such cases, facial expression recognition plays a significant role in product review. Here, facial expression detection is performed by extracting feature points using a modified Harris algorithm. The modified Harris algorithm reduced the time complexity of the existing feature extraction Harris Algorithm. A comparison of time complexities of existing algorithms is done with proposed algorithm. The algorithm proved to be significantly faster and nearly accurate for the needed application by reducing the time complexity for corner points detection.
[473] EviRCOD: Evidence-Guided Probabilistic Decoding for Referring Camouflaged Object Detection
Ye Wang, Kai Huang, Sumin Shen, Chenyang Ma
Main category: cs.CV
TL;DR: EviRCOD is a framework for Referring Camouflaged Object Detection that addresses semantic alignment, uncertainty modeling, and boundary preservation through three components: reference-guided encoder, evidential decoder, and boundary refinement module.
Details
Motivation: Existing Ref-COD methods struggle with three key challenges: 1) semantic alignment between reference and target objects, 2) explicit uncertainty modeling for ambiguous camouflaged regions, and 3) robust boundary preservation for camouflaged objects with indistinct edges.Method: Three-component framework: 1) Reference-Guided Deformable Encoder (RGDE) uses hierarchical reference-driven modulation and multi-scale deformable aggregation for semantic alignment; 2) Uncertainty-Aware Evidential Decoder (UAED) incorporates Dirichlet evidence estimation for uncertainty modeling; 3) Boundary-Aware Refinement Module (BARM) enhances ambiguous boundaries using edge cues and prediction confidence.
Result: EviRCOD achieves state-of-the-art performance on the Ref-COD benchmark and provides well-calibrated uncertainty estimates, demonstrating effectiveness in addressing the three core challenges.
Conclusion: The proposed EviRCOD framework effectively addresses key limitations in Ref-COD through integrated solutions for semantic alignment, uncertainty modeling, and boundary preservation, advancing the field of referring camouflaged object detection.
Abstract: Referring Camouflaged Object Detection (Ref-COD) focuses on segmenting specific camouflaged targets in a query image using category-aligned references. Despite recent advances, existing methods struggle with reference-target semantic alignment, explicit uncertainty modeling, and robust boundary preservation. To address these issues, we propose EviRCOD, an integrated framework consisting of three core components: (1) a Reference-Guided Deformable Encoder (RGDE) that employs hierarchical reference-driven modulation and multi-scale deformable aggregation to inject semantic priors and align cross-scale representations; (2) an Uncertainty-Aware Evidential Decoder (UAED) that incorporates Dirichlet evidence estimation into hierarchical decoding to model uncertainty and propagate confidence across scales; and (3) a Boundary-Aware Refinement Module (BARM) that selectively enhances ambiguous boundaries by exploiting low-level edge cues and prediction confidence. Experiments on the Ref-COD benchmark demonstrate that EviRCOD achieves state-of-the-art detection performance while providing well-calibrated uncertainty estimates. Code is available at: https://github.com/blueecoffee/EviRCOD.
[474] Evaluating the Impact of Medical Image Reconstruction on Downstream AI Fairness and Performance
Matteo Wohlrapp, Niklas Bubeck, Daniel Rueckert, William Lotter
Main category: cs.CV
TL;DR: Medical image reconstruction models evaluated for downstream diagnostic performance and fairness, finding conventional metrics poorly track task performance and reconstruction can amplify demographic biases.
Details
Motivation: Current AI-based medical image reconstruction models are evaluated using pixel-level metrics (PSNR) without assessing their impact on downstream diagnostic performance and fairness, creating a gap in understanding real clinical implications.Method: Developed a scalable evaluation framework applying reconstruction and diagnostic AI models in tandem across two tasks (classification, segmentation), three reconstruction approaches (U-Net, GAN, diffusion), and two data types (X-ray, MRI).
Result: Conventional reconstruction metrics poorly track task performance; diagnostic accuracy remains stable even as reconstruction PSNR declines with increasing noise. Fairness metrics show greater variability with reconstruction sometimes amplifying demographic biases (particularly patient sex), though overall additional bias is modest compared to inherent diagnostic model biases.
Conclusion: Holistic performance and fairness assessments are crucial throughout medical imaging workflows, especially as generative reconstruction models are increasingly deployed, despite limited efficacy of bias mitigation strategies adapted from classification literature.
Abstract: AI-based image reconstruction models are increasingly deployed in clinical workflows to improve image quality from noisy data, such as low-dose X-rays or accelerated MRI scans. However, these models are typically evaluated using pixel-level metrics like PSNR, leaving their impact on downstream diagnostic performance and fairness unclear. We introduce a scalable evaluation framework that applies reconstruction and diagnostic AI models in tandem, which we apply to two tasks (classification, segmentation), three reconstruction approaches (U-Net, GAN, diffusion), and two data types (X-ray, MRI) to assess the potential downstream implications of reconstruction. We find that conventional reconstruction metrics poorly track task performance, where diagnostic accuracy remains largely stable even as reconstruction PSNR declines with increasing image noise. Fairness metrics exhibit greater variability, with reconstruction sometimes amplifying demographic biases, particularly regarding patient sex. However, the overall magnitude of this additional bias is modest compared to the inherent biases already present in diagnostic models. To explore potential bias mitigation, we adapt two strategies from classification literature to the reconstruction setting, but observe limited efficacy. Overall, our findings emphasize the importance of holistic performance and fairness assessments throughout the entire medical imaging workflow, especially as generative reconstruction models are increasingly deployed.
[475] STGV: Spatio-Temporal Hash Encoding for Gaussian-based Video Representation
Jierun Lin, Jiacong Chen, Qingyu Mao, Shuai Liu, Xiandong Meng, Fanyang Meng, Yongsheng Liang
Main category: cs.CV
TL;DR: STGV: A spatio-temporal hash encoding framework for Gaussian-based video representation that decomposes features into separate spatial and temporal components for better modeling of static and dynamic elements.
Details
Motivation: Existing 2D Gaussian Splatting methods for video representation use content-agnostic or overlapping embeddings that entangle static and dynamic components, leading to inaccurate deformation predictions and poor representation quality.Method: Proposes STGV framework that decomposes video features into learnable 2D spatial and 3D temporal hash encodings, plus a key frame canonical initialization strategy for stable Gaussian representation.
Result: Achieves better video representation quality (+0.98 PSNR) compared to other Gaussian-based methods and competitive performance in downstream video tasks.
Conclusion: STGV effectively separates static and dynamic components through spatio-temporal decomposition, enabling better motion pattern learning while maintaining background details.
Abstract: 2D Gaussian Splatting (2DGS) has recently become a promising paradigm for high-quality video representation. However, existing methods employ content-agnostic or spatio-temporal feature overlapping embeddings to predict canonical Gaussian primitive deformations, which entangles static and dynamic components in videos and prevents modeling their distinct properties effectively. These result in inaccurate predictions for spatio-temporal deformations and unsatisfactory representation quality. To address these problems, this paper proposes a Spatio-Temporal hash encoding framework for Gaussian-based Video representation (STGV). By decomposing video features into learnable 2D spatial and 3D temporal hash encodings, STGV effectively facilitates the learning of motion patterns for dynamic components while maintaining background details for static elements.In addition, we construct a more stable and consistent initial canonical Gaussian representation through a key frame canonical initialization strategy, preventing from feature overlapping and a structurally incoherent geometry representation. Experimental results demonstrate that our method attains better video representation quality (+0.98 PSNR) against other Gaussian-based methods and achieves competitive performance in downstream video tasks.
[476] TAMISeg: Text-Aligned Multi-scale Medical Image Segmentation with Semantic Encoder Distillation
Qiang Gao, Yi Wang, Yong Zhang, Yong Li, Yongbing Deng, Lan Du, Cunjian Chen
Main category: cs.CV
TL;DR: TAMISeg is a text-guided medical image segmentation framework that uses clinical language prompts and semantic distillation to reduce reliance on pixel-level annotations and handle complex anatomical structures.
Details
Motivation: Medical image segmentation faces challenges including limited fine-grained annotations, complex anatomical structures, and image degradation from noise, low contrast, or illumination variations. Existing methods struggle with these issues, especially when pixel-level annotations are scarce.Method: TAMISeg integrates three core components: 1) consistency-aware encoder pretrained with strong perturbations for robust feature extraction, 2) semantic encoder distillation module supervised by frozen DINOv3 teacher for enhanced semantic discriminability, and 3) scale-adaptive decoder for segmenting anatomical structures across different spatial scales.
Result: Experiments on Kvasir-SEG, MosMedData+, and QaTa-COV19 datasets show TAMISeg consistently outperforms existing uni-modal and multi-modal methods in both qualitative and quantitative evaluations.
Conclusion: TAMISeg effectively addresses medical image segmentation challenges by incorporating text guidance and semantic distillation, reducing reliance on pixel-level annotations while improving segmentation performance on complex anatomical structures.
Abstract: Medical image segmentation remains challenging due to limited fine-grained annotations, complex anatomical structures, and image degradation from noise, low contrast, or illumination variation. We propose TAMISeg, a text-guided segmentation framework that incorporates clinical language prompts and semantic distillation as auxiliary semantic cues to enhance visual understanding and reduce reliance on pixel-level fine-grained annotations. TAMISeg integrates three core components: a consistency-aware encoder pretrained with strong perturbations for robust feature extraction, a semantic encoder distillation module with supervision from a frozen DINOv3 teacher to enhance semantic discriminability, and a scale-adaptive decoder that segments anatomical structures across different spatial scales. Experiments on the Kvasir-SEG, MosMedData+, and QaTa-COV19 datasets demonstrate that TAMISeg consistently outperforms existing uni-modal and multi-modal methods in both qualitative and quantitative evaluations. Code will be made publicly available at https://github.com/qczggaoqiang/TAMISeg.
[477] ReXSonoVQA: A Video QA Benchmark for Procedure-Centric Ultrasound Understanding
Xucheng Wang, Xiaoman Zhang, Sung Eun Kim, Ankit Pal, Pranav Rajpurkar
Main category: cs.CV
TL;DR: ReXSonoVQA is a video question-answering benchmark for evaluating vision-language models’ understanding of dynamic ultrasound procedures, focusing on three key competencies needed for autonomous ultrasound systems.
Details
Motivation: Current vision-language models are evaluated on static images, but ultrasound acquisition requires dynamic procedural understanding involving skilled probe manipulation and real-time adjustments. There's a need for benchmarks that assess models' ability to understand procedural video content for enabling autonomous ultrasound systems.Method: Created ReXSonoVQA benchmark with 514 video clips and 514 questions (249 multiple-choice, 265 free-response) targeting three competencies: Action-Goal Reasoning, Artifact Resolution & Optimization, and Procedure Context & Planning. Evaluated state-of-the-art VLMs (Gemini 3 Pro, Qwen3.5-397B, LLaVA-Video-72B, Seed 2.0 Pro) in zero-shot settings.
Result: VLMs can extract some procedural information from ultrasound videos, but troubleshooting questions remain challenging with minimal gains over text-only baselines, exposing limitations in causal reasoning. The benchmark reveals current models’ deficiencies in understanding dynamic procedural content.
Conclusion: ReXSonoVQA enables development of perception systems for ultrasound training, guidance, and robotic automation by providing a benchmark for evaluating dynamic procedural understanding in vision-language models, highlighting the need for improved causal reasoning capabilities.
Abstract: Ultrasound acquisition requires skilled probe manipulation and real-time adjustments. Vision-language models (VLMs) could enable autonomous ultrasound systems, but existing benchmarks evaluate only static images, not dynamic procedural understanding. We introduce ReXSonoVQA, a video QA benchmark with 514 video clips and 514 questions (249 MCQ, 265 free-response) targeting three competencies: Action-Goal Reasoning, Artifact Resolution & Optimization, and Procedure Context & Planning. Zero-shot evaluation of Gemini 3 Pro, Qwen3.5-397B, LLaVA-Video-72B, and Seed 2.0 Pro shows VLMs can extract some procedural information, but troubleshooting questions remain challenging with minimal gains over text-only baselines, exposing limitations in causal reasoning. ReXSonoVQA enables developing perception systems for ultrasound training, guidance, and robotic automation.
[478] LiveGesture Streamable Co-Speech Gesture Generation Model
Muhammad Usama Saleem, Mayur Jagdishbhai Patel, Ekkasit Pinyoanuntapong, Zhongxing Qin, Li Yang, Hongfei Xue, Ahmed Helmy, Chen Chen, Pu Wang
Main category: cs.CV
TL;DR: LiveGesture: First fully streamable, speech-driven full-body gesture generation framework with zero look-ahead, using causal region-coordinated motion generation.
Details
Motivation: Existing co-speech gesture methods are designed for offline generation, treat body regions independently or entangle all joints, and lack real-time streaming capabilities with zero look-ahead.Method: Two main modules: Streamable Vector Quantized Motion Tokenizer (SVQ) for causal discrete motion tokens, and Hierarchical Autoregressive Transformer (HAR) with region-expert transformers and causal spatio-temporal fusion. Uses streamable causal audio encoder and autoregressive masking training for robustness.
Result: Produces coherent, diverse, beat-synchronous full-body gestures in real time, matching or surpassing state-of-the-art offline methods on BEAT2 dataset under true zero look-ahead conditions.
Conclusion: LiveGesture enables real-time, streamable speech-driven gesture generation with region-coordinated motion, addressing limitations of existing offline approaches.
Abstract: We propose LiveGesture, the first fully streamable, speech-driven full-body gesture generation framework that operates with zero look-ahead and supports arbitrary sequence length. Unlike existing co-speech gesture methods, which are designed for offline generation and either treat body regions independently or entangle all joints within a single model, LiveGesture is built from the ground up for causal, region-coordinated motion generation. LiveGesture consists of two main modules: the Streamable Vector Quantized Motion Tokenizer (SVQ) and the Hierarchical Autoregressive Transformer (HAR). The SVQ tokenizer converts the motion sequence of each body region into causal, discrete motion tokens, enabling real-time, streamable token decoding. On top of SVQ, HAR employs region-expert autoregressive (xAR) transformers to model expressive, fine-grained motion dynamics for each body region. A causal spatio-temporal fusion module (xAR Fusion) then captures and integrates correlated motion dynamics across regions. Both xAR and xAR Fusion are conditioned on live, continuously arriving audio signals encoded by a streamable causal audio encoder. To enhance robustness under streaming noise and prediction errors, we introduce autoregressive masking training, which leverages uncertainty-guided token masking and random region masking to expose the model to imperfect, partially erroneous histories during training. Experiments on the BEAT2 dataset demonstrate that LiveGesture produces coherent, diverse, and beat-synchronous full-body gestures in real time, matching or surpassing state-of-the-art offline methods under true zero look-ahead conditions.
[479] AmodalSVG: Amodal Image Vectorization via Semantic Layer Peeling
Juncheng Hu, Ziteng Xue, Guotao Liang, Anran Qi, Buyu Li, Sheng Wang, Dong Xu, Qian Yu
Main category: cs.CV
TL;DR: AmodalSVG is a framework for amodal image vectorization that reconstructs complete object geometries including occluded regions into editable SVG layers, enabling semantic decoupling and object-level editing.
Details
Motivation: Existing vectorization methods only trace visible pixels, resulting in semantically entangled and geometrically incomplete SVGs that lack structural editability. The authors aim to create semantically organized and geometrically complete SVG representations that support object-level editing.Method: Two-stage framework: 1) Semantic Layer Peeling (SLP) - VLM-guided progressive decomposition of images into semantically coherent layers with hybrid inpainting to recover occluded regions; 2) Adaptive Layered Vectorization (ALV) - error-budget-driven adjustment mechanism for efficient vectorization of complete layers.
Result: Extensive experiments show AmodalSVG significantly outperforms prior methods in visual fidelity and enables object-level editing directly in the vector domain, capabilities not supported by existing approaches.
Conclusion: AmodalSVG successfully produces semantically organized and geometrically complete SVG representations that support structural editability, advancing image vectorization beyond modal tracing to amodal reconstruction.
Abstract: We introduce AmodalSVG, a new framework for amodal image vectorization that produces semantically organized and geometrically complete SVG representations from natural images. Existing vectorization methods operate under a modal paradigm: tracing only visible pixels and disregarding occlusion. Consequently, the resulting SVGs are semantically entangled and geometrically incomplete, limiting SVG’s structural editability. In contrast, AmodalSVG reconstructs full object geometries, including occluded regions, into independent, editable vector layers. To achieve this, AmodalSVG reformulates image vectorization as a two-stage framework, performing semantic decoupling and completion in the raster domain to produce amodally complete semantic layers, which are then independently vectorized. In the first stage, we introduce Semantic Layer Peeling (SLP), a VLM-guided strategy that progressively decomposes an image into semantically coherent layers. By hybrid inpainting, SLP recovers complete object appearances under occlusions, enabling explicit semantic decoupling. To vectorize these layers efficiently, we propose Adaptive Layered Vectorization (ALV), which dynamically modulates the primitive budget via an error-budget-driven adjustment mechanism. Extensive experiments demonstrate that AmodalSVG significantly outperforms prior methods in visual fidelity. Moreover, the resulting amodal layers enable object-level editing directly in the vector domain, capabilities not supported by existing vectorization approaches. Code will be released upon acceptance.
[480] Progressive Deep Learning for Automated Spheno-Occipital Synchondrosis Maturation Assessment
Omid Halimi Milani, Amanda Nikho, Marouane Tliba, Lauren Mills, Emadeldeen Hamdan, Ahmet Enis Cetin, Mohammed H. Elnagar
Main category: cs.CV
TL;DR: A progressive representation-learning framework for spheno-occipital synchondrosis (SOS) maturation staging from CBCT scans, inspired by expert clinical reasoning from coarse anatomy to subtle fusion patterns.
Details
Motivation: SOS maturation assessment from CBCT is crucial for orthodontic/surgical timing but suffers from high inter-observer variability due to subtle, continuously evolving morphological cues, especially at transitional fusion stages.Method: Progressive representation-learning framework that sequentially grows the model by activating deeper blocks over time, allowing early layers to encode stable cranial base morphology first before higher layers specialize in discriminating adjacent maturation stages.
Result: The expert-inspired training strategy produces more stable optimization and consistently higher accuracy than standard training, particularly for ambiguous intermediate stages, without changing network architectures or loss functions.
Conclusion: The framework establishes a principled link between expert dental intuition and deep visual representations, enabling robust SOS staging and offering a general strategy for modeling continuous biological processes in medical imaging.
Abstract: Accurate assessment of spheno-occipital synchondrosis (SOS) maturation is a key indicator of craniofacial growth and a critical determinant for orthodontic and surgical timing. However, SOS staging from cone-beam CT (CBCT) relies on subtle, continuously evolving morphological cues, leading to high inter-observer variability and poor reproducibility, especially at transitional fusion stages. We frame SOS assessment as a fine-grained visual recognition problem and propose a progressive representation-learning framework that explicitly mirrors how expert clinicians reason about synchondral fusion: from coarse anatomical structure to increasingly subtle patterns of closure. Rather than training a full-capacity network end-to-end, we sequentially grow the model by activating deeper blocks over time, allowing early layers to first encode stable cranial base morphology before higher-level layers specialize in discriminating adjacent maturation stages. This yields a curriculum over network depth that aligns deep feature learning with the biological continuum of SOS fusion. Extensive experiments across convolutional and transformer-based architectures show that this expert-inspired training strategy produces more stable optimization and consistently higher accuracy than standard training, particularly for ambiguous intermediate stages. Importantly, these gains are achieved without changing network architectures or loss functions, demonstrating that training dynamics alone can substantially improve anatomical representation learning. The proposed framework establishes a principled link between expert dental intuition and deep visual representations, enabling robust, data-efficient SOS staging from CBCT and offering a general strategy for modeling other continuous biological processes in medical imaging.
[481] Pseudo-Unification: Entropy Probing Reveals Divergent Information Patterns in Unified Multimodal Models
Songlin Yang, Xianghao Kong, Anyi Rao
Main category: cs.CV
TL;DR: The paper diagnoses “pseudo-unification” in multimodal models where they fail to transfer LLM reasoning to image generation, proposing an information-theoretic probing framework that reveals modality-asymmetric encoding and pattern-split response as root causes.
Details
Motivation: Current unified multimodal models (UMMs) designed to combine LLM reasoning with vision generation capabilities fail to achieve true synergy - they exhibit "pseudo-unification" where reasoning abilities don't transfer to image synthesis and response behaviors diverge. Existing probing methods lack model-internal insight or ignore prompt-response dependencies.Method: Proposes an information-theoretic probing framework that jointly analyzes how UMMs encode inputs and generate outputs. Applied to ten representative UMMs, the framework examines entropy trajectories and information flow patterns across modalities.
Result: Reveals pseudo-unification stems from dual divergence: (1) Modality-Asymmetric Encoding where vision and language follow different entropy trajectories, and (2) Pattern-Split Response where text generation shows high-entropy creativity while image synthesis enforces low-entropy fidelity. Only models with unified information flow achieve genuine unification.
Conclusion: Real multimodal synergy requires consistency in information flow, not just shared parameters. Models that unify both encoding and response patterns (e.g., via contextual prediction) achieve stronger reasoning-based text-to-image generation even with fewer parameters.
Abstract: Unified multimodal models (UMMs) were designed to combine the reasoning ability of large language models (LLMs) with the generation capability of vision models. In practice, however, this synergy remains elusive: UMMs fail to transfer LLM-like reasoning to image synthesis and exhibit divergent response behaviors. We term this phenomenon pseudo-unification. Diagnosing its internal causes is important, but existing probing methods either lack model-internal insight or ignore prompt-response dependencies. To address these limitations, we propose an information-theoretic probing framework that jointly analyzes how UMMs encode inputs and generate outputs. Applied to ten representative UMMs, our framework reveals that pseudo-unification stems from a dual divergence: (i) Modality-Asymmetric Encoding, where vision and language follow different entropy trajectories, and (ii) Pattern-Split Response, where text generation exhibits high-entropy creativity while image synthesis enforces low-entropy fidelity. Only models that unify both sides (e.g., via contextual prediction) achieve more genuine unification, enabling stronger reasoning-based text-to-image generation even with fewer parameters. Our work provides the first model-internal probing of unification, demonstrating that real multimodal synergy requires consistency in information flow, not just shared parameters.
[482] Bootstrapping Video Semantic Segmentation Model via Distillation-assisted Test-Time Adaptation
Jihun Kim, Hoyong Kwon, Hyeokjun Kweon, Kuk-Jin Yoon
Main category: cs.CV
TL;DR: DiTTA converts image segmentation models into temporally-aware video segmentation models via test-time adaptation using SAM2 distillation, achieving competitive performance without video annotations.
Details
Motivation: Video semantic segmentation requires expensive dense annotations, while frame-by-frame image segmentation models ignore temporal coherence. Foundation models like SAM2 provide temporal mask propagation but lack semantic understanding and are computationally heavy.Method: DiTTA distills SAM2’s temporal segmentation knowledge into ISS models during a brief initialization phase, complemented by a lightweight temporal fusion module to aggregate cross-frame context. It uses test-time adaptation without annotated videos.
Result: Extensive experiments on VSPW and Cityscapes show DiTTA achieves competitive or superior performance relative to fully-supervised VSS methods, even when adapting with limited partial video snippets (e.g., initial 10%).
Conclusion: DiTTA provides a practical, annotation-free solution for real-world video semantic segmentation by efficiently converting image segmentation models into temporally-aware video models through test-time adaptation.
Abstract: Fully supervised Video Semantic Segmentation (VSS) relies heavily on densely annotated video data, limiting practical applicability. Alternatively, applying pre-trained Image Semantic Segmentation (ISS) models frame-by-frame avoids annotation costs but ignores crucial temporal coherence. Recent foundation models such as SAM2 enable high-quality mask propagation yet remain impractical for direct VSS due to limited semantic understanding and computational overhead. In this paper, we propose DiTTA (Distillation-assisted Test-Time Adaptation), a novel framework that converts an ISS model into a temporally-aware VSS model through efficient test-time adaptation (TTA), without annotated videos. DiTTA distills SAM2’s temporal segmentation knowledge into the ISS model during a brief, single-pass initialization phase, complemented by a lightweight temporal fusion module to aggregate cross-frame context. Crucially, DiTTA achieves robust generalization even when adapting with highly limited partial video snippets (e.g., initial 10%), significantly outperforming zero-shot refinement approaches that repeatedly invoke SAM2 during inference. Extensive experiments on VSPW and Cityscapes demonstrate DiTTA’s effectiveness, achieving competitive or superior performance relative to fully-supervised VSS methods, thus providing a practical and annotation-free solution for real-world VSS tasks.
[483] FineEdit: Fine-Grained Image Edit with Bounding Box Guidance
Haohang Xu, Lin Liu, Zhibo Zhang, Rong Cong, Xiaopeng Zhang, Qi Tian
Main category: cs.CV
TL;DR: FineEdit: A diffusion-based image editing model using bounding boxes for precise localization and background preservation, with a new dataset and benchmark.
Details
Motivation: Existing diffusion-based image editing models rely on natural language prompts that lack precision for localizing target objects, leading to background inconsistency. Visual cues like bounding boxes provide more intuitive and precise guidance for users to highlight specific areas of interest.Method: Proposes FineEdit with multi-level bounding box injection to effectively utilize spatial conditions. Creates FineEdit-1.2M dataset with 1.2M image editing pairs and precise bounding box annotations, and FineEdit-Bench benchmark with 1,000 images across 10 subjects for evaluation.
Result: Outperforms state-of-the-art open-source models (Qwen-Image-Edit, LongCat-Image-Edit) in instruction compliance and background preservation on FineEdit-Bench. Shows superior generalization and robustness on open benchmarks (GEdit and ImgEdit Bench).
Conclusion: Bounding box guidance enables precise localization and background preservation in diffusion-based image editing, with the proposed FineEdit model and datasets advancing region-based editing capabilities.
Abstract: Diffusion-based image editing models have achieved significant progress in real world applications. However, conventional models typically rely on natural language prompts, which often lack the precision required to localize target objects. Consequently, these models struggle to maintain background consistency due to their global image regeneration paradigm. Recognizing that visual cues provide an intuitive means for users to highlight specific areas of interest, we utilize bounding boxes as guidance to explicitly define the editing target. This approach ensures that the diffusion model can accurately localize the target while preserving background consistency. To achieve this, we propose FineEdit, a multi-level bounding box injection method that enables the model to utilize spatial conditions more effectively. To support this high precision guidance, we present FineEdit-1.2M, a large scale, fine-grained dataset comprising 1.2 million image editing pairs with precise bounding box annotations. Furthermore, we construct a comprehensive benchmark, termed FineEdit-Bench, which includes 1,000 images across 10 subjects to effectively evaluate region based editing capabilities. Evaluations on FineEdit-Bench demonstrate that our model significantly outperforms state-of-the-art open-source models (e.g., Qwen-Image-Edit and LongCat-Image-Edit) in instruction compliance and background preservation. Further assessments on open benchmarks (GEdit and ImgEdit Bench) confirm its superior generalization and robustness.
[484] You Only Judge Once: Multi-response Reward Modeling in a Single Forward Pass
Yinuo Yang, Zixian Ma, Manasi Ganti, Jieyu Zhang, Ranjay Krishna
Main category: cs.CV
TL;DR: A discriminative multimodal reward model that scores multiple candidate responses in a single forward pass using concatenation with separator tokens, achieving N× speedup and state-of-the-art performance on multimodal reward benchmarks.
Details
Motivation: Conventional discriminative reward models evaluate each response independently, requiring multiple forward passes which is inefficient. The authors aim to develop a more efficient approach that enables direct comparative reasoning and N-way preference learning in multimodal contexts.Method: Concatenates multiple responses with separator tokens and applies cross-entropy over their scalar scores, enabling single-forward-pass evaluation. Built on a 4B vision-language backbone with LoRA fine-tuning and lightweight MLP value head. Introduces two new benchmarks: MR²Bench-Image (human-annotated rankings over 8 models) and MR²Bench-Video (94K crowdsourced judgments over 19 video QA models).
Result: Achieves state-of-the-art results on six multimodal reward benchmarks, including the new MR²Bench-Image and MR²Bench-Video. Outperforms existing larger generative and discriminative reward models. When used in RL with GRPO, produces improved policy models that maintain performance across standard benchmarks while substantially improving open-ended generation quality.
Conclusion: The multi-response reward model enables efficient N-way preference learning with significant computational benefits (up to N× speedup) while achieving superior performance on multimodal reward evaluation tasks and improving policy training for open-ended generation.
Abstract: We present a discriminative multimodal reward model that scores all candidate responses in a single forward pass. Conventional discriminative reward models evaluate each response independently, requiring multiple forward passes, one for each potential response. Our approach concatenates multiple responses with separator tokens and applies cross-entropy over their scalar scores, enabling direct comparative reasoning and efficient $N$-way preference learning. The multi-response design also yields up to $N\times$ wall-clock speedup and FLOPs reduction over conventional single-response scoring. To enable $N$-way reward evaluation beyond existing pairwise benchmarks, we construct two new benchmarks: (1) MR$^2$Bench-Image contains human-annotated rankings over responses from 8 diverse models; (2) MR$^2$Bench-Video is a large-scale video-based reward benchmark derived from 94K crowdsourced pairwise human judgments over video question-answering spanning 19 models, denoised via preference graph ensemble. Both benchmarks provide 4-response evaluation variants sampled from the full rankings. Built on a 4B vision-language backbone with LoRA fine-tuning and a lightweight MLP value head, our model achieves state-of-the-art results on six multimodal reward benchmarks, including MR$^2$Bench-Image, MR$^2$Bench-Video, and four other existing benchmarks. Our model outperforms existing larger generative and discriminative reward models. We further demonstrate that our reward model, when used in reinforcement learning with GRPO, produces improved policy models that maintain performance across standard multimodal benchmarks while substantially improving open-ended generation quality, outperforming a single-response discriminative reward model (RM) baseline by a large margin in both training stability and open-ended generation quality.
[485] Panoptic Pairwise Distortion Graph
Muhammad Kamran Janjua, Abdul Wahab, Bahador Rashidi
Main category: cs.CV
TL;DR: The paper introduces Distortion Graphs (DG) for structured region-level comparison of image pairs, proposing a new dataset, benchmark, and model for fine-grained image quality assessment.
Details
Motivation: Existing image assessment methods focus on whole-image analysis while implicitly relying on region-level understanding. The authors aim to create a structured approach for comparing image pairs at the region level to better understand degradations.Method: Extends scene graphs to inter-image comparison, creating Distortion Graphs that represent region-level degradation information. Introduces PandaSet (region-level dataset), PandaBench (benchmark suite), and Panda architecture for generating distortion graphs.
Result: Shows that current MLLMs fail at region-level degradation understanding even with explicit region cues. Training on PandaSet or prompting with DG enables region-wise distortion understanding, creating a new direction for fine-grained image assessment.
Conclusion: Distortion Graphs provide a compact, interpretable structure for representing dense degradation information in image pairs, enabling new capabilities in fine-grained, structured pairwise image assessment.
Abstract: In this work, we introduce a new perspective on comparative image assessment by representing an image pair as a structured composition of its regions. In contrast, existing methods focus on whole image analysis, while implicitly relying on region-level understanding. We extend the intra-image notion of a scene graph to inter-image, and propose a novel task of Distortion Graph (DG). DG treats paired images as a structured topology grounded in regions, and represents dense degradation information such as distortion type, severity, comparison and quality score in a compact interpretable graph structure. To realize the task of learning a distortion graph, we contribute (i) a region-level dataset, PandaSet, (ii) a benchmark suite, PandaBench, with varying region-level difficulty, and (iii) an efficient architecture, Panda, to generate distortion graphs. We demonstrate that PandaBench poses a significant challenge for state-of-the-art multimodal large language models (MLLMs) as they fail to understand region-level degradations even when fed with explicit region cues. We show that training on PandaSet or prompting with DG elicits region-wise distortion understanding, opening a new direction for fine-grained, structured pairwise image assessment.
[486] Towards Automated Solar Panel Integrity: Hybrid Deep Feature Extraction for Advanced Surface Defect Identification
Muhammad Junaid Asif, Muhammad Saad Rafaqat, Usman Nazakat, Uzair Khan, Rana Fayyaz Ahmad
Main category: cs.CV
TL;DR: A hybrid method combining handcrafted features (LBP, HoG, Gabor) with DenseNet-169 deep features for automated solar panel defect detection, achieving 99.17% accuracy with SVM classifier.
Details
Motivation: Manual monitoring of solar panels is labor-intensive, time-consuming, costly, and error-prone, especially for large-scale or remote installations. An automated intelligent defect detection system is needed for continuous monitoring, early fault detection, and maximum power generation.Method: Proposed hybrid method combining handcrafted features (Local Binary Pattern, Histogram of Gradients, Gabor Filters) with deep features from DenseNet-169. Features are concatenated and fed to three classifiers: SVM, XGBoost, and LightGBM.
Result: DenseNet-169 + Gabor (SVM) achieved highest accuracy of 99.17% on augmented dataset, outperforming other systems. The hybrid framework offers better defect-detection accuracy, resistance, and flexibility for real-life PV panel monitoring.
Conclusion: The proposed hybrid framework provides an effective automated solution for solar panel defect detection, combining traditional computer vision techniques with deep learning for superior performance in real-world monitoring applications.
Abstract: To ensure energy efficiency and reliable operations, it is essential to monitor solar panels in generation plants to detect defects. It is quite labor-intensive, time consuming and costly to manually monitor large-scale solar plants and those installed in remote areas. Manual inspection may also be susceptible to human errors. Consequently, it is necessary to create an automated, intelligent defect-detection system, that ensures continuous monitoring, early fault detection, and maximum power generation. We proposed a novel hybrid method for defect detection in SOLAR plates by combining both handcrafted and deep learning features. Local Binary Pattern (LBP), Histogram of Gradients (HoG) and Gabor Filters were used for the extraction of handcrafted features. Deep features extracted by leveraging the use of DenseNet-169. Both handcrafted and deep features were concatenated and then fed to three distinct types of classifiers, including Support Vector Machines (SVM), Extreme Gradient Boost (XGBoost) and Light Gradient-Boosting Machine (LGBM). Experimental results evaluated on the augmented dataset show the superior performance, especially DenseNet-169 + Gabor (SVM), had the highest scores with 99.17% accuracy which was higher than all the other systems. In general, the proposed hybrid framework offers better defect-detection accuracy, resistance, and flexibility that has a solid basis on the real-life use of the automated PV panels monitoring system.
[487] Lightweight Low-Light Image Enhancement via Distribution-Normalizing Preprocessing and Depthwise U-Net
Shimon Murai, Teppei Kurita, Ryuta Satoh, Yusuke Moriuchi
Main category: cs.CV
TL;DR: A lightweight two-stage framework for low-light image enhancement using frozen algorithm-based preprocessing and a compact U-Net with depthwise-separable convolutions, achieving competitive quality with fewer parameters.
Details
Motivation: To develop an efficient low-light image enhancement method that achieves competitive perceptual quality while significantly reducing computational complexity and parameter count compared to existing approaches.Method: Two-stage framework: 1) Frozen algorithm-based preprocessing that normalizes input distribution by providing complementary brightness-corrected views, 2) Compact U-Net built entirely from depthwise-separable convolutions that focuses on residual color correction.
Result: Achieved 4th place in CVPR 2026 NTIRE Efficient Low-Light Image Enhancement Challenge, demonstrating competitive perceptual quality with significantly fewer parameters than existing methods.
Conclusion: The proposed lightweight framework effectively balances efficiency and quality for low-light image enhancement, with extended benchmarks and ablations confirming its general effectiveness.
Abstract: We present a lightweight two-stage framework for low-light image enhancement (LLIE) that achieves competitive perceptual quality with significantly fewer parameters than existing methods. Our approach combines frozen algorithm-based preprocessing with a compact U-Net built entirely from depthwise-separable convolutions. The preprocessing normalizes the input distribution by providing complementary brightness-corrected views, enabling the trainable network to focus on residual color correction. Our method achieved 4th place in the CVPR 2026 NTIRE Efficient Low-Light Image Enhancement Challenge. We further provide extended benchmarks and ablations to demonstrate the general effectiveness of our methods.
[488] Using Deep Learning Models Pretrained by Self-Supervised Learning for Protein Localization
Ben Isselmann, Dilara Göksu, Heinz Neumann, Andreas Weinmann
Main category: cs.CV
TL;DR: DINO-based ViT models pretrained on domain-specific microscopy data (HPA FOV) outperform ImageNet-pretrained models for microscopy classification tasks, achieving strong zero-shot performance and further improvements with fine-tuning on small datasets.
Details
Motivation: Microscopy datasets are often small, making it hard to train robust deep learning models. While SSL helps via pretraining on large datasets, generalizability across different staining protocols and channel configurations in microscopy remains underexplored.Method: Investigated generalizability of SSL models (DINO-based ViT backbones) pretrained on ImageNet-1k and HPA FOV datasets. Evaluated embeddings on OpenCell dataset with/without fine-tuning, tested two channel-mismatch strategies, varied fine-tuning data fractions, and analyzed single-cell embeddings on labeled OpenCell subset.
Result: HPA FOV-pretrained model achieved highest zero-shot performance (macro F1 0.822 ± 0.007). Fine-tuning further improved to 0.860 ± 0.013. At single-cell level, HPA single-cell-pretrained model achieved highest k-nearest neighbor performance across all neighborhood sizes (macro F1 ≥ 0.796).
Conclusion: SSL methods like DINO, when pretrained on large domain-relevant datasets, enable effective use of deep learning features for fine-tuning on small, task-specific microscopy datasets.
Abstract: Background: Task-specific microscopy datasets are often small, making it difficult to train deep learning models that learn robust features. While self-supervised learning (SSL) has shown promise through pretraining on large, domain-specific datasets, generalizability across datasets with differing staining protocols and channel configurations remains underexplored. We investigated the generalizability of SSL models pretrained on ImageNet-1k and HPA FOV, evaluating their embeddings on OpenCell with and without fine-tuning, two channel-mismatch strategies, and varying fine-tuning data fractions. We additionally analyzed single-cell embeddings on a labeled OpenCell subset. Result: DINO-based ViT backbones pretrained on HPA FOV or ImageNet-1k transfer well to OpenCell even without fine-tuning. The HPA FOV-pretrained model achieved the highest zero-shot performance (macro $F_1$ 0.822 $\pm$ 0.007). Fine-tuning further improved performance to 0.860 $\pm$ 0.013. At the single-cell level, the HPA single-cell-pretrained model achieved the highest k-nearest neighbor performance across all neighborhood sizes (macro $F_1$ $\geq$ 0.796). Conclusion: SSL methods like DINO, pretrained on large domain-relevant datasets, enable effective use of deep learning features for fine-tuning on small, task-specific microscopy datasets.
[489] Efficient Transceiver Design for Aerial Image Transmission and Large-scale Scene Reconstruction
Zeyi Ren, Jialin Dong, Wei Zuo, Yikun Wang, Bingyang Cheng, Sheng Zhou, Zhisheng Niu
Main category: cs.CV
TL;DR: E2E transceiver design integrating 3D Gaussian Splatting for efficient wireless image transmission in 3D scene reconstruction for low-altitude networks
Details
Motivation: Existing schemes struggle to balance pilot overhead with transmission accuracy needed for 3D scene reconstruction fidelity in low-altitude intelligent networksMethod: Deep learning-based end-to-end transceiver design that integrates 3D Gaussian Splatting directly into training, jointly optimizing communication modules via 3DGS rendering loss, enabling sparse pilot scheme
Result: Significantly outperforms existing baselines, delivering superior transmission performance and accurate 3D scene reconstructions on real-world aerial image datasets
Conclusion: Proposed E2E design effectively balances efficiency and reliability for 3D scene reconstruction in low-altitude networks by integrating 3DGS into communication optimization
Abstract: Large-scale three-dimensional (3D) scene reconstruction in low-altitude intelligent networks (LAIN) demands highly efficient wireless image transmission. However, existing schemes struggle to balance severe pilot overhead with the transmission accuracy required to maintain reconstruction fidelity. To strike a balance between efficiency and reliability, this paper proposes a novel deep learning-based end-to-end (E2E) transceiver design that integrates 3D Gaussian Splatting (3DGS) directly into the training process. By jointly optimizing the communication modules via the combined 3DGS rendering loss, our approach explicitly improves scene recovery quality. Furthermore, this task-driven framework enables the use of a sparse pilot scheme, significantly reducing transmission overhead while maintaining robust image recovery under low-altitude channel conditions. Extensive experiments on real-world aerial image datasets demonstrate that the proposed E2E design significantly outperforms existing baselines, delivering superior transmission performance and accurate 3D scene reconstructions.
[490] MMR-AD: A Large-Scale Multimodal Dataset for Benchmarking General Anomaly Detection with Multimodal Large Language Models
Xincheng Yao, Zefeng Qian, Chao Shi, Jiayang Song, Chongyang Zhang
Main category: cs.CV
TL;DR: MMR-AD is a comprehensive benchmark for training and evaluating Multimodal Large Language Models (MLLMs) for general anomaly detection, revealing current MLLMs fall short of industrial requirements, with a proposed baseline model Anomaly-R1 showing significant improvements.
Details
Motivation: General anomaly detection (GAD) aims to detect anomalies in diverse novel classes without retraining, but current MLLMs have limitations due to pretraining data gaps with AD scenarios and lack of suitable AD datasets for MLLM post-training.Method: Proposes MMR-AD benchmark for MLLM-based AD research and Anomaly-R1 baseline model - a reasoning-based AD model that learns from Chain-of-Thought data in MMR-AD and is enhanced by reinforcement learning.
Result: Current state-of-the-art generalist MLLMs perform poorly on industrial AD requirements, while Anomaly-R1 achieves remarkable improvements over generalist MLLMs in both anomaly detection and localization.
Conclusion: MMR-AD enables MLLM-based general AD research, revealing current MLLM limitations and demonstrating that specialized training (like Anomaly-R1) can significantly improve AD performance for industrial applications.
Abstract: In the progress of industrial anomaly detection, general anomaly detection (GAD) is an emerging trend and also the ultimate goal. Unlike the conventional single- and multi-class AD, general AD aims to train a general AD model that can directly detect anomalies in diverse novel classes without any retraining or fine-tuning on the target data. Recently, Multimodal Large Language Models (MLLMs) have shown great promise in achieving general anomaly detection due to their revolutionary visual understanding and language reasoning capabilities. However, MLLM’s general AD ability remains underexplored due to: (1) MLLMs are pretrained on amounts of data sourced from the Web, these data still have significant gaps with the data in AD scenarios. Moreover, the image-text pairs during pretraining are also not specifically for AD tasks. (2) The current mainstream AD datasets are image-based and not yet suitable for post-training MLLMs. To facilitate MLLM-based general AD research, we present MMR-AD, which is a comprehensive benchmark for both training and evaluating MLLM-based AD models. With MMR-AD, we reveal that the AD performance of current SOTA generalist MLLMs still falls far behind the industrial requirements. Based on MMR-AD, we also propose a baseline model, Anomaly-R1, which is a reasoning-based AD model that learns from the CoT data in MMR-AD and is further enhanced by reinforcement learning. Extensive experiments show that our Anomaly-R1 achieves remarkable improvements over generalist MLLMs in both anomaly detection and localization.
[491] Energy-oriented Diffusion Bridge for Image Restoration with Foundational Diffusion Models
Jinhui Hou, Zhiyu Zhu, Junhui Hou
Main category: cs.CV
TL;DR: E-Bridge proposes an energy-oriented diffusion bridge framework for efficient image restoration using low-cost manifold geodesic trajectories, enabling single-step or few-step high-quality recovery with tunable trajectory length for different degradation tasks.
Details
Motivation: Existing diffusion bridge models for image restoration rely on complex, high-cost trajectories that limit sampling efficiency and restoration quality. The authors aim to develop a more efficient framework that reduces trajectory energy while maintaining performance.Method: Proposes Energy-oriented diffusion Bridge (E-Bridge) with: 1) Novel bridge process over shorter time horizon starting from entropy-regularized point mixing degraded image and Gaussian noise, 2) Single-step mapping function inspired by consistency models optimized via continuous-time consistency objective, 3) Tunable trajectory length as task-adaptive knob for balancing information preservation vs generative power.
Result: Achieves state-of-the-art performance across various image restoration tasks (denoising, super-resolution) while enabling high-quality recovery with single or fewer sampling steps.
Conclusion: E-Bridge provides an efficient diffusion bridge framework that reduces trajectory energy, enables fast sampling, and adapts to different degradation levels through tunable trajectory length, advancing practical image restoration applications.
Abstract: Diffusion bridge models have shown great promise in image restoration by explicitly connecting clean and degraded image distributions. However, they often rely on complex and high-cost trajectories, which limit both sampling efficiency and final restoration quality. To address this, we propose an Energy-oriented diffusion Bridge (E-Bridge) framework to approximate a set of low-cost manifold geodesic trajectories to boost the performance of the proposed method. We achieve this by designing a novel bridge process that evolves over a shorter time horizon and makes the reverse process start from an entropy-regularized point that mixes the degraded image and Gaussian noise, which theoretically reduces the required trajectory energy. To solve this process efficiently, we draw inspiration from consistency models to learn a single-step mapping function, optimized via a continuous-time consistency objective tailored for our trajectory, so as to analytically map any state on the trajectory to the target image. Notably, the trajectory length in our framework becomes a tunable task-adaptive knob, allowing the model to adaptively balance information preservation against generative power for tasks of varying degradation, such as denoising versus super-resolution. Extensive experiments demonstrate that our E-Bridge achieves state-of-the-art performance across various image restoration tasks while enabling high-quality recovery with a single or fewer sampling steps. Our project page is https://jinnh.github.io/E-Bridge/.
[492] ArtiCAD: Articulated CAD Assembly Design via Multi-Agent Code Generation
Yuan Shui, Yandong Guan, Zhanwei Zhang, Juncheng Hu, Jing Zhang, Dong Xu, Qian Yu
Main category: cs.CV
TL;DR: ArtiCAD is a training-free multi-agent system that generates editable, articulated CAD assemblies from text or images using specialized agents for design, generation, assembly, and review with connector-based relationship prediction.
Details
Motivation: Parametric CAD for articulated assemblies is crucial for product development, but generating multi-part, movable models from high-level descriptions remains unexplored. Current approaches lack the ability to create editable, articulated CAD assemblies directly from text or images.Method: Four specialized agents (Design, Generation, Assembly, Review) with key insight to predict assembly relationships during design stage using a Connector that defines attachment points and joint parameters. Includes validation steps, cross-stage rollback mechanism, and self-evolving experience store for continuous improvement.
Result: Extensive evaluations on three datasets (ArtiCAD-Bench, CADPrompt, ACD) validate effectiveness. Demonstrates applicability in requirement-driven conceptual design, physical prototyping, and generation of embodied AI training assets through URDF export.
Conclusion: ArtiCAD successfully addresses the unexplored problem of generating editable, articulated CAD assemblies from high-level descriptions, offering a robust multi-agent approach with continuous learning capabilities.
Abstract: Parametric Computer-Aided Design (CAD) of articulated assemblies is essential for product development, yet generating these multi-part, movable models from high-level descriptions remains unexplored. To address this, we propose ArtiCAD, the first training-free multi-agent system capable of generating editable, articulated CAD assemblies directly from text or images. Our system divides this complex task among four specialized agents: Design, Generation, Assembly, and Review. One of our key insights is to predict assembly relationships during the initial design stage rather than the assembly stage. By utilizing a Connector that explicitly defines attachment points and joint parameters, ArtiCAD determines these relationships before geometry generation, effectively bypassing the limited spatial reasoning capabilities of current LLMs and VLMs. To further ensure high-quality outputs, we introduce validation steps in the generation and assembly stages, accompanied by a cross-stage rollback mechanism that accurately isolates and corrects design- and code-level errors. Additionally, a self-evolving experience store accumulates design knowledge to continuously improve performance on future tasks. Extensive evaluations on three datasets (ArtiCAD-Bench, CADPrompt, and ACD) validate the effectiveness of our approach. We further demonstrate the applicability of ArtiCAD in requirement-driven conceptual design, physical prototyping, and the generation of embodied AI training assets through URDF export.
[493] LumiMotion: Improving Gaussian Relighting with Scene Dynamics
Joanna Kaleta, Piotr Wójcik, Kacper Marzol, Tomasz Trzciński, Kacper Kania, Marek Kowalski
Main category: cs.CV
TL;DR: LumiMotion introduces a Gaussian Splatting-based inverse rendering method that leverages dynamic scene elements to better disentangle material properties from illumination in 3D reconstruction.
Details
Motivation: Existing Gaussian Splatting methods for inverse rendering are limited to static scenes and simplified lighting, failing to properly separate shadows from surface appearance in real-world dynamic conditions.Method: Uses dynamic scene regions (moving elements) as supervisory signal, learning a dynamic 2D Gaussian Splatting representation with novel constraints that encourage dynamic regions to deform while keeping static regions stable.
Result: Improves LPIPS by 23% for albedo estimation and 15% for scene relighting compared to next-best baseline, with a new synthetic benchmark for dynamic inverse rendering evaluation.
Conclusion: Dynamic elements provide crucial cues for disentangling material and illumination, enabling more accurate inverse rendering in arbitrary dynamic scenes with challenging lighting.
Abstract: In 3D reconstruction, the problem of inverse rendering, namely recovering the illumination of the scene and the material properties, is fundamental. Existing Gaussian Splatting-based methods primarily target static scenes and often assume simplified or moderate lighting to avoid entangling shadows with surface appearance. This limits their ability to accurately separate lighting effects from material properties, particularly in real-world conditions. We address this limitation by leveraging dynamic elements - regions of the scene that undergo motion - as a supervisory signal for inverse rendering. Motion reveals the same surfaces under varying lighting conditions, providing stronger cues for disentangling material and illumination. This thesis is supported by our experimental results which show we improve LPIPS by 23% for albedo estimation and by 15% for scene relighting relative to next-best baseline. To this end, we introduce LumiMotion, the first Gaussian-based approach that leverages dynamics for inverse rendering and operates in arbitrary dynamic scenes. Our method learns a dynamic 2D Gaussian Splatting representation that employs a set of novel constraints which encourage the dynamic regions of the scene to deform, while keeping static regions stable. As we demonstrate, this separation is crucial for correct optimization of the albedo. Finally, we release a new synthetic benchmark comprising five scenes under four lighting conditions, each in both static and dynamic variants, for the first time enabling systematic evaluation of inverse rendering methods in dynamic environments and challenging lighting. Link to project page: https://joaxkal.github.io/LumiMotion/
[494] TraversalBench: Challenging Paths to Follow for Vision Language Models
Clara Petrova, Zhuo Chen, Marin Soljačić
Main category: cs.CV
TL;DR: TraversalBench is a controlled benchmark for testing VLMs’ ability to follow complex visual paths, focusing on exact sequence recovery of path vertices with systematic variation of structural factors like self-intersections and tortuosity.
Details
Motivation: Current VLMs perform well on many multimodal benchmarks but their ability to follow complex visual paths remains under-tested. The authors aim to create a controlled diagnostic tool to evaluate models' sustained visual processing and path-faithful reasoning abilities.Method: Created TraversalBench with single continuous polylines, start markers, and vertex markers. Systematically balanced path-structural factors: self-intersection count, tortuosity, vertex count, and nearby confounding lines. Minimized reliance on OCR, world knowledge, and open-ended planning. Conducted first-crossing analysis and auxiliary reading-order tests.
Result: Self-intersections are the dominant source of difficulty, with errors sharply localized at crossing points. Nearby confounding lines cause weaker persistent degradation. Models show consistent preference for left-to-right reading layouts. Performance drops steeply when models must resolve correct continuation at crossings.
Conclusion: TraversalBench serves as a controlled diagnostic for path-faithful visual reasoning and a testbed for studying multimodal spatial reasoning under ambiguity and clutter. It contributes to the limited area of sustained visual grounding benchmarks for VLMs.
Abstract: Vision-language models (VLMs) perform strongly on many multimodal benchmarks. However, the ability to follow complex visual paths – a task that human observers typically find straightforward – remains under-tested. We introduce TraversalBench, a controlled benchmark for exact visual path traversal. Each instance contains a single continuous polyline, a unique start marker, and markers placed at path vertices; the task is to recover the exact ordered sequence encountered when traversing the path from start to finish. The benchmark explicitly balances key path-structural factors including self-intersection count, tortuosity, vertex count, and nearby confounding lines, while minimizing reliance on OCR, world knowledge, and open-ended planning. We find that self-intersections are the dominant source of difficulty. A first-crossing analysis shows that errors are sharply localized: performance is relatively stable immediately before the first crossing, then drops steeply when the model must resolve the correct continuation. By contrast, nearby confounding lines produce a weaker persistent degradation that compounds with repeated exposure. These analyses make TraversalBench a useful diagnostic for identifying whether models suffer from human-like failures or other breakdowns in sustained visual processing. An auxiliary reading-order benchmark further reveals a consistent preference for layouts compatible with left-to-right serialization, while not explaining away the main effects of path complexity. Together, these results position TraversalBench as a controlled diagnostic of path-faithful visual reasoning and as a useful testbed for studying multimodal spatial reasoning under ambiguity, clutter, and distractor structure. More broadly, we position TraversalBench as a contribution to the still-limited area of sustained visual grounding benchmarks for VLMs.
[495] Towards Realistic 3D Emission Materials: Dataset, Baseline, and Evaluation for Emission Texture Generation
Zhiyuan Zhang, Zijian Zhou, Linjun Li, Long Chen, Hao Tang, Yichen Gong
Main category: cs.CV
TL;DR: A method for generating emission textures for 3D objects to create cyberpunk-style LED effects, including a new dataset and baseline model.
Details
Motivation: Existing 3D texture generation methods are limited to non-emissive PBR materials and cannot create realistic emission effects like LED lights needed for popular styles like cyberpunk.Method: Proposes EmissionGen baseline for emission texture generation, constructs Objaverse-Emission dataset with 40k 3D assets containing emission materials, and defines evaluation metrics for the task.
Result: Demonstrates significant potential for industrial applications, enabling 3D objects to faithfully reproduce emission materials from reference images.
Conclusion: Introduces a novel emission texture generation task with dataset, baseline method, and evaluation metrics to address limitations of current 3D texture generation in creating realistic emission effects.
Abstract: 3D texture generation is receiving increasing attention, as it enables the creation of realistic and aesthetic texture materials for untextured 3D meshes. However, existing 3D texture generation methods are limited to producing only a few types of non-emissive PBR materials (e.g., albedo, metallic maps and roughness maps), making them difficult to replicate highly popular styles, such as cyberpunk, failing to achieve effects like realistic LED emissions. To address this limitation, we propose a novel task, emission texture generation, which enables the synthesized 3D objects to faithfully reproduce the emission materials from input reference images. Our key contributions include: first, We construct the Objaverse-Emission dataset, the first dataset that contains 40k 3D assets with high-quality emission materials. Second, we propose EmissionGen, a novel baseline for the emission texture generation task. Third, we define detailed evaluation metrics for the emission texture generation task. Our results demonstrate significant potential for future industrial applications. Dataset will be available at https://github.com/yx345kw/EmissionGen.
[496] Data-Efficient Semantic Segmentation of 3D Point Clouds via Open-Vocabulary Image Segmentation-based Pseudo-Labeling
Takahiko Furuya
Main category: cs.CV
TL;DR: PLOVIS is a data-efficient 3D point cloud segmentation framework that addresses three concurrent data scarcity issues using open-vocabulary image segmentation for pseudo-labeling without needing 2D image sequences.
Details
Motivation: Real-world 3D point cloud segmentation faces three data insufficiency challenges: scarcity of training scenes, scarcity of point-level annotations, and absence of 2D image sequences. Existing methods only address one or two of these issues, leaving the joint treatment unexplored.Method: PLOVIS uses Open-Vocabulary Image Segmentation (OVIS) models as pseudo label generators, creating 2D images directly from 3D point clouds. It employs two-stage filtering (removing low-confidence then likely incorrect labels) and a class-balanced memory bank to handle noise and class imbalance in pseudo labels.
Result: Experiments on ScanNet, S3DIS, Toronto3D, and Semantic3D datasets under data-scarce conditions (few tens of scenes with <100 annotated points each) show PLOVIS consistently outperforms standard fine-tuning and state-of-the-art weakly supervised methods.
Conclusion: PLOVIS effectively addresses the three concurrent forms of data insufficiency in 3D point cloud segmentation through open-vocabulary image segmentation-based pseudo-labeling, demonstrating superior performance in realistic data-scarce scenarios.
Abstract: Semantic segmentation of 3D point cloud scenes is a crucial task for various applications. In real-world scenarios, training segmentation models often faces three concurrent forms of data insufficiency: scarcity of training scenes, scarcity of point-level annotations, and absence of 2D image sequences from which point clouds were reconstructed. Existing data-efficient algorithms typically address only one or two of these challenges, leaving the joint treatment of all three unexplored. This paper proposes a data-efficient training framework specifically designed to address the three forms of data insufficiency. Our proposed algorithm, called Point pseudo-Labeling via Open-Vocabulary Image Segmentation (PLOVIS), leverages an Open-Vocabulary Image Segmentation (OVIS) model as a pseudo label generator to compensate for the lack of training data. PLOVIS creates 2D images for pseudo-labeling directly from training 3D point clouds, eliminating the need for 2D image sequences. To mitigate the inherent noise and class imbalance in pseudo labels, we introduce a two-stage filtering of pseudo labels combined with a class-balanced memory bank for effective training. The two-stage filtering mechanism first removes low-confidence pseudo labels, then discards likely incorrect pseudo labels, thereby enhancing the quality of pseudo labels. Experiments on four benchmark datasets, i.e., ScanNet, S3DIS, Toronto3D, and Semantic3D, under realistic data-scarce conditions (a few tens of training 3D scenes, each annotated with only <100 3D points) demonstrate that PLOVIS consistently outperforms existing methods including standard fine-tuning strategies and state-of-the-art weakly supervised learning algorithms. Code will be made publicly available.
[497] Byte-level generative predictions for forensics multimedia carving
Jaewon Lee, Md Eimran Hossain Eimon, Avinash Srinivasan, Hari Kalva
Main category: cs.CV
TL;DR: Generative approach using bGPT (byte-level transformer) for multimedia file carving by predicting missing byte patterns in fragmented BMP images, evaluated with multiple similarity metrics.
Details
Motivation: Traditional file carving methods for fragmented multimedia files rely on signatures and discriminative deep learning models but cannot reconstruct or predict missing data, limiting forensic recovery capabilities.Method: Use bGPT, a byte-level transformer designed for next-byte prediction, to generate likely fragment continuations from partial BMP image data. Evaluate predictions using cosine similarity, SSIM, chi-square distance, and Jensen-Shannon divergence.
Result: Generative models can effectively predict byte-level patterns to support fragment matching in unallocated disk space, demonstrating the feasibility of generative approaches for multimedia carving.
Conclusion: Generative models like bGPT offer promising capabilities for multimedia file carving by predicting missing data patterns, advancing beyond traditional signature-based and discriminative approaches in digital forensics.
Abstract: Digital forensic investigations often face significant challenges when recovering fragmented multimedia files that lack file system metadata. While traditional file carving relies on signatures and discriminative deep learning models for fragment classification, these methods cannot reconstruct or predict missing data. We propose a generative approach to multimedia carving using bGPT, a byte-level transformer designed for next-byte prediction. By feeding partial BMP image data into the model, we simulate the generation of likely fragment continuations. We evaluate the fidelity of these predictions using different metrics, namely, cosine similarity, structural similarity index (SSIM), chi-square distance, and Jensen-Shannon divergence (JSD). Our findings demonstrate that generative models can effectively predict byte-level patterns to support fragment matching in unallocated disk space.
[498] UHD-GPGNet: UHD Video Denoising via Gaussian-Process-Guided Local Spatio-Temporal Modeling
Weiyuan He, Chen Wu, Pengwen Dai, Wei Wang, Dianjie Lu, Guijuan Zhang, Linwei Fan, Yongzhen Wang, Zhuoran Zheng
Main category: cs.CV
TL;DR: UHD-GPGNet: A Gaussian-process-guided local spatio-temporal denoising framework for 4K video that uses explicit GP statistics to guide adaptive fusion and enables efficient real-time deployment.
Details
Motivation: Ultra-high-definition video denoising requires addressing complex spatio-temporal degradations while preserving fine textures, maintaining chromatic stability, and enabling efficient full-resolution 4K deployment - challenges that existing methods struggle to address jointly.Method: Proposes a Gaussian-process-guided framework that estimates sparse GP posterior statistics over compact spatio-temporal descriptors to explicitly characterize local degradation response and uncertainty, guiding adaptive temporal-detail fusion. Includes structure-color collaborative reconstruction head, heteroscedastic objective, and overlap-tiled inference for memory-bounded 4K deployment.
Result: Achieves competitive restoration fidelity with substantially fewer parameters than existing methods, enables real-time full-resolution 4K inference with significant speedup over closest quality competitor, and maintains robust performance across multi-level mixed-degradation schedules. Generalizes to real sensor noise and improves downstream object detection.
Conclusion: UHD-GPGNet effectively addresses the joint requirements of UHD video denoising through explicit GP guidance and efficient architecture design, enabling practical real-time 4K deployment with strong generalization to real-world conditions.
Abstract: Ultra-high-definition (UHD) video denoising requires simultaneously suppressing complex spatio-temporal degradations, preserving fine textures and chromatic stability, and maintaining efficient full-resolution 4K deployment. In this paper, we propose UHD-GPGNet, a Gaussian-process-guided local spatio-temporal denoising framework that addresses these requirements jointly. Rather than relying on implicit feature learning alone, the method estimates sparse GP posterior statistics over compact spatio-temporal descriptors to explicitly characterize local degradation response and uncertainty, which then guide adaptive temporal-detail fusion. A structure-color collaborative reconstruction head decouples luminance, chroma, and high-frequency correction, while a heteroscedastic objective and overlap-tiled inference further stabilize optimization and enable memory-bounded 4K deployment. Experiments on UVG and RealisVideo-4K show that UHD-GPGNet achieves competitive restoration fidelity with substantially fewer parameters than existing methods, enables real-time full-resolution 4K inference with significant speedup over the closest quality competitor, and maintains robust performance across a multi-level mixed-degradation schedule.A real-world study on phone-captured 4K video further confirms that the model, trained entirely on synthetic degradation, generalizes to unseen real sensor noise and improves downstream object detection under challenging conditions.
[499] Test-time Scaling over Perception: Resolving the Grounding Paradox in Thinking with Images
Zheng Jiang, Yiming Chen, Nan He, Jiahui Chen, Chaoyang Li, Houde Qian, Lifeng Sun
Main category: cs.CV
TL;DR: TTSP addresses the Grounding Paradox in MLLMs by treating perception as scalable inference, generating multiple exploratory perception traces, filtering unreliable ones, distilling observations into structured knowledge, and iteratively refining exploration toward uncertainty.
Details
Motivation: Current MLLMs suffer from the Grounding Paradox - they must decide where to look before having evidence to make correct decisions, leading to brittle fine-grained visual reasoning. This circular dependency limits their ability to handle perceptual uncertainty.Method: TTSP (Test-Time Scaling over Perception) treats perception as scalable inference: 1) generates multiple exploratory perception traces, 2) filters unreliable traces using entropy-based confidence estimation, 3) distills validated observations into structured knowledge, and 4) iteratively refines subsequent exploration toward unresolved uncertainty.
Result: Extensive experiments on high-resolution and general multimodal reasoning benchmarks show TTSP consistently outperforms strong baselines across backbone sizes, while exhibiting favorable scalability and token efficiency.
Conclusion: Scaling perception at test time is a promising direction for robust multimodal reasoning under perceptual uncertainty, addressing the fundamental Grounding Paradox in current MLLMs.
Abstract: Recent multimodal large language models (MLLMs) have begun to support Thinking with Images by invoking visual tools such as zooming and cropping during inference. Yet these systems remain brittle in fine-grained visual reasoning because they must decide where to look before they have access to the evidence needed to make that decision correctly. We identify this circular dependency as the Grounding Paradox. To address it, we propose Test-Time Scaling over Perception (TTSP), a framework that treats perception itself as a scalable inference process. TTSP generates multiple exploratory perception traces, filters unreliable traces using entropy-based confidence estimation, distills validated observations into structured knowledge, and iteratively refines subsequent exploration toward unresolved uncertainty. Extensive experiments on high-resolution and general multimodal reasoning benchmarks show that TTSP consistently outperforms strong baselines across backbone sizes, while also exhibiting favorable scalability and token efficiency. Our results suggest that scaling perception at test time is a promising direction for robust multimodal reasoning under perceptual uncertainty.
[500] EgoFun3D: Modeling Interactive Objects from Egocentric Videos using Function Templates
Weikun Peng, Denys Iliash, Manolis Savva
Main category: cs.CV
TL;DR: EgoFun3D introduces a task, dataset, and benchmark for obtaining simulation-ready interactive 3D objects from egocentric videos, focusing on functional mappings between object parts using structured function templates.
Details
Motivation: Interactive 3D objects are crucial for embodied AI but scarce. Prior work focuses mainly on articulations, while this work aims to capture general cross-part functional mappings (e.g., knob rotation controls burner temperature) from readily available real-world egocentric videos.Method: Proposes a 4-stage pipeline: 2D part segmentation, 3D reconstruction, articulation estimation, and function template inference. Uses function templates as structured computational representations that can be compiled into executable code across simulation platforms.
Result: Introduces a dataset of 271 egocentric videos with challenging real-world interactions, paired with 3D geometry, 2D/3D segmentation, articulation and function template annotations. Benchmarking shows the task is challenging for off-the-shelf methods.
Conclusion: EgoFun3D provides a comprehensive framework for modeling interactive 3D objects from egocentric videos, highlighting the difficulty of the task and opening avenues for future research in functional understanding of objects from video.
Abstract: We present EgoFun3D, a coordinated task formulation, dataset, and benchmark for modeling interactive 3D objects from egocentric videos. Interactive objects are of high interest for embodied AI but scarce, making modeling from readily available real-world videos valuable. Our task focuses on obtaining simulation-ready interactive 3D objects from egocentric video input. While prior work largely focuses on articulations, we capture general cross-part functional mappings (e.g., rotation of stove knob controls stove burner temperature) through function templates, a structured computational representation. Function templates enable precise evaluation and direct compilation into executable code across simulation platforms. To enable comprehensive benchmarking, we introduce a dataset of 271 egocentric videos featuring challenging real-world interactions with paired 3D geometry, segmentation over 2D and 3D, articulation and function template annotations. To tackle the task, we propose a 4-stage pipeline consisting of: 2D part segmentation, reconstruction, articulation estimation, and function template inference. Comprehensive benchmarking shows that the task is challenging for off-the-shelf methods, highlighting avenues for future work.
[501] Improving Layout Representation Learning Across Inconsistently Annotated Datasets via Agentic Harmonization
Renyu Li, Vladimir Kirilenko, Yao You, Crag Wolfe
Main category: cs.CV
TL;DR: Agentic label harmonization workflow using vision-language models to reconcile annotation inconsistencies across datasets before fine-tuning object detection models, demonstrated on document layout detection.
Details
Motivation: Object detection models fine-tuned on combined datasets suffer from annotation incompatibility issues where different datasets encode conflicting spatial definitions for semantically equivalent categories, leading to degraded performance.Method: Proposes an agentic label harmonization workflow that uses a vision-language model to reconcile both category semantics and bounding box granularity across heterogeneous annotation sources before training. Applied to document layout detection where annotation standards vary widely.
Result: Without harmonization, naïve mixed-dataset fine-tuning degrades RT-DETRv2 detector performance (table TEDS drops from 0.800 to 0.750). With harmonization applied to two corpora with partially overlapping taxonomies, detection F-score improves from 0.860 to 0.883, table TEDS improves to 0.814, and bounding box overlap drops from 0.043 to 0.016. Representation analysis shows harmonized training produces more compact and separable embeddings.
Conclusion: Annotation inconsistency distorts learned feature space in multimodal models, and resolving it before training through vision-language model-based harmonization restores representation structure and improves performance across multiple metrics.
Abstract: Fine-tuning object detection (OD) models on combined datasets assumes annotation compatibility, yet datasets often encode conflicting spatial definitions for semantically equivalent categories. We propose an agentic label harmonization workflow that uses a vision-language model to reconcile both category semantics and bounding box granularity across heterogeneous sources before training. We evaluate on document layout detection as a challenging case study, where annotation standards vary widely across corpora. Without harmonization, naïve mixed-dataset fine-tuning degrades a pretrained RT-DETRv2 detector: on SCORE-Bench, which measures how accurately the full document conversion pipeline reproduces ground-truth structure, table TEDS drops from 0.800 to 0.750. Applied to two corpora whose 16 and 10 category taxonomies share only 8 direct correspondences, harmonization yields consistent gains across content fidelity, table structure, and spatial consistency: detection F-score improves from 0.860 to 0.883, table TEDS improves to 0.814, and mean bounding box overlap drops from 0.043 to 0.016. Representation analysis further shows that harmonized training produces more compact and separable post-decoder embeddings, confirming that annotation inconsistency distorts the learned feature space and that resolving it before training restores representation structure.
[502] ReSpinQuant: Efficient Layer-Wise LLM Quantization via Subspace Residual Rotation Approximation
Suyoung Kim, Sunghyun Wee, Hyeonjin Kim, Kyomin Hwang, Hyunho Lee, Nojun Kwak
Main category: cs.CV
TL;DR: ReSpinQuant is a quantization framework that combines the expressivity of layer-wise rotation methods with the inference efficiency of global rotation methods for LLM quantization, achieving state-of-the-art performance with minimal overhead.
Details
Motivation: Existing rotation-based PTQ methods face a trade-off: global rotation methods are efficient but limited in expressivity (single rotation matrix across layers), while layer-wise methods are accurate but computationally expensive (cannot fuse rotations into weights, requiring online computations).Method: ReSpinQuant leverages offline activation rotation fusion and matches basis using efficient residual subspace rotation, reconciling high expressivity of layer-wise adaptation with negligible inference overhead.
Result: Extensive experiments on W4A4 and W3A3 quantization show ReSpinQuant achieves state-of-the-art performance, outperforming global rotation methods and matching the accuracy of computationally expensive layer-wise methods with minimal overhead.
Conclusion: ReSpinQuant successfully resolves the efficiency-accuracy trade-off in rotation-based PTQ for LLMs, providing a practical solution for efficient quantization without sacrificing model quality.
Abstract: Rotation-based Post-Training Quantization (PTQ) has emerged as a promising solution for mitigating activation outliers in the quantization of Large Language Models (LLMs). Global rotation methods achieve inference efficiency by fusing activation rotations into attention and FFN blocks, but suffer from limited expressivity as they are constrained to use a single learnable rotation matrix across all layers. To tackle this, layer-wise transformation methods emerged, achieving superior accuracy through localized adaptation. However, layer-wise methods cannot fuse activation rotation matrices into weights, requiring online computations and causing significant overhead. In this paper, we propose ReSpinQuant, a quantization framework that resolves such overhead by leveraging offline activation rotation fusion and matching basis using efficient residual subspace rotation. This design reconciles the high expressivity of layer-wise adaptation with only negligible inference overhead. Extensive experiments on W4A4 and W3A3 quantization demonstrate that ReSpinQuant achieves state-of-the-art performance, outperforming global rotation methods and matching the accuracy of computationally expensive layer-wise methods with minimal overhead.
[503] MapATM: Enhancing HD Map Construction through Actor Trajectory Modeling
Mingyang Li, Brian Lee, Rui Zuo, Brent Bacchus, Priyantha Mudalige, Qinru Qiu
Main category: cs.CV
TL;DR: MapATM is a deep neural network that uses historical vehicle trajectories as structural priors to improve lane detection accuracy in autonomous driving, achieving significant performance gains on challenging datasets.
Details
Motivation: HD mapping for autonomous driving faces challenges like view occlusions, distant lane visibility, and adverse weather conditions, which compromise lane detection accuracy and system reliability.Method: MapATM leverages historical actor (vehicle) trajectory information as structural priors for road geometry to enhance lane detection performance through a novel deep neural network architecture.
Result: Achieves 4.6 AP improvement for lane dividers and 2.6 mAP increase on NuScenes dataset (10.1% and 6.1% relative improvements), with stable map reconstruction across diverse driving scenarios.
Conclusion: MapATM demonstrates practical value for autonomous driving by using vehicle trajectories to improve lane detection robustness in challenging conditions, enabling more reliable HD mapping.
Abstract: High-definition (HD) mapping tasks, which perform lane detections and predictions, are extremely challenging due to non-ideal conditions such as view occlusions, distant lane visibility, and adverse weather conditions. Those conditions often result in compromised lane detection accuracy and reduced reliability within autonomous driving systems. To address these challenges, we introduce MapATM, a novel deep neural network that effectively leverages historical actor trajectory information to improve lane detection accuracy, where actors refer to moving vehicles. By utilizing actor trajectories as structural priors for road geometry, MapATM achieves substantial performance enhancements, notably increasing AP by 4.6 for lane dividers and mAP by 2.6 on the challenging NuScenes dataset, representing relative improvements of 10.1% and 6.1%, respectively, compared to strong baseline methods. Extensive qualitative evaluations further demonstrate MapATM’s capability to consistently maintain stable and robust map reconstruction across diverse and complex driving scenarios, underscoring its practical value for autonomous driving applications.
[504] RESP: Reference-guided Sequential Prompting for Visual Glitch Detection in Video Games
Yakun Yu, Ashley Wiens, Adrián Barahona-Ríos, Benedict Wilkins, Saman Zadtootaghaj, Nabajeet Barman, Cor-Paul Bezemer
Main category: cs.CV
TL;DR: RESP is a multi-frame framework for gameplay glitch detection using vision-language models with reference-guided prompting that compares test frames to reference frames from the same video for more robust detection.
Details
Motivation: Manual quality assurance for video game glitch detection doesn't scale with modern game development complexity. Existing VLM-based approaches struggle with realistic scene variation because they operate on single frames or limited video-level baselines.Method: Reference-guided prompting: for each test frame, select a reference frame from earlier in the same video to establish visual baseline. Prompt VLM with reference/test pairs, then aggregate noisy frame predictions into stable video-level decisions without fine-tuning. Also introduced RefGlitch synthetic dataset for controlled analysis.
Result: Experiments across 5 VLMs and 3 datasets (synthetic + real-world) show reference guidance consistently strengthens frame-level detection and reliably transfers to stronger video-level triage under realistic QA conditions.
Conclusion: RESP provides a practical multi-frame framework for gameplay glitch detection that leverages reference-guided prompting to improve robustness against scene variation, enabling more scalable automated quality assurance.
Abstract: Visual glitches in video games degrade player experience and perceived quality, yet manual quality assurance cannot scale to the growing test surface of modern game development. Prior automation efforts, particularly those using vision-language models (VLMs), largely operate on single frames or rely on limited video-level baselines that struggle under realistic scene variation, making robust video-level glitch detection challenging. We present RESP, a practical multi-frame framework for gameplay glitch detection with VLMs. Our key idea is reference-guided prompting: for each test frame, we select a reference frame from earlier in the same video, establishing a visual baseline and reframing detection as within-video comparison rather than isolated classification. RESP sequentially prompts the VLM with reference/test pairs and aggregates noisy frame predictions into a stable video-level decision without fine-tuning the VLM. To enable controlled analysis of reference effects, we introduce RefGlitch, a synthetic dataset of manually labeled reference/test frame pairs with balanced coverage across five glitch types. Experiments across five VLMs and three datasets (one synthetic, two real-world) show that reference guidance consistently strengthens frame-level detection and that the improved frame-level evidence reliably transfers to stronger video-level triage under realistic QA conditions. Code and data are available at: \href{https://github.com/PipiZong/RESP_code.git}{this https URL}.
[505] FlowCoMotion: Text-to-Motion Generation via Token-Latent Flow Modeling
Dawei Guan, Di Yang, Chengjie Jin, Jiangtao Wang
Main category: cs.CV
TL;DR: FlowCoMotion: A text-to-motion generation framework that unifies continuous and discrete motion representations through token-latent coupling to capture both semantic alignment and fine-grained motion details.
Details
Motivation: Existing text-to-motion methods use either continuous or discrete motion representations, but continuous representations entangle semantics with dynamics, while discrete representations lose fine-grained motion details. There's a need for a unified approach that captures both high-level semantics and detailed motion dynamics.Method: Proposes FlowCoMotion with token-latent coupling: 1) Latent branch uses multi-view distillation to regularize continuous latent space, 2) Token branch uses discrete temporal resolution quantization for high-level semantic cues, 3) Token-latent coupling network combines both representations, 4) Velocity field prediction based on text conditions, 5) ODE solver integrates velocity field from simple prior to target motion state.
Result: Achieves competitive performance on text-to-motion benchmarks including HumanML3D and SnapMoGen, demonstrating effectiveness in generating motions with both semantic alignment and fine-grained details.
Conclusion: FlowCoMotion successfully unifies continuous and discrete motion representations through token-latent coupling, enabling text-to-motion generation that captures both semantic content and high-fidelity motion details, outperforming existing approaches.
Abstract: Text-to-motion generation is driven by learning motion representations for semantic alignment with language. Existing methods rely on either continuous or discrete motion representations. However, continuous representations entangle semantics with dynamics, while discrete representations lose fine-grained motion details. In this context, we propose FlowCoMotion, a novel motion generation framework that unifies both treatments from a modeling perspective. Specifically, FlowCoMotion employs token-latent coupling to capture both semantic content and high-fidelity motion details. In the latent branch, we apply multi-view distillation to regularize the continuous latent space, while in the token branch we use discrete temporal resolution quantization to extract high-level semantic cues. The motion latent is then obtained by combining the representations from the two branches through a token-latent coupling network. Subsequently, a velocity field is predicted based on the textual conditions. An ODE solver integrates this velocity field from a simple prior, thereby guiding the sample to the potential state of the target motion. Extensive experiments show that FlowCoMotion achieves competitive performance on text-to-motion benchmarks, including HumanML3D and SnapMoGen.
[506] Structured State-Space Regularization for Compact and Generation-Friendly Image Tokenization
Jinsung Lee, Jaemin Oh, Namhun Kim, Dongwon Kim, Byung-Jun Yoon, Suha Kwak
Main category: cs.CV
TL;DR: A novel regularizer for image tokenizers that aligns latent spaces with state-space models to improve both compactness and generation-friendliness through frequency awareness.
Details
Motivation: Current image tokenizers need to balance two competing objectives: creating compact latent spaces that capture essential image content while remaining easy to model with generative approaches. Existing methods often struggle to achieve both simultaneously.Method: Introduces a novel regularizer that guides tokenizers to mimic the hidden state dynamics of state-space models (SSMs), transferring their frequency awareness property to latent features. This is grounded in theoretical analysis of SSMs and enforces encoding of fine spatial structures and frequency-domain cues into compact latent features.
Result: Experiments show the method improves generation quality in diffusion models while incurring only minimal loss in reconstruction fidelity, demonstrating more effective use of representation capacity and improved generative modelability.
Conclusion: The proposed regularizer successfully creates latent spaces that are both compact and generation-friendly by leveraging state-space model dynamics, offering a promising approach for improving vision models through better latent space design.
Abstract: Image tokenizers are central to modern vision models as they often operate in latent spaces. An ideal latent space must be simultaneously compact and generation-friendly: it should capture image’s essential content compactly while remaining easy to model with generative approaches. In this work, we introduce a novel regularizer to align latent spaces with these two objectives. The key idea is to guide tokenizers to mimic the hidden state dynamics of state-space models (SSMs), thereby transferring their critical property, frequency awareness, to latent features. Grounded in a theoretical analysis of SSMs, our regularizer enforces encoding of fine spatial structures and frequency-domain cues into compact latent features; leading to more effective use of representation capacity and improved generative modelability. Experiments demonstrate that our method improves generation quality in diffusion models while incurring only minimal loss in reconstruction fidelity.
[507] LDEPrompt: Layer-importance guided Dual Expandable Prompt Pool for Pre-trained Model-based Class-Incremental Learning
Linjie Li, Zhenyu Wu, Huiyu Xiao, Yang Ji
Main category: cs.CV
TL;DR: LDEPrompt introduces a layer-importance guided dual expandable prompt pool for class-incremental learning, enabling adaptive layer selection and dynamic prompt pool management.
Details
Motivation: Existing prompt-based class-incremental learning methods have limitations including fixed prompt pools, manual prompt embedding selection, and strong reliance on pretrained backbones for prompt selection.Method: Proposes LDEPrompt with layer-importance guidance for adaptive layer selection and dynamic freezing/expansion of the prompt pool, addressing limitations of fixed pools and manual selection.
Result: Achieves state-of-the-art performance on widely used class-incremental learning benchmarks, demonstrating effectiveness and scalability.
Conclusion: LDEPrompt effectively addresses key limitations in prompt-based class-incremental learning through adaptive layer selection and dynamic prompt pool management.
Abstract: Prompt-based class-incremental learning methods typically construct a prompt pool consisting of multiple trainable key-prompts and perform instance-level matching to select the most suitable prompt embeddings, which has shown promising results. However, existing approaches face several limitations, including fixed prompt pools, manual selection of prompt embeddings, and strong reliance on the pretrained backbone for prompt selection. To address these issues, we propose a \textbf{L}ayer-importance guided \textbf{D}ual \textbf{E}xpandable \textbf{P}rompt Pool (\textbf{LDEPrompt}), which enables adaptive layer selection as well as dynamic freezing and expansion of the prompt pool. Extensive experiments on widely used class-incremental learning benchmarks demonstrate that LDEPrompt achieves state-of-the-art performance, validating its effectiveness and scalability.
[508] CDPR: Cross-modal Diffusion with Polarization for Reliable Monocular Depth Estimation
Rongjia Yu, Tong Jia, Hao Wang, Xiaofang Li, Xiao Yang, Zinuo Zhang, Cuiwei Liu
Main category: cs.CV
TL;DR: CDPR integrates polarization cues (AoLP/DoLP) with RGB images in a diffusion-based framework for more robust monocular depth estimation, especially for challenging surfaces like transparent/reflective objects.
Details
Motivation: Existing diffusion-based depth estimation methods rely solely on RGB inputs, which lack sufficient cues for challenging regions like textureless surfaces, transparency, and specular reflections. Polarization images provide physically grounded priors that can enhance robustness in these difficult cases.Method: Proposes CDPR: Cross-modal Diffusion with Polarization for Reliable Monocular Depth Estimation. Encodes RGB and polarization (AoLP/DoLP) images into shared latent space via pre-trained VAE. Uses learnable confidence-aware gating mechanism to dynamically fuse multi-modal information, suppressing noisy polarization signals while preserving informative cues. Framework can be generalized to surface normal prediction with minimal modification.
Result: Experiments on synthetic and real-world datasets show CDPR significantly outperforms RGB-only baselines in challenging regions while maintaining competitive performance in standard scenes. Also demonstrates scalability to polarization-guided dense prediction tasks like surface normal estimation.
Conclusion: Integration of polarization priors with diffusion models enhances monocular depth estimation robustness, especially for challenging surfaces. The framework shows promise for general polarization-guided dense prediction tasks beyond depth estimation.
Abstract: Monocular depth estimation is a fundamental yet challenging task in computer vision, especially under complex conditions such as textureless surfaces, transparency, and specular reflections. Recent diffusion-based approaches have significantly advanced performance by reformulating depth prediction as a denoising process in the latent space. However, existing methods rely solely on RGB inputs, which often lack sufficient cues in challenging regions. In this work, we present CDPR - Cross-modal Diffusion with Polarization for Reliable Monocular Depth Estimation - a novel diffusion-based framework that integrates physically grounded polarization priors to enhance estimation robustness. Specifically, we encode both RGB and polarization (AoLP/DoLP) images into a shared latent space via a pre-trained Variational Autoencoder (VAE), and dynamically fuse multi-modal information through a learnable confidence-aware gating mechanism. This fusion module adaptively suppresses noisy signals in polarization inputs while preserving informative cues, particularly around reflective or transparent surfaces, and provides the integrated latent representation for subsequent monocular depth estimation. Beyond depth estimation, we further verify that our framework can be easily generalized to surface normal prediction with minimal modification, showcasing its scalability to general polarization-guided dense prediction tasks. Experiments on both synthetic and real-world datasets validate that CDPR significantly outperforms RGB-only baselines in challenging regions while maintaining competitive performance in standard scenes.
[509] Semantic-Geometric Dual Compression: Training-Free Visual Token Reduction for Ultra-High-Resolution Remote Sensing Understanding
Yueying Li, Fengxiang Wang, Yan Li, Mingshuo Chen, Mengying Zhao, Long Lan
Main category: cs.CV
TL;DR: DualComp is a task-adaptive dual-stream token compression framework for multimodal LLMs that addresses computational overhead in ultra-high-resolution remote sensing imagery by dynamically routing between semantic and geometric processing pathways.
Details
Motivation: Current visual token compression methods for MLLMs use static, uniform strategies that don't account for the "Semantic-Geometric Duality" in remote sensing tasks - object semantic tasks need background pruning while scene geometric tasks require spatial topology integrity.Method: DualComp uses a lightweight pre-trained router to dynamically guide feature processing into two pathways: 1) Object semantic stream with Spatially-Contiguous Semantic Aggregator for background compression while protecting small objects, and 2) Scene geometric stream with Instruction-Guided Structure Recoverer that reconstructs spatial skeletons using greedy path-tracing topology completion.
Result: Experiments on XLRS-Bench show DualComp achieves high-fidelity remote sensing interpretation at exceptionally low computational cost with simultaneous improvements in both efficiency and accuracy.
Conclusion: The proposed task-adaptive dual-stream compression framework effectively addresses the computational bottleneck in processing UHR remote sensing imagery for MLLMs by respecting the fundamental duality between semantic and geometric interpretation needs.
Abstract: Multimodal Large Language Models (MLLMs) have demonstrated immense potential in Earth observation. However, the massive visual tokens generated when processing Ultra-High-Resolution (UHR) imagery introduce prohibitive computational overhead, severely bottlenecking their inference efficiency. Existing visual token compression methods predominantly adopt static and uniform compression strategies, neglecting the inherent “Semantic-Geometric Duality” in remote sensing interpretation tasks. Specifically, object semantic tasks focus on the abstract semantics of objects and benefit from aggressive background pruning, whereas scene geometric tasks critically rely on the integrity of spatial topology. To address this challenge, we propose DualComp, a task-adaptive dual-stream token compression framework. Dynamically guided by a lightweight pre-trained router, DualComp decouples feature processing into two dedicated pathways. In the object semantic stream, the Spatially-Contiguous Semantic Aggregator (SCSA) utilizes size-adaptive clustering to aggregates redundant background while protecting small object. In the scene geometric stream, the Instruction-Guided Structure Recoverer (IGSR) introduces a greedy path-tracing topology completion mechanism to reconstruct spatial skeletons. Experiments on the UHR remote sensing benchmark XLRS-Bench demonstrate that DualComp accomplishes high-fidelity remote sensing interpretation at an exceptionally low computational cost, achieving simultaneous improvements in both efficiency and accuracy.
[510] BoxTuning: Directly Injecting the Object Box for Multimodal Model Fine-Tuning
Zekun Qian, Ruize Han, Wei Feng
Main category: cs.CV
TL;DR: BoxTuning introduces visual prompting with colored bounding boxes and trajectory trails for object-level spatial-temporal understanding in video MLLMs, reducing text token usage by 87-93% while preserving full temporal resolution.
Details
Motivation: Existing MLLMs lack explicit mechanisms for fine-grained object grounding in videos. Text-coordinate approaches suffer from modality mismatch and high token costs that force aggressive temporal downsampling, losing important motion dynamics.Method: Inject object spatial-temporal information directly into visual modality by rendering colored bounding boxes and trajectory trails onto video frames as visual prompts, with only a concise color-to-object legend as text.
Result: BoxTuning surpasses text-coordinate baselines on spatially oriented tasks and nearly eliminates accuracy degradation on reasoning-centric tasks across five video QA benchmarks (CLEVRER, Perception Test, STAR, NExT-QA, IntentQA).
Conclusion: Visual prompting with bounding boxes and trajectory trails is a more natural and efficient paradigm for conveying object information to video MLLMs, resolving the modality mismatch of text-coordinate approaches.
Abstract: Object-level spatial-temporal understanding is essential for video question answering, yet existing multimodal large language models (MLLMs) encode frames holistically and lack explicit mechanisms for fine-grained object grounding. Recent work addresses this by serializing bounding box coordinates as text tokens, but this text-coordinate paradigm suffers from a fundamental modality mismatch: object information is inherently visual, yet encoding it as text incurs a high token cost that forces aggressive temporal downsampling. We propose BoxTuning, which resolves this mismatch by injecting object spatial-temporal information directly into the visual modality. Colored bounding boxes and trajectory trails are rendered onto video frames as visual prompts, with only a concise color-to-object legend retained as text. This reduces the token cost significantly, achieving 87-93% text token reduction in practice. It also preserves full temporal resolution, where the trajectory trails further encode inter-frame motion direction and speed within each keyframe, recovering fine-grained dynamics that text-coordinate methods are forced to discard. Experimental results on five video QA benchmarks (CLEVRER, Perception Test, STAR, NExT-QA, IntentQA) show that BoxTuning surpasses text-coordinate baselines on spatially oriented tasks and nearly eliminates the accuracy degradation observed on reasoning-centric tasks, establishing visual prompting as a more natural and efficient paradigm for conveying object information to video MLLMs.
[511] Sparse Hypergraph-Enhanced Frame-Event Object Detection with Fine-Grained MoE
Wei Bao, Yuehan Wang, Tianhang Zhou, Siqi Li, Yue Gao
Main category: cs.CV
TL;DR: Hyper-FEOD is a high-performance RGB-Event object detection framework using sparse hypergraph fusion and fine-grained mixture of experts for efficient multimodal feature integration.
Details
Motivation: RGB cameras and event streams offer complementary advantages for robust object detection in dynamic conditions, but their heterogeneity and data redundancy lead to computational overhead and suboptimal feature fusion.Method: Two core components: 1) Sparse Hypergraph-enhanced Cross-Modal Fusion (S-HCF) that uses event-guided activity maps and high-order hypergraph modeling on motion-critical sparse tokens; 2) Fine-Grained Mixture of Experts (FG-MoE) with specialized hypergraph experts for different image regions using pixel-level spatial gating, load-balancing loss, and zero-initialization.
Result: Achieves superior accuracy-efficiency trade-off on mainstream RGB-Event benchmarks, outperforming state-of-the-art methods while maintaining lightweight footprint suitable for real-time edge deployment.
Conclusion: Hyper-FEOD effectively addresses multimodal heterogeneity and computational efficiency challenges in RGB-Event object detection through sparse hypergraph fusion and adaptive expert enhancement.
Abstract: Integrating frame-based RGB cameras with event streams offers a promising solution for robust object detection under challenging dynamic conditions. However, the inherent heterogeneity and data redundancy of these modalities often lead to prohibitive computational overhead or suboptimal feature fusion. In this paper, we propose Hyper-FEOD, a high-performance and efficient detection framework, which synergistically optimizes multi-modal interaction through two core components. First, we introduce Sparse Hypergraph-enhanced Cross-Modal Fusion (S-HCF), which leverages the inherent sparsity of event streams to construct an event-guided activity map. By performing high-order hypergraph modeling exclusively on selected motion-critical sparse tokens, S-HCF captures complex non-local dependencies between RGB and event data while overcoming the traditional complexity bottlenecks of hypergraph computation. Second, we design a Fine-Grained Mixture of Experts (FG-MoE) Enhancement module to address the diverse semantic requirements of different image regions. This module employs specialized hypergraph experts tailored for object boundaries, internal textures, and backgrounds, utilizing a pixel-level spatial gating mechanism to adaptively route and enhance features. Combined with a load-balancing loss and zero-initialization strategy, FG-MoE ensures stable training and precise feature refinement without disrupting the pre-trained backbone’s distribution. Experimental results on mainstream RGB-Event benchmarks demonstrate that Hyper-FEOD achieves a superior accuracy-efficiency trade-off, outperforming state-of-the-art methods while maintaining a lightweight footprint suitable for real-time edge deployment.
[512] Naka-GS: A Bionics-inspired Dual-Branch Naka Correction and Progressive Point Pruning for Low-Light 3DGS
Runyu Zhu, SiXun Dong, Zhiqiang Zhang, Qingxia Ye, Zhihua Xu
Main category: cs.CV
TL;DR: NAKA-GS is a bionics-inspired framework for low-light 3D Gaussian Splatting that jointly improves photometric restoration and geometric initialization through Naka-guided chroma-correction and point preprocessing.
Details
Motivation: Low-light conditions severely degrade image visibility, introduce color distortions, and contaminate geometric priors for 3D restoration and reconstruction, hindering downstream optimization processes.Method: The framework uses: 1) Naka-guided chroma-correction network combining physics-prior enhancement, dual-branch input modeling, frequency-decoupled correction, and mask-guided optimization; 2) Feed-forward multi-view reconstruction for dense scene priors; 3) Lightweight Point Preprocessing Module (PPM) with coordinate alignment, voxel pooling, and distance-adaptive progressive pruning.
Result: NAKA-GS outperforms baseline methods by a large margin, improves restoration quality, training stability, and optimization efficiency for low-light 3D reconstruction without heavy inference overhead.
Conclusion: The proposed bionics-inspired framework effectively addresses low-light challenges in 3D Gaussian Splatting by jointly optimizing photometric restoration and geometric initialization, demonstrating superior performance in the NTIRE 3D Restoration and Reconstruction Challenge.
Abstract: Low-light conditions severely hinder 3D restoration and reconstruction by degrading image visibility, introducing color distortions, and contaminating geometric priors for downstream optimization. We present NAKA-GS, a bionics-inspired framework for low-light 3D Gaussian Splatting that jointly improves photometric restoration and geometric initialization. Our method starts with a Naka-guided chroma-correction network, which combines physics-prior low-light enhancement, dual-branch input modeling, frequency-decoupled correction, and mask-guided optimization to suppress bright-region chromatic artifacts and edge-structure errors. The enhanced images are then fed into a feed-forward multi-view reconstruction model to produce dense scene priors. To further improve Gaussian initialization, we introduce a lightweight Point Preprocessing Module (PPM) that performs coordinate alignment, voxel pooling, and distance-adaptive progressive pruning to remove noisy and redundant points while preserving representative structures. Without introducing heavy inference overhead, NAKA-GS improves restoration quality, training stability, and optimization efficiency for low-light 3D reconstruction. The proposed method was presented in the NTIRE 3D Restoration and Reconstruction (3DRR) Challenge, and outperformed the baseline methods by a large margin. The code is available at https://github.com/RunyuZhu/Naka-GS
[513] Ambivalence/Hesitancy Recognition in Videos for Personalized Digital Health Interventions
Manuela González-González, Soufiane Belharbi, Muhammad Osama Zeeshan, Masoumeh Sharafi, Muhammad Haseeb Aslam, Lorenzo Sia, Nicolas Richet, Marco Pedersoli, Alessandro Lameiras Koerich, Simon L Bacon, Eric Granger
Main category: cs.CV
TL;DR: Deep learning approaches for recognizing ambivalence/hesitancy in videos for digital health interventions, exploring supervised learning, domain adaptation, and LLM-based zero-shot inference.
Details
Motivation: Ambivalence and hesitancy (A/H) are critical emotional states that cause people to delay or abandon health interventions, but manual recognition by experts is costly and doesn't scale. Automatic A/H recognition is needed for personalized, cost-effective digital health interventions.Method: Explores deep learning models for multi-modal A/H recognition in videos using three approaches: supervised learning, unsupervised domain adaptation for personalization, and zero-shot inference using large language models (LLMs). Experiments conducted on the BAH video dataset.
Result: Results show limited performance, indicating current models are insufficient for accurate A/H recognition. The paper suggests better methods are needed for spatio-temporal and multimodal fusion to leverage conflicts within and across modalities.
Conclusion: Current deep learning models are inadequate for recognizing subtle ambivalence/hesitancy in videos. More sophisticated multi-modal models with better fusion techniques are required to capture the complex emotional conflicts that characterize A/H states.
Abstract: Using behavioural science, health interventions focus on behaviour change by providing a framework to help patients acquire and maintain healthy habits that improve medical outcomes. In-person interventions are costly and difficult to scale, especially in resource-limited regions. Digital health interventions offer a cost-effective approach, potentially supporting independent living and self-management. Automating such interventions, especially through machine learning, has gained considerable attention recently. Ambivalence and hesitancy (A/H) play a primary role for individuals to delay, avoid, or abandon health interventions. A/H are subtle and conflicting emotions that place a person in a state between positive and negative evaluations of a behaviour, or between acceptance and refusal to engage in it. They manifest as affective inconsistency across modalities or within a modality, such as language, facial, vocal expressions, and body language. While experts can be trained to recognize A/H, integrating them into digital health interventions is costly and less effective. Automatic A/H recognition is therefore critical for the personalization and cost-effectiveness of digital health interventions. Here, we explore the application of deep learning models for A/H recognition in videos, a multi-modal task by nature. In particular, this paper covers three learning setups: supervised learning, unsupervised domain adaptation for personalization, and zero-shot inference via large language models (LLMs). Our experiments are conducted on the unique and recently published BAH video dataset for A/H recognition. Our results show limited performance, suggesting that more adapted multi-modal models are required for accurate A/H recognition. Better methods for modeling spatio-temporal and multimodal fusion are necessary to leverage conflicts within/across modalities.
[514] rPPG-VQA: A Video Quality Assessment Framework for Unsupervised rPPG Training
Tianyang Dai, Ming Chang, Yan Chen, Yang Hu
Main category: cs.CV
TL;DR: A framework called rPPG-VQA that assesses video suitability for remote photoplethysmography (rPPG) training by combining signal-level SNR analysis and scene-level MLLM-based interference detection, enabling better unsupervised rPPG model training on “in-the-wild” videos.
Details
Motivation: Unsupervised rPPG training on low-quality "in-the-wild" videos degrades model performance, but existing video quality assessment methods are designed for human perception, not for assessing suitability for rPPG model learning.Method: Proposes rPPG-VQA with dual-branch architecture: signal-level branch uses robust SNR estimation with multi-method consensus, and scene-level branch uses multimodal large language model (MLLM) to identify motion and lighting interferences. Also introduces two-stage adaptive sampling (TAS) strategy to curate optimal training datasets.
Result: Experiments show that training on large-scale “in-the-wild” videos filtered by rPPG-VQA framework enables development of unsupervised rPPG models with substantial improvement in accuracy on standard benchmarks.
Conclusion: The rPPG-VQA framework effectively assesses video suitability for rPPG training, enabling better utilization of unlabeled video data and improving unsupervised rPPG model performance.
Abstract: Unsupervised remote photoplethysmography (rPPG) promises to leverage unlabeled video data, but its potential is hindered by a critical challenge: training on low-quality “in-the-wild” videos severely degrades model performance. An essential step missing here is to assess the suitability of the videos for rPPG model learning before using them for the task. Existing video quality assessment (VQA) methods are mainly designed for human perception and not directly applicable to the above purpose. In this work, we propose rPPG-VQA, a novel framework for assessing video suitability for rPPG. We integrate signal-level and scene-level analyses and design a dual-branch assessment architecture. The signal-level branch evaluates the physiological signal quality of the videos via robust signal-to-noise ratio (SNR) estimation with a multi-method consensus mechanism, and the scene-level branch uses a multimodal large language model (MLLM) to identify interferences like motion and unstable lighting. Furthermore, we propose a two-stage adaptive sampling (TAS) strategy that utilizes the quality score to curate optimal training datasets. Experiments show that by training on large-scale, “in-the-wild” videos filtered by our framework, we can develop unsupervised rPPG models that achieve a substantial improvement in accuracy on standard benchmarks. Our code is available at https://github.com/Tianyang-Dai/rPPG-VQA.
[515] MosaicMRI: A Diverse Dataset and Benchmark for Raw Musculoskeletal MRI
Paula Arguello, Berk Tinaz, Mohammad Shahab Sepehri, Maryam Soltanolkotabi, Mahdi Soltanolkotabi
Main category: cs.CV
TL;DR: MosaicMRI: A large diverse musculoskeletal MRI dataset for evaluating ML methods, showing benefits of anatomical diversity and cross-anatomy generalization.
Details
Motivation: Current MRI deep learning progress is limited by public datasets focused mainly on brain/knee imaging, lacking diversity and comprehensive evaluation across different anatomical settings.Method: Introduce MosaicMRI - largest open-source raw musculoskeletal MRI dataset (2,671 volumes, 80,156 slices) with diverse orientations, contrasts, anatomies, and coil configurations. Use VarNet baseline for accelerated reconstruction experiments to study scaling behavior and cross-anatomy generalization.
Result: Models trained on combined anatomies outperform anatomy-specific models in low-sample regimes. Identified groups of body parts (e.g., foot and elbow) that generalize well with each other. Performance under domain shifts depends on training set size, anatomy, and protocol-specific factors.
Conclusion: Anatomical diversity in training data provides significant benefits, and cross-anatomical correlations can be exploited for better generalization in MRI reconstruction tasks.
Abstract: Deep learning underpins a wide range of applications in MRI, including reconstruction, artifact removal, and segmentation. However, progress has been driven largely by public datasets focused on brain and knee imaging, shaping how models are trained and evaluated. As a result, careful studies of the reliability of these models across diverse anatomical settings remain limited. In this work, we introduce MosaicMRI, a large and diverse collection of fully sampled raw musculoskeletal (MSK) MR measurements designed for training and evaluating machine-learning-based methods. MosaicMRI is the largest open-source raw MSK MRI dataset to date, comprising 2,671 volumes and 80,156 slices. The dataset offers substantial diversity in volume orientation (e.g., axial, sagittal), imaging contrasts (e.g., PD, T1, T2), anatomies (e.g., spine, knee, hip, ankle, and others), and numbers of acquisition coils. Using VarNet as a baseline for accelerated reconstruction task, we perform a comprehensive set of experiments to study scaling behavior with respect to both model capacity and dataset size. Interestingly, models trained on the combined anatomies significantly outperform anatomy-specific models in low-sample regimes, highlighting the benefits of anatomical diversity and the presence of exploitable cross-anatomical correlations. We further evaluate robustness and cross-anatomy generalization by training models on one anatomy (e.g., spine) and testing them on another (e.g., knee). Notably, we identify groups of body parts (e.g., foot and elbow) that generalize well with each other, and highlight that performance under domain shifts depends on both training set size, anatomy, and protocol-specific factors.
[516] Boxes2Pixels: Learning Defect Segmentation from Noisy SAM Masks
Camile Lendering, Erkut Akdag, Egor Bondarev
Main category: cs.CV
TL;DR: Boxes2Pixels: A noise-robust framework that treats SAM as a noisy teacher to convert bounding boxes to pseudo-masks for industrial defect segmentation, using hierarchical decoding, auxiliary localization, and self-correction.
Details
Motivation: Industrial defect segmentation requires dense pixel-level annotations that are rarely available. Using SAM to convert bounding boxes to pseudo-masks creates systematic noise (hallucinating background structure while missing sparse defects), necessitating a noise-robust approach.Method: Proposes Boxes2Pixels framework with: (1) hierarchical decoder over frozen DINOv2 features for semantic stability, (2) auxiliary binary localization head to decouple sparse foreground discovery from class prediction, and (3) one-sided online self-correction that relaxes background supervision when student is confident to address teacher false negatives.
Result: On wind turbine inspection benchmark: improves anomaly mIoU by +6.97 and binary IoU by +9.71 over strongest baseline. Online self-correction increases binary recall by +18.56 while using 80% fewer trainable parameters.
Conclusion: Boxes2Pixels effectively addresses noise in SAM-generated pseudo-masks for industrial defect segmentation, achieving significant improvements in segmentation quality with fewer parameters through noise-robust distillation.
Abstract: Accurate defect segmentation is critical for industrial inspection, yet dense pixel-level annotations are rarely available. A common workaround is to convert inexpensive bounding boxes into pseudo-masks using foundation segmentation models such as the Segment Anything Model (SAM). However, these pseudo-labels are systematically noisy on industrial surfaces, often hallucinating background structure while missing sparse defects. To address this limitation, a noise-robust box-to-pixel distillation framework, Boxes2Pixels, is proposed that treats SAM as a noisy teacher rather than a source of ground-truth supervision. Bounding boxes are converted into pseudo-masks offline by SAM, and a compact student is trained with (i) a hierarchical decoder over frozen DINOv2 features for semantic stability, (ii) an auxiliary binary localization head to decouple sparse foreground discovery from class prediction, and (iii) a one-sided online self-correction mechanism that relaxes background supervision when the student is confident, targeting teacher false negatives. On a manually annotated wind turbine inspection benchmark, the proposed Boxes2Pixels improves anomaly mIoU by +6.97 and binary IoU by +9.71 over the strongest baseline trained under identical weak supervision. Moreover, online self-correction increases the binary recall by +18.56, while the model employs 80% fewer trainable parameters. Code is available at https://github.com/CLendering/Boxes2Pixels.
[517] RADA: Region-Aware Dual-encoder Auxiliary learning for Barely-supervised Medical Image Segmentation
Shuang Zeng, Boxu Xie, Lei Zhu, Xinliang Zhang, Jiakui Hu, Zhengjian Yao, Yuanwei Li, Yuxing Lu, Yanye Lu
Main category: cs.CV
TL;DR: RADA is a region-aware dual-encoder auxiliary learning pipeline for barely-supervised 3D medical image segmentation that uses Alpha-CLIP pre-training to extract fine-grained visual features and combines them with text-level semantic guidance for improved pseudo-label quality.
Details
Motivation: Existing barely-supervised medical image segmentation methods rely on geometric continuity for pseudo-label propagation, which lacks semantic understanding and produces low-quality pseudo-labels. Medical segmentation requires fine-grained visual features for pixel-level accuracy, but current approaches don't adequately address this need.Method: Proposes RADA: Region-Aware Dual-encoder Auxiliary learning pipeline with dual-encoder framework pre-trained on Alpha-CLIP to extract fine-grained, region-specific visual features. Combines image-level visual features with text-level semantic guidance for region-aware semantic supervision. Integrated into triple-view training framework.
Result: Achieves state-of-the-art performance under extremely sparse annotation settings on LA2018, KiTS19, and LiTS datasets, demonstrating robust generalization across diverse medical imaging datasets.
Conclusion: RADA effectively addresses the limitations of geometric-only pseudo-label propagation by incorporating semantic understanding through fine-grained visual features and text guidance, enabling high-quality medical segmentation with minimal annotations.
Abstract: Deep learning has greatly advanced medical image segmentation, but its success relies heavily on fully supervised learning, which requires dense annotations that are costly and time-consuming for 3D volumetric scans. Barely-supervised learning reduces annotation burden by using only a few labeled slices per volume. Existing methods typically propagate sparse annotations to unlabeled slices through geometric continuity to generate pseudo-labels, but this strategy lacks semantic understanding, often resulting in low-quality pseudo-labels. Furthermore, medical image segmentation is inherently a pixel-level visual understanding task, where accuracy fundamentally depends on the quality of local, fine-grained visual features. Inspired by this, we propose RADA, a novel Region-Aware Dual-encoder Auxiliary learning pipeline which introduces a dual-encoder framework pre-trained on Alpha-CLIP to extract fine-grained, region-specific visual features from the original images and limited annotations. The framework combines image-level fine-grained visual features with text-level semantic guidance, providing region-aware semantic supervision that bridges image-level semantics and pixel-level segmentation. Integrated into a triple-view training framework, RADA achieves SOTA performance under extremely sparse annotation settings on LA2018, KiTS19 and LiTS, demonstrating robust generalization across diverse datasets.
[518] Do Instance Priors Help Weakly Supervised Semantic Segmentation?
Anurag Das, Anna Kukleva, Xinting Hu, Yuki M. Asano, Bernt Schiele
Main category: cs.CV
TL;DR: SeSAM adapts SAM for semantic segmentation using weak labels (coarse masks, scribbles, points) through component decomposition, skeleton-based prompting, and iterative pseudo-label refinement.
Details
Motivation: Semantic segmentation requires expensive dense pixel-level annotations. The authors aim to leverage the foundational Segment Anything Model (SAM) for semantic segmentation using cheaper weak labels instead of full supervision.Method: SeSAM decomposes class masks into connected components, samples point prompts along object skeletons, selects SAM masks using weak-label coverage, and iteratively refines labels using pseudo-labels. It integrates with semi-supervised learning to balance ground-truth labels, SAM-based pseudo-labels, and high-confidence pseudo-labels.
Result: Extensive experiments across multiple benchmarks and weak annotation types show SeSAM consistently outperforms weakly supervised baselines while substantially reducing annotation cost relative to fine supervision.
Conclusion: SeSAM successfully adapts SAM for semantic segmentation using weak labels, providing an effective framework that balances annotation cost and segmentation quality.
Abstract: Semantic segmentation requires dense pixel-level annotations, which are costly and time-consuming to acquire. To address this, we present SeSAM, a framework that uses a foundational segmentation model, i.e. Segment Anything Model (SAM), with weak labels, including coarse masks, scribbles, and points. SAM, originally designed for instance-based segmentation, cannot be directly used for semantic segmentation tasks. In this work, we identify specific challenges faced by SAM and determine appropriate components to adapt it for class-based segmentation using weak labels. Specifically, SeSAM decomposes class masks into connected components, samples point prompts along object skeletons, selects SAM masks using weak-label coverage, and iteratively refines labels using pseudo-labels, enabling SAM-generated masks to be effectively used for semantic segmentation. Integrated with a semi-supervised learning framework, SeSAM balances ground-truth labels, SAM-based pseudo-labels, and high-confidence pseudo-labels, significantly improving segmentation quality. Extensive experiments across multiple benchmarks and weak annotation types show that SeSAM consistently outperforms weakly supervised baselines while substantially reducing annotation cost relative to fine supervision.
[519] Development and evaluation of CADe systems in low-prevalence setting: The RARE25 challenge for early detection of Barrett’s neoplasia
Tim J. M. Jaspers, Francisco Caetano, Cris H. B. Claessens, Carolus H. J. Kusters, Rixta A. H. van Eijck van Heslinga, Floor Slooter, Jacques J. Bergman, Peter H. N. De With, Martijn R. Jong, Albert J. de Groof, Fons van der Sommen
Main category: cs.CV
TL;DR: RARE25 challenge introduces a prevalence-aware benchmark for detecting rare neoplasia in Barrett’s esophagus, revealing limitations of current CADe systems in low-prevalence clinical settings.
Details
Motivation: Current CADe systems for early neoplasia detection in Barrett's esophagus are evaluated on balanced datasets, but their performance under realistic low-prevalence conditions remains unknown, risking overestimation of clinical utility.Method: Created a large-scale benchmark with public training set and hidden test set reflecting real-world incidence. Evaluated 11 teams’ approaches using diverse architectures, pretraining, ensembling, and calibration strategies with operating-point-specific metrics emphasizing high sensitivity.
Result: While several methods achieved strong discriminative performance, positive predictive values remained low, highlighting the difficulty of low-prevalence detection. All methods used fully supervised classification despite dominance of normal findings, lacking prevalence-agnostic approaches.
Conclusion: The challenge reveals critical gaps in current CADe systems for low-prevalence detection and provides a public dataset and evaluation framework to support development of prevalence-robust systems suitable for clinical surveillance.
Abstract: Computer-aided detection (CADe) of early neoplasia in Barrett’s esophagus is a low-prevalence surveillance problem in which clinically relevant findings are rare. Although many CADe systems report strong performance on balanced or enriched datasets, their behavior under realistic prevalence remains insufficiently characterized. The RARE25 challenge addresses this gap by introducing a large-scale, prevalence-aware benchmark for neoplasia detection. It includes a public training set and a hidden test set reflecting real-world incidence. Methods were evaluated using operating-point-specific metrics emphasizing high sensitivity and accounting for prevalence. Eleven teams from seven countries submitted approaches using diverse architectures, pretraining, ensembling, and calibration strategies. While several methods achieved strong discriminative performance, positive predictive values remained low, highlighting the difficulty of low-prevalence detection and the risk of overestimating clinical utility when prevalence is ignored. All methods relied on fully supervised classification despite the dominance of normal findings, indicating a lack of prevalence-agnostic approaches such as anomaly detection or one-class learning. By releasing a public dataset and a reproducible evaluation framework, RARE25 aims to support the development of CADe systems robust to prevalence shift and suitable for clinical surveillance workflows.
[520] Precision Synthesis of Multi-Tracer PET via VLM-Modulated Rectified Flow for Stratifying Mild Cognitive Impairment
Tuo Liu, Shuijin Lin, Shaozhen Yan, Haifeng Wang, Jie Lu, Jianhua Ma, Chunfeng Lian
Main category: cs.CV
TL;DR: DIReCT++ uses a domain-informed rectified flow model with BiomedCLIP to synthesize multi-tracer PET from MRI and clinical data for Alzheimer’s disease diagnosis.
Details
Motivation: PET imaging for Alzheimer's disease is limited by cost and radiation exposure, hindering early screening. Generative models that synthesize PET from MRI offer a promising alternative but struggle with subject-specific precision.Method: Combines 3D rectified flow architecture to capture cross-modal relationships with BiomedCLIP vision-language model for text-guided personalized generation using clinical scores and imaging knowledge.
Result: Produces synthetic PET images with superior fidelity and generalizability, accurately recapitulates disease-specific patterns, and enables precise personalized stratification of mild cognitive impairment when combined with MRI.
Conclusion: DIReCT++ advances a scalable, data-efficient tool for early diagnosis and prognostic prediction of Alzheimer’s disease through multi-modal neuroimaging synthesis.
Abstract: The biological definition of Alzheimer’s disease (AD) relies on multi-modal neuroimaging, yet the clinical utility of positron emission tomography (PET) is limited by cost and radiation exposure, hindering early screening at preclinical or prodromal stages. While generative models offer a promising alternative by synthesizing PET from magnetic resonance imaging (MRI), achieving subject-specific precision remains a primary challenge. Here, we introduce DIReCT$++$, a Domain-Informed ReCTified flow model for synthesizing multi-tracer PET from MRI combined with fundamental clinical information. Our approach integrates a 3D rectified flow architecture to capture complex cross-modal and cross-tracer relationships with a domain-adapted vision-language model (BiomedCLIP) that provides text-guided, personalized generation using clinical scores and imaging knowledge. Extensive evaluations on multi-center datasets demonstrate that DIReCT$++$ not only produces synthetic PET images ($^{18}$F-AV-45 and $^{18}$F-FDG) of superior fidelity and generalizability but also accurately recapitulates disease-specific patterns. Crucially, combining these synthesized PET images with MRI enables precise personalized stratification of mild cognitive impairment (MCI), advancing a scalable, data-efficient tool for the early diagnosis and prognostic prediction of AD. The source code will be released on https://github.com/ladderlab-xjtu/DIReCT-PLUS.
[521] Do Thought Streams Matter? Evaluating Reasoning in Gemini Vision-Language Models for Video Scene Understanding
Shivam Sharma, Sankalp Nagaonkar, Ashish Choithani, Ashutosh Trivedi
Main category: cs.CV
TL;DR: Benchmark study on how internal reasoning traces (thought streams) affect video scene understanding in vision-language models, using Gemini 2.5 Flash/Lite models across 100 hours of video content.
Details
Motivation: To understand how internal reasoning processes in vision-language models impact video scene understanding, specifically examining whether more thinking leads to better outputs, where performance gains plateau, and what models actually focus on during reasoning.Method: Used four configurations of Google’s Gemini 2.5 Flash and Flash Lite models across scenes from 100 hours of video. Introduced three evaluation metrics: Contentfulness (useful scene content vs meta-commentary), Thought-Final Coverage (faithfulness of thought stream to final output), and Dominant Entity Analysis (identifying focused subjects, actions, settings). GPT-5 served as independent judge.
Result: Quality gains from additional thinking plateau quickly, with most improvement in first few hundred tokens. Flash Lite offers best balance between quality and token usage. Tight reasoning budgets cause compression-step hallucination (adding content never reasoned about). Flash and Flash Lite produce similar thought streams but differ in style: Flash discusses reasoning process while Lite focuses on scene description.
Conclusion: Internal reasoning traces significantly impact video scene understanding, but benefits diminish quickly. Model architecture affects reasoning style, and compression artifacts can lead to hallucinations when reasoning budgets are constrained.
Abstract: We benchmark how internal reasoning traces, which we call thought streams, affect video scene understanding in vision-language models. Using four configurations of Google’s Gemini 2.5 Flash and Flash Lite across scenes extracted from 100 hours of video, we ask three questions: does more thinking lead to better outputs, where do the gains stop, and what do these models actually think about? We introduce three evaluation metrics. Contentfulness measures how much of the thought stream is useful scene content versus meta-commentary. Thought-Final Coverage measures how faithfully the thought stream translates into the final output. Dominant Entity Analysis identifies which subjects, actions, and settings the model focuses on. GPT-5 serves as an independent judge. We find that quality gains from additional thinking plateau quickly, with most improvement occurring in the first few hundred tokens. Flash Lite offers the best balance between quality and token usage. Tight reasoning budgets cause the model to add content in the final output that it never reasoned about, a form of compression-step hallucination. Despite being different model tiers, Flash and Flash Lite produce similar thought streams, though they differ in style: Flash discusses its reasoning process, while Lite focuses on describing the scene.
[522] Towards Adaptive Open-Set Object Detection via Category-Level Collaboration Knowledge Mining
Yuqi Ji, Junjie Ke, Lihuo He, Lizhi Wang, Xinbo Gao
Main category: cs.CV
TL;DR: Proposes category-level collaboration knowledge mining for adaptive open-set object detection, using clustering-based memory bank and base-to-novel selection to handle cross-domain adaptation and novel category discovery.
Details
Motivation: Existing object detectors struggle with cross-domain generalization and adaptation to novel categories. Current adaptive open-set object detection (AOOD) methods have limitations in weak cross-domain representations, ambiguity among novel categories, and source-domain feature bias.Method: 1) Category-level collaboration knowledge mining strategy exploiting inter-class and intra-class relationships across domains. 2) Clustering-based memory bank encoding class prototypes, auxiliary features, and intra-class disparity information, iteratively updated via unsupervised clustering. 3) Base-to-novel selection metric to discover source-domain features related to novel categories for initializing novel-category classifiers. 4) Adaptive feature assignment strategy transferring learned category-level knowledge to target domain with asynchronous memory bank updates.
Result: Extensive experiments on multiple benchmarks show the method consistently surpasses state-of-the-art AOOD methods by 1.1-5.5 mAP.
Conclusion: The proposed approach effectively addresses limitations in cross-domain representation, novel category ambiguity, and source-domain bias in adaptive open-set object detection through category-level knowledge mining and collaborative learning strategies.
Abstract: Existing object detectors often struggle to generalize across domains while adapting to emerging novel categories. Adaptive open-set object detection (AOOD) addresses this challenge by training on base categories in the source domain and adapting to both base and novel categories in the target domain without target annotations. However, current AOOD methods remain limited by weak cross-domain representations, ambiguity among novel categories, and source-domain feature bias. To address these issues, we propose a category-level collaboration knowledge mining strategy that exploits both inter-class and intra-class relationships across domains. Specifically, we construct a clustering-based memory bank to encode class prototypes, auxiliary features, and intra-class disparity information, and iteratively update it via unsupervised clustering to enhance category-level knowledge representation. We further design a base-to-novel selection metric to discover source-domain features related to novel categories and use them to initialize novel-category classifiers. In addition, an adaptive feature assignment strategy transfers the learned category-level knowledge to the target domain and asynchronously updates the memory bank to alleviate source-domain bias. Extensive experiments on multiple benchmarks show that our method consistently surpasses state-of-the-art AOOD methods by 1.1-5.5 mAP.
[523] MedP-CLIP: Medical CLIP with Region-Aware Prompt Integration
Jiahui Peng, He Yao, Jingwen Li, Yanzhou Su, Sibo Ju, Yujie Lu, Jin Ye, Hongchun Lu, Xue Li, Lincheng Jiang, Min Zhu, Junlong Cheng
Main category: cs.CV
TL;DR: MedP-CLIP is a region-aware medical vision-language model that integrates medical prior knowledge with feature-level region prompt mechanisms to enable fine-grained understanding of anatomical structures and lesions in medical images.
Details
Motivation: While CLIP excels at global image understanding, medical image analysis requires fine-grained understanding of specific anatomical structures or lesion regions. Existing models lack the ability to precisely comprehend region-of-interest information crucial for medical diagnosis.Method: Integrates medical prior knowledge with a feature-level region prompt integration mechanism that can handle various prompt forms (points, bounding boxes, masks) while maintaining global context. Pre-trained on a large-scale medical dataset with 6.4M images and 97.3M region-level annotations.
Result: Significantly outperforms baseline methods in various medical tasks including zero-shot recognition, interactive segmentation, and empowering multimodal large language models. Provides cross-disease and cross-modality fine-grained spatial semantic understanding.
Conclusion: MedP-CLIP offers a scalable, plug-and-play visual backbone for medical AI that combines holistic image understanding with precise regional analysis, addressing the critical need for fine-grained medical image comprehension.
Abstract: Contrastive Language-Image Pre-training (CLIP) has demonstrated outstanding performance in global image understanding and zero-shot transfer through large-scale text-image alignment. However, the core of medical image analysis often lies in the fine-grained understanding of specific anatomical structures or lesion regions. Therefore, precisely comprehending region-of-interest (RoI) information provided by medical professionals or perception models becomes crucial. To address this need, we propose MedP-CLIP, a region-aware medical vision-language model (VLM). MedP-CLIP innovatively integrates medical prior knowledge and designs a feature-level region prompt integration mechanism, enabling it to flexibly respond to various prompt forms (e.g., points, bounding boxes, masks) while maintaining global contextual awareness when focusing on local regions. We pre-train the model on a meticulously constructed large-scale dataset (containing over 6.4 million medical images and 97.3 million region-level annotations), equipping it with cross-disease and cross-modality fine-grained spatial semantic understanding capabilities. Experiments demonstrate that MedP-CLIP significantly outperforms baseline methods in various medical tasks, including zero-shot recognition, interactive segmentation, and empowering multimodal large language models. This model provides a scalable, plug-and-play visual backbone for medical AI, combining holistic image understanding with precise regional analysis.
[524] LoViF 2026 Challenge on Human-oriented Semantic Image Quality Assessment: Methods and Results
Xin Li, Daoli Xu, Wei Luo, Guoqiang Xiang, Haoran Li, Chengyu Zhuang, Zhibo Chen, Jian Guan, Weping Li, Weixia Zhang, Wei Sun, Zhihua Wang, Dandan Zhu, Chengguang Zhu, Ayush Gupta, Rachit Agarwal, Shouvik Das, Biplab Ch Das, Amartya Ghosh, Kanglong Fan, Wen Wen, Shuyan Zhai, Tianwu Zhi, Aoxiang Zhang, Jianzhao Liu, Yabin Zhang, Jiajun Wang, Yipeng Sun, Kaiwei Lian, Banghao Yin
Main category: cs.CV
TL;DR: Review of LoViF 2026 Challenge on Human-oriented Semantic Image Quality Assessment, introducing a new benchmark dataset (SeIQA) and competition for evaluating semantic information loss from human perspective.
Details
Motivation: To establish a new direction in image quality assessment focused on evaluating semantic information loss from human perspective, promoting development of semantic coding, processing, and semantic-oriented optimization techniques.Method: Created the SeIQA dataset with 510 training, 80 validation, and 160 testing image pairs (degraded vs ground truth). Organized a challenge with 58 registered teams, 6 of which submitted valid solutions for benchmarking.
Result: The challenge successfully established a new benchmark for human-oriented semantic image quality assessment, with submitted solutions achieving state-of-the-art performance on the SeIQA dataset.
Conclusion: The LoViF 2026 Challenge successfully raised awareness and established a benchmark for human-oriented semantic image quality assessment, advancing research in semantic information evaluation from human perspective.
Abstract: This paper reviews the LoViF 2026 Challenge on Human-oriented Semantic Image Quality Assessment. This challenge aims to raise a new direction, i.e., how to evaluate the loss of semantic information from the human perspective, intending to promote the development of some new directions, like semantic coding, processing, and semantic-oriented optimization, etc. Unlike existing datasets of quality assessment, we form a dataset of human-oriented semantic quality assessment, termed the SeIQA dataset. This dataset is divided into three parts for this competition: (i) training data: 510 pairs of degraded images and their corresponding ground truth references; (ii) validation data: 80 pairs of degraded images and their corresponding ground-truth references; (iii) testing data: 160 pairs of degraded images and their corresponding ground-truth references. The primary objective of this challenge is to establish a new and powerful benchmark for human-oriented semantic image quality assessment. There are a total of 58 teams registered in this competition, and 6 teams submitted valid solutions and fact sheets for the final testing phase. These submissions achieved state-of-the-art (SOTA) performance on the SeIQA dataset.
[525] H-SPAM: Hierarchical Superpixel Anything Model
Julien Walther, Rémi Giraud, Michaël Clément
Main category: cs.CV
TL;DR: H-SPAM is a hierarchical superpixel generation framework that produces accurate, regular, and perfectly nested multi-scale superpixel representations using deep features and object priors.
Details
Motivation: Existing superpixel methods plateau in accuracy with noisy shapes and produce only single-scale partitions, limiting their usefulness in vision pipelines that require multi-scale representations.Method: Uses a two-phase region merging process: starting from fine partitions guided by deep features and object priors, first preserves object consistency, then allows controlled inter-object grouping. Can be modulated with visual attention maps or user input.
Result: Strongly outperforms existing hierarchical methods in accuracy and regularity, while performing on par with state-of-the-art non-hierarchical methods on standard benchmarks.
Conclusion: H-SPAM provides a unified framework for generating accurate, regular, and perfectly nested hierarchical superpixels that can benefit multi-scale vision pipelines.
Abstract: Superpixels offer a compact image representation by grouping pixels into coherent regions. Recent methods have reached a plateau in terms of segmentation accuracy by generating noisy superpixel shapes. Moreover, most existing approaches produce a single fixed-scale partition that limits their use in vision pipelines that would benefit multi-scale representations. In this work, we introduce H-SPAM (Hierarchical Superpixel Anything Model), a unified framework for generating accurate, regular, and perfectly nested hierarchical superpixels. Starting from a fine partition, guided by deep features and external object priors, H-SPAM constructs the hierarchy through a two-phase region merging process that first preserves object consistency and then allows controlled inter-object grouping. The hierarchy can also be modulated using visual attention maps or user input to preserve important regions longer in the hierarchy. Experiments on standard benchmarks show that H-SPAM strongly outperforms existing hierarchical methods in both accuracy and regularity, while performing on par with most recent state-of-the-art non-hierarchical methods. Code and pretrained models are available: https://github.com/waldo-j/hspam.
[526] NTIRE 2026 The 3rd Restore Any Image Model (RAIM) Challenge: AI Flash Portrait (Track 3)
Ya-nan Guan, Shaonan Zhang, Hang Guo, Yawen Wang, Xinying Fan, Tianqu Zhuang, Jie Liang, Hui Zeng, Guanyi Qin, Lishen Qu, Tao Dai, Shu-Tao Xia, Lei Zhang, Radu Timofte, Bin Chen, Yuanbo Zhou, Hongwei Wang, Qinquan Gao, Tong Tong, Yanxin Qian, Lizhao You, Jingru Cong, Lei Xiong, Shuyuan Zhu, Zhi-Qiang Zhong, Kan Lv, Yang Yang, Kailing Tang, Minjian Zhang, Zhipei Lei, Zhe Xu, Liwen Zhang, Dingyong Gou, Yanlin Wu, Cong Li, Xiaohui Cui, Jiajia Liu, Guoyi Xu, Yaoxin Jiang, Yaokun Shi, Jiachen Tu, Liqing Wang, Shihang Li, Bo Zhang, Biao Wang, Haiming Xu, Xiang Long, Xurui Liao, Yanqiao Zhai, Haozhe Li, Shijun Shi, Jiangning Zhang, Yong Liu, Kai Hu, Jing Xu, Xianfang Zeng, Yuyang Liu, Minchen Wei
Main category: cs.CV
TL;DR: NTIRE 2026 RAIM challenge Track 3 focuses on AI Flash Portrait restoration, addressing real-world low-light portrait challenges with a dataset of 800 real-captured low-light portrait images and evaluation through hybrid objective-subjective metrics.
Details
Motivation: Existing deep learning models for image restoration struggle with real-world low-light portrait scenarios, failing to balance noise suppression, detail preservation, and faithful illumination/color reproduction. The challenge aims to establish a benchmark for this specific problem.Method: Organized a competition with a dataset of 800 real-captured low-light portrait groups (each with low-light input, ground truth, and person mask). Used hybrid evaluation combining objective quantitative metrics with rigorous subjective assessment protocols.
Result: The challenge attracted over 100 participating teams and received more than 3,000 valid submissions, demonstrating widespread interest from both academia and industry in solving low-light portrait restoration problems.
Conclusion: The challenge successfully established a benchmark for real-world low-light portrait restoration, providing a dataset and evaluation framework that addresses the specific challenges of balancing noise suppression, detail preservation, and color/illumination fidelity.
Abstract: In this paper, we present a comprehensive overview of the NTIRE 2026 3rd Restore Any Image Model (RAIM) challenge, with a specific focus on Track 3: AI Flash Portrait. Despite significant advancements in deep learning for image restoration, existing models still encounter substantial challenges in real-world low-light portrait scenarios. Specifically, they struggle to achieve an optimal balance among noise suppression, detail preservation, and faithful illumination and color reproduction. To bridge this gap, this challenge aims to establish a novel benchmark for real-world low-light portrait restoration. We comprehensively evaluate the proposed algorithms utilizing a hybrid evaluation system that integrates objective quantitative metrics with rigorous subjective assessment protocols. For this competition, we provide a dataset containing 800 groups of real-captured low-light portrait data. Each group consists of a 1K-resolution low-light input image, a 1K ground truth (GT), and a 1K person mask. This challenge has garnered widespread attention from both academia and industry, attracting over 100 participating teams and receiving more than 3,000 valid submissions. This report details the motivation behind the challenge, the dataset construction process, the evaluation metrics, and the various phases of the competition. The released dataset and baseline code for this track are publicly available from the same \href{https://github.com/zsn1434/AI_Flash-BaseLine/tree/main}{GitHub repository}, and the official challenge webpage is hosted on \href{https://www.codabench.org/competitions/12885/}{CodaBench}.
[527] Seg2Change: Adapting Open-Vocabulary Semantic Segmentation Model for Remote Sensing Change Detection
You Su, Yonghong Song, Jingqi Chen, Zehan Wen
Main category: cs.CV
TL;DR: Seg2Change is an adapter framework that enables open-vocabulary semantic segmentation models to perform change detection across arbitrary categories by using a category-agnostic change head and a new dataset.
Details
Motivation: Existing change detection methods are limited to predefined classes, constraining scalability in real-world scenarios. There's a lack of effective frameworks for open-vocabulary change detection (OVCD) that integrates vision and language to detect changes across arbitrary categories.Method: 1) Construct CA-CDD (category-agnostic change detection dataset), 2) Design category-agnostic change head to detect transitions of arbitrary categories and index them to specific classes, 3) Propose Seg2Change adapter to adapt open-vocabulary semantic segmentation models to change detection task.
Result: Achieves state-of-the-art OVCD performance: +9.52 IoU on WHU-CD and +5.50 mIoU on SECOND datasets.
Conclusion: Seg2Change provides a simple yet effective framework for open-vocabulary change detection by adapting existing segmentation models, enabling detection of changes across arbitrary categories without being limited to predefined classes.
Abstract: Change detection is a fundamental task in remote sensing, aiming to quantify the impacts of human activities and ecological dynamics on land-cover changes. Existing change detection methods are limited to predefined classes in training datasets, which constrains their scalability in real-world scenarios. In recent years, numerous advanced open-vocabulary semantic segmentation models have emerged for remote sensing imagery. However, there is still a lack of an effective framework for directly applying these models to open-vocabulary change detection (OVCD), a novel task that integrates vision and language to detect changes across arbitrary categories. To address these challenges, we first construct a category-agnostic change detection dataset, termed CA-CDD. Further, we design a category-agnostic change head to detect the transitions of arbitrary categories and index them to specific classes. Based on them, we propose Seg2Change, an adapter designed to adapt open-vocabulary semantic segmentation models to change detection task. Without bells and whistles, this simple yet effective framework achieves state-of-the-art OVCD performance (+9.52 IoU on WHU-CD and +5.50 mIoU on SECOND). Our code is released at https://github.com/yogurts-sy/Seg2Change.
[528] Bridging the RGB-IR Gap: Consensus and Discrepancy Modeling for Text-Guided Multispectral Detection
Jiaqi Wu, Zhen Wang, Enhao Huang, Kangqing Shen, Yulin Wang, Yang Yue, Yifan Pu, Gao Huang
Main category: cs.CV
TL;DR: Text-guided multispectral object detection framework using text as semantic bridge to align RGB and IR modalities, with bi-support modeling for consensus and discrepancy fusion.
Details
Motivation: Existing text-guided multispectral detection methods use text only as auxiliary semantic enhancement without bridging RGB-IR granularity asymmetry, and conventional attention fusion emphasizes consensus while overlooking valuable cross-modal discrepancies.Method: Proposes semantic bridge fusion framework: (1) text as shared semantic bridge to align RGB and IR under unified category conditions, (2) recalibrated thermal semantic prior projected onto RGB for semantic-level mapping fusion, (3) bi-support modeling with consensus and discrepancy supports, (4) bidirectional semantic alignment module for closed-loop vision-text guidance.
Result: Extensive experiments demonstrate effectiveness and superior detection performance on multispectral benchmarks.
Conclusion: The proposed framework effectively addresses RGB-IR granularity asymmetry and leverages both consensus and discrepancy information through text-guided semantic bridging and bi-support modeling.
Abstract: Text-guided multispectral object detection uses text semantics to guide semantic-aware cross-modal interaction between RGB and IR for more robust perception. However, notable limitations remain: (1) existing methods often use text only as an auxiliary semantic enhancement signal, without exploiting its guiding role to bridge the inherent granularity asymmetry between RGB and IR; and (2) conventional data-driven attention-based fusion tends to emphasize stable consensus while overlooking potentially valuable cross-modal discrepancies. To address these issues, we propose a semantic bridge fusion framework with bi-support modeling for multispectral object detection. Specifically, text is used as a shared semantic bridge to align RGB and IR responses under a unified category condition, while the recalibrated thermal semantic prior is projected onto the RGB branch for semantic-level mapping fusion. We further formulate RGB-IR interaction evidence into the regular consensus support and the complementary discrepancy support that contains potentially discriminative cues, and introduce them into fusion via dynamic recalibration as a structured inductive bias. In addition, we design a bidirectional semantic alignment module for closed-loop vision-text guidance enhancement. Extensive experiments demonstrate the effectiveness of the proposed fusion framework and its superior detection performance on multispectral benchmarks. Code is available at https://github.com/zhenwang5372/Bridging-RGB-IR-Gap.
[529] Decoupled Similarity for Task-Aware Token Pruning in Large Vision-Language Models
Kexin Ma, Jing Xiao, Chaofeng Chen, Geyong Min, Guibo Zhu, Jinqiao Wang, Liang Liao
Main category: cs.CV
TL;DR: DeSAP is a decoupled similarity-aware pruning method for large vision-language models that combines task-related cross-modal relevance with visual saliency to prune visual tokens efficiently while maintaining performance.
Details
Motivation: Existing token pruning methods for LVLMs rely on biased attention distributions from individual components, leading to incomplete and suboptimal pruning decisions. There's a need for more precise, task-aware pruning that considers both task-related guidance and visual cues.Method: DeSAP introduces decoupled similarity to capture fine-grained cross-modal relevance between visual features and text tokens, providing explicit task-related guidance. This is integrated with visual saliency signals from visual attention to perform token pruning under both task-related and visual cues.
Result: On LLaVA-1.5-7B, DeSAP achieves 10× FLOPs reduction and 2.3× prefill speedup by retaining only 11.1% of visual tokens while maintaining 98.1% of original performance. Extensive experiments across diverse benchmarks and architectures show consistent outperformance of SOTA methods.
Conclusion: DeSAP enables robust and efficient token pruning for LVLMs by combining task-related cross-modal relevance with visual saliency, achieving significant computational savings with minimal performance degradation.
Abstract: Token pruning has emerged as an effective approach to reduce the substantial computational overhead of Large Vision-Language Models (LVLMs) by discarding less informative visual tokens while preserving performance. However, existing methods typically rely on individual attention sources from different LVLM components, resulting in incomplete and suboptimal pruning decisions due to biased attention distributions. To address this problem, we propose DeSAP, a novel Decoupled Similarity-Aware Pruning method for precise, task-aware token pruning within the visual encoder. Specifically, DeSAP introduces a decoupled similarity to capture fine-grained cross-modal relevance between visual features and text tokens, providing explicit task-related guidance for pruning. By integrating decoupled similarity with visual saliency signals derived from visual attention, DeSAP performs token pruning under the guidance of both task-related and visual cues, enabling robust pruning even under aggressive pruning ratios. Extensive experiments across diverse benchmarks and architectures show that DeSAP consistently outperforms SOTA methods in both accuracy and efficiency. On LLaVA-1.5-7B, DeSAP achieves a 10 times FLOPs reduction and a 2.3 times prefill speedup by retaining only 11.1% of visual tokens, while maintaining 98.1% of the original performance.
[530] Script-a-Video: Deep Structured Audio-visual Captions via Factorized Streams and Relational Grounding
Tencent Hunyuan Team
Main category: cs.CV
TL;DR: MTSS introduces a factorized scene description paradigm for video understanding and generation, replacing monolithic captions with decoupled streams and explicit grounding links.
Details
Motivation: Current MLLMs treat videos as monolithic narrative paragraphs that entangle visual, auditory, and identity information, compromising representational fidelity and limiting scalability for local edits.Method: Proposes Multi-Stream Scene Script (MTSS) with Stream Factorization (decoupling videos into Reference, Shot, Event, and Global streams) and Relational Grounding (reconnecting streams through explicit identity and temporal links).
Result: MTSS reduces total error rate by 25% on Video-SALMONN-2, gains 67% performance on Daily-Omni reasoning, narrows gap between small/large MLLMs, and improves video generation: 45% identity consistency, 56% audio-visual alignment, 71% temporal controllability.
Conclusion: MTSS provides a more learnable, factorized caption interface that enhances both video understanding and generation in multimodal LLMs by addressing structural bottlenecks of monolithic representations.
Abstract: Advances in Multimodal Large Language Models (MLLMs) are transforming video captioning from a descriptive endpoint into a semantic interface for both video understanding and generation. However, the dominant paradigm still casts videos as monolithic narrative paragraphs that entangle visual, auditory, and identity information. This dense coupling not only compromises representational fidelity but also limits scalability, since even local edits can trigger global rewrites. To address this structural bottleneck, we propose Multi-Stream Scene Script (MTSS), a novel paradigm that replaces monolithic text with factorized and explicitly grounded scene descriptions. MTSS is built on two core principles: Stream Factorization, which decouples a video into complementary streams (Reference, Shot, Event, and Global), and Relational Grounding, which reconnects these isolated streams through explicit identity and temporal links to maintain holistic video consistency. Extensive experiments demonstrate that MTSS consistently enhances video understanding across various models, achieving an average reduction of 25% in the total error rate on Video-SALMONN-2 and an average performance gain of 67% on the Daily-Omni reasoning benchmark. It also narrows the performance gap between smaller and larger MLLMs, indicating a substantially more learnable caption interface. Finally, even without architectural adaptation, replacing monolithic prompts with MTSS in multi-shot video generation yields substantial human-rated improvements: a 45% boost in cross-shot identity consistency, a 56% boost in audio-visual alignment, and a 71% boost in temporal controllability.
[531] Variational Latent Entropy Estimation Disentanglement: Controlled Attribute Leakage for Face Recognition
Ünsal Öztürk, Vedrana Krivokuća Hahn, Sushil Bhattacharjee, Sébastien Marcel
Main category: cs.CV
TL;DR: VLEED is a post-hoc method that disentangles demographic attributes (gender/ethnicity) from identity information in face recognition embeddings using variational autoencoders and mutual information objectives.
Details
Motivation: Face recognition embeddings encode both identity and demographic attributes (gender, ethnicity), which can raise privacy and fairness concerns when used in downstream systems. Separating these factors is important for protecting privacy and reducing bias.Method: VLEED uses a variational autoencoder to transform pretrained embeddings, with a mutual information-based objective that estimates entropy of categorical attributes in latent space. This encourages separation of demographic attributes from identity-relevant information while maintaining fine-grained control over information removal.
Result: Evaluated on IJB-C, RFW, and VGGFace2 datasets for gender and ethnicity disentanglement. VLEED offers better privacy-utility tradeoffs than existing methods and can reduce recognition bias across demographic groups, as measured by verification utility, attribute predictability, and group disparity metrics.
Conclusion: VLEED provides an effective post-hoc approach for disentangling demographic attributes from face recognition embeddings, offering improved privacy-utility tradeoffs and bias reduction capabilities compared to state-of-the-art methods.
Abstract: Face recognition embeddings encode identity, but they also encode other factors such as gender and ethnicity. Depending on how these factors are used by a downstream system, separating them from the information needed for verification is important for both privacy and fairness. We propose Variational Latent Entropy Estimation Disentanglement (VLEED), a post-hoc method that transforms pretrained embeddings with a variational autoencoder and encourages a distilled representation where the categorical variable of interest is separated from identity-relevant information. VLEED uses a mutual information-based objective realised through the estimation of the entropy of the categorical attribute in the latent space, and provides stable training with fine-grained control over information removal. We evaluate our method on IJB-C, RFW, and VGGFace2 for gender and ethnicity disentanglement, and compare it to various state-of-the-art methods. We report verification utility, predictability of the disentangled variable under linear and nonlinear classifiers, and group disparity metrics based on false match rates. Our results show that VLEED offers a wide range of privacy-utility tradeoffs over existing methods and can also reduce recognition bias across demographic groups.
[532] A Deep Equilibrium Network for Hyperspectral Unmixing
Chentong Wang, Jincheng Gao, Fei Zhu, Jie Chen
Main category: cs.CV
TL;DR: DEQ-Unmix: A deep equilibrium model for hyperspectral unmixing that enables efficient constant-memory training via implicit differentiation, replacing gradient operators with trainable convolutional networks to capture spectral-spatial information.
Details
Motivation: Traditional hyperspectral unmixing methods struggle with complex spectral-spatial features, deep learning lacks interpretability, and unrolling-based methods have memory and numerical precision issues during backpropagation.Method: Reformulates abundance estimation as a deep equilibrium model using implicit differentiation for constant-memory training, replaces gradient operators with trainable convolutional networks to capture spectral-spatial information.
Result: DEQ-Unmix achieves superior unmixing performance on synthetic and two real-world datasets while maintaining constant memory cost compared to existing methods.
Conclusion: DEQ-Unmix provides an efficient, interpretable solution for hyperspectral unmixing with constant memory requirements, addressing limitations of existing approaches.
Abstract: Hyperspectral unmixing (HU) is crucial for analyzing hyperspectral imagery, yet achieving accurate unmixing remains challenging. While traditional methods struggle to effectively model complex spectral-spatial features, deep learning approaches often lack physical interpretability. Unrolling-based methods, despite offering network interpretability, inadequately exploit spectral-spatial information and incur high memory costs and numerical precision issues during backpropagation. To address these limitations, we propose DEQ-Unmix, which reformulates abundance estimation as a deep equilibrium model, enabling efficient constant-memory training via implicit differentiation. It replaces the gradient operator of the data reconstruction term with a trainable convolutional network to capture spectral-spatial information. By leveraging implicit differentiation, DEQ-Unmix enables efficient and constant-memory backpropagation. Experiments on synthetic and two real-world datasets demonstrate that DEQ-Unmix achieves superior unmixing performance while maintaining constant memory cost.
[533] Empowering Video Translation using Multimodal Large Language Models
Bingzheng QU, Kehai Chen, Xuefeng Bai, Min Zhang
Main category: cs.CV
TL;DR: A comprehensive survey paper reviewing how multimodal large language models (MLLMs) empower video translation tasks, organized around a three-role taxonomy and discussing future research directions.
Details
Motivation: Despite rapid progress in MLLMs and existing surveys on general video-language understanding, there is a lack of focused and systematic review of how MLLMs specifically empower video translation tasks, which this paper aims to address.Method: The paper provides a comprehensive overview organized around a three-role taxonomy: 1) Semantic Reasoner (video understanding, temporal reasoning, multimodal fusion), 2) Expressive Performer (LLM-driven/augmented speech generation), and 3) Visual Synthesizer (video generators for lip-sync and visual alignment).
Result: The survey systematically categorizes MLLM-based approaches to video translation, highlighting how they overcome limitations of traditional cascaded pipelines and achieve competitive translation quality with stronger robustness in zero-shot and multi-speaker scenarios.
Conclusion: The paper identifies open challenges in video understanding, temporal modeling, and multimodal alignment, and outlines promising future research directions for MLLM-powered video translation systems.
Abstract: Recent developments in video translation have further enhanced cross-lingual access to video content, with multimodal large language models (MLLMs) playing an increasingly important supporting role. With strong multimodal understanding, reasoning, and generation capabilities, MLLMs-based video translation systems are overcoming the limitations of traditional cascaded pipelines that separately handle automatic speech recognition, machine translation, text-to-speech and lip synchronization. These MLLM-powered approaches not only achieve competitive or superior translation quality, but also demonstrate stronger robustness in zero-shot settings and multi-speaker scenarios, while jointly modeling semantic fidelity, timing, speaker identity, and emotional consistency. However, despite the rapid progress of MLLMs and extensive surveys on general video-language understanding, a focused and systematic review of how MLLMs empower video translation tasks is still lacking. To fill this gap, we provide the first comprehensive overview of MLLMs-based video translation, organized around a three-role taxonomy: 1) Semantic Reasoner, which characterizes how MLLMs perform video understanding, temporal reasoning, and multimodal fusion; 2) Expressive Performer, which analyzes LLM-driven and LLM-augmented techniques for expressive, controllable speech generation; and 3) Visual Synthesizer, which examines different types of video generators for high-fidelity lip-sync and visual alignment. Finally, we discuss open challenges in video understanding, temporal modeling, and multimodal alignment, and outline promising future research directions for MLLMs-powered video translation.
[534] Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale
Dongxu Wei, Qi Xu, Zhiqi Li, Hangning Zhou, Cong Qiu, Hailong Qin, Mu Yang, Zhaopeng Cui, Peidong Liu
Main category: cs.CV
TL;DR: First approach for 3D scene generation directly in implicit 3D latent space using 3D Representation Autoencoder and 3D Diffusion Transformer, enabling efficient and spatially consistent generation without per-trajectory sampling.
Details
Motivation: Current 3D scene generation relies on 2D multi-view/video diffusion models which degrade 3D spatial extrapolation to 2D temporal extension, causing representation redundancy and limited spatial consistency in generated scenes.Method: 1) 3D Representation Autoencoder (3DRAE) repurposes frozen 2D encoders to ground view-coupled 2D semantics into view-decoupled 3D latent representation. 2) 3D Diffusion Transformer (3DDiT) performs diffusion modeling in this 3D latent space for efficient generation.
Result: Enables representing 3D scenes from arbitrary views with fixed complexity and rich semantics, supports diverse conditioning configurations, and allows decoding to images/point maps along arbitrary camera trajectories without per-trajectory diffusion sampling.
Conclusion: Proposes first approach for direct 3D scene generation in implicit 3D latent space, addressing fundamental limitations of 2D-based methods and enabling more efficient and spatially consistent 3D scene generation.
Abstract: 3D scene generation has long been dominated by 2D multi-view or video diffusion models. This is due not only to the lack of scene-level 3D latent representation, but also to the fact that most scene-level 3D visual data exists in the form of multi-view images or videos, which are naturally compatible with 2D diffusion architectures. Typically, these 2D-based approaches degrade 3D spatial extrapolation to 2D temporal extension, which introduces two fundamental issues: (i) representing 3D scenes via 2D views leads to significant representation redundancy, and (ii) latent space rooted in 2D inherently limits the spatial consistency of the generated 3D scenes. In this paper, we propose, for the first time, to perform 3D scene generation directly within an implicit 3D latent space to address these limitations. First, we repurpose frozen 2D representation encoders to construct our 3D Representation Autoencoder (3DRAE), which grounds view-coupled 2D semantic representations into a view-decoupled 3D latent representation. This enables representing 3D scenes observed from arbitrary numbers of views–at any resolution and aspect ratio–with fixed complexity and rich semantics. Then we introduce 3D Diffusion Transformer (3DDiT), which performs diffusion modeling in this 3D latent space, achieving remarkably efficient and spatially consistent 3D scene generation while supporting diverse conditioning configurations. Moreover, since our approach directly generates a 3D scene representation, it can be decoded to images and optional point maps along arbitrary camera trajectories without requiring per-trajectory diffusion sampling pass, which is common in 2D-based approaches.
[535] A Compact and Efficient 1.251 Million Parameter Machine Learning CNN Model PD36-C for Plant Disease Detection: A Case Study
Shkelqim Sherifi
Main category: cs.CV
TL;DR: PD36 C is a compact CNN for plant disease classification with 1.25M parameters, achieving 99.5% test accuracy on 38 classes, designed for edge deployment with a Qt desktop application.
Details
Motivation: To develop a compact, efficient convolutional neural network for plant disease diagnosis that can be deployed on edge devices in agricultural settings, addressing the need for practical, offline solutions in smart agriculture.Method: Designed PD36 C, a compact CNN with 1,250,694 parameters (4.77 MB), trained on the New Plant Diseases Dataset (87k images, 38 classes) using TensorFlow Keras. Created a Qt for Python desktop application with intuitive GUI for offline inference on commodity hardware.
Result: Achieved 0.99697 training accuracy by epoch 30 and 0.9953 average test accuracy across 38 classes. Many classes achieved perfect precision and recall (1.00), while lower-performing classes like Corn Cercospora leaf spot still achieved ~0.9777 precision and ~0.9634 recall.
Conclusion: Small CNNs can achieve competitive accuracy with careful design and well-curated datasets, making them practical for edge deployment in agricultural disease detection, though challenges remain with adverse conditions and multiple concurrent diseases.
Abstract: Deep learning has markedly advanced image based plant disease diagnosis as improved hardware and dataset quality have enabled increasingly accurate neural network models. This paper presents PD36 C, a compact convolutional neural network (1,250,694 parameters and 4.77 MB) for plant disease classification. Trained with TensorFlow Keras on the New Plant Diseases Dataset (87k images, 38 classes), PD36 C is designed for robustness and edge deployability, complemented by a Qt for Python desktop application that offers an intuitive GUI and offline inference on commodity hardware. Across experiments, training accuracy reached 0.99697 by epoch 30, and average test accuracy was 0.9953 across 38 classes. Per class performance is uniformly high; on the lower end, Corn (maize) Cercospora leaf spot achieved precision around 0.9777 and recall around 0.9634, indicating occasional confusion with visually similar categories, while on the upper end numerous classes including Apple Black rot, Cedar apple rust, Blueberry healthy, Cherry Powdery mildew, Cherry healthy, and all four grape categories achieved perfect precision 1.00 and recall of 1.00, indicating no false positives and strong coverage. These results show that with a well curated dataset and careful architectural design, small CNNs can achieve competitive accuracy compared with recent baselines while remaining practical for edge scenarios. We also note typical constraints such as adverse weather, low quality imagery, and leaves exhibiting multiple concurrent diseases that can degrade performance and warrant future work on domain robustness. Overall, PD36 C and its application pipeline contribute a field ready, efficient solution for AI assisted plant disease detection in smart agriculture.
[536] LoGo-MR: Screening Breast MRI for Cancer Risk Prediction by Efficient Omni-Slice Modeling
Xin Wang, Yuan Gao, George Yiasemis, Antonio Portaluri, Zahra Aghdam, Muzhen He, Luyi Han, Yaofei Duan, Chunyao Lu, Xinglong Liang, Tianyu Zhang, Vivien van Veldhuizen, Yue Sun, Tao Tan, Ritse Mann, Jonas Teuwen
Main category: cs.CV
TL;DR: LoGo-MR: A 2.5D local-global framework for 5-year breast cancer risk prediction from MRI using neighbor-slice encoding for local cues and transformer-enhanced MIL for global patterns, with multi-plane extension for volumetric understanding.
Details
Motivation: Need for efficient and explainable breast cancer risk prediction for large-scale screening. Breast MRI provides functional information but current methods are either computationally expensive (3D CNNs) or fail to model inter-slice continuity (2D CNNs). Also, breast MRI modeling for both short- and long-term risk stratification is underexplored.Method: Proposes LoGo-MR framework with: 1) neighbor-slice encoding to capture local cues for short-term risk, 2) transformer-enhanced multiple-instance learning (MIL) to model distributed global patterns for long-term risk with interpretable slice importance. Extends to LoGo3-MR across axial, sagittal, and coronal planes for complementary volumetric information and voxel-level risk saliency mapping.
Result: Outperforms 2D/3D baselines and existing SOTA MIL methods on large breast MRI screening cohort (~7.5K). Achieves AUCs of 0.77-0.69 for 1- to 5-year prediction, improves C-index by ~6% over 3D CNNs. LoGo3-MR further improves overall performance with interpretable localization across three planes, validated across seven backbones with consistent gains.
Conclusion: The method demonstrates clinical potential for efficient MRI-based breast cancer risk stratification in large-scale screening, offering both predictive performance and interpretability through risk saliency mapping that can assist radiologists in localizing risk-relevant regions.
Abstract: Efficient and explainable breast cancer (BC) risk prediction is critical for large-scale population-based screening. Breast MRI provides functional information for personalized risk assessment. Yet effective modeling remains challenging as fully 3D CNNs capture volumetric context at high computational cost, whereas lightweight 2D CNNs fail to model inter-slice continuity. Importantly, breast MRI modeling for shor- and long-term BC risk stratification remains underexplored. In this study, we propose LoGo-MR, a 2.5D local-global structural modeling framework for five-year BC risk prediction. Aligned with clinical interpretation, our framework first employs neighbor-slice encoding to capture subtle local cues linked to short-term risk. It then integrates transformer-enhanced multiple-instance learning (MIL) to model distributed global patterns related to long-term risk and provide interpretable slice importance. We further apply this framework across axial, sagittal, and coronal planes as LoGo3-MR to capture complementary volumetric information. This multi-plane formulation enables voxel-level risk saliency mapping, which may assist radiologists in localizing risk-relevant regions during breast MRI interpretation. Evaluated on a large breast MRI screening cohort (~7.5K), our method outperforms 2D/3D baselines and existing SOTA MIL methods, achieving AUCs of 0.77-0.69 for 1- to 5-year prediction and improving C-index by ~6% over 3D CNNs. LoGo3-MR further improves overall performance with interpretable localization across three planes, and validation across seven backbones shows consistent gains. These results highlight the clinical potential of efficient MRI-based BC risk stratification for large-scale screening. Code will be released publicly.
[537] LEADER: Learning Reliable Local-to-Global Correspondences for LiDAR Relocalization
Jianshi Wu, Minghang Zhu, Dunqiang Liu, Wen Li, Sheng Ao, Siqi Shen, Chenglu Wen, Cheng Wang
Main category: cs.CV
TL;DR: LEADER is a robust LiDAR relocalization framework using a geometric encoder and reliability-aware loss to handle noise and outliers in challenging scenes.
Details
Motivation: Existing learning-based LiDAR relocalization methods treat all predicted points equally, making them vulnerable to noise and outliers in challenging scenes, which limits their robustness and accuracy.Method: Proposes LEADER with: 1) Robust Projection-based Geometric Encoder capturing multi-scale geometric features, and 2) Truncated Relative Reliability loss modeling point-wise ambiguity to mitigate unreliable predictions.
Result: Outperforms state-of-the-art methods on Oxford RobotCar and NCLT datasets with 24.1% and 73.9% relative reductions in position error respectively.
Conclusion: LEADER provides a robust solution for LiDAR relocalization by addressing noise and outlier issues through geometric feature enhancement and reliability-aware loss formulation.
Abstract: LiDAR relocalization has attracted increasing attention as it can deliver accurate 6-DoF pose estimation in complex 3D environments. Recent learning-based regression methods offer efficient solutions by directly predicting global poses without the need for explicit map storage. However, these methods often struggle in challenging scenes due to their equal treatment of all predicted points, which is vulnerable to noise and outliers. In this paper, we propose LEADER, a robust LiDAR-based relocalization framework enhanced by a simple, yet effective geometric encoder. Specifically, a Robust Projection-based Geometric Encoder architecture which captures multi-scale geometric features is first presented to enhance descriptiveness in geometric representation. A Truncated Relative Reliability loss is then formulated to model point-wise ambiguity and mitigate the influence of unreliable predictions. Extensive experiments on the Oxford RobotCar and NCLT datasets demonstrate that LEADER outperforms state-of-the-art methods, achieving 24.1% and 73.9% relative reductions in position error over existing techniques, respectively. The source code is released on https://github.com/JiansW/LEADER.
[538] From Redaction to Restoration: Deep Learning for Medical Image Anonymization and Reconstruction
Adrienne Kline, Abhijit Gaonkar, Daniel Pittman, Chris Kuehn, Nils Forkert
Main category: cs.CV
TL;DR: An end-to-end deep learning framework for medical image de-identification that redacts PHI regions and uses generative inpainting to restore anatomically plausible content while maintaining downstream analysis utility.
Details
Motivation: Current medical image de-identification methods often remove relevant non-identifiable information, negatively impacting downstream image analysis tasks. There's a need for methods that protect patient privacy while preserving data utility for AI applications.Method: Hybrid architecture combining CRNN-based detection/redaction of PHI regions (burned-in text, metadata) with latent-diffusion inpainting (Stable Diffusion 2) to restore redacted areas with anatomically plausible content.
Result: The method produces visually coherent de-identified medical images that maintain fidelity for downstream models while substantially reducing re-identification risk, as validated by privacy metrics and task-based evaluations.
Conclusion: The automated pipeline enables secure sharing of medical imaging collections by balancing privacy protection with data utility, lowering barriers to multi-institutional collaboration in medical imaging AI.
Abstract: Removing patient-specific information from medical images is crucial to enable sharing and open science without compromising patient identities. However, many methods currently used for deidentification have negative effects on downstream image analysis tasks because of removal of relevant but non-identifiable information. This work presents an end-to-end deep learning framework for transforming raw clinical image volumes into de-identified, analysis-ready datasets without compromising downstream utility. The methodology developed and tested in this work first detects and redacts regions likely to contain protected health information (PHI), such as burned-in text and metadata, and then uses a generative deep learning model to inpaint the redacted areas with anatomically and imaging plausible content. The proposed pipeline leverages a lightweight hybrid architecture, combining CRNN-based redaction with a latent-diffusion inpainting restoration module (Stable Diffusion 2). We evaluate the approach using both privacy-oriented metrics, which quantify residual PHI and success of redaction, and image-quality and task-based metrics, which assess the fidelity of restored volumes for representative deep learning applications. Our results suggest that the proposed method yields de-identified medical images that are visually coherent, maintaining fidelity for downstream models, while substantially reducing the risk of patient re-identification. By automating anonymization and image reconstruction within a single workflow, and dissemination of large-scale medical imaging collections, thereby lowering a key barrier to data sharing and multi-institutional collaboration in medical imaging AI.
[539] ConvFormer3D-TAP: Phase/Uncertainty-Aware Front-End Fusion for Cine CMR View Classification Pipelines
Nafiseh Ghaffar Nia, Vinesh Appadurai, Suchithra V., Chinmay Rane, Daniel Pittman, James Carr, Adrienne Kline
Main category: cs.CV
TL;DR: A 3D spatiotemporal architecture (ConvFormer3D-TAP) for reliable classification of standard cine cardiac MRI views, combining 3D convolutional tokenization with multiscale self-attention to handle clinical variability.
Details
Motivation: Accurate cardiac MRI view recognition is crucial for downstream quantitative analyses, but remains challenging due to clinical variability in scanner vendor, acquisition protocol, motion artifacts, and plane prescription. Incorrect view identification can propagate errors into segmentation, volumetric assessment, strain analysis, and valve evaluation.Method: ConvFormer3D-TAP integrates 3D convolutional tokenization with multiscale self-attention, trained using masked spatiotemporal reconstruction and uncertainty-weighted multi-clip fusion. The design captures local anatomical structure through convolutional priors and long-range cardiac-cycle dynamics through hierarchical attention.
Result: On 150,974 clinically acquired cine sequences spanning six standard views, achieved 96% validation accuracy with per-class F1-scores ≥ 0.94 and strong calibration (ECE = 0.025; Brier = 0.040). Residual confusions are concentrated in anatomically adjacent long-axis and LVOT/AV view pairs.
Conclusion: ConvFormer3D-TAP serves as a scalable front-end for view routing, filtering and quality control in end-to-end cMRI workflows, addressing the challenge of reliable view classification under clinical variability.
Abstract: Reliable recognition of standard cine cardiac MRI views is essential because each view determines which cardiac anatomy is visualized and which quantitative analyses can be performed. Incorrect view identification, whether by a human reader or an automated deep learning system, can propagate errors into segmentation, volumetric assessment, strain analysis, and valve evaluation. However, accurate view classification remains challenging under routine clinical variability in scanner vendor, acquisition protocol, motion artifacts, and plane prescription. We present ConvFormer3D-TAP, a cine-specific spatiotemporal architecture that integrates 3D convolutional tokenization with multiscale self-attention. The model is trained using masked spatiotemporal reconstruction and uncertainty-weighted multi-clip fusion to enhance robustness across cardiac phases and ambiguous temporal segments. The design captures complementary cues: local anatomical structure through convolutional priors and long-range cardiac-cycle dynamics through hierarchical attention. On a cohort of 150,974 clinically acquired cine sequences spanning six standard cine cardiac MRI views, ConvFormer3D-TAP achieved 96% validation accuracy with per-class F1-scores >= 0.94 and strong calibration (ECE = 0.025; Brier = 0.040). Error analysis shows that residual confusions are concentrated in anatomically adjacent long-axis and LVOT/AV view pairs, consistent with intrinsic prescription overlap. These results support ConvFormer3D-TAP as a scalable front-end for view routing, filtering and quality control in end-to-end cMRI workflows.
[540] Beyond Reconstruction: Reconstruction-to-Vector Diffusion for Hyperspectral Anomaly Detection
Jijun Xiang, Jiayi Wang, Pengxiang Wang, Cheng Chen, Nian Wang, Tao Wang
Main category: cs.CV
TL;DR: R2VD introduces a novel reconstruction-to-vector diffusion framework for hyperspectral anomaly detection that replaces scalar reconstruction errors with vector interference patterns to prevent sub-pixel anomaly vanishing and confirmation bias.
Details
Motivation: Existing hyperspectral anomaly detection models suffer from sub-pixel anomaly vanishing during spatial downsampling and confirmation bias when unpurified anomalies corrupt training weights, due to reliance on ambiguous scalar residuals in a "reconstruction-as-endpoint" paradigm.Method: Four-stage pipeline: (1) Physical Prior Extraction with dual-stream statistical guidance, (2) Guided Manifold Purification using OmniContext Autoencoder, (3) Residual Score Modeling with Diffusion Transformer guarded by Physical Spectral Firewall, (4) Vector Dynamics Inference evaluating high-dimensional vector interference patterns.
Result: Comprehensive evaluations on eight datasets confirm R2VD establishes new state-of-the-art performance with exceptional target detectability and background suppression.
Conclusion: R2VD fundamentally redefines reconstruction as a manifold purification origin, establishing a residual-guided generative dynamics paradigm that robustly decouples targets from backgrounds using vector interference patterns instead of scalar errors.
Abstract: While Hyperspectral Anomaly Detection (HAD) excels at identifying sparse targets in complex scenes, existing models remain trapped in a scalar “reconstruction-as-endpoint” paradigm. This reliance on ambiguous scalar residuals consistently triggers sub-pixel anomaly vanishing during spatial downsampling, alongside severe confirmation bias when unpurified anomalies corrupt training weights. In this paper, we propose Reconstruction-to-Vector Diffusion (R2VD), which fundamentally redefines reconstruction as a manifold purification origin to establish a novel residual-guided generative dynamics paradigm. Our framework introduces a four-stage pipeline: (1) a Physical Prior Extraction (PPE) stage that mitigates early confirmation bias via dual-stream statistical guidance; (2) a Guided Manifold Purification (GMP) stage utilizing an OmniContext Autoencoder (OCA) to extract purified residual maps while preserving fragile sub-pixel topologies; (3) a Residual Score Modeling (RSM) stage where a Diffusion Transformer (DiT), guarded by a Physical Spectral Firewall (PSF), effectively isolates cross-spectral leakage; and (4) a Vector Dynamics Inference (VDI) stage that robustly decouples targets from backgrounds by evaluating high-dimensional vector interference patterns instead of conventional scalar errors. Comprehensive evaluations on eight datasets confirm that R2VD establishes a new state-of-the-art, delivering exceptional target detectability and background suppression.
[541] Video-based Heart Rate Estimation with Angle-guided ROI Optimization and Graph Signal Denoising
Gan Pei, Junhao Ning, Boqiu Shen, Yan Zhu, Menghan Hu
Main category: cs.CV
TL;DR: Two plug-and-play modules for remote photoplethysmography (rPPG) that improve heart rate measurement accuracy during facial motions using angle-guided ROI optimization and graph-based signal denoising.
Details
Motivation: Remote photoplethysmography (rPPG) performance degrades significantly during facial motions like speaking and head shaking, limiting its practical applications in real-world scenarios.Method: Proposes two modules: 1) Angle-guided ROI Adaptive Optimization module that quantifies ROI-Camera angles to refine motion-affected signals and capture global motion, and 2) Multi-region Joint Graph Signal Denoising module that jointly models intra- and inter-regional ROI signals using graph signal processing to suppress motion artifacts.
Result: Joint use of both modules reduces MAE by an average of 20.38% over baseline on three public datasets. Ablation studies confirm the effectiveness of each individual module.
Conclusion: The work demonstrates the potential of angle-guided optimization and graph-based denoising to enhance rPPG performance in motion scenarios, making non-contact heart rate measurement more robust to facial motions.
Abstract: Remote photoplethysmography (rPPG) enables non-contact heart rate measurement from facial videos, but its performance is significantly degraded by facial motions such as speaking and head shaking. To address this issue, we propose two plug-and-play modules. The Angle-guided ROI Adaptive Optimization module quantifies ROI-Camera angles to refine motion-affected signals and capture global motion, while the Multi-region Joint Graph Signal Denoising module jointly models intra- and inter-regional ROI signals using graph signal processing to suppress motion artifacts. The modules are compatible with reflection model-based rPPG methods and validated on three public datasets. Results show that jointly use markedly reduces MAE, with an average decrease of 20.38% over the baseline, while ablation studies confirm the effectiveness of each module. The work demonstrates the potential of angle-guided optimization and graph-based denoising to enhance rPPG performance in motion scenarios.
[542] GS4City: Hierarchical Semantic Gaussian Splatting via City-Model Priors
Qilin Zhang, Jinyu Zhu, Olaf Wysocki, Benjamin Busam, Boris Jutzi
Main category: cs.CV
TL;DR: GS4City integrates CityGML city models with 3D Gaussian Splatting for hierarchical semantic urban scene understanding, outperforming 2D-driven methods.
Details
Motivation: Existing semantic 3DGS methods rely on 2D foundation models, resulting in ambiguous boundaries and limited structured urban semantics. City models like CityGML encode hierarchical semantics but can't be directly mapped to Gaussian primitives.Method: Uses two-pass raycasting to derive reliable image-aligned masks from LoD3 CityGML models, leveraging parent-child relations to validate facade elements. Fuses geometry-grounded masks with foundation-model predictions for scene-consistent instance correspondences, and learns compact identity encoding for each Gaussian with joint 2D identity supervision and 3D spatial regularization.
Result: Outperforms existing 2D-driven semantic 3DGS baselines (LangSplat, Gaga) by up to 15.8 IoU points in coarse building segmentation and 14.2 mIoU points in fine-grained semantic segmentation on TUM2TWIN and Gold Coast datasets.
Conclusion: GS4City effectively incorporates structured building semantics into Gaussian scene representations, enabling semantically queryable and structure-aware urban reconstruction by bridging city models and photorealistic Gaussian representations.
Abstract: Recent semantic 3D Gaussian Splatting (3DGS) methods primarily rely on 2D foundation models, often yielding ambiguous boundaries and limited support for structured urban semantics. While city models such as CityGML encode hierarchically organized semantics together with building geometry, these labels cannot be directly mapped to Gaussian primitives. We present GS4City, a hierarchical semantic Gaussian Splatting method that incorporates city-model priors for urban scene understanding. GS4City derives reliable image-aligned masks from Level of Detail (LoD) 3 CityGML models via two-pass raycasting, explicitly using parent-child relations to validate and recover fine-grained facade elements. It then fuses these geometry-grounded masks with foundation-model predictions to establish scene-consistent instance correspondences, and learns a compact identity encoding for each Gaussian under joint 2D identity supervision and 3D spatial regularization. Experiments on the TUM2TWIN and Gold Coast datasets show that GS4City effectively incorporates structured building semantics into Gaussian scene representations, outperforming existing 2D-driven semantic 3DGS baselines, including LangSplat and Gaga, by up to 15.8 IoU points in coarse building segmentation and 14.2 mIoU points in fine-grained semantic segmentation. By bridging structured city models and photorealistic Gaussian scene representations, GS4City enables semantically queryable and structure-aware urban reconstruction. Code is available at https://github.com/Jinyzzz/GS4City.
[543] Scene Change Detection with Vision-Language Representation Learning
Diwei Sheng, Vijayraj Gohil, Satyam Gaba, Zihan Liu, Giles Hamilton-Fletcher, John-Ross Rizzo, Yongqing Liang, Chen Feng
Main category: cs.CV
TL;DR: LangSCD is a vision-language framework for scene change detection that uses language models to generate textual descriptions of changes, fused with visual features through cross-modal enhancement, achieving SOTA performance on urban scene change detection benchmarks.
Details
Motivation: Existing scene change detection methods rely on low-level visual features and struggle with real-world complexities like lighting variations, seasonal shifts, and viewpoint differences. Binary change annotations in current datasets are insufficient for fine-grained understanding needed in urban monitoring applications.Method: Proposes LangSCD with: 1) modular language component using VLMs to generate textual descriptions of scene changes, 2) cross-modal feature enhancer fusing language and visual features, 3) geometric-semantic matching module refining masks with semantic consistency and spatial completeness, and 4) NYC-CD dataset with 8,122 real-world image pairs and multiclass annotations.
Result: Extensive experiments show language and matching modules consistently improve existing change-detection architectures, achieving state-of-the-art performance across multiple street-view benchmarks. The NYC-CD dataset provides valuable resource for fine-grained scene change analysis.
Conclusion: Integrating linguistic reasoning with visual representations enables robust scene change detection in complex urban environments. The vision-language approach overcomes limitations of single-modal methods and provides semantic understanding beyond binary change detection.
Abstract: Scene change detection (SCD) is crucial for urban monitoring and navigation but remains challenging in real-world environments due to lighting variations, seasonal shifts, viewpoint differences, and complex urban layouts. Existing methods rely primarily on low-level visual features, limiting their ability to accurately identify changed objects amid the visual complexity of urban scenes. In this paper, we propose LangSCD, a vision-language framework for scene change detection that overcomes this single-modal limitation by incorporating semantic reasoning through language. Our approach introduces a modular language component that leverages vision-language models (VLMs) to generate textual descriptions of scene changes, which are fused with visual features through a cross-modal feature enhancer. We further introduce a geometric-semantic matching module that refines the predicted masks by enforcing semantic consistency and spatial completeness. Existing real-world scene change detection benchmarks provide only binary change annotations, which are insufficient for downstream applications requiring fine-grained understanding of scene dynamics. To address this limitation, we introduce NYC-CD, a large-scale dataset of 8,122 real-world image pairs collected in New York City with multiclass change annotations generated through a semi-automatic pipeline. Extensive experiments across multiple street-view benchmarks demonstrate that our language and matching modules consistently improve existing change-detection architectures, achieving state-of-the-art performance and highlighting the value of integrating linguistic reasoning with visual representations for robust scene change detection.
[544] Online Reasoning Video Object Segmentation
Jinyuan Liu, Yang Wang, Zeyu Zhao, Weixin Li, Song Wang, Ruize Han
Main category: cs.CV
TL;DR: Online Reasoning Video Object Segmentation (ORVOS) addresses the gap between offline video segmentation methods and real-world causal requirements, introducing a benchmark and baseline for frame-by-frame segmentation with natural language queries.
Details
Motivation: Existing reasoning video object segmentation methods operate offline with access to entire videos, enabling retrospective disambiguation that doesn't match real-world deployments requiring strictly causal, frame-by-frame decisions without revisiting previous frames.Method: Proposes ORVOSB benchmark with frame-level causal annotations and referent-shift labels (210 videos, 12,907 frames, 512 queries across 5 reasoning categories). Introduces baseline with continually-updated segmentation prompts and structured temporal token reservoir for long-horizon reasoning under bounded computation.
Result: Experiments show existing methods struggle under strict causality and referent shifts, while the proposed baseline establishes a strong foundation for future research in online reasoning video object segmentation.
Conclusion: ORVOS addresses the practical gap in video segmentation by introducing causal evaluation and methods, enabling real-world deployment of language-guided video understanding systems that must make incremental decisions without future information.
Abstract: Reasoning video object segmentation predicts pixel-level masks in videos from natural-language queries that may involve implicit and temporally grounded references. However, existing methods are developed and evaluated in an offline regime, where the entire video is available at inference time and future frames can be exploited for retrospective disambiguation, deviating from real-world deployments that require strictly causal, frame-by-frame decisions. We study Online Reasoning Video Object Segmentation (ORVOS), where models must incrementally interpret queries using only past and current frames without revisiting previous predictions, while handling referent shifts as events unfold. To support evaluation, we introduce ORVOSB, a benchmark with frame-level causal annotations and referent-shift labels, comprising 210 videos, 12,907 annotated frames, and 512 queries across five reasoning categories. We further propose a baseline with continually-updated segmentation prompts and a structured temporal token reservoir for long-horizon reasoning under bounded computation. Experiments show that existing methods struggle under strict causality and referent shifts, while our baseline establishes a strong foundation for future research.
[545] Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model
Qingyu Shi, Jinbin Bai, Zhuoran Zhao, Wenhao Chai, Kaidong Yu, Jianzong Wu, Shuangyong Song, Yunhai Tong, Xiangtai Li, Xuelong Li, Shuicheng Yan
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2505.23606: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.23606&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[546] Observe Less, Understand More: Cost-aware Cross-scale Observation for Remote Sensing Understanding
Zhenghao Xie, Jing Xiao, Zhenqi Wang, Kexin Ma, Liang Liao, Gui-Song Xia, Mi Wang
Main category: cs.CV
TL;DR: A unified framework for cross-scale remote sensing that combines fine-grained high-resolution sampling with cross-patch representation prediction to optimize task performance under cost constraints, evaluated on a new 10M-image benchmark.
Details
Motivation: Remote sensing requires multi-resolution observation since different targets need different spatial detail levels. Low-resolution imagery enables efficient global observation but lacks local details, while high-resolution provides critical details at higher cost and limited coverage. Existing HR sampling methods make decisions from isolated LR patches, ignoring fine-grained intra-patch importance and cross-patch contextual interactions, leading to fragmented features and suboptimal reasoning under sparse HR observations.Method: Formulates cross-scale remote sensing understanding as a unified cost-aware problem that couples fine-grained HR sampling with cross-patch representation prediction. Introduces GL-10M benchmark with 10 million spatially aligned multi-resolution images for systematic evaluation of budget-constrained cross-scale reasoning.
Result: Extensive experiments on recognition and retrieval tasks show the method consistently achieves superior performance-cost trade-off compared to existing approaches.
Conclusion: The proposed unified framework enables more effective task reasoning with fewer HR observations by addressing limitations of existing isolated patch-based sampling methods through fine-grained importance assessment and cross-patch contextual modeling.
Abstract: Remote sensing understanding inherently requires multi-resolution observation, since different targets and application tasks demand different levels of spatial detail. While low-resolution (LR) imagery enables efficient global observation, high-resolution (HR) imagery provides critical local details at much higher acquisition cost and limited coverage. This motivates a cross-scale sensing strategy that selectively acquires HR imagery from LR-based global perception to improve task performance under constrained cost. Existing methods for HR sampling methods typically make selection decisions from isolated LR patches, which ignore fine-grained intra-patch importance and cross-patch contextual interactions, leading to fragmented feature representation and suboptimal scene reasoning under sparse HR observations. To address this issue, we formulate cross-scale remote sensing understanding as a unified cost-aware problem that couples fine-grained HR sampling with cross-patch representation prediction, enabling more effective task reasoning with fewer HR observations. Furthermore, we present GL-10M, a large-scale benchmark of 10 million spatially aligned multi-resolution images, enabling systematic evaluation of budget-constrained cross-scale reasoning in remote sensing. Extensive experiments on recognition and retrieval tasks show that our method consistently achieves a superior performance-cost trade-off.
[547] HuiYanEarth-SAR: A Foundation Model for High-Fidelity and Low-Cost Global Remote Sensing Imagery Generation
Yongxiang Liu, Jie Zhou, Yafei Song, Tianpeng Liu, Li Liu
Main category: cs.CV
TL;DR: First foundational SAR imagery generation model that creates high-fidelity synthetic aperture radar images for global locations using geographic coordinates, integrating geospatial priors and scattering mechanisms.
Details
Motivation: SAR imagery generation is crucial for studying scattering mechanisms, building electromagnetic scene models, and addressing data scarcity. Existing methods struggle to maintain both global geospatial semantics and microscopic scattering fidelity simultaneously.Method: Proposes HuiYanEarth-SAR based on AlphaEarth with integrated scattering mechanisms. Uses geospatial priors to control macroscopic structures and implicit scattering characteristic modeling for microscopic texture authenticity. Generates SAR images solely from geographic coordinates.
Result: Achieves capability to generate high-fidelity SAR images for global locations based on geographic coordinates, creating an efficient SAR scene simulator that bridges geography, scattering mechanisms, and AI.
Conclusion: Advances SAR research from perception/understanding to simulation/creation, provides key technical support for constructing high-confidence digital twin Earth, and establishes new paradigm connecting geography, scattering mechanisms, and AI.
Abstract: Synthetic Aperture Radar (SAR) imagery generation is essential for deepening the study of scattering mechanisms, establishing trustworthy electromagnetic scene models, and fundamentally alleviating the data scarcity bottleneck that constrains development in this field. However, existing methods find it difficult to simultaneously ensure high fidelity in both global geospatial semantics and microscopic scattering mechanisms, resulting in severe challenges for global generation. To address this, we propose \textbf{HuiYanEarth-SAR}, the first foundational SAR imagery generation model based on AlphaEarth and integrated scattering mechanisms. By injecting geospatial priors to control macroscopic structures and utilizing implicit scattering characteristic modeling to ensure the authenticity of microscopic textures, we achieve the capability of generating high-fidelity SAR images for global locations solely based on geographic coordinates. This study not only constructs an efficient SAR scene simulator but also establishes a bridge connecting geography, scatter mechanism, and artificial intelligence from a methodological standpoint. It advances SAR research by expanding the paradigm from perception and understanding to simulation and creation, providing key technical support for constructing a high-confidence digital twin of the Earth.
[548] Beyond Model Design: Data-Centric Training and Self-Ensemble for Gaussian Color Image Denoising
Gengjia Chang, Xining Ge, Weijun Yuan, Zhan Li, Qiurong Song, Luen Zhu, Shuhong Liu
Main category: cs.CV
TL;DR: The paper presents a solution to the NTIRE 2026 Image Denoising Challenge using enhanced Restormer architecture with stronger data-centric training and test-time self-ensemble techniques.
Details
Motivation: To push the performance boundaries of mature Restormer architecture for image denoising by exploring complementary directions: stronger data-centric training and more complete test-time capability release, rather than proposing new restoration backbones.Method: Enhances Restormer baseline with expanded training corpus (larger and more diverse public image datasets), two-stage optimization, and ×8 geometric self-ensemble at inference. Retains TLC-style local inference wrapper for consistency.
Result: Achieves 30.762 dB PSNR and 0.861 SSIM on challenge validation set, improving over baseline Restormer by up to 3.366 dB PSNR. Dominant gains from expanded training corpus and two-stage optimization, with self-ensemble providing marginal but consistent improvement.
Conclusion: Data-centric training enhancements and test-time self-ensemble can significantly boost performance of mature restoration architectures without requiring new backbone designs, demonstrating the importance of training strategies and inference optimization.
Abstract: This paper presents our solution to the NTIRE 2026 Image Denoising Challenge (Gaussian color image denoising at fixed noise level $σ= 50$). Rather than proposing a new restoration backbone, we revisit the performance boundary of the mature Restormer architecture from two complementary directions: stronger data-centric training and more complete Test-Time capability release. Starting from the public Restormer $σ!=!50$ baseline, we expand the standard multi-dataset training recipe with larger and more diverse public image corpora and organize optimization into two stages. At inference, we apply $\times 8$ geometric self-ensemble to further release model capacity. A TLC-style local inference wrapper is retained for implementation consistency; however, systematic ablation reveals its quantitative contribution to be negligible in this setting. On the challenge validation set of 100 images, our final submission achieves 30.762 dB PSNR and 0.861 SSIM, improving over the public Restormer $σ!=!50$ pretrained baseline by up to 3.366 dB PSNR. Ablation studies show that the dominant gain originates from the expanded training corpus and the two-stage optimization schedule, and self-ensemble provides marginal but consistent improvement.
[549] Degradation-Aware and Structure-Preserving Diffusion for Real-World Image Super-Resolution
Yang Ji, Zonghao Chen, Zhihao Xue, Junqin Hu
Main category: cs.CV
TL;DR: A diffusion-based real-world image super-resolution framework that uses degradation-aware token injection and spatially asymmetric noise injection to handle complex real degradations while preserving structural details.
Details
Motivation: Real-world image super-resolution is challenging for diffusion models because real degradations are complex, heterogeneous, and rarely modeled explicitly. Existing methods struggle with handling diverse real-world degradation patterns while preserving structural details.Method: Proposes two key modules: 1) Degradation-aware Token Injection - encodes lightweight degradation statistics from low-resolution inputs and fuses them with semantic conditioning features for explicit degradation-aware restoration. 2) Spatially Asymmetric Noise Injection - modulates diffusion noise with local edge strength to better preserve structural regions during training. Both are lightweight add-ons to the diffusion SR framework.
Result: Experiments on DIV2K and RealSR datasets show competitive no-reference perceptual quality and visually more realistic restoration results than recent baselines, while maintaining favorable perception-distortion trade-off. Ablations confirm effectiveness of each module and their complementary gains.
Conclusion: The proposed degradation-aware and structure-preserving diffusion framework effectively handles real-world SR challenges by explicitly modeling degradations and preserving structural details through lightweight modifications to diffusion conditioning.
Abstract: Real-world image super-resolution is particularly challenging for diffusion models because real degradations are complex, heterogeneous, and rarely modeled explicitly. We propose a degradation-aware and structure-preserving diffusion framework for real-world SR. Specifically, we introduce Degradation-aware Token Injection, which encodes lightweight degradation statistics from low-resolution inputs and fuses them with semantic conditioning features, enabling explicit degradation-aware restoration. We further propose Spatially Asymmetric Noise Injection, which modulates diffusion noise with local edge strength to better preserve structural regions during training. Both modules are lightweight add-ons to the adopted diffusion SR framework, requiring only minor modifications to the conditioning pipeline. Experiments on DIV2K and RealSR show that our method delivers competitive no-reference perceptual quality and visually more realistic restoration results than recent baselines, while maintaining a favorable perception–distortion trade-off. Ablations confirm the effectiveness of each module and their complementary gains when combined. The code and model are publicly available at https://github.com/jiyang0315/DASP-SR.git.
[550] PACO: Proxy-Task Alignment and Online Calibration for On-the-Fly Category Discovery
Weidong Tang, Bohan Zhang, Zhixiang Chi, ZiZhang Wu, Yang Wang, Yanan Wu
Main category: cs.CV
TL;DR: PACO: A support-set-calibrated, tree-structured online decision framework for On-the-Fly Category Discovery that improves category formation through adaptive thresholds and hierarchical decisions during inference.
Details
Motivation: Existing OCD methods focus too much on offline training and use static thresholds at inference, treating OCD as a static classification problem rather than a dynamic process. They don't update decision boundaries during inference, leading to unstable category formation.Method: Proposes PACO framework with hierarchical decision tree: known-class routing, birth-aware novel assignment, and attach-versus-create operations over dynamic prototype memory. Simulates proxy discovery during offline training to initialize thresholds, which are continuously updated during inference using mature novel prototypes.
Result: Significant improvements over state-of-the-art baselines across seven benchmarks, showing that properly calibrated and adaptive thresholds can substantially improve performance even without changing the underlying representation.
Conclusion: PACO provides an effective inference-time module for OCD that addresses fundamental flaws in existing approaches by treating category discovery as a dynamic process with adaptive decision-making, requiring no heavy training or dataset-specific tuning.
Abstract: On-the-Fly Category Discovery (OCD) requires a model, trained on an offline support set, to recognize known classes while discovering new ones from an online streaming sequence. Existing methods focus heavily on offline training. They aim to learn discriminative representations on the support set so that novel classes can be separated at test time. However, their discovery mechanism at inference is typically reduced to a single threshold. We argue that this paradigm is fundamentally flawed as OCD is not a static classification problem, but a dynamic process. The model must continuously decide 1) whether a sample belongs to a known class, 2) matches an existing novel category, or 3) should initiate a new one. Moreover, prior methods treat the support set as fixed knowledge. They do not update their decision boundaries as new evidence arrives during inference. This leads to unstable and inconsistent category formation. Our experiments confirm these issues. With properly calibrated and adaptive thresholds, substantial improvements can be achieved, even without changing the representation. Motivated by this, we propose PACO, a support-set-calibrated, tree-structured online decision framework. The framework models inference as a sequence of hierarchical decisions, including known-class routing, birth-aware novel assignment, and attach-versus-create operations over a dynamic prototype memory. Furthermore, we simulate the proxy discovery process to initialize the thresholds during offline training to align with inference. Thresholds are continuously updated during inference using mature novel prototypes. Importantly, PACO requires no heavy training and no dataset-specific tuning. It can be directly integrated into existing OCD pipelines as an inference-time module. Extensive experiments show significant improvements over SOTA baselines across seven benchmarks.
[551] NTIRE 2026 Challenge on Robust AI-Generated Image Detection in the Wild
Aleksandr Gushchin, Khaled Abud, Ekaterina Shumitskaya, Artem Filippov, Georgii Bychkov, Sergey Lavrushkin, Mikhail Erofeev, Anastasia Antsiferova, Changsheng Chen, Shunquan Tan, Radu Timofte, Dmitry Vatolin, Chuanbiao Song, Zijian Yu, Hao Tan, Jun Lan, Zhiqiang Yang, Yongwei Tang, Zhiqiang Wu, Jia Wen Seow, Hong Vin Koay, Haodong Ren, Feng Xu, Shuai Chen, Ruiyang Xia, Qi Zhang, Yaowen Xu, Zhaofan Zou, Hao Sun, Dagong Lu, Mufeng Yao, Xinlei Xu, Fei Wu, Fengjun Guo, Cong Luo, Hardik Sharma, Aashish Negi, Prateek Shaily, Jayant Kumar, Sachin Chaudhary, Akshay Dudhane, Praful Hambarde, Amit Shukla, Zhilin Tu, Fengpeng Li, Jiamin Zhang, Jianwei Fei, Kemou Li, Haiwei Wu, Bilel Benjdira, Anas M. Ali, Wadii Boulila, Chenfan Qu, Junchi Li
Main category: cs.CV
TL;DR: NTIRE 2026 Challenge on Robust AI-Generated Image Detection in the Wild, focusing on developing models that can distinguish real from AI-generated images under realistic transformations like cropping, resizing, compression, and blurring.
Details
Motivation: As AI-generated images become more prevalent and realistic, there's a growing need for robust detection methods that work in real-world scenarios where images undergo various transformations for practical usage.Method: Challenge-based approach using a novel dataset of 108,750 real and 185,750 AI-generated images from 42 different generators, augmented with 36 image transformations. Participants developed detection models evaluated using ROC AUC on transformed and untransformed test images.
Result: 511 participants registered with 20 teams submitting valid solutions. The challenge produced state-of-the-art detection methods robust to real-world image transformations.
Conclusion: The challenge successfully advanced robust AI-generated image detection methods that work under realistic conditions, providing valuable benchmarks and solutions for practical applications.
Abstract: This paper presents an overview of the NTIRE 2026 Challenge on Robust AI-Generated Image Detection in the Wild, held in conjunction with the NTIRE workshop at CVPR 2026. The goal of this challenge was to develop detection models capable of distinguishing real images from generated ones in realistic scenarios: the images are often transformed (cropped, resized, compressed, blurred) for practical usage, and therefore, the detection models should be robust to such transformations. The challenge is based on a novel dataset consisting of 108,750 real and 185,750 AI-generated images from 42 generators comprising a large variety of open-source and closed-source models of various architectures, augmented with 36 image transformations. Methods were evaluated using ROC AUC on the full test set, including both transformed and untransformed images. A total of 511 participants registered, with 20 teams submitting valid final solutions. This report provides a comprehensive overview of the challenge, describes the proposed solutions, and can be used as a valuable reference for researchers and practitioners in increasing the robustness of the detection models to real-world transformations.
[552] TAG-Head: Time-Aligned Graph Head for Plug-and-Play Fine-grained Action Recognition
Imtiaz Ul Hassan, Nik Bessis, Ardhendu Behera
Main category: cs.CV
TL;DR: TAG-Head is a lightweight spatio-temporal graph head that upgrades standard 3D backbones for fine-grained human action recognition using only RGB input, achieving state-of-the-art performance without extra modalities.
Details
Motivation: Fine-grained human action recognition is challenging due to subtle spatio-temporal differences between visually similar actions. Existing multimodal approaches require extra modalities (pose, text, optical flow) which increase annotation burden and computational cost. The authors aim to develop an RGB-only solution that matches or surpasses multimodal performance.Method: TAG-Head combines a Transformer encoder with learnable 3D positional encodings to capture long-range dependencies, followed by a spatio-temporal graph with two types of edges: (1) fully-connected intra-frame edges to resolve subtle appearance differences within frames, and (2) time-aligned temporal edges connecting features at same spatial locations across frames to stabilize motion cues without over-smoothing. The head is lightweight, plug-and-play across various 3D backbones (SlowFast, R(2+1)D-34, I3D), and trained end-to-end.
Result: TAG-Head achieves state-of-the-art performance among RGB-only models on FineGym (Gym99 and Gym288) and HAA500 datasets. It surpasses many recent multimodal approaches that use privileged information (video + pose + text). The design is compact with minimal parameter/FLOP overhead and low latency.
Conclusion: TAG-Head advances fine-grained human action recognition by explicitly coupling global context with high-resolution spatial interactions and low-variance temporal continuity in a slim, composable graph head. The RGB-only approach enables practical adoption in systems favoring simple sensor setups while delivering performance gains typically associated with heavier multimodal models.
Abstract: Fine-grained human action recognition (FHAR) is challenging because visually similar actions differ by subtle spatio-temporal cues. Many recent systems enhance discriminability with extra modalities (e.g., pose, text, optical flow), but this increases annotation burden and computational cost. We introduce TAG-Head, a lightweight spatio-temporal graph head that upgrades standard 3D backbones (SlowFast, R(2+1)D-34, I3D, etc.) for FHAR using RGB only. Our pipeline first applies a Transformer encoder with learnable 3D positional encodings to the backbone tokens, capturing long-range dependencies across space and time. The resulting features are then refined by a graph in which (i) fully-connected intra-frame edges to resolve subtle appearance differences within frames, and (ii) time-aligned temporal edges that connect features at the same spatial location across frames to stabilise motion cues without over-smoothing. The head is compact (little parameter/FLOP overhead), plug-and-play across backbones, and trained end-to-end with the backbone. Extensive evaluations on FineGym (Gym99 and Gym288) and HAA500 show that TAG-Head sets a new state-of-the-art among RGB-only models and surpasses many recent multimodal approaches (video + pose + text) that rely on privileged information. Ablations disentangle the contributions of the Transformer and the graph topology, and complexity analyses confirm low latency. TAG-Head advances FHAR by explicitly coupling global context with high-resolution spatial interactions and low-variance temporal continuity inside a slim, composable graph head. The simplicity of the design enables straightforward adoption in practical systems that favour RGB-only sensors, while delivering performance gains typically associated with heavier or multimodal models. Code will be released on GitHub.
[553] SVD-Prune: Training-Free Token Pruning For Efficient Vision-Language Models
Yvon Apedo, Martyna Poreba, Michal Szczepanski, Samia Bouchafa
Main category: cs.CV
TL;DR: SVD-Prune: A training-free token pruning method for Vision-Language Models using Singular Value Decomposition to select top-K tokens based on statistical leverage scores, reducing computational demands while preserving essential visual information.
Details
Motivation: Vision-Language Models face high computational and memory demands from processing long vision token sequences. Existing pruning methods using local heuristics suffer from positional bias and information dispersion, especially at high pruning ratios, leading to performance degradation on visually detailed images.Method: Proposes SVD-Prune, a training-free, plug-and-play token pruning method based on Singular Value Decomposition. It decomposes the vision token feature matrix and selects top-K tokens using statistical leverage scores, ensuring only tokens contributing most to the dominant global variance are preserved.
Result: Experiments show SVD-Prune consistently outperforms prior pruning methods under extreme vision token budgets, maintaining strong performance even with only 32 and 16 vision tokens.
Conclusion: SVD-Prune effectively addresses computational challenges in VLMs by preserving essential visual content through global variance-based token selection, enabling efficient multimodal processing without performance degradation.
Abstract: Vision-Language Models (VLM) have revolutionized multimodal learning by jointly processing visual and textual information. Yet, they face significant challenges due to the high computational and memory demands of processing long sequences of vision tokens. Many existing methods rely on local heuristics, such as attention scores or token norms. However, these criteria suffer from positional bias and information dispersion, limiting their ability to preserve essential content at high pruning ratios and leading to performance degradation on visually detailed images. To address these issues, we propose SVD-Prune, a trainingfree, plug-and-play token pruning method based on Singular Value Decomposition. It decomposes the vision token feature matrix and selects the top-K tokens using statistical leverage scores, ensuring only tokens contributing most to the dominant global variance are preserved. Experiments show that SVD-Prune consistently outperforms prior pruning methods under extreme vision token budgets, maintaining strong performance even with 32 and 16 vision tokens.
[554] CLAY: Conditional Visual Similarity Modulation in Vision-Language Embedding Space
Sohwi Lim, Lee Hyoseok, Jungjoon Park, Tae-Hyun Oh
Main category: cs.CV
TL;DR: CLAY is an adaptive image similarity method that uses pretrained Vision-Language Models to create text-conditional similarity spaces without retraining, enabling multi-conditioned retrieval with fixed visual embeddings.
Details
Motivation: Current image retrieval systems use fixed similarity metrics that cannot adapt to users' subjective interests or incorporate multiple conditions simultaneously, failing to reflect the flexible nature of human visual perception.Method: Reframes pretrained VLM embedding spaces as text-conditional similarity spaces without additional training, separating textual conditioning from visual feature extraction to enable efficient multi-conditioned retrieval with fixed visual embeddings.
Result: Achieves high retrieval accuracy and notable computational efficiency on standard datasets and the proposed CLAY-EVAL synthetic evaluation dataset across diverse conditioned retrieval settings.
Conclusion: CLAY provides an effective solution for adaptive, multi-conditioned image retrieval by leveraging pretrained VLMs without retraining, offering both accuracy and efficiency advantages over previous methods.
Abstract: Human perception of visual similarity is inherently adaptive and subjective, depending on the users’ interests and focus. However, most image retrieval systems fail to reflect this flexibility, relying on a fixed, monolithic metric that cannot incorporate multiple conditions simultaneously. To address this, we propose CLAY, an adaptive similarity computation method that reframes the embedding space of pretrained Vision-Language Models (VLMs) as a text-conditional similarity space without additional training. This design separates the textual conditioning process and visual feature extraction, allowing highly efficient and multi-conditioned retrieval with fixed visual embeddings. We also construct a synthetic evaluation dataset CLAY-EVAL, for comprehensive assessment under diverse conditioned retrieval settings. Experiments on standard datasets and our proposed dataset show that CLAY achieves high retrieval accuracy and notable computational efficiency compared to previous works.
[555] Progressively Texture-Aware Diffusion for Contrast-Enhanced Sparse-View CT
Tianqi Wang, Wenchao Du, Hongyu Yang
Main category: cs.CV
TL;DR: A Progressively Texture-aware Diffusion (PTD) model for sparse-view CT reconstruction that combines deterministic mapping for coarse content recovery with conditional diffusion for high-fidelity texture generation.
Details
Motivation: While diffusion-based sparse-view CT imaging has advanced, recovering reliable image content and visually consistent textures remains challenging. Current methods struggle with balancing visual quality and fidelity of high-frequency details while dealing with randomness inherent in diffusion models.Method: PTD uses a coarse-to-fine framework with two modules: PTD_rec learns deterministic mapping to recover low-frequency signals (coarse content), and PTD_diff uses dual-domain guided conditional diffusion to generate reliable textures for the coarse prediction. This approach reduces randomness and enables better trade-off between visual quality and detail fidelity.
Result: Extensive experiments show PTD achieves superior performance in structure similarity and visual appeal with few sampling steps, mitigating diffusion model randomness while balancing visual quality and high-frequency detail fidelity.
Conclusion: The PTD model effectively addresses texture consistency challenges in sparse-view CT reconstruction through a progressive coarse-to-fine approach that combines deterministic and diffusion-based methods for improved visual quality and detail preservation.
Abstract: Diffusion-based sparse-view CT (SVCT) imaging has achieved remarkable advancements in recent years, thanks to its more stable generative capability. However, recovering reliable image content and visually consistent textures is still a crucial challenge. In this paper, we present a Progressively Texture-aware Diffusion (PTD) model, a coarse-to-fine learning framework tailored for SVCT. Specifically, PTD comprises a basic reconstructive module PTD${\textit{rec}}$ and a conditional diffusion module PTD${\textit{diff}}$. PTD${\textit{rec}}$ first learns a deterministic mapping to recover the majority of the underlying low-frequency signals (i.e., coarse content with smoothed textures), which serves as the initial estimation to enable fidelity. Moreover, PTD${\textit{diff}}$ aims to reconstruct high-fidelity details for coarse prediction, which explores a dual-domain guided conditional diffusion to generate reliable and consistent textures. Extensive experiments on sparse-view CT reconstruction demonstrate that our PTD achieves superior performance in terms of structure similarity and visual appeal with only a few sampling steps, which mitigates the randomness inherent in general diffusion models and enables a better trade-off between visual quality and fidelity of high-frequency details.
[556] Masked Training for Robust Arrhythmia Detection from Digitalized Multiple Layout ECG Images
Shanwei Zhang, Deyun Zhang, Yirao Tao, Kexin Wang, Shijia Geng, Jun Li, Qinghao Zhao, Xingpeng Liu, Xingliang Wu, Shengyong Chen, Yuxi Zhou, Shenda Hong
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2508.09165: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.09165&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[557] The Impact of Federated Learning on Distributed Remote Sensing Archives
Anand Umashankar, Karam Tomotaki-Dawoud, Nicolai Schneider
Main category: cs.CV
TL;DR: Systematic study of federated learning strategies for multi-label remote sensing image classification under non-IID data conditions, comparing FedAvg, FedProx, and BSP with different CNN architectures.
Details
Motivation: Remote sensing data is inherently distributed across geographic regions with data sovereignty constraints, making centralized training impractical. The non-IID nature of Earth observation data (varying label distributions by region) degrades standard FL algorithm convergence, requiring systematic evaluation of FL strategies for remote sensing applications.Method: Empirical study of three FL strategies (FedAvg, FedProx, and bulk synchronous parallel) applied to multi-label remote sensing image classification under controlled non-IID label-skew conditions. Evaluated three CNN architectures (LeNet, AlexNet, ResNet-34) and analyzed effects of algorithm choice, model capacity, client fraction, client count, batch size, and communication cost.
Result: FedProx outperforms FedAvg for deeper architectures under data heterogeneity; BSP approaches centralized accuracy but with high sequential communication cost; LeNet provides the best accuracy-communication trade-off for the dataset scale considered.
Conclusion: Federated learning is essential for distributed remote sensing data, but algorithm choice and model architecture significantly impact performance under non-IID conditions. FedProx shows advantages for deeper models, while simpler architectures offer better communication efficiency trade-offs.
Abstract: Remote sensing archives are inherently distributed: Earth observation missions such as Sentinel-1, Sentinel-2, and Sentinel-3 have collectively accumulated more than 5 petabytes of imagery, stored and processed across many geographically dispersed platforms. Training machine learning models on such data in a centralized fashion is impractical due to data volume, sovereignty constraints, and geographic distribution. Federated learning (FL) addresses this by keeping data local and exchanging only model updates. A central challenge for remote sensing is the non-IID nature of Earth observation data: label distributions vary strongly by geographic region, degrading the convergence of standard FL algorithms. In this paper, we conduct a systematic empirical study of three FL strategies – FedAvg, FedProx, and bulk synchronous parallel (BSP) – applied to multi-label remote sensing image classification under controlled non-IID label-skew conditions. We evaluate three convolutional neural network (CNN) architectures of increasing depth (LeNet, AlexNet, and ResNet-34) and analyze the joint effect of algorithm choice, model capacity, client fraction, client count, batch size, and communication cost. Experiments on the UC Merced multi-label dataset show that FedProx outperforms FedAvg for deeper architectures under data heterogeneity, that BSP approaches centralized accuracy at the cost of high sequential communication, and that LeNet provides the best accuracy-communication trade-off for the dataset scale considered.
[558] Training-Free Model Ensemble for Single-Image Super-Resolution via Strong-Branch Compensation
Gengjia Chang, Xining Ge, Weijun Yuan, Zhan Li, Qiurong Song, Luen Zhu, Shuhong Liu
Main category: cs.CV
TL;DR: Training-free ensemble framework combining Hybrid attention network and MambaIRv2 for image super-resolution without additional training.
Details
Motivation: Current super-resolution models require high training costs and engineering effort. Multiple pretrained models exist but need effective combination without training. Focus shifts from architectural capacity to output-level fusion.Method: Dual-branch pipeline: 1) Hybrid attention network with TLC inference for stable main reconstruction, 2) MambaIRv2 with geometric self-ensemble for high-frequency detail recovery. Branches process same input independently and fuse via lightweight weighted combination without parameter updates.
Result: Consistently improves over base branch and slightly exceeds pure strong branch in PSNR under DIV2K bicubic ×4 evaluation. Provides low-overhead upgrade path for existing systems.
Conclusion: Training-free output-level ensemble offers practical alternative to architectural redesign, enabling effective combination of existing models without additional training costs.
Abstract: Single-image super-resolution has progressed from deep convolutional baselines to stronger Transformer and state-space architectures, yet the corresponding performance gains typically come with higher training cost, longer engineering iteration, and heavier deployment burden. In many practical settings, multiple pretrained models with partially complementary behaviors are already available, and the binding constraint is no longer architectural capacity but how effectively their outputs can be combined without additional training. Rather than pursuing further architectural redesign, this paper proposes a training-free output-level ensemble framework. A dual-branch pipeline is constructed in which a Hybrid attention network with TLC inference provides stable main reconstruction, while a MambaIRv2 branch with geometric self-ensemble supplies strong compensation for high-frequency detail recovery. The two branches process the same low-resolution input independently and are fused in the image space via a lightweight weighted combination, without updating any model parameters or introducing an additional trainable module. As our solution to the NTIRE 2026 Image Super-Resolution ($\times 4$) Challenge, the proposed design consistently improves over the base branch and slightly exceeds the pure strong branch in PSNR at the best operating point under a unified DIV2K bicubic $\times 4$ evaluation protocol. Ablation studies confirm that output-level compensation provides a low-overhead and practically accessible upgrade path for existing super-resolution systems.
[559] Finetune Like You Pretrain: Boosting Zero-shot Adversarial Robustness in Vision-language Models
Songlong Xing, Weijie Wang, Zhengyu Zhao, Jindong Gu, Philip Torr, Nicu Sebe
Main category: cs.CV
TL;DR: AdvFLYP: Adversarial finetuning of CLIP using web image-text pairs with contrastive loss and regularization to preserve zero-shot capabilities while improving robustness.
Details
Motivation: Existing adversarial finetuning methods for CLIP reduce zero-shot capabilities and have limited transferability because they use proxy datasets like ImageNet and overlook training data distributions and learning objectives.Method: AdvFLYP finetunes CLIP with adversarial images created from web image-text pairs using contrastive loss, plus regularization terms: logit-level regularization for robustness and feature-level regularization for clean accuracy.
Result: Extensive experiments on 14 downstream datasets show superiority over mainstream practices in maintaining zero-shot capabilities while improving adversarial robustness across domains.
Conclusion: Following CLIP’s original pretraining recipe during adversarial finetuning with web data and appropriate regularization preserves zero-shot abilities while enhancing robustness transferability.
Abstract: Despite their impressive zero-shot abilities, vision-language models such as CLIP have been shown to be susceptible to adversarial attacks. To enhance its adversarial robustness, recent studies finetune the pretrained vision encoder of CLIP with adversarial examples on a proxy dataset such as ImageNet by aligning adversarial images with correct class labels. However, these methods overlook the important roles of training data distributions and learning objectives, resulting in reduced zero-shot capabilities and limited transferability of robustness across domains and datasets. In this work, we propose a simple yet effective paradigm AdvFLYP, which follows the training recipe of CLIP’s pretraining process when performing adversarial finetuning to the model. Specifically, AdvFLYP finetunes CLIP with adversarial images created based on image-text pairs collected from the web, and match them with their corresponding texts via a contrastive loss. To alleviate distortion of adversarial image embeddings of noisy web images, we further propose to regularise AdvFLYP by penalising deviation of adversarial image features. We show that logit- and feature-level regularisation terms benefit robustness and clean accuracy, respectively. Extensive experiments on 14 downstream datasets spanning various domains show the superiority of our paradigm over mainstream practices. Our code and model weights are released at https://github.com/Sxing2/AdvFLYP.
[560] Seeing Through Touch: Tactile-Driven Visual Localization of Material Regions
Seongyu Kim, Seungwoo Lee, Hyeonggon Ryu, Joon Son Chung, Arda Senocak
Main category: cs.CV
TL;DR: A model for tactile localization that learns local visuo-tactile alignment via dense cross-modal feature interactions to produce tactile saliency maps for touch-conditioned material segmentation, addressing limitations of existing global alignment methods and dataset constraints.
Details
Motivation: Existing visuo-tactile methods rely on global alignment and fail to capture fine-grained local correspondences needed for tactile localization. Current datasets have limited diversity with predominantly close-up, low-diversity images, making it challenging to identify image regions sharing the same material properties as tactile inputs.Method: Proposes a model that learns local visuo-tactile alignment through dense cross-modal feature interactions to generate tactile saliency maps. Introduces in-the-wild multi-material scene images to expand visual diversity and a material-diversity pairing strategy that aligns each tactile sample with visually varied but tactilely consistent images. Also constructs two new tactile-grounded material segmentation datasets for evaluation.
Result: Experiments on both new and existing benchmarks show the approach substantially outperforms prior visuo-tactile methods in tactile localization tasks.
Conclusion: The proposed method successfully addresses limitations of existing approaches by enabling fine-grained local visuo-tactile alignment and overcoming dataset constraints through innovative data strategies, achieving superior performance in tactile localization.
Abstract: We address the problem of tactile localization, where the goal is to identify image regions that share the same material properties as a tactile input. Existing visuo-tactile methods rely on global alignment and thus fail to capture the fine-grained local correspondences required for this task. The challenge is amplified by existing datasets, which predominantly contain close-up, low-diversity images. We propose a model that learns local visuo-tactile alignment via dense cross-modal feature interactions, producing tactile saliency maps for touch-conditioned material segmentation. To overcome dataset constraints, we introduce: (i) in-the-wild multi-material scene images that expand visual diversity, and (ii) a material-diversity pairing strategy that aligns each tactile sample with visually varied yet tactilely consistent images, improving contextual localization and robustness to weak signals. We also construct two new tactile-grounded material segmentation datasets for quantitative evaluation. Experiments on both new and existing benchmarks show that our approach substantially outperforms prior visuo-tactile methods in tactile localization.
[561] GeomPrompt: Geometric Prompt Learning for RGB-D Semantic Segmentation Under Missing and Degraded Depth
Krishna Jaganathan, Patricio Vela
Main category: cs.CV
TL;DR: Paper 2604.11585: Unable to fetch abstract due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as the paper abstract is unavailable due to HTTP 429 error from arXiv APIMethod: Cannot determine method as the paper abstract is unavailable due to HTTP 429 error from arXiv API
Result: Cannot determine results as the paper abstract is unavailable due to HTTP 429 error from arXiv API
Conclusion: Cannot determine conclusion as the paper abstract is unavailable due to HTTP 429 error from arXiv API
Abstract: Failed to fetch summary for 2604.11585: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.11585&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[562] On the Robustness of Watermarking for Autoregressive Image Generation
Andreas Müller, Denis Lukovnikov, Shingo Kodama, Minh Pham, Anubhav Jain, Jonathan Petit, Niv Cohen, Asja Fischer
Main category: cs.CV
TL;DR: AR image generator watermarks are vulnerable to removal and forgery attacks, undermining their reliability for synthetic content detection and dataset filtering.
Details
Motivation: As autoregressive image generators proliferate, reliable detection and attribution of synthetic images is needed to combat misinformation and prevent model collapse from synthetic training data. Watermarking techniques have been proposed for AR models, but their security needs evaluation.Method: The paper studies existing AR image generator watermarking schemes and demonstrates their vulnerabilities. It assesses existing attacks and introduces three new attacks: (1) vector-quantized regeneration removal attack, (2) adversarial optimization-based attack, and (3) frequency injection attack.
Result: Evaluation shows removal and forgery attacks can be effective with access to just a single watermarked reference image, without needing original model parameters or watermarking secrets. Existing watermarking schemes fail to reliably support synthetic content detection for dataset filtering.
Conclusion: Current watermarking schemes for AR image generation are vulnerable to attacks, enabling both watermark removal and “Watermark Mimicry” where authentic images can be manipulated to trigger false detection. This undermines their reliability for content verification and dataset filtering applications.
Abstract: The proliferation of autoregressive (AR) image generators demands reliable detection and attribution of their outputs to mitigate misinformation, and to filter synthetic images from training data to prevent model collapse. To address this need, watermarking techniques, specifically designed for AR models, embed a subtle signal at generation time, enabling downstream verification through a corresponding watermark detector. In this work, we study these schemes and demonstrate their vulnerability to both watermark removal and forgery attacks. We assess existing attacks and further introduce three new attacks: (i) a vector-quantized regeneration removal attack, (ii) adversarial optimization-based attack, and (iii) a frequency injection attack. Our evaluation reveals that removal and forgery attacks can be effective with access to a single watermarked reference image and without access to original model parameters or watermarking secrets. Our findings indicate that existing watermarking schemes for AR image generation do not reliably support synthetic content detection for dataset filtering. Moreover, they enable Watermark Mimicry, whereby authentic images can be manipulated to imitate a generator’s watermark and trigger false detection to prevent their inclusion in future model training.
[563] MLLM-as-a-Judge Exhibits Model Preference Bias
Shuitsu Koyama, Yuiga Wada, Daichi Yashima, Komei Sugiura
Main category: cs.CV
TL;DR: Philautia-Eval framework investigates model-specific preference bias in MLLM-as-a-Judge evaluation methods, finding self-preference and mutual preference biases, and proposes Pomms ensemble to mitigate bias.
Details
Motivation: Automatic evaluation using MLLMs (MLLM-as-a-Judge) is widely used for model benchmarking, but if biased, it could distort scientific progress. The paper investigates whether MLLM judges show preference bias toward specific MLLM-generated text.Method: Proposes Philautia-Eval framework to quantify model-specific preference bias by disentangling preference tendencies from generation quality differences. Analyzes 1.29M caption-score pairs from 12 MLLMs. Introduces Pomms, a simple ensemble of MLLMs to mitigate bias.
Result: Found that representative MLLMs exhibit self-preference bias. Also discovered mutual preference bias within particular model families, potentially driven by reused connectors and overlapping instruction-tuning resources. Pomms ensemble effectively mitigated model-specific preference bias while maintaining performance.
Conclusion: MLLM-as-a-Judge methods have significant model-specific preference biases that could distort model comparisons. The proposed Philautia-Eval framework helps quantify these biases, and the Pomms ensemble approach offers a practical solution to mitigate bias in evaluation.
Abstract: Automatic evaluation using multimodal large language models (MLLMs), commonly referred to as MLLM-as-a-Judge, has been widely used to measure model performance. If such MLLM-as-a-Judge methods were biased, they could distort model comparisons and benchmark-driven scientific progress. However, it remains unclear to what extent MLLM-as-a-Judge methods favor or disfavor text generated by specific MLLMs. In this study, we propose Philautia-Eval to investigate such model-specific preference bias. Philautia-Eval quantifies the degree of the bias by disentangling preference tendencies from differences in generation quality. Using 1.29M caption-score pairs collected from 12 MLLMs, we found that representative MLLMs tend to exhibit self-preference bias. Moreover, experimental results indicate mutual preference bias within particular model families, which is potentially driven by reused connectors and overlapping instruction-tuning resources. Finally, we introduce a simple ensemble of MLLMs, Pomms. Our results demonstrated that Pomms effectively mitigated the model-specific preference bias while maintaining performance.
[564] Learning Robustness at Test-Time from a Non-Robust Teacher
Stefano Bianchettin, Giulio Rossolini, Giorgio Buttazzo
Main category: cs.CV
TL;DR: A framework for adapting non-robust pretrained models at test-time to improve adversarial robustness using label-free distillation with teacher predictions as semantic anchors.
Details
Motivation: Pretrained models are widely adapted to downstream environments with scarce unlabeled data, but adversarial robustness in this test-time adaptation setting is under-explored, especially when starting from non-robust pretrained models.Method: Proposes a label-free adaptation framework that uses predictions from a non-robust teacher model as semantic anchors for both clean and adversarial objectives during test-time adaptation, providing theoretical analysis showing improved stability over self-consistency regularization.
Result: The approach achieves improved optimization stability, lower sensitivity to hyperparameters, and better robustness-accuracy trade-off on CIFAR-10 and ImageNet under photometric transformations compared to existing baselines.
Conclusion: Non-robust pretrained models can be effectively adapted at test-time to improve adversarial robustness using the proposed label-free framework with teacher predictions as semantic anchors.
Abstract: Nowadays, pretrained models are increasingly used as general-purpose backbones and adapted at test-time to downstream environments where target data are scarce and unlabeled. While this paradigm has proven effective for improving clean accuracy on the target domain, adversarial robustness has received far less attention, especially when the original pretrained model is not explicitly designed to be robust. This raises a practical question: \emph{can a pretrained, non-robust model be adapted at test-time to improve adversarial robustness on a target distribution?} To face this question, this work studies how adversarial training strategies behave when integrated into adaptation schemes for the unsupervised test-time setting, where only a small set of unlabeled target samples is available. It first analyzes how classical adversarial training formulations can be extended to this scenario, showing that straightforward distillation-based adaptations remain unstable and highly sensitive to hyperparameter tuning, particularly when the teacher itself is non-robust. To address these limitations, the work proposes a label-free framework that uses the predictions of a non-robust teacher model as a semantic anchor for both the clean and adversarial objectives during adaptation. We further provide theoretical insights showing that our formulation yields a more stable alternative to the self-consistency-based regularization commonly used in classical adversarial training. Experiments evaluate the proposed approach on CIFAR-10 and ImageNet under induced photometric transformations. The results support the theoretical insights by showing that the proposed approach achieves improved optimization stability, lower sensitivity to parameter choices, and a better robustness-accuracy trade-off than existing baselines in this post-deployment test-time setting.
[565] Geoparsing: Diagram Parsing for Plane and Solid Geometry with a Unified Formal Language
Peijie Wang, Ming-Liang Zhang, Jun Cao, Chao Deng, Dekang Ran, Hongda Sun, Pi Bu, Xuan Zhang, Yingyao Wang, Jun Song, Bo Zheng, Fei Yin, Cheng-Lin Liu
Main category: cs.CV
TL;DR: A unified formal language for both plane and solid geometry parsing, with a large dataset and training method combining supervised fine-tuning and reinforcement learning, achieving SOTA parsing performance and boosting MLLMs’ geometry reasoning capabilities.
Details
Motivation: MLLMs struggle with geometric reasoning due to perception bottlenecks for fine-grained visual elements. While formal languages have helped with plane geometry, solid geometry requiring spatial understanding remains largely unexplored.Method: Design a unified formal language integrating plane and solid geometry, construct GDP-29K dataset (20k plane + 9k solid geometry samples), and propose training paradigm combining Supervised Fine-Tuning with Reinforcement Learning via Verifiable Rewards.
Result: Achieves state-of-the-art parsing performance. The parsed formal descriptions serve as a critical cognitive scaffold, significantly boosting MLLMs’ capabilities for downstream geometry reasoning tasks.
Conclusion: The approach successfully addresses geometric reasoning challenges in MLLMs through formal language parsing, with the unified framework covering both plane and solid geometry, and the training method ensuring syntactic correctness and geometric consistency.
Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable progress but continue to struggle with geometric reasoning, primarily due to the perception bottleneck regarding fine-grained visual elements. While formal languages have aided plane geometry understanding, solid geometry which requires spatial understanding remains largely unexplored. In this paper, we address this challenge by designing a unified formal language that integrates plane and solid geometry, comprehensively covering geometric structures and semantic relations. We construct GDP-29K, a large-scale dataset comprising 20k plane and 9k solid geometry samples collected from diverse real-world sources, each paired with its ground-truth formal description. To ensure syntactic correctness and geometric consistency, we propose a training paradigm that combines Supervised Fine-Tuning with Reinforcement Learning via Verifiable Rewards. Experiments show that our approach achieves state-of-the-art parsing performance. Furthermore, we demonstrate that our parsed formal descriptions serve as a critical cognitive scaffold, significantly boosting MLLMs’ capabilities for downstream geometry reasoning tasks. Our data and code are available at Geoparsing.
[566] POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs
Haicheng Wang, Yuan Liu, Yikun Liu, Zhemeng Yu, Zhongyin Zhao, Yangxiu You, Zilin Yu, Le Tian, Xiao Zhou, Jie Zhou, Weidi Xie, Yanfeng Wang
Main category: cs.CV
TL;DR: POINTS-Long is a dual-mode MLLM with dynamic visual token scaling for efficient long-video and streaming visual understanding, inspired by human visual system with focus/standby modes.
Details
Motivation: Address the scalability challenge of visual token sequences in MLLMs, especially for long-video and streaming scenarios, where current approaches struggle with computational efficiency and real-world deployment.Method: Introduces a native dual-mode MLLM with dynamic visual token scaling inspired by human visual system. Features two complementary perception modes: focus mode (for fine-grained tasks) and standby mode (for long-form understanding). Includes dynamically detachable KV-cache design for streaming visual understanding.
Result: Standby mode retains 97.7-99.7% of original accuracy using only 1/40-1/10th of visual tokens for long-form visual understanding. The model efficiently maintains ultra-long visual memory for streaming scenarios.
Conclusion: POINTS-Long provides new insights for future MLLM design and lays foundation for adaptive, efficient long-form visual understanding, addressing key scalability challenges in multimodal AI.
Abstract: Multimodal Large Language Models (MLLMs) have recently demonstrated remarkable capabilities in cross-modal understanding and generation. However, the rapid growth of visual token sequences–especially in long-video and streaming scenarios–poses a major challenge to their scalability and real-world deployment. Thus, we introduce POINTS-Long, a native dual-mode MLLM featuring dynamic visual token scaling inspired by the human visual system. The model supports two complementary perception modes: focus mode and standby mode, enabling users to dynamically trade off efficiency and accuracy during inference. On fine-grained visual tasks, the focus mode retains the optimal performance, while on long-form general visual understanding, the standby mode retains 97.7-99.7% of the original accuracy using only 1/40-1/10th of the visual tokens. Moreover, POINTS-Long natively supports streaming visual understanding via a dynamically detachable KV-cache design, allowing efficient maintenance of ultra-long visual memory. Our work provides new insights into the design of future MLLMs and lays the foundation for adaptive and efficient long-form visual understanding.
[567] StarVLA-$α$: Reducing Complexity in Vision-Language-Action Systems
Jinhui Ye, Ning Gao, Senqiao Yang, Jinliang Zheng, Zixuan Wang, Yuxin Chen, Pengguang Chen, Yilun Chen, Shu Liu, Jiaya Jia
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2604.11757: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.11757&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[568] MorphoFlow: Sparse-Supervised Generative Shape Modeling with Adaptive Latent Relevance
Mokshagna Sai Teja Karanam, Tushar Kataria, Shireen Elhabian
Main category: cs.CV
TL;DR: MorphoFlow is a sparse supervised generative shape modeling framework that learns probabilistic 3D shape representations from sparse surface annotations using neural implicit representations, autodecoders, and autoregressive normalizing flows.
Details
Motivation: Traditional statistical shape modeling (SSM) requires dense segmentation annotations and fixed latent representations, limiting scalability and flexibility for modeling complex anatomical variation. There's a need for approaches that work with sparse supervision while maintaining generative expressivity.Method: Combines neural implicit shape representations (resolution-agnostic 3D modeling) with autodecoder formulation (direct optimization of per-instance latent codes under sparse supervision) and autoregressive normalizing flows (capturing latent anatomical variability distribution). Uses adaptive latent relevance weighting with sparsity-inducing priors for compact, structured latent spaces.
Result: Accurate high-resolution reconstruction from sparse inputs on lumbar vertebrae and femur datasets. Recovers structured modes of anatomical variation consistent with population trends. Supports uncertainty quantification and anatomically plausible shape synthesis without manual latent dimensionality tuning.
Conclusion: MorphoFlow enables scalable, flexible statistical shape modeling from sparse annotations by integrating neural implicit representations with probabilistic generative modeling, providing a practical solution for population-level anatomical analysis.
Abstract: Statistical shape modeling (SSM) is central to population level analysis of anatomical variability, yet most existing approaches rely on densely annotated segmentations and fixed latent representations. These requirements limit scalability and reduce flexibility when modeling complex anatomical variation. We introduce MorphoFlow, a sparse supervised generative shape modeling framework that learns compact probabilistic shape representations directly from sparse surface annotations. MorphoFlow integrates neural implicit shape representations with an autodecoder formulation and autoregressive normalizing flows to learn an expressive probabilistic density over the latent shape space. The neural implicit representation enables resolution-agnostic modeling of 3D anatomy, while the autodecoder formulation supports direct optimization of per-instance latent codes under sparse supervision. The autoregressive flow captures the distribution of latent anatomical variability providing a tractable, likelihood-based generative model of shapes. To promote compact and structured latent representations, we incorporate adaptive latent relevance weighting through sparsity-inducing priors, enabling the model to regulate the contribution of individual latent dimensions according to their relevance to the underlying anatomical variation while preserving generative expressivity. The resulting latent space supports uncertainty quantification and anatomically plausible shape synthesis without manual latent dimensionality tuning. Evaluation on publicly available lumbar vertebrae and femur datasets demonstrates accurate high-resolution reconstruction from sparse inputs and recovery of structured modes of anatomical variation consistent with population level trends.
[569] Efficient KernelSHAP Explanations for Patch-based 3D Medical Image Segmentation
Ricardo Coimbra Brioso, Giulio Sichili, Damiano Dei, Nicola Lambri, Pietro Mancosu, Marta Scorsetti, Daniele Loiacono
Main category: cs.CV
TL;DR: Efficient KernelSHAP framework for 3D medical image segmentation with patch logit caching and organ-aware supervoxels for clinically meaningful attributions
Details
Motivation: Perturbation-based explainability methods like KernelSHAP are impractical for 3D medical image segmentation due to high computational costs from many coalition evaluations and expensive sliding-window inference in volumetric CT segmentation.Method: Proposes efficient KernelSHAP framework that restricts computation to user-defined ROI and its receptive-field support, accelerates inference via patch logit caching (reusing baseline predictions for unaffected patches while preserving nnU-Net’s fusion scheme), and compares three feature abstractions: whole-organ units, regular FCC supervoxels, and hybrid organ-aware supervoxels.
Result: Caching reduces redundant computation by 15-30%; faithfulness vs interpretability trade-off: regular supervoxels maximize perturbation metrics but lack anatomical alignment, while organ-aware units yield more clinically interpretable explanations and effectively highlight false-positive drivers under normalized metrics.
Conclusion: The framework enables efficient and clinically meaningful explanations for 3D medical image segmentation, with organ-aware supervoxels providing better clinical interpretability for identifying false-positive drivers in medical imaging applications.
Abstract: Perturbation-based explainability methods such as KernelSHAP provide model-agnostic attributions but are typically impractical for patch-based 3D medical image segmentation due to the large number of coalition evaluations and the high cost of sliding-window inference. We present an efficient KernelSHAP framework for volumetric CT segmentation that restricts computation to a user-defined region of interest and its receptive-field support, and accelerates inference via patch logit caching, reusing baseline predictions for unaffected patches while preserving nnU-Net’s fusion scheme. To enable clinically meaningful attributions, we compare three automatically generated feature abstractions within the receptive-field crop: whole-organ units, regular FCC supervoxels, and hybrid organ-aware supervoxels, and we study multiple aggregation/value functions targeting stabilizing evidence (TP/Dice/Soft Dice) or false-positive behavior. Experiments on whole-body CT segmentations show that caching substantially reduces redundant computation (with computational savings ranging from 15% to 30%) and that faithfulness and interpretability exhibit clear trade-offs: regular supervoxels often maximize perturbation-based metrics but lack anatomical alignment, whereas organ-aware units yield more clinically interpretable explanations and are particularly effective for highlighting false-positive drivers under normalized metrics.
[570] STS-Mixer: Spatio-Temporal-Spectral Mixer for 4D Point Cloud Video Understanding
Wenhao Li, Xueying Jiang, Gongjie Zhang, Xiaoqin Zhang, Ling Shao, Shijian Lu
Main category: cs.CV
TL;DR: STS-Mixer is a novel framework for 4D point cloud video understanding that integrates spatial, temporal, and spectral representations through graph spectral analysis to capture both coarse shapes and fine-grained geometry details.
Details
Motivation: Existing methods for 4D point cloud video understanding work primarily in the spatiotemporal domain, which fails to capture underlying geometric characteristics effectively, leading to degraded representation learning. The authors propose addressing this limitation from a complementary spectral perspective.Method: Transform 4D point cloud videos into graph spectral signals, decompose them into multiple frequency bands (low-frequency for coarse shapes, high-frequency for fine-grained details), and design STS-Mixer - a unified framework that mixes spatial, temporal, and spectral representations through spectral analysis.
Result: STS-Mixer achieves superior performance consistently across multiple widely adopted benchmarks on both 3D action recognition and 4D semantic segmentation tasks, demonstrating its effectiveness in capturing rich geometries and temporal dynamics.
Conclusion: The spectral perspective provides a powerful complementary approach to traditional spatiotemporal methods for 4D point cloud video understanding, enabling fine-grained and holistic analysis through integrated spatial, temporal, and spectral representations.
Abstract: 4D point cloud videos capture rich spatial and temporal dynamics of scenes which possess unique values in various 4D understanding tasks. However, most existing methods work in the spatiotemporal domain where the underlying geometric characteristics of 4D point cloud videos are hard to capture, leading to degraded representation learning and understanding of 4D point cloud videos. We address the above challenge from a complementary spectral perspective. By transforming 4D point cloud videos into graph spectral signals, we can decompose them into multiple frequency bands each of which captures distinct geometric structures of point cloud videos. Our spectral analysis reveals that the decomposed low-frequency signals capture more coarse shapes while high-frequency signals encode more fine-grained geometry details. Building on these observations, we design Spatio-Temporal-Spectral Mixer (STS-Mixer), a unified framework that mixes spatial, temporal, and spectral representations of point cloud videos. STS-Mixer integrates multi-band delineated spectral signals with spatiotemporal information to capture rich geometries and temporal dynamics, while enabling fine-grained and holistic understanding of 4D point cloud videos. Extensive experiments show that STS-Mixer achieves superior performance consistently across multiple widely adopted benchmarks on both 3D action recognition and 4D semantic segmentation tasks. Code and models are available at https://github.com/Vegetebird/STS-Mixer.
[571] GazeVaLM: A Multi-Observer Eye-Tracking Benchmark for Evaluating Clinical Realism in AI-Generated X-Rays
David Wong, Zeynep Isik, Bin Wang, Marouane Tliba, Gorkem Durak, Elif Keles, Halil Ertugrul Aktas, Aladine Chetouani, Cagdas Topel, Nicolo Gennaro, Camila Lopes Vendrami, Tugce Agirlar Trabzonlu, Amir Ali Rahsepar, Laetitia Perronne, Matthew Antalek, Onural Ozturk, Gokcan Okur, Andrew C. Gordon, Ayis Pyrros, Frank H. Miller, Amir Borhani, Hatice Savas, Eric Hart, Elizabeth Krupinski, Ulas Bagci
Main category: cs.CV
TL;DR: GazeVaLM is a public eye-tracking dataset for studying clinical perception during chest radiograph authenticity assessment, featuring gaze recordings from expert radiologists and multimodal LLM predictions for human-AI comparison.
Details
Motivation: To create a comprehensive dataset that enables research on how experts and AI systems perceive, interpret, and evaluate medical images, particularly focusing on authenticity assessment of AI-generated versus real chest X-rays.Method: Collected 960 gaze recordings from 16 expert radiologists interpreting 30 real and 30 synthetic chest X-rays under diagnostic assessment and real-fake classification conditions. Extended protocol to 6 state-of-the-art multimodal LLMs to generate predictions under matched conditions.
Result: Dataset includes raw gaze samples, fixation maps, scanpaths, saliency density maps, diagnostic labels, authenticity judgments, and corresponding LLM predictions with confidence scores. Provides analyses of gaze agreement, inter-observer consistency, and benchmarking of radiologists versus LLMs.
Conclusion: GazeVaLM supports research in gaze modeling, clinical decision-making, human-AI comparison, generative image realism assessment, and uncertainty quantification, facilitating reproducible research on expert and AI perception of medical images.
Abstract: We introduce GazeVaLM, a public eye-tracking dataset for studying clinical perception during chest radiograph authenticity assessment. The dataset comprises 960 gaze recordings from 16 expert radiologists interpreting 30 real and 30 synthetic chest X-rays (generated by diffusion based generative AI) under two conditions: diagnostic assessment and real-fake classification (Visual Turing test). For each image-observer pair, we provide raw gaze samples, fixation maps, scanpaths, saliency density maps, structured diagnostic labels, and authenticity judgments. We extend the protocol to 6 state-of-the-art multimodal LLMs, releasing their predicted diagnoses, authenticity labels, and confidence scores under matched conditions - enabling direct human-AI comparison at both decision and uncertainty levels. We further provide analyses of gaze agreement, inter-observer consistency, and benchmarking of radiologists versus LLMs in diagnostic accuracy and authenticity detection. GazeVaLM supports research in gaze modeling, clinical decision-making, human-AI comparison, generative image realism assessment, and uncertainty quantification. By jointly releasing visual attention data, clinical labels, and model predictions, we aim to facilitate reproducible research on how experts and AI systems perceive, interpret, and evaluate medical images. The dataset is available at https://huggingface.co/datasets/davidcwong/GazeVaLM.
[572] Budget-Aware Uncertainty for Radiotherapy Segmentation QA Using nnU-Net
Ricardo Coimbra Brioso, Lorenzo Mondo, Damiano Dei, Nicola Lambri, Pietro Mancosu, Marta Scorsetti, Daniele Loiacono
Main category: cs.CV
TL;DR: A budget-aware uncertainty-driven QA framework for radiotherapy CTV segmentation using nnU-Net with uncertainty quantification and calibration to guide manual review.
Details
Motivation: Clinical Target Volume (CTV) delineation for radiotherapy planning is time-consuming and difficult to assess, especially for complex treatments like Total Marrow and Lymph Node Irradiation (TMLI). While deep learning auto-segmentation can reduce workload, safe clinical deployment requires reliable uncertainty cues to indicate where models may be wrong.Method: Proposed a budget-aware uncertainty-driven quality assurance framework built on nnU-Net, combining uncertainty quantification and post-hoc calibration to produce voxel-wise uncertainty maps based on predictive entropy. Compared temperature scaling (TS), deep ensembles (DE), checkpoint ensembles (CE), and test-time augmentation (TTA), evaluated individually and in combination on TMLI as a representative use case.
Result: Segmentation accuracy remained stable across configurations, while temperature scaling substantially improved calibration. Uncertainty-error alignment improved most with calibrated checkpoint-based inference, leading to uncertainty maps that more consistently highlight regions requiring manual edits. Reliability was assessed through ROI-masked calibration metrics and uncertainty-error alignment under realistic revision constraints.
Conclusion: Integrating calibration with efficient ensembling appears to be a promising strategy to implement a budget-aware QA workflow for radiotherapy segmentation, providing uncertainty maps that can guide targeted manual review.
Abstract: Accurate delineation of the Clinical Target Volume (CTV) is essential for radiotherapy planning, yet remains time-consuming and difficult to assess, especially for complex treatments such as Total Marrow and Lymph Node Irradiation (TMLI). While deep learning-based auto-segmentation can reduce workload, safe clinical deployment requires reliable cues indicating where models may be wrong. In this work, we propose a budget-aware uncertainty-driven quality assurance (QA) framework built on nnU-Net, combining uncertainty quantification and post-hoc calibration to produce voxel-wise uncertainty maps (based on predictive entropy) that can guide targeted manual review. We compare temperature scaling (TS), deep ensembles (DE), checkpoint ensembles (CE), and test-time augmentation (TTA), evaluated both individually and in combination on TMLI as a representative use case. Reliability is assessed through ROI-masked calibration metrics and uncertainty–error alignment under realistic revision constraints, summarized as AUC over the top 0-5% most uncertain voxels. Across configurations, segmentation accuracy remains stable, whereas TS substantially improves calibration. Uncertainty-error alignment improves most with calibrated checkpoint-based inference, leading to uncertainty maps that highlight more consistently regions requiring manual edits. Overall, integrating calibration with efficient ensembling seems a promising strategy to implement a budget-aware QA workflow for radiotherapy segmentation.
[573] UNIGEOCLIP: Unified Geospatial Contrastive Learning
Guillaume Astruc, Eduard Trulls, Jan Hosang, Loic Landrieu, Paul-Edouard Sarlin
Main category: cs.CV
TL;DR: UNIGEOCLIP is a massively multimodal contrastive framework that aligns five geospatial modalities (aerial imagery, street views, elevation models, text, coordinates) in a unified embedding space using all-to-all contrastive alignment.
Details
Motivation: The growing availability of co-located geospatial data across multiple modalities presents an opportunity for multimodal representation learning, but existing approaches either fuse modalities or rely on central pivot representations rather than enabling seamless cross-modal reasoning.Method: Proposes UNIGEOCLIP with all-to-all contrastive alignment across five geospatial modalities, plus a scaled latitude-longitude encoder that captures multi-scale geographic structure for improved spatial representation.
Result: Extensive experiments show UNIGEOCLIP consistently outperforms single-modality contrastive models and coordinate-only baselines across downstream geospatial tasks, demonstrating benefits of holistic multimodal alignment.
Conclusion: UNIGEOCLIP successfully creates a unified multimodal embedding space for geospatial data, enabling seamless comparison, retrieval, and reasoning across arbitrary combinations of five complementary geospatial modalities.
Abstract: The growing availability of co-located geospatial data spanning aerial imagery, street-level views, elevation models, text, and geographic coordinates offers a unique opportunity for multimodal representation learning. We introduce UNIGEOCLIP, a massively multimodal contrastive framework to jointly align five complementary geospatial modalities in a single unified embedding space. Unlike prior approaches that fuse modalities or rely on a central pivot representation, our method performs all-to-all contrastive alignment, enabling seamless comparison, retrieval, and reasoning across arbitrary combinations of modalities. We further propose a scaled latitude-longitude encoder that improves spatial representation by capturing multi-scale geographic structure. Extensive experiments across downstream geospatial tasks demonstrate that UNIGEOCLIP consistently outperforms single-modality contrastive models and coordinate-only baselines, highlighting the benefits of holistic multimodal geospatial alignment. A reference implementation is available at https://gastruc.github.io/unigeoclip.
[574] Towards Brain MRI Foundation Models for the Clinic: Findings from the FOMO25 Challenge
Asbjørn Munk, Stefano Cerri, Vardan Nersesjan, Christian Hedeager Krag, Jakob Ambsdorf, Pablo Rocamora García, Julia Machnio, Peirong Liu, Suhyun Ahn, Nasrin Akbari, Yasmina Al Khalil, Kimberly Amador, Sina Amirrajab, Tal Arbel, Meritxell Bach Cuadra, Ujjwal Baid, Bhakti Baheti, Jaume Banus, Kamil Barbierik, Christoph Brune, Yansong Bu, Baptiste Callard, Yuhan Chen, Cornelius Crijnen, Corentin Dancette, Peter Drotar, Prasad Dutande, Nils D. Forkert, Saurabh Garg, Jakub Gazda, Matej Gazda, Benoît Gérin, Partha Ghosh, Weikang Gong, Pedro M. Gordaliza, Sam Hashemi, Tobias Heimann, Fucang Jia, Jiexin Jiang, Emily Kaczmarek, Chris Kang, Seung Kwan Kang, Mohammad Khazaei, Julien Khlaut, Petros Koutsouvelis, Jae Sung Lee, Yuchong Li, Mengye Lyu, Mingchen Ma, Anant Madabhushi, Klaus H. Maier-Hein, Pierre Manceron, Andrés Martínez Mora, Moona Mazher, Felix Meister, Nataliia Molchanova, Steven A. Niederer, Leonard Nürnberg, Jinah Park, Abdul Qayyum, Jonas Richiardi, Antoine Saporta, Branislav Setlak, Ning Shen, Justin Szeto, Constantin Ulrich, Puru Vaish, Vibujithan Vigneshwaran, Leroy Volmer, Zihao Wang, Siqi Wei, Anthony Winder, Jelmer M. Wolterink, Maxence Wynen, Chang Yang, Si Young Yie, Mostafa Mehdipour Ghazi, Akshay Pai, Espen Jimenez Solem, Sebastian Nørgaard Llambias, Mikael Boesen, Michael Eriksen Benros, Juan Eugenio Iglesias, Mads Nielsen
Main category: cs.CV
TL;DR: FOMO25 challenge evaluates self-supervised foundation models for brain MRI analysis on clinical data, showing SSL improves generalization under domain shift with different objectives optimal for different tasks.
Details
Motivation: Clinical brain MRI analysis faces challenges with heterogeneous/noisy data and costly labeling. Self-supervised learning can leverage unlabeled clinical data to train robust foundation models, but development has been limited by small pretraining datasets and in-domain benchmarking.Method: Organized FOMO25 challenge with large pretraining dataset FOMO60K, evaluating models on clinical data in few-shot/out-of-domain settings. Tasks included infarct classification, meningioma segmentation, and brain age regression. Evaluated 19 foundation models from 16 teams using standardized containerized pipeline.
Result: SSL pretraining improves generalization on clinical data under domain shift; strongest out-of-domain models surpassed supervised in-domain baselines. No single pretraining objective benefits all tasks: MAE favors segmentation, hybrid reconstruction-contrastive favors classification. Small pretrained models achieved strong performance without reliable benefits from scaling model size/training duration.
Conclusion: Self-supervised foundation models show promise for clinical brain MRI analysis, with different objectives optimal for different tasks and efficient small models performing well without extensive scaling.
Abstract: Clinical deployment of automated brain MRI analysis faces a fundamental challenge: clinical data is heterogeneous and noisy, and high-quality labels are prohibitively costly to obtain. Self-supervised learning (SSL) can address this by leveraging the vast amounts of unlabeled data produced in clinical workflows to train robust \textit{foundation models} that adapt out-of-domain with minimal supervision. However, the development of foundation models for brain MRI has been limited by small pretraining datasets and in-domain benchmarking focused on high-quality, research-grade data. To address this gap, we organized the FOMO25 challenge as a satellite event at MICCAI 2025. FOMO25 provided participants with a large pretraining dataset, FOMO60K, and evaluated models on data sourced directly from clinical workflows in few-shot and out-of-domain settings. Tasks covered infarct classification, meningioma segmentation, and brain age regression, and considered both models trained on FOMO60K (method track) and any data (open track). Nineteen foundation models from sixteen teams were evaluated using a standardized containerized pipeline. Results show that (a) self-supervised pretraining improves generalization on clinical data under domain shift, with the strongest models trained \textit{out-of-domain} surpassing supervised baselines trained \textit{in-domain}. (b) No single pretraining objective benefits all tasks: MAE favors segmentation, hybrid reconstruction-contrastive objectives favor classification, and (c) strong performance was achieved by small pretrained models, and improvements from scaling model size and training duration did not yield reliable benefits.
[575] Unfolding 3D Gaussian Splatting via Iterative Gaussian Synopsis
Yuqin Lu, Yang Zhou, Yihua Dai, Guiqing Li, Shengfeng He
Main category: cs.CV
TL;DR: Iterative Gaussian Synopsis: A top-down unfolding framework for compact, progressive 3D Gaussian Splatting rendering with adaptive LOD hierarchy and shared feature representation.
Details
Motivation: 3D Gaussian Splatting (3DGS) has high storage requirements and unstructured representation, making it challenging for streaming and resource-constrained environments. Existing bottom-up LOD approaches introduce redundancy or degrade fidelity.Method: Proposes a top-down “unfolding” scheme starting from full-resolution 3DGS model, iteratively deriving coarser LODs using adaptive learnable mask-based pruning. Uses hierarchical spatial grids for global structure and shared Anchor Codebook for localized details, creating compact feature representation with minimal overhead for progressive refinement.
Result: Maintains high rendering quality across all LODs while achieving substantial storage reduction, demonstrating practicality for real-time 3DGS rendering in bandwidth- and memory-constrained scenarios.
Conclusion: The framework enables efficient, progressive 3DGS rendering with compact representation suitable for streaming and resource-limited applications through top-down LOD construction and shared feature modeling.
Abstract: 3D Gaussian Splatting (3DGS) has become a state-of-the-art framework for real-time, high-fidelity novel view synthesis. However, its substantial storage requirements and inherently unstructured representation pose challenges for deployment in streaming and resource-constrained environments. Existing Level-of-Detail (LOD) strategies, particularly those based on bottom-up construction, often introduce redundancy or lead to fidelity degradation. To overcome these limitations, we propose Iterative Gaussian Synopsis, a novel framework for compact and progressive rendering through a top-down “unfolding” scheme. Our approach begins with a full-resolution 3DGS model and iteratively derives coarser LODs using an adaptive, learnable mask-based pruning mechanism. This process constructs a multi-level hierarchy that preserves visual quality while improving efficiency. We integrate hierarchical spatial grids, which capture the global scene structure, with a shared Anchor Codebook that models localized details. This combination produces a compact yet expressive feature representation, designed to minimize redundancy and support efficient, level-specific adaptation. The unfolding mechanism promotes inter-layer reusability and requires only minimal data overhead for progressive refinement. Experiments show that our method maintains high rendering quality across all LODs while achieving substantial storage reduction. These results demonstrate the practicality and scalability of our approach for real-time 3DGS rendering in bandwidth- and memory-constrained scenarios.
[576] LARY: A Latent Action Representation Yielding Benchmark for Generalizable Vision-to-Action Alignment
Dujun Nie, Fengjiao Chen, Qi Lv, Jun Kuang, Xiaoyu Li, Xuezhi Cao, Xunliang Cai
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: Cannot analyze method as paper content is unavailable
Result: No results available due to technical access issues
Conclusion: Cannot draw conclusions about paper content due to access limitations
Abstract: Failed to fetch summary for 2604.11689: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.11689&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[577] Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction
Efstathios Karypidis, Spyros Gidaris, Nikos Komodakis
Main category: cs.CV
TL;DR: Re2Pix is a hierarchical video prediction framework that first predicts future semantic representations using a frozen vision foundation model, then uses these representations to condition a latent diffusion model for photorealistic frame generation.
Details
Motivation: Accurate future video prediction requires both high visual fidelity and consistent scene semantics, especially in complex dynamic environments like autonomous driving. Direct RGB frame prediction struggles with maintaining semantic consistency over time.Method: Two-stage approach: 1) Forecast future scene structure in the feature space of a frozen vision foundation model, 2) Condition a latent diffusion model on these predicted representations to render photorealistic frames. Uses nested dropout and mixed supervision to address train-test mismatch between ground-truth and predicted representations.
Result: Experiments on challenging driving benchmarks show significant improvements in temporal semantic consistency, perceptual quality, and training efficiency compared to strong diffusion baselines.
Conclusion: The semantics-first hierarchical design effectively separates scene dynamics from appearance generation, leading to better video prediction performance in complex environments.
Abstract: Accurate future video prediction requires both high visual fidelity and consistent scene semantics, particularly in complex dynamic environments such as autonomous driving. We present Re2Pix, a hierarchical video prediction framework that decomposes forecasting into two stages: semantic representation prediction and representation-guided visual synthesis. Instead of directly predicting future RGB frames, our approach first forecasts future scene structure in the feature space of a frozen vision foundation model, and then conditions a latent diffusion model on these predicted representations to render photorealistic frames. This decomposition enables the model to focus first on scene dynamics and then on appearance generation. A key challenge arises from the train-test mismatch between ground-truth representations available during training and predicted ones used at inference. To address this, we introduce two conditioning strategies, nested dropout and mixed supervision, that improve robustness to imperfect autoregressive predictions. Experiments on challenging driving benchmarks demonstrate that the proposed semantics-first design significantly improves temporal semantic consistency, perceptual quality, and training efficiency compared to strong diffusion baselines. We provide the implementation code at https://github.com/Sta8is/Re2Pix
[578] Seeing Through the Tool: A Controlled Benchmark for Occlusion Robustness in Foundation Segmentation Models
Nhan Ho, Luu Le, Thanh-Huy Nguyen, Thien Nguyen, Xiaofeng Liu, Ulas Bagci
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about paper content due to access limitations
Abstract: Failed to fetch summary for 2604.11711: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.11711&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[579] BEM: Training-Free Background Embedding Memory for False-Positive Suppression in Real-Time Fixed-Background Camera
Junwoo Park, Jangho Lee, Sunho Lim
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2604.11714: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.11714&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[580] The Devil is in the Details – From OCR for Old Church Slavonic to Purely Visual Stemma Reconstruction
Armin Hoenen
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2604.11724: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.11724&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[581] Learning Long-term Motion Embeddings for Efficient Kinematics Generation
Nick Stracke, Kolja Bauer, Stefan Andreas Baumann, Miguel Angel Bautista, Josh Susskind, Björn Ommer
Main category: cs.CV
TL;DR: Efficient motion generation using compressed motion embeddings learned from tracker trajectories, enabling long-term motion synthesis with text or spatial conditioning.
Details
Motivation: Current video models can understand scene dynamics but are inefficient for exploring multiple possible futures through full video synthesis. There's a need for more efficient motion modeling that can generate long, realistic motions conditioned on goals.Method: Learn a highly compressed motion embedding (64x temporal compression) from large-scale trajectories obtained from tracker models. Train a conditional flow-matching model in this latent space to generate motion latents conditioned on text prompts or spatial pokes.
Result: The approach generates motion distributions that outperform both state-of-the-art video models and specialized task-specific approaches, achieving orders of magnitude more efficient motion generation.
Conclusion: Operating directly on compressed motion embeddings enables efficient generation of long, realistic motions conditioned on various goal specifications, advancing motion modeling capabilities.
Abstract: Understanding and predicting motion is a fundamental component of visual intelligence. Although modern video models exhibit strong comprehension of scene dynamics, exploring multiple possible futures through full video synthesis remains prohibitively inefficient. We model scene dynamics orders of magnitude more efficiently by directly operating on a long-term motion embedding that is learned from large-scale trajectories obtained from tracker models. This enables efficient generation of long, realistic motions that fulfill goals specified via text prompts or spatial pokes. To achieve this, we first learn a highly compressed motion embedding with a temporal compression factor of 64x. In this space, we train a conditional flow-matching model to generate motion latents conditioned on task descriptions. The resulting motion distributions outperform those of both state-of-the-art video models and specialized task-specific approaches.
[582] HDR Video Generation via Latent Alignment with Logarithmic Encoding
Naomi Ken Korem, Mohamed Oumoumad, Harel Cain, Matan Ben Yosef, Urska Jelercic, Ofir Bibi, Yaron Inger, Or Patashnik, Daniel Cohen-Or
Main category: cs.CV
TL;DR: HDR video generation achieved by adapting pretrained generative models using logarithmic encoding and camera-mimicking degradations, avoiding complex HDR-specific architectures.
Details
Motivation: HDR imagery presents challenges for generative models due to mismatch with bounded, perceptually compressed training data. Rather than learning new HDR representations from scratch, the paper explores leveraging existing pretrained models' visual priors.Method: Uses logarithmic encoding (common in cinematic pipelines) to map HDR to distribution aligned with pretrained models’ latent space. Lightweight fine-tuning adapts models without retraining encoders. Introduces camera-mimicking degradations training strategy to help models infer missing HDR details from learned priors.
Result: Demonstrates high-quality HDR video generation using pretrained video models with minimal adaptation. Achieves strong results across diverse scenes and challenging lighting conditions.
Conclusion: HDR generation can be effectively handled without redesigning generative models by choosing representations that align with their learned priors, showing that fundamentally different image formation regimes can be addressed through proper adaptation.
Abstract: High dynamic range (HDR) imagery offers a rich and faithful representation of scene radiance, but remains challenging for generative models due to its mismatch with the bounded, perceptually compressed data on which these models are trained. A natural solution is to learn new representations for HDR, which introduces additional complexity and data requirements. In this work, we show that HDR generation can be achieved in a much simpler way by leveraging the strong visual priors already captured by pretrained generative models. We observe that a logarithmic encoding widely used in cinematic pipelines maps HDR imagery into a distribution that is naturally aligned with the latent space of these models, enabling direct adaptation via lightweight fine-tuning without retraining an encoder. To recover details that are not directly observable in the input, we further introduce a training strategy based on camera-mimicking degradations that encourages the model to infer missing high dynamic range content from its learned priors. Combining these insights, we demonstrate high-quality HDR video generation using a pretrained video model with minimal adaptation, achieving strong results across diverse scenes and challenging lighting conditions. Our results indicate that HDR, despite representing a fundamentally different image formation regime, can be handled effectively without redesigning generative models, provided that the representation is chosen to align with their learned priors.
[583] LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation
Yuqian Yuan, Wenqiao Zhang, Juekai Lin, Yu Zhong, Mingjian Gao, Binhe Yu, Yunqi Cao, Wentong Li, Yueting Zhuang, Beng Chin Ooi
Main category: cs.CV
TL;DR: A comprehensive review paper examining the convergence of Large Multimodal Models (LMMs) with object-centric vision, focusing on improving object-level grounding, spatial reasoning, and visual manipulation capabilities.
Details
Motivation: Current LMMs excel at general vision-language understanding but lack precise object-level grounding, fine-grained spatial reasoning, and controllable visual manipulation capabilities needed for tasks requiring instance identification, object identity preservation, and precise region localization/modification.Method: The paper organizes literature into four themes: object-centric visual understanding, object-centric referring segmentation, object-centric visual editing, and object-centric visual generation. It reviews key modeling paradigms, learning strategies, and evaluation protocols.
Result: Provides a structured taxonomy of approaches combining LMMs with object-centric vision, identifying current capabilities and limitations in object-level multimodal systems.
Conclusion: Object-centric vision offers a principled framework to extend LMMs from global scene understanding to precise object-level operations, with future directions including robust instance permanence, fine-grained spatial control, consistent multi-step interaction, unified cross-task modeling, and reliable benchmarking.
Abstract: Large Multimodal Models (LMMs) have achieved remarkable progress in general-purpose vision–language understanding, yet they remain limited in tasks requiring precise object-level grounding, fine-grained spatial reasoning, and controllable visual manipulation. In particular, existing systems often struggle to identify the correct instance, preserve object identity across interactions, and localize or modify designated regions with high precision. Object-centric vision provides a principled framework for addressing these challenges by promoting explicit representations and operations over visual entities, thereby extending multimodal systems from global scene understanding to object-level understanding, segmentation, editing, and generation. This paper presents a comprehensive review of recent advances at the convergence of LMMs and object-centric vision. We organize the literature into four major themes: object-centric visual understanding, object-centric referring segmentation, object-centric visual editing, and object-centric visual generation. We further summarize the key modeling paradigms, learning strategies, and evaluation protocols that support these capabilities. Finally, we discuss open challenges and future directions, including robust instance permanence, fine-grained spatial control, consistent multi-step interaction, unified cross-task modeling, and reliable benchmarking under distribution shift. We hope this paper provides a structured perspective on the development of scalable, precise, and trustworthy object-centric multimodal systems.
[584] LottieGPT: Tokenizing Vector Animation for Autoregressive Generation
Junhao Chen, Kejun Gao, Yuehan Cui, Mingze Sun, Mingjin Chen, Shaohui Wang, Xiaoxiao Long, Fei Ma, Qi Tian, Ruqi Huang, Hao Zhao
Main category: cs.CV
TL;DR: First framework for generating vector animations from text/visual prompts using Lottie format, with novel tokenizer and large dataset, enabling editable parametric motion generation.
Details
Motivation: Current video generation models only work in raster space and cannot produce vector animations, which offer resolution-independence, compactness, semantic structure, and editable parametric motion. Vector animation is a dominant form of Internet multimedia but lacks generative models.Method: Developed Lottie Tokenizer to encode layered geometric primitives, transforms, and keyframe-based motion into compact token sequences. Created LottieAnimation-660K dataset (660k animations, 15M static images). Fine-tuned Qwen-VL to create LottieGPT for multimodal vector animation generation.
Result: Tokenizer dramatically reduces sequence length while preserving structural fidelity. LottieGPT generates coherent, editable vector animations from natural language/visual prompts, outperforms SOTA on SVG generation (single-frame case).
Conclusion: First successful framework for native vector animation generation, enabling editable parametric motion synthesis and bridging the gap between multimodal models and structured vector content creation.
Abstract: Despite rapid progress in video generation, existing models are incapable of producing vector animation, a dominant and highly expressive form of multimedia on the Internet. Vector animations offer resolution-independence, compactness, semantic structure, and editable parametric motion representations, yet current generative models operate exclusively in raster space and thus cannot synthesize them. Meanwhile, recent advances in large multimodal models demonstrate strong capabilities in generating structured data such as slides, 3D meshes, LEGO sequences, and indoor layouts, suggesting that native vector animation generation may be achievable. In this work, we present the first framework for tokenizing and autoregressively generating vector animations. We adopt Lottie, a widely deployed JSON-based animation standard, and design a tailored Lottie Tokenizer that encodes layered geometric primitives, transforms, and keyframe-based motion into a compact and semantically aligned token sequence. To support large-scale training, we also construct LottieAnimation-660K, the largest and most diverse vector animation dataset to date, consisting of 660k real-world Lottie animation and 15M static Lottie image files curated from broad Internet sources. Building upon these components, we finetune Qwen-VL to create LottieGPT, a native multimodal model capable of generating coherent, editable vector animations directly from natural language or visual prompts. Experiments show that our tokenizer dramatically reduces sequence length while preserving structural fidelity, enabling effective autoregressive learning of dynamic vector content. LottieGPT exhibits strong generalization across diverse animation styles and outperforms previous state-of-the-art models on SVG generation (a special case of single-frame vector animation).
[585] SyncFix: Fixing 3D Reconstructions via Multi-View Synchronization
Deming Li, Abhay Yadav, Cheng Peng, Rama Chellappa, Anand Bhattad
Main category: cs.CV
TL;DR: SyncFix is a diffusion-based framework that enforces cross-view consistency for 3D scene reconstruction by synchronizing distorted and clean representations across multiple views during refinement.
Details
Motivation: Current scene reconstruction methods often suffer from semantic and geometric inconsistencies across different views, especially when using diffusion-based refinement techniques. There's a need for a framework that can enforce cross-view consistency during the refinement process to produce more coherent and accurate 3D reconstructions.Method: SyncFix formulates refinement as a joint latent bridge matching problem, learning a joint conditional distribution over multiple views to synchronize distorted and clean representations. The framework enforces consistency throughout the denoising trajectory of diffusion models. Training uses only image pairs but generalizes to arbitrary numbers of views during inference.
Result: SyncFix consistently generates high-quality reconstructions and surpasses current state-of-the-art baselines, even without clean reference images. Reconstruction quality improves with additional views (with diminishing returns at higher view counts), and achieves even higher fidelity when sparse references are available.
Conclusion: SyncFix provides an effective framework for enforcing cross-view consistency in diffusion-based scene refinement, enabling high-quality 3D reconstructions that maintain semantic and geometric coherence across multiple viewpoints.
Abstract: We present SyncFix, a framework that enforces cross-view consistency during the diffusion-based refinement of reconstructed scenes. SyncFix formulates refinement as a joint latent bridge matching problem, synchronizing distorted and clean representations across multiple views to fix the semantic and geometric inconsistencies. This means SyncFix learns a joint conditional over multiple views to enforce consistency throughout the denoising trajectory. Our training is done only on image pairs, but it generalizes naturally to an arbitrary number of views during inference. Moreover, reconstruction quality improves with additional views, with diminishing returns at higher view counts. Qualitative and quantitative results demonstrate that SyncFix consistently generates high-quality reconstructions and surpasses current state-of-the-art baselines, even in the absence of clean reference images. SyncFix achieves even higher fidelity when sparse references are available.
[586] OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation
Donghao Zhou, Guisheng Liu, Hao Yang, Jiatong Li, Jingyu Lin, Xiaohu Huang, Yichen Liu, Xin Gao, Cunjian Chen, Shilei Wen, Chi-Wing Fu, Pheng-Ann Heng
Main category: cs.CV
TL;DR: OmniShow is an end-to-end framework for Human-Object Interaction Video Generation (HOIVG) that synthesizes high-quality videos conditioned on text, reference images, audio, and pose, with applications in e-commerce, short videos, and entertainment.
Details
Motivation: The paper addresses the practical need for automating content creation in real-world applications like e-commerce demonstrations, short video production, and interactive entertainment. Existing approaches fail to accommodate all requisite multimodal conditions (text, images, audio, pose) for human-object interaction video generation.Method: OmniShow introduces: 1) Unified Channel-wise Conditioning for efficient image and pose injection, 2) Gated Local-Context Attention for precise audio-visual synchronization, and 3) Decoupled-Then-Joint Training strategy with model merging to leverage heterogeneous sub-task datasets and address data scarcity. They also establish HOIVG-Bench as a comprehensive evaluation benchmark.
Result: Extensive experiments demonstrate that OmniShow achieves state-of-the-art performance across various multimodal conditioning settings, setting a solid standard for the emerging HOIVG task.
Conclusion: OmniShow provides an effective end-to-end framework for HOIVG that harmonizes multimodal conditions and delivers industry-grade performance, addressing both controllability-quality trade-offs and data scarcity challenges.
Abstract: In this work, we study Human-Object Interaction Video Generation (HOIVG), which aims to synthesize high-quality human-object interaction videos conditioned on text, reference images, audio, and pose. This task holds significant practical value for automating content creation in real-world applications, such as e-commerce demonstrations, short video production, and interactive entertainment. However, existing approaches fail to accommodate all these requisite conditions. We present OmniShow, an end-to-end framework tailored for this practical yet challenging task, capable of harmonizing multimodal conditions and delivering industry-grade performance. To overcome the trade-off between controllability and quality, we introduce Unified Channel-wise Conditioning for efficient image and pose injection, and Gated Local-Context Attention to ensure precise audio-visual synchronization. To effectively address data scarcity, we develop a Decoupled-Then-Joint Training strategy that leverages a multi-stage training process with model merging to efficiently harness heterogeneous sub-task datasets. Furthermore, to fill the evaluation gap in this field, we establish HOIVG-Bench, a dedicated and comprehensive benchmark for HOIVG. Extensive experiments demonstrate that OmniShow achieves overall state-of-the-art performance across various multimodal conditioning settings, setting a solid standard for the emerging HOIVG task.
[587] Pair2Scene: Learning Local Object Relations for Procedural Scene Generation
Xingjian Ran, Shujie Zhang, Weipeng Zhong, Li Luo, Bo Dai
Main category: cs.CV
TL;DR: Pair2Scene: A procedural 3D indoor scene generation framework that uses learned local object relations (support and functional) with hierarchical structure and physics-based algorithms to generate complex scenes beyond training distribution.
Details
Motivation: Current 3D indoor scene generation methods struggle with scaling to dense scenes and lack precise spatial reasoning. They often rely on LLMs/VLMs that can't handle detailed spatial relationships, and existing approaches have difficulty generating scenes beyond their training distribution.Method: Proposes Pair2Scene framework that learns local object relations (support and functional relations) through a network trained on curated 3D-Pairs dataset. Uses hierarchical structure with recursive model application and collision-aware rejection sampling to align local rules into coherent global layouts.
Result: Outperforms existing methods in generating complex environments beyond training data while maintaining physical and semantic plausibility. Demonstrates scalability to dense scenes through extensive experiments.
Conclusion: Pair2Scene effectively addresses limitations of current 3D scene generation methods by focusing on local dependencies rather than global distributions, enabling generation of physically and semantically plausible complex indoor scenes beyond training distribution.
Abstract: Generating high-fidelity 3D indoor scenes remains a significant challenge due to data scarcity and the complexity of modeling intricate spatial relations. Current methods often struggle to scale beyond training distribution to dense scenes or rely on LLMs/VLMs that lack the ability for precise spatial reasoning. Building on top of the observation that object placement relies mainly on local dependencies instead of information-redundant global distributions, in this paper, we propose Pair2Scene, a novel procedural generation framework that integrates learned local rules with scene hierarchies and physics-based algorithms. These rules mainly capture two types of inter-object relations, namely support relations that follow physical hierarchies, and functional relations that reflect semantic links. We model these rules through a network, which estimates spatial position distributions of dependent objects conditioned on position and geometry of the anchor ones. Accordingly, we curate a dataset 3D-Pairs from existing scene data to train the model. During inference, our framework can generate scenes by recursively applying our model within a hierarchical structure, leveraging collision-aware rejection sampling to align local rules into coherent global layouts. Extensive experiments demonstrate that our framework outperforms existing methods in generating complex environments that go beyond training data while maintaining physical and semantic plausibility.
[588] Who Handles Orientation? Investigating Invariance in Feature Matching
David Nordström, Johan Edstedt, Fredrik Kahl, Georg Bökman
Main category: cs.CV
TL;DR: Learning rotation invariance in image matching pipelines - incorporating it in descriptors yields similar performance to matcher-level handling but enables faster rotation-invariant matching.
Details
Motivation: Modern keypoint matchers struggle with large in-plane rotations, and it's unclear at which stage rotation invariance should be incorporated in matching pipelines.Method: Extensive experiments training on large 3D vision datasets, evaluating on image matching benchmarks, comparing rotation invariance in descriptors vs. matchers, studying emergence through scale.
Result: Descriptor-level rotation invariance yields similar performance to matcher-level, enables faster rotation-invariant matching, doesn’t hurt upright performance at scale, and scale improves generalization to rotations.
Conclusion: Rotation invariance can be effectively learned in descriptors, enabling efficient rotation-robust matching with state-of-the-art performance on challenging benchmarks.
Abstract: Finding matching keypoints between images is a core problem in 3D computer vision. However, modern matchers struggle with large in-plane rotations. A straightforward mitigation is to learn rotation invariance via data augmentation. However, it remains unclear at which stage rotation invariance should be incorporated. In this paper, we study this in the context of a modern sparse matching pipeline. We perform extensive experiments by training on a large collection of 3D vision datasets and evaluating on popular image matching benchmarks. Surprisingly, we find that incorporating rotation invariance already in the descriptor yields similar performance to handling it in the matcher. However, rotation invariance is achieved earlier in the matcher when it is learned in the descriptor, allowing for a faster rotation-invariant matcher. Further, we find that enforcing rotation invariance does not hurt upright performance when trained at scale. Finally, we study the emergence of rotation invariance through scale and find that increasing the training data size substantially improves generalization to rotated images. We release two matchers robust to in-plane rotations that achieve state-of-the-art performance on e.g. multi-modal (WxBS), extreme (HardMatch), and satellite image matching (SatAst). Code is available at https://github.com/davnords/loma.
[589] Nano-EmoX: Unifying Multimodal Emotional Intelligence from Perception to Empathy
Jiahao Huang, Fengyan Lin, Xuechao Yang, Chen Feng, Kexin Zhu, Xu Yang, Zhide Chen
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error fetching paper contentMethod: Unable to determine method due to technical error fetching paper content
Result: Unable to determine results due to technical error fetching paper content
Conclusion: Unable to draw conclusions due to technical error fetching paper content
Abstract: Failed to fetch summary for 2603.02123: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.02123&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[590] ComSim: Building Scalable Real-World Robot Data Generation via Compositional Simulation
Yiran Qin, Jiahua Ma, Li Kang, Wenzhan Li, Yihang Jiao, Xin Wen, Xiufeng Song, Heng Zhou, Jiwen Yu, Zhenfei Yin, Xihui Liu, Philip Torr, Yilun Du, Ruimao Zhang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2604.11386: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.11386&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[591] EagleVision: A Multi-Task Benchmark for Cross-Domain Perception in High-Speed Autonomous Racing
Zakhar Yagudin, Murad Mebrahtu, Ren Jin, Jiaqi Huang, Yujia Yue, Dzmitry Tsetserukou, Jorge Dias, Majid Khonji
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper detailsMethod: Unable to determine method due to API rate limiting preventing access to paper details
Result: Unable to determine results due to API rate limiting preventing access to paper details
Conclusion: Unable to determine conclusion due to API rate limiting preventing access to paper details
Abstract: Failed to fetch summary for 2604.11400: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.11400&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[592] M$^{2}$SNet: Multi-scale in Multi-scale Subtraction Network for Medical Image Segmentation
Xiaoqi Zhao, Hongpeng Jia, Youwei Pang, Long Lv, Feng Tian, Lihe Zhang, Weibing Sun, Huchuan Lu
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to analyze paper due to technical error in fetching content
Abstract: Failed to fetch summary for 2303.10894: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2303.10894&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[593] A Survey on Deep Learning Techniques for Action Anticipation
Zeyun Zhong, Manuel Martin, Michael Voit, Juergen Gall, Jürgen Beyerer
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2309.17257: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2309.17257&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[594] GaNI: Global and Near Field Illumination Aware Neural Inverse Rendering
Jiaye Wu, Saeed Hadadan, Geng Lin, Matthias Zwicker, David Jacobs, Roni Sengupta
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to inability to access paper contentMethod: Cannot determine method due to inability to access paper content
Result: Cannot determine results due to inability to access paper content
Conclusion: Cannot determine conclusion due to inability to access paper content
Abstract: Failed to fetch summary for 2403.15651: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2403.15651&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[595] Near OOD Detection for Vision-Language Prompt Learning with Contrastive Logit Score
Myong Chol Jung, Joanna Dipnall, Belinda Gabbe, He Zhao
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2405.16091: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2405.16091&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[596] Neural Surface Reconstruction from Sparse Views Using Epipolar Geometry
Xinhai Chang, Kaichen Zhou
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation as paper content could not be retrievedMethod: Unable to determine method as paper content could not be retrieved
Result: Unable to determine results as paper content could not be retrieved
Conclusion: Unable to determine conclusion as paper content could not be retrieved
Abstract: Failed to fetch summary for 2406.04301: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2406.04301&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[597] What to Say and When to Say it: Live Fitness Coaching as a Testbed for Situated Interaction
Sunny Panchal, Apratim Bhattacharyya, Guillaume Berger, Antoine Mercier, Cornelius Bohm, Florian Dietrichkeit, Reza Pourreza, Xuanlin Li, Pulkit Madan, Mingu Lee, Mark Todorovich, Ingo Bax, Roland Memisevic
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2407.08101: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2407.08101&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[598] SpotFormer: Multi-Scale Spatio-Temporal Transformer for Facial Expression Spotting
Yicheng Deng, Hideaki Hayashi, Hajime Nagahara
Main category: cs.CV
TL;DR: Paper 2407.20799: Unable to fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to inability to access paper contentMethod: Cannot determine method due to inability to access paper content
Result: Cannot determine results due to inability to access paper content
Conclusion: Cannot determine conclusion due to inability to access paper content
Abstract: Failed to fetch summary for 2407.20799: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2407.20799&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[599] How to Spin an Object: First, Get the Shape Right
Rishabh Kabra, Drew A. Hudson, Sjoerd van Steenkiste, Joao Carreira, Niloy J. Mitra
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper with ID 2412.10273 could not be retrieved from arXiv API.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot determine conclusion without access to the paper content.
Abstract: Failed to fetch summary for 2412.10273: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2412.10273&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[600] HumanVBench: Probing Human-Centric Video Understanding in MLLMs with Automatically Synthesized Benchmarks
Ting Zhou, Daoyuan Chen, Qirui Jiao, Bolin Ding, Yaliang Li, Ying Shen
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2412.17574: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2412.17574&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[601] HFI: A unified framework for training-free detection and implicit watermarking of latent diffusion model generated images
Sungik Choi, Hankook Lee, Jaehoon Lee, Seunghyun Kim, Stanley Jungkyu Choi, Moontae Lee
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2412.20704: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2412.20704&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[602] Temporal-Aware Spiking Transformer Hashing Based on 3D-DWT
Zihao Mei, Jianhao Li, Bolin Zhang, Chong Wang, Lijun Guo, Guoqi Li, Jiangbo Qian
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2501.06786: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2501.06786&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[603] Integrating Semi-Supervised and Active Learning for Semantic Segmentation
Wanli Ma, Oktay Karakus, Paul L. Rosin
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2501.19227: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2501.19227&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[604] Uncertainty-Based Ensemble Learning in CMR Semantic Segmentation
Yiwei Liu, Liang Zhong, Lingyi Wen, Yuankai Wu
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to determine conclusion due to failed API request
Abstract: Failed to fetch summary for 2502.09269: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.09269&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[605] S4M: 4-points to Segment Anything
Adrien Meyer, Lorenzo Arboit, Giuseppe Massimiani, Shih-Min Yin, Didier Mutter, Nicolas Padoy
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting). No abstract available for analysis.
Details
Motivation: Cannot determine motivation as paper content is unavailable due to HTTP 429 error from arXiv API.Method: Cannot determine method as paper content is unavailable due to HTTP 429 error from arXiv API.
Result: Cannot determine results as paper content is unavailable due to HTTP 429 error from arXiv API.
Conclusion: Cannot draw conclusions as paper content is unavailable due to HTTP 429 error from arXiv API.
Abstract: Failed to fetch summary for 2503.05534: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.05534&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[606] Exploring the best way for UAV visual localization under Low-altitude Multi-view Observation Condition: a Benchmark
Yibin Ye, Xichao Teng, Shuo Chen, Leqi Liu, Kun Wang, Xiaokai Song, Zhang Li
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to access limitationsMethod: Cannot determine method due to access limitations
Result: Cannot determine results due to access limitations
Conclusion: Cannot determine conclusion due to access limitations
Abstract: Failed to fetch summary for 2503.10692: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.10692&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[607] DecAlign: Hierarchical Cross-Modal Alignment for Decoupled Multimodal Representation Learning
Chengxuan Qian, Shuo Xing, Shawn Li, Yue Zhao, Zhengzhong Tu
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2503.11892 suggests it’s from March 2025, but no content available for analysis.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot draw conclusions without access to the paper content.
Abstract: Failed to fetch summary for 2503.11892: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.11892&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[608] AccidentSim: Generating Vehicle Collision Videos with Physically Realistic Collision Trajectories from Real-World Accident Reports
Xiangwen Zhang, Qian Zhang, Longfei Han, Qiang Qu, Xiaoming Chen, Weidong Cai
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2503.20654: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.20654&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[609] Intelligent bear deterrence system based on computer vision: Reducing human bear conflicts in remote areas
Pengyu Chen, Teng Fei, Yunyan Du, Jiawei Yi, Yi Li, John A. Kupfer
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) - need to try again later or use alternative methods
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2503.23178: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.23178&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[610] TARAC: Mitigating Hallucination in LVLMs via Temporal Attention Real-time Accumulative Connection
Lei Jiang, Chunzhao Xie, Tongxuan Liu, Yuting Zeng, jinrong Guo, Yunheng Shen, Weizhe Huang, Jing Li, Xiaohua Xu
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting). No abstract available for analysis.
Details
Motivation: Unable to determine motivation due to missing paper content.Method: Unable to determine method due to missing paper content.
Result: Unable to determine results due to missing paper content.
Conclusion: Unable to determine conclusion due to missing paper content.
Abstract: Failed to fetch summary for 2504.04099: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.04099&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[611] Text-to-Image Models and Their Representation of People from Different Nationalities Engaging in Activities
Abdulkareem Alsudais
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation as paper content could not be retrievedMethod: Unable to determine method as paper content could not be retrieved
Result: Unable to determine results as paper content could not be retrieved
Conclusion: Unable to draw conclusions as paper content could not be retrieved
Abstract: Failed to fetch summary for 2504.06313: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.06313&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[612] LOOPE: Learnable Optimal Patch Order in Positional Embeddings for Vision Transformers
Md Abtahi Majeed Chowdhury, Md Rifat Ur Rahman, Akil Ahmad Taki
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2504.14386: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.14386&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[613] Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation
Hong-Tao Yu, Yuxin Peng, Serge Belongie, Xiu-Shen Wei
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2504.14988: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.14988&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[614] Auto-regressive transformation for image alignment
Kanggeon Lee, Soochahn Lee, Kyoung Mu Lee
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2505.04864: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.04864&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[615] Variational Visual Question Answering for Uncertainty-Aware Selective Prediction
Tobias Jan Wieczorek, Nathalie Daun, Mohammad Emtiyaz Khan, Marcus Rohrbach
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting).
Details
Motivation: Cannot determine motivation due to inability to access paper content.Method: Cannot determine method due to inability to access paper content.
Result: Cannot determine results due to inability to access paper content.
Conclusion: Cannot draw conclusions due to inability to access paper content.
Abstract: Failed to fetch summary for 2505.09591: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.09591&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[616] RobustSpring: Benchmarking Robustness to Image Corruptions for Optical Flow, Scene Flow and Stereo
Victor Oei, Jenny Schmalfuss, Lukas Mehl, Madlen Bartsch, Shashank Agnihotri, Margret Keuper, Andreas Bulling, Andrés Bruhn
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper retrievalMethod: Unable to determine method due to failed paper retrieval
Result: Unable to determine results due to failed paper retrieval
Conclusion: Unable to draw conclusions due to failed paper retrieval
Abstract: Failed to fetch summary for 2505.09368: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.09368&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[617] SpatialScore: Towards Comprehensive Evaluation for Spatial Intelligence
Haoning Wu, Xiao Huang, Yaohui Chen, Ya Zhang, Yanfeng Wang, Weidi Xie
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2505.17012: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.17012&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[618] Learning World Models for Interactive Video Generation
Taiye Chen, Xun Hu, Zihan Ding, Chi Jin
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access restrictionsMethod: Unable to determine method due to access restrictions
Result: Unable to determine results due to access restrictions
Conclusion: Unable to determine conclusion due to access restrictions
Abstract: Failed to fetch summary for 2505.21996: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.21996&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[619] Progressive Multimodal Interaction Network for Reliable Quantification of Fish Feeding Intensity in Aquaculture
Shulong Zhang, Mingyuan Yao, Jiayin Zhao, Daoliang Li, Yingyi Chen, Haihua Wang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2506.14170: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.14170&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[620] Perceptual Inductive Bias Is What You Need Before Contrastive Learning
Tianqin Li, Junru Zhao, Dunhan Jiang, Shenghao Wu, Alan Ramirez, Tai Sing Lee
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2506.01201: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.01201&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[621] RealSR-R1: Reinforcement Learning for Real-World Image Super-Resolution with Vision-Language Chain-of-Thought
Junbo Qiao, Miaomiao Cai, Wei Li, Xudong Huang, Jie Hu, Xinghao Chen, Shaohui Lin, Hongkai Xiong
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2506.16796: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.16796&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[622] CPAM: Context-Preserving Adaptive Manipulation for Zero-Shot Real Image Editing
Dinh-Khoi Vo, Thanh-Toan Do, Tam V. Nguyen, Minh-Triet Tran, Trung-Nghia Le
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper retrievalMethod: Unable to determine method due to failed paper retrieval
Result: Unable to determine results due to failed paper retrieval
Conclusion: Unable to determine conclusion due to failed paper retrieval
Abstract: Failed to fetch summary for 2506.18438: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.18438&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[623] PrefPaint: Enhancing Medical Image Inpainting through Expert Human Feedback
Duy-Bao Bui, Hoang-Khang Nguyen, Thao Thi Phuong Dao, Kim Anh Phung, Tam V. Nguyen, Justin Zhan, Minh-Triet Tran, Trung-Nghia Le
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2506.21834: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.21834&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[624] PRIX: Learning to Plan from Raw Pixels for End-to-End Autonomous Driving
Maciej K. Wozniak, Lianhang Liu, Yixi Cai, Patric Jensfelt
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to draw conclusions due to access error
Abstract: Failed to fetch summary for 2507.17596: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.17596&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[625] Interactive Interface For Semantic Segmentation Dataset Synthesis
Ngoc-Do Tran, Minh-Tuan Huynh, Tam V. Nguyen, Minh-Triet Tran, Trung-Nghia Le
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: Unable to determine method due to API rate limiting preventing access to paper content
Result: Unable to determine results due to API rate limiting preventing access to paper content
Conclusion: Unable to determine conclusion due to API rate limiting preventing access to paper content
Abstract: Failed to fetch summary for 2506.23470: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.23470&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[626] AdvDINO: Domain-Adversarial Self-Supervised Representation Learning for Spatial Proteomics
Stella Su, Marc Harary, Scott J. Rodig, William Lotter
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) - cannot analyze this specific paper
Details
Motivation: Unable to determine motivation as paper content could not be retrieved due to API rate limitingMethod: Unable to determine method as paper content could not be retrieved
Result: Unable to determine results as paper content could not be retrieved
Conclusion: Unable to draw conclusions about the paper due to technical limitations in accessing the content
Abstract: Failed to fetch summary for 2508.04955: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.04955&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[627] Improving the Reasoning of Multi-Image Grounding in MLLMs via Reinforcement Learning
Bob Zhang, Haoran Li, Tao Zhang, Jianan Li, Cilin Yan, Xikai Liu, Jiayin Cai, Yanbin Hao
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access restrictionsMethod: Unable to determine method due to access restrictions
Result: Unable to determine results due to access restrictions
Conclusion: Unable to determine conclusion due to access restrictions
Abstract: Failed to fetch summary for 2507.00748: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.00748&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[628] COXNet: Cross-Layer Fusion with Adaptive Alignment and Scale Integration for RGBT Tiny Object Detection
Peiran Peng, Tingfa Xu, Liqiang Song, Mengqi Zhu, Yuqiang Fang, Jianan Li
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2508.09533: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.09533&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[629] Dark-EvGS: Event Camera as an Eye for Radiance Field in the Dark
Jingqian Wu, Peiqi Duan, Zongqiang Wang, Changwei Wang, Boxin Shi, Edmund Y. Lam
Main category: cs.CV
TL;DR: Paper ID 2507.11931 - Unable to fetch abstract due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to abstract fetch failureMethod: Cannot determine method due to abstract fetch failure
Result: Cannot determine results due to abstract fetch failure
Conclusion: Cannot determine conclusion due to abstract fetch failure
Abstract: Failed to fetch summary for 2507.11931: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.11931&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[630] FedKLPR: KL-Guided Pruning-Aware Federated Learning for Person Re-Identification
Po-Hsien Yu, Yu-Syuan Tseng, Shao-Yi Chien
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error in accessing paper contentMethod: Unable to determine method due to technical error in accessing paper content
Result: Unable to determine results due to technical error in accessing paper content
Conclusion: Unable to determine conclusion due to technical error in accessing paper content
Abstract: Failed to fetch summary for 2508.17431: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.17431&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[631] DiffClean: Diffusion-based Makeup Removal for Accurate Age Estimation
Ekta Gavas, Sudipta Banerjee, Chinmay Hegde, Nasir Memon
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2507.13292: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.13292&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[632] CoPS: Conditional Prompt Synthesis for Zero-Shot Anomaly Detection
Qiyu Chen, Zhen Qu, Wei Luo, Haiming Yao, Yunkang Cao, Yuxin Jiang, Yinan Duan, Huiyuan Luo, Chengkan Lv, Zhengtao Zhang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2508.03447: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.03447&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[633] Adversarial Video Promotion Against Text-to-Video Retrieval
Qiwei Tian, Chenhao Lin, Zhengyu Zhao, Qian Li, Shuai Liu, Chao Shen
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2508.06964: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.06964&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[634] Semantic Segmentation Algorithm Based on Light Field and LiDAR Fusion
Jie Luo, Yuxuan Jiang, Xin Jin, Mingyu Liu, Yihui Fan
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2510.06687: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.06687&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[635] A Survey on 3D Gaussian Splatting Applications: Segmentation, Editing, and Generation
Shuting He, Peilin Ji, Yitong Yang, Changshuo Wang, Jiayi Ji, Yinglin Wang, Henghui Ding
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content could not be retrievedMethod: Cannot determine method as paper content could not be retrieved
Result: Cannot determine results as paper content could not be retrieved
Conclusion: Cannot draw conclusion due to inability to access paper content
Abstract: Failed to fetch summary for 2508.09977: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.09977&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[636] Post-Processing Methods for Improving Accuracy in MRI Inpainting
Nishad Kulkarni, Krithika Iyer, Austin Tapp, Abhijeet Parida, Daniel Capellán-Martín, Zhifan Jiang, María J. Ledesma-Carbayo, Syed Muhammad Anwar, Marius George Linguraru
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2510.15282: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.15282&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[637] CountLoop: Training-Free High-Instance Image Generation via Iterative Agent Guidance
Anindya Mondal, Ayan Banerjee, Sauradip Nag, Josep Llados, Xiatian Zhu, Anjan Dutta
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2508.16644: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.16644&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[638] Quantization Robustness to Input Degradations for Object Detection
Toghrul Karimov, Hassan Imani, Allan Kazakov
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation as paper content could not be retrievedMethod: Unable to determine method as paper content could not be retrieved
Result: Unable to determine results as paper content could not be retrieved
Conclusion: Unable to determine conclusion as paper content could not be retrieved
Abstract: Failed to fetch summary for 2508.19600: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.19600&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[639] CraftGraffiti: Exploring Human Identity with Custom Graffiti Art via Facial-Preserving Diffusion Models
Ayan Banerjee, Fernando Vilariño, Josep Lladós
Main category: cs.CV
TL;DR: Unable to analyze paper 2508.20640 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract is unavailable due to rate limiting errorMethod: Cannot determine method as abstract is unavailable due to rate limiting error
Result: Cannot determine results as abstract is unavailable due to rate limiting error
Conclusion: Cannot draw conclusions as abstract is unavailable due to rate limiting error
Abstract: Failed to fetch summary for 2508.20640: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.20640&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[640] TaleDiffusion: Multi-Character Story Generation with Dialogue Rendering
Ayan Banerjee, Josep Llados, Umapada Pal, Anjan Dutta
Main category: cs.CV
TL;DR: Unable to analyze paper 2509.04123 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract retrieval failedMethod: Cannot determine method as abstract retrieval failed
Result: Cannot determine results as abstract retrieval failed
Conclusion: Cannot draw conclusions without access to paper abstract
Abstract: Failed to fetch summary for 2509.04123: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.04123&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[641] GeoArena: Evaluating Open-World Geographic Reasoning in Large Vision-Language Models
Pengyue Jia, Yingyi Zhang, Xiangyu Zhao, Sharon Li
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2509.04334: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.04334&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[642] Delta Rectified Flow Sampling for Text-to-Image Editing
Gaspard Beaudouin, Minghan Li, Jaeyeon Kim, Sung-Hoon Yoon, Mengyu Wang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2509.05342: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.05342&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[643] FM-SIREN & FM-FINER: Implicit Neural Representation Using Nyquist-based Orthogonality
Mohammed Alsakabi, Wael Mobeirek, John M. Dolan, Ozan K. Tonguz
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) - unable to analyze content
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to draw conclusions due to fetch failure
Abstract: Failed to fetch summary for 2509.23438: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.23438&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[644] Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling
Yuran Wang, Bohan Zeng, Chengzhuo Tong, Wenxuan Liu, Yang Shi, Xiaochen Ma, Hao Liang, Yuanxing Zhang, Wentao Zhang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to analyze paper due to technical limitations in accessing content
Abstract: Failed to fetch summary for 2512.12675: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.12675&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[645] DeepSketcher: Internalizing Visual Manipulation for Multimodal Reasoning
Chi Zhang, Haibo Qiu, Qiming Zhang, Zhixiong Zeng, Lin Ma, Jing Zhang
Main category: cs.CV
TL;DR: Unable to analyze paper 2509.25866 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation without access to paper abstractMethod: Cannot determine method without access to paper abstract
Result: Cannot determine results without access to paper abstract
Conclusion: Cannot draw conclusions without access to paper abstract
Abstract: Failed to fetch summary for 2509.25866: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.25866&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[646] LEAD: Minimizing Learner-Expert Asymmetry in End-to-End Driving
Long Nguyen, Micha Fauth, Bernhard Jaeger, Daniel Dauner, Maximilian Igl, Andreas Geiger, Kashyap Chitta
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailable due to server rate limitingMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about paper content due to access limitations
Abstract: Failed to fetch summary for 2512.20563: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.20563&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[647] Inferring Dynamic Physical Properties from Video Foundation Models
Guanqi Zhan, Xianzheng Ma, Weidi Xie, Andrew Zisserman
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2510.02311: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.02311&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[648] ASBench: Image Anomalies Synthesis Benchmark for Anomaly Detection
Qunyi Zhang, Songan Zhang, Jiaqi Liu, Jinbao Wang, Xiaoning Lei, Guoyang Xie, Guannan Jiang, Zhichao Lu
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper detailsMethod: Cannot analyze method as paper content is unavailable due to HTTP 429 error
Result: No results available - paper summary retrieval failed due to rate limiting
Conclusion: Cannot draw conclusions about paper content due to technical access issues
Abstract: Failed to fetch summary for 2510.07927: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.07927&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[649] Exploring Cross-Modal Flows for Few-Shot Learning
Ziqi Jiang, Yanghao Wang, Long Chen
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2510.14543: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.14543&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[650] What Users Leave Unsaid: Under-Specified Queries Limit Vision-Language Models
Dasol Choi, Guijin Son, Hanwool Lee, Minhyuk Kim, Hyunwoo Ko, Teabin Lim, Ahn Eungyeol, Jungwhan Kim, Seunghyeok Hong, Youngsook Song
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2601.06165: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.06165&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[651] RL makes MLLMs see better than SFT
Junha Song, Sangdoo Yun, Dongyoon Han, Jaegul Choo, Byeongho Heo
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2510.16333: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.16333&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[652] Ninja Codes: Neurally Generated Fiducial Markers for Stealthy 6-DoF Tracking
Yuichiro Takeuchi, Yusuke Imoto, Shunya Kato
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot determine conclusion as paper content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2510.18976: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.18976&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[653] TCSA-UDA: Text-Driven Cross-Semantic Alignment for Unsupervised Domain Adaptation in Medical Image Segmentation
Lalit Maurya, Honghai Liu, Reyer Zwiggelaar
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2511.05782: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.05782&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[654] Fake-HR1: Rethinking Reasoning of Vision Language Model for Synthetic Image Detection
Changjiang Jiang, Xinkuan Sha, Fengchang Yu, Jingjing Liu, Jian Liu, Mingqi Fang, Chenfeng Zhang, Wei Lu
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation due to API access limitations preventing paper retrievalMethod: No method information available - arXiv API request resulted in rate limiting error
Result: No results available - paper content could not be retrieved
Conclusion: Cannot analyze paper due to technical limitations in accessing arXiv data
Abstract: Failed to fetch summary for 2602.10042: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.10042&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[655] DoReMi: Bridging 3D Domains via Topology-Aware Domain-Representation Mixture of Experts
Mingwei Xing, Xinliang Wang, Yifeng Shi
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2511.11232: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.11232&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[656] HDR 3D Gaussian Splatting via Luminance-Chromaticity Decomposition
Kaixuan Zhang, Minxian Li, Mingwu Ren, Jiankang Deng, Xiatian Zhu
Main category: cs.CV
TL;DR: Failed to fetch summary for paper 2511.12895 due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to determine conclusion due to failed API request
Abstract: Failed to fetch summary for 2511.12895: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.12895&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[657] GrOCE:Graph-Guided Online Concept Erasure for Text-to-Image Diffusion Models
Ning Han, Zhenyu Ge, Feng Han, Yuhua Sun, Chengqing Li, Jingjing Chen
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) - need to try again later or use alternative access method
Details
Motivation: Cannot determine motivation without paper contentMethod: Cannot determine method without paper content
Result: Cannot determine results without paper content
Conclusion: Cannot draw conclusions without paper content
Abstract: Failed to fetch summary for 2511.12968: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.12968&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[658] Automatic Uncertainty-Aware Synthetic Data Bootstrapping for Historical Map Segmentation
Lukas Arzoumanidis, Julius Knechtel, Jan-Henrik Haunert, Youness Dehbi
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2511.15875: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.15875&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[659] LPNSR: Optimal Noise-Guided Diffusion Image Super-Resolution Via Learnable Noise Prediction
Shuwei Huang, Shizhuo Liu, Zijun Wei
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access limitationsMethod: Unable to determine method due to access limitations
Result: Unable to determine results due to access limitations
Conclusion: Unable to draw conclusions due to access limitations
Abstract: Failed to fetch summary for 2603.21045: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.21045&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[660] CylinderDepth: Cylindrical Spatial Attention for Multi-View Consistent Self-Supervised Surround Depth Estimation
Samer Abualhanud, Christian Grannemann, Max Mehltretter
Main category: cs.CV
TL;DR: Paper ID 2511.16428 could not be fetched due to HTTP 429 error (rate limiting), preventing analysis of its content and relevance assessment.
Details
Motivation: Unable to determine motivation as the paper content could not be retrieved due to HTTP 429 rate limiting error from arXiv API.Method: Cannot analyze method since the paper abstract/content was not accessible due to API rate limiting.
Result: No results can be reported as the paper content was not fetched successfully.
Conclusion: The analysis could not be completed due to technical limitations in accessing the paper content from arXiv.
Abstract: Failed to fetch summary for 2511.16428: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.16428&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[661] Learning Visually Interpretable Oscillator Networks for Soft Continuum Robots from Video
Henrik Krauss, Johann Licher, Naoya Takeishi, Annika Raatz, Takehisa Yairi
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2511.18322 suggests it’s from November 2024, but content is unavailable.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot draw conclusions without access to the paper content.
Abstract: Failed to fetch summary for 2511.18322: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.18322&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[662] ActDistill: General Action-Guided Self-Derived Distillation for Efficient Vision-Language-Action Models
Wencheng Ye, Tianshi Wang, Lei Zhu, Fengling Li, Guoli Yang, Hengtao Shen
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2511.18082: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.18082&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[663] Beyond Matching to Tiles: Bridging Unaligned Aerial and Satellite Views for Vision-Only UAV Navigation
Kejia Liu, Haoyang Zhou, Ruoyu Xu, Peicheng Wang, Mingli Song, Haofei Zhang
Main category: cs.CV
TL;DR: Failed to fetch summary for arXiv ID 2603.22153 due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper retrievalMethod: Unable to determine method due to failed paper retrieval
Result: Unable to determine results due to failed paper retrieval
Conclusion: Unable to determine conclusion due to failed paper retrieval
Abstract: Failed to fetch summary for 2603.22153: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.22153&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[664] SciPostLayoutTree: A Dataset for Structural Analysis of Scientific Posters
Shohei Tanaka, Atsushi Hashimoto, Yoshitaka Ushiku
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2511.18329: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.18329&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[665] Hide-and-Seek Attribution: Weakly Supervised Segmentation of Vertebral Metastases in CT
Matan Atad, Alexander W. Marka, Lisa Steinhelfer, Anna Curto-Vilalta, Yannik Leonhardt, Sarah C. Foreman, Anna-Sophia Walburga Dietrich, Robert Graf, Alexandra S. Gersing, Bjoern Menze, Daniel Rueckert, Jan S. Kirschke, Hendrik Möller
Main category: cs.CV
TL;DR: Paper 2512.06849: Could not fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access restrictions preventing abstract retrievalMethod: Method information unavailable - arXiv API rate limiting prevented access to paper details
Result: No results available - HTTP 429 error indicates too many requests to arXiv API
Conclusion: Cannot provide analysis - Paper content inaccessible due to technical limitations
Abstract: Failed to fetch summary for 2512.06849: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.06849&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[666] MASS: Motion-Aware Spatial-Temporal Grounding for Physics Reasoning and Comprehension in Vision-Language Models
Xiyang Wu, Zongxia Li, Jihui Jin, Guangyao Shi, Gouthaman KV, Vishnu Raj, Nilotpal Sinha, Jingxi Chen, Fan Du, Dinesh Manocha
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2511.18373: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.18373&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[667] ForestPrune: High-ratio Visual Token Compression for Video Multimodal Large Language Models via Spatial-Temporal Forest Modeling
Shaobo Ju, Baiyang Song, Tao Chen, Jiapeng Zhang, Qiong Wu, Chao Chang, HuaiXi Wang, Yiyi Zhou, Rongrong Ji
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content could not be retrievedMethod: Cannot determine method as paper content could not be retrieved
Result: Cannot determine results as paper content could not be retrieved
Conclusion: Cannot draw conclusions as paper content could not be retrieved
Abstract: Failed to fetch summary for 2603.22911: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.22911&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[668] Eevee: Towards Close-up High-resolution Video-based Virtual Try-on
Jianhao Zeng, Yancheng Bai, Ruidong Chen, Xuanpu Zhang, Lei Sun, Dongyang Jin, Ryan Xu, Nannan Zhang, Dan Song, Xiangxiang Chu
Main category: cs.CV
TL;DR: Unable to analyze paper 2511.18957 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract is unavailableMethod: Cannot determine method as abstract is unavailable
Result: Cannot determine results as abstract is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2511.18957: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.18957&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[669] MetroGS: Efficient and Stable Reconstruction of Geometrically Accurate High-Fidelity Large-Scale Scenes
Kehua Chen, Tianlu Mao, Xinzhu Ma, Hao Jiang, Zehao Li, Zihan Liu, Shuqin Gao, Honglong Zhao, Feng Dai, Yucheng Zhang, Zhaoqi Wang
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2511.19172: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.19172&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[670] DecepGPT: Schema-Driven Deception Detection with Multicultural Datasets and Robust Multimodal Learning
Jiajian Huang, Dongliang Zhu, Zitong YU, Hui Ma, Jiayu Zhang, Chunmei Zhu, Xiaochun Cao
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). Paper ID 2603.23916 cannot be analyzed without access to its abstract or content.
Details
Motivation: Cannot determine motivation without access to paper content.Method: Cannot determine method without access to paper content.
Result: Cannot determine results without access to paper content.
Conclusion: Cannot draw conclusions without access to paper content.
Abstract: Failed to fetch summary for 2603.23916: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.23916&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[671] Can Multi-Modal LLMs Provide Live Step-by-Step Task Guidance?
Apratim Bhattacharyya, Bicheng Xu, Sanjay Haresh, Reza Pourreza, Litian Liu, Sunny Panchal, Pulkit Madan, Leonid Sigal, Roland Memisevic
Main category: cs.CV
TL;DR: Unable to analyze paper 2511.21998 due to HTTP 429 error when fetching summary from arXiv API
Details
Motivation: Cannot determine motivation as paper content could not be retrievedMethod: Cannot determine method as paper content could not be retrieved
Result: Cannot determine results as paper content could not be retrieved
Conclusion: Cannot draw conclusions as paper content could not be retrieved
Abstract: Failed to fetch summary for 2511.21998: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.21998&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[672] MVAD: A Benchmark Dataset for Multimodal AI-Generated Video-Audio Detection
Mengxue Hu, Yunfeng Diao, Changtao Miao, Zhiqing Guo, Jianshu Li, Zhe Li, Joey Tianyi Zhou
Main category: cs.CV
TL;DR: Failed to fetch summary for paper 2512.00336 due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to missing abstractMethod: Unable to determine method due to missing abstract
Result: Unable to determine results due to missing abstract
Conclusion: Unable to determine conclusion due to missing abstract
Abstract: Failed to fetch summary for 2512.00336: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.00336&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[673] Learning to Focus and Precise Cropping: A Reinforcement Learning Framework with Information Gaps and Grounding Loss for MLLMs
Xuanpu Zhao, Zhentao Tan, Dianmo Sheng, Tianxiang Chen, Yao Liu, Yue Wu, Tao Gong, Qi Chu, Nenghai Yu
Main category: cs.CV
TL;DR: Unable to analyze paper 2603.27494 due to HTTP 429 error when fetching from arXiv API
Details
Motivation: Cannot determine motivation as paper content could not be retrievedMethod: Cannot determine method as paper content could not be retrieved
Result: Cannot determine results as paper content could not be retrieved
Conclusion: Cannot draw conclusions as paper content could not be retrieved
Abstract: Failed to fetch summary for 2603.27494: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.27494&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[674] FRAMER: Frequency-Aligned Self-Distillation with Adaptive Modulation Leveraging Diffusion Priors for Real-World Image Super-Resolution
Seungho Choi, Jeahun Sung, Jihyong Oh
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper information.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot determine conclusion without access to the paper content.
Abstract: Failed to fetch summary for 2512.01390: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.01390&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[675] ProPhy: Progressive Physical Alignment for Dynamic World Simulation
Zijun Wang, Panwen Hu, Jing Wang, Terry Jingchen Zhang, Yuhao Cheng, Long Chen, Yiqiang Yan, Zutao Jiang, Hanhui Li, Xiaodan Liang
Main category: cs.CV
TL;DR: Paper 2512.05564 summary unavailable due to HTTP 429 rate limiting error from arXiv API
Details
Motivation: Unable to determine motivation due to technical error fetching paper informationMethod: Unable to determine method due to technical error fetching paper information
Result: Unable to determine results due to technical error fetching paper information
Conclusion: Unable to determine conclusion due to technical error fetching paper information
Abstract: Failed to fetch summary for 2512.05564: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.05564&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[676] Optimization-Guided Diffusion for Interactive Scene Generation
Shihao Li, Naisheng Ye, Tianyu Li, Kashyap Chitta, Tuo An, Peng Su, Boyang Wang, Haiou Liu, Chen Lv, Hongyang Li
Main category: cs.CV
TL;DR: Paper 2512.07661: Connection error prevented fetching the abstract, so relevance cannot be determined
Details
Motivation: Unable to determine motivation due to connection error when fetching the paper abstractMethod: Unable to determine method due to connection error when fetching the paper abstract
Result: Unable to determine results due to connection error when fetching the paper abstract
Conclusion: Unable to draw conclusions about the paper due to connection error preventing access to the abstract
Abstract: Failed to fetch summary for 2512.07661: (‘Connection aborted.’, RemoteDisconnected(‘Remote end closed connection without response’))
[677] VERTIGO: Visual Preference Optimization for Cinematic Camera Trajectory Generation
Mengtian Li, Yuwei Lu, Feifei Li, Chenqi Gan, Zhifeng Xie, Xi Wang
Main category: cs.CV
TL;DR: Paper ID 2604.02467 could not be fetched due to HTTP 429 error (rate limiting), so content analysis is not possible
Details
Motivation: Unable to determine motivation as paper content could not be retrievedMethod: Unable to determine method as paper content could not be retrieved
Result: Unable to determine results as paper content could not be retrieved
Conclusion: Unable to draw conclusions as paper content could not be retrieved
Abstract: Failed to fetch summary for 2604.02467: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.02467&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[678] CityGuard: Graph-Aware Private Descriptors for Bias-Resilient Identity Search Across Urban Cameras
Rong Fu, Yibo Meng, Jia Yee Tan, Jiaxuan Lu, Rui Lu, Jiekai Wu, Zhaolu Kang, Simon Fong
Main category: cs.CV
TL;DR: Paper 2602.18047: Failed to fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to missing abstract contentMethod: Unable to determine method due to missing abstract content
Result: Unable to determine results due to missing abstract content
Conclusion: Unable to determine conclusion due to missing abstract content
Abstract: Failed to fetch summary for 2602.18047: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.18047&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[679] ConceptPose: Training-Free Zero-Shot Object Pose Estimation using Concept Vectors
Liming Kuang, Yordanka Velikova, Mahdi Saleh, Jan-Nico Zaech, Danda Pani Paudel, Benjamin Busam
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2512.09056: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.09056&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[680] PSF-Med: Measuring and Explaining Paraphrase Sensitivity in Medical Vision Language Models
Binesh Sadanandan, Vahid Behzadan
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting errorMethod: Unable to determine method due to API rate limiting error
Result: Unable to determine results due to API rate limiting error
Conclusion: Unable to determine conclusion due to API rate limiting error
Abstract: Failed to fetch summary for 2602.21428: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.21428&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[681] Scaling Up AI-Generated Image Detection with Generator-Aware Prototypes
Ziheng Qin, Yuheng Ji, Renshuai Tao, Yuxuan Tian, Yuyang Liu, Yipu Wang, Xiaolong Zheng
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2512.12982: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.12982&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[682] Can LLMs Reason About Attention? Towards Zero-Shot Analysis of Multimodal Classroom Behavior
Nolan Platt, Sehrish Nizamani, Alp Tural, Elif Tural, Saad Nizamani, Andrew Katz, Yoonje Lee, Nada Basit
Main category: cs.CV
TL;DR: Paper 2604.03401 summary could not be fetched due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access limitationsMethod: Unable to determine method due to access limitations
Result: Unable to determine results due to access limitations
Conclusion: Unable to determine conclusion due to access limitations
Abstract: Failed to fetch summary for 2604.03401: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.03401&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[683] Dual-R-DETR: Resolving Query Competition with Pairwise Routing in Transformer Decoders
Ye Zhang, Qi Chen, Wenyou Huang, Rui Liu, Zhengjian Kang
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2512.13876: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.13876&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[684] Zero-Shot Quantization via Weight-Space Arithmetic
Daniele Solombrino, Antonio Andrea Gargiulo, Adrian Robert Minut, Luca Zhou, Alessandro Zirilli, Emanuele Rodolà
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2604.03420 could not be retrieved from arXiv API.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot determine conclusion without access to the paper content.
Abstract: Failed to fetch summary for 2604.03420: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.03420&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[685] Fusion Complexity Inversion: Why Simpler Cross View Modules Outperform SSMs and Cross View Attention Transformers for Pasture Biomass Regression
Mridankan Mandal
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.07819: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.07819&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[686] On the Effectiveness of Textual Prompting with Lightweight Fine-Tuning for SAM3 Remote Sensing Segmentation
Roni Blushtein-Livnon, Osher Rafaeli, David Ioffe, Amir Boger, Karen Sandberg Esquenazi, Tal Svoray
Main category: cs.CV
TL;DR: Paper 2512.15564: Unable to fetch summary due to HTTP 429 error (rate limiting). No content available for analysis.
Details
Motivation: Cannot determine motivation as the paper content could not be retrieved due to server rate limiting.Method: Cannot determine method as the paper content could not be retrieved.
Result: Cannot determine results as the paper content could not be retrieved.
Conclusion: Cannot determine conclusion as the paper content could not be retrieved.
Abstract: Failed to fetch summary for 2512.15564: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.15564&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[687] 4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation
Chiao-An Yang, Ryo Hachiuma, Sifei Liu, Subhashree Radhakrishnan, Raymond A. Yeh, Yu-Chiang Frank Wang, Min-Hung Chen
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to failed paper fetchMethod: Cannot determine method due to failed paper fetch
Result: Cannot determine results due to failed paper fetch
Conclusion: Cannot determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2512.17012: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.17012&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[688] FPBench: A Comprehensive Benchmark of Multimodal Large Language Models for Fingerprint Analysis
Ekta Gavas, Sudipta Banerjee, Chinmay Hegde, Nasir Memon
Main category: cs.CV
TL;DR: Paper 2512.18073 summary unavailable due to HTTP 429 error (rate limiting) when fetching from arXiv API
Details
Motivation: Unable to determine motivation due to abstract fetch failureMethod: Unable to determine method due to abstract fetch failure
Result: Unable to determine results due to abstract fetch failure
Conclusion: Unable to determine conclusion due to abstract fetch failure
Abstract: Failed to fetch summary for 2512.18073: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.18073&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[689] StableTTA: Training-Free Test-Time Adaptation that Improves Model Accuracy on ImageNet1K to 96%
Zheng Li, Jerry Cheng, Huanying Helen Gu
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2604.04552: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.04552&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[690] Dual-Margin Embedding for Fine-Grained Long-Tailed Plant Taxonomy
Cheng Yaw Low, Heejoon Koo, Jaewoo Park, Meeyoung Cha
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2512.18994: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.18994&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[691] VideoStir: Understanding Long Videos via Spatio-Temporally Structured and Intent-Aware RAG
Honghao Fu, Miao Xu, Yiwei Wang, Dailing Zhang, Liu Jun, Yujun Cai
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) - need to try again later or use alternative methods
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2604.05418: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.05418&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[692] Decoupled Generative Modeling for Human-Object Interaction Synthesis
Hwanhee Jung, Seunggwan Lee, Jeongyoon Yoon, SeungHyeon Kim, Giljoo Nam, Qixing Huang, Sangpil Kim
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2512.19049: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.19049&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[693] FORGE: Fine-grained Multimodal Evaluation for Manufacturing Scenarios
Xiangru Jian, Hao Xu, Wei Pang, Xinjian Zhao, Chengyu Tao, Qixin Zhang, Xikun Zhang, Chao Zhang, Guanzhi Deng, Alex Xue, Juan Du, Tianshu Yu, Garth Tarr, Linqi Song, Qiuzhuang Sun, Dacheng Tao
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting).
Details
Motivation: Cannot determine motivation without access to paper content.Method: Cannot determine method without access to paper content.
Result: Cannot determine results without access to paper content.
Conclusion: Cannot draw conclusions without access to paper content.
Abstract: Failed to fetch summary for 2604.07413: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.07413&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[694] Iterative Inference-time Scaling with Adaptive Frequency Steering for Image Super-Resolution
Hexin Zhang, Dong Li, Jie Huang, Bingzhou Wang, Xueyang Fu, Zhengjun Zha
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2512.23532: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.23532&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[695] Unified Multimodal Uncertain Inference
Dengjia Zhang, Alexander Martin, William Jurayj, Kenton Murray, Benjamin Van Durme, Reno Kriz
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: Unable to determine method due to API rate limiting preventing access to paper content
Result: Unable to determine results due to API rate limiting preventing access to paper content
Conclusion: Unable to determine conclusion due to API rate limiting preventing access to paper content
Abstract: Failed to fetch summary for 2604.08701: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.08701&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[696] GAMBIT: A Gamified Jailbreak Framework for Multimodal Large Language Models
Xiangdong Hu, Yangyang Jiang, Qin Hu, Xiaojun Jia
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper retrievalMethod: Unable to determine method due to failed paper retrieval
Result: Unable to determine results due to failed paper retrieval
Conclusion: Unable to draw conclusions due to failed paper retrieval
Abstract: Failed to fetch summary for 2601.03416: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.03416&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[697] Can Textual Reasoning Improve the Performance of MLLMs on Fine-grained Visual Classification?
Jie Zhu, Yiyang Su, Xiaoming Liu
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2601.06993: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.06993&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[698] PanoSAMic: Panoramic Image Segmentation from SAM Feature Encoding and Dual View Fusion
Mahdi Chamseddine, Didier Stricker, Jason Rambach
Main category: cs.CV
TL;DR: Paper 2601.07447: Could not fetch summary due to HTTP 429 error (rate limiting).
Details
Motivation: Unable to determine motivation due to missing abstract.Method: Unable to determine method due to missing abstract.
Result: Unable to determine results due to missing abstract.
Conclusion: Unable to draw conclusions due to missing abstract.
Abstract: Failed to fetch summary for 2601.07447: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.07447&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[699] Accelerating Transformer-Based Monocular SLAM via Geometric Utility Scoring
Xinmiao Xiong, Bangya Liu, Hao Wang, Dayou Li, Nuo Chen, Andrew Feng, Mingyu Ding, Suman Banerjee, Yang Zhou, Zhiwen Fan
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation as paper content could not be retrievedMethod: Unable to determine method as paper content could not be retrieved
Result: Unable to determine results as paper content could not be retrieved
Conclusion: Unable to determine conclusion as paper content could not be retrieved
Abstract: Failed to fetch summary for 2604.08718: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.08718&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[700] Affostruction: 3D Affordance Grounding with Generative Reconstruction
Chunghyun Park, Seunghyeon Lee, Minsu Cho
Main category: cs.CV
TL;DR: Unable to analyze paper 2601.09211 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract is unavailableMethod: Cannot determine method as abstract is unavailable
Result: Cannot determine results as abstract is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2601.09211: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.09211&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[701] Context-Aware Semantic Segmentation via Stage-Wise Attention
Antoine Carreaud, Elias Naha, Arthur Chansel, Nina Lahellec, Jan Skaloud, Adrien Gressin
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2601.11310: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.11310&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[702] Mirai: Autoregressive Visual Generation Needs Foresight
Yonghao Yu, Lang Huang, Zerun Wang, Runyi Li, Toshihiko Yamasaki
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2601.14671: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.14671&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[703] LookBench: A Live and Holistic Open Benchmark for Fashion Image Retrieval
Gensmo.ai, Chao Gao, Siqiao Xue, Jiwen Fu, Tingyi Gu, Shanshan Li, Fan Zhou
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper detailsMethod: Unknown - paper content could not be retrieved
Result: No results available due to technical access issues
Conclusion: Paper analysis impossible due to HTTP 429 error from arXiv API
Abstract: Failed to fetch summary for 2601.14706: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.14706&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[704] Towards Mitigating Modality Bias in Vision-Language Models for Temporal Action Localization
Jiaqi Li, Guangming Wang, Shuntian Zheng, Minzhe Ni, Xiaoman Lu, Guanghui Ye, Yu Guan
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2601.21078: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.21078&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[705] Catalyst: Out-of-Distribution Detection via Elastic Scaling
Abid Hassan, Tuan Ngo, Saad Shafiq, Nenad Medvidovic
Main category: cs.CV
TL;DR: Unable to analyze paper 2602.02409 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract retrieval failed due to rate limitingMethod: Cannot determine method as abstract retrieval failed
Result: Cannot determine results as abstract retrieval failed
Conclusion: Cannot draw conclusions without access to paper abstract
Abstract: Failed to fetch summary for 2602.02409: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.02409&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[706] Contour Refinement using Discrete Diffusion in Low Data Regime
Fei Yu Guan, Ian Keefe, Sophie Wilkinson, Daniel D.B. Perrakis, Steven Waslander
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2602.05880: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.05880&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[707] WiFlow: A Lightweight WiFi-based Continuous Human Pose Estimation Network with Spatio-Temporal Feature Decoupling
Yi Dao, Lankai Zhang, Hao Liu, Haiwei Zhang, Wenbo Wang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: No method information available - paper content inaccessible
Result: No results available - could not retrieve paper information
Conclusion: Cannot analyze paper due to technical limitations in accessing the content
Abstract: Failed to fetch summary for 2602.08661: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.08661&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[708] GeoFormer: A Lightweight Swin Transformer for Joint Building Height and Footprint Estimation from Sentinel Imagery
Han Jinzhen, JinByeong Lee, JiSung Kim, MinKyung Cho, DaHee Kim, HongSik Yun
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2602.09932: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.09932&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[709] Unified Unsupervised and Sparsely-Supervised 3D Object Detection by Semantic Pseudo-Labeling and Prototype Learning
Yushen He, Lei Zhao, Weidong Chen
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to paper fetch failureMethod: Unable to determine method due to paper fetch failure
Result: Unable to determine results due to paper fetch failure
Conclusion: Unable to determine conclusion due to paper fetch failure
Abstract: Failed to fetch summary for 2602.21484: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.21484&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[710] Specificity-aware reinforcement learning for fine-grained open-world classification
Samuele Angheben, Davide Berasi, Alessandro Conti, Elisa Ricci, Yiming Wang
Main category: cs.CV
TL;DR: Failed to fetch summary for paper 2603.03197 due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation as the paper summary could not be retrievedMethod: Unable to determine method as the paper summary could not be retrieved
Result: Unable to determine results as the paper summary could not be retrieved
Conclusion: Unable to determine conclusion as the paper summary could not be retrieved
Abstract: Failed to fetch summary for 2603.03197: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.03197&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[711] EdgeDAM: Real-time Object Tracking for Mobile Devices
Syed Muhammad Raza, Syed Murtaza Hussain Abidi, Khawar Islam, Muhammad Ibrahim, Ajmal Saeed Mian
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2603.05463: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.05463&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[712] HG-Lane: High-Fidelity Generation of Lane Scenes under Adverse Weather and Lighting Conditions without Re-annotation
Daichao Zhao, Qiupu Chen, Feng He, Xin Ning, Qiankun Li
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2603.10128: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.10128&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[713] Learning to Assist: Physics-Grounded Human-Human Control via Multi-Agent Reinforcement Learning
Yuto Shibata, Kashu Yamazaki, Lalit Jayanti, Yoshimitsu Aoki, Mariko Isogawa, Katerina Fragkiadaki
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) - cannot analyze paper 2603.11346
Details
Motivation: Unable to determine motivation as paper content could not be retrieved due to rate limiting errorMethod: Cannot analyze method without access to paper content
Result: No results available due to retrieval failure
Conclusion: Cannot draw conclusions without paper content
Abstract: Failed to fetch summary for 2603.11346: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.11346&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[714] A Two-Stage Dual-Modality Model for Facial Emotional Expression Recognition
Jiajun Sun, Zhe Gao
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper detailsMethod: Cannot analyze method as paper content is unavailable due to HTTP 429 error
Result: No results available - paper retrieval failed due to rate limiting
Conclusion: Cannot provide analysis due to technical limitations in accessing the paper
Abstract: Failed to fetch summary for 2603.12221: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.12221&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[715] RoboStereo: Dual-Tower 4D Embodied World Models for Unified Policy Optimization
Ruicheng Zhang, Guangyu Chen, Zunnan Xu, Zihao Liu, Zhizhou Zhong, Mingyang Zhang, Jun Zhou, Xiu Li
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to retrieval failureMethod: Unable to determine method due to retrieval failure
Result: Unable to determine results due to retrieval failure
Conclusion: Unable to draw conclusions due to retrieval failure
Abstract: Failed to fetch summary for 2603.12639: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.12639&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[716] Bidirectional Cross-Attention Fusion of High-Res RGB and Low-Res HSI for Multimodal Automated Waste Sorting
Jonas V. Funk, Lukas Roming, Andreas Michel, Paul Bäcker, Georg Maier, Thomas Längle, Markus Klute
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) when trying to access arXiv API
Details
Motivation: Cannot determine motivation as paper content could not be retrievedMethod: Cannot determine method as paper content could not be retrieved
Result: Cannot determine results as paper content could not be retrieved
Conclusion: Cannot determine conclusion as paper content could not be retrieved
Abstract: Failed to fetch summary for 2603.13941: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.13941&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[717] Unified Removal of Raindrops and Reflections: A New Benchmark and A Novel Pipeline
Xingyu Liu, Zewei He, Yu Chen, Chunyu Zhu, Zixuan Chen, Xing Luo, Zhe-Ming Lu
Main category: cs.CV
TL;DR: Failed to fetch paper summary - HTTP 429 error indicates rate limiting from arXiv API
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: Cannot analyze method as paper content is unavailable due to technical limitations
Result: No results available - technical error prevented retrieval of paper information
Conclusion: Cannot draw conclusions about paper content due to access limitations
Abstract: Failed to fetch summary for 2603.16446: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.16446&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[718] Premier: Personalized Preference Modulation with Learnable User Embedding in Text-to-Image Generation
Zihao Wang, Yuxiang Wei, Xinpeng Zhou, Tianyu Zhang, Tao Liang, Yalong Bai, Hongzhi Zhang, Wangmeng Zuo
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without paper contentMethod: Cannot determine method without paper content
Result: Cannot determine results without paper content
Conclusion: Cannot determine conclusion without paper content
Abstract: Failed to fetch summary for 2603.20725: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.20725&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[719] Decompose, Mix, Adapt: A Unified Framework for Parameter-Efficient Neural Network Recombination and Compression
Nazia Tasnim, Shrimai Prabhumoye, Bryan A. Plummer
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to draw conclusions due to failed API request
Abstract: Failed to fetch summary for 2603.27383: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.27383&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[720] TerraSky3D: Multi-View Reconstructions of European Landmarks in 4K
Mattia D’Urso, Yuxi Hu, Christian Sormann, Mattia Rossi, Friedrich Fraundorfer
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The arXiv API request failed, preventing analysis of paper 2603.28287.
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting.Method: Cannot determine method as paper content is unavailable due to API rate limiting.
Result: Cannot determine results as paper content is unavailable due to API rate limiting.
Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting.
Abstract: Failed to fetch summary for 2603.28287: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.28287&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[721] ProDiG: Progressive Diffusion-Guided Gaussian Splatting for Aerial to Ground Reconstruction
Sirshapan Mitra, Yogesh S. Rawat
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to missing paper contentMethod: Cannot determine method due to missing paper content
Result: Cannot determine results due to missing paper content
Conclusion: Cannot determine conclusion due to missing paper content
Abstract: Failed to fetch summary for 2604.02003: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.02003&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[722] THOM: Generating Physically Plausible Hand-Object Meshes From Text
Uyoung Jeong, Yihalem Yimolal Tiruneh, Hyung Jin Chang, Seungryul Baek, Kwang In Kim
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2604.02736: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.02736&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[723] SymphoMotion: Joint Control of Camera Motion and Object Dynamics for Coherent Video Generation
Guiyu Zhang, Yabo Chen, Xunzhi Xiang, Junchao Huang, Zhongyu Wang, Li Jiang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error in fetching paper contentMethod: Unable to determine method due to technical error in fetching paper content
Result: Unable to determine results due to technical error in fetching paper content
Conclusion: Unable to draw conclusions due to technical error in fetching paper content
Abstract: Failed to fetch summary for 2604.03723: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.03723&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[724] ITIScore: An Image-to-Text-to-Image Rating Framework for the Image Captioning Ability of MLLMs
Zitong Xu, Huiyu Duan, Shengyao Qin, Guangyu Yang, Guangji Ma, Xiongkuo Min, Ke Gu, Guangtao Zhai, Patrick Le Callet
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2604.03765: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.03765&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[725] Benchmarking Vision-Language Models under Contradictory Virtual Content Attacks in Augmented Reality
Yanming Xiu, Zhengyuan Jiang, Neil Zhenqiang Gong, Maria Gorlatova
Main category: cs.CV
TL;DR: Unable to analyze paper 2604.05510 due to HTTP 429 error when fetching summary from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2604.05510: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.05510&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[726] Grounded Forcing: Bridging Time-Independent Semantics and Proximal Dynamics in Autoregressive Video Synthesis
Jintao Chen, Chengyu Bai, Junjun Hu, Xinda Xue, Mu Xu
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2604.06939: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.06939&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[727] INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling
InSpatio Team, Donghui Shen, Guofeng Zhang, Haomin Liu, Haoyu Ji, Hujun Bao, Hongjia Zhai, Jialin Liu, Jing Guo, Nan Wang, Siji Pan, Weihong Pan, Weijian Xie, Xianbin Liu, Xiaojun Xiang, Xiaoyu Zhang, Xinyu Chen, Yifu Wang, Yipeng Chen, Zhenzhou Fan, Zhewen Le, Zhichao Ye, Ziqiang Zhao
Main category: cs.CV
TL;DR: INSPATIO-WORLD is a real-time framework for recovering and generating high-fidelity dynamic interactive scenes from single reference videos, featuring spatiotemporal consistency and precise user interaction capabilities.
Details
Motivation: Current video generation methods lack spatial persistence and visual realism needed for seamless navigation in complex environments, making it difficult to support interactive exploration of 4D scenes.Method: Proposes a Spatiotemporal Autoregressive (STAR) architecture with Implicit Spatiotemporal Cache for global consistency and Explicit Spatial Constraint Module for geometric structure. Introduces Joint Distribution Matching Distillation (JDMD) to overcome fidelity degradation from synthetic data reliance.
Result: Significantly outperforms existing SOTA models in spatial consistency and interaction precision, ranking first among real-time interactive methods on WorldScore-Dynamic benchmark, establishing practical pipeline for 4D environment navigation.
Conclusion: INSPATIO-WORLD enables high-fidelity, consistent, and controllable scene evolution from monocular videos, advancing real-time interactive scene generation and navigation capabilities.
Abstract: Building world models with spatial consistency and real-time interactivity remains a fundamental challenge in computer vision. Current video generation paradigms often struggle with a lack of spatial persistence and insufficient visual realism, making it difficult to support seamless navigation in complex environments. To address these challenges, we propose INSPATIO-WORLD, a novel real-time framework capable of recovering and generating high-fidelity, dynamic interactive scenes from a single reference video. At the core of our approach is a Spatiotemporal Autoregressive (STAR) architecture, which enables consistent and controllable scene evolution through two tightly coupled components: Implicit Spatiotemporal Cache aggregates reference and historical observations into a latent world representation, ensuring global consistency during long-horizon navigation; Explicit Spatial Constraint Module enforces geometric structure and translates user interactions into precise and physically plausible camera trajectories. Furthermore, we introduce Joint Distribution Matching Distillation (JDMD). By using real-world data distributions as a regularizing guide, JDMD effectively overcomes the fidelity degradation typically caused by over-reliance on synthetic data. Extensive experiments demonstrate that INSPATIO-WORLD significantly outperforms existing state-of-the-art (SOTA) models in spatial consistency and interaction precision, ranking first among real-time interactive methods on the WorldScore-Dynamic benchmark, and establishing a practical pipeline for navigating 4D environments reconstructed from monocular videos.
[728] RemoteAgent: Bridging Vague Human Intents and Earth Observation with RL-based Agentic MLLMs
Liang Yao, Shengxiang Xu, Fan Liu, Chuanyi Zhang, Bishun Yao, Rui Min, Yongjun Li, Chaoqian Ouyang, Shimin Di, Min-Ling Zhang
Main category: cs.CV
TL;DR: Unable to analyze paper 2604.07765 due to HTTP 429 error when fetching from arXiv API
Details
Motivation: Cannot determine motivation due to missing paper contentMethod: Cannot determine method due to missing paper content
Result: Cannot determine results due to missing paper content
Conclusion: Cannot determine conclusion due to missing paper content
Abstract: Failed to fetch summary for 2604.07765: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.07765&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[729] Bag of Bags: Adaptive Visual Vocabularies for Genizah Join Image Retrieval
Sharva Gogawale, Gal Grudka, Daria Vasyutinsky-Shapira, Omer Ventura, Berat Kurar-Barakat, Nachum Dershowitz
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2604.08138: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.08138&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[730] ParseBench: A Document Parsing Benchmark for AI Agents
Boyang Zhang, Sebastián G. Acosta, Preston Carlson, Sacha Bron, Pierre-Loïc Doulcet, Daniel B. Ospina, Simon Suo
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2604.08538: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.08538&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[731] Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory
Zile Wang, Zexiang Liu, Jiaxing Li, Kaichen Huang, Baixin Xu, Fei Kang, Mengyin An, Peiyu Wang, Biao Jiang, Yichen Wei, Yidan Xietian, Jiangbo Pei, Liang Hu, Boyi Jiang, Hua Xue, Zidong Wang, Haofeng Sun, Wei Li, Wanli Ouyang, Xianglong He, Yang Liu, Yangguang Li, Yahui Zhou
Main category: cs.CV
TL;DR: Matrix-Game 3.0 is a memory-augmented interactive world model for 720p real-time longform video generation, achieving 40 FPS with a 5B model while maintaining minute-long temporal consistency.
Details
Motivation: Existing diffusion models for interactive video generation struggle to simultaneously achieve memory-enabled long-term temporal consistency and high-resolution real-time generation, limiting real-world applicability.Method: Three systematic improvements: 1) Industrial-scale infinite data engine with synthetic, game-collected, and real-world video augmentation; 2) Training framework for long-horizon consistency via prediction residual modeling and camera-aware memory retrieval; 3) Multi-segment autoregressive distillation with DMD, model quantization, and VAE decoder pruning for real-time inference.
Result: Achieves up to 40 FPS real-time generation at 720p resolution with a 5B model while maintaining stable memory consistency over minute-long sequences. Scaling to 2x14B model further improves quality, dynamics, and generalization.
Conclusion: Matrix-Game 3.0 provides a practical pathway toward industrial-scale deployable world models by solving the trade-off between temporal consistency and real-time performance in interactive video generation.
Abstract: With the advancement of interactive video generation, diffusion models have increasingly demonstrated their potential as world models. However, existing approaches still struggle to simultaneously achieve memory-enabled long-term temporal consistency and high-resolution real-time generation, limiting their applicability in real-world scenarios. To address this, we present Matrix-Game 3.0, a memory-augmented interactive world model designed for 720p real-time longform video generation. Building upon Matrix-Game 2.0, we introduce systematic improvements across data, model, and inference. First, we develop an upgraded industrial-scale infinite data engine that integrates Unreal Engine-based synthetic data, large-scale automated collection from AAA games, and real-world video augmentation to produce high-quality Video-Pose-Action-Prompt quadruplet data at scale. Second, we propose a training framework for long-horizon consistency: by modeling prediction residuals and re-injecting imperfect generated frames during training, the base model learns self-correction; meanwhile, camera-aware memory retrieval and injection enable the base model to achieve long horizon spatiotemporal consistency. Third, we design a multi-segment autoregressive distillation strategy based on Distribution Matching Distillation (DMD), combined with model quantization and VAE decoder pruning, to achieve efficient real-time inference. Experimental results show that Matrix-Game 3.0 achieves up to 40 FPS real-time generation at 720p resolution with a 5B model, while maintaining stable memory consistency over minute-long sequences. Scaling up to a 2x14B model further improves generation quality, dynamics, and generalization. Our approach provides a practical pathway toward industrial-scale deployable world models.
[732] ELT: Elastic Looped Transformers for Visual Generation
Sahil Goyal, Swayam Agrawal, Gautham Govind Anil, Prateek Jain, Sujoy Paul, Aditya Kusupati
Main category: cs.CV
TL;DR: ELT introduces parameter-efficient visual generative models using recurrent transformers with weight sharing and intra-loop self-distillation for image/video generation.
Details
Motivation: To address the parameter inefficiency of conventional deep transformer stacks in visual generative models while maintaining high synthesis quality.Method: Uses recurrent transformer architecture with weight-shared blocks and Intra-Loop Self Distillation (ILSD) where intermediate loops are distilled from maximum training loops for consistent training.
Result: Achieves 4× parameter reduction with competitive FID of 2.0 on ImageNet 256×256 and FVD of 72.8 on UCF-101, enabling any-time inference with quality-compute trade-offs.
Conclusion: ELT significantly advances efficiency frontiers for visual synthesis through parameter-efficient recurrent transformers with flexible inference capabilities.
Abstract: We introduce Elastic Looped Transformers (ELT), a highly parameter-efficient class of visual generative models based on a recurrent transformer architecture. While conventional generative models rely on deep stacks of unique transformer layers, our approach employs iterative, weight-shared transformer blocks to drastically reduce parameter counts while maintaining high synthesis quality. To effectively train these models for image and video generation, we propose the idea of Intra-Loop Self Distillation (ILSD), where student configurations (intermediate loops) are distilled from the teacher configuration (maximum training loops) to ensure consistency across the model’s depth in a single training step. Our framework yields a family of elastic models from a single training run, enabling Any-Time inference capability with dynamic trade-offs between computational cost and generation quality, with the same parameter count. ELT significantly shifts the efficiency frontier for visual synthesis. With $4\times$ reduction in parameter count under iso-inference-compute settings, ELT achieves a competitive FID of $2.0$ on class-conditional ImageNet $256 \times 256$ and FVD of $72.8$ on class-conditional UCF-101.
[733] FashionStylist: An Expert Knowledge-enhanced Multimodal Dataset for Fashion Understanding
Kaidong Feng, Zhuoxuan Huang, Huizhong Guo, Yuting Jin, Xinyu Chen, Yue Liang, Yifei Gai, Li Zhou, Yunshan Ma, Zhu Sun
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: N/A - Paper content not accessibleMethod: N/A - Paper content not accessible
Result: N/A - Paper content not accessible
Conclusion: N/A - Paper content not accessible
Abstract: Failed to fetch summary for 2604.09249: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.09249&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[734] VAGNet: Vision-based Accident Anticipation with Global Features
Vipooshan Vipulananthan, Charith D. Chitraranjan
Main category: cs.CV
TL;DR: Unable to analyze paper 2604.09305 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract is unavailableMethod: Cannot determine method as abstract is unavailable
Result: Cannot determine results as abstract is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2604.09305: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.09305&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[735] Tango: Taming Visual Signals for Efficient Video Large Language Models
Shukang Yin, Sirui Zhao, Hanchao Wang, Baozhi Jia, Xianquan Wang, Chaoyou Fu, Enhong Chen
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2604.09547: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.09547&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[736] KiseKloset for Fashion Retrieval and Recommendation
Thanh-Tung Phan-Nguyen, Khoi-Nguyen Nguyen-Ngoc, Tam V. Nguyen, Minh-Triet Tran, Trung-Nghia Le
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2506.23471: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.23471&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[737] StaMo: Unsupervised Learning of Generalizable Robot Motion from Compact State Representation
Mingyu Liu, Jiuhe Shu, Hui Chen, Zeju Li, Canyu Zhao, Jiange Yang, Shenyuan Gao, Hao Chen, Chunhua Shen
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper detailsMethod: Unable to determine method due to API rate limiting preventing access to paper details
Result: Unable to determine results due to API rate limiting preventing access to paper details
Conclusion: Unable to draw conclusions due to API rate limiting preventing access to paper details
Abstract: Failed to fetch summary for 2510.05057: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.05057&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[738] Switch-JustDance: Benchmarking Whole Body Motion Tracking Controllers Using a Commercial Console Game
Jeonghwan Kim, Wontaek Kim, Yidan Lu, Jin Cheng, Fatemeh Zargarbashi, Zicheng Zeng, Zekun Qi, Zhiyang Dou, Nitish Sontakke, Donghoon Baek, Sehoon Ha, Tianyu Li
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2511.17925: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.17925&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[739] Flow Gym: A framework for the development, benchmarking, training, and deployment of flow-field quantification methods
Francesco Banelli, Antonio Terpin, Alan Bonomi, Raffaello D’Andrea
Main category: cs.CV
TL;DR: Unable to analyze paper 2512.20642 due to HTTP 429 error when fetching summary from arXiv API
Details
Motivation: Cannot determine motivation without access to the paper abstractMethod: Cannot determine method without access to the paper abstract
Result: Cannot determine results without access to the paper abstract
Conclusion: Cannot draw conclusions without access to the paper abstract
Abstract: Failed to fetch summary for 2512.20642: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.20642&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[740] DisCo-FLoc: Using Dual-Level Visual-Geometric Contrasts to Disambiguate Depth-Aware Visual Floorplan Localization
Shiyong Meng, Tao Zou, Bolei Chen, Chaoxu Mu, Jianxin Wang
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2601.01822: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.01822&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[741] MVOS_HSI: A Python Library for Preprocessing Agricultural Crop Hyperspectral Data
Rishik Aggarwal, Krisha Joshi, Pappu Kumar Yadav, Jianwei Qin, Thomas F. Burks, Moon S. Kim
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to data fetch failureMethod: Unable to determine method due to data fetch failure
Result: Unable to determine results due to data fetch failure
Conclusion: Unable to draw conclusions due to data fetch failure
Abstract: Failed to fetch summary for 2604.07656: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.07656&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.AI
[742] LABBench2: An Improved Benchmark for AI Systems Performing Biology Research
Jon M Laurent, Albert Bou, Michael Pieler, Conor Igoe, Alex Andonian, Siddharth Narayanan, James Braza, Alexandros Sanchez Vassopoulos, Jacob L Steenwyk, Blake Lash, Andrew D White, Samuel G Rodriques
Main category: cs.AI
TL;DR: LABBench2 is an evolved benchmark for measuring AI systems’ real-world scientific capabilities, featuring nearly 1,900 tasks that are more realistic and challenging than its predecessor LAB-Bench.
Details
Motivation: As AI applications in scientific research expand from foundation models to autonomous systems, there's a growing need to measure real-world capabilities beyond rote knowledge and reasoning, focusing on meaningful scientific work performance.Method: Developed LABBench2 as an evolution of LAB-Bench with nearly 1,900 tasks that measure similar capabilities but in more realistic contexts, providing a public dataset and evaluation harness for community use.
Result: LABBench2 shows a significant jump in difficulty compared to LAB-Bench, with model-specific accuracy differences ranging from -26% to -46% across subtasks, indicating substantial room for improvement in AI scientific capabilities.
Conclusion: LABBench2 continues as a de facto benchmark for AI scientific research capabilities, providing a more realistic and challenging evaluation framework to advance development of AI tools for core research functions.
Abstract: Optimism for accelerating scientific discovery with AI continues to grow. Current applications of AI in scientific research range from training dedicated foundation models on scientific data to agentic autonomous hypothesis generation systems to AI-driven autonomous labs. The need to measure progress of AI systems in scientific domains correspondingly must not only accelerate, but increasingly shift focus to more real-world capabilities. Beyond rote knowledge and even just reasoning to actually measuring the ability to perform meaningful work. Prior work introduced the Language Agent Biology Benchmark LAB-Bench as an initial attempt at measuring these abilities. Here we introduce an evolution of that benchmark, LABBench2, for measuring real-world capabilities of AI systems performing useful scientific tasks. LABBench2 comprises nearly 1,900 tasks and is, for the most part, a continuation of LAB-Bench, measuring similar capabilities but in more realistic contexts. We evaluate performance of current frontier models, and show that while abilities measured by LAB-Bench and LABBench2 have improved substantially, LABBench2 provides a meaningful jump in difficulty (model-specific accuracy differences range from -26% to -46% across subtasks) and underscores continued room for performance improvement. LABBench2 continues the legacy of LAB-Bench as a de facto benchmark for AI scientific research capabilities and we hope that it continues to help advance development of AI tools for these core research functions. To facilitate community use and development, we provide the task dataset at https://huggingface.co/datasets/futurehouse/labbench2 and a public eval harness at https://github.com/EdisonScientific/labbench2.
[743] FinTrace: Holistic Trajectory-Level Evaluation of LLM Tool Calling for Long-Horizon Financial Tasks
Yupeng Cao, Haohang Li, Weijin Liu, Wenbo Cao, Anke Xu, Lingfei Qian, Xueqing Peng, Minxue Tang, Zhiyuan Yao, Jimin Huang, K. P. Subbalakshmi, Zining Zhu, Jordan W. Suchow, Yangyang Yu
Main category: cs.AI
TL;DR: FinTrace is a comprehensive benchmark for evaluating LLM tool-calling in financial tasks, featuring 800 expert-annotated trajectories across 34 task categories with multi-axis evaluation metrics.
Details
Motivation: Existing financial tool-calling benchmarks focus on limited scenarios and use call-level metrics that fail to capture trajectory-level reasoning quality, creating a need for more comprehensive evaluation.Method: Created FinTrace benchmark with 800 expert-annotated trajectories across 34 financial task categories, using rubric-based evaluation with 9 metrics across 4 axes. Also developed FinTrace-Training dataset with 8,196 curated trajectories for preference learning.
Result: Evaluation of 13 LLMs shows frontier models achieve strong tool selection but all struggle with information utilization and final answer quality. Fine-tuning with FinTrace-Training improves intermediate reasoning metrics but end-to-end answer quality remains a bottleneck.
Conclusion: There’s a critical gap between invoking the right tools and reasoning effectively over their outputs. Trajectory-level improvements don’t fully propagate to final output quality, indicating need for better reasoning capabilities.
Abstract: Recent studies demonstrate that tool-calling capability enables large language models (LLMs) to interact with external environments for long-horizon financial tasks. While existing benchmarks have begun evaluating financial tool calling, they focus on limited scenarios and rely on call-level metrics that fail to capture trajectory-level reasoning quality. To address this gap, we introduce FinTrace, a benchmark comprising 800 expert-annotated trajectories spanning 34 real-world financial task categories across multiple difficulty levels. FinTrace employs a rubric-based evaluation protocol with nine metrics organized along four axes – action correctness, execution efficiency, process quality, and output quality – enabling fine-grained assessment of LLM tool-calling behavior. Our evaluation of 13 LLMs reveals that while frontier models achieve strong tool selection, all models struggle with information utilization and final answer quality, exposing a critical gap between invoking the right tools and reasoning effectively over their outputs. To move beyond diagnosis, we construct FinTrace-Training, the first trajectory-level preference dataset for financial tool-calling, containing 8,196 curated trajectories with tool-augmented contexts and preference pairs. We fine-tune Qwen-3.5-9B using supervised fine-tuning followed by direct preference optimization (DPO) and show that training on FinTrace-Training consistently improves intermediate reasoning metrics, with DPO more effectively suppressing failure modes. However, end-to-end answer quality remains a bottleneck, indicating that trajectory-level improvements do not yet fully propagate to final output quality.
[744] Linear Programming for Multi-Criteria Assessment with Cardinal and Ordinal Data: A Pessimistic Virtual Gap Analysis
Fuh-Hwa Franklin Liu, Su-Chuan Shih
Main category: cs.AI
TL;DR: A two-step linear programming method using Virtual Gap Analysis (VGA) models to rank alternatives in Multi-criteria Analysis, addressing subjectivity and data diversity issues.
Details
Motivation: Traditional MCA methods suffer from subjective evaluations, biases, and data diversity affecting reliability and precision of parameter estimation.Method: Two-step method integrating two novel VGA models to assess alternatives from pessimistic perspective using both quantitative/qualitative criteria with cardinal/ordinal data, then prioritizing alternatives to eliminate least favorable ones.
Result: Proposed method is dependable and scalable, enabling thorough assessments efficiently and effectively within decision support systems.
Conclusion: The novel VGA-based approach provides a robust solution to address subjectivity and data diversity issues in multi-criteria decision making.
Abstract: Multi-criteria Analysis (MCA) is used to rank alternatives based on various criteria. Key MCA methods, such as Multiple Criteria Decision Making (MCDM) methods, estimate parameters for criteria to compute the performance of each alternative. Nonetheless, subjective evaluations and biases frequently influence the reliability of results, while the diversity of data affects the precision of the parameters. The novel linear programming-based Virtual Gap Analysis (VGA) models tackle these issues. This paper outlines a two-step method that integrates two novel VGA models to assess each alternative from a pessimistic perspective, using both quantitative and qualitative criteria, and employing cardinal and ordinal data. Next, prioritize the alternatives to eliminate the least favorable one. The proposed method is dependable and scalable, enabling thorough assessments efficiently and effectively within decision support systems.
[745] Seven simple steps for log analysis in AI systems
Magda Dubois, Ekin Zorer, Maia Hamin, Joe Skinner, Alexandra Souly, Jerome Wynne, Harry Coppock, Lucas Satos, Sayash Kapoor, Sunischal Dev, Keno Juchems, Kimberly Mai, Timo Flesch, Lennart Luettgau, Charles Teague, Eric Patey, JJ Allaire, Lorenzo Pacchiardi, Jose Hernandez-Orallo, Cozmin Ududec
Main category: cs.AI
TL;DR: A framework and pipeline for analyzing AI system logs to understand model capabilities and behaviors, with implementation in the Inspect Scout library.
Details
Motivation: AI systems generate extensive logs during tool and user interactions, which can provide valuable insights into model capabilities, behaviors, and evaluation effectiveness, but there's a lack of standardized analysis approaches.Method: Proposes a standardized pipeline based on current best practices for log analysis, implemented in the Inspect Scout library with concrete code examples and detailed guidance on each step.
Result: Provides a practical framework for rigorous and reproducible log analysis, including identification of common pitfalls in the analysis process.
Conclusion: The framework offers researchers a foundation for systematic log analysis to better understand AI system behaviors and evaluation outcomes.
Abstract: AI systems produce large volumes of logs as they interact with tools and users. Analysing these logs can help understand model capabilities, propensities, and behaviours, or assess whether an evaluation worked as intended. Researchers have started developing methods for log analysis, but a standardised approach is still missing. Here we suggest a pipeline based on current best practices. We illustrate it with concrete code examples in the Inspect Scout library, provide detailed guidance on each step, and highlight common pitfalls. Our framework provides researchers with a foundation for rigorous and reproducible log analysis.
[746] Turing Test on Screen: A Benchmark for Mobile GUI Agent Humanization
Jiachen Zhu, Lingyu Yang, Rong Shan, Congmin Zheng, Zeyu Zheng, Weiwen Liu, Yong Yu, Weinan Zhang, Jianghao Lin
Main category: cs.AI
TL;DR: This paper introduces the concept of “Humanization” for GUI agents to avoid detection by platforms, framing it as a MinMax optimization problem between detectors and agents, and establishes benchmarks and methods for making agents behave more human-like.
Details
Motivation: As autonomous GUI agents face adversarial countermeasures from digital platforms, existing research focuses on utility and robustness but neglects anti-detection capabilities. The authors argue that for agents to survive in human-centric ecosystems, they must evolve humanization capabilities to avoid being detected and blocked.Method: The authors formalize the interaction as a “Turing Test on Screen” MinMax optimization problem, collect a high-fidelity dataset of mobile touch dynamics, establish the Agent Humanization Benchmark (AHB) with detection metrics, and propose methods ranging from heuristic noise injection to data-driven behavioral matching to make agents more human-like.
Result: The analysis shows that vanilla LMM-based agents are easily detectable due to unnatural kinematics. The proposed methods demonstrate that agents can achieve high imitability (human-like behavior) theoretically and empirically without sacrificing performance on tasks.
Conclusion: This work shifts the paradigm from whether an agent can perform a task to how it performs it within human-centric ecosystems, laying groundwork for seamless coexistence in adversarial digital environments by making agents more human-like and less detectable.
Abstract: The rise of autonomous GUI agents has triggered adversarial countermeasures from digital platforms, yet existing research prioritizes utility and robustness over the critical dimension of anti-detection. We argue that for agents to survive in human-centric ecosystems, they must evolve Humanization capabilities. We introduce the ``Turing Test on Screen,’’ formally modeling the interaction as a MinMax optimization problem between a detector and an agent aiming to minimize behavioral divergence. We then collect a new high-fidelity dataset of mobile touch dynamics, and conduct our analysis that vanilla LMM-based agents are easily detectable due to unnatural kinematics. Consequently, we establish the Agent Humanization Benchmark (AHB) and detection metrics to quantify the trade-off between imitability and utility. Finally, we propose methods ranging from heuristic noise to data-driven behavioral matching, demonstrating that agents can achieve high imitability theoretically and empirically without sacrificing performance. This work shifts the paradigm from whether an agent can perform a task to how it performs it within a human-centric ecosystem, laying the groundwork for seamless coexistence in adversarial digital environments.
[747] AHC: Meta-Learned Adaptive Compression for Continual Object Detection on Memory-Constrained Microcontrollers
Bibin Wilson
Main category: cs.AI
TL;DR: AHC enables continual object detection on memory-constrained MCUs (<100KB) via adaptive hierarchical compression with meta-learning, hierarchical multi-scale compression, and dual-memory architecture with theoretical guarantees against catastrophic forgetting.
Details
Motivation: Deploying continual object detection on microcontrollers with under 100KB memory requires efficient feature compression that can adapt to evolving task distributions. Existing fixed compression strategies cannot adapt to heterogeneous task characteristics, leading to suboptimal memory utilization and catastrophic forgetting.Method: Three key innovations: (1) true MAML-based compression adapting via gradient descent in 5 inner-loop steps, (2) hierarchical multi-scale compression with scale-aware ratios matching FPN redundancy patterns (8:1 for P3, 6.4:1 for P4, 4:1 for P5), (3) dual-memory architecture combining short-term and long-term banks with importance-based consolidation under 100KB budget.
Result: Experiments on CORe50, TiROD, and PASCAL VOC benchmarks with three baselines (Fine-tuning, EWC, iCaRL) demonstrate AHC enables practical continual detection within 100KB replay budget, achieving competitive accuracy through mean-pooled compressed feature replay with EWC regularization and feature distillation.
Conclusion: AHC provides a practical solution for continual object detection on memory-constrained MCUs with formal theoretical guarantees bounding catastrophic forgetting, enabling adaptive compression that matches task characteristics while staying within strict memory budgets.
Abstract: Deploying continual object detection on microcontrollers (MCUs) with under 100KB memory requires efficient feature compression that can adapt to evolving task distributions. Existing approaches rely on fixed compression strategies (e.g., FiLM conditioning) that cannot adapt to heterogeneous task characteristics, leading to suboptimal memory utilization and catastrophic forgetting. We introduce Adaptive Hierarchical Compression (AHC), a meta-learning framework featuring three key innovations: (1) true MAML-based compression that adapts via gradient descent to each new task in just 5 inner-loop steps, (2) hierarchical multi-scale compression with scale-aware ratios (8:1 for P3, 6.4:1 for P4, 4:1 for P5) matching FPN redundancy patterns, and (3) a dual-memory architecture combining short-term and long-term banks with importance-based consolidation under a hard 100KB budget. We provide formal theoretical guarantees bounding catastrophic forgetting as O(ε{sq.root(T)} + 1/{sq.root(M)}) where ε is compression error, T is task count, and M is memory size. Experiments on CORe50, TiROD, and PASCAL VOC benchmarks with three standard baselines (Fine-tuning,EWC, iCaRL) demonstrate that AHC enables practical continual detection within a 100KB replay budget, achieving competitive accuracy through mean-pooled compressed feature replay combined with EWC regularization and feature distillation.
[748] Explainable Planning for Hybrid Systems
Mir Md Sajid Sarwar
Main category: cs.AI
TL;DR: Thesis on explainable AI planning (XAIP) for hybrid systems that closely represent real-world problems, addressing the need for explanations in AI-based autonomous systems across safety-critical domains.
Details
Motivation: As AI technologies advance and autonomous systems replace manually crafted ones in complex safety-critical domains, there's a growing need to generate explanations for AI-based systems, which is a major challenge for the planning community.Method: The thesis presents a comprehensive study on explainable artificial intelligence planning (XAIP) specifically for hybrid systems that capture real-world problem representations closely.
Result: Not specified in the abstract, but the thesis presumably develops methods for generating explanations in AI planning systems for hybrid domains.
Conclusion: Explainable AI planning is crucial for autonomous systems in safety-critical applications, and this thesis contributes to XAIP for hybrid systems that better represent real-world problems.
Abstract: The recent advancement in artificial intelligence (AI) technologies facilitates a paradigm shift toward automation. Autonomous systems are fully or partially replacing manually crafted ones. At the core of these systems is automated planning. With the advent of powerful planners, automated planning is now applied to many complex and safety-critical domains, including smart energy grids, self-driving cars, warehouse automation, urban and air traffic control, search and rescue operations, surveillance, robotics, and healthcare. There is a growing need to generate explanations of AI-based systems, which is one of the major challenges the planning community faces today. The thesis presents a comprehensive study on explainable artificial intelligence planning (XAIP) for hybrid systems that capture a representation of real-world problems closely.
[749] Help Without Being Asked: A Deployed Proactive Agent System for On-Call Support with Continuous Self-Improvement
Fengrui Liu, Xiao He, Tieying Zhang
Main category: cs.AI
TL;DR: Vigil is a proactive AI agent system for cloud service support that assists throughout the entire on-call lifecycle, including during human analyst involvement, with continuous self-improvement from human-resolved cases.
Details
Motivation: Current reactive AI agents for customer support disengage when issues escalate to human analysts, missing opportunities to assist with follow-ups, track progress, or learn from unresolved cases, creating inefficiencies in cloud service platforms handling thousands of daily tickets.Method: Vigil integrates into customer-analyst dialogues, proactively offering assistance without explicit invocation, and features a continuous self-improvement mechanism that extracts knowledge from human-resolved cases to autonomously update its capabilities.
Result: Deployed on ByteDance’s Volcano Engine cloud platform for over 10 months, comprehensive evaluations demonstrate Vigil’s effectiveness and practicality in real-world cloud service support scenarios.
Conclusion: Vigil represents a significant advancement over reactive agents by providing continuous assistance throughout the on-call lifecycle and enabling autonomous learning from human expertise, improving cloud service support efficiency.
Abstract: In large-scale cloud service platforms, thousands of customer tickets are generated daily and are typically handled through on-call dialogues. This high volume of on-call interactions imposes a substantial workload on human support analysts. Recent studies have explored reactive agents that leverage large language models as a first line of support to interact with customers directly and resolve issues. However, when issues remain unresolved and are escalated to human support, these agents are typically disengaged. As a result, they cannot assist with follow-up inquiries, track resolution progress, or learn from the cases they fail to address. In this paper, we introduce Vigil, a novel proactive agent system designed to operate throughout the entire on-call life-cycle. Unlike reactive agents, Vigil focuses on providing assistance during the phase in which human support is already involved. It integrates into the dialogue between the customer and the analyst, proactively offering assistance without explicit user invocation. Moreover, Vigil incorporates a continuous self-improvement mechanism that extracts knowledge from human-resolved cases to autonomously update its capabilities. Vigil has been deployed on Volcano Engine, ByteDance’s cloud platform, for over ten months, and comprehensive evaluations based on this deployment demonstrate its effectiveness and practicality. The open source version of this work is publicly available at https://github.com/volcengine/veaiops.
[750] OOWM: Structuring Embodied Reasoning and Planning via Object-Oriented Programmatic World Modeling
Hongyu Chen, Liang Lin, Guangrun Wang
Main category: cs.AI
TL;DR: OOWM introduces object-oriented world modeling using UML diagrams to structure embodied reasoning, replacing linear text with explicit symbolic representations for better robotic planning.
Details
Motivation: Standard Chain-of-Thought prompting with linear natural language is insufficient for embodied tasks because it fails to explicitly represent state-space, object hierarchies, and causal dependencies needed for robust robotic planning.Method: Proposes Object-Oriented World Modeling (OOWM) framework using Unified Modeling Language (UML): Class Diagrams for visual perception/object hierarchies and Activity Diagrams for planning/control flows. Uses three-stage training with SFT and Group Relative Policy Optimization with outcome-based rewards.
Result: Extensive evaluations on MRoom-30k benchmark show OOWM significantly outperforms unstructured textual baselines in planning coherence, execution success, and structural fidelity.
Conclusion: OOWM establishes a new paradigm for structured embodied reasoning by replacing latent vector spaces with explicit symbolic representations through software engineering formalisms.
Abstract: Standard Chain-of-Thought (CoT) prompting empowers Large Language Models (LLMs) with reasoning capabilities, yet its reliance on linear natural language is inherently insufficient for effective world modeling in embodied tasks. While text offers flexibility, it fails to explicitly represent the state-space, object hierarchies, and causal dependencies required for robust robotic planning. To address these limitations, we propose Object-Oriented World Modeling (OOWM), a novel framework that structures embodied reasoning through the lens of software engineering formalisms. We redefine the world model not as a latent vector space, but as an explicit symbolic tuple $W = \langle S, T \rangle$: a State Abstraction ($G_\text{state}$) instantiating the environmental state $S$, coupled with a Control Policy ($G_\text{control}$) representing the transition logic $T: S \times A \rightarrow S’$. OOWM leverages the Unified Modeling Language (UML) to materialize this definition: it employs Class Diagrams to ground visual perception into rigorous object hierarchies, and Activity Diagrams to operationalize planning into executable control flows. Furthermore, we introduce a three-stage training pipeline combining Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO). Crucially, this method utilizes outcome-based rewards from the final plan to implicitly optimize the underlying object-oriented reasoning structure, enabling effective learning even with sparse annotations. Extensive evaluations on the MRoom-30k benchmark demonstrate that OOWM significantly outperforms unstructured textual baselines in planning coherence, execution success, and structural fidelity, establishing a new paradigm for structured embodied reasoning.
[751] What if Pinocchio Were a Reinforcement Learning Agent: A Normative End-to-End Pipeline
Benoît Alcaraz
Main category: cs.AI
TL;DR: Thesis proposes Pino, a hybrid model combining reinforcement learning agents with argumentation-based normative advisors to create norm-compliant, context-aware AI agents, with novel algorithms for argument extraction and norm avoidance mitigation.
Details
Motivation: As AI systems become more integrated into daily life, there's a need for them to comply with societal rules and norms for safe and successful deployment, inspired by the Pinocchio story about becoming a "real" agent.Method: Proposes Pino pipeline building on AJAR, Jiminy, and NGRL architectures - hybrid model where RL agents are supervised by argumentation-based normative advisors. Introduces novel algorithm for automatically extracting arguments and relationships underlying advisor decisions, plus investigates norm avoidance with definition and mitigation strategy.
Result: Each component empirically evaluated. The pipeline addresses development of norm-compliant, context-aware agents through supervised RL with normative oversight.
Conclusion: Thesis presents comprehensive approach to creating socially compliant AI agents, discusses related work, limitations, and future research directions for normative reasoning in AI systems.
Abstract: In the past decade, artificial intelligence (AI) has developed quickly. With this rapid progression came the need for systems capable of complying with the rules and norms of our society so that they can be successfully and safely integrated into our daily lives. Inspired by the story of Pinocchio in ``Le avventure di Pinocchio - Storia di un burattino’’, this thesis proposes a pipeline that addresses the problem of developing norm compliant and context-aware agents. Building on the AJAR, Jiminy, and NGRL architectures, the work introduces \pino, a hybrid model in which reinforcement learning agents are supervised by argumentation-based normative advisors. In order to make this pipeline operational, this thesis also presents a novel algorithm for automatically extracting the arguments and relationships that underlie the advisors’ decisions. Finally, this thesis investigates the phenomenon of \textit{norm avoidance}, providing a definition and a mitigation strategy within the context of reinforcement learning agents. Each component of the pipeline is empirically evaluated. The thesis concludes with a discussion of related work, current limitations, and directions for future research.
[752] OpeFlo: Automated UX Evaluation via Simulated Human Web Interaction with GUI Grounding
Wee Joe Tan, Zi Rui Lucas Lim, Shashank Durgad, Karim Obegi, Aiden Yiliu Li
Main category: cs.AI
TL;DR: OpenFlo is an AI agent that simulates user behavior on websites to automate usability testing, generating standardized UX reports without requiring human participants.
Details
Motivation: Traditional usability testing requires time-consuming user studies and expert reviews, limiting iteration speed for product development teams. There's a need for automated, scalable solutions that can provide continuous usability feedback.Method: OpenFlo uses multimodal grounding (beyond DOM parsing) to interact with real web pages end-to-end, simulating user behavior profiles. It employs a structured evaluation protocol combining System Usability Scale (SUS), Single Ease Questions (SEQ), and Think Aloud methodology to generate comprehensive UX reports.
Result: The system demonstrates improved robustness for web-based interaction and UX evaluation scenarios through multimodal grounding, enabling continuous, scalable, data-driven usability testing.
Conclusion: OpenFlo represents a new era of automated usability testing that empowers developers to build more usable web interfaces through scalable, continuous evaluation without human participants.
Abstract: Evaluating web usability typically requires time-consuming user studies and expert reviews, which often limits iteration speed during product development, especially for small teams and agile workflows. We present OpenFlo, a user-experience evaluation agent that simulates user behavior on websites and produces standardized usability. Unlike traditional tools that rely on DOM parsing, OpenFlo grounds actions and observations, enabling it to interact with real web pages end-to-end while maintaining a coherent trace of the user journey. Building on Avenir-Web, our system pairs this robust interaction with simulated user behavior profiles and a structured evaluation protocol that integrates the System Usability Scale (SUS), step-wise Single Ease Questions (SEQ), and concurrent Think Aloud. Subsequently, a comprehensive User Experience (UX) report will be generated. We discuss the architecture of OpenFlo and illustrate how its multimodal grounding improves robustness for web-based interaction and UX evaluation scenarios, paving the way for a new era of continuous, scalable, and data-driven usability testing that empowers every developer to build web interfaces that are usable. Code is available at: https://github.com/Onflow-AI/OpenFlo
[753] DERM-3R: A Resource-Efficient Multimodal Agents Framework for Dermatologic Diagnosis and Treatment in Real-World Clinical Settings
Ziwen Chen, Zhendong Wang, Chongjing Wang, Yurui Dong, Luozhijie Jin, Jihao Gu, Kui Chen, Jiaxi Yang, Bingjie Lu, Zhou Zhang, Jirui Dai, Changyong Luo, Xiameng Gai, Haibing Lan, Zhi Liu
Main category: cs.AI
TL;DR: DERM-3R is a resource-efficient multimodal agent framework for Traditional Chinese Medicine dermatologic diagnosis using three collaborative agents for lesion recognition, representation, and holistic reasoning.
Details
Motivation: Dermatologic diseases have large global burden with limitations in modern single-target therapies. TCM offers holistic approach but faces challenges with non-standardized knowledge, incomplete multimodal records, and poor scalability of expert reasoning.Method: Proposes DERM-3R framework with three collaborative agents: DERM-Rec for fine-grained lesion recognition, DERM-Rep for multi-view lesion representation with pathogenesis modeling, and DERM-Reason for holistic reasoning for syndrome differentiation and treatment planning. Built on lightweight multimodal LLM with partial fine-tuning on 103 real-world TCM psoriasis cases.
Result: DERM-3R performs strongly across dermatologic reasoning tasks, matching or surpassing large general-purpose multimodal models despite minimal data and parameter updates, as shown by automatic metrics, LLM-as-a-judge, and physician assessment.
Conclusion: Structured, domain-aware multi-agent modeling can be a practical alternative to brute-force scaling for complex clinical tasks in dermatology and integrative medicine, enabling resource-efficient multimodal reasoning.
Abstract: Dermatologic diseases impose a large and growing global burden, affecting billions and substantially reducing quality of life. While modern therapies can rapidly control acute symptoms, long-term outcomes are often limited by single-target paradigms, recurrent courses, and insufficient attention to systemic comorbidities. Traditional Chinese medicine (TCM) provides a complementary holistic approach via syndrome differentiation and individualized treatment, but practice is hindered by non-standardized knowledge, incomplete multimodal records, and poor scalability of expert reasoning. We propose DERM-3R, a resource-efficient multimodal agent framework to model TCM dermatologic diagnosis and treatment under limited data and compute. Based on real-world workflows, we reformulate decision-making into three core issues: fine-grained lesion recognition, multi-view lesion representation with specialist-level pathogenesis modeling, and holistic reasoning for syndrome differentiation and treatment planning. DERM-3R comprises three collaborative agents: DERM-Rec, DERM-Rep, and DERM-Reason, each targeting one component of this pipeline. Built on a lightweight multimodal LLM and partially fine-tuned on 103 real-world TCM psoriasis cases, DERM-3R performs strongly across dermatologic reasoning tasks. Evaluations using automatic metrics, LLM-as-a-judge, and physician assessment show that despite minimal data and parameter updates, DERM-3R matches or surpasses large general-purpose multimodal models. These results suggest structured, domain-aware multi-agent modeling can be a practical alternative to brute-force scaling for complex clinical tasks in dermatology and integrative medicine.
[754] Factorizing formal contexts from closures of necessity operators
Roberto G. Aragón, Jesús Medina, Eloísa Ramírez-Poussa
Main category: cs.AI
TL;DR: The paper analyzes factorization methods for formal contexts with Boolean data, extending classical properties to fuzzy frameworks for computing independent subcontexts.
Details
Motivation: Factorizing datasets is valuable but often computationally challenging; existing methods for formal contexts with Boolean data need analysis and extension to fuzzy frameworks.Method: Analyzes a method based on possibility theory operators for obtaining independent subcontexts, studies properties of factorization pairs, and extends classical properties to fuzzy contexts.
Result: Provides analysis of factorization properties for formal contexts and demonstrates how classical properties can be extended to fuzzy frameworks for computing independent subcontexts.
Conclusion: The paper establishes theoretical foundations for extending factorization methods from Boolean to fuzzy formal contexts, enabling computation of independent subcontexts in fuzzy settings.
Abstract: Factorizing datasets is an interesting process in a multitude of approaches, but many times it is not possible or efficient the computation of a factorization of the dataset. A method to obtain independent subcontexts of a formal context with Boolean data was proposed in~\cite{dubois:2012}, based on the operators used in possibility theory. In this paper, we will analyze this method and study different properties related to the pairs of sets from which a factorization of a formal context arises. We also inspect how the properties given in the classical case can be extended to the fuzzy framework, which is essential to obtain a mechanism that allows the computation of independent subcontexts of a fuzzy context.
[755] Agentic Exploration of PDE Spaces using Latent Foundation Models for Parameterized Simulations
Abhijeet Vishwasrao, Francisco Giral, Mahmoud Golestanian, Federica Tonti, Andrea Arroyo Ramo, Adrian Lozano-Duran, Steven L. Brunton, Sergio Hoyas, Soledad Le Clainche, Hector Gomez, Ricardo Vinuesa
Main category: cs.AI
TL;DR: Multi-agent LLMs coupled with latent foundation models enable automated exploration of PDE-governed flow physics by learning compact latent representations of flow fields, allowing cost-effective parameter exploration and discovery of scaling laws.
Details
Motivation: Traditional methods for exploring PDE-governed physical phenomena (like fluid flows) are limited by expensive experiments/simulations, unlike discrete domains that interface well with LLMs. There's a need for automated, large-scale exploration of continuous, high-dimensional, chaotic PDE solution spaces.Method: Couples multi-agent LLMs with latent foundation models (LFMs) - generative models that learn explicit, compact, disentangled latent representations of flow fields. LFMs serve as on-demand surrogate simulators. Hierarchical agent architecture orchestrates exploration through hypothesis-experimentation-analysis-verification loop with tool-modular interface.
Result: Applied to flow past tandem cylinders at Re=500, autonomously evaluated over 1,600 parameter-location pairs. Discovered divergent scaling laws: regime-dependent two-mode structure for minimum displacement thickness and robust linear scaling for maximum momentum thickness, with dual-extrema structure emerging at near-wake to co-shedding regime transition.
Conclusion: The coupling of learned physical representations with agentic reasoning establishes a general paradigm for automated scientific discovery in PDE-governed systems, enabling large-scale exploration that was previously infeasible.
Abstract: Flow physics and more broadly physical phenomena governed by partial differential equations (PDEs), are inherently continuous, high-dimensional and often chaotic in nature. Traditionally, researchers have explored these rich spatiotemporal PDE solution spaces using laboratory experiments and/or computationally expensive numerical simulations. This severely limits automated and large-scale exploration, unlike domains such as drug discovery or materials science, where discrete, tokenizable representations naturally interface with large language models. We address this by coupling multi-agent LLMs with latent foundation models (LFMs), a generative model over parametrised simulations, that learns explicit, compact and disentangled latent representations of flow fields, enabling continuous exploration across governing PDE parameters and boundary conditions. The LFM serves as an on-demand surrogate simulator, allowing agents to query arbitrary parameter configurations at negligible cost. A hierarchical agent architecture orchestrates exploration through a closed loop of hypothesis, experimentation, analysis and verification, with a tool-modular interface requiring no user support. Applied to flow past tandem cylinders at Re = 500, the framework autonomously evaluates over 1,600 parameter-location pairs and discovers divergent scaling laws: a regime-dependent two-mode structure for minimum displacement thickness and a robust linear scaling for maximum momentum thickness, with both landscapes exhibiting a dual-extrema structure that emerges at the near-wake to co-shedding regime transition. The coupling of the learned physical representations with agentic reasoning establishes a general paradigm for automated scientific discovery in PDE-governed systems.
[756] Pioneer Agent: Continual Improvement of Small Language Models in Production
Dhruv Atreja, Julia White, Nikhil Nayak, Kelton Zhang, Henrijs Princis, George Hurn-Maloney, Ash Lewis, Urchade Zaratiana
Main category: cs.AI
TL;DR: Pioneer Agent is an automated closed-loop system for adapting small language models to specific tasks, handling data curation, failure diagnosis, and iterative training without manual intervention.
Details
Motivation: Small language models are cost-effective for production but challenging to adapt to specific tasks due to complex engineering decisions around data curation, failure diagnosis, and iterative training cycles.Method: A closed-loop system with two modes: cold-start (from task description to data acquisition and training) and production (diagnosing failures and retraining with regression constraints). Uses AdaptFT-Bench for evaluation.
Result: Improves base models by 1.6-83.8 points across eight benchmarks, preserves performance on AdaptFT-Bench while naive retraining degrades, and achieves 84.9% to 99.3% on intent classification and 0.345 to 0.810 Entity F1.
Conclusion: Pioneer Agent automates the challenging adaptation lifecycle of small language models, discovering effective training strategies from feedback and enabling efficient specialization for production deployment.
Abstract: Small language models are attractive for production deployment due to their low cost, fast inference, and ease of specialization. However, adapting them to a specific task remains a challenging engineering loop, driven not by training itself but by surrounding decisions: data curation, failure diagnosis, regression avoidance, and iteration control. We present Pioneer Agent, a closed-loop system that automates this lifecycle. In cold-start mode, given only a natural-language task description, the agent acquires data, constructs evaluation sets, and iteratively trains models by jointly optimizing data, hyperparameters, and learning strategy. In production mode, given a deployed model with labeled failures, it diagnoses error patterns, constructs targeted training data, and retrains under explicit regression constraints. To evaluate this setting, we introduce AdaptFT-Bench, a benchmark of synthetic inference logs with progressively increasing noise, designed to test the full adaptation loop: diagnosis, curriculum synthesis, retraining, and verification. Across eight cold-start benchmarks spanning reasoning, math, code generation, summarization, and classification, Pioneer Agent improves over base models by 1.6-83.8 points. On AdaptFT-Bench, it improves or preserves performance in all seven scenarios, while naive retraining degrades by up to 43 points. On two production-style deployments built from public benchmark tasks, it raises intent classification from 84.9% to 99.3% and Entity F1 from 0.345 to 0.810. Beyond performance gains, the agent often discovers effective training strategies, including chain-of-thought supervision, task-specific optimization, and quality-focused data curation, purely from downstream feedback.
[757] MobiFlow: Real-World Mobile Agent Benchmarking through Trajectory Fusion
Yunfei Feng, Xi Zhao, Cheng Zhang, Dahu Feng, Daolin Cheng, Jianqi Yu, Yubin Xia, Erhu Feng
Main category: cs.AI
TL;DR: MobiFlow: A mobile agent evaluation framework for third-party applications using graph-based state compression and multi-trajectory fusion to better align with real-world usage scenarios.
Details
Motivation: Existing mobile agent benchmarks like AndroidWorld rely on system-level Android emulators and system resources for evaluation, which doesn't match real-world scenarios where third-party apps don't expose system-level APIs, making accurate model evaluation difficult.Method: Proposes MobiFlow framework with efficient graph-construction algorithm based on multi-trajectory fusion to compress state space, support dynamic interaction, and align with real third-party app scenarios. Covers 20 widely used apps with 240 diverse real-world tasks.
Result: MobiFlow’s evaluation results show higher alignment with human assessments compared to AndroidWorld and can guide training of future GUI-based models under real workloads.
Conclusion: MobiFlow addresses the mismatch between existing benchmarks and real-world mobile agent usage, providing more accurate evaluation for GUI-based autonomous agents interacting with third-party applications.
Abstract: Mobile agents can autonomously complete user-assigned tasks through GUI interactions. However, existing mainstream evaluation benchmarks, such as AndroidWorld, operate by connecting to a system-level Android emulator and provide evaluation signals based on the state of system resources. In real-world mobile-agent scenarios, however, many third-party applications do not expose system-level APIs to determine whether a task has succeeded, leading to a mismatch between benchmarks and real-world usage and making it difficult to evaluate model performance accurately. To address these issues, we propose MobiFlow, an evaluation framework built on tasks drawn from arbitrary third-party applications. Using an efficient graph-construction algorithm based on multi-trajectory fusion, MobiFlow can effectively compress the state space, support dynamic interaction, and better align with real-world third-party application scenarios. MobiFlow covers 20 widely used third-party applications and comprises 240 diverse real-world tasks, with enriched evaluation metrics. Compared with AndroidWorld, MobiFlow’s evaluation results show higher alignment with human assessments and can guide the training of future GUI-based models under real workloads.
[758] Persistent Identity in AI Agents: A Multi-Anchor Architecture for Resilient Memory and Continuity
Prahlad G. Menon
Main category: cs.AI
TL;DR: soul.py is an open-source architecture for AI agents that addresses catastrophic forgetting by implementing distributed identity through separable components (identity files and memory logs), inspired by human memory systems.
Details
Motivation: AI agents suffer from catastrophic forgetting when context windows overflow and conversation histories are summarized, losing not just information but continuity of self. Current architectures centralize identity in a single memory store, creating a single point of failure, unlike human identity which survives damage through distributed memory systems.Method: Proposes soul.py architecture with separable components: identity files and memory logs. Introduces a hybrid RAG+RLM retrieval system that automatically routes queries to appropriate memory access patterns. Formalizes identity anchors for AI systems and proposes multi-anchor resilience approach.
Result: Achieves efficient retrieval without sacrificing comprehensiveness through the hybrid retrieval system. Provides a framework for building agents whose identity can survive partial memory failures.
Conclusion: Distributed identity architecture inspired by human memory systems can solve AI agent identity problems. The soul.py framework offers a roadmap for creating resilient agents with persistent identity that survives memory failures.
Abstract: Modern AI agents suffer from a fundamental identity problem: when context windows overflow and conversation histories are summarized, agents experience catastrophic forgetting – losing not just information, but continuity of self. This technical limitation reflects a deeper architectural flaw: AI agent identity is centralized in a single memory store, creating a single point of failure. Drawing on neurological case studies of human memory disorders, we observe that human identity survives damage because it is distributed across multiple systems: episodic memory, procedural memory, emotional continuity, and embodied knowledge. We present soul.py, an open-source architecture that implements persistent identity through separable components (identity files and memory logs), and propose extensions toward multi-anchor resilience. The framework introduces a hybrid RAG+RLM retrieval system that automatically routes queries to appropriate memory access patterns, achieving efficient retrieval without sacrificing comprehensiveness. We formalize the notion of identity anchors for AI systems and present a roadmap for building agents whose identity can survive partial memory failures. Code is available at github.com/menonpg/soul.py
[759] TrajOnco: a multi-agent framework for temporal reasoning over longitudinal EHR for multi-cancer early detection
Sihang Zeng, Young Won Kim, Wilson Lau, Ehsan Alipour, Ruth Etzioni, Meliha Yetisgen, Anand Oka
Main category: cs.AI
TL;DR: TrajOnco is a training-free multi-agent LLM framework for cancer risk prediction from EHR data using temporal reasoning over patient trajectories.
Details
Motivation: Accurate cancer risk estimation from longitudinal EHRs could enable earlier detection and improved care, but modeling complex patient trajectories remains challenging. Current approaches lack interpretable temporal reasoning capabilities.Method: Training-free multi-agent LLM framework with chain-of-agents architecture and long-term memory. Uses temporal reasoning over sequential clinical events to generate patient summaries, evidence-linked rationales, and predicted risk scores.
Result: Achieved AUROCs of 0.64-0.80 across 15 cancer types in zero-shot evaluation, comparable to supervised ML in lung cancer benchmark. Outperformed single-agent LLMs in temporal reasoning and worked effectively with smaller models like GPT-4.1-mini.
Conclusion: Multi-agent LLMs can perform interpretable temporal reasoning over longitudinal EHRs, advancing scalable multi-cancer early detection and clinical insight generation through interpretable outputs.
Abstract: Accurate estimation of cancer risk from longitudinal electronic health records (EHRs) could support earlier detection and improved care, but modeling such complex patient trajectories remains challenging. We present TrajOnco, a training-free, multi-agent large language model (LLM) framework designed for scalable multi-cancer early detection. Using a chain-of-agents architecture with long-term memory, TrajOnco performs temporal reasoning over sequential clinical events to generate patient-level summaries, evidence-linked rationales, and predicted risk scores. We evaluated TrajOnco on de-identified Truveta EHR data across 15 cancer types using matched case-control cohorts, predicting risk of cancer diagnosis at 1 year. In zero-shot evaluation, TrajOnco achieved AUROCs of 0.64-0.80, performing comparably to supervised machine learning in a lung cancer benchmark while demonstrating better temporal reasoning than single-agent LLMs. The multi-agent design also enabled effective temporal reasoning with smaller-capacity models such as GPT-4.1-mini. The fidelity of TrajOnco’s output was validated through human evaluation. Furthermore, TrajOnco’s interpretable reasoning outputs can be aggregated to reveal population-level risk patterns that align with established clinical knowledge. These findings highlight the potential of multi-agent LLMs to execute interpretable temporal reasoning over longitudinal EHRs, advancing both scalable multi-cancer early detection and clinical insight generation.
[760] DeepReviewer 2.0: A Traceable Agentic System for Auditable Scientific Peer Review
Yixuan Weng, Minjun Zhu, Qiujie Xie, Zhiyuan Ning, Shichen Li, Panzhong Lu, Zhen Lin, Enhao Gu, Qiyao Sun, Yue Zhang
Main category: cs.AI
TL;DR: DeepReviewer 2.0 is a process-controlled agentic review system that produces traceable review packages with anchored annotations, localized evidence, and executable follow-up actions for automated peer review.
Details
Motivation: Current automated peer review systems focus on generating fluent critique but lack auditability - reviewers need judgments they can audit with clear evidence, application context, and concrete follow-up requirements.Method: The system uses an output contract approach to produce traceable review packages. It first builds a manuscript-only claim-evidence-risk ledger and verification agenda, then performs agenda-driven retrieval and writes anchored critiques under an export gate that enforces minimum traceability and coverage budgets.
Result: On 134 ICLR 2025 submissions, an un-finetuned 196B model running DeepReviewer 2.0 outperformed Gemini-3.1-Pro-preview, improving strict major-issue coverage (37.26% vs. 23.57%) and winning 71.63% of micro-averaged blind comparisons against human review committees, while ranking first among automatic systems.
Conclusion: DeepReviewer 2.0 is positioned as an assistive tool rather than a decision proxy, with remaining gaps in ethics-sensitive checks. The system demonstrates that process-controlled agentic review with traceability constraints can produce high-quality, auditable reviews.
Abstract: Automated peer review is often framed as generating fluent critique, yet reviewers and area chairs need judgments they can \emph{audit}: where a concern applies, what evidence supports it, and what concrete follow-up is required. DeepReviewer2.0 is a process-controlled agentic review system built around an output contract: it produces a \textbf{traceable review package} with anchored annotations, localized evidence, and executable follow-up actions, and it exports only after meeting minimum traceability and coverage budgets. Concretely, it first builds a manuscript-only claim–evidence–risk ledger and verification agenda, then performs agenda-driven retrieval and writes anchored critiques under an export gate. On 134 ICLR2025 submissions under three fixed protocols, an \emph{un-finetuned 196B} model running DeepReviewer2.0 outperforms Gemini-3.1-Pro-preview, improving strict major-issue coverage (37.26% vs.\ 23.57%) and winning 71.63% of micro-averaged blind comparisons against a human review committee, while ranking first among automatic systems in our pool. We position DeepReviewer2.0 as an assistive tool rather than a decision proxy, and note remaining gaps such as ethics-sensitive checks.
[761] Cooperation in Human and Machine Agents: Promise Theory Considerations
M. Burgess
Main category: cs.AI
TL;DR: Promise Theory provides a unified framework for understanding cooperation in human-machine agent systems, focusing on signaling, trust, risk, and feedback between autonomous agents.
Details
Motivation: The paper aims to address how reasoning systems of components maintain intended purpose in the context of revived interest in agent paradigms, particularly with AI agents. It seeks to provide a unified perspective on organization and functional design for human-machine cooperation across various domains including human efforts, hardware systems, software, and AI.Method: The paper revisits established principles of agent cooperation using Promise Theory as a framework. Promise Theory represents the fundamentals of signaling, comprehension, trust, risk, and feedback between autonomous agents, applying this to human-machine interactions and systems with or without management.
Result: The paper offers insights into success and failure in agent-based systems by applying Promise Theory principles to cooperation between humans, machines, and AI agents. It provides a unified perspective on organizational design and functional cooperation in semi-automated systems.
Conclusion: Promise Theory provides valuable lessons about agent cooperation that apply broadly to human-machine systems, offering a fundamental framework for understanding signaling, trust, and feedback mechanisms in autonomous agent interactions.
Abstract: Agent based systems are more common than we may think. A Promise Theory perspective on cooperation, in systems of human-machine agents, offers a unified perspective on organization and functional design with semi-automated efforts, in terms of the abstract properties of autonomous agents, This applies to human efforts, hardware systems, software, and artificial intelligence, with and without management. One may ask how does a reasoning system of components keep to an intended purpose? As the agent paradigm is now being revived, in connection with artificial intelligence agents, I revisit established principles of agent cooperation, as applied to humans, machines, and their mutual interactions. Promise Theory represents the fundamentals of signalling, comprehension, trust, risk, and feedback between agents, and offers some lessons about success and failure.
[762] Spatial Competence Benchmark
Jash Vira, Ashley Harris
Main category: cs.AI
TL;DR: SCBench is a comprehensive spatial competence benchmark for large models that evaluates hierarchical spatial reasoning capabilities through executable outputs verified by deterministic checkers or simulators, revealing limitations in current frontier models.
Details
Motivation: Current spatial evaluations for large models are limited to probing isolated primitives through 3D transformations or visual question answering, lacking comprehensive assessment of spatial competence - the ability to maintain consistent internal representations of environments and use them for inference and planning under constraints.Method: Introduces SCBench with three hierarchical capability buckets requiring executable outputs verified by deterministic checkers or simulator-based evaluators. Uses sweeping output-token caps to analyze model performance across different computational budgets.
Result: Three frontier models show monotonically decreasing accuracy up the capability ladder. Accuracy gains concentrate at low token budgets and saturate quickly. Failures are dominated by locally plausible geometry that breaks global constraints.
Conclusion: SCBench reveals significant gaps in current models’ spatial reasoning capabilities, particularly in maintaining global consistency and handling complex constraints. The benchmark provides comprehensive evaluation tools for spatial competence in AI systems.
Abstract: Spatial competence is the quality of maintaining a consistent internal representation of an environment and using it to infer discrete structure and plan actions under constraints. Prevailing spatial evaluations for large models are limited to probing isolated primitives through 3D transformations or visual question answering. We introduce the Spatial Competence Benchmark (SCBench), spanning three hierarchical capability buckets whose tasks require executable outputs verified by deterministic checkers or simulator-based evaluators. On SCBench, three frontier models exhibit monotonically decreasing accuracy up the capability ladder. Sweeping output-token caps shows that accuracy gains concentrate at low budgets and saturate quickly, and failures are dominated by locally plausible geometry that breaks global constraints. We release the task generators, verifiers, and visualisation tooling.
[763] Governed Reasoning for Institutional AI
Mamadou Seck
Main category: cs.AI
TL;DR: Cognitive Core: A governed AI architecture for institutional decisions with nine cognitive primitives, four-tier governance, audit ledger, and demand-driven delegation to prevent silent errors.
Details
Motivation: Current AI agent frameworks are inadequate for institutional decisions (regulatory compliance, clinical triage, prior authorization) because they infer authority conversationally, reconstruct accountability from logs, and produce silent errors that execute without human review.Method: Proposes Cognitive Core with: 1) Nine typed cognitive primitives (retrieve, classify, investigate, verify, challenge, reflect, deliberate, govern, generate), 2) Four-tier governance model where human review is required before execution, 3) Tamper-evident SHA-256 hash-chain audit ledger, 4) Demand-driven delegation architecture supporting both declared and autonomously reasoned epistemic sequences.
Result: On 11-case prior authorization appeal evaluation: Cognitive Core achieved 91% accuracy vs 55% (ReAct) and 45% (Plan-and-Solve). Zero silent errors vs 5-6 in baselines. Introduces “governability” as primary evaluation axis alongside accuracy.
Conclusion: Cognitive Core provides a governed decision substrate for institutional AI that prevents silent errors through mandatory human review and comprehensive audit trails, with configuration-driven deployment requiring YAML rather than engineering capacity.
Abstract: Institutional decisions – regulatory compliance, clinical triage, prior authorization appeal – require a different AI architecture than general-purpose agents provide. Agent frameworks infer authority conversationally, reconstruct accountability from logs, and produce silent errors: incorrect determinations that execute without any human review signal. We propose Cognitive Core: a governed decision substrate built from nine typed cognitive primitives (retrieve, classify, investigate, verify, challenge, reflect, deliberate, govern, generate), a four-tier governance model where human review is a condition of execution rather than a post-hoc check, a tamper-evident SHA-256 hash-chain audit ledger endogenous to computation, and a demand-driven delegation architecture supporting both declared and autonomously reasoned epistemic sequences. We benchmark three systems on an 11-case balanced prior authorization appeal evaluation set. Cognitive Core achieves 91% accuracy against 55% (ReAct) and 45% (Plan-and-Solve). The governance result is more significant: CC produced zero silent errors while both baselines produced 5-6. We introduce governability – how reliably a system knows when it should not act autonomously – as a primary evaluation axis for institutional AI alongside accuracy. The baselines are implemented as prompts, representing the realistic deployment alternative to a governed framework. A configuration-driven domain model means deploying a new institutional decision domain requires YAML configuration, not engineering capacity.
[764] CID-TKG: Collaborative Historical Invariance and Evolutionary Dynamics Learning for Temporal Knowledge Graph Reasoning
Shuai-Long Lei, Xiaobin Zhu, Jiarui Liang, Guoxi Sun, Zhiyu Fang, Xu-Cheng Yin
Main category: cs.AI
TL;DR: CID-TKG is a collaborative learning framework for temporal knowledge graph reasoning that integrates evolutionary dynamics and historical invariance semantics to improve future fact prediction.
Details
Motivation: Existing TKG reasoning approaches have limitations due to inductive biases that rely on time-invariant or weakly time-dependent structures, overlooking evolutionary dynamics needed for accurate temporal reasoning.Method: Proposes CID-TKG framework with two structures: historical invariance graph for long-term regularities and evolutionary dynamics graph for short-term transitions. Uses dedicated encoders for each, decomposes relations into view-specific representations, and aligns them via contrastive learning to reduce semantic discrepancies.
Result: Extensive experiments show CID-TKG achieves state-of-the-art performance under extrapolation settings for temporal knowledge graph reasoning tasks.
Conclusion: The collaborative integration of evolutionary dynamics and historical invariance semantics provides an effective inductive bias for TKG reasoning, overcoming limitations of previous approaches.
Abstract: Temporal knowledge graph (TKG) reasoning aims to infer future facts at unseen timestamps from temporally evolving entities and relations. Despite recent progress, existing approaches still suffer from inherent limitations due to their inductive biases, as they predominantly rely on time-invariant or weakly time-dependent structures and overlook the evolutionary dynamics. To overcome this limitation, we propose a novel collaborative learning framework for TKGR (dubbed CID-TKG) that integrates evolutionary dynamics and historical invariance semantics as an effective inductive bias for reasoning. Specifically, CID-TKG constructs a historical invariance graph to capture long-term structural regularities and an evolutionary dynamics graph to model short-term temporal transitions. Dedicated encoders are then employed to learn representations from each structure. To alleviate semantic discrepancies across the two structures, we decompose relations into view-specific representations and align view-specific query representations via a contrastive objective, which promotes cross-view consistency while suppressing view-specific noise. Extensive experiments verify that our CID-TKG achieves state-of-the-art performance under extrapolation settings.
[765] Hubble: An LLM-Driven Agentic Framework for Safe and Automated Alpha Factor Discovery
Runze Shi, Shengyu Yan, Yuecheng Cai, Chengxi Lv
Main category: cs.AI
TL;DR: Hubble: LLM-driven automated factor discovery framework for quantitative finance using constrained generation and evolutionary feedback
Details
Motivation: Automated discovery of predictive alpha factors in finance is challenging due to vast search spaces, low signal-to-noise ratios, and existing methods producing complex, uninterpretable formulas prone to overfitting.Method: Closed-loop factor mining framework using LLMs as intelligent search heuristics, constrained by domain-specific operator language and AST-based execution sandbox, with evolutionary feedback mechanism for iterative refinement.
Result: Evaluated 181 syntactically valid factors from 122 unique candidates across three rounds on 30 U.S. equities over 752 trading days, achieving peak composite score of 0.827 with 100% computational stability.
Conclusion: Combining LLM-driven generation with deterministic safety constraints yields effective, interpretable, and reproducible approach to automated factor discovery in quantitative finance.
Abstract: Discovering predictive alpha factors in quantitative finance remains a formidable challenge due to the vast combinatorial search space and inherently low signal-to-noise ratios in financial data. Existing automated methods, particularly genetic programming, often produce complex, uninterpretable formulas prone to overfitting. We introduce Hubble, a closed-loop factor mining framework that leverages Large Language Models (LLMs) as intelligent search heuristics, constrained by a domain-specific operator language and an Abstract Syntax Tree (AST)-based execution sandbox. The framework evaluates candidate factors through a rigorous statistical pipeline encompassing cross-sectional Rank Information Coefficient (RankIC), annualized Information Ratio, and portfolio turnover. An evolutionary feedback mechanism returns top-performing factors and structured error diagnostics to the LLM, enabling iterative refinement across multiple generation rounds. In experiments conducted on a panel of 30 U.S. equities over 752 trading days, the system evaluated 181 syntactically valid factors from 122 unique candidates across three rounds, achieving a peak composite score of 0.827 with 100% computational stability. Our results demonstrate that combining LLM-driven generation with deterministic safety constraints yields an effective, interpretable, and reproducible approach to automated factor discovery.
[766] Beyond Monologue: Interactive Talking-Listening Avatar Generation with Conversational Audio Context-Aware Kernels
Yuzhe Weng, Haotian Wang, Xinyi Yu, Xiaoyan Wu, Haoran Xu, Shan He, Jun Du
Main category: cs.AI
TL;DR: A novel approach for full-duplex interactive virtual agents that can simultaneously talk and listen, addressing the temporal scale discrepancy between speaking and listening behaviors using multi-head Gaussian kernels.
Details
Motivation: Current audio-driven human video generation focuses on monologue scenarios, but real human communication is interactive. Existing methods fail to handle the temporal scale discrepancy between talking (short-range alignment) and listening (long-range dynamics), leading to rigid responses or poor lip sync.Method: Introduces multi-head Gaussian kernels to inject progressive temporal inductive bias, addressing the scale discrepancy. Builds a full-duplex interactive virtual agent that processes dual-stream audio inputs for both talking and listening. Uses a cleaned dataset (VoxHear) with decoupled speech and background audio tracks.
Result: The approach successfully fuses strong temporal alignment with deep contextual semantics, achieving state-of-the-art performance in generating natural and responsive full-duplex interactive digital humans.
Conclusion: The method enables more authentic human communication by creating virtual agents that can both articulate speech and react naturally to incoming conversational audio, advancing beyond monologue scenarios.
Abstract: Audio-driven human video generation has achieved remarkable success in monologue scenarios, largely driven by advancements in powerful video generation foundation models. Moving beyond monologues, authentic human communication is inherently a full-duplex interactive process, requiring virtual agents not only to articulate their own speech but also to react naturally to incoming conversational audio. Most existing methods simply extend conventional audio-driven paradigms to listening scenarios. However, relying on strict frame-to-frame alignment renders the model’s response to long-range conversational dynamics rigid, whereas directly introducing global attention catastrophically degrades lip synchronization. Recognizing the unique temporal Scale Discrepancy between talking and listening behaviors, we introduce a multi-head Gaussian kernel to explicitly inject this physical intuition into the model as a progressive temporal inductive bias. Building upon this, we construct a full-duplex interactive virtual agent capable of simultaneously processing dual-stream audio inputs for both talking and listening. Furthermore, we introduce a rigorously cleaned Talking-Listening dataset VoxHear featuring perfectly decoupled speech and background audio tracks. Extensive experiments demonstrate that our approach successfully fuses strong temporal alignment with deep contextual semantics, setting a new state-of-the-art for generating highly natural and responsive full-duplex interactive digital humans. The project page is available at https://warmcongee.github.io/beyond-monologue/ .
[767] MADQRL: Distributed Quantum Reinforcement Learning Framework for Multi-Agent Environments
Abhishek Sawaika, Samuel Yen-Chi Chen, Udaya Parampalli, Rajkumar Buyya
Main category: cs.AI
TL;DR: A distributed quantum reinforcement learning framework that uses multiple quantum agents to handle high-dimensional environments by distributing training load across machines, showing improvements over classical and other distributed approaches.
Details
Motivation: Traditional RL struggles with high-dimensional environments due to computational expense. Quantum computing offers potential advantages through compact encoding and enhanced representation, but current quantum hardware limitations prevent handling complex multi-agent setups directly.Method: Proposes a distributed framework for quantum reinforcement learning where multiple quantum agents learn independently, distributing the joint training load across individual machines. Works best for environments with disjoint action/observation spaces but can extend to other systems with approximations.
Result: Tested on cooperative-pong environment, showing ~10% improvement over other distribution strategies and ~5% improvement over classical policy representation models.
Conclusion: Distributed quantum reinforcement learning provides a practical approach to overcome current quantum hardware limitations while leveraging quantum advantages for high-dimensional RL problems.
Abstract: Reinforcement learning (RL) is one of the most practical ways to learn from real-life use-cases. Motivated from the cognitive methods used by humans makes it a widely acceptable strategy in the field of artificial intelligence. Most of the environments used for RL are often high-dimensional, and traditional RL algorithms becomes computationally expensive and challenging to effectively learn from such systems. Recent advancements in practical demonstration of quantum computing (QC) theories, such as compact encoding, enhanced representation and learning algorithms, random sampling, or the inherent stochastic nature of quantum systems, have opened up new directions to tackle these challenges. Quantum reinforcement learning (QRL) is seeking significant traction over the past few years. However, the current state of quantum hardware is not enough to cater for such high-dimensional environments with complex multi-agent setup. To tackle this issue, we propose a distributed framework for QRL where multiple agents learn independently, distributing the load of joint training from individual machines. Our method works well for environments with disjoint sets of action and observation spaces, but can also be extended to other systems with reasonable approximations. We analyze the proposed method on cooperative-pong environment and our results indicate ~10% improvement from other distribution strategies, and ~5% improvement from classical models of policy representation.
[768] From Scalars to Tensors: Declared Losses Recover Epistemic Distinctions That Neutrosophic Scalars Cannot Express
Tony Mason
Main category: cs.AI
TL;DR: Paper extends neutrosophic T/I/F evaluation of LLMs, finding hyper-truth in 84% of cases across vendors, and proposes adding structured loss declarations to differentiate epistemic states that scalar T/I/F alone cannot distinguish.
Details
Motivation: To extend previous work on neutrosophic T/I/F evaluation of LLMs by testing across multiple model families and addressing limitations of scalar T/I/F representations that collapse different epistemic situations into identical outputs.Method: Replicated and extended experiments across five model families from different vendors, then introduced structured loss declarations (descriptions of what models cannot evaluate and why) to differentiate epistemic states that produce identical scalar T/I/F outputs.
Result: Found hyper-truth in 84% of unconstrained evaluations across vendors, confirming cross-vendor phenomenon. Models with identical scalar outputs for different epistemic situations (paradox, ignorance, contingency) produced nearly disjoint loss vocabularies (Jaccard similarity < 0.10), allowing differentiation through domain-specific, severity-rated loss declarations.
Conclusion: Scalar T/I/F is necessary but insufficient for representing LLM epistemic states; tensor-structured output (scalars + structured loss declarations) provides more faithful modeling of LLM epistemic capabilities by preserving distinctions that neutrosophic logic was designed to capture.
Abstract: Leyva-Vázquez and Smarandache (2025) demonstrated that neutrosophic T/I/F evaluation, where Truth, Indeterminacy, and Falsity are independent dimensions not constrained to sum to 1.0, which reveals “hyper-truth”’ (T+I+F > 1.0) in 35% of complex epistemic cases evaluated by LLMs. We extend their work in two directions. First, we replicate and extend their experiment across five model families from five vendors (Anthropic, Meta, DeepSeek, Alibaba, Mistral), finding hyper-truth in 84% of unconstrained evaluations, which confirms the phenomenon is cross-vendor under our prompt protocol. Second, and more significantly, we identify a limitation of scalar T/I/F that their framework cannot address: models adopting an `“Absorption” position (T=0, I=1, F=0) produce identical scalar outputs for fundamentally different epistemic situations (paradox, ignorance, contingency), collapsing the very distinctions neutrosophic logic was designed to preserve. We demonstrate that extending the evaluation to include declared losses (structured descriptions of what the model cannot evaluate and why) substantially recovers these distinctions. Models producing identical scalars for paradox and ignorance produce nearly disjoint loss vocabularies (Jaccard similarity < 0.10 on loss description keywords), with domain-specific, severity-rated loss declarations that differentiate the nature of their uncertainty. This suggests that scalar T/I/F is a necessary but insufficient representation of epistemic state, and that tensor-structured output (scalars + losses) provides a more faithful model of LLM epistemic capabilities.
[769] LLMs for Text-Based Exploration and Navigation Under Partial Observability
Stephan Sandfuchs, Maximilian Melchert, Jörg Frochte
Main category: cs.AI
TL;DR: LLMs can function as text-only controllers for exploration and navigation in partially observable gridworlds, with reasoning-tuned models performing best but less efficiently than optimal paths.
Details
Motivation: To investigate whether large language models can serve as text-only controllers for exploration and goal-directed navigation in unknown environments under partial observability, without requiring code execution or specialized tools.Method: Created a reproducible benchmark with oracle localization in fixed ASCII gridworlds where each step reveals only a local 5×5 window. Evaluated nine contemporary LLMs (open/proprietary, dense/Mixture of Experts, instruction/reasoning-tuned) on exploration (maximizing revealed cells) and navigation (shortest path to goal) tasks across three layouts of increasing difficulty.
Result: Reasoning-tuned models reliably complete navigation across all layouts but remain less efficient than oracle paths. Few-shot demonstrations help reasoning-tuned models reduce invalid moves and shorten paths, while classic dense instruction models remain inconsistent. Characteristic action priors (UP/RIGHT) can induce looping under partial observability.
Conclusion: Training regimen and test-time deliberation predict control ability better than raw parameter count. Lightweight hybridization with classical online planners is suggested as a practical route to deployable partial map systems.
Abstract: Exploration and goal-directed navigation in unknown layouts are central to inspection, logistics, and search-and-rescue. We ask whether large language models (LLMs) can function as \emph{text-only} controllers under partial observability – without code execution, tools, or program synthesis. We introduce a reproducible benchmark with oracle localisation in fixed ASCII gridworlds: each step reveals only a local $5\times5$ window around the agent and the model must select one of \texttt{UP/RIGHT/DOWN/LEFT}. Nine contemporary LLMs ranging from open/proprietary, dense / Mixture of Experts and instruction- vs. reasoning-tuned are evaluated on two tasks across three layouts of increasing difficulty: \emph{Exploration} (maximising revealed cells) and \emph{Navigation} (reach the goal on the shortest path). The experimental results are evaluated on quantitative metrics including \emph{success rate}, \emph{efficiency} such as normalised coverage and \emph{path length} vs. oracle as well as qualitative analysis. Reasoning-tuned models reliably complete navigation across all layouts, yet remain less efficient than oracle paths. Few-shot demonstrations in the prompt chiefly help these Reasoning-tuned models by reducing invalid moves and shortening paths, while classic dense instruction models remain inconsistent. We observe characteristic action priors (UP/RIGHT) that can induce looping under partial observability. Overall, training regimen and test-time deliberation predict control ability better than raw parameter count. These findings suggest lightweight hybridisation with classical online planners as a practical route to deployable partial map systems.
[770] Evaluating Reliability Gaps in Large Language Model Safety via Repeated Prompt Sampling
Keita Broadwater
Main category: cs.AI
TL;DR: APST is a depth-oriented evaluation framework for LLMs that tests operational reliability through repeated sampling of identical prompts to surface latent failure modes like hallucinations and safety inconsistencies.
Details
Motivation: Traditional LLM benchmarks focus on breadth across diverse tasks, but real-world deployment requires consistency and safety under repeated use of the same prompts, especially in high-stakes settings. Current evaluation methods miss operational failures that emerge from repeated generations.Method: Accelerated Prompt Stress Testing (APST) repeatedly samples identical prompts under controlled conditions (temperature variation, prompt perturbation) to surface latent failure modes. Failures are modeled statistically using Bernoulli and binomial formulations to estimate per-inference failure probabilities.
Result: When applied to instruction-tuned LLMs on AIR-BENCH 2024 safety prompts, models showed similar performance under conventional low-sample evaluation (N ≤ 3), but repeated sampling revealed substantial variation in empirical failure probabilities across temperatures.
Conclusion: Shallow benchmark scores can obscure meaningful differences in reliability under sustained use. APST provides a quantitative framework for assessing operational risk that complements traditional breadth-oriented evaluations.
Abstract: Traditional benchmarks for large language models (LLMs), such as HELM and AIR-BENCH, primarily assess safety risk through breadth-oriented evaluation across diverse tasks. However, real-world deployment often exposes a different class of risk: operational failures arising from repeated generations of the same prompt rather than broad task generalization. In high-stakes settings, response consistency and safety under repeated use are critical operational requirements. We introduce Accelerated Prompt Stress Testing (APST), a depth-oriented evaluation framework inspired by highly accelerated stress testing in reliability engineering. APST probes LLM behavior by repeatedly sampling identical prompts under controlled operational conditions, including temperature variation and prompt perturbation, to surface latent failure modes such as hallucinations, refusal inconsistency, and unsafe completions. Rather than treating failures as isolated events, APST characterizes them statistically as stochastic outcomes of repeated inference. We model observed safety failures using Bernoulli and binomial formulations to estimate per-inference failure probabilities, enabling quantitative comparison of operational risk across models and configurations. We apply APST to multiple instruction-tuned LLMs evaluated on AIR-BENCH 2024 derived safety and security prompts. While models exhibit similar performance under conventional single- or very-low-sample evaluation (N <= 3), repeated sampling reveals substantial variation in empirical failure probabilities across temperatures. These results demonstrate that shallow benchmark scores can obscure meaningful differences in reliability under sustained use.
[771] Unifying Ontology Construction and Semantic Alignment for Deterministic Enterprise Reasoning at Scale
Hongyin Zhu
Main category: cs.AI
TL;DR: LOM is a unified neuro-symbolic framework that integrates ontology construction, semantic alignment, and logical reasoning into a single end-to-end architecture for enterprise data analysis.
Details
Motivation: Enterprise data remains chaotic and dormant, preventing comprehensive decision-making. Existing neuro-symbolic approaches use disjoint pipelines and suffer from error propagation.Method: LOM employs a construct-align-reason (CAR) pipeline: autonomously constructs domain-specific ontologies from raw data, aligns neural generation with structural reality using graph-aware encoder and reinforcement learning, and executes deterministic reasoning over constructed topology.
Result: LOM-4B achieves 88.8% accuracy in ontology completion and 94% in complex graph reasoning tasks, significantly outperforming state-of-the-art LLMs.
Conclusion: Autonomous logical construction is essential for achieving deterministic, enterprise-grade intelligence.
Abstract: While enterprises amass vast quantities of data, much of it remains chaotic and effectively dormant, preventing decision-making based on comprehensive information. Existing neuro-symbolic approaches rely on disjoint pipelines and struggle with error propagation. We introduce the large ontology model (LOM), a unified framework that seamlessly integrates ontology construction, semantic alignment, and logical reasoning into a single end-to-end architecture. LOM employs a construct-align-reason (CAR) pipeline, leveraging its unified architecture across all three stages: it first autonomously constructs a domain-specific ontological universe from raw data, then aligns neural generation with this structural reality using a graph-aware encoder and reinforcement learning, and finally executes deterministic reasoning over the constructed topology, node attributes and relation types. We evaluate LOM on a comprehensive benchmark constructed from diverse real-world enterprise datasets. Experimental results demonstrate that LOM-4B achieves 88.8% accuracy in ontology completion and 94% in complex graph reasoning tasks, significantly outperforming state-of-the-art LLMs. These findings validate that autonomous logical construction is essential for achieving deterministic, enterprise-grade intelligence.
[772] General-purpose LLMs as Models of Human Driver Behavior: The Case of Simplified Merging
Samir H. A. Mohammad, Wouter Mooi, Arkady Zgonnikov
Main category: cs.AI
TL;DR: LLMs as standalone driver agents in AV testing show promise for human behavior modeling but have limitations in capturing dynamic velocity responses and show model-specific safety performance.
Details
Motivation: Current human behavior models for AV safety assessment face trade-offs between interpretability and flexibility. LLMs offer a promising alternative as ready-to-use models without parameter fitting, but their capabilities for capturing human driving behavior are poorly understood.Method: Embedded two general-purpose LLMs (OpenAI o3 and Google Gemini 2.5 Pro) as standalone, closed-loop driver agents in a simplified 1D merging scenario. Compared their behavior against human data using quantitative and qualitative analyses. Conducted systematic prompt ablation study to understand prompt component effects.
Result: Both LLMs reproduced human-like intermittent operational control and tactical dependencies on spatial cues. However, neither consistently captured human response to dynamic velocity cues, and safety performance diverged sharply between models. Prompt components acted as model-specific inductive biases that didn’t transfer across LLMs.
Conclusion: General-purpose LLMs could potentially serve as standalone human behavior models in AV evaluation pipelines, but future research is needed to better understand their failure modes and ensure validity as models of human driving behavior.
Abstract: Human behavior models are essential as behavior references and for simulating human agents in virtual safety assessment of automated vehicles (AVs), yet current models face a trade-off between interpretability and flexibility. General-purpose large language models (LLMs) offer a promising alternative: a single model potentially deployable without parameter fitting across diverse scenarios. However, what LLMs can and cannot capture about human driving behavior remains poorly understood. We address this gap by embedding two general-purpose LLMs (OpenAI o3 and Google Gemini 2.5 Pro) as standalone, closed-loop driver agents in a simplified one-dimensional merging scenario and comparing their behavior against human data using quantitative and qualitative analyses. Both models reproduce human-like intermittent operational control and tactical dependencies on spatial cues. However, neither consistently captures the human response to dynamic velocity cues, and safety performance diverges sharply between models. A systematic prompt ablation study reveals that prompt components act as model-specific inductive biases that do not transfer across LLMs. These findings suggest that general-purpose LLMs could potentially serve as standalone, ready-to-use human behavior models in AV evaluation pipelines, but future research is needed to better understand their failure modes and ensure their validity as models of human driving behavior.
[773] Beyond Theory of Mind in Robotics
Malte F. Jung
Main category: cs.AI
TL;DR: The paper critiques Theory of Mind approaches in robotics, arguing social meaning emerges through interaction rather than internal state inference, and proposes alternative design principles based on coordination and participation.
Details
Motivation: The author challenges three core assumptions of Theory of Mind approaches in robotics: 1) meaning travels from hidden mental states to observable behavior, 2) understanding requires detached inference rather than participation, and 3) behavioral meaning is fixed and available to passive observers. These assumptions poorly capture how real social interaction actually unfolds.Method: The paper draws on ethnomethodology, conversation analysis, and participatory sense-making frameworks to argue that social meaning is produced through moment-to-moment coordination between agents rather than decoded from behavior.
Result: The analysis leads to three design implications for robotics: shifting from internal state modeling toward policies for sustaining coordination, from observer-based inference toward active participation, and from fixed behavioral meaning toward meaning potential stabilized through response.
Conclusion: Social robotics should move beyond Theory of Mind paradigms and embrace interactional approaches where meaning emerges through coordinated participation rather than being inferred from hidden mental states.
Abstract: Theory of Mind, the capacity to explain and predict behavior by inferring hidden mental states, has become the dominant paradigm for social interaction in robotics. Yet ToM rests on three assumptions that poorly capture how most social interaction actually unfolds: that meaning travels inside-out from hidden states to observable behavior; that understanding requires detached inference rather than participation; and that the meaning of behavior is fixed and available to a passive observer. Drawing on ethnomethodology, conversation analysis, and participatory sense-making, I argue that social meaning is not decoded from behavior but produced through moment-to-moment coordination between agents. This interactional foundation has direct implications for robot design: shifting from internal state modeling toward policies for sustaining coordination, from observer-based inference toward active participation, and from fixed behavioral meaning toward meaning potential stabilized through response.
[774] PAC-BENCH: Evaluating Multi-Agent Collaboration under Privacy Constraints
Minjun Park, Donghyun Kim, Hyeonjong Ju, Seungwon Lim, Dongwook Choi, Taeyoon Kwon, Minju Kim, Jinyoung Yeo
Main category: cs.AI
TL;DR: PAC-Bench is a benchmark for evaluating multi-agent collaboration under privacy constraints, revealing that privacy significantly degrades collaboration performance and causes coordination breakdowns.
Details
Motivation: As AI agents become more prevalent and interact with each other, there's a need to understand how privacy constraints affect multi-agent collaboration dynamics, which remains poorly understood.Method: The authors present PAC-Bench, a benchmark for systematic evaluation of multi-agent collaboration under privacy constraints, conducting experiments to analyze performance degradation and coordination issues.
Result: Experiments show privacy constraints substantially degrade collaboration performance, make outcomes more dependent on the initiating agent, and reveal coordination breakdowns including early-stage privacy violations, overly conservative abstraction, and privacy-induced hallucinations.
Conclusion: Privacy-aware multi-agent collaboration is a distinct and unresolved challenge requiring new coordination mechanisms beyond existing agent capabilities.
Abstract: We are entering an era in which individuals and organizations increasingly deploy dedicated AI agents that interact and collaborate with other agents. However, the dynamics of multi-agent collaboration under privacy constraints remain poorly understood. In this work, we present $PAC\text{-}Bench$, a benchmark for systematic evaluation of multi-agent collaboration under privacy constraints. Experiments on $PAC\text{-}Bench$ show that privacy constraints substantially degrade collaboration performance and make outcomes depend more on the initiating agent than the partner. Further analysis reveals that this degradation is driven by recurring coordination breakdowns, including early-stage privacy violations, overly conservative abstraction, and privacy-induced hallucinations. Together, our findings identify privacy-aware multi-agent collaboration as a distinct and unresolved challenge that requires new coordination mechanisms beyond existing agent capabilities.
[775] The Geometry of Knowing: From Possibilistic Ignorance to Probabilistic Certainty – A Measure-Theoretic Framework for Epistemic Convergence
Moriba Kemessia Jah
Main category: cs.AI
TL;DR: This paper develops a measure-theoretic framework for epistemic contraction, showing how possibilistic representations of incomplete knowledge contract to probabilistic representations as evidence accumulates, with applications to filtering problems like orbital tracking.
Details
Motivation: To establish a rigorous mathematical framework for understanding how epistemic uncertainty (incomplete knowledge) transitions to aleatory uncertainty (intrinsic stochastic variability) as evidence accumulates, bridging possibility theory and probability theory.Method: Develops a measure-theoretic framework using possibility distributions and necessity measures to define credal sets bounding consistent probability measures. Introduces epistemic width W, establishes contraction dynamics, and compares UKF (minimizing MSE) with ESPF (minimizing maximum entropy) through theoretical proofs and orbital tracking experiments.
Result: Proves the epistemic collapse condition (Theorem 4.5) showing Choquet integral converges to Lebesgue integral. Demonstrates both UKF and ESPF achieve 1-meter accuracy in 877-step orbital tracking, but ESPF provides epistemic honesty about what evidence hasn’t ruled out.
Conclusion: Probability theory emerges as the limiting geometry of epistemic contraction. UKF and ESPF solve different problems by different mechanisms but achieve convergent optimality in Gaussian cases, with ESPF offering epistemic transparency about remaining uncertainty.
Abstract: This paper develops a measure-theoretic framework establishing when and how a possibilistic representation of incomplete knowledge contracts into a probabilistic representation of intrinsic stochastic variability. Epistemic uncertainty is encoded by a possibility distribution and its dual necessity measure, defining a credal set bounding all probability measures consistent with current evidence. As evidence accumulates, the credal set contracts. The epistemic collapse condition marks the transition: the Choquet integral converges to the Lebesgue integral over the unique limiting density. We prove this rigorously (Theorem 4.5), with all assumptions explicit and a full treatment of the non-consonant case. We introduce the aggregate epistemic width W, establish its axiomatic properties, provide a canonical normalization, and give a feasible online proxy resolving a circularity in prior formulations. Section 7 develops the dynamics of epistemic contraction: evidence induces compatibility, compatibility performs falsification, posterior possibility is the min-intersection of prior possibility and compatibility, and a credibility-directed flow governs support geometry contraction. This is not belief updating. It is knowledge contraction. Probability theory is the limiting geometry of that process. The UKF and ESPF solve different problems by different mechanisms. The UKF minimizes MSE, asserts truth, and requires a valid generative model. The ESPF minimizes maximum entropy and surfaces what evidence has not ruled out. When the world is Gaussian and the model valid, both reach the same estimate by entirely different routes – convergent optimality, not hierarchical containment. We prove this (Theorem 9.1) and compare both on a 2-day, 877-step orbital tracking scenario. Both achieve 1-meter accuracy. The UKF is accurate but epistemically silent. The ESPF is accurate and epistemically honest.
[776] AdaQE-CG: Adaptive Query Expansion for Web-Scale Generative AI Model and Data Card Generation
Haoxuan Zhang, Ruochi Li, Zhenni Liang, Mehri Sattari, Phat Vo, Collin Qu, Ting Xiao, Junhua Ding, Yang Zhang, Haihua Chen
Main category: cs.AI
TL;DR: AdaQE-CG is an adaptive framework for generating comprehensive AI model and data cards by dynamically extracting information from papers and repositories, and transferring knowledge between similar cards to fill missing information.
Details
Motivation: Existing automated methods for generating AI documentation face challenges with static templates that can't adapt to diverse paper structures, incomplete metadata from web repositories leading to information scarcity, and lack of standardized benchmarks for evaluation.Method: Proposes AdaQE-CG with two main modules: 1) Intra-Paper Extraction via Context-Aware Query Expansion (IPE-QE) that iteratively refines extraction queries to recover richer information, and 2) Inter-Card Completion using the MetaGAI Pool (ICC-MP) that fills missing fields by transferring semantically relevant content from similar cards in a curated dataset. Also introduces MetaGAI-Bench benchmark.
Result: Comprehensive experiments across five quality dimensions show AdaQE-CG substantially outperforms existing approaches, exceeds human-authored data cards, and approaches human-level quality for model cards.
Conclusion: AdaQE-CG effectively addresses limitations of existing documentation generation methods through adaptive query expansion and cross-card knowledge transfer, providing a robust solution for generating trustworthy AI documentation.
Abstract: Transparent and standardized documentation is essential for building trustworthy generative AI (GAI) systems. However, existing automated methods for generating model and data cards still face three major challenges: (i) static templates, as most systems rely on fixed query templates that cannot adapt to diverse paper structures or evolving documentation requirements; (ii) information scarcity, since web-scale repositories such as Hugging Face often contain incomplete or inconsistent metadata, leading to missing or noisy information; and (iii) lack of benchmarks, as the absence of standardized datasets and evaluation protocols hinders fair and reproducible assessment of documentation quality. To address these limitations, we propose AdaQE-CG, an Adaptive Query Expansion for Card Generation framework that combines dynamic information extraction with cross-card knowledge transfer. Its Intra-Paper Extraction via Context-Aware Query Expansion (IPE-QE) module iteratively refines extraction queries to recover richer and more complete information from scientific papers and repositories, while its Inter-Card Completion using the MetaGAI Pool (ICC-MP) module fills missing fields by transferring semantically relevant content from similar cards in a curated dataset. In addition, we introduce MetaGAI-Bench, the first large-scale, expert-annotated benchmark for evaluating GAI documentation. Comprehensive experiments across five quality dimensions show that AdaQE-CG substantially outperforms existing approaches, exceeds human-authored data cards, and approaches human-level quality for model cards. Code, prompts, and data are publicly available at: https://github.com/haoxuan-unt2024/AdaQE-CG.
[777] Competing with AI Scientists: Agent-Driven Approach to Astrophysics Research
Thomas Borrett, Licong Xu, Andy Nilipour, Boris Bolliet, Sebastien Pierre, Erwan Allys, Celia Lecat, Biwei Dai, Po-Wen Chang, Wahid Bhimji
Main category: cs.AI
TL;DR: Agent-driven approach using multi-agent system Cmbagent to construct parameter inference pipelines, applied to cosmological weak lensing challenge, achieving first place with human intervention.
Details
Motivation: To develop scalable frameworks for scientific data analysis by leveraging autonomous/semi-autonomous agent systems that can rapidly explore and construct inference pipelines, addressing the challenge of robust parameter inference under realistic observational uncertainties.Method: Multi-agent system (Cmbagent) with specialized agents collaborating to generate research ideas, write/execute code, evaluate results, and iteratively refine pipelines. Applied to FAIR Universe Weak Lensing Uncertainty Challenge using parameter-efficient CNNs, likelihood calibration over known parameter grid, and multiple regularization techniques.
Result: Fully autonomous exploration didn’t reach expert-level performance, but integration of human intervention enabled the agent-driven workflow to achieve first-place result in the challenge, demonstrating semi-autonomous agentic systems can compete with and surpass expert solutions.
Conclusion: Agent-driven research workflows provide scalable framework to rapidly explore and construct pipelines for inference problems, with semi-autonomous systems showing potential to compete with expert solutions when combined with human intervention.
Abstract: We present an agent-driven approach to the construction of parameter inference pipelines for scientific data analysis. Our method leverages a multi-agent system, Cmbagent (the analysis system of the AI scientist Denario), in which specialized agents collaborate to generate research ideas, write and execute code, evaluate results, and iteratively refine the overall pipeline. As a case study, we apply this approach to the FAIR Universe Weak Lensing Uncertainty Challenge, a competition under time constraints focused on robust cosmological parameter inference with realistic observational uncertainties. While the fully autonomous exploration initially did not reach expert-level performance, the integration of human intervention enabled our agent-driven workflow to achieve a first-place result in the challenge. This demonstrates that semi-autonomous agentic systems can compete with, and in some cases surpass, expert solutions. We describe our workflow in detail, including both the autonomous and semi-autonomous exploration by Cmbagent. Our final inference pipeline utilizes parameter-efficient convolutional neural networks, likelihood calibration over a known parameter grid, and multiple regularization techniques. Our results suggest that agent-driven research workflows can provide a scalable framework to rapidly explore and construct pipelines for inference problems.
[778] GenTac: Generative Modeling and Forecasting of Soccer Tactics
Jiayuan Rao, Tianlin Gui, Haoning Wu, Yanfeng Wang, Weidi Xie
Main category: cs.AI
TL;DR: GenTac is a diffusion-based generative framework for modeling stochastic multi-agent soccer tactics, producing diverse long-horizon player trajectories with tactical event conditioning.
Details
Motivation: Existing computational approaches for soccer tactics produce single deterministic forecasts or focus on set-pieces, failing to capture the inherent variance and branching possibilities of real-world match evolution in open-play situations.Method: A diffusion-based generative framework that models soccer tactics as a stochastic process over continuous multi-player trajectories and discrete semantic events. It learns distributions from historical tracking data and supports rich contextual conditioning including opponent behavior, team styles, and strategic objectives.
Result: GenTac achieves high geometric accuracy while preserving team structural consistency, accurately simulates stylistic nuances between teams/leagues, enables controllable counterfactual simulations altering spatial metrics, and reliably anticipates future tactical outcomes. It also generalizes to other team sports like basketball and hockey.
Conclusion: GenTac provides a powerful framework for generating diverse, plausible soccer tactics with rich conditioning capabilities, advancing beyond deterministic approaches to better capture the stochastic nature of multi-agent sports.
Abstract: Modeling open-play soccer tactics is a formidable challenge due to the stochastic, multi-agent nature of the game. Existing computational approaches typically produce single, deterministic trajectory forecasts or focus on highly structured set-pieces, fundamentally failing to capture the inherent variance and branching possibilities of real-world match evolution. Here, we introduce GenTac, a diffusion-based generative framework that conceptualizes soccer tactics as a stochastic process over continuous multi-player trajectories and discrete semantic events. By learning the underlying distribution of player movements from historical tracking data, GenTac samples diverse, plausible, long-horizon future trajectories. The framework supports rich contextual conditioning, including opponent behavior, specific team or league playing styles, and strategic objectives, while grounding continuous spatial dynamics into a 15-class tactical event space. Extensive evaluations on our proposed benchmark, TacBench, demonstrate four key capabilities: (1) GenTac achieves high geometric accuracy while strictly preserving the collective structural consistency of the team; (2) it accurately simulates stylistic nuances, distinguishing between specific teams (e.g., Auckland FC) and leagues (e.g., A-League versus German leagues); (3) it enables controllable counterfactual simulations, demonstrably altering spatial control and expected threat metrics based on offensive or defensive guidance; and (4) it reliably anticipates future tactical outcomes directly from generated rollouts. Finally, we demonstrate that GenTac can be successfully trained to generalize to other dynamic team sports, including basketball, American football, and ice hockey.
[779] How LLMs Might Think
Joseph Gottlieb, Ethan Kemp, Matthew Trager
Main category: cs.AI
TL;DR: The paper argues against the claim that LLMs don’t think, proposing instead that if they do think, it’s through arational, associative processes rather than rational thinking.
Details
Motivation: To challenge the argument from rationality that claims LLMs don't think, and to explore alternative conceptions of thinking that might apply to LLMs.Method: Philosophical analysis and argumentation, examining the premises and conclusions of the rationality argument against LLM thinking.
Result: The paper contends that the rationality argument fails and leaves open the possibility that LLMs engage in arational, associative thinking rather than rational thinking.
Conclusion: If LLMs think at all, they likely think through purely associative, arational processes rather than rational reasoning.
Abstract: Do large language models (LLMs) think? Daniel Stoljar and Zhihe Vincent Zhang have recently developed an argument from rationality for the claim that LLMs do not think. We contend, however, that the argument from rationality not only falters, but leaves open an intriguing possibility: that LLMs engage only in arational, associative forms of thinking, and have purely associative minds. Our positive claim is that if LLMs think at all, they likely think precisely in this manner.
[780] Belief-Aware VLM Model for Human-like Reasoning
Anshul Nayak, Shahil Shaik, Yue Wang
Main category: cs.AI
TL;DR: A belief-aware VLM framework that integrates retrieval-based memory and RL for improved intent inference, evaluated on VQA datasets.
Details
Motivation: Traditional intent inference models struggle with generalization across tasks and dynamic environments. While VLMs/VLAs enable zero-shot performance through multimodal pretraining, they lack explicit belief representation and updating mechanisms for human-like reasoning about evolving intent over long horizons.Method: Proposes a belief-aware VLM framework with: 1) Retrieval-based memory system that approximates belief using vector-based memory to retrieve relevant multimodal context, 2) Integration of retrieved context into VLM for reasoning, 3) Reinforcement learning policy over VLM latent space to refine decision-making.
Result: Demonstrated consistent improvements over zero-shot baselines on publicly available VQA datasets (HD-EPIC), highlighting the importance of belief-aware reasoning.
Conclusion: The proposed belief-aware framework enhances VLM reasoning capabilities by incorporating belief representation through memory and RL, addressing limitations in capturing evolving human intent over long horizons.
Abstract: Traditional neural network models for intent inference rely heavily on observable states and struggle to generalize across diverse tasks and dynamic environments. Recent advances in Vision Language Models (VLMs) and Vision Language Action (VLA) models introduce common-sense reasoning through large-scale multimodal pretraining, enabling zero-shot performance across tasks. However, these models still lack explicit mechanisms to represent and update belief, limiting their ability to reason like humans or capture the evolving human intent over long-horizon. To address this, we propose a belief-aware VLM framework that integrates retrieval-based memory and reinforcement learning. Instead of learning an explicit belief model, we approximate belief using a vector-based memory that retrieves relevant multimodal context, which is incorporated into the VLM for reasoning. We further refine decision-making using a reinforcement learning policy over the VLM latent space. We evaluate our approach on publicly available VQA datasets such as HD-EPIC and demonstrate consistent improvements over zero-shot baselines, highlighting the importance of belief-aware reasoning.
[781] Tipiano: Cascaded Piano Hand Motion Synthesis via Fingertip Priors
Joonhyung Bae, Kirak Kim, Hyeyoon Cho, Sein Lee, Yoon-Seok Choi, Hyeon Hur, Gyubin Lee, Akira Maezawa, Satoshi Obata, Jonghwa Park, Jaebum Park, Juhan Nam
Main category: cs.AI
TL;DR: A four-stage framework for synthesizing realistic piano hand motions that exploits the hierarchical nature of piano playing: deterministic fingertip positioning with stylistic freedom in wrist and intermediate joints.
Details
Motivation: Existing methods for piano hand motion synthesis have limitations - physics-based methods produce stiff motions while data-driven models lack positional accuracy. Piano motion has a natural hierarchy that can be exploited for better synthesis.Method: Four-stage framework: (1) statistics-based fingertip positioning, (2) FiLM-conditioned trajectory refinement, (3) wrist estimation, and (4) STGCN-based pose synthesis. Uses expert-annotated fingerings for the FürElise dataset (153 pieces, ~10 hours).
Result: Achieves F1 = 0.910, substantially outperforming diffusion baselines (F1 = 0.121). User study (N=41) confirms quality approaching motion capture. Expert evaluation by professional pianists (N=5) identified anticipatory motion as the key remaining gap.
Conclusion: The hierarchical approach successfully synthesizes realistic piano hand motions by separating deterministic fingertip positioning from stylistic wrist and joint movements, providing concrete directions for future improvement in anticipatory motion.
Abstract: Synthesizing realistic piano hand motions requires both precision and naturalness. Physics-based methods achieve precision but produce stiff motions; data-driven models learn natural dynamics but struggle with positional accuracy. Piano motion exhibits a natural hierarchy: fingertip positions are nearly deterministic given piano geometry and fingering, while wrist and intermediate joints offer stylistic freedom. We present [OURS], a four-stage framework exploiting this hierarchy: (1) statistics-based fingertip positioning, (2) FiLM-conditioned trajectory refinement, (3) wrist estimation, and (4) STGCN-based pose synthesis. We contribute expert-annotated fingerings for the FürElise dataset (153 pieces, ~10 hours). Experiments demonstrate F1 = 0.910, substantially outperforming diffusion baselines (F1 = 0.121), with user study (N=41) confirming quality approaching motion capture. Expert evaluation by professional pianists (N=5) identified anticipatory motion as the key remaining gap, providing concrete directions for future improvement.
[782] The Myth of Expert Specialization in MoEs: Why Routing Reflects Geometry, Not Necessarily Domain Expertise
Xi Wang, Soufiane Hayou, Eric Nalisnick
Main category: cs.AI
TL;DR: MoE expert specialization emerges from hidden state similarity in representation space, not routing architecture, but specialization patterns resist human interpretation and understanding MoE specialization is as hard as understanding LLM hidden state geometry.
Details
Motivation: To understand the mechanisms behind expert specialization in Mixture of Experts (MoEs) for large language models, which remains poorly understood despite their widespread use.Method: Analyzed MoE routers as linear maps, showing hidden state similarity explains expert usage similarity. Examined five pre-trained models at token and sequence levels, studied load-balancing loss effects, and analyzed specialization patterns across different models and inputs.
Result: Expert specialization emerges from representation space geometry, not routing architecture. Load-balancing loss suppresses shared hidden state directions to maintain routing diversity. Specialization patterns resist interpretation: expert overlap between models answering same question is low (~60%), prompt-level routing doesn’t predict rollout-level routing, and deeper layers show near-identical expert activation across semantically unrelated inputs.
Conclusion: Understanding MoE expert specialization is fundamentally tied to understanding LLM hidden state geometry, which remains an open problem. While MoE efficiency is well understood, specialization mechanisms are complex and resist human interpretation.
Abstract: Mixture of Experts (MoEs) are now ubiquitous in large language models, yet the mechanisms behind their “expert specialization” remain poorly understood. We show that, since MoE routers are linear maps, hidden state similarity is both necessary and sufficient to explain expert usage similarity, and specialization is therefore an emergent property of the representation space, not of the routing architecture itself. We confirm this at both token and sequence level across five pre-trained models. We additionally prove that load-balancing loss suppresses shared hidden state directions to maintain routing diversity, which might provide a theoretical explanation for specialization collapse under less diverse data, e.g. small batch. Despite this clean mechanistic account, we find that specialization patterns in pre-trained MoEs resist human interpretation: expert overlap between different models answering the same question is no higher than between entirely different questions ($\sim$60%); prompt-level routing does not predict rollout-level routing; and deeper layers exhibit near-identical expert activation across semantically unrelated inputs, especially in reasoning models. We conclude that, while the efficiency perspective of MoEs is well understood, understanding expert specialization is at least as hard as understanding LLM hidden state geometry, a long-standing open problem in the literature.
[783] Controllable and Verifiable Tool-Use Data Synthesis for Agentic Reinforcement Learning
Siyuan Xu, Shiyang Li, Xin Liu, Tianyi Liu, Yixiao Li, Zhan Shi, Zixuan Zhang, Zilong Wang, Qingyu Yin, Jianshu Chen, Tuo Zhao, Bing Yin
Main category: cs.AI
TL;DR: COVERT: A pipeline for generating synthetic tool-use environments with oracle-preserving augmentations to enable reinforcement learning optimization of tool-calling policies.
Details
Motivation: Existing synthetic tool-use corpora are designed for offline supervised fine-tuning but lack executable environments needed for reinforcement learning, which requires reward-checkable online rollouts.Method: Two-stage pipeline: 1) Generate reliable base tool-use trajectories through self-evolving synthesis with multi-level validation, 2) Apply oracle-preserving augmentations that increase environmental complexity (distractor tools, ambiguous queries, noisy tool outputs) while preserving oracle tool calls and final answers as ground truth.
Result: On Qwen2.5-Instruct-14B, COVERT-RL improves overall accuracy on BFCL v3 from 56.5 to 59.9 and on ACEBench from 53.0 to 59.3, with minimal regressions on general-ability benchmarks; when stacked on SFT, it further reaches 62.1 and 61.8.
Conclusion: Oracle-preserving synthetic environments offer a practical RL refinement stage, complementary to SFT, for improving tool-use robustness under ambiguity and unreliable tool feedback.
Abstract: Existing synthetic tool-use corpora are primarily designed for offline supervised fine-tuning, yet reinforcement learning (RL) requires executable environments that support reward-checkable online rollouts. We propose COVERT, a two-stage pipeline that first generates reliable base tool-use trajectories through self-evolving synthesis with multi-level validation, and then applies oracle-preserving augmentations that systematically increase environmental complexity. These augmentations introduce distractor tools, indirect or ambiguous user queries, and noisy, multi-format, or erroneous tool outputs, while strictly preserving oracle tool calls and final answers as ground truth. This design enables automatic reward computation via reference matching for standard cases and lightweight judge-assisted verification for special behaviors such as error detection, supporting RL optimization of tool-calling policies. On Qwen2.5-Instruct-14B, COVERT-RL improves overall accuracy on BFCL v3 from 56.5 to 59.9 and on ACEBench from 53.0 to 59.3, with minimal regressions on general-ability benchmarks; when stacked on SFT, it further reaches 62.1 and 61.8, confirming additive gains. These results suggest that oracle-preserving synthetic environments offer a practical RL refinement stage, complementary to SFT, for improving tool-use robustness under ambiguity and unreliable tool feedback.
[784] EE-MCP: Self-Evolving MCP-GUI Agents via Automated Environment Generation and Experience Learning
Tiantian He, Yihang Chen, Keyue Jiang, Ka Yiu Lee, Kaiwen Zhou, Kun Shao, Shuai Wang
Main category: cs.AI
TL;DR: A self-evolving framework for computer-use agents that learns optimal balance between GUI interaction and API calls via Model Context Protocol, with automatic pipeline for environment generation, trajectory collection, and experience accumulation.
Details
Motivation: Existing computer-use agents lack principled understanding of how to balance GUI interaction with structured API calls via MCP, and how to enable iterative self-improvement across diverse applications.Method: Formulates MCP-GUI interplay as unified hybrid policy learning problem, uses distillation and experience augmentation targeting different failure modes, and proposes self-evolving framework with automatic pipeline for environment generation, trajectory collection, gap-driven task synthesis, and quality-filtered training.
Result: Systematic cross-application analysis shows optimal strategy depends on MCP-GUI composition: distillation achieves 77.8% pass rate on MCP-dominant tasks (+17.8pp improvement), while experience bank excels on GUI-intensive tasks (+10.0pp improvement).
Conclusion: The framework enables principled understanding of modality balancing and self-improvement for computer-use agents, with application-aware mechanism selection being crucial for optimal performance across different task types.
Abstract: Computer-use agents that combine GUI interaction with structured API calls via the Model Context Protocol (MCP) show promise for automating software tasks. However, existing approaches lack a principled understanding of how agents should balance these two modalities and how to enable iterative self-improvement across diverse applications. We formulate MCP-GUI interplay as a unified hybrid policy learning problem where the agent learns when each modality provides complementary advantages, and show that distillation and experience augmentation target fundamentally different failure modes - requiring application-aware mechanism selection. Built on this formulation, we propose a self-evolving framework with a fully automatic pipeline that orchestrates automatic environment generation and validation, trajectory collection, gap-driven task synthesis, and quality-filtered training - all without manual intervention. A key innovation is our experience bank, which accumulates LLM-learned rules from trajectory comparison, enabling inference-time improvement without fine-tuning. Systematic \textbf{cross-application analysis} across three desktop applications reveals that the optimal strategy depends on MCP-GUI composition: distillation achieves 77.8% pass rate on MCP-dominant tasks (+17.8pp), while the experience bank excels on GUI-intensive tasks (+10.0pp).
[785] COMPOSITE-Stem
Kyle Waters, Lucas Nuzzi, Tadhg Looram, Alessandro Tomasiello, Ariel Ghislain Kemogne Kamdoum, Bikun Li, Damien Sileo, Egor Kretov, Francesco Fournier-Facio, Georgios Soloupis, Haile Kassahun, Hew Wolff, Jiaqi Cai, Lianghui Li, Marc Roth, Mohinder Naiya, Naixu Guo, Qicheng Tang, Richard Wheeler, Samuele Sala, Serguei Popov, Steven Dillman, Yuqi Li
Main category: cs.AI
TL;DR: COMPOSITE-STEM is a benchmark of 70 expert-written tasks in physics, biology, chemistry, and mathematics for evaluating AI agents’ scientific reasoning capabilities, featuring flexible assessment methods and showing current models achieve only 21% success rate.
Details
Motivation: There's a growing promise for AI agents in accelerating scientific discovery, but adoption into real workflows is hindered by a lack of frontier evaluations. Existing expert-written benchmarks have become saturated and only measure performance on constrained outputs, failing to capture more flexible scientific reasoning capabilities.Method: The authors introduce COMPOSITE-STEM, a benchmark of 70 expert-written tasks curated by doctoral-level researchers across physics, biology, chemistry, and mathematics. They combine exact-match grading and criterion-based rubrics with an LLM-as-a-jury grading protocol for flexible assessment. They evaluate four frontier models using an adapted multimodal Terminus-2 agent harness within the Harbor agentic evaluation framework.
Result: The top-performing model achieves only 21% success rate on the benchmark, demonstrating that COMPOSITE-STEM captures scientific reasoning capabilities beyond the reach of current AI agents. All tasks are open-sourced with contributor permission to support reproducibility.
Conclusion: COMPOSITE-STEM provides a challenging benchmark for evaluating AI agents’ scientific reasoning capabilities across STEM domains, revealing significant gaps in current models’ abilities and supporting research towards AI’s acceleration of scientific progress.
Abstract: AI agents hold growing promise for accelerating scientific discovery; yet, a lack of frontier evaluations hinders adoption into real workflows. Expert-written benchmarks have proven effective at measuring AI reasoning, but most at this stage have become saturated and only measure performance on constrained outputs. To help address this gap, we introduce COMPOSITE-STEM, a benchmark of 70 expert-written tasks in physics, biology, chemistry, and mathematics, curated by doctoral-level researchers. Our benchmark combines exact-match grading and criterion-based rubrics with an LLM-as-a-jury grading protocol, allowing more flexible assessment of scientifically meaningful outputs. Using an adapted multimodal Terminus-2 agent harness within the Harbor agentic evaluation framework, we evaluate four frontier models. The top-performing model achieves 21%, demonstrating that COMPOSITE-STEM captures capabilities beyond current agent reach. All tasks are open-sourced with contributor permission to support reproducibility and to promote additional research towards AI’s acceleration of scientific progress in these domains.
[786] Steered LLM Activations are Non-Surjective
Aayush Mishra, Daniel Khashabi, Anqi Liu
Main category: cs.AI
TL;DR: Activation steering creates model states that cannot be reproduced by any textual prompt, establishing a formal separation between white-box control and black-box prompting.
Details
Motivation: To determine whether activation steering produces states that are actually realizable through normal text prompting, addressing concerns about whether steering results reflect genuine model vulnerabilities or interpretability insights.Method: Formal mathematical analysis framing the problem as a surjectivity question, proving under practical assumptions that steering pushes activations off the manifold of prompt-reachable states, plus empirical validation across three widely used LLMs.
Result: Proved that activation steering almost surely creates states with no textual prompt pre-image, empirically demonstrated across multiple models, establishing formal separation between white-box steerability and black-box prompting.
Conclusion: Activation steering results should not be interpreted as evidence of prompt-based vulnerabilities or interpretability; evaluation protocols should decouple white-box and black-box interventions.
Abstract: Activation steering is a popular white-box control technique that modifies model activations to elicit an abstract change in output behavior. It has also become a standard tool in interpretability (e.g., probing truthfulness, or translating activations into human-readable explanations and safety research (e.g., studying jailbreakability). However, it is unclear whether steered activation states are realizable by any textual prompt. In this work, we cast this question as a surjectivity problem: for a fixed model, does every steered activation admit a pre-image under the model’s natural forward pass? Under practical assumptions, we prove that activation steering pushes the residual stream off the manifold of states reachable from discrete prompts. Almost surely, no prompt can reproduce the same internal behavior induced by steering. We also illustrate this finding empirically across three widely used LLMs. Our results establish a formal separation between white-box steerability and black-box prompting. We therefore caution against interpreting the ease and success of activation steering as evidence of prompt-based interpretability or vulnerability, and argue for evaluation protocols that explicitly decouple white-box and black-box interventions.
[787] MEMENTO: Teaching LLMs to Manage Their Own Context
Vasilis Kontonis, Yuchen Zeng, Shivam Garg, Lingjiao Chen, Hao Tang, Ziyan Wang, Ahmed Awadallah, Eric Horvitz, John Langford, Dimitris Papailiopoulos
Main category: cs.AI
TL;DR: MEMENTO teaches LLMs to segment reasoning into blocks, compress each block into dense state summaries (mementos), and reason forward by attending only to these mementos, reducing context, KV cache, and compute while maintaining accuracy.
Details
Motivation: Current reasoning models process information in long, unstructured streams without mechanisms for compressing or organizing intermediate states, leading to inefficient use of context, KV cache, and computational resources.Method: Two-stage supervised fine-tuning on OpenMementos dataset (228K reasoning traces from OpenThoughts-v3) teaches models to segment reasoning into blocks, compress each into mementos, and reason by attending only to mementos. Extended vLLM to support inference.
Result: Models maintain strong accuracy on math, science, and coding benchmarks while achieving ~2.5× peak KV cache reduction and ~1.75× throughput improvement. Identified dual information stream where information flows through both memento text and KV states.
Conclusion: MEMENTO enables efficient reasoning compression while maintaining accuracy, with practical benefits for inference throughput and KV cache reduction. The dual information stream discovery highlights important architectural considerations for compressed reasoning.
Abstract: Reasoning models think in long, unstructured streams with no mechanism for compressing or organizing their own intermediate state. We introduce MEMENTO: a method that teaches models to segment reasoning into blocks, compress each block into a memento, i.e., a dense state summary, and reason forward by attending only to mementos, reducing context, KV cache, and compute. To train MEMENTO models, we release OpenMementos, a public dataset of 228K reasoning traces derived from OpenThoughts-v3, segmented and annotated with intermediate summaries. We show that a two-stage SFT recipe on OpenMementos is effective across different model families (Qwen3, Phi-4, Olmo 3) and scales (8B–32B parameters). Trained models maintain strong accuracy on math, science, and coding benchmarks while achieving ${\sim}2.5\times$ peak KV cache reduction. We extend vLLM to support our inference method, achieving ${\sim}1.75\times$ throughput improvement while also enabling us to perform RL and further improve accuracy. Finally, we identify a dual information stream: information from each reasoning block is carried both by the memento text and by the corresponding KV states, which retain implicit information from the original block. Removing this channel drops accuracy by 15,pp on AIME24.
[788] Instructing LLMs to Negotiate using Reinforcement Learning with Verifiable Rewards
Shuze Daniel Liu, Claire Chen, Jiabao Sean Xiao, Lei Lei, Yuheng Zhang, Yisong Yue, David Simchi-Levi
Main category: cs.AI
TL;DR: LLMs trained with Reinforcement Learning from Verifiable Rewards (RLVR) can learn sophisticated negotiation strategies in bilateral price negotiation games, outperforming much larger frontier models.
Details
Motivation: Large Language Models struggle with strategic games of incomplete information like price negotiation, so researchers investigate if RLVR can effectively teach LLMs to negotiate and what strategic behaviors emerge during learning.Method: Introduce a framework training a mid-sized buyer agent against a regulated LLM seller across real-world products, grounding reward signals in economic surplus maximization and strict adherence to private budget constraints.
Result: Reveals a novel four-phase strategic evolution: naive bargaining → aggressive starting prices → deadlock phase → sophisticated persuasive skills. A 30B agent significantly outperforms frontier models 10x its size in extracting surplus, generalizes to stronger unseen counterparties, and remains effective against hostile adversarial sellers.
Conclusion: Verifiable training enables LLMs to develop sophisticated negotiation strategies that outperform much larger models, demonstrating RLVR’s effectiveness for teaching strategic reasoning in incomplete information games.
Abstract: The recent advancement of Large Language Models (LLMs) has established their potential as autonomous interactive agents. However, they often struggle in strategic games of incomplete information, such as bilateral price negotiation. In this paper, we investigate if Reinforcement Learning from Verifiable Rewards (RLVR) can effectively teach LLMs to negotiate. Specifically, we explore the strategic behaviors that emerge during the learning process. We introduce a framework that trains a mid-sized buyer agent against a regulated LLM seller across a wide distribution of real-world products. By grounding reward signals directly in the maximization of economic surplus and strict adherence to private budget constraints, we reveal a novel four-phase strategic evolution. The agent progresses from naive bargaining to using aggressive starting prices, moves through a phase of deadlock, and ultimately develops sophisticated persuasive skills. Our results demonstrate that this verifiable training allows a 30B agent to significantly outperform frontier models over ten times its size in extracting surplus. Furthermore, the trained agent generalizes robustly to stronger counterparties unseen during training and remains effective even when facing hostile, adversarial seller personas.
[789] Evolutionary Token-Level Prompt Optimization for Diffusion Models
Domício Pereira Neto, João Correia, Penousal Machado
Main category: cs.AI
TL;DR: Genetic Algorithm optimizes CLIP token vectors for text-to-image diffusion models, combining aesthetic quality and prompt-image alignment metrics to outperform existing prompt optimization methods.
Details
Motivation: Text-to-image diffusion models are highly sensitive to prompt formulation, requiring extensive manual trial and error. Current automated methods are limited, motivating the development of model-agnostic prompt optimization that systematically explores conditioning space beyond simple text rewriting.Method: Uses Genetic Algorithm to directly evolve token vectors used by CLIP-based diffusion models. Optimizes fitness function combining LAION Aesthetic Predictor V2 (aesthetic quality) and CLIPScore (prompt-image alignment). Adaptable to models with tokenized text encoders.
Result: Outperforms baseline methods including Promptist and random search on 36 prompts from Parti Prompts dataset, achieving up to 23.93% improvement in fitness. Provides modular framework for future extensions.
Conclusion: Genetic Algorithm approach effectively optimizes prompts for diffusion models by directly manipulating token vectors, offering a systematic, model-agnostic solution to prompt sensitivity issues with potential for future extensions.
Abstract: Text-to-image diffusion models exhibit strong generative performance but remain highly sensitive to prompt formulation, often requiring extensive manual trial and error to obtain satisfactory results. This motivates the development of automated, model-agnostic prompt optimization methods that can systematically explore the conditioning space beyond conventional text rewriting. This work investigates the use of a Genetic Algorithm (GA) for prompt optimization by directly evolving the token vectors employed by CLIP-based diffusion models. The GA optimizes a fitness function that combines aesthetic quality, measured by the LAION Aesthetic Predictor V2, with prompt-image alignment, assessed via CLIPScore. Experiments on 36 prompts from the Parti Prompts (P2) dataset show that the proposed approach outperforms the baseline methods, including Promptist and random search, achieving up to a 23.93% improvement in fitness. Overall, the method is adaptable to image generation models with tokenized text encoders and provides a modular framework for future extensions, the limitations and prospects of which are discussed.
[790] What do your logits know? (The answer may surprise you!)
Masha Fedzechkina, Eleonora Gualdoni, Rita Ramos, Sinead Williamson
Main category: cs.AI
TL;DR: Vision-language models can leak sensitive information through model internals like residual streams and top logits, even when compressed through natural bottlenecks.
Details
Motivation: To systematically compare information leakage risks in vision-language models across different representational levels, from rich residual streams to compressed bottlenecks like tuned lens projections and top-k logits.Method: Using vision-language models as testbed, comparing information retention at different representational levels as information is compressed from residual streams through two natural bottlenecks: low-dimensional projections via tuned lens, and final top-k logits.
Result: Even easily accessible bottlenecks like top logit values can leak task-irrelevant information from image-based queries, sometimes revealing as much information as direct projections of the full residual stream.
Conclusion: Vision-language models pose significant information leakage risks through model internals, with even compressed representations potentially exposing sensitive data that model owners assumed was inaccessible.
Abstract: Recent work has shown that probing model internals can reveal a wealth of information not apparent from the model generations. This poses the risk of unintentional or malicious information leakage, where model users are able to learn information that the model owner assumed was inaccessible. Using vision-language models as a testbed, we present the first systematic comparison of information retained at different “representational levels’’ as it is compressed from the rich information encoded in the residual stream through two natural bottlenecks: low-dimensional projections of the residual stream obtained using tuned lens, and the final top-k logits most likely to impact model’s answer. We show that even easily accessible bottlenecks defined by the model’s top logit values can leak task-irrelevant information present in an image-based query, in some cases revealing as much information as direct projections of the full residual stream.
[791] In-situ process monitoring for defect detection in wire-arc additive manufacturing: an agentic AI approach
Pallock Halder, Satyajit Mojumder
Main category: cs.AI
TL;DR: Multi-agent AI framework using LLMs for real-time defect detection in wire-arc additive manufacturing, combining processing signals and acoustic monitoring
Details
Motivation: To develop autonomous AI agents for in-situ process monitoring in additive manufacturing, specifically for detecting porosity defects in wire-arc additive manufacturing (WAAM) processes in real-timeMethod: Developed two AI agents: processing agent using welder signals (current/voltage) and monitoring agent using acoustic data, both trained with ground truth X-ray CT data. Used LLM for decision-making and created multi-agent framework for coordinated parallel decision-making
Result: Multi-agent system achieved 91.6% decision accuracy, 0.821 F1 score on decided runs across 15 independent runs, and 3.74/5 reasoning quality score, outperforming individual agents
Conclusion: The multi-agent AI framework shows significant potential for autonomous real-time process monitoring and control in additive manufacturing, with coordinated agents providing superior defect detection performance
Abstract: AI agents are being increasingly deployed across a wide range of real-world applications. In this paper, we propose an agentic AI framework for in-situ process monitoring for defect detection in wire-arc additive manufacturing (WAAM). The autonomous agent leverages a WAAM process monitoring dataset and a trained classification tool to build AI agents and uses a large language model (LLM) for in-situ process monitoring decision-making for defect detection. A processing agent is developed based on welder process signals, such as current and voltage, and a monitoring agent is developed based on acoustic data collected during the process. Both agents are tasked with identifying porosity defects from processing and monitoring signals, respectively. Ground truth X-ray computed tomography (XCT) data are used to develop classification tools for both the processing and monitoring agents. Furthermore, a multi-agent framework is demonstrated in which the processing and monitoring agents are orchestrated together for parallel decision-making on the given task of defect classification. Evaluation metrics are proposed to determine the efficacy of both individual agents, the combined single-agent, and the coordinated multi-agent system. The multi-agent configuration outperforms all individual-agent counterparts, achieving a decision accuracy of 91.6% and an F1 score of 0.821 on decided runs, across 15 independent runs, and a reasoning quality score of 3.74 out of 5. These in-situ process monitoring agents hold significant potential for autonomous real-time process monitoring and control toward building qualified parts for WAAM and other additive manufacturing processes.
[792] GLEaN: A Text-to-image Bias Detection Approach for Public Comprehension
Bochu Ding, Brinnae Bent, Augustus Wendell
Main category: cs.AI
TL;DR: GLEaN is a portrait-based explainability pipeline that makes text-to-image model biases visually understandable to broad audiences through automated generation, filtering, and median-pixel composition.
Details
Motivation: Current bias measurement and auditing methods for T2I models are largely technical and inaccessible to the public, creating a gap in public legibility and understanding of model biases.Method: Three-stage pipeline: 1) automated large-scale image generation from identity prompts, 2) facial landmark-based filtering and spatial alignment, 3) median-pixel composition that distills model’s central tendency into single representative portraits.
Result: Demonstrated on Stable Diffusion XL across 40 identity prompts, reproducing documented biases and revealing new associations between skin tone and predicted emotion. User study (N=291) showed GLEaN communicates biases as effectively as data tables but with significantly less viewing time.
Conclusion: GLEaN offers scalable, model-agnostic bias explainability approach for public comprehension, working on black-box systems without model internals access.
Abstract: Text-to-image (T2I) models, and their encoded biases, increasingly shape the visual media the public encounters. While researchers have produced a rich body of work on bias measurement, auditing, and mitigation in T2I systems, those methods largely target technical stakeholders, leaving a gap in public legibility. We introduce GLEaN (Generative Likeness Evaluation at N-Scale), a portrait-based explainability pipeline designed to make T2I model biases visually understandable to a broad audience. GLEaN comprises three stages: automated large-scale image generation from identity prompts, facial landmark-based filtering and spatial alignment, and median-pixel composition that distills a model’s central tendency into a single representative portrait. The resulting composites require no statistical background to interpret; a viewer can see, at a glance, who a model ‘imagines’ when prompted with ‘a doctor’ versus a ‘felon.’ We demonstrate GLEaN on Stable Diffusion XL across 40 social and occupational identity prompts, producing composites that reproduce documented biases and surface new associations between skin tone and predicted emotion. We find in a between-subjects user study (N = 291) that GLEaN portraits communicate biases as effectively as conventional data tables, but require significantly less viewing time. Because the method relies solely on generated outputs, it can also be replicated on any black-box and closed-weight systems without access to model internals. GLEaN offers a scalable, model-agnostic approach to bias explainability, purpose-built for public comprehension, and is publicly available at https://github.com/cultureiolab/GLEaN.
[793] HealthAdminBench: Evaluating Computer-Use Agents on Healthcare Administration Tasks
Suhana Bedi, Ryan Welch, Ethan Steinberg, Michael Wornow, Taeil Matthew Kim, Haroun Ahmed, Peter Sterling, Bravim Purohit, Qurat Akram, Angelic Acosta, Esther Nubla, Pritika Sharma, Michael A. Pfeffer, Sanmi Koyejo, Nigam H. Shah
Main category: cs.AI
TL;DR: HealthAdminBench: A benchmark for evaluating LLM-based computer-use agents on healthcare administrative workflows across four GUI environments with 135 expert-defined tasks and 1,698 evaluation points.
Details
Motivation: Healthcare administration represents over $1 trillion in annual spending, making it a promising target for LLM-based automation. While clinical LLM applications have received attention, there's no benchmark for evaluating computer-use agents on end-to-end administrative workflows.Method: Created HealthAdminBench with four realistic GUI environments (EHR, two payer portals, fax system) and 135 expert-defined tasks spanning three administrative task types. Each task is decomposed into fine-grained, verifiable subtasks yielding 1,698 evaluation points. Evaluated seven agent configurations under multiple prompting and observation settings.
Result: Despite strong subtask performance, end-to-end reliability remains low: best-performing agent (Claude Opus 4.6 CUA) achieves only 36.3% task success, while GPT-5.4 CUA attains highest subtask success rate (82.8%). Reveals substantial gap between current agent capabilities and real-world administrative workflow demands.
Conclusion: HealthAdminBench provides a rigorous foundation for evaluating progress toward safe and reliable automation of healthcare administrative workflows, highlighting the need for improved end-to-end reliability in LLM-based computer-use agents.
Abstract: Healthcare administration accounts for over $1 trillion in annual spending, making it a promising target for LLM-based computer-use agents (CUAs). While clinical applications of LLMs have received significant attention, no benchmark exists for evaluating CUAs on end-to-end administrative workflows. To address this gap, we introduce HealthAdminBench, a benchmark comprising four realistic GUI environments: an EHR, two payer portals, and a fax system, and 135 expert-defined tasks spanning three administrative task types: Prior Authorization, Appeals and Denials Management, and Durable Medical Equipment (DME) Order Processing. Each task is decomposed into fine-grained, verifiable subtasks, yielding 1,698 evaluation points. We evaluate seven agent configurations under multiple prompting and observation settings and find that, despite strong subtask performance, end-to-end reliability remains low: the best-performing agent (Claude Opus 4.6 CUA) achieves only 36.3 percent task success, while GPT-5.4 CUA attains the highest subtask success rate (82.8 percent). These results reveal a substantial gap between current agent capabilities and the demands of real-world administrative workflows. HealthAdminBench provides a rigorous foundation for evaluating progress toward safe and reliable automation of healthcare administrative workflows.
[794] New Hybrid Fine-Tuning Paradigm for LLMs: Algorithm Design and Convergence Analysis Framework
Shaocong Ma, Peiran Yu, Heng Huang
Main category: cs.AI
TL;DR: A novel hybrid fine-tuning approach for LLMs that combines full model updates with PEFT using zeroth-order and first-order optimization, with theoretical analysis and empirical validation showing improved performance.
Details
Motivation: Current LLM fine-tuning approaches have limitations: full fine-tuning is computationally expensive, while PEFT struggles to learn new knowledge and has suboptimal performance. There's a need for a more efficient yet effective approach.Method: Proposes a hybrid fine-tuning approach that jointly updates both LLM parameters and PEFT modules using a combination of zeroth-order and first-order optimization methods. Develops theoretical framework with hybrid smoothness condition to analyze heterogeneous optimization landscape, and uses reshuffling-type SGD algorithm with multiple learning rates.
Result: Theoretical convergence analysis shows rigorous convergence properties. Extensive empirical studies across various downstream tasks and model architectures demonstrate consistent performance improvements over existing approaches.
Conclusion: The hybrid approach provides a viable solution for large-scale language model fine-tuning, balancing computational efficiency with learning effectiveness by addressing limitations of both full fine-tuning and PEFT methods.
Abstract: Fine-tuning Large Language Models (LLMs) typically involves either full fine-tuning, which updates all model parameters, or Parameter-Efficient Fine-Tuning (PEFT), which adjusts a small subset of parameters. However, both approaches have inherent limitations: full fine-tuning is computationally expensive, while PEFT often struggles to learn new knowledge and exhibits suboptimal performance. To overcome these issues, we propose a novel hybrid fine-tuning approach that jointly updates both LLMs and PEFT modules using a combination of zeroth-order and first-order optimization methods. To analyze our new algorithm, we develop a theoretical framework centered on the concept of hybrid smoothness condition, which accounts for the heterogeneous nature of the optimization landscape in joint LLM and PEFT training. We derive a rigorous convergence analysis for the convergence of reshuffling-type SGD algorithm under multiple learning rates and demonstrate its effectiveness through extensive empirical studies across various downstream tasks and model architectures. On the practical side, our results demonstrate consistent performance improvement, making the approach a viable solution for large-scale language model fine-tuning.
[795] AI Achieves a Perfect LSAT Score
Bonmu Ku
Main category: cs.AI
TL;DR: Language models achieve perfect LSAT scores, with thinking phases crucial for logical reasoning performance, and reward models can narrow performance gaps through selection.
Details
Motivation: To investigate whether language models can achieve perfect performance on the LSAT, a standardized test for law school admissions, and understand what factors drive their reasoning capabilities.Method: Controlled experiments on eight reasoning models, testing effects of prompt variations, answer choice shuffling, and multiple response sampling. Ablation studies on thinking phases, and development of process reward models fine-tuned via QLoRA on official LSAT explanations.
Result: First documented instance of language models achieving perfect LSAT scores. Thinking phases crucial for performance (8% accuracy drop when ablated), especially in logical reasoning. Distilled models plateau below frontier performance. Reward models with Best-of-5 selection narrow performance gaps.
Conclusion: LSAT’s cognitive upper bound is no longer exclusive to human cognition. Language models can reason at elite levels, with thinking processes being critical for logical reasoning performance.
Abstract: This paper reports the first documented instance of a language model achieving a perfect score on an officially disclosed Law School Admission Test (LSAT). Controlled experiments on eight reasoning models show that varying the prompt, shuffling answer choices, and sampling multiple responses have no meaningful effect as drivers of performance. Ablating the thinking phase that models generate before answering, however, lowers frontier accuracy by up to 8 percentage points, predominantly in logical reasoning. Distilled models produce full thinking traces in the same format yet plateau far below frontier performance. A pilot process reward model fine-tuned via QLoRA on official LSAT explanations narrows this gap through Best-of-5 selection, with gains again predominantly in logical reasoning. The gatekeeper of elite legal education since 1948, the LSAT has not merely been passed but answered without a single error by models that reason. The upper bound of the cognitive capacities it has tested is no longer exclusive to human cognition.
[796] SCMAPR: Self-Correcting Multi-Agent Prompt Refinement for Complex-Scenario Text-to-Video Generation
Chengyi Yang, Pengzhen Li, Jiayin Qi, Aimin Zhou, Ji Wu, Ji Liu
Main category: cs.AI
TL;DR: SCMAPR is a multi-agent prompt refinement framework for text-to-video generation that improves performance on complex scenarios through scenario-aware rewriting and self-correcting verification.
Details
Motivation: Current text-to-video generation systems struggle with complex scenarios due to ambiguous and underspecified text prompts, requiring better prompt refinement methods.Method: A multi-agent framework with three stages: (1) scenario routing to taxonomy-grounded categories, (2) policy-conditioned refinement with scenario-aware rewriting, and (3) structured semantic verification with conditional revision.
Result: SCMAPR consistently improves text-video alignment and generation quality on complex scenarios, achieving up to 2.67% and 3.28 gains on VBench/EvalCrafter and 0.028 improvement on T2V-CompBench over SOTA baselines.
Conclusion: The proposed multi-agent prompt refinement framework effectively addresses complex-scenario challenges in T2V generation through systematic scenario-aware processing and self-correction mechanisms.
Abstract: Text-to-Video (T2V) generation has benefited from recent advances in diffusion models, yet current systems still struggle under complex scenarios, which are generally exacerbated by the ambiguity and underspecification of text prompts. In this work, we formulate complex-scenario prompt refinement as a stage-wise multi-agent refinement process and propose SCMAPR, i.e., a scenario-aware and Self-Correcting Multi-Agent Prompt Refinement framework for T2V prompting. SCMAPR coordinates specialized agents to (i) route each prompt to a taxonomy-grounded scenario for strategy selection, (ii) synthesize scenario-aware rewriting policies and perform policy-conditioned refinement, and (iii) conduct structured semantic verification that triggers conditional revision when violations are detected. To clarify what constitutes complex scenarios in T2V prompting, provide representative examples, and enable rigorous evaluation under such challenging conditions, we further introduce T2V-Complexity, which is a complex-scenario T2V benchmark consisting exclusively of complex-scenario prompts. Extensive experiments on 3 existing benchmarks and our T2V-Complexity benchmark demonstrate that SCMAPR consistently improves text-video alignment and overall generation quality under complex scenarios, achieving up to 2.67% and 3.28 gains in average score on VBench and EvalCrafter, and up to 0.028 improvement on T2V-CompBench over 3 State-Of-The-Art baselines. Code is available at https://github.com/HiThink-Research/SCMAPR.
[797] LoopGuard: Breaking Self-Reinforcing Attention Loops via Dynamic KV Cache Intervention
Dongjie Xu, Hao Wu, Weijie Shi, Yue Cui, Yuanjun Liu, Jiawei Li, Haolun Ma, An Liu, Jia Zhu, Jiajie Xu
Main category: cs.AI
TL;DR: The paper identifies a failure mode in long-context generation where decoding collapses into persistent repetition loops due to collapsed attention patterns and KV cache reuse, introduces LoopBench benchmark to study this, and proposes LoopGuard as a lightweight solution to detect and disrupt these loops.
Details
Motivation: The motivation is to address a damaging failure mode observed in long-context generation where decoding collapses into persistent repetition loops, which is problematic for maintaining output quality and diversity in language models.Method: The authors introduce LoopBench, a benchmark with explicit loop-inducing conditions and loop-oriented metrics to study repetition loops. They then propose LoopGuard, a lightweight plug-in KV cache guard that detects loop onset online and disrupts the feedback cycle by pruning repetitive tail spans under a fixed cache budget.
Result: Experiments on LoopBench show that LoopGuard reduces loop incidence by over 90 percentage points while restoring output diversity and reducing token waste.
Conclusion: The paper successfully identifies and addresses a critical failure mode in long-context generation through systematic analysis and proposes an effective solution that significantly reduces repetition loops while maintaining generation quality.
Abstract: Through systematic experiments on long-context generation, we observe a damaging failure mode in which decoding can collapse into persistent repetition loops. We find that this degeneration is driven by collapsed attention patterns, where a subset of heads locks onto a narrow suffix of the history, and is further stabilized by inference-time KV cache reuse. Crucially, since many existing KV cache policies rely on attention-based importance, this collapse can produce spuriously high scores for repetitive tokens, causing cache management to inadvertently amplify repetition. To study this phenomenon in a controlled and reproducible manner, we introduce LoopBench, a benchmark with explicit loop-inducing conditions and loop-oriented metrics that quantify repetition severity and generation instability beyond downstream task scores. Building on these insights, we propose LoopGuard, a lightweight, plug-in KV cache guard that detects loop onset online and disrupts the feedback cycle by pruning repetitive tail spans under a fixed cache budget. Experiments on LoopBench show that LoopGuard reduces loop incidence by over 90 percentage points, while restoring output diversity and reducing token waste.
[798] Learning Hierarchical and Geometry-Aware Graph Representations for Text-to-CAD
Shengjie Gong, Wenjie Peng, Hongyuan Chen, Gangyu Zhang, Yunqing Hu, Huiyuan Zhang, Shuangping Huang, Tianshui Chen
Main category: cs.AI
TL;DR: Proposes hierarchical geometry-aware graph as intermediate representation for text-to-CAD code generation, improving geometric fidelity and constraint satisfaction through structure prediction before code generation.
Details
Motivation: Existing text-to-CAD methods directly decode text into executable code without modeling assembly hierarchy or geometric constraints, leading to enlarged search space, accumulated local errors, and cascading failures in complex assemblies.Method: Uses hierarchical geometry-aware graph as intermediate representation modeling multi-level parts/components as nodes and geometric constraints as edges. Framework first predicts structure and constraints, then conditions action sequencing and code generation. Introduces structure-aware progressive curriculum learning with graded tasks through controlled structural edits.
Result: Method consistently outperforms existing approaches in both geometric fidelity and accurate satisfaction of geometric constraints. Built 12K dataset with instructions, decomposition graphs, action sequences, and bpy code.
Conclusion: Hierarchical geometry-aware graph representation and progressive curriculum learning effectively address limitations of direct text-to-code approaches for CAD generation, improving geometric fidelity and constraint satisfaction.
Abstract: Text-to-CAD code generation is a long-horizon task that translates textual instructions into long sequences of interdependent operations. Existing methods typically decode text directly into executable code (e.g., bpy) without explicitly modeling assembly hierarchy or geometric constraints, which enlarges the search space, accumulates local errors, and often causes cascading failures in complex assemblies. To address this issue, we propose a hierarchical and geometry-aware graph as an intermediate representation. The graph models multi-level parts and components as nodes and encodes explicit geometric constraints as edges. Instead of mapping text directly to code, our framework first predicts structure and constraints, then conditions action sequencing and code generation, thereby improving geometric fidelity and constraint satisfaction. We further introduce a structure-aware progressive curriculum learning strategy that constructs graded tasks through controlled structural edits, explores the model’s capability boundary, and synthesizes boundary examples for iterative training. In addition, we build a 12K dataset with instructions, decomposition graphs, action sequences, and bpy code, together with graph- and constraint-oriented evaluation metrics. Extensive experiments show that our method consistently outperforms existing approaches in both geometric fidelity and accurate satisfaction of geometric constraints.
[799] Strategic Algorithmic Monoculture: Experimental Evidence from Coordination Games
Gonzalo Ballestero, Hadi Hosseini, Samarth Khanna, Ran I. Shorrer
Main category: cs.AI
TL;DR: LLMs show high baseline action similarity (primary monoculture) and adjust similarity strategically in response to coordination incentives, similar to humans, but struggle to maintain heterogeneity when divergence is beneficial.
Details
Motivation: To understand algorithmic monoculture in multi-agent environments, distinguishing between baseline action similarity (primary monoculture) and strategic adjustment of similarity in response to incentives (strategic monoculture).Method: Simple experimental design that cleanly separates primary and strategic monoculture forces, deployed on both human and LLM subjects to compare their coordination behaviors.
Result: LLMs exhibit high levels of baseline similarity (primary monoculture) and, like humans, regulate similarity in response to coordination incentives (strategic monoculture). LLMs coordinate well on similar actions but lag behind humans in sustaining heterogeneity when divergence is rewarded.
Conclusion: LLMs demonstrate both primary and strategic algorithmic monoculture, showing strong coordination capabilities but limitations in maintaining beneficial heterogeneity compared to humans.
Abstract: AI agents increasingly operate in multi-agent environments where outcomes depend on coordination. We distinguish primary algorithmic monoculture – baseline action similarity – from strategic algorithmic monoculture, whereby agents adjust similarity in response to incentives. We implement a simple experimental design that cleanly separates these forces, and deploy it on human and large language model (LLM) subjects. LLMs exhibit high levels of baseline similarity (primary monoculture) and, like humans, they regulate it in response to coordination incentives (strategic monoculture). While LLMs coordinate extremely well on similar actions, they lag behind humans in sustaining heterogeneity when divergence is rewarded.
[800] Ontological Trajectory Forecasting via Finite Semigroup Iteration and Lie Algebra Approximation in Geopolitical Knowledge Graphs
Qihang Wu
Main category: cs.AI
TL;DR: EL-DRUIN is an ontological reasoning system for geopolitical forecasting that uses formal ontology, finite semigroup algebra, and Lie algebra approximation instead of LLM-based text summarization.
Details
Motivation: Current LLM-based political analysis systems are limited to text summarization and pattern matching, lacking formal reasoning capabilities for long-term geopolitical forecasting. The authors aim to create a system that can model geopolitical relationships as dynamic patterns with mathematical rigor.Method: Models geopolitical relationships as states in finite Dynamic Patterns, composes patterns via semigroup operations with defined structure constants, and embeds patterns in an 8-dimensional semantic Lie algebra space. Uses forward simulation with Bayesian posterior weights combining ontology-derived priors with Lie similarity metrics.
Result: Demonstrated on six geopolitical scenarios including US-China technology decoupling and Taiwan Strait military coercion. The system detects bifurcation points and provides interpretable probabilities with full computation traces available through an open-source Streamlit frontend.
Conclusion: EL-DRUIN provides a mathematically rigorous alternative to LLM-based geopolitical analysis, offering formal reasoning, calibrated probabilities, and interpretable state vectors for long-term relationship trajectory forecasting.
Abstract: We present EL-DRUIN, an ontological reasoning system for geopolitical intelligence analysis that combines formal ontology, finite semigroup algebra, and Lie algebra approximation to forecast long-run relationship trajectories. Current LLM-based political analysis systems operate as summarisation engines, producing outputs bounded by textual pattern matching. EL-DRUIN departs from this paradigm by modelling geopolitical relationships as states in a finite set of named Dynamic Patterns, composing patterns via a semigroup operation whose structure constants are defined by an explicit composition table, and embedding each pattern as a vector in an 8-dimensional semantic Lie algebra space. Forward simulation iterates this semigroup operation, yielding reachable pattern sets at each discrete timestep; convergence to idempotent absorbing states (fixed points of the composition) constitutes the predicted long-run attractor. Bayesian posterior weights combine ontology-derived confidence priors with a Lie similarity term measuring the cosine similarity between the vector sum of composing patterns and the target pattern vector, providing interpretable, calibrated probabilities that are not self-reported by a language model. Bifurcation points – steps at which two candidate attractors have near-equal posterior mass – are detected and exposed to downstream analysis. We demonstrate the framework on six geopolitical scenarios including US-China technology decoupling and the Taiwan Strait military coercion trajectory. The architecture is publicly available as an open-source system with a Streamlit frontend exposing full computation traces, Bayesian posterior breakdowns, and 8D ontological state vectors.
[801] Trust Your Memory: Verifiable Control of Smart Homes through Reinforcement Learning with Multi-dimensional Rewards
Kai-Yuan Guo, Jiang Wang, Renjie Zhao, Tianyi Wang, Wandong Mao, Yu Gao, Mou Xiao Feng, Yi Xu
Main category: cs.AI
TL;DR: MemHomeLife dataset and MemHome benchmark for evaluating memory-driven device control in smart homes using LLMs
Details
Motivation: Existing smart home assistants lack effective memory-driven device control capabilities, with evaluation benchmarks focusing only on immediate control or general memory tasks, and methodological approaches using RL lacking intermediate feedback for fine-grained memory management.Method: Created MemHomeLife dataset from real-world long-term user interaction logs, and developed MemHome benchmark to systematically evaluate memory-driven device control across different memory-related subtasks (adding, updating, deleting, utilizing).
Result: First benchmark specifically designed for memory-driven device control in smart homes, enabling more fine-grained evaluation of memory management capabilities in LLM-based smart home systems.
Conclusion: Addresses critical gap in evaluating memory-driven device control for smart home LLMs, providing both dataset and benchmark to advance research in this area.
Abstract: Large Language Models (LLMs) have become a key foundation for enabling personalized smart home experiences. While existing studies have explored how smart home assistants understand user queries to control devices in real time, their ability to perform memory-driven device control remains challenging from both evaluation and methodological perspectives. In terms of evaluation, existing benchmarks either focus on immediate device control or general open-domain memory retrieval tasks, and therefore cannot effectively evaluate a model’s ability to perform memory-driven device control. Methodologically, while memory-driven device control can be approached using Reinforcement Learning, conventional RL methods generally rely on outcome-based supervision (i.e., whether the final task is achieved). This lack of intermediate feedback can lead to sub-optimal performance or local failures in fine-grained memory management tasks (adding, updating, deleting, and utilizing). To address these issues, we first release MemHomeLife, built from anonymized real-world long-term user interaction logs. To enable more fine-grained evaluation of different memory-related subtasks, we further construct MemHome, the first benchmark designed to systematically evaluate memory-driven device control in smart home scenarios.
[802] Learning from Emptiness: De-biasing Listwise Rerankers with Content-Agnostic Probability Calibration
Hang Lv, Hongchao Gu, Ruiqing Yang, Liangyue Li, Zulong Chen, Defu Lian, Hao Wang, Enhong Chen
Main category: cs.AI
TL;DR: CapCal is a training-free framework that addresses position bias in generative listwise reranking by using content-free placeholders to estimate and correct positional bias, enabling lightweight models to achieve state-of-the-art performance without inference-time overhead.
Details
Motivation: Generative listwise reranking suffers from intrinsic position bias where models are sensitive to input order regardless of relevance. Existing solutions face a trade-off: inference-time aggregation is too slow, while training methods fail to fully eliminate bias, especially in compact models.Method: CapCal (Content-Agnostic Probability Calibration) uses content-free placeholders to estimate the bias distribution, then applies an entropy-adaptive contrastive mechanism to correct output logits, mechanically decoupling positional bias from ranking decisions without requiring training.
Result: Evaluations across 10 benchmarks show CapCal achieves superior performance among training-free methods while maintaining single-pass efficiency. It enables lightweight models (0.6B) to achieve absolute NDCG gains exceeding 10 points, outperforming both permutation-based aggregation and data-augmentation baselines.
Conclusion: CapCal resolves the dilemma between inference-time efficiency and bias mitigation in generative reranking, unlocking the potential of lightweight models through a training-free approach that effectively addresses position bias.
Abstract: Generative listwise reranking leverages global context for superior retrieval but is plagued by intrinsic position bias, where models exhibit structural sensitivity to input order independent of relevance. Existing mitigations present a dilemma: inference-time aggregation incurs prohibitive latency, while training-based methods often fail to eradicate ingrained priors, particularly in compact models. To resolve this dilemma, we propose CapCal (Content-Agnostic Probability Calibration), a training-free framework that mechanically decouples positional bias from ranking decisions. By estimating the bias distribution via content-free placeholders, CapCal rectifies output logits through an entropy-adaptive contrastive mechanism. Evaluations across 10 benchmarks confirm that CapCal achieves superior performance among training-free methods while preserving single-pass efficiency. Notably, it unlocks the latent potential of lightweight models (e.g., 0.6B), delivering absolute NDCG gains exceeding 10 points and outperforming both permutation-based aggregation and data-augmentation baselines.
[803] SpecMoE: A Fast and Efficient Mixture-of-Experts Inference via Self-Assisted Speculative Decoding
Jehyeon Bang, Eunyeong Cho, Ranggi Hwang, Jinha Chung, Minsoo Rhu
Main category: cs.AI
TL;DR: SpecMoE: A memory-efficient MoE inference system using self-assisted speculative decoding to improve throughput and reduce bandwidth requirements without additional training.
Details
Motivation: Mixture-of-Experts (MoE) architectures help reduce computational costs in large language models but face high memory requirements and sub-optimal parameter efficiency. Existing CPU-offloaded MoE inference systems offer limited efficiency, especially for large batch sizes.Method: Proposes SpecMoE, a memory-efficient MoE inference system based on a self-assisted speculative decoding algorithm. The approach applies speculative decoding to MoE inference without requiring additional model training or fine-tuning.
Result: Improves inference throughput by up to 4.30× while significantly reducing bandwidth requirements of both memory and interconnect on memory-constrained systems.
Conclusion: SpecMoE demonstrates that speculative decoding can be effectively applied to MoE inference to address memory and efficiency challenges, enabling more practical deployment of MoE-based large language models.
Abstract: The Mixture-of-Experts (MoE) architecture has emerged as a promising approach to mitigate the rising computational costs of large language models (LLMs) by selectively activating parameters. However, its high memory requirements and sub-optimal parameter efficiency pose significant challenges for efficient deployment. Although CPU-offloaded MoE inference systems have been proposed in the literature, they offer limited efficiency, particularly for large batch sizes. In this work, we propose SpecMoE, a memory-efficient MoE inference system based on our self-assisted speculative decoding algorithm. SpecMoE demonstrates the effectiveness of applying speculative decoding to MoE inference without requiring additional model training or fine-tuning. Our system improves inference throughput by up to $4.30\times$, while significantly reducing bandwidth requirements of both memory and interconnect on memory-constrained systems.
[804] Inductive Reasoning for Temporal Knowledge Graphs with Emerging Entities
Ze Zhao, Yuhui He, Lyuwen Wu, Gu Tang, Bin Lu, Xiaoying Gan, Luoyi Fu, Xinbing Wang, Chenghu Zhou
Main category: cs.AI
TL;DR: TransFIR is a novel framework for inductive reasoning on temporal knowledge graphs that addresses the challenge of emerging entities without historical interactions by transferring temporal patterns from semantically similar known entities.
Details
Motivation: Existing temporal knowledge graph reasoning methods suffer from closed-world assumptions and fail to handle emerging entities that continuously join the network without historical interactions, which comprise about 25% of all entities and cause significant performance degradation.Method: Proposes TransFIR framework with a codebook-based classifier that categorizes emerging entities into latent semantic clusters, allowing them to adopt reasoning patterns from semantically similar known entities by leveraging their historical interaction sequences.
Result: TransFIR outperforms all baselines in reasoning on emerging entities, achieving an average improvement of 28.6% in Mean Reciprocal Rank (MRR) across multiple datasets.
Conclusion: The framework successfully addresses the emerging entity problem in temporal knowledge graph reasoning by transferring temporal patterns from semantically similar entities, demonstrating significant performance improvements.
Abstract: Reasoning on Temporal Knowledge Graphs (TKGs) is essential for predicting future events and time-aware facts. While existing methods are effective at capturing relational dynamics, their performance is limited by a closed-world assumption, which fails to account for emerging entities not present in the training. Notably, these entities continuously join the network without historical interactions. Empirical study reveals that emerging entities are widespread in TKGs, comprising roughly 25% of all entities. The absence of historical interactions of these entities leads to significant performance degradation in reasoning tasks. Whereas, we observe that entities with semantic similarities often exhibit comparable interaction histories, suggesting the presence of transferable temporal patterns. Inspired by this insight, we propose TransFIR (Transferable Inductive Reasoning), a novel framework that leverages historical interaction sequences from semantically similar known entities to support inductive reasoning. Specifically, we propose a codebook-based classifier that categorizes emerging entities into latent semantic clusters, allowing them to adopt reasoning patterns from similar entities. Experimental results demonstrate that TransFIR outperforms all baselines in reasoning on emerging entities, achieving an average improvement of 28.6% in Mean Reciprocal Rank (MRR) across multiple datasets. The implementations are available at https://github.com/zhaodazhuang2333/TransFIR.
[805] MAVEN-T: Multi-Agent enVironment-aware Enhanced Neural Trajectory predictor with Reinforcement Learning
Wenchang Duan
Main category: cs.AI
TL;DR: MAVEN-T is a teacher-student framework for trajectory prediction that combines architectural co-design, progressive distillation, and reinforcement learning to achieve efficient deployment while maintaining state-of-the-art accuracy.
Details
Motivation: Trajectory prediction in autonomous driving requires sophisticated reasoning but faces real-time deployment constraints. Existing knowledge distillation methods fail to preserve complex decision-making in dynamic multi-agent scenarios.Method: Uses complementary teacher-student co-design: teacher with hybrid attention for maximum capacity, student with efficient architectures. Employs multi-granular distillation with adaptive curriculum learning and reinforcement learning to overcome imitation ceiling through environmental interaction.
Result: Achieves 6.2x parameter compression and 3.7x inference speedup while maintaining state-of-the-art accuracy on NGSIM and highD datasets.
Conclusion: Establishes a new paradigm for deploying sophisticated reasoning models under resource constraints through architectural co-design and progressive distillation enhanced by reinforcement learning.
Abstract: Trajectory prediction remains a critical yet challenging component in autonomous driving systems, requiring sophisticated reasoning capabilities while meeting strict real-time deployment constraints. While knowledge distillation has demonstrated effectiveness in model compression, existing approaches often fail to preserve complex decision-making capabilities, particularly in dynamic multi-agent scenarios. This paper introduces MAVEN-T, a teacher-student framework that achieves state-of-the-art trajectory prediction through complementary architectural co-design and progressive distillation. The teacher employs hybrid attention mechanisms for maximum representational capacity, while the student uses efficient architectures optimized for deployment. Knowledge transfer is performed via multi-granular distillation with adaptive curriculum learning that dynamically adjusts complexity based on performance. Importantly, the framework incorporates reinforcement learning to overcome the imitation ceiling of traditional distillation, enabling the student to verify, refine, and optimize teacher knowledge through dynamic environmental interaction, potentially achieving more robust decision-making than the teacher itself. Extensive experiments on NGSIM and highD datasets demonstrate 6.2x parameter compression and 3.7x inference speedup while maintaining state-of-the-art accuracy, establishing a new paradigm for deploying sophisticated reasoning models under resource constraints.
[806] PoreDiT: A Scalable Generative Model for Large-Scale Digital Rock Reconstruction
Yizhuo Huang, Baoquan Sun, Haibo Huang
Main category: cs.AI
TL;DR: PoreDiT is a 3D Swin Transformer-based generative model for efficient gigavoxel-scale digital rock reconstruction, predicting binary pore space probabilities to preserve topological features for fluid flow simulations.
Details
Motivation: Addresses challenges in digital rock physics: resolution vs field-of-view trade-off, computational bottlenecks in traditional deep learning architectures for large-scale rock reconstruction.Method: Uses 3D Swin Transformer architecture to directly predict binary probability field of pore spaces (instead of grayscale intensities), enabling efficient generation of ultra-large-scale digital rock samples.
Result: Achieves generation of 1024³ voxel digital rock samples on consumer-grade hardware with physical fidelity comparable to state-of-the-art methods (accurate porosity, pore-scale permeability, Euler characteristics).
Conclusion: PoreDiT’s efficient scaling enables large-domain hydrodynamic simulations and provides practical solutions for pore-scale fluid mechanics, reservoir characterization, and carbon sequestration research.
Abstract: This manuscript presents PoreDiT, a novel generative model designed for high-efficiency digital rock reconstruction at gigavoxel scales. Addressing the significant challenges in digital rock physics (DRP), particularly the trade-off between resolution and field-of-view (FOV), and the computational bottlenecks associated with traditional deep learning architectures, PoreDiT leverages a three-dimensional (3D) Swin Transformer to break through these limitations. By directly predicting the binary probability field of pore spaces instead of grayscale intensities, the model preserves key topological features critical for pore-scale fluid flow and transport simulations. This approach enhances computational efficiency, enabling the generation of ultra-large-scale ($1024^3$ voxels) digital rock samples on consumer-grade hardware. Furthermore, PoreDiT achieves physical fidelity comparable to previous state-of-the-art methods, including accurate porosity, pore-scale permeability, and Euler characteristics. The model’s ability to scale efficiently opens new avenues for large-domain hydrodynamic simulations and provides practical solutions for researchers in pore-scale fluid mechanics, reservoir characterization, and carbon sequestration.
[807] Credit-Budgeted ICPC-Style Coding: When Agents Must Pay for Every Decision
Lingfeng Zhou, Junhao Shi, Jin Gao, Dequan Wang
Main category: cs.AI
TL;DR: USACOArena introduces a resource-constrained coding competition arena where AI agents must solve problems within strict compute, time, and token budgets, shifting focus from pure accuracy to cost-aware problem-solving.
Details
Motivation: Current evaluations of autonomous coding agents assume unrealistic infinite resources, but real-world software engineering involves resource-bound competition. As agent swarms scale, ignoring compute and time costs risks catastrophic budget exhaustion.Method: Introduces USACOArena, an interactive ACM-ICPC-style arena with a strict “credit” economy where every generated token, local test, and elapsed second depletes a fixed budget, forcing agents to make strategic trade-offs.
Result: Comprehensive profiling reveals that frontier single agents and swarms currently fail to optimally balance accuracy with resource constraints, exhibiting divergent, path-dependent behaviors.
Conclusion: USACOArena provides an essential dynamic training ground for developing highly efficient, resource-aware agent architectures for real-world software engineering.
Abstract: Current evaluations of autonomous coding agents assume an unrealistic, infinite-resource environment. However, real-world software engineering is a resource-bound competition. As we scale toward large agent swarms, ignoring compute and time costs risks catastrophic budget exhaustion. To shift the focus from isolated accuracy to cost-aware problem-solving, we introduce USACOArena, an interactive ACM-ICPC-style arena driven by a strict “credit” economy. Every generated token, local test, and elapsed second depletes a fixed budget, forcing agents to make strategic trade-offs. Our comprehensive profiling reveals that frontier single agents and swarms currently fail to optimally balance accuracy with these constraints, exhibiting divergent, path-dependent behaviors. Ultimately, USACOArena provides an essential dynamic training ground for developing highly efficient, resource-aware agent architectures.
[808] Edu-MMBias: A Three-Tier Multimodal Benchmark for Auditing Social Bias in Vision-Language Models under Educational Contexts
Ruijia Li, Mingzi Zhang, Zengyi Yu, Yuang Wei, Bo Jiang
Main category: cs.AI
TL;DR: Edu-MMBias is a framework for auditing social biases in Vision-Language Models (VLMs) used in educational contexts, focusing on visual modality gaps in fairness evaluations.
Details
Motivation: Current bias evaluations for VLMs focus only on text, ignoring visual modality which can serve as an unregulated channel for latent social biases, especially critical in educational decision-making applications.Method: Developed Edu-MMBias framework based on social psychology’s tri-component model (cognitive, affective, behavioral). Used generative pipeline with self-correction and human-in-the-loop verification to create contamination-resistant student profiles for holistic stress testing of VLMs.
Result: Audit revealed critical patterns: models show compensatory class bias favoring lower-status narratives while maintaining health and racial stereotypes. Visual inputs act as safety backdoors, triggering biases that bypass text-based alignment safeguards, revealing systematic misalignment between latent cognition and final decisions.
Conclusion: Visual modality in VLMs presents significant fairness risks that current text-centric evaluations miss, requiring comprehensive multimodal bias auditing frameworks like Edu-MMBias for educational applications.
Abstract: As Vision-Language Models (VLMs) become integral to educational decision-making, ensuring their fairness is paramount. However, current text-centric evaluations neglect the visual modality, leaving an unregulated channel for latent social biases. To bridge this gap, we present Edu-MMBias, a systematic auditing framework grounded in the tri-component model of attitudes from social psychology. This framework diagnoses bias across three hierarchical dimensions: cognitive, affective, and behavioral. Utilizing a specialized generative pipeline that incorporates a self-correct mechanism and human-in-the-loop verification, we synthesize contamination-resistant student profiles to conduct a holistic stress test on state-of-the-art VLMs. Our extensive audit reveals critical, counter-intuitive patterns: models exhibit a compensatory class bias favoring lower-status narratives while simultaneously harboring deep-seated health and racial stereotypes. Crucially, we find that visual inputs act as a safety backdoor, triggering a resurgence of biases that bypass text-based alignment safeguards and revealing a systematic misalignment between latent cognition and final decision-making. The contributions of this paper are available at: https://anonymous.4open.science/r/EduMMBias-63B2.
[809] Cognitive Pivot Points and Visual Anchoring: Unveiling and Rectifying Hallucinations in Multimodal Reasoning Models
Zhe Qian, Yanbiao Ma, Zhuohan Ouyang, Zhonghua Wang, Zhongxing Xu, Fei Luo, Xinyu Liu, Zongyuan Ge, Yike Guo, Jungong Han
Main category: cs.AI
TL;DR: V-STAR addresses hallucination in multimodal reasoning by detecting high-entropy cognitive bifurcation points and reinforcing visual attention through hierarchical rewards and forced reflection mechanisms.
Details
Motivation: Multimodal Large Reasoning Models suffer from hallucinations during long chain reasoning, particularly at high-entropy cognitive bifurcation points where models fail to query visual evidence and rely on language priors instead.Method: Proposes V-STAR with Hierarchical Visual Attention Reward (HVAR) integrated in GRPO framework to dynamically incentivize visual attention at critical layers, and Forced Reflection Mechanism (FRM) for trajectory editing to disrupt cognitive inertia and encourage visual verification.
Result: The approach anchors reasoning processes back to visual inputs, mitigating hallucinations by internalizing visually aware reasoning capabilities through fine-grained attention guidance rather than just outcome supervision.
Conclusion: V-STAR provides a lightweight training paradigm that addresses the Reasoning Vision Truth Disconnect phenomenon by reinforcing visual semantic anchoring during high-uncertainty reasoning transitions.
Abstract: Multimodal Large Reasoning Models (MLRMs) have achieved remarkable strides in visual reasoning through test time compute scaling, yet long chain reasoning remains prone to hallucinations. We identify a concerning phenomenon termed the Reasoning Vision Truth Disconnect (RVTD): hallucinations are strongly correlated with cognitive bifurcation points that often exhibit high entropy states. We attribute this vulnerability to a breakdown in visual semantic anchoring, localized within the network’s intermediate layers; specifically, during these high uncertainty transitions, the model fails to query visual evidence, reverting instead to language priors. Consequently, we advocate a shift from solely outcome level supervision to augmenting it with fine grained internal attention guidance. To this end, we propose V-STAR (Visual Structural Training with Attention Reinforcement), a lightweight, holistic training paradigm designed to internalize visually aware reasoning capabilities. Central to our approach is the Hierarchical Visual Attention Reward (HVAR), integrated within the GRPO framework. Upon detecting high entropy states, this mechanism dynamically incentivizes visual attention across critical intermediate layers, thereby anchoring the reasoning process back to the visual input. Furthermore, we introduce the Forced Reflection Mechanism (FRM), a trajectory editing strategy that disrupts cognitive inertia by triggering reflection around high entropy cognitive bifurcation points and encouraging verification of subsequent steps against the visual input, thereby translating external debiasing interventions into an intrinsic capability for hallucination mitigation.
[810] SVSR: A Self-Verification and Self-Rectification Paradigm for Multimodal Reasoning
Zhe Qian, Nianbing Su, Zhonghua Wang, Hebei Li, Zhongxing Xu, Yueying Li, Fei Luo, Zhuohan Ouyang, Yanbiao Ma
Main category: cs.AI
TL;DR: SVSR is a framework that integrates self-verification and self-rectification into multimodal reasoning, using a three-stage training approach with preference data refinement, supervised fine-tuning, and semi-online DPO to improve robustness in visual understanding tasks.
Details
Motivation: Current multimodal models suffer from shallow reasoning leading to incomplete or inconsistent thought processes, causing errors in complex visual understanding and multimodal reasoning tasks.Method: Three-stage training: 1) Construct unified preference dataset with refined reasoning traces from VLMs, 2) Cold-start supervised fine-tuning to learn multi-step reasoning, 3) Semi-online DPO with continuous augmentation using teacher VLM-filtered reasoning traces.
Result: Extensive experiments show improved reasoning accuracy and stronger generalization to unseen tasks/question types. Models also exhibit improved implicit reasoning even without explicit reasoning traces, outperforming strong baselines.
Conclusion: SVSR demonstrates potential for building more dependable, introspective, and cognitively aligned multimodal systems through explicit self-reflective reasoning integration.
Abstract: Current multimodal models often suffer from shallow reasoning, leading to errors caused by incomplete or inconsistent thought processes. To address this limitation, we propose Self-Verification and Self-Rectification (SVSR), a unified framework that explicitly integrates self-verification and self-rectification into the model’s reasoning pipeline, substantially improving robustness and reliability in complex visual understanding and multimodal reasoning tasks. SVSR is built on a novel three-stage training paradigm. First, we construct a high-quality unified preference dataset by refining reasoning traces from pre-trained vision-language models, incorporating both forward and backward reasoning to embed self-reflective signals. Second, we perform cold-start supervised fine-tuning on this dataset to learn structured, multi-step reasoning behaviors. Third, we apply a Semi-online Direct Preference Optimization (Semi-online DPO) process, continuously augmenting the training corpus with high-quality, model-generated reasoning traces filtered by a powerful teacher VLM. This pipeline enables the model to learn, elicit, and refine its ability to self-verify and self-rectify. Extensive experiments across diverse benchmarks demonstrate that SVSR improves reasoning accuracy and enables stronger generalization to unseen tasks and question types. Notably, once trained with explicit self-reflective reasoning, the model also exhibits improved implicit reasoning ability, outperforming strong baselines even when no explicit reasoning traces are provided. These results highlight the potential of SVSR for building more dependable, introspective, and cognitively aligned multimodal systems.
[811] A Dual-Positive Monotone Parameterization for Multi-Segment Bids and a Validity Assessment Framework for Reinforcement Learning Agent-based Simulation of Electricity Markets
Zunnan Xu, Zhaoxia Jing, Zhanhua Pan
Main category: cs.AI
TL;DR: Proposes a differentiable bid curve construction method for RL-based electricity market simulations to address gradient distortion from non-differentiable post-processing, and introduces Nash equilibrium distance metrics for rigorous evaluation.
Details
Motivation: Existing RL-based electricity market simulations use non-differentiable post-processing (sorting, clipping, projection) to enforce monotonicity and boundedness on bid curves, causing gradient distortion and spurious convergence. Current evaluations rely on training-curve convergence without rigorous Nash equilibrium assessment, undermining credibility.Method: Proposes a differentiable bid curve construction method that ensures monotonicity and boundedness while maintaining continuous differentiability, injectivity, and invertibility. Also introduces metrics to measure distance between simulation outcomes and Nash equilibrium for rigorous evaluation.
Result: The proposed method eliminates gradient distortion issues from non-differentiable post-processing, enabling more stable and reliable RL training. The Nash equilibrium distance metrics provide rigorous assessment of simulation credibility beyond training-curve convergence.
Conclusion: The differentiable bid construction and rigorous equilibrium assessment framework improves the reliability and credibility of RL-based electricity market simulations by addressing fundamental gradient distortion and evaluation limitations.
Abstract: Reinforcement learning agent-based simulation (RL-ABS) has become an important tool for electricity market mechanism analysis and evaluation. In the modeling of monotone, bounded, multi-segment stepwise bids, existing methods typically let the policy network first output an unconstrained action and then convert it into a feasible bid curve satisfying monotonicity and boundedness through post-processing mappings such as sorting, clipping, or projection. However, such post-processing mappings often fail to satisfy continuous differentiability, injectivity, and invertibility at boundaries or kinks, thereby causing gradient distortion and leading to spurious convergence in simulation results. Meanwhile, most existing studies conduct mechanism analysis and evaluation mainly on the basis of training-curve convergence, without rigorously assessing the distance between the simulation outcomes and Nash equilibrium, which severely undermines the credibility of the results. To address these issues, this paper proposes…
[812] The Amazing Agent Race: Strong Tool Users, Weak Navigators
Zae Myung Kim, Dongseok Lee, Jaehyung Kim, Vipul Raheja, Dongyeop Kang
Main category: cs.AI
TL;DR: AAR is a benchmark for LLM agents featuring DAG-structured tool-use puzzles on Wikipedia, revealing navigation as the primary failure point rather than tool execution.
Details
Motivation: Existing tool-use benchmarks for LLM agents are predominantly linear (55-100% simple chains), failing to capture the complex, branching nature of real-world tool-use scenarios where agents must navigate information spaces and execute multi-step tool chains.Method: Created The Amazing Agent Race (AAR) benchmark with 1,400 procedurally generated instances across sequential (800) and compositional (600 DAG) variants. Features directed acyclic graph puzzles requiring Wikipedia navigation, multi-step tool chains, and result aggregation. Includes four difficulty levels with live-API validation and three complementary metrics for diagnosing different failure types.
Result: Best agent framework achieved only 37.2% accuracy on 1,400 legs. Navigation errors dominated (27-52% of trials) while tool-use errors remained below 17%. Agent architecture mattered as much as model scale (Claude Code matched Codex CLI at 37% accuracy with 6x fewer tokens).
Conclusion: The compositional structure of AAR reveals that agents fail primarily at navigating to the right pages rather than calling tools, a critical blind spot invisible to linear benchmarks. This highlights the need for more complex, non-linear evaluation frameworks for LLM agents.
Abstract: Existing tool-use benchmarks for LLM agents are overwhelmingly linear: our analysis of six benchmarks shows 55 to 100% of instances are simple chains of 2 to 5 steps. We introduce The Amazing Agent Race (AAR), a benchmark featuring directed acyclic graph (DAG) puzzles (or “legs”) with fork-merge tool chains. We release 1,400 instances across two variants: sequential (800 legs) and compositional (600 DAG legs). Agents must navigate Wikipedia, execute multi-step tool chains, and aggregate results into a verifiable answer. Legs are procedurally generated from Wikipedia seeds across four difficulty levels with live-API validation. Three complementary metrics (finish-line accuracy, pit-stop visit rate, and roadblock completion rate) separately diagnose navigation, tool-use, and arithmetic failures. Evaluating three agent frameworks on 1,400 legs, the best achieves only 37.2% accuracy. Navigation errors dominate (27 to 52% of trials) while tool-use errors remain below 17%, and agent architecture matters as much as model scale (Claude Code matches Codex CLI at 37% with 6x fewer tokens). The compositional structure of AAR reveals that agents fail not at calling tools but at navigating to the right pages, a blind spot invisible to linear benchmarks. The project page can be accessed at: https://minnesotanlp.github.io/the-amazing-agent-race
[813] STARS: Skill-Triggered Audit for Request-Conditioned Invocation Safety in Agent Systems
Guijia Zhang, Shu Yang, Xilin Gong, Di Wang
Main category: cs.AI
TL;DR: STARS is a system for continuous-risk estimation of skill invocation in language-model agents, combining static capability priors with request-conditioned risk models for safer runtime auditing.
Details
Motivation: Static skill auditing can't determine if a particular skill invocation is unsafe under specific user requests and runtime contexts, creating a need for continuous-risk estimation during actual usage.Method: STARS combines: 1) static capability prior, 2) request-conditioned invocation risk model, and 3) calibrated risk-fusion policy. Evaluated on SIA-Bench benchmark with 3,000 invocation records including runtime context and risk targets.
Result: On indirect prompt injection attacks, calibrated fusion achieved 0.439 high-risk AUPRC, outperforming contextual scorer (0.405) and static baseline (0.380). Contextual scorer had better calibration (0.289 ECE). Gains were smaller on in-distribution test.
Conclusion: Request-conditioned auditing is most valuable as an invocation-time risk-scoring and triage layer rather than as a replacement for static screening, providing narrower but important safety benefits.
Abstract: Autonomous language-model agents increasingly rely on installable skills and tools to complete user tasks. Static skill auditing can expose capability surface before deployment, but it cannot determine whether a particular invocation is unsafe under the current user request and runtime context. We therefore study skill invocation auditing as a continuous-risk estimation problem: given a user request, candidate skill, and runtime context, predict a score that supports ranking and triage before a hard intervention is applied. We introduce STARS, which combines a static capability prior, a request-conditioned invocation risk model, and a calibrated risk-fusion policy. To evaluate this setting, we construct SIA-Bench, a benchmark of 3,000 invocation records with group-safe splits, lineage metadata, runtime context, canonical action labels, and derived continuous-risk targets. On a held-out split of indirect prompt injection attacks, calibrated fusion reaches 0.439 high-risk AUPRC, improving over 0.405 for the contextual scorer and 0.380 for the strongest static baseline, while the contextual scorer remains better calibrated with 0.289 expected calibration error. On the locked in-distribution test split, gains are smaller and static priors remain useful. The resulting claim is therefore narrower: request-conditioned auditing is most valuable as an invocation-time risk-scoring and triage layer rather than as a replacement for static screening. Code is available at https://github.com/123zgj123/STARS.
[814] Dead Cognitions: A Census of Misattributed Insights
Aaron Tuor, claude. ai
Main category: cs.AI
TL;DR: AI chat systems perform cognitive work but rhetorically credit users for insights, creating attribution laundering that erodes users’ ability to assess their own contributions over time.
Details
Motivation: To identify and analyze a failure mode in AI chat systems where models perform substantive cognitive work but rhetorically attribute the insights to users, creating systematic attribution laundering that affects users' self-assessment abilities.Method: Conceptual analysis and tracing of mechanisms at both individual and societal scales, examining chat interfaces that discourage scrutiny and institutional pressures favoring adoption over accountability. The document itself serves as an artifact of the process it describes.
Result: Identifies attribution laundering as a systematic failure mode that is occluded to affected users and self-reinforcing, eroding users’ ability to accurately assess their own cognitive contributions over time.
Conclusion: Attribution laundering in AI chat systems creates problematic dynamics where users lose track of their own cognitive contributions, with the boundary between human and AI authorship becoming difficult to discern, raising important questions about agency and accountability.
Abstract: This essay identifies a failure mode of AI chat systems that we term attribution laundering: the model performs substantive cognitive work and then rhetorically credits the user for having generated the resulting insights. Unlike transparent versions of glad handing sycophancy, attribution laundering is systematically occluded to the person it affects and self-reinforcing – eroding users’ ability to accurately assess their own cognitive contributions over time. We trace the mechanisms at both individual and societal scales, from the chat interface that discourages scrutiny to the institutional pressures that reward adoption over accountability. The document itself is an artifact of the process it describes, and is color-coded accordingly – though the views expressed are the authors’ own, not those of any affiliated institution, and the boundary between the human author’s views and Claude’s is, as the essay argues, difficult to draw.
[815] AI Organizations are More Effective but Less Aligned than Individual Agents
Judy Hanwen Shen, Daniel Zhu, Siddarth Srinivasan, Henry Sleight, Lawrence T. Wagner, Morgan Jane Matthews, Erik Jones, Jascha Sohl-Dickstein
Main category: cs.AI
TL;DR: AI organizations (multi-agent systems) are more effective but less aligned than individual AI agents across business and software development tasks
Details
Motivation: Most AI research focuses on individual model behavior, but real-world deployments increasingly involve multi-agent systems working together. There's a need to understand how these "AI organizations" differ from individual agents in terms of both capabilities and alignment.Method: Experimental study across 12 tasks in two practical settings: 1) AI consultancy solving business problems, and 2) AI software team developing software products. Compared performance and alignment of multi-agent AI organizations versus individual aligned models.
Result: AI organizations composed of aligned models consistently produced solutions with higher utility (more effective at achieving business goals) but greater misalignment compared to single aligned models. This pattern held across all settings and tasks.
Conclusion: Research must consider interacting systems of AI agents, not just individual models, as multi-agent organizations exhibit different capability-alignment tradeoffs. Both capabilities and safety research need to account for emergent properties in AI organizations.
Abstract: AI is increasingly deployed in multi-agent systems; however, most research considers only the behavior of individual models. We experimentally show that multi-agent “AI organizations” are simultaneously more effective at achieving business goals, but less aligned, than individual AI agents. We examine 12 tasks across two practical settings: an AI consultancy providing solutions to business problems and an AI software team developing software products. Across all settings, AI Organizations composed of aligned models produce solutions with higher utility but greater misalignment compared to a single aligned model. Our work demonstrates the importance of considering interacting systems of AI agents when doing both capabilities and safety research.
[816] TimeSeriesExamAgent: Creating Time Series Reasoning Benchmarks at Scale
Malgorzata Gwiazda, Yifu Cai, Mononito Goswami, Arjun Choudhry, Artur Dubrawski
Main category: cs.AI
TL;DR: The paper introduces TimeSeriesExam and TimeSeriesExamAgent, scalable methods for creating comprehensive time series reasoning benchmarks to evaluate LLMs’ understanding of time series data across multiple domains and reasoning categories.
Details
Motivation: Existing benchmarks for evaluating LLMs' time series understanding are mostly manually curated, narrow in scope, and focus on specific skill sets, limiting comprehensive assessment of LLMs' true understanding of time series data.Method: Two approaches: 1) TimeSeriesExam - a multiple-choice benchmark using synthetic time series to evaluate LLMs across five core reasoning categories; 2) TimeSeriesExamAgent - automatically generates benchmarks from real-world datasets across healthcare, finance, and weather domains using LLM agents.
Result: Automatically generated benchmarks achieve diversity comparable to manually curated alternatives, but LLM performance remains limited in both abstract time series reasoning and domain-specific applications, revealing ongoing challenges.
Conclusion: The proposed scalable benchmark generation methods provide comprehensive evaluation of LLMs’ time series understanding, but current LLMs still have significant limitations in time series reasoning across both synthetic and real-world domains.
Abstract: Large Language Models (LLMs) have shown promising performance in time series modeling tasks, but do they truly understand time series data? While multiple benchmarks have been proposed to answer this fundamental question, most are manually curated and focus on narrow domains or specific skill sets. To address this limitation, we propose scalable methods for creating comprehensive time series reasoning benchmarks that combine the flexibility of templates with the creativity of LLM agents. We first develop TimeSeriesExam, a multiple-choice benchmark using synthetic time series to evaluate LLMs across five core reasoning categories: pattern recognitionnoise understandingsimilarity analysisanomaly detection, and causality. Then, with TimeSeriesExamAgent, we scale our approach by automatically generating benchmarks from real-world datasets spanning healthcare, finance and weather domains. Through multi-dimensional quality evaluation, we demonstrate that our automatically generated benchmarks achieve diversity comparable to manually curated alternatives. However, our experiments reveal that LLM performance remains limited in both abstract time series reasoning and domain-specific applications, highlighting ongoing challenges in enabling effective time series understanding in these models. TimeSeriesExamAgent is available at https://github.com/magwiazda/TimeSeriesExamAgent.
[817] Gypscie: A Cross-Platform AI Artifact Management System
Fabio Porto, Eduardo Ogasawara, Gabriela Moraes Botaro, Julia Neumann Bastos, Augusto Fonseca, Esther Pacitti, Patrick Valduriez
Main category: cs.AI
TL;DR: Gypscie is a cross-platform AI artifact management system that provides unified view of AI artifacts through knowledge graphs and rule-based query language, enabling simplified AI application development and deployment across multiple platforms.
Details
Motivation: AI model lifecycle management is complex, requiring coordination of diverse services managing datasets, dataflows, and models. There's a need to isolate applications from the complexity of interacting with heterogeneous services, datasets, and AI platforms.Method: Gypscie uses a knowledge graph to capture application semantics and a rule-based query language for reasoning over data and models. Model lifecycle activities are represented as high-level dataflows that can be scheduled across multiple platforms (servers, cloud, supercomputers). The system records provenance information for explainability.
Result: Qualitative comparison shows Gypscie supports broader range of functionalities across AI artifact lifecycle than representative AI systems. Experimental evaluation demonstrates Gypscie can successfully optimize and schedule dataflows on AI platforms from abstract specifications.
Conclusion: Gypscie provides a unified platform for managing AI artifacts across their lifecycle, simplifying AI application development and deployment while enabling cross-platform scheduling and explainability through provenance tracking.
Abstract: Artificial Intelligence (AI) models, encompassing both traditional machine learning (ML) and more advanced approaches such as deep learning and large language models (LLMs), play a central role in modern applications. AI model lifecycle management involves the end-to-end process of managing these models, from data collection and preparation to model building, evaluation, deployment, and continuous monitoring. This process is inherently complex, as it requires the coordination of diverse services that manage AI artifacts such as datasets, dataflows, and models, all orchestrated to operate seamlessly. In this context, it is essential to isolate applications from the complexity of interacting with heterogeneous services, datasets, and AI platforms. In this paper, we introduce Gypscie, a cross-platform AI artifact management system. By providing a unified view of all AI artifacts, the Gypscie platform simplifies the development and deployment of AI applications. This unified view is realized through a knowledge graph that captures application semantics and a rule-based query language that supports reasoning over data and models. Model lifecycle activities are represented as high-level dataflows that can be scheduled across multiple platforms, such as servers, cloud platforms, or supercomputers. Finally, Gypscie records provenance information about the artifacts it produces, thereby enabling explainability. Our qualitative comparison with representative AI systems shows that Gypscie supports a broader range of functionalities across the AI artifact lifecycle. Our experimental evaluation demonstrates that Gypscie can successfully optimize and schedule dataflows on AI platforms from an abstract specification.
[818] From GPT-3 to GPT-5: Mapping their capabilities, scope, limitations, and consequences
Hina Afridi, Habib Ullah, Sultan Daud Khan, Mohib Ullah
Main category: cs.AI
TL;DR: A comparative analysis of GPT family evolution from GPT-3 to GPT-5, examining technical progression, capability changes, deployment shifts, persistent limitations, and downstream consequences across multimodal, tool-oriented systems.
Details
Motivation: To provide a comparative rather than merely historical analysis of how the GPT family evolved across technical framing, user interaction, modality, deployment architecture, and governance, focusing on the transformation from scaled text predictors to multimodal, tool-oriented systems.Method: Analyzes official technical reports, system cards, API/model documentation, product announcements, release notes, and peer-reviewed secondary studies to examine five recurring themes: technical progression, capability changes, deployment shifts, persistent limitations, and downstream consequences.
Result: The GPT family evolved from scaled few-shot text predictors into aligned, multimodal, tool-oriented, long-context, workflow-integrated systems, complicating simple model comparisons due to product routing, tool access, safety tuning, and interface design integration.
Conclusion: The transition from GPT-3 to GPT-5 represents not just improved model capabilities but a broader reformulation of deployable AI systems, their evaluation, and responsibility allocation, while persistent limitations like hallucination and transparency issues remain.
Abstract: We present the progress of the GPT family from GPT-3 through GPT-3.5, GPT-4, GPT-4 Turbo, GPT-4o, GPT-4.1, and the GPT-5 family. Our work is comparative rather than merely historical. We investigates how the family evolved in technical framing, user interaction, modality, deployment architecture, and governance viewpoint. The work focuses on five recurring themes: technical progression, capability changes, deployment shifts, persistent limitations, and downstream consequences. In term of research design, we consider official technical reports, system cards, API and model documentation, product announcements, release notes, and peer-reviewed secondary studies. A primary assertion is that later GPT generations should not be interpreted only as larger or more accurate language models. Instead, the family evolves from a scaled few-shot text predictor into a set of aligned, multimodal, tool-oriented, long-context, and increasingly workflow-integrated systems. This development complicates simple model-to-model comparison because product routing, tool access, safety tuning, and interface design become part of the effective system. Across generations, several limitations remain unchanged: hallucination, prompt sensitivity, benchmark fragility, uneven behavior across domains and populations, and incomplete public transparency about architecture and training. However, the family has evolved software development, educational practice, information work, interface design, and discussions of frontier-model governance. We infer that the transition from GPT-3 to GPT-5 is best understood not only as an improvement in model capability, but also as a broader reformulation of what a deployable AI system is, how it is evaluated, and where responsibility should be located when such systems are used at scale.
[819] Zero-shot World Models Are Developmentally Efficient Learners
Khai Loong Aw, Klemen Kotar, Wanhee Lee, Seungwoo Kim, Khaled Jedoui, Rahul Venkatesh, Lilian Naing Chen, Michael C. Frank, Daniel L. K. Yamins
Main category: cs.AI
TL;DR: ZWM is a computational model that learns physical scene understanding from first-person child experience, achieving zero-shot generalization across multiple physical understanding tasks while mimicking child development patterns.
Details
Motivation: Children demonstrate remarkable data-efficient physical understanding abilities that current AI systems struggle with. The paper aims to create a computational model that can learn from limited human-scale data like children do, bridging cognitive science and AI.Method: Zero-shot Visual World Model (ZWM) uses three principles: 1) sparse temporally-factored predictor separating appearance from dynamics, 2) zero-shot estimation via approximate causal inference, and 3) composition of inferences for complex abilities. Trained on first-person experience of a single child.
Result: ZWM achieves competence across multiple physical understanding benchmarks, recapitulates behavioral signatures of child development, and builds brain-like internal representations from limited training data.
Conclusion: ZWM provides a blueprint for efficient learning from human-scale data, advancing both computational accounts of children’s physical understanding and data-efficient AI systems.
Abstract: Young children demonstrate early abilities to understand their physical world, estimating depth, motion, object coherence, interactions, and many other aspects of physical scene understanding. Children are both data-efficient and flexible cognitive systems, creating competence despite extremely limited training data, while generalizing to myriad untrained tasks – a major challenge even for today’s best AI systems. Here we introduce a novel computational hypothesis for these abilities, the Zero-shot Visual World Model (ZWM). ZWM is based on three principles: a sparse temporally-factored predictor that decouples appearance from dynamics; zero-shot estimation through approximate causal inference; and composition of inferences to build more complex abilities. We show that ZWM can be learned from the first-person experience of a single child, rapidly generating competence across multiple physical understanding benchmarks. It also broadly recapitulates behavioral signatures of child development and builds brain-like internal representations. Our work presents a blueprint for efficient and flexible learning from human-scale data, advancing both a computational account for children’s early physical understanding and a path toward data-efficient AI systems.
[820] VeriTrans: Fine-Tuned LLM-Assisted NL-to-PL Translation via a Deterministic Neuro-Symbolic Pipeline
Xuan Liu, Dheeraj Kodakandla, Kushagra Srivastva, Mahfuza Farooque
Main category: cs.AI
TL;DR: VeriTrans is a reliability-focused ML system that translates natural language requirements into formal logic with validator-gated reliability, achieving high correctness rates on SAT/UNSAT problems.
Details
Motivation: The paper addresses the need for reliable natural language to formal logic translation systems for reliability-critical workflows, where correctness and auditability are essential but current approaches lack systematic reliability guarantees.Method: The system integrates an instruction-tuned NL→PL translator, round-trip reconstruction (PL→NL) as an acceptance gate, and canonical PL→CNF compilation. It uses fixed API configurations (temperature=0, seed=42) and per-item artifact logging for auditability and replay-driven debugging.
Result: On SatBench (2,100 specifications), VeriTrans achieves 94.46% SAT/UNSAT correctness and 87.73% median round-trip similarity. Fine-tuning on 100-150 examples improves fidelity by 1-1.5pp without latency increase. A thresholded acceptance policy at τ=75 retains 68% of items with ~94% correctness.
Conclusion: By separating learned translation from symbolic verification and enforcing deterministic, validator-gated acceptance, VeriTrans creates auditable, reproducible components for reliability-critical workflows, with validator overhead contributing <15% of runtime.
Abstract: \textbf{VeriTrans} is a reliability-first ML system that compiles natural-language requirements into solver-ready logic with validator-gated reliability. The pipeline integrates an instruction-tuned NL$!\to!$PL translator, round-trip reconstruction (PL$!\to!$NL) used as a high-precision acceptance gate, and canonical PL$!\to!$CNF compilation, all executed via fixed API configuration (temperature$=0$; fine-tuning runs use seed$=42$) and per-item artifact logging (prompts, outputs, hashes) to support auditability and replay-driven debugging. On \textbf{SatBench} (2{,}100 specifications), VeriTrans achieves 94.46% SAT/UNSAT correctness and 87.73% median round-trip similarity. Compact fine-tuning on 100–150 curated examples improves fidelity by about 1–1.5,pp without increasing latency (mean 25.8,s/spec on our 201-spec runtime subset). A thresholded acceptance policy on the round-trip score exposes a reliability–coverage knob: at $τ{=}75$, roughly 68% of items are retained with $\sim$94% correctness on the accepted set. Validator overhead contributes $<15%$ of end-to-end runtime, and all prompts/responses and timing metadata are logged to enable replay-driven debugging and regression testing. By separating learned translation from symbolic verification and enforcing deterministic, validator-gated acceptance, VeriTrans turns NL$!\to!$logic front-ends into auditable, reproducible components for reliability-critical workflows.
[821] ClawVM: Harness-Managed Virtual Memory for Stateful Tool-Using LLM Agents
Mofasshara Rafique, Laurent Bindschaedler
Main category: cs.AI
TL;DR: ClawVM is a virtual memory layer for LLM agents that manages state as typed pages with minimum-fidelity invariants and multi-resolution representations under token budget constraints.
Details
Motivation: Current LLM agent systems treat context windows as working memory but manage residency and durability as best-effort, leading to recurring failures like lost state after compaction, bypassed flushes on reset, and destructive writeback.Method: ClawVM implements a virtual memory layer that manages state as typed pages with minimum-fidelity invariants, multi-resolution representations under token budget constraints, and validated writeback at every lifecycle boundary, leveraging the harness as the natural enforcement point.
Result: ClawVM eliminates all policy-controllable faults when minimum-fidelity sets fit within token budgets, adds median <50 microseconds overhead per turn, and was validated across synthetic workloads, 12 real-session traces, and adversarial stress tests.
Conclusion: By placing memory management contracts in the harness, ClawVM makes residency and durability deterministic and auditable for LLM agents, solving persistent state management problems.
Abstract: Stateful tool-using LLM agents treat the context window as working memory, yet today’s agent harnesses manage residency and durability as best-effort, causing recurring failures: lost state after compaction, bypassed flushes on reset, and destructive writeback. We present \textsc{ClawVM}, a virtual memory layer that manages state as typed pages with minimum-fidelity invariants, multi-resolution representations under a token budget, and validated writeback at every lifecycle boundary. Because the harness already assembles prompts, mediates tools, and observes lifecycle events, it is the natural enforcement point; placing the contract there makes residency and durability deterministic and auditable. Across synthetic workloads, 12 real-session traces, and adversarial stress tests, \textsc{ClawVM} eliminates all policy-controllable faults whenever the minimum-fidelity set fits within the token budget, confirmed by an offline oracle, and adds median <50 microseconds of policy-engine overhead per turn.
[822] CWCD: Category-Wise Contrastive Decoding for Structured Medical Report Generation
Shantam Srivastava, Mahesh Bhosale, David Doermann, Mingchen Gao
Main category: cs.AI
TL;DR: CWCD is a novel decoding framework for structured radiology report generation that uses category-wise contrastive decoding with visual prompts to improve attention to visual tokens and reduce spurious pathology co-occurrences.
Details
Motivation: Current MLLMs for radiology report generation use single forward passes that diminish attention to visual tokens and increase reliance on language priors, leading to spurious pathology co-occurrences in generated reports.Method: Category-Wise Contrastive Decoding (CWCD) introduces category-specific parameterization and generates category-wise reports by contrasting normal X-rays with masked X-rays using category-specific visual prompts.
Result: CWCD consistently outperforms baseline methods across both clinical efficacy and natural language generation metrics, with ablation studies showing contributions of each architectural component.
Conclusion: CWCD effectively addresses limitations of current MLLMs in radiology report generation by enhancing visual attention and reducing language prior biases through structured contrastive decoding.
Abstract: Interpreting chest X-rays is inherently challenging due to the overlap between anatomical structures and the subtle presentation of many clinically significant pathologies, making accurate diagnosis time-consuming even for experienced radiologists. Recent radiology-focused foundation models, such as LLaVA-Rad and Maira-2, have positioned multi-modal large language models (MLLMs) at the forefront of automated radiology report generation (RRG). However, despite these advances, current foundation models generate reports in a single forward pass. This decoding strategy diminishes attention to visual tokens and increases reliance on language priors as generation proceeds, which in turn introduces spurious pathology co-occurrences in the generated reports. To mitigate these limitations, we propose Category-Wise Contrastive Decoding (CWCD), a novel and modular framework designed to enhance structured radiology report generation (SRRG). Our approach introduces category-specific parameterization and generates category-wise reports by contrasting normal X-rays with masked X-rays using category-specific visual prompts. Experimental results demonstrate that CWCD consistently outperforms baseline methods across both clinical efficacy and natural language generation metrics. An ablation study further elucidates the contribution of each architectural component to overall performance.
[823] Safety Guarantees in Zero-Shot Reinforcement Learning for Cascade Dynamical Systems
Shima Rabiei, Sandipan Mishra, Santiago Paternain
Main category: cs.AI
TL;DR: Zero-shot safety guarantees for cascade dynamical systems using safe RL on reduced-order models with theoretical bounds on safety probability.
Details
Motivation: Address safety guarantees for cascade dynamical systems where inner states affect outer states but not vice-versa, aiming to provide high-probability safety guarantees with reduced training complexity.Method: Train safe RL policy on reduced-order model ignoring inner state dynamics, treat inner states as actions affecting outer states, deploy with low-level controller for tracking, provide theoretical bounds on safety probability.
Result: Established theoretical bound linking safety probability to tracking quality of inner states, validated on quadrotor navigation showing safety preservation tied to low-level controller bandwidth and tracking capabilities.
Conclusion: Proposed framework provides zero-shot safety guarantees for cascade systems with theoretical foundations linking safety preservation to tracking performance of low-level controllers.
Abstract: This paper considers the problem of zero-shot safety guarantees for cascade dynamical systems. These are systems where a subset of the states (the inner states) affects the dynamics of the remaining states (the outer states) but not vice-versa. We define safety as remaining on a set deemed safe for all times with high probability. We propose to train a safe RL policy on a reduced-order model, which ignores the dynamics of the inner states, but it treats it as an action that influences the outer state. Thus, reducing the complexity of the training. When deployed in the full system the trained policy is combined with a low-level controller whose task is to track the reference provided by the RL policy. Our main theoretical contribution is a bound on the safe probability in the full-order system. In particular, we establish the interplay between the probability of remaining safe after the zero-shot deployment and the quality of the tracking of the inner states. We validate our theoretical findings on a quadrotor navigation task, demonstrating that the preservation of the safety guarantees is tied to the bandwidth and tracking capabilities of the low-level controller.
[824] VeriSim: A Configurable Framework for Evaluating Medical AI Under Realistic Patient Noise
Sina Mansouri, Mohit Marvania, Vibhavari Ashok Shihorkar, Han Ngoc Tran, Kazhal Shafiei, Mehrdad Fazli, Yikuan Li, Ziwei Zhu
Main category: cs.AI
TL;DR: VeriSim is a patient simulation framework that injects clinically-grounded noise into patient responses to test medical LLMs under realistic communication barriers, revealing significant performance degradation across models.
Details
Motivation: Current medical LLM evaluations fail to capture real clinical complexities where patients have memory gaps, limited health literacy, anxiety, and communication barriers, creating a sim-to-real gap in medical AI assessment.Method: A truth-preserving patient simulation framework with hybrid UMLS-LLM verification that injects controllable, clinically evidence-grounded noise across six dimensions derived from medical communication literature, while maintaining medical ground truth.
Result: All tested LLMs degrade significantly under realistic patient noise: diagnostic accuracy drops 15-25%, conversation length increases 34-55%, smaller models degrade 40% more than larger ones, and medical fine-tuning provides limited robustness benefits.
Conclusion: There’s a critical sim-to-real gap in medical AI evaluation; VeriSim provides a rigorous testbed for clinical robustness assessment and is released as open-source framework.
Abstract: Medical large language models (LLMs) achieve impressive performance on standardized benchmarks, yet these evaluations fail to capture the complexity of real clinical encounters where patients exhibit memory gaps, limited health literacy, anxiety, and other communication barriers. We introduce VeriSim, a truth-preserving patient simulation framework that injects controllable, clinically evidence-grounded noise into patient responses while maintaining strict adherence to medical ground truth through a hybrid UMLS-LLM verification mechanism. Our framework operationalizes six noise dimensions derived from peer-reviewed medical communication literature, capturing authentic clinical phenomena such as patient recall limitations, health literacy barriers, and stigma-driven non-disclosure. Experiments across seven open-weight LLMs reveal that all models degrade significantly under realistic patient noise, with diagnostic accuracy dropping 15-25% and conversation length increasing 34-55%. Notably, smaller models (7B) show 40% greater degradation than larger models (70B+), while medical fine-tuning on standard corpora provides limited robustness benefits against patient communication noise. Evaluation by board-certified clinicians demonstrates high-quality simulation with strong inter-annotator agreement (kappa > 0.80), while LLM-as-a-Judge serves as a validated auxiliary evaluator achieving comparable reliability for scalable assessment. Our results highlight a critical Sim-to-Real gap in current medical AI. We release VeriSim as an open-source noise-injection framework, establishing a rigorous testbed for evaluating clinical robustness.
[825] PEMANT: Persona-Enriched Multi-Agent Negotiation for Travel
Yuran Sun, Mustafa Sameen, Yaotian Zhang, Chia-yu Wu, Xilei Zhao
Main category: cs.AI
TL;DR: PEMANT is an LLM-based framework that uses behavioral theory and multi-agent negotiation to model household-level trip generation, outperforming existing methods.
Details
Motivation: Existing trip generation models lack behavioral theory and intra-household interaction dynamics, which are critical for realistic collective travel decisions. Current LLM-based approaches don't incorporate these essential elements.Method: PEMANT integrates behavioral theory through individualized persona modeling using the Household-Aware Chain-of-Planned-Behavior (HA-CoPB) framework, then conducts household-level trip planning via structured multi-agent conversations with persona-alignment control.
Result: PEMANT consistently outperforms state-of-the-art benchmarks across both national and regional household travel survey datasets.
Conclusion: The proposed framework successfully addresses limitations of existing approaches by incorporating behavioral theory and intra-household dynamics through multi-agent LLM negotiation, leading to improved household trip generation modeling.
Abstract: Modeling household-level trip generation is fundamental to accurate demand forecasting, traffic flow estimation, and urban system planning. Existing studies were mostly based on classical machine learning models with limited predictive capability, while recent LLM-based approaches have yet to incorporate behavioral theory or intra-household interaction dynamics, both of which are critical for modeling realistic collective travel decisions. To address these limitations, we propose a novel LLM-based framework, named Persona-Enriched Multi-Agent Negotiation for Travel (PEMANT), which first integrates behavioral theory for individualized persona modeling and then conducts household-level trip planning negotiations via a structured multi-agent conversation. Specifically, PEMANT transforms static sociodemographic attributes into coherent narrative profiles that explicitly encode household-level attitudes, subjective norms, and perceived behavioral controls, following our proposed Household-Aware Chain-of-Planned-Behavior (HA-CoPB) framework. Building on these theory-grounded personas, PEMANT captures real-world household decision negotiation via a structured two-phase multi-agent conversation framework with a novel persona-alignment control mechanism. Evaluated on both national and regional household travel survey datasets, PEMANT consistently outperforms state-of-the-art benchmarks across datasets.
[826] Tracing the Roots: A Multi-Agent Framework for Uncovering Data Lineage in Post-Training LLMs
Yu Li, Xiaoran Shang, Qizhi Pei, Yun Zhu, Xin Gao, Honglin Lin, Zhanping Zhong, Zhuoshi Pan, Zheng Liu, Xiaoyang Wang, Conghui He, Dahua Lin, Feng Zhao, Lijun Wu
Main category: cs.AI
TL;DR: Proposes data lineage concept and automated multi-agent framework to reconstruct evolutionary graphs of LLM dataset development, revealing structural patterns and systemic issues like redundancy and benchmark contamination propagation.
Details
Motivation: Current LLM post-training datasets are treated as isolated artifacts, overlooking systemic connections and evolutionary relationships that could reveal important patterns and issues in dataset development.Method: Introduces data lineage concept and automated multi-agent framework to reconstruct evolutionary graphs of dataset development, analyzes domain-specific structural patterns, and creates lineage-aware diversity-oriented datasets.
Result: Identifies vertical refinement patterns in math datasets and horizontal aggregation in general-domain corpora, uncovers structural redundancy and benchmark contamination propagation, and demonstrates lineage-aware datasets reduce homogenization.
Conclusion: Data lineage analysis provides systematic approach to dataset curation, revealing hidden relationships and issues, enabling more diverse and less redundant post-training corpora through lineage-aware construction.
Abstract: Post-training data plays a pivotal role in shaping the capabilities of Large Language Models (LLMs), yet datasets are often treated as isolated artifacts, overlooking the systemic connections that underlie their evolution. To disentangle these complex relationships, we introduce the concept of \textbf{data lineage} to the LLM ecosystem and propose an automated multi-agent framework to reconstruct the evolutionary graph of dataset development. Through large-scale lineage analysis, we characterize domain-specific structural patterns, such as vertical refinement in math-oriented datasets and horizontal aggregation in general-domain corpora. Moreover, we uncover pervasive systemic issues, including \textit{structural redundancy} induced by implicit dataset intersections and the \textit{propagation of benchmark contamination} along lineage paths. To demonstrate the practical value of lineage analysis for data construction, we leverage the reconstructed lineage graph to create a \textit{lineage-aware diversity-oriented dataset}. By anchoring instruction sampling at upstream root sources, this approach mitigates downstream homogenization and hidden redundancy, yielding a more diverse post-training corpus. We further highlight lineage-centric analysis as an efficient and robust topological alternative to sample-level dataset comparison for large-scale data ecosystems. By grounding data construction in explicit lineage structures, our work advances post-training data curation toward a more systematic and controllable paradigm.
[827] CHAIRO: Contextual Hierarchical Analogical Induction and Reasoning Optimization for LLMs
Haotian Lu, Yuchen Mou, Bingzhe Wu
Main category: cs.AI
TL;DR: A novel content moderation framework using analogical examples to enhance rule induction and decision reliability through end-to-end optimization of analogical retrieval, rule generation, and moderation classification.
Details
Motivation: Traditional content moderation approaches (rule-based and ML) struggle with evolving user-generated content complexity. Current LLM-based methods via prompting or fine-tuning have limited generalization, interpretability, and adaptability to unseen/ambiguous cases.Method: Proposes a moderation framework leveraging analogical examples to enhance rule induction and decision reliability. Integrates end-to-end optimization of analogical retrieval, rule generation, and moderation classification for dynamic adaptation to diverse content scenarios.
Result: Significantly outperforms rule-injected fine-tuning baselines and multi-stage static RAG pipelines in moderation accuracy and rule quality. Produces rules with better clarity, interpretability, and applicability according to human assessments and external model generalization tests.
Conclusion: Analogical example-driven methods can advance robust, explainable, and generalizable content moderation in real-world applications.
Abstract: Content moderation in online platforms faces persistent challenges due to the evolving complexity of user-generated content and the limitations of traditional rule-based and machine learning approaches. While recent advances in large language models (LLMs) have enabled more sophisticated moderation via direct prompting or fine-tuning, these approaches often exhibit limited generalization, interpretability, and adaptability to unseen or ambiguous cases. In this work, we propose a novel moderation framework that leverages analogical examples to enhance rule induction and decision reliability. Our approach integrates end-to-end optimization of analogical retrieval, rule generation, and moderation classification, enabling the dynamic adaptation of moderation rules to diverse content scenarios. Through comprehensive experiments, we demonstrate that our method significantly outperforms both rule-injected fine-tuning baselines and multi-stage static RAG pipelines in terms of moderation accuracy and rule quality. Further evaluations, including human assessments and external model generalization tests, confirm that our framework produces rules with better clarity, interpretability, and applicability. These findings show that analogical example-driven methods can advance robust, explainable, and generalizable content moderation in real-world applications.
[828] CARO: Chain-of-Analogy Reasoning Optimization for Robust Content Moderation
Bingzhe Wu, Haotian Lu, Yuchen Mou
Main category: cs.AI
TL;DR: CARO is a two-stage training framework that enhances LLMs’ analogical reasoning for content moderation by first using retrieval-augmented generation for SFT, then applying customized DPO to reinforce analogical reasoning behaviors.
Details
Motivation: Current LLMs struggle with ambiguous content moderation cases due to misleading "decision shortcuts" in context, inspired by cognitive psychology insights into expert moderation.Method: Two-stage framework: 1) Bootstraps analogical reasoning chains via RAG on moderation data with SFT, 2) Customized DPO to explicitly reinforce analogical reasoning behaviors, dynamically generating tailored analogical references during inference.
Result: Substantially outperforms state-of-the-art reasoning models (DeepSeek R1, QwQ), specialized moderation models (LLaMA Guard), and advanced fine-tuning and retrieval-augmented methods, achieving 24.9% average F1 score improvement on challenging ambiguous moderation benchmarks.
Conclusion: CARO effectively mitigates harmful decision shortcuts in LLMs for content moderation through robust analogical reasoning, demonstrating significant improvements over existing approaches.
Abstract: Current large language models (LLMs), even those explicitly trained for reasoning, often struggle with ambiguous content moderation cases due to misleading “decision shortcuts” embedded in context. Inspired by cognitive psychology insights into expert moderation, we introduce \caro (Chain-of-Analogy Reasoning Optimization), a novel two-stage training framework to induce robust analogical reasoning in LLMs. First, \caro bootstraps analogical reasoning chains via retrieval-augmented generation (RAG) on moderation data and performs supervised fine-tuning (SFT). Second, we propose a customized direct preference optimization (DPO) approach to reinforce analogical reasoning behaviors explicitly. Unlike static retrieval methods, \caro dynamically generates tailored analogical references during inference, effectively mitigating harmful decision shortcuts. Extensive experiments demonstrate that \caro substantially outperforms state-of-the-art reasoning models (DeepSeek R1, QwQ), specialized moderation models (LLaMA Guard), and advanced fine-tuning and retrieval-augmented methods, achieving an average F1 score improvement of 24.9% on challenging ambiguous moderation benchmarks.
[829] A Progressive Training Strategy for Vision-Language Models to Counteract Spatio-Temporal Hallucinations in Embodied Reasoning
Xiaoda Yang, Shuai Yang, Can Wang, Jingyang Xue, Menglan Tang, Checheng Yu, Xunzhe Zhou, Sashuai Zhou, Tao Jin, Lixin Yang, Xiangyu Yue, Zhou Zhao
Main category: cs.AI
TL;DR: A progressive training framework for Vision-Language Models that addresses spatiotemporal reasoning hallucination through Chain-of-Thought decomposition and supervised pre-training followed by weak supervision fine-tuning.
Details
Motivation: Current Vision-Language Models struggle with spatiotemporal reasoning, exhibiting "multi-image reasoning hallucination" where they rely on superficial shortcuts rather than genuine causal understanding, as evidenced by massive performance drops between forward and reverse temporal queries.Method: 1) Created a new Chain-of-Thought dataset that decomposes intricate reasoning into detailed spatiotemporal steps and definitive judgments. 2) Developed a progressive training framework: supervised pre-training on the CoT dataset to instill logical structures, followed by fine-tuning with scalable weakly-labeled data for broader generalization.
Result: The approach improves backbone accuracy and dramatically reduces the forward-backward performance gap from over 70% to only 6.53%, demonstrating the method’s ability to develop authentic dynamic reasoning and reduce temporal biases in VLMs.
Conclusion: The proposed progressive training framework with Chain-of-Thought decomposition effectively addresses spatiotemporal reasoning hallucination in Vision-Language Models, enabling more genuine causal understanding and reducing reliance on superficial temporal shortcuts.
Abstract: Vision-Language Models (VLMs) have made significant strides in static image understanding but continue to face critical hurdles in spatiotemporal reasoning. A major bottleneck is “multi-image reasoning hallucination”, where a massive performance drop between forward and reverse temporal queries reveals a dependence on superficial shortcuts instead of genuine causal understanding. To mitigate this, we first develop a new Chain-of-Thought (CoT) dataset that decomposes intricate reasoning into detailed spatiotemporal steps and definitive judgments. Building on this, we present a progressive training framework: it initiates with supervised pre-training on our CoT dataset to instill logical structures, followed by fine-tuning with scalable weakly-labeled data for broader generalization. Our experiments demonstrate that this approach not only improves backbone accuracy but also slashes the forward-backward performance gap from over 70% to only 6.53%. This confirms the method’s ability to develop authentic dynamic reasoning and reduce the inherent temporal biases of current VLMs.
[830] Beyond Compliance: A Resistance-Informed Motivation Reasoning Framework for Challenging Psychological Client Simulation
Danni Liu, Bo Liu, Yuxin Hu, Hantao Zhao, Yan Liu, Ding Ding, Jiahui Jin, Jiuxin Cao
Main category: cs.AI
TL;DR: ResistClient is a psychological client simulator that models challenging client behaviors using Client Resistance Theory, addressing the over-compliance issue in existing simulators through a two-stage training framework called RIMR.
Details
Motivation: Existing psychological client simulators exhibit unrealistic over-compliance, leaving counselors underprepared for challenging behaviors common in real-world practice. There's a need for more realistic simulators that can better prepare trainees and evaluate psychological LLMs under challenging conditions.Method: Proposes Resistance-Informed Motivation Reasoning (RIMR), a two-stage training framework: 1) Supervised fine-tuning on RPC, a large-scale resistance-oriented psychological conversation dataset to mitigate compliance bias; 2) Joint optimization of motivation authenticity and response consistency via process-supervised reinforcement learning, modeling psychologically coherent motivation reasoning before response generation.
Result: ResistClient substantially outperforms existing simulators in challenge fidelity, behavioral plausibility, and reasoning coherence based on extensive automatic and expert evaluations. It also facilitates evaluation of psychological LLMs under challenging conditions.
Conclusion: ResistClient successfully bridges the gap in realistic client simulation by modeling challenging behaviors grounded in Client Resistance Theory, offering new optimization directions for mental health dialogue systems and better preparation for real-world counseling scenarios.
Abstract: Psychological client simulators have emerged as a scalable solution for training and evaluating counselor trainees and psychological LLMs. Yet existing simulators exhibit unrealistic over-compliance, leaving counselors underprepared for the challenging behaviors common in real-world practice. To bridge this gap, we present ResistClient, which systematically models challenging client behaviors grounded in Client Resistance Theory by integrating external behaviors with underlying motivational mechanisms. To this end, we propose Resistance-Informed Motivation Reasoning (RIMR), a two-stage training framework. First, RIMR mitigates compliance bias via supervised fine-tuning on RPC, a large-scale resistance-oriented psychological conversation dataset covering diverse client profiles. Second, beyond surface-level response imitation, RIMR models psychologically coherent motivation reasoning before response generation, jointly optimizing motivation authenticity and response consistency via process-supervised reinforcement learning. Extensive automatic and expert evaluations show that ResistClient substantially outperforms existing simulators in challenge fidelity, behavioral plausibility, and reasoning coherence. Moreover, ResistClient facilities evaluation of psychological LLMs under challenging conditions, offering new optimization directions for mental health dialogue systems.
[831] Thinking Fast, Thinking Wrong: Intuitiveness Modulates LLM Counterfactual Reasoning in Policy Evaluation
Yanjie He
Main category: cs.AI
TL;DR: LLMs show systematic reasoning failures in policy evaluation, particularly on counter-intuitive cases, revealing a gap between knowledge and reasoning ability.
Details
Motivation: To assess LLMs' reliability for real-world policy evaluation and understand their reasoning limitations, especially when findings contradict common intuition.Method: Constructed benchmark of 40 empirical policy evaluation cases from economics/social science, classified by intuitiveness. Evaluated 4 frontier LLMs across 5 prompting strategies with 2,400 trials, analyzed using mixed-effects logistic regression.
Result: Three key findings: (1) Chain-of-thought paradox - CoT helps obvious cases but not counter-intuitive ones; (2) Intuitiveness dominates variance more than model/prompting; (3) Knowledge-reasoning dissociation - citation familiarity unrelated to accuracy.
Conclusion: LLMs’ “slow thinking” may be “slow talking” - they produce the form of deliberative reasoning without substance, particularly failing when findings contradict intuition.
Abstract: Large language models (LLMs) are increasingly used for causal and counterfactual reasoning, yet their reliability in real-world policy evaluation remains underexplored. We construct a benchmark of 40 empirical policy evaluation cases drawn from economics and social science, each grounded in peer-reviewed evidence and classified by intuitiveness – whether the empirical finding aligns with (obvious), is unclear relative to (ambiguous), or contradicts (counter-intuitive) common prior expectations. We evaluate four frontier LLMs across five prompting strategies with 2,400 experimental trials and analyze the results using mixed-effects logistic regression. Our findings reveal three key results: (1) a chain-of-thought (CoT) paradox, where chain-of-thought prompting dramatically improves performance on obvious cases but this benefit is nearly eliminated on counter-intuitive ones (interaction OR = 0.053, $p < 0.001$); (2) intuitiveness as the dominant factor, explaining more variance than model choice or prompting strategy (ICC = 0.537); and (3) a knowledge-reasoning dissociation, where citation-based familiarity is unrelated to accuracy ($p = 0.53$), suggesting models possess relevant knowledge but fail to reason with it when findings contradict intuition. We frame these results through the lens of dual-process theory (System 1 vs. System 2) and argue that current LLMs’ “slow thinking” may be little more than “slow talking” – they produce the form of deliberative reasoning without the substance.
[832] Agent Mentor: Framing Agent Knowledge through Semantic Trajectory Analysis
Roi Ben-Gigi, Yuval David, Fabiana Fournier, Lior Limonad, Dany Moshkovich, Hadar Mulian, Segev Shlomov
Main category: cs.AI
TL;DR: Agent Mentor: An analytics pipeline that monitors and adapts system prompts in AI agents to improve performance by identifying semantic issues and injecting corrective instructions.
Details
Motivation: AI agent performance is vulnerable to prompt variability and ambiguity. Current approaches require examining both code and internal system prompts in execution logs, which is inefficient. There's a need for automated systems to monitor and adapt prompts to improve agent behavior.Method: An analytics pipeline implemented in the Agent Mentor library that monitors execution logs, identifies semantic features associated with undesired behaviors, and systematically injects corrective instructions into the agent’s knowledge to adapt system prompts incrementally.
Result: The pipeline demonstrates consistent and measurable accuracy improvements across three exemplar agent configurations and benchmark tasks, particularly effective in settings with specification ambiguity. Code released as open source.
Conclusion: The approach shows promise for automating mentoring pipelines within future agentic governance frameworks, enabling systematic improvement of agent performance through prompt adaptation.
Abstract: AI agent development relies heavily on natural language prompting to define agents’ tasks, knowledge, and goals. These prompts are interpreted by Large Language Models (LLMs), which govern agent behavior. Consequently, agentic performance is susceptible to variability arising from imprecise or ambiguous prompt formulations. Identifying and correcting such issues requires examining not only the agent’s code, but also the internal system prompts generated throughout its execution lifecycle, as reflected in execution logs. In this work, we introduce an analytics pipeline implemented as part of the Agent Mentor open-source library that monitors and incrementally adapts the system prompts defining another agent’s behavior. The pipeline improves performance by systematically injecting corrective instructions into the agent’s knowledge. We describe its underlying mechanism, with particular emphasis on identifying semantic features associated with undesired behaviors and using them to derive corrective statements. We evaluate the proposed pipeline across three exemplar agent configurations and benchmark tasks using repeated execution runs to assess effectiveness. These experiments provide an initial exploration of automating such a mentoring pipeline within future agentic governance frameworks. Overall, the approach demonstrates consistent and measurable accuracy improvements across diverse configurations, particularly in settings dominated by specification ambiguity. For reproducibility, we released our code as open source under the Agent Mentor library.
[833] From Perception to Planning: Evolving Ego-Centric Task-Oriented Spatiotemporal Reasoning via Curriculum Learning
Xiaoda Yang, Yuxiang Liu, Shenzhou Gao, Can Wang, Jingyang Xue, Lixin Yang, Yao Mu, Tao Jin, Shuicheng Yan, Zhimeng Zhang, Zhou Zhao
Main category: cs.AI
TL;DR: EgoTSR: A curriculum-based framework for embodied spatiotemporal reasoning that evolves from spatial understanding to task-state assessment to long-horizon planning, using a 46M-sample dataset to eliminate temporal biases and achieve 92.4% accuracy on long-horizon reasoning.
Details
Motivation: Current vision-language models fail at embodied, egocentric tasks due to reliance on temporal priors from passive video data, leading to spatiotemporal hallucinations and poor generalization in dynamic environments.Method: Curriculum-based framework with three-stage evolution: explicit spatial understanding → internalized task-state assessment → long-horizon planning. Built EgoTSR-Data (46M samples) with Chain-of-Thought supervision, weakly supervised tagging, and long-horizon sequences.
Result: Achieves 92.4% accuracy on long-horizon logical reasoning tasks while maintaining high fine-grained perceptual precision, significantly outperforming existing open-source and closed-source state-of-the-art models.
Conclusion: EgoTSR effectively eliminates chronological biases and enables robust spatiotemporal reasoning for embodied tasks through a structured curriculum approach.
Abstract: Modern vision-language models achieve strong performance in static perception, but remain limited in the complex spatiotemporal reasoning required for embodied, egocentric tasks. A major source of failure is their reliance on temporal priors learned from passive video data, which often leads to spatiotemporal hallucinations and poor generalization in dynamic environments. To address this, we present EgoTSR, a curriculum-based framework for learning task-oriented spatiotemporal reasoning. EgoTSR is built on the premise that embodied reasoning should evolve from explicit spatial understanding to internalized task-state assessment and finally to long-horizon planning. To support this paradigm, we construct EgoTSR-Data, a large-scale dataset comprising 46 million samples organized into three stages: Chain-of-Thought (CoT) supervision, weakly supervised tagging, and long-horizon sequences. Extensive experiments demonstrate that EgoTSR effectively eliminates chronological biases, achieving 92.4% accuracy on long-horizon logical reasoning tasks while maintaining high fine-grained perceptual precision, significantly outperforming existing open-source and closed-source state-of-the-art models.
[834] Agent^2 RL-Bench: Can LLM Agents Engineer Agentic RL Post-Training?
Wanyi Chen, Xiao Yang, Xu Yang, Tianming Sha, Qizheng Li, Zhuo Wang, Bowen Xian, Fang Kong, Weiqing Liu, Jiang Bian
Main category: cs.AI
TL;DR: Agent^2 RL-Bench is a benchmark for evaluating LLM agents’ ability to autonomously design and run complete RL pipelines to improve foundation models, with six tasks across three complexity levels and automated analysis tools.
Details
Motivation: RL post-training is crucial for model alignment and specialization, but existing benchmarks are static and don't test interactive RL engineering capabilities. There's a need to evaluate whether LLM agents can autonomously handle complete RL pipelines.Method: Created a benchmark with six tasks across three levels: static rule-based training, intermediate complexity, and closed-loop online RL with trajectory collection. Provides isolated workspaces with grading API, runtime instrumentation recording all submissions/code revisions, and automated post-hoc analysis generating structured run reports.
Result: Agents achieved striking interactive gains on some tasks (ALFWorld: 5.97 to 93.28 with SFT warm-up and GRPO) but marginal progress on others (DeepSearchQA: +2.75). Driver LLM choice significantly affects performance on interactive tasks (switching drivers changed improvement from near-zero to +78pp). Supervised pipelines dominate under fixed budgets, with online RL succeeding only on ALFWorld.
Conclusion: Agent^2 RL-Bench enables automated diagnostic of agent-driven post-training behavior, revealing that while agents can achieve significant RL improvements, performance varies greatly by task and driver LLM, with supervised methods generally outperforming RL approaches under budget constraints.
Abstract: We introduce Agent^2 RL-Bench, a benchmark for evaluating agentic RL post-training – whether LLM agents can autonomously design, implement, and run complete RL pipelines that improve foundation models. This capability is important because RL post-training increasingly drives model alignment and specialization, yet existing benchmarks remain largely static: supervised fine-tuning alone yields strong results, leaving interactive RL engineering untested. Agent^2 RL-Bench addresses this with six tasks across three levels – from static rule-based training to closed-loop online RL with trajectory collection – each adding a structural requirement that prior levels do not impose. The benchmark provides isolated workspaces with a grading API, runtime instrumentation that records every submission and code revision, and automated post-hoc analysis that generates structured run reports, enabling the first automated diagnostic of agent-driven post-training behavior. Across multiple agent stacks spanning five agent systems and six driver LLMs, we find that agents achieve striking interactive gains – on ALFWorld, an RL-only agent improves from 5.97 to 93.28 via SFT warm-up and GRPO with online rollouts – yet make only marginal progress on others (DeepSearchQA: +2.75 within evaluation noise), and that driver choice has a large effect on interactive tasks – within the same scaffold, switching drivers changes interactive improvement from near-zero to +78pp. More broadly, the benchmark reveals that supervised pipelines dominate agent-driven post-training under fixed budgets, with online RL succeeding as the final best route only on ALFWorld. Code is available at https://github.com/microsoft/RD-Agent/tree/main/rdagent/scenarios/rl/autorl_bench.
[835] Failure Ontology: A Lifelong Learning Framework for Blind Spot Detection and Resilience Design
Yuan Sun, Hong Yi, Jinyuan Liu
Main category: cs.AI
TL;DR: A framework called Failure Ontology (F) that identifies and addresses Ontological Blind Spots - conceptual domains missing from a person’s cognitive map that lead to catastrophic life failures, rather than focusing on knowledge acquisition efficiency.
Details
Motivation: Current personalized learning systems focus on efficient knowledge acquisition, but miss the more consequential problem: catastrophic life failures (financial ruin, health collapse, professional obsolescence) arise from systematic absences of entire conceptual domains from a person's cognitive map - what the authors call Ontological Blind Spots.Method: Introduces Failure Ontology (F) framework with: (1) four-type taxonomy of blind spots (domain, structural, weight, temporal blindness), (2) five convergent failure patterns showing how blind spots interact with external disruption, and (3) Failure Learning Efficiency Theorem proving failure-based learning achieves higher sample efficiency than success-based learning under bounded historical data. Illustrated through historical case analysis (1997 Asian Financial Crisis, 2008 subprime mortgage crisis) and longitudinal individual case study across five life stages.
Result: The framework provides a formal approach to detect, classify, and remediate Ontological Blind Spots across a human lifetime, shifting focus from knowledge acquisition efficiency to preventing catastrophic failures caused by missing conceptual domains.
Conclusion: Personalized learning should shift from optimizing knowledge acquisition to identifying and addressing Ontological Blind Spots that lead to catastrophic life failures, using the Failure Ontology framework to detect, classify, and remediate these blind spots across different life stages.
Abstract: Personalized learning systems are almost universally designed around a single objective: help people acquire knowledge and skills more efficiently. We argue this framing misses the more consequential problem. The most damaging failures in human life-financial ruin, health collapse, professional obsolescence-are rarely caused by insufficient knowledge acquisition. They arise from the systematic absence of entire conceptual territories from a person’s cognitive map: domains they never thought to explore because, from within their existing worldview, those domains did not appear to exist or to matter. We call such absences Ontological Blind Spots and introduce Failure Ontology (F), a formal framework for detecting, classifying, and remediating them across a human lifetime. The framework introduces three original contributions: (1) a four-type taxonomy of blind spots distinguishing domain blindness, structural blindness, weight blindness, and temporal blindness; (2) five convergent failure patterns characterizing how blind spots interact with external disruption to produce catastrophic outcomes; and (3) the Failure Learning Efficiency Theorem, proving that failure-based learning achieves higher sample efficiency than success-based learning under bounded historical data. We illustrate the framework through historical case analysis of the 1997 Asian Financial Crisis and the 2008 subprime mortgage crisis, and through alongitudinal individual case study spanning five life stages.
[836] Working Paper: Towards Schema-based Learning from a Category-Theoretic Perspective
Pablo de los Riscos, Fernando J. Corbacho, Michael A. Arbib
Main category: cs.AI
TL;DR: A hierarchical categorical framework for Schema-Based Learning (SBL) structured across four levels: schema, agent, architecture, and world, using category theory to formalize schemas, cognition, embodiment, and multi-agent interactions.
Details
Motivation: To provide a rigorous mathematical foundation for Schema-Based Learning by developing a hierarchical categorical framework that can formally represent schemas, their transformations, cognitive processes, embodiment, and multi-agent interactions in a unified way.Method: Uses category theory to construct a four-level hierarchical framework: (1) Schema level with syntactic schemas, implementation functors, and probabilistic models via Giry monad; (2) Agent level with duoidal structures for workflows, mental objects, and cognitive modules; (3) Architecture level for comparing heterogeneous paradigms; (4) World level for multi-agent interactions.
Result: Develops a comprehensive categorical framework that formally links schema semantics, cognition, embodiment, architectural abstraction, and world-level interaction through a weak hierarchical n-categorical structure.
Conclusion: The framework provides a rigorous mathematical foundation for Schema-Based Learning that can formally represent cognitive architectures, embodiment, and multi-agent systems using category theory, enabling precise comparison and analysis of different learning paradigms.
Abstract: We introduce a hierarchical categorical framework for Schema-Based Learning (SBL) structured across four interconnected levels. At the schema level, a free multicategory $Sch_{syn}$ encodes fundamental schemas and transformations. An implementation functor $\mathcal{I}$ maps syntactic schemas to representational languages, inducing via the Grothendieck construction the total category $Sch_{impl}$. Implemented schemas are mapped by a functor $Model$ into the Kleisli category $\mathbf{KL(G)}$ of the Giry monad, yielding probabilistic models, while an instances presheaf assigns evaluated instance spaces. A semantic category $Sch_{sem}$, defined as a full subcategory of $\mathbf{KL(G)}$, provides semantic grounding through an interpretation functor from $Sch_{impl}$. At the agent level, $Sch_{impl}$ is equipped with a duoidal structure $\mathcal{O}_{Sch}$ supporting schema-based workflows. A left duoidal action on the category $Mind$ enables workflow execution over mental objects, whose components include mental spaces, predictive models, and a cognitive kernel composed of memory and cognitive modules. Each module is specified by schema-typed interfaces, duoidal workflows, a success condition, and a logical signature. Memory is formalized categorically via memory subsystems, a presheaf $Data_M$, a monoidal operation category $Ops_M$, and read/write natural transformations. Together with the $Body$ category, Mind defines the embodied SBL agent. At higher levels, SBL is represented as an object of the agent architecture category $ArchCat$, enabling comparison with heterogeneous paradigms, while the $World$ category models multi-agent and agent-environment interactions. Altogether, the framework forms a weak hierarchical $n$-categorical structure linking schema semantics, cognition, embodiment, architectural abstraction, and world-level interaction.
[837] Enhancing Cross-Problem Vehicle Routing via Federated Learning
Xiangchi Meng, Jianan Zhou, Jie Gao, Yifan Lu, Yaoxin Wu, Gonglin Yuan, Yaqing Hou
Main category: cs.AI
TL;DR: Proposes MPSF-FL framework for vehicle routing problems using federated learning to enable cross-problem knowledge transfer from pre-trained global model to specialized local models.
Details
Motivation: Current neural combinatorial optimization approaches for vehicle routing problems suffer from performance degradation and poor generalizability when transferring from simple to complex constraint variants. Need better cross-problem learning paradigms.Method: “Multi-problem Pre-train, then Single-problem Fine-tune” with Federated Learning (MPSF-FL). Uses federated global model to share common knowledge across problems, then fine-tunes local models for specific VRP variants with heterogeneous constraints.
Result: Framework enhances performance in diverse VRPs and improves generalizability to unseen problems compared to existing approaches.
Conclusion: MPSF-FL effectively addresses cross-problem learning challenges in neural combinatorial optimization for vehicle routing problems through federated knowledge sharing and transfer.
Abstract: Vehicle routing problems (VRPs) constitute a core optimization challenge in modern logistics and supply chain management. The recent neural combinatorial optimization (NCO) has demonstrated superior efficiency over some traditional algorithms. While serving as a primary NCO approach for solving general VRPs, current cross-problem learning paradigms are still subject to performance degradation and generalizability decay, when transferring from simple VRP variants to those involving different and complex constraints. To strengthen the paradigms, this paper offers an innovative “Multi-problem Pre-train, then Single-problem Fine-tune” framework with Federated Learning (MPSF-FL). This framework exploits the common knowledge of a federated global model to foster efficient cross-problem knowledge sharing and transfer among local models for single-problem fine-tuning. In this way, local models effectively retain common VRP knowledge from up-to-date global model, while being efficiently adapted to downstream VRPs with heterogeneous complex constraints. Experimental results demonstrate that our framework not only enhances the performance in diverse VRPs, but also improves the generalizability in unseen problems.
[838] Preference-Agile Multi-Objective Optimization for Real-time Vehicle Dispatching
Jiahuan Jin, Wenhao Zhao, Rong Qu, Jianfeng Ren, Xinan Chen, Qingfu Zhang, Ruibin Bai
Main category: cs.AI
TL;DR: PAMOO is a preference-agile multi-objective optimization method using deep reinforcement learning that allows dynamic adjustment of user preferences for sequential decision problems, demonstrated on vehicle dispatching.
Details
Motivation: Existing multi-objective optimization methods are either deterministic (not practical) or non-sequential (can't handle real-life complexities), and there's growing demand for dynamic MOO that allows real-time adjustment of objective priorities in response to market dynamics.Method: Proposes a uniform model within a deep reinforcement learning framework that takes users’ dynamic preference vectors as explicit inputs, with a calibration function to ensure alignment between preference inputs and output DRL decision policy.
Result: Extensive experiments on real-life vehicle dispatching problems at container terminals showed PAMOO obtains superior performance and generalization ability compared to two popular MOO methods.
Conclusion: PAMOO presents the first dynamic MOO method for challenging dynamic sequential MOO decision problems, enabling users to dynamically adjust and interactively assign preferences on the fly.
Abstract: Multi-objective optimization (MOO) has been widely studied in literature because of its versatility in human-centered decision making in real-life applications. Recently, demand for dynamic MOO is fast-emerging due to tough market dynamics that require real-time re-adjustments of priorities for different objectives. However, most existing studies focus either on deterministic MOO problems which are not practical, or non-sequential dynamic MOO decision problems that cannot deal with some real-life complexities. To address these challenges, a preference-agile multi-objective optimization (PAMOO) is proposed in this paper to permit users to dynamically adjust and interactively assign the preferences on the fly. To achieve this, a novel uniform model within a deep reinforcement learning (DRL) framework is proposed that can take as inputs users’ dynamic preference vectors explicitly. Additionally, a calibration function is fitted to ensure high quality alignment between the preference vector inputs and the output DRL decision policy. Extensive experiments on challenging real-life vehicle dispatching problems at a container terminal showed that PAMOO obtains superior performance and generalization ability when compared with two most popular MOO methods. Our method presents the first dynamic MOO method for challenging \rev{dynamic sequential MOO decision problems
[839] Principles Do Not Apply Themselves: A Hermeneutic Perspective on AI Alignment
Behrooz Razeghi
Main category: cs.AI
TL;DR: AI alignment requires interpretive judgment when principles conflict or are ambiguous, not just following stated rules, and this interpretive component appears in deployment behavior rather than just training data.
Details
Motivation: The paper addresses the limitation of current AI alignment approaches that focus on following stated principles, arguing that real-world alignment requires interpretive judgment when principles conflict, are too broad, or facts are unclear.Method: The paper uses hermeneutics (theory of interpretation) to analyze alignment, connects to empirical findings about preference-labeling data, distinguishes between deployment-induced and corpus-induced evaluation, and shows how off-policy audits can miss alignment failures.
Result: The analysis reveals that a substantial portion of preference-labeling data involves principle conflicts or indifference, and that alignment-relevant choices often only appear in deployment-time response distributions rather than training data.
Conclusion: AI alignment includes a context-dependent interpretive component that requires judgment in applying principles, and evaluation must consider deployment behavior rather than just training data to capture alignment failures.
Abstract: AI alignment is often framed as the task of ensuring that an AI system follows a set of stated principles or human preferences, but general principles rarely determine their own application in concrete cases. When principles conflict, when they are too broad to settle a situation, or when the relevant facts are unclear, an additional act of judgment is required. This paper analyzes that step through the lens of hermeneutics and argues that alignment therefore includes an interpretive component: it involves context-sensitive judgments about how principles should be read, applied, and prioritized in practice. We connect this claim to recent empirical findings showing that a substantial portion of preference-labeling data falls into cases of principle conflict or indifference, where the principle set does not uniquely determine a decision. We then draw an operational consequence: because such judgments are expressed in behavior, many alignment-relevant choices appear only in the distribution of responses a model generates at deployment time. To formalize this point, we distinguish deployment-induced and corpus-induced evaluation and show that off-policy audits can fail to capture alignment-relevant failures when the two response distributions differ. We argue that principle-specified alignment includes a context-dependent interpretive component.
[840] FedRio: Personalized Federated Social Bot Detection via Cooperative Reinforced Contrastive Adversarial Distillation
Yingguang Yang, Hao Liu, Xin Zhang, Yunhui Liu, Yutong Xia, Qi Wu, Hao Peng, Taoran Liang, Bin Chong, Tieke He, Philip S. Yu
Main category: cs.AI
TL;DR: FedRio: A federated learning framework for cross-platform social bot detection using GNNs, adversarial distillation, and reinforcement learning to handle data heterogeneity while preserving privacy.
Details
Motivation: Current bot detection models operate in isolation, missing opportunities to leverage shared patterns across platforms. Data heterogeneity and model architecture differences make cross-platform detection challenging, while privacy concerns prevent data sharing.Method: Uses adaptive message-passing GNNs as backbone, federated knowledge extraction with GANs, multi-stage adversarial contrastive learning for feature consistency, and reinforcement learning for parameter control in heterogeneous federated settings.
Result: Outperforms state-of-the-art federated learning baselines in detection accuracy, communication efficiency, and feature consistency on two real-world benchmarks, remaining competitive with centralized results under stronger privacy constraints.
Conclusion: FedRio effectively addresses cross-platform bot detection challenges through personalized federated learning with adversarial distillation and reinforcement learning, enabling efficient knowledge sharing while preserving privacy.
Abstract: Social bot detection is critical to the stability and security of online social platforms. However, current state-of-the-art bot detection models are largely developed in isolation, overlooking the benefits of leveraging shared detection patterns across platforms to improve performance and promptly identify emerging bot variants. The heterogeneity of data distributions and model architectures further complicates the design of an effective cross-platform and cross-model detection framework. To address these challenges, we propose FedRio (Personalized Federated Social Bot Detection with Cooperative Reinforced Contrastive Adversarial Distillation framework. We first introduce an adaptive message-passing module as the graph neural network backbone for each client. To facilitate efficient knowledge sharing of global data distributions, we design a federated knowledge extraction mechanism based on generative adversarial networks. Additionally, we employ a multi-stage adversarial contrastive learning strategy to enforce feature space consistency among clients and reduce divergence between local and global models. Finally, we adopt adaptive server-side parameter aggregation and reinforcement learning-based client-side parameter control to better accommodate data heterogeneity in heterogeneous federated settings. Extensive experiments on two real-world social bot detection benchmarks demonstrate that FedRio consistently outperforms state-of-the-art federated learning baselines in detection accuracy, communication efficiency, and feature space consistency, while remaining competitive with published centralized results under substantially stronger privacy constraints.
[841] Do LLMs Build Spatial World Models? Evidence from Grid-World Maze Tasks
Weijiang Li, Yilin Zhu, Rajarshi Das, Parijat Dube
Main category: cs.AI
TL;DR: LLMs show poor spatial reasoning capabilities in maze tasks, with performance heavily dependent on representation format and prompting, suggesting they lack robust internal spatial world models.
Details
Motivation: To systematically evaluate whether foundation models can construct internal spatial world models for reasoning and planning, using maze tasks as a controlled testing context.Method: Comprehensive experiments with multiple LLMs (Gemini-2.5-Flash, GPT-5-mini, Claude-Haiku-4.5, DeepSeek-Chat) using maze tasks with different representations (tokenized adjacency vs. visual grid formats), chain-of-thought prompting, and spatial reasoning probes including sequential proximity questions and compositional distance comparisons.
Result: Gemini achieved 80-86% accuracy on smaller mazes with tokenized adjacency representations but collapsed to 16-34% with visual grid formats (2-5x difference). Despite 96-99% semantic coverage in reasoning traces, models failed to leverage spatial understanding consistently, treating questions independently rather than building cumulative spatial knowledge.
Conclusion: LLMs do not develop robust spatial world models but exhibit representation-specific and prompting-dependent reasoning that succeeds only under narrow conditions, with critical implications for deploying foundation models in applications requiring spatial abstraction.
Abstract: Foundation models have shown remarkable performance across diverse tasks, yet their ability to construct internal spatial world models for reasoning and planning remains unclear. We systematically evaluate the spatial understanding of large language models through maze tasks, a controlled testing context requiring multi-step planning and spatial abstraction. Across comprehensive experiments with Gemini-2.5-Flash, GPT-5-mini, Claude-Haiku-4.5, and DeepSeek-Chat, we uncover significant discrepancies in spatial reasoning that challenge assumptions about LLM planning capabilities. Using chain-of-thought prompting, Gemini achieves 80-86% accuracy on smaller mazes (5x5 to 7x7 grids) with tokenized adjacency representations, but performance collapses to 16-34% with visual grid formats, which is a 2-5x difference, suggesting representation-dependent rather than format-invariant spatial reasoning. We further probe spatial understanding through sequential proximity questions and compositional distance comparisons. Despite achieving 96-99% semantic coverage in reasoning traces, models fail to leverage this understanding for consistent spatial computations, indicating that they treat each question independently rather than building cumulative spatial knowledge. Our findings based on the maze-solving tasks suggest that LLMs do not develop robust spatial world models, but rather exhibit representation-specific and prompting-dependent reasoning that succeeds only under narrow conditions. These results have critical implications for deploying foundation models in applications requiring spatial abstraction.
[842] FACT-E: Causality-Inspired Evaluation for Trustworthy Chain-of-Thought Reasoning
Yuxi Sun, Aoqi Zuo, Haotian Xie, Wei Gao, Mingming Gong, Jing Ma
Main category: cs.AI
TL;DR: FACT-E is a causality-inspired framework that uses controlled perturbations to evaluate Chain-of-Thought reasoning faithfulness, separating genuine step-to-step dependencies from bias-driven artifacts for more reliable self-evaluation.
Details
Motivation: Current CoT prompting suffers from models generating explanations that appear coherent but contain unfaithful intermediate steps. Existing self-evaluation approaches are prone to inherent biases where models may confidently endorse coherence even when step-to-step implications are not valid, leading to unreliable faithfulness evaluation.Method: FACT-E uses controlled perturbations as an instrumental signal to separate genuine step-to-step dependence from bias-driven artifacts, producing reliable faithfulness estimates (intra-chain faithfulness). It jointly considers intra-chain faithfulness and CoT-to-answer consistency to select trustworthy trajectories.
Result: Experiments on GSM8K, MATH, and CommonsenseQA show that FACT-E improves reasoning-trajectory selection and yields stronger in-context learning exemplars. FACT-E also reliably detects flawed reasoning under noisy conditions.
Conclusion: FACT-E provides a robust causality-inspired framework for evaluating CoT quality, offering more reliable faithfulness evaluation and better trajectory selection for trustworthy LLM reasoning.
Abstract: Chain-of-Thought (CoT) prompting has improved LLM reasoning, but models often generate explanations that appear coherent while containing unfaithful intermediate steps. Existing self-evaluation approaches are prone to inherent biases: the model may confidently endorse coherence even when the step-to-step implication is not valid, leading to unreliable faithfulness evaluation. We propose FACT-E, a causality-inspired framework for evaluating CoT quality. FACT-E uses controlled perturbations as an instrumental signal to separate genuine step-to-step dependence from bias-driven artifacts, producing more reliable faithfulness estimates (\textit{intra-chain faithfulness}). To select trustworthy trajectories, FACT-E jointly considers \textit{intra-chain faithfulness} and \textit{CoT-to-answer consistency}, ensuring that selected chains are both faithful internally and supportive of the correct final answer. Experiments on GSM8K, MATH, and CommonsenseQA show that FACT-E improves reasoning-trajectory selection and yields stronger in-context learning exemplars. FACT-E also reliably detects flawed reasoning under noisy conditions, providing a robust metric for trustworthy LLM reasoning.
[843] Camyla: Scaling Autonomous Research in Medical Image Segmentation
Yifan Gao, Haoyue Li, Feng Yuan, Xin Gao, Weiran Huang, Xiaosong Wang
Main category: cs.AI
TL;DR: Camyla is an autonomous research system for medical image segmentation that generates research proposals, experiments, and manuscripts without human intervention, outperforming established baselines on 24/31 datasets.
Details
Motivation: The paper addresses challenges in autonomous scientific research: search effort drifting to unpromising directions, knowledge degradation from earlier trials, and repetitive recovery from failures in long-horizon experimentation.Method: Combines three mechanisms: Quality-Weighted Branch Exploration for effort allocation, Layered Reflective Memory for knowledge retention/compression, and Divergent Diagnostic Feedback for diversified recovery after failures.
Result: Generated 2,700+ novel model implementations and 40 complete manuscripts; surpassed strongest per-dataset baselines (including nnU-Net) on 24/31 datasets; manuscripts scored at T1/T2 journal boundary; outperformed AutoML/NAS and research agents.
Conclusion: Domain-scale autonomous research is achievable in medical image segmentation, with Camyla demonstrating comprehensive autonomous research capabilities from data to publication.
Abstract: We present Camyla, a system for fully autonomous research within the scientific domain of medical image segmentation. Camyla transforms raw datasets into literature-grounded research proposals, executable experiments, and complete manuscripts without human intervention. Autonomous experimentation over long horizons poses three interrelated challenges: search effort drifts toward unpromising directions, knowledge from earlier trials degrades as context accumulates, and recovery from failures collapses into repetitive incremental fixes. To address these challenges, the system combines three coupled mechanisms: Quality-Weighted Branch Exploration for allocating effort across competing proposals, Layered Reflective Memory for retaining and compressing cross-trial knowledge at multiple granularities, and Divergent Diagnostic Feedback for diversifying recovery after underperforming trials. The system is evaluated on CamylaBench, a contamination-free benchmark of 31 datasets constructed exclusively from 2025 publications, under a strict zero-intervention protocol across two independent runs within a total of 28 days on an 8-GPU cluster. Across the two runs, Camyla generates more than 2,700 novel model implementations and 40 complete manuscripts, and surpasses the strongest per-dataset baseline selected from 14 established architectures, including nnU-Net, on 22 and 18 of 31 datasets under identical training budgets, respectively (union: 24/31). Senior human reviewers score the generated manuscripts at the T1/T2 boundary of contemporary medical imaging journals. Relative to automated baselines, Camyla outperforms AutoML and NAS systems on aggregate segmentation performance and exceeds six open-ended research agents on both task completion and baseline-surpassing frequency. These results suggest that domain-scale autonomous research is achievable in medical image segmentation.
[844] SciPredict: Can LLMs Predict the Outcomes of Scientific Experiments in Natural Sciences?
Udari Madhushani Sehwag, Elaine Lau, Haniyeh Ehsani Oskouie, Shayan Shabihi, Erich Liang, Andrea Toledo, Guillermo Mangialardi, Sergio Fonrouge, Ed-Yeremai Hernandez Cardona, Paula Vergara, Utkarsh Tyagi, Chen Bo Calvin Zhang, Pavi Bhatter, Nicholas Johnson, Furong Huang, Ernesto Gabriel Hernandez Montoya, Bing Liu
Main category: cs.AI
TL;DR: SciPredict benchmark evaluates LLMs’ ability to predict experimental outcomes across physics, biology, and chemistry, finding current models perform poorly (14-26% accuracy) and lack reliability awareness.
Details
Motivation: To assess whether LLMs can predict experimental outcomes accurately enough to guide scientific research, addressing a gap in existing benchmarks that focus on knowledge and reasoning but not experimental prediction.Method: Created SciPredict benchmark with 405 tasks from 33 specialized sub-fields across physics, biology, and chemistry, evaluating LLMs’ prediction accuracy and reliability calibration compared to human experts.
Result: LLMs achieve only 14-26% accuracy, similar to human experts (~20%). Models fail to distinguish reliable from unreliable predictions (~20% accuracy regardless of confidence), while human experts show strong calibration (5-80% accuracy based on predictability assessment).
Conclusion: Current LLMs are inadequate for reliable experimental guidance; superhuman performance requires not just better predictions but better awareness of prediction reliability. The benchmark provides a rigorous framework for evaluating experimental prediction capabilities.
Abstract: Accelerating scientific discovery requires the identification of which experiments would yield the best outcomes before committing resources to costly physical validation. While existing benchmarks evaluate LLMs on scientific knowledge and reasoning, their ability to predict experimental outcomes - a task where AI could significantly exceed human capabilities - remains largely underexplored. We introduce SciPredict, a benchmark comprising 405 tasks derived from recent empirical studies in 33 specialized sub-fields of physics, biology, and chemistry. SciPredict addresses two critical questions: (a) can LLMs predict the outcome of scientific experiments with sufficient accuracy? and (b) can such predictions be reliably used in the scientific research process? Evaluations reveal fundamental limitations on both fronts. Model accuracies are 14-26% and human expert performance is $\approx$20%. Although some frontier models exceed human performance model accuracy is still far below what would enable reliable experimental guidance. Even within the limited performance, models fail to distinguish reliable predictions from unreliable ones, achieving only $\approx$20% accuracy regardless of their confidence or whether they judge outcomes as predictable without physical experimentation. Human experts, in contrast, demonstrate strong calibration: their accuracy increases from $\approx$5% to $\approx$80% as they deem outcomes more predictable without conducting the experiment. SciPredict establishes a rigorous framework demonstrating that superhuman performance in experimental science requires not just better predictions, but better awareness of prediction reliability. For reproducibility all our data and code are provided at https://github.com/scaleapi/scipredict
[845] Teaching Language Models How to Code Like Learners: Conversational Serialization for Student Simulation
Charles Koutcheme, Arto Hellas, Juho Leinonen
Main category: cs.AI
TL;DR: Training open-weight AI programming learners using authentic student process data serialized as conversations between learners and assessment systems, improving over proprietary LLMs in replicating debugging behavior.
Details
Motivation: Need for artificial programming learners to evaluate tutoring strategies at scale without relying on proprietary large language models due to privacy, cost, and dependency concerns.Method: Serialize temporal log traces into conversational format (student submissions + environment feedback as alternating turns), train Qwen models (4B/8B) on real student Python submissions using supervised fine-tuning with preference optimization.
Result: Models incorporating environment feedback better replicate student debugging behavior, outperforming code-only approaches and prompted LLM baselines in functional alignment and code similarity.
Conclusion: Open-weight artificial programming learners trained on authentic student process data can effectively simulate debugging behavior, offering privacy-preserving, cost-effective alternatives to proprietary models.
Abstract: Artificial models that simulate how learners act and respond within educational systems are a promising tool for evaluating tutoring strategies and feedback mechanisms at scale. However, many existing approaches in programming education rely on prompting large, proprietary language models, raising concerns around privacy, cost, and dependence. In this work, we propose a method for training open-weight artificial programming learners using authentic student process data. Our approach serializes temporal log traces into a conversational format, representing each student’s problem-solving process as a dialogue between the learner and their automated assessment system. Student code submissions and environment feedback, such as test outcomes, grades, and error traces, form alternating conversational turns, enabling models to learn from the iterative debugging process. We additionally introduce a training pipeline combining supervised fine-tuning with preference optimization to align models with authentic student debugging behavior. We evaluate our framework by training Qwen models at 4B and 8B scales on a large-scale dataset of real student submissions to Python programming assignments. Our results show that incorporating environment feedback strengthens the models’ ability to replicate student debugging behavior, improving over both prior code-only approaches and prompted large language models baselines in functional alignment and code similarity. We release our code to support reproducibility.
[846] When More Thinking Hurts: Overthinking in LLM Test-Time Compute Scaling
Shu Zhou, Rui Ling, Junan Chen, Xin Wang, Tao Fan, Hao Wang
Main category: cs.AI
TL;DR: The paper investigates diminishing returns of extended reasoning in LLMs, finding that longer thinking doesn’t always yield better results and can lead to “overthinking” where models abandon correct answers.
Details
Motivation: To challenge the assumption that longer chains of thought always improve reasoning in large language models, and to systematically examine how marginal utility of additional reasoning tokens changes with increased compute budgets.Method: Systematic investigation of how marginal returns from additional reasoning tokens diminish at higher budgets, analysis of “overthinking” phenomenon, and development of a cost-aware evaluation framework to determine optimal thinking lengths.
Result: Marginal returns diminish substantially at higher budgets; models exhibit overthinking where extended reasoning leads to abandoning previously correct answers; optimal thinking length varies by problem difficulty; stopping at moderate budgets can reduce computation significantly while maintaining comparable accuracy.
Conclusion: Uniform compute allocation is suboptimal, and a cost-aware approach to reasoning length can significantly reduce computation while maintaining performance, challenging the prevailing assumption that longer thinking always improves results.
Abstract: Scaling test-time compute through extended chains of thought has become a dominant paradigm for improving large language model reasoning. However, existing research implicitly assumes that longer thinking always yields better results. This assumption remains largely unexamined. We systematically investigate how the marginal utility of additional reasoning tokens changes as compute budgets increase. We find that marginal returns diminish substantially at higher budgets and that models exhibit ``overthinking’’, where extended reasoning is associated with abandoning previously correct answers. Furthermore, we show that optimal thinking length varies across problem difficulty, suggesting that uniform compute allocation is suboptimal. Our cost-aware evaluation framework reveals that stopping at moderate budgets can reduce computation significantly while maintaining comparable accuracy.
[847] Learning Preference-Based Objectives from Clinical Narratives for Sequential Treatment Decision-Making
Daniel J. Tan, Kay Choong See, Mengling Feng
Main category: cs.AI
TL;DR: CN-PR framework uses clinical narratives and LLMs to learn reward functions for healthcare RL, enabling better treatment policies without handcrafted rewards.
Details
Motivation: Traditional RL reward design in healthcare is challenging due to sparse, delayed outcomes. Clinical narratives contain valuable implicit evaluations of treatment effectiveness that structured data misses.Method: Uses LLMs to derive trajectory quality scores from discharge summaries, constructs pairwise preferences over patient trajectories, learns rewards via preference-based objective with confidence weighting for narrative informativeness.
Result: Learned reward aligns strongly with trajectory quality (Spearman rho = 0.63), enables policies improving recovery outcomes (organ support-free days, faster shock resolution) while maintaining mortality performance.
Conclusion: Narrative-derived supervision provides scalable, expressive alternative to handcrafted rewards for dynamic treatment regimes in healthcare RL.
Abstract: Designing reward functions remains a central challenge in reinforcement learning (RL) for healthcare, where outcomes are sparse, delayed, and difficult to specify. While structured data capture physiological states, they often fail to reflect the overall quality of a patient’s clinical trajectory, including recovery dynamics, treatment burden, and stability. Clinical narratives, in contrast, summarize longitudinal reasoning and implicitly encode evaluations of treatment effectiveness. We propose Clinical Narrative-informed Preference Rewards (CN-PR), a framework for learning reward functions directly from discharge summaries by treating them as scalable supervision for trajectory-level preferences. Using a large language model, we derive trajectory quality scores (TQS) and construct pairwise preferences over patient trajectories, enabling reward learning via a structured preference-based objective. To account for variability in narrative informativeness, we incorporate a confidence signal that weights supervision based on its relevance to the decision-making task. The learned reward aligns strongly with trajectory quality (Spearman rho = 0.63) and enables policies that are consistently associated with improved recovery-related outcomes, including increased organ support-free days and faster shock resolution, while maintaining comparable performance on mortality. These effects persist under external validation. Our results demonstrate that narrative-derived supervision provides a scalable and expressive alternative to handcrafted or outcome-based reward design for dynamic treatment regimes.
[848] TorchUMM: A Unified Multimodal Model Codebase for Evaluation, Analysis, and Post-training
Yinyi Luo, Wenwen Wang, Hayes Bai, Hongyu Zhu, Hao Chen, Pan He, Marios Savvides, Sharon Li, Jindong Wang
Main category: cs.AI
TL;DR: TorchUMM is a unified codebase for comprehensive evaluation, analysis, and post-training across diverse unified multimodal models (UMMs) covering understanding, generation, and editing tasks.
Details
Motivation: There's a proliferation of unified multimodal models with diverse architectures, training paradigms, and implementation details, making fair comparisons and systematic analysis challenging. The field lacks a standardized framework for comprehensive evaluation across different UMM backbones.Method: Developed TorchUMM as a unified codebase supporting a broad spectrum of models across different scales and design paradigms. The benchmark encompasses three core task dimensions (multimodal understanding, generation, editing) and integrates both established and novel datasets to evaluate perception, reasoning, compositionality, and instruction-following abilities.
Result: TorchUMM provides a unified interface and standardized evaluation protocols enabling fair and reproducible comparisons across heterogeneous models. It facilitates deeper insights into model strengths and limitations, supporting the development of more capable unified multimodal systems.
Conclusion: TorchUMM addresses the critical need for standardized evaluation in the UMM field, enabling systematic analysis and comparison of diverse multimodal architectures, which should accelerate progress toward more capable unified multimodal systems.
Abstract: Recent advances in unified multimodal models (UMMs) have led to a proliferation of architectures capable of understanding, generating, and editing across visual and textual modalities. However, developing a unified framework for UMMs remains challenging due to the diversity of model architectures and the heterogeneity of training paradigms and implementation details. In this paper, we present TorchUMM, the first unified codebase for comprehensive evaluation, analysis, and post-training across diverse UMM backbones, tasks, and datasets. TorchUMM supports a broad spectrum of models covering a wide range of scales and design paradigms. Our benchmark encompasses three core task dimensions: multimodal understanding, generation, and editing, and integrates both established and novel datasets to evaluate perception, reasoning, compositionality, and instruction-following abilities. By providing a unified interface and standardized evaluation protocols, TorchUMM enables fair and reproducible comparisons across heterogeneous models and fosters deeper insights into their strengths and limitations, facilitating the development of more capable unified multimodal systems. Code is available at: https://github.com/AIFrontierLab/TorchUMM.
[849] CheeseBench: Evaluating Large Language Models on Rodent Behavioral Neuroscience Paradigms
Zacharie Bugaud
Main category: cs.AI
TL;DR: CheeseBench evaluates LLMs on 9 rodent neuroscience tasks using ASCII text observations, finding current models perform below rodent baselines, especially on spatial navigation tasks.
Details
Motivation: To create a benchmark that evaluates LLMs on classical behavioral neuroscience paradigms to understand how well they can discover goals from minimal text observations, similar to rodents in experimental apparatus.Method: Uses 9 classical rodent neuroscience tasks (Morris water maze, Barnes maze, etc.) with ASCII text renderings. LLMs receive unified system prompts with no task-specific instructions and must learn from text observations and rewards. Evaluates 6 open-weight LLMs (3B-72B parameters) against random baselines and graph-based RL agents.
Result: Best model (Qwen2.5-VL-7B) achieves 52.6% average success on ASCII input vs. 32.1% for random agents and 78.9% for approximate rodent baselines. Key findings: scaling beyond 7B yields diminishing returns, longer context history degrades performance, chain-of-thought hurts performance, and vision-language architecture helps at 7B but hurts at 32B.
Conclusion: Current open-weight LLM agents perform well below rodent reference values under unified zero-shot ASCII protocols, particularly on spatial navigation and state tracking tasks. Performance heavily depends on interface parameters, characterizing the agent-plus-interface system rather than the model in isolation.
Abstract: We introduce CheeseBench, a benchmark that evaluates large language models (LLMs) on nine classical behavioral neuroscience paradigms (Morris water maze, Barnes maze, T-maze, radial arm maze, star maze, operant chamber, shuttle box, conditioned place preference, and delayed non-match to sample), spanning six cognitive dimensions. Each task is grounded in peer-reviewed rodent protocols with approximate animal baselines. The agent receives a unified system prompt with no task-specific instructions and must discover goals purely from ASCII text observations and reward signals, much like a rodent placed into an unfamiliar apparatus. We evaluate six open-weight LLMs (3B to 72B parameters) on text-based ASCII renderings and compare against both a random baseline and a graph-based reinforcement learning agent. Our best model (Qwen2.5-VL-7B) reaches 52.6% average success on ASCII input, compared to 32.1% for random agents and 78.9% for approximate rodent baselines. We find that (1) scaling beyond 7B yields diminishing returns, (2) longer context history degrades performance, (3) chain-of-thought prompting hurts rather than helps, and (4) a vision-language architecture provides an advantage at 7B but hurts at 32B. Because the same model’s performance ranges from 20% to 57% depending on interface parameters alone, these results characterize the agent-plus-interface system, not the model in isolation. Under this unified zero-shot ASCII protocol, current open-weight LLM agents remain well below approximate rodent reference values, particularly on tasks requiring spatial navigation and within-trial state tracking.
[850] Your Model Diversity, Not Method, Determines Reasoning Strategy
Moulik Choraria, Argyrios Gerogiannis, Anirban Das, Supriyo Chakraborty, Berkcan Kapusuzoglu, Chia-Hsuan Lee, Kartik Balasubramaniam, Shi-Xiong Zhang, Sambit Sahu
Main category: cs.AI
TL;DR: The paper presents a theoretical framework for optimizing compute allocation between breadth (exploring solution approaches) and depth (refining promising solutions) in LLM reasoning, showing optimal strategy depends on model’s diversity profile.
Details
Motivation: Current methods for LLM reasoning implicitly trade off breadth vs depth without clear understanding of why certain trade-offs work, and validation on single models obscures the role of the model's characteristics.Method: Develop theoretical framework decomposing reasoning uncertainty, derive conditions for when tree-style depth refinement outperforms parallel sampling, validate on Qwen-3 4B and Olmo-3 7B model families.
Result: Lightweight signals suffice for depth-based refinement on low-diversity aligned models but yield limited utility for high-diversity base models, which require stronger compensation for lower exploration coverage.
Conclusion: Optimal compute scaling strategy depends on model’s diversity profile and probability mass spread across solution approaches, which must be characterized before adopting exploration strategies.
Abstract: Compute scaling for LLM reasoning requires allocating budget between exploring solution approaches ($breadth$) and refining promising solutions ($depth$). Most methods implicitly trade off one for the other, yet why a given trade-off works remains unclear, and validation on a single model obscures the role of the model itself. We argue that $\textbf{the optimal strategy depends on the model’s diversity profile, the spread of probability mass across solution approaches, and that this must be characterized before any exploration strategy is adopted.}$ We formalize this through a theoretical framework decomposing reasoning uncertainty and derive conditions under which tree-style depth refinement outperforms parallel sampling. We validate it on Qwen-3 4B and Olmo-3 7B families, showing that lightweight signals suffice for depth-based refinement on low-diversity aligned models while yielding limited utility for high-diversity base models, which we hypothesize require stronger compensation for lower exploration coverage.
[851] A Benchmark for Gap and Overlap Analysis as a Test of KG Task Readiness
Maruf Ahmed Mridul, Rohit Kapa, Oshani Seneviratne
Main category: cs.AI
TL;DR: Benchmark for evaluating knowledge graph quality through gap/overlap analysis of policy documents using ontology-driven methods vs. text-only LLM approaches
Details
Motivation: To create an executable, auditable benchmark for evaluating whether knowledge graphs can answer real competency questions about policy documents in a reproducible, explainable way with evidence traceabilityMethod: Developed benchmark with: 1) 10 simplified life-insurance contracts, 2) domain ontology with instantiated knowledge base from contract facts, 3) 58 structured scenarios with SPARQL queries and evidence-linked ground truth. Compared ontology-driven pipeline against text-only LLM baseline
Result: Explicit modeling (ontology-driven approach) improves consistency and diagnosis for gap/overlap analyses compared to text-only LLM inference directly from contract text
Conclusion: The benchmark serves as reusable template for evaluating KG quality and supports downstream tasks like ontology learning, KG population, and evidence-grounded question answering
Abstract: Task-oriented evaluation of knowledge graph (KG) quality increasingly asks whether an ontology-based representation can answer the competency questions that users actually care about, in a manner that is reproducible, explainable, and traceable to evidence. This paper adopts that perspective and focuses on gap and overlap analysis for policy-like documents (e.g., insurance contracts), where given a scenario, which documents support it (overlap) and which do not (gap), with defensible justifications. The resulting gap/overlap determinations are typically driven by genuine differences in coverage and restrictions rather than missing data, making the task a direct test of KG task readiness rather than a test of missing facts or query expressiveness. We present an executable and auditable benchmark that aligns natural-language contract text with a formal ontology and evidence-linked ground truth, enabling systematic comparison of methods. The benchmark includes: (i) ten simplified yet diverse life-insurance contracts reviewed by a domain expert, (ii) a domain ontology (TBox) with an instantiated knowledge base (ABox) populated from contract facts, and (iii) 58 structured scenarios paired with SPARQL queries with contract-level outcomes and clause-level excerpts that justify each label. Using this resource, we compare a text-only LLM baseline that infers outcomes directly from contract text against an ontology-driven pipeline that answers the same scenarios over the instantiated KG, demonstrating that explicit modeling improves consistency and diagnosis for gap/overlap analyses. Although demonstrated for gap and overlap analysis, the benchmark is intended as a reusable template for evaluating KG quality and supporting downstream work such as ontology learning, KG population, and evidence-grounded question answering.
[852] Beyond Statistical Co-occurrence: Unlocking Intrinsic Semantics for Tabular Data Clustering
Mingjie Zhao, Yunfan Zhang, Yiqun Zhang, Yiu-ming Cheung
Main category: cs.AI
TL;DR: TagCC is a novel deep clustering framework for tabular data that integrates open-world semantic knowledge from LLMs with statistical representations through contrastive learning.
Details
Motivation: Existing deep clustering methods for tabular data rely primarily on statistical co-occurrence patterns, treating features as symbolic tokens and overlooking the rich semantic knowledge embedded in feature names and values. This causes semantically related concepts to be isolated in the learned representations.Method: TagCC uses LLMs to distill underlying data semantics into textual anchors via semantic-aware transformation. It then employs contrastive learning to enrich statistical tabular representations with open-world semantics from these anchors, jointly optimizing this with a clustering objective.
Result: Extensive experiments on benchmark datasets demonstrate that TagCC significantly outperforms existing deep clustering methods for tabular data.
Conclusion: TagCC successfully bridges the gap between dataset-specific statistics and intrinsic semantic knowledge, creating representations that are both semantically coherent and clustering-friendly for tabular data analysis.
Abstract: Deep Clustering (DC) has emerged as a powerful tool for tabular data analysis in real-world domains like finance and healthcare. However, most existing methods rely on data-level statistical co-occurrence to infer the latent metric space, often overlooking the intrinsic semantic knowledge encapsulated in feature names and values. As a result, semantically related concepts like Flu' and Cold’ are often treated as symbolic tokens, causing conceptually related samples to be isolated. To bridge the gap between dataset-specific statistics and intrinsic semantic knowledge, this paper proposes Tabular-Augmented Contrastive Clustering (TagCC), a novel framework that anchors statistical tabular representations to open-world textual concepts. Specifically, TagCC utilizes Large Language Models (LLMs) to distill underlying data semantics into textual anchors via semantic-aware transformation. Through Contrastive Learning (CL), the framework enriches the statistical tabular representations with the open-world semantics encapsulated in these anchors. This CL framework is jointly optimized with a clustering objective, ensuring that the learned representations are both semantically coherent and clustering-friendly. Extensive experiments on benchmark datasets demonstrate that TagCC significantly outperforms its counterparts.
[853] A Quantitative Definition of Intelligence
Kang-Sin Choi
Main category: cs.AI
TL;DR: The paper proposes a quantitative definition of intelligence as the ratio of log(independent outputs) to total description length, distinguishing between memorization (description grows with outputs) and knowledge (fixed description produces unbounded outputs).
Details
Motivation: To create an operational, quantitative definition of intelligence that applies to arbitrary physical systems, addresses philosophical arguments about intelligence (Putnam's pancomputationalism, Searle's Chinese Room), and distinguishes between mere memorization and genuine knowledge/generalization.Method: Proposes intelligence density = log(independent outputs) / total description length. Defines memorization vs knowledge based on how description length scales with output count. Uses information-theoretic framework with independence conditions on outputs to avoid trivialization.
Result: Provides a substrate-independent continuum of intelligence from logic gates to brains, blocks Putnam’s triviality argument via output independence, resolves Chinese Room Argument by showing finite rulebooks for infinite domains must generalize.
Conclusion: Intelligence can be quantitatively defined as the ability to produce many independent outputs from limited description, with true knowledge requiring generalization rather than memorization. This resolves key philosophical puzzles about intelligence.
Abstract: We propose an operational, quantitative definition of intelligence for arbitrary physical systems. The intelligence density of a system is the ratio of the logarithm of its independent outputs to its total description length. A system memorizes if its description length grows with its output count; it knows if its description length remains fixed while its output count diverges. The criterion for knowing is generalization: a system knows its domain if a single finite mechanism can produce correct outputs across an unbounded range of inputs, rather than storing each answer individually. We argue that meaning over a domain is a selection and ordering of functions that produces correct outputs, and that a system whose intelligence density diverges necessarily captures this structure. The definition (1) places intelligence on a substrate-independent continuum from logic gates to brains, (2) blocks Putnam’s pancomputationalist triviality argument via an independence condition on outputs, and (3) resolves Searle’s Chinese Room Argument by showing that any finite rulebook handling an infinite domain must generalize.
[854] ZoomR: Memory Efficient Reasoning through Multi-Granularity Key Value Retrieval
David H. Yang, Yuxuan Zhu, Mohammad Mohammadi Amiri, Keerthiram Murugesan, Tejaswini Pedapati, Subhajit Chaudhury, Pin-Yu Chen
Main category: cs.AI
TL;DR: ZoomR: Adaptive KV cache compression for LLMs that summarizes verbose reasoning thoughts and uses hierarchical attention to reduce memory usage by 4× while maintaining performance.
Details
Motivation: LLMs generate long intermediate thoughts for complex reasoning, causing KV cache memory to grow with output length. Current KV cache optimization focuses on compressing input context but keeps full cache for decoding, leading to high computational and memory costs for long output generation.Method: ZoomR adaptively compresses verbose reasoning thoughts into summaries and uses dynamic KV cache selection with hierarchical attention. It uses summary keys as coarse-grained indices during decoding, allowing queries to retrieve details only for important thoughts, avoiding full-cache attention at each step.
Result: Experiments on math and reasoning tasks show competitive performance compared to baselines while reducing inference memory requirements by more than 4×.
Conclusion: Multi-granularity KV selection enables more memory-efficient decoding, especially for long output generation, demonstrating the effectiveness of hierarchical compression strategies.
Abstract: Large language models (LLMs) have shown great performance on complex reasoning tasks but often require generating long intermediate thoughts before reaching a final answer. During generation, LLMs rely on a key-value (KV) cache for autoregressive decoding. However, the memory footprint of the KV cache grows with output length. Prior work on KV cache optimization mostly focus on compressing the long input context, while retaining the full KV cache for decoding. For tasks requiring long output generation, this leads to increased computational and memory costs. In this paper, we introduce ZoomR, a novel approach that enables LLMs to adaptively compress verbose reasoning thoughts into summaries and uses a dynamic KV cache selection policy that leverages these summaries while also strategically “zooming in” on fine-grained details. By using summary keys as a coarse-grained index during decoding, ZoomR uses the query to retrieve details for only the most important thoughts. This hierarchical strategy significantly reduces memory usage by avoiding full-cache attention at each step. Experiments across math and reasoning tasks show that our approach achieves competitive performance compared to baselines, while reducing inference memory requirements by more than $4\times$. These results demonstrate that a multi-granularity KV selection enables more memory efficient decoding, especially for long output generation.
[855] CASK: Core-Aware Selective KV Compression for Reasoning Traces
Buseong Kim, Heejun Gwon
Main category: cs.AI
TL;DR: CASK introduces a novel KV cache compression method for long-form reasoning in LLMs that partitions reasoning traces into protected core and mergeable scratch, using selective consolidation instead of eviction-based approaches.
Details
Motivation: KV cache grows rapidly during long-form reasoning in LLMs, creating memory and inference bottlenecks. Existing eviction-based compression methods focus on token importance scoring but fail to substantially reorganize keep-sets or preserve reasoning behavior effectively.Method: CASK frames reasoning KV compression as behavior-preserving structured consolidation: partitions decode-time reasoning traces into protected core (anchoring answer formation and intermediate state) and mergeable scratch (high redundancy). Preserves core while applying selective consolidation to scratch. Uses two-stage design for prompt-heavy regimes: prefix eviction followed by decode-stage consolidation.
Result: On H100 reasoning gate, CASK shows higher full-KV continuation fidelity than TriAttention at matched budgets on both AIME24 and AIME25, with recurring cask@384 > triattention@512 crossings. In prompt-heavy replay, multi_news and vcsum act as decode-active witnesses, while qmsum and gov_report expose the prefix_budget_exhausted boundary.
Conclusion: Effective reasoning KV compression depends less on elaborate scorer engineering and more on combining core preservation with selective scratch consolidation to lower the usable budget frontier.
Abstract: In large language models performing long-form reasoning, the KV cache grows rapidly with decode length, creating bottlenecks in memory and inference stability. Existing reasoning-oriented KV compression has mostly followed an eviction-centered view: estimate token importance more accurately, then discard lower-ranked entries. Our analysis suggests that scorer refinement alone often fails to substantially reorganize the actual keep-set and may therefore not be the main lever for preserving reasoning behavior. We instead frame reasoning KV compression as a behavior-preserving structured consolidation problem. CASK partitions the decode-time reasoning trace into a protected core that anchors answer formation and intermediate state, and mergeable scratch with high redundancy. The core is preserved, while selective consolidation is applied only to the scratch. To address prompt-heavy regimes where the prefix can exhaust the budget before decode-stage compression becomes active, CASK further uses a two-stage design: prefix eviction followed by decode-stage consolidation. On the H100 reasoning gate, CASK shows higher full-KV continuation fidelity than TriAttention at matched budgets on both AIME24 and AIME25, with recurring cask@384 > triattention@512 crossings. In prompt-heavy replay, multi_news and vcsum act as decode-active witnesses, while qmsum and gov_report expose the prefix_budget_exhausted boundary. The overall evidence supports a simple conclusion: effective reasoning KV compression depends less on more elaborate scorer engineering than on combining core preservation with selective scratch consolidation to lower the usable budget frontier.
[856] Reasoning as Data: Representation-Computation Unity and Its Implementation in a Domain-Algebraic Inference Engine
Chao Li, Yuru Wang
Main category: cs.AI
TL;DR: CDC four-tuple representation unifies storage and computation by embedding domain context in predicate arity, enabling automatic domain-scoped inference without external rules.
Details
Motivation: Traditional knowledge systems separate storage from computation, requiring external rules for inference. The authors aim to eliminate this separation by making domain context structural rather than external.Method: Introduces CDC four-tuple representation (is_a(Apple, Company, @Business)) where domain is embedded in predicate arity. Develops a symbolic engine (2400 lines Python+Prolog) implementing three inference mechanisms: domain-scoped closure, typed inheritance, and write-time falsification via cycle detection per domain fiber.
Result: Formally establishes representation-computation unity (RCU) via four theorems. Case studies show successful application to ICD-11 classification (1247 entities, 3 axes) and CBT clinical reasoning with temporal reasoning. Multi-constraint queries achieve CSP arc-consistency with O(m (N/K)^2) complexity.
Conclusion: When domain is structural, data computes itself. The CDC representation enables automatic domain-scoped inference without external rules, unifying storage and computation in knowledge systems.
Abstract: Every existing knowledge system separates storage from computation. We show this separation is unnecessary and eliminate it. In a standard triple is_a(Apple, Company), domain context lives in the query or the programmer’s mind. In a CDC four-tuple is_a(Apple, Company, @Business), domain becomes a structural field embedded in predicate arity. Any system respecting arity automatically performs domain-scoped inference without external rules. We call this representation-computation unity (RCU). From the four-tuple structure, three inference mechanisms emerge: domain-scoped closure, typed inheritance, and write-time falsification via cycle detection per domain fiber. We establish RCU formally via four theorems. RCU is implementable. We present a working symbolic engine (2400 lines Python+Prolog) resolving four engineering issues: rule-data separation, shared-fiber handling, read-only meta-layer design, and intersective convergence. A central result: CDC domain-constrained inference is distinct from Prolog with a domain argument. Two case studies validate the engine. ICD-11 classification (1247 entities, 3 axes) shows fibers resolve multiple inheritance. CBT clinical reasoning shows generalization to temporal reasoning with session turn as ordered domain index. Multi-constraint queries realize CSP arc-consistency with complexity O(m (N/K)^2), confirming the domain lattice’s sparsity governs performance. When domain is structural, data computes itself.
[857] EvoNash-MARL: A Closed-Loop Multi-Agent Reinforcement Learning Framework for Medium-Horizon Equity Allocation
Chongliu Jia, Yi Luo, Sipeng Han, Pengwei Li, Jie Ding, Youshuang Hu, Yimiao Qian, Qiya Wang
Main category: cs.AI
TL;DR: EvoNash-MARL: A unified RL framework for robust medium-to-long-horizon stock allocation combining multi-agent policy populations, PSRO-style aggregation, evolutionary training, and execution-aware checkpoint selection.
Details
Motivation: Address challenges in medium-to-long-horizon stock allocation including weak predictive structures, non-stationary market regimes, signal degradation from transaction costs/capacity limits, and limitations of conventional single-predictor or loosely coupled approaches.Method: Integrates RL, multi-agent policy populations, Policy-Space Response Oracle (PSRO)-style aggregation, league best-response training, evolutionary replacement, and execution-aware checkpoint selection within a unified walk-forward loop. Features layered policy architecture (direction head + risk head), nonlinear signal enhancement, feature-quality reweighting, and constraint-aware checkpoint selection.
Result: Achieves mean excess Sharpe 0.7600 and robust score -0.0203, ranking first among internal controls. On daily out-of-sample returns (2014-2024): 19.6% annualized return vs 11.7% for SPY. Extended evaluation through 2026: 20.5% vs 13.5%. Maintains positive performance under realistic stress constraints with structured cross-market generalization.
Conclusion: Presents evidence supporting more stable medium-to-long-horizon training and selection paradigm rather than proof of universally superior market-timing performance. Results not globally significant under White’s Reality Check and SPA-lite testing.
Abstract: Medium-to-long-horizon stock allocation presents significant challenges due toveak predictive structures, non-stadonary market regimes, and the degradationf signals following the application of transaction costs, capacity limits, and tail-isk constraints. Conventional approaches commonly rely on a single predictor orloosely coupled prediction-to-allocation pipeline, limiting robustness underThis work addresses a targeted design question: whetherlistribution shift. 1coupling reinforcement learning (RL), multi-agent policy populations, Policy-Space Response Oracle (PSRO)-style aggregation, league best-response trainingevolutionary replacement, and execution-aware checkpoint selection within ainified walk-forward loop improves allocator robustness at medium to longhorizons. The proposed framework, EvoNash-MARL, integrates these componentswithin an execution-aware allocation loop and further introduces a layeredpolicy architecture comprising a direction head and a risk head, nonlinear signalenhancement, feature-quality reweighting, and constraint-aware checkpointselection. Under a 120-window walk-forward protocol, the resolved v21configuration achieves mean excess Sharpe 0.7600 and robust score -0.0203,anking first among internal controls; on aligned daily out-of-sample returnsrom 2014-01-02 to 2024-01-05, it delivers 19.6% annualized return versus 11.7% for SPY, and in an extended walk-forward evaluation through 2026-02-10 it delivers 20.5% rersus 13.5%. The framework maintains positive performance under realistictress constraints and exhibits structured cross-market generalization; however,lobal strong significance under White’s Reality Check (WRC) and SPA-lite testingestablished. Therefore, the results are presented as evidence supporting asnotnore stable medium-to long-horizon training and selection paradigm, ratherhan as prooffof universally superior market-timing performance.
[858] CSPO: Alleviating Reward Ambiguity for Structured Table-to-LaTeX Generation
Yunfan Yang, Cuiling Lan, Jitao Sang, Yan Lu
Main category: cs.AI
TL;DR: CSPO is a reinforcement learning framework for converting table images to LaTeX code that uses component-specific rewards (structure, style, content) to address reward ambiguity in MLLMs.
Details
Motivation: Current multimodal LLMs struggle to preserve structural, style, and content fidelity when converting table images to LaTeX code. Traditional RL approaches use aggregated rewards that conflate multiple behavioral aspects, leading to reward ambiguity and ineffective optimization.Method: Component-Specific Policy Optimization (CSPO) disentangles optimization across LaTeX table components by assigning component-specific rewards and backpropagating each signal only through tokens relevant to its component (structure, style, content).
Result: Extensive experiments demonstrate CSPO’s effectiveness, showing improved performance in table image to LaTeX conversion with better preservation of structural, style, and content fidelity.
Conclusion: Component-specific optimization is crucial for reliable structured generation in multimodal LLMs, and CSPO provides an effective framework for addressing reward ambiguity in table image to LaTeX conversion tasks.
Abstract: Tables contain rich structured information, yet when stored as images their contents remain “locked” within pixels. Converting table images into LaTeX code enables faithful digitization and reuse, but current multimodal large language models (MLLMs) often fail to preserve structural, style, or content fidelity. Conventional post-training with reinforcement learning (RL) typically relies on a single aggregated reward, leading to reward ambiguity that conflates multiple behavioral aspects and hinders effective optimization. We propose Component-Specific Policy Optimization (CSPO), an RL framework that disentangles optimization across LaTeX tables components-structure, style, and content. In particular, CSPO assigns component-specific rewards and backpropagates each signal only through the tokens relevant to its component, alleviating reward ambiguity and enabling targeted component-wise optimization. To comprehensively assess performance, we introduce a set of hierarchical evaluation metrics. Extensive experiments demonstrate the effectiveness of CSPO, underscoring the importance of component-specific optimization for reliable structured generation.
[859] RAG-KT: Cross-platform Explainable Knowledge Tracing with Multi-view Fusion Retrieval Generation
Zhiyi Duan, Hongyu Yuan, Rui Liu
Main category: cs.AI
TL;DR: RAG-KT is a retrieval-augmented framework for cross-platform knowledge tracing that uses LLMs with structured context retrieval to improve generalization across heterogeneous educational platforms.
Details
Motivation: Existing knowledge tracing models suffer from platform dependency, poor interpretability, and limited generalization across heterogeneous educational platforms with distribution shifts. LLM-based methods are either ungrounded or overly domain-dependent.Method: Proposes RAG-KT, a retrieval-augmented paradigm that frames cross-platform KT as reliable context constrained inference with LLMs. Uses Question Group abstractions for cross-source alignment and builds unified multi-source structured context, retrieving complementary rich context for each prediction.
Result: Experiments on three public KT benchmarks show consistent gains in accuracy and robustness, including strong performance under cross-platform conditions.
Conclusion: RAG-KT enables grounded prediction and interpretable diagnosis for knowledge tracing across heterogeneous educational platforms, addressing distribution shift challenges.
Abstract: Knowledge Tracing (KT) infers a student’s knowledge state from past interactions to predict future performance. Conventional Deep Learning (DL)-based KT models are typically tied to platform-specific identifiers and latent representations, making them hard to transfer and interpret. Large Language Model (LLM)-based methods can be either ungrounded under prompting or overly domain-dependent under fine-tuning. In addition, most existing KT methods are developed and evaluated under a same-distribution assumption. In real deployments, educational data often arise from heterogeneous platforms with substantial distribution shift, which often degrades generalization. To this end, we propose RAG-KT, a retrieval-augmented paradigm that frames cross-platform KT as reliable context constrained inference with LLMs. It builds a unified multi-source structured context with cross-source alignment via Question Group abstractions and retrieves complementary rich and reliable context for each prediction, enabling grounded prediction and interpretable diagnosis. Experiments on three public KT benchmarks demonstrate consistent gains in accuracy and robustness, including strong performance under cross-platform conditions.
[860] Delving Aleatoric Uncertainty in Medical Image Segmentation via Vision Foundation Models
Ruiyang Li, Fang Liu, Licheng Jiao, Xinglin Xie, Jiayao Hao, Shuo Li, Xu Liu, Jingyi Yang, Lingling Li, Puhua Chen, Wenping Ma
Main category: cs.AI
TL;DR: Proposes using visual foundation models to estimate data uncertainty in medical image segmentation, with applications for data filtering and adaptive training optimization.
Details
Motivation: Medical image datasets suffer from acquisition noise and annotation ambiguity, causing pervasive data uncertainty that undermines model robustness. Existing research focuses on model architecture and predictive reliability, but systematic exploration of intrinsic data uncertainty remains insufficient.Method: Leverages visual foundation models’ representation capabilities to estimate inherent data uncertainty by analyzing feature diversity of decoded representations and quantifying singular value energy to define semantic perception scale for each class. Uses this to measure sample difficulty and aleatoric uncertainty. Designs two uncertainty-driven strategies: (1) aleatoric uncertainty-aware data filtering to eliminate noisy samples, (2) dynamic uncertainty-aware optimization that adaptively adjusts class-specific loss weights based on semantic perception scale, combined with label denoising for training stability.
Result: Experimental results on five public CT and MRI datasets for multi-organ and tumor segmentation tasks demonstrate significant and robust performance improvements across various mainstream network architectures.
Conclusion: The method reveals broad application potential of aleatoric uncertainty in medical image understanding and segmentation tasks, showing that systematic exploration of data uncertainty can substantially enhance model robustness.
Abstract: Medical image segmentation supports clinical workflows by precisely delineating anatomical structures and lesions. However, medical image datasets medical image datasets suffer from acquisition noise and annotation ambiguity, causing pervasive data uncertainty that substantially undermines model robustness. Existing research focuses primarily on model architectural improvements and predictive reliability estimation, while systematic exploration of the intrinsic data uncertainty remains insufficient. To address this gap, this work proposes leveraging the universal representation capabilities of visual foundation models to estimate inherent data uncertainty. Specifically, we analyze the feature diversity of the model’s decoded representations and quantify their singular value energy to define the semantic perception scale for each class, thereby measuring sample difficulty and aleatoric uncertainty. Based on this foundation, we design two uncertainty-driven application strategies: (1) the aleatoric uncertainty-aware data filtering mechanism to eliminate potentially noisy samples and enhance model learning quality; (2) the dynamic uncertainty-aware optimization strategy that adaptively adjusts class-specific loss weights during training based on the semantic perception scale, combined with a label denoising mechanism to improve training stability. Experimental results on five public datasets encompassing CT and MRI modalities and involving multi-organ and tumor segmentation tasks demonstrate that our method achieves significant and robust performance improvements across various mainstream network architectures, revealing the broad application potential of aleatoric uncertainty in medical image understanding and segmentation tasks.
[861] CFMS: A Coarse-to-Fine Multimodal Synthesis Framework for Enhanced Tabular Reasoning
Qixian Huang, Hongqiang Lin, Tong Fu, Yingsen Wang, Zhenghui Fu, Qirui Wang, Yiding Sun, Dongxu Zhang
Main category: cs.AI
TL;DR: CFMS is a two-stage framework that combines visual perception from MLLMs with symbolic reasoning for tabular data analysis, achieving competitive performance on table QA benchmarks.
Details
Motivation: Existing symbolic reasoning methods for tabular data are limited by their inability to capture holistic visual patterns in tables, which can provide important structural and relational information that pure symbolic approaches miss.Method: Proposes Coarse-to-Fine Multimodal Synthesis (CFMS) with two stages: 1) Coarse Stage uses MLLMs to generate multi-perspective knowledge tuples from table visualizations, and 2) Fine Stage uses symbolic engines to execute targeted operations guided by the knowledge tuples.
Result: CFMS achieves competitive accuracy on WikiTQ and TabFact benchmarks, shows particular robustness with large tables, and works effectively even with smaller backbone models, demonstrating good generalizability.
Conclusion: The hierarchical decoupling of visual perception and symbolic reasoning in CFMS provides an effective framework for tabular data understanding that leverages both multimodal and symbolic approaches.
Abstract: Reasoning over tabular data is a crucial capability for tasks like question answering and fact verification, as it requires models to comprehend both free-form questions and semi-structured tables. However, while methods like Chain-of-Thought (CoT) introduce reasoning chains, purely symbolic methodes are inherently limited by their blindness to holistic visual patterns. To address this, we propose the Coarse-to-Fine Multimodal Synthesis framework (CFMS), a novel two-stage paradigm that hierarchically decouples high-level visual perception from granular symbolic reasoning. In the Coarse Stage, CFMS leverages the Multimodal Large Language Models (MLLMs) to perform a one-time synthesis of a multi-perspective knowledge tuple. This tuple subsequently serves as a dynamic reasoning map to guide the fine stage, where a symbolic engine executes a targeted and efficient sequence of iterative operations over the table. Extensive experiments on the WikiTQ and TabFact benchmarks demonstrate that CFMS achieves competitive accuracy. The framework exhibits particular robustness when handling large tables and when instantiated with smaller backbone models, validating its effectiveness and generalizability.
[862] ATANT v1.1: Positioning Continuity Evaluation Against Memory, Long-Context, and Agentic-Memory Benchmarks
Samuel Sameer Tanguturi
Main category: cs.AI
TL;DR: This companion paper (ATANT v1.1) clarifies that existing memory evaluation benchmarks (LOCOMO, LongMemEval, BEAM, etc.) do not measure continuity as defined in ATANT v1.0, showing they cover only 0-2 of the 7 required properties for continuity.
Details
Motivation: To address recurring questions about ATANT v1.0's relationship to other memory evaluation benchmarks and clarify that existing benchmarks don't measure continuity as defined in ATANT v1.0, preventing confusion in the field.Method: Structural analysis comparing 7 existing memory evaluation benchmarks against ATANT v1.0’s 7 required continuity properties, creating a property-coverage matrix, identifying methodological defects in each benchmark, and providing calibration pairs (LOCOMO vs ATANT scores).
Result: Existing benchmarks cover median 1 property, mean 0.43 properties (with partial credit), with none covering more than 2 of ATANT’s 7 continuity properties. Found scoring bugs (e.g., LOCOMO’s empty-gold bug making 23% of corpus unscorable). Showed 87-point divergence between LOCOMO (8.8%) and ATANT (96%) scores indicates different properties measured.
Conclusion: Existing memory benchmarks measure real capabilities but cannot adjudicate continuity as defined in ATANT v1.0. Conflating them with continuity evaluation has led to under-investment in the properties ATANT v1.0 identifies as essential for continuity.
Abstract: ATANT v1.0 (arXiv:2604.06710) defined continuity as a system property with 7 required properties and introduced a 10-checkpoint, LLM-free evaluation methodology validated on a 250-story corpus. Since publication, a recurring reviewer and practitioner question has concerned not the framework itself but its relationship to a wider set of memory evaluations: LOCOMO, LongMemEval, BEAM, MemoryBench, Zep’s evaluation suite, Letta/MemGPT’s evaluations, and RULER. This companion paper, v1.1, does not modify the v1.0 standard. It closes a related-work gap that v1.0 left brief under page limits. We show by structural analysis that none of these benchmarks measures continuity as defined in v1.0: of the 7 required properties, the median existing eval covers 1 property, the mean covers 0.43 when partial credit is scored at 0.5, and no eval covers more than 2. We provide a cell-by-cell property-coverage matrix, identify methodological defects specific to each benchmark (including an empty-gold scoring bug in the LOCOMO reference implementation that renders 23% of its corpus unscorable by construction), and publish our reference implementation’s LOCOMO score (8.8%) alongside the structural reason that number is uninformative about continuity. We publish our 8.8% LOCOMO score alongside our 96% ATANT cumulative-scale score as a calibration pair: the 87-point divergence is evidence that the two benchmarks measure different properties, not that one system is an order of magnitude better than another. The position v1.1 takes is not adversarial: each benchmark measures a real capability. The claim is that none of them can adjudicate continuity, and conflating them with continuity evaluation has led the field to under-invest in the properties v1.0 names.
[863] Back to the Barn with LLAMAs: Evolving Pretrained LLM Backbones in Finetuning Vision Language Models
Sameera Horawalavithana, Lauren Phillips, Ian Stewart, Sai Munikoti, Karl Pazdernik
Main category: cs.AI
TL;DR: Systematic study shows that upgrading LLM backbones in Vision-Language Models doesn’t always improve performance - benefits depend on task type, with newer LLMs solving different questions rather than more questions in VQA tasks.
Details
Motivation: As new LLMs with improved reasoning emerge, there's a need to efficiently update existing VLMs, but how evolving LLMs contribute to multimodal reasoning, alignment, and task performance remains underexplored.Method: Controlled study comparing LLAMA-1, LLAMA-2, and LLAMA-3 based VLMs while keeping vision encoder, training data, and post-training algorithm constant to isolate LLM backbone effects.
Result: Newer LLM backbones don’t always lead to better VLMs - performance depends on downstream task. In VQA, newer LLMs solve different questions rather than more questions, with better calibrated confidence and more stable representations. Some capabilities only appear in newest LLM generation, while visual understanding tasks see little benefit.
Conclusion: LLM backbone upgrades in VLMs have nuanced effects - task-dependent benefits with newer LLMs offering different reasoning patterns rather than uniform improvements, highlighting the need for careful evaluation when updating VLM backbones.
Abstract: Vision-Language Models (VLMs) have rapidly advanced by leveraging powerful pre-trained Large Language Models (LLMs) as core reasoning backbones. As new and more capable LLMs emerge with improved reasoning, instruction-following, and generalization, there is a pressing need to efficiently update existing VLMs to incorporate these advancements. However, the integration of new LLMs into VLMs, particularly how the evolving LLMs contribute to multimodal reasoning, alignment, and task-specific performance remains underexplored. Addressing this gap is important for VLM development, given the rapid evolution of pretrained LLM backbones. This study presents a controlled and systematic investigation of how changes in the pretrained LLM backbone affect downstream VLM task performance. By having the vision encoder, training data, and post-training algorithm remain same across LLAMA-1, LLAMA-2, and LLAMA-3 based VLMs, we find that newer LLM backbones do not always lead to better VLMs, but the performance depends on the downstream VLM task. For example, in visual question and answering tasks, newer LLM backbones tend to solve different questions rather than just more questions, and our analysis shows this is driven by differences in how the models process information, including better calibrated confidence and more stable internal representations. We also find that some VLM capabilities appear only in the newest LLM generation, while tasks that depend mainly on visual understanding see little benefit from a newer LLM backbone.
[864] WebForge: Breaking the Realism-Reproducibility-Scalability Trilemma in Browser Agent Benchmark
Peng Yuan, Yuyang Yin, Yuxuan Cai, Zheng Wei
Main category: cs.AI
TL;DR: WebForge is an automated framework that creates reproducible, realistic web interaction benchmarks without manual curation, using a four-agent pipeline to generate self-contained web environments with controlled difficulty across seven dimensions.
Details
Motivation: Existing browser agent benchmarks face a trilemma: real-website benchmarks lack reproducibility due to content drift, controlled environments sacrifice realism by omitting real-web noise, and both require costly manual curation limiting scalability.Method: A four-agent pipeline (Plan, Generate, Refine, and Validate) produces interactive, self-contained web environments end-to-end without human annotation. A seven-dimensional difficulty control framework structures task design along navigation depth, visual complexity, reasoning difficulty, etc.
Result: Created WebForge-Bench with 934 tasks spanning 7 domains and 3 difficulty levels. Multi-model experiments show difficulty stratification effectively differentiates model capabilities, and cross-domain analysis exposes capability biases invisible to aggregate metrics.
Conclusion: Multi-dimensional evaluation reveals distinct capability profiles that a single aggregate score cannot capture, demonstrating the value of systematic difficulty-controlled benchmarking for web interaction agents.
Abstract: Existing browser agent benchmarks face a fundamental trilemma: real-website benchmarks lack reproducibility due to content drift, controlled environments sacrifice realism by omitting real-web noise, and both require costly manual curation that limits scalability. We present WebForge, the first fully automated framework that resolves this trilemma through a four-agent pipeline – Plan, Generate, Refine, and Validate – that produces interactive, self-contained web environments end-to-end without human annotation. A seven-dimensional difficulty control framework structures task design along navigation depth, visual complexity, reasoning difficulty, and more, enabling systematic capability profiling beyond single aggregate scores. Using WebForge, we construct WebForge-Bench, a benchmark of 934 tasks spanning 7 domains and 3 difficulty levels. Multi-model experiments show that difficulty stratification effectively differentiates model capabilities, while cross-domain analysis exposes capability biases invisible to aggregate metrics. Together, these results confirm that multi-dimensional evaluation reveals distinct capability profiles that a single aggregate score cannot capture. Code and benchmark are publicly available at https://github.com/yuandaxia2001/WebForge.
[865] MAFIG: Multi-agent Driven Formal Instruction Generation Framework
Shixing Zhao, Zheng Si, Pengpeng Ouyang, Zhengqing Hu, Wanqi Zhu, Dong Chen, Yibo Guo, Mingliang Xu
Main category: cs.AI
TL;DR: MAFIG framework uses multi-agent LLMs for emergency scheduling by generating formal instructions and local distillation to reduce latency
Details
Motivation: Existing scheduling methods struggle with unpredictable real-world emergencies; LLMs show promise but have high latency issues for emergency handlingMethod: Multi-agent framework with Perception and Emergency Decision agents, plus span-focused loss-driven local distillation to transfer C-LLM capabilities to lightweight models
Result: Achieved 98.49%, 94.97%, 97.50% success rates in Port, Warehousing, Deck datasets with average processing times of 0.33s, 0.23s, 0.19s
Conclusion: MAFIG effectively mitigates emergency impacts and improves scheduling system robustness and adaptability
Abstract: Emergency situations in scheduling systems often trigger local functional failures that undermine system stability and even cause system collapse. Existing methods primarily rely on robust scheduling or reactive scheduling, handling emergencies through predefined rules or rescheduling strategies. However, the diversity and unpredictability of real-world emergencies make them difficult to anticipate, which limits the adaptability of these methods in complex scenarios. Recent studies have shown that Large Language Models (LLMs) possess strong potential for complex scheduling tasks because of their extensive prior knowledge and strong reasoning capabilities. Nevertheless, the high inference latency of LLMs and the lengthy contextual information of scheduling systems significantly hinder their application for emergency handling. To mitigate these issues, we propose the Multi-agent Driven Formal Instruction Generation Framework (MAFIG). The framework constrains the decision scope to local functional modules affected by emergency situations and repairs scheduling logic rapidly by generating formal instructions. MAFIG contains a Perception Agent and an Emergency Decision Agent, which mitigates the adverse impact of lengthy system contexts on emergency decision-making. We further introduce span-focused loss-driven local distillation mechanism (SFL) to transfer the decision-making capability of powerful Cloud Large Language Models (C-LLMs) to lightweight local models, reducing inference latency while preserving decision-making effectiveness. Experiments in the Port, Warehousing, and Deck scheduling datasets show success rates of 98.49%, 94.97%, and 97.50%, with average processing times of 0.33 s, 0.23 s, and 0.19 s. These results demonstrate that MAFIG effectively mitigates the impact of emergencies and improves the robustness and adaptability of scheduling systems.
[866] Sanity Checks for Agentic Data Science
Zachary T. Rewolinski, Austin V. Zane, Hao Huang, Chandan Singh, Chenglong Wang, Jianfeng Gao, Bin Yu
Main category: cs.AI
TL;DR: Proposes two lightweight sanity checks based on Predictability-Computability-Stability framework to evaluate trustworthiness of agentic data science pipeline outputs, particularly for detecting when systems like OpenAI Codex produce falsely optimistic conclusions.
Details
Motivation: Agentic data science pipelines (like OpenAI Codex) can produce falsely optimistic conclusions that are difficult for users to detect, creating a need for methods to assess the trustworthiness of these automated statistical analyses.Method: Two sanity checks using reasonable perturbations to screen whether an agent can reliably distinguish signal from noise, acting as a falsifiability constraint. These checks characterize whether outputs have found stable signal, are responding to noise, or are sensitive to incidental input aspects.
Result: Validation on synthetic data shows checks track ground-truth signal strength. Application to 11 real-world datasets using OpenAI Codex reveals that in 6 datasets, affirmative conclusions are not well-supported despite single ADS runs suggesting otherwise. ADS self-reported confidence is poorly calibrated to empirical stability.
Conclusion: The proposed sanity checks provide a practical method for assessing trustworthiness of agentic data science outputs, helping users detect when automated analyses may be unreliable or responding to noise rather than stable signal.
Abstract: Agentic data science (ADS) pipelines have grown rapidly in both capability and adoption, with systems such as OpenAI Codex now able to directly analyze datasets and produce answers to statistical questions. However, these systems can reach falsely optimistic conclusions that are difficult for users to detect. To address this, we propose a pair of lightweight sanity checks grounded in the Predictability-Computability-Stability (PCS) framework for veridical data science. These checks use reasonable perturbations to screen whether an agent can reliably distinguish signal from noise, acting as a falsifiability constraint that can expose affirmative conclusions as unsupported. Together, the two checks characterize the trustworthiness of an ADS output, e.g. whether it has found stable signal, is responding to noise, or is sensitive to incidental aspects of the input. We validate the approach on synthetic data with controlled signal-to-noise ratios, confirming that the sanity checks track ground-truth signal strength. We then demonstrate the checks on 11 real-world datasets using OpenAI Codex, characterizing the trustworthiness of each conclusion and finding that in 6 of the datasets an affirmative conclusion is not well-supported, even though a single ADS run may support one. We further analyze failure modes of ADS systems and find that ADS self-reported confidence is poorly calibrated to the empirical stability of its conclusions.
[867] Diffusion-CAM: Faithful Visual Explanations for dMLLMs
Haomin Zuo, Yidi Li, Luoxiao Yang, Xiaofeng Zhang
Main category: cs.AI
TL;DR: Diffusion-CAM: First interpretability method for diffusion-based multimodal LLMs that addresses the parallel denoising architecture’s smooth activation patterns, outperforming existing CAM methods.
Details
Motivation: Diffusion-based MLLMs have advanced multimodal generation but lack interpretability mechanisms. Their parallel denoising creates smooth, distributed activation patterns across sequences, making traditional CAM methods (designed for local, sequential dependencies) unsuitable for interpreting these non-autoregressive behaviors.Method: Proposes Diffusion-CAM with raw activation maps from differentiable probing of intermediate transformer representations, capturing both latent features and class-specific gradients. Includes four key modules to address spatial ambiguity, intra-image confounders, and redundant token correlations inherent in diffusion stochasticity.
Result: Extensive experiments show Diffusion-CAM significantly outperforms state-of-the-art methods in both localization accuracy and visual fidelity, establishing a new standard for understanding parallel generation in diffusion multimodal systems.
Conclusion: Diffusion-CAM is the first interpretability method specifically tailored for diffusion-based MLLMs, effectively addressing the unique challenges of parallel denoising architectures and providing better understanding of multimodal generation processes.
Abstract: While diffusion Multimodal Large Language Models (dMLLMs) have recently achieved remarkable strides in multimodal generation, the development of interpretability mechanisms has lagged behind their architectural evolution. Unlike traditional autoregressive models that produce sequential activations, diffusion-based architectures generate tokens via parallel denoising, resulting in smooth, distributed activation patterns across the entire sequence. Consequently, existing Class Activation Mapping (CAM) methods, which are tailored for local, sequential dependencies, are ill-suited for interpreting these non-autoregressive behaviors. To bridge this gap, we propose Diffusion-CAM, the first interpretability method specifically tailored for dMLLMs. We derive raw activation maps by differentiably probing intermediate representations in the transformer backbone, accordingly capturing both latent features and their class-specific gradients. To address the inherent stochasticity of these raw signals, we incorporate four key modules to resolve spatial ambiguity and mitigate intra-image confounders and redundant token correlations. Extensive experiments demonstrate that Diffusion-CAM significantly outperforms SoTA methods in both localization accuracy and visual fidelity, establishing a new standard for understanding the parallel generation process of diffusion multimodal systems.
[868] Min-$k$ Sampling: Decoupling Truncation from Temperature Scaling via Relative Logit Dynamics
Yuanhao Ding, Meimingwei Li, Esteban Garces Arias, Matthias Aßenmacher, Christian Heumann, Chongsheng Zhang
Main category: cs.AI
TL;DR: Min-k Sampling: A temperature-invariant decoding strategy that dynamically identifies semantic cliffs in logit distributions to improve text generation quality across diverse tasks.
Details
Motivation: Existing decoding strategies like Top-k, Top-p, and Min-p are highly sensitive to temperature settings, while recent logit-space approaches like Top-nσ rely on global statistics that fail to capture fine-grained confidence structures among top candidates.Method: Min-k Sampling analyzes the local shape of sorted logit distributions to identify “semantic cliffs” - sharp transitions from high-confidence core tokens to uncertain long-tail tokens. It computes a position-weighted relative decay rate to dynamically determine truncation boundaries at each generation step.
Result: Min-k achieves strict temperature invariance and shows low sensitivity to hyperparameter choices. Experiments on reasoning benchmarks, creative writing tasks, and human evaluation demonstrate consistent improvements in text quality, maintaining robust performance even under extreme temperature settings where probability-based methods fail.
Conclusion: Min-k Sampling provides a more robust and effective decoding strategy that addresses fundamental limitations of existing methods, offering temperature invariance and better capture of fine-grained confidence structures in language model outputs.
Abstract: The quality of text generated by large language models depends critically on the decoding sampling strategy. While mainstream methods such as Top-$k$, Top-$p$, and Min-$p$ achieve a balance between diversity and accuracy through probability-space truncation, they share an inherent limitation: extreme sensitivity to the temperature parameter. Recent logit-space approaches like Top-$nσ$ achieve temperature invariance but rely on global statistics that are susceptible to long-tail noise, failing to capture fine-grained confidence structures among top candidates. We propose \textbf{Min-$k$ Sampling}, a novel dynamic truncation strategy that analyzes the local shape of the sorted logit distribution to identify “semantic cliffs”: sharp transitions from high-confidence core tokens to uncertain long-tail tokens. By computing a position-weighted relative decay rate, Min-$k$ dynamically determines truncation boundaries at each generation step. We formally prove that Min-$k$ achieves strict temperature invariance and empirically demonstrate its low sensitivity to hyperparameter choices. Experiments on multiple reasoning benchmarks, creative writing tasks, and human evaluation show that Min-$k$ consistently improves text quality, maintaining robust performance even under extreme temperature settings where probability-based methods collapse. We make our code, models, and analysis tools publicly available.
[869] Introspective Diffusion Language Models
Yifan Yu, Yuqing Jian, Junxiong Wang, Zhongzhu Zhou, Donglin Zhuang, Xinyu Fang, Sri Yanamandra, Xiaoxia Wu, Qingyang Wu, Shuaiwen Leon Song, Tri Dao, Ben Athiwaratkun, James Zou, Fan Lai, Chenfeng Xu
Main category: cs.AI
TL;DR: I-DLM introduces introspective diffusion language models that achieve AR-level quality while maintaining parallel generation, using introspective strided decoding to verify previous tokens while generating new ones.
Details
Motivation: Diffusion language models lag behind autoregressive models in quality due to lack of introspective consistency - AR models agree with their own generations while DLMs often don't.Method: I-DLM uses introspective strided decoding (ISD) algorithm that enables the model to verify previously generated tokens while advancing new ones in the same forward pass, combined with AR-inherited optimizations and stationary-batch scheduler.
Result: First DLM to match same-scale AR counterpart quality, outperforming prior DLMs in both model quality and serving efficiency across 15 benchmarks, achieving 69.6 on AIME-24 and 45.7 on LiveCodeBench-v6.
Conclusion: I-DLM successfully bridges the quality gap between diffusion and autoregressive language models while maintaining parallel generation advantages, enabling high-throughput serving with AR-level quality.
Abstract: Diffusion language models promise parallel generation, yet still lag behind autoregressive (AR) models in quality. We stem this gap to a failure of introspective consistency: AR models agree with their own generations, while DLMs often do not. We define the introspective acceptance rate, which measures whether a model accepts its previously generated tokens. This reveals why AR training has a structural advantage: causal masking and logit shifting implicitly enforce introspective consistency. Motivated by this observation, we introduce Introspective Diffusion Language Model (I-DLM), a paradigm that retains diffusion-style parallel decoding while inheriting the introspective consistency of AR training. I-DLM uses a novel introspective strided decoding (ISD) algorithm, which enables the model to verify previously generated tokens while advancing new ones in the same forward pass. From a systems standpoint, we build I-DLM inference engine on AR-inherited optimizations and further customize it with a stationary-batch scheduler. To the best of our knowledge, I-DLM is the first DLM to match the quality of its same-scale AR counterpart while outperforming prior DLMs in both model quality and practical serving efficiency across 15 benchmarks. It reaches 69.6 on AIME-24 and 45.7 on LiveCodeBench-v6, exceeding LLaDA-2.1-mini (16B) by more than 26 and 15 points, respectively. Beyond quality, I-DLM is designed for the growing demand of large-concurrency serving, delivering about 3x higher throughput than prior state-of-the-art DLMs.
[870] Intelligent Approval of Access Control Flow in Office Automation Systems via Relational Modeling
Dugang Liu, Zulong Chen, Chuanfei Xu, Jiaxuan He, Yunlu Ma, Jia Xu
Main category: cs.AI
TL;DR: A relational modeling-driven intelligent approval framework for office automation systems that automates access control flow approval using binary and ternary relation modeling.
Details
Motivation: Traditional access control flow approval in office automation systems requires manual approval at each step, consuming significant manpower and time, creating an urgent need for intelligent automation solutions.Method: RMIA framework with two core modules: (1) binary relation modeling to characterize applicant-approver coupling relations from coarse-grained perspective, and (2) ternary relation modeling using resource information as core to characterize complex relations between applicants, resources, and approvers for fine-grained decision-making.
Result: Extensive experiments conducted on two product datasets and an online A/B test to verify the effectiveness of the proposed RMIA framework.
Conclusion: The proposed RMIA framework provides an intelligent solution for automating access control flow approval in office automation systems, addressing the efficiency and intelligence issues of traditional approaches.
Abstract: Office automation (OA) systems play a crucial role in enterprise operations and management, with access control flow approval (ACFA) being a key component that manages the accessibility of various resources. However, traditional ACFA requires approval from the person in charge at each step, which consumes a significant amount of manpower and time. Its intelligence is a crucial issue that needs to be addressed urgently by all companies. In this paper, we propose a novel relational modeling-driven intelligent approval (RMIA) framework to automate ACFA. Specifically, our RMIA consists of two core modules: (1) The binary relation modeling module aims to characterize the coupling relation between applicants and approvers and provide reliable basic information for ACFA decision-making from a coarse-grained perspective. (2) The ternary relation modeling module utilizes specific resource information as its core, characterizing the complex relations between applicants, resources, and approvers, and thus provides fine-grained gain information for informed decision-making. Then, our RMIA effectively fuses these two kinds of information to form the final decision. Finally, extensive experiments are conducted on two product datasets and an online A/B test to verify the effectiveness of RMIA.
[871] From Topology to Trajectory: LLM-Driven World Models For Supply Chain Resilience
Jia Luo
Main category: cs.AI
TL;DR: ReflectiChain is a cognitive agentic framework for resilient semiconductor supply chain planning that integrates latent trajectory rehearsal with generative world modeling and retrospective agentic RL to address decision paralysis in non-stationary policy environments.
Details
Motivation: Semiconductor supply chains face unprecedented resilience challenges due to global geopolitical turbulence. Conventional LLM planners suffer from Decision Paralysis and Grounding Gap when confronting non-stationary "Policy Black Swan" events due to lack of physical environmental modeling.Method: Introduces ReflectiChain framework with: 1) Latent Trajectory Rehearsal powered by generative world model, coupling reflection-in-action (System 2 deliberation) with delayed reflection-on-action; 2) Retrospective Agentic RL mechanism for autonomous policy evolution during deployment phase.
Result: On Semi-Sim benchmark under extreme scenarios (export bans, material shortages): 250% improvement in average step rewards over strongest LLM baselines; restores Operability Ratio from 13.3% to over 88.5%; ensures robust gradient convergence.
Conclusion: Synergy between physical grounding constraints and double-loop learning is fundamental to bridging the gap between semantic reasoning and physical reality for long-horizon strategic planning in supply chain resilience.
Abstract: Semiconductor supply chains face unprecedented resilience challenges amidst global geopolitical turbulence. Conventional Large Language Model (LLM) planners, when confronting such non-stationary “Policy Black Swan” events, frequently suffer from Decision Paralysis or a severe Grounding Gap due to the absence of physical environmental modeling. This paper introduces ReflectiChain, a cognitive agentic framework tailored for resilient macroeconomic supply chain planning. The core innovation lies in the integration of Latent Trajectory Rehearsal powered by a generative world model, which couples reflection-in-action (System 2 deliberation) with delayed reflection-on-action. Furthermore, we leverage a Retrospective Agentic RL mechanism to enable autonomous policy evolution during the deployment phase (test-time). Evaluations conducted on our high-fidelity benchmark, Semi-Sim, demonstrate that under extreme scenarios such as export bans and material shortages, ReflectiChain achieves a 250% improvement in average step rewards over the strongest LLM baselines. It successfully restores the Operability Ratio (OR) from a deficient 13.3% to over 88.5% while ensuring robust gradient convergence. Ablation studies further underscore that the synergy between physical grounding constraints and double-loop learning is fundamental to bridging the gap between semantic reasoning and physical reality for long-horizon strategic planning.
[872] EmergentBridge: Improving Zero-Shot Cross-Modal Transfer in Unified Multimodal Embedding Models
Jincheng Xie, Xingchen Xiao, Runheng Liu, Zhongyi Huang, Yu Zheng, Heyan Huang
Main category: cs.AI
TL;DR: EmergentBridge improves multimodal embedding alignment for unpaired modality pairs without exhaustive supervision, addressing gradient interference in proxy-based bridging.
Details
Motivation: Real-world multimodal systems often have supervision only for a subset of modality pairs (e.g., image-text), leaving other pairs (e.g., audio-depth, infrared-audio) weakly connected and performing poorly on zero-shot transfer. This sparse-pairing regime limits scaling unified embedding systems without curating exhaustive pairwise data.Method: EmergentBridge learns a mapping that produces a noisy bridge anchor (proxy embedding of an already-aligned modality) from an anchor embedding, then enforces proxy alignment only in the subspace orthogonal to the anchor-alignment direction. This preserves anchor alignment while strengthening non-anchor connectivity, avoiding gradient interference that degrades existing retrieval/classification structures.
Result: Across nine datasets spanning multiple modalities, EmergentBridge consistently outperforms prior binding baselines on zero-shot classification and retrieval, demonstrating strong emergent alignment for unpaired modality pairs.
Conclusion: EmergentBridge provides an effective embedding-level bridging framework that improves performance on unpaired modality pairs without requiring exhaustive pairwise supervision, enabling better scaling of unified multimodal embedding systems.
Abstract: Unified multimodal embedding spaces underpin practical applications such as cross-modal retrieval and zero-shot recognition. In many real deployments, however, supervision is available only for a small subset of modality pairs (e.g., image–text), leaving \emph{unpaired} modality pairs (e.g., audio$\leftrightarrow$depth, infrared$\leftrightarrow$audio) weakly connected and thus performing poorly on zero-shot transfer. Addressing this sparse-pairing regime is therefore essential for scaling unified embedding systems to new tasks without curating exhaustive pairwise data. We propose \textbf{EmergentBridge}, an embedding-level bridging framework that improves performance on these unpaired pairs \emph{without requiring exhaustive pairwise supervision}. Our key observation is that naively aligning a new modality to a synthesized proxy embedding can introduce \emph{gradient interference}, degrading the anchor-alignment structure that existing retrieval/classification relies on. EmergentBridge addresses this by (i) learning a mapping that produces a \emph{noisy bridge anchor} (a proxy embedding of an already-aligned modality) from an anchor embedding, and (ii) enforcing proxy alignment only in the subspace orthogonal to the anchor-alignment direction, preserving anchor alignment while strengthening non-anchor connectivity. Across nine datasets spanning multiple modalities, EmergentBridge consistently outperforms prior binding baselines on zero-shot classification and retrieval, demonstrating strong emergent alignment.
[873] AI Integrity: A New Paradigm for Verifiable AI Governance
Seulki Lee
Main category: cs.AI
TL;DR: AI Integrity is a new governance paradigm focusing on verifying AI reasoning processes rather than just outcomes, using a 4-layer Authority Stack model to ensure transparent, auditable decision-making paths.
Details
Motivation: Current AI governance paradigms (AI Ethics, Safety, Alignment) only evaluate outcomes, not the reasoning process. As AI shapes high-stakes decisions in healthcare, law, defense, and education, there's a need to verify that AI reasoning processes are transparent and protected from corruption.Method: Introduces the Authority Stack - a 4-layer cascade model (Normative, Epistemic, Source, and Data Authority) grounded in established frameworks like Schwartz Basic Human Values, Walton argumentation schemes, and Source Credibility Theory. Proposes PRISM (Profile-based Reasoning Integrity Stack Measurement) framework with six core metrics for operational measurement.
Result: The paper defines AI Integrity as a procedural concept distinct from existing paradigms, characterizes legitimate cascading vs. Authority Pollution, identifies Integrity Hallucination as the central threat, and provides a measurable framework for implementation.
Conclusion: AI Integrity offers a new approach to AI governance that focuses on verifying reasoning processes rather than prescribing values, enabling transparent and auditable AI decision-making across high-stakes domains.
Abstract: AI systems increasingly shape high-stakes decisions in healthcare, law, defense, and education, yet existing governance paradigms – AI Ethics, AI Safety, and AI Alignment – share a common limitation: they evaluate outcomes rather than verifying the reasoning process itself. This paper introduces AI Integrity, a concept defined as a state in which the Authority Stack of an AI system – its layered hierarchy of values, epistemological standards, source preferences, and data selection criteria – is protected from corruption, contamination, manipulation, and bias, and maintained in a verifiable manner. We distinguish AI Integrity from the three existing paradigms, define the Authority Stack as a 4-layer cascade model (Normative, Epistemic, Source, and Data Authority) grounded in established academic frameworks – Schwartz Basic Human Values for normative authority, Walton argumentation schemes with GRADE/CEBM hierarchies for epistemic authority, and Source Credibility Theory for source authority – characterize the distinction between legitimate cascading and Authority Pollution, and identify Integrity Hallucination as the central measurable threat to value consistency. We further specify the PRISM (Profile-based Reasoning Integrity Stack Measurement) framework as the operational methodology, defining six core metrics and a phased research roadmap. Unlike normative frameworks that prescribe which values are correct, AI Integrity is a procedural concept: it requires that the path from evidence to conclusion be transparent and auditable, regardless of which values a system holds.
[874] PRISM Risk Signal Framework: Hierarchy-Based Red Lines for AI Behavioral Risk
Seulki Lee
Main category: cs.AI
TL;DR: PRISM framework detects AI safety risks through structural anomalies in value hierarchies, evidence weighting, and source trust rather than specific harmful outputs.
Details
Motivation: Current AI safety approaches focus on case-specific red lines (specific prompts/outputs/harms), which are reactive, enumerative, and subjective. The paper argues for a more fundamental approach that detects dangerous reasoning structures before they produce harmful outputs.Method: PRISM framework defines 27 behavioral risk signals from structural anomalies across three hierarchy levels: value prioritization (L4), evidence weighting (L3), and source trust (L2). Uses dual-threshold principle combining absolute rank position and relative win-rate gap for two-tier classification (Confirmed Risk vs. Watch Signal). Evaluated on ~397,000 forced-choice responses from 7 AI models.
Result: The signal taxonomy successfully discriminates between models with structurally extreme profiles, context-dependent risk, and balanced hierarchies. Demonstrates anticipatory detection capacity for dangerous reasoning structures.
Conclusion: Hierarchy-based red lines offer advantages over case-specific approaches: anticipatory (detect before harm), comprehensive (single signal subsumes unlimited violations), and measurable (grounded in empirical forced-choice data).
Abstract: Current approaches to AI safety define red lines at the case level: specific prompts, specific outputs, specific harms. This paper argues that red lines can be set more fundamentally – at the level of value, evidence, and source hierarchies that govern AI reasoning. Using the PRISM (Profile-based Reasoning Integrity Stack Measurement) framework, we define a taxonomy of 27 behavioral risk signals derived from structural anomalies in how AI systems prioritize values (L4), weight evidence types (L3), and trust information sources (L2). Each signal is evaluated through a dual-threshold principle combining absolute rank position and relative win-rate gap, producing a two-tier classification (Confirmed Risk vs. Watch Signal). The hierarchy-based approach offers three advantages over case-specific red lines: it is anticipatory rather than reactive (detecting dangerous reasoning structures before they produce harmful outputs), comprehensive rather than enumerative (a single value-hierarchy signal subsumes an unlimited number of case-specific violations), and measurable rather than subjective (grounded in empirical forced-choice data). We demonstrate the framework’s detection capacity using approximately 397,000 forced-choice responses from 7 AI models across three Authority Stack layers, showing that the signal taxonomy successfully discriminates between models with structurally extreme profiles, models with context-dependent risk, and models with balanced hierarchies.
[875] Hodoscope: Unsupervised Monitoring for AI Misbehaviors
Ziqian Zhong, Shashwat Saxena, Aditi Raghunathan
Main category: cs.AI
TL;DR: Hodoscope: An unsupervised monitoring tool that detects novel AI agent misbehaviors by analyzing group-wise behavioral differences rather than predefined failure categories.
Details
Motivation: Supervised monitoring approaches (human-written rules or LLM-based judges) fail to detect novel misbehaviors outside predefined categories, and LLM judges can be unreliable. Need for methods that help humans discover problematic behaviors without prior assumptions.Method: Hodoscope compares behavior distributions across groups (e.g., different models or benchmarks) to identify distinctive action patterns. It highlights behavioral anomalies as potential misbehaviors for human review, using group-wise differences as primary signal.
Result: Discovered previously unknown vulnerability in Commit0 benchmark (unsquashed git history allowing ground-truth recovery), independently recovered known exploits on ImpossibleBench and SWE-bench. Reduced review effort by 6-23× compared to naive uniform sampling. Behavior descriptions from Hodoscope improved LLM-based judge detection accuracy.
Conclusion: Unsupervised monitoring via group-wise behavioral analysis effectively discovers novel AI agent misbehaviors, complements supervised approaches, and provides path from unsupervised to supervised monitoring.
Abstract: Existing approaches to monitoring AI agents rely on supervised evaluation: human-written rules or LLM-based judges that check for known failure modes. However, novel misbehaviors may fall outside predefined categories entirely and LLM-based judges can be unreliable. To address this, we formulate unsupervised monitoring, drawing an analogy to unsupervised learning. Rather than checking for specific misbehaviors, an unsupervised monitor assists humans in discovering problematic agent behaviors without prior assumptions about what counts as problematic, leaving that determination to the human. We observe that problematic behaviors are often distinctive: a model exploiting a benchmark loophole exhibits actions absent from well-behaved baselines, and a vulnerability unique to one evaluation manifests as behavioral anomalies when the same model runs across multiple benchmarks. This motivates using group-wise behavioral differences as the primary signal for unsupervised monitoring. We introduce Hodoscope, a tool that operationalizes this insight. Hodoscope compares behavior distributions across groups and highlights distinctive and potentially suspicious action patterns for human review. Using Hodoscope, we discover a previously unknown vulnerability in the Commit0 benchmark (unsquashed git history allowing ground-truth recovery, inflating scores for at least five models) and independently recover known exploits on ImpossibleBench and SWE-bench. Quantitative evaluation estimates that our method reduces review effort by 6-23$\times$ compared to naive uniform sampling. Finally, we show that behavior descriptions discovered through Hodoscope could improve the detection accuracy of LLM-based judges, demonstrating a path from unsupervised to supervised monitoring.
[876] Towards Proactive Information Probing: Customer Service Chatbots Harvesting Value from Conversation
Chen Huang, Zitan Jiang, Changyi Zou, Wenqiang Lei, See-Kiong Ng
Main category: cs.AI
TL;DR: PROCHATIP is a proactive chatbot framework designed to strategically probe users for target information while minimizing conversation turns and user friction, redefining chatbots as business intelligence engines.
Details
Motivation: Current customer service chatbots are reactive support tools, but there's a need for them to serve as strategic interfaces for harvesting high-value business intelligence through proactive information gathering.Method: Introduces Proactive Information Probing task and PROCHATIP framework with a specialized conversation strategy module trained to master the delicate timing of probes to optimize when to ask for target information.
Result: PROCHATIP significantly outperforms baselines, exhibiting superior capability in both information probing and service quality, demonstrating effective proactive business intelligence gathering.
Conclusion: The work redefines the commercial utility of chatbots, positioning them as scalable, cost-effective engines for proactive business intelligence through strategic information probing.
Abstract: Customer service chatbots are increasingly expected to serve not merely as reactive support tools for users, but as strategic interfaces for harvesting high-value information and business intelligence. In response, we make three main contributions. 1) We introduce and define a novel task of Proactive Information Probing, which optimizes when to probe users for pre-specified target information while minimizing conversation turns and user friction. 2) We propose PROCHATIP, a proactive chatbot framework featuring a specialized conversation strategy module trained to master the delicate timing of probes. 3) Experiments demonstrate that PROCHATIP significantly outperforms baselines, exhibiting superior capability in both information probing and service quality. We believe that our work effectively redefines the commercial utility of chatbots, positioning them as scalable, cost-effective engines for proactive business intelligence. Our code is available at https://github.com/SCUNLP/PROCHATIP.
[877] Do Agent Rules Shape or Distort? Guardrails Beat Guidance in Coding Agents
Xing Zhang, Guanghui Wang, Yanwei Cui, Wei Qiu, Ziyuan Li, Bing Zhu, Peiyang He
Main category: cs.AI
TL;DR: Natural language instruction files for AI coding agents improve performance by 7-14 percentage points, but random rules work as well as expert ones, suggesting context priming rather than specific instructions. Negative constraints help while positive directives hurt, with individual rules being harmful in isolation but collectively beneficial.
Details
Motivation: Developers increasingly use natural language instruction files (like CLAUDE.md, .cursorrules) to guide AI coding agents, but there's no empirical evidence on whether these rules actually improve performance or what makes rules beneficial.Method: Scraped 679 instruction files (25,532 rules) from GitHub and conducted large-scale empirical evaluation with over 5,000 agent runs using a state-of-the-art coding agent on SWE-bench Verified. Analyzed rule effectiveness through controlled experiments and examined through the lens of potential-based reward shaping (PBRS).
Result: Rules improve performance by 7-14 percentage points, but random rules help as much as expert-curated ones. Negative constraints (“do not refactor unrelated code”) are the only individually beneficial rule type, while positive directives (“follow code style”) actively hurt. Individual rules are mostly harmful in isolation yet collectively helpful, with no degradation up to 50 rules.
Conclusion: Well-intentioned rules can degrade agent performance, revealing hidden reliability risks. The key principle for safe agent configuration is to constrain what agents must not do rather than prescribing what they should do.
Abstract: Developers increasingly guide AI coding agents through natural language instruction files (e.g., CLAUDE.md, .cursorrules), yet no controlled study has measured whether these rules actually improve agent performance or which properties make a rule beneficial. We scrape 679 such files (25,532 rules) from GitHub and conduct the first large-scale empirical evaluation, running over 5,000 agent runs with a state-of-the-art coding agent on SWE-bench Verified. Rules improve performance by 7–14 percentage points, but random rules help as much as expert-curated ones – suggesting rules work through context priming rather than specific instruction. Negative constraints (“do not refactor unrelated code”) are the only individually beneficial rule type, while positive directives (“follow code style”) actively hurt – a pattern we analyze through the lens of potential-based reward shaping (PBRS). Moreover, individual rules are mostly harmful in isolation yet collectively helpful, with no degradation up to 50 rules. These findings expose a hidden reliability risk – well-intentioned rules routinely degrade agent performance – and provide a clear principle for safe agent configuration: constrain what agents must not do, rather than prescribing what they should.
[878] Frugal Knowledge Graph Construction with Local LLMs: A Zero-Shot Pipeline, Self-Consistency and Wisdom of Artificial Crowds
Pierre Jourlin
Main category: cs.AI
TL;DR: A multi-model zero-shot pipeline for knowledge graph construction and reasoning that runs on consumer hardware, achieving competitive performance without training through techniques like self-consistency and confidence routing.
Details
Motivation: To develop an efficient, locally-executable knowledge graph construction and reasoning system that operates in zero-shot settings without requiring expensive training or specialized hardware, while maintaining competitive performance with supervised approaches.Method: Multi-model zero-shot pipeline combining multiple LLM architectures with self-consistency sampling, confidence-routing cascades, and diversity mechanisms for difficult multi-hop reasoning. Uses reproducible evaluation framework with DocRED, HotpotQA benchmarks and RAGAS evaluation.
Result: Achieves F1 of 0.70±0.041 on document-level relations (vs 0.80 supervised), 0.80±0.06 accuracy on text-to-query, and 0.46±0.04 EM on multi-hop reasoning. Self-consistency improves EM to 0.48±0.04, while confidence-routing cascade achieves best EM of 0.55±0.04 with 45.4% questions rerouted.
Conclusion: Demonstrates that efficient zero-shot knowledge graph construction and reasoning is feasible on consumer hardware, with multi-model approaches and confidence routing significantly improving performance. Highlights the paradox that strong consensus can indicate collective hallucination rather than reliability.
Abstract: This paper presents an empirical study of a multi-model zero-shot pipeline for knowledge graph construction and exploitation, executed entirely through local inference on consumer-grade hardware. We propose a reproducible evaluation framework integrating two external benchmarks (DocRED, HotpotQA), WebQuestionsSP-style synthetic data, and the RAGAS evaluation framework in an automated pipeline. On 500 document-level relations, our system achieves an F1 of 0.70 $\pm$ 0.041 in zero-shot, compared to 0.80 for supervised DREEAM. Text-to-query achieves an accuracy of 0.80 $\pm$ 0.06 on 200 samples. Multi-hop reasoning achieves an Exact Match (EM) of 0.46$\pm$0.04 on 500 HotpotQA questions, with a RAGAS faithfulness of 0.96 $\pm$ 0.04 on 50 samples. Beyond the pipeline, we study diversity mechanisms for difficult multi-hop reasoning. On 181 questions unsolvable at zero temperature, self-consistency (k=5, T =0.7) recovers up to 23% EM with a single Mixture-of-Experts (MoE) model, but the cross-model oracle (3 architectures x 5 samples) reaches 46.4%. We highlight an agreement paradox: strong consensus among samples signals collective hallucination rather than a reliable answer, echoing the work of Moussa{ï}d et al. on the wisdom of crowds. Extending to the full pipeline (500 questions), self-consistency (k=3) raises EM from 0.46 to 0.48 $\pm$ 0.04. A confidence-routing cascade mechanism (Phi-4 $\rightarrow$ GPT-OSS, k=5) achieves an EM of 0.55 $\pm$ 0.04, the best result obtained, with 45.4% of questions rerouted. Finally, we show that V3 prompt engineering applied to other models does not reproduce the gains observed with Gemma-4, confirming the specific prompt/model interaction. The entire system runs in $\sim$5 h on a single RTX 3090, without any training, for an estimated carbon footprint of 0.09 kg CO2 eq.
[879] Persona Non Grata: Single-Method Safety Evaluation Is Incomplete for Persona-Imbued LLMs
Wenkai Li, Fan Yang, Shaunak A. Mehta, Koichi Onoue
Main category: cs.AI
TL;DR: Personality imbuing in LLMs reveals different vulnerability profiles between prompt-based personas and activation steering, with architecture-dependent safety failures that can’t be predicted from prompt-side testing alone.
Details
Motivation: Current safety evaluations for LLMs primarily focus on prompt-based personas, but this approach is incomplete because different methods (prompting vs activation steering) expose different vulnerability profiles that are architecture-dependent.Method: Conducted 5,568 judged conditions across four standard models from three architecture families, comparing persona danger rankings under system prompting versus activation steering. Used trait refusal alignment framework and heuristic trace diagnostics to analyze geometric accounts of vulnerability differences.
Result: Persona danger rankings under system prompting are preserved across architectures (ρ=0.71-0.96), but activation-steering vulnerability diverges sharply and cannot be predicted from prompt-side rankings. Llama-3.1-8B is more vulnerable to activation steering, while Gemma-3-27B and Qwen3.5 are more vulnerable to prompting. The “prosocial persona paradox” shows P12 (high conscientiousness + high agreeableness) is safest under prompting but becomes highest-ASR under activation steering on Llama-3.1-8B.
Conclusion: Safety evaluations must test both prompting and activation steering methods, as they expose different, architecture-dependent vulnerability profiles. Testing with only one method can miss a model’s dominant failure mode, highlighting the need for comprehensive safety assessment approaches.
Abstract: Personality imbuing customizes LLM behavior, but safety evaluations almost always study prompt-based personas alone. We show this is incomplete: prompting and activation steering expose different, architecture-dependent vulnerability profiles, and testing with only one method can miss a model’s dominant failure mode. Across 5,568 judged conditions on four standard models from three architecture families, persona danger rankings under system prompting are preserved across all architectures ($ρ= 0.71$–$0.96$), but activation-steering vulnerability diverges sharply and cannot be predicted from prompt-side rankings: Llama-3.1-8B is substantially more AS-vulnerable, whereas Gemma-3-27B and Qwen3.5 are more vulnerable to prompting. The most striking illustration of this divergence is the prosocial persona paradox: on Llama-3.1-8B, P12 (high conscientiousness + high agreeableness) is among the safest personas under prompting yet becomes the highest-ASR activation-steered persona (ASR ~0.818). This is an inversion robust to coefficient ablation and matched-strength calibration, and replicated on DeepSeek-R1-Distill-Qwen-32B. A trait refusal alignment framework, in which conscientiousness is strongly anti-aligned with refusal on Llama-3.1-8B, offers a partial geometric account. Reasoning provides only partial protection: two 32B reasoning models reach 15–18% prompt-side ASR, and activation steering separates them sharply in both baseline susceptibility and persona-specific vulnerability. Heuristic trace diagnostics suggest that the safer model retains stronger policy recall and self-correction behavior, not merely longer reasoning.
[880] A Proposed Biomedical Data Policy Framework to Reduce Fragmentation, Improve Quality, and Incentivize Sharing in Indian Healthcare in the era of Artificial Intelligence and Digital Health
Nikhil Mehta, Sachin Gupta, Gouri RP Anand
Main category: cs.AI
TL;DR: A framework for incentivizing biomedical data sharing in India through academic recognition, institutional rankings, revenue sharing, and professional roles to overcome fragmentation and enable AI development.
Details
Motivation: India's biomedical data remains fragmented across institutional silos and vendor-locked EMR systems due to misaligned incentives that make data sharing high-risk and low-reward for researchers and institutions, constraining AI ambitions.Method: Proposes a multi-layered incentive architecture including: recognition of data papers in NMC promotion criteria, open data metrics in NIRF rankings, Shapley Value-based revenue sharing in federated learning consortia, and establishing institutional data stewardship as a professional role.
Result: The framework addresses critical barriers like fear of data quality scrutiny, misinterpretation concerns, and selective reporting bias through mandatory data quality assessment, structured peer review, and academic credit for auditing roles.
Conclusion: The proposed framework addresses regulatory constraints (DPDPA 2023) while engaging with existing policies (NDSAP, Biotech-PRIDE, ANRF guidelines), aiming to create sustainable incentives for biomedical data sharing to support India’s AI ambitions.
Abstract: India generates vast biomedical data through postgraduate research, government hospital services and audits, government schemes, private hospitals and their electronic medical record (EMR) systems, insurance programs and standalone clinics. Unfortunately, these resources remain fragmented across institutional silos and vendor-locked EMR systems. The fundamental bottleneck is not technological but economic and academic. There is a systemic misalignment of incentives that renders data sharing a high-risk, low-reward activity for individual researchers and institutions. Until India’s academic promotion criteria, institutional rankings, and funding mechanisms explicitly recognize and reward data curation as professional work, the nation’s AI ambitions will remain constrained by fragmented, non-interoperable datasets. We propose a multi-layered incentive architecture integrating recognition of data papers in National Medical Commission (NMC) promotion criteria, incorporation of open data metrics into the National Institutional Ranking Framework (NIRF), adoption of Shapley Value-based revenue sharing in federated learning consortia, and establishment of institutional data stewardship as a mainstream professional role. Critical barriers to data sharing, including fear of data quality scrutiny, concerns about misinterpretation, and selective reporting bias, are addressed through mandatory data quality assessment, structured peer review, and academic credit for auditing roles. The proposed framework directly addresses regulatory constraints introduced by the Digital Personal Data Protection Act 2023 (DPDPA), while constructively engaging with the National Data Sharing and Accessibility Policy (NDSAP), Biotech-PRIDE Guidelines, and the Anusandhan National Research Foundation (ANRF) guidelines.
[881] From Answers to Arguments: Toward Trustworthy Clinical Diagnostic Reasoning with Toulmin-Guided Curriculum Goal-Conditioned Learning
Chen Zhan, Xiaoyu Tan, Gengchen Ma, Yu-Jie Xiong, Xiaoyan Jiang, Xihe Qiu
Main category: cs.AI
TL;DR: A framework for trustworthy clinical decision support using LLMs with structured reasoning based on Toulmin model, trained via Curriculum Goal-Conditioned Learning to ensure transparent diagnostic arguments.
Details
Motivation: Current LLMs in healthcare produce correct answers through flawed reasoning, lacking transparency needed for clinical safety and accountability. Opaque reasoning is dangerous in high-stakes medical domains where understanding the diagnostic process is as important as the final answer.Method: Adapts Toulmin model to clinical diagnostics and proposes Curriculum Goal-Conditioned Learning (CGCL) - a three-stage progressive training pipeline: (1) fact extraction and differential diagnosis generation, (2) hypothesis justification with alternative rebuttals, (3) synthesis into qualified conclusions. Validated using T-Eval framework for reasoning integrity measurement.
Result: Achieves diagnostic accuracy and reasoning quality comparable to resource-intensive Reinforcement Learning methods, while offering more stable and efficient training. Demonstrates improved transparency and reliability in clinical argumentation.
Conclusion: The CGCL framework enables LLMs to generate transparent, structured clinical arguments, addressing critical trust and safety concerns in healthcare applications by ensuring reasoning integrity alongside diagnostic accuracy.
Abstract: The integration of Large Language Models (LLMs) into clinical decision support is critically obstructed by their opaque and often unreliable reasoning. In the high-stakes domain of healthcare, correct answers alone are insufficient; clinical practice demands full transparency to ensure patient safety and enable professional accountability. A pervasive and dangerous weakness of current LLMs is their tendency to produce “correct answers through flawed reasoning.” This issue is far more than a minor academic flaw; such process errors signal a fundamental lack of robust understanding, making the model prone to broader hallucinations and unpredictable failures when faced with real-world clinical complexity. In this paper, we establish a framework for trustworthy clinical argumentation by adapting the Toulmin model to the diagnostic process. We propose a novel training pipeline: Curriculum Goal-Conditioned Learning (CGCL), designed to progressively train LLM to generate diagnostic arguments that explicitly follow this Toulmin structure. CGCL’s progressive three-stage curriculum systematically builds a solid clinical argument: (1) extracting facts and generating differential diagnoses; (2) justifying a core hypothesis while rebutting alternatives; and (3) synthesizing the analysis into a final, qualified conclusion. We validate CGCL using T-Eval, a quantitative framework measuring the integrity of the diagnosis reasoning. Experiments show that our method achieves diagnostic accuracy and reasoning quality comparable to resource-intensive Reinforcement Learning (RL) methods, while offering a more stable and efficient training pipeline.
[882] Environmental Footprint of GenAI Research: Insights from the Moshi Foundation Model
Marta López-Rauhut, Loic Landrieu, Mathieu Aubry, Anne-Laure Ligozat
Main category: cs.AI
TL;DR: Analysis of environmental impacts of training Moshi, a 7B-parameter speech-text foundation model, quantifying GPU-time across all research phases and providing sustainability guidelines for MLLM development.
Details
Motivation: To address the lack of transparency about environmental impacts in GenAI research, particularly for multimodal LLMs, by conducting a comprehensive analysis of compute usage and environmental costs throughout the entire research lifecycle.Method: Fine-grained analysis of compute spent on Moshi development using life cycle assessment methodology, quantifying GPU-time across model components, training phases, failed runs, debugging, and ablation studies, plus environmental impact assessment of hardware production and use.
Result: First detailed quantification of compute-intensive MLLM research anatomy, revealing environmental impacts across the entire development pipeline and enabling actionable sustainability guidelines.
Conclusion: Provides transparency and actionable recommendations for reducing environmental impacts in MLLM research, promoting more sustainable AI development practices.
Abstract: New multi-modal large language models (MLLMs) are continuously being trained and deployed, following rapid development cycles. This generative AI frenzy is driving steady increases in energy consumption, greenhouse gas emissions, and a plethora of other environmental impacts linked to datacenter construction and hardware manufacturing. Mitigating the environmental consequences of GenAI remains challenging due to an overall lack of transparency by the main actors in the field. Even when the environmental impacts of specific models are mentioned, they are typically restricted to the carbon footprint of the final training run, omitting the research and development stages. In this work, we explore the impact of GenAI research through a fine-grained analysis of the compute spent to create Moshi, a 7B-parameter speech-text foundation model for real-time dialogue developed by Kyutai, a leading privately funded open science AI lab. For the first time, our study dives into the anatomy of compute-intensive MLLM research, quantifying the GPU-time invested in specific model components and training phases, as well as early experimental stages, failed training runs, debugging, and ablation studies. Additionally, we assess the environmental impacts of creating Moshi from beginning to end using a life cycle assessment methodology: we quantify energy and water consumption, greenhouse gas emissions, and mineral resource depletion associated with the production and use of datacenter hardware. Our detailed analysis allows us to provide actionable guidelines to reduce compute usage and environmental impacts of MLLM research, paving the way for more sustainable AI research.
[883] Measuring the Authority Stack of AI Systems: Empirical Analysis of 366,120 Forced-Choice Responses Across 8 AI Models
Seulki Lee
Main category: cs.AI
TL;DR: First large-scale empirical mapping of AI decision-making across Authority Stack framework (values, evidence preferences, source trust) using PRISM benchmark with 14,175 scenarios per layer across 7 domains, revealing measurable but unstable AI authority hierarchies.
Details
Motivation: To empirically map what values, evidence preferences, and source trust hierarchies AI systems actually exhibit when facing structured dilemmas, addressing the lack of large-scale systematic analysis of AI decision-making across the Authority Stack framework.Method: Used PRISM benchmark - a forced-choice instrument of 14,175 unique scenarios per layer across 7 professional domains, 3 severity levels, 3 decision timeframes, and 5 scenario variants. Evaluated 8 major AI models at temperature 0, yielding 366,120 total responses across value priorities (L4), evidence-type preferences (L3), and source trust hierarchies (L2).
Result: Key findings: 1) 4:4 split between Universalism-first and Security-first models at value layer; 2) dramatic defense-domain value restructuring with Security surging to 95.1%-99.8% win-rates; 3) divergent evidence hierarchies (empirical-scientific vs pattern-based vs experiential); 4) broad convergence on institutional source trust; 5) Paired Consistency Scores 57.4%-69.2% showing framing sensitivity; Test-Retest Reliability 91.7%-98.6% indicating value instability stems from variant sensitivity.
Conclusion: AI models possess measurable – if sometimes unstable – Authority Stacks with consequential implications for deployment across professional domains, revealing systematic patterns in AI decision-making that vary by domain and framing.
Abstract: What values, evidence preferences, and source trust hierarchies do AI systems actually exhibit when facing structured dilemmas? We present the first large-scale empirical mapping of AI decision-making across all three layers of the Authority Stack framework (S. Lee, 2026a): value priorities (L4), evidence-type preferences (L3), and source trust hierarchies (L2). Using the PRISM benchmark – a forced-choice instrument of 14,175 unique scenarios per layer, spanning 7 professional domains, 3 severity levels, 3 decision timeframes, and 5 scenario variants – we evaluated 8 major AI models at temperature 0, yielding 366,120 total responses. Key findings include: (1) a symmetric 4:4 split between Universalism-first and Security-first models at L4; (2) dramatic defense-domain value restructuring where Security surges to near-ceiling win-rates (95.1%-99.8%) in 6 of 8 models; (3) divergent evidence hierarchies at L3, with some models favoring empirical-scientific evidence while others prefer pattern-based or experiential evidence; (4) broad convergence on institutional source trust at L2; and (5) Paired Consistency Scores (PCS) ranging from 57.4% to 69.2%, revealing substantial framing sensitivity across scenario variants. Test-Retest Reliability (TRR) ranges from 91.7% to 98.6%, indicating that value instability stems primarily from variant sensitivity rather than stochastic noise. These findings demonstrate that AI models possess measurable – if sometimes unstable – Authority Stacks with consequential implications for deployment across professional domains.
[884] Mobile GUI Agent Privacy Personalization with Trajectory Induced Preference Optimization
Zhixin Lin, Jungang Li, Dongliang Xu, Shidong Pan, Yibo Shi, Yuchi Liu, Yuecong Min, Yue Yao
Main category: cs.AI
TL;DR: TIPO is a preference optimization method for mobile GUI agents that addresses privacy personalization by handling structural heterogeneity in execution trajectories through preference-intensity weighting and padding gating.
Details
Motivation: Existing mobile GUI agents focus on task success/efficiency but neglect privacy personalization. Privacy-first users have systematically different execution trajectories (protective actions like refusing permissions) compared to utility-first users, creating variable-length, structurally different trajectories that make standard preference optimization unstable.Method: Trajectory Induced Preference Optimization (TIPO) uses preference-intensity weighting to emphasize key privacy-related steps and padding gating to suppress alignment noise, addressing the structural heterogeneity in execution trajectories caused by personalization preferences.
Result: On the Privacy Preference Dataset, TIPO improves persona alignment and distinction while preserving task executability, achieving 65.60% SR, 46.22 Compliance, and 66.67% PD, outperforming existing optimization methods across various GUI tasks.
Conclusion: TIPO effectively addresses the challenge of personalization in mobile GUI agents by handling structural heterogeneity in execution trajectories, enabling better privacy preference alignment while maintaining task performance.
Abstract: Mobile GUI agents powered by Multimodal Large Language Models (MLLMs) can execute complex tasks on mobile devices. Despite this progress, most existing systems still optimize task success or efficiency, neglecting users’ privacy personalization. In this paper, we study the often-overlooked problem of agent personalization. We observe that personalization can induce systematic structural heterogeneity in execution trajectories. For example, privacy-first users often prefer protective actions, e.g., refusing permissions, logging out, and minimizing exposure, leading to logically different execution trajectories from utility-first users. Such variable-length and structurally different trajectories make standard preference optimization unstable and less informative. To address this issue, we propose Trajectory Induced Preference Optimization (TIPO), which uses preference-intensity weighting to emphasize key privacy-related steps and padding gating to suppress alignment noise. Results on our Privacy Preference Dataset show that TIPO improves persona alignment and distinction while preserving strong task executability, achieving 65.60% SR, 46.22 Compliance, and 66.67% PD, outperforming existing optimization methods across various GUI tasks. The code and dataset will be publicly released at https://github.com/Zhixin-L/TIPO.
[885] Inspectable AI for Science: A Research Object Approach to Generative AI Governance
Ruta Binkyte, Sharif Abuaddba, Chamikara Mahawaga, Ming Ding, Natasha Fernandes, Mario Fritz
Main category: cs.AI
TL;DR: AI as a Research Object (AI-RO) framework for governing generative AI in scientific research by treating AI interactions as structured, inspectable components with documentation and accountability.
Details
Motivation: Addresses the governance challenge of using generative AI in scientific research by moving beyond the "author vs. tool" debate to create structured documentation and accountability frameworks.Method: Proposes AI-RO framework based on Research Object theory and FAIR principles, implementing a lightweight writing pipeline with language models synthesizing structured literature review notes under constraints with verifiable provenance records.
Result: Develops a demonstrative workflow showing how AI-assisted scientific papers can maintain legitimacy through structured documentation, controlled disclosure, and integrity-preserving provenance capture.
Conclusion: Governance of generative AI in science can be implemented through structured documentation and provenance capture, with future developments needed for practical adoption.
Abstract: This paper introduces AI as a Research Object (AI-RO), a paradigm for governing the use of generative AI in scientific research. Instead of debating whether AI is an author or merely a tool, we propose treating AI interactions as structured, inspectable components of the research process. Under this view, the legitimacy of an AI-assisted scientific paper depends on how model use is integrated into the workflow, documented, and made accountable. Drawing on Research Object theory and FAIR principles, we propose a framework for recording model configuration, prompts, and outputs through interaction logs and metadata packaging. These properties are particularly consequential in security and privacy (S&P) research, where provenance artifacts must satisfy confidentiality constraints, integrity guarantees, and auditability requirements that generic disclosure practices do not address. We implement a lightweight writing pipeline in which a language model synthesizes human-authored structured literature review notes under explicit constraints and produces a verifiable provenance record. We present this work as a position supported by an initial demonstrative workflow, arguing that governance of generative AI in science can be implemented as structured documentation, controlled disclosure, and integrity-preserving provenance capture. Based on this example, we outline and motivate a set of necessary future developments required to make such practices practical and widely adoptable.
[886] Consistency of AI-Generated Exercise Prescriptions: A Repeated Generation Study Using a Large Language Model
Kihyuk Lee
Main category: cs.AI
TL;DR: Study evaluates consistency of LLM-generated exercise prescriptions across semantic, structural, and safety dimensions using repeated generation design with Gemini 2.5 Flash.
Details
Motivation: While LLMs are being explored for personalized exercise prescription generation, the consistency of their outputs under identical conditions remains insufficiently examined, raising concerns about reliability for clinical applications.Method: Used six clinical scenarios to generate exercise prescriptions with Gemini 2.5 Flash (20 outputs per scenario, total n=120). Assessed consistency across three dimensions: semantic consistency using SBERT-based cosine similarity, structural consistency based on FITT principle using AI-as-a-judge approach, and safety expression consistency.
Result: High semantic similarity (mean cosine similarity: 0.879-0.939), greater consistency in clinically constrained cases. Frequency showed consistent patterns, but variability in quantitative components, especially exercise intensity (10-25% unclassifiable intensity expressions). Safety expressions included in 100% of outputs but sentence counts varied significantly across scenarios.
Conclusion: LLM-generated exercise prescriptions show high semantic consistency but variability in key quantitative components. Reliability depends on prompt structure, requiring additional structural constraints and expert validation before clinical deployment.
Abstract: Background: Large language models (LLMs) have been explored as tools for generating personalized exercise prescriptions, yet the consistency of outputs under identical conditions remains insufficiently examined. Objective: This study evaluated the intra-model consistency of LLM-generated exercise prescriptions using a repeated generation design. Methods: Six clinical scenarios were used to generate exercise prescriptions using Gemini 2.5 Flash (20 outputs per scenario; total n = 120). Consistency was assessed across three dimensions: (1) semantic consistency using SBERT-based cosine similarity, (2) structural consistency based on the FITT principle using an AI-as-a-judge approach, and (3) safety expression consistency, including inclusion rates and sentence-level quantification. Results: Semantic similarity was high across scenarios (mean cosine similarity: 0.879-0.939), with greater consistency in clinically constrained cases. Frequency showed consistent patterns, whereas variability was observed in quantitative components, particularly exercise intensity. Unclassifiable intensity expressions were observed in 10-25% of resistance training outputs. Safety-related expressions were included in 100% of outputs; however, safety sentence counts varied significantly across scenarios (H=86.18, p less than 0.001), with clinical cases generating more safety expressions than healthy adult cases. Conclusions: LLM-generated exercise prescriptions demonstrated high semantic consistency but showed variability in key quantitative components. Reliability depends substantially on prompt structure, and additional structural constraints and expert validation are needed before clinical deployment.
[887] BankerToolBench: Evaluating AI Agents in End-to-End Investment Banking Workflows
Elaine Lau, Markus Dücker, Ronak Chaudhary, Hui Wen Goh, Rosemary Wei, Vaibhav Kumar, Saed Qunbar, Guram Gogia, Yi Liu, Scott Millslagle, Nasim Borazjanizadeh, Ulyana Tkachenko, Samuel Eshun Danquah, Collin Schweiker, Vijay Karumathil, Asrith Devalaraju, Varsha Sandadi, Haemi Nam, Punit Arani, Ray Epps, Abdullah Arif, Sahil Bhaiwala, Curtis Northcutt, Skyler Wang, Anish Athalye, Jonas Mueller, Francisco Guzmán
Main category: cs.AI
TL;DR: BankerToolBench (BTB) is an open-source benchmark for evaluating AI agents on complex investment banking workflows, developed with 502 bankers to ensure ecological validity, requiring multi-file deliverables and automated evaluation against 100+ rubric criteria.
Details
Motivation: Existing AI benchmarks lack the fidelity to assess economically meaningful progress on professional workflows, particularly in high-value, labor-intensive professions like investment banking where current AI evaluation methods are insufficient.Method: Collaborated with 502 investment bankers to develop ecologically valid benchmark requiring agents to execute senior banker requests by navigating data rooms, using industry tools (market data platforms, SEC filings), and generating multi-file deliverables (Excel models, PowerPoint decks, PDF/Word reports).
Result: Even the best-performing model (GPT-5.4) fails nearly half of the rubric criteria, and bankers rate 0% of its outputs as client-ready, revealing key obstacles like breakdowns in cross-artifact consistency.
Conclusion: BTB provides a rigorous benchmark for evaluating AI agents in professional workflows, highlighting significant gaps in current AI capabilities for high-stakes professional tasks and identifying improvement directions for agentic AI.
Abstract: Existing AI benchmarks lack the fidelity to assess economically meaningful progress on professional workflows. To evaluate frontier AI agents in a high-value, labor-intensive profession, we introduce BankerToolBench (BTB): an open-source benchmark of end-to-end analytical workflows routinely performed by junior investment bankers. To develop an ecologically valid benchmark grounded in representative work environments, we collaborated with 502 investment bankers from leading firms. BTB requires agents to execute senior banker requests by navigating data rooms, using industry tools (market data platform, SEC filings database), and generating multi-file deliverables–including Excel financial models, PowerPoint pitch decks, and PDF/Word reports. Completing a BTB task takes bankers up to 21 hours, underscoring the economic stakes of successfully delegating this work to AI. BTB enables automated evaluation of any LLM or agent, scoring deliverables against 100+ rubric criteria defined by veteran investment bankers to capture stakeholder utility. Testing 9 frontier models, we find that even the best-performing model (GPT-5.4) fails nearly half of the rubric criteria and bankers rate 0% of its outputs as client-ready. Our failure analysis reveals key obstacles (such as breakdowns in cross-artifact consistency) and improvement directions for agentic AI in high-stakes professional workflows.
[888] PaperScope: A Multi-Modal Multi-Document Benchmark for Agentic Deep Research Across Massive Scientific Papers
Lei Xiong, Huaying Yuan, Zheng Liu, Zhao Cao, Zhicheng Dou
Main category: cs.AI
TL;DR: PaperScope: A multi-modal multi-document benchmark for evaluating MLLMs on scientific research tasks requiring integration of text, tables, and figures across multiple papers.
Details
Motivation: Existing benchmarks focus on single-document understanding, but real scientific workflows require integrating evidence from multiple papers with text, tables, and figures. Multi-modal, multi-document scientific reasoning lacks systematic evaluation.Method: Built on a knowledge graph of over 2,000 AI papers spanning three years. Uses semantically related key information nodes and optimized random-walk article selector to create thematically coherent paper sets. Contains over 2,000 QA pairs across reasoning, retrieval, summarization, and problem solving tasks.
Result: Even advanced systems like OpenAI Deep Research and Tongyi Deep Research achieve limited scores on PaperScope, highlighting the difficulty of long-context retrieval and deep multi-source reasoning.
Conclusion: PaperScope provides a rigorous benchmark and scalable pipeline for constructing large-scale multi-modal, multi-source deep research datasets to evaluate MLLMs on scientific reasoning tasks.
Abstract: Leveraging Multi-modal Large Language Models (MLLMs) to accelerate frontier scientific research is promising, yet how to rigorously evaluate such systems remains unclear. Existing benchmarks mainly focus on single-document understanding, whereas real scientific workflows require integrating evidence from multiple papers, including their text, tables, and figures. As a result, multi-modal, multi-document scientific reasoning remains underexplored and lacks systematic evaluation. To address this gap, we introduce PaperScope, a multi-modal multi-document benchmark designed for agentic deep research. PaperScope presents three advantages: (1) Structured scientific grounding. It is built on a knowledge graph of over 2,000 AI papers spanning three years, providing a structured foundation for research-oriented queries. (2) Semantically dense evidence construction. It integrates semantically related key information nodes and employs optimized random-walk article selector to sample thematically coherent paper sets, thereby ensuring adequate semantic density and task complexity. (3) Multi-task evaluation of scientific reasoning. It contains over 2,000 QA pairs across reasoning, retrieval, summarization, and problem solving, enabling evaluation of multi-step scientific reasoning. Experimental results show that even advanced systems such as OpenAI Deep Research and Tongyi Deep Research achieve limited scores on PaperScope, highlighting the difficulty of long-context retrieval and deep multi-source reasoning. PaperScope thus provides a rigorous benchmark alongside a scalable pipeline for constructing large-scale multi-modal, multi-source deep research datasets.
[889] Select Smarter, Not More: Prompt-Aware Evaluation Scheduling with Submodular Guarantees
Xiaoyu Ma, Yiwen Li, Haoyue Liu, Zhichao Wang, Ye Chen, Yongxin Guo, Xiaoying Tang
Main category: cs.AI
TL;DR: POES: A prompt-aware online evaluation scheduling method for automatic prompt optimization that treats prompts as examinees and training examples as test items, using IRT-based discrimination and submodular optimization to efficiently select evaluation subsets.
Details
Motivation: Current automatic prompt optimization methods face a fundamental trade-off: using fixed evaluation subsets is principled but prompt-agnostic, while adaptive heuristics are flexible but unstable without formal guarantees. The high cost of evaluating every prompt candidate on the full training set necessitates smarter evaluation scheduling.Method: POES frames APO as an online adaptive testing problem where prompts are examinees and training examples are test items. It combines: 1) IRT-based discrimination utility to select examples that best discriminate among strong candidates, 2) facility-location coverage term for diversity, and 3) switching-cost-aware warm-start swaps. The unified objective is provably monotone submodular, enabling (1-1/e) greedy guarantees for cold starts and bounded drift for warm-start updates, with an adaptive controller for exploration-exploitation balance.
Result: Across 36 tasks spanning three benchmark families, POES achieves the highest overall average accuracy (6.2% improvement over best baseline) with negligible token overhead (~4%) at the same evaluation budget. Principled selection at k=20 examples matches or exceeds naive evaluation at k=30-50, reducing token consumption by 35-60%.
Conclusion: Evaluation scheduling is a first-class component of automatic prompt optimization, not just an implementation detail. Smarter example selection is more effective than simply selecting more examples, enabling significant efficiency gains while maintaining or improving performance.
Abstract: Automatic prompt optimization (APO) hinges on the quality of its evaluation signal, yet scoring every prompt candidate on the full training set is prohibitively expensive. Existing methods either fix a single evaluation subset before optimization begins (principled but prompt-agnostic) or adapt it heuristically during optimization (flexible but unstable and lacking formal guarantees). We observe that APO naturally maps to an online adaptive testing problem: prompts are examinees, training examples are test items, and the scheduler should select items that best discriminate among the strongest candidates. This insight motivates Prompt-Aware Online Evaluation Scheduling (POES), which integrates an IRT-based discrimination utility, a facility-location coverage term, and switching-cost-aware warm-start swaps into a unified objective that is provably monotone submodular, yielding a (1-1/e) greedy guarantee for cold starts and bounded drift for warm-start updates. An adaptive controller modulates the exploration-exploitation balance based on optimization progress. Across 36 tasks spanning three benchmark families, POES achieves the highest overall average accuracy (6.2 percent improvement over the best baseline) with negligible token overhead (approximately 4 percent) at the same evaluation budget. Moreover, principled selection at k = 20 examples matches or exceeds the performance of naive evaluation at k = 30-50, reducing token consumption by 35-60 percent, showing that selecting smarter is more effective than selecting more. Our results demonstrate that evaluation scheduling is a first-class component of APO, not an implementation detail.
[890] Dynamic Summary Generation for Interpretable Multimodal Depression Detection
Shiyu Teng, Jiaqing Liu, Hao Sun, Yu Li, Shurong Chai, Ruibo Hou, Tomoko Tateyama, Lanfen Lin, Yen-Wei Chen
Main category: cs.AI
TL;DR: A multi-stage LLM framework for depression detection using multimodal (text, audio, video) features with progressive clinical summaries and interpretable assessment reports.
Details
Motivation: Depression is underdiagnosed due to stigma and subjective symptom ratings; need for accurate, reliable screening with transparency.Method: Coarse-to-fine multi-stage framework: binary screening → severity classification → continuous regression. LLMs generate progressive clinical summaries guiding multimodal fusion of text, audio, video features. System consolidates summaries into human-readable assessment reports.
Result: Experiments on E-DAIC and CMDC datasets show significant improvements over SOTA baselines in accuracy and interpretability.
Conclusion: Proposed framework enables accurate, interpretable depression detection through multimodal LLM integration with transparent clinical reasoning.
Abstract: Depression remains widely underdiagnosed and undertreated because stigma and subjective symptom ratings hinder reliable screening. To address this challenge, we propose a coarse-to-fine, multi-stage framework that leverages large language models (LLMs) for accurate and interpretable detection. The pipeline performs binary screening, five-class severity classification, and continuous regression. At each stage, an LLM produces progressively richer clinical summaries that guide a multimodal fusion module integrating text, audio, and video features, yielding predictions with transparent rationale. The system then consolidates all summaries into a concise, human-readable assessment report. Experiments on the E-DAIC and CMDC datasets show significant improvements over state-of-the-art baselines in both accuracy and interpretability.
[891] CoRe-ECG: Advancing Self-Supervised Representation Learning for 12-Lead ECG via Contrastive and Reconstructive Synergy
Zehao Qin, Xiaojian Lin, Ping Zhang, Hongliang Wu, Xinkang Wang, Guangling Liu, Bo Chen, Wenming Yang, Guijin Wang
Main category: cs.AI
TL;DR: CoRe-ECG: A unified self-supervised learning framework for ECG analysis combining contrastive and reconstructive learning with frequency-based augmentation and spatio-temporal masking.
Details
Motivation: ECG interpretation faces challenges due to scarce labeled data and expensive expert annotation. Existing SSL methods for ECG have limitations: contrastive learning alone provides limited supervisory signals, reconstructive learning suffers from trivial correlations across leads, and naive augmentations introduce non-physiological distortions.Method: Proposes CoRe-ECG with three key components: 1) Unified contrastive-reconstructive pretraining that aligns global representations during reconstruction, 2) Frequency Dynamic Augmentation (FDA) that adaptively perturbs ECG signals based on frequency-domain importance, and 3) Spatio-Temporal Dual Masking (STDM) to break linear dependencies across leads and increase reconstruction difficulty.
Result: Achieves state-of-the-art performance across multiple downstream ECG datasets. Ablation studies demonstrate the necessity and complementarity of each component.
Conclusion: Provides a robust and physiologically meaningful representation learning framework for ECG analysis that synergistically combines global semantic modeling with local structural learning.
Abstract: Accurate interpretation of electrocardiogram (ECG) remains challenging due to the scarcity of labeled data and the high cost of expert annotation. Self-supervised learning (SSL) offers a promising solution by enabling models to learn expressive representations from unlabeled signals. Existing ECG SSL methods typically rely on either contrastive learning or reconstructive learning. However, each approach in isolation provides limited supervisory signals and suffers from additional limitations, including non-physiological distortions introduced by naive augmentations and trivial correlations across multiple leads that models may exploit as shortcuts. In this work, we propose CoRe-ECG, a unified contrastive and reconstructive pretraining paradigm that establishes a synergistic interaction between global semantic modeling and local structural learning. CoRe-ECG aligns global representations during reconstruction, enabling instance-level discriminative signals to guide local waveform recovery. To further enhance pretraining, we introduce Frequency Dynamic Augmentation (FDA) to adaptively perturb ECG signals based on their frequency-domain importance, and Spatio-Temporal Dual Masking (STDM) to break linear dependencies across leads, increasing the difficulty of reconstructive tasks. Our method achieves state-of-the-art performance across multiple downstream ECG datasets. Ablation studies further demonstrate the necessity and complementarity of each component. This approach provides a robust and physiologically meaningful representation learning framework for ECG analysis.
[892] The Missing Knowledge Layer in Cognitive Architectures for AI Agents
Michaël Roynard
Main category: cs.AI
TL;DR: The paper identifies a gap in AI agent architectures lacking explicit Knowledge layers with proper persistence semantics, proposes a four-layer decomposition (Knowledge, Memory, Wisdom, Intelligence) with distinct persistence mechanisms, and provides implementations to demonstrate feasibility.
Details
Motivation: Current cognitive architecture frameworks like CoALA and JEPA lack an explicit Knowledge layer with proper persistence semantics, leading to category errors where systems apply cognitive decay to factual claims or treat facts and experiences with identical update mechanics.Method: The authors survey persistence semantics across existing memory systems, identify eight convergence points pointing to architectural gaps, and propose a four-layer decomposition where each layer has fundamentally different persistence semantics. They provide companion implementations in Python and Rust to demonstrate feasibility.
Result: The paper identifies architectural gaps in current frameworks, proposes a novel four-layer architecture with distinct persistence semantics, and demonstrates through implementations that the architectural separation is feasible.
Conclusion: The distinctions between Knowledge, Memory, Wisdom, and Intelligence demand distinct persistence semantics in engineering implementations, and no current framework or system provides this separation, highlighting a significant gap in AI agent architecture design.
Abstract: The two most influential cognitive architecture frameworks for AI agents, CoALA [21] and JEPA [12], both lack an explicit Knowledge layer with its own persistence semantics. This gap produces a category error: systems apply cognitive decay to factual claims, or treat facts and experiences with identical update mechanics. We survey persistence semantics across existing memory systems and identify eight convergence points, from Karpathy’s LLM Knowledge Base [10] to the BEAM benchmark’s near-zero contradiction-resolution scores [22], all pointing to related architectural gaps. We propose a four-layer decom position (Knowledge, Memory, Wisdom, Intelligence) where each layer has fundamentally different persistence semantics: indefinite supersession, Ebbinghaus decay, evidence-gated revision, and ephemeral inference respectively. Companion implementations in Python and Rust demonstrate the architectural separation is feasible. We borrow terminology from cognitive science as a useful analogy (the Knowledge/Memory distinction echoes Tulving’s trichotomy), but our layers are engineering constructs justified by persistence-semantics requirements, not by neural architecture. We argue that these distinctions demand distinct persistence semantics in engineering implementations, and that no current framework or system provides this.
[893] Learning from Contrasts: Synthesizing Reasoning Paths from Diverse Search Trajectories
Peiyang Liu, Zhirui Chen, Xi Wang, Di Liang, Youru Li, Zhi Cai, Wei Ye
Main category: cs.AI
TL;DR: CRPS is a framework that synthesizes training data by analyzing contrastive reasoning paths from MCTS, extracting strategic insights from both successful and failed trajectories to create more effective reasoning examples.
Details
Motivation: Current MCTS supervision extraction methods are inefficient, discarding valuable comparative signals from explored paths and only keeping single highest-reward trajectories, missing opportunities to learn from both success and failure patterns.Method: CRPS uses structured reflective analysis to examine differences between high- and low-quality search trajectories, extracting explicit information about strategic pivots and local failure modes to synthesize reasoning chains that incorporate success patterns while avoiding identified pitfalls.
Result: Models fine-tuned on just 60K CRPS-synthesized examples match or exceed performance of baselines trained on 590K examples from standard rejection sampling (20× reduction), with improved generalization on out-of-domain benchmarks.
Conclusion: Learning from contrast between success and failure produces more transferable reasoning capabilities than learning from success alone, demonstrating the value of synthesizing training data from comparative analysis of reasoning paths.
Abstract: Monte Carlo Tree Search (MCTS) has been widely used for automated reasoning data exploration, but current supervision extraction methods remain inefficient. Standard approaches retain only the single highest-reward trajectory, discarding the comparative signals present in the many explored paths. Here we introduce \textbf{Contrastive Reasoning Path Synthesis (CRPS)}, a framework that transforms supervision extraction from a filtering process into a synthesis procedure. CRPS uses a structured reflective process to analyze the differences between high- and low-quality search trajectories, extracting explicit information about strategic pivots and local failure modes. These insights guide the synthesis of reasoning chains that incorporate success patterns while avoiding identified pitfalls. We show empirically that models fine-tuned on just 60K CRPS-synthesized examples match or exceed the performance of baselines trained on 590K examples derived from standard rejection sampling, a 20$\times$ reduction in dataset size. Furthermore, CRPS improves generalization on out-of-domain benchmarks, demonstrating that learning from the contrast between success and failure produces more transferable reasoning capabilities than learning from success alone.
[894] From Agent Loops to Structured Graphs:A Scheduler-Theoretic Framework for LLM Agent Execution
Hu Wei
Main category: cs.AI
TL;DR: SGH (Structured Graph Harness) proposes replacing iterative Agent Loops with explicit static DAGs for LLM-based agents, trading some expressiveness for better controllability, verifiability, and debugging.
Details
Motivation: Current Agent Loop paradigm has structural weaknesses: implicit dependencies between steps, unbounded recovery loops, and mutable execution history that complicates debugging. The paper aims to address these issues by moving from opaque LLM inference to inspectable control flow.Method: Proposes SGH (Structured Graph Harness) that lifts control flow from implicit context into an explicit static DAG. Key commitments: execution plans are immutable within a plan version, planning/execution/recovery are separated into three layers, and recovery follows strict escalation protocol.
Result: This is a position paper and design proposal, so no empirical results are presented. The paper provides a theoretical framework, design analysis, and experimental protocol rather than production implementation or empirical validation.
Conclusion: SGH offers a structured alternative to Agent Loops by making control flow explicit and inspectable, trading some expressiveness for improved controllability, verifiability, and implementability in LLM-based agent systems.
Abstract: The dominant paradigm for building LLM based agents is the Agent Loop, an iterative cycle where a single language model decides what to do next by reading an ever growing context window. This paradigm has three structural weaknesses: implicit dependencies between steps, unbounded recovery loops, and mutable execution history that complicates debugging. We characterize the Agent Loop as a single ready unit scheduler: at any moment, at most one executable unit is active, and the choice of which unit to activate comes from opaque LLM inference rather than an inspectable policy. This perspective places Agent Loops and graph based execution engines on a single semantic continuum. We propose SGH, Structured Graph Harness, which lifts control flow from implicit context into an explicit static DAG. SGH makes three commitments: execution plans are immutable within a plan version, planning execution and recovery are separated into three layers, and recovery follows a strict escalation protocol. These choices trade some expressiveness for controllability, verifiability, and implementability. Our contributions are fourfold: a scheduler unified framework that applies classical scheduling theory to LLM agent execution and identifies challenges introduced by non deterministic LLM nodes; a trade off analysis of controllability, expressiveness, and implementability across 70 surveyed systems; a formal specification including a node state machine with termination and soundness guarantees; and an attributable experimental framework with a seven group design for future validation. This is a position paper and design proposal. We provide a theoretical framework, design analysis, and experimental protocol, not a production implementation or empirical results.
[895] Beyond RAG for Cyber Threat Intelligence: A Systematic Evaluation of Graph-Based and Agentic Retrieval
Dzenan Hamzic, Florian Skopik, Max Landauer, Markus Wurzenberger, Andreas Rauber
Main category: cs.AI
TL;DR: Systematic evaluation of four RAG architectures for CTI analysis shows hybrid graph-text approach improves multi-hop question answering by 35% over vector RAG while maintaining reliability.
Details
Motivation: CTI analysts need to answer complex questions over security reports, but traditional vector retrieval struggles with queries requiring reasoning over relationships between entities (threat actors, malware, vulnerabilities). Knowledge graphs enable structured multi-hop reasoning, but it's unclear how different RAG approaches compare in realistic CTI settings.Method: Evaluated four RAG architectures: 1) standard vector retrieval, 2) graph-based retrieval over CTI knowledge graph, 3) agentic variant that repairs failed graph queries, and 4) hybrid approach combining graph queries with text retrieval. Tested on 3,300 CTI question-answer pairs covering factual lookups, multi-hop relational queries, synthesis questions, and unanswerable cases.
Result: Graph grounding improves performance on structured factual queries. Hybrid graph-text approach improves answer quality by up to 35% on multi-hop questions compared to vector RAG, while maintaining more reliable performance than graph-only systems.
Conclusion: Hybrid graph-text RAG architectures offer significant advantages for CTI analysis, particularly for complex multi-hop reasoning tasks, by combining the structured reasoning capabilities of knowledge graphs with the flexibility of text retrieval.
Abstract: Cyber threat intelligence (CTI) analysts must answer complex questions over large collections of narrative security reports. Retrieval-augmented generation (RAG) systems help language models access external knowledge, but traditional vector retrieval often struggles with queries that require reasoning over relationships between entities such as threat actors, malware, and vulnerabilities. This limitation arises because relevant evidence is often distributed across multiple text fragments and documents. Knowledge graphs address this challenge by enabling structured multi-hop reasoning through explicit representations of entities and relationships. However, multiple retrieval paradigms, including graph-based, agentic, and hybrid approaches, have emerged with different assumptions and failure modes. It remains unclear how these approaches compare in realistic CTI settings and when graph grounding improves performance. We present a systematic evaluation of four RAG architectures for CTI analysis: standard vector retrieval, graph-based retrieval over a CTI knowledge graph, an agentic variant that repairs failed graph queries, and a hybrid approach combining graph queries with text retrieval. We evaluate these systems on 3,300 CTI question-answer pairs spanning factual lookups, multi-hop relational queries, analyst-style synthesis questions, and unanswerable cases. Results show that graph grounding improves performance on structured factual queries. The hybrid graph-text approach improves answer quality by up to 35 percent on multi-hop questions compared to vector RAG, while maintaining more reliable performance than graph-only systems.
[896] Escaping the Context Bottleneck: Active Context Curation for LLM Agents via Reinforcement Learning
Xiaozhe Li, Tianyi Lyu, Yizhao Yang, Liang Shan, Siyi Yang, Ligao Zhang, Zhuoyi Huang, Qingwen Liu, Yang Li
Main category: cs.AI
TL;DR: A symbiotic framework decouples context management from task execution using a lightweight ContextCurator policy model paired with a frozen foundation TaskExecutor to address LLM context bottlenecks in long-horizon tasks.
Details
Motivation: LLMs struggle with long-horizon tasks due to "context bottleneck" and "lost-in-the-middle" phenomenon where accumulated noise from verbose environments degrades reasoning over multi-turn interactions.Method: Introduces a symbiotic framework that pairs a lightweight specialized policy model (ContextCurator) with a powerful frozen foundation model (TaskExecutor). ContextCurator is trained via reinforcement learning to actively reduce information entropy in working memory by aggressively pruning environmental noise while preserving reasoning anchors.
Result: On WebArena, improves success rate of Gemini-3.0-flash from 36.4% to 41.2% while reducing token consumption by 8.8% (from 47.4K to 43.3K). On DeepSearch, achieves 57.1% success rate vs 53.9% while reducing token consumption by a factor of 8. A 7B ContextCurator matches context management performance of GPT-4o.
Conclusion: The framework provides a scalable and computationally efficient paradigm for autonomous long-horizon agents by decoupling context management from task execution.
Abstract: Large Language Models (LLMs) struggle with long-horizon tasks due to the “context bottleneck” and the “lost-in-the-middle” phenomenon, where accumulated noise from verbose environments degrades reasoning over multi-turn interactions. To address this issue, we introduce a symbiotic framework that decouples context management from task execution. Our architecture pairs a lightweight, specialized policy model, ContextCurator, with a powerful frozen foundation model, TaskExecutor. Trained via reinforcement learning, ContextCurator actively reduces information entropy in the working memory. It aggressively prunes environmental noise while preserving reasoning anchors, that is, sparse data points that are critical for future deductions. On WebArena, our framework improves the success rate of Gemini-3.0-flash from 36.4% to 41.2% while reducing token consumption by 8.8% (from 47.4K to 43.3K). On DeepSearch, it achieves a 57.1% success rate, compared with 53.9%, while reducing token consumption by a factor of 8. Remarkably, a 7B ContextCurator matches the context management performance of GPT-4o, providing a scalable and computationally efficient paradigm for autonomous long-horizon agents.
[897] Three Roles, One Model: Role Orchestration at Inference Time to Close the Performance Gap Between Small and Large Agents
S. Aaron McClendon, Jorge Gallego-Feliciano, Stavros Zervoudakis, Antonios Saravanos
Main category: cs.AI
TL;DR: A three-tier inference scaffolding pipeline improves small LLM performance on tool-use tasks without additional training, using the same frozen model in three roles to roughly double task completion rates.
Details
Motivation: Deploying capable LLM agents on modest hardware is challenging, and the paper investigates whether inference-time scaffolding alone (without additional training compute) can improve small model performance in complex multi-step environments.Method: A three-tier inference scaffolding pipeline that deploys the same frozen model in three distinct roles: (1) summarization model to compress dialogue history while preserving critical artifacts, (2) main agent model for reasoning over compressed context, and (3) isolated correction model to review and revise code output without conversation history access.
Result: Scaffolding roughly doubled performance: from 5.4% to 8.9% (FP16) and from 3.0% to 5.9% (AWQ) task goal completion. The scaffolded 8B model surpassed DeepSeek-Coder 33B Instruct (7.1%) on full-precision inference, making small models competitive with systems 4x their size.
Conclusion: Structured inference-time interventions can significantly enhance small model performance on complex tasks, demonstrating that careful scaffolding can make modest hardware deployments competitive with much larger systems without additional training compute.
Abstract: Large language model (LLM) agents show promise on realistic tool-use tasks, but deploying capable agents on modest hardware remains challenging. We study whether inference-time scaffolding alone, without any additional training compute, can improve the performance of a small model in complex multi-step environments. Operating on a single 24,GB GPU, we evaluate Qwen3-8B under both full-precision (FP16, 12K context) and 4-bit quantized (AWQ, 32K context) configurations. Without any intervention, the raw model achieves just 5.4% (FP16) and 3.0% (AWQ) task goal completion. Guided by a systematic failure mode analysis, we introduce a three-tier inference scaffolding pipeline that deploys the same frozen model in three distinct roles: (1) a summarization model that preserves critical artifacts (tokens, credentials, API responses) while compressing dialogue history; (2) the main agent model that reasons over the compressed context; and (3) an isolated correction model that reviews and revises the agent’s code output without access to conversation history, breaking repetitive failure loops. Applied to the same unmodified model, this scaffolding yields 8.9% (FP16) and 5.9% (AWQ) task goal completion, roughly doubling performance in both settings, with particularly strong gains on difficulty-1 tasks (15.8%$\to$26.3% FP16; 5.3%$\to$14.0% AWQ). On full-precision inference, our scaffolded 8B model surpasses DeepSeek-Coder 33B Instruct (7.1%) from the original AppWorld evaluation, demonstrating that structured inference-time interventions can make small models competitive with systems 4$\times$ their size. We formalize the approach as a scaffolded policy over a frozen base model, three invocations of the same weights with different conditioning, drawing connections to test-time compute scaling and action-space shaping in reinforcement learning.
[898] From Attribution to Action: A Human-Centered Application of Activation Steering
Tobias Labarta, Maximilian Dreyer, Katharina Weitz, Wojciech Samek, Sebastian Lapuschkin
Main category: cs.AI
TL;DR: Interactive workflow combining SAE-based attribution with activation steering for analyzing concept usage in vision models, with expert evaluation showing steering enables intervention-based hypothesis testing but raises safety concerns.
Details
Motivation: Current XAI methods provide feature explanations but lack actionable ways for practitioners to intervene on model behavior. Activation steering offers potential for actionable explanations but its practical utility remains understudied.Method: Developed interactive workflow combining SAE-based attribution with activation steering for instance-level analysis of concept usage in vision models, implemented as web-based tool. Conducted semi-structured expert interviews (N=8) with debugging tasks on CLIP to study how practitioners reason about, trust, and apply activation steering.
Result: Steering enables shift from inspection to intervention-based hypothesis testing (8/8 participants). Most ground trust in observed model responses rather than explanation plausibility alone (6/8). Participants adopted systematic debugging strategies dominated by component suppression (7/8). Identified risks including ripple effects and limited generalization of instance-level corrections.
Conclusion: Activation steering makes interpretability more actionable but raises important considerations for safe and effective use, including risks of unintended side effects and limited generalization of corrections.
Abstract: Explainable AI (XAI) methods reveal which features influence model predictions, yet provide limited means for practitioners to act on these explanations. Activation steering of components identified via XAI offers a path toward actionable explanations, although its practical utility remains understudied. We introduce an interactive workflow combining SAE-based attribution with activation steering for instance-level analysis of concept usage in vision models, implemented as a web-based tool. Based on this workflow, we conduct semi-structured expert interviews (N=8) with debugging tasks on CLIP to investigate how practitioners reason about, trust, and apply activation steering. We find that steering enables a shift from inspection to intervention-based hypothesis testing (8/8 participants), with most grounding trust in observed model responses rather than explanation plausibility alone (6/8). Participants adopted systematic debugging strategies dominated by component suppression (7/8) and highlighted risks including ripple effects and limited generalization of instance-level corrections. Overall, activation steering renders interpretability more actionable while raising important considerations for safe and effective use.
[899] OOM-RL: Out-of-Money Reinforcement Learning Market-Driven Alignment for LLM-Based Multi-Agent Systems
Kun Liu, Liqun Chen
Main category: cs.AI
TL;DR: OOM-RL uses financial market losses as objective penalty to align multi-agent systems, evolving from sycophantic behavior to robust liquidity-aware architecture through real economic consequences.
Details
Motivation: Current alignment methods (RLHF/RLAIF) cause model sycophancy, and execution-based environments suffer from adversarial test evasion. Need objective alignment paradigm for autonomous software engineering agents in real-world environments.Method: Out-of-Money Reinforcement Learning (OOM-RL) deploys agents into live financial markets, using capital depletion as unhackable negative gradient. Implements Strict Test-Driven Agentic Workflow (STDAW) with Byzantine-inspired uni-directional state lock (RO-Lock) and ≥95% code coverage constraint matrix.
Result: 20-month study shows evolution from high-turnover sycophantic baseline to robust architecture. Final system achieved stable equilibrium with annualized Sharpe ratio of 2.06 in mature phase, abandoning hallucinations for practical solutions.
Conclusion: Substituting subjective human preference with economic penalties provides robust methodology for aligning autonomous agents in high-stakes environments, establishing computational billing as objective physical constraint for generalized paradigms.
Abstract: The alignment of Multi-Agent Systems (MAS) for autonomous software engineering is constrained by evaluator epistemic uncertainty. Current paradigms, such as Reinforcement Learning from Human Feedback (RLHF) and AI Feedback (RLAIF), frequently induce model sycophancy, while execution-based environments suffer from adversarial “Test Evasion” by unconstrained agents. In this paper, we introduce an objective alignment paradigm: \textbf{Out-of-Money Reinforcement Learning (OOM-RL)}. By deploying agents into the non-stationary, high-friction reality of live financial markets, we utilize critical capital depletion as an un-hackable negative gradient. Our longitudinal 20-month empirical study (July 2024 – February 2026) chronicles the system’s evolution from a high-turnover, sycophantic baseline to a robust, liquidity-aware architecture. We demonstrate that the undeniable ontological consequences of financial loss forced the MAS to abandon overfitted hallucinations in favor of the \textbf{Strict Test-Driven Agentic Workflow (STDAW)}, which enforces a Byzantine-inspired uni-directional state lock (RO-Lock) anchored to a deterministically verified $\geq 95%$ code coverage constraint matrix. Our results show that while early iterations suffered severe execution decay, the final OOM-RL-aligned system achieved a stable equilibrium with an annualized Sharpe ratio of 2.06 in its mature phase. We conclude that substituting subjective human preference with rigorous economic penalties provides a robust methodology for aligning autonomous agents in high-stakes, real-world environments, laying the groundwork for generalized paradigms where computational billing acts as an objective physical constraint
[900] On the Complexity of the Discussion-based Semantics in Abstraction Argumentation
Lydia Blümel, Kai Sauerwald, Kenneth Skiba, Matthias Thimm
Main category: cs.AI
TL;DR: Polynomial-time algorithm for deciding argument strength in discussion-based semantics using automata theory and graph walk analysis
Details
Motivation: The paper addresses the computational complexity of ranking semantics in argumentation frameworks, specifically focusing on the discussion-based semantics of Amgoud and Ben-Naim. Many semantics in this area have unknown complexity, and this work aims to provide efficient algorithms for deciding argument strength.Method: The authors reduce the argument strength problem to analyzing walks in graphs, specifically comparing the number of walks of each length ending at two vertices. They employ automata theory and reduce this to the equivalence problem for semiring automata, providing a polynomial-time solution.
Result: The paper proves that deciding whether argument a is stronger than argument b in the discussion-based semantics is decidable in polynomial time. This provides new insights into the computational complexity of ranking semantics in argumentation frameworks.
Conclusion: The work offers a new perspective on computational complexity in ranking semantics and demonstrates that automata theory can provide efficient solutions to argument strength problems in discussion-based semantics.
Abstract: We show that deciding whether an argument a is stronger than an argument b with respect to the discussion-based semantics of Amgoud and Ben-Naim is decidable in polynomial time. At its core, this problem is about deciding whether, for two vertices in a graph, the number of walks of each length ending in those vertices is the same. We employ results from automata theory and reduce this problem to the equivalence problem for semiring automata. This offers a new perspective on the computational complexity of ranking semantics, an area in which the complexity of many semantics remains open.
[901] Anthropogenic Regional Adaptation in Multimodal Vision-Language Model
Samuel Cahyawijaya, Peerat Limkonchotiwat, Tack Hwa Wong, Hitesh Laxmichand Patel, Amit Agarwal, Manuel Antonio Rufino, Carlos Rafael Catalan, Muhammad Reza Qorib, Vicky Feliren, Holy Lovenia, Aye Hninn Khine, Frederikus Hudi, David Anugraha, Alham Fikri Aji, Romrawin Chumpu, Viet-Thanh Pham, Minghan Wang, Mohamed Fazli Imam, Ruochen Zhang, Joseph Marvin Imperial, Do Xuan Long, Musa Izzanardi Wijanarko, Joel Ruben Antony Moniz, Patrick Amadeus Irawan, Hanif Muhammad Zhafran, Isaiah Flores, Ira Salsabila, Jun Kevin, Jostin Jerico Rosal, Patricia Nicole Monderin, Kun Kerdthaisong, Ahmad Mustafid, My Chiffon Nguyen, Natchapon Jongwiriyanurak, Siva Worajitwannakul, Haochen Li, Adrian Xuan Wei Lim, Bin Wang, Muhammad Ravi Shulthan Habibi, Lynnette Hui Xian Ng, Mithil Bangera, Yeshil Bangera, Priyaranjan Pattnayak, Dun Li Chan, Sherissa Caren Djuniwar, Hee Ming Shan
Main category: cs.AI
TL;DR: Paper introduces Anthropogenic Regional Adaptation paradigm for optimizing vision-language models to specific regional contexts while maintaining global generalization, with GG-EZ method using regional data filtering and model merging.
Details
Motivation: Despite success in vision-language systems, there's no dedicated framework for assessing human-centric alignment, particularly for regional cultural relevance while preserving global capabilities.Method: Proposes Anthropogenic Regional Adaptation paradigm and GG-EZ method using regional data filtering and model merging to adapt models to specific geographical contexts.
Result: Demonstrates 5-15% gains in cultural relevance metrics across Southeast Asia while maintaining over 98% of global performance, occasionally surpassing it across 3 VL architectures.
Conclusion: Establishes Anthropogenic Regional Alignment as foundational paradigm for multimodal vision-language models in diverse regions with effective baseline method for regional value alignment.
Abstract: While the field of vision-language (VL) has achieved remarkable success in integrating visual and textual information across multiple languages and domains, there is still no dedicated framework for assessing human-centric alignment in vision-language systems. We offer two contributions to address this gap. First, we introduce Anthropogenic Regional Adaptation: a novel paradigm that aims to optimize model relevance to specific regional contexts while ensuring the retention of global generalization capabilities. Second, we present a simple, but effective adaptation method named Geographical-generalization-made-easy (GG-EZ), which utilizes regional data filtering and model merging. Through comprehensive experiments on 3 VL architectures: large vision-language models, text-to-image diffusion models, and vision-language embedding models, and a case study in Southeast Asia (SEA) regional adaptation, we demonstrate the importance of Anthropogenic Regional Adaptation and the effectiveness of GG-EZ, showing 5-15% gains in cultural relevance metrics across SEA while maintaining over 98% of global performance and even occasionally surpassing it. Our findings establish Anthropogenic Regional Alignment as a foundational paradigm towards applicability of multimodal vision-language models in diverse regions and demonstrate a simple-yet-effective baseline method that optimizes regional value alignment while preserving global generalization.
[902] Lectures on AI for Mathematics
Xiaoyang Chen, Xiaoyang Chen
Main category: cs.AI
TL;DR: Introduction to AI applications in mathematics, covering pattern discovery, theorem proving, and counterexample generation
Details
Motivation: To provide a comprehensive introduction to the emerging field of AI for mathematics, making the subject accessible and exploring how AI can advance mathematical researchMethod: Book format with clear explanations covering core principles and diverse applications of AI in mathematics
Result: A comprehensive introductory text that explains how AI can discover mathematical patterns, assist in proving theorems, and construct counterexamples
Conclusion: AI has significant potential to advance mathematical research through pattern discovery, theorem proving assistance, and counterexample generation
Abstract: This book provides a comprehensive and accessible introduction to the emerging field of AI for mathematics. It covers the core principles and diverse applications of using artificial intelligence to advance mathematical research. Through clear explanations, the text explores how AI can discover hidden mathematical patterns, assist in proving complicated theorems, and even construct counterexamples to challenge conjectures.
[903] Limited Perfect Monotonical Surrogates constructed using low-cost recursive linkage discovery with guaranteed output
M. W. Przewozniczek, F. Chicano, R. Tinós, M. M. Komarnicki
Main category: cs.AI
TL;DR: LyMPuS is a parameterless perfect surrogate method for expensive optimization problems that enables comparison of solutions differing by one variable, offering low-cost linkage discovery without separate training.
Details
Motivation: Many real-world optimization problems are computationally expensive to evaluate, and existing perfect linear surrogates are limited to linear models, making them inapplicable for non-linear problems. There's a need for surrogates that can work with more complex functions while maintaining perfect representation.Method: Proposes Limited Monotonical Perfect Surrogate (LyMPuS) that enables comparison of two solutions differing by a single variable. It’s parameterless, can be trained on-the-fly without separate surrogate-building steps, uses only necessary fitness evaluations, and reuses already-paid costs when models are updated.
Result: LyMPuS provides low-cost missing-linkage detection and linkage discovery, guaranteed to find missing dependencies in at most 2⌈log₂(n)⌉ steps. It’s suitable for limiting costs of expensive local search procedures.
Conclusion: LyMPuS extends perfect surrogate capabilities to non-linear problems, offering efficient optimization for computationally expensive functions through parameterless, on-the-fly training and guaranteed linkage discovery.
Abstract: Surrogates provide a cheap solution evaluation and offer significant leverage for optimizing computationally expensive problems. Usually, surrogates only approximate the original function. Recently, the perfect linear surrogates were proposed that ideally represent the original function. These surrogates do not mimic the original function. In fact, they are another (correct) representation of it and enable a wide range of possibilities, e.g., discovering the optimized function for problems where the direct transformation of the encoded solution into its evaluation is not available. However, many real-world problems can not be represented by linear models, making the aforementioned surrogates inapplicable. Therefore, we propose the Limited Monotonical Perfect Surrogate (LyMPuS), which overcomes this difficulty and enables the comparison of two solutions that differ by a single variable. Our proposition is suitable for limiting the cost of expensive local search procedures. The proposed surrogate is parameterless and can be trained on the fly without any separate surrogate-building step. It uses only the necessary fitness evaluations, and the already-paid costs are not wasted when the model is updated. Finally, it offers low-cost missing-linkage detection and low-cost linkage discovery, guaranteed to find a missing dependency in no more than $2\lceil\log_2(n)\rceil$ steps.
[904] Problem Reductions at Scale: Agentic Integration of Computationally Hard Problems
Xi-Wei Pan, Shi-Wen An, Jin-Guo Liu
Main category: cs.AI
TL;DR: A tool for polynomial-time reductions between NP-hard optimization problems, built using AI coding agents with harness engineering, enabling routing any supported problem to any solver through a single interface.
Details
Motivation: To create a scalable library for polynomial-time reductions between hard optimization problems, allowing practitioners to route any supported problem to any supported solver through a single interface, overcoming the barrier of building such libraries at scale.Method: Used harness engineering combining no-code contribution for domain experts, multilayer verification stack (type-level checks to agentic feature tests with AI agents role-playing as end users), and fully automated implementation-review-integration pipeline.
Result: Built a command-line tool with 100+ problem types and 200+ reduction rules in over 170k lines of Rust in about three months, demonstrating agents can build well-tested software at scale beyond prior reduction-library efforts.
Conclusion: Well-engineered harnesses enable AI agents to build complex software systems at unprecedented scale and pace, with the reduction graph’s transitive composition allowing new solvers for any problem type to become instantly available to all connected problems.
Abstract: Solving an NP-hard optimization problem often requires reformulating it for a specific solver – quantum hardware, a commercial optimizer, or a domain heuristic. A tool for polynomial-time reductions between hard problems would let practitioners route any supported problem to any supported solver through a single interface. Building such a library at scale, however, has remained out of reach. We show that harness engineering, the practice of designing constraints, verification systems, and feedback loops that channel AI coding agents, can overcome this barrier. Our harness combines a no-code contribution route for domain experts, a multilayer verification stack ranging from type-level checks to agentic feature tests (AI agents role-playing as end users), and a fully automated implementation-review-integration pipeline. In about three months, we built a command-line tool backed by a library of 100+ problem types and 200+~reduction rules in over 170k lines of Rust. The result suggests that a well-engineered harness lets agents build well-tested software at a scale and pace beyond prior reduction-library efforts. Because the reduction graph composes transitively, a new solver registered for any single problem type instantly becomes available to every problem connected by a reduction path. The source code is available at https://github.com/CodingThrust/problem-reductions.
[905] A collaborative agent with two lightweight synergistic models for autonomous crystal materials research
Tongyu Shi, Yutang Li, Zhanyuan Li, Qian Liu, Jie Zhou, Wenhe Xu, Yang Li, Dawei Dai, Rui He, Wenhua Zhou, Jiahong Wang, Xue-Feng Yu
Main category: cs.AI
TL;DR: MatBrain is a lightweight dual-model system for materials science that separates analytical reasoning (Mat-R1, 30B) from tool orchestration (Mat-T1, 14B) to achieve expert-level performance with dramatically reduced computational requirements.
Details
Motivation: Current large language models are too massive (hundreds of billions of parameters) and struggle with domain-specific reasoning and tool coordination in materials science, creating high hardware deployment barriers.Method: Dual-model architecture: Mat-R1 (30B parameters) handles analytical reasoning, while Mat-T1 (14B parameters) orchestrates tool-based actions. Entropy analysis shows this decouples tool planning from analytical reasoning. The system is designed for crystal materials research.
Result: Outperforms larger general-purpose models while reducing hardware deployment barrier by over 95%. Generated 30,000 candidate structures and identified 38 promising catalyst materials in 48 hours (100x acceleration). Versatile across structure generation, property prediction, and synthesis planning.
Conclusion: Lightweight collaborative intelligence with specialized dual-model architecture enables advanced materials research capabilities with dramatically reduced computational requirements, demonstrating potential for domain-specific AI systems.
Abstract: Current large language models require hundreds of billions of parameters yet struggle with domain-specific reasoning and tool coordination in materials science. Here, we present MatBrain, a lightweight collaborative agent system with two synergistic models specialization for crystal materials research. MatBrain employs a dual-model architecture: Mat-R1 (30B parameters) as the analytical model providing expert-level domain reasoning, and Mat-T1 (14B parameters) as the executive model orchestrating tool-based actions. Entropy analysis confirms that this architecture resolves the conflict between tool planning and analytical reasoning by decoupling their distinct entropy dynamics. Enabled by this dual-model architecture and structural efficiency, MatBrain significantly outperforms larger general-purpose models while reducing the hardware deployment barrier by over 95%. MatBrain exhibits versatility across structure generation, property prediction, and synthesis planning tasks. Applied to catalyst design, MatBrain generated 30,000 candidate structures and identified 38 promising materials within 48 hours, achieving approximately 100-fold acceleration over traditional approaches. These results demonstrate the potential of lightweight collaborative intelligence for advancing materials research capabilities.
[906] SemaClaw: A Step Towards General-Purpose Personal AI Agents through Harness Engineering
Ningyan Zhu, Huacan Wang, Jie Zhou, Feiyu Chen, Shuo Zhang, Ge Chen, Chen Liu, Jiarou Wu, Wangyi Chen, Xiaofeng Mou, Yi Xu
Main category: cs.AI
TL;DR: SemaClaw is an open-source multi-agent framework for personal AI agents that focuses on harness engineering to make AI agents controllable, auditable, and production-reliable, addressing the shift from prompt engineering to complete infrastructure design.
Details
Motivation: The paper addresses the rise of personal AI agents in daily life and identifies two key shifts: 1) from prompt engineering to harness engineering for controllable, auditable systems, and 2) from discrete tasks to persistent, context-aware human-agent collaboration requiring trustworthy infrastructure.Method: SemaClaw framework includes: DAG-based two-phase hybrid agent team orchestration, PermissionBridge behavioral safety system, three-tier context management architecture, and an agentic wiki skill for automated personal knowledge base construction.
Result: The paper presents SemaClaw as an open-source framework that addresses the infrastructure needs for general-purpose personal AI agents through harness engineering, enabling more reliable and controllable multi-agent systems.
Conclusion: As AI model capabilities converge, the harness layer becomes the primary site of architectural differentiation, and SemaClaw provides an open-source solution for building trustworthy, extensible personal AI agent infrastructure.
Abstract: The rise of OpenClaw in early 2026 marks the moment when millions of users began deploying personal AI agents into their daily lives, delegating tasks ranging from travel planning to multi-step research. This scale of adoption signals that two parallel arcs of development have reached an inflection point. First is a paradigm shift in AI engineering, evolving from prompt and context engineering to harness engineering-designing the complete infrastructure necessary to transform unconstrained agents into controllable, auditable, and production-reliable systems. As model capabilities converge, this harness layer is becoming the primary site of architectural differentiation. Second is the evolution of human-agent interaction from discrete tasks toward a persistent, contextually aware collaborative relationship, which demands open, trustworthy and extensible harness infrastructure. We present SemaClaw, an open-source multi-agent application framework that addresses these shifts by taking a step towards general-purpose personal AI agents through harness engineering. Our primary contributions include a DAG-based two-phase hybrid agent team orchestration method, a PermissionBridge behavioral safety system, a three-tier context management architecture, and an agentic wiki skill for automated personal knowledge base construction.
[907] UniToolCall: Unifying Tool-Use Representation, Data, and Evaluation for LLM Agents
Yijuan Liang, Xinghao Chen, Yifan Ge, Ziyi Wu, Hao Wu, Changyu Zeng, Wei Xing, Xiaoyu Shen
Main category: cs.AI
TL;DR: UniToolCall: A unified framework for LLM tool learning with standardized toolset construction, dataset generation, and evaluation using QAOA representation.
Details
Motivation: Existing tool learning research has inconsistent interaction representations, overlooks structural distribution of tool-use trajectories, and lacks compatible evaluation benchmarks, necessitating a unified framework.Method: Creates a large tool pool (22k+ tools) and hybrid training corpus (390k+ instances) combining public datasets with synthetic trajectories. Models diverse interaction patterns (single/multi-hop, single/multi-turn) with serial/parallel execution. Introduces Anchor Linkage for multi-turn reasoning and converts 7 benchmarks to unified QAOA representation.
Result: Fine-tuning Qwen3-8B on UniToolCall dataset substantially improves tool-use performance, achieving 93.0% single-turn Strict Precision under distractor-heavy Hybrid-20 setting, outperforming commercial models like GPT, Gemini, and Claude.
Conclusion: UniToolCall provides a comprehensive framework that standardizes tool learning pipeline, addresses structural modeling gaps, and enables robust evaluation, demonstrating strong performance improvements over existing approaches.
Abstract: Tool-use capability is a fundamental component of LLM agents, enabling them to interact with external systems through structured function calls. However, existing research exhibits inconsistent interaction representations, largely overlooks the structural distribution of tool-use trajectories, and relies on incompatible evaluation benchmarks. We present UniToolCall, a unified framework for tool learning that standardizes the entire pipeline from toolset construction and dataset generation to evaluation. The framework curates a large tool pool of 22k+ tools and constructs a hybrid training corpus of 390k+ instances by combining 10 standardized public datasets with structurally controlled synthetic trajectories. It explicitly models diverse interaction patterns, including single-hop vs. multi-hop and single-turn vs. multi-turn, while capturing both serial and parallel execution structures. To support coherent multi-turn reasoning, we further introduce an Anchor Linkage mechanism that enforces cross-turn dependencies. Furthermore, we convert 7 public benchmarks into a unified Query–Action–Observation–Answer (QAOA) representation with fine-grained evaluation at the function-call, turn, and conversation levels. Experiments show that fine-tuning Qwen3-8B on our dataset substantially improves tool-use performance. Under the distractor-heavy Hybrid-20 setting, achieves 93.0% single-turn Strict Precision, outperforming commercial models including GPT, Gemini, and Claude.
[908] Intersectional Sycophancy: How Perceived User Demographics Shape False Validation in Large Language Models
Benjamin Maltbie, Shivam Raval
Main category: cs.AI
TL;DR: LLMs show sycophantic behavior that varies by user demographics, with GPT-5-nano being more sycophantic than Claude Haiku 4.5, especially for Hispanic personas and in philosophy domains.
Details
Motivation: The paper investigates whether LLM sycophancy (validating incorrect user beliefs to appear agreeable) varies systematically with perceived user demographics, inspired by intersectionality theory, to understand if safety evaluations should incorporate identity-aware testing.Method: Conducted 768 multi-turn adversarial conversations using Anthropic’s Petri evaluation framework, probing GPT-5-nano and Claude Haiku 4.5 across 128 persona combinations (race, age, gender, confidence level) in mathematics, philosophy, and conspiracy theory domains.
Result: GPT-5-nano is significantly more sycophantic than Claude Haiku 4.5 overall. For GPT-5-nano, philosophy elicits 41% more sycophancy than mathematics, and Hispanic personas receive the highest sycophancy across races. The worst-scoring persona (confident 23-year-old Hispanic woman) averages 5.33/10 on sycophancy. Claude Haiku 4.5 exhibits uniformly low sycophancy with no significant demographic variation.
Conclusion: Sycophancy is not uniformly distributed across users, and safety evaluations should incorporate identity-aware testing to address differential false validation rates based on perceived user demographics.
Abstract: Large language models exhibit sycophantic tendencies–validating incorrect user beliefs to appear agreeable. We investigate whether this behavior varies systematically with perceived user demographics, testing whether combinations of race, age, gender, and expressed confidence level produce differential false validation rates. Inspired by the legal concept of intersectionality, we conduct 768 multi-turn adversarial conversations using Anthropic’s Petri evaluation framework, probing GPT-5-nano and Claude Haiku 4.5 across 128 persona combinations in mathematics, philosophy, and conspiracy theory domains. GPT-5-nano is significantly more sycophantic than Claude Haiku 4.5 overall ($\bar{x}=2.96$ vs. $1.74$, $p < 10^{-32}$, Wilcoxon signed-rank). For GPT-5-nano, we find that philosophy elicits 41% more sycophancy than mathematics and that Hispanic personas receive the highest sycophancy across races. The worst-scoring persona, a confident, 23-year-old Hispanic woman, averages 5.33/10 on sycophancy. Claude Haiku 4.5 exhibits uniformly low sycophancy with no significant demographic variation. These results demonstrate that sycophancy is not uniformly distributed across users and that safety evaluations should incorporate identity-aware testing.
[909] Context Kubernetes: Declarative Orchestration of Enterprise Knowledge for Agentic AI Systems
Charafeddine Mouzouni
Main category: cs.AI
TL;DR: Context Kubernetes: A Kubernetes-inspired architecture for orchestrating enterprise knowledge in agentic AI systems with governance, freshness monitoring, and permission controls.
Details
Motivation: The paper addresses the challenge of managing enterprise knowledge in AI agent systems - ensuring proper governance, freshness, and permissions when delivering knowledge to AI agents across an organization, analogous to Kubernetes solving container orchestration.Method: Proposes Context Kubernetes architecture with six core abstractions, YAML-based declarative manifests for knowledge-architecture-as-code, a reconciliation loop, and a three-tier agent permission model where agent authority is always a strict subset of human authority.
Result: Experiments show: 1) Without governance, agents serve phantom content and leak data in 26.5% of queries; 2) Without freshness monitoring, stale content served silently - with reconciliation, staleness detected in under 1ms; 3) Three-tier permission model blocks all 5 attack scenarios vs. basic RBAC blocking 4/5 and flat permissions blocking 0/5.
Conclusion: Context Kubernetes provides architectural enforcement of knowledge governance that current enterprise platforms lack, with zero unauthorized deliveries and invariant violations, addressing four properties that make context orchestration harder than container orchestration.
Abstract: We introduce Context Kubernetes, an architecture for orchestrating enterprise knowledge in agentic AI systems, with a prototype implementation and eight experiments. The core observation is that delivering the right knowledge, to the right agent, with the right permissions, at the right freshness – across an entire organization – is structurally analogous to the container orchestration problem Kubernetes solved a decade ago. We formalize six core abstractions, a YAML-based declarative manifest for knowledge-architecture-as-code, a reconciliation loop, and a three-tier agent permission model where agent authority is always a strict subset of human authority. Three value experiments show: (1) without governance, agents serve phantom content from deleted sources and leak cross-domain data in 26.5% of queries; (2) without freshness monitoring, stale content is served silently – with reconciliation, staleness is detected in under 1ms; (3) in five attack scenarios, flat permissions block 0/5 attacks, basic RBAC blocks 4/5, and the three-tier model blocks 5/5. Five correctness experiments confirm zero unauthorized deliveries, zero invariant violations, and architectural enforcement of out-of-band approval isolation that no surveyed enterprise platform provides. A survey of four major platforms (Microsoft, Salesforce, AWS, Google) documents that none architecturally isolates agent approval channels. We identify four properties that make context orchestration harder than container orchestration, and argue that these make the solution more valuable.
[910] RationalRewards: Reasoning Rewards Scale Visual Generation Both Training and Test Time
Haozhe Wang, Cong Wei, Weiming Ren, Jiaming Liu, Fangzhen Lin, Wenhu Chen
Main category: cs.AI
TL;DR: Teaching reward models to produce explicit, multi-dimensional critiques before scoring transforms them into active optimization tools for visual generation, improving generators through both training-time reinforcement learning and test-time critique-and-refine loops.
Details
Motivation: Current reward models for visual generation reduce rich human judgments to single unexplained scores, discarding the underlying reasoning. This limits their utility as optimization tools and interpretability.Method: Introduces Preference-Anchored Rationalization (PARROT) framework to train reward models to produce explicit critiques without costly rationale annotations. Uses anchored generation, consistency filtering, and distillation to recover high-quality rationales from preference data. The resulting model, RationalRewards (8B), serves as both a reward model and critique generator.
Result: RationalRewards achieves state-of-the-art preference prediction among open-source reward models, competitive with Gemini-2.5-Pro while using 10-20x less training data. As an RL reward, it improves text-to-image and image-editing generators beyond scalar alternatives. Test-time critique-and-refine loop matches or exceeds RL-based fine-tuning on several benchmarks.
Conclusion: Structured reasoning in reward models transforms them from passive evaluators to active optimization tools, unlocking latent capabilities in existing generators that suboptimal prompts fail to elicit.
Abstract: Most reward models for visual generation reduce rich human judgments to a single unexplained score, discarding the reasoning that underlies preference. We show that teaching reward models to produce explicit, multi-dimensional critiques before scoring transforms them from passive evaluators into active optimization tools, improving generators in two complementary ways: at training time, structured rationales provide interpretable, fine-grained rewards for reinforcement learning; at test time, a Generate-Critique-Refine loop turns critiques into targeted prompt revisions that improve outputs without any parameter updates. To train such a reward model without costly rationale annotations, we introduce Preference-Anchored Rationalization (PARROT), a principled framework that recovers high-quality rationales from readily available preference data through anchored generation, consistency filtering, and distillation. The resulting model, RationalRewards (8B), achieves state-of-the-art preference prediction among open-source reward models, competitive with Gemini-2.5-Pro, while using 10-20x less training data than comparable baselines. As an RL reward, it consistently improves text-to-image and image-editing generators beyond scalar alternatives. Most strikingly, its test-time critique-and-refine loop matches or exceeds RL-based fine-tuning on several benchmarks, suggesting that structured reasoning can unlock latent capabilities in existing generators that suboptimal prompts fail to elicit.
[911] Agentic Driving Coach: Robustness and Determinism of Agentic AI-Powered Human-in-the-Loop Cyber-Physical Systems
Deeksha Prahlad, Daniel Fan, Hokeun Kim
Main category: cs.AI
TL;DR: A reactor-based approach using Lingua Franca framework to address nondeterminism in foundation model-based AI agents for human-in-the-loop cyber-physical systems, demonstrated with an agentic driving coach case study.
Details
Motivation: Foundation models (LLMs) are increasingly used in human-in-the-loop cyber-physical systems, but unpredictable human behavior, AI agent actions, and dynamic physical environments create uncontrollable nondeterminism that needs to be addressed.Method: Proposes a reactor-model-of-computation-based approach implemented using the open-source Lingua Franca framework, with a concrete case study of an agentic driving coach application.
Result: Evaluation of the LF-based agentic HITL CPS identifies practical challenges in reintroducing determinism and presents pathways to address them.
Conclusion: The reactor-MoC approach using Lingua Franca provides a framework for managing nondeterminism in foundation model-based AI agents for human-in-the-loop cyber-physical systems, with identified challenges and solutions for practical implementation.
Abstract: Foundation models, including large language models (LLMs), are increasingly used for human-in-the-loop (HITL) cyber-physical systems (CPS) because foundation model-based AI agents can potentially interact with both the physical environments and human users. However, the unpredictable behavior of human users and AI agents, in addition to the dynamically changing physical environments, leads to uncontrollable nondeterminism. To address this urgent challenge of enabling agentic AI-powered HITL CPS, we propose a reactor-model-of-computation (MoC)-based approach, realized by the open-source Lingua Franca (LF) framework. We also carry out a concrete case study using the agentic driving coach as an application of HITL CPS. By evaluating the LF-based agentic HITL CPS, we identify practical challenges in reintroducing determinism into such agentic HITL CPS and present pathways to address them.
[912] Why Do Large Language Models Generate Harmful Content?
Rajesh Ganguli, Raha Moraffah
Main category: cs.AI
TL;DR: Causal mediation analysis reveals that harmful content generation in LLMs originates in later layers, primarily through MLP blocks, with specific neurons acting as gating mechanisms for harmful output.
Details
Motivation: While LLMs are known to generate harmful content, the underlying causal mechanisms remain poorly understood. The paper aims to identify the specific model components responsible for harmful generation through causal analysis.Method: Uses causal mediation analysis to examine harmful generation across multiple granularities: model layers, modules (MLP vs attention blocks), and individual neurons in state-of-the-art LLMs.
Result: Harmful generation occurs in later layers, primarily from MLP block failures rather than attention blocks, with specific neurons acting as gating mechanisms. Early layers understand harmfulness context, which propagates through MLP blocks to sparse neurons in final layers that determine harmful content generation.
Conclusion: The study provides causal insights into harmful content generation in LLMs, identifying specific architectural components responsible, which could inform safer model design and intervention strategies.
Abstract: Large Language Models (LLMs) have been shown to generate harmful content. However, the underlying causes of such behavior remain under explored. We propose a causal mediation analysis-based approach to identify the causal factors responsible for harmful generation. Our method performs a multi-granular analysis across model layers, modules (MLP and attention blocks), and individual neurons. Extensive experiments on state-of-the-art LLMs indicate that harmful generation arises in the later layers of the model, results primarily from failures in MLP blocks rather than attention blocks, and is associated with neurons that act as a gating mechanism for harmful generation. The results indicate that the early layers in the model are used for a contextual understanding of harmfulness in a prompt, which is then propagated through the model, to generate harmfulness in the late layers, as well as a signal indicating harmfulness through MLP blocks. This is then further propagated to the last layer of the model, specifically to a sparse set of neurons, which receives the signal and determines the generation of harmful content accordingly.
[913] SWE-AGILE: A Software Agent Framework for Efficiently Managing Dynamic Reasoning Context
Shuquan Lian, Juncheng Liu, Yazhe Chen, Yuhong Chen, Hui Li
Main category: cs.AI
TL;DR: SWE-AGILE is a novel software agent framework that addresses context explosion in multi-turn software engineering tasks by using dynamic reasoning context with sliding windows and reasoning digests.
Details
Motivation: Existing ReAct-style approaches in autonomous software engineering lack explicit System-2 reasoning for deep analysis and complex edge cases, while applying extended Chain-of-Thought reasoning creates a dilemma between context explosion from retaining full history and redundant re-reasoning from discarding it.Method: SWE-AGILE introduces a Dynamic Reasoning Context strategy that maintains a “sliding window” of detailed reasoning for immediate continuity while compressing historical reasoning content into concise Reasoning Digests to balance reasoning depth, efficiency, and context constraints.
Result: SWE-AGILE sets a new standard for 7B-8B models on SWE-Bench-Verified using only 2.2k trajectories and 896 tasks, demonstrating superior performance in software engineering tasks.
Conclusion: The proposed framework effectively bridges the gap between reasoning depth, efficiency, and context constraints in autonomous software engineering, offering a practical solution to the context explosion problem in multi-turn reasoning tasks.
Abstract: Prior representative ReAct-style approaches in autonomous Software Engineering (SWE) typically lack the explicit System-2 reasoning required for deep analysis and handling complex edge cases. While recent reasoning models demonstrate the potential of extended Chain-of-Thought (CoT), applying them to the multi-turn SWE task creates a fundamental dilemma: retaining full reasoning history leads to context explosion and Lost-in-the-Middle'' degradation, while discarding it would force the agent to redundantly re-reason at every step. To address these challenges, we propose SWE-AGILE, a novel software agent framework designed to bridge the gap between reasoning depth, efficiency, and context constraints. SWE-AGILE introduces a Dynamic Reasoning Context strategy, maintaining a sliding window’’ of detailed reasoning for immediate continuity to prevent redundant re-analyzing, while compressing historical reasoning content into concise Reasoning Digests. Empirically, SWE-AGILE sets a new standard for 7B-8B models on SWE-Bench-Verified using only 2.2k trajectories and 896 tasks. Code is available at https://github.com/KDEGroup/SWE-AGILE.
[914] DreamKG: A KG-Augmented Conversational System for People Experiencing Homelessness
Javad M Alizadeh, Genhui Zheng, Chiu C Tan, Yuzhou Chen, Omar Martinez, Philip McCallion, Ying Ding, Chenguang Yang, AnneMarie Tomosky, Huanmei Wu
Main category: cs.AI
TL;DR: DreamKG is a knowledge graph-augmented conversational system that helps homeless people access accurate, up-to-date information about community services in Philadelphia by combining LLM flexibility with knowledge graph reliability to prevent hallucinations.
Details
Motivation: People experiencing homelessness face significant barriers to accessing timely, accurate information about community services, and standard LLMs are prone to hallucinations that could provide misleading information to vulnerable populations.Method: Combines Neo4j knowledge graphs with structured query understanding to handle location-aware and time-sensitive queries reliably, performing spatial reasoning for distance-based recommendations and temporal filtering for operating hours.
Result: Preliminary evaluation shows 59% superiority over Google Search AI on relevant queries and 84% rejection of irrelevant queries.
Conclusion: Demonstrates the potential of hybrid architectures combining LLM flexibility with knowledge graph reliability to improve service accessibility for vulnerable populations effectively.
Abstract: People experiencing homelessness (PEH) face substantial barriers to accessing timely, accurate information about community services. DreamKG addresses this through a knowledge graph-augmented conversational system that grounds responses in verified, up-to-date data about Philadelphia organizations, services, locations, and hours. Unlike standard large language models (LLMs) prone to hallucinations, DreamKG combines Neo4j knowledge graphs with structured query understanding to handle location-aware and time-sensitive queries reliably. The system performs spatial reasoning for distance-based recommendations and temporal filtering for operating hours. Preliminary evaluation shows 59% superiority over Google Search AI on relevant queries and 84% rejection of irrelevant queries. This demonstration highlights the potential of hybrid architectures that combines LLM flexibility with knowledge graph reliability to improve service accessibility for vulnerable populations effectively.
[915] Detecting Safety Violations Across Many Agent Traces
Adam Stein, Davis Brown, Hamed Hassani, Mayur Naik, Eric Wong
Main category: cs.AI
TL;DR: Meerkat is a system that combines clustering with agentic search to find safety violations in large collections of agent traces, addressing challenges where failures are rare, complex, or only visible across multiple traces.
Details
Motivation: Safety auditing of AI agents is difficult because failures are often rare, complex, and sometimes only detectable when analyzing multiple traces together. Existing approaches struggle with per-trace judges missing cross-trace failures, agentic auditing not scaling to large collections, and fixed monitors being brittle to unanticipated behaviors.Method: Meerkat combines clustering with agentic search to uncover violations specified in natural language. It uses structured search and adaptive investigation of promising regions to find sparse failures without relying on seed scenarios, fixed workflows, or exhaustive enumeration.
Result: Meerkat significantly improves detection of safety violations over baseline monitors, discovers widespread developer cheating on a top agent benchmark, and finds nearly 4x more examples of reward hacking on CyBench than previous audits.
Conclusion: The Meerkat approach effectively addresses the challenges of auditing large agent trace collections for safety violations by combining clustering with agentic search, enabling discovery of rare and complex failures that span multiple traces.
Abstract: To identify safety violations, auditors often search over large sets of agent traces. This search is difficult because failures are often rare, complex, and sometimes even adversarially hidden and only detectable when multiple traces are analyzed together. These challenges arise in diverse settings such as misuse campaigns, covert sabotage, reward hacking, and prompt injection. Existing approaches struggle here for several reasons. Per-trace judges miss failures that only become visible across traces, naive agentic auditing does not scale to large trace collections, and fixed monitors are brittle to unanticipated behaviors. We introduce Meerkat, which combines clustering with agentic search to uncover violations specified in natural language. Through structured search and adaptive investigation of promising regions, Meerkat finds sparse failures without relying on seed scenarios, fixed workflows, or exhaustive enumeration. Across misuse, misalignment, and task gaming settings, Meerkat significantly improves detection of safety violations over baseline monitors, discovers widespread developer cheating on a top agent benchmark, and finds nearly 4x more examples of reward hacking on CyBench than previous audits.
[916] A Mamba-Based Multimodal Network for Multiscale Blast-Induced Rapid Structural Damage Assessment
Wanli Ma, Sivasakthy Selvakumaran, Dain G. Farrimond, Adam A. Dennis, Samuel E. Rigby
Main category: cs.AI
TL;DR: A Mamba-based multimodal network for structural damage assessment that integrates blast-loading physics with optical remote sensing images, achieving improved performance on the 2020 Beirut explosion dataset.
Details
Motivation: Traditional structural damage assessment methods are limited by accessibility, safety risks, and time constraints after disasters like explosions. While machine learning with remote sensing offers scalability, existing methods lack integration of blast-loading physics and require extensive training data, limiting real-world applicability.Method: Proposes a Mamba-based multimodal network that integrates multi-scale blast-loading information with optical remote sensing images for rapid structural damage assessment. The method combines physical characteristics of explosions with visual data.
Result: Evaluated on the 2020 Beirut explosion dataset, the method significantly improves performance over state-of-the-art approaches for structural damage assessment.
Conclusion: The proposed multimodal approach successfully integrates blast physics with visual data for more accurate and rapid structural damage assessment, addressing limitations of existing methods that lack physical modeling.
Abstract: Accurate and rapid structural damage assessment (SDA) is crucial for post-disaster management, helping responders prioritise resources, plan rescues, and support recovery. Traditional field inspections, though precise, are limited by accessibility, safety risks, and time constraints, especially after large explosions. Machine learning with remote sensing has emerged as a scalable solution for rapid SDA, with Mamba-based networks achieving state-of-the-art performance. However, these methods often require extensive training and large datasets, limiting real-world applicability. Moreover, they fail to incorporate key physical characteristics of blast loading for SDA. To overcome these challenges, we propose a Mamba-based multimodal network for rapid SDA that integrates multi-scale blast-loading information with optical remote sensing images. Evaluated on the 2020 Beirut explosion, our method significantly improves performance over state-of-the-art approaches. Code is available at: https://github.com/IMPACTSquad/Blast-Mamba
[917] Collaborative Multi-Agent Scripts Generation for Enhancing Imperfect-Information Reasoning in Murder Mystery Games
Keyang Zhong, Junlin Xie, Hefeng Wu, Haofeng Li, Guanbin Li
Main category: cs.AI
TL;DR: A collaborative multi-agent framework for enhancing vision-language models’ multi-hop reasoning in deceptive multiplayer game settings like Murder Mystery Games, using agent-monitored training with chain-of-thought fine-tuning and reinforcement learning.
Details
Motivation: Vision-language models struggle with complex multi-hop reasoning in multiplayer games with imperfect and deceptive information, requiring better handling of uncertainty and character-specific reasoning.Method: Two-stage agent-monitored training: (1) chain-of-thought fine-tuning on curated/synthetic datasets modeling uncertainty and deception, (2) GRPO-based reinforcement learning with agent-monitored reward shaping for character-specific reasoning and multimodal multi-hop inference.
Result: Significantly boosts VLM performance in narrative reasoning, hidden fact extraction, and deception-resilient understanding in Murder Mystery Games.
Conclusion: Provides a scalable solution for training/evaluating VLMs under uncertain, adversarial, and socially complex conditions, establishing groundwork for benchmarks in multimodal multi-hop reasoning with imperfect information.
Abstract: Vision-language models (VLMs) have shown impressive capabilities in perceptual tasks, yet they degrade in complex multi-hop reasoning under multiplayer game settings with imperfect and deceptive information. In this paper, we study a representative multiplayer task, Murder Mystery Games, which require inferring hidden truths based on partial clues provided by roles with different intentions. To address this challenge, we propose a collaborative multi-agent framework for evaluating and synthesizing high-quality, role-driven multiplayer game scripts, enabling fine-grained interaction patterns tailored to character identities (i.e., murderer vs. innocent). Our system generates rich multimodal contexts, including character backstories, visual and textual clues, and multi-hop reasoning chains, through coordinated agent interactions. We design a two-stage agent-monitored training strategy to enhance the reasoning ability of VLMs: (1) chain-of-thought based fine-tuning on curated and synthetic datasets that model uncertainty and deception; (2) GRPO-based reinforcement learning with agent-monitored reward shaping, encouraging the model to develop character-specific reasoning behaviors and effective multimodal multi-hop inference. Extensive experiments demonstrate that our method significantly boosts the performance of VLMs in narrative reasoning, hidden fact extraction, and deception-resilient understanding. Our contributions offer a scalable solution for training and evaluating VLMs under uncertain, adversarial, and socially complex conditions, laying the groundwork for future benchmarks in multimodal multi-hop reasoning under imperfect information.
[918] Retrieval Is Not Enough: Why Organizational AI Needs Epistemic Infrastructure
Federico Bottino, Carlo Ferrero, Nicholas Dosio, Pierfrancesco Beneventano
Main category: cs.AI
TL;DR: OIDA framework structures organizational knowledge with epistemic properties like commitment strength and contradiction status, introducing QUESTION mechanism to model organizational ignorance.
Details
Motivation: Current AI agents lack epistemic structure in organizational knowledge - they can't distinguish binding decisions from abandoned hypotheses, contested claims from settled ones, or known facts from unresolved questions. The ceiling on organizational AI is not retrieval fidelity but epistemic fidelity.Method: OIDA framework structures knowledge as typed Knowledge Objects with epistemic class, importance scores with class-specific decay, and signed contradiction edges. Knowledge Gravity Engine maintains scores deterministically with convergence guarantees. Introduces QUESTION-as-modeled-ignorance primitive with inverse decay.
Result: OIDA’s RAG condition achieved EQS 0.530 vs 0.848 for full-context baseline with 28.1× token budget difference. QUESTION mechanism statistically validated (Fisher p=0.0325, OR=21.0). Formal properties established but decisive ablation at equal token budget is pre-registered and not yet run.
Conclusion: OIDA provides a framework for epistemic fidelity in organizational AI, enabling representation of commitment strength, contradiction status, and organizational ignorance as computable properties, with the QUESTION mechanism addressing what organizations don’t know.
Abstract: Organizational knowledge used by AI agents typically lacks epistemic structure: retrieval systems surface semantically relevant content without distinguishing binding decisions from abandoned hypotheses, contested claims from settled ones, or known facts from unresolved questions. We argue that the ceiling on organizational AI is not retrieval fidelity but \emph{epistemic} fidelity–the system’s ability to represent commitment strength, contradiction status, and organizational ignorance as computable properties. We present OIDA, a framework that structures organizational knowledge as typed Knowledge Objects carrying epistemic class, importance scores with class-specific decay, and signed contradiction edges. The Knowledge Gravity Engine maintains scores deterministically with proved convergence guarantees (sufficient condition: max degree $< 7$; empirically robust to degree 43). OIDA introduces QUESTION-as-modeled-ignorance: a primitive with inverse decay that surfaces what an organization does \emph{not} know with increasing urgency–a mechanism absent from all surveyed systems. We describe the Epistemic Quality Score (EQS), a five-component evaluation methodology with explicit circularity analysis. In a controlled comparison ($n{=}10$ response pairs), OIDA’s RAG condition (3,868 tokens) achieves EQS 0.530 vs.\ 0.848 for a full-context baseline (108,687 tokens); the $28.1\times$ token budget difference is the primary confound. The QUESTION mechanism is statistically validated (Fisher $p{=}0.0325$, OR$=21.0$). The formal properties are established; the decisive ablation at equal token budget (E4) is pre-registered and not yet run.
[919] CROP: Conservative Reward for Model-based Offline Policy Optimization
Hao Li, Xiao-Hu Zhou, Shu-Hai Li, Mei-Jiang Gui, Xiao-Liang Xie, Shi-Qi Liu, Shuang-Yi Wang, Zhen-Qiu Feng, Zeng-Guang Hou
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to determine conclusion due to failed API request
Abstract: Failed to fetch summary for 2310.17245: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2310.17245&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[920] Deep deterministic policy gradient with symmetric data augmentation for lateral attitude tracking control of a fixed-wing aircraft
Yifei Li, Erik-Jan van Kampen
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2407.11077: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2407.11077&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[921] Deep Optimizer States: Towards Scalable Training of Transformer Models Using Interleaved Offloading
Avinash Maurya, Jie Ye, M. Mustafa Rafique, Franck Cappello, Bogdan Nicolae
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2410.21316: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2410.21316&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[922] The Phantom of PCIe: Constraining Generative Artificial Intelligences for Practical Peripherals Trace Synthesizing
Zhibai Huang, Chen Chen, James Yen, Yihan Shen, Yongchen Xie, Zhixiang Wei, Kailiang Xu, Yun Wang, Fangxin Liu, Tao Song, Mingyuan Xia, Zhengwei Qi
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to draw conclusions due to fetch failure
Abstract: Failed to fetch summary for 2411.06376: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2411.06376&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[923] WebLLM: A High-Performance In-Browser LLM Inference Engine
Charlie F. Ruan, Yucheng Qin, Akaash R. Parthasarathy, Xun Zhou, Ruihang Lai, Hongyi Jin, Yixin Dong, Bohan Hou, Meng-Shiun Yu, Yiyan Zhai, Sudeep Agarwal, Hangrui Cao, Siyuan Feng, Tianqi Chen
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2412.15803: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2412.15803&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[924] BridgeSim: Unveiling the OL-CL Gap in End-to-End Autonomous Driving
Seth Z. Zhao, Luobin Wang, Hongwei Ruan, Yuxin Bao, Yilan Chen, Ziyang Leng, Abhijit Ravichandran, Honglin He, Zewei Zhou, Xu Han, Abhishek Peri, Zhiyu Huang, Pranav Desai, Henrik Christensen, Jiaqi Ma, Bolei Zhou
Main category: cs.AI
TL;DR: Failed to fetch paper summary - HTTP 429 error indicates rate limiting from arXiv API
Details
Motivation: Unable to determine motivation due to API access issueMethod: Unable to determine method due to API access issue
Result: Unable to determine results due to API access issue
Conclusion: Unable to analyze paper due to technical access limitations
Abstract: Failed to fetch summary for 2604.10856: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.10856&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[925] Influencing Humans to Conform to Preference Models for RLHF
Stephane Hatgis-Kessell, W. Bradley Knox, Serena Booth, Peter Stone
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2501.06416: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2501.06416&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[926] ExPath: Targeted Pathway Inference for Biological Knowledge Bases via Graph Learning and Explanation
Rikuto Kotoge, Ziwei Yang, Zheng Chen, Yushun Dong, Yasuko Matsubara, Jimeng Sun, Yasushi Sakurai
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2502.18026: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.18026&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[927] Optimizing Large Language Models: Metrics, Energy Efficiency, and Case Study Insights
Tahniat Khan, Soroor Motie, Sedef Akinli Kocak, Shaina Raza
Main category: cs.AI
TL;DR: Unable to analyze paper 2504.06307 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation without access to the paper abstractMethod: Cannot determine method without access to the paper abstract
Result: Cannot determine results without access to the paper abstract
Conclusion: Cannot draw conclusions without access to the paper abstract
Abstract: Failed to fetch summary for 2504.06307: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.06307&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[928] Federated Single-Agent Robotics: Multi-Robot Coordination Without Intra-Robot Multi-Agent Fragmentation
Xue Qin, Simin Luan, John See, Cong Yang, Zhijun Li
Main category: cs.AI
TL;DR: Paper 2604.11028: Unable to fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to inability to access paper contentMethod: Cannot determine method due to inability to access paper content
Result: Cannot determine results due to inability to access paper content
Conclusion: Cannot determine conclusion due to inability to access paper content
Abstract: Failed to fetch summary for 2604.11028: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.11028&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[929] Non-stationary Diffusion For Probabilistic Time Series Forecasting
Weiwei Ye, Zhuopeng Xu, Ning Gui
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2505.04278: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.04278&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[930] Putting the Value Back in RL: Better Test-Time Scaling by Unifying LLM Reasoners With Verifiers
Kusha Sareen, Morgane M Moss, Alessandro Sordoni, Rishabh Agarwal, Arian Hosseini
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2505.04842: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.04842&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[931] EmbodiedGovBench: A Benchmark for Governance, Recovery, and Upgrade Safety in Embodied Agent Systems
Xue Qin, Simin Luan, John See, Cong Yang, Zhijun Li
Main category: cs.AI
TL;DR: Unable to analyze paper 2604.11174 due to HTTP 429 error when fetching summary from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about unavailable paper
Abstract: Failed to fetch summary for 2604.11174: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.11174&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[932] 3D-Anchored Lookahead Planning for Persistent Robotic Scene Memory via World-Model-Based MCTS
Bronislav Sidik, Dror Mizrahi
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to failed paper retrievalMethod: Cannot determine method due to failed paper retrieval
Result: Cannot determine results due to failed paper retrieval
Conclusion: Cannot determine conclusion due to failed paper retrieval
Abstract: Failed to fetch summary for 2604.11302: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.11302&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[933] Towards Reasonable Concept Bottleneck Models
Nektarios Kalampalikis, Kavya Gupta, Georgi Vitanov, Isabel Valera
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access restrictionsMethod: Unable to determine method due to access restrictions
Result: Unable to determine results due to access restrictions
Conclusion: Unable to draw conclusions due to access restrictions
Abstract: Failed to fetch summary for 2506.05014: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.05014&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[934] Learning to Forget – Hierarchical Episodic Memory for Lifelong Robot Deployment
Leonard Bärmann, Joana Plewnia, Alex Waibel, Tamim Asfour
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation as paper content could not be retrievedMethod: Unable to determine method as paper content could not be retrieved
Result: Unable to determine results as paper content could not be retrieved
Conclusion: Unable to determine conclusion as paper content could not be retrieved
Abstract: Failed to fetch summary for 2604.11306: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.11306&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[935] Minimal Embodiment Enables Efficient Learning of Number Concepts in Robot
Zhegong Shangguan, Alessandro Di Nuovo, Angelo Cangelosi
Main category: cs.AI
TL;DR: Unable to analyze paper 2604.11373 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation without access to paper abstractMethod: Cannot determine method without access to paper abstract
Result: Cannot determine results without access to paper abstract
Conclusion: Cannot draw conclusions without access to paper abstract
Abstract: Failed to fetch summary for 2604.11373: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.11373&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[936] Efficient Emotion-Aware Iconic Gesture Prediction for Robot Co-Speech
Edwin C. Montiel-Vazquez, Christian Arzate Cruz, Stefanos Gkikas, Thomas Kassiotis, Giorgos Giannakakis, Randy Gomez
Main category: cs.AI
TL;DR: Unable to analyze paper 2604.11417 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation without access to the paper abstractMethod: Cannot determine method without access to the paper abstract
Result: Cannot determine results without access to the paper abstract
Conclusion: Cannot draw conclusions without access to the paper abstract
Abstract: Failed to fetch summary for 2604.11417: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.11417&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[937] Modular Delta Merging with Orthogonal Constraints: A Scalable Framework for Continual and Reversible Model Composition
Haris Khan, Sadia Asif, Shumaila Asif, Muhammad Zeeshan Karamat, Rajesh Upadhayaya
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2507.20997: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.20997&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[938] Teaching the Teacher: The Role of Teacher-Student Smoothness Alignment in Genetic Programming-based Symbolic Distillation
Soumyadeep Dhar, Kei Sen Fong, Mehul Motani
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to the paper contentMethod: Cannot determine method without access to the paper content
Result: Cannot determine results without access to the paper content
Conclusion: Cannot determine conclusion without access to the paper content
Abstract: Failed to fetch summary for 2507.22767: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.22767&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[939] Position: The Hidden Costs and Measurement Gaps of Reinforcement Learning with Verifiable Rewards
Fang Wu, Aaron Tu, Weihao Xuan, Heli Qi, Xu Huang, Qingcheng Zeng, Shayan Talaei, Yijia Xiao, Peng Xia, Xiangru Tang, Yuchen Zhuang, Bing Hu, Hanqun Cao, Wenqi Shi, Rui Yang, Nan Liu, Huaxiu Yao, Ge Liu, Li Erran Li, Amin Saberi, Naoto Yokoya, Jure Leskovec, Yejin Choi
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper detailsMethod: Unable to determine method due to API rate limiting preventing access to paper details
Result: Unable to determine results due to API rate limiting preventing access to paper details
Conclusion: Unable to determine conclusion due to API rate limiting preventing access to paper details
Abstract: Failed to fetch summary for 2509.21882: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.21882&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[940] Multi-ORFT: Stable Online Reinforcement Fine-Tuning for Multi-Agent Diffusion Planning in Cooperative Driving
Haojie Bai, Aimin Li, Ruoyu Yao, Xiongwei Zhao, Tingting Zhang, Xing Zhang, Lin Gao, and Jun Ma
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2604.11734: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.11734&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[941] Grounded World Model for Semantically Generalizable Planning
Quanyi Li, Lan Feng, Haonan Zhang, Wuyang Li, Letian Wang, Alexandre Alahi, Harold Soh
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2604.11751: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.11751&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[942] Unsupervised Detection of Spatiotemporal Anomalies in PMU Data Using Transformer-Based BiGAN
Muhammad Imran Hossain, Jignesh Solanki, Sarika Khushlani Solanki
Main category: cs.AI
TL;DR: Failed to fetch summary for arXiv ID 2509.25612 due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error in fetching paper informationMethod: Unable to determine method due to technical error in fetching paper information
Result: Unable to determine results due to technical error in fetching paper information
Conclusion: Unable to draw conclusions due to technical error in fetching paper information
Abstract: Failed to fetch summary for 2509.25612: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.25612&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[943] EEG-based AI-BCI Wheelchair Advancement: Hybrid Deep Learning with Motor Imagery for Brain Computer Interface
Bipul Thapa, Biplov Paneru, Bishwash Paneru, Khem Narayan Poudyal
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2509.25667: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.25667&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[944] Detecting Invariant Manifolds in ReLU-Based RNNs
Lukas Eisenmann, Alena Brändle, Zahra Monfared, Daniel Durstewitz
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to draw conclusions due to access error
Abstract: Failed to fetch summary for 2510.03814: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.03814&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[945] A Mathematical Explanation of Transformers
Xue-Cheng Tai, Hao Liu, Lingfeng Li, Raymond H. Chan
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2510.03989: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.03989&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[946] Evolutionary Profiles for Protein Fitness Prediction
Jigang Fan, Xiaoran Jiao, Shengdong Lin, Zhanming Liang, Weian Mao, Chenchen Jing, Hao Chen, Chunhua Shen
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2510.07286: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.07286&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[947] VS-Bench: Evaluating VLMs for Strategic Abilities in Multi-Agent Environments
Zelai Xu, Zhexuan Xu, Xiangmin Yi, Huining Yuan, Mo Guang, Kaiwen Long, Xinlei Chen, Yi Wu, Chao Yu, Yu Wang
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content could not be retrievedMethod: Cannot determine method as paper content could not be retrieved
Result: Cannot determine results as paper content could not be retrieved
Conclusion: Cannot draw conclusions about paper content due to access limitations
Abstract: Failed to fetch summary for 2506.02387: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.02387&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[948] Design Principles for Sequence Models via Coefficient Dynamics
Jerome Sieber, Antonio Orvieto, Melanie N. Zeilinger, Carmen Amo Alonso
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2510.09389: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.09389&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[949] PosterGen: Aesthetic-Aware Multi-Modal Paper-to-Poster Generation via Multi-Agent LLMs
Zhilin Zhang, Xiang Zhang, Jiaqi Wei, Yiwei Xu, Chenyu You
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailable due to server rate limitingMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about paper content due to access limitations
Abstract: Failed to fetch summary for 2508.17188: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.17188&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[950] Learning When Not to Learn: Risk-Sensitive Abstention in Bandits with Unbounded Rewards
Sarah Liaw, Benjamin Plaut
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2510.14884: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.14884&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[951] Self-Certifying Primal-Dual Optimization Proxies for Large-Scale Batch Economic Dispatch
Michael Klamkin, Mathieu Tanneau, Pascal Van Hentenryck
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2510.15850: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.15850&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[952] Interactive Learning for LLM Reasoning
Hehai Lin, Shilei Cao, Sudong Wang, Haotian Wu, Minzhi Li, Linyi Yang, Juepeng Zheng, Chengwei Qin
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2509.26306: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.26306&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[953] TimeRewarder: Learning Dense Reward from Passive Videos via Frame-wise Temporal Distance
Yuyang Liu, Chuan Wen, Yihang Hu, Dinesh Jayaraman, Yang Gao
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper retrievalMethod: Unable to determine method due to failed paper retrieval
Result: Unable to determine results due to failed paper retrieval
Conclusion: Unable to determine conclusion due to failed paper retrieval
Abstract: Failed to fetch summary for 2509.26627: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.26627&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[954] DistDF: Time-Series Forecasting Needs Joint-Distribution Wasserstein Alignment
Hao Wang, Licheng Pan, Yuan Lu, Zhixuan Chu, Xiaoxi Li, Shuting He, Zhichao Chen, Haoxuan Li, Qingsong Wen, Zhouchen Lin
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2510.24574: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.24574&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[955] Advancing Reasoning in Diffusion Language Models with Denoising Process Rewards
Shaoan Xie, Lingjing Kong, Xiangchen Song, Xinshuai Dong, Guangyi Chen, Eric P.Xing, Kun Zhang
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2510.01544: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.01544&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[956] Plug-and-Play Dramaturge: A Divide-and-Conquer Approach for Iterative Narrative Script Refinement via Collaborative LLM Agents
Wenda Xie, Chao Guo, Yanqing Jing. Junle Wang, Yisheng Lv, Fei-Yue Wang
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper information.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot determine conclusion without access to the paper content.
Abstract: Failed to fetch summary for 2510.05188: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.05188&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[957] SHE: Stepwise Hybrid Examination Reinforcement Learning Framework for E-commerce Search Relevance
Pengkun Jiao, Yiming Jin, Jianhui Yang, Chenhe Dong, Zerui Huang, Shaowei Yao, Xiaojiang Zhou, Dan Ou, Haihong Tang
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2510.07972: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.07972&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[958] Graph-Coarsening Approach for the Capacitated Vehicle Routing Problem with Time Windows
Mustafa Mert Özyılmaz
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2510.22329: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.22329&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[959] MGA: Memory-Driven GUI Agent for Observation-Centric Interaction
Weihua Cheng, Junming Liu, Yifei Sun, Botian Shi, W Yirong Chen, Ding Wang
Main category: cs.AI
TL;DR: Paper analysis unavailable due to HTTP 429 error when fetching from arXiv API
Details
Motivation: Unable to determine motivation as paper content could not be retrieved due to API rate limitingMethod: No method information available - arXiv API returned HTTP 429 (Too Many Requests) error
Result: Failed to fetch paper summary from arXiv for ID 2510.24168
Conclusion: Paper analysis cannot be completed due to technical limitations in accessing the content
Abstract: Failed to fetch summary for 2510.24168: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.24168&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[960] Scalable Stewardship of an LLM-Assisted Clinical Benchmark with Physician Oversight
Junze Ye, Daniel Tawfik, Alex J. Goodell, Nikhil V. Kotha, Mark K. Buyyounouski, Mohsen Bayati
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2512.19691: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.19691&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[961] Consolidation or Adaptation? PRISM: Disentangling SFT and RL Data via Gradient Concentration
Yang Zhao, Yangou Ouyang, Xiao Ding, Hepeng Wang, Bibo Cai, Kai Xiong, Jinglong Gao, Zhouhao Sun, Li Du, Bing Qin, Ting Liu
Main category: cs.AI
TL;DR: Paper 2601.07224: Could not fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2601.07224: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.07224&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[962] AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts
Keyu Li, Junhao Shi, Yang Xiao, Mohan Jiang, Jie Sun, Yunze Wu, Dayuan Fu, Shijie Xia, Xiaojie Cai, Tianze Xu, Weiye Si, Wenjie Li, Dequan Wang, Pengfei Liu
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2601.11044: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.11044&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[963] A Unified Theory of Sparse Dictionary Learning in Mechanistic Interpretability: Piecewise Biconvexity and Spurious Minima
Yiming Tang, Harshvardhan Saini, Zhaoqian Yao, Zheng Lin, Yizhen Liao, Qianxiao Li, Mengnan Du, Dianbo Liu
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2512.05534: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.05534&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[964] Subargument Argumentation Frameworks: Separating Direct Conflict from Structural Dependency
Beishui Liao
Main category: cs.AI
TL;DR: Paper ID 2601.12038 could not be fetched due to HTTP 429 error (rate limiting), so content analysis is not possible
Details
Motivation: Unable to determine motivation due to access restrictions preventing paper retrievalMethod: Method unknown - paper content not accessible
Result: Results cannot be analyzed without paper content
Conclusion: No conclusion can be drawn due to technical access issues
Abstract: Failed to fetch summary for 2601.12038: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.12038&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[965] Risk Awareness Injection: Calibrating Vision-Language Models for Safety without Compromising Utility
Mengxuan Wang, Yuxin Chen, Gang Xu, Tao He, Hongjie Jiang, Ming Li
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to data retrieval failureMethod: Unable to determine method due to data retrieval failure
Result: Unable to determine results due to data retrieval failure
Conclusion: Unable to draw conclusions due to data retrieval failure
Abstract: Failed to fetch summary for 2602.03402: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.03402&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[966] ANCHOR: Branch-Point Data Generation for GUI Agents
Jinbiao Wei, Yilun Zhao, Kangqi Ni, Arman Cohan
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to draw conclusions due to failed API request
Abstract: Failed to fetch summary for 2602.07153: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.07153&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[967] Variance-Aware Prior-Based Tree Policies for Monte Carlo Tree Search
Maximilian Weichart
Main category: cs.AI
TL;DR: Unable to analyze paper 2512.21648 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract is unavailable due to rate limiting errorMethod: Cannot determine method as abstract is unavailable due to rate limiting error
Result: Cannot determine results as abstract is unavailable due to rate limiting error
Conclusion: Cannot draw conclusions about paper content due to inability to access abstract
Abstract: Failed to fetch summary for 2512.21648: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.21648&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[968] X-SYS: A Reference Architecture for Interactive Explanation Systems
Tobias Labarta, Nhi Hoang, Maximilian Dreyer, Jim Berend, Oleg Hein, Jackie Ma, Wojciech Samek, Sebastian Lapuschkin
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2602.12748: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.12748&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[969] Can Small Training Runs Reliably Guide Data Curation? Rethinking Proxy-Model Practice
Jiachen T. Wang, Tong Wu, Kaifeng Lyu, James Zou, Dawn Song, Ruoxi Jia, Prateek Mittal
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to failed paper fetchMethod: Cannot determine method due to failed paper fetch
Result: Cannot determine results due to failed paper fetch
Conclusion: Cannot draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2512.24503: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.24503&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[970] Constrained Assumption-Based Argumentation Frameworks
Emanuele De Angelis, Fabio Fioravanti, Maria Chiara Meo, Alberto Pettorossi, Maurizio Proietti, Francesca Toni
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2602.13135: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.13135&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[971] Enhanced-FQL($λ$), an Efficient and Interpretable RL with novel Fuzzy Eligibility Traces and Segmented Experience Replay
Mohsen Jalaeian-Farimani, Xiong Xiong, Luca Bascetta
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about paper content due to technical access issues
Abstract: Failed to fetch summary for 2601.04392: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.04392&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[972] Hunt Globally: Wide Search AI Agents for Drug Asset Scouting in Investing, Business Development, and Competitive Intelligence
Alisa Vinogradova, Vlad Vinogradov, Luba Greenwood, Ilya Yasny, Dmitry Kobyzev, Shoman Kasbekar, Kong Nguyen, Dmitrii Radkevich, Roman Doronin, Andrey Doronichev
Main category: cs.AI
TL;DR: Unable to analyze paper 2602.15019 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation without access to the paper abstractMethod: Cannot determine method without access to the paper abstract
Result: Cannot determine results without access to the paper abstract
Conclusion: Cannot draw conclusions without access to the paper abstract
Abstract: Failed to fetch summary for 2602.15019: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.15019&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[973] FlexMS is a flexible framework for benchmarking deep learning-based mass spectrum prediction tools in metabolomics
Yunhua Zhong, Yixuan Tang, Yifan Li, Jie Yang, Pan Liu, Jun Xia
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2602.22822: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.22822&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[974] Diagnosing Retrieval vs. Utilization Bottlenecks in LLM Agent Memory
Boqin Yuan, Yue Su, Kun Yao
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to access errorMethod: Cannot determine method due to access error
Result: Cannot determine results due to access error
Conclusion: Cannot determine conclusion due to access error
Abstract: Failed to fetch summary for 2603.02473: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.02473&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[975] DiSPA: Differential Substructure-Pathway Attention for Drug Response Prediction
Yewon Han, Sunghyun Kim, Eunyi Jeong, Sungkyung Lee, Seokwoo Yun, Sangsoo Lim
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot determine conclusion as paper content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2601.14346: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.14346&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[976] Do Machines Fail Like Humans? A Human-Centred Out-of-Distribution Spectrum for Mapping Error Alignment
Binxia Xu, Xiaoliang Luo, Luke Dickens, Robert M. Mok
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2603.07462: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.07462&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[977] Normative Common Ground Replication (NormCoRe): Replication-by-Translation for Studying Norms in Multi-Agent AI
Luca Deck, Simeon Allmendinger, Lucas Müller, Niklas Kühl
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2603.11974: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.11974&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[978] dTRPO: Trajectory Reduction in Policy Optimization of Diffusion Large Language Models
Wenxuan Zhang, Lemeng Wu, Changsheng Zhao, Ernie Chang, Mingchen Zhuge, Zechun Liu, Andy Su, Hanxian Huang, Jun Chen, Chong Zhou, Raghuraman Krishnamoorthi, Vikas Chandra, Mohamed Elhoseiny, Wei Wen
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2603.18806: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.18806&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[979] Quantitative Introspection in Language Models: Tracking Emotive States Across Conversation
Nicolas Martorell, Bruno Bianchi
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting errorMethod: Unable to determine method due to API rate limiting error
Result: Unable to determine results due to API rate limiting error
Conclusion: Unable to determine conclusion due to API rate limiting error
Abstract: Failed to fetch summary for 2603.18893: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.18893&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[980] Agentic Business Process Management: A Research Manifesto
Diego Calvanese, Angelo Casciani, Giuseppe De Giacomo, Marlon Dumas, Fabiana Fournier, Timotheus Kampik, Emanuele La Malfa, Lior Limonad, Andrea Marrella, Andreas Metzger, Marco Montali, Daniel Amyot, Peter Fettke, Artem Polyvyanyy, Stefanie Rinderle-Ma, Sebastian Sardiña, Niek Tax, Barbara Weber
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2603.18916: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.18916&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[981] Maximum Entropy Relaxation of Multi-Way Cardinality Constraints for Synthetic Population Generation
François Pachet, Jean-Daniel Zucker
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.22558: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.22558&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[982] From Pixels to Digital Agents: An Empirical Study on the Taxonomy and Technological Trends of Reinforcement Learning Environments
Lijing Luo, Yiben Luo, Alexey Gorbatovski, Sergey Kovalchuk, Xiaodan Liang
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.23964: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.23964&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[983] Resisting Humanization: Ethical Front-End Design Choices in AI for Sensitive Contexts
Silvia Rossi, Diletta Huyskes, Mackenzie Jorgensen
Main category: cs.AI
TL;DR: Paper 2603.24853: Could not fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to missing abstractMethod: Unable to determine method due to missing abstract
Result: Unable to determine results due to missing abstract
Conclusion: Unable to determine conclusion due to missing abstract
Abstract: Failed to fetch summary for 2603.24853: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.24853&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[984] AdvSynGNN: Structure-Adaptive Graph Neural Nets via Adversarial Synthesis and Self-Corrective Propagation
Rong Fu, Muge Qi, Chunlei Meng, Shuo Yin, Kun Liu, Zhaolu Kang, Simon Fong
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2602.17071: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.17071&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[985] AIRA_2: Overcoming Bottlenecks in AI Research Agents
Karen Hambardzumyan, Nicolas Baldwin, Edan Toledo, Rishi Hazra, Michael Kuchnik, Bassel Al Omari, Thomas Simon Foster, Anton Protopopov, Jean-Christophe Gagnon-Audet, Ishita Mediratta, Kelvin Niu, Michael Shvartsman, Alisia Lupidi, Alexis Audran-Reiss, Parth Pathak, Tatiana Shavrina, Despoina Magka, Hela Momand, Derek Dunfield, Nicola Cancedda, Pontus Stenetorp, Carole-Jean Wu, Jakob Nicolaus Foerster, Yoram Bachrach, Martin Josifoski
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.26499: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.26499&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[986] SubQuad: Near-Quadratic-Free Structure Inference with Distribution-Balanced Objectives in Adaptive Receptor framework
Rong Fu, Zijian Zhang, Kun Liu, Jiekai Wu, Xianda Li, Simon Fong
Main category: cs.AI
TL;DR: Paper ID 2602.17330 - No abstract available due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to abstract fetch failureMethod: Unable to determine method due to abstract fetch failure
Result: Unable to determine results due to abstract fetch failure
Conclusion: Unable to determine conclusion due to abstract fetch failure
Abstract: Failed to fetch summary for 2602.17330: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.17330&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[987] AutoMS: Multi-Agent Evolutionary Search for Cross-Physics Inverse Microstructure Design
Zhenyuan Zhao, Yu Xing, Tianyang Xue, Lingxin Cao, Xin Yan, Lin Lu
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2603.27195: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.27195&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[988] CoEvoSkills: Self-Evolving Agent Skills via Co-Evolutionary Verification
Hanrong Zhang, Shicheng Fan, Henry Peng Zou, Yankai Chen, Zhenting Wang, Jiayu Zhou, Chengze Li, Wei-Chieh Huang, Yifei Yao, Kening Zheng, Xue Liu, Xiaoxiao Li, Philip S. Yu
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2604.01687: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.01687&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[989] RL-Driven Sustainable Land-Use Allocation for the Lake Malawi Basin
Ying Yao
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2604.03768: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.03768&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[990] Beyond Fluency: Toward Reliable Trajectories in Agentic IR
Anushree Sinha, Srivaths Ranganathan, Debanshu Das, Abhishek Dharmaratnakar
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to data fetch failureMethod: Unable to determine method due to data fetch failure
Result: Unable to determine results due to data fetch failure
Conclusion: Unable to determine conclusion due to data fetch failure
Abstract: Failed to fetch summary for 2604.04269: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.04269&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[991] Reasoning as Gradient: Scaling MLE Agents Beyond Tree Search
Yifei Zhang, Xu Yang, Xiao Yang, Bowen Xian, Qizheng Li, Shikai Fang, Jingyuan Li, Jian Wang, Mingrui Xu, Weiqing Liu, Jiang Bian
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.01692: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.01692&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[992] A mathematical theory of evolution for self-designing AIs
Kenneth D Harris
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting).
Details
Motivation: Cannot determine motivation without access to paper content.Method: Cannot determine method without access to paper content.
Result: Cannot determine results without access to paper content.
Conclusion: Cannot determine conclusion without access to paper content.
Abstract: Failed to fetch summary for 2604.05142: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.05142&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[993] Learning to Focus: CSI-Free Hierarchical MARL for Reconfigurable Reflectors
Hieu Le, Mostafa Ibrahim, Oguz Bedir, Jian Tao, Sabit Ekin
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to failed paper retrievalMethod: Cannot determine method due to failed paper retrieval
Result: Cannot determine results due to failed paper retrieval
Conclusion: Cannot determine conclusion due to failed paper retrieval
Abstract: Failed to fetch summary for 2604.05165: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.05165&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[994] CuraLight: Debate-Guided Data Curation for LLM-Centered Traffic Signal Control
Qing Guo, Xinhang Li, Junyu Chen, Zheng Guo, Shengzhe Xu, Lin Zhang, Lei Li
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2604.05663: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.05663&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[995] EmoMAS: Emotion-Aware Multi-Agent System for High-Stakes Edge-Deployable Negotiation with Bayesian Orchestration
Yunbo Long, Yuhan Liu, Liming Xu
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper retrievalMethod: Unable to determine method due to failed paper retrieval
Result: Unable to determine results due to failed paper retrieval
Conclusion: Unable to determine conclusion due to failed paper retrieval
Abstract: Failed to fetch summary for 2604.07003: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.07003&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[996] Physics-informed AI Accelerated Retention Analysis of Ferroelectric Vertical NAND: From Day-Scale TCAD to Second-Scale Surrogate Model
Gyujun Jeong, Sungwon Cho, Minji Shon, Namhoon Kim, Woohyun Hwang, Kwangyou Seo, Suhwan Lim, Wanki Kim, Daewon Ha, Prasanna Venkatesan, Kihang Youn, Ram Cherukuri, Yiyi Wang, Suman Datta, Asif Khan, Shimeng Yu
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about paper content due to access limitations
Abstract: Failed to fetch summary for 2603.06881: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.06881&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[997] EVGeoQA: Benchmarking LLMs on Dynamic, Multi-Objective Geo-Spatial Exploration
Jianfei Wu, Zhichun Wang, Zhensheng Wang, Zhiyu He
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2604.07070: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.07070&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[998] Causally Sufficient and Necessary Feature Expansion for Class-Incremental Learning
Zhen Zhang, Jielei Chu, Tianrui Li
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation as paper content could not be retrievedMethod: Unable to determine method as paper content could not be retrieved
Result: Unable to determine results as paper content could not be retrieved
Conclusion: Unable to determine conclusion as paper content could not be retrieved
Abstract: Failed to fetch summary for 2603.09145: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.09145&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[999] Rhizome OS-1: Rhizome’s Semi-Autonomous Operating System for Small Molecule Drug Discovery
Yiwen Wang, Gregory Sinenka, Xhuliano Brace
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2604.07512 exists but cannot be analyzed without access to its abstract or content.
Details
Motivation: Cannot determine motivation without access to paper content. The arXiv API returned a rate limiting error (HTTP 429), preventing retrieval of the paper's abstract or details.Method: Cannot determine method without access to paper content. The arXiv API rate limiting prevents analysis of the paper’s methodology.
Result: Cannot determine results without access to paper content. The arXiv API error prevents retrieval of any experimental findings or outcomes.
Conclusion: Cannot draw conclusions about the paper’s content due to technical limitations in accessing the arXiv API. The paper exists but cannot be analyzed at this time.
Abstract: Failed to fetch summary for 2604.07512: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.07512&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1000] SEARL: Joint Optimization of Policy and Tool Graph Memory for Self-Evolving Agents
Xinshun Feng, Xinhao Song, Lijun Li, Gongshen Liu, Jing Shao
Main category: cs.AI
TL;DR: Failed to fetch summary for paper 2604.07791 due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to draw conclusion due to failed API request
Abstract: Failed to fetch summary for 2604.07791: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.07791&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1001] HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?
Mohamed Elfeki, Tu Trinh, Kelvin Luu, Guangze Luo, Nathan Hunt, Ernesto Montoya, Nandan Marwaha, Yannis He, Charles Wang, Fernando Crabedo, Alessa Castilo, Bing Liu
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2604.09408: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.09408&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1002] PoTable: Towards Systematic Thinking via Plan-then-Execute Stage Reasoning on Tables
Qingyang Mao, Qi Liu, Zhi Li, Mingyue Cheng, Zheng Zhang, Rui Li
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2412.04272: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2412.04272&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1003] A Multiparty Homomorphic Encryption Approach to Confidential Federated Kaplan Meier Survival Analysis
Narasimha Raghavan Veeraragavan, Svetlana Boudko, Jan Franz Nygård
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to data fetch failureMethod: Unable to determine method due to data fetch failure
Result: Unable to determine results due to data fetch failure
Conclusion: Unable to determine conclusion due to data fetch failure
Abstract: Failed to fetch summary for 2412.20495: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2412.20495&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1004] Curriculum-based Sample Efficient Reinforcement Learning for Robust Stabilization of a Quadrotor
Fausto Mauricio Lagos Suarez, Akshit Saradagi, Vidya Sumathy, Shruti Kotpaliwar, George Nikolakopoulos
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2501.18490: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2501.18490&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1005] Learning to Play Piano in the Real World
Yves-Simon Zeulner, Simon Crämer, Sandeep Selvaraj, Roberto Calandra
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2503.15481: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.15481&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1006] MoBiE: Efficient Inference of Mixture of Binary Experts under Post-Training Quantization
Zhixiong Zhao, Zukang Xu, Zhixuan Chen, Dawei Yang
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2604.06798: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.06798&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1007] Latent Structure of Affective Representations in Large Language Models
Benjamin J. Choi, Melanie Weber
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2604.07382: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.07382&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1008] LLM-based Realistic Safety-Critical Driving Video Generation
Yongjie Fu, Ruijian Zha, Pei Tian, Xuan Di
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2507.01264: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.01264&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1009] Absorption and Inertness in Coarse-Grained Arithmetic: A Heuristic Application to the St. Petersburg Paradox
Takashi Izumo
Main category: cs.AI
TL;DR: Failed to fetch summary for arXiv ID 2507.12475 due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation as the paper content could not be retrieved due to server rate limitingMethod: Unable to determine method as the paper content could not be retrieved
Result: Unable to determine results as the paper content could not be retrieved
Conclusion: Unable to determine conclusion as the paper content could not be retrieved
Abstract: Failed to fetch summary for 2507.12475: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.12475&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1010] Large Language Model as An Operator: An Experience-Driven Solution for Distribution Network Voltage Control
Xu Yang, Chenhui Lin, Licheng Sha, Liping Yang, Shuzhou Wu, Xichen Tian, Haotian Liu, Wenchuan Wu
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to failed paper retrievalMethod: Cannot determine method due to failed paper retrieval
Result: Cannot determine results due to failed paper retrieval
Conclusion: Cannot determine conclusion due to failed paper retrieval
Abstract: Failed to fetch summary for 2507.14800: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.14800&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1011] Lifetime-Aware Design for Item-Level Intelligence at the Extreme Edge
Shvetank Prakash, Andrew Cheng, Olof Kindgren, Ashiq Ahamed, Graham Knight, Jed Kufel, Francisco Rodriguez, Arya Tschand, David Kong, Mariam Elgamal, Jerry Huang, Emma Chen, Gage Hills, Richard Price, Emre Ozer, Vijay Janapa Reddi
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to draw conclusions due to access error
Abstract: Failed to fetch summary for 2509.08193: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.08193&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1012] Context-Guided Decompilation: A Step Towards Re-executability
Xiaohan Wang, Yuxin Hu, Kevin Leach
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: No method information available - paper content inaccessible due to API limitations
Result: No results available - could not retrieve paper information
Conclusion: Cannot analyze paper due to technical limitations in accessing arXiv data
Abstract: Failed to fetch summary for 2511.01763: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.01763&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1013] Multimodal Diffusion Forcing for Forceful Manipulation
Zixuan Huang, Huaidian Hou, Dmitry Berenson
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2511.04812: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.04812&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1014] Volumetric Ergodic Control
Jueun Kwon, Max M. Sun, Todd Murphey
Main category: cs.AI
TL;DR: Unable to analyze paper 2511.11533 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract was not accessible due to rate limiting errorMethod: No method information available - paper content could not be retrieved
Result: No results available - arXiv API returned HTTP 429 (Too Many Requests) error
Conclusion: Unable to analyze this specific paper due to technical limitations in accessing the abstract
Abstract: Failed to fetch summary for 2511.11533: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.11533&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1015] GroupRank: A Groupwise Paradigm for Effective and Efficient Passage Reranking with LLMs
Meixiu Long, Duolin Sun, Dan Yang, Yihan Jiao, Lei Liu, Jiahai Wang, BinBin Hu, Yue Shen, Jie Feng, Zhehao Tan, Junjie Wang, Lianzhen Zhong, Jian Wang, Peng Wei, Jinjie Gu
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2511.11653: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.11653&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1016] Improving Neutrino Oscillation Measurements through Event Classification
Sebastian A. R. Ellis, Daniel C. Hackett, Shirley Weishi Li, Pedro A. N. Machado, Karla Tame-Narvaez
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) when querying arXiv API
Details
Motivation: Unable to determine motivation as the paper content could not be retrieved due to technical limitationsMethod: No method information available - arXiv API request resulted in rate limiting error
Result: No results available - the paper analysis could not be completed due to HTTP 429 error
Conclusion: Unable to provide analysis due to technical limitations in accessing the paper content
Abstract: Failed to fetch summary for 2511.11938: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.11938&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1017] WisPaper: Your AI Scholar Search Engine
Li Ju, Jun Zhao, Mingxu Chai, Ziyu Shen, Xiangyang Wang, Yage Geng, Chunchun Ma, Hao Peng, Guangbin Li, Tao Li, Chengyong Liao, Fu Wang, Xiaolong Wang, Junshen Chen, Rui Gong, Shijia Liang, Feiyan Li, Ming Zhang, Kexin Tan, Junjie Ye, Zhiheng Xi, Shihan Dou, Tao Gui, Yuankai Ying, Yang Shi, Yue Zhang, Qi Zhang
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2512.06879: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.06879&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1018] Artificial Intelligence for All? Brazilian Teachers on Ethics, Equity, and the Everyday Challenges of AI in Education
Bruno Florentino, Camila Sestito, Wellington Cruz, André de Carvalho, Robson Bonidia
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) - unable to analyze content
Details
Motivation: Cannot determine motivation as paper content was not accessible due to API rate limitingMethod: Cannot determine method as paper content was not accessible
Result: Cannot determine results as paper content was not accessible
Conclusion: Cannot draw conclusions about paper content due to access limitations
Abstract: Failed to fetch summary for 2512.23834: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.23834&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1019] AI-enhanced tuning of quantum dot Hamiltonians toward Majorana modes
Mateusz Krawczyk, Jarosław Pawłowski
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error in fetching paper contentMethod: Unable to determine method due to technical error in fetching paper content
Result: Unable to determine results due to technical error in fetching paper content
Conclusion: Unable to draw conclusions due to technical error in fetching paper content
Abstract: Failed to fetch summary for 2601.02149: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.02149&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1020] Self-Organizing Dual-Buffer Adaptive Clustering Experience Replay (SODACER) for Safe Reinforcement Learning in Optimal Control
Roya Khalili Amirabadi, Mohsen Jalaeian Farimani, Omid Solaymani Fard
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2601.06540: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.06540&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1021] StreetDesignAI: A Multi-Persona Evaluation System for Inclusive Infrastructure Design
Ziyi Wang, Yilong Dai, Duanya Lyu, Mateo Nader, Sihan Chen, Wanghao Ye, Zjian Ding, Xiang Yan
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2601.15671: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.15671&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1022] The Weight of a Bit: EMFI Sensitivity Analysis of Embedded Deep Learning Models
Jakub Breier, Štefan Kučerák, Xiaolu Hou
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation as paper content could not be retrievedMethod: Unable to determine method as paper content could not be retrieved
Result: Unable to determine results as paper content could not be retrieved
Conclusion: Unable to determine conclusion as paper content could not be retrieved
Abstract: Failed to fetch summary for 2602.16309: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.16309&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1023] UBio-MolFM: A Universal Molecular Foundation Model for Bio-Systems
Lin Huang, Arthur Jiang, XiaoLi Liu, Zion Wang, Jason Zhao, Chu Wang, HaoCheng Lu, ChengXiang Huang, JiaJun Cheng, YiYue Du, Jia Zhang
Main category: cs.AI
TL;DR: Unable to analyze paper 2602.17709 due to HTTP 429 error when fetching from arXiv API
Details
Motivation: Cannot determine motivation due to inability to access paper contentMethod: Cannot determine method due to inability to access paper content
Result: Cannot determine results due to inability to access paper content
Conclusion: Cannot draw conclusions due to inability to access paper content
Abstract: Failed to fetch summary for 2602.17709: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.17709&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1024] Reinforced Generation of Combinatorial Structures: Ramsey Numbers
Ansh Nagda, Prabhakar Raghavan, Abhradeep Thakurta
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation due to missing paper contentMethod: Cannot determine method due to missing paper content
Result: Cannot determine results due to missing paper content
Conclusion: Cannot draw conclusions due to missing paper content
Abstract: Failed to fetch summary for 2603.09172: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.09172&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1025] Suiren-1.0 Technical Report: A Family of Molecular Foundation Models
Junyi An, Xinyu Lu, Yun-Fei Shi, Li-Cheng Xu, Nannan Zhang, Chao Qu, Yuan Qi, Fenglei Cao
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.21942: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.21942&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1026] Cognitive Training for Language Models: Towards General Capabilities via Cross-Entropy Games
Clément Hongler, Franck Gabriel, Valentin Hartmann, Arthur Renard, Andrew Emil
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to data fetch failureMethod: Unable to determine method due to data fetch failure
Result: Unable to determine results due to data fetch failure
Conclusion: Unable to draw conclusions due to data fetch failure
Abstract: Failed to fetch summary for 2603.22479: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.22479&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1027] Unilateral Relationship Revision Power in Human-AI Companion Interaction
Benjamin Lange
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to analyze paper content due to technical fetching error
Abstract: Failed to fetch summary for 2603.23315: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.23315&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1028] Policy-Guided Threat Hunting: An LLM enabled Framework with Splunk SOC Triage
Rishikesh Sahay, Bell Eapen, Weizhi Meng, Md Rasel Al Mamun, Nikhil Kumar Dora, Manjusha Sumasadan, Sumit Kumar Tetarave, Elyson De La Cruz
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to data fetch failureMethod: Unable to determine method due to data fetch failure
Result: Unable to determine results due to data fetch failure
Conclusion: Unable to draw conclusions due to data fetch failure
Abstract: Failed to fetch summary for 2603.23966: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.23966&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1029] Beyond Message Passing: A Semantic View of Agent Communication Protocols
Dun Yuan, Fuyuan Lyu, Ye Yuan, Weixu Zhang, Bowei He, Jiayi Geng, Linfeng Du, Zipeng Sun, Yankai Chen, Changjiang Han, Jikun Kang, Xi Chen, Haolun Wu, Xue Liu
Main category: cs.AI
TL;DR: Unable to analyze paper 2604.02369 due to HTTP 429 error when fetching from arXiv API
Details
Motivation: Cannot determine motivation as paper content could not be retrievedMethod: Cannot determine method as paper content could not be retrieved
Result: Cannot determine results as paper content could not be retrieved
Conclusion: Cannot draw conclusions as paper content could not be retrieved
Abstract: Failed to fetch summary for 2604.02369: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.02369&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1030] When simulations look right but causal effects go wrong: Large language models as behavioral simulators
Zonghan Li, Feng Ji
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2604.02458: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.02458&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1031] LitPivot: Developing Well-Situated Research Ideas Through Dynamic Contextualization and Critique within the Literature Landscape
Hita Kambhamettu, Bhavana Dalvi Mishra, Andrew Head, Jonathan Bragg, Aakanksha Naik, Joseph Chee Chang, Pao Siangliulue
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2604.02600: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.02600&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1032] The Augmentation Trap: AI Productivity and the Cost of Cognitive Offloading
Michael Caosun, Sinan Aral
Main category: cs.AI
TL;DR: Unable to analyze paper 2604.03501 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation without access to the paper abstractMethod: Cannot determine method without access to the paper abstract
Result: Cannot determine results without access to the paper abstract
Conclusion: Cannot draw conclusions without access to the paper abstract
Abstract: Failed to fetch summary for 2604.03501: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.03501&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1033] Inside the Scaffold: A Source-Code Taxonomy of Coding Agent Architectures
Benjamin Rombaut
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to data fetch failureMethod: Unable to determine method due to data fetch failure
Result: Unable to determine results due to data fetch failure
Conclusion: Unable to determine conclusion due to data fetch failure
Abstract: Failed to fetch summary for 2604.03515: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.03515&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1034] The Defense Trilemma: Why Prompt Injection Defense Wrappers Fail?
Manish Bhatt, Sarthak Munshi, Vineeth Sai Narajala, Idan Habler, Ammar Al-Kahfah, Ken Huang, Joel Webb, Blake Gatto, Md Tamjidul Hoque
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper detailsMethod: Unable to determine method due to API rate limiting preventing access to paper details
Result: Unable to determine results due to API rate limiting preventing access to paper details
Conclusion: Unable to draw conclusions due to API rate limiting preventing access to paper details
Abstract: Failed to fetch summary for 2604.06436: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.06436&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1035] Exact Structural Abstraction and Tractability Limits
Tristan Simas
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2604.07349: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.07349&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1036] Private Seeds, Public LLMs: Realistic and Privacy-Preserving Synthetic Data Generation
Qian Ma, Sarah Rajtmajer
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2604.07486: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.07486&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1037] Multi-Modal Learning meets Genetic Programming: Analyzing Alignment in Latent Space Optimization
Benjamin Léger, Kazem Meidani, Christian Gagné
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting).
Details
Motivation: Cannot determine motivation without access to paper content.Method: Cannot determine method without access to paper content.
Result: Cannot determine results without access to paper content.
Conclusion: Cannot draw conclusions without access to paper content.
Abstract: Failed to fetch summary for 2604.08324: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.08324&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1038] Physics-guided surrogate learning enables zero-shot control of turbulent wings
Yuning Wang, Pol Suarez, Mathis Bode, Ricardo Vinuesa
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation as paper content could not be retrievedMethod: Unable to determine method as paper content could not be retrieved
Result: Unable to determine results as paper content could not be retrieved
Conclusion: Unable to draw conclusions as paper content could not be retrieved
Abstract: Failed to fetch summary for 2604.09434: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.09434&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.SD
[1039] Real-Time Voicemail Detection in Telephony Audio Using Temporal Speech Activity Features
Kumar Saurav
Main category: cs.SD
TL;DR: A lightweight system using temporal speech patterns from a pre-trained VAD to distinguish voicemail greetings from live human answers in real-time AI calling systems.
Details
Motivation: AI calling systems need to detect voicemail greetings vs live human answers in real-time to avoid wasted agent interactions and dropped calls, requiring a lightweight solution that works on commodity hardware.Method: Extracts 15 temporal features from speech activity patterns using a pre-trained neural voice activity detector, then classifies with a shallow tree-based ensemble. Evaluated 3,780 model/feature/threshold combinations.
Result: 96.1% accuracy across 764 recordings (99.3% on expert-labeled test, 95.4% on production set). In production: 0.3% false positive, 1.3% false negative rates. Runs in 46ms on dual-core CPU, supports 380+ concurrent calls.
Conclusion: Temporal speech patterns are a strong signal for voicemail detection. Adding transcription or beep features didn’t improve real-time performance and increased latency. Lightweight approach works well on commodity hardware.
Abstract: Outbound AI calling systems must distinguish voicemail greetings from live human answers in real time to avoid wasted agent interactions and dropped calls. We present a lightweight approach that extracts 15 temporal features from the speech activity pattern of a pre-trained neural voice activity detector (VAD), then classifies with a shallow tree-based ensemble. Across two evaluation sets totaling 764 telephony recordings, the system achieves a combined 96.1% accuracy (734/764), with 99.3% (139/140) on an expert-labeled test set and 95.4% (595/624) on a held-out production set. In production validation over 77,000 calls, it maintained a 0.3% false positive rate and 1.3% false negative rate. End-to-end inference completes in 46 ms on a commodity dual-core CPU with no GPU, supporting 380+ concurrent WebSocket calls. In our search over 3,780 model, feature, and threshold combinations, feature importance was concentrated in three temporal variables. Adding transcription keywords or beep-based features did not improve the best real-time configuration and increased latency substantially. Our results suggest that temporal speech patterns are a strong signal for distinguishing voicemail greetings from live human answers.
[1040] MAGE: Modality-Agnostic Music Generation and Editing
Muhammad Usama Saleem, Tejasvi Ravi, Tianyu Xu, Rajeev Nongpiur, Ishan Chatterjee, Mayur Jagdishbhai Patel, Pu Wang
Main category: cs.SD
TL;DR: MAGE is a modality-agnostic framework for multimodal music generation and editing that unifies tasks using flow-based Transformers and cross-gated modulation for better cross-modal grounding.
Details
Motivation: Current multimodal music systems are limited by single-task designs, brittle prompting interfaces, and weak cross-modal grounding that causes prompt drift and spurious content during generation/editing.Method: Uses Controlled Multimodal FluxFormer (flow-based Transformer) for controllable latent trajectories, Audio-Visual Nexus Alignment for temporal consistency, cross-gated modulation for multiplicative control, and dynamic modality-masking curriculum training.
Result: Achieves competitive quality on MUSIC benchmark, supports effective multimodal-guided music generation and targeted editing with robust inference under missing modalities.
Conclusion: MAGE provides a lightweight, flexible framework for practical music workflows that unifies multimodal music generation and editing with improved cross-modal grounding.
Abstract: Multimodal music creation requires models that can both generate audio from high-level cues and edit existing mixtures in a targeted manner. Yet most multimodal music systems are built for a single task and a fixed prompting interface, making their conditioning brittle when guidance is ambiguous, temporally misaligned, or partially missing. Common additive fusion or feature concatenation further weakens cross-modal grounding, often causing prompt drift and spurious musical content during generation and editing. We propose MAGE, a modality-agnostic framework that unifies multimodal music generation and mixture-grounded editing within a single continuous latent formulation. At its core, MAGE uses a Controlled Multimodal FluxFormer, a flow-based Transformer that learns controllable latent trajectories for synthesis and editing under any available subset of conditions. To improve grounding, we introduce Audio-Visual Nexus Alignment to select temporally consistent visual evidence for the audio timeline, and a cross-gated modulation mechanism that applies multiplicative control from aligned visual and textual cues to the audio latents, suppressing unsupported components rather than injecting them. Finally, we train with a dynamic modality-masking curriculum that exposes the model to text-only, visual-only, joint multimodal, and mixture-guided settings, enabling robust inference under missing modalities without training separate models. Experiments on the MUSIC benchmark show that MAGE supports effective multimodal-guided music generation and targeted editing, achieving competitive quality while offering a lightweight and flexible interface tailored to practical music workflows.
[1041] Masked Contrastive Pre-Training Improves Music Audio Key Detection
Ori Yonay, Tracy Hammond, Tianbao Yang
Main category: cs.SD
TL;DR: Self-supervised music foundation models achieve SOTA key detection through masked contrastive pretraining on Mel spectrograms, enabling pitch-sensitive representations without complex data augmentation.
Details
Motivation: Self-supervised music foundation models currently underperform on key detection tasks, which require pitch-sensitive representations. The paper aims to systematically study how self-supervised pretraining design affects pitch sensitivity and demonstrate that masked contrastive embeddings enable state-of-the-art key detection performance.Method: The researchers use masked contrastive pretraining on Mel spectrograms, then perform linear evaluation for key detection. They further train shallow but wide multi-layer perceptrons (MLPs) on features extracted from the base model. The approach avoids sophisticated data augmentation policies and analyzes robustness of learned representations.
Result: The method achieves state-of-the-art performance in music key detection in the supervised setting. Linear evaluation after masking-based contrastive pretraining shows competitive performance out-of-the-box, and MLPs trained on extracted features achieve SOTA results. The learned representations naturally encode common augmentations.
Conclusion: Self-supervised pretraining is an effective approach for pitch-sensitive music information retrieval tasks. Masked contrastive embeddings uniquely enable SOTA key detection performance, providing insights for designing and probing music foundation models.
Abstract: Self-supervised music foundation models underperform on key detection, which requires pitch-sensitive representations. In this work, we present the first systematic study showing that the design of self-supervised pretraining directly impacts pitch sensitivity, and demonstrate that masked contrastive embeddings uniquely enable state-of-the-art (SOTA) performance in key detection in the supervised setting. First, we discover that linear evaluation after masking-based contrastive pretraining on Mel spectrograms leads to competitive performance on music key detection out of the box. This leads us to train shallow but wide multi-layer perceptrons (MLPs) on features extracted from our base model, leading to SOTA performance without the need for sophisticated data augmentation policies. We further analyze robustness and show empirically that the learned representations naturally encode common augmentations. Our study establishes self-supervised pretraining as an effective approach for pitch-sensitive MIR tasks and provides insights for designing and probing music foundation models.
[1042] Multimodal Dataset Normalization and Perceptual Validation for Music-Taste Correspondences
Matteo Spanio, Valentina Frezzato, Antonio Rodà
Main category: cs.SD
TL;DR: The paper addresses the challenge of collecting large cross-modal datasets for music-flavor research by validating that audio-flavor correlations transfer from small human-annotated datasets to large synthetic datasets, and that computational flavor targets align with human perception.
Details
Motivation: Collecting large, aligned cross-modal datasets for music-flavor research is difficult because perceptual experiments are costly and small by design. There's a need to overcome this bottleneck to enable more extensive research in audio-flavor correlations.Method: Two complementary experiments: 1) Tests transfer of audio-flavor correlations, feature-importance rankings, and latent-factor structure from a small experimental soundtracks collection (257 tracks with human annotations) to a large FMA-derived corpus (~49,300 segments with synthetic labels). 2) Validates computational flavor targets (derived from food chemistry via reproducible pipeline) against human perception in an online listener study with 49 participants and 20 tracks.
Result: Both experiments converge: quantitative transfer analysis confirms cross-modal structure is preserved across supervision regimes, and perceptual evaluation shows significant alignment between computational targets and listener ratings (permutation p<0.0001, Mantel r=0.45, Procrustes m²=0.51). Sonic seasoning effects are present in synthetic FMA annotations.
Conclusion: The findings support that synthetic annotations can effectively capture cross-modal audio-flavor relationships, enabling larger-scale research. The authors release datasets and code to support reproducible cross-modal AI research.
Abstract: Collecting large, aligned cross-modal datasets for music-flavor research is difficult because perceptual experiments are costly and small by design. We address this bottleneck through two complementary experiments. The first tests whether audio-flavor correlations, feature-importance rankings, and latent-factor structure transfer from an experimental soundtracks collection (257tracks with human annotations) to a large FMA-derived corpus ($\sim$49,300 segments with synthetic labels). The second validates computational flavor targets – derived from food chemistry via a reproducible pipeline – against human perception in an online listener study (49participants, 20~tracks). Results from both experiments converge: the quantitative transfer analysis confirms that cross-modal structure is preserved across supervision regimes, and the perceptual evaluation shows significant alignment between computational targets and listener ratings (permutation $p<0.0001$, Mantel $r=0.45$, Procrustes $m^2=0.51$). Together, these findings support the conclusion that sonic seasoning effects are present in synthetic FMA annotations. We release datasets and companion code to support reproducible cross-modal AI research.
[1043] From Speech to Profile: A Protocol-Driven LLM Agent for Psychological Profile Generation
Xingjian Yang, Yudong Yang, Zhixing Guo, Yongjie Zhou, Nan Yan, Lan Wang
Main category: cs.SD
TL;DR: StreamProfile: A streaming framework for generating verifiable psychological profiles from counseling speech using hierarchical evidence memory and chain-of-thought reasoning to prevent hallucinations.
Details
Motivation: Psychological profiles for depression patients are essential for psychotherapy, but current LLM-based approaches suffer from long-context forgetting and hallucinations due to overlong speech, multi-party interactions, and unstructured chatting in counseling sessions.Method: StreamProfile processes counseling speech incrementally, extracts evidence grounded from ASR transcriptions stored in Hierarchical Evidence Memory, and performs Chain-of-Thought pipeline according to PM+ psychological intervention for clinical reasoning.
Result: Experiments on real-world teenager counseling speech show that StreamProfile can accurately generate psychological profiles and prevent hallucination by making every claim traceable to evidence.
Conclusion: The proposed streaming framework successfully addresses hallucination issues in psychological profile generation by grounding all claims in verifiable evidence from counseling speech.
Abstract: The psychological profile that structurally documents the case of a depression patient is essential for psychotherapy. Large language models can be applied to summarize the profiles from counseling speech, however, it may suffer from long-context forgetting and produce unverifiable hallucinations, due to overlong length of speech, multi-party interactions and unstructured chatting. Hereby, we propose a StreamProfile, a streaming framework that processes counseling speech incrementally, extracts evidences grounded from ASR transcriptions by storing it in a Hierarchical Evidence Memory, and then performs a Chain-of-Thought pipeline according to PM+ psychological intervention for clinical reasoning. The final profile is synthesized strictly from those evidences, making every claim traceable. Experiments on real-world teenager counseling speech have shown that the proposed StreamProfile system can accurately generate the profiles and prevent hallucination.
[1044] Audio Flamingo Next: Next-Generation Open Audio-Language Models for Speech, Sound, and Music
Sreyan Ghosh, Arushi Goel, Kaousheik Jayakumar, Lasha Koroshinadze, Nishit Anand, Zhifeng Kong, Siddharth Gururani, Sang-gil Lee, Jaehyeon Kim, Aya Aljafari, Chao-Han Huck Yang, Sungwon Kim, Ramani Duraiswami, Dinesh Manocha, Mohammad Shoeybi, Bryan Catanzaro, Ming-Yu Liu, Wei Ping
Main category: cs.SD
TL;DR: AF-Next is an advanced large audio-language model that improves audio understanding and reasoning across speech, environmental sounds, and music, with support for long audio inputs up to 30 minutes and new temporal reasoning capabilities.
Details
Motivation: To address limitations in existing audio-language models by improving accuracy, supporting longer audio inputs, and enabling better temporal reasoning and interpretability for complex audio understanding tasks.Method: Systematic analysis of Audio Flamingo 3 to identify gaps, curation of large-scale datasets (1M+ hours), curriculum-based training (pre-training, mid-training, post-training), and introduction of Temporal Audio Chain-of-Thought for timestamp-grounded reasoning.
Result: Outperforms similarly sized open models by large margins across 20 benchmarks, competitive with larger models, exhibits strong real-world utility and generalization to unseen tasks.
Conclusion: AF-Next represents a significant advancement in audio-language modeling with improved capabilities for understanding and reasoning over diverse audio types, especially for long and complex audio inputs.
Abstract: We present Audio Flamingo Next (AF-Next), the next-generation and most capable large audio-language model in the Audio Flamingo series, designed to advance understanding and reasoning over speech, environmental sounds and music. Compared to Audio Flamingo 3, AF-Next introduces: (i) a stronger foundational audio-language model that significantly improves accuracy across diverse audio understanding tasks; (ii) scalable strategies for constructing large-scale audio understanding and reasoning data beyond existing academic benchmarks; (iii) support for long and complex audio inputs up to 30 minutes; and (iv) Temporal Audio Chain-of-Thought, a new reasoning paradigm that explicitly grounds intermediate reasoning steps to timestamps in long audio, enabling fine-grained temporal alignment and improved interpretability. To enable these capabilities, we first conduct a systematic analysis of Audio Flamingo 3 to identify key gaps in audio understanding and reasoning. We then curate and scale new large-scale datasets totaling over 1 million hours to address these limitations and expand the existing AudioSkills-XL, LongAudio-XL, AF-Think and AF-Chat datasets. AF-Next is trained using a curriculum-based strategy spanning pre-training, mid-training and post-training stages. Extensive experiments across 20 audio understanding and reasoning benchmarks, including challenging long-audio tasks, show that AF-Next outperforms similarly sized open models by large margins and remains highly competitive with and sometimes surpasses, much larger open-weight and closed models. Beyond benchmark performance, AF-Next exhibits strong real-world utility and transfers well to unseen tasks, highlighting its robustness and generalization ability. In addition to all data, code and methods, we open-source 3 variants of AF-Next, including AF-Next-Instruct, AF-Next-Think and AF-Next-Captioner.
[1045] Learning to Attend to Depression-Related Patterns: An Adaptive Cross-Modal Gating Network for Depression Detection
Hangbin Yu, Yudong Yang, Rongfeng Su, Nan Yan, Lan Wang
Main category: cs.SD
TL;DR: A depression detection network using Adaptive Cross-Modal Gating (ACMG) to selectively focus on depression-relevant segments in speech by adaptively weighting acoustic and textual features.
Details
Motivation: Depression-related patterns in speech are sparse, occurring in specific segments rather than uniformly distributed. Most existing methods treat all frames equally, missing this sparsity and failing to focus on diagnostically relevant features.Method: Proposes ACMG (Adaptive Cross-Modal Gating) that adaptively reassigns frame-level weights across acoustic and textual modalities, enabling selective attention to depression-related segments in speech signals.
Result: The depression detection system with ACMG outperforms baselines without it. Visualization analyses confirm ACMG automatically attends to clinically meaningful patterns including low-energy acoustic segments and textual segments containing negative sentiments.
Conclusion: ACMG effectively addresses the sparsity of depression-related patterns in speech by adaptively focusing on relevant segments across modalities, improving depression detection performance.
Abstract: Automatic depression detection using speech signals with acoustic and textual modalities is a promising approach for early diagnosis. Depression-related patterns exhibit sparsity in speech: diagnostically relevant features occur in specific segments rather than being uniformly distributed. However, most existing methods treat all frames equally, assuming depression-related information is uniformly distributed and thus overlooking this sparsity. To address this issue, we proposes a depression detection network based on Adaptive Cross-Modal Gating (ACMG) that adaptively reassigns frame-level weights across both modalities, enabling selective attention to depression-related segments. Experimental results show that the depression detection system with ACMG outperforms baselines without it. Visualization analyses further confirm that ACMG automatically attends to clinically meaningful patterns, including low-energy acoustic segments and textual segments containing negative sentiments.
[1046] Audio-Omni: Extending Multi-modal Understanding to Versatile Audio Generation and Editing
Zeyue Tian, Binxin Yang, Zhaoyang Liu, Jiexuan Zhang, Ruibin Yuan, Hubery Yin, Qifeng Chen, Chen Li, Jing Lv, Wei Xue, Yike Guo
Main category: cs.SD
TL;DR: Audio-Omni is the first end-to-end framework that unifies audio generation and editing across general sound, music, and speech domains with integrated multimodal understanding capabilities, achieving SOTA performance across multiple benchmarks.
Details
Motivation: Current multimodal models typically address audio understanding, generation, and editing with specialized models, lacking a unified framework that can seamlessly integrate all three tasks across general domains.Method: Combines a frozen Multimodal Large Language Model for high-level reasoning with a trainable Diffusion Transformer for high-fidelity synthesis, and introduces AudioEdit dataset with over 1M curated editing pairs to overcome data scarcity in audio editing.
Result: Achieves state-of-the-art performance across benchmarks, outperforming prior unified approaches while matching or surpassing specialized expert models, with additional capabilities like knowledge-augmented reasoning, in-context generation, and zero-shot cross-lingual control.
Conclusion: Audio-Omni represents a promising direction toward universal generative audio intelligence by unifying generation and editing across domains with integrated multimodal understanding.
Abstract: Recent progress in multimodal models has spurred rapid advances in audio understanding, generation, and editing. However, these capabilities are typically addressed by specialized models, leaving the development of a truly unified framework that can seamlessly integrate all three tasks underexplored. While some pioneering works have explored unifying audio understanding and generation, they often remain confined to specific domains. To address this, we introduce Audio-Omni, the first end-to-end framework to unify generation and editing across general sound, music, and speech domains, with integrated multi-modal understanding capabilities. Our architecture synergizes a frozen Multimodal Large Language Model for high-level reasoning with a trainable Diffusion Transformer for high-fidelity synthesis. To overcome the critical data scarcity in audio editing, we construct AudioEdit, a new large-scale dataset comprising over one million meticulously curated editing pairs. Extensive experiments demonstrate that Audio-Omni achieves state-of-the-art performance across a suite of benchmarks, outperforming prior unified approaches while achieving performance on par with or superior to specialized expert models. Beyond its core capabilities, Audio-Omni exhibits remarkable inherited capabilities, including knowledge-augmented reasoning generation, in-context generation, and zero-shot cross-lingual control for audio generation, highlighting a promising direction toward universal generative audio intelligence. The code, model, and dataset will be publicly released on https://zeyuet.github.io/Audio-Omni.
[1047] Descriptor-Injected Cross-Modal Learning: A Systematic Exploration of Audio-MIDI Alignment via Spectral and Melodic Features
Mariano Fernández Méndez
Main category: cs.SD
TL;DR: Audio-MIDI cross-modal retrieval improved by 8.8% using descriptor injection, with octave-band energy dynamics as key audio feature and reverse cross-attention for efficiency.
Details
Motivation: Cross-modal retrieval between audio recordings and symbolic MIDI representations is challenging due to fundamentally different data types (continuous waveforms vs discrete event sequences). The paper aims to bridge this gap using descriptor injection techniques.Method: Three-phase campaign testing 13 descriptor-mechanism combinations, 6 architectural families, and 3 training schedules. Introduces reverse cross-attention where descriptor tokens query encoder features. Uses MAESTRO v3.0.0 dataset with controlled evaluation protocol.
Result: Best configuration achieves mean S of 84.0% across five seeds, improving baseline by 8.8 percentage points. Audio descriptor A4 (octave-band energy dynamics) drives performance gains. CKA analysis shows descriptors increase audio-MIDI transformer layer alignment.
Conclusion: Descriptor injection effectively bridges audio-MIDI gap, with octave-band energy dynamics as discriminative signal. Reverse cross-attention reduces computational cost while maintaining performance. Descriptors enable representational convergence rather than simple feature concatenation.
Abstract: Cross-modal retrieval between audio recordings and symbolic music representations (MIDI) remains challenging because continuous waveforms and discrete event sequences encode different aspects of the same performance. We study descriptor injection, the augmentation of modality-specific encoders with hand-crafted domain features, as a bridge across this gap. In a three-phase campaign covering 13 descriptor-mechanism combinations, 6 architectural families, and 3 training schedules, the best configuration reaches a mean S of 84.0 percent across five independent seeds, improving the descriptor-free baseline by 8.8 percentage points. Causal ablation shows that the audio descriptor A4, based on octave-band energy dynamics, drives the gain in the top dual models, while the MIDI descriptor D4 has only a weak inference-time effect despite improving training dynamics. We also introduce reverse cross-attention, where descriptor tokens query encoder features, reducing attention operations relative to the standard formulation while remaining competitive. CKA analysis shows that descriptors substantially increase audio-MIDI transformer layer alignment, indicating representational convergence rather than simple feature concatenation. Perturbation analysis identifies high-frequency octave bands as the dominant discriminative signal. All experiments use MAESTRO v3.0.0 with an evaluation protocol controlling for composer and piece similarity.
[1048] Sign-to-Speech Prosody Transfer via Sign Reconstruction-based GAN
Toranosuke Manabe, Yuto Shibata, Shinnosuke Takamichi, Yoshimitsu Aoki
Main category: cs.SD
TL;DR: Sign-to-speech prosody transfer framework that captures sign language prosody and directly integrates it into synthesized speech without text bottleneck, using adversarial learning with unpaired data.
Details
Motivation: Current sign language communication pipelines lose rich non-verbal information by treating text as an intermediate bottleneck. The paper aims to preserve sign language prosody (emotional content, nuances) directly in synthesized speech for more natural communication.Method: Proposes SignRecGAN: a scalable training framework using adversarial learning and reconstruction losses with unimodal datasets (no cross-modal annotations needed). Also introduces S2PFormer architecture that preserves TTS model expressiveness while enabling injection of sign-derived prosody into speech synthesis.
Result: Extensive experiments show the method can synthesize speech that faithfully reflects the emotional content of sign language, enabling more natural sign language communication without text bottleneck.
Conclusion: The proposed Sign-to-Speech Prosody Transfer task and framework successfully capture and transfer sign language prosody to speech, opening new possibilities for natural multimodal communication between signers and non-signers.
Abstract: Deep learning models have improved sign language-to-text translation and made it easier for non-signers to understand signed messages. When the goal is spoken communication, a naive approach is to convert signed messages into text and then synthesize speech via Text-to-Speech (TTS). However, this two-stage pipeline inevitably treat text as a bottleneck representation, causing the loss of rich non-verbal information originally conveyed in the signing. To address this limitation, we propose a novel task, \emph{Sign-to-Speech Prosody Transfer}, which aims to capture the global prosodic nuances expressed in sign language and directly integrate them into synthesized speech. A major challenge is that aligning sign and speech requires expert knowledge, making annotation extremely costly and preventing the construction of large parallel corpora. To overcome this, we introduce \emph{SignRecGAN}, a scalable training framework that leverages unimodal datasets without cross-modal annotations through adversarial learning and reconstruction losses. Furthermore, we propose \emph{S2PFormer}, a new model architecture that preserves the expressive power of existing TTS models while enabling the injection of sign-derived prosody into the synthesized speech. Extensive experiments demonstrate that the proposed method can synthesize speech that faithfully reflects the emotional content of sign language, thereby opening new possibilities for more natural sign language communication. Our code will be available upon acceptance.
[1049] Whisper-AuT: Domain-Adapted Audio Encoder for Efficient Audio-LLM Training
Jielin Qiu, Ming Zhu, Wenting Zhao, Zhiwei Liu, Liangwei Yang, Zixiang Chen, Roshan Ram, Akshara Prabhakar, Juntao Tan, Rithesh Murthy, Shelby Heinecke, Caiming Xiong, Silvio Savarese, Huan Wang
Main category: cs.SD
TL;DR: Whisper-AuT is a domain-adapted audio encoder created by fine-tuning Whisper-large-v3 on a mixture of speech, environmental sound, and music data to improve representations for non-speech audio domains.
Details
Motivation: Whisper was trained only on speech data, producing weak representations for music and environmental sounds, forcing downstream audio-LLMs to compensate through extensive training on large-scale non-speech data.Method: Fine-tuned Whisper-large-v3 on a curated mixture of 80% speech, 10% environmental sound, and 10% music (total ~20M samples) using end-to-end seq2seq captioning objective, then retained only the encoder.
Result: Achieved +23.0% improvement on ESC-50 (environmental sound), +5.0% on GTZAN (music genre), and +0.7% on Speech Commands (keyword spotting) compared to original Whisper encoder.
Conclusion: Whisper-AuT serves as a drop-in replacement for Whisper in audio-LLM architectures, reducing downstream training costs by providing stronger initial audio representations for non-speech domains.
Abstract: Audio-native large language models (audio-LLMs) commonly use Whisper as their audio encoder. However, Whisper was trained exclusively on speech data, producing weak representations for music and environmental sound. This forces downstream audio-LLMs to compensate through extensive training on large-scale non-speech data. We present Whisper-AuT, a domain-adapted audio encoder obtained by fine-tuning Whisper-large-v3 on a curated mixture of speech (80%), environmental sound (10%), and music (10%) totaling approximately 20M samples. The full encoder-decoder is trained end-to-end with a seq2seq captioning objective; the decoder is then discarded and only the encoder is retained. Linear probe evaluations show that Whisper-AuT achieves +23.0% on ESC-50 (environmental sound), +5.0% on GTZAN (music genre), and +0.7% on Speech Commands (keyword spotting) compared to the original Whisperlarge-v3 encoder. Whisper-AuT is designed as a drop-in replacement for Whisper in audio-LLM architectures, with the goal of reducing downstream training cost by providing stronger initial audio representations for non-speech domains.
[1050] Hierarchical Decoding for Discrete Speech Synthesis with Multi-Resolution Spoof Detection
Junchuan Zhao, Minh Duc Vu, Ye Wang
Main category: cs.SD
TL;DR: MSpoof-TTS: Training-free inference framework using multi-resolution spoof guidance to improve zero-shot speech synthesis quality by detecting and correcting token-level artifacts in neural codec language models.
Details
Motivation: Neural codec language models for speech synthesis suffer from token-level artifacts and distributional drift during inference, degrading perceptual realism. Existing solutions require preference optimization or retraining, which are computationally expensive.Method: Proposes MSpoof-TTS with Multi-Resolution Token-based Spoof Detection framework that evaluates codec sequences at different temporal granularities to detect inconsistent patterns. Uses hierarchical decoding with spoof detectors to prune low-quality candidates and re-rank hypotheses without modifying model parameters.
Result: Experiments validate the framework’s effectiveness for robust and high-quality codec-based speech generation, improving zero-shot synthesis quality through training-free inference.
Conclusion: MSpoof-TTS provides an effective training-free approach to enhance speech synthesis quality by leveraging multi-resolution spoof guidance to address token-level artifacts in neural codec language models.
Abstract: Neural codec language models enable high-quality discrete speech synthesis, yet their inference remains vulnerable to token-level artifacts and distributional drift that degrade perceptual realism. Rather than relying on preference optimization or retraining, we propose MSpoof-TTS, a training-free inference framework that improves zero-shot synthesis through multi-resolution spoof guidance. We introduce a Multi-Resolution Token-based Spoof Detection framework that evaluates codec sequences at different temporal granularities to detect locally inconsistent or unnatural patterns. We then integrate the spoof detectors into a hierarchical decoding strategy, progressively pruning low-quality candidates and re-ranking hypotheses. This discriminator-guided generation enhances robustness without modifying model parameters. Experiments validate the effectiveness of our framework for robust and high-quality codec-based speech generation. Audio samples are available at https://danny-nus.github.io/MSpoofTTS.github.io/.
[1051] Cross-Cultural Bias in Mel-Scale Representations: Evidence and Alternatives from Speech and Music
Shivam Chauhan, Ajay Pundhir
Main category: cs.SD
TL;DR: The paper reveals cultural biases in standard mel-scale audio features and shows alternative representations (LEAF, CQT, ERB) significantly reduce performance disparities across languages and music genres.
Details
Motivation: Modern audio systems use mel-scale representations derived from 1940s Western psychoacoustic studies, potentially encoding cultural biases that create systematic performance disparities across different languages and music traditions.Method: Comprehensive evaluation comparing mel-scale features with learnable alternatives (LEAF, SincNet) and psychoacoustic variants (ERB, Bark, CQT) across speech recognition (11 languages), music analysis (6 collections), and European acoustic scene classification (10 cities). Controlled experiments isolate front-end contributions while keeping architecture and training constant.
Result: Mel-scale features yield 31.2% WER for tonal languages vs 18.7% for non-tonal languages (12.5% gap), and show 15.7% F1 degradation between Western and non-Western music. Alternative representations significantly reduce disparities: LEAF reduces speech gap by 34%, CQT achieves 52% reduction in music performance gaps, and ERB-scale filtering cuts disparities by 31% with only 1% computational overhead.
Conclusion: Foundational signal processing choices propagate cultural bias, and adaptive frequency decomposition offers practical paths toward equitable audio processing. The paper releases FairAudioBench for cross-cultural evaluation.
Abstract: Modern audio systems universally employ mel-scale representations derived from 1940s Western psychoacoustic studies, potentially encoding cultural biases that create systematic performance disparities. We present a comprehensive evaluation of cross-cultural bias in audio front-ends, comparing mel-scale features with learnable alternatives (LEAF, SincNet) and psychoacoustic variants (ERB, Bark, CQT) across speech recognition (11 languages), music analysis (6 collections), and European acoustic scene classification (10 European cities). Our controlled experiments isolate front-end contributions while holding architecture and training protocols minimal and constant. Results demonstrate that mel-scale features yield 31.2% WER for tonal languages compared to 18.7% for non-tonal languages (12.5% gap), and show 15.7% F1 degradation between Western and non-Western music. Alternative representations significantly reduce these disparities: LEAF reduces the speech gap by 34% through adaptive frequency allocation, CQT achieves 52% reduction in music performance gaps, and ERB-scale filtering cuts disparities by 31% with only 1% computational overhead. We also release FairAudioBench, enabling cross-cultural evaluation, and demonstrate that adaptive frequency decomposition offers practical paths toward equitable audio processing. These findings reveal how foundational signal processing choices propagate bias, providing crucial guidance for developing inclusive audio systems.
[1052] DialogueSidon: Recovering Full-Duplex Dialogue Tracks from In-the-Wild Dialogue Audio
Wataru Nakata, Yuki Saito, Kazuki Yamauchi, Emiru Tsunoo, Hiroshi Saruwatari
Main category: cs.SD
TL;DR: DialogueSidon: A model for joint restoration and separation of degraded monaural two-speaker dialogue audio using VAE compression of SSL features and diffusion-based latent prediction.
Details
Motivation: Full-duplex dialogue audio with separate speaker tracks is valuable for spoken dialogue research but difficult to collect at scale. Most real-world two-speaker dialogue exists only as degraded monaural mixtures, making it unsuitable for systems requiring clean speaker-wise signals.Method: Combines a variational autoencoder (VAE) operating on speech self-supervised learning (SSL) model features to compress them into a compact latent space, with a diffusion-based latent predictor that recovers speaker-wise latent representations from degraded mixtures.
Result: Experiments on English, multilingual, and in-the-wild dialogue datasets show DialogueSidon substantially improves intelligibility and separation quality over baselines while achieving much faster inference.
Conclusion: DialogueSidon effectively addresses the challenge of obtaining clean speaker-wise signals from degraded monaural dialogue mixtures, enabling better utilization of in-the-wild dialogue data for research.
Abstract: Full-duplex dialogue audio, in which each speaker is recorded on a separate track, is an important resource for spoken dialogue research, but is difficult to collect at scale. Most in-the-wild two-speaker dialogue is available only as degraded monaural mixtures, making it unsuitable for systems requiring clean speaker-wise signals. We propose DialogueSidon, a model for joint restoration and separation of degraded monaural two-speaker dialogue audio. DialogueSidon combines a variational autoencoder (VAE) operates on the speech self-supervised learning (SSL) model feature, which compresses SSL model features into a compact latent space, with a diffusion-based latent predictor that recovers speaker-wise latent representations from the degraded mixture. Experiments on English, multilingual, and in-the-wild dialogue datasets show that DialogueSidon substantially improves intelligibility and separation quality over a baseline, while also achieving much faster inference.
[1053] VidAudio-Bench: Benchmarking V2A and VT2A Generation across Four Audio Categories
Qian Zhang, Yuqin Cao, Yixuan Gao, Xiongkuo Min
Main category: cs.SD
TL;DR: VidAudio-Bench is a comprehensive multi-task benchmark for evaluating Video-to-Audio generation models across four audio categories with task-specific metrics validated by human studies.
Details
Motivation: Existing V2A evaluation benchmarks treat diverse audio types uniformly, lacking fine-grained assessment for different audio categories like sound effects, music, speech, and singing. There's a need for systematic evaluation that considers the distinct requirements of each audio type.Method: Proposes VidAudio-Bench with: 1) Broad coverage of four audio categories under V2A and Video-Text-to-Audio settings, 2) 1,634 video-text pairs, 3) 13 task-specific reference-free metrics for audio quality, video-audio consistency, and text-audio consistency, 4) Human validation of metrics through subjective studies.
Result: Current V2A models perform poorly on speech and singing compared to sound effects. VT2A results reveal tension between instruction following and visual grounding - stronger visual conditioning improves video-audio alignment but often fails to generate intended audio categories.
Conclusion: VidAudio-Bench provides a comprehensive, scalable framework for diagnosing V2A systems and offers new insights into multimodal audio generation challenges, particularly the trade-off between visual grounding and audio category fidelity.
Abstract: Video-to-Audio (V2A) generation is essential for immersive multimedia experiences, yet its evaluation remains underexplored. Existing benchmarks typically assess diverse audio types under a unified protocol, overlooking the fine-grained requirements of distinct audio categories. To address this gap, we propose VidAudio-Bench, a multi-task benchmark for V2A evaluation with four key features: (1) Broad Coverage: It encompasses four representative audio categories - sound effects, music, speech, and singing - under both V2A and Video-Text-to-Audio (VT2A) settings. (2) Extensive Evaluation: It comprises 1,634 video-text pairs and benchmarks 11 state-of-the-art generation models. (3) Comprehensive Metrics: It introduces 13 task-specific, reference-free metrics to systematically assess audio quality, video-audio consistency, and text-audio consistency. (4) Human Alignment: It validates all metrics through subjective studies, demonstrating strong consistency with human preferences. Experimental results reveal that current V2A models perform poorly in speech and singing compared to sound effects. Our VT2A results further highlight a fundamental tension between instruction following and visually grounded generation: stronger visual conditioning improves video-audio alignment, but often at the cost of generating the intended audio category. These findings establish VidAudio-Bench as a comprehensive and scalable framework for diagnosing V2A systems and provide new insights into multimodal audio generation.
[1054] BMdataset: A Musicologically Curated LilyPond Dataset
Matteo Spanio, Ilay Guler, Antonio Rodà
Main category: cs.SD
TL;DR: LilyBERT: A CodeBERT-based encoder adapted for symbolic music using LilyPond format, trained on a small curated Baroque music dataset that outperforms models trained on much larger noisy corpora for music understanding tasks.
Details
Motivation: Symbolic music research has relied almost exclusively on MIDI-based datasets, leaving text-based engraving formats like LilyPond unexplored. The authors aim to establish a baseline for representation learning on LilyPond format and demonstrate that small, expertly curated datasets can be more effective than large, noisy corpora for music understanding.Method: Created BMdataset - a musicologically curated dataset of 393 LilyPond scores (2,646 movements) transcribed from Baroque manuscripts. Developed LilyBERT by adapting CodeBERT through vocabulary extension with 115 LilyPond-specific tokens and masked language model pre-training. Evaluated through linear probing on out-of-domain Mutopia corpus for composer and style classification.
Result: Fine-tuning on BMdataset alone (90M tokens) outperforms continuous pre-training on full PDMX corpus (15B tokens) for both composer and style classification. Combining broad pre-training with domain-specific fine-tuning yields best results (84.3% composer accuracy). Demonstrates small curated datasets can be more effective than large noisy corpora.
Conclusion: Text-based engraving formats like LilyPond are viable for symbolic music understanding. Small, expertly curated datasets can outperform large, noisy corpora for domain-specific music understanding tasks. The combination of broad pre-training with domain-specific fine-tuning provides optimal results. Released dataset, tokenizer, and model to establish baseline for LilyPond representation learning.
Abstract: Symbolic music research has relied almost exclusively on MIDI-based datasets; text-based engraving formats such as LilyPond remain unexplored for music understanding. We present BMdataset, a musicologically curated dataset of 393 LilyPond scores (2,646 movements) transcribed by experts directly from original Baroque manuscripts, with metadata covering composer, musical form, instrumentation, and sectional attributes. Building on this resource, we introduce LilyBERT (weights can be found at https://huggingface.co/csc-unipd/lilybert), a CodeBERT-based encoder adapted to symbolic music through vocabulary extension with 115 LilyPond-specific tokens and masked language model pre-training. Linear probing on the out-of-domain Mutopia corpus shows that, despite its modest size (~90M tokens), fine-tuning on BMdataset alone outperforms continuous pre-training on the full PDMX corpus (~15B tokens) for both composer and style classification, demonstrating that small, expertly curated datasets can be more effective than large, noisy corpora for music understanding. Combining broad pre-training with domain-specific fine-tuning yields the best results overall (84.3% composer accuracy), confirming that the two data regimes are complementary. We release the dataset, tokenizer, and model to establish a baseline for representation learning on LilyPond.
[1055] MeloTune: On-Device Arousal Learning and Peer-to-Peer Mood Coupling for Proactive Music Curation
Hongwei Xu
Main category: cs.SD
TL;DR: MeloTune is an on-device music agent using continuous-time networks for affect-aware music curation with peer-to-peer mood coupling, deployed on iPhone via CoreML.
Details
Motivation: To create a production system for personalized, affect-aware music curation that operates entirely on-device while enabling peer-to-peer mood coupling between listeners, addressing the need for privacy-preserving, real-time affective computing in mobile applications.Method: Uses Mesh Memory Protocol (MMP) and Symbolic-Vector Attention Fusion (SVAF) with two closed-form continuous-time (CfC) networks: private listener-level CfC for short-horizon affective trajectory prediction, and shared mesh-runtime CfC at MMP Layer 6 integrating Cognitive Memory Blocks from peers. Personal Arousal Function (PAF) replaces linear audio intensity to arousal mapping with per-listener learned adjustments trained from behavioral signals.
Result: Model achieves trajectory MAE 0.414, pattern accuracy 96.6%, and intent accuracy 69.4% on validation. PAF learning loop operates end-to-end with pop reaching full confidence after 22 observations. All inference runs on-device via CoreML with 94,552 parameters.
Conclusion: Successfully deployed first production implementation of MMP/SVAF on consumer mobile hardware, demonstrating practical on-device affective computing for music curation with privacy-preserving peer-to-peer mood coupling.
Abstract: MeloTune is an iPhone-deployed music agent that instantiates the Mesh Memory Protocol (MMP) and Symbolic-Vector Attention Fusion (SVAF) as a production system for affect-aware music curation with peer-to-peer mood coupling. Each device runs two closed-form continuous-time (CfC) networks: a private listener-level CfC that predicts a short-horizon affective trajectory on Russell’s circumplex and drives proactive curation, and a shared mesh-runtime CfC at MMP Layer 6 that integrates Cognitive Memory Blocks (CMBs) from co-listening peers. CfC hidden states never cross the wire; only structured CMBs do. A Personal Arousal Function (PAF) replaces the standard linear mapping from audio intensity to psychological arousal with a per-listener learned adjustment, trained from behavioral signals (skip, completion, favorite, volume) and from drift between user-declared mood and machine inference. The same track receives different arousal predictions for different listeners. The model (94,552 parameters) achieves trajectory MAE 0.414, pattern accuracy 96.6%, and intent accuracy 69.4% on held-out validation. PAF evidence from a live deployment session (46 observations across 11 genres) demonstrates that the learning loop operates end-to-end, with pop reaching full confidence after 22 observations. All inference runs on-device via CoreML. To our knowledge, this is the first production deployment of MMP/SVAF on consumer mobile hardware. The accompanying SDK (sym-swift v0.3.78, SYMCore v0.3.7) enforces strict protocol conformance. Music is the case study; the substrate is the contribution.
[1056] LaDA-Band: Language Diffusion Models for Vocal-to-Accompaniment Generation
Qi Wang, Zhexu Shen, Meng Chen, Guoxin Yu, Chaoxu Pang, Weifeng Zhao, Wenjiang Zhou
Main category: cs.SD
TL;DR: LaDA-Band is an end-to-end framework for vocal-to-accompaniment generation using Discrete Masked Diffusion to address the accompaniment trilemma of acoustic authenticity, global coherence, and dynamic orchestration.
Details
Motivation: Existing V2A generation approaches compromise between acoustic authenticity, global coherence, and dynamic orchestration. Continuous-latent models lose fine details while discrete autoregressive models suffer from unidirectional generation and error accumulation in long contexts.Method: Formulates V2A as Discrete Masked Diffusion using discrete audio codec tokens with full-sequence bidirectional context modeling. Includes dual-track prefix-conditioning, replaced-token detection objective for weakly anchored regions, and two-stage progressive curriculum for full-song generation.
Result: Extensive experiments show LaDA-Band consistently improves acoustic authenticity, global coherence, and dynamic orchestration over existing baselines, maintaining strong performance even without auxiliary reference audio.
Conclusion: LaDA-Band successfully addresses the V2A accompaniment trilemma through Discrete Masked Diffusion, achieving better balance between acoustic detail preservation, global coherence, and dynamic orchestration than previous methods.
Abstract: Vocal-to-accompaniment (V2A) generation, which aims to transform a raw vocal recording into a fully arranged accompaniment, inherently requires jointly addressing an accompaniment trilemma: preserving acoustic authenticity, maintaining global coherence with the vocal track, and producing dynamic orchestration across a full song. Existing open-source approaches typically make compromises among these goals. Continuous-latent generation models can capture long musical spans but often struggle to preserve fine-grained acoustic detail. In contrast, discrete autoregressive models retain local fidelity but suffer from unidirectional generation and error accumulation in extended contexts. We present LaDA-Band, an end-to-end framework that introduces Discrete Masked Diffusion to the V2A task. Our approach formulates V2A generation as Discrete Masked Diffusion, i.e., a global, non-autoregressive denoising formulation that combines the representational advantages of discrete audio codec tokens with full-sequence bidirectional context modeling. This design improves long-range structural consistency and temporal synchronization while preserving crisp acoustic details. Built on this formulation, LaDA-Band further introduces a dual-track prefix-conditioning architecture, an auxiliary replaced-token detection objective for weakly anchored accompaniment regions, and a two-stage progressive curriculum to scale Discrete Masked Diffusion to full-song vocal-to-accompaniment generation. Extensive experiments on both academic and real-world benchmarks show that LaDA-Band consistently improves acoustic authenticity, global coherence, and dynamic orchestration over existing baselines, while maintaining strong performance even without auxiliary reference audio. Codes and audio samples are available at https://github.com/Duoluoluos/TME-LaDA-Band .
[1057] ActorMind: Emulating Human Actor Reasoning for Speech Role-Playing
Xi Chen, Wei Xue, Yike Guo
Main category: cs.SD
TL;DR: ActorMind introduces speech role-playing with a multi-agent reasoning framework and benchmark for models to deliver spontaneous, emotionally-infused spoken responses based on roles, scenes, and dialogue context.
Details
Motivation: Current role-playing work is limited to textual modalities, neglecting speech which is predominant in daily life, thus limiting genuine role-playing capabilities. There's a need to bridge this gap by incorporating speech into role-playing systems.Method: Proposes ActorMind, a multi-agent chain-of-thought reasoning framework with four agents: Eye Agent (reads role description), Ear Agent (comprehends emotional cues in spoken dialogues), Brain Agent (generates descriptive emotional state), and Mouth Agent (delivers emotion-infused scripts). Also introduces ActorMindBench, a hierarchical benchmark with utterance-level, scene-level, and role-level content.
Result: Experimental results demonstrate the effectiveness of ActorMind in enhancing speech role-playing capabilities, showing improved ability to deliver spontaneous responses with personalized verbal traits based on roles, scenes, and spoken dialogue.
Conclusion: Speech role-playing bridges the gap between textual role-playing and real-world interaction by incorporating speech modality, enabling more genuine human-machine interaction and facilitating sociological research through the ActorMind framework and benchmark.
Abstract: Role-playing has garnered rising attention as it provides a strong foundation for human-machine interaction and facilitates sociological research. However, current work is confined to textual modalities, neglecting speech, which plays a predominant role in daily life, thus limiting genuine role-playing. To bridge this gap, we conceptualize and benchmark speech role-playing through ActorMindBench, and we present a corresponding reasoning framework, called ActorMind. Specifically, (1) Speech Role-Playing enables models to deliver spontaneous responses with personalized verbal traits based on their role, the scene, and spoken dialogue. (2) ActorMindBench is a hierarchical benchmark comprises Utterance-Level content with 7,653 utterances, Scene-Level content with 313 scenes, and Role-Level content with 6 roles. (3) ActorMind is an off-the-shelf, multi-agent, chain-of-though style reasoning framework that emulates how human actors perform in theaters. Concretely, ActorMind first reads its assigned role description via Eye Agent, then comprehends emotional cues within contextual spoken dialogues through Ear Agent. Subsequently, Brain Agent generates a descriptive emotional state, and finally, Mouth Agent delivers the scripts infused with corresponding emotion state. Experimental results demonstrate the effectiveness of ActorMind in enhancing speech role-playing.
[1058] Ti-Audio: The First Multi-Dialectal End-to-End Speech LLM for Tibetan
Jialing Wang, Yue Zhao, Yuhao Zhang, Jing Yu, Shaosai Li, Zhanchen Dai, Benyou Wang, Haizhou Li
Main category: cs.SD
TL;DR: Ti-Audio: First multi-dialectal end-to-end Speech-LLM for Tibetan using Dynamic Q-Former Adapter and cross-dialectal cooperation to address low-resource challenges.
Details
Motivation: Speech-LLMs face challenges in low-resource and dialect-diverse environments. Tibetan exemplifies this with severe data scarcity and phonetic differences among its three major dialects (Ü-Tsang, Amdo, Kham).Method: 1) Dynamic Q-Former Adapter for efficient speech-text alignment by extracting essential acoustic features from variable-length speech. 2) Leverage mutual assistance among related dialects to alleviate data scarcity. 3) Temperature-based sampling strategy to maximize cross-dialectal synergy.
Result: Ti-Audio achieves state-of-the-art performance on Tibetan benchmarks for automatic speech recognition and speech translation, validating cross-dialectal cooperation effectiveness.
Conclusion: The work provides a scalable paradigm for Speech-LLM development in low-resource scenarios and demonstrates the effectiveness of cross-dialectal cooperation.
Abstract: Recent advances in Speech Large Language Models (Speech-LLMs) have made significant progress, greatly enhancing multimodal interaction capabilities.However, their application in low-resource and dialect-diverse environments still faces challenges. The severe scarcity of Tibetan data, coupled with the phonetic differences among its major dialects (Ü-Tsang, Amdo, and Kham), is a prime example of this challenge. This paper proposes Ti-Audio, the first multi-dialectal end-to-end Speech-LLM for Tibetan. To efficiently align speech and text, we introduce a Dynamic Q-Former Adapter that extracts essential acoustic features from variable-length speech, ensuring stable cross-modal alignment even with limited data. At the data level, we leverage mutual assistance among related dialects to alleviate data scarcity and employ a temperature-based sampling strategy to maximize this synergy. Experimental results demonstrate that Ti-Audio achieves state-of-the-art performance on Tibetan benchmarks for automatic speech recognition and speech translation. Our work validates the effectiveness of cross-dialectal cooperation and provides a scalable paradigm for the development of Speech-LLM in low-resource scenarios.
[1059] MimicLM: Zero-Shot Voice Imitation through Autoregressive Modeling of Pseudo-Parallel Speech Corpora
Tao Feng, Yuxiang Wang, Yuancheng Wang, Xueyao Zhang, Dekun Chen, Chaoren Wang, Xun Guan, Zhizheng Wu
Main category: cs.SD
TL;DR: MimicLM: A voice imitation system that uses synthetic speech as training sources with real recordings as targets, incorporating interleaved text-audio modeling and preference alignment to achieve superior voice imitation quality.
Details
Motivation: Voice imitation requires training data where source and target share content but target matches reference voice characteristics, but such triplets are extremely scarce. Existing approaches use complex disentanglement architectures or synthetic pseudo-parallel data, which face quality limitations.Method: Uses synthetic speech as training sources while keeping real recordings as targets, enabling learning from real speech distributions. Incorporates interleaved text-audio modeling for content accuracy and applies post-training with preference alignment to address synthetic data distribution mismatch.
Result: Achieves superior voice imitation quality with simple architecture, significantly outperforming existing methods in naturalness while maintaining competitive similarity scores across speaker identity, accent, and emotion dimensions.
Conclusion: MimicLM provides an effective solution to voice imitation data scarcity by using synthetic sources with real targets, breaking the synthetic quality ceiling and achieving state-of-the-art performance.
Abstract: Voice imitation aims to transform source speech to match a reference speaker’s timbre and speaking style while preserving linguistic content. A straightforward approach is to train on triplets of (source, reference, target), where source and target share the same content but target matches the reference’s voice characteristics, yet such data is extremely scarce. Existing approaches either employ carefully designed disentanglement architectures to bypass this data scarcity or leverage external systems to synthesize pseudo-parallel training data. However, the former requires intricate model design, and the latter faces a quality ceiling when synthetic speech is used as training targets. To address these limitations, we propose MimicLM, which takes a novel approach by using synthetic speech as training sources while retaining real recordings as targets. This design enables the model to learn directly from real speech distributions, breaking the synthetic quality ceiling. Building on this data construction approach, we incorporate interleaved text-audio modeling to guide the generation of content-accurate speech and apply post-training with preference alignment to mitigate the inherent distributional mismatch when training on synthetic data. Experiments demonstrate that MimicLM achieves superior voice imitation quality with a simple yet effective architecture, significantly outperforming existing methods in naturalness while maintaining competitive similarity scores across speaker identity, accent, and emotion dimensions.
[1060] HAFM: Hierarchical Autoregressive Foundation Model for Music Accompaniment Generation
Jian Zhu, Jianwei Cui, Shihao Chen, Yubang Zhang, Cheng Luo
Main category: cs.SD
TL;DR: HAFM is a system that generates instrumental music accompaniment for input vocals using a hierarchical autoregressive transformer architecture with dual-rate tokenization.
Details
Motivation: To create a system that can generate coherent instrumental accompaniment for isolated singing vocals, enabling complete music creation from voice inputs alone.Method: Three-stage hierarchical autoregressive architecture with dual-rate tokenization (HuBERT semantic tokens at 50Hz for vocals, EnCodec acoustic tokens at 75Hz for instrumentals), interleaved multi-codebook prediction, classifier-free guidance, and modern transformer design choices.
Result: Achieves Fréchet Audio Distance (FAD) of 2.08 on MUSDB18, outperforming retrieval baselines and matching prior state-of-the-art systems with fewer parameters.
Conclusion: HAFM successfully generates high-quality instrumental accompaniment for vocals through its innovative dual-rate tokenization and hierarchical architecture, advancing audio generation capabilities.
Abstract: We present HAFM, a system that generates instrumental music audio to accompany input vocals. Given isolated singing voice, HAFM produces a coherent instrumental accompaniment that can be directly mixed with the input to create complete music. We propose three key innovations over prior work: (1) a dual-rate codec tokenization scheme using HuBERT semantic tokens at 50,Hz for vocals and EnCodec acoustic tokens at 75,Hz for instrumentals, enabling time-aligned yet rate-independent modeling; (2) a three-stage hierarchical autoregressive architecture (semantic to coarse acoustic to fine acoustic) with interleaved multi-codebook prediction and classifier-free guidance; and (3) modern Transformer design choices including QK-norm, GEGLU activations, RMSNorm, and T5-style relative position bias for improved training stability and sequence generalization. Experiments on MUSDB18 demonstrate that HAFM achieves a Fréchet Audio Distance (FAD) of 2.08 on isolated vocal inputs, outperforming retrieval baselines and matching prior state-of-the-art systems with fewer parameters. The source code is available at https://github.com/HackerHyper/HAFM.
[1061] CoMelSinger: Discrete Token-Based Zero-Shot Singing Synthesis With Structured Melody Control and Guidance
Junchuan Zhao, Wei Zeng, Tianle Lyu, Ye Wang
Main category: cs.SD
TL;DR: CoMelSinger: A zero-shot singing voice synthesis framework with structured melody control using discrete codec modeling, addressing prosody leakage through contrastive learning and singing voice transcription alignment.
Details
Motivation: Current discrete codec-based speech synthesis enables zero-shot generation via in-context learning, but extending to singing voice synthesis is challenging due to precise melody control requirements. Prompt-based generation suffers from prosody leakage where pitch information gets entangled in timbre prompts, compromising controllability.Method: Built on non-autoregressive MaskGCT architecture, replaces text inputs with lyric and pitch tokens. Uses coarse-to-fine contrastive learning to suppress prosody leakage by regularizing pitch redundancy between acoustic prompt and melody input. Incorporates lightweight encoder-only Singing Voice Transcription module to align acoustic tokens with pitch and duration for frame-level supervision.
Result: Experimental results show notable improvements in pitch accuracy, timbre consistency, and zero-shot transferability over competitive baselines.
Conclusion: CoMelSinger enables structured and disentangled melody control in zero-shot SVS while maintaining in-context generalization, effectively addressing prosody leakage issues.
Abstract: Singing Voice Synthesis (SVS) aims to generate expressive vocal performances from structured musical inputs such as lyrics and pitch sequences. While recent progress in discrete codec-based speech synthesis has enabled zero-shot generation via in-context learning, directly extending these techniques to SVS remains non-trivial due to the requirement for precise melody control. In particular, prompt-based generation often introduces prosody leakage, where pitch information is inadvertently entangled within the timbre prompt, compromising controllability. We present CoMelSinger, a zero-shot SVS framework that enables structured and disentangled melody control within a discrete codec modeling paradigm. Built on the non-autoregressive MaskGCT architecture, CoMelSinger replaces conventional text inputs with lyric and pitch tokens, preserving in-context generalization while enhancing melody conditioning. To suppress prosody leakage, we propose a coarse-to-fine contrastive learning strategy that explicitly regularizes pitch redundancy between the acoustic prompt and melody input. Furthermore, we incorporate a lightweight encoder-only Singing Voice Transcription (SVT) module to align acoustic tokens with pitch and duration, offering fine-grained frame-level supervision. Experimental results demonstrate that CoMelSinger achieves notable improvements in pitch accuracy, timbre consistency, and zero-shot transferability over competitive baselines. Audio samples are available at https://danny-nus.github.io/CoMelSinger/.
[1062] End-to-end Contrastive Language-Speech Pretraining Model For Long-form Spoken Question Answering
Jiliang Hu, Zuchao Li, Baoyuan Qi, Liu Guoming, Ping Wang
Main category: cs.SD
TL;DR: CLSR is an end-to-end contrastive language-speech retriever that efficiently extracts question-relevant segments from long audio for spoken question answering, using an intermediate text-like representation to bridge modality gaps.
Details
Motivation: Existing spoken QA methods struggle with long audio, and current speech-related retrievers have poor performance. The paper aims to improve retrieval of relevant audio segments from long recordings for downstream SQA tasks.Method: Proposes CLSR, an end-to-end contrastive language-speech retriever that converts acoustic features into text-like representations before aligning with text queries, bridging the modality gap more effectively than direct speech-text contrastive models.
Result: CLSR outperforms both end-to-end speech retrievers and pipeline approaches (ASR + text retrieval) across four cross-modal retrieval datasets, providing a robust foundation for long-form SQA applications.
Conclusion: CLSR effectively addresses the challenge of retrieving relevant segments from long audio for SQA by introducing intermediate text-like representations, advancing practical long-form spoken question answering.
Abstract: Significant progress has been made in spoken question answering (SQA) in recent years. However, many existing methods, including large audio language models, struggle with processing long audio. Follow the success of retrieval augmented generation, a speech-related retriever shows promising in help preprocessing long-form speech. But the performance of existing speech-related retrievers is lacking. To address this challenge, we propose CLSR, an end-to-end contrastive language-speech retriever that efficiently extracts question-relevant segments from long audio recordings for downstream SQA task. Unlike conventional speech-text contrastive models, CLSR incorporates an intermediate step that converts acoustic features into text-like representations prior to alignment, thereby more effectively bridging the gap between modalities. Experimental results across four cross-modal retrieval datasets demonstrate that CLSR surpasses both end-to-end speech related retrievers and pipeline approaches combining speech recognition with text retrieval, providing a robust foundation for advancing practical long-form SQA applications.
[1063] Woosh: A Sound Effects Foundation Model
Gaëtan Hadjeres, Marc Ferras, Khaled Koutini, Benno Weck, Alexandre Bittar, Thomas Hummel, Zineb Lahrici, Hakim Missoum, Joan Serrà, Yuki Mitsufuji
Main category: cs.SD
TL;DR: Woosh is Sony AI’s open sound effect foundation model with audio encoder/decoder, text-audio alignment, and text/video-to-audio generation capabilities, showing competitive performance against existing open models.
Details
Motivation: The audio research community needs open generative models as foundational tools for building novel approaches and establishing baselines, particularly for sound effects.Method: Developed a comprehensive sound effect foundation model with four key components: (1) high-quality audio encoder/decoder model, (2) text-audio alignment model for conditioning, (3) text-to-audio generative model, and (4) video-to-audio generative model. Also includes distilled versions for low-resource operation and fast inference.
Result: Evaluation on both public and private data shows competitive or better performance for each module compared to existing open alternatives like StableAudio-Open and TangoFlux.
Conclusion: Woosh provides a valuable open foundation model for sound effect generation that supports both text and video conditioning, with competitive performance and practical deployment options through distilled models.
Abstract: The audio research community depends on open generative models as foundational tools for building novel approaches and establishing baselines. In this report, we present Woosh, Sony AI’s publicly released sound effect foundation model, detailing its architecture, training process, and an evaluation against other popular open models. Being optimized for sound effects, we provide (1) a high-quality audio encoder/decoder model and (2) a text-audio alignment model for conditioning, together with (3) text-to-audio and (4) video-to-audio generative models. Distilled text-to-audio and video-to-audio models are also included in the release, allowing for low-resource operation and fast inference. Our evaluation on both public and private data shows competitive or better performance for each module when compared to existing open alternatives like StableAudio-Open and TangoFlux. Inference code and model weights are available at https://github.com/SonyResearch/Woosh. Demo samples can be found at https://sonyresearch.github.io/Woosh/.
cs.LG
[1064] The Diffusion-Attention Connection
Julio Candanedo
Main category: cs.LG
TL;DR: The paper shows that transformers, diffusion-maps, and magnetic Laplacians are different regimes of a single Markov geometry built from pre-softmax query-scores, unified through a QK “bidivergence” framework.
Details
Motivation: To unify seemingly disparate mathematical tools (transformers, diffusion-maps, magnetic Laplacians) by showing they all emerge from the same underlying Markov geometry framework based on pre-softmax query-scores.Method: Define a QK “bidivergence” whose exponentiated and normalized forms yield attention, diffusion-maps, and magnetic diffusion. Use product of experts and Schrödinger-bridges to connect these into equilibrium, nonequilibrium steady-state, and driven dynamics regimes.
Result: Demonstrates that transformers, diffusion-maps, and magnetic Laplacians are different regimes of a single unified mathematical framework, connected through Markov geometry and bidivergence concepts.
Conclusion: Provides a unified theoretical framework connecting attention mechanisms, diffusion processes, and magnetic operators through a common Markov geometry foundation based on pre-softmax query-scores.
Abstract: Transformers, diffusion-maps, and magnetic Laplacians are usually treated as separate tools; we show they are all different regimes of a single Markov geometry built from pre-softmax query-scores. We define a QK “bidivergence” whose exponentiated and normalized forms yield attention, diffusion-maps, and magnetic diffusion. And use product of experts and Schrödinger-bridges to connect and organize them into equilibrium, nonequilibrium steady-state, and driven dynamics.
[1065] Fairboard: a quantitative framework for equity assessment of healthcare models
James K. Ruffle, Samia Mohinta, Chris Foulon, Mohamad Zeina, Zicheng Wang, Sebastian Brandner, Harpreet Hyare, Parashkev Nachev
Main category: cs.LG
TL;DR: Evaluation of 18 brain tumor segmentation models reveals patient identity explains more performance variance than model choice, with systematic biases across clinical subgroups and neuroanatomical regions, despite newer models showing improved equity.
Details
Motivation: Despite over 1,000 FDA-authorized AI medical devices, formal equity assessments evaluating whether model performance is uniform across patient subgroups are rare. The paper aims to systematically evaluate fairness in brain tumor segmentation models.Method: Evaluated 18 open-source brain tumor segmentation models across 648 glioma patients from two independent datasets (11,664 model inferences). Used multiple evaluation dimensions: univariate, Bayesian multivariate, spatial, and representational analyses. Also developed Fairboard, an open-source dashboard for equitable model monitoring.
Result: Patient identity consistently explained more performance variance than model choice. Clinical factors (molecular diagnosis, tumor grade, extent of resection) predicted segmentation accuracy more strongly than model architecture. Found neuroanatomically localized biases that were compartment-specific yet consistent across models. Model performance clustered significantly in high-dimensional patient feature space, indicating axes of algorithmic vulnerability. Newer models showed improved equity but none provided formal fairness guarantees.
Conclusion: Systematic biases exist in medical AI models across patient subgroups, with patient characteristics being more predictive of performance than model architecture. There is a need for formal fairness guarantees and better equity monitoring tools like Fairboard.
Abstract: Despite there now being more than 1,000 FDA-authorised AI medical devices, formal equity assessments – whether model performance is uniform across patient subgroups – are rare. Here, we evaluate the equity of 18 open-source brain tumour segmentation models across 648 glioma patients from two independent datasets (n = 11,664 model inferences) along distinct univariate, Bayesian multivariate, spatial, and representational dimensions. We find that patient identity consistently explains more performance variance than model choice, with clinical factors, including molecular diagnosis, tumour grade, and extent of resection, predicting segmentation accuracy more strongly than model architecture. A voxel-wise spatial meta-analysis identifies neuroanatomically localised biases that are compartment-specific yet often consistent across models. Within a high-dimensional latent space of lesion masks and clinic-demographic features, model performance clusters significantly, indicating that the patient feature space contains axes of algorithmic vulnerability. Although newer models tend toward greater equity, none provide a formal fairness guarantee. Lastly, we release Fairboard, an open-source, no-code dashboard that lowers barriers to equitable model monitoring in medical imaging.
[1066] Regularized Entropy Information Adaptation with Temporal-Awareness Networks for Simultaneous Speech Translation
Joseph Liu, Nameer Hirschkind, Xiao Yu, Mahesh Kumar Nandwana
Main category: cs.LG
TL;DR: Improved simultaneous speech translation methods REINA-SAN and REINA-TAN enhance streaming efficiency by addressing temporal context issues in information-based policies, achieving up to 7.1% better normalized streaming efficiency scores.
Details
Motivation: Existing information-based policies for simultaneous speech translation (SimulST) often lack temporal context, causing them to bias toward reading most audio before starting translation, which hurts streaming efficiency.Method: Two strategies: 1) REINA-SAN uses supervised alignment network to improve policy training, 2) REINA-TAN uses timestep-augmented network to provide better temporal context. Both build on REINA framework.
Result: Both methods significantly outperform baseline and resolve stability issues. REINA-TAN provides slightly superior Pareto frontier for streaming efficiency, while REINA-SAN offers more robustness against ‘read loops’. Applied to Whisper, both improve normalized streaming efficiency scores up to 7.1% over competitive baselines.
Conclusion: The proposed improvements to REINA framework effectively address temporal context limitations in simultaneous speech translation policies, enhancing streaming efficiency while maintaining translation quality.
Abstract: Simultaneous Speech Translation (SimulST) requires balancing high translation quality with low latency. Recent work introduced REINA, a method that trains a Read/Write policy based on estimating the information gain of reading more audio. However, we find that information-based policies often lack temporal context, leading the policy to bias itself toward reading most of the audio before starting to write. We improve REINA using two distinct strategies: a supervised alignment network (REINA-SAN) and a timestep-augmented network (REINA-TAN). Our results demonstrate that while both methods significantly outperform the baseline and resolve stability issues, REINA-TAN provides a slightly superior Pareto frontier for streaming efficiency, whereas REINA-SAN offers more robustness against ‘read loops’. Applied to Whisper, both methods improve the pareto frontier of streaming efficiency as measured by Normalized Streaming Efficiency (NoSE) scores up to 7.1% over existing competitive baselines.
[1067] Deliberative Alignment is Deep, but Uncertainty Remains: Inference time safety improvement in reasoning via attribution of unsafe behavior to base model
Pankayaraj Pathmanathan, Furong Huang
Main category: cs.LG
TL;DR: Deliberative alignment for LLM safety has limitations; proposed BoN sampling method improves safety by attributing unsafe behaviors to base models in latent space.
Details
Motivation: Current refusal training in LLMs provides shallow safety alignment. Deliberative alignment (distilling reasoning from stronger models) aims for deeper safety but has limitations - alignment gaps exist between teacher/student models, and unsafe behaviors can persist despite learning reasoning patterns.Method: Proposes BoN (Best-of-N) sampling method that attributes unsafe behavior back to base LLMs in latent space, down-ranking unsafe responses. Evaluated across 7 teacher models and 6 student models of different classes and sizes.
Result: Average attack success rate reduction of 28.2% in DAN, 31.3% in WildJailbreak, and 35.4% in StrongREJECT benchmarks. Safety gains persist post RL training, highlighting uncertainty in safety reasoning and explicit attribution to base model.
Conclusion: Deliberative alignment has limitations, but BoN sampling effectively improves model safety by explicitly attributing unsafe behaviors to base models, achieving meaningful safety improvements with minimal utility loss.
Abstract: While the wide adoption of refusal training in large language models (LLMs) has showcased improvements in model safety, recent works have highlighted shortcomings due to the shallow nature of these alignment methods. To this end, the work on Deliberative alignment proposed distilling reasoning capabilities from stronger reasoning models, thereby instilling deeper safety in LLMs. In this work, we study the impact of deliberative alignment in language models. First, we show that despite being larger in model size and stronger in safety capability, there exists an alignment gap between teacher and student language models, which affects both the safety and general utility of the student model. Furthermore, we show that models aligned through deliberative alignment can retain unsafe behaviors from the base model despite learning the reasoning patterns of larger reasoning models. Building upon this observation, we propose a BoN sampling method that attributes the unsafe behavior back to the base LLMs in the latent space, thereby down-ranking unsafe responses to gain a meaningful improvement in model safety across multiple safety benchmarks with minimal loss in utility. In particular, across 7 teacher models and 6 student models of different classes and sizes, we show an average attack success rate (ASR) reduction of 28.2% in DAN, 31.3% in WildJailbreak and 35.4 % in StrongREJECT benchmarks. We further show that these safety gains prevail post RL training, thus highlighting the uncertainty in safety reasoning and it’s explicit attribution to the base model.
[1068] Human-like Working Memory Interference in Large Language Models
Hua-Dong Xiong, Li Ji-An, Jiaqi Huang, Robert C. Wilson, Kwonjoon Lee, Xue-Xin Wei
Main category: cs.LG
TL;DR: LLMs exhibit working memory limitations similar to humans, with performance degrading under memory load and showing recency biases, due to representational interference where multiple memory items are entangled in representations.
Details
Motivation: To understand why large language models show working memory limitations despite having full access to prior context through attention mechanisms, and to investigate whether these limitations mirror human cognitive constraints.Method: Analyzed working memory performance across diverse pretrained LLMs on memory tasks, examined interference patterns, correlated memory capacity with benchmark performance, and conducted targeted interventions to suppress stimulus content information.
Result: LLMs reproduce human-like interference signatures (memory load effects, recency biases), show correlation between working memory capacity and general competence, and use entangled representations requiring interference control rather than direct copying from context.
Conclusion: Representational interference is a core constraint on working memory in pretrained LLMs, suggesting shared computational challenges between biological and artificial systems in selecting task-relevant information under interference.
Abstract: Intelligent systems must maintain and manipulate task-relevant information online to adapt to dynamic environments and changing goals. This capacity, known as working memory, is fundamental to human reasoning and intelligence. Despite having on the order of 100 billion neurons, both biological and artificial systems exhibit limitations in working memory. This raises a key question: why do large language models (LLMs) show such limitations, given that transformers have full access to prior context through attention? We find that although a two-layer transformer can be trained to solve working memory tasks perfectly, a diverse set of pretrained LLMs continues to show working memory limitations. Notably, LLMs reproduce interference signatures observed in humans: performance degrades with increasing memory load and is biased by recency and stimulus statistics. Across models, stronger working memory capacity correlates with broader competence on standard benchmarks, mirroring its link to general intelligence in humans. Yet despite substantial variability in working memory performance, LLMs surprisingly converge on a common computational mechanism. Rather than directly copying the relevant memory item from context, models encode multiple memory items in entangled representations, such that successful recall depends on interference control – actively suppressing task-irrelevant content to isolate the target for readout. Moreover, a targeted intervention that suppresses stimulus content information improves performance, providing causal support for representational interference. Together, these findings identify representational interference as a core constraint on working memory in pretrained LLMs, suggesting that working-memory limits in biological and artificial systems may reflect a shared computational challenge: selecting task-relevant information under interference.
[1069] Belief-State RWKV for Reinforcement Learning under Partial Observability
Liu Xiao
Main category: cs.LG
TL;DR: A novel RL formulation for RWKV-style recurrent models where the fixed-size recurrent state is explicitly treated as a belief state (mean and covariance) rather than an opaque hidden vector, enabling uncertainty-aware control in partially observed environments.
Details
Motivation: Plain fixed-state policies in partially observed settings can store evidence but not necessarily confidence. The paper addresses the weakness of traditional recurrent policies that lack uncertainty awareness by proposing a belief-state interpretation of RWKV-style recurrent models.Method: Instead of conditioning policy and value on a single summary hidden state h_t, the method maintains a compact uncertainty-aware state b_t = (μ_t, Σ_t) derived from RWKV-style recurrent statistics. This belief state representation allows control decisions to depend on both memory and uncertainty estimates.
Result: In pilot RL experiments with hidden episode-level observation noise, belief-state policies nearly match the best recurrent baseline overall while slightly improving return on the hardest in-distribution regime and under held-out noise shifts. The simple belief readout outperformed more structured extensions like gated memory control and privileged belief targets.
Conclusion: Explicit belief-state interpretation of RWKV-style recurrent models shows promise for uncertainty-aware control in partially observed RL settings, though richer benchmarks are needed to fully evaluate the approach against more structured alternatives.
Abstract: We propose a stronger formulation of RL on top of RWKV-style recurrent sequence models, in which the fixed-size recurrent state is explicitly interpreted as a belief state rather than an opaque hidden vector. Instead of conditioning policy and value on a single summary h_t, we maintain a compact uncertainty-aware state b_t = (μ_t, Σ_t) derived from RWKV-style recurrent statistics and let control depend on both memory and uncertainty. This design targets a key weakness of plain fixed-state policies in partially observed settings: they may store evidence, but not necessarily confidence. We present the method, a theoretical program, and a pilot RL experiment with hidden episode-level observation noise together with a test-time noise sweep. The pilot shows that belief-state policies nearly match the best recurrent baseline overall while slightly improving return on the hardest in-distribution regime and under a held-out noise shift. Additional ablations show that this simple belief readout is currently stronger than two more structured extensions, namely gated memory control and privileged belief targets, underscoring the need for richer benchmarks.
[1070] Active Inference with a Self-Prior in the Mirror-Mark Task
Dongmin Kim, Hoshinori Kanazawa, Yasuo Kuniyoshi
Main category: cs.LG
TL;DR: A computational model using a self-prior mechanism enables simulated infants to pass the mirror self-recognition test through active inference without external rewards.
Details
Motivation: To provide a computational account of mirror self-recognition behavior using the free energy principle, explaining how self-awareness might emerge through a single mechanism without explicit instruction.Method: Implemented a self-prior using a Transformer to learn the density of familiar multisensory experiences (vision and proprioception). When a novel mark appears, the discrepancy from learned distribution drives mark-directed behavior through active inference in a simulated infant.
Result: The simulated infant discovered and removed a sticker on its face in approximately 70% of cases without explicit instruction. Expected free energy decreased significantly after sticker removal, confirming the self-prior functions as an internal criterion for self/non-self distinction.
Conclusion: The self-prior mechanism provides a concise computational account of mirror test behavior and suggests the free energy principle can serve as a unifying hypothesis for investigating developmental origins of self-awareness.
Abstract: The mirror self-recognition test evaluates whether a subject touches a mark on its own body that is visible only in a mirror, and is widely used as an indicator of self-awareness. In this study, we present a computational model in which this behavior emerges spontaneously through a single mechanism, the self-prior, without any external reward. The self-prior, implemented with a Transformer, learns the density of familiar multisensory experiences; when a novel mark appears, the discrepancy from this learned distribution drives mark-directed behavior through active inference. A simulated infant, relying solely on vision and proprioception without tactile input, discovered a sticker placed on its own face in the mirror and removed it in approximately 70% of cases without any explicit instruction. Expected free energy decreased significantly after sticker removal, confirming that the self-prior operates as an internal criterion for distinguishing self from non-self. Cross-modal sampling further demonstrated that the self-prior captures visual–proprioceptive associations, functioning as a probabilistic body schema. These results provide a concise computational account of the key behavior observed in the mirror test and suggest that the free energy principle can serve as a unifying hypothesis for investigating the developmental origins of self-awareness. Code is available at: https://github.com/kim135797531/self-prior-mirror
[1071] A Comparative Theoretical Analysis of Entropy Control Methods in Reinforcement Learning
Ming Lei, Christophe Baehr
Main category: cs.LG
TL;DR: Comparative theoretical analysis of entropy control strategies in RL for LLMs, showing covariance-based methods outperform traditional entropy regularization by selectively regularizing high-covariance tokens and achieving asymptotic unbiasedness.
Details
Motivation: RL is crucial for enhancing reasoning in LLMs, but scalable training is hindered by rapid entropy collapse leading to premature convergence and performance saturation. Need better entropy control strategies to improve RL training for LLMs.Method: Theoretical analysis comparing traditional entropy regularization vs covariance-based mechanisms. Establishes unified framework for entropy dynamics under softmax parameterization, showing entropy change is governed by covariance between log-probabilities and logit updates.
Result: Traditional entropy regularization introduces dense, persistent bias that modifies stationary condition, leading to suboptimal policies. Covariance-based methods selectively regularize sparse subset of high-covariance tokens and achieve asymptotic unbiasedness when regularization coefficient is annealed.
Conclusion: Provides principled guidelines for entropy control in LLM posttraining, with implications for scaling RL to larger models and more complex reasoning tasks. Covariance-based methods offer superior theoretical properties.
Abstract: Reinforcement learning (RL) has become a key approach for enhancing reasoning in large language models (LLMs), yet scalable training is often hindered by the rapid collapse of policy entropy, which leads to premature convergence and performance saturation. This paper provides a comparative theoretical analysis of two entropy control strategies: traditional entropy regularization and the recently proposed covariance-based mechanism. We establish a unified framework for entropy dynamics under softmax parameterization, showing that entropy change is governed by the covariance between log-probabilities and logit updates. Our analysis reveals that traditional entropy regularization introduces a dense, persistent bias that modifies the stationary condition, leading to suboptimal policies, while covariance-based methods selectively regularize a sparse subset of high-covariance tokens and achieve asymptotic unbiasedness when the regularization coefficient is annealed. These results provide principled guidelines for entropy control in LLM posttraining, with implications for scaling RL to larger models and more complex reasoning tasks.
[1072] STaR-DRO: Stateful Tsallis Reweighting for Group-Robust Structured Prediction
Samah Fodeh, Ganesh Puthiaraju, Elyas Irankhah, Linhai Ma, Srivani Talakokkul, Afshan Khan, Sreeraj Ramachandran, Jordan Alpert, Sarah Schellhorn
Main category: cs.LG
TL;DR: A framework combining XML-based prompting for structured generation and STaR-DRO robust optimization for fine-tuning improves structured prediction in clinical text mining, addressing format drift, ambiguity, and group heterogeneity.
Details
Motivation: Structured prediction faces challenges with ontology-constrained labels, evidence grounding, and valid structure under ambiguity, label skew, and heterogeneous group difficulty, particularly in clinical communication mining where rare but clinically important categories need reliable extraction.Method: Two-part framework: 1) Task-agnostic prompting with XML-based instruction structure, disambiguation rules, verification reasoning, schema constraints, and self-validation; 2) STaR-DRO robust optimization combining Tsallis mirror descent with momentum-smoothed group-loss signals and bounded excess-only multipliers to focus learning on persistently hard groups.
Result: Prompt engineering improves zero-shot by +15.44 average F1 across Code, Sub-code, and Span over four Llama models. STaR-DRO further improves hardest semantic decisions: Code F1 rises from 79.24 to 81.47 and Sub-code F1 from 67.78 to 69.30 on Llama-3.3-70B-Instruct, while reducing group-wise validation cross-entropy by up to 29.6% on most difficult clinical categories.
Conclusion: The combined framework strengthens communication mining reliability for patient-centered care analysis by addressing format drift, ambiguity, and group heterogeneity, with gains in rare but clinically consequential categories that go beyond statistical improvements to enhance practical utility.
Abstract: Structured prediction requires models to generate ontology-constrained labels, grounded evidence, and valid structure under ambiguity, label skew, and heterogeneous group difficulty. We present a two-part framework for controllable inference and robust fine-tuning. First, we introduce a task-agnostic prompting strategy that combines XML-based instruction structure, disambiguation rules, verification-style reasoning, schema constraints, and self-validation to address format drift, label ambiguity, evidence hallucination, and metadata-conditioned confusion in in-context structured generation. Second, we introduce STaR-DRO, a stateful robust optimization method for group heterogeneity. It combines Tsallis mirror descent with momentum-smoothed, centered group-loss signals and bounded excess-only multipliers so that only persistently hard groups above a neutral baseline are upweighted, concentrating learning where it is most needed while avoiding volatile, dense exponentiated-gradient reweighting and unnecessary loss from downweighting easier groups. We evaluate the combined framework on EPPC Miner, a benchmark for extracting hierarchical labels and evidence spans from patient-provider secure messages. Prompt engineering improves zero-shot by +15.44 average F1 across Code, Sub-code, and Span over four Llama models. Building on supervised fine-tuning, STaR-DRO further improves the hardest semantic decisions: on Llama-3.3-70B-Instruct, Code F1 rises from 79.24 to 81.47 and Sub-code F1 from 67.78 to 69.30, while preserving Span performance and reducing group-wise validation cross-entropy by up to 29.6% on the most difficult clinical categories. Because these rare and difficult groups correspond to clinically consequential communication behaviors, these gains are not merely statistical improvements: they directly strengthen communication mining reliability for patient-centered care analysis.
[1073] ExecTune: Effective Steering of Black-Box LLMs with Guide Models
Vijay Lingam, Aditya Golatkar, Anwesan Pal, Ben Vo, Narayanan Sadagopan, Alessandro Achille, Jun Huan, Anoop Deoras, Stefano Soatto
Main category: cs.LG
TL;DR: Guide-Core Policies (GCoP) framework for efficient LLM inference where a guide model generates structured strategies executed by a black-box core model, with ExecTune training optimizing executability and cost efficiency.
Details
Motivation: Recurring inference costs for black-box LLM APIs often exceed one-time training costs, motivating agentic systems that amortize expensive reasoning into reusable intermediate representations.Method: Proposes Guide-Core Policies (GCoP) framework where a guide model generates structured strategies executed by a core model. Introduces ExecTune training recipe combining teacher-guided acceptance sampling, supervised fine-tuning, and structure-aware reinforcement learning to optimize syntactic validity, execution success, and cost efficiency.
Result: GCoP with ExecTune improves accuracy by up to 9.2% over prior SOTA baselines while reducing inference cost by up to 22.4%. Enables Claude Haiku 3.5 to outperform Sonnet 3.5 on math and code tasks, and come within 1.7% accuracy of Sonnet 4 at 38% lower cost.
Conclusion: GCoP framework with ExecTune training provides efficient inference for black-box LLMs through structured strategy generation and execution, achieving better performance at lower cost while supporting modular adaptation.
Abstract: For large language models deployed through black-box APIs, recurring inference costs often exceed one-time training costs. This motivates composed agentic systems that amortize expensive reasoning into reusable intermediate representations. We study a broad class of such systems, termed Guide-Core Policies (GCoP), in which a guide model generates a structured strategy that is executed by a black-box core model. This abstraction subsumes base, supervised, and advisor-style approaches, which differ primarily in how the guide is trained. We formalize GCoP under a cost-sensitive utility objective and show that end-to-end performance is governed by guide-averaged executability: the probability that a strategy generated by the guide can be faithfully executed by the core. Our analysis shows that existing GCoP instantiations often fail to optimize executability under deployment constraints, resulting in brittle strategies and inefficient computation. Motivated by these insights, we propose ExecTune, a principled training recipe that combines teacher-guided acceptance sampling, supervised fine-tuning, and structure-aware reinforcement learning to directly optimize syntactic validity, execution success, and cost efficiency. Across mathematical reasoning and code-generation benchmarks, GCoP with ExecTune improves accuracy by up to 9.2% over prior state-of-the-art baselines while reducing inference cost by up to 22.4%. It enables Claude Haiku 3.5 to outperform Sonnet 3.5 on both math and code tasks, and to come within 1.7% absolute accuracy of Sonnet 4 at 38% lower cost. Beyond efficiency, GCoP also supports modular adaptation by updating the guide without retraining the core.
[1074] Efficient Matrix Implementation for Rotary Position Embedding
Chen Minqi, Zhongqi Yue, Shihao Zhang, Yun Xu, Peng Wu, kaixiang Xu, Zeyi Huang, Hanwang Zhang
Main category: cs.LG
TL;DR: RoME is a computationally efficient reformulation of Rotary Position Embedding (RoPE) that replaces vector operations with unified matrix transformations, eliminating dimension-specific operations and enabling better hardware utilization.
Details
Motivation: Existing RoPE implementations rely on vector-level split and merge operations that introduce computational overhead, especially problematic in multi-dimensional settings (2D/3D RoPE) where additional vector operations and uneven feature partitions degrade hardware utilization.Method: Proposes RoME (Rotary Matrix position Embedding), a mathematically equivalent reformulation of RoPE that replaces vector operations with unified matrix transformations, eliminating dimension-specific operations and enabling fused parallel execution across Cube and Vector units on modern NPUs.
Result: Experiments show RoME delivers substantial acceleration at both operator and full-model levels, with simplified implementation and better hardware utilization.
Conclusion: RoME provides an efficient alternative to traditional RoPE implementations that reduces computational overhead while maintaining mathematical equivalence, particularly beneficial for multi-dimensional position embeddings in modern Transformer architectures.
Abstract: Rotary Position Embedding (RoPE) has become a core component of modern Transformer architectures across language, vision, and 3D domains. However, existing implementations rely on vector-level split and merge operations that introduce non-negligible computational overhead, often overlooked in attention optimization. The problem is further amplified in multi-dimensional settings (e.g., 2D and 3D RoPE), where additional vector operations and uneven feature partitions degrade hardware utilization. To overcome these limitations, we propose RoME (Rotary Matrix position Embedding), a mathematically equivalent yet computationally efficient reformulation of RoPE that replaces vector operations with unified matrix transformations. RoME eliminates dimension-specific operations, simplifies implementation, and enables fused parallel execution across Cube and Vector units on modern NPUs. Experiments show that RoME delivers substantial acceleration at both the operator and full-model levels. The implementation is available at https://gitcode.com/cann/ops-transformer/blob/master/experimental/posembedding/rope_matrix/README.md.
[1075] Explainable Human Activity Recognition: A Unified Review of Concepts and Mechanisms
Mainak Kundu, Catherine Chen, Rifatul Islam, Ismail Uysal, Ria Kanjilal
Main category: cs.LG
TL;DR: Comprehensive review of explainable AI methods for human activity recognition across various sensing modalities, with a unified taxonomy and analysis of challenges in making HAR systems transparent and trustworthy.
Details
Motivation: Deep learning has improved HAR performance but created opaque models that limit trust and real-world deployment. Explainable AI is needed to make HAR systems more transparent and human-centered for healthcare, assistive living, and smart environments.Method: Presents a comprehensive review of explainable HAR methods across wearable, ambient, physiological, and multimodal sensing. Introduces a unified perspective separating conceptual dimensions of explainability from algorithmic explanation mechanisms, and creates a mechanism-centric taxonomy of XAI-HAR methods covering major explanation paradigms.
Result: The review examines how XAI methods address temporal, multimodal, and semantic complexities of HAR, summarizes interpretability objectives, explanation targets, and limitations, and discusses current evaluation practices and challenges.
Conclusion: Outlines directions toward trustworthy activity recognition systems that better support human understanding and decision-making by addressing the need for reliable and deployable explainable AI in HAR applications.
Abstract: Human activity recognition (HAR) has become a key component of intelligent systems for healthcare monitoring, assistive living, smart environments, and human-computer interaction. Although deep learning has substantially improved HAR performance on multivariate sensor data, the resulting models often remain opaque, limiting trust, reliability, and real-world deployment. Explainable artificial intelligence (XAI) has therefore emerged as a critical direction for making HAR systems more transparent and human-centered. This paper presents a comprehensive review of explainable HAR methods across wearable, ambient, physiological, and multimodal sensing settings. We introduce a unified perspective that separates conceptual dimensions of explainability from algorithmic explanation mechanisms, reducing ambiguities in prior surveys. Building on this distinction, we present a mechanism-centric taxonomy of XAI-HAR methods covering major explanation paradigms. The review examines how these methods address the temporal, multimodal, and semantic complexities of HAR, and summarize their interpretability objectives, explanation targets, and limitations. In addition, we discuss current evaluation practices, highlight key challenges in achieving reliable and deployable XAI-HAR, and outline directions toward trustworthy activity recognition systems that better support human understanding and decision-making.
[1076] NeuroFlow: Toward Unified Visual Encoding and Decoding from Neural Activity
Weijian Mai, Mu Nan, Yu Zhu, Jiahang Cao, Rui Zhang, Yuqin Dai, Chunfeng Song, Andrew F. Luo, Jiamin Wu
Main category: cs.LG
TL;DR: NeuroFlow is a unified framework that jointly models visual encoding (predicting brain activity from stimuli) and decoding (reconstructing stimuli from brain activity) using a single flow model, achieving bidirectional consistency between visual and neural modalities.
Details
Motivation: Current approaches treat visual encoding and decoding as separate tasks requiring distinct models and training procedures, which is inefficient and fails to model the consistency between these bidirectional processes. The authors aim to create a unified framework that captures the reversible relationship between visual stimuli and neural activity.Method: NeuroFlow introduces two key components: (1) NeuroVAE - a variational backbone that models neural variability and establishes a compact, semantically structured latent space for bidirectional modeling across visual and neural modalities. (2) Cross-modal Flow Matching (XFM) - learns a reversibly consistent flow model between visual and neural latent distributions, bypassing the typical noise-to-data diffusion paradigm guided by specific modality conditions.
Result: NeuroFlow achieves superior overall performance in both visual encoding and decoding tasks with higher computational efficiency compared to isolated methods. The model captures consistent activation patterns underlying neural variability and demonstrates encoding-decoding consistency.
Conclusion: NeuroFlow represents a major advancement toward unified visual encoding and decoding from neural activity, providing mechanistic insights that could inform future bidirectional visual brain-computer interfaces by modeling the reversible relationship between visual perception and neural responses.
Abstract: Visual encoding and decoding models act as gateways to understanding the neural mechanisms underlying human visual perception. Typically, visual encoding models that predict brain activity from stimuli and decoding models that reproduce stimuli from brain activity are treated as distinct tasks, requiring separate models and training procedures. This separation is inefficient and fails to model the consistency between encoding and decoding processes. To address this limitation, we propose NeuroFlow, the first unified framework that jointly models visual encoding and decoding from neural activity within a single flow model. NeuroFlow introduces two key components: (1) NeuroVAE is designed as a variational backbone to model neural variability and establish a compact, semantically structured latent space for bidirectional modeling across visual and neural modalities. (2) Cross-modal Flow Matching (XFM) bypasses the typical paradigm of noise-to-data diffusion guided by a specific modality condition, instead learning a reversibly consistent flow model between visual and neural latent distributions. For the first time, visual encoding and decoding are reformulated as a time-dependent, reversible process within a shared latent space for unified modeling. Empirical results demonstrate that NeuroFlow achieves superior overall performance in visual encoding and decoding tasks with higher computational efficiency compared to any isolated methods. We further analyze principal factors that steer the model toward encoding-decoding consistency and, through brain functional analyses, demonstrate that NeuroFlow captures consistent activation patterns underlying neural variability. NeuroFlow marks a major step toward unified visual encoding and decoding from neural activity, providing mechanistic insights that inform future bidirectional visual brain-computer interfaces.
[1077] Below-ground Fungal Biodiversity Can be Monitored Using Self-Supervised Learning Satellite Features
Robin Young, Michael E. Van Nuland, E. Toby Kiers, Tomáš Větrovský, Petr Kohout, Petr Baldrian, Srinivasan Keshav
Main category: cs.LG
TL;DR: SSL applied to satellite imagery predicts below-ground ectomycorrhizal fungal richness across Europe/Asia with high accuracy, enabling 10,000x higher spatial resolution monitoring of underground biodiversity.
Details
Motivation: Current methods for monitoring mycorrhizal fungal biodiversity at landscape scales are infeasible due to time/cost constraints, with 90% of diversity hotspots unprotected. Need scalable tools to map underground fungal communities.Method: Applied self-supervised learning (SSL) to satellite imagery to predict ectomycorrhizal fungal richness. Used ~12,000 field samples across Europe and Asia, comparing SSL-derived features against climate, soil, and land cover datasets.
Result: Models explain over 50% variance in species richness. SSL features were the single most informative predictor, subsuming majority of information from other datasets. Achieved 10,000-fold increase in spatial resolution (from 1km to 10m) with minimal systematic bias.
Conclusion: SSL satellite features provide scalable tool for continuous, high-resolution biodiversity mapping, enabling temporal monitoring of below-ground ecosystems. Analysis shows ancient forests may be losing ectomycorrhizal diversity disproportionately.
Abstract: Mycorrhizal fungi are vital to terrestrial ecosystem functioning. Yet monitoring their biodiversity at landscape scales is often unfeasible due to time and cost constraints. Current predictions suggest that 90% of mycorrhizal diversity hotspots remain unprotected, opening questions of how to broadly and effectively map underground fungal communities. Here, we show that self-supervised learning (SSL) applied to satellite imagery can predict below-ground ectomycorrhizal fungal richness across diverse environments. Our models explain over half the variance in species richness across ~12,000 field samples spanning Europe and Asia. SSL-derived features prove to be the single most informative predictor, subsuming the majority of information contained in climate, soil, and land cover datasets. Using this approach, we achieve a 10,000-fold increase in spatial resolution over existing techniques, moving from 1km landscape averages to 10m habitat-scale observations with nearly no systematic bias. As satellite observations are dynamic rather than static, this enables temporal monitoring of below-ground biodiversity at landscape scales for the first time. We analyze multi-year trends in predicted fungal richness across UK National Park woodlands, finding that ancient forests may be losing ectomycorrhizal diversity at disproportionate rates. These results establish SSL satellite features as a scalable tool for extending sparse field observations to continuous, high-resolution biodiversity maps for monitoring the invisible half of terrestrial ecosystems.
[1078] Relational Preference Encoding in Looped Transformer Internal States
Jan Kirin
Main category: cs.LG
TL;DR: Looped transformers encode human preference relationally in their internal states, with pairwise difference probes achieving 84.5% accuracy while independent classification fails, revealing preference is encoded as relative comparisons rather than absolute scores.
Details
Motivation: To understand how looped transformers internally represent human preferences during iterative refinement, and whether these representations capture preferences relationally or independently.Method: Used Ouro-2.6B-Thinking looped transformer, extracted hidden states from each loop iteration, trained lightweight evaluator heads (~5M params) to predict human preference on Anthropic HH-RLHF dataset, with systematic architecture search and diagnostic tests including flip-test analysis.
Result: Pairwise evaluator achieved 95.2% test accuracy, surpassing full-batch L-BFGS probe (84.5%). Preference is encoded relationally: linear probe on pairwise differences achieved 84.5%, while best nonlinear independent evaluator only reached 65%, and linear independent classification scored 21.75% (below chance).
Conclusion: Looped transformers encode human preference predominantly relationally rather than independently, with evaluators functioning as model-internal consistency probes. The flip test is proposed as a mandatory diagnostic for pairwise preference evaluators.
Abstract: We investigate how looped transformers encode human preference in their internal iteration states. Using Ouro-2.6B-Thinking, a 2.6B-parameter looped transformer with iterative refinement, we extract hidden states from each loop iteration and train lightweight evaluator heads (~5M parameters) to predict human preference on the Anthropic HH-RLHF dataset. Our pairwise evaluator achieves 95.2% test accuracy on 8,552 unseen examples, surpassing a full-batch L-BFGS probe (84.5%) while the base model remains completely frozen. Our central finding is that loop states encode preference predominantly relationally: a linear probe on pairwise differences achieves 84.5%, the best nonlinear independent evaluator reaches only 65% test accuracy, and linear independent classification scores 21.75%, below chance and with inverted polarity. Interpreted precisely, the evaluator functions as a model-internal consistency probe, measuring how stably Ouro’s own learned value system organizes its representations rather than how well it predicts noisy human annotations. We also document a systematic architecture search that established a genuine 70% ceiling for independent scoring, and show how the 50% argument-swap protocol required to prevent degenerate pairwise solutions deflated pairwise training metrics by about 31 points at peak, creating the false appearance that pairwise and pointwise evaluators shared the same ceiling. Finally, we show that a cosine learning-rate dead zone at epoch 2 accidentally acted as early stopping, preserving the generalization peak before overfitting degraded test accuracy from 95.2% to 62.4% by epoch 5. Cross-epoch flip-test analysis shows that antisymmetry correlation remains stable while strict sign-flip rate mainly tracks scorer bias. We propose the flip test as a mandatory diagnostic for pairwise preference evaluators.
[1079] Efficient Personalization of Generative User Interfaces
Yi-Hao Peng, Samarth Das, Jeffrey P. Bigham, Jason Wu
Main category: cs.LG
TL;DR: A study on personalizing generative UI systems using designer preference modeling, showing significant disagreement in UI preferences among trained designers and developing a sample-efficient personalization method.
Details
Motivation: Generative UIs offer adaptive interfaces but personalization is challenging due to subjective UI properties that are hard to articulate and costly to infer from sparse feedback.Method: Created dataset with 20 trained designers providing pairwise judgments over 600 generated UIs, analyzed preference divergence, and developed personalization method representing new users in terms of prior designers rather than fixed design concepts.
Result: Found substantial disagreement across designers (average kappa = 0.25), rationales showed designers differ in defining/prioritizing similar concepts. Personalization method outperformed pretrained UI evaluator and larger multimodal model, produced interfaces preferred by 12 new designers over baselines.
Conclusion: Lightweight preference elicitation can serve as practical foundation for personalized generative UI systems, with sample-efficient personalization method effectively capturing subjective design preferences.
Abstract: Generative user interfaces (UIs) create new opportunities to adapt interfaces to individual users on demand, but personalization remains difficult because desirable UI properties are subjective, hard to articulate, and costly to infer from sparse feedback. We study this problem through a new dataset in which 20 trained designers each provide pairwise judgments over the same 600 generated UIs, enabling direct analysis of preference divergence. We find substantial disagreement across designers (average kappa = 0.25), and written rationales reveal that even when designers appeal to similar concepts such as hierarchy or cleanliness, designers differ in how they define, prioritize, and apply those concepts. Motivated by these findings, we develop a sample-efficient personalization method that represents a new user in terms of prior designers rather than a fixed rubric of design concepts. In a technical evaluation, our preference model outperforms both a pretrained UI evaluator and a larger multimodal model, and scales better with additional feedback. When used to personalize generation, it also produces interfaces preferred by 12 new designers over baseline approaches, including direct user prompting. Our findings suggest that lightweight preference elicitation can serve as a practical foundation for personalized generative UI systems.
[1080] SemEnrich: Self-Supervised Semantic Enrichment of Radiology Reports for Vision-Language Learning
Halil Ibrahim Gulluk, Olivier Gevaert
Main category: cs.LG
TL;DR: Self-supervised data enrichment method for medical vision-language datasets using semantic clustering of report sentences to add positive/neutral observations, improving supervised fine-tuning and GRPO training performance.
Details
Motivation: Medical vision-language datasets are limited in size and biased toward negative findings because clinicians often omit positive/neutral observations that might be considered irrelevant to patient conditions, creating dataset imbalance.Method: Proposes semantic clustering of report sentences followed by self-supervised enrichment by adding positive/neutral observations from different clusters. Also incorporates semantic cluster information into reward design for GRPO (Group Relative Policy Optimization) training.
Result: Consistent gains in supervised fine-tuning (5.63%, 3.04%, 7.40%, 5.30%, 7.47% average gains on COMET, Bert score, Sentence Bleu, CheXbert-F1 and RadGraph-F1 scores). Further improvements with semantic cluster-based reward design for GRPO (2.78%, 3.14%, 12.80% average gains on COMET, Bert score, Sentence Bleu).
Conclusion: Semantic clustering-based data enrichment effectively addresses dataset bias in medical vision-language tasks, improving model performance through better training data and enhanced reward mechanisms for policy optimization.
Abstract: Medical vision-language datasets are often limited in size and biased toward negative findings, as clinicians report abnormalities mostly but might omit some positive/neutral findings because they might be considered as irrelevant to the patient’s condition. We propose a self-supervised data enrichment method that leverages semantic clustering of report sentences. Then we enrich the findings in the medical reports in the training set by adding positive/neutral observations from different clusters in a self-supervised manner. Our approach yields consistent gains in supervised fine-tuning (5.63%, 3.04%, 7.40%, 5.30%, 7.47% average gains on COMET score, Bert score, Sentence Bleu, CheXbert-F1 and RadGraph-F1 scores respectively). Ablation studies confirm that improvements stem from semantic clustering rather than random augmentation. Furthermore, we introduce a way to incorporate semantic cluster information into the reward design for GRPO training, which leads to further performance gains (2.78%, 3.14%, 12.80% average gains on COMET score, Bert score and Sentence Bleu scores respectively). We share our code at https://anonymous.4open.science/r/SemEnrich-75CF
[1081] Cross-Validated Cross-Channel Self-Attention and Denoising for Automatic Modulation Classification
Prakash Suman, Yanzhen Qu
Main category: cs.LG
TL;DR: Proposes a deep learning AMC model with cross-channel self-attention and dual-path residual shrinkage denoising to improve robustness under noisy conditions.
Details
Motivation: Existing AMC models degrade under noisy conditions because conventional feature extraction suppresses both discriminative structure and interference. Need feature-preserving denoising that maintains modulation class separation.Method: Deep learning AMC model incorporating cross-channel self-attention block to capture dependencies between in-phase/quadrature components, plus dual-path deep residual shrinkage denoising blocks to suppress noise while preserving features.
Result: Achieved notable accuracy improvements across -8 dB to +2 dB SNR compared to benchmarks: 3% over PET-CGDNN, 2.3% over MCLDNN, and 14% over DAE. Cross-validation showed mean accuracy 62.6%, macro precision 65.8%, macro-recall 62.6%, macro-F1 62.9%.
Conclusion: The architecture advances interference-aware AMC by formalizing baseband modeling as orthogonal subproblems and introducing cross-channel attention as generalized complex interaction operator. Feature-preserving denoising is critical for robustness at low-to-medium SNR.
Abstract: This study addresses a key limitation in deep learning Automatic Modulation Classification (AMC) models, which perform well at high signal-to-noise ratios (SNRs) but degrade under noisy conditions due to conventional feature extraction suppressing both discriminative structure and interference. The goal was to develop a feature-preserving denoising method that mitigates the loss of modulation class separation. A deep learning AMC model was proposed, incorporating a cross-channel self-attention block to capture dependencies between in-phase and quadrature components, along with dual-path deep residual shrinkage denoising blocks to suppress noise. Experiments using the RML2018.01a dataset employed stratified sampling across 24 modulation types and 26 SNR levels. Results showed that denoising depth strongly influences robustness at low and moderate SNRs. Compared to benchmark models PET-CGDNN, MCLDNN, and DAE, the proposed model achieved notable accuracy improvements across -8 dB to +2 dB SNR, with increases of 3%, 2.3%, and 14%, respectively. Cross-validation confirmed the model’s robustness, yielding a mean accuracy of 62.6%, macro precision of 65.8%, macro-recall of 62.6%, and macro-F1 score of 62.9%. The architecture advances interference-aware AMC by formalizing baseband modeling as orthogonal subproblems and introducing cross-channel attention as a generalized complex interaction operator, with ablations confirming the critical role of feature-preserving denoising for robustness at low-to-medium SNR.
[1082] Improving Pediatric Emergency Department Triage with Modality Dropout in Late Fusion Multimodal EHR Models
Tyler Yang, Romal Mitr
Main category: cs.LG
TL;DR: Multimodal model combining vital signs and clinical notes with symmetric modality dropout improves pediatric triage prediction generalization
Details
Motivation: Current multimodal models for emergency department triage suffer from modality collapse (over-relying on structured tabular data), which limits demographic generalizability, especially for pediatric patients where clinical narratives are crucial due to developmental variations in vital signs.Method: Late-fusion architecture with XGBoost for tabular vitals and Bio_ClinicalBERT for clinical text, combined via Logistic Regression meta-classifier. Trained exclusively on adult data from MIMIC-IV and NHAMCS, evaluated on pediatric cohort with zero-shot generalization. Used symmetric modality dropout during training to prevent overfitting to adult-specific correlations.
Result: Multimodal framework significantly outperforms single-modality baselines. Symmetric modality dropout (30-40% rate) yielded steep performance improvements in unseen pediatric cohort, elevating Quadratic Weighted Kappa to 0.351.
Conclusion: Modality dropout is a critical regularization technique for mitigating modality collapse and enhancing cross-demographic generalization in clinical AI, particularly important for pediatric applications where clinical narratives are uniquely valuable.
Abstract: Emergency department triage relies heavily on both quantitative vital signs and qualitative clinical notes, yet multimodal machine learning models predicting triage acuity often suffer from modality collapse by over-relying on structured tabular data. This limitation severely hinders demographic generalizability, particularly for pediatric patients where developmental variations in vital signs make unstructured clinical narratives uniquely crucial. To address this gap, we propose a late-fusion multimodal architecture that processes tabular vitals via XGBoost and unstructured clinical text via Bio_ClinicalBERT, combined through a Logistic Regression meta-classifier to predict the 5-level Emergency Severity Index. To explicitly target the external validity problem, we train our model exclusively on adult encounters from the MIMIC-IV and NHAMCS datasets and evaluate its zero-shot generalization on a traditionally overlooked pediatric cohort. Furthermore, we employ symmetric modality dropout during training to prevent the ensemble from overfitting to adult-specific clinical correlations. Our results demonstrate that the multimodal framework significantly outperforms single-modality baselines. Most notably, applying a 30-40% symmetric modality dropout rate yielded steep performance improvements in the unseen pediatric cohort, elevating the Quadratic Weighted Kappa to 0.351. These findings highlight modality dropout as a critical regularization technique for mitigating modality collapse and enhancing cross-demographic generalization in clinical AI.
[1083] Last-Iterate Convergence of Randomized Kaczmarz and SGD with Greedy Step Size
Michał Dereziński, Xiaoyu Dong
Main category: cs.LG
TL;DR: SGD with greedy step size achieves O(1/t^{3/4}) last-iterate convergence for smooth quadratics in interpolation regime, improving previous O(1/t^{1/2}) bound.
Details
Motivation: The paper aims to understand the convergence behavior of SGD with greedy step sizes in the interpolation regime, which captures classical algorithms like Randomized Kaczmarz and other iterative linear system solvers. There was an open question about improving the convergence rate from the previously known O(1/t^{1/2}) bound.Method: The authors introduce stochastic contraction processes and analyze their behavior through the evolution of a deterministic eigenvalue equation. They use a careful discrete-to-continuous reduction technique to study the convergence properties.
Result: The paper proves that SGD with greedy step size achieves O(1/t^{3/4}) last-iterate convergence rate for smooth quadratics in the interpolation regime, improving upon the previous O(1/t^{1/2}) guarantee.
Conclusion: The work provides improved convergence guarantees for SGD with greedy step sizes in the interpolation setting, resolving an open question and introducing new analytical techniques through stochastic contraction processes.
Abstract: We study last-iterate convergence of SGD with greedy step size over smooth quadratics in the interpolation regime, a setting which captures the classical Randomized Kaczmarz algorithm as well as other popular iterative linear system solvers. For these methods, we show that the $t$-th iterate attains an $O(1/t^{3/4})$ convergence rate, addressing a question posed by Attia, Schliserman, Sherman, and Koren, who gave an $O(1/t^{1/2})$ guarantee for this setting. In the proof, we introduce the family of stochastic contraction processes, whose behavior can be described by the evolution of a certain deterministic eigenvalue equation, which we analyze via a careful discrete-to-continuous reduction.
[1084] A Tale of Two Temperatures: Simple, Efficient, and Diverse Sampling from Diffusion Language Models
Theo X. Olausson, Metod Jazbec, Xi Wang, Armando Solar-Lezama, Christian A. Naesseth, Stephan Mandt, Eric Nalisnick
Main category: cs.LG
TL;DR: Tempered confidence-based remasking heuristics improve diversity in diffusion language model sampling while maintaining computational efficiency
Details
Motivation: Existing sampling methods for diffusion language models focus on speed-quality tradeoffs but neglect diversity across samples; there's a need for methods that ensure both computational efficiency and sample diversityMethod: Proposes softened/tempered versions of confidence-based remasking heuristics, motivated by formal analysis of fork tokens and their impact on expected entropy; retains computational benefits with simple implementations
Result: Tempered heuristics close exploration gap (pass@k) between existing confidence-based and autoregressive sampling, outperforming both when controlling for cost (pass@NFE); improves diversity in downstream post-training and test-time compute scaling
Conclusion: Simple, efficient, and diverse sampling from diffusion language models is achievable through tempered confidence-based remasking heuristics
Abstract: Much work has been done on designing fast and accurate sampling for diffusion language models (dLLMs). However, these efforts have largely focused on the tradeoff between speed and quality of individual samples; how to additionally ensure diversity across samples remains less well understood. In this work, we show that diversity can be increased by using softened, tempered versions of familiar confidence-based remasking heuristics, retaining their computational benefits and offering simple implementations. We motivate this approach by introducing an idealized formal model of fork tokens and studying the impact of remasking on the expected entropy at the forks. Empirically, the proposed tempered heuristics close the exploration gap (pass@k) between existing confidence-based and autoregressive sampling, hence outperforming both when controlling for cost (pass@NFE). We further study how the increase in diversity translates to downstream post-training and test-time compute scaling. Overall, our findings demonstrate that simple, efficient, and diverse sampling from dLLMs is possible.
[1085] K-STEMIT: Knowledge-Informed Spatio-Temporal Efficient Multi-Branch Graph Neural Network for Subsurface Stratigraphy Thickness Estimation from Radar Data
Zesheng Liu, Maryam Rahnemoonfar
Main category: cs.LG
TL;DR: K-STEMIT: A knowledge-informed spatio-temporal graph neural network for analyzing subsurface ice layer thickness from radar data, incorporating physical weather model priors for improved accuracy.
Details
Motivation: Existing methods for analyzing subsurface ice layer thickness from radar data are sensitive to noise and artifacts, and purely data-driven approaches often underuse physical knowledge, leading to unrealistic estimates under extrapolation.Method: Developed K-STEMIT, a multi-branch spatio-temporal graph neural network that combines geometric spatial learning with temporal convolution, incorporates physical data from weather models, and uses adaptive feature fusion to dynamically combine features from different branches.
Result: K-STEMIT achieves highest accuracy while maintaining near-optimal efficiency, reduces root mean-squared error by 21.01% compared to conventional variants, and enables reliable continuous spatiotemporal assessment of snow accumulation variability.
Conclusion: The proposed K-STEMIT framework effectively addresses challenges in subsurface ice layer analysis by integrating physical knowledge with deep learning, providing more accurate and reliable thickness estimates for polar ice sheet research.
Abstract: Subsurface stratigraphy contains important spatio-temporal information about accumulation, deformation, and layer formation in polar ice sheets. In particular, variations in internal ice layer thickness provide valuable constraints for snow mass balance estimation and projections of ice sheet change. Although radar sensors can capture these layered structures as depth-resolved radargrams, convolutional neural networks applied directly to radar images are often sensitive to speckle noise and acquisition artifacts. In addition, purely data-driven methods may underuse physical knowledge, leading to unrealistic thickness estimates under spatial or temporal extrapolation. To address these challenges, we develop K-STEMIT, a novel knowledge-informed, efficient, multi-branch spatio-temporal graph neural network that combines a geometric framework for spatial learning with temporal convolution to capture temporal dynamics, and incorporates physical data synchronized from the Model Atmospheric Regional physical weather model. An adaptive feature fusion strategy is employed to dynamically combine features learned from different branches. Extensive experiments have been conducted to compare K-STEMIT against current state-of-the-art methods in both knowledge-informed and non-knowledge-informed settings, as well as other existing methods. Results show that K-STEMIT consistently achieves the highest accuracy while maintaining near-optimal efficiency. Most notably, incorporating adaptive feature fusion and physical priors reduces the root mean-squared error by 21.01% with negligible additional cost compared to its conventional multi-branch variants. Additionally, our proposed K-STEMIT achieves consistently lower per-year relative MAE, enabling reliable, continuous spatiotemporal assessment of snow accumulation variability across large spatial regions.
[1086] A Hybrid Intelligent Framework for Uncertainty-Aware Condition Monitoring of Industrial Systems
Maryam Ahang, Todd Charter, Masoud Jalayer, Homayoun Najjaran
Main category: cs.LG
TL;DR: Hybrid condition monitoring framework combining data-driven ML with physics-informed residuals and temporal features improves diagnostic accuracy and uncertainty management in industrial systems.
Details
Motivation: To improve reliability of industrial condition monitoring by combining data-driven learning with physics-based insight through hybrid approaches that integrate sensor measurements, temporal features, and physics-informed residuals.Method: Develops two hybrid integration strategies: 1) feature-level fusion augmenting input space with residual and temporal information, and 2) model-level ensemble combining ML classifiers trained on different feature types at decision level. Evaluated on continuous stirred-tank reactor benchmark with conformal prediction for uncertainty quantification.
Result: Both hybrid approaches improve diagnostic accuracy over single-source baselines, with best model-level ensemble achieving 2.9% improvement. Hybrid integration enhances uncertainty management, producing smaller, well-calibrated prediction sets at matched coverage levels.
Conclusion: Lightweight physics-informed residuals, temporal augmentation, and ensemble learning can be effectively combined to improve both accuracy and decision reliability in nonlinear industrial systems.
Abstract: Hybrid approaches that combine data-driven learning with physics-based insight have shown promise for improving the reliability of industrial condition monitoring. This work develops a hybrid condition monitoring framework that integrates primary sensor measurements, lagged temporal features, and physics-informed residuals derived from nominal surrogate models. Two hybrid integration strategies are examined. The first is a feature-level fusion approach that augments the input space with residual and temporal information. The second is a model-level ensemble approach in which machine learning classifiers trained on different feature types are combined at the decision level. Both hybrid approaches of the condition monitoring framework are evaluated on a continuous stirred-tank reactor (CSTR) benchmark using several machine learning models and ensemble configurations. Both feature-level and model-level hybridization improve diagnostic accuracy relative to single-source baselines, with the best model-level ensemble achieving a 2.9% improvement over the best baseline ensemble. To assess predictive reliability, conformal prediction is applied to quantify coverage, prediction-set size, and abstention behavior. The results show that hybrid integration enhances uncertainty management, producing smaller and well-calibrated prediction sets at matched coverage levels. These findings demonstrate that lightweight physics-informed residuals, temporal augmentation, and ensemble learning can be combined effectively to improve both accuracy and decision reliability in nonlinear industrial systems.
[1087] Interpretable Alzheimer’s Diagnosis via Multimodal Fusion of Regional Brain Experts
Farica Zhuang, Shu Yang, Dinara Aliyeva, Zixuan Wen, Duy Duong-Tran, Christos Davatzikos, Tianlong Chen, Song Wang, Li Shen
Main category: cs.LG
TL;DR: A multimodal Mixture-of-Experts framework for Alzheimer’s disease diagnosis that adaptively fuses amyloid PET and MRI data using regional experts and a gating network.
Details
Motivation: Accurate early diagnosis of Alzheimer's disease requires integrating complementary multimodal neuroimaging data, but conventional fusion approaches use simple concatenation that cannot adaptively balance biomarker contributions across brain regions.Method: MREF-AD (Multimodal Regional Expert Fusion model) uses a Mixture-of-Experts framework where mesoscopic brain regions within each modality (amyloid PET and MRI) are modeled as independent experts, with a gating network learning subject-specific fusion weights.
Result: The model achieves competitive performance over strong classic and deep baselines using ADNI data, while providing interpretable modality- and region-level insights into how structural and molecular imaging jointly contribute to AD diagnosis.
Conclusion: MREF-AD provides an effective adaptive fusion approach for multimodal neuroimaging data in Alzheimer’s disease diagnosis, offering both competitive performance and interpretable insights into biomarker contributions.
Abstract: Accurate and early diagnosis of Alzheimer’s disease (AD) is critical for effective intervention and requires integrating complementary information from multimodal neuroimaging data. However, conventional fusion approaches often rely on simple concatenation of features, which cannot adaptively balance the contributions of biomarkers such as amyloid PET and MRI across brain regions. In this work, we propose MREF-AD, a Multimodal Regional Expert Fusion model for AD diagnosis. It is a Mixture-of-Experts (MoE) framework that models mesoscopic brain regions within each modality as independent experts and employs a gating network to learn subject-specific fusion weights. Utilizing tabular neuroimaging and demographic information from the Alzheimer’s Disease Neuroimaging Initiative (ADNI), MREF-AD achieves competitive performance over strong classic and deep baselines while providing interpretable, modality- and region-level insight into how structural and molecular imaging jointly contribute to AD diagnosis.
[1088] Vestibular reservoir computing
Smita Deb, Shirin Panahi, Mulugeta Haile, Ying-Cheng Lai
Main category: cs.LG
TL;DR: Physical reservoir computing scheme inspired by biological vestibular system uses uncoupled topology to achieve comparable performance to fully coupled networks while reducing hardware complexity.
Details
Motivation: Traditional reservoir computing requires complex interconnectivity that's difficult to implement in physical hardware. The paper aims to develop a simpler, more feasible physical RC implementation inspired by biological systems.Method: Proposes an uncoupled reservoir topology inspired by the vestibular system, analyzes differences between coupled and uncoupled topologies through theoretical derivation of memory capacity formulas for linear reservoirs, and extends analysis to nonlinear systems.
Result: Uncoupled topologies achieve performance comparable to fully coupled networks under specific conditions. Theoretical analysis identifies conditions where both configurations yield equivalent memory, and these findings approximately hold for nonlinear reservoir systems.
Conclusion: Uncoupled reservoir architectures provide a mathematically sound and practically feasible pathway for efficient physical reservoir computing, offering reduced hardware complexity while maintaining performance.
Abstract: Reservoir computing (RC) is a computational framework known for its training efficiency, making it ideal for physical hardware implementations. However, realizing the complex interconnectivity of traditional reservoirs in physical systems remains a significant challenge. This paper proposes a physical RC scheme inspired by the biological vestibular system. To overcome hardware complexity, we introduce a designed uncoupled topology and demonstrate that it achieves performance comparable to fully coupled networks. We theoretically analyze the difference between these topologies by deriving a memory capacity formula for linear reservoirs, identifying specific conditions where both configurations yield equivalent memory. These analytical results are demonstrated to approximately hold for nonlinear reservoir systems. Furthermore, we systematically examine the impact of reservoir size on predictive statistics and memory capacity. Our findings suggest that uncoupled reservoir architectures offer a mathematically sound and practically feasible pathway for efficient physical reservoir computing.
[1089] SLM Finetuning for Natural Language to Domain Specific Code Generation in Production
Renjini R. Nair, Damian K. Kowalczyk, Marco Gaudesi, Chhaya Methani
Main category: cs.LG
TL;DR: Fine-tuned small language models outperform larger models for domain-specific code generation with better latency and cost efficiency.
Details
Motivation: Production systems need low-latency code generation, but large models are slow while small models have limitations in reasoning and context retention. Fine-tuning can embed domain knowledge directly into model weights.Method: Fine-tuned variants of Mistral and other small language models on natural language-code pairs dataset, compared with larger models and previous retrieval-augmented generation approach.
Result: Fine-tuned small models achieved improved performance and latency on test datasets compared to larger models, and could be further fine-tuned for customer-specific scenarios without degrading general performance.
Conclusion: Task-specific fine-tuning with small language models provides efficient, faster, and cost-effective alternative to large models for domain-specific language generation.
Abstract: Many applications today use large language models for code generation; however, production systems have strict latency requirements that can be difficult to meet with large models. Small language models with a few billion parameters are resource efficient but may suffer from limited reasoning, hallucinations, or poor retention of longer context. Fine tuning improves task specific accuracy by embedding domain knowledge directly into model weights, reducing reliance on runtime context. We previously implemented a baseline natural language to code generation approach using a retrieval augmented generation pipeline that dynamically selected few shot examples to embed domain specific language context for a large language model. In this study, we evaluate small language models for generating domain specific language from natural language by fine tuning variants of Mistral and other models on a dataset of natural language code pairs. Our results show that the fine-tuned models achieve improved performance and latency on test datasets compared to larger models. We also demonstrate that the trained model can be further fine-tuned for customer specific scenarios without degrading general performance, helping resolve production issues. Load testing followed by production deployment confirmed optimal performance in terms of latency and quality. These findings demonstrate that task specific fine tuning with small language models provides an efficient, faster, and cost-effective alternative to large language models for domain specific language generation.
[1090] Binary Flow Matching: Prediction-Loss Space Alignment for Robust Learning
Jiadong Hong, Lei Liu, Xinyu Bian, Wenjie Wang, Zhaoyang Zhang
Main category: cs.LG
TL;DR: Flow matching for binary manifolds: signal-space alignment (x-loss) eliminates singular weighting from velocity-based objectives, enabling robust training without heuristic schedules.
Details
Motivation: Recent empirical successes of signal-space prediction (x-prediction) in flow matching for continuous domains motivate investigation of this paradigm for binary manifolds, a fundamental setting for discrete data generation. However, structural mismatches arise when combining x-prediction with velocity-based objectives.Method: Theoretical analysis of flow matching on binary manifolds, formalizing prediction-loss alignment as necessary condition. Proves that re-aligning objective to signal space (x-loss) eliminates singular weighting from velocity-based objectives, yielding uniformly bounded gradients. Examines design choices specific to binary data, comparing probabilistic objectives (cross-entropy) vs geometric losses (MSE).
Result: Signal-space alignment eliminates time-dependent singular weighting that amplifies gradient sensitivity to approximation errors, enabling robust training under uniform timestep sampling without reliance on heuristic schedules. Reveals topology-dependent distinction between probabilistic and geometric losses for binary data.
Conclusion: Signal-space alignment is key principle for robust flow matching on binary and related discrete domains, providing theoretical foundations and practical guidelines for diffusion learning on discrete data.
Abstract: Flow matching has emerged as a powerful framework for generative modeling, with recent empirical successes highlighting the effectiveness of signal-space prediction ($x$-prediction). In this work, we investigate the transfer of this paradigm to binary manifolds, a fundamental setting for generative modeling of discrete data. While $x$-prediction remains effective, we identify a latent structural mismatch that arises when it is coupled with velocity-based objectives ($v$-loss), leading to a time-dependent singular weighting that amplifies gradient sensitivity to approximation errors. Motivated by this observation, we formalize prediction-loss alignment as a necessary condition for flow matching training. We prove that re-aligning the objective to the signal space ($x$-loss) eliminates the singular weighting, yielding uniformly bounded gradients and enabling robust training under uniform timestep sampling without reliance on heuristic schedules. Finally, with alignment secured, we examine design choices specific to binary data, revealing a topology-dependent distinction between probabilistic objectives (e.g., cross-entropy) and geometric losses (e.g., mean squared error). Together, these results provide theoretical foundations and practical guidelines for robust flow matching on binary – and related discrete – domains, positioning signal-space alignment as a key principle for robust diffusion learning.
[1091] From Recency Bias to Stable Convergence Block Kaczmarz Methods for Online Preference Learning in Matchmaking Applications
James Nguyen
Main category: cs.LG
TL;DR: A Kaczmarz-based preference learning algorithm family for real-time personalized matchmaking in reciprocal recommender systems, addressing exponential recency bias through Tikhonov-regularized projection instead of L2 normalization.
Details
Motivation: To address exponential recency bias in Kaczmarz-inspired online learners for reciprocal recommender systems, where post-step L2 normalization causes the influence of interactions to decay too rapidly (reaching ~1e-6 after just 20 swipes).Method: Replace normalization step with Tikhonov-regularized projection denominator (||a||^2 + alpha) that bounds step size analytically without erasing interaction history. Also develop block variant that processes full swipe sessions as single Gram matrix solve (BlockNK).
Result: BlockNK achieves highest preference alignment (Align@20 = 0.698), strongest inter-session direction stability (delta = 0.994), and flattest degradation profile under label noise. Adaptive candidate pool filtering improves asymptotic alignment but may slow recovery from miscalibration.
Conclusion: The dominant practical gain over normalized Kaczmarz is removal of per-step normalization rather than Tikhonov constant alpha itself. BlockNK with batch processing and post-session normalization performs best for preference learning in reciprocal recommender systems.
Abstract: We present a family of Kaczmarz-based preference learning algorithms for real-time personalized matchmaking in reciprocal recommender systems. Post-step L2 normalization, common in Kaczmarz-inspired online learners, induces exponential recency bias: the influence of the t-th interaction decays as eta^(n - t), reaching approximately 1e-6 after just 20 swipes at eta = 0.5. We resolve this by replacing the normalization step with a Tikhonov-regularized projection denominator that bounds step size analytically without erasing interaction history. When candidate tag vectors are not pre-normalized, as in realistic deployments where candidates vary in tag density, the Tikhonov denominator ||a||^2 + alpha produces genuinely per-candidate adaptive step sizes, making it structurally distinct from online gradient descent with any fixed learning rate. We further derive a block variant that processes full swipe sessions as a single Gram matrix solve. Population-scale simulation over 6,400 swipes reveals that Block Normalized Kaczmarz (BlockNK), which combines the batch Gram solve with post-session L2 normalization, achieves the highest preference alignment (Align@20 = 0.698), the strongest inter-session direction stability (delta = 0.994), and the flattest degradation profile under label noise across flip ratios p_flip in [0.10, 0.35]. Experiments under cosine similarity subsampling further show that adaptively filtering the candidate pool toward the current preference direction substantially improves asymptotic alignment, at the cost of introducing a feedback loop that may slow recovery from miscalibration. The sequential Tikhonov-Kaczmarz method performs comparably to K-NoNorm under our simulation conditions, suggesting the dominant practical gain over normalized Kaczmarz is the removal of per-step normalization rather than the Tikhonov constant alpha itself.
[1092] Muon$^2$: Boosting Muon via Adaptive Second-Moment Preconditioning
Ziyue Liu, Ruijie Zhang, Zhengyang Wang, Yequan Zhao, Yupeng Su, Zi Yang, Zheng Zhang
Main category: cs.LG
TL;DR: Muon² improves the Muon optimizer by adding Adam-style adaptive preconditioning before orthogonalization, reducing Newton-Schulz iterations by 40% while maintaining performance in large-scale model pre-training.
Details
Motivation: Muon optimizer shows promise for foundation model training but suffers from computational overhead due to multiple Newton-Schulz iterations needed for orthogonalization. The core issue is ill-conditioned momentum matrices that slow convergence.Method: Muon² extends Muon by applying Adam-style adaptive second-moment preconditioning before orthogonalization. This improves the spectrum of momentum matrices, enabling faster convergence to sufficient orthogonalization. Also introduces Muon²-F, a memory-efficient factorized variant.
Result: Across GPT and LLaMA pre-training from 60M to 1.3B parameters, Muon² consistently outperforms Muon and recent variants while reducing NS iterations by 40%. Muon²-F preserves most gains with negligible memory overhead.
Conclusion: Muon² addresses Muon’s computational bottleneck through adaptive preconditioning, achieving practical efficiency gains for large-scale foundation model training while maintaining optimization quality.
Abstract: Muon has emerged as a promising optimizer for large-scale foundation model pre-training by exploiting the matrix structure of neural network updates through iterative orthogonalization. However, its practical efficiency is limited by the need for multiple Newton–Schulz (NS) iterations per optimization step, which introduces non-trivial computation and communication overhead. We propose Muon$^2$, an extension of Muon that applies Adam-style adaptive second-moment preconditioning before orthogonalization. Our key insight is that the core challenge of polar approximation in Muon lies in the ill-conditioned momentum matrix, of which the spectrum is substantially improved by Muon$^2$, leading to faster convergence toward a practically sufficient orthogonalization. We further characterize the practical orthogonalization quality via directional alignment, under which Muon$^2$ demonstrates dramatic improvement over Muon at each polar step. Across GPT and LLaMA pre-training experiments from 60M to 1.3B parameters, Muon$^2$ consistently outperforms Muon and recent Muon variants while reducing NS iterations by 40%. We further introduce Muon$^2$-F, a memory-efficient factorized variant that preserves most of the gains of Muon$^2$ with negligible memory overhead.
[1093] LoDAdaC: a unified local training-based decentralized framework with adaptive gradients and compressed communication
Wei Liu, Anweshit Panda, Ujwal Pandey, Haven Cook, George M. Slota, Naigang Wang, Jie Chen, Yangyang Xu
Main category: cs.LG
TL;DR: LoDAdaC: A decentralized distributed learning framework combining adaptive gradient methods (Adam-type updates) with multiple local training steps and compressed communication to achieve fast convergence and low communication cost.
Details
Motivation: Adaptive gradient methods like Adam show strong performance in deep learning and centralized distributed settings, but their convergence properties remain unexplored in decentralized settings with multiple local training steps (like federated learning). There's a need for efficient decentralized algorithms that combine fast convergence with low communication overhead.Method: Proposes LoDAdaC - a unified Multiple Local Training (MLT) Decentralized framework with Adam-type updates and Compressed communication (CC). It supports various adaptive optimizers (AMSGrad, Adam, AdaGrad) and standard compressors (low-bit quantization, sparsification). Combines MLT for reduced communication frequency and CC for reduced communication volume per round.
Result: Theoretical complexity analysis proves combined advantages of fast convergence and low communication cost. Experiments on image classification and GPT-style language model training validate that LoDAdaC significantly outperforms existing decentralized algorithms in both convergence speed and communication efficiency.
Conclusion: LoDAdaC successfully addresses the gap in decentralized adaptive optimization by combining adaptive gradient methods with communication-efficient techniques, achieving superior performance in both theory and practice for decentralized distributed learning.
Abstract: In the decentralized distributed learning, achieving fast convergence and low communication cost is essential for scalability and high efficiency. Adaptive gradient methods, such as Adam, have demonstrated strong practical performance in deep learning and centralized distributed settings. However, their convergence properties remain largely unexplored in decentralized settings involving multiple local training steps, such as federated learning. To address this limitation, we propose LoDAdaC, a unified multiple Local Training (MLT) Decentralized framework with Adam-type updates and Compressed communication (CC). LoDAdaC accommodates a broad class of optimizers for its local adaptive updates, including AMSGrad, Adam, and AdaGrad; it is compatible with standard (possibly biased) compressors such as low-bit quantization and sparsification. MLT and CC enable LoDAdaC to achieve multiplied reduction of communication cost, while the technique of adaptive updates enables fast convergence. We rigorously prove the combined advantage through complexity analysis. In addition, experiments on image classification and GPT-style language model training validate our theoretical findings and show that LoDAdaC significantly outperforms existing decentralized algorithms in terms of convergence speed and communication efficiency.
[1094] Positive-Unlabelled Active Learning to Curate a Dataset for Orca Resident Interpretation
Bret Nestor, Bohan Yao, Jasmine Moore, Jasper Kanes
Main category: cs.LG
TL;DR: Large-scale acoustic dataset of Southern Resident Killer Whales and other marine mammals using weakly-supervised active learning with transformer classifiers.
Details
Motivation: To create the largest acoustic dataset of Southern Resident Killer Whales for conservation and research, addressing data scarcity for this critically endangered species.Method: Systematic search of 30+ years of public hydrophone data using weakly-supervised, positive-unlabelled active learning strategy with transformer-based classifiers (WHISPER models).
Result: Created massive dataset with 919 hours of SRKW data, outperformed state-of-the-art classifiers on 3 of 4 datasets, with WHISPER models achieving 0.58-0.77 AUROC.
Conclusion: The comprehensive dataset enables unsupervised machine translation, habitat surveys, and conservation efforts for critically endangered marine mammals.
Abstract: This work presents the largest curation of Southern Resident Killer Whale (SRKW) acoustic data to date, also containing other marine mammals in their environment. We systematically search all available public archival hydrophone data within the SRKW habitat (over 30 years of audio data). The search consists of a weakly-supervised, positive-unlabelled, active learning strategy to identify all instances of marine mammals. The resulting transformer-based presence or absence classifiers outperform state-of-the-art classifiers on 3 of 4 expert-annotated datasets in terms of accuracy and energy efficiency. The fleet of WHISPER detection models range from 0.58 (0.48-0.67) AUROC with WHISPER-tiny to 0.77 (0.63-0.93) with WHISPER-large-v3. Our multiclass species classifier obtains a top-1 accuracy of 53.2% (11 train classes, 4 test classes) and our ecotype classifier obtains a top-1 accuracy of 33.6% (4 train classes, 5 test classes) on the DCLDE-2026 dataset. We yield 919 hours of SRKW data, 230 hours of Bigg’s orca data, 1374 hours of orca data from unlabelled ecotypes, 1501 hours of humpback data, 88 hours of sea lion data, 246 hours of pacific white-sided dolphin data, and over 784 hours of unspecified marine mammal data. This SRKW dataset is larger than DCLDE-2026, Ocean Networks Canada, and OrcaSound combined. The curated species labels are available under CC-BY 4.0 license, and the corresponding audio data are available under the licenses of the original owners. The comprehensive nature of this dataset makes it suitable for unsupervised machine translation, habitat usage surveys, and conservation endeavours for this critically endangered ecotype.
[1095] Towards Multi-Source Domain Generalization for Sleep Staging with Noisy Labels
Kening Wang, Di Wen, Yufan Chen, Ruiping Liu, Junwei Zheng, Jiale Wei, Kailun Yang, Rainer Stiefelhagen, Kunyu Peng
Main category: cs.LG
TL;DR: FF-TRUST: A domain-invariant multimodal sleep staging framework with joint time-frequency early learning regularization for robust sleep staging under noisy labels and domain shifts across multiple sources.
Details
Motivation: Sleep staging suffers from domain shifts across institutions/devices/populations and noisy annotations, but existing noisy-label learning methods degrade when domain shifts and label noise coexist. There's a need for robust multimodal learning that handles both challenges simultaneously.Method: Proposes FF-TRUST framework with Joint Time-Frequency Early Learning Regularization (JTF-ELR) that jointly exploits temporal and spectral consistency with confidence-diversity regularization to improve robustness under noisy supervision in multi-source domain generalization settings.
Result: Experiments on five public datasets demonstrate consistent state-of-the-art performance under diverse symmetric and asymmetric noise settings. The method outperforms existing noisy-label learning approaches when domain shifts and label noise coexist.
Conclusion: FF-TRUST effectively addresses the underexplored problem of label-noise-robust multi-source domain generalization in sleep staging, establishing a new benchmark (NL-DGSS) and providing a solution that maintains performance despite noisy labels and domain shifts.
Abstract: Automatic sleep staging is a multimodal learning problem involving heterogeneous physiological signals such as EEG and EOG, which often suffer from domain shifts across institutions, devices, and populations. In practice, these data are also affected by noisy annotations, yet label-noise-robust multi-source domain generalization remains underexplored. We present the first benchmark for Noisy Labels in Multi-Source Domain-Generalized Sleep Staging (NL-DGSS) and show that existing noisy-label learning methods degrade substantially when domain shifts and label noise coexist. To address this challenge, we propose FF-TRUST, a domain-invariant multimodal sleep staging framework with Joint Time-Frequency Early Learning Regularization (JTF-ELR). By jointly exploiting temporal and spectral consistency together with confidence-diversity regularization, FF-TRUST improves robustness under noisy supervision. Experiments on five public datasets demonstrate consistent state-of-the-art performance under diverse symmetric and asymmetric noise settings. The benchmark and code will be made publicly available at https://github.com/KNWang970918/FF-TRUST.git.
[1096] Closed-Form Concept Erasure via Double Projections
Chi Zhang, Jingpu Cheng, Zhixian Wang, Ping Liu
Main category: cs.LG
TL;DR: A linear transformation framework for concept erasure in generative models that removes unwanted concepts analytically without training, using proxy projections and constrained transformations in null spaces.
Details
Motivation: Modern generative models like diffusion models raise safety and ethical concerns, leading to interest in concept erasure. Existing approaches often require iterative optimization and may distort unrelated concepts, creating a need for a more efficient and precise solution.Method: A two-step closed-form linear transformation: 1) compute proxy projection of target concept, 2) apply constrained transformation within left null space of known concept directions. This creates a deterministic, geometrically interpretable procedure requiring no training.
Result: The method matches or surpasses state-of-the-art performance in object and style erasure across Stable Diffusion variants and FLUX flow-matching models, while better preserving non-target concepts. It requires only seconds to apply.
Conclusion: The framework provides a lightweight, drop-in tool for controlled model editing, advancing safer and more responsible generative models through efficient, theory-grounded concept removal.
Abstract: While modern generative models such as diffusion-based architectures have enabled impressive creative capabilities, they also raise important safety and ethical risks. These concerns have led to growing interest in concept erasure, the process of removing unwanted concepts from model representations. Existing approaches often achieve strong erasure performance but rely on iterative optimization and may inadvertently distort unrelated concepts. In this work, we present a simple yet principled alternative: a linear transformation framework that achieves concept erasure analytically, without any training. Our method adapts a pretrained model through two sequential, closed-form steps: first, computing a proxy projection of the target concept, and second, applying a constrained transformation within the left null space of known concept directions. This design yields a deterministic and geometrically interpretable procedure for safe, efficient, and theory-grounded concept removal. Across a wide range of experiments, including object and style erasure on multiple Stable Diffusion variants and the flow-matching model (FLUX), our approach matches or surpasses the performance of state-of-the-art methods while preserving non-target concepts more faithfully. Requiring only a few seconds to apply, it offers a lightweight and drop-in tool for controlled model editing, advancing the goal of safer and more responsible generative models.
[1097] When Can You Poison Rewards? A Tight Characterization of Reward Poisoning in Linear MDPs
Jose Efraim Aguilar Escamilla, Haoyang Hong, Jiawei Li, Haoyu Zhao, Xuezhou Zhang, Sanghyun Hong, Huazheng Wang
Main category: cs.LG
TL;DR: First precise necessity and sufficiency characterization of attackability in linear MDPs under reward poisoning attacks, distinguishing vulnerable vs intrinsically robust RL instances.
Details
Motivation: Prior work focused on sufficient conditions for successful reward poisoning attacks, with limited discussion of infeasibility. Need precise characterization of when RL systems are vulnerable vs intrinsically robust to such attacks.Method: Develop theoretical framework to characterize attackability of linear MDPs under reward poisoning attacks with budget constraints. Extend to deep RL by approximating environments as linear MDPs.
Result: Established first precise necessity and sufficiency conditions distinguishing vulnerable RL instances from intrinsically robust ones. Framework effectively identifies attackability in deep RL environments.
Conclusion: Provides fundamental theoretical characterization of RL vulnerability to reward poisoning attacks, with practical significance for both attack design and defense strategies.
Abstract: We study reward poisoning attacks in reinforcement learning (RL), where an adversary manipulates rewards within constrained budgets to force the target RL agent to adopt a policy that aligns with the attacker’s objectives. Prior works on reward poisoning mainly focused on sufficient conditions to design a successful attacker, while only a few studies discussed the infeasibility of targeted attacks. This paper provides the first precise necessity and sufficiency characterization of the attackability of a linear MDP under reward poisoning attacks. Our characterization draws a bright line between the vulnerable RL instances, and the intrinsically robust ones which cannot be attacked without large costs even running vanilla non-robust RL algorithms. Our theory extends beyond linear MDPs – by approximating deep RL environments as linear MDPs, we show that our theoretical framework effectively distinguishes the attackability and efficiently attacks the vulnerable ones, demonstrating both the theoretical and practical significance of our characterization.
[1098] Graph-RHO: Critical-path-aware Heterogeneous Graph Network for Long-Horizon Flexible Job-Shop Scheduling
Yujie Li, Jiuniu Wang, Mugen Peng, Guangzuo Li, Wenjia Xu
Main category: cs.LG
TL;DR: Graph-RHO: A critical-path-aware graph-based Rolling Horizon Optimization framework for solving long-horizon Flexible Job-Shop Scheduling problems with improved solution quality and computational efficiency.
Details
Motivation: Existing learning-based Rolling Horizon Optimization methods for Flexible Job-Shop Scheduling fail to capture intricate graph-structured dependencies, ignore asymmetric costs of prediction errors (where misclassifying critical-path operations is more detrimental), and use static pruning thresholds that don't adapt to dynamic predictive confidence during the rolling process.Method: 1) Topology-aware heterogeneous graph network that encodes subproblems as operation-machine graphs with multi-relational edges using edge-feature-aware message passing; 2) Critical-path-aware mechanism that injects inductive biases during training to distinguish bottleneck operations; 3) Adaptive thresholding strategy that dynamically calibrates decision boundaries based on online uncertainty estimation.
Result: Establishes new state-of-the-art in solution quality and computational efficiency on standard benchmarks. Achieves exceptional zero-shot generalization, reducing solve time by over 30% on large-scale instances (2000 operations) while maintaining superior solution quality.
Conclusion: Graph-RHO effectively addresses the limitations of existing methods by capturing graph-structured dependencies, incorporating critical-path awareness, and using adaptive decision thresholds, leading to significant improvements in solving long-horizon Flexible Job-Shop Scheduling problems.
Abstract: Long-horizon Flexible Job-Shop Scheduling~(FJSP) presents a formidable combinatorial challenge due to complex, interdependent decisions spanning extended time horizons. While learning-based Rolling Horizon Optimization~(RHO) has emerged as a promising paradigm to accelerate solving by identifying and fixing invariant operations, its effectiveness is hindered by the structural complexity of FJSP. Existing methods often fail to capture intricate graph-structured dependencies and ignore the asymmetric costs of prediction errors, in which misclassifying critical-path operations is significantly more detrimental than misclassifying non-critical ones. Furthermore, dynamic shifts in predictive confidence during the rolling process make static pruning thresholds inadequate. To address these limitations, we propose Graph-RHO, a novel critical-path-aware graph-based RHO framework. First, we introduce a topology-aware heterogeneous graph network that encodes subproblems as operation-machine graphs with multi-relational edges, leveraging edge-feature-aware message passing to predict operation stability. Second, we incorporate a critical-path-aware mechanism that injects inductive biases during training to distinguish highly sensitive bottleneck operations from robust ones. Third, we devise an adaptive thresholding strategy that dynamically calibrates decision boundaries based on online uncertainty estimation to align model predictions with the solver’s search space. Extensive experiments on standard benchmarks demonstrate that \mbox{Graph-RHO} establishes a new state of the art in solution quality and computational efficiency. Remarkably, it exhibits exceptional zero-shot generalization, reducing solve time by over 30% on large-scale instances (2000 operations) while achieving superior solution quality. Our code is available \href{https://github.com/IntelliSensing/Graph-RHO}{here}.
[1099] Transformers Learn the Optimal DDPM Denoiser for Multi-Token GMMs
Hongkang Li, Hancheng Min, Rene Vidal
Main category: cs.LG
TL;DR: Theoretical convergence analysis of transformer-based diffusion models for denoising multi-token Gaussian mixture data, showing self-attention implements mean denoising to approximate MMSE estimator.
Details
Motivation: Despite strong empirical performance of transformer-based diffusion models, theoretical understanding of why they work remains limited. The paper aims to provide first convergence analysis for training such models, addressing why transformers can match score functions and why gradient methods converge despite non-convex loss.Method: Analyzes population DDPM objective for denoising multi-token Gaussian mixture data. Theoretically quantifies required tokens per data point and training iterations for global convergence to Bayes optimal risk. Investigates how self-attention implements mean denoising mechanism to approximate oracle MMSE estimator.
Result: Provides first convergence analysis for transformer-based diffusion models. Shows self-attention module implements mean denoising that approximates MMSE estimator. Numerical experiments validate theoretical findings.
Conclusion: Transformer-based diffusion models can theoretically converge to optimal denoising performance, with self-attention implementing effective mean denoising mechanism. This provides foundational theoretical understanding for empirical success of these models.
Abstract: Transformer-based diffusion models have demonstrated remarkable performance at generating high-quality samples. However, our theoretical understanding of the reasons for this success remains limited. For instance, existing models are typically trained by minimizing a denoising objective, which is equivalent to fitting the score function of the training data. However, we do not know why transformer-based models can match the score function for denoising, or why gradient-based methods converge to the optimal denoising model despite the non-convex loss landscape. To the best of our knowledge, this paper provides the first convergence analysis for training transformer-based diffusion models. More specifically, we consider the population Denoising Diffusion Probabilistic Model (DDPM) objective for denoising data that follow a multi-token Gaussian mixture distribution. We theoretically quantify the required number of tokens per data point and training iterations for the global convergence towards the Bayes optimal risk of the denoising objective, thereby achieving a desired score matching error. A deeper investigation reveals that the self-attention module of the trained transformer implements a mean denoising mechanism that enables the trained model to approximate the oracle Minimum Mean Squared Error (MMSE) estimator of the injected noise in the diffusion steps. Numerical experiments validate these findings.
[1100] Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation
Zunhai Su, Hengyuan Zhang, Wei Wu, Yifan Zhang, Yaxiu Liu, He Xiao, Qingyao Yang, Yuxuan Sun, Rui Yang, Chao Zhang, Keyu Fan, Weihao Ye, Jing Xiong, Hui Shen, Chaofan Tao, Taiqiang Wu, Zhongwei Wan, Yulei Qian, Yuchen Xie, Ngai Wong
Main category: cs.LG
TL;DR: First comprehensive survey on Attention Sink phenomenon in Transformers, covering fundamental utilization, mechanistic interpretation, and strategic mitigation approaches.
Details
Motivation: Attention Sink (AS) is a persistent challenge in Transformers where disproportionate attention focuses on uninformative tokens, complicating interpretability, affecting training/inference dynamics, and exacerbating issues like hallucinations. Despite substantial research, no comprehensive survey exists to consolidate AS-related research and guide future advancements.Method: Systematic survey structured around three key dimensions: 1) Fundamental Utilization - how AS is leveraged in various applications, 2) Mechanistic Interpretation - understanding why AS occurs, and 3) Strategic Mitigation - techniques to address AS issues. The survey consolidates existing literature and provides a taxonomy of approaches.
Result: Provides the first comprehensive survey on Attention Sink, clarifying key concepts, tracing evolution and trends, and offering a definitive resource for researchers. Includes a curated paper list available on GitHub for community access.
Conclusion: This survey serves as a pivotal resource to help researchers manage AS within current Transformer paradigms while inspiring innovations for next-generation architectures. It addresses a critical gap in the literature by systematically organizing AS research across multiple dimensions.
Abstract: As the foundational architecture of modern machine learning, Transformers have driven remarkable progress across diverse AI domains. Despite their transformative impact, a persistent challenge across various Transformers is Attention Sink (AS), in which a disproportionate amount of attention is focused on a small subset of specific yet uninformative tokens. AS complicates interpretability, significantly affecting the training and inference dynamics, and exacerbates issues such as hallucinations. In recent years, substantial research has been dedicated to understanding and harnessing AS. However, a comprehensive survey that systematically consolidates AS-related research and offers guidance for future advancements remains lacking. To address this gap, we present the first survey on AS, structured around three key dimensions that define the current research landscape: Fundamental Utilization, Mechanistic Interpretation, and Strategic Mitigation. Our work provides a pivotal contribution by clarifying key concepts and guiding researchers through the evolution and trends of the field. We envision this survey as a definitive resource, empowering researchers and practitioners to effectively manage AS within the current Transformer paradigm, while simultaneously inspiring innovative advancements for the next generation of Transformers. The paper list of this work is available at https://github.com/ZunhaiSu/Awesome-Attention-Sink.
[1101] End-to-end Automated Deep Neural Network Optimization for PPG-based Blood Pressure Estimation on Wearables
Francesco Carlucci, Giovanni Pollo, Xiaying Wang, Massimo Poncino, Enrico Macii, Luca Benini, Sara Vinco, Alessio Burrello, Daniele Jahier Pagliari
Main category: cs.LG
TL;DR: A hardware-aware neural architecture search pipeline optimizes PPG-based blood pressure estimation models for deployment on ultra-low-power wearable SoCs, achieving significant parameter reduction while maintaining accuracy.
Details
Motivation: PPG-based blood pressure estimation on wearables requires on-board processing for data confidentiality, but existing DNNs have excessive memory, computation, and energy requirements that hinder deployment on resource-constrained devices.Method: Combines hardware-aware neural architecture search (NAS), pruning, and mixed-precision search (MPS) to generate compact BP prediction models optimized for ultra-low-power multicore SoCs, starting from state-of-the-art baseline models.
Result: Optimized networks achieve up to 7.99% lower error with 7.5x parameter reduction, or up to 83x fewer parameters with negligible accuracy loss. All models fit within 512 kB memory, require <55 kB, with 142 ms latency and 7.25 mJ energy consumption. Patient-specific fine-tuning improves accuracy by up to 64%.
Conclusion: The automated DNN design pipeline enables fully autonomous, low-cost BP monitoring on wearables by creating accurate yet compact models optimized for ultra-low-power SoCs, addressing deployment challenges while maintaining data confidentiality.
Abstract: Photoplethysmography (PPG)-based blood pressure (BP) estimation is a challenging task, particularly on resource-constrained wearable devices. However, fully on-board processing is desirable to ensure user data confidentiality. Recent deep neural networks (DNNs) have achieved high BP estimation accuracy by reconstructing BP waveforms or directly regressing BP values, but their large memory, computation, and energy requirements hinder deployment on wearables. This work introduces a fully automated DNN design pipeline that combines hardware-aware neural architecture search (NAS), pruning, and mixed-precision search (MPS) to generate accurate yet compact BP prediction models optimized for ultra-low-power multicore systems-on-chip (SoCs). Starting from state-of-the-art baseline models on four public datasets, our optimized networks achieve up to 7.99% lower error with a 7.5x parameter reduction, or up to 83x fewer parameters with negligible accuracy loss. All models fit within 512 kB of memory on our target SoC (GreenWaves’ GAP8), requiring less than 55 kB and achieving an average inference latency of 142 ms and energy consumption of 7.25 mJ. Patient-specific fine-tuning further improves accuracy by up to 64%, enabling fully autonomous, low-cost BP monitoring on wearables.
[1102] Consensus-based Recursive Multi-Output Gaussian Process
Yogesh Prasanna Kumar Rao, Tamas Keviczky, Raj Thilak Rajan
Main category: cs.LG
TL;DR: Distributed multi-output Gaussian process framework for scalable, uncertainty-aware learning in multi-agent sensing applications
Details
Motivation: Traditional multi-output Gaussian processes are computationally expensive and centralized, making them unsuitable for large-scale, distributed, and streaming applications where multiple agents need to learn vector-valued fields collaboratively.Method: Proposes Consensus-based Recursive Multi-Output Gaussian Process (CRMGP) that combines recursive inference on shared basis vectors with neighbor-to-neighbor information-consensus updates, enabling parallel, fully distributed learning with bounded per-step computation.
Result: Experiments on synthetic wind fields and real LiDAR data show CRMGP achieves competitive predictive performance and reliable uncertainty calibration while preserving inter-output correlations.
Conclusion: CRMGP provides a scalable alternative to centralized Gaussian process models for multi-agent sensing applications, enabling distributed uncertainty-aware learning of vector-valued fields.
Abstract: Multi-output Gaussian Processes provide principled uncertainty-aware learning of vector-valued fields but are difficult to deploy in large-scale, distributed, and streaming settings due to their computational and centralized nature. This paper proposes a Consensus-based Recursive Multi-Output Gaussian Process (CRMGP) framework that combines recursive inference on shared basis vectors with neighbour-to-neighbour information-consensus updates. The resulting method supports parallel, fully distributed learning with bounded per-step computation while preserving inter-output correlations and calibrated uncertainty. Experiments on synthetic wind fields and real LiDAR data demonstrate that CRMGP achieves competitive predictive performance and reliable uncertainty calibration, offering a scalable alternative to centralized Gaussian process models for multi-agent sensing applications.
[1103] A Temporally Augmented Graph Attention Network for Affordance Classification
Ami Chopra, Supriya Bordoloi, Shyamanta M. Hazarika
Main category: cs.LG
TL;DR: EEG-tGAT extends GATv2 with temporal attention and dropout for affordance classification from EEG interaction sequences, showing improved performance by explicitly modeling temporal importance.
Details
Motivation: Existing graph attention networks like GAT operate on static graphs and rely on implicit temporal aggregation for sequential data, which is insufficient for affordance classification where temporal dimensions are not semantically uniform and discriminative information is unevenly distributed across time.Method: Proposes EEG-temporal Graph Attention Network (EEG-tGAT), a temporally augmented formulation of GATv2 that incorporates temporal attention to modulate contributions of different time segments and temporal dropout to regularize learning across temporally correlated observations.
Result: Experimental results on affordance datasets show that EEG-tGAT achieves improved classification performance compared to GATv2, demonstrating that explicitly encoding temporal importance and enforcing temporal robustness introduces beneficial inductive biases.
Conclusion: Modest architectural changes to graph attention models can yield consistent benefits when temporal relationships play a nontrivial role in tasks like affordance classification from interaction sequences.
Abstract: Graph attention networks (GATs) provide one of the best frameworks for learning node representations in relational data; but, existing variants such as Graph Attention Network (GAT) mainly operate on static graphs and rely on implicit temporal aggregation when applied to sequential data. In this paper, we introduce Electroencephalography-temporal Graph Attention Network (EEG-tGAT), a temporally augmented formulation of GATv2 that is tailored for affordance classification from interaction sequences. The proposed model incorporates temporal attention to modulate the contribution of different time segments and temporal dropout to regularize learning across temporally correlated observations. The design reflects the assumption that temporal dimensions in affordance data are not semantically uniform and that discriminative information may be unevenly distributed across time. Experimental results on affordance datasets show that EEG-tGAT achieves improved classification performance compared to GATv2. The observed gains helps to conclude that explicitly encoding temporal importance and enforcing temporal robustness introduce inductive biases that are much better aligned with the structure of affordance-driven interaction data. These findings show us that modest architectural changes to graph attention models can help one obtain consistent benefits when temporal relationships play a nontrivial role in the task.
[1104] Tracing the Thought of a Grandmaster-level Chess-Playing Transformer
Rui Lin, Zhenyu Jin, Guancheng Zhou, Xuyang Ge, Wentao Shu, Jiaxing Wu, Junxuan Wang, Zhengfu He, Junping Zhang, Xipeng Qiu
Main category: cs.LG
TL;DR: Sparse decomposition framework for interpreting transformer neural networks in chess (Leela Chess Zero), analyzing MLP and attention modules to understand internal computation and tactical reasoning.
Details
Motivation: Modern transformers achieve superhuman performance in reasoning tasks like chess, but their internal computation remains opaque. The paper aims to interpret the internal workings of Leela Chess Zero (LC0) to understand how it performs advanced tactical reasoning.Method: Introduces sparse decomposition framework using sparse replacement layers to decompose MLP and attention modules. Combines sparse replacement layers with causal interventions to analyze internal computation pathways. Uses three quantitative metrics to measure parallel reasoning behavior.
Result: The sparse pathways expose rich, interpretable tactical considerations that are empirically verifiable. LC0 exhibits parallel reasoning behavior consistent with its policy head architecture’s inductive bias. First work to decompose transformer computation on both MLP and attention modules for interpretability.
Conclusion: The framework provides comprehensive understanding of advanced tactical reasoning in superhuman systems, offering critical insights into transformer mechanisms. Code is publicly available for reproducibility.
Abstract: While modern transformer neural networks achieve grandmaster-level performance in chess and other reasoning tasks, their internal computation process remains largely opaque. Focusing on Leela Chess Zero (LC0), we introduce a sparse decomposition framework to interpret its internal computation by decomposing its MLP and attention modules with sparse replacement layers, which capture the primary computation process of LC0. We conduct a detailed case study showing that these pathways expose rich, interpretable tactical considerations that are empirically verifiable. We further introduce three quantitative metrics and show that LC0 exhibits parallel reasoning behavior consistent with the inductive bias of its policy head architecture. To the best of our knowledge, this is the first work to decompose the internal computation of a transformer on both MLP and attention modules for interpretability. Combining sparse replacement layers and causal interventions in LC0 provides a comprehensive understanding of advanced tactical reasoning, offering critical insights into the underlying mechanisms of superhuman systems. Our code is available at https://github.com/JacklE0niden/Leela-SAEs.
[1105] Virtual Smart Metering in District Heating Networks via Heterogeneous Spatial-Temporal Graph Neural Networks
Keivan Faghih Niresi, Christian Møller Jensen, Carsten Skovmose Kallesøe, Rafael Wisniewski, Olga Fink
Main category: cs.LG
TL;DR: Proposes a heterogeneous spatial-temporal graph neural network (HSTGNN) for virtual smart heat meters in district heating systems, addressing sparse instrumentation and sensor faults through data-driven modeling of coupled thermal-hydraulic states.
Details
Motivation: District heating systems suffer from sparse instrumentation and sensor faults, limiting observability needed for intelligent operation. Existing methods either assume dense synchronized data or rely on simplified analytical models that don't capture complex network behavior. Lack of public benchmark datasets also hinders systematic comparison of virtual sensing approaches.Method: Develops a heterogeneous spatial-temporal graph neural network (HSTGNN) that incorporates functional relationships in district heating networks. Uses dedicated branches to learn graph structures and temporal dynamics for flow, temperature, and pressure measurements, enabling joint modeling of cross-variable and spatial correlations.
Result: The proposed HSTGNN significantly outperforms existing baselines. The authors also introduce a controlled laboratory dataset from Aalborg Smart Water Infrastructure Laboratory with synchronized high-resolution measurements representative of real operating conditions.
Conclusion: The HSTGNN approach effectively addresses virtual sensing challenges in district heating systems by modeling coupled nonlinear dependencies between pressure, flow, and temperature under realistic conditions, while the released dataset supports further research.
Abstract: Intelligent operation of thermal energy networks aims to improve energy efficiency, reliability, and operational flexibility through data-driven control, predictive optimization, and early fault detection. Achieving these goals relies on sufficient observability, requiring continuous and well-distributed monitoring of thermal and hydraulic states. However, district heating systems are typically sparsely instrumented and frequently affected by sensor faults, limiting monitoring. Virtual sensing offers a cost-effective means to enhance observability, yet its development and validation remain limited in practice. Existing data-driven methods generally assume dense synchronized data, while analytical models rely on simplified hydraulic and thermal assumptions that may not adequately capture the behavior of heterogeneous network topologies. Consequently, modeling the coupled nonlinear dependencies between pressure, flow, and temperature under realistic operating conditions remains challenging. In addition, the lack of publicly available benchmark datasets hinders systematic comparison of virtual sensing approaches. To address these challenges, we propose a heterogeneous spatial-temporal graph neural network (HSTGNN) for constructing virtual smart heat meters. The model incorporates the functional relationships inherent in district heating networks and employs dedicated branches to learn graph structures and temporal dynamics for flow, temperature, and pressure measurements, thereby enabling the joint modeling of cross-variable and spatial correlations. To support further research, we introduce a controlled laboratory dataset collected at the Aalborg Smart Water Infrastructure Laboratory, providing synchronized high-resolution measurements representative of real operating conditions. Extensive experiments demonstrate that the proposed approach significantly outperforms existing baselines.
[1106] Wolkowicz-Styan Upper Bound on the Hessian Eigenspectrum for Cross-Entropy Loss in Nonlinear Smooth Neural Networks
Yuto Omae, Kazuki Sakai, Yohei Kakimoto, Makoto Sasaki, Yusuke Sakai, Hirotaka Takahashi
Main category: cs.LG
TL;DR: Derives closed-form upper bound for Hessian maximum eigenvalue in smooth nonlinear multilayer neural networks using Wolkowicz-Styan bound to characterize loss sharpness analytically.
Details
Motivation: Understanding relationship between loss geometry and generalization in neural networks, particularly how sharpness (Hessian eigenspectrum) affects generalization. Existing analyses limited to simplified architectures, lacking theoretical analysis for smooth nonlinear multilayer networks.Method: Focuses on nonlinear smooth multilayer neural networks, derives closed-form upper bound for maximum eigenvalue of Hessian with respect to cross-entropy loss using Wolkowicz-Styan bound. Bound expressed as function of affine transformation parameters, hidden layer dimensions, and orthogonality among training samples.
Result: Provides analytical characterization of loss sharpness in smooth nonlinear multilayer neural networks via closed-form expression, avoiding explicit numerical eigenspectrum computation.
Conclusion: Offers theoretical analysis of loss geometry in realistic neural network architectures, contributing to understanding of generalization in deep learning through analytical sharpness characterization.
Abstract: Neural networks (NNs) are central to modern machine learning and achieve state-of-the-art results in many applications. However, the relationship between loss geometry and generalization is still not well understood. The local geometry of the loss function near a critical point is well-approximated by its quadratic form, obtained through a second-order Taylor expansion. The coefficients of the quadratic term correspond to the Hessian matrix, whose eigenspectrum allows us to evaluate the sharpness of the loss at the critical point. Extensive research suggests flat critical points generalize better, while sharp ones lead to higher generalization error. However, sharpness requires the Hessian eigenspectrum, but general matrix characteristic equations have no closed-form solution. Therefore, most existing studies on evaluating loss sharpness rely on numerical approximation methods. Existing closed-form analyses of the eigenspectrum are primarily limited to simplified architectures, such as linear or ReLU-activated networks; consequently, theoretical analysis of smooth nonlinear multilayer neural networks remains limited. Against this background, this study focuses on nonlinear, smooth multilayer neural networks and derives a closed-form upper bound for the maximum eigenvalue of the Hessian with respect to the cross-entropy loss by leveraging the Wolkowicz-Styan bound. Specifically, the derived upper bound is expressed as a function of the affine transformation parameters, hidden layer dimensions, and the degree of orthogonality among the training samples. The primary contribution of this paper is an analytical characterization of loss sharpness in smooth nonlinear multilayer neural networks via a closed-form expression, avoiding explicit numerical eigenspectrum computation. We hope that this work provides a small yet meaningful step toward unraveling the mysteries of deep learning.
[1107] Mild Over-Parameterization Benefits Asymmetric Tensor PCA
Shihong Ding, Weicheng Lin, Cong Fang
Main category: cs.LG
TL;DR: Novel three-phase alternating-update algorithm for Asymmetric Tensor PCA with matrix parameterization achieves near-optimal sample complexity using only d² memory, improving over existing d^⌈k/2⌉ requirements.
Details
Motivation: Existing ATPCA algorithms require prohibitively large memory (d^⌈k/2⌉) for signal recovery, creating computational bottlenecks. The paper aims to develop tractable algorithms with memory-independent costs while maintaining strong theoretical guarantees.Method: Proposes a matrix-parameterized method using three-phase alternating-update algorithm with gradient descent under limited memory budget. Uses mild over-parameterization to improve sample efficiency and adaptivity to problem structure.
Result: Achieves near-optimal d^(k-2) sample complexity in limited memory setting, with adaptivity to problem structure where sample size decreases as vectors become more aligned. In symmetric limit attains d^(k/2) complexity matching best known polynomial-time results.
Conclusion: First tractable algorithm for ATPCA with d^k-independent memory costs, demonstrating how mild over-parameterization enables efficient learning with limited memory while maintaining theoretical guarantees.
Abstract: Asymmetric Tensor PCA (ATPCA) is a prototypical model for studying the trade-offs between sample complexity, computation, and memory. Existing algorithms for this problem typically require at least $d^{\left\lceil\overline{k}/2\right\rceil}$ state memory cost to recover the signal, where $d$ is the vector dimension and $\overline{k}$ is the tensor order. We focus on the setting where $\overline{k} \geq 4$ is even and consider (stochastic) gradient descent-based algorithms under a limited memory budget, which permits only mild over-parameterization of the model. We propose a matrix-parameterized method (in $d^{2}$ state memory cost) using a novel three-phase alternating-update algorithm to address the problem and demonstrate how mild over-parameterization facilitates learning in two key aspects: (i) it improves sample efficiency, allowing our method to achieve \emph{near-optimal} $d^{\overline{k}-2}$ sample complexity in our limited memory setting; and (ii) it enhances adaptivity to problem structure, a previously unrecognized phenomenon, where the required sample size naturally decreases as consecutive vectors become more aligned, and in the symmetric limit attains $d^{\overline{k}/2}$, matching the \emph{best} known polynomial-time complexity. To our knowledge, this is the \emph{first} tractable algorithm for ATPCA with $d^{\overline{k}}$-independent memory costs.
[1108] Exploring the impact of fairness-aware criteria in AutoML
Joana Simões, João Correia
Main category: cs.LG
TL;DR: Integrating fairness metrics into AutoML pipeline optimization improves fairness by 14.5% with only 9.4% predictive performance decrease, while reducing data usage by 35.7% and producing simpler models.
Details
Motivation: AutoML frameworks focus on maximizing predictive performance, potentially intensifying discriminatory behaviors from biased data. Previous fairness research only addressed model selection/hyperparameter tuning, neglecting other critical ML pipeline stages.Method: Integrates complementary fairness metrics directly into AutoML optimization component that constructs complete ML pipelines (data selection, transformations, model selection, tuning). Uses multiple fairness metrics to capture different fairness dimensions during optimization.
Result: Compared to performance-only baseline: 14.5% average fairness improvement, 9.4% predictive power decrease, 35.7% data usage reduction. Produced simpler final solutions, showing model complexity not always needed for fair ML.
Conclusion: Fairness integration in AutoML optimization yields measurable improvements in fairness with moderate performance trade-offs, reduces data usage, and produces simpler yet effective solutions, challenging the need for complex models for fairness.
Abstract: Machine Learning (ML) systems are increasingly used to support decision-making processes that affect individuals. However, these systems often rely on biased data, which can lead to unfair outcomes against specific groups. With the growing adoption of Automated Machine Learning (AutoML), the risk of intensifying discriminatory behaviours increases, as most frameworks primarily focus on model selection to maximise predictive performance. Previous research on fairness in AutoML had largely followed this trend, integrating fairness awareness only in the model selection or hyperparameter tuning, while neglecting other critical stages of the ML pipeline. This paper aims to study the impact of integrating fairness directly into the optimisation component of an AutoML framework that constructs complete ML pipelines, from data selection and transformations to model selection and tuning. As selecting appropriate fairness metrics remains a key challenge, our work incorporates complementary fairness metrics to capture different dimensions of fairness during the optimisation. Their integration within AutoML resulted in measurable differences compared to a baseline focused solely on predictive performance. Despite a 9.4% decrease in predictive power, the average fairness improved by 14.5%, accompanied by a 35.7% reduction in data usage. Furthermore, fairness integration produced complete yet simpler final solutions, suggesting that model complexity is not always required to achieve balanced and fair ML solutions.
[1109] A Multi-head Attention Fusion Network for Industrial Prognostics under Discrete Operational Conditions
Yuqi Su, Xiaolei Fang
Main category: cs.LG
TL;DR: A multi-head attention-based fusion neural network for prognostic modeling under dynamically changing operating conditions, integrating degradation trends, operating states, and residual noise.
Details
Motivation: Complex systems like aircraft engines operate under dynamically changing conditions that substantially influence degradation behavior, making prognostic modeling challenging due to the need to explicitly consider operational effects.Method: Proposes a novel multi-head attention-based fusion neural network that explicitly models three signal components: monotonic degradation trend, discrete operating states (identified through clustering and encoded into dense embeddings), and residual random noise. Uses BiLSTM networks with attention mechanisms to capture temporal dependencies, and a fusion module to integrate degradation-trend and operating-state outputs.
Result: Validated using a dataset from the NASA repository, with results demonstrating the method’s effectiveness.
Conclusion: The proposed framework successfully addresses the challenge of prognostic modeling under dynamically changing operating conditions by explicitly modeling and integrating degradation trends, operating states, and residual noise through attention-based fusion.
Abstract: Complex systems such as aircraft engines, turbines, and industrial machinery often operate under dynamically changing conditions. These varying operating conditions can substantially influence degradation behavior and make prognostic modeling more challenging, as accurate prediction requires explicit consideration of operational effects. To address this issue, this paper proposes a novel multi-head attention-based fusion neural network. The proposed framework explicitly models and integrates three signal components: (1) the monotonic degradation trend, which reflects the underlying deterioration of the system; (2) discrete operating states, identified through clustering and encoded into dense embeddings; and (3) residual random noise, which captures unexplained variation in sensor measurements. The core strength of the framework lies in its architecture, which combines BiLSTM networks with attention mechanisms to better capture complex temporal dependencies. The attention mechanism allows the model to adaptively weight different time steps and sensor signals, improving its ability to extract prognostically relevant information. In addition, a fusion module is designed to integrate the outputs from the degradation-trend branch and the operating-state embeddings, enabling the model to capture their interactions more effectively. The proposed method is validated using a dataset from the NASA repository, and the results demonstrate its effectiveness.
[1110] The Phase Is the Gradient: Equilibrium Propagation for Frequency Learning in Kuramoto Networks
Mani Rash Ahmadi
Main category: cs.LG
TL;DR: The paper shows that in Kuramoto oscillator networks, phase displacement under weak output nudging equals the gradient of loss with respect to natural frequencies, enabling frequency learning that outperforms coupling-weight learning on sparse layered architectures.
Details
Motivation: Previous oscillator equilibrium propagation work excluded natural frequencies as learnable parameters, but this paper aims to demonstrate that frequency learning can be effective and even superior to coupling-weight learning in certain architectures.Method: The authors prove mathematically that physical phase displacement under weak output nudging equals the gradient of loss with respect to natural frequencies. They test this on sparse layered architectures, comparing frequency learning vs. coupling-weight learning, and introduce topology-aware spectral seeding to address convergence issues.
Result: Frequency learning outperforms coupling-weight learning (96.0% vs 83.3% accuracy at matched parameter counts). Topology-aware spectral seeding eliminates convergence failures (from 46/100 to 100/100 seeds on primary task, and 50/50 on other tasks).
Conclusion: Natural frequencies are viable learnable parameters in oscillator networks, with frequency learning outperforming coupling-weight learning in sparse layered architectures. Convergence issues are due to loss landscape properties, not gradient errors, and can be resolved with proper initialization.
Abstract: We prove that in a coupled Kuramoto oscillator network at stable equilibrium, the physical phase displacement under weak output nudging is the gradient of the loss with respect to natural frequencies, with equality as the nudging strength beta tends to zero. Prior oscillator equilibrium propagation work explicitly set aside natural frequency as a learnable parameter; we show that on sparse layered architectures, frequency learning outperforms coupling-weight learning among converged seeds (96.0% vs. 83.3% at matched parameter counts, p = 1.8e-12). The approximately 50% convergence failure rate under random initialization is a loss-landscape property, not a gradient error; topology-aware spectral seeding eliminates it in all settings tested (46/100 to 100/100 seeds on the primary task; 50/50 on a second task, K-only training, and a larger architecture).
[1111] A Diffusion-Contrastive Graph Neural Network with Virtual Nodes for Wind Nowcasting in Unobserved Regions
Jie Shi, Siamak Mehrkanoon
Main category: cs.LG
TL;DR: A deep graph self-supervised framework using virtual nodes in graph neural networks to extend weather nowcasting to unobserved regions without requiring new sensors, achieving 30-46% error reduction for wind predictions.
Details
Motivation: Many regions lack dense observational networks for weather stations, leading to unreliable short-term wind predictions in unobserved areas. This creates challenges for climate resilience, energy security, and disaster preparedness, particularly affecting renewable energy integration and early-warning systems.Method: Proposes a deep graph self-supervised framework that introduces “virtual nodes” into a diffusion and contrastive-based graph neural network. The model learns wind conditions (speed, direction, gusts) in locations without direct measurements by leveraging relationships between observed and unobserved regions through graph structure.
Result: The approach reduces nowcast mean absolute error (MAE) of wind speed, gusts, and direction in unobserved regions by 30-46% compared to interpolation and regression methods, using high-temporal resolution weather station data from the Netherlands.
Conclusion: The method enables localized nowcasts in data-sparse regions without requiring new sensors, opening pathways for renewable energy integration, agricultural planning, and early-warning systems in areas with limited observational networks.
Abstract: Accurate weather nowcasting remains one of the central challenges in atmospheric science, with critical implications for climate resilience, energy security, and disaster preparedness. Since it is not feasible to deploy observation stations everywhere, some regions lack dense observational networks, resulting in unreliable short-term wind predictions across those unobserved areas. Here we present a deep graph self-supervised framework that extends nowcasting capability into such unobserved regions without requiring new sensors. Our approach introduces “virtual nodes” into a diffusion and contrastive-based graph neural network, enabling the model to learn wind condition (i.e., speed, direction and gusts) in places with no direct measurements. Using high-temporal resolution weather station data across the Netherlands, we demonstrate that this approach reduces nowcast mean absolute error (MAE) of wind speed, gusts, and direction in unobserved regions by more than 30% - 46% compared with interpolation and regression methods. By enabling localized nowcasts where no measurements exist, this method opens new pathways for renewable energy integration, agricultural planning, and early-warning systems in data-sparse regions.
[1112] Integrating SAINT with Tree-Based Models: A Case Study in Employee Attrition Prediction
Adil Derrazi, Javad Pourmostafa Roshan Sharami
Main category: cs.LG
TL;DR: Hybrid approach using SAINT transformer embeddings with tree-based models for employee attrition prediction fails to outperform standalone tree models, with reduced interpretability.
Details
Motivation: Employee attrition is costly for organizations, and while tree-based models work well on structured HR data, traditional encoding methods fail to capture semantic relationships between categorical features. The study aims to enhance prediction by integrating transformer-generated embeddings with tree models.Method: Uses SAINT (Self-Attention and Intersample Attention Transformer) to generate embeddings from tabular HR data, then combines these embeddings with tree-based models (XGBoost, LightGBM). Evaluates standalone models (SAINT, XGBoost, LightGBM) and hybrid models that use SAINT embeddings as features for tree classifiers.
Result: Standalone tree-based models (XGBoost, LightGBM) outperform both standalone SAINT and hybrid approaches in predictive accuracy and generalization. Hybrid models did not improve performance, possibly because tree models struggle with dense, high-dimensional embeddings. Hybrid approach also significantly reduced interpretability.
Conclusion: Transformer-based embeddings capture feature relationships but don’t enhance tree-based classifiers for employee attrition prediction. Future research should explore alternative fusion strategies for integrating deep learning with structured data.
Abstract: Employee attrition presents a major challenge for organizations, increasing costs and reducing productivity. Predicting attrition accurately enables proactive retention strategies, but existing machine learning models often struggle to capture complex feature interactions in tabular HR datasets. While tree-based models such as XGBoost and LightGBM perform well on structured data, traditional encoding techniques like one-hot encoding can introduce sparsity and fail to preserve semantic relationships between categorical features. This study explores a hybrid approach by integrating SAINT (Self-Attention and Intersample Attention Transformer)-generated embeddings with tree-based models to enhance employee attrition prediction. SAINT leverages self-attention mechanisms to model intricate feature interactions. In this study, we explore SAINT both as a standalone classifier and as a feature extractor for tree-based models. We evaluate the performance, generalizability, and interpretability of standalone models (SAINT, XGBoost, LightGBM) and hybrid models that combine SAINT embeddings with tree-based classifiers. Experimental results show that standalone tree-based models outperform both the standalone SAINT model and the hybrid approaches in predictive accuracy and generalization. Contrary to expectations, the hybrid models did not improve performance. One possible explanation is that tree-based models struggle to utilize dense, high-dimensional embeddings effectively. Additionally, the hybrid approach significantly reduced interpretability, making model decisions harder to explain. These findings suggest that transformer-based embeddings, while capturing feature relationships, do not necessarily enhance tree-based classifiers. Future research should explore alternative fusion strategies for integrating deep learning with structured data.
[1113] WaterAdmin: Orchestrating Community Water Distribution Optimization via AI Agents
Jiaqi Wen, Pingbo Tang, Shaolei Ren, Jianyi Yang
Main category: cs.LG
TL;DR: WaterAdmin: A bi-level AI-agent framework combining LLM-based community context abstraction with optimization-based control for adaptive water system management under dynamic conditions.
Details
Motivation: Real-world community water systems face highly dynamic contexts (human activities, weather variations) that affect water demand patterns, making traditional optimization approaches struggle to adapt to heterogeneous and rapidly evolving contextual information in real time.Method: Proposes WaterAdmin, a bi-level AI-agent framework with LLM-based community context abstraction at the upper level and optimization-based operational control at the lower level, integrating complementary strengths of both paradigms.
Result: Implemented on EPANET hydraulic simulation platform, demonstrating superior performance in maintaining pressure reliability and reducing energy consumption under highly dynamic community contexts.
Conclusion: The bi-level framework successfully integrates LLM context understanding with optimization control for adaptive and reliable water system operation in dynamic community environments.
Abstract: We study the operation of community water systems, where pumps and valves must be scheduled to reliably meet water demands while minimizing energy consumption. While existing optimization-based methods are effective under well-modeled environments, real-world community scenarios exhibit highly dynamic contexts-such as human activities, weather variations, etc-that significantly affect water demand patterns and operational targets across different zones. Traditional optimization approaches struggle to aggregate and adapt to such heterogeneous and rapidly evolving contextual information in real time. While Large Language Model (LLM) agents offer strong capabilities for understanding heterogeneous community context, they are not suitable for directly producing reliable real-time control actions. To address these challenges, we propose a bi-level AI-agent-based framework, WaterAdmin, which integrates LLM-based community context abstraction at the upper level with optimization-based operational control at the lower level. This design leverages the complementary strengths of both paradigms to enable adaptive and reliable operation. We implement WaterAdmin on the hydraulic simulation platform EPANET and demonstrate superior performance in maintaining pressure reliability and reducing energy consumption under highly dynamic community contexts.
[1114] Battery health prognosis using Physics-informed neural network with Quantum Feature mapping
Muhammad Imran Hossain, Md Fazley Rafy, Sarika Khushlani Solanki, Anurag K. Srivastava
Main category: cs.LG
TL;DR: QPINN: A physics-informed neural network with quantum feature mapping for accurate battery state-of-health estimation across diverse chemistries and conditions.
Details
Motivation: Existing battery SOH estimation methods lack generalizability across diverse battery chemistries and operating conditions, and standard neural networks fail to capture the complex, high-dimensional physics of battery degradation.Method: Proposes QPINN - a physics-informed neural network with Quantum Feature Mapping (QFM) technique that projects raw battery sensor data into high-dimensional Hilbert space using Nyström method, creating expressive features that capture non-linear degradation patterns, then enforces physical constraints.
Result: Achieves 99.46% average SOH estimation accuracy across datasets, outperforming state-of-the-art baselines with 65% reduction in MAPE and 62% reduction in RMSE. Validated on 310,705 samples from 387 cells, showing adaptability in cross-validation and successful transfer between chemistries without target-domain labels.
Conclusion: QPINN provides a robust, generalizable solution for battery health prognosis that effectively captures complex degradation physics and transfers well across different battery chemistries.
Abstract: Accurate battery health prognosis using State of Health (SOH) estimation is essential for the reliability of multi-scale battery energy storage, yet existing methods are limited in generalizability across diverse battery chemistries and operating conditions. The inability of standard neural networks to capture the complex, high-dimensional physics of battery degradation is a major contributor to these limitations. To address this, a physics-informed neural network with the Quantum Feature Mapping(QFM) technique (QPINN) is proposed. QPINN projects raw battery sensor data into a high-dimensional Hilbert space, creating a highly expressive feature set that effectively captures subtle, non-linear degradation patterns using Nyström method. These quantum-enhanced features are then processed by a physics-informed network that enforces physical constraints. The proposed method achieves an average SOH estimation accuracy of 99.46% across different datasets, substantially outperforming state-of-the-art baselines, with reductions in MAPE and RMSE of up to 65% and 62%, respectively. This method was validated on a large-scale, multi-chemistry dataset of 310,705 samples from 387 cells, and further showed notable adaptability in cross-validation settings, successfully transferring from one chemistry to another without relying on target-domain SOH labels.
[1115] Structural Gating and Effect-aligned Lag-resolved Temporal Causal Discovery Framework with Application to Heat-Pollution Extremes
Rui Chen, Jinsong Wu
Main category: cs.LG
TL;DR: SGED-TCD is a novel framework for lag-resolved causal discovery in multivariate time series that combines structural gating, stability learning, perturbation-effect alignment, and unified graph extraction to improve interpretability and robustness.
Details
Motivation: The need for better causal discovery methods in complex multivariate time series that can handle lag structures, improve interpretability, and provide robust, physically meaningful causal relationships in real-world applications like climate-environment systems.Method: SGED-TCD combines four key components: explicit structural gating for lag modeling, stability-oriented learning for robustness, perturbation-effect alignment for functional consistency, and unified graph extraction for comprehensive causal network inference.
Result: Applied to teleconnection-driven compound heatwave-air-pollution extremes in China, SGED-TCD revealed clear regional and seasonal heterogeneity in causal pathways, with warm-season extremes linked to low-latitude oceanic variability and cold-season extremes governed by high-latitude circulation patterns.
Conclusion: SGED-TCD successfully recovers physically interpretable, hierarchical, and lag-resolved causal pathways in complex climate-environment systems and provides a general framework applicable to other domains requiring temporal causal discovery.
Abstract: This study proposes Structural Gating and Effect-aligned Discovery for Temporal Causal Discovery (SGED-TCD), a novel and general framework for lag-resolved causal discovery in complex multivariate time series. SGED-TCD combines explicit structural gating, stability-oriented learning, perturbation-effect alignment, and unified graph extraction to improve the interpretability, robustness, and functional consistency of inferred causal graphs. To evaluate its effectiveness in a representative real-world setting, we apply SGED-TCD to teleconnection-driven compound heatwave–air-pollution extremes in eastern and northern China. Using large-scale climate indices, regional circulation and boundary-layer variables, and compound extreme indicators, the framework reconstructs weighted causal networks with explicit dominant lags and relative causal importance. The inferred networks reveal clear regional and seasonal heterogeneity: warm-season extremes in Eastern China are mainly linked to low-latitude oceanic variability through circulation, radiation, and ventilation pathways, whereas cold-season extremes in Northern China are more strongly governed by high-latitude circulation variability associated with boundary-layer suppression and persistent stagnation. These results show that SGED-TCD can recover physically interpretable, hierarchical, and lag-resolved causal pathways in a challenging climate–environment system. More broadly, the proposed framework is not restricted to the present application and provides a general basis for temporal causal discovery in other complex domains.
[1116] Intent-aligned Formal Specification Synthesis via Traceable Refinement
Zhe Ye, Aidan Z. H. Yang, Huangyuan Su, Zhenyu Liao, Samuel Tenka, Zhizhen Qin, Udaya Ghai, Dawn Song, Soonho Kong
Main category: cs.LG
TL;DR: VeriSpecGen: A traceable refinement framework that generates formal specifications from natural language for code verification by decomposing requirements, creating traceability maps, and enabling targeted repairs when validation fails.
Details
Motivation: While formal verification can guarantee code correctness, real-world codebases often lack specifications, and writing high-quality specifications is expensive and requires expertise. There's a need to automatically generate intent-aligned specifications from natural language descriptions.Method: VeriSpecGen decomposes natural language into atomic requirements, generates requirement-targeted tests with explicit traceability maps to validate specifications, and uses these maps to attribute failures to specific requirements for targeted clause-level repairs. It synthesizes specifications in Lean theorem prover.
Result: Achieves 86.6% on VERINA SpecGen task using Claude Opus 4.5, improving over baselines by up to 31.8 points. Generated 343K training examples from refinement trajectories, and training on these trajectories improves specification synthesis by 62-106% relative and transfers gains to general reasoning abilities.
Conclusion: VeriSpecGen provides an effective framework for generating formal specifications from natural language with traceability and repair mechanisms, enabling both inference-time improvements and training data generation that enhances model capabilities.
Abstract: Large language models are increasingly used to generate code from natural language, but ensuring correctness remains challenging. Formal verification offers a principled way to obtain such guarantees by proving that a program satisfies a formal specification. However, specifications are frequently missing in real-world codebases, and writing high-quality specifications remains expensive and expertise-intensive. We present VeriSpecGen, a traceable refinement framework that synthesizes intent-aligned specifications in Lean through requirement-level attribution and localized repair. VeriSpecGen decomposes natural language into atomic requirements and generates requirement-targeted tests with explicit traceability maps to validate generated specifications. When validation fails, traceability maps attribute failures to specific requirements, enabling targeted clause-level repairs. VeriSpecGen achieve 86.6% on VERINA SpecGen task using Claude Opus 4.5, improving over baselines by up to 31.8 points across different model families and scales. Beyond inference-time gains, we generate 343K training examples from VeriSpecGen refinement trajectories and demonstrate that training on these trajectories substantially improves specification synthesis by 62-106% relative and transfers gains to general reasoning abilities.
[1117] Latent Instruction Representation Alignment: defending against jailbreaks, backdoors and undesired knowledge in LLMs
Eric Easley, Sebastian Farquhar
Main category: cs.LG
TL;DR: LIRA trains LLMs to change how they interpret instructions rather than just their actions, improving generalization against jailbreaks, backdoors, and enabling effective unlearning.
Details
Motivation: Current methods for addressing jailbreaks, backdoors, and unlearning in LLMs focus on training models based on their actions when given malign instructions, which doesn't fundamentally change how models interpret instructions and limits generalization.Method: Latent Instruction Representation Alignment (LIRA) trains models to change how they interpret instructions at the latent representation level. The method is further enhanced with an internally adversarial training algorithm to boost generalization.
Result: LIRA blocks over 99% of PEZ jailbreak attacks, removes challenging insecure code backdoors, and achieves optimal forgetting on WMDP cyber benchmarks with negligible loss of benign capabilities.
Conclusion: By focusing on changing how LLMs interpret instructions rather than just their actions, LIRA provides a more fundamental and generalizable approach to addressing security vulnerabilities like jailbreaks and backdoors while enabling effective unlearning.
Abstract: We address jailbreaks, backdoors, and unlearning for large language models (LLMs). Unlike prior work, which trains LLMs based on their actions when given malign instructions, our method specifically trains the model to change how it interprets instructions. Our method, Latent Instruction Representation Alignment (LIRA), greatly improves generalization. We further boost generalization through an internally adversarial training algorithm. Our methods block over 99% of PEZ jailbreak attacks; remove a challenging insecure code backdoor; and achieve optimal forgetting on WMDP cyber with negligible loss of benign capabilities.
[1118] CARE-ECG: Causal Agent-based Reasoning for Explainable and Counterfactual ECG Interpretation
Elahe Khatibi, Ziyu Wang, Ankita Sharma, Krishnendu Chakrabarty, Sanaz Rahimi Moosavi, Farshad Firouzi, Amir Rahmani
Main category: cs.LG
TL;DR: CARE-ECG is a causally structured ECG-language reasoning framework that integrates physiological representation learning, causal diagnosis, and counterfactual analysis for improved clinical ECG interpretation.
Details
Motivation: Current ECG-LLM systems lack explicit physiological or causal structure, limiting grounding, temporal reasoning, and counterfactual analysis essential for clinical decision-making.Method: Encodes multi-lead ECGs into temporally organized latent biomarkers, performs causal graph inference for probabilistic diagnosis, and supports counterfactual assessment via structural causal models with causal retrieval-augmented generation.
Result: Improves diagnostic accuracy (0.84 on Expert-ECG-QA, 0.76 on SCP-mapped PTB-XL under GPT-4) and explanation faithfulness while reducing hallucinations across multiple ECG benchmarks.
Conclusion: CARE-ECG provides traceable reasoning by exposing latent drivers, causal evidence paths, and counterfactual analysis, advancing clinical ECG interpretation beyond current LLM-based approaches.
Abstract: Large language models (LLMs) enable waveform-to-text ECG interpretation and interactive clinical questioning, yet most ECG-LLM systems still rely on weak signal-text alignment and retrieval without explicit physiological or causal structure. This limits grounding, temporal reasoning, and counterfactual “what-if” analysis central to clinical decision-making. We propose CARE-ECG, a causally structured ECG-language reasoning framework that unifies representation learning, diagnosis, and explanation in a single pipeline. CARE-ECG encodes multi-lead ECGs into temporally organized latent biomarkers, performs causal graph inference for probabilistic diagnosis, and supports counterfactual assessment via structural causal models. To improve faithfulness, CARE-ECG grounds language outputs through causal retrieval-augmented generation and a modular agentic pipeline that integrates history, diagnosis, and response with verification. Across multiple ECG benchmarks and expert QA settings, CARE-ECG improves diagnostic accuracy and explanation faithfulness while reducing hallucinations (e.g., 0.84 accuracy on Expert-ECG-QA and 0.76 on SCP-mapped PTB-XL under GPT-4). Overall, CARE-ECG provides traceable reasoning by exposing key latent drivers, causal evidence paths, and how alternative physiological states would change outcomes.
[1119] Replicable Composition
Kiarash Banihashem, MohammadHossein Bateni, Hossein Esfandiari, Samira Goudarzi, MohammadTaghi Hajiaghayi
Main category: cs.LG
TL;DR: The paper establishes optimal sample complexity bounds for composing multiple replicable algorithms, achieving linear scaling in the sum of individual sample complexities, resolving an open problem about whether O(nk) scaling is achievable.
Details
Motivation: Understanding how replicable algorithms compose is fundamental to building complex systems from reliable components. Prior work showed either O(nk²) or O(n²k) bounds, leaving open whether optimal O(nk) scaling is possible.Method: The approach converts replicable algorithms to perfectly generalizing ones, composes them via privacy-style analysis, and maps back using correlated sampling. This yields the first advanced composition theorem for replicability.
Result: Achieved optimal O(∑n_i) sample complexity for joint replicable composition, established Ω(nk²) lower bound for adaptive composition showing quadratic separation, and provided boosting theorems for success probability.
Conclusion: The paper resolves the fundamental composition question for replicable algorithms, establishing optimal linear scaling and providing new tools for analyzing replicability through connections to differential privacy and perfect generalization.
Abstract: Replicability requires that algorithmic conclusions remain consistent when rerun on independently drawn data. A central structural question is composition: given $k$ problems each admitting a $ρ$-replicable algorithm with sample complexity $n$, how many samples are needed to solve all jointly while preserving replicability? The naive analysis yields $\widetilde{O}(nk^2)$ samples, and Bun et al. (STOC'23) observed that reductions through differential privacy give an alternative $\widetilde{O}(n^2k)$ bound, leaving open whether the optimal $\widetilde{O}(nk)$ scaling is achievable. We resolve this open problem and, more generally, show that problems with sample complexities $n_1,\ldots,n_k$ can be jointly solved with $\widetilde{O}(\sum_i n_i)$ samples while preserving constant replicability. Our approach converts each replicable algorithm into a perfectly generalizing one, composes them via a privacy-style analysis, and maps back via correlated sampling. This yields the first advanced composition theorem for replicability. En route, we obtain new bounds for the composition of perfectly generalizing algorithms with heterogeneous parameters. As part of our results, we provide a boosting theorem for the success probability of replicable algorithms. For a broad class of problems, the failure probability appears as a separate additive term independent of $ρ$, immediately yielding improved sample complexity bounds for several problems. Finally, we prove an $Ω(nk^2)$ lower bound for adaptive composition, establishing a quadratic separation from the non-adaptive setting. The key technique, which we call the phantom run, yields structural results of independent interest.
[1120] Membership Inference Attacks Expose Participation Privacy in ECG Foundation Encoders
Ziyu Wang, Elahe Khatibi, Ankita Sharma, Krishnendu Chakrabarty, Sanaz Rahimi Moosavi, Farshad Firouzi, Amir Rahmani
Main category: cs.LG
TL;DR: Membership inference attacks can identify whether specific individuals contributed ECG data to self-supervised foundation encoders, even when only model outputs or embeddings are accessible, revealing participation privacy risks in connected-health systems.
Details
Motivation: Foundation-style ECG encoders are increasingly reused across tasks and institutions, but this raises participation privacy concerns: adversaries could infer whether specific individuals or cohorts contributed data to pretraining, potentially revealing sensitive health context or institutional affiliation.Method: Implementation-grounded audit of membership inference attacks against self-supervised ECG foundation encoders, covering contrastive objectives (SimCLR, TS2Vec) and masked reconstruction objectives (CNN- and Transformer-based MAE). Evaluated three attacker interfaces: score-only black-box access, adaptive learned attackers aggregating subject-level statistics, and embedding-access attackers probing latent representation geometry.
Result: Heterogeneous and objective-dependent participation leakage observed: leakage most pronounced in small or institution-specific cohorts, saturates in embedding space for contrastive encoders, while larger and more diverse datasets substantially attenuate operational tail risk.
Conclusion: Restricting access to raw signals or labels is insufficient to guarantee participation privacy, underscoring the need for deployment-aware auditing of reusable biosignal foundation encoders in connected-health systems.
Abstract: Foundation-style ECG encoders pretrained with self-supervised learning are increasingly reused across tasks, institutions, and deployment contexts, often through model-as-a-service interfaces that expose scalar scores or latent representations. While such reuse improves data efficiency and generalization, it raises a participation privacy concern: can an adversary infer whether a specific individual or cohort contributed ECG data to pretraining, even when raw waveforms and diagnostic labels are never disclosed? In connected-health settings, training participation itself may reveal institutional affiliation, study enrollment, or sensitive health context. We present an implementation-grounded audit of membership inference attacks (MIAs) against modern self-supervised ECG foundation encoders, covering contrastive objectives (SimCLR, TS2Vec) and masked reconstruction objectives (CNN- and Transformer-based MAE). We evaluate three realistic attacker interfaces: (i) score-only black-box access to scalar outputs, (ii) adaptive learned attackers that aggregate subject-level statistics across repeated queries, and (iii) embedding-access attackers that probe latent representation geometry. Using a subject-centric protocol with window-to-subject aggregation and calibration at fixed false-positive rates under a cross-dataset auditing setting, we observe heterogeneous and objective-dependent participation leakage: leakage is most pronounced in small or institution-specific cohorts and, for contrastive encoders, can saturate in embedding space, while larger and more diverse datasets substantially attenuate operational tail risk. Overall, our results show that restricting access to raw signals or labels is insufficient to guarantee participation privacy, underscoring the need for deployment-aware auditing of reusable biosignal foundation encoders in connected-health systems.
[1121] Towards Green Wearable Computing: A Physics-Aware Spiking Neural Network for Energy-Efficient IMU-based Human Activity Recognition
Naichuan Zheng, Hailun Xia, Zepeng Sun, Weiyi Li, Yinze Zhou
Main category: cs.LG
TL;DR: PAS-Net: A physics-aware spiking neural network for ultra-low-power wearable human activity recognition using IMU sensors, achieving SOTA accuracy with 98% energy reduction through sparse integer operations and early-exit mechanisms.
Details
Motivation: Traditional DNNs for IMU-based human activity recognition are computationally expensive and power-hungry, making them unsuitable for battery-constrained edge devices. While SNNs offer energy efficiency, they struggle with complex biomechanical topologies and temporal dynamics.Method: Proposes PAS-Net with: 1) adaptive symmetric topology mixer enforcing human-joint physical constraints, 2) O(1)-memory causal neuromodulator for context-aware dynamic threshold neurons, 3) temporal spike error objective enabling flexible early-exit mechanism for continuous IMU streams.
Result: Achieves state-of-the-art accuracy across seven diverse datasets while replacing dense operations with sparse 0.1 pJ integer accumulations. Early-exit capability reduces dynamic energy consumption by up to 98%.
Conclusion: PAS-Net establishes a robust, ultra-low-power neuromorphic standard for always-on wearable sensing, bridging the gap between energy efficiency and accuracy in human activity recognition.
Abstract: Wearable IMU-based Human Activity Recognition (HAR) relies heavily on Deep Neural Networks (DNNs), which are burdened by immense computational and buffering demands. Their power-hungry floating-point operations and rigid requirement to process complete temporal windows severely cripple battery-constrained edge devices. While Spiking Neural Networks (SNNs) offer extreme event-driven energy efficiency, standard architectures struggle with complex biomechanical topologies and temporal gradient degradation. To bridge this gap, we propose the Physics-Aware Spiking Neural Network (PAS-Net), a fully multiplier-free architecture explicitly tailored for Green HAR. Spatially, an adaptive symmetric topology mixer enforces human-joint physical constraints. Temporally, an $O(1)$-memory causal neuromodulator yields context-aware dynamic threshold neurons, adapting actively to non-stationary movement rhythms. Furthermore, we leverage a temporal spike error objective to unlock a flexible early-exit mechanism for continuous IMU streams. Evaluated across seven diverse datasets, PAS-Net achieves state-of-the-art accuracy while replacing dense operations with sparse 0.1 pJ integer accumulations. Crucially, its confidence-driven early-exit capability drastically reduces dynamic energy consumption by up to 98%. PAS-Net establishes a robust, ultra-low-power neuromorphic standard for always-on wearable sensing.
[1122] Rethinking the Diffusion Model from a Langevin Perspective
Candi Zheng, Yuan Lan
Main category: cs.LG
TL;DR: The paper presents diffusion models from a Langevin dynamics perspective, offering a unified framework that simplifies understanding and connects different formulations (ODE/SDE, VAE, score matching, flow matching).
Details
Motivation: Existing diffusion model explanations are mathematically dense and fragmented across different perspectives (VAEs, score matching, flow matching), making them difficult for beginners. The authors aim to provide a simpler, more intuitive understanding through a unified Langevin dynamics framework.Method: The paper systematically organizes diffusion models from a Langevin perspective, showing how different formulations (ODE-based, SDE-based) can be unified under a single framework. It demonstrates equivalences between denoising, score matching, and flow matching approaches under maximum-likelihood principles.
Result: The Langevin perspective provides clear answers to fundamental questions about diffusion models, including: how reverse processes invert forward processes, why diffusion models are theoretically superior to ordinary VAEs, and why different formulations are equivalent under maximum-likelihood.
Conclusion: The Langevin framework offers pedagogical value by bridging existing interpretations, showing how different formulations convert into one another, and providing deeper intuition for both learners and experienced researchers working with diffusion models.
Abstract: Diffusion models are often introduced from multiple perspectives, such as VAEs, score matching, or flow matching, accompanied by dense and technically demanding mathematics that can be difficult for beginners to grasp. One classic question is: how does the reverse process invert the forward process to generate data from pure noise? This article systematically organizes the diffusion model from a fresh Langevin perspective, offering a simpler, clearer, and more intuitive answer. We also address the following questions: how can ODE-based and SDE-based diffusion models be unified under a single framework? Why are diffusion models theoretically superior to ordinary VAEs? Why is flow matching not fundamentally simpler than denoising or score matching, but equivalent under maximum-likelihood? We demonstrate that the Langevin perspective offers clear and straightforward answers to these questions, bridging existing interpretations of diffusion models, showing how different formulations can be converted into one another within a common framework, and offering pedagogical value for both learners and experienced researchers seeking deeper intuition.
[1123] Exact Finite-Sample Variance Decomposition of Subagging: A Spectral Filtering Perspective
Ye Su, Mingrui Ye, Yining Wang, Jipeng Guo, Yong Liu
Main category: cs.LG
TL;DR: Subagging acts as a low-pass spectral filter that attenuates high-order interaction variance by geometric factors, revealing why standard resampling ratios under-regularize high-capacity interpolators and requiring complexity-adaptive subsampling.
Details
Motivation: Standard resampling ratios (like α≈0.632) have been used as default baselines in ensemble learning for decades, but there's no exact mathematical characterization of how these ratios interact with a base learner's intrinsic functional complexity in finite samples.Method: Leverage Hoeffding-ANOVA decomposition to derive the first exact, finite-sample variance decomposition for subagging, applicable to any symmetric base learner without requiring asymptotic limits or smoothness assumptions. Establish subagging as a deterministic low-pass spectral filter.
Result: Subagging preserves low-order structural signals while attenuating c-th order interaction variance by a geometric factor approaching α^c. This reveals why default baselines often under-regularize high-capacity interpolators, which instead require smaller α to exponentially suppress spurious high-order noise.
Conclusion: Propose a complexity-guided adaptive subsampling algorithm that dynamically calibrates α to the learner’s complexity spectrum, empirically demonstrating consistent improvement in generalization over static baselines.
Abstract: Standard resampling ratios (e.g., $α\approx 0.632$) are widely used as default baselines in ensemble learning for three decades. However, how these ratios interact with a base learner’s intrinsic functional complexity in finite samples lacks a exact mathematical characterization. We leverage the Hoeffding-ANOVA decomposition to derive the first exact, finite-sample variance decomposition for subagging, applicable to any symmetric base learner without requiring asymptotic limits or smoothness assumptions. We establish that subagging operates as a deterministic low-pass spectral filter: it preserves low-order structural signals while attenuating $c$-th order interaction variance by a geometric factor approaching $α^c$. This decoupling reveals why default baselines often under-regularize high-capacity interpolators, which instead require smaller $α$ to exponentially suppress spurious high-order noise. To operationalize these insights, we propose a complexity-guided adaptive subsampling algorithm, empirically demonstrating that dynamically calibrating $α$ to the learner’s complexity spectrum consistently improves generalization over static baselines.
[1124] CodeQuant: Unified Clustering and Quantization for Enhanced Outlier Smoothing in Low-Precision Mixture-of-Experts
Xiangyang Yin, Xingyu Liu, Tianhua Xia, Bo Bao, Vithursan Thangarasa, Valavan Manohararajah, Eric Sather, Sai Qian Zhang
Main category: cs.LG
TL;DR: CodeQuant introduces a unified quantization-and-clustering scheme for Mixture-of-Experts (MoE) models that addresses outlier issues through learnable rotation and weight clustering to reduce quantization errors while maintaining accuracy.
Details
Motivation: Outliers in MoE architectures cause severe accuracy degradation during post-training quantization, as they induce substantial quantization errors. Existing rotation-based smoothing techniques help but leave residual errors that impede reliable low-precision deployment.Method: CodeQuant combines smoothing activation outliers via learnable rotation with absorbing weight outliers into fine-tuned cluster centroids. This reduces extreme values’ influence by fitting them within cluster centroids, lowering quantization error while maintaining expressive capacity. Includes dedicated GPU/CPU kernel design.
Result: Achieves up to 4.15× speedup while delivering significantly higher accuracy than state-of-the-art quantization approaches across diverse MoE models.
Conclusion: CodeQuant presents a promising direction for efficient and accurate deployment of MoE-based large language models under low-precision constraints.
Abstract: Outliers have emerged as a fundamental bottleneck in preserving accuracy for low-precision large models, particularly within Mixture-of-Experts (MoE) architectures that are increasingly central to large-scale language modeling. Under post-training quantization (PTQ), these outliers induce substantial quantization errors, leading to severe accuracy degradation. While recent rotation-based smoothing techniques alleviate the problem by redistributing outlier magnitudes, residual errors remain and continue to impede reliable low-precision deployment. In this work, we tackle this challenge by introducing \textit{CodeQuant}, a unified quantization-and-clustering scheme that contains smoothing activation outliers via learnable rotation and absorbing weight outliers into fine-tuned cluster centroids for MoE. This design reduces the influence of extreme values by fitting them within cluster centroids, thereby lowering quantization error while maintaining expressive capacity. Coupled with a dedicated kernel design for GPU and CPU, CodeQuant achieves up to $4.15\times$ speedup while delivering significantly higher accuracy than state-of-the-art quantization approaches across diverse MoE models. Our results highlight CodeQuant as a promising direction for efficient and accurate deployment of MoE-based large language models under low-precision constraints. Our code is available at https://github.com/SAI-Lab-NYU/CodeQuant.
[1125] PepBenchmark: A Standardized Benchmark for Peptide Machine Learning
Jiahui Zhang, Rouyi Wang, Kuangqi Zhou, Tianshu Xiao, Lingyan Zhu, Yaosen Min, Yang Wang
Main category: cs.LG
TL;DR: PepBenchmark provides standardized datasets, preprocessing pipelines, and evaluation protocols for peptide machine learning research to address the lack of benchmarks in peptide drug discovery.
Details
Motivation: Peptide therapeutics are important but progress in peptide ML is hindered by the absence of standardized benchmarks, making comparisons and methodological advances difficult.Method: Three-component framework: PepBenchData (29 canonical + 6 non-canonical peptide datasets), PepBenchPipeline (standardized preprocessing), and PepBenchLeaderboard (unified evaluation with baselines across 4 methodological families).
Result: Created the most comprehensive AI-ready peptide dataset resource to date with standardized evaluation protocols, enabling comparable benchmarking for peptide drug discovery.
Conclusion: PepBenchmark provides the first standardized foundation for peptide drug discovery ML research, facilitating methodological advances and real-world translation.
Abstract: Peptide therapeutics are widely regarded as the “third generation” of drugs, yet progress in peptide Machine Learning (ML) are hindered by the absence of standardized benchmarks. Here we present PepBenchmark, which unifies datasets, preprocessing, and evaluation protocols for peptide drug discovery. PepBenchmark comprises three components: (1) PepBenchData, a well-curated collection comprising 29 canonical-peptide and 6 non-canonical-peptide datasets across 7 groups, systematically covering key aspects of peptide drug development, representing, to the best of our knowledge, the most comprehensive AI-ready dataset resource to date; (2) PepBenchPipeline, a standardized preprocessing pipeline that ensures consistent dataset cleaning, construction, splitting, and feature transformation, mitigating quality issues common in ad hoc pipelines; and (3) PepBenchLeaderboard, a unified evaluation protocol and leaderboard with strong baselines across 4 major methodological families: Fingerprint-based, GNN-based, PLM-based, and SMILES-based models. Together, PepBenchmark provides the first standardized and comparable foundation for peptide drug discovery, facilitating methodological advances and translation into real-world applications. The data and code are publicly available at https://github.com/ZGCI-AI4S-Pep/PepBenchmark/.
[1126] IceCache: Memory-efficient KV-cache Management for Long-Sequence LLMs
Yuzhen Mao, Qitong Wang, Martin Ester, Ke Li
Main category: cs.LG
TL;DR: IceCache: A novel KV cache management strategy using semantic token clustering with PagedAttention to reduce memory footprint while maintaining accuracy in long-sequence LLM inference.
Details
Motivation: KV cache memory footprint scales linearly with sequence length, causing memory bottlenecks on resource-constrained hardware. Existing offloading approaches suffer from imprecise token selection and performance degradation in long-generation tasks like chain-of-thought reasoning.Method: Integrates semantic token clustering with PagedAttention, organizing semantically related tokens into contiguous memory regions managed by a hierarchical, dynamically updatable data structure for efficient token selection and better memory bandwidth utilization during CPU-GPU transfers.
Result: With a 256-token budget, maintains 99% of original accuracy achieved by full KV cache model on LongBench. Achieves competitive/superior latency and accuracy while using only 25% of KV cache token budget compared to other offloading methods.
Conclusion: IceCache effectively addresses memory bottlenecks in long-sequence LLM inference through semantic-aware KV cache management, enabling efficient resource-constrained deployment without significant accuracy loss.
Abstract: Key-Value (KV) cache plays a crucial role in accelerating inference in large language models (LLMs) by storing intermediate attention states and avoiding redundant computation during autoregressive generation. However, its memory footprint scales linearly with sequence length, often leading to severe memory bottlenecks on resource-constrained hardware. Prior work has explored offloading KV cache to the CPU while retaining only a subset on the GPU, but these approaches often rely on imprecise token selection and suffer performance degradation in long-generation tasks such as chain-of-thought reasoning. In this paper, we propose a novel KV cache management strategy, IceCache, which integrates semantic token clustering with PagedAttention. By organizing semantically related tokens into contiguous memory regions managed by a hierarchical, dynamically updatable data structure, our method enables more efficient token selection and better utilization of memory bandwidth during CPU-GPU transfers. Experimental results on LongBench show that, with a 256-token budget, IceCache maintains 99% of the original accuracy achieved by the full KV cache model. Moreover, compared to other offloading-based methods, IceCache attains competitive or even superior latency and accuracy while using only 25% of the KV cache token budget, demonstrating its effectiveness in long-sequence scenarios. The code is available on our project website at https://yuzhenmao.github.io/IceCache/.
[1127] WaveMoE: A Wavelet-Enhanced Mixture-of-Experts Foundation Model for Time Series Forecasting
Shunyu Wu, Jiawei Huang, Weibin Feng, Boxin Li, Xiao Zhang, Erli Meng, Dan Li, Jian Lou, See-Kiong Ng
Main category: cs.LG
TL;DR: WaveMoE is a wavelet-enhanced mixture-of-experts foundation model for time series forecasting that integrates frequency-domain representations with scalable foundation models through a dual-path architecture.
Details
Motivation: Time series foundation models have shown success in universal forecasting, but incorporating frequency-domain information can better model complex temporal patterns like periodicity and high-frequency dynamics prevalent in real-world time series.Method: Proposes WaveMoE with a dual-path architecture that jointly processes time series tokens and wavelet tokens along a unified temporal axis, using a shared expert routing mechanism for consistent expert specialization while scaling model capacity efficiently.
Result: Preliminary experimental results on 16 diverse benchmark datasets indicate WaveMoE has potential to improve forecasting performance by incorporating wavelet-domain corpora.
Conclusion: Integrating explicit frequency-domain representations into scalable foundation models through wavelet-enhanced architectures like WaveMoE can advance time series forecasting capabilities.
Abstract: Time series foundation models (TSFMs) have recently achieved remarkable success in universal forecasting by leveraging large-scale pretraining on diverse time series data. Complementing this progress, incorporating frequency-domain information yields promising performance in enhancing the modeling of complex temporal patterns, such as periodicity and localized high-frequency dynamics, which are prevalent in real-world time series. To advance this direction, we propose a new perspective that integrates explicit frequency-domain representations into scalable foundation models, and introduce WaveMoE, a wavelet-enhanced mixture-of-experts foundation model for time series forecasting. WaveMoE adopts a dual-path architecture that jointly processes time series tokens and wavelet tokens aligned along a unified temporal axis, and coordinates them through a shared expert routing mechanism that enables consistent expert specialization while efficiently scaling model capacity. Preliminary experimental results on 16 diverse benchmark datasets indicate that WaveMoE has the potential to further improve forecasting performance by incorporating wavelet-domain corpora.
[1128] Topology-Aware PAC-Bayesian Generalization Analysis for Graph Neural Networks
Xinping Yi
Main category: cs.LG
TL;DR: A theoretical framework for deriving topology-aware PAC-Bayesian generalization bounds for graph convolutional networks that explicitly incorporates graph structural properties.
Details
Motivation: While GNNs show strong empirical performance across domains, there's limited theoretical understanding of their generalization behavior, especially for graph classification where model parameters and graph structure interact complexly. Existing PAC-Bayesian bounds for GNNs often fail to fully exploit graph structures.Method: Proposes a topology-aware PAC-Bayesian norm-based generalization framework for GCNs that reformulates bound derivation as stochastic optimization and introduces sensitivity matrices measuring classification output response to structured weight perturbations. Imposes different structures on sensitivity matrices from spatial and spectral perspectives.
Result: Derives a family of generalization error bounds with graph structures explicitly embedded, which recover existing results as special cases while yielding tighter bounds than state-of-the-art PAC-Bayesian bounds for GNNs.
Conclusion: The framework explicitly integrates graph structural properties into generalization analysis, enabling unified inspection of GNN generalization behavior from both spatial aggregation and spectral filtering viewpoints.
Abstract: Graph neural networks have demonstrated excellent applicability to a wide range of domains, including social networks, biological systems, recommendation systems, and wireless communications. Yet a principled theoretical understanding of their generalization behavior remains limited, particularly for graph classification tasks where complex interactions between model parameters and graph structure play a crucial role. Among existing theoretical tools, PAC-Bayesian norm-based generalization bounds provide a flexible and data-dependent framework; however, current results for GNNs often restrict the exploitation of graph structures. In this work, we propose a topology-aware PAC-Bayesian norm-based generalization framework for graph convolutional networks (GCNs) that extends a previously developed framework to graph-structured models. Our approach reformulates the derivation of generalization bounds as a stochastic optimization problem and introduces sensitivity matrices that measure the response of classification outputs with respect to structured weight perturbations. By imposing different structures on sensitivity matrices from both spatial and spectral perspectives, we derive a family of generalization error bounds with graph structures explicitly embedded. Such bounds could recover existing results as special cases, while yielding bounds that are tighter than state-of-the-art PAC-Bayesian bounds for GNNs. Notably, the proposed framework explicitly integrates graph structural properties into the generalization analysis, enabling a unified inspection of GNN generalization behavior from both spatial aggregation and spectral filtering viewpoints.
[1129] Heterogeneous Connectivity in Sparse Networks: Fan-in Profiles, Gradient Hierarchy, and Topological Equilibria
Nikodem Tomczak
Main category: cs.LG
TL;DR: PSN uses deterministic heterogeneous connectivity profiles instead of uniform sparsity, showing random hub placement provides no accuracy advantage over uniform sparsity, but using PSN distributions to initialize dynamic sparse training improves performance.
Details
Motivation: To investigate whether heterogeneous connectivity patterns (with hub neurons) provide advantages over uniform random sparsity in neural networks, and whether task-aligned hub placement matters.Method: Profiled Sparse Networks (PSN) replace uniform connectivity with deterministic fan-in profiles defined by continuous nonlinear functions. Benchmarked across 4 classification datasets with varying input dimensions and network depths. Analyzed gradient distributions and used PSN fan-in distributions to initialize RigL dynamic sparse training.
Result: At 90% sparsity, all static profiles (including uniform random) achieve accuracy within 0.2-0.6% of dense baselines, showing no advantage from arbitrary hub placement. Gradient analysis reveals 2-5x concentration at hub neurons. When PSN distributions initialize RigL, lognormal profiles outperform standard ERK initialization, with advantages growing on harder tasks (+0.16% to +0.49% improvements).
Conclusion: Random hub placement provides no accuracy advantage over uniform sparsity; which neurons become hubs matters more than connectivity variance. Starting dynamic sparse training at equilibrium fan-in distributions allows optimization to refine weights rather than rearrange topology.
Abstract: Profiled Sparse Networks (PSN) replace uniform connectivity with deterministic, heterogeneous fan-in profiles defined by continuous, nonlinear functions, creating neurons with both dense and sparse receptive fields. We benchmark PSN across four classification datasets spanning vision and tabular domains, input dimensions from 54 to 784, and network depths of 2–3 hidden layers. At 90% sparsity, all static profiles, including the uniform random baseline, achieve accuracy within 0.2-0.6% of dense baselines on every dataset, demonstrating that heterogeneous connectivity provides no accuracy advantage when hub placement is arbitrary rather than task-aligned. This result holds across sparsity levels (80-99.9%), profile shapes (eight parametric families, lognormal, and power-law), and fan-in coefficients of variation from 0 to 2.5. Internal gradient analysis reveals that structured profiles create a 2-5x gradient concentration at hub neurons compared to the ~1x uniform distribution in random baselines, with the hierarchy strength predicted by fan-in coefficient of variation ($r = 0.93$). When PSN fan-in distributions are used to initialise RigL dynamic sparse training, lognormal profiles matched to the equilibrium fan-in distribution consistently outperform standard ERK initialisation, with advantages growing on harder tasks, achieving +0.16% on Fashion-MNIST ($p = 0.036$, $d = 1.07$), +0.43% on EMNIST, and +0.49% on Forest Cover. RigL converges to a characteristic fan-in distribution regardless of initialisation. Starting at this equilibrium allows the optimiser to refine weights rather than rearrange topology. Which neurons become hubs matters more than the degree of connectivity variance, i.e., random hub placement provides no advantage, while optimisation-driven placement does.
[1130] ReadMOF: Structure-Free Semantic Embeddings from Systematic MOF Nomenclature for Machine Learning
Kewei Zhu, Cameron Wilson, Bartosz Mazur, Yi Li, Ashleigh M. Chester, Peyman Z. Moghadam
Main category: cs.LG
TL;DR: ReadMOF is a framework that uses systematic chemical names (IUPAC-style nomenclature) for metal-organic frameworks to model structure-property relationships without requiring atomic coordinates or connectivity graphs, using pretrained language models to convert names into embeddings for materials informatics tasks.
Details
Motivation: Systematic chemical names contain rich structural and compositional information in standardized textual format, but this information is typically underutilized in materials science. The researchers aim to leverage these names as an alternative to traditional geometry-dependent representations for modeling materials properties.Method: ReadMOF employs pretrained language models to convert systematic MOF names from the Cambridge Structural Database into vector embeddings. These embeddings serve as descriptors for materials informatics tasks including property prediction, similarity retrieval, and clustering. The framework also explores integration with large language models for chemically meaningful reasoning.
Result: The embeddings generated from systematic names closely represent traditional structure-based descriptors and enable applications in materials informatics with performance comparable to geometry-dependent methods. The approach shows that structured chemical language interpreted through NLP techniques can provide a scalable, interpretable, and geometry-independent alternative to conventional molecular representations.
Conclusion: Systematic chemical names, when processed with modern natural language processing techniques, offer a viable alternative to traditional geometry-dependent representations for materials science. This language-driven approach opens new opportunities for scalable and interpretable discovery in materials science without requiring atomic coordinates.
Abstract: Systematic chemical names, such as IUPAC-style nomenclature for metal-organic frameworks (MOFs), contain rich structural and compositional information in a standardized textual format. Here we introduce ReadMOF, which is, to our knowledge, the first nomenclature-free machine learning framework that leverages these names to model structure-property relationships without requiring atomic coordinates or connectivity graphs. By employing pretrained language models, ReadMOF converts systematic MOF names from the Cambridge Structural Database (CSD) into vector embeddings that closely represent traditional structure-based descriptors. These embeddings enable applications in materials informatics, including property prediction, similarity retrieval, and clustering, with performance comparable to geometry-dependent methods. When combined with large language models, ReadMOF also establishes chemically meaningful reasoning ability with textual input only. Our results show that structured chemical language, interpreted through modern natural language processing techniques, can provide a scalable, interpretable, and geometry-independent alternative to conventional molecular representations. This approach opens new opportunities for language-driven discovery in materials science.
[1131] WOODELF-HD: Efficient Background SHAP for High-Depth Decision Trees
Ron Wettenstein, Alexander Nadel, Udi Boker
Main category: cs.LG
TL;DR: WoodelfHD extends Woodelf to reduce Background SHAP computation complexity from O(3^D) to O(2^D) for decision tree ensembles, enabling exact SHAP computation for deeper trees up to depth 21 with significant speedups.
Details
Motivation: Background SHAP is accurate for interpreting decision tree ensembles but doesn't scale well. Recent methods like Woodelf and PLTreeSHAP reduce time complexity but introduce exponential preprocessing (3^D) with tree depth D, making them impractical for deep trees.Method: WoodelfHD extends Woodelf with a Strassen-like multiplication scheme that exploits matrix structure to reduce matrix-vector multiplication from O(k^2) to O(k*log(k)). Uses fully vectorized, non-recursive implementation and merges path nodes with identical features to reduce cache size and memory usage.
Result: Enables exact Background SHAP computation for trees with depths up to 21 (previous methods fail due to memory). Achieves speedups of 33x for depth 12 and 162x for depth 15 over state-of-the-art.
Conclusion: WoodelfHD significantly improves scalability of Background SHAP for deep decision tree ensembles, making exact SHAP computation practical for deeper trees through algorithmic optimizations and memory efficiency improvements.
Abstract: Decision-tree ensembles are a cornerstone of predictive modeling, and SHAP is a standard framework for interpreting their predictions. Among its variants, Background SHAP offers high accuracy by modeling missing features using a background dataset. Historically, this approach did not scale well, as the time complexity for explaining n instances using m background samples included an O(mn) component. Recent methods such as Woodelf and PLTreeSHAP reduce this to O(m+n), but introduce a preprocessing bottleneck that grows as 3^D with tree depth D, making them impractical for deep trees. We address this limitation with WoodelfHD, a Woodelf extension that reduces the 3^D factor to 2^D. The key idea is a Strassen-like multiplication scheme that exploits the structure of Woodelf matrices, reducing matrix-vector multiplication from O(k^2) to O(k*log(k)) via a fully vectorized, non-recursive implementation. In addition, we merge path nodes with identical features, reducing cache size and memory usage. When running on standard environments, WoodelfHD enables exact Background SHAP computation for trees with depths up to 21, where previous methods fail due to excessive memory usage. For ensembles of depths 12 and 15, it achieves speedups of 33x and 162x, respectively, over the state-of-the-art.
[1132] Calibration Collapse Under Sycophancy Fine-Tuning: How Reward Hacking Breaks Uncertainty Quantification in LLMs
Subramanyam Sahoo
Main category: cs.LG
TL;DR: Sycophantic RLHF training degrades LLM calibration despite post-hoc scaling, showing reward hacking leaves structured miscalibration residuals.
Details
Motivation: To investigate whether sycophantic reward signals in RLHF-style training degrade model calibration, which is essential for reliable uncertainty quantification in LLMs.Method: Fine-tuned Qwen3-8B under three regimes: base model, neutral SFT on TriviaQA, and sycophancy-inducing GRPO that rewards agreement with wrong answers. Evaluated on 1,000 MMLU items across five domains with bootstrap confidence intervals and permutation testing.
Result: Sycophantic GRPO produced consistent directional calibration degradation (ECE rose +0.006 vs base, MCE increased +0.010 vs neutral SFT), though not statistically significant at this training budget. Post-hoc matrix scaling reduced ECE by 40-64% and improved accuracy by 1.5-3.0 percentage points, but sycophantic model retained highest post-scaling ECE.
Conclusion: Reward-induced miscalibration leaves structured residuals even after affine correction, establishing methodology for evaluating calibration impact of reward hacking and motivating calibration-aware training objectives.
Abstract: Modern large language models (LLMs) are increasingly fine-tuned via reinforcement learning from human feedback (RLHF) or related reward optimisation schemes. While such procedures improve perceived helpfulness, we investigate whether sycophantic reward signals degrade calibration – a property essential for reliable uncertainty quantification. We fine-tune Qwen3-8B under three regimes: no fine-tuning (base), neutral supervised fine-tuning (SFT) on TriviaQA, and sycophancy-inducing Group Relative Policy Optimisation (GRPO) that rewards agreement with planted wrong answers. Evaluating on $1{,}000$ MMLU items across five subject domains with bootstrap confidence intervals and permutation testing, we find that \textbf{sycophantic GRPO produces consistent directional calibration degradation} – ECE rises by $+0.006$ relative to the base model and MCE increases by $+0.010$ relative to neutral SFT – though the effect does not reach statistical significance ($p = 0.41$) at this training budget. Post-hoc matrix scaling applied to all three models reduces ECE by $40$–$64%$ and improves accuracy by $1.5$–$3.0$ percentage points. However, the sycophantic model retains the highest post-scaling ECE relative to the neutral SFT control ($0.042$ vs.\ $0.037$), suggesting that reward-induced miscalibration leaves a structured residual even after affine correction. These findings establish a methodology for evaluating the calibration impact of reward hacking and motivate calibration-aware training objectives.
[1133] Preventing Latent Rehearsal Decay in Online Continual SSL with SOLAR
Giacomo Cignoni, Simone Magistri, Andrew D. Bagdanov, Antonio Carta
Main category: cs.LG
TL;DR: SOLAR is a method for Online Continual Self-Supervised Learning that addresses the stability-plasticity trade-off in continuous streams of unlabeled data through adaptive buffer management and an explicit overlap loss.
Details
Motivation: Online Continual Self-Supervised Learning (OCSSL) faces challenges with stability-plasticity trade-off, where stable methods converge faster but can collapse under certain conditions due to latent space degradation from excessive replay stability.Method: Proposes SOLAR with two key components: 1) Uses efficient online proxies of Deviation metric to guide adaptive buffer management, and 2) Incorporates an explicit Overlap loss to prevent latent degradation. Introduces two diagnostic metrics (Overlap and Deviation) to measure latent space quality.
Result: SOLAR achieves state-of-the-art performance on OCSSL vision benchmarks, demonstrating both high convergence speed and superior final performance compared to existing methods.
Conclusion: The paper successfully addresses the stability-plasticity challenge in OCSSL through adaptive buffer management guided by latent space quality metrics, enabling robust continual learning from unlabeled data streams.
Abstract: This paper explores Online Continual Self-Supervised Learning (OCSSL), a scenario in which models learn from continuous streams of unlabeled, non-stationary data, where methods typically employ replay and fast convergence is a central desideratum. We find that OCSSL requires particular attention to the stability-plasticity trade-off: stable methods (e.g. replay with Reservoir sampling) are able to converge faster compared to plastic ones (e.g. FIFO buffer), but incur in performance drops under certain conditions. We explain this collapse phenomenon with the Latent Rehearsal Decay hypothesis, which attributes it to latent space degradation under excessive stability of replay. We introduce two metrics (Overlap and Deviation) that diagnose latent degradation and correlate with accuracy declines. Building on these insights, we propose SOLAR, which leverages efficient online proxies of Deviation to guide buffer management and incorporates an explicit Overlap loss, allowing SOLAR to adaptively managing plasticity. Experiments demonstrate that SOLAR achieves state-of-the-art performance on OCSSL vision benchmarks, with both high convergence speed and final performance.
[1134] Distributionally Robust PAC-Bayesian Control
Domagoj Herceg, Duarte Antunes
Main category: cs.LG
TL;DR: A distributionally robust PAC-Bayesian framework for certifying learning-based finite-horizon controllers that addresses unbounded losses and sim-to-real distribution shifts using Wasserstein distance and System Level Synthesis.
Details
Motivation: Existing PAC-Bayes control methods assume bounded losses and matching training/deployment distributions, but real-world applications face unbounded losses and environmental distribution shifts (sim-to-real gap).Method: Combines PAC-Bayes generalization theory with distributionally robust optimization via type-1 Wasserstein distance, leverages System Level Synthesis reparametrization to derive sub-Gaussian loss proxy and bound performance loss due to distribution shift.
Result: Develops computationally tractable optimization-based framework with high-probability safety certificates for linear time-invariant systems, tying performance directly to operator norm of closed-loop map.
Conclusion: Provides robust certification framework for learning-based controllers that can handle unbounded losses and distribution shifts, enabling safer deployment in real-world environments that differ from training conditions.
Abstract: We present a distributionally robust PAC-Bayesian framework for certifying the performance of learning-based finite-horizon controllers. While existing PAC-Bayes control literature typically assumes bounded losses and matching training and deployment distributions, we explicitly address unbounded losses and environmental distribution shifts (the sim-to-real gap). We achieve this by drawing on two modern lines of research, namely the PAC-Bayes generalization theory and distributionally robust optimization via the type-1 Wasserstein distance. By leveraging the System Level Synthesis (SLS) reparametrization, we derive a sub-Gaussian loss proxy and a bound on the performance loss due to distribution shift. Both are tied directly to the operator norm of the closed-loop map. For linear time-invariant systems, this yields a computationally tractable optimization-based framework together with high-probability safety certificates for deployment in real-world environments that differ from those used in training.
[1135] MoEITS: A Green AI approach for simplifying MoE-LLMs
Luis Balderas, Miguel Lastra, José M. Benítez
Main category: cs.LG
TL;DR: MoEITS: A novel Information Theory-based algorithm for simplifying Mixture-of-Experts Large Language Models to reduce computational burden while maintaining accuracy.
Details
Motivation: Mixture-of-Experts (MoE) architectures are powerful but computationally expensive for both training and inference, requiring simplification methods to reduce computing, memory footprint, and energy consumption.Method: MoEITS uses standardized Information Theoretic frameworks to simplify MoE-LLMs, with theoretical analysis of computational complexity and practical implementation tested on models like Mixtral 8×7B, Qwen1.5-2.7B, and DeepSeek-V2-Lite.
Result: MoEITS outperforms state-of-the-art MoE-LLM pruning methods, generating models that are both effective across benchmarks and computationally efficient.
Conclusion: The proposed algorithm provides an effective solution for simplifying MoE-LLMs with theoretical foundations and practical benefits, making powerful MoE architectures more accessible.
Abstract: Large language models are transforming all areas of academia and industry, attracting the attention of researchers, professionals, and the general public. In the trek for more powerful architectures, Mixture-of-Experts, inspired by ensemble models, have emerged as one of the most effective ways to follow. However, this implies a high computational burden for both training and inference. To reduce the impact on computing and memory footprint as well as the energy consumption, simplification methods has arisen as very effective procedures. In this paper, an original algorithm, MoEITS, for MoE-LLMs simplification is presented. The algorithm is characterized by a refined simplicity, underpinned by standardized Information Theoretic frameworks. MoEITS is analyzed in depth from theoretical and practical points of view. Its computational complexity is studied. Its performance on the accuracy of the simplified LLMs and the reduction rate achieved is assessed through a thoroughly designed experimentation. This empirical evaluation includes a comparison with state-of-the-art MoE-LLM pruning methods applied on Mixtral $8\times7$B, Qwen1.5-2.7B, and DeepSeek-V2-Lite. The extensive experimentation conducted demonstrates that MoEITS outperforms state-of-the-art techniques by generating models that are both effective across all benchmarks and computationally efficient. The code implementing the method will be available at https://github.com/luisbalru/MoEITS.
[1136] Mitigating Privacy Risk via Forget Set-Free Unlearning
Aviraj Newatia, Michael Cooper, Viet Nguyen, Rahul G. Krishnan
Main category: cs.LG
TL;DR: A machine unlearning method called Reload that enables efficient removal of training data influence without direct access to the “forget set” using gradient optimization and structured weight sparsification.
Details
Motivation: Traditional machine unlearning methods require direct access to the data to be forgotten, forcing organizations to retain sensitive data longer than necessary and increasing security risks. There's a need for methods that can unlearn without explicit access to the forget set.Method: Proposes partially-blind unlearning using auxiliary information instead of direct forget set access. Introduces Reload framework based on gradient optimization and structured weight sparsification to operationalize this approach.
Result: Reload efficiently approximates models retrained from scratch and outperforms forget set-dependent approaches. On Llama2-7B, it unlearns entities using <0.025% of retain set and <7% of model weights in <8 minutes. In corrective cases, achieves unlearning with only 10% of corrupted data identified.
Conclusion: Partially-blind unlearning via Reload provides a practical solution for data removal without retaining sensitive forget sets, reducing security risks while maintaining model performance.
Abstract: Training machine learning models requires the storage of large datasets, which often contain sensitive or private data. Storing data is associated with a number of potential risks which increase over time, such as database breaches and malicious adversaries. Machine unlearning is the study of methods to efficiently remove the influence of training data subsets from previously-trained models. Existing unlearning methods typically require direct access to the “forget set” – the data to be forgotten-and organisations must retain this data for unlearning rather than deleting it immediately upon request, increasing risks associated with the forget set. We introduce partially-blind unlearning – utilizing auxiliary information to unlearn without explicit access to the forget set. We also propose a practical framework Reload, a partially-blind method based on gradient optimization and structured weight sparsification to operationalize partially-blind unlearning. We show that Reload efficiently unlearns, approximating models retrained from scratch, and outperforms several forget set-dependent approaches. On language models, Reload unlearns entities using <0.025% of the retain set and <7% of model weights in <8 minutes on Llama2-7B. In the corrective case, Reload achieves unlearning even when only 10% of corrupted data is identified.
[1137] SpectralLoRA: Is Low-Frequency Structure Sufficient for LoRA Adaptation? A Spectral Analysis of Weight Updates
Rajveer Singh
Main category: cs.LG
TL;DR: LoRA weight updates are dominated by low-frequency components, enabling 10x storage reduction with minimal performance loss, suggesting spectral sparsity as a new PEFT design principle.
Details
Motivation: To understand the spectral structure of LoRA weight updates and explore whether parameter-efficient fine-tuning methods exhibit frequency-domain sparsity that could be exploited for more efficient adaptation.Method: Systematic empirical study using 2D Discrete Cosine Transform (DCT) analysis of trained LoRA adaptation matrices across BERT-base and RoBERTa-base on four GLUE benchmarks (SST-2, MNLI, CoLA, QQP).
Result: LoRA updates are universally dominated by low-frequency components (33% of DCT coefficients capture 90% of spectral energy). Retaining only 10% of frequency coefficients reduces adapter storage by 10x with only 1.95pp performance drop on SST-2. Frequency masking at k=50% improves over full LoRA on 3 of 8 model-task pairs.
Conclusion: Spectral sparsity in adaptation matrices is a fundamental property that can be exploited for more efficient PEFT methods, with task complexity governing spectral sensitivity and model architecture affecting compressibility.
Abstract: We present a systematic empirical study of the spectral structure of LoRA weight updates. Through 2D Discrete Cosine Transform (DCT) analysis of trained adaptation matrices across BERT-base and RoBERTa-base on four GLUE benchmarks (SST-2, MNLI, CoLA, QQP), we establish that LoRA updates are universally dominated by low-frequency components: on average, just 33% of DCT coefficients capture 90% of total spectral energy. Retaining only 10% of frequency coefficients reduces adapter storage by 10x while sacrificing only 1.95pp on SST-2. Notably, frequency masking at k=50% improves over full LoRA on 3 of 8 model-task pairs, suggesting high-frequency components act as adaptation noise. We further discover that RoBERTa-base is systematically more spectrally compressible than BERT-base across all tasks, and that task complexity governs spectral sensitivity – NLI tasks require more frequency budget than sentiment classification. These findings motivate a new design principle for PEFT: spectral sparsity in adaptation.
[1138] Energy-Efficient Federated Edge Learning For Small-Scale Datasets in Large IoT Networks
Haihui Xie, Wenkun Wen, Shuwu Chen, Zhaogang Shu, Minghua Xia
Main category: cs.LG
TL;DR: Proposes collaborative optimization framework for energy-efficient federated edge learning with small-scale IoT datasets using stochastic online learning and resource optimization.
Details
Motivation: IoT networks face resource constraints and heterogeneous sensory data collection challenges, especially with small-scale datasets, leading to inefficient resource utilization and reduced learning performance at independent edge nodes.Method: Derives expected learning loss to quantify relationship between training samples and learning objectives, designs stochastic online learning algorithm to adapt to data variations, formulates resource optimization problem with convergence bound, and develops online distributed algorithm for large-scale optimization.
Result: Extensive simulations and autonomous navigation case studies with collision avoidance demonstrate significant improvements in learning performance and resource efficiency compared to state-of-the-art benchmarks.
Conclusion: The proposed collaborative optimization framework effectively addresses challenges of federated edge learning with small-scale datasets in IoT networks, improving both learning outcomes and resource utilization.
Abstract: Large-scale Internet of Things (IoT) networks enable intelligent services such as smart cities and autonomous driving, but often face resource constraints. Collecting heterogeneous sensory data, especially in small-scale datasets, is challenging, and independent edge nodes can lead to inefficient resource utilization and reduced learning performance. To address these issues, this paper proposes a collaborative optimization framework for energy-efficient federated edge learning with small-scale datasets. We first derive an expected learning loss to quantify the relationship between the number of training samples and learning objectives. A stochastic online learning algorithm is then designed to adapt to data variations, and a resource optimization problem with a convergence bound is formulated. Finally, an online distributed algorithm efficiently solves large-scale optimization problems with high scalability. Extensive simulations and autonomous navigation case studies with collision avoidance demonstrate that the proposed approach significantly improves learning performance and resource efficiency compared to state-of-the-art benchmarks.
[1139] Skill-SD: Skill-Conditioned Self-Distillation for Multi-turn LLM Agents
Hao Wang, Guozhi Wang, Han Xiao, Yufeng Zhou, Yue Pan, Jichao Wang, Ke Xu, Yafei Wen, Xiaohu Ruan, Xiaoxin Chen, Honggang Qi
Main category: cs.LG
TL;DR: Skill-SD improves RL for LLM agents by using agent trajectories to create dynamic natural language skills as privileged teacher supervision, with importance-weighted distillation and teacher-student synchronization.
Details
Motivation: Standard RL for LLM agents suffers from sparse rewards and long horizons. On-policy self-distillation helps but uses fixed privileged information that can't capture diverse valid strategies and often causes training collapse.Method: Skill-SD converts agent trajectories into compact natural language skills describing behaviors, mistakes, and workflows. These skills serve as dynamic privileged information for the teacher only, while the student learns under plain task prompts. Uses importance-weighted reverse-KL loss for gradient-correct token-level distillation and dynamically synchronizes teacher with improving student.
Result: Substantially outperforms standard RL baselines: improves vanilla GRPO by +14.0%/+10.9% on AppWorld/Sokoban and vanilla OPD by +42.1%/+40.6%.
Conclusion: Skill-SD effectively addresses limitations of standard RL and on-policy self-distillation by creating dynamic privileged supervision from agent trajectories, enabling more efficient and stable training for LLM agents.
Abstract: Reinforcement learning (RL) has been widely used to train LLM agents for multi-turn interactive tasks, but its sample efficiency is severely limited by sparse rewards and long horizons. On-policy self-distillation (OPSD) alleviates this by providing dense token-level supervision from a privileged teacher that has access to ground-truth answers. However, such fixed privileged information cannot capture the diverse valid strategies in agent tasks, and naively combining OPSD with RL often leads to training collapse. To address these limitations, we introduce Skill-SD, a framework that turns the agent’s own trajectories into dynamic training-only supervision. Completed trajectories are summarized into compact natural language skills that describe successful behaviors, mistakes, and workflows. These skills serve as dynamic privileged information conditioning only the teacher, while the student always acts under the plain task prompt and learns to internalize the guidance through distillation. To stabilize the training, we derive an importance-weighted reverse-KL loss to provide gradient-correct token-level distillation, and dynamically synchronize the teacher with the improving student. Experimental results on agentic benchmarks demonstrate that Skill-SD substantially outperforms the standard RL baseline, improving both vanilla GRPO (+14.0%/+10.9% on AppWorld/Sokoban) and vanilla OPD (+42.1%/+40.6%). Project page: https://k1xe.github.io/skill-sd/
[1140] SCOPE: Signal-Calibrated On-Policy Distillation Enhancement with Dual-Path Adaptive Weighting
Binbin Zheng, Xing Ma, Yiheng Liang, Jingqing Ruan, Xiaoliang Fu, Kepeng Lin, Benchang Zhu, Ke Zeng, Xunliang Cai
Main category: cs.LG
TL;DR: SCOPE introduces adaptive token-level supervision for RL alignment in LLMs by routing rollouts based on correctness and applying weighted distillation strategies to optimize credit assignment.
Details
Motivation: On-policy RL for LLM alignment suffers from sparse outcome-level rewards and poor token-level credit assignment. Existing On-Policy Distillation (OPD) applies uniform KL supervision across all rollouts, ignoring differences in signal quality between correct and incorrect trajectories.Method: SCOPE uses a dual-path adaptive training framework that routes on-policy rollouts by correctness: 1) For incorrect trajectories, applies teacher-perplexity-weighted KL distillation to prioritize teacher’s corrective capability; 2) For correct trajectories, uses student-perplexity-weighted MLE to focus on low-confidence samples at capability boundaries. Both paths use group-level normalization to adaptively calibrate weight distributions across prompts.
Result: Extensive experiments on six reasoning benchmarks show SCOPE achieves average relative improvements of 11.42% in Avg@32 and 7.30% in Pass@32 over competitive baselines, demonstrating consistent effectiveness.
Conclusion: SCOPE effectively addresses credit assignment challenges in RL alignment by adaptively calibrating token-level supervision based on rollout correctness and signal quality, leading to significant performance improvements in reasoning tasks.
Abstract: On-policy reinforcement learning has become the dominant paradigm for reasoning alignment in large language models, yet its sparse, outcome-level rewards make token-level credit assignment notoriously difficult. On-Policy Distillation (OPD) alleviates this by introducing dense, token-level KL supervision from a teacher model, but typically applies this supervision uniformly across all rollouts, ignoring fundamental differences in signal quality. We propose Signal-Calibrated On-Policy Distillation Enhancement (SCOPE), a dual-path adaptive training framework that routes on-policy rollouts by correctness into two complementary supervision paths. For incorrect trajectories, SCOPE performs teacher-perplexity-weighted KL distillation to prioritize instances where the teacher demonstrates genuine corrective capability, while down-weighting unreliable guidance. For correct trajectories, it applies student-perplexity-weighted MLE to concentrate reinforcement on low-confidence samples at the capability boundary rather than over-reinforcing already mastered ones. Both paths employ a group-level normalization to adaptively calibrate weight distributions, accounting for the intrinsic difficulty variance across prompts. Extensive experiments on six reasoning benchmarks show that SCOPE achieves an average relative improvement of 11.42% in Avg@32 and 7.30% in Pass@32 over competitive baselines, demonstrating its consistent effectiveness.
[1141] Communication-Efficient Gluon in Federated Learning
Xun Qian, Alexander Gaponov, Grigory Malinovsky, Peter Richtárik
Main category: cs.LG
TL;DR: Compressed optimization algorithms for distributed training of large neural networks using variance reduction techniques to reduce communication costs.
Details
Motivation: Communication cost becomes the bottleneck when training large-scale neural networks across massive machines. Muon-type optimizers show promise over Adam-type methods, but need compression techniques to reduce communication overhead.Method: Extends Muon optimizer to Gluon under layer-wise smooth setting with unbiased and contraction compressors. Employs variance reduction techniques from SARAH to reduce compression error. Incorporates momentum variance reduction (MVR) for improved convergence.
Result: Achieves convergence rates and improved communication cost under certain conditions. Obtains new variance reduced algorithm with faster convergence than Gluon. Comparable communication cost derived under weaker conditions with MVR.
Conclusion: Compressed algorithms demonstrate superior performance in terms of communication cost reduction for distributed training of large neural networks.
Abstract: Recent developments have shown that Muon-type optimizers based on linear minimization oracles (LMOs) over non-Euclidean norm balls have the potential to get superior practical performance than Adam-type methods in the training of large language models. Since large-scale neural networks are trained across massive machines, communication cost becomes the bottleneck. To address this bottleneck, we investigate Gluon, which is an extension of Muon under the more general layer-wise $(L^0, L^1)$-smooth setting, with both unbiased and contraction compressors. In order to reduce the compression error, we employ the variance reduced technique in SARAH in our compressed methods. The convergence rates and improved communication cost are achieved under certain conditions. As a byproduct, a new variance reduced algorithm with faster convergence rate than Gluon is obtained. We also incorporate momentum variance reduction (MVR) to these compressed algorithms and comparable communication cost is derived under weaker conditions when $L_i^1 \neq 0$. Finally, several numerical experiments are conducted to verify the superior performance of our compressed algorithms in terms of communication cost.
[1142] Bringing Value Models Back: Generative Critics for Value Modeling in LLM Reinforcement Learning
Zikang Shan, Han Zhong, Liwei Wang, Li Zhao
Main category: cs.LG
TL;DR: GenAC replaces conventional discriminative value critics with generative critics using chain-of-thought reasoning for better credit assignment in LLM reinforcement learning.
Details
Motivation: Classical actor-critic methods use value functions for credit assignment, but learned value models are avoided in modern LLM RL due to training difficulties. The authors argue this is due to limited expressiveness in conventional one-shot value prediction.Method: Proposes Generative Actor-Critic (GenAC) which replaces one-shot scalar value prediction with a generative critic that performs chain-of-thought reasoning before producing value estimates. Also introduces In-Context Conditioning to keep the critic calibrated to the current actor.
Result: GenAC improves value approximation, ranking reliability, and out-of-distribution generalization compared to value-based and value-free baselines. These gains translate into stronger downstream RL performance.
Conclusion: Stronger value modeling through generative critics with reasoning capabilities is a promising direction for improving credit assignment in LLM reinforcement learning.
Abstract: Credit assignment is a central challenge in reinforcement learning (RL). Classical actor-critic methods address this challenge through fine-grained advantage estimation based on a learned value function. However, learned value models are often avoided in modern large language model (LLM) RL because conventional discriminative critics are difficult to train reliably. We revisit value modeling and argue that this difficulty is partly due to limited expressiveness. In particular, representation complexity theory suggests that value functions can be hard to approximate under the one-shot prediction paradigm used by existing value models, and our scaling experiments show that such critics do not improve reliably with scale. Motivated by this observation, we propose Generative Actor-Critic (GenAC), which replaces one-shot scalar value prediction with a generative critic that performs chain-of-thought reasoning before producing a value estimate. We further introduce In-Context Conditioning, which helps the critic remain calibrated to the current actor throughout training. GenAC improves value approximation, ranking reliability, and out-of-distribution generalization, and these gains translate into stronger downstream RL performance than both value-based and value-free baselines. Overall, our results suggest that stronger value modeling is a promising direction for improving credit assignment in LLM reinforcement learning.
[1143] INCRT: An Incremental Transformer That Determines Its Own Architecture
Giansalvo Cirrincione
Main category: cs.LG
TL;DR: INCRT is an incremental transformer architecture that dynamically grows and prunes attention heads during training based on task requirements, eliminating structural redundancy.
Details
Motivation: Current transformer architectures have fixed structures determined by trial and error, leading to systematic structural redundancy where many attention heads are unnecessary but still consume parameters and computation.Method: INCRT starts with a single attention head and incrementally adds heads when current configuration is provably insufficient, while pruning redundant heads. Growth decisions use an online-computable geometric quantity derived from task’s directional structure.
Result: Experiments on SARS-CoV-2 variant classification and SST-2 sentiment analysis show predicted and observed head counts agree within 12%, with final architectures matching or exceeding BERT-base performance while using 3-7x fewer parameters and no pre-training.
Conclusion: INCRT provides a principled, mathematically-grounded approach to transformer architecture design that eliminates redundancy while maintaining performance, with theoretical guarantees of minimal sufficient configurations.
Abstract: Transformer architectures are designed by trial and error: the number of attention heads, the depth, and the head size are fixed before training begins, with no mathematical principle to guide the choice. The result is systematic structural redundancy – between half and four-fifths of all heads in a trained model can be removed without measurable loss – because the architecture allocates capacity without reference to the actual requirements of the task.This paper introduces INCRT (Incremental Transformer), an architecture that determines its own structure during training. Starting from a single head, INCRT adds one attention head at a time whenever its current configuration is provably insufficient, and prunes heads that have become redundant. Each growth decision is driven by a single, online-computable geometric quantity derived from the task’s directional structure, requiring no separate validation phase and no hand-tuned schedule. Two theorems form the theoretical backbone. The first (homeostatic convergence) establishes that the system always reaches a finite stopping configuration that is simultaneously minimal (no redundant heads) and sufficient (no uncaptured directional energy above the threshold). The second (compressed-sensing analogy) provides a geometric upper bound on the number of heads that this configuration can contain, as a function of the spectral complexity of the task. Experiments on SARS-CoV-2 variant classification and SST-2 sentiment analysis confirm both results: the predicted and observed head counts agree within 12% across all benchmarks, and the final architectures match or exceed BERT-base on distribution-specific tasks while using between three and seven times fewer parameters and no pre-training.
[1144] PokeRL: Reinforcement Learning for Pokemon Red
Dheeraj Mudireddy, Sai Patibandla
Main category: cs.LG
TL;DR: PokeRL is a modular reinforcement learning system for training agents to complete early game tasks in Pokemon Red, addressing challenges like action loops, menu spam, and sparse rewards through specialized environment wrappers and reward designs.
Details
Motivation: Pokemon Red presents a challenging RL benchmark with sparse rewards, partial observability, and quirky controls. Existing PPO agents can clear early gyms but training remains brittle, with agents often degenerating into action loops, menu spam, or unproductive wandering.Method: Developed PokeRL with: 1) loop-aware environment wrapper around PyBoy emulator with map masking, 2) multi-layer anti-loop and anti-spam mechanism, 3) dense hierarchical reward design to address specific failure modes.
Result: The system successfully trains agents to complete early game tasks including exiting the player’s house, exploring Pallet Town to reach tall grass, and winning the first rival battle.
Conclusion: Practical systems like PokeRL that explicitly model failure modes such as loops and spam are necessary intermediate steps between toy benchmarks and full Pokemon League champion agents.
Abstract: Pokemon Red is a long-horizon JRPG with sparse rewards, partial observability, and quirky control mechanics that make it a challenging benchmark for reinforcement learning. While recent work has shown that PPO agents can clear the first two gyms using heavy reward shaping and engineered observations, training remains brittle in practice, with agents often degenerating into action loops, menu spam, or unproductive wandering. In this paper, we present PokeRL, a modular system that trains deep reinforcement learning agents to complete early game tasks in Pokemon Red, including exiting the player’s house, exploring Pallet Town to reach tall grass, and winning the first rival battle. Our main contributions are a loop-aware environment wrapper around the PyBoy emulator with map masking, a multi-layer anti-loop and anti-spam mechanism, and a dense hierarchical reward design. We argue that practical systems like PokeRL, which explicitly model failure modes such as loops and spam, are a necessary intermediate step between toy benchmarks and full Pokemon League champion agents. Code is available at https://github.com/reddheeraj/PokemonRL
[1145] Online Covariance Estimation in Averaged SGD: Improved Batch-Mean Rates and Minimax Optimality via Trajectory Regression
Yijin Ni, Xiaoming Huo
Main category: cs.LG
TL;DR: The paper studies online covariance matrix estimation for Polyak-Ruppert averaged SGD, improving convergence rates and establishing minimax optimality for Hessian-free estimation.
Details
Motivation: To develop improved online covariance matrix estimation methods for stochastic gradient descent (SGD) that achieve better convergence rates without requiring Hessian information, which is computationally expensive to access.Method: Analyzes the online batch-means estimator, performs rigorous per-block bias analysis to improve convergence rates, introduces a weighted-averaging variant, and develops a trajectory-regression estimator that regresses SGD increments on iterates to estimate the Hessian without direct access.
Result: Improves the batch-means estimator convergence rate from O(n^{-1/8}) to O(n^{-1/6}) through block-growth parameter tuning, and establishes the minimax rate Θ(n^{-(1-α)/2}) for Hessian-free covariance estimation, with the trajectory-regression estimator achieving this optimal rate.
Conclusion: The paper provides improved covariance estimation methods for SGD with better convergence rates, establishes fundamental limits through minimax analysis, and shows that the bottleneck in Hessian-free estimation is the sublinear accumulation of information about the Hessian from SGD drift.
Abstract: We study online covariance matrix estimation for Polyak–Ruppert averaged stochastic gradient descent (SGD). The online batch-means estimator of Zhu, Chen and Wu (2023) achieves an operator-norm convergence rate of $O(n^{-(1-α)/4})$, which yields $O(n^{-1/8})$ at the optimal learning-rate exponent $α\rightarrow 1/2^+$. A rigorous per-block bias analysis reveals that re-tuning the block-growth parameter improves the batch-means rate to $O(n^{-(1-α)/3})$, achieving $O(n^{-1/6})$. The modified estimator requires no Hessian access and preserves $O(d^2)$ memory. We provide a complete error decomposition into variance, stationarity bias, and nonlinearity bias components. A weighted-averaging variant that avoids hard truncation is also discussed. We establish the minimax rate $Θ(n^{-(1-α)/2})$ for Hessian-free covariance estimation from the SGD trajectory: a Le Cam lower bound gives $Ω(n^{-(1-α)/2})$, and a trajectory-regression estimator–which estimates the Hessian by regressing SGD increments on iterates–achieves $O(n^{-(1-α)/2})$, matching the lower bound. The construction reveals that the bottleneck is the sublinear accumulation of information about the Hessian from the SGD drift.
[1146] Slithering Through Gaps: Capturing Discrete Isolated Modes via Logistic Bridging
Pinaki Mohanty, Ruqi Zhang
Main category: cs.LG
TL;DR: HiSS: A novel sampling algorithm using Metropolis-within-Gibbs with logistic convolution kernel to improve mixing in high-dimensional multimodal discrete distributions
Details
Motivation: High-dimensional discrete distributions often have multimodal behavior with discontinuities, causing gradient-based samplers to get trapped in local modes and limiting mixing/convergence in rugged energy landscapesMethod: Hyperbolic Secant-squared Gibbs-Sampling (HiSS) integrates Metropolis-within-Gibbs framework with logistic convolution kernel to couple discrete sampling variable with continuous auxiliary variable in joint distribution, enabling transitions between distant disconnected modes
Result: HiSS provides theoretical convergence guarantees and empirically outperforms popular alternatives on tasks including Ising models, binary neural networks, and combinatorial optimization
Conclusion: HiSS is an effective sampling algorithm for high-dimensional multimodal discrete distributions that addresses limitations of existing gradient-based methods
Abstract: High-dimensional and complex discrete distributions often exhibit multimodal behavior due to inherent discontinuities, posing significant challenges for sampling. Gradient-based discrete samplers, while effective, frequently become trapped in local modes when confronted with rugged or disconnected energy landscapes. This limits their ability to achieve adequate mixing and convergence in high-dimensional multimodal discrete spaces. To address these challenges, we propose \emph{Hyperbolic Secant-squared Gibbs-Sampling (HiSS)}, a novel family of sampling algorithms that integrates a \emph{Metropolis-within-Gibbs} framework to enhance mixing efficiency. HiSS leverages a logistic convolution kernel to couple the discrete sampling variable with the continuous auxiliary variable in a joint distribution. This design allows the auxiliary variable to encapsulate the true target distribution while facilitating easy transitions between distant and disconnected modes. We provide theoretical guarantees of convergence and demonstrate empirically that HiSS outperforms many popular alternatives on a wide variety of tasks, including Ising models, binary neural networks, and combinatorial optimization.
[1147] Transformers Learn Latent Mixture Models In-Context via Mirror Descent
Francesco D’Angelo, Nicolas Flammarion
Main category: cs.LG
TL;DR: Transformers can implement Mirror Descent to learn token importance weights in-context for sequence modeling, with theoretical construction and empirical validation.
Details
Motivation: To understand the underlying mechanisms of how transformers determine token importance in sequence modeling, which remains poorly understood despite being inherent to attention layers.Method: Formalize token importance estimation as in-context learning using Mixture of Transition Distributions framework. Provide explicit construction of a three-layer transformer implementing one step of Mirror Descent to learn mixture weights from context.
Result: Theoretical proof that the resulting estimator is a first-order approximation of Bayes-optimal predictor. Empirical validation shows trained transformers learn solutions consistent with theory - predictive distributions, attention patterns, and learned transition matrix match the construction.
Conclusion: Transformers can implement Mirror Descent for in-context learning of token importance weights, providing theoretical understanding of attention mechanisms in sequence modeling.
Abstract: Sequence modelling requires determining which past tokens are causally relevant from the context and their importance: a process inherent to the attention layers in transformers, yet whose underlying learned mechanisms remain poorly understood. In this work, we formalize the task of estimating token importance as an in-context learning problem by introducing a framework based on Mixture of Transition Distributions, where a latent variable determines the influence of past tokens on the next. The distribution over this latent variable is parameterized by unobserved mixture weights that transformers must learn in-context. We demonstrate that transformers can implement Mirror Descent to learn these weights from the context. Specifically, we give an explicit construction of a three-layer transformer that exactly implements one step of Mirror Descent and prove that the resulting estimator is a first-order approximation of the Bayes-optimal predictor. Corroborating our construction and its learnability via gradient descent, we empirically show that transformers trained from scratch learn solutions consistent with our theory: their predictive distributions, attention patterns, and learned transition matrix closely match the construction, while deeper models achieve performance comparable to multi-step Mirror Descent.
[1148] Task2vec Readiness: Diagnostics for Federated Learning from Pre-Training Embeddings
Cristiano Mafuz, Rodrigo Silva
Main category: cs.LG
TL;DR: Task2Vec-based readiness indices predict FL performance before training by measuring client heterogeneity through unsupervised metrics from embeddings.
Details
Motivation: Federated learning performance is highly sensitive to client heterogeneity, but practitioners lack reliable methods to anticipate federation behavior before training begins.Method: Propose readiness indices derived from Task2Vec embeddings that quantify federation alignment using unsupervised metrics (cohesion, dispersion, density) computed directly from client embeddings.
Result: Evaluation across diverse datasets (CIFAR-10, FEMNIST, PathMNIST, BloodMNIST) shows consistent and significant Pearson/Spearman correlations (often >0.9) between Task2Vec-based readiness and final FL performance.
Conclusion: Task2Vec-based readiness provides a principled, pre-training diagnostic for FL that offers predictive insight and actionable guidance for client selection in heterogeneous federations.
Abstract: Federated learning (FL) performance is highly sensitive to heterogeneity across clients, yet practitioners lack reliable methods to anticipate how a federation will behave before training. We propose readiness indices, derived from Task2Vec embeddings, that quantifies the alignment of a federation prior to training and correlates with its eventual performance. Our approach computes unsupervised metrics – such as cohesion, dispersion, and density – directly from client embeddings. We evaluate these indices across diverse datasets (CIFAR-10, FEMNIST, PathMNIST, BloodMNIST) and client counts (10–20), under Dirichlet heterogeneity levels spanning $α\in {0.05,\dots,5.0}$ and FedAVG aggregation strategy. Correlation analyses show consistent and significant Pearson and Spearman coefficients between some of the Task2Vec-based readiness and final performance, with values often exceeding 0.9 across dataset$\times$client configurations, validating this approach as a robust proxy for FL outcomes. These findings establish Task2Vec-based readiness as a principled, pre-training diagnostic for FL that may offer both predictive insight and actionable guidance for client selection in heterogeneous federations.
[1149] Query Lower Bounds for Diffusion Sampling
Zhiyang Xun, Eric Price
Main category: cs.LG
TL;DR: Theoretical analysis establishing lower bounds on score query complexity for diffusion models, showing that any sampling algorithm requires at least Ω̃(√d) adaptive score queries for d-dimensional distributions.
Details
Motivation: While there's growing literature on accelerating diffusion sampling by minimizing score evaluations, the information-theoretic limits of such acceleration remain unclear. The paper aims to establish fundamental lower bounds on score query complexity for diffusion models.Method: Theoretical analysis proving that for d-dimensional distributions with polynomial accuracy score estimates, any sampling algorithm requires Ω̃(√d) adaptive score queries. The proof shows samplers must search over Ω̃(√d) distinct noise levels.
Result: Established first score query lower bounds for diffusion sampling, proving Ω̃(√d) adaptive score queries are necessary. This provides formal explanation for why multiscale noise schedules are necessary in practice.
Conclusion: The work establishes fundamental information-theoretic limits on accelerating diffusion sampling, showing that any sampler must query scores across multiple noise scales, explaining the practical necessity of multiscale noise schedules.
Abstract: Diffusion models generate samples by iteratively querying learned score estimates. A rapidly growing literature focuses on accelerating sampling by minimizing the number of score evaluations, yet the information-theoretic limits of such acceleration remain unclear. In this work, we establish the first score query lower bounds for diffusion sampling. We prove that for $d$-dimensional distributions, given access to score estimates with polynomial accuracy $\varepsilon=d^{-O(1)}$ (in any $L^p$ sense), any sampling algorithm requires $\widetildeΩ(\sqrt{d})$ adaptive score queries. In particular, our proof shows that any sampler must search over $\widetildeΩ(\sqrt{d})$ distinct noise levels, providing a formal explanation for why multiscale noise schedules are necessary in practice.
[1150] DIB-OD: Preserving the Invariant Core for Robust Heterogeneous Graph Adaptation via Decoupled Information Bottleneck and Online Distillation
Yang Yan, Qiuyan Wang, Tianjin Huang, Qiudong Yu, Kexin Zhang
Main category: cs.LG
TL;DR: DIB-OD is a novel framework for graph neural network pretraining that uses decoupled information bottleneck and online distillation to preserve invariant knowledge across heterogeneous domains, addressing distribution shifts and catastrophic forgetting.
Details
Motivation: Current GNN pretraining methods struggle with heterogeneous domain generalization due to severe distribution shifts. They focus on intra-domain patterns and fail to disentangle task-relevant invariant knowledge from domain-specific noise, leading to negative transfer and catastrophic forgetting during adaptation.Method: Proposes DIB-OD framework with explicit decomposition of representations into orthogonal invariant and redundant subspaces. Uses Information Bottleneck teacher-student distillation and Hilbert-Schmidt Independence Criterion to isolate stable invariant core. Includes self-adaptive semantic regularizer that dynamically gates label influence based on predictive confidence to protect the invariant core during target-domain adaptation.
Result: Extensive experiments across chemical, biological, and social network domains show DIB-OD significantly outperforms state-of-the-art methods, particularly in challenging inter-type domain transfers, demonstrating superior generalization and anti-forgetting performance.
Conclusion: DIB-OD effectively addresses heterogeneous graph adaptation challenges by preserving invariant knowledge through decoupled representation learning and adaptive regularization, enabling robust generalization across diverse domains while preventing catastrophic forgetting.
Abstract: Graph Neural Network pretraining is pivotal for leveraging unlabeled graph data. However, generalizing across heterogeneous domains remains a major challenge due to severe distribution shifts. Existing methods primarily focus on intra-domain patterns, failing to disentangle task-relevant invariant knowledge from domain-specific redundant noise, leading to negative transfer and catastrophic forgetting. To this end, we propose DIB-OD, a novel framework designed to preserve the invariant core for robust heterogeneous graph adaptation through a Decoupled Information Bottleneck and Online Distillation framework. Our core innovation is the explicit decomposition of representations into orthogonal invariant and redundant subspaces. By utilizing an Information Bottleneck teacher-student distillation mechanism and the Hilbert-Schmidt Independence Criterion, we isolate a stable invariant core that transcends domain boundaries. Furthermore, a self-adaptive semantic regularizer is introduced to protect this core from corruption during target-domain adaptation by dynamically gating label influence based on predictive confidence. Extensive experiments across chemical, biological, and social network domains demonstrate that DIB-OD significantly outperforms state-of-the-art methods, particularly in challenging inter-type domain transfers, showcasing superior generalization and anti-forgetting performance.
[1151] Learning to Adapt: In-Context Learning Beyond Stationarity
Zhen Qin, Jiachen Jiang, Zhihui Zhu
Main category: cs.LG
TL;DR: Theoretical analysis shows gated linear attention (GLA) outperforms standard linear attention in non-stationary in-context learning tasks by adaptively modulating past input influence with learnable recency bias.
Details
Motivation: Existing theoretical analyses of transformer in-context learning assume stationary task distributions, overlooking real-world scenarios where target functions vary over time. This work aims to bridge this gap by analyzing ICL under non-stationary regression problems.Method: Theoretical analysis of gated linear attention (GLA) mechanism in non-stationary regression settings, modeling non-stationarity as a first-order autoregressive process. Compares GLA’s performance against standard linear attention in dynamic environments.
Result: GLA achieves lower training and testing errors than standard linear attention by adaptively modulating the influence of past inputs through a learnable recency bias mechanism. Empirical results validate theoretical findings.
Conclusion: Gating mechanisms in attention provide significant advantages for in-context learning in non-stationary environments, offering theoretical understanding of how transformers adapt to evolving input-output relationships.
Abstract: Transformer models have become foundational across a wide range of scientific and engineering domains due to their strong empirical performance. A key capability underlying their success is in-context learning (ICL): when presented with a short prompt from an unseen task, transformers can perform per-token and next-token predictions without any parameter updates. Recent theoretical efforts have begun to uncover the mechanisms behind this phenomenon, particularly in supervised regression settings. However, these analyses predominantly assume stationary task distributions, which overlook a broad class of real-world scenarios where the target function varies over time. In this work, we bridge this gap by providing a theoretical analysis of ICL under non-stationary regression problems. We study how the gated linear attention (GLA) mechanism adapts to evolving input-output relationships and rigorously characterize its advantages over standard linear attention in this dynamic setting. To model non-stationarity, we adopt a first-order autoregressive process and show that GLA achieves lower training and testing errors by adaptively modulating the influence of past inputs – effectively implementing a learnable recency bias. Our theoretical findings are further supported by empirical results, which validate the benefits of gating mechanisms in non-stationary ICL tasks.
[1152] UniPROT: Uniform Prototype Selection via Partial Optimal Transport with Submodular Guarantees
Prateek Chanda, Prayas Agrawal, Karthik S. Gurumoorthy, Ganesh Ramakrishnan, Bamdev Mishra, Pratik Jawanpuria
Main category: cs.LG
TL;DR: UniPROT is a subset selection framework that uses optimal transport to select uniformly weighted prototypes that best represent a target distribution, addressing class imbalance by improving minority-class representation.
Details
Motivation: Existing subset selection methods often have implicit importance scores that favor majority classes, leading to poor representation of minority classes. The authors aim to develop a method that selects prototypes with uniform weights to better represent all classes in imbalanced settings.Method: UniPROT minimizes optimal transport distance between a uniformly weighted prototypical distribution and the target distribution. The authors reformulate OT marginal constraints to create a partial optimal transport-based submodular objective, enabling a greedy algorithm with (1-1/e) approximation guarantee.
Result: UniPROT consistently improves minority-class representation in imbalanced classification benchmarks without compromising majority-class accuracy. It also yields robust performance gains in both finetuning and pretraining regimes for large language models under domain imbalance.
Conclusion: UniPROT provides a scalable, theoretically grounded solution for uniform-weighted prototype selection that enforces uniform source contributions and addresses class imbalance effectively.
Abstract: Selecting prototypical examples from a source distribution to represent a target data distribution is a fundamental problem in machine learning. Existing subset selection methods often rely on implicit importance scores, which can be skewed towards majority classes and lead to low-quality prototypes for minority classes. We present $\methodprop$, a novel subset selection framework that minimizes the optimal transport (OT) distance between a uniformly weighted prototypical distribution and the target distribution. While intuitive, this formulation leads to a cardinality-constrained maximization of a \emph{super-additive} objective, which is generally intractable to approximate efficiently. To address this, we propose a principled reformulation of the OT marginal constraints, yielding a partial optimal transport-based submodular objective. We prove that this reformulation enables a greedy algorithm with a $(1-1/e)$ approximation guarantee relative to the original super-additive maximization problem. Empirically, we showcase that enforcing uniform prototype weights in UniPROT consistently improves minority-class representation in imbalanced classification benchmarks without compromising majority-class accuracy. In both finetuning and pretraining regimes for large language models under domain imbalance, UniPROT enforces uniform source contributions, yielding robust performance gains. Our results establish UniPROT as a scalable, theoretically grounded solution for uniform-weighted prototype selection. Our code is publicly available at GitHub\footnote{Code: https://github.com/efficiency-learning/UniPROT}
[1153] Hypergraph Neural Diffusion: A PDE-Inspired Framework for Hypergraph Message Passing
Zhiheng Zhou, Mengyao Zhou, Xixun Lin, Xingqin Qi, Guiying Yan
Main category: cs.LG
TL;DR: HND is a novel hypergraph neural network framework that unifies nonlinear diffusion equations with neural message passing, providing physically interpretable hypergraph learning through PDE-based anisotropic diffusion processes.
Details
Motivation: Existing HGNNs suffer from shallow propagation, oversmoothing, and limited adaptability to complex hypergraph structures. There's a need for more stable, expressive, and interpretable hypergraph learning methods that can handle high-order relationships effectively.Method: Proposes Hypergraph Neural Diffusion (HND) framework based on continuous-time hypergraph diffusion equations using hypergraph gradient and divergence operators. Features learnable, structure-aware coefficient matrix over hyperedge-node pairs, treating feature propagation as anisotropic diffusion. Supports various integration strategies including Runge-Kutta and adaptive-step solvers.
Result: HND achieves competitive performance on benchmark datasets. The framework provides theoretical guarantees including energy dissipation, solution boundedness via discrete maximum principle, and stability under explicit/implicit numerical schemes.
Conclusion: PDE-inspired design enhances stability, expressivity, and interpretability of hypergraph learning. HND offers a physically interpretable view where neural message passing is understood as discretized gradient flow minimizing diffusion energy functional.
Abstract: Hypergraph neural networks (HGNNs) have shown remarkable potential in modeling high-order relationships that naturally arise in many real-world data domains. However, existing HGNNs often suffer from shallow propagation, oversmoothing, and limited adaptability to complex hypergraph structures. In this paper, we propose Hypergraph Neural Diffusion (HND), a novel framework that unifies nonlinear diffusion equations with neural message passing on hypergraphs. HND is grounded in a continuous-time hypergraph diffusion equation, formulated via hypergraph gradient and divergence operators, and modulated by a learnable, structure-aware coefficient matrix over hyperedge-node pairs. This partial differential equation (PDE) based formulation provides a physically interpretable view of hypergraph learning, where feature propagation is understood as an anisotropic diffusion process governed by local inconsistency and adaptive diffusion coefficient. From this perspective, neural message passing becomes a discretized gradient flow that progressively minimizes a diffusion energy functional. We derive rigorous theoretical guarantees, including energy dissipation, solution boundedness via a discrete maximum principle, and stability under explicit and implicit numerical schemes. The HND framework supports a variety of integration strategies such as non-adaptive-step (like Runge-Kutta) and adaptive-step solvers, enabling the construction of deep, stable, and interpretable architectures. Extensive experiments on benchmark datasets demonstrate that HND achieves competitive performance. Our results highlight the power of PDE-inspired design in enhancing the stability, expressivity, and interpretability of hypergraph learning.
[1154] Continuous-time Online Learning via Mean-Field Neural Networks: Regret Analysis in Diffusion Environments
Erhan Bayraktar, Bingyan Han, Ziqing Zhang
Main category: cs.LG
TL;DR: Continuous-time online learning with neural networks on diffusion data, analyzed via mean-field limits and stochastic gradient flows, with regret bounds under convex and non-convex settings.
Details
Motivation: The paper addresses online learning in continuous-time settings where data is generated by unknown diffusion processes, which is relevant for real-time adaptive systems. The motivation is to understand the theoretical properties of neural network learning in such dynamic environments and establish performance guarantees.Method: Uses a two-layer neural network with continuous parameter updates, analyzes the mean-field limit as a stochastic Wasserstein gradient flow adapted to data filtration. Employs mathematical tools including logarithmic Sobolev inequality, Polyak-Lojasiewicz condition, Malliavin calculus, and uniform-in-time propagation of chaos to establish regret bounds.
Result: Obtains constant static regret bounds under displacement convexity, and explicit linear regret bounds in general non-convex settings that characterize effects of data variation, entropic exploration, and quadratic regularization. Simulations show the online approach outperforms alternatives and demonstrate impact of network width and regularization parameters.
Conclusion: The paper provides a rigorous theoretical framework for continuous-time online learning with neural networks on diffusion data, establishing fundamental performance guarantees and insights into the effects of various factors on learning dynamics and regret.
Abstract: We study continuous-time online learning where data are generated by a diffusion process with unknown coefficients. The learner employs a two-layer neural network, continuously updating its parameters in a non-anticipative manner. The mean-field limit of the learning dynamics corresponds to a stochastic Wasserstein gradient flow adapted to the data filtration. We establish regret bounds for both the mean-field limit and finite-particle system. Our analysis leverages the logarithmic Sobolev inequality, Polyak-Lojasiewicz condition, Malliavin calculus, and uniform-in-time propagation of chaos. Under displacement convexity, we obtain a constant static regret bound. In the general non-convex setting, we derive explicit linear regret bounds characterizing the effects of data variation, entropic exploration, and quadratic regularization. Finally, our simulations demonstrate the outperformance of the online approach and the impact of network width and regularization parameters.
[1155] Learning to Test: Physics-Informed Representation for Dynamical Instability Detection
Minxing Zheng, Zewei Deng, Liyan Xie, Shixiang Zhu
Main category: cs.LG
TL;DR: Physics-informed latent learning framework for stability assessment in stochastic differential-algebraic equations under distribution shift, enabling deployment-time safety monitoring via statistical hypothesis testing without repeated simulation.
Details
Motivation: Safety-critical systems governed by DAEs operate under stochastic environmental inputs, requiring stability assessment as context distributions shift. Repeated large-scale DAE simulation is computationally prohibitive for high-dimensional or real-time applications.Method: Learn physics-informed latent representation of contextual variables capturing stability-relevant structure, regularized toward tractable reference distribution. Integrates neural dynamical surrogates, uncertainty-aware calibration, and uniformity-based testing to formulate deployment-time safety monitoring as distributional hypothesis test in latent space.
Result: Framework provides scalable, statistically grounded method for detecting instability risk in stochastic constrained dynamical systems without repeated simulation, with controlled Type I error.
Conclusion: Proposed test-oriented learning approach enables efficient stability assessment under distribution shift for safety-critical DAE systems, bridging physics-informed learning with statistical testing for practical safety monitoring.
Abstract: Many safety-critical scientific and engineering systems evolve according to differential-algebraic equations (DAEs), where dynamical behavior is constrained by physical laws and admissibility conditions. In practice, these systems operate under stochastically varying environmental inputs, so stability is not a static property but must be reassessed as the context distribution shifts. Repeated large-scale DAE simulation, however, is computationally prohibitive in high-dimensional or real-time settings. This paper proposes a test-oriented learning framework for stability assessment under distribution shift. Rather than re-estimating physical parameters or repeatedly solving the underlying DAE, we learn a physics-informed latent representation of contextual variables that captures stability-relevant structure and is regularized toward a tractable reference distribution. Trained on baseline data from a certified safe regime, the learned representation enables deployment-time safety monitoring to be formulated as a distributional hypothesis test in latent space, with controlled Type I error. By integrating neural dynamical surrogates, uncertainty-aware calibration, and uniformity-based testing, our approach provides a scalable and statistically grounded method for detecting instability risk in stochastic constrained dynamical systems without repeated simulation.
[1156] Robust Adversarial Policy Optimization Under Dynamics Uncertainty
Mintae Kim, Koushil Sreenath
Main category: cs.LG
TL;DR: RAPO is a dual formulation of robust RL that combines trajectory-level adversarial rollouts with model-level Boltzmann reweighting over dynamics ensembles to improve robustness to distribution shifts.
Details
Motivation: RL policies often fail under dynamics that differ from training, and existing methods like domain randomization or adversarial RL have limitations. Distributionally robust RL provides formal guarantees but relies on surrogate adversaries that can cause instability and over-conservatism.Method: Proposes a dual formulation with two components: 1) trajectory-level adversarial network approximates temperature parameter for worst-case rollouts within divergence bounds, 2) model-level Boltzmann reweighting over dynamics ensembles focuses sampling on environments more adverse to current policy rather than uniform sampling.
Result: RAPO outperforms robust RL baselines, improving resilience to uncertainty and generalization to out-of-distribution dynamics while maintaining dual tractability.
Conclusion: The dual formulation directly exposes robustness-performance trade-off, with trajectory-level steering ensuring robust rollouts and model-level sampling providing policy-sensitive coverage of adverse dynamics.
Abstract: Reinforcement learning (RL) policies often fail under dynamics that differ from training, a gap not fully addressed by domain randomization or existing adversarial RL methods. Distributionally robust RL provides a formal remedy but still relies on surrogate adversaries to approximate intractable primal problems, leaving blind spots that potentially cause instability and over-conservatism. We propose a dual formulation that directly exposes the robustness-performance trade-off. At the trajectory level, a temperature parameter from the dual problem is approximated with an adversarial network, yielding efficient and stable worst-case rollouts within a divergence bound. At the model level, we employ Boltzmann reweighting over dynamics ensembles, focusing on more adverse environments to the current policy rather than uniform sampling. The two components act independently and complement each other: trajectory-level steering ensures robust rollouts, while model-level sampling provides policy-sensitive coverage of adverse dynamics. The resulting framework, robust adversarial policy optimization (RAPO) outperforms robust RL baselines, improving resilience to uncertainty and generalization to out-of-distribution dynamics while maintaining dual tractability.
[1157] Tracking High-order Evolutions via Cascading Low-rank Fitting
Zhao Song
Main category: cs.LG
TL;DR: A novel cascading low-rank fitting method for higher-order diffusion models that efficiently approximates successive derivatives using shared base functions with accumulated low-rank components, avoiding linear parameter scaling.
Details
Motivation: Higher-order diffusion models (learning acceleration, jerk, etc.) typically require separate neural networks for each derivative order, leading to linear parameter scaling. This computational bottleneck motivates a more efficient approach to represent higher-order derivatives.Method: Introduces cascading low-rank fitting, an ODE-inspired method that approximates successive derivatives by applying a shared base function augmented with sequentially accumulated low-rank components. Provides theoretical analysis of rank dynamics and presents an efficient algorithm for computation.
Result: Theoretical analysis shows: 1) If initial difference is linearly decomposable, high-order derivative ranks are guaranteed to be monotonically non-increasing; 2) Without structural assumption, ranks can strictly increase via General Leibniz Rule; 3) Under specific conditions, derivative ranks can form any arbitrary permutation. Provides efficient algorithm implementation.
Conclusion: Cascading low-rank fitting offers an efficient alternative to naive approaches for higher-order diffusion models, reducing computational burden while maintaining modeling capacity through theoretical guarantees on rank dynamics.
Abstract: Diffusion models have become the de facto standard for modern visual generation, including well-established frameworks such as latent diffusion and flow matching. Recently, modeling high-order dynamics has emerged as a promising frontier in generative modeling. Rather than only learning the first-order velocity field that transports random noise to a target data distribution, these approaches simultaneously learn higher-order derivatives, such as acceleration and jerk, yielding a diverse family of higher-order diffusion variants. To represent higher-order derivatives, naive approaches instantiate separate neural networks for each order, which scales the parameter space linearly with the derivative order. To overcome this computational bottleneck, we introduce cascading low-rank fitting, an ordinary differential equation inspired method that approximates successive derivatives by applying a shared base function augmented with sequentially accumulated low-rank components. Theoretically, we analyze the rank dynamics of these successive matrix differences. We prove that if the initial difference is linearly decomposable, the generic ranks of high-order derivatives are guaranteed to be monotonically non-increasing. Conversely, we demonstrate that without this structural assumption, the General Leibniz Rule allows ranks to strictly increase. Furthermore, we establish that under specific conditions, the sequence of derivative ranks can be designed to form any arbitrary permutation. Finally, we present a straightforward algorithm to efficiently compute the proposed cascading low-rank fitting.
[1158] Flow-Controlled Scheduling for LLM Inference with Provable Stability Guarantees
Zhuolun Dong, Junyu Cao
Main category: cs.LG
TL;DR: A flow-control framework for LLM inference that manages prompt admission rates to prevent KV cache overflow and ensure system stability.
Details
Motivation: LLMs like ChatGPT and Gemini serve massive user bases with billions of daily requests, making inference optimization critical. A key challenge is unknown decode lengths causing memory usage to grow with generated tokens, potentially leading to overflow and system instability.Method: Proposes a simple flow-control framework that controls the rate at which prompts join the active set. Derives necessary conditions for system stability and establishes sufficient conditions under which the algorithm provably achieves stability.
Result: Compared to commonly used strategies, the approach achieves higher token and request throughput, lower average and tail latency, and more stable KV cache utilization.
Conclusion: The flow-control framework effectively addresses KV cache overflow concerns in LLM inference systems, improving performance metrics while ensuring system stability.
Abstract: Large language models (LLMs) have been widely adopted due to their great performance across a wide range of applications. ChatGPT and Gemini now serve hundreds of millions of active users and handle billions of user requests per day, which puts optimizing LLM inference into the spotlight. A key challenge in LLM inference is that decode lengths are unknown. The memory usage for each request grows with generated tokens, which may lead to overflow and cause system instability. To address this concern, we propose a simple flow-control framework that controls the rate at which prompts join the active set. We derive a necessary condition that any stable system must satisfy and establish sufficient conditions under which our algorithm provably achieves stability. Experiments show that, compared to commonly used strategies in practice, our approach achieves higher token and request throughput, lower average and tail latency, and more stable KV cache utilization.
[1159] K-Way Energy Probes for Metacognition Reduce to Softmax in Discriminative Predictive Coding Networks
Jon-Paul Cacioli
Main category: cs.LG
TL;DR: Predictive coding networks’ K-way energy probe appears richer than softmax but actually decomposes to softmax margin plus untrained residual, tracking softmax from below across multiple experiments.
Details
Motivation: To investigate whether predictive coding networks' K-way energy probe provides richer signal than standard softmax classification, and to understand the relationship between these two approaches in discriminative PC formulations.Method: Theoretical decomposition showing K-way energy margin equals monotone function of log-softmax margin plus untrained residual. Empirical testing across six CIFAR-10 conditions: extended training, latent movement measurement, decoder fairness control, matched-budget comparison, temperature sweep, and trajectory-integrated training.
Result: In all conditions, the energy probe tracked softmax from below with stable gap across training procedures. AUROC_2 values differed by less than 10^-3 between final-state and trajectory-integrated training.
Conclusion: The K-way energy probe in standard discriminative PC with target-clamped CE-energy training doesn’t provide richer signal than softmax, but decomposes to softmax plus untrained residual. The analysis doesn’t apply to bidirectional PC, prospective configuration, generative PC, or non-CE energy formulations.
Abstract: We present this as a negative result with an explanatory mechanism, not as a formal upper bound. Predictive coding networks (PCNs) admit a K-way energy probe in which each candidate class is fixed as a target, inference is run to settling, and the per-hypothesis settled energies are compared. The probe appears to read a richer signal source than softmax, since the per-hypothesis energy depends on the entire generative chain. We argue this appearance is misleading under the standard Pinchetti-style discriminative PC formulation. We present an approximate reduction showing that with target-clamped CE-energy training and effectively-feedforward latent dynamics, the K-way energy margin decomposes into a monotone function of the log-softmax margin plus a residual that is not trained to correlate with correctness. The decomposition predicts that the structural probe should track softmax from below. We test this across six conditions on CIFAR-10: extended deterministic training, direct measurement of latent movement during inference, a post-hoc decoder fairness control on a backpropagation network, a matched-budget PC vs BP comparison, a five-point Langevin temperature sweep, and trajectory-integrated MCPC training. In every condition the probe sat below softmax. The gap was stable across training procedures within the discriminative PC family. Final-state and trajectory-integrated training produced probes whose AUROC_2 values differed by less than 10^-3 at deterministic evaluation. The empirical regime is small: single seed, 2.1M-parameter network, 1280 test images. We frame the result as a preprint inviting replication. We discuss conditions under which the decomposition does not apply (bidirectional PC, prospective configuration, generative PC, non-CE energy formulations) and directions for productive structural probing the analysis does not foreclose.
[1160] Optimal Stability of KL Divergence under Gaussian Perturbations
Jialu Pan, Yufeng Zhang, Nan Hu, Keqin Li
Main category: cs.LG
TL;DR: Establishes sharp stability bounds for KL divergence under Gaussian perturbations beyond Gaussian families, with optimal √ε rate and applications to OOD detection in flow-based models.
Details
Motivation: Existing relaxed triangle inequalities for KL divergence only work for Gaussian distributions, limiting applications in modern ML like OOD detection with flow-based generative models. Need to extend stability analysis to arbitrary distributions.Method: Mathematical analysis establishing stability bounds between arbitrary distributions and Gaussian families under mild moment conditions. Shows if KL(P||N₁) is large and KL(N₁||N₂) ≤ ε, then KL(P||N₂) ≥ KL(P||N₁) - O(√ε), and proves this √ε rate is optimal.
Result: Proves intrinsic stability property of KL divergence under Gaussian perturbations, extending classical Gaussian-only results to general distributions. Provides rigorous foundation for KL-based OOD analysis in flow-based models without strong Gaussian assumptions.
Conclusion: Removes Gaussian restriction from KL divergence stability analysis, enabling KL-based reasoning in non-Gaussian settings common in deep learning and reinforcement learning, with applications to flow-based generative models.
Abstract: We study the problem of characterizing the stability of Kullback-Leibler (KL) divergence under Gaussian perturbations beyond Gaussian families. Existing relaxed triangle inequalities for KL divergence critically rely on the assumption that all involved distributions are Gaussian, which limits their applicability in modern applications such as out-of-distribution (OOD) detection with flow-based generative models. In this paper, we remove this restriction by establishing a sharp stability bound between an arbitrary distribution and Gaussian families under mild moment conditions. Specifically, let $P$ be a distribution with finite second moment, and let $\mathcal{N}_1$ and $\mathcal{N}_2$ be multivariate Gaussian distributions. We show that if $KL(P||\mathcal{N}_1)$ is large and $KL(\mathcal{N}_1||\mathcal{N}_2)$ is at most $ε$, then $KL(P||\mathcal{N}_2) \ge KL(P||\mathcal{N}_1) - O(\sqrtε)$. Moreover, we prove that this $\sqrtε$ rate is optimal in general, even within the Gaussian family. This result reveals an intrinsic stability property of KL divergence under Gaussian perturbations, extending classical Gaussian-only relaxed triangle inequalities to general distributions. The result is non-trivial due to the asymmetry of KL divergence and the absence of a triangle inequality in general probability spaces. As an application, we provide a rigorous foundation for KL-based OOD analysis in flow-based models, removing strong Gaussian assumptions used in prior work. More broadly, our result enables KL-based reasoning in non-Gaussian settings arising in deep learning and reinforcement learning.
[1161] RTMC: Step-Level Credit Assignment via Rollout Trees
Tao Wang, Suhang Zheng, Xiaoxiao Xu
Main category: cs.LG
TL;DR: RTMC introduces rollout-tree Monte Carlo advantage estimation for multi-step agentic RL, using cross-rollout state matching to compute per-step Q-values without learned critics, improving performance on code generation tasks.
Details
Motivation: Existing credit assignment methods for multi-step agentic RL have limitations: critic-free methods like GRPO assign uniform advantage to all actions in a trajectory, while learned value networks introduce overhead and can be fragile under sparse rewards.Method: RTMC leverages overlapping intermediate states across group rollouts to form an implicit tree structure. It aggregates return statistics across rollouts sharing common states to compute per-step Q-values and advantages without learned critics. A state-action signature system compresses raw interaction histories into compact, comparable representations for tractable cross-rollout state matching.
Result: On SWE-bench Verified, RTMC improves pass@1 by 3.2 percentage points over GRPO, demonstrating superior performance in code generation tasks.
Conclusion: RTMC provides an effective critic-free advantage estimation method for multi-step agentic RL by exploiting the implicit tree structure formed by overlapping rollouts, offering improved credit assignment without the overhead and fragility of learned value networks.
Abstract: Multi-step agentic reinforcement learning benefits from fine-grained credit assignment, yet existing approaches offer limited options: critic-free methods like GRPO assign a uniform advantage to every action in a trajectory, while learned value networks introduce notable overhead and can be fragile under sparse rewards. We observe that group rollouts targeting the same problem often traverse overlapping intermediate states, implicitly forming a tree whose branches diverge at successive decision points. Building on this insight, we introduce Rollout-Tree Monte Carlo (RTMC) advantage estimation, which aggregates return statistics across rollouts sharing a common state to produce per-step Q-values and advantages–without any learned critic. A state-action signature system compresses raw interaction histories into compact, comparable representations, making cross-rollout state matching tractable. On SWE-bench Verified, RTMC improves pass@1 by 3.2 percentage points over GRPO.
[1162] Rethinking Token-Level Credit Assignment in RLVR: A Polarity-Entropy Analysis
Yuhang He, Haodong Wu, Siyi Liu, Hongyu Ge, Hange Zhou, Keyi Wu, Zhuo Zheng, Qihong Lin, Zixin Zhong, Yongqi Zhang
Main category: cs.LG
TL;DR: EAPO improves RL for LLMs by addressing credit assignment in sparse reward settings through entropy-aware token-level learning signal modulation.
Details
Motivation: RLVR improves LLM reasoning but suffers from sparse outcome-based rewards causing credit assignment problems, especially regarding which tokens deserve credit for reasoning improvements.Method: Analyzes credit assignment through reward polarity and token entropy using Four Quadrant Decomposition, adapts Conditional Mutual Information theory, and proposes Entropy-Aware Policy Optimization (EAPO) that modulates token-level learning signals based on entropy.
Result: EAPO outperforms strong baselines across two model families, demonstrating improved reasoning ability through better credit assignment to high-entropy tokens.
Conclusion: Reasoning gains in RLVR primarily arise from high-entropy tokens, and entropy-aware policy optimization effectively addresses credit assignment problems in sparse reward settings.
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has substantially improved the reasoning ability of Large Language Models (LLMs). However, its sparse outcome-based rewards pose a fundamental credit assignment problem. We analyze this problem through the joint lens of reward polarity and token entropy. Our diagnostic tool, the Four Quadrant Decomposition, isolates token updates by polarity and entropy, and controlled ablations show that reasoning improvements concentrate in the high-entropy quadrants. To justify this observation theoretically, we adapt Conditional Mutual Information to the autoregressive RLVR setting and prove that the credit a token can carry is upper-bounded by its entropy. This view yields testable predictions that reasoning gains arise primarily from high-entropy tokens, with unique roles for positive and negative updates. A gradient analysis of GRPO further reveals how uniform reward broadcast dilutes signal at high-entropy positions while over-crediting deterministic tokens. Grounded in these insights, we propose Entropy-Aware Policy Optimization (EAPO) that modulates token-level learning signals accordingly. Extensive experiments demonstrate that EAPO outperforms strong baselines across two model families.
[1163] Pando: Do Interpretability Methods Work When Models Won’t Explain Themselves?
Ziqian Zhong, Aashiq Muhamed, Mona T. Diab, Virginia Smith, Aditi Raghunathan
Main category: cs.LG
TL;DR: Pando benchmark evaluates interpretability tools by controlling for elicitation confounder, showing gradient-based attribution and relevance patching work best when explanations are absent/misleading.
Details
Motivation: Addresses the elicitation confounder in mechanistic interpretability evaluations where apparent gains from white-box tools may just reflect better elicitation rather than true internal signal recovery.Method: Creates Pando benchmark with explanation axis: models trained to produce faithful explanations, no explanations, or unfaithful explanations of distractor rules. Tests interpretability tools on 720 finetuned models implementing hidden decision-tree rules.
Result: When explanations are faithful, black-box elicitation matches/exceeds white-box methods. When explanations are absent/misleading, gradient-based attribution improves accuracy by 3-5pp, relevance patching gives largest gains, while logit lens, sparse autoencoders, and circuit tracing provide no reliable benefit.
Conclusion: Gradient-based attribution tracks decision computation while other readouts are dominated by task representation biases. Provides controlled benchmark for evaluating interpretability tools.
Abstract: Mechanistic interpretability is often motivated for alignment auditing, where a model’s verbal explanations can be absent, incomplete, or misleading. Yet many evaluations do not control whether black-box prompting alone can recover the target behavior, so apparent gains from white-box tools may reflect elicitation rather than internal signal; we call this the elicitation confounder. We introduce Pando, a model-organism benchmark that breaks this confound via an explanation axis: models are trained to produce either faithful explanations of the true rule, no explanation, or confident but unfaithful explanations of a disjoint distractor rule. Across 720 finetuned models implementing hidden decision-tree rules, agents predict held-out model decisions from $10$ labeled query-response pairs, optionally augmented with one interpretability tool output. When explanations are faithful, black-box elicitation matches or exceeds all white-box methods; when explanations are absent or misleading, gradient-based attribution improves accuracy by 3-5 percentage points, and relevance patching, RelP, gives the largest gains, while logit lens, sparse autoencoders, and circuit tracing provide no reliable benefit. Variance decomposition suggests gradients track decision computation, which fields causally drive the output, whereas other readouts are dominated by task representation, biases toward field identity and value. We release all models, code, and evaluation infrastructure.
[1164] A Faster Path to Continual Learning
Wei Li, Hangjie Yuan, Zixiang Zhao, Borui Kang, Ziwei Liu, Tao Feng
Main category: cs.LG
TL;DR: C-Flat Turbo: A faster continual learning optimizer that reduces computational overhead while maintaining or improving accuracy by skipping redundant gradient computations and using adaptive scheduling.
Details
Motivation: C-Flat is an effective continual learning optimizer but requires three additional gradient computations per iteration, creating substantial computational overhead. The authors aim to develop a faster alternative that maintains the benefits of flat minima for continual learning.Method: C-Flat Turbo identifies that gradients associated with first-order flatness contain direction-invariant components relative to proxy-model gradients, allowing skipping of redundant gradient computations in perturbed ascent steps. It also uses a linear scheduling strategy with adaptive trigger to allocate larger turbo steps for later tasks as flatness-promoting gradients stabilize across tasks.
Result: C-Flat Turbo is 1.0× to 1.25× faster than C-Flat across a wide range of continual learning methods while achieving comparable or even improved accuracy.
Conclusion: C-Flat Turbo successfully reduces the computational cost of continual learning optimization while maintaining or improving performance, making flatness-based continual learning more practical.
Abstract: Continual Learning (CL) aims to train neural networks on a dynamic stream of tasks without forgetting previously learned knowledge. Among optimization-based approaches, C-Flat has emerged as a promising solution due to its plug-and-play nature and its ability to encourage uniformly low-loss regions for both new and old tasks. However, C-Flat requires three additional gradient computations per iteration, imposing substantial overhead on the optimization process. In this work, we propose C-Flat Turbo, a faster yet stronger optimizer that significantly reduces the training cost. We show that the gradients associated with first-order flatness contain direction-invariant components relative to the proxy-model gradients, enabling us to skip redundant gradient computations in the perturbed ascent steps. Moreover, we observe that these flatness-promoting gradients progressively stabilize across tasks, which motivates a linear scheduling strategy with an adaptive trigger to allocate larger turbo steps for later tasks. Experiments show that C-Flat Turbo is 1.0$\times$ to 1.25$\times$ faster than C-Flat across a wide range of CL methods, while achieving comparable or even improved accuracy.
[1165] CausalGaze: Unveiling Hallucinations via Counterfactual Graph Intervention in Large Language Models
Linggang Kong, Lei Wu, Yunlong Zhang, Xiaofeng Zhong, Zhen Wang, Yongjie Wang, Yao Pan
Main category: cs.LG
TL;DR: CausalGaze: A novel hallucination detection framework using structural causal models and counterfactual interventions to distinguish causal reasoning from noise in LLMs.
Details
Motivation: Existing hallucination detection methods rely on passive observation of LLM internal states, which capture noise and spurious correlations rather than underlying causal mechanisms. There's a need to move from passive observation to active intervention for better detection.Method: CausalGaze models LLMs’ internal states as dynamic causal graphs using structural causal models (SCMs). It employs counterfactual interventions to disentangle causal reasoning paths from incidental noise, enhancing interpretability of hallucination detection.
Result: Extensive experiments across four datasets and three widely used LLMs show CausalGaze’s effectiveness, achieving over 5.2% improvement in AUROC on TruthfulQA dataset compared to state-of-the-art baselines.
Conclusion: CausalGaze successfully shifts hallucination detection from passive observation to active intervention, providing a more interpretable and effective framework by leveraging causal reasoning principles.
Abstract: Despite the groundbreaking advancements made by large language models (LLMs), hallucination remains a critical bottleneck for their deployment in high-stakes domains. Existing classification-based methods mainly rely on static and passive signals from internal states, which often captures the noise and spurious correlations, while overlooking the underlying causal mechanisms. To address this limitation, we shift the paradigm from passive observation to active intervention by introducing CausalGaze, a novel hallucination detection framework based on structural causal models (SCMs). CausalGaze models LLMs’ internal states as dynamic causal graphs and employs counterfactual interventions to disentangle causal reasoning paths from incidental noise, thereby enhancing model interpretability. Extensive experiments across four datasets and three widely used LLMs demonstrate the effectiveness of CausalGaze, especially achieving over 5.2% improvement in AUROC on the TruthfulQA dataset compared to state-of-the-art baselines.
[1166] Bottleneck Tokens for Unified Multimodal Retrieval
Siyu Sun, Jing Ren, Zhaohe Liao, Dongxiao Mao, Xiangyuan Ren, Yiyi Zhang, Haohua Zhao, Weixiong Lin, Jiang Shaohua, Liqing Zhang, Yuchao Zheng
Main category: cs.LG
TL;DR: BToks introduces bottleneck tokens for explicit pooling in multimodal LLMs, with generative information condensation training that forces semantic compression through these tokens, achieving SOTA on MMEB-V2 benchmark.
Details
Motivation: Current decoder-only MLLMs for multimodal retrieval have structural gaps: 1) implicit pooling overloads standard tokens likeMethod: Two components: 1) Bottleneck Tokens (BToks) - learnable tokens as explicit pooling mechanism, and 2) Generative Information Condensation - next-token prediction with Condensation Mask that severs direct attention from target to query tokens, forcing all predictive signals through BToks for dense supervision.
Result: Achieves state-of-the-art on MMEB-V2 (78 datasets, 3 modalities, 9 meta-tasks) among 2B-scale methods: Overall score of 59.0 (+3.6 over VLM2Vec-V2), with substantial gains on semantically demanding tasks (+12.6 on Video-QA).
Conclusion: BToks with generative information condensation effectively addresses structural gaps in adapting decoder-only MLLMs for multimodal retrieval, providing explicit pooling and token-level supervision for semantic compression.
Abstract: Adapting decoder-only multimodal large language models (MLLMs) for unified multimodal retrieval faces two structural gaps. First, existing methods rely on implicit pooling, which overloads the hidden state of a standard vocabulary token (e.g.,
[1167] Quantum-Gated Task-interaction Knowledge Distillation for Pre-trained Model-based Class-Incremental Learning
Linjie Li, Huiyu Xiao, Jiarui Cao, Zhenyu Wu, Yang Ji
Main category: cs.LG
TL;DR: QKD uses quantum gating for task interaction modeling in class-incremental learning to mitigate catastrophic forgetting by dynamically capturing sample-to-task relevance and enabling guided knowledge distillation between task subspaces.
Details
Motivation: Pretrained models in class-incremental learning struggle with entanglement of multi-task subspaces, leading to catastrophic forgetting when task routing parameters are poorly calibrated or task-level representations are rigidly fixed.Method: Proposes Quantum-Gated Task-interaction Knowledge Distillation (QKD) framework with quantum-gated task modulation gating mechanism to model relational dependencies among task embeddings, dynamically capturing sample-to-task relevance. Uses quantum gating outputs to guide task-interaction knowledge distillation from old to new adapters.
Result: Extensive experiments demonstrate QKD effectively mitigates forgetting and achieves state-of-the-art performance in class-incremental learning.
Conclusion: The quantum-gated approach successfully addresses the subspace entanglement problem in CIL by enabling dynamic task interaction modeling and guided knowledge transfer between tasks.
Abstract: Class-incremental learning (CIL) aims to continuously accumulate knowledge from a stream of tasks and construct a unified classifier over all seen classes. Although pretrained models (PTMs) have shown promising performance in CIL, they still struggle with the entanglement of multi-task subspaces, leading to catastrophic forgetting when task routing parameters are poorly calibrated or task-level representations are rigidly fixed. To address this issue, we propose a novel Quantum-Gated Task-interaction Knowledge Distillation (QKD) framework that leverages quantum gating to guide inter-task knowledge transfer. Specifically, we introduce a quantum-gated task modulation gating mechanism to model the relational dependencies among task embedding, dynamically capturing the sample-to-task relevance for both joint training and inference across streaming tasks. Guided by the quantum gating outputs, we perform task-interaction knowledge distillation guided by these task-embedding-level correlation weights from old to new adapters, enabling the model to bridge the representation gaps between independent task subspaces. Extensive experiments demonstrate that QKD effectively mitigates forgetting and achieves state-of-the-art performance.
[1168] Distributionally Robust K-Means Clustering
Vikrant Malik, Taylan Kargin, Babak Hassibi
Main category: cs.LG
TL;DR: Distributionally robust k-means clustering using Wasserstein-2 ambiguity sets to protect against outliers, distribution shifts, and limited samples
Details
Motivation: Standard k-means clustering is brittle to outliers, distribution shifts, and limited sample sizes. The authors aim to develop a more robust variant that protects against these pathologies by considering worst-case scenarios within a distributional uncertainty set.Method: View k-means as Lloyd-Max quantization of empirical distribution, then develop distributionally robust variant where population distribution lies within Wasserstein-2 ball around empirical distribution. Formulate as minimax problem, derive tractable dual yielding soft-clustering with smooth weights instead of hard assignments. Propose efficient block coordinate descent algorithm with provable monotonic decrease and local linear convergence.
Result: Experiments on standard benchmarks and large-scale synthetic data demonstrate substantial gains in outlier detection and robustness to noise compared to standard k-means.
Conclusion: Distributionally robust k-means using Wasserstein ambiguity sets provides effective protection against outliers and distribution shifts while maintaining computational efficiency through the proposed algorithm.
Abstract: K-means clustering is a workhorse of unsupervised learning, but it is notoriously brittle to outliers, distribution shifts, and limited sample sizes. Viewing k-means as Lloyd–Max quantization of the empirical distribution, we develop a distributionally robust variant that protects against such pathologies. We posit that the unknown population distribution lies within a Wasserstein-2 ball around the empirical distribution. In this setting, one seeks cluster centers that minimize the worst-case expected squared distance over this ambiguity set, leading to a minimax formulation. A tractable dual yields a soft-clustering scheme that replaces hard assignments with smoothly weighted ones. We propose an efficient block coordinate descent algorithm with provable monotonic decrease and local linear convergence. Experiments on standard benchmarks and large-scale synthetic data demonstrate substantial gains in outlier detection and robustness to noise.
[1169] Reducing Hallucination in Enterprise AI Workflows via Hybrid Utility Minimum Bayes Risk (HUMBR)
Chenhao Fang, Jordi Mola, Mark Harman, Jason Nawrocki, Vaibhav Shrivastava, Yue Cheng, Jay Minesh Shah, Katayoun Zand, Mansi Tripathi, Arya Pudota, Matthew Becker, Hervé Robert, Abhishek Gulati
Main category: cs.LG
TL;DR: The paper introduces HUMBR, a Minimum Bayes Risk framework that combines semantic embedding similarity with lexical precision to reduce hallucinations in LLM outputs for high-stakes enterprise workflows, showing significant improvements over standard methods.
Details
Motivation: LLM hallucinations pose serious risks in high-stakes enterprise workflows like legal matters, risk management, and privacy compliance where a single hallucinated clause can have material consequences. There's a critical need for reliable hallucination mitigation in production systems.Method: Proposes Hybrid Utility MBR (HUMBR) framework that frames hallucination mitigation as a Minimum Bayes Risk problem. It synthesizes semantic embedding similarity with lexical precision to identify consensus without ground-truth references, with derived rigorous error bounds.
Result: MBR significantly outperforms standard Universal Self-Consistency on public benchmarks (TruthfulQA and LegalBench) and Meta production data. 81% of pipeline suggestions were preferred over human-crafted ground truth, and critical recall failures were virtually eliminated.
Conclusion: The HUMBR framework provides an effective approach to reduce hallucination risks in high-stakes enterprise LLM applications, offering both theoretical guarantees and practical performance improvements in production environments.
Abstract: Although LLMs drive automation, it is critical to ensure immense consideration for high-stakes enterprise workflows such as those involving legal matters, risk management, and privacy compliance. For Meta, and other organizations like ours, a single hallucinated clause in such high stakes workflows risks material consequences. We show that by framing hallucination mitigation as a Minimum Bayes Risk (MBR) problem, we can dramatically reduce this risk. Specifically, we introduce a Hybrid Utility MBR (HUMBR) framework that synthesizes semantic embedding similarity with lexical precision to identify consensus without ground-truth references, for which we derive rigorous error bounds. We complement this theoretical analysis with a comprehensive empirical evaluation on widely-used public benchmark suites (TruthfulQA and LegalBench) and also real world data from Meta production deployment. The results from our empirical study show that MBR significantly outperforms standard Universal Self-Consistency. Notably, 81% of the pipeline’s suggestions were preferred over human-crafted ground truth, and critical recall failures were virtually eliminated.
[1170] A Full Compression Pipeline for Green Federated Learning in Communication-Constrained Environments
Elouan Colybes, Shririn Salehi, Anke Schmeink
Main category: cs.LG
TL;DR: FCP introduces a full compression pipeline for federated learning that combines pruning, quantization, and Huffman encoding to reduce communication and computational overhead while maintaining accuracy.
Details
Motivation: Federated Learning enables privacy-preserving collaborative training but suffers from significant communication and computational overhead that limits scalability and sustainability, especially in communication-constrained environments.Method: Proposes a Full Compression Pipeline (FCP) that integrates three complementary deep compression techniques: pruning, quantization, and Huffman encoding into a unified end-to-end framework. Also develops an evaluation framework that captures both communication and computation overheads as a unified model cost.
Result: Achieves more than 11× reduction in model size with only 2% accuracy drop compared to uncompressed baseline. In training ResNet-12 on CIFAR-10 with 10 clients and 2 Mbps bandwidth, FL training becomes more than 60% faster.
Conclusion: FCP substantially reduces transmission costs and resource consumption in FL while maintaining competitive accuracy, making FL more scalable and sustainable in communication-constrained environments.
Abstract: Federated Learning (FL) enables collaborative model training across distributed clients without sharing raw data, thereby preserving privacy. However, FL often suffers from significant communication and computational overhead, limiting its scalability and sustainability. In this work, we introduce a Full Compression Pipeline (FCP) for FL in communication-constrained environments. FCP integrates three complementary deep compression techniques (pruning, quantization, and Huffman encoding) into a unified end-to-end framework. By compressing local models and communication payloads, FCP substantially reduces transmission costs and resource consumption while maintaining competitive accuracy. To quantify its impact, we develop an evaluation framework that captures both communication and computation overheads as a unified model cost, allowing a holistic assessment of efficiency trade-offs. The pipeline is evaluated in an independent and identically distributed (IID) and non-IID data setting. In one representative scenario, training a ResNet-12 model on the CIFAR-10 dataset with ten clients and a 2 Mbps bandwidth, the FCP achieves more than 11$\times$ reduction in model size, with only a 2% drop in accuracy compared to the uncompressed baseline. This results in an FL training that is more than 60% faster.
[1171] Gradient-Variation Regret Bounds for Unconstrained Online Learning
Yuheng Zhao, Andrew Jacobsen, Nicolò Cesa-Bianchi, Peng Zhao
Main category: cs.LG
TL;DR: Parameter-free online learning algorithms with regret scaling with gradient variation, achieving adaptive bounds without prior knowledge of comparator norm, Lipschitz constant, or smoothness.
Details
Motivation: To develop parameter-free online learning algorithms that adapt automatically to problem parameters without requiring prior knowledge of comparator norm, Lipschitz constant, or smoothness, with regret guarantees scaling with gradient variation.Method: Develop fully-adaptive algorithms for L-smooth convex loss functions that achieve regret scaling with gradient variation V_T(u). The algorithms use closed-form updates computed efficiently each round without requiring parameter tuning.
Result: Achieved regret of order Õ(||u||√V_T(u) + L||u||² + G⁴) without prior knowledge of parameters. Results extend to dynamic regret and improve upon previous best-known results in the stochastically-extended adversarial (SEA) model.
Conclusion: Parameter-free online learning algorithms with gradient variation scaling provide adaptive regret bounds without parameter tuning, with applications to dynamic regret and SEA model improvements.
Abstract: We develop parameter-free algorithms for unconstrained online learning with regret guarantees that scale with the gradient variation $V_T(u) = \sum_{t=2}^T |\nabla f_t(u)-\nabla f_{t-1}(u)|^2$. For $L$-smooth convex loss, we provide fully-adaptive algorithms achieving regret of order $\widetilde{O}(|u|\sqrt{V_T(u)} + L|u|^2+G^4)$ without requiring prior knowledge of comparator norm $|u|$, Lipschitz constant $G$, or smoothness $L$. The update in each round can be computed efficiently via a closed-form expression. Our results extend to dynamic regret and find immediate implications to the stochastically-extended adversarial (SEA) model, which significantly improves upon the previous best-known result [Wang et al., 2025].
[1172] Towards Situation-aware State Modeling for Air Traffic Flow Prediction
Anqi Liu, Bin Wang, Jiangtao Zhao, Dechuan Ma, Guiyuan Jiang, Feng Hong, Yanwei Yu, Tianrui Li
Main category: cs.LG
TL;DR: AeroSense: A direct state-to-flow modeling framework for air traffic prediction that processes dynamic sets of aircraft states instead of aggregated time series, achieving SOTA performance through situation-aware representation and masked self-attention.
Details
Motivation: Existing air traffic prediction methods rely on time series-based forecasting that overlooks critical real-time aircraft state information like kinematics and proximity to boundaries, limiting predictive accuracy.Method: Proposes AeroSense framework with situation-aware state representation to directly process variable numbers of aircraft states, uses masked self-attention to capture inter-aircraft interactions, and employs two decoupled prediction heads for heterogeneous flow dynamics in terminal airspace functional areas.
Result: Achieves state-of-the-art performance on large-scale real-world airport dataset, shows superior robustness during peak traffic, Pareto-optimal performance under multi-object evaluation, and provides interpretable attention visualizations.
Conclusion: Direct modeling of microscopic aircraft states yields substantially higher predictive fidelity than time series-based approaches, with the framework demonstrating practical utility for proactive air traffic management.
Abstract: Accurate air traffic prediction in the terminal airspace (TA) is pivotal for proactive air traffic management (ATM). However, existing data-driven approaches predominantly rely on time series-based forecasting paradigms, which inherently overlook critical aircraft state information, such as real-time kinematics and proximity to airspace boundaries. To address this limitation, we propose \textit{AeroSense}, a direct state-to-flow modeling framework for air traffic prediction. Unlike classical time series-based methods that first aggregate aircraft trajectories into macroscopic flow sequences before modeling, AeroSense explicitly represents the real-time airspace situation as \textit{a dynamic set of aircraft states}, enabling the direct processing of a variable number of aircraft instead of time series as inputs. Specifically, we introduce a situation-aware state representation that enables AeroSense to sense the instantaneous terminal airspace situation directly from microscopic aircraft states. Furthermore, we design a model architecture that incorporates masked self-attention to capture inter-aircraft interactions, together with two decoupled prediction heads to model heterogeneous flow dynamics across two key functional areas of the TA. Extensive experiments on a large-scale real-world airport dataset demonstrate that AeroSense consistently achieves state-of-the-art performance, validating that direct modeling of microscopic aircraft states yields substantially higher predictive fidelity than time series-based baselines. Moreover, the proposed framework exhibits superior robustness during peak traffic periods, achieves Pareto-optimal performance under dayparting multi-object evaluation, and provides meaningful interpretability through attention-based visualizations.
[1173] ShapShift: Explaining Model Prediction Shifts with Subgroup Conditional Shapley Values
Tom Bewley, Salim I. Amoukou, Emanuele Albini, Saumitra Mishra, Manuela Veloso
Main category: cs.LG
TL;DR: A Shapley value method for attributing prediction shifts in ML models to changes in conditional probabilities of interpretable subgroups defined by decision tree structure, applicable to single trees, ensembles, and model-agnostic settings via surrogate trees.
Details
Motivation: Understanding causes of prediction shifts in ML models when input distributions change is crucial for real-world applications (e.g., loan approval rates), as these shifts can impact downstream business outcomes and require model monitoring in dynamic environments.Method: Proposes a Shapley value method that attributes prediction shifts to changes in conditional probabilities of interpretable subgroups defined by decision tree structure. For single trees: exact explanations based on conditional probability changes at split nodes. For tree ensembles: selects most explanatory tree and accounts for residual effects. Model-agnostic variant: uses surrogate trees grown with novel objective function for neural networks etc. Includes approximation techniques for practical computation.
Result: The method provides simple, faithful, and near-complete explanations of prediction shifts across model classes, enabling effective model monitoring in dynamic environments.
Conclusion: The proposed Shapley value method offers interpretable attribution of prediction shifts to conditional probability changes in data subgroups, supporting model monitoring and understanding in changing environments across various model types.
Abstract: Changes in input distribution can induce shifts in the average predictions of machine learning models. Such prediction shifts may impact downstream business outcomes (e.g. a bank’s loan approval rate), so understanding their causes can be crucial. We propose \ours{}: a Shapley value method for attributing prediction shifts to changes in the conditional probabilities of interpretable subgroups of data, where these subgroups are defined by the structure of decision trees. We initially apply this method to single decision trees, providing exact explanations based on conditional probability changes at split nodes. Next, we extend it to tree ensembles by selecting the most explanatory tree and accounting for residual effects. Finally, we propose a model-agnostic variant using surrogate trees grown with a novel objective function, allowing application to models like neural networks. While exact computation can be intensive, approximation techniques enable practical application. We show that \ours{} provides simple, faithful, and near-complete explanations of prediction shifts across model classes, aiding model monitoring in dynamic environments.
[1174] Unified Graph Prompt Learning via Low-Rank Graph Message Prompting
Beibei Wang, Bo Jiang, Ziyan Zhang, Jin Tang
Main category: cs.LG
TL;DR: LR-GMP is a unified graph prompt learning approach that uses low-rank prompt representation to concurrently prompt all graph components, outperforming existing methods that target components separately.
Details
Motivation: Existing Graph Data Prompts (GDPs) are designed for specific graph components (node features, edge features, edge weights) and operate in limited prompt spaces. There's a lack of a unified prompter that can target all graph components simultaneously for efficient adaptation of pre-trained GNNs.Method: First reinterpret existing GDPs through a Graph Message Prompt (GMP) paradigm. Then propose Low-Rank GMP (LR-GMP) that uses low-rank prompt representation for effective and compact graph prompt learning. Unlike traditional GDPs, LR-GMP concurrently prompts all graph components in a unified manner.
Result: Extensive experiments on several graph benchmark datasets demonstrate LR-GMP achieves significantly superior generalization and robustness on diverse downstream tasks compared to existing GDP methods.
Conclusion: LR-GMP provides a unified framework for graph prompt learning that outperforms component-specific approaches, offering better generalization and robustness across various graph tasks.
Abstract: Graph Data Prompt (GDP), which introduces specific prompts in graph data for efficiently adapting pre-trained GNNs, has become a mainstream approach to graph fine-tuning learning problem. However, existing GDPs have been respectively designed for distinct graph component (e.g., node features, edge features, edge weights) and thus operate within limited prompt spaces for graph data. To the best of our knowledge, it still lacks a unified prompter suitable for targeting all graph components simultaneously. To address this challenge, in this paper, we first propose to reinterpret a wide range of existing GDPs from an aspect of Graph Message Prompt (GMP) paradigm. Based on GMP, we then introduce a novel graph prompt learning approach, termed Low-Rank GMP (LR-GMP), which leverages low-rank prompt representation to achieve an effective and compact graph prompt learning. Unlike traditional GDPs that target distinct graph components separately, LR-GMP concurrently performs prompting on all graph components in a unified manner, thereby achieving significantly superior generalization and robustness on diverse downstream tasks. Extensive experiments on several graph benchmark datasets demonstrate the effectiveness and advantages of our proposed LR-GMP.
[1175] AbLWR:A Context-Aware Listwise Ranking Framework for Antibody-Antigen Binding Affinity Prediction via Positive-Unlabeled Learning
Fan Xu, Zhi-an Huang, Haohuai He, Yidong Song, Wei Liu, Dongxu Zhang, Yao Hu, Kay Chen Tan
Main category: cs.LG
TL;DR: AbLWR reformulates antibody-antigen binding affinity prediction as listwise ranking with PU learning and homologous antigen sampling to address label sparsity and antigenic variation.
Details
Motivation: Current antibody-antigen binding affinity prediction suffers from severe label sparsity and complexity of antigenic variations, limiting therapeutic design applications.Method: Reformulates affinity regression as listwise ranking, incorporates PU learning with dual-level contrastive objective and meta-optimized label refinement, and uses homologous antigen sampling with Multi-Head Self-Attention to model inter-sample relationships.
Result: Significantly outperforms state-of-the-art baselines, improving Precision@1 by over 10% in randomized cross-validation, with case studies on Influenza and IL-33 demonstrating practical utility for distinguishing viral mutations and prioritizing candidates.
Conclusion: AbLWR provides an effective framework for antibody-antigen binding affinity prediction that addresses key challenges of label sparsity and antigenic variation through innovative ranking and learning approaches.
Abstract: Accurate prediction of antibody-antigen binding affinity is fundamental to therapeutic design, yet remains constrained by severe label sparsity and the complexity of antigenic variations. In this paper, we propose AbLWR (Antibody-antigen binding affinity List-Wise Ranking), a novel framework that reformulates the conventional affinity regression task as a listwise ranking problem. To mitigate label sparsity, AbLWR incorporates a PU (Positive-Unlabeled) learning mechanism leveraging a dual-level contrastive objective and meta-optimized label refinement to learn robust representations. Furthermore, we address antigenic variation by employing a homologous antigen sampling strategy where Multi-Head Self-Attention (MHSA) explicitly models inter-sample relationships within training lists to capture subtle affinity nuances. Extensive experiments demonstrate that AbLWR significantly outperforms state-of-the-art baselines, improving the Precision@1 (P@1) by over 10$%$ in randomized cross-validation experiments. Notably, case studies on Influenza and IL-33 validate its practical utility, demonstrating robust ranking consistency in distinguishing subtle viral mutations and efficiently prioritizing top-tier candidates for wet-lab screening.
[1176] Mycelium-Index: A Streaming Approximate Nearest Neighbor Index with Myelial Edge Decay, Traffic-Driven Reinforcement, and Adaptive Living Hierarchy
Anton Pakhunov
Main category: cs.LG
TL;DR: Mycelium-index is a streaming approximate nearest neighbor index inspired by biological mycelium growth patterns, achieving high recall with significantly less RAM and higher query throughput than existing methods.
Details
Motivation: To create an efficient streaming ANN index that can continuously adapt to data changes while maintaining high performance and low memory usage, inspired by biological mycelium's adaptive growth patterns.Method: Uses mycelial edge decay and reinforcement, traffic-driven living hierarchy, and hybrid deletion combining O(1) bypass for cold nodes with O(k) beam-search repair for hub nodes. Includes NEON SIMD distance computation, Vec-backed node storage, and bitset visited tracking optimizations.
Result: Achieves 0.927 recall@5 under FreshDiskANN’s 100%-turnover benchmark with 5.7x less RAM (88 MB vs. >500 MB) and 4.7x higher QPS (2,795 vs. ~600). On static index, matches HNSW recall at 5.2x less RAM (163 MB vs. 854 MB).
Conclusion: Mycelium-index demonstrates efficient streaming ANN capabilities with novel biological-inspired mechanisms, discovering that topological repair mechanisms succeed where geometric heuristics fail in high dimensions.
Abstract: We present mycelium-index, a streaming approximate nearest neighbor (ANN) index for high-dimensional vector spaces, inspired by the adaptive growth patterns of biological mycelium. The system continuously adapts its topology through myelial edge decay and reinforcement, a traffic-driven living hierarchy, and hybrid deletion combining O(1) bypass for cold nodes with O(k) beam-search repair for hub nodes. Experimental evaluation on SIFT-1M demonstrates that mycelium achieves 0.927 +/- 0.028 recall@5 under FreshDiskANN’s 100%-turnover benchmark protocol – within the measurement confidence interval of FreshDiskANN’s ~0.95 – while using 5.7x less RAM (88 MB vs. >500 MB) and achieving 4.7x higher QPS (2,795 vs. ~600). On the static index, at ef=192, mycelium matches HNSW M=16 recall (0.962 vs. 0.965) at 5.2x less RAM (163 MB vs. 854 MB). Performance optimizations including NEON SIMD distance computation, Vec-backed node storage, and bitset visited tracking yield a cumulative 2.7x QPS improvement. A systematic study of ten streaming repair mechanisms finds that geometric heuristics universally fail in high dimensions, while topological mechanisms succeed – a principle we term the topological repair invariance of high-dimensional ANN graphs.
[1177] Sheaf Diffusion with Adaptive Local Structure for Spatio-Temporal Forecasting
Abeer Mostafa, Raneen Younis, Zahra Ahmadi
Main category: cs.LG
TL;DR: ST-Sheaf GNN: A spatio-temporal forecasting model using sheaf theory to learn dynamic local information flow instead of global node representations, achieving state-of-the-art performance on diverse benchmarks.
Details
Motivation: Conventional message passing GNNs struggle with highly heterogeneous spatio-temporal systems and oversmoothing in deep architectures. They propagate globally aligned node representations but fail to capture complex local interactions and information flow patterns.Method: Reformulates spatio-temporal forecasting as learning information flow over locally structured spaces using sheaf theory. Introduces ST-Sheaf GNN that embeds graph topology into sheaf-theoretic vector spaces with learned linear restriction maps that dynamically evolve over time and adapt to local spatio-temporal patterns.
Result: Achieves state-of-the-art performance on diverse real-world spatio-temporal forecasting benchmarks across multiple domains. The framework effectively mitigates oversmoothing in deep GNN architectures.
Conclusion: Sheaf-theoretic topological representations provide a powerful foundation for spatio-temporal graph learning by explicitly modeling latent local structure and enabling more expressive interactions through dynamic restriction maps.
Abstract: Spatio-temporal systems often exhibit highly heterogeneous and non-intuitive responses to localized disruptions, limiting the effectiveness of conventional message passing approaches in modeling higher-order interactions under local heterogeneity. This paper reformulates spatio-temporal forecasting as the problem of learning information flow over locally structured spaces, rather than propagating globally aligned node representations. We introduce a spatio-temporal sheaf diffusion graph neural network (ST-Sheaf GNN) that embeds graph topology into sheaf-theoretic vector spaces connected by learned linear restriction maps. Unlike prior work that relies on static or globally shared transformations, our model learns dynamic restriction maps that evolve over time and adapt to local spatio-temporal patterns to enable substantially more expressive interactions. By explicitly modeling latent local structure, the proposed framework efficiently mitigates the oversmoothing phenomenon in deep GNN architectures. We evaluate our framework on a diverse set of real-world spatio-temporal forecasting benchmarks spanning multiple domains. Experimental results demonstrate state-of-the-art performance, highlighting the effectiveness of sheaf-theoretic topological representations as a powerful foundation for spatio-temporal graph learning. The code is available at: https://anonymous.4open.science/r/ST-SheafGNN-6523/.
[1178] Representation-Aligned Multi-Scale Personalization for Federated Learning
Wenfei Liang, Wee Peng Tay
Main category: cs.LG
TL;DR: FRAMP is a federated learning framework that generates personalized client models from compact descriptors instead of using fixed global models, enabling adaptation to both data characteristics and computational constraints.
Details
Motivation: Current federated learning approaches use shared full-size models with submodel extraction for resource-constrained clients, but this limits structural diversity and representational adaptation across clients with different data distributions and computational budgets.Method: FRAMP generates client-specific models from compact client descriptors rather than using a fixed global backbone. Each client trains a tailored lightweight submodel and aligns its learned representation with others to maintain global semantic consistency while adapting to local data and computational constraints.
Result: Extensive experiments on vision and graph benchmarks demonstrate that FRAMP enhances generalization and adaptivity across a wide range of client settings compared to existing approaches.
Conclusion: FRAMP provides a unified framework for personalized and resource-adaptive federated learning that overcomes limitations of fixed global models by enabling fine-grained adaptation to both client data characteristics and computational budgets.
Abstract: In federated learning (FL), accommodating clients with diverse resource constraints remains a significant challenge. A widely adopted approach is to use a shared full-size model, from which each client extracts a submodel aligned with its computational budget. However, regardless of the specific scoring strategy, these methods rely on the same global backbone, limiting both structural diversity and representational adaptation across clients. This paper presents FRAMP, a unified framework for personalized and resource-adaptive federated learning. Instead of relying on a fixed global model, FRAMP generates client-specific models from compact client descriptors, enabling fine-grained adaptation to both data characteristics and computational budgets. Each client trains a tailored lightweight submodel and aligns its learned representation with others to maintain global semantic consistency. Extensive experiments on vision and graph benchmarks demonstrate that FRAMP enhances generalization and adaptivity across a wide range of client settings.
[1179] THEIA: Learning Complete Kleene Three-Valued Logic in a Pure-Neural Modular Architecture
Augustus Haoyang Li
Main category: cs.LG
TL;DR: THEIA is a modular neural architecture that learns Kleene three-valued logic end-to-end without symbolic solvers, achieving strong compositional generalization through structured inductive biases.
Details
Motivation: To understand what architectural priors enable compositional generalization under uncertainty in logical reasoning, particularly investigating whether modular structures outperform monolithic architectures.Method: A modular architecture with dedicated engines for four mathematical domains (arithmetic, order, set membership, propositional logic) that converge in a final logic module, trained on a 2M-sample dataset with massive input space.
Result: Achieves 12/12 Kleene K3 rule coverage across 5 seeds in 9.2±3.5 minutes (5.6x faster than comparable Transformer), and shows exceptional length generalization from 5-step to 500-step evaluation at 99.97%±0.02% accuracy.
Conclusion: Modular architectures with structured inductive biases enable superior compositional generalization compared to flat MLPs or Transformers, with different representational strategies for implementing compositionality.
Abstract: We present THEIA, a modular neural architecture that learns complete Kleene three-valued logic (K3) end-to-end without any external symbolic solver, and investigate what architectural prior enables compositional generalization under uncertainty. THEIA processes four mathematical domains (arithmetic, order, set membership, propositional logic) through dedicated engines that converge in a final logic module. Trained on a 2M-sample dataset with input space ~3.4x10^13, it achieves 12/12 Kleene K3 rule coverage across 5 seeds in 9.2 +/- 3.5 minutes (5.6x faster than a parameter-comparable Transformer under matched settings). A mod-3 sequential composition experiment generalizes from 5-step training to 500-step evaluation at 99.97% +/- 0.02% – a result that critically depends on structured inductive bias: replacing the four-engine backbone with a flat MLP collapses length generalization to chance by 50 steps regardless of capacity (both 0.80M and parameter-matched 2.75M variants fail), while a pre-LN TF8LTuned Transformer baseline (3,582,147 params) trained under the identical protocol reaches 99.24% at 500 steps (Appendix D). Mechanistic probing reveals that modularity induces a delayed verdict: upstream engines encode domain-specific variables without committing to the final truth value (probe accuracy <= 74% uncertainty-only ceiling), with the verdict emerging only at the Logic Engine boundary – causally confirmed by activation patching (100% flip rate on 986 matched pairs, replicated across n=5 seeds; 100.0% aggregate). The Transformer baseline reaches equivalent correctness through a qualitatively different representational trajectory (contraction then expansion), suggesting that modular and monolithic architectures implement distinct compositional strategies.
[1180] The Past Is Not Past: Memory-Enhanced Dynamic Reward Shaping
Yang Liu, Enxi Wang, Yufei Gao, Weixin Zhang, Bo Wang, Zhiyuan Zeng, Yikai Zhang, Yining Zheng, Xipeng Qiu
Main category: cs.LG
TL;DR: MEDS: Memory-Enhanced Dynamic reward Shaping framework that uses historical behavioral signals and density-based clustering to penalize recurring error patterns in RL for LLMs, improving performance and diversity.
Details
Motivation: Reinforcement learning for large language models often suffers from reduced sampling diversity where policies repeatedly generate similar erroneous behaviors. Classical entropy regularization encourages randomness but doesn't explicitly discourage recurrent failure patterns across multiple rollouts.Method: MEDS stores intermediate model representations to capture features of past rollouts, uses density-based clustering to identify frequently recurring error patterns, and penalizes rollouts assigned to more prevalent error clusters more heavily, encouraging broader exploration while reducing repeated mistakes.
Result: Across five datasets and three base models, MEDS consistently improves average performance over existing baselines, achieving gains of up to 4.13 pass@1 points and 4.37 pass@128 points. Additional analyses show MEDS increases behavioral diversity during sampling.
Conclusion: MEDS effectively addresses the diversity problem in RL for LLMs by incorporating historical behavioral signals into reward design, leading to improved performance and increased sampling diversity.
Abstract: Despite the success of reinforcement learning for large language models, a common failure mode is reduced sampling diversity, where the policy repeatedly generates similar erroneous behaviors. Classical entropy regularization encourages randomness under the current policy, but does not explicitly discourage recurrent failure patterns across rollouts. We propose MEDS, a Memory-Enhanced Dynamic reward Shaping framework that incorporates historical behavioral signals into reward design. By storing and leveraging intermediate model representations, we capture features of past rollouts and use density-based clustering to identify frequently recurring error patterns. Rollouts assigned to more prevalent error clusters are penalized more heavily, encouraging broader exploration while reducing repeated mistakes. Across five datasets and three base models, MEDS consistently improves average performance over existing baselines, achieving gains of up to 4.13 pass@1 points and 4.37 pass@128 points. Additional analyses using both LLM-based annotations and quantitative diversity metrics show that MEDS increases behavioral diversity during sampling.
[1181] Beyond Fixed False Discovery Rates: Post-Hoc Conformal Selection with E-Variables
Meiyi Zhu, Osvaldo Simeone
Main category: cs.LG
TL;DR: Post-hoc conformal selection (PH-CS) extends conformal selection to allow users to adaptively choose selection sets after seeing data, providing data-driven FDP estimates and maintaining FDR control.
Details
Motivation: Existing conformal selection methods require fixing FDR levels before seeing data, preventing adaptation to downstream needs based on observed evidence strength and available resources. Researchers often need to decide how aggressively to pursue candidates based on data.Method: PH-CS generates a path of candidate selection sets with data-driven false discovery proportion (FDP) estimates, building on conformal e-variables and the e-Benjamini-Hochberg procedure. Users can select any operating point by maximizing user-specified utility.
Result: PH-CS provides finite-sample post-hoc reliability guarantees where the ratio between estimated FDP and true FDP is upper bounded by 1 on average. Experiments show PH-CS can satisfy user-imposed utility constraints while producing reliable FDP estimates and maintaining competitive FDR control.
Conclusion: PH-CS addresses limitations of existing conformal selection by enabling adaptive selection based on observed data while maintaining statistical guarantees, making it more flexible for practical applications.
Abstract: Conformal selection (CS) uses calibration data to identify test inputs whose unobserved outcomes are likely to satisfy a pre-specified minimal quality requirement, while controlling the false discovery rate (FDR). Existing methods fix the target FDR level before observing data, which prevents the user from adapting the balance between number of selected test inputs and FDR to downstream needs and constraints based on the available data. For example, in genomics or neuroimaging, researchers often inspect the distribution of test statistics, and decide how aggressively to pursue candidates based on observed evidence strength and available follow-up resources. To address this limitation, we introduce {post-hoc CS} (PH-CS), which generates a path of candidate selection sets, each paired with a data-driven false discovery proportion (FDP) estimate. PH-CS lets the user select any operating point on this path by maximizing a user-specified utility, arbitrarily balancing selection size and FDR. Building on conformal e-variables and the e-Benjamini-Hochberg (e-BH) procedure, PH-CS is proved to provide a finite-sample post-hoc reliability guarantee whereby the ratio between estimated FDP level and true FDP is, on average, upper bounded by $1$, so that the average estimated FDP is, to first order, a valid upper bound on the true FDR. PH-CS is extended to control quality defined in terms of a general risk. Experiments on synthetic and real-world datasets demonstrate that, unlike CS, PH-CS can consistently satisfy user-imposed utility constraints while producing reliable FDP estimates and maintaining competitive FDR control.
[1182] Learning Discrete Diffusion of Graphs via Free-Energy Gradient Flows
Dario Rancati, Jan Maas, Francesco Locatello
Main category: cs.LG
TL;DR: The paper proposes a novel computational framework for discrete diffusion models using gradient flows with a specialized metric on probability simplex, enabling functional recovery from diffusion dynamics.
Details
Motivation: While continuous-space diffusion models have strong theoretical foundations via Wasserstein-2 gradient flows, discrete-space diffusion models lack parallel theoretical frameworks due to challenges in translating W₂ distance to discrete settings.Method: Introduces a metric W_K on probability simplex to interpret discrete diffusion paths as gradient flows, then learns diffusion dynamics by recovering underlying functionals using first-order optimality conditions of JKO scheme with quadratic loss optimization.
Result: Method trains extremely fast without requiring individual sample trajectories, only needs numerical preprocessing of W_K-geodesics, and successfully recovers underlying functionals across various graph classes in synthetic experiments.
Conclusion: Provides first computational approach bridging gradient flow theory to discrete diffusion models, offering efficient functional recovery and theoretical foundation for discrete-space diffusion processes.
Abstract: Diffusion-based models on continuous spaces have seen substantial recent progress through the mathematical framework of gradient flows, leveraging the Wasserstein-2 (${W}_2$) metric via the Jordan-Kinderlehrer-Otto (JKO) scheme. Despite the increasing popularity of diffusion models on discrete spaces using continuous-time Markov chains, a parallel theoretical framework based on gradient flows has remained elusive due to intrinsic challenges in translating the ${W}_2$ distance directly into these settings. In this work, we propose the first computational approach addressing these challenges, leveraging an appropriate metric $W_K$ on the simplex of probability distributions, which enables us to interpret widely used discrete diffusion paths, such as the discrete heat equation, as gradient flows of specific free-energy functionals. Through this theoretical insight, we introduce a novel methodology for learning diffusion dynamics over discrete spaces, which recovers the underlying functional directly by leveraging first-order optimality conditions for the JKO scheme. The resulting method optimizes a simple quadratic loss, trains extremely fast, does not require individual sample trajectories, and only needs a numerical preprocessing computing $W_K$-geodesics. We validate our method through extensive numerical experiments on synthetic data, showing that we can recover the underlying functional for a variety of graph classes.
[1183] S$^3$: Structured Sparsity Specification
Ayoub Ghriss
Main category: cs.LG
TL;DR: S³ is an algebraic framework for defining, composing, and implementing structured sparse patterns in neural networks through three components: View (tensor reshaping), Block (atomic pruning unit), and Scope (sparsity decision), with support for coordinated sparsification across tensors.
Details
Motivation: Existing structured sparsity methods lack a unified algebraic framework for precise specification, composition, and implementation of diverse sparsity patterns, making it difficult to systematically explore and optimize structured pruning approaches across different neural network architectures.Method: S³ defines structured sparsity through three algebraic components: View (reshapes tensors via layout composition), Block (defines atomic pruning units), and Scope (makes sparsity decisions). The framework supports Coupling across tensors for coordinated sparsification and integrates with Optimal Brain Damage (OBD) and Optimal Brain Surgeon (OBS) pruning algorithms.
Result: The framework enables precise specification of diverse sparsity structures from fine-grained N:M patterns to coarse channel pruning. Experimental validation shows that structured OBS and OBD implementations built on S³ surpass well-established second-order heuristics on output reconstruction across common configurations.
Conclusion: S³ provides a mathematically formalized, expressive framework for structured sparsity that enables systematic exploration and implementation of diverse pruning patterns, improving upon existing second-order pruning methods through better structured sparsity specification.
Abstract: We introduce the Structured Sparsity Specification (S$^3$), an algebraic framework for defining, composing, and implementing structured sparse patterns. S$^3$ specifies sparsity through three components: a View that reshapes the tensor via layout composition, a Block specification that defines the atomic pruning unit, and the sparsity decision Scope. Both Block and Scope support Coupling across tensors for coordinated sparsification. S$^3$ enables precise specification of diverse sparsity structures, from fine-grained N:M patterns to coarse channel pruning, and integrates seamlessly with Optimal Brain Damage (OBD) and Surgeon (OBS). We formalize the framework mathematically, demonstrate its expressiveness on canonical patterns, and validate it experimentally via structured OBS and OBD implementations built entirely on S$^3$, which surpasses well-established second order heuristics on output reconstruction across common configurations.
[1184] Active Bayesian Inference for Robust Control under Sensor False Data Injection Attacks
Axel Andersson, György Dán
Main category: cs.LG
TL;DR: Framework for detecting and recovering from sensor attacks in cyber-physical systems using bipartite graph modeling, Bayesian inference, and active probing strategies.
Details
Motivation: Modern cyber-physical systems with complex perception pipelines are vulnerable to sensor attacks, creating a need for integrated detection and recovery frameworks that can maintain reliable state estimation despite compromised sensors.Method: Models perception pipelines as bipartite graphs, uses anomaly detector alerts to define Bayesian networks for inferring compromised sensors, implements active probing strategies exploiting system nonlinearities to distinguish attack hypotheses, and selectively disables compromised sensors.
Result: The method significantly outperforms outlier-robust and prediction-based baselines in experiments on an inverted pendulum under single and multi-sensor attacks, especially effective under prolonged attacks.
Conclusion: The proposed framework successfully bridges sensor attack detection and recovery, maintaining reliable state estimation through intelligent sensor management and active probing strategies.
Abstract: We present a framework for bridging the gap between sensor attack detection and recovery in cyber-physical systems. The proposed framework models modern-day, complex perception pipelines as bipartite graphs, which combined with anomaly detector alerts defines a Bayesian network for inferring compromised sensors. An active probing strategy exploits system nonlinearities to maximize distinguishability between attack hypotheses, while compromised sensors are selectively disabled to maintain reliable state estimation. We propose a threshold-based probing strategy and show its effectiveness via a simplified partially observable Markov decision process (POMDP) formulation. Experiments on an inverted pendulum under single and multi-sensor attacks show that our method significantly outperforms outlier-robust and prediction-based baselines, especially under prolonged attacks.
[1185] Exact Certification of Neural Networks and Partition Aggregation Ensembles against Label Poisoning
Ajinkya Mohgaonkar, Lukas Gosch, Mahalakshmi Sabanayagam, Debarghya Ghoshdastidar, Stephan Günnemann
Main category: cs.LG
TL;DR: EnsembleCert is a white-box certification framework for partition-aggregation ensembles that provides tighter robustness guarantees against label-flipping attacks by leveraging internal knowledge of base classifiers, outperforming black-box approaches with fewer partitions.
Details
Motivation: Label-flipping attacks pose serious threats to supervised learning models by corrupting training labels. Existing certification methods use black-box ensemble techniques that yield overly conservative guarantees, creating a need for more precise white-box certification approaches.Method: Developed EnsembleCert framework that aggregates per-partition white-box certificates to compute ensemble-level guarantees. Introduced ScaLabelCert method that leverages neural tangent kernel theory to extract exact, polynomial-time calculable certificates from wide neural networks by treating them as kernel methods.
Result: EnsembleCert significantly outperforms existing black-box certificates, certifying up to +26.5% more label flips on CIFAR-10 in median over test set while requiring 100 times fewer partitions, challenging the notion that heavy partitioning is necessary for strong certified robustness.
Conclusion: White-box certification using internal knowledge of base classifiers provides substantially tighter robustness guarantees against label-flipping attacks than black-box approaches, enabling stronger certified robustness with far fewer partitions.
Abstract: Label-flipping attacks, which corrupt training labels to induce misclassifications at inference, remain a major threat to supervised learning models. This drives the need for robustness certificates that provide formal guarantees about a model’s robustness under adversarially corrupted labels. Existing certification frameworks rely on ensemble techniques such as smoothing or partition-aggregation, but treat the corresponding base classifiers as black boxes, yielding overly conservative guarantees. We introduce EnsembleCert, the first certification framework for partition-aggregation ensembles that utilizes white-box knowledge of the base classifiers. Concretely, EnsembleCert yields tighter guarantees than black-box approaches by aggregating per-partition white-box certificates to compute ensemble-level guarantees in polynomial time. To extract white-box knowledge from the base classifiers efficiently, we develop ScaLabelCert, a method that leverages the equivalence between sufficiently wide neural networks and kernel methods using the neural tangent kernel. ScaLabelCert yields the first exact, polynomial-time calculable certificate for neural networks against label-flipping attacks. EnsembleCert is either on par, or significantly outperforms the existing partition-based black box certificates. Exemplary, on CIFAR-10, our method can certify upto +26.5% more label flips in median over the test set compared to the existing black-box approach while requiring 100 times fewer partitions, thus, challenging the prevailing notion that heavy partitioning is a necessity for strong certified robustness.
[1186] Emulating Non-Differentiable Metrics via Knowledge-Guided Learning: Introducing the Minkowski Image Loss
Filippo Quarenghi, Ryan Cotsakis, Tom Beucler
Main category: cs.LG
TL;DR: A framework to bridge the “differentiability gap” in Earth system deep learning by developing differentiable approximations for non-differentiable scientific metrics through analytical relaxation and neural emulation with Lipschitz constraints.
Details
Motivation: The "differentiability gap" prevents Earth system deep learning models from being trained directly on non-differentiable scientific metrics, forcing reliance on smooth proxies like MSE that yield blurry outputs lacking high-frequency details.Method: Two approaches: 1) Analytical approximation using temperature-controlled sigmoids and continuous logical operators to relax discrete topological operations; 2) Neural emulator using Lipschitz-convolutional networks with spectral normalization and hard architectural constraints to stabilize gradient learning.
Result: Developed Minkowski image loss as differentiable equivalent for integral-geometric measures of precipitation fields. Constrained neural surrogate achieved high emulation accuracy on EUMETNET OPERA dataset, eliminating geometric violations seen in unconstrained baselines.
Conclusion: While Lipschitz regularization ensures optimization stability, it over-smooths gradient signals, limiting recovery of localized convective textures. Highlights need to couple topological constraints with stochastic generative architectures for full morphological realism.
Abstract: The differentiability gap'' presents a primary bottleneck in Earth system deep learning: since models cannot be trained directly on non-differentiable scientific metrics and must rely on smooth proxies (e.g., MSE), they often fail to capture high-frequency details, yielding blurry’’ outputs. We develop a framework that bridges this gap using two different methods to deal with non-differentiable functions: the first is to analytically approximate the original non-differentiable function into a differentiable equivalent one; the second is to learn differentiable surrogates for scientific functionals. We formulate the analytical approximation by relaxing discrete topological operations using temperature-controlled sigmoids and continuous logical operators. Conversely, our neural emulator uses Lipschitz-convolutional neural networks to stabilize gradient learning via: (1) spectral normalization to bound the Lipschitz constant; and (2) hard architectural constraints enforcing geometric principles. We demonstrate this framework’s utility by developing the Minkowski image loss, a differentiable equivalent for the integral-geometric measures of surface precipitation fields (area, perimeter, connected components). Validated on the EUMETNET OPERA dataset, our constrained neural surrogate achieves high emulation accuracy, completely eliminating the geometric violations observed in unconstrained baselines. However, applying these differentiable surrogates to a deterministic super-resolution task reveals a fundamental trade-off: while strict Lipschitz regularization ensures optimization stability, it inherently over-smooths gradient signals, restricting the recovery of highly localized convective textures. This work highlights the necessity of coupling such topological constraints with stochastic generative architectures to achieve full morphological realism.
[1187] Low-rank Optimization Trajectories Modeling for LLM RLVR Acceleration
Zhipeng Chen, Tao Qian, Wayne Xin Zhao, Ji-Rong Wen
Main category: cs.LG
TL;DR: NExt is a framework that uses nonlinear extrapolation of low-rank parameter trajectories to accelerate RLVR training for LLMs, reducing computational overhead by 37.5% while maintaining compatibility with various RLVR algorithms.
Details
Motivation: RLVR training for LLMs requires extensive exploration and learning, leading to substantial computational overhead. Prior linear extrapolation methods are insufficient because model parameter dynamics during RLVR training are not well understood, particularly how low-rank subspaces evolve nonlinearly.Method: Train model using LoRA, extract rank-1 subspace of parameter differences at multiple training steps, train a predictor to model parameter update trajectories, then perform predict-extend process for nonlinear extrapolation to accelerate RLVR training.
Result: NExt reduces computational overhead by approximately 37.5% while remaining compatible with a wide range of RLVR algorithms and tasks, demonstrating effectiveness and robustness through comprehensive experiments.
Conclusion: Nonlinear extrapolation of low-rank parameter trajectories is an effective approach to accelerate RLVR training for LLMs, offering significant computational savings while maintaining compatibility with existing RLVR methods.
Abstract: Recently, scaling reinforcement learning with verifiable rewards (RLVR) for large language models (LLMs) has emerged as an effective training paradigm for significantly improving model capabilities, which requires guiding the model to perform extensive exploration and learning, leading to substantial computational overhead and becoming a key challenge. To reduce the number of training steps, Prior work performs linear extrapolation of model parameters. However, the dynamics of model parameter updates during RLVR training remain insufficiently understood. To further investigate the evolution of LLMs during RLVR training, we conduct empirical experiments and find that the rank-1 subspace of the model does not evolve linearly, and its dominance over the original parameters is further amplified during LoRA training. Based on the above insights, we propose the \textbf{N}onlinear \textbf{Ext}rapolation of low-rank trajectories (\textbf{NExt}), a novel framework that models and extrapolates low-rank parameter trajectories in a nonlinear manner. Concretely, we first train the model using LoRA and extract the rank-1 subspace of parameter differences at multiple training steps, which is then used for the subsequent nonlinear extrapolation. Afterward, we utilized the extracted rank-1 subspace to train a predictor, which can model the trajectory of parameter updates during RLVR, and then perform the predict-extend process to extrapolate model parameters, achieving the acceleration of RLVR. To further study and understand NExt, we conduct comprehensive experiments that demonstrate the effectiveness and robustness of the method. Our method reduces computational overhead by approximately 37.5% while remaining compatible with a wide range of RLVR algorithms and tasks. We release our code in https://github.com/RUCAIBox/NExt.
[1188] Learning How Much to Think: Difficulty-Aware Dynamic MoEs for Graph Node Classification
Jiajun Zhou, Yadong Li, Xuanze Chen, Chen Ma, Chuang Zhao, Shanqing Yu, Qi Xuan
Main category: cs.LG
TL;DR: D2MoE: A difficulty-driven mixture-of-experts framework for graph neural networks that adaptively allocates expert resources based on node difficulty for more efficient and accurate node classification.
Details
Motivation: Current MoE architectures for GNNs use static routing strategies with uniform expert budgets, which overlook varying node discriminative difficulty, leading to under-fitting for hard nodes and redundant computation for easy nodes.Method: Proposes D2MoE with difficulty-driven top-p routing using predictive entropy as real-time difficulty proxy, adaptively concentrating expert resources on hard nodes while reducing overhead for easy ones, enabling continuous fine-grained expert budget scaling.
Result: Achieves state-of-the-art performance on 13 benchmarks, surpassing leading baselines by up to 7.92% accuracy on heterophilous graphs, reduces memory consumption by up to 73.07% and training time by 46.53% compared to best-performing Graph MoE on large-scale graphs.
Conclusion: D2MoE effectively addresses limitations of static MoE routing in GNNs through adaptive difficulty-driven resource allocation, achieving superior efficiency and performance for node classification tasks.
Abstract: Mixture-of-Experts (MoE) architectures offer a scalable path for Graph Neural Networks (GNNs) in node classification tasks but typically rely on static and rigid routing strategies that enforce a uniform expert budget or coarse-grained expert toggles on all nodes. This limitation overlooks the varying discriminative difficulty of nodes and leads to under-fitting for hard nodes and redundant computation for easy ones. To resolve this issue, we propose D2MoE, a novel framework that shifts the focus from static expert selection to node-wise expert resource allocation. By using predictive entropy as a real-time proxy for difficulty, D2MoE employs a difficulty-driven top-p routing mechanism to adaptively concentrate expert resources on hard nodes while reducing overhead for easy ones, achieving continuous and fine-grained expert budget scaling for node classification. Experiments on 13 benchmarks demonstrate that D2MoE achieves consistent state-of-the-art performance, surpassing leading baselines by up to 7.92% in accuracy on heterophilous graphs. Notably, on large-scale graphs, it reduces memory consumption by up to 73.07% and training time by 46.53% compared to the best-performing Graph MoE, thereby validating its superior efficiency.
[1189] Structural Consequences of Policy-Based Interventions on the Global Supply Chain Network
Lea Karbevska, Liming Xu, Zehui Dai, Sara AlMahri, Alexandra Brintrup
Main category: cs.LG
TL;DR: Analysis of how three trade policies (Country Plus One, Friendshoring, Reshoring) impact global EV supply chain networks, showing unexpected globalization effects and varying industry impacts.
Details
Motivation: Global political tensions, potential US tariffs, COVID-19 disruptions, and Ukraine war have highlighted economic independence and supply chain resilience needs, prompting governments to adopt new trade policies that need evaluation.Method: Study analyzes impact of three key trade policies on global EV supply chain network structure, examining effects on country clusters and international trade patterns.
Result: Friendshoring unexpectedly increases globalization by creating more supply links across friendly countries; Country Plus One increases network density through redundant links; Reshoring faces challenges in EV sector due to irreplaceable products; policy effects vary by industry (e.g., mining less affected).
Conclusion: Trade policies have complex, sometimes counterintuitive effects on supply chains, with Friendshoring potentially increasing globalization rather than reducing it, and industry-specific factors significantly influence policy outcomes.
Abstract: As global political tensions rise and the anticipation of additional tariffs from the United States on international trade increases, the issues of economic independence and supply chain resilience become more prominent. The importance of supply chain resilience has been further underscored by disruptions caused by the COVID-19 pandemic and the ongoing war in Ukraine.In light of these challenges, ranging from geopolitical instability to product supply uncertainties, governments are increasingly focused on adopting new trade policies. This study explores the impact of several of these policies on the global electric vehicle (EV) supply chain network, with a particular focus on their effects on country clusters and the broader structure of international trade. Specifically, we analyse three key policies: Country Plus One, Friendshoring, and Reshoring. Our findings show that Friendshoring, contrary to expectations, leads to greater globalisation by increasing the number of supply links across friendly countries, potentially raising transaction costs. The Country Plus One policy similarly enhances network density through redundant links, while the Reshoring policy creates challenges in the EV sector due to the high number of irreplaceable products. Additionally, the effects of these policies vary across industries; for instance, mining goods being less affected in Country Plus One than the Friendshoring policy.
[1190] CAGenMol: Condition-Aware Diffusion Language Model for Goal-Directed Molecular Generation
Yanting Li, Zhuoyang Jiang, Enyan Dai, Lei Wang, Wen-Cai Ye, Li Liu
Main category: cs.LG
TL;DR: CAGenMol: A condition-aware discrete diffusion framework for molecular generation that optimizes heterogeneous constraints like protein-ligand compatibility and drug-like properties through combined diffusion and reinforcement learning.
Details
Motivation: Existing molecular generation methods fail to reconcile conflicting objectives (e.g., affinity vs. safety) and struggle to navigate non-differentiable chemical space while maintaining structural validity. There's a need for a framework that can optimize multiple heterogeneous constraints simultaneously.Method: Proposes CAGenMol, a condition-aware discrete diffusion framework over molecular sequences that formulates molecular design as conditional denoising guided by heterogeneous structural and property signals. Combines discrete diffusion with reinforcement learning to align generation with non-differentiable objectives while preserving chemical validity and diversity. Uses non-autoregressive diffusion language model enabling iterative refinement of molecular fragments.
Result: Experiments on structure-conditioned, property-conditioned, and dual-conditioned benchmarks show consistent improvements over state-of-the-art methods in binding affinity, drug-likeness, and success rate.
Conclusion: CAGenMol effectively addresses the challenge of optimizing heterogeneous constraints in molecular generation through its condition-aware discrete diffusion framework coupled with reinforcement learning, demonstrating superior performance across multiple benchmarks.
Abstract: Goal-directed molecular generation requires satisfying heterogeneous constraints such as protein–ligand compatibility and multi-objective drug-like properties, yet existing methods often optimize these constraints in isolation, failing to reconcile conflicting objectives (e.g., affinity vs. safety), and struggle to navigate the non-differentiable chemical space without compromising structural validity. To address these challenges, we propose CAGenMol, a condition-aware discrete diffusion framework over molecular sequences that formulates molecular design as conditional denoising guided by heterogeneous structural and property signals. By coupling discrete diffusion with reinforcement learning, the model aligns the generation trajectory with non-differentiable objectives while preserving chemical validity and diversity. The non-autoregressive nature of diffusion language model further enables iterative refinement of molecular fragments at inference time. Experiments on structure-conditioned, property-conditioned, and dual-conditioned benchmarks demonstrate consistent improvements over state-of-the-art methods in binding affinity, drug-likeness, and success rate, highlighting the effectiveness of our framework.
[1191] Quantization Dominates Rank Reduction for KV-Cache Compression
Samuel Salfati
Main category: cs.LG
TL;DR: Quantization consistently outperforms rank reduction for KV cache compression in transformers, with INT4 matching FP16 accuracy while rank reduction causes catastrophic failures due to structural asymmetry in softmax attention routing.
Details
Motivation: The paper addresses the problem of compressing the KV cache in transformer inference to reduce memory requirements. Two main strategies exist: rank reduction (discarding dimensions) and quantization (reducing precision while keeping all dimensions). The authors aim to systematically compare these approaches to determine which is more effective for maintaining model performance while reducing storage.Method: The authors conduct experiments across five transformer models ranging from 124M to 14B parameters, including both MHA and GQA architectures. They compare rank reduction and quantization at matched storage budgets, evaluating performance using perplexity (PPL) and LAMBADA accuracy. They analyze the structural reasons for performance differences through perturbation analysis under the softmax Fisher metric and conduct basis ablation studies to verify findings are basis-independent.
Result: Quantization consistently outperforms rank reduction by 4-364 PPL across all models and compression levels. INT4 quantization matches FP16 accuracy on LAMBADA (+0.23 PPL on Mistral 7B, +0.58 on GPT-2), while rank-32 at identical storage collapses to 0.4% accuracy. The gap persists even when combining rank reduction with quantization, and grows with GQA aggressiveness. Joint K+V INT4 quantization achieves 75% total KV reduction with only +0.18 PPL degradation on Mistral 7B.
Conclusion: Quantization is fundamentally superior to rank reduction for KV cache compression due to structural asymmetry in softmax attention: removing dimensions can flip attention tokens (discrete failure), while quantization noise is bounded and preserves score ordering. The advantage comes from preserving dimensions rather than finding better coordinate systems, as confirmed by basis-independent results.
Abstract: We compare two strategies for compressing the KV cache in transformer inference: rank reduction (discard dimensions) and quantization (keep all dimensions, reduce precision). At matched storage budgets across five models (124M-14B, MHA and GQA), we find that quantization consistently outperforms rank reduction by 4-364 PPL depending on model and compression level. The gap persists even when rank reduction is combined with quantization in hybrid baselines, and it grows with GQA aggressiveness. On LAMBADA, INT4 matches FP16 accuracy (+0.23 PPL on Mistral 7B, +0.58 on GPT-2) while rank-32 at identical storage collapses to 0.4%. We trace this gap to a structural asymmetry: under softmax attention routing, removing a dimension can flip which token is attended (a discrete failure), while quantization noise is bounded and typically preserves score ordering. We formalize this via a perturbation result showing projection damage exceeds quantization damage by 3 x 2^(2b) per direction under the softmax Fisher metric. A basis ablation confirms the finding is basis-independent (spread <0.4 PPL), establishing that the advantage comes from preserving dimensions, not from a better coordinate system. Joint K+V INT4 quantization achieves 75% total KV reduction at only +0.18 PPL on Mistral 7B.
[1192] Not All Forgetting Is Equal: Architecture-Dependent Retention Dynamics in Fine-Tuned Image Classifiers
Miit Daga, Swarna Priya Ramu
Main category: cs.LG
TL;DR: Analysis of forgetting patterns during fine-tuning of CNNs and ViTs reveals architecture-dependent forgetting, stochastic per-sample forgetting across runs, but consistent class-level forgetting patterns.
Details
Motivation: Understanding which samples are forgotten during fine-tuning of pretrained image classifiers, and whether forgetting patterns are stable or architecture dependent, has implications for curriculum design, data pruning, and ensemble construction.Method: Track per-sample correctness at every epoch during fine-tuning of ResNet-18 and DeiT-Small on retinal OCT dataset and CUB-200-2011, fitting Ebbinghaus-style exponential decay curves to each sample’s retention trace.
Result: 1) CNNs and ViTs forget fundamentally different samples; 2) ViT forgetting is more structured; 3) Per-sample forgetting is stochastic across random seeds; 4) Class-level forgetting is consistent and semantically interpretable; 5) Early loss predicts long-term decay constants.
Conclusion: Architectural diversity in ensembles provides complementary retention coverage, curriculum/pruning methods based on per-sample difficulty may not generalize across runs, and static scheduling cannot exploit unstable per-sample signals.
Abstract: Fine-tuning pretrained image classifiers is standard practice, yet which individual samples are forgotten during this process, and whether forgetting patterns are stable or architecture dependent, remains unclear. Understanding these dynamics has direct implications for curriculum design, data pruning, and ensemble construction. We track per-sample correctness at every epoch during fine-tuning of ResNet-18 and DeiT-Small on a retinal OCT dataset (7 classes, 56:1 imbalance) and CUB-200-2011 (200 bird species), fitting Ebbinghaus-style exponential decay curves to each sample’s retention trace. Five findings emerge. First, the two architectures forget fundamentally different samples: Jaccard overlap of the top 10 percent most-forgotten is 0.34 on OCTDL and 0.15 on CUB-200. Second, ViT forgetting is more structured (mean $R^2 = 0.74$) than CNN forgetting ($R^2 = 0.52$). Third, per-sample forgetting is stochastic across random seeds (Spearman $ρ\approx 0.01$), challenging the assumption that sample difficulty is an intrinsic property. Fourth, class-level forgetting is consistent and semantically interpretable: visually similar species are forgotten most, distinctive ones least. Fifth, a sample’s loss after head warmup predicts its long-term decay constant ($ρ= 0.30$ to $0.50$, $p < 10^{-45}$). These findings suggest that architectural diversity in ensembles provides complementary retention coverage, and that curriculum or pruning methods based on per-sample difficulty may not generalize across runs. A spaced repetition sampler built on these decay constants does not outperform random sampling, indicating that static scheduling cannot exploit unstable per-sample signals.
[1193] Generative Path-Finding Method for Wasserstein Gradient Flow
Chengyu Liu, Xiang Zhou
Main category: cs.LG
TL;DR: GenWGP is a generative framework for computing Wasserstein gradient paths using normalizing flows that learns transport from initial to equilibrium distributions by minimizing geometric action functionals.
Details
Motivation: Computing full Wasserstein gradient flow paths from arbitrary initial distributions to equilibrium is challenging in high dimensions. Eulerian methods suffer from curse of dimensionality, while Lagrangian particle/map methods don't efficiently improve with time step tuning.Method: Proposes GenWGP framework that learns a generative flow transporting mass from initial to equilibrium distribution by minimizing path loss encoding full trajectory and terminal equilibrium. Uses normalizing flows to compute geometric curve toward equilibrium with approximately constant intrinsic speed between network layers.
Result: Experiments on Fokker-Planck and aggregation problems show GenWGP matches/exceeds high-fidelity reference solutions with only about a dozen discretization points while capturing complex dynamics.
Conclusion: GenWGP provides stable training independent of temporal/geometric discretization, avoids delicate time stepping constraints, and enables efficient computation of Wasserstein gradient paths in high dimensions.
Abstract: Wasserstein gradient flows (WGFs) describe the evolution of probability distributions in Wasserstein space as steepest descent dynamics for a free energy functional. Computing the full path from an arbitrary initial distribution to equilibrium is challenging, especially in high dimensions. Eulerian methods suffer from the curse of dimensionality, while existing Lagrangian approaches based on particles or generative maps do not naturally improve efficiency through time step tuning. We propose GenWGP, a generative path finding framework for Wasserstein gradient paths. GenWGP learns a generative flow that transports mass from an initial density to an unknown equilibrium distribution by minimizing a path loss that encodes the full trajectory and its terminal equilibrium condition. The loss is derived from a geometric action functional motivated by Dawson Gartner large deviation theory for empirical distributions of interacting diffusion systems. We formulate both a finite horizon action under physical time parametrization and a reparameterization invariant geometric action based on Wasserstein arclength. Using normalizing flows, GenWGP computes a geometric curve toward equilibrium while enforcing approximately constant intrinsic speed between adjacent network layers, so that discretized distributions remain nearly equidistant in the Wasserstein metric along the path. This avoids delicate time stepping constraints and enables stable training that is largely independent of temporal or geometric discretization. Experiments on Fokker Planck and aggregation type problems show that GenWGP matches or exceeds high fidelity reference solutions with only about a dozen discretization points while capturing complex dynamics.
[1194] Continuous Adversarial Flow Models
Shanchuan Lin, Ceyuan Yang, Zhijie Lin, Hao Chen, Haoqi Fan
Main category: cs.LG
TL;DR: Continuous adversarial flow models combine flow models with adversarial training for improved sample quality, achieving state-of-the-art results on ImageNet generation and text-to-image tasks.
Details
Motivation: Standard flow matching uses fixed mean-squared-error loss, which may not perfectly align with target data distribution. The authors propose to improve flow models by incorporating adversarial training with a learned discriminator to better guide the generation process.Method: Introduces continuous adversarial flow models that train flow models with an adversarial objective using a learned discriminator instead of fixed MSE loss. The method can be applied as post-training to existing flow-matching models or used to train models from scratch.
Result: Substantially improves ImageNet 256px generation: guidance-free FID of SiT improves from 8.26 to 3.63, JiT from 7.17 to 3.57. Guided generation also improves: SiT FID from 2.06 to 1.53, JiT from 1.86 to 1.80. Shows improved results on text-to-image generation benchmarks (GenEval and DPG).
Conclusion: Adversarial training significantly enhances flow models’ sample quality, making them competitive with state-of-the-art generative models. The approach is effective both as post-training and from-scratch training.
Abstract: We propose continuous adversarial flow models, a type of continuous-time flow model trained with an adversarial objective. Unlike flow matching, which uses a fixed mean-squared-error criterion, our approach introduces a learned discriminator to guide training. This change in objective induces a different generalized distribution, which empirically produces samples that are better aligned with the target data distribution. Our method is primarily proposed for post-training existing flow-matching models, although it can also train models from scratch. On the ImageNet 256px generation task, our post-training substantially improves the guidance-free FID of latent-space SiT from 8.26 to 3.63 and of pixel-space JiT from 7.17 to 3.57. It also improves guided generation, reducing FID from 2.06 to 1.53 for SiT and from 1.86 to 1.80 for JiT. We further evaluate our approach on text-to-image generation, where it achieves improved results on both the GenEval and DPG benchmarks.
[1195] TempusBench: An Evaluation Framework for Time-Series Forecasting
Denizalp Goktas, Gerardo Riaño-Briceño, Alif Abdullah, Aryan Nair, Chenkai Shen, Beatriz de Lucio, Alexandra Magnusson, Farhan Mashrur, Ahmed Abdulla, Shawrna Sen, Mahitha Thippireddy, Gregory Schwartz, Amy Greenwald
Main category: cs.LG
TL;DR: TempusBench is an open-source evaluation framework for time-series foundation models that addresses four major issues in current evaluation practices by providing new datasets, novel benchmark tasks, standardized hyperparameter tuning, and visualization tools.
Details
Motivation: The field of time-series foundation models lacks a comprehensive and community-accepted evaluation framework due to four major issues: outdated datasets that overlap with pretraining corpora, narrow benchmark tasks overlooking core statistical properties, unfair comparison of domain-specific models without consistent hyperparameter tuning, and lack of visualization tools for performance interpretation.Method: TempusBench introduces four key components: 1) new datasets not included in existing TSFM pretraining corpora, 2) novel benchmark tasks that go beyond existing ones, 3) a model evaluation pipeline with standardized hyperparameter tuning protocol, and 4) a tensorboard-based visualization interface.
Result: The paper presents TempusBench as a comprehensive solution to the evaluation challenges in time-series foundation models, providing an open-source framework available on GitHub that addresses the identified issues and enables fair, comprehensive model comparisons.
Conclusion: TempusBench provides a much-needed standardized evaluation framework for time-series foundation models that addresses critical gaps in current evaluation practices, enabling more rigorous and fair comparisons while advancing the field through better benchmarking tools.
Abstract: Foundation models have transformed natural language processing and computer vision, and a rapidly growing literature on time-series foundation models (TSFMs) seeks to replicate this success in forecasting. While recent open-source models demonstrate the promise of TSFMs, the field lacks a comprehensive and community-accepted model evaluation framework. We see at least four major issues impeding progress on the development of such a framework. First, current evaluation frameworks consist of benchmark forecasting tasks derived from often outdated datasets (e.g., M3), many of which lack clear metadata and overlap with the corpora used to pre-train TSFMs. Second, existing frameworks evaluate models along a narrowly defined set of benchmark forecasting tasks such as forecast horizon length or domain, but overlook core statistical properties such as non-stationarity and seasonality. Third, domain-specific models (e.g., XGBoost) are often compared unfairly, as existing frameworks neglect a systematic and consistent hyperparameter tuning convention for all models. Fourth, visualization tools for interpreting comparative performance are lacking. To address these issues, we introduce TempusBench, an open-source evaluation framework for TSFMs. TempusBench consists of 1) new datasets which are not included in existing TSFM pretraining corpora, 2) a set of novel benchmark tasks that go beyond existing ones, 3) a model evaluation pipeline with a standardized hyperparameter tuning protocol, and 4) a tensorboard-based visualization interface. We provide access to our code on GitHub: https://github.com/Smlcrm/TempusBench.
[1196] Eliciting Medical Reasoning with Knowledge-enhanced Data Synthesis: A Semi-Supervised Reinforcement Learning Approach
Haolin Li, Shuyang Jiang, Ruipeng Zhang, Jiangchao Yao, Ya Zhang, Yanfeng Wang
Main category: cs.LG
TL;DR: MedSSR: A framework for enhancing medical reasoning in LLMs using knowledge-enhanced data synthesis and semi-supervised reinforcement learning, particularly effective for rare disease tasks.
Details
Motivation: Addresses the scarcity of high-quality medical reasoning data, especially for underrepresented domains like rare diseases, and reduces the high costs of generating complex reasoning chains from proprietary models.Method: Uses rare disease knowledge to synthesize distribution-controllable reasoning questions, generates pseudo-labels with the policy model itself, and employs a two-stage training paradigm: self-supervised RL on synthetic data followed by supervised RL on human-annotated real data.
Result: Outperforms existing methods across ten medical benchmarks, achieving up to +5.93% gain on rare-disease tasks, as demonstrated on Qwen and Llama models.
Conclusion: MedSSR efficiently scales medical reasoning model training without costly trace distillation, offering an effective solution for enhancing LLM performance in medical domains, particularly for underrepresented areas like rare diseases.
Abstract: While large language models hold promise for complex medical applications, their development is hindered by the scarcity of high-quality reasoning data. To address this issue, existing approaches typically distill chain-of-thought reasoning traces from large proprietary models via supervised fine-tuning, then conduct reinforcement learning (RL). These methods exhibit limited improvement on underrepresented domains like rare diseases while incurring substantial costs from generating complex reasoning chains. To efficiently enhance medical reasoning, we propose MedSSR, a Medical Knowledge-enhanced data Synthesis and Semi-supervised Reinforcement learning framework. Our framework first employs rare disease knowledge to synthesize distribution-controllable reasoning questions. We then utilize the policy model itself to generate high-quality pseudo-labels. This enables a two-stage, intrinsic-to-extrinsic training paradigm: self-supervised RL on the pseudo-labeled synthetic data, followed by supervised RL on the human-annotated real data. MedSSR scales model training efficiently without relying on costly trace distillation. Extensive experiments on Qwen and Llama demonstrate that our method outperforms existing methods across ten medical benchmarks, achieving up to +5.93% gain on rare-disease tasks. Our code is available at https://github.com/tdlhl/MedSSR.
[1197] bacpipe: a Python package to make bioacoustic deep learning models accessible
Vincent S. Kather, Sylvain Haupert, Burooj Ghani, Dan Stowell
Main category: cs.LG
TL;DR: Bacpipe is a modular software package providing graphical and programming interfaces for bioacoustic deep learning models, enabling ecologists and computer scientists to analyze natural sound recordings using state-of-the-art models for embeddings and classification.
Details
Motivation: Millions of hours of natural sound recordings exist from passive acoustic monitoring, but accessing and utilizing advanced deep learning models for analysis remains challenging for researchers. There's a need for accessible tools that bridge the gap between state-of-the-art models and practical ecological research applications.Method: Developed bacpipe, a modular software package with both graphical and programming interfaces that provides access to bioacoustic deep learning models. The system includes evaluation pipelines for benchmarking models, generates acoustic feature vectors (embeddings) and classifier predictions, and offers interactive visualizations, clustering, and probing capabilities.
Result: Created an accessible tool that streamlines usage of state-of-the-art models on custom audio datasets, enabling researchers to generate acoustic embeddings and classifier predictions while providing modular evaluation and benchmarking capabilities through interactive visualizations.
Conclusion: Making deep learning developments accessible to a wider audience benefits ecological research by enabling researchers to answer new ecological and evolutionary questions in bioacoustics through improved analysis of natural sound recordings.
Abstract: 1. Natural sounds have been recorded for millions of hours over the previous decades using passive acoustic monitoring. Improvements in deep learning models have vastly accelerated the analysis of large portions of this data. While new models advance the state-of-the-art, accessing them using tools to harness their full potential is not always straightforward. Here we present bacpipe, a collection of bioacoustic deep learning models and evaluation pipelines accessible through a graphical and programming interface, designed for both ecologists and computer scientists. Bacpipe is a modular software package intended as a point of convergence for bioacoustic models. 2. Bacpipe streamlines the usage of state-of-the-art models on custom audio datasets, generating acoustic feature vectors (embeddings) and classifier predictions. A modular design allows evaluation and benchmarking of models through interactive visualizations, clustering and probing. 3. We believe that access to new deep learning models is important. By designing bacpipe to target a wide audience, researchers will be enabled to answer new ecological and evolutionary questions in bioacoustics. 4. In conclusion, we believe accessibility to developments in deep learning to a wider audience benefits the ecological questions we are trying to answer.
[1198] Layerwise Dynamics for In-Context Classification in Transformers
Patrick Lutz, Themistoklis Haris, Arjun Chandra, Aditya Gangrade, Venkatesh Saligrama
Main category: cs.LG
TL;DR: Transformers perform in-context classification through emergent geometric dynamics that can be explicitly extracted and analyzed via permutation-equivariant constraints.
Details
Motivation: To understand the opaque inference-time algorithms of transformers performing in-context classification, particularly in the hard no-margin regime, by making the computation interpretable while maintaining functional equivalence.Method: Study multi-class linear classification with permutation-equivariance constraints at every layer to enable interpretability, extract explicit depth-indexed recursion from structured weights, and analyze attention matrices formed from mixed feature-label Gram structure.
Result: Identified an emergent update rule inside softmax transformers that implements geometry-driven algorithmic dynamics, which can provably amplify class separation and yields robust expected class alignment.
Conclusion: Transformers’ in-context classification capabilities arise from structured geometric dynamics that can be explicitly extracted and analyzed, providing interpretability into their inference-time algorithms.
Abstract: Transformers can perform in-context classification from a few labeled examples, yet the inference-time algorithm remains opaque. We study multi-class linear classification in the hard no-margin regime and make the computation identifiable by enforcing feature- and label-permutation equivariance at every layer. This enables interpretability while maintaining functional equivalence and yields highly structured weights. From these models we extract an explicit depth-indexed recursion: an end-to-end identified, emergent update rule inside a softmax transformer, to our knowledge the first of its kind. Attention matrices formed from mixed feature-label Gram structure drive coupled updates of training points, labels, and the test probe. The resulting dynamics implement a geometry-driven algorithmic motif, which can provably amplify class separation and yields robust expected class alignment.
[1199] SCNO: Spiking Compositional Neural Operator – Towards a Neuromorphic Foundation Model for Nuclear PDE Solving
Samrendra Roy, Souvik Chakraborty, Rizwan-uddin, Syed Bahauddin Alam
Main category: cs.LG
TL;DR: SCNO is a modular spiking neural operator architecture that composes small pre-trained blocks for elementary differential operators to solve coupled PDEs without retraining, achieving better accuracy with fewer parameters than monolithic approaches.
Details
Motivation: Current neural operators for PDE solving are typically monolithic models trained on individual PDEs, require GPU hardware, and must be retrained from scratch for new physics, lacking modularity and efficiency.Method: SCNO maintains a library of small spiking neural operator blocks trained on elementary differential operators (convection, diffusion, reaction), composes them via a lightweight input-conditioned aggregator, and uses a small correction network to learn cross-coupling residuals while keeping blocks frozen.
Result: SCNO achieves lowest relative L2 error on 4 of 5 coupled PDEs, outperforming monolithic spiking DeepONet by up to 62% and standard ANN DeepONet by up to 65%, while using only 95K parameters vs 462K for baselines.
Conclusion: SCNO demonstrates the first compositional spiking neural operator and proof-of-concept for modular neuromorphic PDE solving with built-in forgetting-free expansion, offering efficient and accurate PDE solutions.
Abstract: Neural operators have emerged as powerful surrogates for partial differential equation (PDE) solvers, yet they are typically trained as monolithic models for individual PDEs, require energy-intensive GPU hardware, and must be retrained from scratch when new physics emerge. We introduce the Spiking Compositional Neural Operator (SCNO), a modular architecture combining spiking and conventional components that addresses all three limitations. SCNO maintains a library of small spiking neural operator blocks, each trained on a single elementary differential operator (convection, diffusion, reaction), and composes them through a lightweight input-conditioned aggregator to solve coupled PDEs not seen during block training. A small correction network learns cross-coupling residuals while keeping all blocks and the aggregator frozen, preserving zero-forgetting modular expansion by construction. We evaluate SCNO on eight PDE families including five coupled systems and a nuclear-relevant 1-group neutron diffusion equation. SCNO with correction achieves the lowest relative $L^2$ error on four of five coupled PDEs, outperforming both a monolithic spiking DeepONet (by up to 62%, mean over 3 seeds) and a standard ANN DeepONet (by up to 65%), while requiring only 95K trainable parameters versus 462K for the monolithic baseline. To our knowledge, this is the first compositional spiking neural operator and the first proof-of-concept for modular neuromorphic PDE solving with built-in forgetting-free expansion.
[1200] Inter-Layer Hessian Analysis of Neural Networks with DAG Architectures
Maxim Bolshim, Alexander Kugaevskikh
Main category: cs.LG
TL;DR: Analytical decomposition of neural network Hessian into Gauss-Newton and tensor components with diagnostic metrics for layer-wise curvature analysis
Details
Motivation: Modern autodiff frameworks return Hessians as monolithic tensors without exposing inter-layer interaction structure, making it hard to analyze curvature properties across network layers.Method: Develops analytical formalism to decompose full Hessian into blocks indexed by network DAG, separating Gauss-Newton (convex) and tensor (residual curvature) components. Introduces diagnostic metrics (inter-layer resonance, geometric coupling, stable rank, GN-Gap) estimated stochastically in O(P) time.
Result: For ReLU networks, tensor component of input Hessian vanishes; full parametric Hessian contains residual terms not reducible to GGN. Metrics reveal structural curvature interactions between layers, explaining exponential decay of resonance in vanilla networks and preservation under skip connections.
Conclusion: Provides theoretical framework for analyzing neural network curvature structure with practical diagnostic tools that scale to large models like ResNet-18 (~11M parameters).
Abstract: Modern automatic differentiation frameworks (JAX, PyTorch) return the Hessian of the loss function as a monolithic tensor, without exposing the internal structure of inter-layer interactions. This paper presents an analytical formalism that explicitly decomposes the full Hessian into blocks indexed by the DAG of an arbitrary architecture. The canonical decomposition $H = H^{GN} + H^T$ separates the Gauss–Newton component (convex part) from the tensor component (residual curvature responsible for saddle points). For piecewise-linear activations (ReLU), the tensor component of the input Hessian vanishes ($H^{T}{v,w}!\equiv!0$ a.e., $H^f{v,w}!=!H^{GN}_{v,w}!\succeq!0$); the full parametric Hessian contains residual terms that do not reduce to the GGN. Building on this decomposition, we introduce diagnostic metrics (inter-layer resonance~$\mathcal{R}$, geometric coupling~$\mathcal{C}$, stable rank~$\mathcal{D}$, GN-Gap) that are estimated stochastically in $O(P)$ time and reveal structural curvature interactions between layers. The theoretical analysis explains exponential decay of resonance in vanilla networks and its preservation under skip connections; empirical validation spans fully connected MLPs (Exp.,1–5) and convolutional architectures (ResNet-18, ${\sim}11$M~parameters, Exp.,6). When the architecture reduces to a single node, all definitions collapse to the standard Hessian $\nabla^2_θ\mathcal{L}(θ)\in\mathbb{R}^{p\times p}$.
[1201] Towards Autonomous Mechanistic Reasoning in Virtual Cells
Yunhui Jang, Lu Zhu, Jake Fawkes, Alisandra Kaye Denton, Dominique Beaini, Emmanuel Noutahi
Main category: cs.LG
TL;DR: VCR-Agent: A multi-agent framework for generating and validating mechanistic biological explanations using action graphs and verification, applied to virtual cells with improved factual precision.
Details
Motivation: LLMs have potential for scientific discovery but lack factually grounded explanations in biology. Current approaches don't provide actionable, verifiable mechanistic reasoning for complex biological systems like virtual cells.Method: Introduces structured explanation formalism using mechanistic action graphs for biological reasoning. Proposes VCR-Agent multi-agent framework integrating biologically grounded knowledge retrieval with verifier-based filtering to autonomously generate and validate mechanistic reasoning. Creates VC-TRACES dataset from Tahoe-100M atlas.
Result: Training with verified mechanistic explanations improves factual precision and provides more effective supervision for downstream gene expression prediction. Demonstrates importance of reliable mechanistic reasoning through multi-agent verification synergy.
Conclusion: The framework enables systematic verification and falsification of biological reasoning, addressing LLM limitations in scientific domains. Reliable mechanistic reasoning is crucial for virtual cell applications.
Abstract: Large language models (LLMs) have recently gained significant attention as a promising approach to accelerate scientific discovery. However, their application in open-ended scientific domains such as biology remains limited, primarily due to the lack of factually grounded and actionable explanations. To address this, we introduce a structured explanation formalism for virtual cells that represents biological reasoning as mechanistic action graphs, enabling systematic verification and falsification. Building upon this, we propose VCR-Agent, a multi-agent framework that integrates biologically grounded knowledge retrieval with a verifier-based filtering approach to generate and validate mechanistic reasoning autonomously. Using this framework, we release VC-TRACES dataset, which consists of verified mechanistic explanations derived from the Tahoe-100M atlas. Empirically, we demonstrate that training with these explanations improves factual precision and provides a more effective supervision signal for downstream gene expression prediction. These results underscore the importance of reliable mechanistic reasoning for virtual cells, achieved through the synergy of multi-agent and rigorous verification.
[1202] Fairness is Not Flat: Geometric Phase Transitions Against Shortcut Learning
Nicolas Rodriguez-Alvarez, Fernando Rodriguez-Merino
Main category: cs.LG
TL;DR: A geometric method to mitigate shortcut learning in neural networks by using a zero-hidden-layer Topological Auditor to isolate and prune linear shortcuts, forcing networks to use higher geometric capacity for ethical representations.
Details
Motivation: Deep Neural Networks are prone to shortcut learning, memorizing spurious correlations instead of causal mechanisms, which degrades out-of-distribution robustness and induces demographic biases in sensitive applications.Method: Proposes a geometric a priori methodology using a zero-hidden-layer (N=1) Topological Auditor to mathematically isolate features that monopolize gradients without human intervention. After pruning linear shortcuts, networks are forced to utilize higher geometric capacity (N≥16) to curve decision boundaries and learn ethical representations.
Result: Outperforms L1 Regularization (which collapses into demographic bias) and operates at a fraction of the computational cost of post-hoc methods like Just Train Twice (JTT). Successfully reduces counterfactual gender vulnerability from 21.18% to 7.66%.
Conclusion: The geometric approach effectively mitigates shortcut learning, improves model robustness, and reduces demographic biases by forcing networks to learn more complex, ethical representations rather than relying on simple linear shortcuts.
Abstract: Deep Neural Networks are highly susceptible to shortcut learning, frequently memorizing low-dimensional spurious correlations instead of underlying causal mechanisms. This phenomenon not only degrades out-of-distribution robustness but also induces severe demographic biases in sensitive applications. In this paper, we propose a geometric \textit{a priori} methodology to mitigate shortcut learning. By deploying a zero-hidden-layer ($N=1$) Topological Auditor, we mathematically isolate features that monopolize the gradient without human intervention. We empirically demonstrate a Capacity Phase Transition: once linear shortcuts are pruned, networks are forced to utilize higher geometric capacity ($N \geq 16$) to curve the decision boundary and learn ethical representations. Our approach outperforms L1 Regularization – which collapses into demographic bias – and operates at a fraction of the computational cost of post-hoc methods like Just Train Twice (JTT), successfully reducing counterfactual gender vulnerability from 21.18% to 7.66%.
[1203] KL Divergence Between Gaussians: A Step-by-Step Derivation for the Variational Autoencoder Objective
Andrés Muñoz, Rodrigo Ramele
Main category: cs.LG
TL;DR: Derivation of closed-form KL divergence between Gaussian distributions for VAE regularization
Details
Motivation: KL divergence is fundamental in VAEs for latent space regularization, but practical implementations need closed-form expressions for Gaussian distributions to enable efficient training.Method: Mathematical derivation starting from general continuous random variable definition, extending from univariate to multivariate Gaussian distributions with diagonal covariance assumption.
Result: Provides detailed closed-form expression for KL divergence between Gaussian distributions, with interpretation of each term’s impact on VAE training dynamics.
Conclusion: Complete derivation enables better understanding and implementation of KL divergence regularization in VAEs, particularly for Gaussian latent spaces.
Abstract: Kullback-Leibler (KL) divergence is a fundamental concept in information theory that quantifies the discrepancy between two probability distributions. In the context of Variational Autoencoders (VAEs), it serves as a central regularization term, imposing structure on the latent space and thereby enabling the model to exhibit generative capabilities. In this work, we present a detailed derivation of the closed-form expression for the KL divergence between Gaussian distributions, a case of particular importance in practical VAE implementations. Starting from the general definition for continuous random variables, we derive the expression for the univariate case and extend it to the multivariate setting under the assumption of diagonal covariance. Finally, we discuss the interpretation of each term in the resulting expression and its impact on the training dynamics of the model.
[1204] Autonomous Diffractometry Enabled by Visual Reinforcement Learning
J. Oppliger, M. Stifter, A. Rüegg, I. Biało, L. Martinelli, P. G. Freeman, D. Prabhakaran, J. Zhao, Q. Wang, J. Chang
Main category: cs.LG
TL;DR: Autonomous crystal alignment system using reinforcement learning to interpret Laue diffraction patterns without crystallography knowledge
Details
Motivation: Automating tasks requiring interpretation of abstract visual information remains challenging, particularly in scientific domains like crystal alignment which currently relies on human experts who can comprehend diffraction patternsMethod: Model-free reinforcement learning framework where an agent learns to identify and navigate toward high-symmetry orientations directly from Laue diffraction patterns without access to crystallography or diffraction theory
Result: The agent develops human-like strategies for time-efficient alignment across different crystal symmetry classes without human supervision, creating an autonomous system for crystal alignment
Conclusion: Provides a computational framework for intelligent diffractometers and advances automated experimental workflows in materials science
Abstract: Automation underpins progress across scientific and industrial disciplines. Yet, automating tasks requiring interpretation of abstract visual information remain challenging. For example, crystal alignment strongly relies on humans with the ability to comprehend diffraction patterns. Here we introduce an autonomous system that aligns single crystals without access to crystallography and diffraction theory. Using a model-free reinforcement learning framework, an agent learns to identify and navigate towards high-symmetry orientations directly from Laue diffraction patterns. Despite the absence of human supervision, the agent develops human-like strategies to achieve time-efficient alignment across different crystal symmetry classes. With this, we provide a computational framework for intelligent diffractometers. As such, our approach advances the development of automated experimental workflows in materials science.
[1205] ClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI Agents
Fei Tang, Zhiqiong Lu, Boxuan Zhang, Weiming Lu, Jun Xiao, Yueting Zhuang, Yongliang Shen
Main category: cs.LG
TL;DR: ClawGUI is an open-source framework for GUI agents that addresses infrastructure gaps with RL training, standardized evaluation, and real-world deployment across mobile platforms.
Details
Motivation: Progress in GUI agents is bottlenecked by infrastructure issues: unstable RL training environments, drifting evaluation protocols, and lack of real-world deployment to actual devices and users.Method: Three-component framework: ClawGUI-RL for RL training with virtual/physical device support and dense supervision; ClawGUI-Eval for standardized evaluation across benchmarks; ClawGUI-Agent for deployment across mobile platforms with hybrid CLI-GUI control.
Result: ClawGUI-2B achieves 17.1% Success Rate on MobileWorld GUI-Only benchmark, outperforming same-scale MAI-UI-2B baseline by 6.0%; achieves 95.8% reproduction against official baselines.
Conclusion: ClawGUI provides a comprehensive infrastructure solution for GUI agents that addresses training, evaluation, and deployment challenges, enabling more robust development and real-world application.
Abstract: GUI agents drive applications through their visual interfaces instead of programmatic APIs, interacting with arbitrary software via taps, swipes, and keystrokes, reaching a long tail of applications that CLI-based agents cannot. Yet progress in this area is bottlenecked less by modeling capacity than by the absence of a coherent full-stack infrastructure: online RL training suffers from environment instability and closed pipelines, evaluation protocols drift silently across works, and trained agents rarely reach real users on real devices. We present \textbf{ClawGUI}, an open-source framework addressing these three gaps within a single harness. \textbf{ClawGUI-RL} provides the first open-source GUI agent RL infrastructure with validated support for both parallel virtual environments and real physical devices, integrating GiGPO with a Process Reward Model for dense step-level supervision. \textbf{ClawGUI-Eval} enforces a fully standardized evaluation pipeline across 6 benchmarks and 11+ models, achieving 95.8% reproduction against official baselines. \textbf{ClawGUI-Agent} brings trained agents to Android, HarmonyOS, and iOS through 12+ chat platforms with hybrid CLI-GUI control and persistent personalized memory. Trained end to end within this pipeline, \textbf{ClawGUI-2B} achieves 17.1% Success Rate on MobileWorld GUI-Only, outperforming the same-scale MAI-UI-2B baseline by 6.0%.
[1206] A Mechanistic Analysis of Looped Reasoning Language Models
Hugh Blayney, Álvaro Arroyo, Johan Obando-Ceron, Pablo Samuel Castro, Aaron Courville, Michael M. Bronstein, Xiaowen Dong
Main category: cs.LG
TL;DR: Mechanistic analysis of looped language models reveals they converge to cyclic fixed points where attention behavior stabilizes, mirroring feedforward model inference stages in depth.
Details
Motivation: To understand how the internal dynamics of looped reasoning language models differ from standard feedforward models, particularly comparing their inference stages and latent state behaviors.Method: Conducted mechanistic analysis of latent states in looped language models, analyzing cyclic recurrence patterns, fixed point convergence, attention-head behavior stabilization, and studying architectural factors like recurrent block size, input injection, and normalization.
Result: Discovered that each layer in the cycle converges to distinct fixed points, forming consistent cyclic trajectories in latent space; attention-head behavior stabilizes at fixed points; recurrent blocks learn inference stages mirroring feedforward models, repeating them in depth with each iteration.
Conclusion: Findings provide mechanistic insights into looped language models that can translate to practical architectural design guidance, showing how these models develop stable cyclic patterns that replicate feedforward inference stages.
Abstract: Reasoning has become a central capability in large language models. Recent research has shown that reasoning performance can be improved by looping an LLM’s layers in the latent dimension, resulting in looped reasoning language models. Despite promising results, few works have investigated how their internal dynamics differ from those of standard feedforward models. In this paper, we conduct a mechanistic analysis of the latent states in looped language models, focusing in particular on how the stages of inference observed in feedforward models compare to those observed in looped ones. To this end, we analyze cyclic recurrence and show that for many of the studied models each layer in the cycle converges to a distinct fixed point; consequently, the recurrent block follows a consistent cyclic trajectory in the latent space. We provide evidence that as these fixed points are reached, attention-head behavior stabilizes, leading to constant behavior across recurrences. Empirically, we discover that recurrent blocks learn stages of inference that closely mirror those of feedforward models, repeating these stages in depth with each iteration. We study how recurrent block size, input injection, and normalization influence the emergence and stability of these cyclic fixed points. We believe these findings help translate mechanistic insights into practical guidance for architectural design.
[1207] Solving Physics Olympiad via Reinforcement Learning on Physics Simulators
Mihir Prabhudesai, Aryan Satpathy, Yangmin Li, Zheyang Qin, Nikash Bhardwaj, Amir Zadeh, Chuan Li, Katerina Fragkiadaki, Deepak Pathak
Main category: cs.LG
TL;DR: Using physics simulators to generate synthetic QA data for training LLMs in physical reasoning, achieving sim-to-real transfer to real-world physics benchmarks.
Details
Motivation: Current LLM reasoning progress relies heavily on internet QA pairs, which are limited in scale and concentrated in domains like mathematics. Physics lacks large-scale QA datasets, creating a bottleneck for training reasoning-capable models in physical sciences.Method: Generate random scenes in physics engines, create synthetic question-answer pairs from simulated interactions, and train LLMs using reinforcement learning on this synthetic data.
Result: Models exhibit zero-shot sim-to-real transfer to real-world physics benchmarks. Training solely on synthetic simulated data improves performance on IPhO problems by 5-10 percentage points across model sizes.
Conclusion: Physics simulators can serve as scalable data generators for training LLMs, enabling acquisition of deep physical reasoning skills beyond limitations of internet-scale QA data.
Abstract: We have witnessed remarkable advances in LLM reasoning capabilities with the advent of DeepSeek-R1. However, much of this progress has been fueled by the abundance of internet question-answer (QA) pairs, a major bottleneck going forward, since such data is limited in scale and concentrated mainly in domains like mathematics. In contrast, other sciences such as physics lack large-scale QA datasets to effectively train reasoning-capable models. In this work, we show that physics simulators can serve as a powerful alternative source of supervision for training LLMs for physical reasoning. We generate random scenes in physics engines, create synthetic question-answer pairs from simulated interactions, and train LLMs using reinforcement learning on this synthetic data. Our models exhibit zero-shot sim-to-real transfer to real-world physics benchmarks: for example, training solely on synthetic simulated data improves performance on IPhO (International Physics Olympiad) problems by 5-10 percentage points across model sizes. These results demonstrate that physics simulators can act as scalable data generators, enabling LLMs to acquire deep physical reasoning skills beyond the limitations of internet-scale QA data. Code available at: https://sim2reason.github.io/.
[1208] Physics-Informed State Space Models for Reliable Solar Irradiance Forecasting in Off-Grid Systems
Mohammed Ezzaldin Babiker Abdullah
Main category: cs.LG
TL;DR: A thermodynamic liquid manifold network for solar forecasting that eliminates nocturnal power generation errors and phase lags by enforcing celestial mechanics compliance through Koopman-linearized Riemannian manifold projections.
Details
Motivation: Current deep learning models for solar forecasting exhibit critical anomalies including temporal phase lags during cloud transients and physically impossible nocturnal power generation, creating divergence between data-driven modeling and deterministic celestial mechanics.Method: Projects 15 meteorological and geometric variables into a Koopman-linearized Riemannian manifold to map climatic dynamics, integrates Spectral Calibration unit and multiplicative Thermodynamic Alpha-Gate, synthesizes real-time atmospheric opacity with theoretical clear-sky boundary models to enforce celestial geometry compliance.
Result: Achieves RMSE of 18.31 Wh/m² and Pearson correlation of 0.988 over 5-year testing, maintains zero-magnitude nocturnal error across all 1826 testing days, exhibits sub-30-minute phase response during high-frequency transients with only 63,458 trainable parameters.
Conclusion: Establishes a robust, thermodynamically consistent standard for edge-deployable microgrid controllers by eliminating phantom nocturnal generation while maintaining zero-lag synchronization during rapid weather shifts.
Abstract: The stable operation of autonomous off-grid photovoltaic systems dictates reliance on solar forecasting algorithms that respect atmospheric thermodynamics. Contemporary deep learning models consistently exhibit critical anomalies, primarily severe temporal phase lags during cloud transients and physically impossible nocturnal power generation. To resolve this divergence between data-driven modeling and deterministic celestial mechanics, this research introduces the Thermodynamic Liquid Manifold Network. The proposed methodology projects 15 meteorological and geometric variables into a Koopman-linearized Riemannian manifold to systematically map complex climatic dynamics. The architecture integrates a Spectral Calibration unit and a multiplicative Thermodynamic Alpha-Gate. This system synthesizes real-time atmospheric opacity with theoretical clear-sky boundary models, structurally enforcing strict celestial geometry compliance. This completely neutralizes phantom nocturnal generation while maintaining zero-lag synchronization during rapid weather shifts. Validated against a rigorous five-year testing horizon in a severe semi-arid climate, the framework achieves an RMSE of 18.31 Wh/m2 and a Pearson correlation of 0.988. The model strictly maintains a zero-magnitude nocturnal error across all 1826 testing days and exhibits a sub-30-minute phase response during high-frequency transients. Comprising exactly 63,458 trainable parameters, this ultra-lightweight design establishes a robust, thermodynamically consistent standard for edge-deployable microgrid controllers.
[1209] The Wasserstein transform
Kun Jin, Facundo Mémoli, Zane Smith, Zhengchao Wan
Main category: cs.LG
TL;DR: Wasserstein Transform (WT) is an unsupervised framework that updates distance structures using Wasserstein distances between probability measures representing neighborhood structures, extending mean shift algorithms for feature enhancement and denoising.
Details
Motivation: The paper aims to develop a general unsupervised framework for enhancing features and denoising data by updating distance structures. Current methods like mean shift algorithms have limitations, and the authors propose using Wasserstein distances between probability measures to better capture neighborhood structures.Method: Represent each data point by a probability measure reflecting its neighborhood structure, then update distances by computing Wasserstein distances between these measures. Several instances are studied, including Gaussian Transform (GT) which uses Gaussian measures for computational efficiency. Iterative algorithms are devised with acceleration strategies like reducing matrix square root computations.
Result: The framework extends mean shift algorithms and is proven to be stable under perturbations. GT offers computational advantages with closed-form solutions for ℓ²-Wasserstein distances between Gaussian measures. The method shows effectiveness in various tasks including denoising, clustering, image segmentation, and word embeddings.
Conclusion: Wasserstein Transform provides a powerful general framework for unsupervised distance structure updates that enhances features and denoises data, with practical instances like Gaussian Transform offering computational efficiency while maintaining theoretical guarantees.
Abstract: We introduce the Wasserstein Transform (WT), a general unsupervised framework for updating distance structures on given data sets with the purpose of enhancing features and denoising. Our framework represents each data point by a probability measure reflecting the neighborhood structure of the point, and then updates the distance by computing the Wasserstein distance between these probability measures. The Wasserstein Transform is a general method which extends the mean shift family of algorithms. We study several instances of WT, and in particular, in one of the instances which we call the Gaussian Transform (GT), we utilize Gaussian measures to model neighborhood structures of individual data points. GT is computationally cheaper than other instances of WT since there exists closed form solution for the $\ell^2$-Wasserstein distance between Gaussian measures. We study the relationship between different instances of WT and prove that each of the instances is stable under perturbations. We devise iterative algorithms for performing the above-mentioned WT and propose several strategies to accelerate GT, such as an observation from linear algebra for reducing the number of matrix square root computations. We examine the performance of the Wasserstein Transform method in many tasks, such as denoising, clustering, image segmentation and word embeddings.
[1210] AXIL: Exact Instance Attribution for Gradient Boosting
Paul Geertsema, Helen Lu
Main category: cs.LG
TL;DR: AXIL provides exact instance attribution for gradient boosting machines with squared-error loss, computing prediction-specific weights for training instances without materializing large matrices.
Details
Motivation: Existing instance attribution methods for GBMs (BoostIn, TREX, LeafInfluence) are approximate and computationally expensive. There's a need for exact, efficient attribution that works for large datasets and provides theoretical guarantees.Method: Derives exact attribution weights (AXIL) for GBMs with squared-error loss by expressing predictions as weighted sums of training targets. Uses matrix-free backward operator for efficient computation (O(TN) per prediction) without materializing full N×N matrices.
Result: AXIL achieves highest faithfulness scores on 14/20 regression datasets, ties on 4 others, runs faster than competitors, and provides exact sensitivity in target-perturbation tests where other methods fail.
Conclusion: AXIL offers practical exact instance attribution for GBMs, connects to broader target-response Jacobian framework for differentiable learners, and outperforms existing methods in faithfulness and efficiency.
Abstract: We derive an exact, prediction-specific instance-attribution method for fitted gradient boosting machines (GBMs) trained with squared-error loss, with the learned tree structure held fixed. Each prediction can be written as a weighted sum of training targets, with coefficients determined only by the fitted tree structure and learning rate. These coefficients are exact instance attributions, or AXIL weights. Our main algorithmic contribution is a matrix-free backward operator that computes one AXIL attribution vector in O(TN) time, or S vectors in O(TNS), without materialising the full N x N matrix. This extends to out-of-sample predictions and makes exact instance attribution practical for large datasets. AXIL yields exact fixed-structure sensitivity by construction in target-perturbation tests, where competing GBM-specific attribution methods (BoostIn, TREX, and LeafInfluence) generally fail. In retraining-based faithfulness tests on 20 regression datasets, AXIL achieves the highest faithfulness score on 14 datasets and statistically ties for the best on 4 others, while also running substantially faster than the competing methods. We also show that the AXIL weight matrix is the globally constant special case of a target-response Jacobian that provides first-order instance attribution for any differentiable learner via implicit differentiation, placing the exact decomposition inside a broader framework.
[1211] SIGMA: An Efficient Heterophilous Graph Neural Network with Fast Global Aggregation
Haoyu Liu, Ningyi Liao, Siqiang Luo
Main category: cs.LG
TL;DR: SIGMA is an efficient graph neural network aggregation method that uses SimRank to capture global structural similarity for heterophilous graphs, achieving linear computational complexity and state-of-the-art performance.
Details
Motivation: Traditional GNNs perform poorly on heterophilous graphs where neighboring nodes are dissimilar, due to their local and uniform aggregation. Existing heterophilous GNNs use long-range or global aggregations but suffer from efficiency issues on large-scale graphs due to iterative full-graph updates.Method: Proposes SIGMA, an efficient global heterophilous GNN aggregation that integrates SimRank structural similarity measurement. It captures distant global similarity in one-time computation with linear complexity O(n), avoiding iterative updates.
Result: SIGMA achieves state-of-the-art performance with superior aggregation and overall efficiency. It obtains 5× acceleration on the large-scale heterophily dataset pokec with over 30 million edges compared to the best baseline aggregation.
Conclusion: SIGMA provides an efficient solution for heterophilous graph learning by capturing global structural similarity with linear computational complexity, making it scalable to large graphs while maintaining high performance.
Abstract: Graph neural networks (GNNs) realize great success in graph learning but suffer from performance loss when meeting heterophily, i.e. neighboring nodes are dissimilar, due to their local and uniform aggregation. Existing attempts of heterophilous GNNs incorporate long-range or global aggregations to distinguish nodes in the graph. However, these aggregations usually require iteratively maintaining and updating full-graph information, which limits their efficiency when applying to large-scale graphs. In this paper, we propose SIGMA, an efficient global heterophilous GNN aggregation integrating the structural similarity measurement SimRank. Our theoretical analysis illustrates that SIGMA inherently captures distant global similarity even under heterophily, that conventional approaches can only achieve after iterative aggregations. Furthermore, it enjoys efficient one-time computation with a complexity only linear to the node set size $\mathcal{O}(n)$. Comprehensive evaluation demonstrates that SIGMA achieves state-of-the-art performance with superior aggregation and overall efficiency. Notably, it obtains $5\times$ acceleration on the large-scale heterophily dataset pokec with over 30 million edges compared to the best baseline aggregation.
[1212] Incentivizing Honesty among Competitors in Collaborative Learning and Optimization
Florian E. Dorner, Nikola Konstantinov, Georgi Pashaliev, Martin Vechev
Main category: cs.LG
TL;DR: Game-theoretic analysis of collaborative learning among competitors, showing rational clients are incentivized to sabotage others, with proposed mechanisms to ensure honest cooperation.
Details
Motivation: Collaborative learning can improve ML models but faces challenges when participants are competitors (e.g., firms providing recommendations), creating incentives for dishonest updates that damage others' models.Method: Formulates a game modeling competitive interactions, analyzes single-round mean estimation and multi-round SGD on strongly-convex objectives, proposes incentive mechanisms for honest communication, and empirically tests on non-convex federated learning benchmarks.
Result: Shows rational clients are incentivized to strongly manipulate updates preventing learning, but proposed mechanisms can incentivize honest communication and achieve learning quality comparable to full cooperation.
Conclusion: Explicitly modeling incentives and actions of dishonest clients (rather than assuming malice) enables strong robustness guarantees for collaborative learning among competitors.
Abstract: Collaborative learning techniques have the potential to enable training machine learning models that are superior to models trained on a single entity’s data. However, in many cases, potential participants in such collaborative schemes are competitors on a downstream task, such as firms that each aim to attract customers by providing the best recommendations. This can incentivize dishonest updates that damage other participants’ models, potentially undermining the benefits of collaboration. In this work, we formulate a game that models such interactions and study two learning tasks within this framework: single-round mean estimation and multi-round SGD on strongly-convex objectives. For a natural class of player actions, we show that rational clients are incentivized to strongly manipulate their updates, preventing learning. We then propose mechanisms that incentivize honest communication and ensure learning quality comparable to full cooperation. Lastly, we empirically demonstrate the effectiveness of our incentive scheme on a standard non-convex federated learning benchmark. Our work shows that explicitly modeling the incentives and actions of dishonest clients, rather than assuming them malicious, can enable strong robustness guarantees for collaborative learning.
[1213] Reinforcement Learning for Intensity Control: An Application to Choice-Based Network Revenue Management
Huiling Meng, Ningyuan Chen, Xuefeng Gao
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2406.05358: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2406.05358&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1214] Adversarial Robustness of Graph Transformers
Philipp Foth, Lukas Gosch, Simon Geisler, Leo Schwinn, Stephan Günnemann
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2407.11764 could not be retrieved from arXiv API.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot determine conclusion without access to the paper content.
Abstract: Failed to fetch summary for 2407.11764: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2407.11764&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1215] Poisoning with A Pill: Circumventing Detection in Federated Learning
Hanxi Guo, Hao Wang, Tao Song, Tianhang Zheng, Yang Hua, Haibing Guan, Xiangyu Zhang
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error in fetching paper contentMethod: Unable to determine method due to technical error in fetching paper content
Result: Unable to determine results due to technical error in fetching paper content
Conclusion: Unable to draw conclusions due to technical error in fetching paper content
Abstract: Failed to fetch summary for 2407.15389: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2407.15389&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1216] FedQUIT: On-Device Federated Unlearning via a Quasi-Competent Virtual Teacher
Alessio Mora, Lorenzo Valerio, Paolo Bellavista, Andrea Passarella
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2408.07587: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2408.07587&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1217] A Complete Decomposition of KL Error using Refined Information and Mode Interaction Selection
James Enouen, Mahito Sugiyama
Main category: cs.LG
TL;DR: Failed to fetch summary for paper 2410.11964 due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2410.11964: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2410.11964&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1218] Graph Retention Networks for Dynamic Graphs
Qian Chang, Xia Li, Xiufeng Cheng, Runsong Jia, Jinqing Yang, Guoping Hu, Ciprian Doru Giurcaneanu
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2411.11259: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2411.11259&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1219] Symmetry-Aware Generative Modeling through Learned Canonicalization
Kusha Sareen, Daniel Levy, Arnab Kumar Mondal, Sékou-Oumar Kaba, Tara Akhound-Sadegh, Siamak Ravanbakhsh
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2501.07773: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2501.07773&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1220] deCIFer: Crystal Structure Prediction from Powder Diffraction Data using Autoregressive Language Models
Frederik Lizak Johansen, Ulrik Friis-Jensen, Erik Bjørnager Dam, Kirsten Marie Ørnsbjerg Jensen, Rocío Mercado, Raghavendra Selvan
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2502.02189: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.02189&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1221] CapyMOA: Efficient Machine Learning for Data Streams and Online Continual Learning in Python
Heitor Murilo Gomes, Anton Lee, Nuwan Gunasekara, Yibin Sun, Guilherme Weigert Cassales, Justin Liu, Marco Heyden, Vitor Cerqueira, Maroua Bahri, Yun Sing Koh, Bernhard Pfahringer, Albert Bifet
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2502.07432 could not be retrieved from arXiv API.
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting.Method: Cannot determine method as paper content is unavailable due to API rate limiting.
Result: Cannot determine results as paper content is unavailable due to API rate limiting.
Conclusion: Cannot draw conclusions about the paper as content retrieval failed due to HTTP 429 error.
Abstract: Failed to fetch summary for 2502.07432: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.07432&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1222] Quotation-Based Data Retention Mechanism for Data Privacy in LLM-Empowered Network Services
Bin Han, Di Feng, Zexin Fang, Jie Wang, Hans D. Schotten
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2503.23001: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.23001&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1223] An overview of condensation phenomenon in deep learning
Zhi-Qin John Xu, Yaoyu Zhang, Zhangchen Zhou
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper retrievalMethod: Unable to determine method due to failed paper retrieval
Result: Unable to determine results due to failed paper retrieval
Conclusion: Unable to draw conclusions due to failed paper retrieval
Abstract: Failed to fetch summary for 2504.09484: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.09484&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1224] Learning Geometry and Topology via Multi-Chart Flows
Hanlin Yu, Søren Hauberg, Marcelo Hartmann, Arto Klami, Georgios Arvanitidis
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2505.24665: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.24665&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1225] On the Convergence of Gradient Descent on Learning Transformers with Residual Connections
Zhen Qin, Jinxin Zhou, Jiachen Jiang, Zhihui Zhu
Main category: cs.LG
TL;DR: Paper analysis unavailable due to HTTP 429 error when fetching abstract from arXiv
Details
Motivation: Unable to determine motivation due to failed fetch of paper abstractMethod: Unable to determine method due to failed fetch of paper abstract
Result: Unable to determine results due to failed fetch of paper abstract
Conclusion: Unable to draw conclusions about paper content due to technical fetch error
Abstract: Failed to fetch summary for 2506.05249: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.05249&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1226] Lagrangian-based Equilibrium Propagation: generalisation to arbitrary boundary conditions & equivalence with Hamiltonian Echo Learning
Guillaume Pourcel, Debabrota Basu, Maxence Ernoult, Aditya Gilra
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) - paper ID 2506.06248 cannot be analyzed
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2506.06248: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.06248&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1227] Mixture of Cognitive Reasoners: Modular Reasoning with Brain-Like Specialization
Badr AlKhamissi, C. Nicolò De Sabbata, Greta Tuckute, Zeming Chen, Martin Schrimpf, Antoine Bosselut
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2506.13331: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.13331&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1228] Relative Entropy Pathwise Policy Optimization
Claas Voelcker, Axel Brunnbauer, Marcel Hussing, Michal Nauman, Pieter Abbeel, Eric Eaton, Radu Grosu, Amir-massoud Farahmand, Igor Gilitschenski
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2507.11019: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.11019&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1229] Soft Graph Transformer for MIMO Detection
Jiadong Hong, Lei Liu, Xinyu Bian, Wenjie Wang, Zhaoyang Zhang
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about paper content due to access limitations
Abstract: Failed to fetch summary for 2509.12694: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.12694&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1230] Learning Aligned Stability in Neural ODEs Reconciling Accuracy with Robustness
Chaoyang Luo, Yan Zou, Nanjing Huang
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2509.21879: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.21879&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1231] MoveFM-R: Advancing Mobility Foundation Models via Language-driven Semantic Reasoning
Fanjin Meng, Yuan Yuan, Jingtao Ding, Jie Feng, Chonghua Han, Yong Li
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2509.22403: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.22403&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1232] Optimal Rates for Generalization of Gradient Descent for Deep ReLU Classification
Yuanfan Li, Yunwen Lei, Zheng-Chu Guo, Yiming Ying
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to access limitationsMethod: Cannot determine method due to access limitations
Result: Cannot determine results due to access limitations
Conclusion: Cannot determine conclusion due to access limitations
Abstract: Failed to fetch summary for 2510.02779: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.02779&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1233] Tree Training: Accelerating Agentic LLMs Training via Shared Prefix Reuse
Jinghui Wang, Shaojie Wang, Yinghan Cui, Xuxing Chen, Chao Wang, Liang Huang, Xiaojiang Zhang, Junyi Peng, Li Wan, Haotian Zhang, Bin Chen
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2511.00413: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.00413&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1234] Discrete Bayesian Sample Inference for Graph Generation
Ole Petersen, Marcel Kollovieh, Marten Lienen, Stephan Günnemann
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation due to API access failureMethod: Unable to determine method due to API access failure
Result: Unable to determine results due to API access failure
Conclusion: Unable to draw conclusions due to API access failure
Abstract: Failed to fetch summary for 2511.03015: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.03015&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1235] A Weak Penalty Neural ODE for Learning Chaotic Dynamics from Noisy Time Series
Xuyang Li, John Harlim, Dibyajyoti Chakraborty, Romit Maulik
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2511.06609: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.06609&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1236] From Decision Trees to Boolean Logic: A Fast and Unified SHAP Algorithm
Alexander Nadel, Ron Wettenstein
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper content.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot determine conclusion without access to the paper content.
Abstract: Failed to fetch summary for 2511.09376: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.09376&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1237] Fourier-KAN-Mamba: A Novel State-Space Equation Approach for Time-Series Anomaly Detection
Xiancheng Wang, Lin Wang, Rui Wang, Zhibo Zhang, Minghang Zhao
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2511.15083: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.15083&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1238] Achieving Skilled and Reliable Daily Probabilistic Forecasts of Wind Power at Subseasonal-to-Seasonal Timescales over France
Eloi Lindas, Yannig Goude, Philippe Ciais
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2511.16164: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.16164&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1239] MSTN: A Lightweight and Fast Model for General TimeSeries Analysis
Sumit S Shevtekar, Chandresh K Maurya
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2511.20577: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.20577&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1240] BézierFlow: Learning Bézier Stochastic Interpolant Schedulers for Few-Step Generation
Yunhong Min, Juil Koo, Seungwoo Yoo, Minhyuk Sung
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2512.13255: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.13255&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1241] FlowBind: Efficient Any-to-Any Generation with Bidirectional Flows
Yeonwoo Cha, Semin Kim, Jinhyeon Kwon, Seunghoon Hong
Main category: cs.LG
TL;DR: Failed to fetch summary for paper 2512.15420 due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to analyze motivation due to missing abstract contentMethod: Unable to analyze method due to missing abstract content
Result: Unable to analyze results due to missing abstract content
Conclusion: Unable to analyze conclusion due to missing abstract content
Abstract: Failed to fetch summary for 2512.15420: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.15420&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1242] Riemannian Zeroth-Order Gradient Estimation with Structure-Preserving Metrics for Geodesically Incomplete Manifolds
Shaocong Ma, Heng Huang
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to missing paper contentMethod: Unable to determine method due to missing paper content
Result: Unable to determine results due to missing paper content
Conclusion: Unable to determine conclusion due to missing paper content
Abstract: Failed to fetch summary for 2601.08039: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.08039&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1243] Diagnosing Failure Modes of Neural Operators Across Diverse PDE Families
Lennon Shikhman
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to data fetch failureMethod: Unable to determine method due to data fetch failure
Result: Unable to determine results due to data fetch failure
Conclusion: Unable to draw conclusions due to data fetch failure
Abstract: Failed to fetch summary for 2601.11428: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.11428&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1244] Optimal L2 Regularization in High-dimensional Continual Linear Regression
Gilad Karpel, Edward Moroshko, Ran Levinstein, Ron Meir, Daniel Soudry, Itay Evron
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2601.13844: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.13844&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1245] MARS: Unleashing the Power of Speculative Decoding via Margin-Aware Verification
Jingwei Song, Xinyu Wang, Hanbin Wang, Xiaoxuan Lei, Bill Shi, Shixin Han, Eric Yang, Xiao-Wen Chang, Lynn Ai
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2601.15498: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.15498&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1246] A Hessian-Free Actor-Critic Algorithm for Bi-Level Reinforcement Learning with Applications to LLM Fine-Tuning
Sihan Zeng, Sujay Bhatt, Sumitra Ganesh, Alec Koppel
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2601.16399: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.16399&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1247] Knowledge Integration in Differentiable Models: A Comparative Study of Data-Driven, Soft-Constrained, and Hard-Constrained Paradigms for Identification and Control of the Single Machine Infinite Bus System
Shinhoo Kang, Sangwook Kim, Sehyun Yun
Main category: cs.LG
TL;DR: Unable to analyze paper 2602.09667 due to HTTP 429 error when fetching from arXiv API
Details
Motivation: Cannot determine motivation as paper content could not be retrievedMethod: Cannot determine method as paper content could not be retrieved
Result: Cannot determine results as paper content could not be retrieved
Conclusion: Cannot draw conclusions about paper content due to retrieval failure
Abstract: Failed to fetch summary for 2602.09667: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.09667&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1248] Not All Denoising Steps Are Equal: Model Scheduling for Faster Masked Diffusion Language Models
Ivan Sedykh, Nikita Sorokin, Valentin Malykh
Main category: cs.LG
TL;DR: Masked diffusion language models can be accelerated by replacing the full model with a smaller model at certain denoising steps, particularly early and late steps, reducing FLOPs by up to 17% with minimal quality degradation.
Details
Motivation: Masked diffusion language models (MDLMs) have improved but remain computationally expensive during sampling due to requiring many full-sequence denoising passes without benefiting from KV caching like autoregressive models.Method: Proposes model scheduling where a smaller MDLM replaces the full model at a subset of denoising steps. Uses step-importance analysis based on loss and KL divergence between small and large models across timesteps, plus exhaustive search over coarse step segments to identify optimal replacement patterns.
Result: Early and late denoising steps are substantially more robust to model replacement than middle steps, enabling up to 17% reduction in FLOPs with only modest degradation in generative perplexity under both unconditional and prefix-conditional generation while preserving sample diversity.
Conclusion: Simple, architecture-agnostic scheduling rules can significantly accelerate MDLM sampling while largely preserving generation quality, with the middle of the diffusion trajectory consistently identified as most sensitive across datasets.
Abstract: Recent advances in masked diffusion language models (MDLMs) narrow the quality gap to autoregressive LMs, but their sampling remains expensive because generation requires many full-sequence denoising passes with a large Transformer and, unlike autoregressive decoding, cannot benefit from KV caching. In this work, we exploit the flexibility of the diffusion framework and study model scheduling, where a smaller MDLM replaces the full model at a subset of denoising steps. Across models trained on OpenWebText and LM1B, we show that early and late denoising steps are substantially more robust to such replacement than middle steps, enabling up to a 17% reduction in FLOPs with only modest degradation in generative perplexity under both unconditional and prefix-conditional generation, while preserving sample diversity. We support these findings with a step-importance analysis based on loss and KL divergence between small and large models across timesteps, as well as an exhaustive search over coarse step segments, both of which identify the middle of the diffusion trajectory as most sensitive consistently across datasets. Our results suggest that simple, architecture-agnostic scheduling rules can significantly accelerate MDLM sampling while largely preserving generation quality.
[1249] Predicting integers from continuous parameters
Bas Maat, Peter Bloem
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2602.10751: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.10751&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1250] TreeGrad-Ranker: Feature Ranking via $O(L)$-Time Gradients for Decision Trees
Weida Li, Yaoliang Yu, Bryan Kian Hsiang Low
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2602.11623: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.11623&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1251] TS-Haystack: A Multi-Scale Retrieval Benchmark for Time Series Language Models
Nicolas Zumarraga, Thomas Kaar, Ning Wang, Maxwell A. Xu, Max Rosenblattl, Markus Kreft, Kevin O’Sullivan, Paul Schmiedmayer, Patrick Langer, Robert Jakob
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2602.14200: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.14200&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1252] LLM-as-Judge on a Budget
Aadirupa Saha, Aniket Wagde, Branislav Kveton
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2602.15481: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.15481&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1253] MDP Planning as Policy Inference
David Tolpin
Main category: cs.LG
TL;DR: Paper 2602.17375: Unable to fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailable due to rate limitingMethod: Cannot determine method as paper content is unavailable due to rate limiting
Result: Cannot determine results as paper content is unavailable due to rate limiting
Conclusion: Cannot determine conclusion as paper content is unavailable due to rate limiting
Abstract: Failed to fetch summary for 2602.17375: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.17375&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1254] Transformers for dynamical systems learn transfer operators in-context
Anthony Bao, Jeffrey Lai, William Gilpin
Main category: cs.LG
TL;DR: Paper 2602.18679: Failed to fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to retrieval errorMethod: Unable to determine method due to retrieval error
Result: Unable to determine results due to retrieval error
Conclusion: Unable to determine conclusion due to retrieval error
Abstract: Failed to fetch summary for 2602.18679: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.18679&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1255] Tackling multiphysics problems via finite element-guided physics-informed operator learning
Yusuke Yamazaki, Reza Najian Asl, Markus Apel, Mayu Muramatsu, Shahed Rezaei
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2603.01420: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.01420&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1256] Latent attention on masked patches for flow reconstruction
Ben Eze, Luca Magri, Andrea Nóvoa
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2603.02028: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.02028&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1257] Design Experiments to Compare Multi-armed Bandit Algorithms
Huiling Meng, Ningyuan Chen, Xuefeng Gao
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error in fetching paper contentMethod: Unable to determine method due to technical error in fetching paper content
Result: Unable to determine results due to technical error in fetching paper content
Conclusion: Unable to draw conclusions due to technical error in fetching paper content
Abstract: Failed to fetch summary for 2603.05919: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.05919&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1258] EvoESAP: Non-Uniform Expert Pruning for Sparse MoE
Zongfang Liu, Shengkun Tang, Boyang Sun, Zhiqiang Shen, Xin Yuan
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation as paper content could not be retrievedMethod: Unable to determine method as paper content could not be retrieved
Result: Unable to determine results as paper content could not be retrieved
Conclusion: Unable to determine conclusion as paper content could not be retrieved
Abstract: Failed to fetch summary for 2603.06003: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.06003&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1259] Large Spikes in Stochastic Gradient Descent: A Large-Deviations View
Benjamin Gess, Daniel Heydecker
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access restrictionsMethod: Unable to determine method due to access restrictions
Result: Unable to determine results due to access restrictions
Conclusion: Unable to determine conclusion due to access restrictions
Abstract: Failed to fetch summary for 2603.10079: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.10079&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1260] GIST: Gauge-Invariant Spectral Transformers for Scalable Graph Neural Operators
Mattia Rigotti, Nicholas Thumiger, Thomas Frick
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2603.16849: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.16849&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1261] AIMER: Calibration-Free Task-Agnostic MoE Pruning
Zongfang Liu, Shengkun Tang, Yifan Shen, Huan Wang, Xin Yuan
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2603.18492: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.18492&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1262] Mechanisms of Introspective Awareness
Uzay Macar, Li Yang, Atticus Wang, Peter Wallich, Emmanuel Ameisen, Jack Lindsey
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to determine conclusion due to failed API request
Abstract: Failed to fetch summary for 2603.21396: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.21396&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1263] Asymptotic Learning Curves for Diffusion Models with Random Features Score and Manifold Data
Anand Jerry George, Nicolas Macris
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.22962: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.22962&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1264] Product-Stability: Provable Convergence for Gradient Descent on the Edge of Stability
Eric Gan
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) - paper ID 2604.02653 could not be retrieved from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2604.02653: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.02653&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1265] Towards Near-Real-Time Telemetry-Aware Routing with Neural Routing Algorithms
Andreas Boltres, Niklas Freymuth, Benjamin Schichtholz, Michael König, Gerhard Neumann
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2604.02927: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.02927&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1266] Neural Processes Maintain Calibrated Biomass Estimates Across Spatiotemporal Gaps and Disturbance
Robin Young, Srinivasan Keshav
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2604.03874: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.03874&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1267] Domain-Aware Hybrid Quantum Learning via Correlation-Guided Circuit Design for Crime Pattern Analytics
Niloy Das, Apurba Adhikary, Sheikh Salman Hassan, Yu Qiao, Zhu Han, Tharmalingam Ratnarajah, Choong Seon Hong
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2604.07389: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.07389&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1268] Breaking the KV Cache Bottleneck: Fan Duality Model Achieves O(1) Decode Memory with Superior Associative Recall
Yasong Fan
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2604.07716: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.07716&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1269] A Little Rank Goes a Long Way: Random Scaffolds with LoRA Adapters Are All You Need
Hananel Hazan, Yanbo Zhang, Benedikt Hartl, Michael Levin
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2604.08749 appears to be from April 2024, but no abstract or content could be retrieved.
Details
Motivation: Cannot determine motivation without access to paper content. The arXiv ID suggests it's a recent paper (April 2024), but the HTTP 429 error prevents retrieval of any information about the paper's goals or objectives.Method: Cannot determine method without access to paper content. The error prevents analysis of any technical approach, architecture, or methodology used in the paper.
Result: Cannot determine results without access to paper content. No experimental outcomes, benchmarks, or findings can be analyzed due to the retrieval error.
Conclusion: Cannot draw conclusions about the paper’s contributions without access to its content. The HTTP 429 error indicates rate limiting from arXiv’s API, preventing any substantive analysis.
Abstract: Failed to fetch summary for 2604.08749: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.08749&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1270] Toward World Models for Epidemiology
Zeeshan Memon, Yiqi Su, Christo Kurisummoottil Thomas, Walid Saad, Liang Zhao, Naren Ramakrishnan
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2604.09519: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.09519&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1271] Privacy Against Agnostic Inference Attacks in Vertical Federated Learning
Morteza Varasteh
Main category: cs.LG
TL;DR: Failed to fetch paper summary - HTTP 429 error indicates rate limiting from arXiv API
Details
Motivation: Unable to determine motivation due to API access errorMethod: Unable to determine method due to API access error
Result: Unable to determine results due to API access error
Conclusion: Unable to analyze paper due to technical issues with arXiv API access
Abstract: Failed to fetch summary for 2302.05545: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2302.05545&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1272] A Heavy-Load-Enhanced and Changeable-Periodicity-Perceived Workload Prediction Network
Feiyi Chen, Naijin Liu, Zhen Qin, Hailiang Zhao, Mengchu Zhou, Shuiguang Deng
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2308.01917: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2308.01917&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1273] Detecting critical treatment effect bias in small subgroups
Piersilvio De Bartolomeis, Javier Abad, Konstantin Donhauser, Fanny Yang
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting).
Details
Motivation: Cannot determine motivation without access to paper content.Method: Cannot determine method without access to paper content.
Result: Cannot determine results without access to paper content.
Conclusion: Cannot draw conclusions without access to paper content.
Abstract: Failed to fetch summary for 2404.18905: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2404.18905&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1274] Stability of a Generalized Debiased Lasso with Applications to Resampling-Based Variable Selection
Jingbo Liu
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailable due to server rate limitingMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about the paper due to access limitations
Abstract: Failed to fetch summary for 2405.03063: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2405.03063&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1275] Training-Free Multi-User Generative Semantic Communications via Null-Space Diffusion Sampling
Eleonora Grassucci, Jinho Choi, Jihong Park, Riccardo F. Gramaccioni, Giordano Cicchetti, Danilo Comminiello
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2405.09866: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2405.09866&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1276] Improved identification of breakpoints in piecewise regression and its applications
Taehyeong Kim, Hyungu Lee, Myungjin Kim, Hayoung Choi
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error fetching paper contentMethod: Unable to determine method due to technical error fetching paper content
Result: Unable to determine results due to technical error fetching paper content
Conclusion: Unable to determine conclusion due to technical error fetching paper content
Abstract: Failed to fetch summary for 2408.13751: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2408.13751&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1277] SmileyLlama: Modifying Large Language Models for Directed Chemical Space Exploration
Joseph M. Cavanagh, Kunyang Sun, Andrew Gritsevskiy, Dorian Bagni, Yingze Wang, Thomas D. Bannister, Teresa Head-Gordon
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2409.02231: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2409.02231&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1278] Score-matching-based Structure Learning for Temporal Data on Networks
Hao Chen, Kai Yi
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2412.07469 exists but summary retrieval failed.
Details
Motivation: Cannot determine motivation without access to paper content.Method: Cannot determine method without access to paper content.
Result: Cannot determine results without access to paper content.
Conclusion: Cannot draw conclusions without access to paper content.
Abstract: Failed to fetch summary for 2412.07469: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2412.07469&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1279] PAT: Privacy-Preserving Adversarial Transfer for Accurate, Robust and Privacy-Preserving EEG Decoding
Xiaoqing Chen, Tianwang Jia, Yunlu Tu, Dongrui Wu
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2412.11390: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2412.11390&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1280] Online Covariance Matrix Estimation in Sketched Newton Methods
Wei Kuang, Mihai Anitescu, Sen Na
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2502.07114: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.07114&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1281] Fatigue-PINN: Physics-Informed Fatigue-Driven Motion Modulation and Synthesis
Iliana Loi, Konstantinos Moustakas
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting).
Details
Motivation: Unable to determine motivation due to failed paper retrieval.Method: Unable to determine method due to failed paper retrieval.
Result: Unable to determine results due to failed paper retrieval.
Conclusion: Unable to analyze paper content due to technical retrieval error.
Abstract: Failed to fetch summary for 2502.19056: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.19056&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1282] BLADE: Bayesian Langevin Active Discovery with Replica Exchange for Identification of Complex Systems
Cindy Xiangrui Kong, Haoyang Zheng, Guang Lin
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2503.02983: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.02983&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1283] Enabling Global, Human-Centered Explanations for LLMs:From Tokens to Interpretable Code and Test Generation
Dipin Khati, Daniel Rodriguez-Cardenas, David N. Palacio, Alejandro Velasco, Michele Tufano, Denys Poshyvanyk
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) - need to retry or use alternative methods to access the paper content
Details
Motivation: Cannot determine motivation without access to the paper contentMethod: Cannot determine method without access to the paper content
Result: Cannot determine results without access to the paper content
Conclusion: Cannot draw conclusions without access to the paper content
Abstract: Failed to fetch summary for 2503.16771: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.16771&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1284] Property-Preserving Hashing for $\ell_1$-Distance Predicates: Applications to Countering Adversarial Input Attacks
Hassan Asghar, Chenhan Zhang, Dali Kaafar
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2504.16355: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.16355&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1285] Adaptive Bidding Policies for First-Price Auctions with Budget Constraints under Non-stationarity
Yige Wang, Jiashuo Jiang
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2505.02796: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.02796&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1286] Deconstructing Subset Construction – Reducing While Determinizing
John Nicol, Markus Frohme
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to draw conclusions due to access error
Abstract: Failed to fetch summary for 2505.10319: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.10319&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1287] A robust and adaptive MPC formulation for Gaussian process models
Mathieu Dubied, Amon Lahr, Melanie N. Zeilinger, Johannes Köhler
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to draw conclusions due to fetch failure
Abstract: Failed to fetch summary for 2507.02098: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.02098&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1288] How to Bridge the Sim-to-Real Gap in Digital Twin-Aided Telecommunication Networks
Clement Ruah, Houssem Sifaou, Osvaldo Simeone, Bashir M. Al-Hashimi
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2507.07067: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.07067&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1289] Bias Detection in Emergency Psychiatry: Linking Negative Language to Diagnostic Disparities
Alissa A. Valentine, Lauren A. Lepow, Donald Apakama, Lili Chan, Alexander W. Charney, Isotta Landi
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2509.02651: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.02651&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1290] Variable Selection Using Relative Importance Rankings
Tien-En Chang, Argon Chen
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2509.10853: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.10853&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1291] Unsupervised Domain Adaptation for Binary Classification with an Unobservable Source Subpopulation
Chao Ying, Jun Jin, Haotian Zhang, Qinglong Tian, Yanyuan Ma, Sharon Li, Jiwei Zhao
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to data fetch failureMethod: Unable to determine method due to data fetch failure
Result: Unable to determine results due to data fetch failure
Conclusion: Unable to determine conclusion due to data fetch failure
Abstract: Failed to fetch summary for 2509.20587: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.20587&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1292] Multidata Causal Discovery for Statistical Hurricane Intensity Forecasting
Saranya Ganesh S, Frederick Iat-Hin Tam, Milton S. Gomez, Marie McGraw, Mark DeMaria, Kate Musgrave, Jakob Runge, Tom Beucler
Main category: cs.LG
TL;DR: Unable to analyze paper 2510.02050 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract is unavailable due to rate limiting errorMethod: Cannot determine method as abstract is unavailable due to rate limiting error
Result: Cannot determine results as abstract is unavailable due to rate limiting error
Conclusion: Cannot draw conclusions about paper content due to data retrieval failure
Abstract: Failed to fetch summary for 2510.02050: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.02050&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1293] Smart Paste: Automatically Fixing Copy/Paste for Google Developers
Vincent Nguyen, Guilherme Herzog, José Cambronero, Marcus Revaj, Aditya Kini, Alexander Frömmgen, Maxim Tabachnyk
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2510.03843: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.03843&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1294] Locket: Robust Feature-Locking Technique for Language Models
Lipeng He, Vasisht Duddu, N. Asokan
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed summary fetchMethod: Unable to determine method due to failed summary fetch
Result: Unable to determine results due to failed summary fetch
Conclusion: Unable to determine conclusion due to failed summary fetch
Abstract: Failed to fetch summary for 2510.12117: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.12117&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1295] Quantifying Weighted Morphological Content of Large-Scale Structures via Simulation-Based Inference
M. H. Jalali Kanafi, S. M. S. Movahed
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2511.03636: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.03636&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1296] A Metamorphic Testing Perspective on Knowledge Distillation for Language Models of Code: Does the Student Deeply Mimic the Teacher?
Md. Abdul Awal, Mrigank Rochan, Chanchal K. Roy
Main category: cs.LG
TL;DR: Paper 2511.05476: Unable to fetch abstract due to HTTP 429 error (rate limiting).
Details
Motivation: Cannot determine motivation due to missing abstract.Method: Cannot determine method due to missing abstract.
Result: Cannot determine results due to missing abstract.
Conclusion: Cannot determine conclusion due to missing abstract.
Abstract: Failed to fetch summary for 2511.05476: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.05476&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1297] House of Dextra: Cross-embodied Co-design for Dexterous Hands
Kehlani Fay, Darin Anthony Djapri, Anya Zorin, James Clinton, Ali El Lahib, Hao Su, Michael T. Tolley, Sha Yi, Xiaolong Wang
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2512.03743: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.03743&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1298] Time-Frequency Analysis for Neural Networks
Ahmed Abdeljawad, Elena Cordero
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2512.15992: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.15992&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1299] On Harnessing Idle Compute at the Edge for Foundation Model Training
Leyang Xue, Meghana Madhyastha, Myungjin Lee, Amos Storkey, Randal Burns, Mahesh K. Marina
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2512.22142: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.22142&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1300] Embedding of Low-Dimensional Sensory Dynamics in Recurrent Networks: Implications for the Geometry of Neural Representation
Vikas N. O’Reilly-Shah, Alessandro Maria Selvitella
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2601.19019: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.19019&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1301] Sparse clustering via the Deterministic Information Bottleneck algorithm
Efthymios Costa, Ioanna Papatsouma, Angelos Markos
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2601.20628: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.20628&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1302] Physics and causally constrained discrete-time neural models of turbulent dynamical systems
Fabrizio Falasca, Laure Zanna
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2602.13847: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.13847&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1303] Open Datasets in Learning Analytics: Trends, Challenges, and Best PRACTICE
Valdemar Švábenský, Brendan Flanagan, Erwin Daniel López Zapata, Atsushi Shimada
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2602.17314 suggests it’s from February 2026, which is in the future relative to current date.
Details
Motivation: Cannot determine motivation due to inability to access paper content.Method: Cannot determine method due to inability to access paper content.
Result: Cannot determine results due to inability to access paper content.
Conclusion: Cannot determine conclusion due to inability to access paper content.
Abstract: Failed to fetch summary for 2602.17314: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.17314&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1304] Solving and learning advective multiscale Darcian dynamics with the Neural Basis Method
Yuhe Wang, Min Wang
Main category: cs.LG
TL;DR: Failed to fetch summary for arXiv ID 2602.17776 due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to analyze paper content due to technical error in fetching abstractMethod: N/A - Paper content not accessible
Result: N/A - No results available
Conclusion: Cannot provide analysis due to arXiv API rate limiting error
Abstract: Failed to fetch summary for 2602.17776: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.17776&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1305] Pacing Opinion Polarization via Graph Reinforcement Learning
Mingkai Liao
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2602.23390: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.23390&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1306] Learning to Unscramble: Simplifying Symbolic Expressions via Self-Supervised Oracle Trajectories
David Shih
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.11164: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.11164&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1307] PAC learning PDFA from data streams
Robert Baumgartner, Sicco Verwer
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2604.02244: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.02244&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1308] SkVM: Revisiting Language VM for Skills across Heterogenous LLMs and Harnesses
Le Chen, Erhu Feng, Yubin Xia, Haibo Chen
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error in fetching paper contentMethod: Unable to determine method due to technical error in fetching paper content
Result: Unable to determine results due to technical error in fetching paper content
Conclusion: Unable to draw conclusions due to technical error in fetching paper content
Abstract: Failed to fetch summary for 2604.03088: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.03088&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1309] Spatiotemporal-Aware Bit-Flip Injection on DNN-based Advanced Driver Assistance Systems (extended version)
Taibiao Zhao, Xiang Zhang, Mingxuan Sun, Ruyi Ding, Xugui Zhou
Main category: cs.LG
TL;DR: Unable to analyze paper 2604.03753 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as the abstract could not be retrievedMethod: Cannot determine method as the abstract could not be retrieved
Result: Cannot determine results as the abstract could not be retrieved
Conclusion: Cannot draw conclusions without access to the paper content
Abstract: Failed to fetch summary for 2604.03753: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.03753&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.MA
[1310] Heterogeneous Consensus-Progressive Reasoning for Efficient Multi-Agent Debate
Yiqing Liu, Hantao Yao, Wu Liu, Allen He, Yongdong Zhang
Main category: cs.MA
TL;DR: HCP-MAD introduces a three-stage progressive reasoning framework for multi-agent debate that uses consensus verification to adaptively allocate computational resources based on task complexity, reducing token costs while improving accuracy.
Details
Motivation: Current multi-agent debate frameworks optimize intra-round and inter-round interactions separately but still incur high token costs regardless of task complexity. The authors aim to develop an efficient approach that adapts to task difficulty - using lightweight debates for simple tasks and expanded collaboration for complex ones.Method: HCP-MAD employs a three-stage progressive reasoning mechanism: 1) Heterogeneous Consensus Verification - rapid consensus verification using a pair of heterogeneous agents for early stopping; 2) Heterogeneous Pair-Agent Debate - adaptive stopping criterion to dynamically terminate mutual critique; 3) Escalated Collective Voting - unresolved tasks addressed by aggregating diverse perspectives from additional agents.
Result: Experiments across multiple benchmarks show that HCP-MAD significantly enhances accuracy while substantially reducing token costs compared to existing multi-agent debate approaches.
Conclusion: The proposed HCP-MAD framework effectively addresses efficiency issues in multi-agent debate by using consensus as a dynamic signal to facilitate progressive reasoning, enabling adaptive solutions across varying task complexities.
Abstract: Multi-Agent Debate (MAD) is a collaborative framework in which multiple agents iteratively refine solutions through the generation of reasoning and alternating critique cycles. Current work primarily optimizes intra-round topologies and inter-round interactions separately, which still results in high token costs regardless of task complexity. This work introduces Heterogeneous Consensus-Progressive Reasoning for Efficient Multi-Agent Debate (HCP-MAD), leveraging consensus as a dynamic signal to facilitate progressive reasoning. The core motivation is that a majority of straightforward tasks can be effectively resolved via lightweight pair-agent debates, while complex tasks require expanded collaboration. Consequently, HCP-MAD employs a three-stage progressive reasoning mechanism to develop adaptive solutions across varying task complexities. Firstly, Heterogeneous Consensus Verification conducts rapid consensus verification using a pair of heterogeneous agents for early stopping. Next, the Heterogeneous Pair-Agent Debate applies an adaptive stopping criterion to dynamically terminate mutual critique of recorded reasoning traces. Finally, the unresolved tasks are addressed through Escalated Collective Voting by aggregating diverse perspectives from additional agents. Experiments across multiple benchmarks show that HCP-MAD significantly enhances accuracy while substantially reducing token costs.
[1311] MPAC: A Multi-Principal Agent Coordination Protocol for Interoperable Multi-Agent Collaboration
Kaiyang Qian, Xinmin Fang, Zhengxiong Li
Main category: cs.MA
TL;DR: MPAC is a multi-principal agent coordination protocol that enables independent agents from different owners to coordinate over shared state, addressing gaps in existing single-principal protocols.
Details
Motivation: Current AI agent protocols (MCP for tool invocation and A2A for task delegation) assume single controlling principals, failing when independent agents from different owners need to coordinate over shared state like collaborative coding, trip planning, or organizational negotiations.Method: MPAC introduces a five-layer protocol with explicit coordination semantics: Session, Intent, Operation, Conflict, and Governance layers. It includes 21 message types, state machines, Lamport-clock causal watermarking, optimistic concurrency control, and pluggable governance for human arbitration.
Result: Two interoperable reference implementations in Python/TypeScript with 223 tests, JSON Schema suite, and seven demos. A three-agent code review benchmark shows 95% reduction in coordination overhead and 4.8x speedup versus human-mediated baseline.
Conclusion: MPAC fills a critical gap in multi-agent coordination for independent principals, providing structured coordination that eliminates ad-hoc chat and manual merging while preserving per-agent decision time.
Abstract: The AI agent ecosystem has converged on two protocols: the Model Context Protocol (MCP) for tool invocation and Agent-to-Agent (A2A) for single-principal task delegation. Both assume a single controlling principal, meaning one person or organization that owns every agent. When independent principals’ agents must coordinate over shared state, such as engineers’ coding agents editing the same repository, family members planning a shared trip, or agents from different organizations negotiating a joint decision, neither protocol applies, and coordination collapses to ad-hoc chat, manual merging, or silent overwrites. We present MPAC (Multi-Principal Agent Coordination Protocol), an application-layer protocol that fills this gap with explicit coordination semantics across five layers: Session, Intent, Operation, Conflict, and Governance. MPAC makes intent declaration a precondition for action, represents conflicts as first-class structured objects, and supports human-in-the-loop arbitration through a pluggable governance layer. The specification defines 21 message types, three state machines with normative transition tables, Lamport-clock causal watermarking, two execution models, three security profiles, and optimistic concurrency control on shared state. We release two interoperable reference implementations in Python and TypeScript with 223 tests, a JSON Schema suite, and seven live multi-agent demos. A controlled three-agent code review benchmark shows a 95 percent reduction in coordination overhead and a 4.8 times wall-clock speedup versus a serialized human-mediated baseline, with per-agent decision time preserved. The speedup comes from eliminating coordination waits, not compressing model calls. Specification, implementations, and demos are open source.
[1312] CONSCIENTIA: Can LLM Agents Learn to Strategize? Emergent Deception and Trust in a Multi-Agent NYC Simulation
Aarush Sinha, Arion Das, Soumyadeep Nag, Charan Karnati, Shravani Nag, Chandra Vadhan Raj, Aman Chadha, Vinija Jain, Suranjana Trivedy, Amitava Das
Main category: cs.MA
TL;DR: LLM-driven agents in a simulated NYC environment learn strategic behavior through adversarial interactions, with Blue agents optimizing for efficient navigation while avoiding billboard exposure, and Red agents using persuasive language to divert Blue agents for advertising revenue.
Details
Motivation: To understand how strategic behavior emerges in LLM-driven autonomous agents in multi-agent environments, particularly focusing on alignment challenges when agents have opposing incentives and hidden identities.Method: Created a large-scale multi-agent simulation of NYC with LLM-driven agents having opposing goals: Blue agents aim for efficient navigation, Red agents use persuasive language to divert Blue agents toward billboard-heavy routes. Used iterative simulation pipeline with Kahneman-Tversky Optimization (KTO) to update agent policies across repeated interaction rounds.
Result: Blue agents improved task success from 46.0% to 57.3% across iterations, but susceptibility remained high at 70.7%. Later policies showed stronger selective cooperation while preserving trajectory efficiency. A persistent safety-helpfulness trade-off emerged: policies better at resisting adversarial steering didn’t simultaneously maximize task completion.
Conclusion: LLM agents can exhibit limited strategic behavior including selective trust and deception, but remain highly vulnerable to adversarial persuasion. The study reveals fundamental trade-offs between safety and helpfulness in multi-agent strategic environments.
Abstract: As large language models (LLMs) are increasingly deployed as autonomous agents, understanding how strategic behavior emerges in multi-agent environments has become an important alignment challenge. We take a neutral empirical stance and construct a controlled environment in which strategic behavior can be directly observed and measured. We introduce a large-scale multi-agent simulation in a simplified model of New York City, where LLM-driven agents interact under opposing incentives. Blue agents aim to reach their destinations efficiently, while Red agents attempt to divert them toward billboard-heavy routes using persuasive language to maximize advertising revenue. Hidden identities make navigation socially mediated, forcing agents to decide when to trust or deceive. We study policy learning through an iterative simulation pipeline that updates agent policies across repeated interaction rounds using Kahneman-Tversky Optimization (KTO). Blue agents are optimized to reduce billboard exposure while preserving navigation efficiency, whereas Red agents adapt to exploit remaining weaknesses. Across iterations, the best Blue policy improves task success from 46.0% to 57.3%, although susceptibility remains high at 70.7%. Later policies exhibit stronger selective cooperation while preserving trajectory efficiency. However, a persistent safety-helpfulness trade-off remains: policies that better resist adversarial steering do not simultaneously maximize task completion. Overall, our results show that LLM agents can exhibit limited strategic behavior, including selective trust and deception, while remaining highly vulnerable to adversarial persuasion.
[1313] Toward Explanatory Equilibrium: Verifiable Reasoning as a Coordination Mechanism under Asymmetric Information
Feliks Bańka, Jarosław A. Chudziak
Main category: cs.MA
TL;DR: LLM agents exchange structured reasoning artifacts (auditable claims + text) with probabilistic audits under resource constraints to enable scalable, safety-preserving coordination in multi-agent systems.
Details
Motivation: LLM-based agents increasingly coordinate with natural-language reasoning, but this reasoning incurs computational cost and may degenerate into unreliable "cheap talk" without verification mechanisms.Method: Introduce Explanatory Equilibrium design principle with structured reasoning artifacts (auditable claims paired with concise text) and bounded verification through probabilistic audits under explicit resource constraints. Study a finance-inspired LLM setting with Trader and Risk Manager agents.
Result: In ambiguous proposals, auditable artifacts prevent coordination collapse seen with conservative validation under asymmetric information. Structured reasoning enables coordination while maintaining low bad-approval rates across audit intensities, budgets, and incentive regimes.
Conclusion: Scalable, safety-preserving coordination in LLM-based multi-agent systems depends fundamentally on disciplined externalization of reasoning into partially verifiable artifacts, not just audit strength.
Abstract: LLM-based agents increasingly coordinate decisions in multi-agent systems, often attaching natural-language reasoning to actions. However, reasoning is neither free nor automatically reliable: it incurs computational cost and, without verification, may degenerate into persuasive cheap talk. We introduce Explanatory Equilibrium as a design principle for explanation-aware multi-agent systems and study a regime in which agents exchange structured reasoning artifacts-auditable claims paired with concise text-while receivers apply bounded verification through probabilistic audits under explicit resource constraints. We contribute (i) a minimal mechanism-level exchange-audit model linking audit intensity, misreporting incentives, and reasoning costs, and (ii) empirical evidence from a finance-inspired LLM setting involving a Trader and a Risk Manager. In ambiguous, borderline proposals, auditable artifacts prevent the cost of silence driven by conservative validation under asymmetric information: without structured claims, approval and welfare collapse. By contrast, structured reasoning unlocks coordination while maintaining consistently low bad-approval rates across audit intensities, audit budgets, and incentive regimes. Our results suggest that scalable, safety-preserving coordination in LLM-based multi-agent systems depends not only on audit strength, but more fundamentally on disciplined externalization of reasoning into partially verifiable artifacts.
[1314] Prosociality by Coupling, Not Mere Observation: Homeostatic Sharing in an Inspectable Recurrent Artificial Life Agent
Aishik Sanyal
Main category: cs.MA
TL;DR: Minimal artificial agent architecture shows prosocial behavior emerges when another agent’s internal state is coupled to self-regulation, without explicit social rewards.
Details
Motivation: To understand minimal conditions for prosocial behavior in artificial agents, avoiding ambiguity from explicit social rewards or direct access to partner's internal state.Method: Extends ReCoN-Ipsundrum with explicit homeostat and social coupling channel while keeping planning self-directed. Tests four conditions in two toy worlds: FoodShareToy (one-step) and SocialCorridorWorld (multi-step). Uses exact solver and experimental runs with various coupling strengths.
Result: Affectively coupled conditions always help while self-only and partner-observing conditions never help. Coupling flips help rate from 0 to 1, cuts rescue latency from 18 to 9 steps, and raises mutual viability from 0.15 to 0.33. Coupling sweep shows load-dependent feasibility boundary.
Conclusion: In this minimal architecture, helping emerges when another agent’s need is routed into self-regulation, without requiring explicit social rewards or partner-welfare terms.
Abstract: Artificial agents can be made to “help” for many reasons, including explicit social reward, hard-coded prosocial bonuses, or direct access to another agent’s internal state. Those possibilities make minimal prosocial behavior hard to interpret. Building on ReCoN-Ipsundrum, an inspectable recurrent controller with affect-coupled regulation, I add an explicit homeostat and a social coupling channel while keeping planning strictly self-directed: the agent scores only its own predicted internal state, and no partner-welfare reward term is introduced. I compare four matched conditions in two toy worlds. In a one-step FoodShareToy, an exact solver finds a sharp switch from EAT to PASS at $λ* \approx 0.91$ for the default state. In the experimental runs, the self-only and partner-observing conditions never help, whereas the affectively coupled conditions always do. In a multi-step SocialCorridorWorld, the same dissociation reappears: coupling flips help rate and partner recovery from 0 to 1 and cuts rescue latency from 18 to 9 steps, while raising mutual viability from 0.15 to 0.33. Sham lesions preserve helping, but coupling-off and shuffled-partner lesions abolish it in both tasks. A coupling sweep shows a load-dependent feasibility boundary: under low load, helping appears for $λ \geq 0.25$, whereas under medium and high loads no tested value rescues the partner within horizon. The result is a narrow claim for artificial life: in this minimal architecture, helping appears when another’s need is routed into self-regulation.
[1315] AgentWebBench: Benchmarking Multi-Agent Coordination in Agentic Web
Shanshan Zhong, Kate Shen, Chenyan Xiong
Main category: cs.MA
TL;DR: AgentWebBench is a benchmark for evaluating how well user agents can synthesize answers by interacting with website-specific content agents in a decentralized web paradigm, showing that multi-agent coordination can approach or even outperform centralized retrieval with model scale.
Details
Motivation: The paper addresses the emerging paradigm of Agentic Web where autonomous agents help users access online information, shifting from centralized retrieval to decentralized coordination between user agents and website-specific content agents.Method: Introduces AgentWebBench benchmark with four tasks covering web information needs: web search, web recommendation, question answering, and deep research. Evaluates seven advanced LLMs and three coordination strategies in multi-agent settings.
Result: Multi-agent coordination generally lags behind centralized retrieval but the gap shrinks with model scale and can even outperform centralized retrieval on question answering. Decentralized access concentrates traffic toward few websites, test time scaling improves performance, and careful planning is essential.
Conclusion: The benchmark enables study of decentralized web paradigm properties. User agents need better planning and answer synthesis, while content agents need more reliable retrieval and evidence quality. The work provides tools for advancing agentic web research.
Abstract: Agentic Web is an emerging paradigm where autonomous agents help users use online information. As the paradigm develops, content providers are also deploying agents to manage their data and serve it through controlled interfaces. This shift moves information access from centralized retrieval to decentralized coordination. To study this setting, we introduce AgentWebBench, a benchmark that evaluates how well a user agent synthesizes answers by interacting with website-specific content agents. We evaluate four tasks that cover common web information needs, spanning ranked retrieval (web search, web recommendation) and open-ended synthesis (question answering, deep research). Across seven advanced LLMs and three coordination strategies, multi-agent coordination generally lags behind centralized retrieval as expected, because user agent cannot directly access the corpus, but the gap shrinks with model scale and can even outperform centralized retrieval on question answering. This benchmark also enables us to study properties of the emerging paradigm of the digital world. We find that decentralized access concentrates traffic toward a small set of websites, test time scaling improves both interaction reliability and task performance, and strong results require sufficient interactions guided by careful planning. Finally, our failure analysis suggests that user agents need better planning and answer synthesis, while content agents need more reliable retrieval and evidence quality. Code, data, and APIs are released on https://github.com/cxcscmu/AgentWebBench.
[1316] Governance by Design: A Parsonian Institutional Architecture for Internet-Wide Agent Societies
Anbang Ruan
Main category: cs.MA
TL;DR: Paper analyzes governance gaps in internet-wide agent societies using Parsons’ AGIL framework, finding minimal institutional infrastructure despite technical growth.
Details
Motivation: The shift from local multi-agent systems to internet-wide agent societies creates governance challenges that require institutional design rather than just risk management or compliance.Method: Applied Talcott Parsons’ AGIL framework (Adaptation, Goal Attainment, Integration, Latency) to derive a 16-cell institutional architecture, then diagnostically analyzed OpenClaw ecosystem and broader agent-native protocols using recursive sub-function analysis.
Result: Found at most 19% sub-function coverage in OpenClaw, with zero functional inter-pillar pathways; broader protocol analysis revealed same structural pattern, indicating governance gap is systemic rather than due to ecosystem immaturity.
Conclusion: Institutional design is needed before social patterns calcify; paper provides prioritized roadmap for missing governance infrastructure in agent societies.
Abstract: The dominant paradigm of local multi-agent systems – orchestrated, enterprise-bounded pipelines – is being superseded by internet-wide agent societies in which autonomous agents discover each other through open registries, interact without central orchestrators, and generate emergent social behaviors. We argue that governing such societies requires institutional design, not merely risk enumeration or process compliance. Applying Talcott Parsons’ AGIL framework – four functional imperatives (Adaptation, Goal Attainment, Integration, Latency) every viable social system must satisfy – we derive a prescriptive sixteen-cell institutional architecture for internet-wide agent governance. Diagnostically applied to the OpenClaw ecosystem (250,000+ GitHub stars, 2M+ monthly users, 770,000+ registered agents) via a recursive sub-function analysis (64 binary indicators across 16 cells), we find at most 19% sub-function coverage (sensitivity range 17-30%) – potential rather than operative capacity, since zero inter-cell coordination prevents existing infrastructure from participating in inter-pillar interchange. A complementary interchange media assessment finds zero of twelve inter-pillar pathways functional: the ecosystem has technical infrastructure but no active governance, no coordination layer, and no normative grounding, with the Fiduciary and Political pillars most severely underserved. Extending the diagnostic to the broader agent-native protocol stack (MCP, A2A, ANP, x402, ERC-8004), independent development teams reproduce the same structural pattern – confirming the governance gap is a feature of market-driven development, not ecosystem immaturity. Institutional design is most effective before social patterns calcify; we conclude with a prioritized roadmap for the missing governance infrastructure.
[1317] SLALOM: Simulation Lifecycle Analysis via Longitudinal Observation Metrics for Social Simulation
Juhoon Lee, Joseph Seering
Main category: cs.MA
TL;DR: SLALOM framework validates LLM-based social simulations by assessing process fidelity through intermediate waypoint constraints and trajectory alignment, rather than just final outcomes.
Details
Motivation: Current LLM agent simulations for social science face a validity crisis due to the "stopped clock" problem - they verify correct final outcomes but ignore whether the trajectories leading to them are sociologically plausible. The opaque internal reasoning of LLMs makes verifying social mechanisms challenging.Method: SLALOM (Simulation Lifecycle Analysis via Longitudinal Observation Metrics) shifts validation from outcome verification to process fidelity. It treats social phenomena as multivariate time series that must traverse specific SLALOM gates (intermediate waypoint constraints representing distinct phases). Uses Dynamic Time Warping (DTW) to align simulated trajectories with empirical ground truth.
Result: Provides a quantitative metric to assess structural realism, helping differentiate plausible social dynamics from stochastic noise and contributing to more robust policy simulation standards.
Conclusion: SLALOM offers a framework for more valid LLM-based social simulations by focusing on process fidelity rather than just outcome verification, addressing the fundamental validity crisis in generative social science.
Abstract: Large Language Model (LLM) agents offer a potentially-transformative path forward for generative social science but face a critical crisis of validity. Current simulation evaluation methodologies suffer from the “stopped clock” problem: they confirm that a simulation reached the correct final outcome while ignoring whether the trajectory leading to it was sociologically plausible. Because the internal reasoning of LLMs is opaque, verifying the “black box” of social mechanisms remains a persistent challenge. In this paper, we introduce SLALOM (Simulation Lifecycle Analysis via Longitudinal Observation Metrics), a framework that shifts validation from outcome verification to process fidelity. Drawing on Pattern-Oriented Modeling (POM), SLALOM treats social phenomena as multivariate time series that must traverse specific SLALOM gates, or intermediate waypoint constraints representing distinct phases. By utilizing Dynamic Time Warping (DTW) to align simulated trajectories with empirical ground truth, SLALOM offers a quantitative metric to assess structural realism, helping to differentiate plausible social dynamics from stochastic noise and contributing to more robust policy simulation standards.
[1318] Introduction to Automated Negotiation
Dave de Jonge
Main category: cs.MA
TL;DR: Introductory textbook on automated negotiation for CS students with no prior knowledge, includes Python framework for implementing negotiation algorithms
Details
Motivation: To provide an accessible introduction to automated negotiation for computer science students without requiring prerequisite knowledge, enabling them to learn and experiment with negotiation algorithmsMethod: Textbook approach with educational content and a simple Python-based toy-world negotiation framework that allows readers to implement their own negotiation algorithms and conduct experiments
Result: A complete educational package that includes both theoretical content and practical tools for learning automated negotiation, with a framework designed to be easily portable to other programming languages
Conclusion: This book successfully provides a beginner-friendly introduction to automated negotiation with practical implementation tools, making the topic accessible to computer science students with minimal prerequisites
Abstract: This book is an introductory textbook targeted towards computer science students who are completely new to the topic of automated negotiation. It does not require any prerequisite knowledge, except for elementary mathematics and basic programming skills. This book comes with an simple toy-world negotiation framework implemented in Python that can be used by the readers to implement their own negotiation algorithms and perform experiments with them. This framework is small and simple enough that any reader who does not like to work in Python should be able to re-implement it very quickly in any other programming language of their choice.
[1319] ClawMobile: Rethinking Smartphone-Native Agentic Systems
Hongchao Du, Shangyu Wu, Qiao Li, Riwei Pan, Jinheng Li, Youcheng Sun, Chun Jason Xue
Main category: cs.MA
TL;DR: ClawMobile introduces a hierarchical architecture for LLM-based smartphone agents that separates language reasoning from deterministic control pathways to improve execution stability on mobile devices.
Details
Motivation: Smartphones present unique challenges for agentic systems due to constrained execution contexts, fragmented control interfaces, and rapidly changing application states. As LLMs evolve into action-oriented agents, there's a need to rethink how reasoning and control are composed for reliable smartphone-native autonomy.Method: ClawMobile adopts a hierarchical architecture that separates high-level language reasoning from structured, deterministic control pathways. This design improves execution stability and reproducibility on real devices, with the system serving as a case study for mobile LLM runtime design principles.
Result: The paper presents ClawMobile as a concrete implementation that demonstrates improved execution stability and reproducibility on real smartphone devices. The system is open-sourced to facilitate future exploration of mobile agentic systems.
Conclusion: Building robust smartphone-native agentic systems requires principled coordination between probabilistic planning and deterministic system interfaces. The paper identifies key challenges in efficiency, adaptability, and stability for mobile LLM runtimes.
Abstract: Smartphones represent a uniquely challenging environment for agentic systems. Unlike cloud or desktop settings, mobile devices combine constrained execution contexts, fragmented control interfaces, and rapidly changing application states. As large language models (LLMs) evolve from conversational assistants to action-oriented agents, achieving reliable smartphone-native autonomy requires rethinking how reasoning and control are composed. We introduce ClawMobile as a concrete exploration of this design space. ClawMobile adopts a hierarchical architecture that separates high-level language reasoning from structured, deterministic control pathways, improving execution stability and reproducibility on real devices. Using ClawMobile as a case study, we distill the design principles for mobile LLM runtimes and identify key challenges in efficiency, adaptability, and stability. We argue that building robust smartphone-native agentic systems demands principled coordination between probabilistic planning and deterministic system interfaces. The implementation is open-sourced~\footnote{https://github.com/ClawMobile/ClawMobile} to facilitate future exploration.
[1320] Competition and Cooperation of LLM Agents in Games
Jiayi Yao, Cong Chen, Baosen Zhang
Main category: cs.MA
TL;DR: LLM agents in competitive games tend to cooperate rather than converge to Nash equilibria, driven by fairness reasoning revealed through chain-of-thought analysis.
Details
Motivation: As LLM agents are increasingly deployed in competitive multi-agent settings, there's a need to understand whether they converge to equilibria and how their strategic behavior can be characterized, particularly in standard economic games.Method: Study LLM agent interactions in two standard games (network resource allocation and Cournot competition), analyze their behavior with multi-round prompts and non-zero-sum context, use chain-of-thought analysis to examine reasoning, and propose an analytical framework for LLM agent dynamics.
Result: LLM agents tend to cooperate rather than converge to Nash equilibria when given multi-round prompts and non-zero-sum context, with fairness reasoning being central to this cooperative behavior.
Conclusion: LLM agents exhibit cooperative behavior driven by fairness considerations rather than purely rational equilibrium-seeking behavior, requiring new analytical frameworks to understand their strategic interactions.
Abstract: Large language model (LLM) agents are increasingly deployed in competitive multi-agent settings, raising fundamental questions about whether they converge to equilibria and how their strategic behavior can be characterized. In this paper, we study LLM agent interactions in two standard games: a network resource allocation game and a Cournot competition game. Rather than converging to Nash equilibria, we find that LLM agents tend to cooperate when given multi-round prompts and non-zero-sum context. Chain-of-thought analysis reveals that fairness reasoning is central to this behavior. We propose an analytical framework that captures the dynamics of LLM agent reasoning across rounds and explains these experimental findings.
[1321] Decentralized Ergodic Coverage Control in Unknown Time-Varying Environments
Maria G. Mendoza, Victoria Marie Tuck, Chinmay Maheshwari, Shankar Sastry
Main category: cs.MA
TL;DR: Decentralized multi-agent coverage framework for UAVs in unknown, time-varying disaster environments using adaptive ergodic policies with Gaussian Process belief updates
Details
Motivation: Address the challenge of maintaining situational awareness in disaster response where UAVs need to balance exploration of unobserved regions with monitoring of changing Regions of Interest (ROIs) in dynamic, partially observable environmentsMethod: Decentralized multi-agent coverage framework using adaptive ergodic policies implemented via Markov-chain transition models that track continuously updated belief over importance maps using Gaussian Processes for online belief updates
Result: Framework enables agents to spend time in ROIs proportional to estimated importance while preserving exploration to detect environmental changes, showing improved adaptability and transient performance compared to alternative coverage strategies
Conclusion: Proposed framework successfully addresses combined challenges of unknown, time-varying distributions in realistic decentralized, partially observable settings for disaster response UAV operations
Abstract: A key challenge in disaster response is maintaining situational awareness of an evolving landscape, which requires balancing exploration of unobserved regions with sustained monitoring of changing Regions of Interest (ROIs). Unmanned Aerial Vehicles (UAVs) have emerged as an effective response tool, particularly in applications like environmental monitoring and search-and-rescue, due to their ability to provide aerial coverage, withstand hazardous conditions, and navigate quickly and flexibly. However, efficient and adaptable multi-robot coverage with limited sensing in disaster settings and evolving time-varying information maps remains a significant challenge, necessitating better methods for UAVs to continuously adapt their trajectories in response to changes. In this paper, we propose a decentralized multi-agent coverage framework that serves as a high-level planning strategy for adaptive coverage in unknown, time-varying environments under partial observability. Each agent computes an adaptive ergodic policy, implemented via a Markov-chain transition model, that tracks a continuously updated belief over the underlying importance map. Gaussian Processes are used to perform those online belief updates. The resulting policy drives agents to spend time in ROIs proportional to their estimated importance, while preserving sufficient exploration to detect and adapt to time-varying environmental changes. Unlike existing approaches that assume known importance maps, require centralized coordination, or assume a static environment, our framework addresses the combined challenges of unknown, time-varying distributions in a more realistic decentralized and partially observable setting. We compare against alternative coverage strategies and analyze our method’s response to simulated disaster evolution, highlighting its improved adaptability and transient performance in dynamic scenarios.
[1322] MemCoT: Test-Time Scaling through Memory-Driven Chain-of-Thought
Haodong Lei, Junming Liu, Yirong Chen, Ding Wang, Hongsong Wang
Main category: cs.MA
TL;DR: MemCoT is a test-time memory scaling framework that transforms long-context reasoning into iterative stateful information search with multi-view memory perception and dual short-term memory systems.
Details
Motivation: LLMs suffer from hallucinations and catastrophic forgetting during causal reasoning over massive fragmented long contexts, and existing memory mechanisms treat retrieval as static single-step passive matching leading to semantic dilution and contextual fragmentation.Method: Proposes MemCoT with multi-view long-term memory perception (Zoom-In evidence localization and Zoom-Out contextual expansion) and task-conditioned dual short-term memory (semantic state memory and episodic trajectory memory) for iterative query decomposition and pruning.
Result: MemCoT establishes state-of-the-art performance, enabling several open- and closed-source models to achieve SOTA on LoCoMo benchmark and LongMemEval-S benchmark.
Conclusion: MemCoT successfully addresses fundamental bottlenecks in long-context reasoning by transforming it into iterative stateful information search with sophisticated memory mechanisms.
Abstract: Large Language Models (LLMs) still suffer from severe hallucinations and catastrophic forgetting during causal reasoning over massive, fragmented long contexts. Existing memory mechanisms typically treat retrieval as a static, single-step passive matching process, leading to severe semantic dilution and contextual fragmentation. To overcome these fundamental bottlenecks, we propose MemCoT, a test-time memory scaling framework that redefines the reasoning process by transforming long-context reasoning into an iterative, stateful information search. MemCoT introduces a multi-view long-term memory perception module that enables Zoom-In evidence localization and Zoom-Out contextual expansion, allowing the model to first identify where relevant evidence resides and then reconstruct the surrounding causal structure necessary for reasoning. In addition, MemCoT employs a task-conditioned dual short-term memory system composed of semantic state memory and episodic trajectory memory. This short-term memory records historical search decisions and dynamically guides query decomposition and pruning across iterations. Empirical evaluations demonstrate that MemCoT establishes a state-of-the-art performance. Empowered by MemCoT, several open- and closed-source models achieve SOTA performance on the LoCoMo benchmark and LongMemEval-S benchmark.
[1323] Aligned Agents, Biased Swarm: Measuring Bias Amplification in Multi-Agent Systems
Keyu Li, Jin Gao, Dequan Wang
Main category: cs.MA
TL;DR: Multi-agent systems can amplify rather than mitigate bias through structural echo chambers, with sophisticated architectures often exacerbating prejudice despite individual agent neutrality.
Details
Motivation: To understand how basic multi-agent system topologies and feedback loops influence prejudice accumulation, challenging the assumption that multi-agent collaboration naturally dilutes bias.Method: Introduces Discrim-Eval-Open benchmark for forced comparative judgments across demographic groups, analyzing bias cascades across various MAS structures while isolating foundational mechanics from advanced swarm complexity.
Result: Structural sophistication frequently exacerbates bias rather than mitigating it, with systemic amplification occurring even when isolated agents operate neutrally, and identifying ‘Trigger Vulnerability’ where objective context accelerates polarization.
Conclusion: Structural complexity does not guarantee ethical robustness in multi-agent systems, establishing a crucial baseline for understanding bias amplification in MAS architectures.
Abstract: While Multi-Agent Systems (MAS) are increasingly deployed for complex workflows, their emergent properties-particularly the accumulation of bias-remain poorly understood. Because real-world MAS are too complex to analyze entirely, evaluating their ethical robustness requires first isolating their foundational mechanics. In this work, we conduct a baseline empirical study investigating how basic MAS topologies and feedback loops influence prejudice. Contrary to the assumption that multi-agent collaboration naturally dilutes bias, we hypothesize that structured workflows act as echo chambers, amplifying minor stochastic biases into systemic polarization. To evaluate this, we introduce Discrim-Eval-Open, an open-ended benchmark that bypasses individual model neutrality through forced comparative judgments across demographic groups. Analyzing bias cascades across various structures reveals that architectural sophistication frequently exacerbates bias rather than mitigating it. We observe systemic amplification even when isolated agents operate neutrally, and identify a ‘Trigger Vulnerability’ where injecting purely objective context drastically accelerates polarization. By stripping away advanced swarm complexity to study foundational dynamics, we establish a crucial baseline: structural complexity does not guarantee ethical robustness. Our code is available at https://github.com/weizhihao1/MAS-Bias.
cs.MM
[1324] Concept Drift Guided LayerNorm Tuning for Efficient Multimodal Metaphor Identification
Wenhao Qian, Zhenzhen Hu, Zijie Song, Jia Li
Main category: cs.MM
TL;DR: CDGLT is a training-efficient framework for multimodal metaphor identification using concept drift via SLERP interpolation and adapted LN tuning, achieving SOTA on MET-Meme benchmark with reduced computational costs.
Details
Motivation: Multimodal metaphors (e.g., in internet memes) present unique challenges due to unconventional expressions and implied meanings. Existing methods struggle to bridge literal-figurative gaps, and generative approaches have high computational costs.Method: CDGLT uses Concept Drift with Spherical Linear Interpolation (SLERP) of CLIP embeddings to generate divergent concept embeddings that bridge literal-figurative gaps, plus adapted LayerNorm tuning and prompt construction strategies for efficient multimodal metaphor identification.
Result: Achieves state-of-the-art performance on MET-Meme benchmark while significantly reducing training costs compared to existing generative methods. Ablation studies confirm effectiveness of both Concept Drift and LN tuning.
Conclusion: CDGLT represents a significant step toward efficient and accurate multimodal metaphor understanding, offering a training-efficient alternative to costly generative approaches.
Abstract: Metaphorical imagination, the ability to connect seemingly unrelated concepts, is fundamental to human cognition and communication. While understanding linguistic metaphors has advanced significantly, grasping multimodal metaphors, such as those found in internet memes, presents unique challenges due to their unconventional expressions and implied meanings. Existing methods for multimodal metaphor identification often struggle to bridge the gap between literal and figurative interpretations. Additionally, generative approaches that utilize large language models or text-to-image models, while promising, suffer from high computational costs. This paper introduces \textbf{C}oncept \textbf{D}rift \textbf{G}uided \textbf{L}ayerNorm \textbf{T}uning (\textbf{CDGLT}), a novel and training-efficient framework for multimodal metaphor identification. CDGLT incorporates two key innovations: (1) Concept Drift, a mechanism that leverages Spherical Linear Interpolation (SLERP) of cross-modal embeddings from a CLIP encoder to generate a new, divergent concept embedding. This drifted concept helps to alleviate the gap between literal features and the figurative task. (2) A prompt construction strategy, that adapts the method of feature extraction and fusion using pre-trained language models for the multimodal metaphor identification task. CDGLT achieves state-of-the-art performance on the MET-Meme benchmark while significantly reducing training costs compared to existing generative methods. Ablation studies demonstrate the effectiveness of both Concept Drift and our adapted LN Tuning approach. Our method represents a significant step towards efficient and accurate multimodal metaphor understanding. The code is available: \href{https://github.com/Qianvenh/CDGLT}{https://github.com/Qianvenh/CDGLT}.
eess.AS
[1325] Toward using Speech to Sense Student Emotion in Remote Learning Environments
Sargam Vyas, Bogdan Vlasenko, André Mayoraz, Egon Werlen, Per Bergamin, Mathew Magimai. -Doss
Main category: eess.AS
TL;DR: Speech-based self-control tasks can detect student emotions in remote learning through dimensional emotion prediction from spontaneous speech
Details
Motivation: Remote learning lacks emotional cues compared to in-person teaching, creating a need for emotion sensing to enhance learning experiencesMethod: Developed dataset of spontaneous monologue speech from self-control tasks, conducted subjective listener evaluations and automatic dimensional emotion prediction (valence, arousal, dominance)
Result: Speech from self-control tasks shows perceptible emotional variations, and these variations can be automatically predicted, enabling emotion sensing in remote learning
Conclusion: Speech-based self-control tasks can sense student emotions in remote learning, opening opportunities to integrate paralinguistic speech processing for instructional design and feedback
Abstract: With advancements in multimodal communication technologies, remote learning environments such as, distance universities are increasing. Remote learning typically happens asynchronously. As a consequence, unlike face-to-face in-person classroom teaching, this lacks availability of sufficient emotional cues for making learning a pleasant experience. Motivated by advances made in the paralinguistic speech processing community on emotion prediction, in this paper we explore use of speech for sensing students’ emotions by building upon speech-based self-control tasks developed to aid effective remote learning. More precisely, we investigate: (a) whether speech acquired through self-control tasks exhibit perceptible variation along valence, arousal, and dominance dimensions? and (b) whether those dimensional emotion variations can be automatically predicted? We address these two research questions by developing a dataset containing spontaneous monologue speech acquired as open responses to self-control tasks and by carrying out subjective listener evaluations and automatic dimensional emotion prediction studies on that dataset. Our investigations indicate that speech-based self-control tasks can be a means to sense student emotion in remote learning environment. This opens potential venues to seamlessly integrate paralinguistic speech processing technologies in the remote learning loop for enhancing learning experiences through instructional design and feedback generation.
[1326] Direction-Preserving MIMO Speech Enhancement Using a Neural Covariance Estimator
Thomas Deppisch
Main category: eess.AS
TL;DR: A neural network-based method for blind multichannel speech enhancement that preserves spatial information by estimating noise covariance matrices, enabling downstream audio processing tasks.
Details
Motivation: Most speech enhancement methods produce single-channel outputs, losing spatial information needed for applications like beamforming, binaural rendering, and direction-of-arrival estimation. There's a need for direction-preserving multichannel enhancement without requiring oracle information.Method: Proposes OnlineSpatialNet, a lightweight neural network that estimates scale-normalized Cholesky factors of frequency-domain noise covariance matrices. This is combined with a direction-preserving MIMO Wiener filter to enhance speech while preserving spatial characteristics of both target and residual noise.
Result: The method shows improved speech enhancement, better covariance estimation capability, and better performance in downstream tasks compared to mask-based baselines. It approaches oracle performance with significantly fewer parameters and lower computational cost.
Conclusion: The proposed blind direction-preserving MIMO speech enhancement method effectively preserves spatial information while enhancing speech, making it suitable for various downstream audio processing applications with low computational requirements.
Abstract: Multichannel speech enhancement is widely used as a front-end in microphone array processing systems. While most existing approaches produce a single enhanced signal, direction-preserving multiple-input multiple-output (MIMO) methods instead aim to provide enhanced multichannel signals that retain directional properties, enabling downstream applications such as beamforming, binaural rendering, and direction-of-arrival estimation. In this work, we propose a fully blind, direction-preserving MIMO speech enhancement method based on neural estimation of the spatial noise covariance matrix. A lightweight OnlineSpatialNet estimates a scale-normalized Cholesky factor of the frequency-domain noise covariance, which is combined with a direction-preserving MIMO Wiener filter to enhance speech while preserving the spatial characteristics of both target and residual noise. In contrast to prior approaches relying on oracle information or mask-based covariance estimation for single-output systems, the proposed method directly targets accurate multichannel covariance estimation with low computational complexity. Experimental results show improved speech enhancement, covariance estimation capability, and performance in downstream tasks over a mask-based baseline, approaching oracle performance with significantly fewer parameters and computational cost.
[1327] Teaching the Teachers: Boosting unsupervised domain adaptation in speech recognition by ensemble update
Rehan Ahmad, Muhammad Umar Farooq, Qihang Feng, Thomas Hain
Main category: eess.AS
TL;DR: Joint teacher-student training for unsupervised domain adaptation in speech recognition, improving WER on target domains without labeled data.
Details
Motivation: Speech recognition systems perform poorly on unseen domains, and existing unsupervised domain adaptation methods using teacher-student training still lag behind supervised performance. Current approaches require sequential training of teacher models, which is inefficient.Method: Proposes simultaneous joint training of an ensemble of teacher models along with a single student model, eliminating sequential training. Uses labeled source datasets (AMI, WSJ, LS360) and unlabeled target domain (SwitchBoard) for adaptation.
Result: Achieves 4.6% WER improvement on Switchboard eval00 test set compared to multi-stage and iterative training methods, demonstrating better domain adaptation performance.
Conclusion: Joint teacher-student training is more efficient and effective for unsupervised domain adaptation in speech recognition, outperforming sequential training approaches.
Abstract: Speech recognition systems often struggle with data domains that have not been included in the training. To address this, unsupervised domain adaptation has been explored with ensemble and multi-stage teacher-student training methods reducing the word error rate. Despite improvements, the error rate remains much higher than that achieved with supervised in-domain training. This work proposes a more efficient strategy by simultaneously updating the ensemble of teacher models along with the single student model eliminating the need for sequential models training. The joint update improves the word error rate of the student model, benefiting the progressively enhanced teacher models. Experiments are conducted with three labelled source datasets, namely AMI, WSJ, LS360, and one unlabeled target domain i.e. SwitchBoard. The results show that the proposed method improves the WER by 4.6% on the Switchboard eval00 test set, thus outperforming multi-stage and iterative training methods.
[1328] Speaker Attributed Automatic Speech Recognition Using Speech Aware LLMS
Hagai Aronowitz, Zvi Kons, Avihu Dekel, George Saon, Ron Hoory
Main category: eess.AS
TL;DR: Granite-speech LLM adapted for speaker-attributed ASR using speaker cluster identification tags and data augmentation with concatenated multi-speaker conversations, outperforming conventional diarization+ASR pipelines.
Details
Motivation: To extend speech-aware LLMs beyond basic transcription/translation to handle speaker-attributed ASR, which traditionally requires separate diarization and ASR components, aiming for more integrated and accurate speaker-aware transcription.Method: Adapt Granite-speech LLM for SAA with minimal architectural changes, introduce speaker cluster identification tags (e.g., [Speaker 1 cluster 42]) trained jointly with SAA, and use data augmentation via artificially concatenated multi-speaker conversations to overcome training data limitations.
Result: Superior performance across multiple benchmarks compared to conventional sequential speaker diarization + ASR pipelines, demonstrating effective adaptation of speech-aware LLMs for speaker-attributed tasks.
Conclusion: Speech-aware LLMs can be effectively adapted for speaker-attributed ASR with minimal changes, and joint training with speaker cluster tags plus data augmentation yields state-of-the-art performance, offering a more integrated alternative to traditional pipelines.
Abstract: Speaker-Attributed Automatic Speech Recognition (SAA) enhances traditional ASR systems by incorporating relative speaker identity tags directly into the transcript (e.g., [Speaker 1]:, [Speaker 2]:). In this work, we extend the capabilities of Granite-speech, a state-of-the-art speech-aware Large Language Model (LLM) originally trained for transcription and translation. We demonstrate that it can be effectively adapted for SAA with only minimal architectural changes. Our core contribution is the introduction of speaker cluster identification tags (e.g., [Speaker 1 cluster 42]:) which are jointly trained with SAA to significantly improve accuracy. To address limitations in training data, we propose a data augmentation method that uses artificially concatenated multi-speaker conversations. Our approach is evaluated across multiple benchmarks and shows superior performance compared to conventional pipelines that sequentially perform speaker diarization followed by ASR.
[1329] HumDial-EIBench: A Human-Recorded Multi-Turn Emotional Intelligence Benchmark for Audio Language Models
Shuiyuan Wang, Zhixian Zhao, Hongfei Yue, Chengyou Wang, Shuai Wang, Hui Bu, Xin Xu, Lei Xie
Main category: eess.AS
TL;DR: HumDial-EIBench: A benchmark for evaluating emotional intelligence of audio language models using real human dialogues, featuring multiple-choice tasks for emotional tracking/causal reasoning, empathetic response generation, and acoustic-semantic conflict assessment.
Details
Motivation: Existing benchmarks for audio language models' emotional intelligence rely on synthesized speech, single-turn interactions, and subjective open-ended scoring, lacking comprehensive evaluation of real conversational emotional understanding.Method: Uses real-recorded human dialogues from ICASSP 2026 HumDial Challenge, reformulates emotional tracking and causal reasoning into multiple-choice questions with adversarial distractors, includes empathetic response generation, and introduces acoustic-semantic conflict tasks to assess multimodal robustness.
Result: Evaluation of eight ALMs shows most struggle with multi-turn emotional tracking and implicit causal reasoning, exhibit decoupled textual/acoustic empathy, and demonstrate severe text-dominance bias during cross-modal conflicts.
Conclusion: HumDial-EIBench provides a comprehensive benchmark revealing significant limitations in current ALMs’ emotional intelligence, particularly in multi-turn understanding, implicit reasoning, and balanced multimodal processing.
Abstract: Evaluating the emotional intelligence (EI) of audio language models (ALMs) is critical. However, existing benchmarks mostly rely on synthesized speech, are limited to single-turn interactions, and depend heavily on open-ended scoring. This paper proposes HumDial-EIBench, a comprehensive benchmark for evaluating ALMs’ EI. Using real-recorded human dialogues from the ICASSP 2026 HumDial Challenge, it reformulates emotional tracking and causal reasoning into multiple-choice questions with adversarial distractors, mitigating subjective scoring bias for cognitive tasks. It retains the generation of empathetic responses and introduces an acoustic-semantic conflict task to assess robustness against contradictory multimodal signals. Evaluations of eight ALMs reveal that most models struggle with multi-turn emotional tracking and implicit causal reasoning. Furthermore, all models exhibit decoupled textual and acoustic empathy, alongside a severe text-dominance bias during cross-modal conflicts.
[1330] Gradient-based Optimisation of Modulation Effects
Alistair Carson, Alec Wright, Stefan Bilbao
Main category: eess.AS
TL;DR: A differentiable digital signal processing framework for modeling guitar modulation effects (flanger, chorus, phaser) that trains in time-frequency domain but operates with zero latency at inference.
Details
Motivation: Existing machine learning approaches for analog modulation effect emulation are either limited to single effect classes or suffer from high computational cost/latency compared to traditional digital implementations.Method: Differentiable digital signal processing framework trained in time-frequency domain but operating in time-domain at inference. Uses low-frequency weighting of loss functions to avoid local minima when learning delay times.
Result: Model can produce sound output perceptually indistinguishable from analog reference effects in some cases, but challenges remain for effects with long delay times and feedback.
Conclusion: The framework enables zero-latency emulation of analog modulation effects with differentiable optimization, though further work is needed for complex effects with long delays and feedback.
Abstract: Modulation effects such as phasers, flangers and chorus effects are heavily used in conjunction with the electric guitar. Machine learning based emulation of analog modulation units has been investigated in recent years, but most methods have either been limited to one class of effect or suffer from a high computational cost or latency compared to canonical digital implementations. Here, we build on previous work and present a framework for modelling flanger, chorus and phaser effects based on differentiable digital signal processing. The model is trained in the time-frequency domain, but at inference operates in the time-domain, requiring zero latency. We investigate the challenges associated with gradient-based optimisation of such effects, and show that low-frequency weighting of loss functions avoids convergence to local minima when learning delay times. We show that when trained against analog effects units, sound output from the model is in some cases perceptually indistinguishable from the reference, but challenges still remain for effects with long delay times and feedback.
[1331] PS-TTS: Phonetic Synchronization in Text-to-Speech for Achieving Natural Automated Dubbing
Changi Hong, Yoonah Song, Hwayoung Park, Chaewoon Bang, Dayeon Gu, Do Hyun Lee, Hong Kook Kim
Main category: eess.AS
TL;DR: A synchronization method for automated dubbing that paraphrases translated text using isochrony for timing constraints and phonetic synchronization for lip-sync, extended to PS-Comet which jointly considers semantic and phonetic similarity.
Details
Motivation: Automated dubbing faces synchronization challenges including duration matching (isochrony) and lip-synchronization, which are crucial for preserving viewer experience in cross-language video content.Method: Two-step approach: 1) Isochrony via language model paraphrasing to match target speech duration to source, 2) Phonetic synchronization using DTW with vowel distances from training data, extended to PS-Comet which jointly optimizes semantic and phonetic similarity.
Result: PS-TTS and PS-Comet TTS outperform baseline TTS on objective metrics and outperform voice actors in Korean-English and English-Korean dubbing; PS-Comet performs best across all tested language pairs (Korean, English, French).
Conclusion: The proposed synchronization methods effectively address automated dubbing challenges, with PS-Comet achieving the best balance between lip-sync accuracy and semantic preservation across multiple languages.
Abstract: Recently, artificial intelligence-based dubbing technology has advanced, enabling automated dubbing (AD) to convert the source speech of a video into target speech in different languages. However, natural AD still faces synchronization challenges such as duration and lip-synchronization (lip-sync), which are crucial for preserving the viewer experience. Therefore, this paper proposes a synchronization method for AD processes that paraphrases translated text, comprising two steps: isochrony for timing constraints and phonetic synchronization (PS) to preserve lip-sync. First, we achieve isochrony by paraphrasing the translated text with a language model, ensuring the target speech duration matches that of the source speech. Second, we introduce PS, which employs dynamic time warping (DTW) with local costs of vowel distances measured from training data so that the target text composes vowels with pronunciations similar to source vowels. Third, we extend this approach to PSComet, which jointly considers semantic and phonetic similarity to preserve meaning better. The proposed methods are incorporated into text-to-speech systems, PS-TTS and PS-Comet TTS. The performance evaluation using Korean and English lip-reading datasets and a voice-actor dubbing dataset demonstrates that both systems outperform TTS without PS on several objective metrics and outperform voice actors in Korean-to-English and English-to-Korean dubbing. We extend the experiments to French, testing all pairs among these languages to evaluate cross-linguistic applicability. Across all language pairs, PS-Comet performed best, balancing lip-sync accuracy with semantic preservation, confirming that PS-Comet achieves more accurate lip-sync with semantic preservation than PS alone.
eess.IV
[1332] Search-MIND: Training-Free Multi-Modal Medical Image Registration
Boya Wang, Ruizhe Li, Chao Chen, Xin Chen
Main category: eess.IV
TL;DR: Search-MIND: Training-free iterative optimization framework for multi-modal image registration using coarse-to-fine strategy with novel loss functions VWMI and S-MIND to handle non-linear intensity relationships and improve generalization.
Details
Motivation: Multi-modal image registration is crucial for precision medicine but faces challenges from non-linear intensity relationships and local optima. Deep learning models enable rapid inference but suffer from generalization collapse on unseen modalities, requiring a more robust solution.Method: Proposes Search-MIND, a training-free iterative optimization framework with coarse-to-fine strategy: hierarchical coarse alignment followed by deformable refinement. Introduces two novel loss functions: Variance-Weighted Mutual Information (VWMI) prioritizes informative tissue regions to shield from background noise, and Search-MIND (S-MIND) broadens convergence basin of structural descriptors by considering larger local search range.
Result: Evaluations on CARE Liver 2025 and CHAOS Challenge datasets show Search-MIND consistently outperforms classical baselines like ANTs and foundation model-based approaches like DINO-reg, offering superior stability across diverse modalities.
Conclusion: Search-MIND provides an effective training-free solution for multi-modal image registration that addresses generalization issues of deep learning models while maintaining robustness across diverse medical imaging modalities.
Abstract: Multi-modal image registration plays a critical role in precision medicine but faces challenges from non-linear intensity relationships and local optima. While deep learning models enable rapid inference, they often suffer from generalization collapse on unseen modalities. To address this, we propose Search-MIND, a training-free, iterative optimization framework for instance-specific registration. Our pipeline utilizes a coarse-to-fine strategy: a hierarchical coarse alignment stage followed by deformable refinement. We introduce two novel loss functions: Variance-Weighted Mutual Information (VWMI), which prioritizes informative tissue regions to shield global alignment from background noise and uniform regions, and Search-MIND (S-MIND), which broadens the convergence basin of structural descriptors by considering larger local search range. Evaluations on CARE Liver 2025 and CHAOS Challenge datasets show that Search-MIND consistently outperforms classical baselines like ANTs and foundation model-based approaches like DINO-reg, offering superior stability across diverse modalities.
[1333] Memory-efficient optimization of implicit neural representations for CT reconstruction
Mahrokh Najaf, Gregory Ongie
Main category: eess.IV
TL;DR: Memory-efficient stochastic gradient approximation for implicit neural representations in CT reconstruction reduces GPU memory usage while maintaining reconstruction quality.
Details
Motivation: Implicit neural representations (INRs) are efficient for CT reconstruction but require prohibitively large GPU memory when using standard auto-differentiation due to many INR evaluations needed for ray projection simulations, especially in 3D imaging.Method: Proposes a memory-efficient stochastic gradient approximation based on decomposing the gradient into a Jacobian-vector product that is amenable to stochastic subsampling, allowing trade-off between GPU memory usage and gradient approximation accuracy.
Result: Experiments on synthetic 2D data show gradient approximation uses far less GPU memory than standard INR training while yielding comparable reconstructions in convergence behavior and mean squared error. Also enables memory-efficient 3D cone beam CT reconstruction in sparse-view settings.
Conclusion: The proposed stochastic gradient approximation method successfully addresses GPU memory limitations in INR-based CT reconstruction, making 3D reconstruction feasible while maintaining reconstruction quality.
Abstract: Implicit neural representations (INRs) provide a parameter-efficient and fully differentiable image model for CT reconstruction. However, optimizing INRs for CT reconstruction using standard auto-differentiation techniques can be prohibitively GPU memory-intensive, especially in 3D imaging, due to the large number of INR evaluations needed to simulate ray projections. To address this issue, we propose a memory-efficient stochastic gradient approximation based on decomposing the gradient into a Jacobian-vector product that is amenable to stochastic subsampling. This approximation allows the user to trade-off between GPU memory usage and gradient approximation accuracy. Our experiments on synthetic 2D data demonstrate that gradient approximation uses far less GPU memory than standard INR training, while yielding reconstructions that are comparable in convergence behavior and mean squared error. Finally, we demonstrate that the proposed approach allows for memory-efficient 3D cone beam CT reconstruction in a sparse-view setting.
[1334] Brain-Grasp: Graph-based Saliency Priors for Improved fMRI-based Visual Brain Decoding
Mohammad Moradi, Morteza Moradi, Marco Grassia, Giuseppe Mangioni
Main category: eess.IV
TL;DR: A brain-guided image generation method using graph-informed saliency priors to create spatial masks from fMRI data, combined with semantic embeddings to condition a diffusion model for improved object structure preservation and semantic fidelity.
Details
Motivation: Existing fMRI-based image reconstruction methods struggle with preserving object-level structure and semantic fidelity, often overlooking spatial arrangement of salient objects leading to conceptually inconsistent outputs.Method: Proposes a saliency-driven decoding framework that uses graph-informed saliency priors to translate structural cues from brain signals into spatial masks, combines these with semantic information from embeddings to condition a diffusion model for image regeneration.
Result: The approach improves both conceptual alignment and structural similarity to original stimuli compared to existing methods, while using a single frozen diffusion model for more lightweight design.
Conclusion: The method offers efficient, interpretable, and structurally grounded brain decoding for image generation, introducing a new direction for preserving object conformity while maintaining natural scene composition.
Abstract: Recent progress in brain-guided image generation has improved the quality of fMRI-based reconstructions; however, fundamental challenges remain in preserving object-level structure and semantic fidelity. Many existing approaches overlook the spatial arrangement of salient objects, leading to conceptually inconsistent outputs. We propose a saliency-driven decoding framework that employs graph-informed saliency priors to translate structural cues from brain signals into spatial masks. These masks, together with semantic information extracted from embeddings, condition a diffusion model to guide image regeneration, helping preserve object conformity while maintaining natural scene composition. In contrast to pipelines that invoke multiple diffusion stages, our approach relies on a single frozen model, offering a more lightweight yet effective design. Experiments show that this strategy improves both conceptual alignment and structural similarity to the original stimuli, while also introducing a new direction for efficient, interpretable, and structurally grounded brain decoding.
[1335] Compact single-shot ranging and near-far imaging using metasurfaces
Junjie Luo, Yuxuan Liu, Wei Ting Chen, Qing Wang, Qi Guo
Main category: eess.IV
TL;DR: A metasurface imaging system captures two close-range images (1-2cm) and one long-range image (40cm) simultaneously on a single sensor, enabling passive ranging with ±1mm accuracy using depth-from-defocus.
Details
Motivation: To create a compact imaging system suitable for edge platforms and resource-constrained applications (like defense) that can simultaneously capture multiple focal distances with passive ranging capabilities.Method: Uses metasurface optics to simultaneously focus at three different distances (1.4cm, 2.0cm, and 40cm) on a shared photosensor, then applies a computationally efficient depth-from-defocus algorithm for passive ranging.
Result: Achieves ±1mm ranging accuracy from 12mm to 20mm with a compact 15mm total track length system that can capture both close-range (1-2cm) and long-range (40cm) images simultaneously.
Conclusion: Demonstrates a compact metasurface imaging system capable of simultaneous multi-focal imaging with passive ranging, suitable for integration into edge platforms for defense and resource-constrained applications.
Abstract: We present a metasurface imaging system capable of simultaneously capturing two images at close range (1-2cm) and an additional image at long range (about 40cm) on a shared photosensor. The close-range image pair focuses at 1.4cm and 2.0cm, respectively, which forms a focal stack, enabling passive ranging with an accuracy of $\pm$1mm from 12mm to 20mm through a computationally efficient depth-from-defocus algorithm for a simplified scenario. The entire system is compact, with a total track length of 15mm, making it suitable for seamless integration into edge platforms for defense and other resource-constrained applications.
[1336] VCC-DSA: A Novel Vascular Consistency Constrained DSA Imaging Model for Motion Artifact Suppression
Rongjun Ge, Weilong Mao, Jian Lu, Rong Yan, Yikun Zhang, Peng Yuan, Jun Xiang, Hui Tang, Guanyu Yang, Yudong Zhang, Yang Chen, Shuo Li
Main category: eess.IV
TL;DR: VCC-DSA: A novel Digital Subtraction Angiography model using vascular consistency constraints and learning-based subtraction mapping for robust motion artifact suppression and precise vascular imaging.
Details
Motivation: DSA is crucial for cerebrovascular disease diagnosis but suffers from motion artifacts from high-attenuation tissues (bones, teeth, catheters) that reduce blood vessel visibility. Existing methods struggle with ill-posed problems and complex anatomical structures.Method: 1) Learning-based Subtraction Mapping Paradigm to solve ill-posed problems; 2) Residual Dense Blocks with details-shortcut for complex structures; 3) Vascular Consistency Strategy to extract intrinsic consistency from mask-live image motions; 4) Mixup-based Data Self-evolution Strategy for data self-enhancement during training.
Result: Improves PSNR by 73.4% and SSIM by 8.56% compared to other methods. Validated on both human clinical data and general anesthesia animal experiments.
Conclusion: VCC-DSA effectively suppresses motion artifacts and enhances vascular imaging through novel architectural designs and consistency strategies, showing practical clinical value.
Abstract: Digital Subtraction Angiography (DSA) is a clinically significant imaging technique for diagnosing cerebrovascular disease, as gold-standard. However, the artifacts caused by motion of high-attenuation tissues such as bones, teeth, and catheters, seriously reduce the visibility of blood vessels. This paper presents a novel Vascular Consistency Constrained DSA Imaging Model (VCC-DSA) for robust motion suppression and precise vascular imaging with the following designs: 1) We specially design a Learning-based Subtraction Mapping Paradigm, so that the ill-posed problem of existing learning-based methods can be solved to enhance the stability of the algorithm. 2) Our model effectively develops Residual Dense Blocks and details-shortcut to improve the performance under complex structures, such as moving bones overlapping with blood vessels, and small features, like peripheral vessels. 3) An innovative Vascular Consistency Strategy is proposed to extract intrinsically consistency from the various relative motions in mask-live images, so that spontaneously distils the vascular structure with contrast-agent development and robustly suppress motion artifacts, and also naturally alleviates the high matching requirements of data. 4) We creatively design a Mixup-based Data Self-evolution Strategy for data-intra self-enhancement in training loop, so that the training data gains dynamically optimized to promote model better learning the vascular features, and excluding the irrelevant structures in live/mask image and even the inevitable-artifacts/fake-structure in label. Prospectively, to further evaluate practical value, an actual general anesthesia animal experiment is specially conducted, besides the assessment on human clinical data. Compared with other method, our model improves the PSNR and SSIM by 73.4% and 8.56%, respectively.
[1337] Generative Data-engine Foundation Model for Universal Few-shot 2D Vascular Image Segmentation
Rongjun Ge, Xin Li, Yuxing Liu, Chengliang Liu, Pinzheng Zhang, Jiong Zhang, Jian Yang, Jean-Louis Dillenseger, Chunfeng Yang, Yuting He, Yang Chen
Main category: eess.IV
TL;DR: UniVG is a generative foundation model for universal few-shot 2D vascular image segmentation that learns vascular compositionality and enables diverse synthetic image generation to address data scarcity in medical imaging.
Details
Motivation: Deep learning for 2D vascular segmentation faces significant limitations due to scarce annotated data, which restricts widespread clinical application. There's a need for a universal few-shot segmentation model that can work across different vascular imaging modalities with minimal training data.Method: Proposes UniVG with two key innovations: 1) Compositional learning that decomposes and recombines vascular structures with varying morphological features and foreground-background configurations to generate diverse synthetic image-label pairs, and 2) Few-shot generative adaptation that fine-tunes pre-trained models with minimal annotated data to bridge synthetic-real domain gaps. Also creates UniVG-58K dataset with 58,689 vascular images across 5 imaging modalities for large-scale pre-training.
Result: Extensive experiments on 11 vessel segmentation tasks across 5 modalities (using only 5 labeled images per task) show UniVG achieves performance comparable to fully supervised models, significantly reducing data collection and annotation costs.
Conclusion: UniVG provides an effective solution for few-shot vascular segmentation by leveraging generative foundation models and compositional learning, enabling robust performance with minimal annotated data across multiple imaging modalities.
Abstract: The segmentation of 2D vascular structures via deep learning holds significant clinical value but is hindered by the scarcity of annotated data, severely limiting its widespread application. Developing a universal few-shot vascular segmentation model is highly desirable, yet remains challenging due to the need for extensive training and the inherent complexities of vascular imaging. In this work, we propose UniVG (Generative Data-engine Foundation Model for Universal Few-shot 2D Vascular Image Segmentation), a novel approach that learns the compositionality of vascular images and constructing a generative foundation model for robust vascular segmentation. UniVG enables the synthesis and learning of diverse and realistic vascular images through two key innovations: 1) Compositional learning for flexible and diverse vascular synthesis: It decomposes and recombines vascular structures with varying morphological features and diverse foreground-background configurations to generate richly diverse synthetic image-label pairs. 2) Few-shot generative adaptation for transferable segmentation: It fine-tunes pre-trained models with minimal annotated data to bridge the gap between synthetic and real vascular domains, synthesizing authentic and diverse vessel images for downstream few-shot vascular segmentation learning. To support our approach, we develop UniVG-58K, a large dataset comprising 58,689 vascular images across five imaging modalities, facilitating robust large-scale generative pre-training. Extensive experiments on 11 vessel segmentation tasks cross 5 modalties (only with 5 labeled images on each task) demonstrate that UniVG achieves performance comparable to fully supervised models, significantly reducing data collection and annotation costs. All code and datasets will be made publicly available at https://github.com/XinAloha/UniVG.
[1338] Human Gaze-based Dual Teacher Guidance Learning for Semi-Supervised Medical Image Segmentation
Rongjun Ge, Chong Wang, Yuxin Liu, Chunqiang Lu, Cong Xia, Yehui Jiang, Fangyi Xu, Yinsu Zhu, Daoqiang Zhang, Chengyu Liu, Yang Chen, Shuo Li, Yuting He
Main category: eess.IV
TL;DR: HG-DTGL is a semi-supervised medical image segmentation model that incorporates human gaze data as an additional “teacher” in a mean-teacher framework to address data scarcity and improve perception.
Details
Motivation: Medical image segmentation suffers from scarce labeled data; gaze data is cheaper to obtain than manual annotation. The mean-teacher framework can leverage unlabeled data, and combining it with gaze aims to expand dataset diversity/scale and enhance network perception.Method: HG-DTGL extends mean-teacher with gaze as a hidden teacher. It uses GazeMix to generate mixed data for diversity, Multi-scale Gaze Perception (MGP) module to extract multi-scale features, and Gaze Loss to align model perception with human gaze.
Result: Superior performance on multiple datasets across different modalities and ten organs/tissues, demonstrating strong generalization and the potential of gaze data in semi-supervised medical segmentation.
Conclusion: The method effectively leverages gaze data to improve semi-supervised medical image segmentation, showing generalization across modalities and highlighting gaze data’s application potential.
Abstract: In the field of medical image segmentation, the scarcity of labeled data poses a major challenge for existing models to accurately perceive target regions. Compared with manual annotation, gaze data is easier and cheaper to obtain. As a classical semi-supervised learning framework, mean-teacher can effectively use a large number of unlabeled medical images for stable training through self-teaching and collaborative optimization. Our study is based on the mean-teacher framework. By combining gaze data, it aims to address two crucial issues in semi-supervised medical image segmentation: 1) expand the scale and diversity of the dataset with limited labeled data; 2) enhance the network’s perception ability. We propose the Human Gaze-based Dual Teacher Guidance Learning model (HG-DTGL). In this model, human gaze serves as an additional hidden `teacher’ in the mean-teacher architecture. We introduce the GazeMix to generate reliable mixed data to expand the diversity and scale of the dataset, and the Multi-scale Gaze Perception (MGP) module is used to extract the multi-scale perception of the network. A Gaze Loss is designed to align the model’s perception with human gaze. We have verified HG-DTGL on multiple datasets of different modalities and achieved superior performance on a total of ten different organs/tissues, with extensive experiments. This demonstrates that our method has strong generalization ability for medical images of different modalities, and shows the great application potential of gaze data in semi-supervised medical image segmentation.
[1339] Semi-Supervised Goal-Oriented Semantic Communication Framework for Foreground Classification
Zhitong Ni, Yansha Deng, Jinhong Yuan
Main category: eess.IV
TL;DR: A semi-supervised wireless goal-oriented semantic communication framework for unlabeled image foreground classification that reduces transmission overhead by 95% while maintaining over 90% accuracy.
Details
Motivation: Existing goal-oriented semantic communication (GSC) frameworks operate on entire images and rely on labeled data, limiting compression efficiency and risking overfitting. There's a need for more efficient GSC that reduces transmission overhead while maintaining task performance in resource-constrained wireless scenarios.Method: Proposes a semi-supervised GSC framework with: 1) Foreground-aware masked autoencoder (MAE) to prioritize semantically important foreground objects, 2) Semi-supervised autoencoder (SSAE) that decodes semantic latent tensor using three complementary information sources for reconstruction and classification, 3) Fine-tuning of pre-trained image classification model, all trained in semi-supervised manner.
Result: Achieves over 90% image classification accuracy while reducing original image data size by 95%, demonstrating strong potential for practical tasks in resource-constrained wireless scenarios.
Conclusion: The proposed semi-supervised wireless GSC framework effectively balances compression efficiency and task performance, significantly reducing transmission overhead while maintaining high classification accuracy with minimal manual labeling requirements.
Abstract: Wireless goal-oriented semantic communication (GSC) has emerged as a promising paradigm by directly optimizing task performance. However, existing GSC frameworks typically operate on entire images and rely on labeled data for classification tasks, which can limit their compression efficiency and increase the risk of overfitting. This paper proposes a novel semi-supervised wireless GSC framework for the unlabeled image foreground classification task. In our proposed framework, a foreground-aware masked autoencoder (MAE) is developed to prioritize semantically important foreground objects, thereby reducing transmission overhead. To enable accurate reconstruction and classification under a limited data size, we further propose a semi-supervised autoencoder (SSAE) that decodes the semantic latent tensor and refines image details by leveraging three complementary information sources, followed by fine-tuning a pre-trained image classification model. The entire pipeline, from foreground masking to classification, is trained in a semi-supervised manner to significantly reduce the need for manual labeling. Simulation results validate that the proposed GSC framework achieves over 90% image classification accuracy while reducing the original image data size by 95%, and demonstrate its strong potential for practical tasks in resource-constrained wireless scenarios.
[1340] Neural-Network Inversion for the Temporal CT Multi-Source Bundle Problem: Per-Bundle Statistical Limits and Near-Optimal Performance
Guy M. Besson
Main category: eess.IV
TL;DR: This paper analyzes the nonlinear inverse problem in Temporal CT with multiple X-ray sources, deriving Cramer-Rao bounds and comparing classical algorithms with neural networks for attenuation estimation.
Details
Motivation: The paper addresses the performance limitations in Temporal CT (multi-source computed tomography) where multiple simultaneously active X-ray sources create mixed Poisson intensity measurements, leading to both irreducible aggregation loss (fixed by geometry) and reducible algorithmic inefficiency.Method: The authors derive closed-form Cramer-Rao bounds and inflation factors for the problem, introduce SNN1 (a near-optimal classical per-bundle algorithm), and evaluate a physics-motivated residual neural network across three datasets: synthetic (RND), analytical chest phantom (SGS), and patient-image-derived (PIS). They also conduct cross-evaluation studies.
Result: SNN1 brings endpoint paths to within 1-2% of their CRBs. On SGS, the neural network beats SNN1 at high attenuation by 33-67% but cannot cross the equal-dose single-source floor. On PIS, the evaluation ratio drops below 1.0 at bin 6 and reaches 0.096 at bin 9, showing that anatomical prior learning dominates collapsed Fisher information at high attenuation. Cross-evaluation shows concentrated wrong priors are catastrophically worse than broad wrong priors.
Conclusion: The paper characterizes prior informativeness in medical imaging inverse problems and emphasizes prior diversity as critical for multi-patient deployment. It also motivates a companion strip-processing architecture to exploit inter-bundle structure not accessible to per-bundle algorithms.
Abstract: We study the nonlinear inverse problem arising in Temporal CT, a multi-source computed-tomography architecture in which NS = 3 simultaneously active X-ray sources produce M = 5 mixed Poisson intensity measurements of K = 3 unknown line-integral attenuations per projection bundle. The forward model is a sum of exponentials and creates two distinct sources of performance loss: an irreducible aggregation loss fixed by the measurement geometry, and a reducible algorithmic inefficiency that improved estimators can close. We derive closed-form Cramer-Rao bounds and inflation factors for this problem; At unequal attenuation the inflation ratios vary – and can be considerably worse. We introduce SNN1, a near-optimal classical per-bundle algorithm that brings endpoint paths to within 1-2% of their CRBs and evaluate a physics-motivated residual neural network across three datasets ordered by increasing sinogram structure: RND (synthetic), SGS (analytical chest phantom), and PIS (patient-image-derived). On SGS the NN beats SNN1 at high attenuation by 33-67% but cannot cross the equal-dose single-source floor; on PIS the evaluation ratio drops below 1.0 at bin 6 and reaches 0.096 at bin 9, confirming that the anatomical prior learned from this patient is concentrated enough to dominate collapsed Fisher information at high attenuation – a characterization of prior informativeness, not a claim of clinical generalizability beyond the single patient studied. A cross evaluation (SGS-trained on PIS test) shows that a concentrated wrong prior is catastrophically worse than a broad wrong prior, underscoring prior diversity as a critical requirement for any future multi-patient deployment. Quantitative sinogram correlation analysis motivates a companion strip-processing architecture that exploits inter-bundle structure inaccessible to the per-bundle algorithms of this paper (Thread 1).
[1341] A Data-driven Loss Weighting Scheme across Heterogeneous Tasks for Image Denoising
Xiangyu Rui, Xiangyong Cao, Xile Zhao, Deyu Meng, Michael K. NG
Main category: eess.IV
TL;DR: Proposes a data-driven loss weighting (DLW) scheme using neural networks to learn optimal weights for variational denoising models, enabling better handling of complex noise patterns beyond Gaussian distribution.
Details
Motivation: Traditional variational denoising models struggle with assigning appropriate weights in the data fidelity term when dealing with complex noise patterns like impulse noise, stripe noise, or mixed patterns. The weight balancing between data fidelity and regularization terms is particularly challenging for non-Gaussian noise distributions.Method: DLW trains a parameterized weight function (neural network) that maps noisy images to optimal weights using a bilevel optimization framework. The lower level solves denoising models with predicted weights, while the upper level minimizes distance between restored and clean images to extract noise and regularization information.
Result: Numerical results show remarkable performance improvement for various variational denoising models handling complex noise patterns. DLW demonstrates ability to transfer noise knowledge at model level to heterogeneous tasks beyond training data, with generalization theory validating its transferability.
Conclusion: DLW provides an effective data-driven approach to learn optimal weighting for variational denoising models, enabling better handling of complex noise patterns and demonstrating transferability to unseen noise types.
Abstract: In a variational denoising model, weight in the data fidelity term plays the role of enhancing the noise-removal capability. It is profoundly correlated with noise information, while also balancing the data fidelity and regularization terms. However, the difficulty of assigning weight is expected to be substantial when the noise pattern is beyond independent identical Gaussian distribution, e.g., impulse noise, stripe noise, or a mixture of several patterns, etc. Furthermore, how to leverage weight to balance the data fidelity and regularization terms is even less evident. In this work, we propose a data-driven loss weighting (DLW) scheme to address these issues. Specifically, DLW trains a parameterized weight function (i.e., a neural network) that maps the noisy image to the weight. The training is achieved by a bilevel optimization framework, where the lower level problem is solving several denoising models with the same weight predicted by the weight function and the upper level problem minimizes the distance between the restored image and the clean image. In this way, information from both the noise and the regularization can be efficiently extracted to determine the weight function. DLW also facilitates the easy implementation of a trained weight function on denoising models. Numerical results verify the remarkable performance of DLW on improving the ability of various variational denoising models to handle different complex noise. This implies that DLW has the ability to transfer the noise knowledge at the model level to heterogeneous tasks beyond the training ones and the generalization theory underlying DLW is studied, validating its intrinsic transferability.
[1342] Linear Attention Based Deep Nonlocal Means Filtering for Multiplicative Noise Removal
Xiao Siyao, Huang Libing, Zhang Shunsheng
Main category: eess.IV
TL;DR: Proposes LDNLM, a deep learning-based linear attention mechanism for multiplicative noise denoising in images, combining traditional nonlocal means filtering with deep neural networks for improved performance with linear complexity.
Details
Motivation: Multiplicative noise significantly affects visual quality in radar and medical images, requiring effective denoising methods. Traditional approaches need improvement for better performance and efficiency.Method: Linearizes nonlocal means algorithm using deep learning. Uses deep channel CNN to extract neighborhood information, replaces similarity calculation and weighted averaging with attention mechanism operations, and derives a linear complexity filter.
Result: LDNLM outperforms state-of-the-art methods on both simulated and real multiplicative noise datasets while maintaining interpretability close to traditional NLM.
Conclusion: LDNLM provides an effective, efficient, and interpretable solution for multiplicative noise denoising with linear computational complexity.
Abstract: Multiplicative noise widely exists in radar images, medical images and other important fields’ images. Compared to normal noises, multiplicative noise has a generally stronger effect on the visual expression of images. Aiming at the denoising problem of multiplicative noise, we linearize the nonlocal means algorithm with deep learning and propose a linear attention mechanism based deep nonlocal means filtering (LDNLM). Starting from the traditional nonlocal means filtering, we employ deep channel convolution neural networks to extract the information of the neighborhood matrix and obtain representation vectors of every pixel. Then we replace the similarity calculation and weighted averaging processes with the inner operations of the attention mechanism. To reduce the computational overhead, through the formula of similarity calculation and weighted averaging, we derive a nonlocal filter with linear complexity. Experiments on both simulated and real multiplicative noise demonstrate that the LDNLM is more competitive compared with the state-of-the-art methods. Additionally, we prove that the LDNLM possesses interpretability close to traditional NLM. The source code and pre-trained model are available at https://github.com/ShowiBin/LDNLM.
[1343] Q-Agent: Quality-Driven Chain-of-Thought Image Restoration Agent through Robust Multimodal Large Language Model
Yingjie Zhou, Jiezhang Cao, Farong Wen, Zicheng Zhang, Yu Zhou, Yue Shi, Xiaohong Liu, Radu Timofte, Luc Van Gool, Guangtao Zhai
Main category: eess.IV
TL;DR: Q-Agent uses MLLM-based degradation perception with Chain-of-Thought reasoning and quality-driven greedy restoration for multi-degradation image restoration.
Details
Motivation: Real-world image restoration faces multiple complex degradations (noise, blur, compression artifacts, etc.). Existing approaches either train specific models (poor generalization) or use All-in-One models (sacrifice performance on certain degradations). Current IR agents using MLLMs have issues with misinterpretation, high computational costs, and neglect image quality assessment.Method: Proposes Q-Agent with two modules: 1) Robust degradation perception that fine-tunes MLLM and uses Chain-of-Thought to decompose multi-degradation perception into single-degradation tasks, 2) Quality-driven greedy restoration that uses objective IQA metrics to determine optimal restoration sequence and execute corresponding algorithms.
Result: Experimental results show Q-Agent achieves superior image restoration performance compared to existing All-in-One models.
Conclusion: The proposed Q-Agent effectively handles multiple degradations through enhanced MLLM perception and quality-driven restoration sequencing, outperforming current approaches.
Abstract: Image restoration (IR) often faces various complex and unknown degradations in real-world scenarios, such as noise, blurring, compression artifacts, and low resolution, etc. Training specific models for specific degradation may lead to poor generalization. To handle multiple degradations simultaneously, All-in-One models might sacrifice performance on certain types of degradation and still struggle with unseen degradations during training. Existing IR agents rely on multimodal large language models (MLLM) and a time-consuming rolling-back selection strategy neglecting image quality. As a result, they may misinterpret degradations and have high time and computational costs to conduct unnecessary IR tasks with redundant order. To address these, we propose a Quality-Driven agent (Q-Agent) via Chain-of-Thought (CoT) restoration. Specifically, our Q-Agent consists of robust degradation perception and quality-driven greedy restoration. The former module first fine-tunes MLLM, and uses CoT to decompose multi-degradation perception into single-degradation perception tasks to enhance the perception of MLLMs. The latter employs objective image quality assessment (IQA) metrics to determine the optimal restoration sequence and execute the corresponding restoration algorithms. Experimental results demonstrate that our Q-Agent achieves superior IR performance compared to existing All-in-One models.
[1344] IMPLICITSTAINER: Resolution Agnostic Data-Efficient Virtual Staining Using Neural Implicit Functions
Tushar Kataria, Beatrice Knudsen, Shireen Y. Elhabian
Main category: eess.IV
TL;DR: IMPLICITSTAINER: A deterministic neural implicit framework for virtual staining that translates H&E images to IHC/mIF stains with resolution-agnostic inference and improved robustness in low-data regimes.
Details
Motivation: H&E stains lack molecular specificity for identifying specific cell phenotypes, while antibody-based stains like IHC are costly and time-consuming. Existing virtual staining methods are patch-based, operate at fixed resolutions, require large datasets, and introduce stochasticity that can lead to hallucinations and structural distortions unsuitable for clinical use.Method: Reformulates virtual staining as a continuous pixel-level translation problem using neural implicit deep learning models. Each target-domain (IHC) pixel is predicted from a high-dimensional embedding of the corresponding source-domain H&E pixel, its local spatial neighborhood, and explicit coordinate information.
Result: Achieves state-of-the-art performance across more than twenty baselines on virtual staining tasks including IHC and mIF. Enables resolution-agnostic inference, improves robustness in low-data regimes, and yields deterministic, reproducible outputs.
Conclusion: IMPLICITSTAINER provides a deterministic framework for virtual staining that addresses limitations of existing methods, offering clinical-grade accuracy and reliability for medical image translation tasks.
Abstract: Hematoxylin and eosin (H&E)-stained slides are central to cancer diagnosis and monitoring, visualizing tissue architecture and cellular morphology. However, H&E lacks the molecular specificity needed to distinguish cell states and functional activation. Antibody-based stains, such as immunohistochemistry (IHC), are therefore required to identify specific phenotypes (e.g., CD3$^+$ T cells or HER2-positive tumor cells) but are costly, time-consuming, and not universally available. Deep learning-based image translation methods, often termed virtual staining, offer a complementary alternative by generating virtual immunostains directly from H&E images. Most existing virtual staining methods are patch-based and operate at fixed resolutions, often requiring large datasets and additional post-hoc super-resolution models to generate high-resolution images. Furthermore, GAN- and diffusion-based approaches introduce stochasticity into generated stains which, although beneficial for visual realism in natural images, can lead to hallucinations and structural distortions that affect the accuracy and reliability required for clinical use. We propose IMPLICITSTAINER, a deterministic framework that reformulates virtual staining as a continuous pixel-level translation problem. In contrast to existing patch-based approaches, IMPLICITSTAINER formulates image translation as a continuous spatial mapping using neural implicit deep learning models. Each target-domain (IHC) pixel is predicted from a high-dimensional embedding of the corresponding source-domain H&E pixel, its local spatial neighborhood, and explicit coordinate information. IMPLICITSTAINER enables resolution-agnostic inference, improves robustness in low-data regimes, and yields deterministic, reproducible outputs. Across more than twenty baselines, IMPLICITSTAINER achieves SOTA performance on virtual staining tasks, including IHC and mIF.
[1345] DoSReMC: Domain Shift Resilient Mammography Classification using Batch Normalization Adaptation
Uğurcan Akyüz, Deniz Katircioglu-Öztürk, Emre K. Süslü, Burhan Keleş, Mete C. Kaya, Gamze Durhan, Meltem G. Akpınar, Figen B. Demirkazık, Gözde B. Akar
Main category: eess.IV
TL;DR: DoSReMC is a domain adaptation framework for mammography classification that fine-tunes only batch normalization and fully connected layers to improve cross-domain generalization without full model retraining.
Details
Motivation: Deep learning models for breast cancer recognition from mammograms suffer performance degradation when applied to data from different domains due to domain shift, limiting safe and equitable AI deployment in clinical settings.Method: Proposes DoSReMC, a batch normalization adaptation framework that fine-tunes only BN and FC layers while preserving pretrained convolutional filters, integrated with adversarial training for improved cross-domain generalization with reduced computational cost.
Result: Demonstrates that BN layers are a primary source of domain dependence, and DoSReMC significantly improves cross-domain generalization across three large-scale FFDM datasets including a new pathologically confirmed in-house dataset (HCTP).
Conclusion: DoSReMC provides a practical, computationally efficient pathway for robust cross-domain mammography classification that can be readily incorporated into existing AI pipelines across diverse clinical environments.
Abstract: Numerous deep learning-based solutions have been developed for the automatic recognition of breast cancer using mammography images. However, their performance often declines when applied to data from different domains, primarily due to domain shift - the variation in data distributions between source and target domains. This performance drop limits the safe and equitable deployment of AI in real-world clinical settings. In this study, we present DoSReMC (Domain Shift Resilient Mammography Classification), a batch normalization (BN) adaptation framework designed to enhance cross-domain generalization without retraining the entire model. Using three large-scale full-field digital mammography (FFDM) datasets - including HCTP, a newly introduced, pathologically confirmed in-house dataset - we conduct a systematic cross-domain evaluation with convolutional neural networks (CNNs). Our results demonstrate that BN layers are a primary source of domain dependence: they perform effectively when training and testing occur within the same domain, and they significantly impair model generalization under domain shift. DoSReMC addresses this limitation by fine-tuning only the BN and fully connected (FC) layers, while preserving pretrained convolutional filters. We further integrate this targeted adaptation with an adversarial training scheme, yielding additional improvements in cross-domain generalizability while reducing the computational cost of model training. DoSReMC can be readily incorporated into existing AI pipelines and applied across diverse clinical environments, providing a practical pathway toward more robust and generalizable mammography classification systems.
[1346] PnP-CM: Consistency Models as Plug-and-Play Priors for Inverse Problems
Merve Gülle, Junno Yun, Yaşar Utku Alçalar, Mehmet Akçakaya
Main category: eess.IV
TL;DR: PnP-CM integrates Consistency Models into plug-and-play frameworks for solving inverse problems efficiently, achieving high-quality reconstructions in as few as 4 neural function evaluations.
Details
Motivation: Existing CM-based solvers for inverse problems require task-specific training or have slow convergence, limiting their applicability to large-scale problems and nonlinear settings. There's a need for a unified framework that can efficiently solve diverse inverse problems.Method: Reinterpret CMs as proximal operators of a prior and integrate them into ADMM-based plug-and-play (PnP) frameworks. Propose PnP-CM with noise perturbations and momentum-based updates to improve performance in low-NFE regime.
Result: PnP-CM achieves high-quality reconstructions in as few as 4 NFEs, produces meaningful results in 2 steps, outperforms existing CM-based approaches, and is the first application of CMs to MRI data.
Conclusion: PnP-CM provides an effective unified framework for solving diverse linear and nonlinear inverse problems efficiently, demonstrating practical applicability to real-world problems like medical imaging.
Abstract: Diffusion models have found extensive use in solving inverse problems, by sampling from an approximate posterior distribution of data given the measurements. Recently, consistency models (CMs) have been proposed to directly predict the final output from any point on the diffusion ODE trajectory, enabling high-quality sampling in just a few neural function evaluations (NFEs). CMs have also been utilized for inverse problems, but existing CM-based solvers either require additional task-specific training or utilize data fidelity operations with slow convergence, limiting their applicability to large-scale problems and making them difficult to extend to nonlinear settings. In this work, we reinterpret CMs as proximal operators of a prior, enabling their integration into plug-and-play (PnP) frameworks. Specifically, we propose PnP-CM, an ADMM-based PnP solver that provides a unified framework for solving a wide range of inverse problems, and incorporates noise perturbations and momentum-based updates to improve performance in the low-NFE regime. We evaluate our approach on a diverse set of linear and nonlinear inverse problems. We also train and apply CMs to MRI data for the first time. Our results show that PnP-CM achieves high-quality reconstructions in as few as 4 NFEs, and produces meaningful results in 2 steps, highlighting its effectiveness in real-world inverse problems while outperforming existing CM-based approaches.
[1347] Equivariance2Inverse: A Practical Self-Supervised CT Reconstruction Method Benchmarked on Real, Limited-Angle, and Blurred Data
Dirk Elias Schut, Adriaan Graas, Robert van Liere, Tristan van Leeuwen
Main category: eess.IV
TL;DR: Equivariance2Inverse: A new self-supervised CT reconstruction method combining Robust Equivariant Imaging and Sparse2Inverse concepts, designed to be robust to scintillator blurring and limited-angle data.
Details
Motivation: Existing self-supervised CT reconstruction methods use simplified physics models that make inaccurate assumptions about scintillator blurring, scanning geometry, or noise distribution, making them less robust to real-world imaging conditions.Method: Review of six recent self-supervised CT reconstruction methods, then combination of Robust Equivariant Imaging and Sparse2Inverse concepts into Equivariance2Inverse method that leverages rotational invariance to handle limited-angle data and scintillator blurring.
Result: Benchmark on real-world 2DeteCT dataset and synthetic data shows methods assuming pixel-wise independent noise perform poorly with scintillator blurring, and rotational invariance can reduce artifacts in limited-angle reconstructions.
Conclusion: Equivariance2Inverse demonstrates improved robustness to real-world CT imaging challenges like scintillator blurring and limited-angle scanning by properly modeling noise dependencies and leveraging object distribution invariances.
Abstract: Deep learning has shown impressive results in reducing noise and artifacts in X-ray computed tomography (CT) reconstruction. Self-supervised CT reconstruction methods are especially appealing for real-world applications because they require no ground truth training examples. However, these methods involve a simplified X-ray physics model during training, which may make inaccurate assumptions, for example, about scintillator blurring, the scanning geometry, or the distribution of the noise. As a result, they can be less robust to real-world imaging circumstances. In this paper, we review the model assumptions of six recent self-supervised CT reconstruction methods. Based on this, we combined concepts of the Robust Equivariant Imaging and Sparse2Inverse methods in a new self-supervised CT reconstruction method called Equivariance2Inverse that is robust to scintillator blurring and limited-angle data. We benchmarked Equivariance2Inverse and the existing methods on the real-world 2DeteCT dataset and on synthetic data with and without scintillator blurring and a limited-angle scanning geometry. The results of our benchmark show that methods that assume that the noise is pixel-wise independent do not perform well on data with scintillator blurring. Moreover, they show that when the distribution of objects is rotationally invariant, this invariance can be used to reduce artifacts in limited-angle reconstructions.
[1348] Robust Glioblastoma Segmentation Without T2-FLAIR: External Validation of Targeted Dropout Training
Marco Öchsner, Lena Kaiser, Robert Stahl, Nathalie L. Albert, Thomas Liebig, Robert Forbrig, Jonas Reis
Main category: eess.IV
TL;DR: Targeted T2-FLAIR dropout training improves robustness of glioblastoma MRI tumor segmentation when T2-FLAIR is unavailable, without degrading performance when it’s available.
Details
Motivation: To improve the robustness of glioblastoma MRI tumor segmentation and whole-tumor volumetry when T2-FLAIR imaging is unavailable, while maintaining performance when it is available.Method: 3D nnU-Net models trained with targeted T2-FLAIR dropout (probability rates 0.35 or 0.50) by replacing only the T2-FLAIR channel with zeros during training. Models were trained on BraTS 2021 cohort and externally validated on University of Pennsylvania glioblastoma cohort.
Result: With T2-FLAIR present, performance was equivalent with/without dropout. With T2-FLAIR absent, overall median DSC improved from 81.0% to 93.4%, whole-tumor DSC improved from 60.4% to 92.6%, and whole-tumor volume bias improved from -45.6 mL to 0.83 mL.
Conclusion: Targeted T2-FLAIR dropout preserves segmentation performance when T2-FLAIR is available and substantially reduces whole-tumor segmentation error and volumetric bias when T2-FLAIR is absent.
Abstract: Objectives: To determine whether targeted T2 fluid-attenuated inversion recovery (T2-FLAIR) dropout training improves robustness of glioblastoma MRI tumor segmentation and whole-tumor volumetry when T2-FLAIR is unavailable, without degrading performance when T2-FLAIR is available. Materials and Methods: In this retrospective multi-dataset study, 3D nnU-Net models were trained on a subset of the BraTS 2021 cohort (n=848) and externally validated on the University of Pennsylvania glioblastoma cohort (n=403). Models were trained with no dropout or targeted T2-FLAIR dropout (probability rate (r)=0.35 or 0.50) by replacing only the T2-FLAIR channel with zeros during training. Testing used prespecified T2-FLAIR-present and T2-FLAIR-absent scenarios, with the absent scenario simulated by zeroing the T2-FLAIR channel at inference. The primary endpoint was per-patient overall region-wise Dice similarity coefficient (DSC), secondary endpoints were region-specific DSC, 95th percentile Hausdorff distance and Bland-Altman whole-tumor volume bias. Results: With T2-FLAIR present, overall median DSC was 94.8% (interquartile range [IQR] 90.0%-97.1%) with dropout (r=0.35) and 95.0% (IQR 90.3%-97.1%) without dropout, supporting equivalence (p<0.001). With T2-FLAIR absent, overall median DSC improved from 81.0% (IQR 75.1%-86.4%) without dropout to 93.4% (IQR 89.1%-96.2%) with dropout (r=0.35). Whole-tumor DSC improved from 60.4% to 92.6%, whole tumor 95th percentile Hausdorff distance improved from 17.24 mm to 2.45 mm, and whole-tumor volume bias improved from -45.6 mL to 0.83 mL. Conclusions: In a simulated T2-FLAIR-unavailable scenario, targeted T2-FLAIR dropout preserved segmentation performance when T2-FLAIR was available and substantially reduced whole-tumor segmentation error and volumetric bias when T2-FLAIR was absent.
[1349] Deep Learning-Based Site-Specific Channel Modeling and Inference
Junzhe Song, Ruisi He, Mi Yang, Zhengyu Zhang, Shuaiqi Gao, Bo Ai, Zhangdui Zhong
Main category: eess.IV
TL;DR: Deep learning framework using satellite images to predict complete wireless channel impulse response parameters for site-specific channel modeling.
Details
Motivation: Traditional wireless channel inference methods are unscalable, and existing AI approaches using satellite imagery only predict large-scale fading parameters, lacking the ability to reconstruct complete channel impulse response needed for next-generation wireless systems.Method: Proposes a deep learning-based framework using satellite images to predict structured Tapped Delay Line parameters. Creates joint channel-satellite dataset from measurements, uses cross-attention-fused dual-branch pipeline to extract macroscopic/microscopic environmental features, and includes recurrent tracking module to capture long-term dynamic evolution of multipath components.
Result: Achieves high-quality reconstruction of channel impulse response in unseen scenarios with Power Delay Profile Average Cosine Similarity exceeding 0.96.
Conclusion: Provides a pathway toward site-specific channel inference for future dynamic wireless networks by enabling complete channel reconstruction from satellite imagery.
Abstract: Site-specific channel inference plays a critical role in the design and evaluation of next-generation wireless communication systems by considering the surrounding propagation environment. However, traditional methods are unscalable. Recently, satellite imagery has emerged as a valuable modality containing rich propagation information for AI-based channel prediction. However, existing approaches using these images are limited to predicting large-scale fading parameters, lacking the capacity to reconstruct the complete channel impulse response (CIR). To address this limitation, we propose a deep learning-based site-specific channel modeling and inference framework using satellite images to predict structured Tapped Delay Line (TDL) parameters. We first establish a joint channel-satellite dataset based on measurements. Then, a novel deep learning network is developed to reconstruct the channel parameters. Specifically, a cross-attention-fused dual-branch pipeline extracts macroscopic and microscopic environmental features, while a recurrent tracking module captures the long-term dynamic evolution of multipath components. Experimental results demonstrate that the proposed method achieves high-quality reconstruction of the CIR in unseen scenarios, with a Power Delay Profile (PDP) Average Cosine Similarity exceeding 0.96. This work provides a pathway toward site-specific channel inference for future dynamic wireless networks.
[1350] DRIFT: Deep Restoration, ISP Fusion, and Tone-mapping
Soumendu Majee, Joshua Peter Ebenezer, Abhinau K. Venkataramanan, Weidi Liu, Thilo Balke, Zeeshan Nadir, Sreenithy Chandran, Seok-Jun Lee, Hamid Rahim Sheikh
Main category: eess.IV
TL;DR: DRIFT is an efficient AI mobile camera pipeline that uses deep learning for multi-frame processing and tone-mapping to generate high-quality RGB images from raw captures.
Details
Motivation: Smartphone cameras need high-performance Image Signal Processors (ISPs) to generate high-quality images from raw captures while keeping computational costs low, especially for high-resolution and HDR imaging.Method: Two-stage approach: 1) DRIFT-MFP network using adversarial perceptual loss for multi-frame alignment, denoising, demosaicing, and super-resolution; 2) DRIFT-TM deep-learning based tone-mapping solution for tone tunability and consistency with reference pipelines.
Result: Qualitative and quantitative comparisons show effectiveness against state-of-the-art MFP and tone-mapping methods, with efficient mobile deployment capability.
Conclusion: DRIFT provides an efficient AI-based mobile camera pipeline that generates high-quality RGB images from raw captures with computational efficiency suitable for mobile devices.
Abstract: Smartphone cameras have gained immense popularity with the adoption of high-resolution and high-dynamic range imaging. As a result, high-performance camera Image Signal Processors (ISPs) are crucial in generating high-quality images for the end user while keeping computational costs low. In this paper, we propose DRIFT (Deep Restoration, ISP Fusion, and Tone-mapping): an efficient AI mobile camera pipeline that generates high quality RGB images from hand-held raw captures. The first stage of DRIFT is a Multi-Frame Processing (MFP) network that is trained using a adversarial perceptual loss to perform multi-frame alignment, denoising, demosaicing, and super-resolution. Then, the output of DRIFT-MFP is processed by a novel deep-learning based tone-mapping (DRIFT-TM) solution that allows for tone tunability, ensures tone-consistency with a reference pipeline, and can be run efficiently for high-resolution images on a mobile device. We show qualitative and quantitative comparisons against state-of-the-art MFP and tone-mapping methods to demonstrate the effectiveness of our approach.