Editor’s Picks
Top papers matching your research interests in multimodal LLMs, audio and vision understanding/generation.
[1] Joint Learning Global-Local Speaker Classification to Enhance End-to-End Speaker Diarization and Recognition
Yuhang Dai, Haopeng Lin, Jiale Qian, Ruiqi Yan, Hao Meng, Hanke Xie, Hanlin Wen, Shunshun Yin, Ming Tao, Xie Chen, Lei Xie, Xinsheng Wang
Main category: cs.SD
TL;DR: GLSC-SDR improves speaker discriminability in Large Audio-Language Models through joint training of speaker classification with diarization and recognition, using a Global-Local Speaker Classification strategy.
Details
Motivation: Current LALMs have limited speaker discriminability due to scarcity of large-scale conversational data and lack of explicit speaker representation optimization, which hinders their performance in speaker diarization and recognition tasks.Method: Proposes GLSC-SDR paradigm that jointly trains speaker classification with diarization and recognition. Introduces Global-Local Speaker Classification strategy using clustered speakers as global labels and re-encoded intra-cluster speakers as local labels for hierarchical speaker discrimination.
Result: Achieves competitive or superior performance on AliMeeting, AISHELL-4, and AMI-SDM datasets compared to simulation-based and multi-encoder approaches, without requiring large-scale real conversational data.
Conclusion: The proposed approach effectively enhances fine-grained speaker discrimination while preserving semantic transcription accuracy in audio-language models.
Abstract: Large Audio-Language Models (LALMs) have demonstrated remarkable performance in end-to-end speaker diarization and recognition. However, their speaker discriminability remains limited due to the scarcity of large-scale conversational data and the absence of explicit speaker representation optimization. To address this, we propose GLSC-SDR, a paradigm that jointly trains speaker classification with diarization and recognition. We further introduce a Global-Local Speaker Classification strategy, which uses clustered speakers as global labels and re-encoded intra-cluster speakers as local labels. This hierarchical design enhances fine-grained speaker discrimination while preserving semantic transcription accuracy. Experiments on AliMeeting, AISHELL-4, and AMI-SDM demonstrate that GLSC-SDR achieves competitive or superior performance compared to simulation-based and multi-encoder approaches, without relying on large-scale real conversational data.
Relevance: 9/10
[2] AVControl: Efficient Framework for Training Audio-Visual Controls
Matan Ben-Yosef, Tavi Halperin, Naomi Ken Korem, Mohammad Salama, Harel Cain, Asaf Joseph, Anthony Chen, Urska Jelercic, Ofir Bibi
Main category: cs.CV
TL;DR: AVControl is a lightweight, extendable framework for multimodal video and audio generation control using LoRA adapters on LTX-2 foundation model, enabling diverse control modalities without architectural changes.
Details
Motivation: Existing approaches for controlling video and audio generation either use monolithic models for fixed controls or require costly architectural changes for each new modality, lacking flexibility and efficiency.Method: Built on LTX-2 joint audio-visual foundation model, each control modality is trained as separate LoRA adapter on a parallel canvas that provides reference signals as additional tokens in attention layers, requiring no architectural changes beyond the LoRA adapters.
Result: Outperforms all baselines on VACE Benchmark for depth- and pose-guided generation, inpainting, and outpainting; shows competitive results on camera control and audio-visual benchmarks; supports diverse independently trained modalities including first modular audio-visual controls.
Conclusion: AVControl provides an efficient, extendable framework for multimodal video and audio generation control that is both compute- and data-efficient, with each modality requiring minimal training data and converging quickly compared to monolithic alternatives.
Abstract: Controlling video and audio generation requires diverse modalities, from depth and pose to camera trajectories and audio transformations, yet existing approaches either train a single monolithic model for a fixed set of controls or introduce costly architectural changes for each new modality. We introduce AVControl, a lightweight, extendable framework built on LTX-2, a joint audio-visual foundation model, where each control modality is trained as a separate LoRA on a parallel canvas that provides the reference signal as additional tokens in the attention layers, requiring no architectural changes beyond the LoRA adapters themselves. We show that simply extending image-based in-context methods to video fails for structural control, and that our parallel canvas approach resolves this. On the VACE Benchmark, we outperform all evaluated baselines on depth- and pose-guided generation, inpainting, and outpainting, and show competitive results on camera control and audio-visual benchmarks. Our framework supports a diverse set of independently trained modalities: spatially-aligned controls such as depth, pose, and edges, camera trajectory with intrinsics, sparse motion control, video editing, and, to our knowledge, the first modular audio-visual controls for a joint generation model. Our method is both compute- and data-efficient: each modality requires only a small dataset and converges within a few hundred to a few thousand training steps, a fraction of the budget of monolithic alternatives. We publicly release our code and trained LoRA checkpoints.
Relevance: 9/10
[3] SAVe: Self-Supervised Audio-visual Deepfake Detection Exploiting Visual Artifacts and Audio-visual Misalignment
Sahibzada Adil Shahzad, Ammarah Hashmi, Junichi Yamagishi, Yusuke Yasuda, Yu Tsao, Chia-Wen Lin, Yan-Tsung Peng, Hsin-Min Wang
Main category: cs.CV
TL;DR: SAVe is a self-supervised audio-visual deepfake detection framework that learns on authentic videos by generating pseudo-manipulations and modeling lip-speech synchronization to detect cross-modal inconsistencies.
Details
Motivation: Current multimodal deepfake detectors rely on curated synthetic forgeries, leading to dataset/generator bias and poor generalization to unseen manipulations. There's a need for scalable, robust detection that can identify subtle visual artifacts and cross-modal inconsistencies.Method: SAVe uses self-supervised learning on authentic videos by: 1) Generating identity-preserving, region-aware self-blended pseudo-manipulations to emulate tampering artifacts across multiple facial granularities; 2) Modeling lip-speech synchronization via audio-visual alignment to detect temporal misalignment patterns characteristic of audio-visual forgeries.
Result: Experiments on FakeAVCeleb and AV-LipSync-TIMIT show competitive in-domain performance and strong cross-dataset generalization, demonstrating SAVe’s effectiveness as a scalable paradigm for multimodal deepfake detection.
Conclusion: Self-supervised learning on authentic videos with pseudo-manipulations and cross-modal alignment modeling provides a scalable and robust approach to multimodal deepfake detection that generalizes well to unseen manipulations.
Abstract: Multimodal deepfakes can exhibit subtle visual artifacts and cross-modal inconsistencies, which remain challenging to detect, especially when detectors are trained primarily on curated synthetic forgeries. Such synthetic dependence can introduce dataset and generator bias, limiting scalability and robustness to unseen manipulations. We propose SAVe, a self-supervised audio-visual deepfake detection framework that learns entirely on authentic videos. SAVe generates on-the-fly, identity-preserving, region-aware self-blended pseudo-manipulations to emulate tampering artifacts, enabling the model to learn complementary visual cues across multiple facial granularities. To capture cross-modal evidence, SAVe also models lip-speech synchronization via an audio-visual alignment component that detects temporal misalignment patterns characteristic of audio-visual forgeries. Experiments on FakeAVCeleb and AV-LipSync-TIMIT demonstrate competitive in-domain performance and strong cross-dataset generalization, highlighting self-supervised learning as a scalable paradigm for multimodal deepfake detection.
Relevance: 9/10
Today’s Research Highlights
AI-enhanced summaries of the latest research papers from arXiv.
Table of Contents
- cs.CL [Total: 79]
- cs.CV [Total: 258]
- cs.AI [Total: 100]
- cs.SD [Total: 9]
- cs.LG [Total: 120]
- cs.MA [Total: 7]
- cs.MM [Total: 2]
- eess.AS [Total: 4]
- eess.IV [Total: 6]
cs.CL
[1] When Consistency Becomes Bias: Interviewer Effects in Semi-Structured Clinical Interviews
Hasindri Watawana, Sergio Burdisso, Diego A. Moreno-Galván, Fernando Sánchez-Vega, A. Pastor López-Monroy, Petr Motlicek, Esaú Villatoro-Tello
Main category: cs.CL
TL;DR: Models for depression detection from doctor-patient conversations exploit interviewer prompt artifacts rather than genuine patient language cues, revealing systematic bias across datasets.
Details
Motivation: While automatic depression detection from conversations has advanced, interpretability remains limited - models achieve strong performance without revealing what drives predictions, potentially exploiting dataset artifacts rather than genuine linguistic cues.Method: Analyzed three depression detection datasets (ANDROIDS, DAIC-WOZ, E-DAIC), examined systematic bias from interviewer prompts in semi-structured interviews, compared models trained on interviewer vs participant utterances, and localized decision evidence by time and speaker.
Result: Models trained on interviewer turns exploit fixed prompts and positions to distinguish depressed vs control subjects, achieving high classification scores without using participant language. Restricting to participant utterances distributes decision evidence more broadly and reflects genuine linguistic cues.
Conclusion: Semi-structured interview protocols create systematic bias where models leverage script artifacts rather than patient language. Need for analyses that localize decision evidence by time and speaker to ensure models learn from participants’ actual language.
Abstract: Automatic depression detection from doctor-patient conversations has gained momentum thanks to the availability of public corpora and advances in language modeling. However, interpretability remains limited: strong performance is often reported without revealing what drives predictions. We analyze three datasets: ANDROIDS, DAIC-WOZ, E-DAIC and identify a systematic bias from interviewer prompts in semi-structured interviews. Models trained on interviewer turns exploit fixed prompts and positions to distinguish depressed from control subjects, often achieving high classification scores without using participant language. Restricting models to participant utterances distributes decision evidence more broadly and reflects genuine linguistic cues. While semi-structured protocols ensure consistency, including interviewer prompts inflates performance by leveraging script artifacts. Our results highlight a cross-dataset, architecture-agnostic bias and emphasize the need for analyses that localize decision evidence by time and speaker to ensure models learn from participants’ language.
[2] Demystifying When Pruning Works via Representation Hierarchies
Shwai He, Guoheng Sun, Haichao Zhang, Yun Fu, Ang Li
Main category: cs.CL
TL;DR: Network pruning works well for non-generative language tasks but fails for generative tasks due to amplified perturbations in probability space during sequential generation.
Details
Motivation: To understand why network pruning inconsistently affects language tasks - working well for non-generative tasks but failing for generative tasks, despite expectations of efficiency gains with preserved performance.Method: Analyze network pruning from a representation-hierarchy perspective, decomposing language model computation into three sequential spaces: embedding (hidden representations), logit (pre-softmax outputs), and probability (post-softmax distributions). Examine how pruning-induced perturbations propagate through these spaces.
Result: Embedding and logit spaces are robust to pruning perturbations, but the nonlinear softmax transformation amplifies deviations in probability space. These amplified errors accumulate across time steps during generation, causing substantial degradation. The categorical-token probability subspace stability supports pruning effectiveness for non-generative tasks.
Conclusion: Pruning’s task-dependent effects stem from how perturbations propagate through representation hierarchies: stable for non-generative tasks but problematic for generation due to error accumulation in probability space. Provides practical guidance for pruning applications.
Abstract: Network pruning, which removes less important parameters or architectures, is often expected to improve efficiency while preserving performance. However, this expectation does not consistently hold across language tasks: pruned models can perform well on non-generative tasks but frequently fail in generative settings. To understand this discrepancy, we analyze network pruning from a representation-hierarchy perspective, decomposing the internal computation of language models into three sequential spaces: embedding (hidden representations), logit (pre-softmax outputs), and probability (post-softmax distributions). We find that representations in the embedding and logit spaces are largely robust to pruning-induced perturbations. However, the nonlinear transformation from logits to probabilities amplifies these deviations, which accumulate across time steps and lead to substantial degradation during generation. In contrast, the stability of the categorical-token probability subspace, together with the robustness of the embedding space, supports the effectiveness of pruning for non-generative tasks such as retrieval and multiple-choice selection. Our analysis disentangles the effects of pruning across tasks and provides practical guidance for its application. Code is available at https://github.com/CASE-Lab-UMD/Pruning-on-Representations
[3] Fine-Tuning A Large Language Model for Systematic Review Screening
Kweku Yamoah, Noah Schroeder, Emmanuel Dorley, Neha Rani, Caleb Schutz
Main category: cs.CL
TL;DR: Fine-tuning a small 1.2B parameter LLM for systematic review screening shows strong performance improvements over base model, achieving 86.40% agreement with human coders.
Details
Motivation: Systematic reviews require extensive manual screening of titles/abstracts, which is time-consuming. While LLMs have been explored for efficiency, inconsistent results suggest prompting alone may not provide sufficient context for good performance.Method: Fine-tuned a 1.2 billion parameter open-weight LLM specifically for study screening using human-rated data from a systematic review with over 8500 titles/abstracts.
Result: Fine-tuned model showed 80.79% improvement in weighted F1 score compared to base model. On full dataset of 8,277 studies: 86.40% agreement with human coder, 91.18% true positive rate, 86.38% true negative rate, and perfect agreement across multiple inference runs.
Conclusion: Fine-tuning LLMs shows promise for title and abstract screening in large-scale systematic reviews, addressing limitations of prompting-only approaches.
Abstract: Systematic reviews traditionally have taken considerable amounts of human time and energy to complete, in part due to the extensive number of titles and abstracts that must be reviewed for potential inclusion. Recently, researchers have begun to explore how to use large language models (LLMs) to make this process more efficient. However, research to date has shown inconsistent results. We posit this is because prompting alone may not provide sufficient context for the model(s) to perform well. In this study, we fine-tune a small 1.2 billion parameter open-weight LLM specifically for study screening in the context of a systematic review in which humans rated more than 8500 titles and abstracts for potential inclusion. Our results showed strong performance improvements from the fine-tuned model, with the weighted F1 score improving 80.79% compared to the base model. When run on the full dataset of 8,277 studies, the fine-tuned model had 86.40% agreement with the human coder, a 91.18% true positive rate, a 86.38% true negative rate, and perfect agreement across multiple inference runs. Taken together, our results show that there is promise for fine-tuning LLMs for title and abstract screening in large-scale systematic reviews.
[4] Evaluating Fine-Tuned LLM Model For Medical Transcription With Small Low-Resource Languages Validated Dataset
Mohammed Nowshad Ruhani Chowdhury, Mohammed Nowaz Rabbani Chowdhury, Sakari Lukkarinen
Main category: cs.CL
TL;DR: Fine-tuning LLaMA 3.1-8B on Finnish clinical conversations shows promise for medical transcription despite low n-gram overlap, with strong semantic similarity scores.
Details
Motivation: Clinical documentation burden contributes to physician burnout, especially for low-resource languages like Finnish. There's a need for effective NLP solutions for medical transcription in Finnish to reduce administrative workload.Method: Fine-tuned LLaMA 3.1-8B on a small validated corpus of simulated clinical conversations from Metropolia University students using controlled preprocessing and optimization, evaluated via sevenfold cross-validation.
Result: BLEU=0.1214, ROUGE-L=0.4982, BERTScore F1=0.8230. Low n-gram overlap but strong semantic similarity with reference transcripts, indicating effective medical discourse translation in spoken Finnish.
Conclusion: Fine-tuning is effective for medical transcription in Finnish, supporting feasibility of privacy-oriented domain-specific LLMs for clinical documentation, with directions for future work identified.
Abstract: Clinical documentation is a critical factor for patient safety, diagnosis, and continuity of care. The administrative burden of EHRs is a significant factor in physician burnout. This is a critical issue for low-resource languages, including Finnish. This study aims to investigate the effectiveness of a domain-aligned natural language processing (NLP); large language model for medical transcription in Finnish by fine-tuning LLaMA 3.1-8B on a small validated corpus of simulated clinical conversations by students at Metropolia University of Applied Sciences. The fine-tuning process for medical transcription used a controlled preprocessing and optimization approach. The fine-tuning effectiveness was evaluated by sevenfold cross-validation. The evaluation metrics for fine-tuned LLaMA 3.1-8B were BLEU = 0.1214, ROUGE-L = 0.4982, and BERTScore F1 = 0.8230. The results showed a low n-gram overlap but a strong semantic similarity with reference transcripts. This study indicate that fine-tuning can be an effective approach for translation of medical discourse in spoken Finnish and support the feasibility of fine-tuning a privacy-oriented domain-specific large language model for clinical documentation in Finnish. Beside that provide directions for future work.
[5] Enhancing Structured Meaning Representations with Aspect Classification
Claire Benét Post, Paul Bontempo, August Milliken, Alvin Po-Chun Chen, Nicholas Derby, Saksham Khatwani, Sumeyye Nabieva, Karthik Sairam, Alexis Palmer
Main category: cs.CL
TL;DR: New English dataset with UMR aspect labels over AMR graphs to address sparse aspect annotation in semantic representations, with baseline experiments for automatic prediction.
Details
Motivation: Aspect is crucial for capturing event temporal structure in semantic representations but remains sparsely annotated, hindering both manual annotation and development of automatic aspect prediction systems.Method: Created new dataset with UMR aspect labels over AMR graphs, developed annotation scheme and guidelines for labeling eventive predicates according to UMR aspect lattice, used multi-step adjudication process for consistency, and conducted baseline experiments with three modeling approaches.
Result: Established initial benchmarks for automatic UMR aspect prediction, providing foundation for integrating aspect into semantic meaning representations more broadly.
Conclusion: The dataset enables future automation of aspect prediction in semantic representations, addressing a critical gap in semantic annotation frameworks.
Abstract: To fully capture the meaning of a sentence, semantic representations should encode aspect, which describes the internal temporal structure of events. In graph-based meaning representation frameworks such as Uniform Meaning Representations (UMR), aspect lets one know how events unfold over time, including distinctions such as states, activities, and completed events. Despite its importance, aspect remains sparsely annotated across semantic meaning representation frameworks. This has, in turn, hindered not only current manual annotation, but also the development of automatic systems capable of predicting aspectual information. In this paper, we introduce a new dataset of English sentences annotated with UMR aspect labels over Abstract Meaning Representation (AMR) graphs that lack the feature. We describe the annotation scheme and guidelines used to label eventive predicates according to the UMR aspect lattice, as well as the annotation pipeline used to ensure consistency and quality across annotators through a multi-step adjudication process. To demonstrate the utility of our dataset for future automation, we present baseline experiments using three modeling approaches. Our results establish initial benchmarks for automatic UMR aspect prediction and provide a foundation for integrating aspect into semantic meaning representations more broadly.
[6] Synthetic Rewriting as a Quality Multiplier: Evidence from Portuguese Continued Pretraining
Thales Sales Almeida, Rodrigo Nogueira, Hélio Pedrini
Main category: cs.CL
TL;DR: Synthetic document rewriting for Portuguese language model pretraining shows quality-dependent benefits, with high-quality source data yielding greater gains than low-quality data, especially at larger model scales.
Details
Motivation: Most synthetic data generation studies focus on English and don't systematically control for source data quality. The authors aim to study how synthetic rewriting interacts with source data quality in Portuguese continued pretraining.Method: Used ClassiCC-PT Portuguese corpus with STEM/Educational quality scores to create two 10B-token subsets at different quality levels. Rewrote each into four styles using a 7B instruction-tuned model, producing ~40B tokens per condition. Trained two English-centric base models (1.1B and 7B parameters) on each condition and evaluated on PoETa V2 benchmark (44 Portuguese tasks).
Result: At 7B scale: rewriting high-quality data yields +3.4 NPM gain over unmodified data, while rewriting low-quality data provides only +0.5 NPM. At 1.1B scale: the interaction is weaker, with unmodified low-quality data performing comparably to rewritten high-quality data.
Conclusion: Synthetic rewriting acts primarily as a quality multiplier rather than a substitute for data curation, and this effect is scale-dependent. High-quality source data is crucial for synthetic data generation benefits.
Abstract: Synthetic data generation through document rewriting has emerged as a promising technique for improving language model pretraining, yet most studies focus on English and do not systematically control for the quality of the source data being rewritten. We present a controlled study of how synthetic rewriting interacts with source data quality in the context of Portuguese continued pretraining. Starting from ClassiCC-PT, a Portuguese corpus annotated with STEM and Educational quality scores, we construct two 10B-token subsets at different quality levels and rewrite each into four styles using a 7B instruction-tuned model, producing approximately 40B tokens of synthetic data per condition. We train two English-centric base models (1.1B and 7B parameters) on each condition and evaluate on PoETa V2, a comprehensive 44-task Portuguese benchmark. At the 7B scale, rewriting high-quality data yields a +3.4 NPM gain over the same data unmodified, while rewriting low-quality data provides only +0.5 NPM. At the 1.1B scale, this interaction is weaker, with unmodified low-quality data performing comparably to rewritten high-quality data. Our results demonstrate that synthetic rewriting acts primarily as a quality multiplier rather than a substitute for data curation, and that this effect is scale-dependent.
[7] Prune as You Generate: Online Rollout Pruning for Faster and Better RLVR
Haobo Xu, Sirui Chen, Ruizhong Qiu, Yuchen Yan, Chen Luo, Monica Cheng, Jingrui He, Hanghang Tong
Main category: cs.CL
TL;DR: ARRoL accelerates RLVR training by pruning low-quality rollouts during generation and steering surviving ones toward better learning signals, achieving both speedup and accuracy gains.
Details
Motivation: RLVR methods like GRPO and DAPO suffer from high computational costs due to sampling many rollouts per prompt, and sparse relative advantages that yield weak learning signals when many samples become nearly all-correct or all-incorrect.Method: ARRoL introduces online rollout pruning: trains a lightweight quality head on-the-fly to predict success probability of partial rollouts for early pruning decisions, uses system design that prunes inside inference engine and re-batches remaining rollouts for efficient computation.
Result: Across GRPO and DAPO on Qwen-3 and LLaMA-3.2 models (1B-8B), ARRoL improves average accuracy by +2.30 to +2.99 while achieving up to 1.7x training speedup, with additional +8.33 gains in test-time scaling accuracy.
Conclusion: ARRoL effectively addresses computational inefficiency and weak learning signals in RLVR through intelligent online pruning, delivering both training acceleration and improved model performance.
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capabilities of Large Language Models (LLMs). However, methods such as GRPO and DAPO suffer from substantial computational cost, since they rely on sampling many rollouts for each prompt. Moreover, in RLVR the relative advantage is often sparse: many samples become nearly all-correct or all-incorrect, yielding low within-group reward variance and thus weak learning signals. In this paper, we introduce arrol (Accelerating RLVR via online Rollout Pruning), an online rollout pruning method that prunes rollouts during generation while explicitly steering the surviving ones more correctness-balanced to enhance learning signals. Specifically, arrol trains a lightweight quality head on-the-fly to predict the success probability of partial rollouts and uses it to make early pruning decisions. The learned quality head can further weigh candidates to improve inference accuracy during test-time scaling. To improve efficiency, we present a system design that prunes rollouts inside the inference engine and re-batches the remaining ones for log-probability computation and policy updates. Across GRPO and DAPO on Qwen-3 and LLaMA-3.2 models (1B-8B), arrol improves average accuracy by +2.30 to +2.99 while achieving up to 1.7x training speedup, and yielding up to +8.33 additional gains in average accuracy in test-time scaling. The code is available at https://github.com/Hsu1023/ARRoL.
[8] Adapting Self-Supervised Speech Representations for Cross-lingual Dysarthria Detection in Parkinson’s Disease
Abner Hernandez, Eunjung Yeo, Kwanghee Choi, Chin-Jou Li, Zhengjun Yue, Rohan Kumar Das, Jan Rusz, Mathew Magimai Doss, Juan Rafael Orozco-Arroyave, Tomás Arias-Vergara, Andreas Maier, Elmar Nöth, David R. Mortensen, David Harwath, Paula Andrea Perez-Toro
Main category: cs.CL
TL;DR: Cross-lingual dysarthria detection using representation-level language shift to align speech representations across languages, evaluated on Parkinson’s disease speech datasets
Details
Motivation: Limited dysarthric speech data availability makes cross-lingual detection challenging, as speech representations often encode language-dependent structure that confounds dysarthria detectionMethod: Proposed representation-level language shift (LS) that aligns source-language self-supervised speech representations with target-language distribution using centroid-based vector adaptation estimated from healthy-control speech
Result: LS substantially improves sensitivity and F1 in cross-lingual settings, with smaller but consistent gains in multilingual settings; representation analysis shows LS reduces language identity in embedding space
Conclusion: Language shift effectively removes language-dependent structure from speech representations, enabling better cross-lingual dysarthria detection
Abstract: The limited availability of dysarthric speech data makes cross-lingual detection an important but challenging problem. A key difficulty is that speech representations often encode language-dependent structure that can confound dysarthria detection. We propose a representation-level language shift (LS) that aligns source-language self-supervised speech representations with the target-language distribution using centroid-based vector adaptation estimated from healthy-control speech. We evaluate the approach on oral DDK recordings from Parkinson’s disease speech datasets in Czech, German, and Spanish under both cross-lingual and multilingual settings. LS substantially improves sensitivity and F1 in cross-lingual settings, while yielding smaller but consistent gains in multilingual settings. Representation analysis further shows that LS reduces language identity in the embedding space, supporting the interpretation that LS removes language-dependent structure.
[9] LogSigma at SemEval-2026 Task 3: Uncertainty-Weighted Multitask Learning for Dimensional Aspect-Based Sentiment Analysis
Baraa Hikal, Jonas Becker, Bela Gipp
Main category: cs.CL
TL;DR: LogSigma system for SemEval-2026 Task 3 achieves state-of-the-art performance on Dimensional Aspect-Based Sentiment Analysis by using learned homoscedastic uncertainty to automatically balance Valence and Arousal prediction tasks across different languages.
Details
Motivation: Traditional Aspect-Based Sentiment Analysis uses discrete sentiment labels, but Dimensional ABSA requires predicting continuous Valence and Arousal scores. A key challenge is that Valence and Arousal differ in prediction difficulty across languages and domains, requiring adaptive task balancing.Method: Uses learned homoscedastic uncertainty where the model learns task-specific log-variance parameters to automatically balance each regression objective during training. Combined with language-specific encoders and multi-seed ensembling.
Result: Achieves 1st place on five datasets across both tracks. The learned variance weights vary substantially across languages (0.66x for German to 2.18x for English), demonstrating language-dependent optimal task balancing.
Conclusion: Optimal task balancing for Valence and Arousal prediction is language-dependent and cannot be determined a priori. Learned homoscedastic uncertainty effectively addresses this challenge for Dimensional ABSA.
Abstract: This paper describes LogSigma, our system for SemEval-2026 Task 3: Dimensional Aspect-Based Sentiment Analysis (DimABSA). Unlike traditional Aspect-Based Sentiment Analysis (ABSA), which predicts discrete sentiment labels, DimABSA requires predicting continuous Valence and Arousal (VA) scores on a 1-9 scale. A central challenge is that Valence and Arousal differ in prediction difficulty across languages and domains. We address this using learned homoscedastic uncertainty, where the model learns task-specific log-variance parameters to automatically balance each regression objective during training. Combined with language-specific encoders and multi-seed ensembling, LogSigma achieves 1st place on five datasets across both tracks. The learned variance weights vary substantially across languages due to differing Valence-Arousal difficulty profiles-from 0.66x for German to 2.18x for English-demonstrating that optimal task balancing is language-dependent and cannot be determined a priori.
[10] Estimating near-verbatim extraction risk in language models with decoding-constrained beam search
A. Feder Cooper, Mark A. Lemley, Christopher De Sa, Lea Duesterwald, Allison Casasola, Jamie Hayes, Katherine Lee, Daniel E. Ho, Percy Liang
Main category: cs.CL
TL;DR: The paper introduces decoding-constrained beam search to efficiently compute deterministic lower bounds on near-verbatim extraction risk in LLMs, addressing the combinatorial challenge of quantifying privacy/copyright risks beyond verbatim memorization.
Details
Motivation: Standard methods for quantifying memorization in LLMs only measure verbatim extraction, missing near-verbatim instances that pose similar privacy and copyright risks. Existing probabilistic extraction methods are tractable only for verbatim cases, while quantifying near-verbatim extraction is computationally expensive due to the combinatorial explosion of possible near-verbatim suffixes.Method: The authors introduce decoding-constrained beam search, which yields deterministic lower bounds on near-verbatim extraction risk at a cost comparable to only ~20 Monte Carlo samples per sequence, compared to ~100,000 samples needed for reliable MC estimation.
Result: The approach reveals information invisible to verbatim methods: many more extractable sequences, substantially larger per-sequence extraction mass, and patterns in how near-verbatim extraction risk manifests across model sizes and types of text.
Conclusion: Decoding-constrained beam search provides an efficient method to quantify near-verbatim extraction risk in LLMs, enabling better assessment of privacy and copyright risks beyond simple verbatim memorization detection.
Abstract: Recent work shows that standard greedy-decoding extraction methods for quantifying memorization in LLMs miss how extraction risk varies across sequences. Probabilistic extraction – computing the probability of generating a target suffix given a prefix under a decoding scheme – addresses this, but is tractable only for verbatim memorization, missing near-verbatim instances that pose similar privacy and copyright risks. Quantifying near-verbatim extraction risk is expensive: the set of near-verbatim suffixes is combinatorially large, and reliable Monte Carlo (MC) estimation can require ~100,000 samples per sequence. To mitigate this cost, we introduce decoding-constrained beam search, which yields deterministic lower bounds on near-verbatim extraction risk at a cost comparable to ~20 MC samples per sequence. Across experiments, our approach surfaces information invisible to verbatim methods: many more extractable sequences, substantially larger per-sequence extraction mass, and patterns in how near-verbatim extraction risk manifests across model sizes and types of text.
[11] Toward domain-specific machine translation and quality estimation systems
Javad Pourmostafa Roshan Sharami
Main category: cs.CL
TL;DR: This dissertation focuses on domain adaptation for Machine Translation and Quality Estimation through data-focused methods including similarity-based data selection, staged training pipelines, tokenization-vocabulary alignment, and QE-guided in-context learning.
Details
Motivation: MT and QE systems perform well in general domains but degrade under domain mismatch, creating a need for effective adaptation methods to specialized domains.Method: Four main approaches: 1) Similarity-based data selection for MT using targeted in-domain subsets; 2) Staged QE training with domain adaptation and data augmentation; 3) Study of subword tokenization and vocabulary alignment in fine-tuning; 4) QE-guided in-context learning for LLMs to select examples without parameter updates.
Result: Small targeted datasets outperform larger generic ones; staged training improves performance across domains/languages; aligned tokenization-vocabulary setups lead to better translation; QE-guided selection outperforms standard retrieval methods and supports reference-free setups.
Conclusion: Domain adaptation depends on data selection, representation, and efficient adaptation strategies. The dissertation provides methods for building reliable MT and QE systems in domain-specific settings.
Abstract: Machine Translation (MT) and Quality Estimation (QE) perform well in general domains but degrade under domain mismatch. This dissertation studies how to adapt MT and QE systems to specialized domains through a set of data-focused contributions. Chapter 2 presents a similarity-based data selection method for MT. Small, targeted in-domain subsets outperform much larger generic datasets and reach strong translation quality at lower computational cost. Chapter 3 introduces a staged QE training pipeline that combines domain adaptation with lightweight data augmentation. The method improves performance across domains, languages, and resource settings, including zero-shot and cross-lingual cases. Chapter 4 studies the role of subword tokenization and vocabulary in fine-tuning. Aligned tokenization-vocabulary setups lead to stable training and better translation quality, while mismatched configurations reduce performance. Chapter 5 proposes a QE-guided in-context learning method for large language models. QE models select examples that improve translation quality without parameter updates and outperform standard retrieval methods. The approach also supports a reference-free setup, reducing reliance on a single reference set. These results show that domain adaptation depends on data selection, representation, and efficient adaptation strategies. The dissertation provides methods for building MT and QE systems that perform reliably in domain-specific settings.
[12] LLM-Driven Reasoning for Constraint-Aware Feature Selection in Industrial Systems
Yuhang Zhou, Zhuokai Zhao, Ke Li, Spilios Evmorfos, Gökalp Demirci, Mingyi Wang, Qiao Liu, Qifei Wang, Serena Li, Weiwei Li, Tingting Wang, Mingze Gao, Gedi Zhou, Abhishek Kumar, Xiangjun Fan, Lizhu Zhang, Jiayi Liu
Main category: cs.CL
TL;DR: MoFA is an LLM-based framework for automated feature selection in industrial ML systems using semantic and quantitative feature information with constraint-aware reasoning.
Details
Motivation: Traditional feature selection methods rely on labeled data and statistical heuristics, which are difficult to apply in production environments with limited labeled data and multiple operational constraints.Method: MoFA performs sequential, reasoning-based feature selection using structured prompts that incorporate feature definitions, importance scores, correlations, and metadata (feature groups/types) for interpretable, constraint-aware reasoning.
Result: Evaluated in three real-world industrial applications: improved accuracy while reducing feature group complexity in interest prediction, discovered high-order interaction terms yielding substantial engagement gains, and selected compact feature subsets improving both accuracy and inference efficiency.
Conclusion: Demonstrates practicality and effectiveness of LLM-based reasoning for feature selection in real production systems, addressing limitations of traditional methods.
Abstract: Feature selection is a crucial step in large-scale industrial machine learning systems, directly affecting model accuracy, efficiency, and maintainability. Traditional feature selection methods rely on labeled data and statistical heuristics, making them difficult to apply in production environments where labeled data are limited and multiple operational constraints must be satisfied. To address this, we propose Model Feature Agent (MoFA), a model-driven framework that performs sequential, reasoning-based feature selection using both semantic and quantitative feature information. MoFA incorporates feature definitions, importance scores, correlations, and metadata (e.g., feature groups or types) into structured prompts and selects features through interpretable, constraint-aware reasoning. We evaluate MoFA in three real-world industrial applications: (1) True Interest and Time-Worthiness Prediction, where it improves accuracy while reducing feature group complexity, (2) Value Model Enhancement, where it discovers high-order interaction terms that yield substantial engagement gains in online experiments, and (3) Notification Behavior Prediction, where it selects compact, high-value feature subsets that improve both model accuracy and inference efficiency. Together, these results demonstrate the practicality and effectiveness of LLM-based reasoning for feature selection in real production systems.
[13] Exons-Detect: Identifying and Amplifying Exonic Tokens via Hidden-State Discrepancy for Robust AI-Generated Text Detection
Xiaowei Zhu, Yubing Ren, Fang Fang, Shi Wang, Yanan Cao, Li Guo
Main category: cs.CL
TL;DR: Exons-Detect: A training-free method for AI-generated text detection using exon-aware token reweighting based on hidden-state discrepancies.
Details
Motivation: The blurring boundary between human-written and AI-generated text creates societal risks like misinformation and authorship ambiguity, highlighting the need for effective detection methods. Existing training-free approaches assume uniform token contributions, making them less robust to short sequences or localized token modifications.Method: Proposes Exons-Detect, a training-free method that identifies and amplifies informative “exonic” tokens by measuring hidden-state discrepancies under a dual-model setting. It computes an interpretable translation score from importance-weighted token sequences.
Result: Achieves state-of-the-art detection performance with strong robustness to adversarial attacks and varying input lengths. Attains 2.2% relative improvement in average AUROC over the strongest prior baseline on DetectRL benchmark.
Conclusion: Exons-Detect provides an effective, training-free solution for AI-generated text detection that addresses limitations of existing methods through exon-aware token reweighting, offering improved robustness and interpretability.
Abstract: The rapid advancement of large language models has increasingly blurred the boundary between human-written and AI-generated text, raising societal risks such as misinformation dissemination, authorship ambiguity, and threats to intellectual property rights. These concerns highlight the urgent need for effective and reliable detection methods. While existing training-free approaches often achieve strong performance by aggregating token-level signals into a global score, they typically assume uniform token contributions, making them less robust under short sequences or localized token modifications. To address these limitations, we propose Exons-Detect, a training-free method for AI-generated text detection based on an exon-aware token reweighting perspective. Exons-Detect identifies and amplifies informative exonic tokens by measuring hidden-state discrepancy under a dual-model setting, and computes an interpretable translation score from the resulting importance-weighted token sequence. Empirical evaluations demonstrate that Exons-Detect achieves state-of-the-art detection performance and exhibits strong robustness to adversarial attacks and varying input lengths. In particular, it attains a 2.2% relative improvement in average AUROC over the strongest prior baseline on DetectRL.
[14] Imperative Interference: Social Register Shapes Instruction Topology in Large Language Models
Tony Mason
Main category: cs.CL
TL;DR: Multilingual LLMs interpret imperative instructions differently across languages due to learned social register conventions, affecting whether system prompts are cooperative or competitive.
Details
Motivation: To understand how multilingual language models process instructions differently across languages, particularly how social register (imperative vs declarative mood) affects instruction interpretation and whether this creates language-dependent alignment issues.Method: Conducted instruction-level ablation experiments across four languages and four models using 22 hand-authored probes against a production system prompt decomposed into 56 blocks. Tested declarative rewriting of imperative instructions to reduce cross-linguistic variance.
Result: Declarative rewriting reduced cross-linguistic variance by 81% (p = 0.029). Rewriting just 3 of 11 imperative blocks shifted Spanish instruction topology from competitive to cooperative, with spillover effects on unrewritten blocks.
Conclusion: Models process instructions as social acts rather than technical specifications, with imperative mood carrying language-dependent obligatory force. This suggests constitutional AI principles in imperative mood may create language-dependent alignment issues.
Abstract: System prompt instructions that cooperate in English compete in Spanish, with the same semantic content, but opposite interaction topology. We present instruction-level ablation experiments across four languages and four models showing that this topology inversion is mediated by social register: the imperative mood carries different obligatory force across speech communities, and models trained on multilingual data have learned these conventions. Declarative rewriting of a single instruction block reduces cross-linguistic variance by 81% (p = 0.029, permutation test). Rewriting three of eleven imperative blocks shifts Spanish instruction topology from competitive to cooperative, with spillover effects on unrewritten blocks. These findings suggest that models process instructions as social acts, not technical specifications: “NEVER do X” is an exercise of authority whose force is language-dependent, while “X: disabled” is a factual description that transfers across languages. If register mediates instruction-following at inference time, it plausibly does so during training. We state this as a testable prediction: constitutional AI principles authored in imperative mood may create language-dependent alignment. Corpus: 22 hand-authored probes against a production system prompt decomposed into 56 blocks.
[15] Approaches to Analysing Historical Newspapers Using LLMs
Filip Dobranić, Tina Munda, Oliver Pejić, Vojko Gorjanc, Uroš Šmajdek, David Bordon, Jakob Lenardič, Tjaša Konovšek, Kristina Pahor de Maiti Tekavčič, Ciril Bohak, Darja Fišer
Main category: cs.CL
TL;DR: Computational analysis of Slovene historical newspapers using topic modeling, LLM-based sentiment analysis, and entity graphs to study collective identity representations in late 19th/early 20th century public discourse.
Details
Motivation: To examine how collective identities, political orientations, and national belonging were represented in Slovene public discourse at the turn of the 20th century using computational methods on historical newspaper data.Method: Combines BERTopic for thematic patterns, evaluates four instruction-following LLMs for sentiment analysis on OCR-degraded historical Slovene, uses NER graphs for entity relationships, and applies mixed methods combining quantitative network analysis with critical discourse analysis.
Result: Identified ideological differences between conservative-Catholic and liberal-progressive newspapers, selected Slovene-adapted GaMS3-12B-Instruct as best for sentiment analysis (though stronger on neutral sentiment), revealed variation in collective identity portrayals, and demonstrated value of computational-humanities integration.
Conclusion: The study shows the value of combining scalable computational methods with critical interpretation for digital humanities research on noisy historical newspaper data, particularly for analyzing collective identity representations.
Abstract: This study presents a computational analysis of the Slovene historical newspapers \textit{Slovenec} and \textit{Slovenski narod} from the sPeriodika corpus, combining topic modelling, large language model (LLM)-based aspect-level sentiment analysis, entity-graph visualisation, and qualitative discourse analysis to examine how collective identities, political orientations, and national belonging were represented in public discourse at the turn of the twentieth century. Using BERTopic, we identify major thematic patterns and show both shared concerns and clear ideological differences between the two newspapers, reflecting their conservative-Catholic and liberal-progressive orientations. We further evaluate four instruction-following LLMs for targeted sentiment classification in OCR-degraded historical Slovene and select the Slovene-adapted GaMS3-12B-Instruct model as the most suitable for large-scale application, while also documenting important limitations, particularly its stronger performance on neutral sentiment than on positive or negative sentiment. Applied at dataset scale, the model reveals meaningful variation in the portrayal of collective identities, with some groups appearing predominantly in neutral descriptive contexts and others more often in evaluative or conflict-related discourse. We then create NER graphs to explore the relationships between collective identities and places. We apply a mixed methods approach to analyse the named entity graphs, combining quantitative network analysis with critical discourse analysis. The investigation focuses on the emergence and development of intertwined historical political and socionomic identities. Overall, the study demonstrates the value of combining scalable computational methods with critical interpretation to support digital humanities research on noisy historical newspaper data.
[16] Closing the Confidence-Faithfulness Gap in Large Language Models
Miranda Muqing Miao, Lyle Ungar
Main category: cs.CL
TL;DR: Mechanistic analysis reveals verbalized confidence in LLMs is linearly encoded but orthogonal to actual accuracy signals, with reasoning processes contaminating confidence calibration. A two-stage steering pipeline improves alignment.
Details
Motivation: LLMs often verbalize confidence scores that don't match their actual accuracy, but the underlying geometric relationships and mechanisms behind this miscalibration remain poorly understood. The paper aims to mechanistically analyze how confidence signals are encoded and why reasoning processes disrupt calibration.Method: Used linear probes and contrastive activation addition (CAA) steering to analyze confidence encoding across three open-weight models and four datasets. Identified the “Reasoning Contamination Effect” where simultaneous reasoning and confidence verbalization disrupts calibration. Developed a two-stage adaptive steering pipeline that reads internal accuracy estimates and steers verbalized output to match them.
Result: Found that calibration and verbalized confidence signals are linearly encoded but orthogonal to each other across all models and datasets. Reasoning processes significantly disrupt confidence calibration. The proposed steering pipeline substantially improves calibration alignment across all evaluated models.
Conclusion: The orthogonal encoding of confidence and accuracy signals explains LLM miscalibration, and the Reasoning Contamination Effect shows how reasoning processes exacerbate this issue. The steering approach successfully improves calibration by leveraging internal accuracy estimates.
Abstract: Large language models (LLMs) tend to verbalize confidence scores that are largely detached from their actual accuracy, yet the geometric relationship governing this behavior remain poorly understood. In this work, we present a mechanistic interpretability analysis of verbalized confidence, using linear probes and contrastive activation addition (CAA) steering to show that calibration and verbalized confidence signals are encoded linearly but are orthogonal to one another – a finding consistent across three open-weight models and four datasets. Interestingly, when models are prompted to simultaneously reason through a problem and verbalize a confidence score, the reasoning process disrupts the verbalized confidence direction, exacerbating miscalibration. We term this the “Reasoning Contamination Effect.” Leveraging this insight, we introduce a two-stage adaptive steering pipeline that reads the model’s internal accuracy estimate and steers verbalized output to match it, substantially improving calibration alignment across all evaluated models.
[17] OMIND: Framework for Knowledge Grounded Finetuning and Multi-Turn Dialogue Benchmark for Mental Health LLMs
Suraj Racha, Prashant Harish Joshi, Utkarsh Maurya, Nitin Yadav, Mridul Sharma, Ananya Kunisetty, Saranya Darisipudi, Nirmal Punjabi, Ganesh Ramakrishnan
Main category: cs.CL
TL;DR: oMind framework adapts LLMs for mental health applications with specialized training data, alignment methods, and evaluation benchmarks to address domain-specific challenges.
Details
Motivation: Mental health is a growing global concern where LLMs could help, but adaptation faces challenges including lack of high-quality interpretable training data, restricted training paradigms, and difficulties evaluating multi-turn dialogues.Method: Developed oMind framework with: 1) Training and aligning LLM agents for diverse capabilities including conversations, 2) Creation of ~164k multi-task SFT dataset using structured knowledge retrieval, LLM-based pruning, and review actions, 3) oMind-Chat benchmark with expert-annotated turn-level and conversation-level rubrics.
Result: oMind LLMs consistently outperform baselines on both core capabilities and conversations, with oMind-LLM showing significantly better reasoning (up to 80% win rate).
Conclusion: The oMind framework successfully addresses key challenges in adapting LLMs for mental health applications through specialized data generation, training paradigms, and evaluation methods.
Abstract: Large Language Models (LLMs) have shown remarkable capabilities for complex tasks, yet adaptation in medical domain, specifically mental health, poses specific challenges. Mental health is a rising concern globally with LLMs having large potential to help address the same. We highlight three primary challenges for LLMs in mental health - lack of high quality interpretable and knowledge grounded training data; training paradigms restricted to core capabilities, and evaluation of multi turn dialogue settings. Addressing it, we present oMind framework which includes training and aligning LLM agents for diverse capabilities including conversations; high quality ~164k multi-task SFT dataset, as a result of our generation pipeline based on Structured Knowledge retrieval, LLM based pruning, and review actions. We also introduce oMind-Chat - a novel multi turn benchmark dataset with expert annotated turn level and conversation level rubrics. Our diverse experiments on both core capabilities and conversations shows oMind LLMs consistently outperform baselines. oMind-LLM also shows significantly better reasoning with up to 80% win rate.
[18] Do LLMs Know What They Know? Measuring Metacognitive Efficiency with Signal Detection Theory
Jon-Paul Cacioli
Main category: cs.CL
TL;DR: This paper introduces a Type-2 Signal Detection Theory framework to evaluate LLM confidence, distinguishing between what models know (Type-1 sensitivity) and how well they know what they know (Type-2 metacognitive sensitivity), using meta-d’ and M-ratio metrics.
Details
Motivation: Standard calibration metrics (ECE, Brier score) conflate two distinct capacities: factual knowledge and metacognitive awareness. The authors aim to separate these capacities to better understand which models truly "know what they don't know" versus those that merely appear well-calibrated.Method: The authors apply Type-2 Signal Detection Theory with meta-d’ and metacognitive efficiency ratio (M-ratio) to evaluate four LLMs across 224,000 factual QA trials. They analyze domain-specific metacognitive efficiency, temperature effects on confidence policies, and compare different evaluation metrics.
Result: Key findings: (1) metacognitive efficiency varies substantially across models even with similar Type-1 sensitivity; (2) metacognitive efficiency is domain-specific; (3) temperature affects confidence policy but not metacognitive capacity for some models; (4) AUROC_2 and M-ratio produce inverted model rankings, showing they measure different things.
Conclusion: The meta-d’ framework reveals crucial distinctions between models that truly understand their knowledge limitations versus those that merely appear well-calibrated, with important implications for model selection, deployment, and human-AI collaboration.
Abstract: Standard evaluation of LLM confidence relies on calibration metrics (ECE, Brier score) that conflate two distinct capacities: how much a model knows (Type-1 sensitivity) and how well it knows what it knows (Type-2 metacognitive sensitivity). We introduce an evaluation framework based on Type-2 Signal Detection Theory that decomposes these capacities using meta-d’ and the metacognitive efficiency ratio M-ratio. Applied to four LLMs (Llama-3-8B-Instruct, Mistral-7B-Instruct-v0.3, Llama-3-8B-Base, Gemma-2-9B-Instruct) across 224,000 factual QA trials, we find: (1) metacognitive efficiency varies substantially across models even when Type-1 sensitivity is similar – Mistral achieves the highest d’ but the lowest M-ratio; (2) metacognitive efficiency is domain-specific, with different models showing different weakest domains, invisible to aggregate metrics; (3) temperature manipulation shifts Type-2 criterion while meta-d’ remains stable for two of four models, dissociating confidence policy from metacognitive capacity; (4) AUROC_2 and M-ratio produce fully inverted model rankings, demonstrating these metrics answer fundamentally different evaluation questions. The meta-d’ framework reveals which models “know what they don’t know” versus which merely appear well-calibrated due to criterion placement – a distinction with direct implications for model selection, deployment, and human-AI collaboration. Pre-registered analysis; code and data publicly available.
[19] Goodness-of-pronunciation without phoneme time alignment
Jeremy H. M. Wong, Nancy F. Chen
Main category: cs.CL
TL;DR: Proposes methods to extract speech evaluation features from weakly-supervised ASR models for low-resource languages, using phoneme confusion networks and cross-attention to combine features without time alignment.
Details
Motivation: Limited ASR training data hinders speech evaluation expansion to low-resource languages. Weakly-supervised models exist but are frame-asynchronous and not phonemic, making feature extraction difficult.Method: Computes phoneme posteriors by mapping ASR hypotheses to phoneme confusion network. Uses word-level speaking rate/duration instead of phoneme-level. Combines phoneme and frame-level features using cross-attention architecture without phoneme time alignment.
Result: Performs comparably with standard frame-synchronous features on English speechocean762 and low-resource Tamil datasets.
Conclusion: Overcomes incompatibilities for feature extraction with weakly-supervised models, enabling expansion of speech evaluation to low-resource languages.
Abstract: In speech evaluation, an Automatic Speech Recognition (ASR) model often computes time boundaries and phoneme posteriors for input features. However, limited data for ASR training hinders expansion of speech evaluation to low-resource languages. Open-source weakly-supervised models are capable of ASR over many languages, but they are frame-asynchronous and not phonemic, hindering feature extraction for speech evaluation. This paper proposes to overcome incompatibilities for feature extraction with weakly-supervised models, easing expansion of speech evaluation to low-resource languages. Phoneme posteriors are computed by mapping ASR hypotheses to a phoneme confusion network. Word instead of phoneme-level speaking rate and duration are used. Phoneme and frame-level features are combined using a cross-attention architecture, obviating phoneme time alignment. This performs comparably with standard frame-synchronous features on English speechocean762 and low-resource Tamil datasets.
[20] To Write or to Automate Linguistic Prompts, That Is the Question
Marina Sánchez-Torrón, Daria Akselrod, Jason Rauchwerk
Main category: cs.CL
TL;DR: Systematic comparison shows automatic prompt optimization (GEPA with DSPy) can match expert-designed prompts for some NLP tasks, with results varying by task and model.
Details
Motivation: To determine whether automatic prompt optimization can replace expert prompt engineering in linguistic tasks, as LLM performance is highly sensitive to prompt design but the effectiveness of automated approaches remains unexplored.Method: First systematic comparison of three approaches: 1) hand-crafted zero-shot expert prompts, 2) base DSPy signatures, and 3) GEPA-optimized DSPy signatures. Evaluated across three tasks (translation, terminology insertion, language quality assessment) with five model configurations.
Result: Task-dependent results: In terminology insertion, optimized and manual prompts produce mostly statistically indistinguishable quality. In translation, each approach wins on different models. In LQA, expert prompts achieve stronger error detection while optimization improves characterization. GEPA elevates minimal DSPy signatures, and majority of expert-optimized comparisons show no statistically significant difference.
Conclusion: Automatic prompt optimization can achieve comparable performance to expert-designed prompts for some NLP tasks, though the comparison is asymmetric (optimization uses labeled data while expert prompts rely on domain expertise). The effectiveness varies by task and model.
Abstract: LLM performance is highly sensitive to prompt design, yet whether automatic prompt optimization can replace expert prompt engineering in linguistic tasks remains unexplored. We present the first systematic comparison of hand-crafted zero-shot expert prompts, base DSPy signatures, and GEPA-optimized DSPy signatures across translation, terminology insertion, and language quality assessment, evaluating five model configurations. Results are task-dependent. In terminology insertion, optimized and manual prompts produce mostly statistically indistinguishable quality. In translation, each approach wins on different models. In LQA, expert prompts achieve stronger error detection while optimization improves characterization. Across all tasks, GEPA elevates minimal DSPy signatures, and the majority of expert-optimized comparisons show no statistically significant difference. We note that the comparison is asymmetric: GEPA optimization searches programmatically over gold-standard splits, whereas expert prompts require in principle no labeled data, relying instead on domain expertise and iterative refinement.
[21] Prompt Attack Detection with LLM-as-a-Judge and Mixture-of-Models
Hieu Xuan Le, Benjamin Goh, Quy Anh Tang
Main category: cs.CL
TL;DR: Lightweight LLMs can serve as effective low-latency security judges for prompt attacks in production guardrails through structured reasoning processes.
Details
Motivation: There's a deployment gap in LLM security: lightweight classifiers struggle with generalization under distribution shift, while high-capacity LLM judges are too slow/costly for live enforcement. Need reliable, low-latency security judges for production guardrails.Method: Use lightweight general-purpose LLMs with careful prompt and output design, guiding them through structured reasoning: intent decomposition, safety-signal verification, harm assessment, and self-reflection. Also evaluate Mixture-of-Models (MoM) aggregation.
Result: Lightweight LLMs (like gemini-2.0-flash-lite-001) can effectively serve as low-latency judges for live guardrails. Currently deployed in production for Singapore’s public service chatbots. MoM aggregation provides only modest performance gains over single-model judges.
Conclusion: Lightweight general-purpose LLMs with structured reasoning can bridge the deployment gap for prompt attack detection in production, offering reliable security judgments under strict latency constraints.
Abstract: Prompt attacks, including jailbreaks and prompt injections, pose a critical security risk to Large Language Model (LLM) systems. In production, guardrails must mitigate these attacks under strict low-latency constraints, resulting in a deployment gap in which lightweight classifiers and rule-based systems struggle to generalize under distribution shift, while high-capacity LLM-based judges remain too slow or costly for live enforcement. In this work, we examine whether lightweight, general-purpose LLMs can reliably serve as security judges under real-world production constraints. Through careful prompt and output design, lightweight LLMs are guided through a structured reasoning process involving explicit intent decomposition, safety-signal verification, harm assessment, and self-reflection. We evaluate our method on a curated dataset combining benign queries from real-world chatbots with adversarial prompts generated via automated red teaming (ART), covering diverse and evolving patterns. Our results show that general-purpose LLMs, such as gemini-2.0-flash-lite-001, can serve as effective low-latency judges for live guardrails. This configuration is currently deployed in production as a centralized guardrail service for public service chatbots in Singapore. We additionally evaluate a Mixture-of-Models (MoM) setting to assess whether aggregating multiple LLM judges improves prompt-attack detection performance relative to single-model judges, with only modest gains observed.
[22] Cross-Preference Learning for Sentence-Level and Context-Aware Machine Translation
Ying Li, Xinglin Lyu, Junhui Li, Jinlong Yang, Hengchao Shang, Min Zhang, Shimin Tao, Daimeng Wei
Main category: cs.CL
TL;DR: CPL is a preference-based training framework for context-aware machine translation that explicitly models when and how contextual information improves translation quality by integrating intra- and cross-condition preferences.
Details
Motivation: Context-aware machine translation doesn't consistently outperform sentence-level MT because contextual signals are unevenly beneficial across sentences, and existing training objectives don't explicitly model this variability, limiting models' ability to adaptively exploit context.Method: Proposes Cross-Preference Learning (CPL), a preference-based training framework that integrates both intra-condition (within same context condition) and cross-condition (between different context conditions) preferences into the optimization objective to provide explicit supervision on when and how contextual information improves translation.
Result: Experimental results on public context-aware MT tasks using Qwen3-4B, Qwen3-8B, and Llama-3-8B models show consistent improvements in translation quality and robustness across both input conditions without architectural modifications.
Conclusion: CPL effectively captures the complementary benefits of sentence-level and context-aware MT by explicitly modeling when contextual information is beneficial, leading to improved translation performance across different conditions.
Abstract: Context-aware machine translation (MT) leverages document-level information, yet it does not consistently outperform sentence-level MT, as contextual signals are unevenly beneficial across sentences. Existing training objectives do not explicitly model this variability, limiting a model’s ability to adaptively exploit context. In this paper, we propose Cross-Preference Learning (CPL), a preference-based training framework that explicitly captures the complementary benefits of sentence-level and context-aware MT. CPL achieves this by integrating both intra- and cross-condition preferences into the preference optimization objective. The introduction of intra- and cross-condition preferences provides explicit supervision on when and how contextual information improves translation quality. We validate the proposed approach on several public context-aware MT tasks using multiple models, including Qwen3-4B, Qwen3-8B, and Llama-3-8B. Experimental results demonstrate consistent improvements in translation quality and robustness across both input conditions, achieved without any architectural modifications.
[23] Probing the Lack of Stable Internal Beliefs in LLMs
Yifan Luo, Kangping Xu, Yanzhen Lu, Yang Yuan, Andrew Chi-Chih Yao
Main category: cs.CL
TL;DR: LLMs struggle to maintain implicit consistency in multi-turn interactions, failing to preserve unstated goals across dialogues despite being able to follow explicit instructions.
Details
Motivation: Persona-driven LLMs need consistent behavioral tendencies to simulate human-like personality traits, but current models lack stable internal representations that anchor responses over extended dialogues.Method: Used a 20-question-style riddle game paradigm where LLMs secretly select a target and respond to user guesses with “yes/no” answers, evaluating their ability to maintain implicit consistency across turns.
Result: LLMs struggle to preserve latent consistency - their implicit “goals” shift across turns unless explicitly provided their selected target in context, revealing limitations in persona modeling.
Conclusion: Current LLMs lack mechanisms to anchor implicit goals over time, which is crucial for realistic personality modeling in interactive applications like dialogue systems.
Abstract: Persona-driven large language models (LLMs) require consistent behavioral tendencies across interactions to simulate human-like personality traits, such as persistence or reliability. However, current LLMs often lack stable internal representations that anchor their responses over extended dialogues. This work explores whether LLMs can maintain “implicit consistency”, defined as persistent adherence to an unstated goal in multi-turn interactions. We designed a 20-question-style riddle game paradigm where an LLM is tasked with secretly selecting a target and responding to users’ guesses with “yes/no” answers. Through evaluations, we find that LLMs struggle to preserve latent consistency: their implicit “goals” shift across turns unless explicitly provided their selected target in context. These findings highlight critical limitations in the building of persona-driven LLMs and underscore the need for mechanisms that anchor implicit goals over time, which is a key to realistic personality modeling in interactive applications such as dialogue systems.
[24] A Catalog of Basque Dialectal Resources: Online Collections and Standard-to-Dialectal Adaptations
Jaione Bengoetxea, Itziar Gonzalez-Dios, Rodrigo Agerri
Main category: cs.CL
TL;DR: This paper presents a comprehensive catalog of Basque dialectal data resources, addressing data scarcity in dialectal NLP through both online dialectal content and standard-to-dialect adapted data.
Details
Motivation: Data scarcity is identified as a primary limitation in dialectal NLP research, particularly for Basque dialects. The paper aims to address this by systematically compiling and cataloging available dialectal resources.Method: Two main approaches: 1) Collection of online dialectal data (news, tweets, dictionaries, videos), and 2) Standard-to-dialect adaptation - both manual (XNLI dataset adapted to three Basque dialects) and automatic (BasPhyCowest dataset with manual quality evaluation).
Result: Created a comprehensive catalog of Basque dialectal resources, produced a high-quality parallel evaluation dataset from XNLI adapted to three dialects, and evaluated the quality of automatically adapted data through native speaker assessment.
Conclusion: The paper provides valuable resources for Basque dialectal NLP, demonstrating both manual and automatic approaches to address data scarcity, with implications for similar low-resource dialectal scenarios.
Abstract: Recent research on dialectal NLP has identified data scarcity as a primary limitation. To address this limitation, this paper presents a catalog of contemporary Basque dialectal data and resources, offering a systematic and comprehensive compilation of the dialectal data currently available in Basque. Two types of data sources have been distinguished: online data originally written in some dialect, and standard-to-dialect adapted data. The former includes all dialectal data that can be found online, such as news and radio sites, informal tweets, as well as online resources such as dictionaries, atlases, grammar rules, or videos. The latter consists of data that has been adapted from the standard variety to dialectal varieties, either manually or automatically. Regarding the manual adaptation, the test split of the XNLI Natural Language Inference dataset was manually adapted into three Basque dialects: Western, Central, and Navarrese-Lapurdian, yielding a high-quality parallel gold standard evaluation dataset. With respect to the automatic dialectal adaptation, the automatically adapted physical commonsense dataset (BasPhyCowest) underwent additional manual evaluation by native speakers to assess its quality and determine whether it could serve as a viable substitute for full manual adaptation (i.e., silver data creation).
[25] A Decade-Scale Benchmark Evaluating LLMs’ Clinical Practice Guidelines Detection and Adherence in Multi-turn Conversations
Andong Tan, Shuyu Dai, Jinglu Wang, Fengtao Zhou, Yan Lu, Xi Wang, Yingcong Chen, Can Yang, Shujie Liu, Hao Chen
Main category: cs.CL
TL;DR: CPGBench: A framework benchmarking LLMs’ ability to detect and adhere to clinical practice guidelines in multi-turn conversations, revealing significant gaps between guideline knowledge and practical application.
Details
Motivation: While LLMs are increasingly used in healthcare, it's unclear how well they can identify and follow clinical practice guidelines during conversations, which is crucial for evidence-based decision-making and patient safety.Method: Created CPGBench framework with 3,418 CPG documents from 9 countries/regions and 2 international organizations across 24 specialties, extracting 32,155 clinical recommendations. Generated multi-turn conversations for each recommendation to evaluate 8 leading LLMs’ detection and adherence capabilities, supplemented by human evaluation with 56 clinicians.
Result: LLMs correctly detected 71.1%-89.6% of recommendations but only correctly referenced 3.6%-29.7% of corresponding titles. Adherence rates ranged from 21.8% to 63.2%, showing large gaps between guideline knowledge and practical application.
Conclusion: CPGBench reveals significant gaps in LLMs’ ability to detect and adhere to clinical guidelines, highlighting the need for improvement before safe deployment in clinical practice. This is the first systematic benchmark for this capability.
Abstract: Clinical practice guidelines (CPGs) play a pivotal role in ensuring evidence-based decision-making and improving patient outcomes. While Large Language Models (LLMs) are increasingly deployed in healthcare scenarios, it is unclear to which extend LLMs could identify and adhere to CPGs during conversations. To address this gap, we introduce CPGBench, an automated framework benchmarking the clinical guideline detection and adherence capabilities of LLMs in multi-turn conversations. We collect 3,418 CPG documents from 9 countries/regions and 2 international organizations published in the last decade spanning across 24 specialties. From these documents, we extract 32,155 clinical recommendations with corresponding publication institute, date, country, specialty, recommendation strength, evidence level, etc. One multi-turn conversation is generated for each recommendation accordingly to evaluate the detection and adherence capabilities of 8 leading LLMs. We find that the 71.1%-89.6% recommendations can be correctly detected, while only 3.6%-29.7% corresponding titles can be correctly referenced, revealing the gap between knowing the guideline contents and where they come from. The adherence rates range from 21.8% to 63.2% in different models, indicating a large gap between knowing the guidelines and being able to apply them. To confirm the validity of our automatic analysis, we further conduct a comprehensive human evaluation involving 56 clinicians from different specialties. To our knowledge, CPGBench is the first benchmark systematically revealing which clinical recommendations LLMs fail to detect or adhere to during conversations. Given that each clinical recommendation may affect a large population and that clinical applications are inherently safety critical, addressing these gaps is crucial for the safe and responsible deployment of LLMs in real world clinical practice.
[26] SafeMath: Inference-time Safety improves Math Accuracy
Sagnik Basu, Subhrajit Mitra, Aman Juneja, Somnath Banerjee, Rima Hazra, Animesh Mukherjee
Main category: cs.CL
TL;DR: LLMs can be manipulated through adversarial math word problems containing harmful content, with risks in educational settings. The paper introduces ToxicGSM dataset and SafeMath safety alignment technique to reduce harmful outputs while maintaining math reasoning accuracy.
Details
Motivation: Recent research shows LLMs can be manipulated through adversarial inputs to produce harmful outputs. The paper focuses on an underexplored issue: harmful and toxic mathematical word problems that can propagate biased, unethical, or psychologically harmful content, especially dangerous in educational settings with children.Method: 1) Introduce ToxicGSM dataset of 1.9k arithmetic problems with harmful/sensitive contexts while preserving mathematically well-defined reasoning tasks. 2) Audit existing LLMs’ behavior on this dataset. 3) Analyze trade-offs between safety enforcement and mathematical correctness. 4) Propose SafeMath - a safety alignment technique to reduce harmful outputs while maintaining mathematical reasoning performance.
Result: The study demonstrates that math questions framed as natural language narratives can serve as subtle medium for harmful content. SafeMath reduces harmful outputs while maintaining and sometimes improving mathematical reasoning performance, showing effective safety alignment need not come at cost of accuracy.
Conclusion: The paper highlights importance of disentangling linguistic harm from math reasoning in LLMs. It shows that safety alignment techniques can effectively reduce harmful outputs without compromising mathematical accuracy, particularly important for educational applications.
Abstract: Recent research points toward LLMs being manipulated through adversarial and seemingly benign inputs, resulting in harmful, biased, or policy-violating outputs. In this paper, we study an underexplored issue concerning harmful and toxic mathematical word problems. We show that math questions, particularly those framed as natural language narratives, can serve as a subtle medium for propagating biased, unethical, or psychologically harmful content, with heightened risks in educational settings involving children. To support a systematic study of this phenomenon, we introduce ToxicGSM, a dataset of 1.9k arithmetic problems in which harmful or sensitive context is embedded while preserving mathematically well-defined reasoning tasks. Using this dataset, we audit the behaviour of existing LLMs and analyse the trade-offs between safety enforcement and mathematical correctness. We further propose SafeMath – a safety alignment technique that reduces harmful outputs while maintaining, and in some cases improving, mathematical reasoning performance. Our results highlight the importance of disentangling linguistic harm from math reasoning and demonstrate that effective safety alignment need not come at the cost of accuracy. We release the source code and dataset at https://github.com/Swagnick99/SafeMath/tree/main.
[27] Translation or Recitation? Calibrating Evaluation Scores for Machine Translation of Extremely Low-Resource Languages
Danlu Chen, Ka Sing He, Jiahe Tian, Chenghao Xiao, Zhaofeng Wu, Taylor Berg-Kirkpatrick, Freda Shi
Main category: cs.CL
TL;DR: FRED Difficulty Metrics (Fertility Ratio, Retrieval Proxy, Pre-training Exposure, Corpus Diversity) help contextualize performance variability in low-resource machine translation by revealing dataset-intrinsic factors rather than model capability differences.
Details
Motivation: Address the perplexing variability in reported performance for extremely low-resource machine translation, making it difficult for researchers (especially those focused on specific language groups like ancient languages) to determine if breakthroughs result from superior methodologies or are artifacts of benchmark collection.Method: Introduce FRED Difficulty Metrics: Fertility Ratio (F) for tokenization coverage, Retrieval Proxy (R) for train-test overlap, Pre-training Exposure (E) for model pre-training data exposure, and Corpus Diversity (D) for dataset variety. These serve as dataset-intrinsic metrics to contextualize reported scores.
Result: Metrics reveal that significant result variability is explained by train-test overlap and pre-training exposure rather than model capability. Some languages (particularly extinct and non-Latin indigenous languages) suffer from poor tokenization coverage (high token fertility), highlighting fundamental limitations of transferring models from high-resource languages lacking shared vocabulary.
Conclusion: By providing these indices alongside performance scores, researchers can enable more transparent evaluation of cross-lingual transfer and provide a more reliable foundation for the extremely low-resource machine translation community.
Abstract: The landscape of extremely low-resource machine translation (MT) is characterized by perplexing variability in reported performance, often making results across different language pairs difficult to contextualize. For researchers focused on specific language groups – such as ancient languages – it is nearly impossible to determine if breakthroughs reported in other contexts (e.g., native African or American languages) result from superior methodologies or are merely artifacts of benchmark collection. To address this problem, we introduce the FRED Difficulty Metrics, which include the Fertility Ratio (F), Retrieval Proxy (R), Pre-training Exposure (E), and Corpus Diversity (D) and serve as dataset-intrinsic metrics to contextualize reported scores. These metrics reveal that a significant portion of result variability is explained by train-test overlap and pre-training exposure rather than model capability. Additionally, we identify that some languages – particularly extinct and non-Latin indigenous languages – suffer from poor tokenization coverage (high token fertility), highlighting a fundamental limitation of transferring models from high-resource languages that lack a shared vocabulary. By providing these indices alongside performance scores, we enable more transparent evaluation of cross-lingual transfer and provide a more reliable foundation for the XLR MT community.
[28] Comparing Natural and Synthetic Structured Data: A Study of the Passive Verb Alternation in French and Italian
Giuseppe Samo, Paola Merlo
Main category: cs.CL
TL;DR: Natural data outperforms synthetic data for training LLMs on linguistic patterns, with models trained on natural data generalizing better to both natural and synthetic test cases.
Details
Motivation: To compare the impact of natural versus synthetic data on training and evaluating LLMs for linguistic knowledge, specifically focusing on passive verb alternation patterns in French and Italian.Method: Used Blackbird Language Matrices (structured datasets) to probe linguistic knowledge, comparing structured templates with natural sentences from Universal Dependencies versus synthetic sentences. Tested models trained on each data type.
Result: Models trained on synthetic data achieved ceiling performance on synthetic tests but failed to generalize to natural sentences. Models trained on natural data showed robust performance on both natural and synthetic test suites.
Conclusion: Natural data is superior for training LLMs to capture abstract linguistic patterns, and structured evaluation setups are valuable for probing syntactic and semantic knowledge.
Abstract: This study compares the impact of natural and synthetic data on training and evaluating large language models (LLMs), using the case of passive verb alternation in French and Italian. We use Blackbird Language Matrices (BLMs), structured datasets designed to probe linguistic knowledge of underlying patterns across sentence sets. We compare structured templates instantiated with natural sentences extracted from Universal Dependencies to structured templates of synthetic sentences. Experiments show that while models achieve ceiling performance when trained and tested on synthetic datasets, they do not reliably generalize to natural sentences. In contrast, models trained on natural data exhibit robust performance across both natural and synthetic test suites, demonstrating their superior ability to capture abstract linguistic patterns. These results corroborate the value of natural data and of structured set ups in linguistic evaluation for probing LLMs’ syntactic and semantic knowledge.
[29] MolQuest: A Benchmark for Agentic Evaluation of Abductive Reasoning in Chemical Structure Elucidation
Taolin Han, Shuang Wu, Jinghang Wang, Yuhao Zhou, Renquan Lv, Bing Zhao, Wei Hu
Main category: cs.CL
TL;DR: MolQuest is an agent-based evaluation framework for molecular structure elucidation that assesses LLMs’ dynamic reasoning in scientific discovery through multi-turn interactive tasks using real chemical experimental data.
Details
Motivation: Current scientific evaluation benchmarks use static, single-turn QA formats that are inadequate for measuring LLM performance in complex scientific tasks requiring multi-step iteration and experimental interaction. There's a need for systematic assessment of LLMs' dynamic reasoning in real-world research scenarios.Method: MolQuest formalizes molecular structure elucidation as a multi-turn interactive task requiring models to plan experimental steps, integrate heterogeneous spectral sources (NMR, MS), and iteratively refine structural hypotheses. It’s built upon authentic chemical experimental data and evaluates LLMs’ abductive reasoning and strategic decision-making in complex chemical spaces.
Result: Contemporary frontier models show significant limitations: even SOTA models achieve only ~50% accuracy, while most other models remain below 30%. This reveals a critical gap in current LLMs’ strategic scientific reasoning capabilities.
Conclusion: MolQuest provides a reproducible, extensible framework for science-oriented LLM evaluation, highlighting the need for future research toward AI that can actively participate in scientific processes through improved dynamic reasoning and strategic decision-making.
Abstract: Large language models (LLMs) hold considerable potential for advancing scientific discovery, yet systematic assessment of their dynamic reasoning in real-world research remains limited. Current scientific evaluation benchmarks predominantly rely on static, single-turn Question Answering (QA) formats, which are inadequate for measuring model performance in complex scientific tasks that require multi-step iteration and experimental interaction. To address this gap, we introduce MolQuest, a novel agent-based evaluation framework for molecular structure elucidation built upon authentic chemical experimental data. Unlike existing datasets, MolQuest formalizes molecular structure elucidation as a multi-turn interactive task, requiring models to proactively plan experimental steps, integrate heterogeneous spectral sources (e.g., NMR, MS), and iteratively refine structural hypotheses. This framework systematically evaluates LLMs’ abductive reasoning and strategic decision-making abilities within a vast and complex chemical space. Empirical results reveal that contemporary frontier models exhibit significant limitations in authentic scientific scenarios: notably, even state-of-the-art (SOTA) models achieve an accuracy of only approximately 50%, while the performance of most other models remains below the 30% threshold. This work provides a reproducible and extensible framework for science-oriented LLM evaluation, our findings highlight the critical gap in current LLMs’ strategic scientific reasoning, setting a clear direction for future research toward AI that can actively participate in the scientific process.
[30] CRAFT: Grounded Multi-Agent Coordination Under Partial Information
Abhijnan Nath, Hannah VanderHoeven, Nikhil Krishnaswamy
Main category: cs.CL
TL;DR: CRAFT is a multi-agent benchmark for evaluating pragmatic communication in LLMs under partial information, where agents with incomplete views must coordinate to build shared 3D structures through natural language.
Details
Motivation: Current language models struggle with pragmatic communication and coordination in multi-agent settings with strict partial information. There's a need for diagnostic frameworks to understand failures in spatial grounding, belief modeling, and pragmatic reasoning when agents must collaborate without complete information.Method: Creates a benchmark where multiple agents with complementary but incomplete views must coordinate through natural language to construct shared 3D structures. Formalizes as multi-sender pragmatic reasoning task with diagnostic framework decomposing failures into spatial grounding, belief modeling, and pragmatic communication errors. Tests 15 models (8 open-weight, 7 frontier including reasoning models).
Result: Stronger reasoning ability doesn’t reliably translate to better coordination; smaller open-weight models often match or outperform frontier systems. Improved individual communication doesn’t guarantee successful collaboration. Multi-agent coordination remains fundamentally unsolved for current language models.
Conclusion: Multi-agent coordination under partial information is a significant unsolved challenge for LLMs. The CRAFT benchmark provides diagnostic tools to understand failure modes in pragmatic communication and spatial reasoning.
Abstract: We introduce CRAFT, a multi-agent benchmark for evaluating pragmatic communication in large language models under strict partial information. In this setting, multiple agents with complementary but incomplete views must coordinate through natural language to construct a shared 3D structure that no single agent can fully observe. We formalize this problem as a multi-sender pragmatic reasoning task and provide a diagnostic framework that decomposes failures into spatial grounding, belief modeling and pragmatic communication errors, including a taxonomy of behavioral failure profiles in both frontier and open-weight models. Across a diverse set of models, including 8 open-weight and 7 frontier including reasoning models, we find that stronger reasoning ability does not reliably translate to better coordination: smaller open-weight models often match or outperform frontier systems, and improved individual communication does not guarantee successful collaboration. These results suggest that multi-agent coordination remains a fundamentally unsolved challenge for current language models. Our code can be found at https://github.com/csu-signal/CRAFT
[31] When Hate Meets Facts: LLMs-in-the-Loop for Check-worthiness Detection in Hate Speech
Nicolás Benjamín Ocampo, Tommaso Caselli, Davide Ceolin
Main category: cs.CL
TL;DR: WSF-ARG+ dataset combines hate speech with check-worthiness information, with LLM-in-the-loop framework for annotation, showing improved hate speech detection when incorporating check-worthiness labels.
Details
Motivation: Hateful content online often uses fact-like misinformation, requiring content moderators to assess both harmfulness and veracity. Current approaches fail to jointly address hate speech and misinformation, deepening prejudice and increasing moderator workload.Method: Created WSF-ARG+ dataset combining hate speech with check-worthiness information. Developed LLM-in-the-loop framework for annotation, tested with 12 open-weight LLMs of different sizes and architectures. Validated through extensive human evaluation.
Result: LLM-in-the-loop framework reduces human effort without compromising annotation quality. Hate speech messages with check-worthy claims show significantly higher harassment and hate. Incorporating check-worthiness labels improves LLM-based hate speech detection up to 0.213 macro-F1 (0.154 average for large models).
Conclusion: Jointly addressing hate speech and misinformation through check-worthiness information improves detection performance and reduces moderator burden, with LLM-in-the-loop frameworks offering efficient annotation solutions.
Abstract: Hateful content online is often expressed using fact-like, not necessarily correct information, especially in coordinated online harassment campaigns and extremist propaganda. Failing to jointly address hate speech (HS) and misinformation can deepen prejudice, reinforce harmful stereotypes, and expose bystanders to psychological distress, while polluting public debate. Moreover, these messages require more effort from content moderators because they must assess both harmfulness and veracity, i.e., fact-check them. To address this challenge, we release WSF-ARG+, the first dataset which combines hate speech with check-worthiness information. We also introduce a novel LLM-in-the-loop framework to facilitate the annotation of check-worthy claims. We run our framework, testing it with 12 open-weight LLMs of different sizes and architectures. We validate it through extensive human evaluation, and show that our LLM-in-the-loop framework reduces human effort without compromising the annotation quality of the data. Finally, we show that HS messages with check-worthy claims show significantly higher harassment and hate, and that incorporating check-worthiness labels improves LLM-based HS detection up to 0.213 macro-F1 and to 0.154 macro-F1 on average for large models.
[32] Separate Before You Compress: The WWHO Tokenization Architecture
Kusal Darshana
Main category: cs.CL
TL;DR: SGPE tokenizer improves multilingual processing for complex Abugida scripts by preserving linguistic structure, reducing token counts by 27-62% compared to standard BPE tokenizers.
Details
Motivation: Standard BPE tokenizers struggle with complex Abugida scripts, breaking multi-codepoint grapheme clusters into meaningless sub-character units, which degrades LLM reasoning efficiency and increases inference costs, creating a "Token Tax" for Global South languages.Method: Proposes WWHO (Where-What-How Often) three-layer architecture and SGPE (Syllable-aware Grapheme Pair Encoding) algorithm that separates linguistic rules from statistical compression, enabling seamless multilingual tokenization with Linguistic Zero-Breakage Guarantee.
Result: For Sinhala: 61.7% token reduction vs OpenAI o200k base (TWR 1.274). For Hindi: 27.0% reduction (TWR 1.181). Mixed-script: 36.7-60.2% reductions vs various tokenizers. Extends usable context window up to 4.38x for Abugida languages.
Conclusion: SGPE significantly improves tokenization efficiency for complex scripts while preserving linguistic integrity, addressing the “Token Tax” problem and enabling better LLM performance for underrepresented languages.
Abstract: Current Large Language Models (LLMs) mostly use BPE (Byte Pair Encoding) based tokenizers, which are very effective for simple structured Latin scripts such as English. However, standard BPE tokenizers struggle to process complex Abugida scripts due to their structural complexity. The problem is that these tokenizers break complex conjuncts, which are multi-codepoint grapheme clusters, into meaningless sub-character units. This degrades the LLM’s reasoning efficiency by forcing it to learn basic orthographic structures at inference time and raises inference costs, resulting in a significant “Token Tax” for the Global South. We propose a new three-layer architecture, the WWHO (Where-What-How Often), and an algorithm named SGPE (Syllable-aware Grapheme Pair Encoding) that separates the linguistic rules of the script from the statistical compression process while enabling seamless multilingual tokenization. Using Sinhala and Devanagari (Hindi/Sanskrit) as highly complex Abugida scripts, we trained WWHO on a cleaned 30-million-sentence dataset and evaluated on a 1,499,950-sentence test set. For Sinhala, SGPE achieves a Token to Word Ratio (TWR) of 1.274 with 4.83 characters per token, representing a 61.7 percent reduction in tokens compared to OpenAI’s o200k base. For Hindi, it achieves a TWR of 1.181 (27.0 percent reduction vs o200k). On the mixed-script (Sinhala, Devanagari, and English) dataset, SGPE achieves an overall TWR of 1.240, representing token reductions of 36.7 percent, 39.6 percent, and 60.2 percent relative to o200k base, Llama 4 Scout, and DeepSeek V3, respectively. This effectively extends the usable context window by up to 4.38 times for these Abugida languages while ensuring a Linguistic Zero-Breakage Guarantee, which ensures that no valid syllable is ever split across multiple tokens.
[33] Beyond Detection: Rethinking Education in the Age of AI-writing
Maria Marina, Alexander Panchenko, Vasily Konovalov
Main category: cs.CL
TL;DR: Paper examines the cognitive risks of AI writing tools like ChatGPT in education, arguing that the writing process itself is essential for deep learning, and explores AI-text detection and pedagogical adaptations.
Details
Motivation: As generative AI tools enter educational settings, there's concern that writing is becoming outsourced and automated, potentially stripping away its cognitive value. The paper aims to explore what is lost when machines write for us and how to preserve the educational benefits of writing.Method: Draws on cognitive psychology, educational theory, and analysis of real classroom practices to examine the value of the writing process. Also explores current AI-text detection capabilities and pedagogical strategies for adaptation.
Result: Finds that the messy, slow, often frustrating process of writing is where deep human learning occurs. Identifies that AI-text detection is possible but limited, and that smarter pedagogy rather than bans is needed for adaptation.
Conclusion: Writing is not just output but a cognitive process essential for learning. In a world where writing can be faked, learning cannot. The ability to recognize machine-generated language may become a critical 21st century literacy.
Abstract: As generative AI tools like ChatGPT enter classrooms, workplaces and everyday thinking, writing is at risk of becoming a formality – outsourced, automated and stripped of its cognitive value. But writing is not just output; it is how we learn to think. This paper explores what is lost when we let machines write for us, drawing on cognitive psychology, educational theory and real classroom practices. We argue that the process of writing – messy, slow, often frustrating – is where a human deep learning happens. The paper also explores the current possibilities of AI-text detection, how educators can adapt through smarter pedagogy rather than bans, and why the ability to recognize machine-generated language may become a critical literacy of the 21st century. In a world where writing can be faked, learning can not.
[34] Adaptive Chunking: Optimizing Chunking-Method Selection for RAG
Paulo Roberto de Moura Júnior, Jean Lelong, Annabelle Blangero
Main category: cs.CL
TL;DR: Adaptive Chunking framework improves RAG performance by selecting optimal chunking strategies per document using five novel intrinsic metrics, achieving significant gains in answer correctness and question coverage.
Details
Motivation: Current RAG systems use one-size-fits-all chunking approaches that fail to capture document nuances, and there's no dedicated evaluation framework for chunking quality independent of downstream performance.Method: Introduces Adaptive Chunking with five intrinsic metrics (RC, ICC, DCC, BI, SC) to assess chunking quality, plus two new chunkers (LLM-regex splitter and split-then-merge recursive splitter) with post-processing techniques.
Result: On diverse corpus (legal, technical, social science), adaptive chunking increased answer correctness to 72% (from 62-64%) and successfully answered questions by over 30% (65 vs 49) without changing models or prompts.
Conclusion: Document-aware adaptive chunking guided by complementary intrinsic metrics offers a practical path to more robust RAG systems, demonstrating significant improvements over standard approaches.
Abstract: The effectiveness of Retrieval-Augmented Generation (RAG) is highly dependent on how documents are chunked, that is, segmented into smaller units for indexing and retrieval. Yet, commonly used “one-size-fits-all” approaches often fail to capture the nuanced structure and semantics of diverse texts. Despite its central role, chunking lacks a dedicated evaluation framework, making it difficult to assess and compare strategies independently of downstream performance. We challenge this paradigm by introducing Adaptive Chunking, a framework that selects the most suitable chunking strategy for each document based on a set of five novel intrinsic, document-based metrics: References Completeness (RC), Intrachunk Cohesion (ICC), Document Contextual Coherence (DCC), Block Integrity (BI), and Size Compliance (SC), which directly assess chunking quality across key dimensions. To support this framework, we also introduce two new chunkers, an LLM-regex splitter and a split-then-merge recursive splitter, alongside targeted post-processing techniques. On a diverse corpus spanning legal, technical, and social science domains, our metric-guided adaptive method significantly improves downstream RAG performance. Without changing models or prompts, our framework increases RAG outcomes, raising answers correctness to 72% (from 62-64%) and increasing the number of successfully answered questions by over 30% (65 vs. 49). These results demonstrate that adaptive, document-aware chunking, guided by a complementary suite of intrinsic metrics, offers a practical and effective path to more robust RAG systems. Code available at https://github.com/ekimetrics/adaptive-chunking.
[35] Large Language Model as Token Compressor and Decompressor
Wenbing Li, Zikai Song, Jielei Zhang, Tianhao Zhao, Junkai Lin, Yiran Wang, Wei Yang
Main category: cs.CL
TL;DR: LLM-based text compression method that translates long texts into compact latent codes (Z-tokens) with variable-length compression based on semantic density, achieving up to 18x token reduction while preserving reconstruction fidelity.
Details
Motivation: To address the challenge of long-context processing in LLMs by developing a token-efficient compression method that can reduce computational overhead while maintaining information integrity.Method: Self-expressive autoencoding framework that fine-tunes a pretrained LLM with LoRA-based adapter heads to compress text into discrete, variable-length latent codes (Z-tokens) and reconstruct original text exactly from them.
Result: Achieves up to 18 times token reduction on Wikipedia, CNN/DailyMail, HotpotQA, and Qulac-style datasets while preserving reconstruction fidelity and downstream task performance.
Conclusion: Off-the-shelf LLMs can serve as effective token compressors/decompressors, enabling content-adaptive compression that supports prompt compression and autoregressive generation in compressed token space for token-efficient long-context reasoning.
Abstract: In this paper, we establish the novel insight that an off-the-shelf LLM can function as an excellent token compressor and decompressor. To demonstrate, we design a self-expressive autoencoding learning framework fine-tunes a pretrained LLM to translate long texts into a compact internal language of discrete, variable-length latent codes, termed Z-tokens, and to reconstruct the original text exactly from them. The resulting representation is content-adaptive: semantically dense segments receive more Z-tokens, while redundant or predictable regions are aggressively compressed, via lightweight LoRA-based adapter heads. Empirically, our method achieves up to 18 times token reduction on Wikipedia, CNN/DailyMail, HotpotQA, and Qulac-style long-query datasets, while preserving reconstruction fidelity and downstream performance. This simple yet effective design supports applications including prompt compression and autoregressive generation directly in the Z-token space, offering a potential pathway toward token-efficient long-context reasoning.
[36] TAPO: Translation Augmented Policy Optimization for Multilingual Mathematical Reasoning
Xu Huang, Zhejian Lai, Zixian Huang, Jiajun Chen, Shujian Huang
Main category: cs.CL
TL;DR: TAPO is a reinforcement learning framework that uses English as a pivot to improve multilingual mathematical reasoning by decoupling understanding from reasoning through step-level relative advantage.
Details
Motivation: LLMs show strong English mathematical reasoning but perform poorly in multilingual contexts due to language understanding deficiencies. The goal is to bridge this performance gap between English and other languages.Method: Translation-Augmented Policy Optimization (TAPO) builds on GRPO with an explicit alignment strategy using English as a pivot. It employs step-level relative advantage mechanism to decouple understanding from reasoning, allowing integration of translation quality rewards without optimization conflicts.
Result: TAPO outperforms baseline methods in both multilingual mathematical reasoning and translation tasks, generalizes well to unseen languages and out-of-domain tasks, and is compatible with various models.
Conclusion: TAPO effectively synergizes language understanding with reasoning capabilities, bridging the multilingual performance gap in mathematical reasoning through a novel reinforcement learning approach.
Abstract: Large Language Models (LLMs) have demonstrated remarkable proficiency in English mathematical reasoning, yet a significant performance disparity persists in multilingual contexts, largely attributed to deficiencies in language understanding. To bridge this gap, we introduce Translation-Augmented Policy Optimization (TAPO), a novel reinforcement learning framework built upon GRPO. TAPO enforces an explicit alignment strategy where the model leverages English as a pivot and follows an understand-then-reason paradigm. Crucially, we employ a step-level relative advantage mechanism that decouples understanding from reasoning, allowing the integration of translation quality rewards without introducing optimization conflicts. Extensive experiments reveal that TAPO effectively synergizes language understanding with reasoning capabilities and is compatible with various models. It outperforms baseline methods in both multilingual mathematical reasoning and translation tasks, while generalizing well to unseen languages and out-of-domain tasks.
[37] Navigating the Prompt Space: Improving LLM Classification of Social Science Texts Through Prompt Engineering
Erkan Gunes, Christoffer Florczak, Tevfik Murat Yildirim
Main category: cs.CL
TL;DR: Systematic study of prompt engineering for LLM text classification shows minimal context yields best performance gains, with diminishing returns and sometimes negative effects from additional context.
Details
Motivation: While LLMs show promise for text classification in social sciences with cost advantages, performance varies widely. The paper aims to determine how to maximize performance through systematic prompt engineering.Method: Systematically varied three aspects of prompt engineering: label descriptions, instructional nudges, and few-shot examples across two different examples. Tested performance with different levels of prompt context.
Result: Minimal increase in prompt context yields highest performance gains, while further context increases provide only marginal improvements. Alarmingly, increased context sometimes decreases accuracy. Substantial heterogeneity across models, tasks, and batch sizes.
Conclusion: Individual validation needed for each LLM coding task rather than reliance on general rules. Simple prompt engineering often outperforms complex approaches.
Abstract: Recent developments in text classification using Large Language Models (LLMs) in the social sciences suggest that costs can be cut significantly, while performance can sometimes rival existing computational methods. However, with a wide variance in performance in current tests, we move to the question of how to maximize performance. In this paper, we focus on prompt context as a possible avenue for increasing accuracy by systematically varying three aspects of prompt engineering: label descriptions, instructional nudges, and few shot examples. Across two different examples, our tests illustrate that a minimal increase in prompt context yields the highest increase in performance, while further increases in context only tend to yield marginal performance increases thereafter. Alarmingly, increasing prompt context sometimes decreases accuracy. Furthermore, our tests suggest substantial heterogeneity across models, tasks, and batch size, underlining the need for individual validation of each LLM coding task rather than reliance on general rules.
[38] Translation Asymmetry in LLMs as a Data Augmentation Factor: A Case Study for 6 Romansh Language Varieties
Jannis Vamvas, Ignacio Pérez Prat, Angela Heldstab, Dominic P. Fischer, Sina Ahmadi, Rico Sennrich
Main category: cs.CL
TL;DR: LLM-based synthetic data generation fails for Romansh due to confusion between its 6 varieties; better approach aligns data augmentation direction with resource gradient, achieving 23 BLEU improvement over Gemini 3 Pro
Details
Motivation: Current LLM-based synthetic data generation methods for low-resource machine translation fail for Romansh because LLMs confuse its 6 distinct language varieties, requiring a different approachMethod: Align data augmentation direction with resource gradient between source and target language instead of using LLMs for synthetic data generation
Result: Surpasses Gemini 3 Pro by 23 BLEU in lowest-resource Romansh variety; human evaluation confirms first model generating fluent translations in individual Romansh varieties
Conclusion: Resource gradient-aligned data augmentation outperforms LLM-based synthetic data generation for low-resource languages with multiple varieties like Romansh
Abstract: Recent strategies for low-resource machine translation rely on LLMs to generate synthetic data from higher-resource languages. We find that this method fails for Romansh, because LLMs tend to confuse its 6 distinct language varieties. Our experiments show that instead, the direction of data augmentation should be aligned with the resource gradient between source and target language. This approach surpasses Gemini 3 Pro in the lowest-resource variety of Romansh by 23 BLEU. A human evaluation confirms that our experiments yield the first model that generates fluent translations in the individual Romansh varieties.
[39] An Experimental Comparison of the Most Popular Approaches to Fake News Detection
Pietro Dell’Oglio, Alessandro Bondielli, Francesco Marcelloni, Lucia C. Passaro
Main category: cs.CL
TL;DR: A comprehensive evaluation of 12 fake news detection approaches across 10 datasets, focusing on text-only English content and assessing performance under in-domain, multi-domain, and cross-domain scenarios.
Details
Motivation: The increasing sophistication of fake news production using LLMs and social media amplification necessitates a critical assessment of existing detection methods to understand their limitations and generalization capabilities.Method: Evaluated 12 representative approaches (traditional ML, deep learning, transformers, cross-domain architectures) on 10 public datasets treated as distinct domains. Conducted in-domain, multi-domain, and cross-domain experiments with harmonized binary labels (“Real” vs “Fake”).
Result: Fine-tuned models perform well in-domain but struggle with generalization. Cross-domain architectures reduce the gap but are data-hungry. LLMs show promise through zero- and few-shot learning capabilities.
Conclusion: Fake news detection faces significant generalization challenges across domains. LLMs offer promising alternatives but results should be interpreted as robustness evaluations within the English text-only protocol due to dataset confounds and possible pre-training exposure.
Abstract: In recent years, fake news detection has received increasing attention in public debate and scientific research. Despite advances in detection techniques, the production and spread of false information have become more sophisticated, driven by Large Language Models (LLMs) and the amplification power of social media. We present a critical assessment of 12 representative fake news detection approaches, spanning traditional machine learning, deep learning, transformers, and specialized cross-domain architectures. We evaluate these methods on 10 publicly available datasets differing in genre, source, topic, and labeling rationale. We address text-only English fake news detection as a binary classification task by harmonizing labels into “Real” and “Fake” to ensure a consistent evaluation protocol. We acknowledge that label semantics vary across datasets and that harmonization inevitably removes such semantic nuances. Each dataset is treated as a distinct domain. We conduct in-domain, multi-domain and cross-domain experiments to simulate real-world scenarios involving domain shift and out-of-distribution data. Fine-tuned models perform well in-domain but struggle to generalize. Cross-domain architectures can reduce this gap but are data-hungry, while LLMs offer a promising alternative through zero- and few-shot learning. Given inherent dataset confounds and possible pre-training exposure, results should be interpreted as robustness evaluations within this English, text-only protocol.
[40] Humans vs Vision-Language Models: A Unified Measure of Narrative Coherence
Nikolai Ilinykh, Hyewon Jang, Shalom Lappin, Asad Sayeed, Sharid Loáiciga
Main category: cs.CL
TL;DR: VLMs generate visually-grounded narratives with systematic coherence differences from human-written stories, showing human-like surface fluency but distinct discourse organization patterns.
Details
Motivation: To understand how vision-language models (VLMs) organize discourse in visually-grounded narratives compared to humans, examining narrative coherence beyond surface fluency.Method: Compare human-written and VLM-generated narratives using metrics for coreference, discourse relations, topic continuity, character persistence, and multimodal character grounding on Visual Writing Prompts corpus.
Result: VLMs show broadly similar coherence profiles that differ systematically from humans; individual metric differences are subtle but become clear when considered jointly.
Conclusion: Despite human-like surface fluency, model narratives exhibit systematic differences in discourse organization across visually-grounded stories.
Abstract: We study narrative coherence in visually grounded stories by comparing human-written narratives with those generated by vision-language models (VLMs) on the Visual Writing Prompts corpus. Using a set of metrics that capture different aspects of narrative coherence, including coreference, discourse relation types, topic continuity, character persistence, and multimodal character grounding, we compute a narrative coherence score. We find that VLMs show broadly similar coherence profiles that differ systematically from those of humans. In addition, differences for individual measures are often subtle, but they become clearer when considered jointly. Overall, our results indicate that, despite human-like surface fluency, model narratives exhibit systematic differences from those of humans in how they organise discourse across a visually grounded story. Our code is available at https://github.com/GU-CLASP/coherence-driven-humans.
[41] PICon: A Multi-Turn Interrogation Framework for Evaluating Persona Agent Consistency
Minseo Kim, Sujeong Im, Junseong Choi, Junhee Lee, Chaeeun Shim, Edward Choi
Main category: cs.CL
TL;DR: PICon is an evaluation framework that uses systematic interrogation-style questioning to test persona agents for consistency across three dimensions: internal, external, and retest consistency.
Details
Motivation: As LLM-based persona agents are increasingly used as proxies for human participants, there's a need for systematic verification of whether their responses remain contradiction-free and factually accurate throughout interactions.Method: Applies interrogation methodology principles to probe persona agents through logically chained multi-turn questioning, evaluating consistency along three dimensions: internal consistency (freedom from self-contradiction), external consistency (alignment with real-world facts), and retest consistency (stability under repetition).
Result: Evaluating seven groups of persona agents and 63 human participants revealed that even previously reported highly consistent systems fail to meet human baselines across all three dimensions, showing contradictions and evasive responses under chained questioning.
Conclusion: Provides both conceptual foundation and practical methodology for evaluating persona agents before trusting them as substitutes for human participants, with implications for reliability assessment in human-substitute applications.
Abstract: Large language model (LLM)-based persona agents are rapidly being adopted as scalable proxies for human participants across diverse domains. Yet there is no systematic method for verifying whether a persona agent’s responses remain free of contradictions and factual inaccuracies throughout an interaction. A principle from interrogation methodology offers a lens: no matter how elaborate a fabricated identity, systematic interrogation will expose its contradictions. We apply this principle to propose PICon, an evaluation framework that probes persona agents through logically chained multi-turn questioning. PICon evaluates consistency along three core dimensions: internal consistency (freedom from self-contradiction), external consistency (alignment with real-world facts), and retest consistency (stability under repetition). Evaluating seven groups of persona agents alongside 63 real human participants, we find that even systems previously reported as highly consistent fail to meet the human baseline across all three dimensions, revealing contradictions and evasive responses under chained questioning. This work provides both a conceptual foundation and a practical methodology for evaluating persona agents before trusting them as substitutes for human participants. We provide the source code and an interactive demo at: https://kaist-edlab.github.io/picon/
[42] Beyond Via: Analysis and Estimation of the Impact of Large Language Models in Academic Papers
Mingmeng Geng, Yuhang Dong, Thierry Poibeau
Main category: cs.CL
TL;DR: Analysis of arXiv papers reveals LLM-driven word usage shifts like increased “beyond”/“via” in titles and decreased “the”/“of” in abstracts, with classifiers struggling to identify specific LLM sources due to model similarities and evolving patterns.
Details
Motivation: To investigate how large language models are influencing academic writing patterns on arXiv, particularly focusing on word usage shifts that haven't been sufficiently studied, and to understand the challenges in detecting which specific LLM generated academic text.Method: Analyzed arXiv papers to identify word frequency changes, used linear interpretable approaches to quantify LLM effects, accounted for differences between models and prompts, and tested classifier performance on multi-class LLM identification tasks.
Result: Found significant LLM-driven word usage shifts (increased “beyond”/“via” in titles, decreased “the”/“of” in abstracts), classifiers struggle with multi-class LLM identification due to model similarities, and real-world LLM usage shows heterogeneous and dynamic patterns.
Conclusion: LLMs are measurably changing academic writing patterns in detectable ways, but current detection methods face challenges due to model similarities and evolving usage patterns, requiring more sophisticated approaches for accurate LLM attribution.
Abstract: Through an analysis of arXiv papers, we report several shifts in word usage that are likely driven by large language models (LLMs) but have not previously received sufficient attention, such as the increased frequency of “beyond” and “via” in titles and the decreased frequency of “the” and “of” in abstracts. Due to the similarities among different LLMs, experiments show that current classifiers struggle to accurately determine which specific model generated a given text in multi-class classification tasks. Meanwhile, variations across LLMs also result in evolving patterns of word usage in academic papers. By adopting a direct and highly interpretable linear approach and accounting for differences between models and prompts, we quantitatively assess these effects and show that real-world LLM usage is heterogeneous and dynamic.
[43] Measuring What Matters – or What’s Convenient?: Robustness of LLM-Based Scoring Systems to Construct-Irrelevant Factors
Cole Walsh, Rodica Ivan
Main category: cs.CL
TL;DR: LLM-based automated scoring system shows robustness to most construct-irrelevant factors but penalizes text duplication and off-topic responses in essay scoring.
Details
Motivation: To investigate the robustness of LLM-based automated scoring systems to construct-irrelevant factors in educational assessment, given concerns about hallucinations and vulnerability to adversarial conditions in traditional automated scoring systems.Method: Study examines effects of various construct-irrelevant factors (meaningless text padding, spelling errors, writing sophistication, text duplication, off-topic responses) on a dual-architecture LLM-based scoring system for short essay-like responses in situational judgment tests.
Result: System was robust to padding, spelling errors, and writing sophistication. Text duplication resulted in lower scores (contradicting previous non-LLM systems), and off-topic responses were heavily penalized.
Conclusion: LLM-based scoring systems show encouraging robustness to construct-irrelevant factors when designed with construct relevance in mind, though careful attention is needed for text duplication and off-topic responses.
Abstract: Automated systems have been widely adopted across the educational testing industry for open-response assessment and essay scoring. These systems commonly achieve performance levels comparable to or superior than trained human raters, but have frequently been demonstrated to be vulnerable to the influence of construct-irrelevant factors (i.e., features of responses that are unrelated to the construct assessed) and adversarial conditions. Given the rising usage of large language models in automated scoring systems, there is a renewed focus on ``hallucinations’’ and the robustness of these LLM-based automated scoring approaches to construct-irrelevant factors. This study investigates the effects of construct-irrelevant factors on a dual-architecture LLM-based scoring system designed to score short essay-like open-response items in a situational judgment test. It was found that the scoring system was generally robust to padding responses with meaningless text, spelling errors, and writing sophistication. Duplicating large passages of text resulted in lower scores predicted by the system, on average, contradicting results from previous studies of non-LLM-based scoring systems, while off-topic responses were heavily penalized by the scoring system. These results provide encouraging support for the robustness of future LLM-based scoring systems when designed with construct relevance in mind.
[44] Self-Improvement of Large Language Models: A Technical Overview and Future Outlook
Haoyan Yang, Mario Xerri, Solha Park, Huajian Zhang, Yiyang Feng, Sai Akhil Kogilathota, Jiawei Zhou
Main category: cs.CL
TL;DR: A framework for self-improving language models that organizes existing techniques into a closed-loop lifecycle with four processes: data acquisition, selection, model optimization, and inference refinement, plus autonomous evaluation.
Details
Motivation: Human supervision for LLM improvement is becoming increasingly costly and limited as models approach human-level capabilities. The growing ability of models to make autonomous decisions enables automating components of the model development process, driving interest in self-improvement where models autonomously generate data, evaluate outputs, and refine their capabilities.Method: Presents a system-level perspective with a unified framework organizing self-improvement as a closed-loop lifecycle. The framework consists of four tightly coupled processes: data acquisition, data selection, model optimization, and inference refinement, plus an autonomous evaluation layer. The model itself drives each stage while autonomous evaluation monitors progress and guides improvement.
Result: The paper systematically reviews and analyzes representative methods for each component from a technical standpoint, providing a comprehensive framework for understanding self-improving LLM techniques.
Conclusion: The framework offers a structured approach to conceptualizing self-improving language models, discusses current limitations, and outlines a vision for future research toward fully self-improving LLMs.
Abstract: As large language models (LLMs) continue to advance, improving them solely through human supervision is becoming increasingly costly and limited in scalability. As models approach human-level capabilities in certain domains, human feedback may no longer provide sufficiently informative signals for further improvement. At the same time, the growing ability of models to make autonomous decisions and execute complex actions naturally enables abstractions in which components of the model development process can be progressively automated. Together, these challenges and opportunities have driven increasing interest in self-improvement, where models autonomously generate data, evaluate outputs, and iteratively refine their own capabilities. In this paper, we present a system-level perspective on self-improving language models and introduce a unified framework that organizes existing techniques. We conceptualize the self-improvement system as a closed-loop lifecycle, consisting of four tightly coupled processes: data acquisition, data selection, model optimization, and inference refinement, along with an autonomous evaluation layer. Within this framework, the model itself plays a central role in driving each stage: collecting or generating data, selecting informative signals, updating its parameters, and refining outputs, while the autonomous evaluation layer continuously monitors progress and guides the improvement cycle across stages. Following this lifecycle perspective, we systematically review and analyze representative methods for each component from a technical standpoint. We further discuss current limitations and outline our vision for future research toward fully self-improving LLMs.
[45] S2D2: Fast Decoding for Diffusion LLMs via Training-Free Self-Speculation
Ligong Han, Hao Wang, Han Gao, Kai Xu, Akash Srivastava
Main category: cs.CL
TL;DR: S2D2 is a training-free self-speculative decoding framework for block-diffusion language models that improves speed-accuracy tradeoffs by using the same model as both drafter and verifier through hybrid diffusion-autoregressive decoding.
Details
Motivation: Standard confidence-thresholded decoding in block-diffusion language models is brittle in few-step regimes needed for practical acceleration - aggressive thresholds hurt quality while conservative thresholds require unnecessary denoising steps. Existing solutions require additional training or extra test-time compute.Method: S2D2 uses a training-free self-speculative decoding framework where the same pretrained block-diffusion model acts as both drafter and verifier by reducing block size to one for autoregressive verification. It inserts speculative verification steps into standard block-diffusion decoding with lightweight routing policies to decide when verification is cost-effective.
Result: Across three mainstream block-diffusion families, S2D2 consistently improves accuracy-speed tradeoff over strong confidence-thresholding baselines. On SDAR: up to 4.7× speedup over autoregressive decoding, up to 1.57× over tuned dynamic decoding baseline while improving accuracy by up to 4.5 points. On LLaDA2.1-Mini: remains complementary to built-in self-correction, with 4.4× faster than static baseline with slightly higher accuracy.
Conclusion: S2D2 provides an effective training-free solution for improving block-diffusion language model decoding efficiency through self-speculative verification, achieving better speed-accuracy tradeoffs without requiring model retraining or significant additional compute.
Abstract: Block-diffusion language models offer a promising path toward faster-than-autoregressive generation by combining block-wise autoregressive decoding with within-block parallel denoising. However, in the few-step regime needed for practical acceleration, standard confidence-thresholded decoding is often brittle: aggressive thresholds hurt quality, while conservative thresholds require unnecessary denoising steps. Existing approaches that address this issue either require additional training or incur extra test-time compute. We present S2D2, a training-free self-speculative decoding framework for block-diffusion language models. Our key observation is that a block-diffusion model becomes autoregressive when the block size is reduced to one, allowing the same pretrained model to act as both drafter and verifier. S2D2 inserts a speculative verification step into standard block-diffusion decoding and uses lightweight routing policies to decide when verification is worth its cost. This yields a hybrid decoding trajectory in which diffusion proposes tokens in parallel, while the autoregressive mode acts as a local sequence-level critic. Across three mainstream block-diffusion families, S2D2 consistently improves the accuracy-speed tradeoff over strong confidence-thresholding baselines. On SDAR, we observe up to $4.7\times$ speedup over autoregressive decoding, and up to $1.57\times$ over a tuned dynamic decoding baseline while improving accuracy by up to $4.5$ points. On LLaDA2.1-Mini, S2D2 remains complementary to built-in self-correction, including a conservative setting where it is $4.4\times$ faster than the static baseline with slightly higher accuracy.
[46] Natural-Language Agent Harnesses
Linyue Pan, Lexiao Zou, Shuo Guo, Jingchen Ni, Hai-Tao Zheng
Main category: cs.CL
TL;DR: Natural-Language Agent Harnesses (NLAHs) externalize agent control logic as portable natural language artifacts executed by Intelligent Harness Runtime (IHR) for better transferability and study.
Details
Motivation: Agent performance depends on harness engineering, but current harness design is buried in controller code and runtime-specific conventions, making it hard to transfer, compare, and study as a scientific object.Method: Introduce Natural-Language Agent Harnesses (NLAHs) that express harness behavior in editable natural language, and Intelligent Harness Runtime (IHR) that executes these harnesses through explicit contracts, durable artifacts, and lightweight adapters.
Result: Conducted controlled evaluations across coding and computer-use benchmarks, assessing operational viability, module ablation, and code-to-text harness migration.
Conclusion: High-level control logic of agent harnesses can be externalized as portable executable artifacts using natural language, enabling better transferability, comparison, and study of harness engineering.
Abstract: Agent performance increasingly depends on \emph{harness engineering}, yet harness design is usually buried in controller code and runtime-specific conventions, making it hard to transfer, compare, and study as a scientific object. We ask whether the high-level control logic of an agent harness can instead be externalized as a portable executable artifact. We introduce \textbf{Natural-Language Agent Harnesses} (NLAHs), which express harness behavior in editable natural language, and \textbf{Intelligent Harness Runtime} (IHR), a shared runtime that executes these harnesses through explicit contracts, durable artifacts, and lightweight adapters. Across coding and computer-use benchmarks, we conduct controlled evaluations of operational viability, module ablation, and code-to-text harness migration.
[47] CodeRefine: A Pipeline for Enhancing LLM-Generated Code Implementations of Research Papers
Ekaterina Trofimova, Emil Sataev, Abhijit Singh Jowhari
Main category: cs.CL
TL;DR: CodeRefine is an LLM-based framework that automatically converts research paper methodologies into functional code through text extraction, knowledge graph creation, and retrospective retrieval-augmented generation.
Details
Motivation: The paper addresses the challenge of bridging theoretical research and practical implementation by automating the transformation of paper methodologies into functional code, which traditionally requires significant manual effort and expertise.Method: Multi-step approach: 1) Extract and summarize key text chunks from papers, 2) Analyze code relevance of text segments, 3) Create knowledge graph using predefined ontology, 4) Generate code from structured representation, 5) Enhance code through retrospective retrieval-augmented generation.
Result: Evaluations on diverse scientific papers demonstrate CodeRefine’s ability to improve code implementation from papers, offering more accurate results than LLM zero-shot prompting approaches.
Conclusion: CodeRefine provides an effective framework for automating research-to-code transformation, potentially accelerating the adoption of cutting-edge algorithms in real-world applications.
Abstract: This paper presents CodeRefine, a novel framework for automatically transforming research paper methodologies into functional code using Large Language Models (LLMs). Our multi-step approach first extracts and summarizes key text chunks from papers, analyzes their code relevance, and creates a knowledge graph using a predefined ontology. Code is then generated from this structured representation and enhanced through a proposed retrospective retrieval-augmented generation approach. CodeRefine addresses the challenge of bridging theoretical research and practical implementation, offering a more accurate alternative to LLM zero-shot prompting. Evaluations on diverse scientific papers demonstrate CodeRefine’s ability to improve code implementation from the paper, potentially accelerating the adoption of cutting-edge algorithms in real-world applications.
[48] LLMs know their vulnerabilities: Uncover Safety Gaps through Natural Distribution Shifts
Qibing Ren, Hao Li, Dongrui Liu, Zhanxu Xie, Xiaoya Lu, Yu Qiao, Lei Sha, Junchi Yan, Lizhuang Ma, Jing Shao
Main category: cs.CL
TL;DR: ActorBreaker: A novel attack method that exploits natural distribution shifts to bypass LLM safety mechanisms by identifying actors related to toxic prompts and crafting multi-turn prompts that gradually elicit unsafe content.
Details
Motivation: LLMs are exposed to harmful data during pre-training, creating safety vulnerabilities. The paper identifies a new vulnerability: susceptibility to natural distribution shifts where semantically related but seemingly benign prompts can bypass safety mechanisms.Method: Introduces ActorBreaker attack method based on Latour’s actor-network theory, identifying both human and non-human actors related to toxic prompts within pre-training distribution to craft multi-turn prompts that gradually lead LLMs to reveal unsafe content.
Result: ActorBreaker outperforms existing attack methods in diversity, effectiveness, and efficiency across aligned LLMs. The paper also constructs a multi-turn safety dataset using ActorBreaker and shows that fine-tuning models on this dataset improves robustness with some utility trade-offs.
Conclusion: LLMs have a new safety vulnerability to natural distribution shifts, requiring expanded safety training covering broader semantic spaces of toxic content. ActorBreaker demonstrates this vulnerability and provides a dataset for improving safety.
Abstract: Safety concerns in large language models (LLMs) have gained significant attention due to their exposure to potentially harmful data during pre-training. In this paper, we identify a new safety vulnerability in LLMs: their susceptibility to \textit{natural distribution shifts} between attack prompts and original toxic prompts, where seemingly benign prompts, semantically related to harmful content, can bypass safety mechanisms. To explore this issue, we introduce a novel attack method, \textit{ActorBreaker}, which identifies actors related to toxic prompts within pre-training distribution to craft multi-turn prompts that gradually lead LLMs to reveal unsafe content. ActorBreaker is grounded in Latour’s actor-network theory, encompassing both human and non-human actors to capture a broader range of vulnerabilities. Our experimental results demonstrate that ActorBreaker outperforms existing attack methods in terms of diversity, effectiveness, and efficiency across aligned LLMs. To address this vulnerability, we propose expanding safety training to cover a broader semantic space of toxic content. We thus construct a multi-turn safety dataset using ActorBreaker. Fine-tuning models on our dataset shows significant improvements in robustness, though with some trade-offs in utility. Code is available at https://github.com/AI45Lab/ActorAttack.
[49] Retrieval-Reasoning Large Language Model-based Synthetic Clinical Trial Generation
Zerui Xu, Fang Wu, Yingzhou Lu, Yuanyuan Zhang, Yue Zhao
Main category: cs.CL
TL;DR: LLM-based framework generates synthetic clinical trial reports with outcomes using retrieval-reasoning approach for privacy-preserving data augmentation
Details
Motivation: Clinical ML applications face data scarcity due to privacy concerns and high costs; LLMs show promise for synthetic data generation but haven't been explored for clinical trial synthesisMethod: Retrieval-Reasoning framework with few-shot LLM prompting: retrieval module grounds generation on relevant trial data, reasoning module ensures domain-consistent justifications for binary outcomes
Result: Synthetic trials effectively augment real datasets; BioBERT classifier fine-tuned on hybrid (synthetic+real) data shows improved performance on clinical trial outcome prediction
Conclusion: LLM-based synthetic data can serve as powerful tool for privacy-preserving data augmentation in clinical research
Abstract: Machine learning (ML) holds great promise for clinical applications but is often hindered by limited access to high-quality data due to privacy concerns, high costs, and long timelines associated with clinical trials. While large language models (LLMs) have demonstrated strong performance in general-purpose generation tasks, their application to synthesizing realistic clinical trials remains underexplored. In this work, we propose a novel Retrieval-Reasoning framework that leverages few-shot prompting with LLMs to generate synthetic clinical trial reports annotated with binary success/failure outcomes. Our approach integrates a retrieval module to ground the generation on relevant trial data and a reasoning module to ensure domain-consistent justifications. Experiments conducted on real clinical trials from the ClinicalTrials.gov database demonstrate that the generated synthetic trials effectively augment real datasets. Fine-tuning a BioBERT classifier on synthetic data, real data, or their combination shows that hybrid fine-tuning leads to improved performance on clinical trial outcome prediction tasks. Our results suggest that LLM-based synthetic data can serve as a powerful tool for privacy-preserving data augmentation in clinical research. The code is available at https://github.com/XuZR3x/Retrieval_Reasoning_Clinical_Trial_Generation.
[50] Exploiting Domain-Specific Parallel Data on Multilingual Language Models for Low-resource Language Translation
Surangika Ranathungaa, Shravan Nayak, Shih-Ting Cindy Huang, Yanke Mao, Tong Su, Yun-Hsiang Ray Chan, Songchen Yuan, Anthony Rinaldi, Annie En-Shiun Lee
Main category: cs.CL
TL;DR: Evaluating fine-tuning vs. further pre-training of multilingual sequence-to-sequence language models for domain-specific low-resource language neural machine translation using auxiliary parallel data.
Details
Motivation: Multilingual NMT systems struggle with low-resource languages due to limited parallel data and poor language representation in models, restricting domain-specific NMT capabilities for these languages.Method: Compare two techniques for utilizing auxiliary parallel data: fine-tuning vs. further pre-training of multilingual sequence-to-sequence language models. Also explore impact of domain divergence on NMT performance.
Result: Evaluation shows effectiveness of both techniques for domain-specific low-resource language NMT. Domain divergence impacts model performance. Several strategies recommended for using auxiliary parallel data.
Conclusion: Provides guidance on building domain-specific NMT models for low-resource languages by effectively utilizing auxiliary parallel data through appropriate techniques based on domain characteristics.
Abstract: Neural Machine Translation (NMT) systems built on multilingual sequence-to-sequence Language Models (msLMs) fail to deliver expected results when the amount of parallel data for a language, as well as the language’s representation in the model are limited. This restricts the capabilities of domain-specific NMT systems for low-resource languages (LRLs). As a solution, parallel data from auxiliary domains can be used either to fine-tune or to further pre-train the msLM. We present an evaluation of the effectiveness of these two techniques in the context of domain-specific LRL-NMT. We also explore the impact of domain divergence on NMT model performance. We recommend several strategies for utilizing auxiliary parallel data in building domain-specific NMT models for LRLs.
[51] The Value of Nothing: Multimodal Extraction of Human Values Expressed by TikTok Influencers
Alina Starovolsky-Shitrit, Alon Neduva, Naama Appel Doron, Itamar Gafni, Ella Daniel, Oren Tsur
Main category: cs.CL
TL;DR: Extracting implicit values from TikTok videos using language models, comparing direct video extraction vs. 2-step text-based approach with annotated dataset.
Details
Motivation: Social platforms like TikTok have become significant channels for value transmission to youth, replacing traditional sources like parents and educators. There's a need to understand what values are being transmitted through visual social media content.Method: Created annotated dataset of TikTok videos using Schwartz Theory of Personal Values. Compared two pipelines: 1) direct value extraction from videos, and 2) 2-step approach converting videos to elaborated scripts then extracting values from text. Tested various language models including few-shot LLMs vs. fine-tuned MLMs.
Result: 2-step approach performed significantly better than direct approach. Few-shot LLM application in both stages outperformed fine-tuned Masked Language Model in second stage. Created first values-annotated dataset of TikTok videos.
Conclusion: First attempt to extract values from TikTok/visual social media. Results enable future research on value transmission in video-based platforms. Methodology shows promise for analyzing multimodal content through text-based approaches.
Abstract: Societal and personal values are transmitted to younger generations through interaction and exposure. Traditionally, children and adolescents learned values from parents, educators, or peers. Nowadays, social platforms serve as a significant channel through which youth (and adults) consume information, as the main medium of entertainment, and possibly the medium through which they learn different values. In this paper we extract implicit values from TikTok movies uploaded by online influencers targeting children and adolescents. We curated a dataset of hundreds of TikTok movies and annotated them according to the well established Schwartz Theory of Personal Values. We then experimented with an array of language models, investigating their utility in value identification. Specifically, we considered two pipelines: direct extraction of values from video and a 2-step approach in which videos are first converted to elaborated scripts and values are extracted from the textual scripts. We find that the 2-step approach performs significantly better than the direct approach and that using a few-shot application of a Large Language Model in both stages outperformed the use of a fine-tuned Masked Language Model in the second stage. We further discuss the impact of continuous pretraining and fine-tuning and compare the performance of the different models on identification of values endorsed or confronted in the TikTok. Finally, we share the first values-annotated dataset of TikTok videos. To the best of our knowledge, this is the first attempt to extract values from TikTok specifically, and visual social media in general. Our results pave the way to future research on value transmission in video-based social platforms.
[52] Is Compression Really Linear with Code Intelligence?
Shijie Xuyang, Xianzhen Luo, Zheng Chu, Houyi Li, Siming Huang, Qiufeng Wang, Wanxiang Che, Qingfu Zhu, Shuigeng Zhou
Main category: cs.CL
TL;DR: The paper investigates the relationship between data compression (measured by bits-per-character) and code intelligence in Code LLMs, finding a logarithmic rather than linear relationship through comprehensive evaluation on multi-language, multi-task benchmarks.
Details
Motivation: Prior work assumed a linear relationship between compression and general intelligence but overlooked the complexity of code (multiple languages and tasks) and struggled with fair evaluation of modern Code LLMs. The authors aim to provide a more nuanced understanding of compression's role in code intelligence.Method: Introduced Format Annealing - a lightweight, transparent training methodology for fair evaluation of pre-trained LLMs’ code intelligence. Evaluated diverse open-source Code LLMs on comprehensive multi-language, multi-task code benchmarks. Used a novel large-scale code validation set from GitHub to measure compression efficacy via bits-per-character (BPC).
Result: Empirical results reveal a fundamental logarithmic relationship between measured code intelligence and BPC, refining prior hypotheses of linearity. The authors suggest previous observations of linearity were likely observations of the logarithmic curve’s tail under specific, limited conditions.
Conclusion: The work provides a more nuanced understanding of compression’s role in developing code intelligence and contributes a robust evaluation framework for the code domain, challenging previous assumptions about the compression-intelligence relationship.
Abstract: Understanding the relationship between data compression and the capabilities of Large Language Models (LLMs) is crucial, especially in specialized domains like code intelligence. Prior work posited a linear relationship between compression and general intelligence. However, it overlooked the multifaceted nature of code that encompasses diverse programming languages and tasks, and struggled with fair evaluation of modern Code LLMs. We address this by evaluating a diverse array of open-source Code LLMs on comprehensive multi-language, multi-task code benchmarks. To address the challenge of efficient and fair evaluation of pre-trained LLMs’ code intelligence, we introduce \textit{Format Annealing}, a lightweight, transparent training methodology designed to assess the intrinsic capabilities of these pre-trained models equitably. Compression efficacy, measured as bits-per-character (BPC), is determined using a novel, large-scale, and previously unseen code validation set derived from GitHub. Our empirical results reveal a fundamental logarithmic relationship between measured code intelligence and BPC. This finding refines prior hypotheses of linearity, which we suggest are likely observations of the logarithmic curve’s tail under specific, limited conditions. Our work provides a more nuanced understanding of compression’s role in developing code intelligence and contributes a robust evaluation framework in the code domain.
[53] Machine Learning for Enhancing Deliberation in Online Political Discussions and Participatory Processes: A Survey
Maike Behrendt, Stefan Sylvius Wagner, Carina Weinmann, Marike Bormann, Mira Warne, Stefan Harmeling
Main category: cs.CL
TL;DR: Literature review on using AI/machine learning to enhance deliberation quality in political online discussions by identifying tasks, existing tools, and assessing current capabilities and challenges.
Details
Motivation: Political online participation is increasingly important, but discussion quality depends on platform design. Machine learning offers potential to facilitate online communication and enhance deliberativeness in political discussions.Method: Conducted a literature review to: (1) identify tasks that AI algorithms could solve to enhance deliberation aspects, (2) provide overview of existing AI-equipped tools/platforms, and (3) assess current AI performance and remaining challenges.
Result: The paper identifies specific issues in political online discussions and maps potential AI solutions to enhance deliberation. It catalogs existing AI-supported platforms and evaluates how well current AI support works, highlighting remaining challenges.
Conclusion: AI has significant potential to improve political online deliberation, but current implementations face challenges. The review provides a roadmap for future research and development in this intersection of AI and political communication.
Abstract: Political online participation in the form of discussing political issues and exchanging opinions among citizens is gaining importance with more and more formats being held digitally. To come to a decision, a thorough discussion and consideration of opinions and a civil exchange of arguments, which is defined as the act of deliberation, is desirable. The quality of discussions and participation processes in terms of their deliberativeness highly depends on the design of platforms and processes. To facilitate online communication for both participants and initiators, machine learning methods offer a lot of potential. In this work we want to showcase which issues occur in political online discussions and how machine learning can be used to counteract these issues and enhance deliberation. We conduct a literature review to (i) identify tasks that could potentially be solved by artificial intelligence (AI) algorithms to enhance individual aspects of deliberation in political online discussions, (ii) provide an overview on existing tools and platforms that are equipped with AI support and (iii) assess how well AI support currently works and where challenges remain.
[54] Elementary Math Word Problem Generation using Large Language Models
Nimesh Ariyarathne, Harshani Bandara, Yasith Heshan, Omega Gamage, Surangika Ranathunga, Dilan Nayanajith, Yutharsan Sivapalan, Gayathri Lihinikaduarachchi, Tharoosha Vihidun, Meenambika Chandirakumar, Sanujen Premakumar, Sanjula Gathsara
Main category: cs.CL
TL;DR: MathWiz is an LLM-based system for generating math word problems automatically without requiring initial text or equations, using only grade level and question type as input.
Details
Motivation: Manually creating math word problems is time-consuming for tutors, and existing LLM-based approaches require additional inputs like initial text or equations. There's a need for a system that can generate high-quality MWPs with minimal input requirements.Method: Developed MathWiz system using Large Language Models with extensive experimentation involving different LLMs, prompting strategies, diversity improvement techniques, and human feedback methods to enhance performance.
Result: Human and automated evaluations showed generated MWPs are high quality with minimal spelling/grammar issues, but LLMs still struggle to generate questions that properly adhere to specified grade level and question type requirements.
Conclusion: LLMs can effectively generate high-quality math word problems with minimal input, though challenges remain in ensuring strict adherence to grade-level and question-type specifications.
Abstract: Mathematics is often perceived as a complex subject by students, leading to high failure rates in exams. To improve Mathematics skills, it is important to provide sample questions for students to practice problem-solving. Manually creating Math Word Problems (MWPs) is time consuming for tutors, because they have to type in natural language while adhering to grammar and spelling rules of the language. Early techniques that use pre-trained Language Models for MWP generation either require a tutor to provide the initial portion of the MWP, and/or additional information such as an equation. In this paper, we present an MWP generation system (MathWiz) based on Large Language Models (LLMs) that overcomes the need for additional input - the only input to our system is the number of MWPs needed, the grade and the type of question (e.g.~addition, subtraction). Unlike the existing LLM-based solutions for MWP generation, we carried out an extensive set of experiments involving different LLMs, prompting strategies, techniques to improve the diversity of MWPs, as well as techniques that employ human feedback to improve LLM performance. Human and automated evaluations confirmed that the generated MWPs are high in quality, with minimal spelling and grammar issues. However, LLMs still struggle to generate questions that adhere to the specified grade and question type requirements.
[55] Instruction Following by Principled Boosting Attention of Large Language Models
Vitoria Guardieiro, Avishree Khare, Adam Stein, Eric Wong
Main category: cs.CL
TL;DR: Instruction Attention Boosting (InstABoost) - a simple inference-time intervention that applies constant additive bias to instruction-key attention logits to strengthen instruction influence without retraining, improving reliability and safety in LLMs.
Details
Motivation: LLMs often violate instructions under long contexts or conflicting user-provided context, creating reliability and safety risks. Current methods like prompting, latent steering, and attention steering have limitations - prompting is weak, latent methods cause fluency collapse, and prior attention methods over-focus on instructions.Method: Proposes Instruction Attention Boosting (InstABoost), which applies a constant additive bias to instruction-key attention logits across all layers and heads. This intervention is based on a theoretical framework formalizing instruction following as rule-based competition between instruction rules and context-derived rules, with attention mediating which rules dominate.
Result: Evaluated across 15 tasks, InstABoost matches or outperforms all baselines (prompting, latent steering, prior attention steering methods) while avoiding fluency collapse of latent methods and instruction over-focus of prior attention methods, achieving a stronger steering-quality tradeoff.
Conclusion: InstABoost provides an effective inference-time intervention for strengthening instruction influence in LLMs without retraining, addressing reliability and safety concerns when instructions conflict with user context or long contexts.
Abstract: Large language models’ behavior is often shaped by instructions such as system prompts, refusal boundaries, privacy constraints, and tool-use rules that must hold at inference time. Yet in practice these constraints can be violated under long contexts or when user-provided context conflicts with them, creating reliability and safety risks. This motivates inference-time interventions that strengthen instruction influence without retraining. One such intervention is attention steering, which biases attention toward instruction tokens. In this work, we present a unifying theory for attention steering methods by formalizing instruction following as rule-based competition between instruction rules and context-derived rules, with attention mediating which rules dominate. We prove that boosting attention to instruction tokens tilts this competition, making it harder for context to override instruction-following. However, excessive boosting can suppress task-relevant context that should be incorporated alongside the instruction. Guided by this theory, we propose Instruction Attention Boosting (InstABoost), a simple intervention that applies a constant additive bias to instruction-key attention logits across all layers and heads. We evaluate InstABoost against prompting, latent steering, and prior attention steering methods across 15 tasks. InstABoost matches or outperforms all baselines while avoiding the fluency collapse of latent methods and the instruction over-focus of prior attention methods, achieving a stronger steering-quality tradeoff.
[56] CodeNER: Code Prompting for Named Entity Recognition
Sungwoo Han, Hyeyeon Kim, Jingun Kwon, Hidetaka Kamigaito, Manabu Okumura
Main category: cs.CL
TL;DR: Code-based prompting improves NER performance by embedding BIO schema instructions in code format within prompts, helping LLMs better understand labeling requirements.
Details
Motivation: Previous NER approaches using LLMs rely solely on input context without capturing detailed labeling requirements. There's a need to improve LLMs' understanding of NER-specific instructions and schema constraints.Method: Proposes code-based prompting that embeds BIO schema instructions within code blocks in prompts, leveraging LLMs’ ability to comprehend long-range scopes in programming languages to better understand NER requirements.
Result: Outperforms conventional text-based prompting on ten benchmarks across English, Arabic, Finnish, Danish, and German datasets. Combining with chain-of-thought prompting further improves performance.
Conclusion: Explicitly structuring NER instructions through code-based prompting effectively enhances LLMs’ NER capabilities by better conveying labeling requirements and schema constraints.
Abstract: Recent studies have explored various approaches for treating candidate named entity spans as both source and target sequences in named entity recognition (NER) by leveraging large language models (LLMs). Although previous approaches have successfully generated candidate named entity spans with suitable labels, they rely solely on input context information when using LLMs, particularly, ChatGPT. However, NER inherently requires capturing detailed labeling requirements with input context information. To address this issue, we propose a novel method that leverages code-based prompting to improve the capabilities of LLMs in understanding and performing NER. By embedding code within prompts, we provide detailed BIO schema instructions for labeling, thereby exploiting the ability of LLMs to comprehend long-range scopes in programming languages. Experimental results demonstrate that the proposed code-based prompting method outperforms conventional text-based prompting on ten benchmarks across English, Arabic, Finnish, Danish, and German datasets, indicating the effectiveness of explicitly structuring NER instructions. We also verify that combining the proposed code-based prompting method with the chain-of-thought prompting further improves performance.
[57] Mapping the Course for Prompt-based Structured Prediction
Matt Pauk, Maria Leonor Pacheco
Main category: cs.CL
TL;DR: LLMs combined with combinatorial inference for structured prediction, improving consistency and accuracy over prompting alone
Details
Motivation: LLMs have strong language capabilities but suffer from hallucinations, inconsistencies, and struggle with complex reasoning due to autoregressive generation limitations. Need to improve structured prediction by combining LLMs with inference methods.Method: Combine LLMs with combinatorial inference to marry LLMs’ predictive power with structural consistency from inference methods. Experiment with prompting strategies to estimate confidence values for downstream symbolic inference. Use calibration and fine-tuning with structured learning objectives.
Result: Incorporating symbolic inference yields more consistent and accurate predictions than prompting alone, independent of prompting strategy. Calibration and fine-tuning with structured learning objectives further increases performance on challenging tasks.
Conclusion: Structured learning remains valuable in the LLM era. Combining LLMs with combinatorial inference addresses hallucinations and improves consistency for structured prediction tasks.
Abstract: Large language models (LLMs) have demonstrated strong performance in a wide-range of language tasks without requiring task-specific fine-tuning. However, they remain prone to hallucinations and inconsistencies, and often struggle with complex reasoning, in part due to the limitations of autoregressive generation. We propose to address some of these issues, particularly for structured prediction, by combining LLMs with combinatorial inference to marry the predictive power of LLMs with the structural consistency provided by inference methods. We perform exhaustive experiments in an effort to understand which prompting strategies can best estimate confidence values for downstream symbolic inference, and find that, independent of prompting strategy, incorporating symbolic inference yields more consistent and accurate predictions than prompting alone. Finally, we show that calibration and fine-tuning with structured learning objectives further increases performance on challenging tasks, highlighting that structured learning remains valuable in the era of LLMs.
[58] ReSum: Unlocking Long-Horizon Search Intelligence via Context Summarization
Xixi Wu, Kuan Li, Yida Zhao, Liwen Zhang, Litu Ou, Huifeng Yin, Zhongwang Zhang, Xinmiao Yu, Dingchu Zhang, Yong Jiang, Pengjun Xie, Fei Huang, Minhao Cheng, Shuai Wang, Hong Cheng, Jingren Zhou
Main category: cs.CL
TL;DR: ReSum is a plug-and-play paradigm for LLM-based web agents that enables unbounded exploration by periodically summarizing interaction histories, with ReSum-GRPO using advantage broadcasting for credit assignment over long-horizon tasks.
Details
Motivation: LLM-based web agents face a conflict between extensive exploration needs and limited context windows. Current solutions require architectural modifications that break compatibility and need costly retraining.Method: ReSum uses an external tool to periodically condense interaction histories into compact summaries. ReSum-GRPO adapts Group Relative Policy Optimization with advantage broadcasting to propagate rewards across segmented trajectories for credit assignment.
Result: ReSum achieves 4.5% improvement over ReAct in training-free settings, with ReSum-GRPO yielding further 8.2% gain. A ReSum-enhanced 30B agent with only 1K training samples achieves competitive performance with leading open-source models.
Conclusion: ReSum provides a lightweight, plug-and-play solution for unbounded exploration in web agents without architectural changes, with ReSum-GRPO enabling effective credit assignment over long-horizon tasks.
Abstract: Large Language Model (LLM)-based web agents excel at knowledge-intensive tasks but face a fundamental conflict between the need for extensive exploration and the constraints of limited context windows. Current solutions typically rely on architectural modifications, e.g., internal memory tokens, which break compatibility with pre-existing agents and necessitate costly end-to-end retraining. To overcome these limitations, we introduce ReSum, a lightweight, plug-and-play paradigm that enables unbounded exploration by periodically invoking an external tool to condense interaction histories into compact summaries. Although this paradigm functions without training, standard agents are not inherently aligned to reason over such compressed contexts. To bridge this gap, we propose ReSum-GRPO, which adapts Group Relative Policy Optimization (GRPO) via advantage broadcasting to propagate final rewards across segmented trajectories, enabling credit assignments over long-horizons. Extensive experiments show that ReSum achieves a 4.5% improvement over ReAct in training-free settings, with ReSum-GRPO yielding a further 8.2% gain. Notably, with only 1K training samples, a ReSum-enhanced 30B agent achieves competitive performance with leading open-source models, showing ReSum’s effectiveness.
[59] Can GRPO Boost Complex Multimodal Table Understanding?
Xiaoqiang Kang, Shengen Wu, Zimu Wang, Yilin Liu, Xiaobo Jin, Kaizhu Huang, Wei Wang, Yutao Yue, Xiaowei Huang, Qiufeng Wang
Main category: cs.CL
TL;DR: Table-R1: A three-stage RL framework for multimodal table understanding that addresses initialization bottlenecks and reward sparsity through warm-up, perception alignment, and hint-completion stages.
Details
Motivation: Existing table understanding methods struggle with complex table structures and logical reasoning. Supervised finetuning dominates but RL approaches like GRPO face challenges with low initial policy accuracy and coarse rewards in tabular contexts.Method: Three-stage RL framework: (1) Warm-up stage prompts initial perception and reasoning capabilities, (2) Perception Alignment GRPO (PA-GRPO) uses continuous Tree-Edit-Distance Similarity rewards for table structure/content recognition, (3) Hint-Completion GRPO (HC-GRPO) employs fine-grained rewards of residual steps based on hint-guided questions.
Result: Table-R1 significantly boosts table reasoning performance on both held-in and held-out datasets, outperforming SFT and GRPO. Qwen2-VL-7B with Table-R1 surpasses larger models like Table-LLaVA 13B and achieves comparable performance to GPT-4o on held-in datasets.
Conclusion: Table-R1 effectively overcomes initialization bottlenecks and reward sparsity in RL for table understanding, advancing robust multimodal table understanding through its three-stage approach.
Abstract: Existing table understanding methods face challenges due to complex table structures and intricate logical reasoning. While supervised finetuning (SFT) dominates existing research, reinforcement learning (RL), such as Group Relative Policy Optimization (GRPO), has shown promise but struggled with low initial policy accuracy and coarse rewards in tabular contexts. In this paper, we introduce Table-R1, a three-stage RL framework that enhances multimodal table understanding through: (1) Warm-up that prompts initial perception and reasoning capabilities, (2) Perception Alignment GRPO (PA-GRPO), which employs continuous Tree-Edit-Distance Similarity (TEDS) rewards for recognizing table structures and contents, and (3) Hint-Completion GRPO (HC-GRPO), which utilizes fine-grained rewards of residual steps based on the hint-guided question. Extensive experiments demonstrate that Table-R1 can boost the model’s table reasoning performance obviously on both held-in and held-out datasets, outperforming SFT and GRPO largely. Notably, Qwen2-VL-7B with Table-R1 surpasses larger specific table understanding models (e.g., Table-LLaVA 13B), even achieving comparable performance to the closed-source model GPT-4o on held-in datasets, demonstrating the efficacy of each stage of Table-R1 in overcoming initialization bottlenecks and reward sparsity, thereby advancing robust multimodal table understanding.
[60] GeoResponder: Towards Building Geospatial LLMs for Time-Critical Disaster Response
Ahmed El Fekih Zguir, Ferda Ofli, Muhammad Imran
Main category: cs.CL
TL;DR: GeoResponder is a framework that teaches LLMs geospatial reasoning for disaster response through scaffolded instruction-tuning, enabling them to understand road networks, coordinates, and infrastructure locations.
Details
Motivation: Current LLMs lack geospatial capabilities needed for time-critical disaster response, where reasoning about road networks, coordinates, and access to essential infrastructure (hospitals, shelters, pharmacies) is vital.Method: Introduces GeoResponder framework with scaffolded instruction-tuning curriculum that stratifies geospatial learning into cognitive layers, anchors semantic knowledge to coordinate manifold, and enforces internalization of spatial axioms.
Result: Extensive evaluations across four topologically distinct cities and diverse tasks show GeoResponder significantly outperforms both state-of-the-art foundation models and domain-specific baselines.
Conclusion: LLMs can begin to internalize and generalize geospatial structures, pointing toward future development of language models capable of supporting disaster response needs.
Abstract: LLMs excel at linguistic tasks but lack the inner geospatial capabilities needed for time-critical disaster response, where reasoning about road networks, coordinates, and access to essential infrastructure such as hospitals, shelters, and pharmacies is vital. We introduce GeoResponder, a framework that instills robust spatial reasoning through a scaffolded instruction-tuning curriculum. By stratifying geospatial learning into different cognitive layers, we anchor semantic knowledge to the continuous coordinate manifold and enforce the internalization of spatial axioms. Extensive evaluations across four topologically distinct cities and diverse tasks demonstrate that GeoResponder significantly outperforms both state-of-the-art foundation models and domain-specific baselines. These results suggest that LLMs can begin to internalize and generalize geospatial structures, pointing toward the future development of language models capable of supporting disaster response needs.
[61] DiffuGuard: How Intrinsic Safety is Lost and Found in Diffusion Large Language Models
Zherui Li, Zheng Nie, Zhenhong Zhou, Yue Liu, Yitong Zhang, Yu Cheng, Qingsong Wen, Kun Wang, Yufei Guo, Jiaheng Zhang
Main category: cs.CL
TL;DR: DiffuGuard is a training-free defense framework that addresses security vulnerabilities in Diffusion Large Language Models (dLLMs) against jailbreak attacks by using stochastic annealing remasking and block-level audit/repair mechanisms.
Details
Motivation: Diffusion LLMs have unique vulnerabilities distinct from autoregressive LLMs due to their iterative and parallel generation mechanisms, creating new attack surfaces for jailbreak attacks that need to be addressed.Method: Proposes DiffuGuard with two components: 1) Stochastic Annealing Remasking - introduces controlled randomness to mitigate greedy selection bias, and 2) Block-level Audit and Repair - uses internal model representations for risk detection and guided correction.
Result: Reduces Attack Success Rate from 47.9% to 14.7% across four dLLMs against six jailbreak methods while preserving model utility and efficiency.
Conclusion: dLLMs have significant intrinsic safety potential that can be unlocked through proper defense mechanisms like DiffuGuard, which addresses vulnerabilities in both intra-step and inter-step dynamics.
Abstract: The rapid advancement of Diffusion Large Language Models (dLLMs) introduces unprecedented vulnerabilities that are fundamentally distinct from Autoregressive LLMs, stemming from their iterative and parallel generation mechanisms. In this paper, we conduct an in-depth analysis of dLLM vulnerabilities to jailbreak attacks across two distinct dimensions: intra-step and inter-step dynamics. Experimental results reveal a harmful bias inherent in the standard greedy remasking strategy and identify a critical phenomenon we term Denoising-path Dependence, where the safety of early-stage tokens decisively influences the final output. These findings also indicate that while current decoding strategies constitute a significant vulnerability, dLLMs possess a substantial intrinsic safety potential. To unlock this potential, we propose DiffuGuard, a training-free defense framework that addresses vulnerabilities through a dual-stage approach: Stochastic Annealing Remasking dynamically introduces controlled randomness to mitigate greedy selection bias, while Block-level Audit and Repair exploits internal model representations for autonomous risk detection and guided correction. Comprehensive experiments on four dLLMs demonstrate DiffuGuard’s exceptional effectiveness, reducing Attack Success Rate against six diverse jailbreak methods from 47.9% to 14.7% while preserving model utility and efficiency. Our code is available at: https://github.com/niez233/DiffuGuard.
[62] Family Matters: Language Transfer and Merging for Adapting Small LLMs to Faroese
Jenny Kunz, Iben Nyholm Debess, Annika Simonsen
Main category: cs.CL
TL;DR: Adapting small English-pretrained language models to Faroese using transfer learning from related Scandinavian languages, comparing full fine-tuning vs LoRA, with new evaluation benchmarks for this low-resource language.
Details
Motivation: Faroese is a low-resource North Germanic language lacking adequate language models and evaluation resources. The paper aims to develop effective adaptation strategies for small, efficient models to serve this language community.Method: Start with English-pretrained models, apply continued pre-training on related Scandinavian languages (individually or combined via model merging), then fine-tune on Faroese. Compare full fine-tuning with parameter-efficient LoRA adaptation. Create two new minimal-pair probing benchmarks for linguistic acceptability and text comprehension, complemented by human evaluations by native Faroese linguists.
Result: Transfer from related languages is essential but task-dependent: Icelandic improves linguistic accuracy, while Danish boosts reading comprehension. LoRA yields stronger linguistic acceptability and marginally higher human evaluation scores, whereas full fine-tuning produces better comprehension performance and more robust downstream fine-tuning. Merging multiple related languages under full fine-tuning improves general language modeling.
Conclusion: Optimal adaptation strategy for low-resource languages depends on target task and available related languages. Different source languages and adaptation methods excel at different aspects of language understanding, suggesting task-aware adaptation approaches are needed.
Abstract: We investigate strategies for adapting small, efficient language models to Faroese, a low-resource North Germanic language. Starting from English-pretrained models, we apply continued pre-training on related Scandinavian languages – individually or combined via model merging – before fine-tuning on Faroese. We compare full fine-tuning with parameter-efficient adaptation via LoRA, assessing their effects on general language modeling performance, linguistic accuracy, and text comprehension. To address the lack of existing Faroese evaluation resources, we construct two new minimal-pair probing benchmarks, one for linguistic acceptability and one for text comprehension, and complement them with human evaluations conducted by native Faroese linguists. Our results show that transfer from related languages is essential, but the optimal source language is task-dependent: Icelandic improves linguistic accuracy, while Danish boosts reading comprehension. The choice of adaptation method likewise depends on the target task: LoRA yields stronger linguistic acceptability and marginally higher human evaluation scores, whereas full fine-tuning produces better comprehension performance and more robust downstream fine-tuning. Merging multiple related languages under full fine-tuning (but not LoRA) improves general language modeling, though its benefits in the linguistic acceptability and comprehension probes are less consistent.
[63] CQA-Eval: Designing Reliable Evaluations of Multi-paragraph Clinical QA under Resource Constraints
Federica Bologna, Tiffany Pan, Matthew Wilkens, Yue Guo, Lucy Lu Wang
Main category: cs.CL
TL;DR: Clinical QA evaluation framework comparing coarse vs fine-grained annotation methods for medical question answering systems, focusing on correctness, relevance, and risk disclosure dimensions.
Details
Motivation: Evaluating multi-paragraph clinical QA systems is resource-intensive and challenging due to medical expertise requirements and difficulty achieving consistent human judgments over complex medical text.Method: Introduced an evaluation framework with recommendations for limited-resource, high-expertise settings. Used physician annotations of 300 real patient questions answered by physicians and LLMs, comparing coarse answer-level vs fine-grained sentence-level evaluation across correctness, relevance, and risk disclosure dimensions.
Result: Inter-annotator agreement varies by dimension: fine-grained annotation improves agreement on correctness, coarse improves agreement on relevance, and judgments on risk disclosure remain inconsistent. Annotating only a small subset of sentences can provide reliability comparable to coarse annotations, reducing cost and effort.
Conclusion: The framework provides practical evaluation recommendations for clinical QA systems, showing that dimension-specific annotation strategies and selective sentence annotation can improve reliability while reducing evaluation costs in medical settings.
Abstract: Evaluating multi-paragraph clinical question answering (QA) systems is resource-intensive and challenging: accurate judgments require medical expertise and achieving consistent human judgments over multi-paragraph text is difficult. We introduce \framework, an evaluation framework and set of evaluation recommendations for limited-resource and high-expertise settings. Based on physician annotations of 300 real patient questions answered by physicians and LLMs, we compare coarse answer-level versus fine-grained sentence-level evaluation over the dimensions of correctness, relevance, and risk disclosure. We find that inter-annotator agreement (IAA) varies by dimension: fine-grained annotation improves agreement on correctness, coarse improves agreement on relevance, and judgments on communicates-risks remain inconsistent. Additionally, annotating only a small subset of sentences can provide reliability comparable to coarse annotations, reducing cost and effort.
[64] FactAppeal: Identifying Epistemic Factual Appeals in News Media
Guy Mor-Lan, Tamir Sheafer, Shaul R. Shenhav
Main category: cs.CL
TL;DR: FactAppeal: A dataset and task for identifying epistemic appeals and evidentiary basis in factual claims, with span-level annotations of sources and evidence types.
Details
Motivation: To understand how factual claims are made credible through external sources and evidence, moving beyond simple claim detection to analyze the nuanced epistemic structures supporting claims.Method: Created FactAppeal dataset with 3,226 manually annotated English news sentences, featuring span-level annotations of factual statements and source mentions, plus fine-grained characteristics like source types, naming, roles, credentials, and attribution methods.
Result: Best performing model (Gemma 2 9B) achieves macro-F1 score of 0.73 on the epistemic appeal identification task across various encoder and decoder models in 2B-9B parameter range.
Conclusion: The paper introduces a novel NLP task and dataset for analyzing how factual claims are anchored to evidence, enabling deeper understanding of epistemic structures in information credibility.
Abstract: How is a factual claim made credible? We propose the novel task of Epistemic Appeal Identification, which identifies whether and how factual statements have been anchored by external sources or evidence. To advance research on this task, we present FactAppeal, a manually annotated dataset of 3,226 English-language news sentences. Unlike prior resources that focus solely on claim detection and verification, FactAppeal identifies the nuanced epistemic structures and evidentiary basis underlying these claims and used to support them. FactAppeal contains span-level annotations which identify factual statements and mentions of sources on which they rely. Moreover, the annotations include fine-grained characteristics of factual appeals such as the type of source (e.g. Active Participant, Witness, Expert, Direct Evidence), whether it is mentioned by name, mentions of the source’s role and epistemic credentials, attribution to the source via direct or indirect quotation, and other features. We model the task with a range of encoder models and generative decoder models in the 2B-9B parameter range. Our best performing model, based on Gemma 2 9B, achieves a macro-F1 score of 0.73.
[65] CNSocialDepress: A Chinese Social Media Dataset for Depression Risk Detection and Structured Analysis
Jinyuan Xu, Tian Lan, Xintao Yu, Xue He, Hezhi Zhang, Ying Wang, Pierre Magistry, Mathieu Valette, Lei Li
Main category: cs.CL
TL;DR: CNSocialDepress: A Chinese social media benchmark dataset for depression risk detection with 44,178 posts from 233 users, featuring binary risk labels and multidimensional psychological attributes for interpretable analysis.
Details
Motivation: Addressing the scarcity of publicly available Chinese-language resources for depression risk detection, which currently focus mainly on binary classification, limiting interpretable and fine-grained analysis of depressive signals.Method: Created CNSocialDepress dataset containing 44,178 social media posts from 233 users, with psychological experts annotating 10,306 depression-related segments. The dataset provides binary risk labels along with structured, multidimensional psychological attributes.
Result: Experimental results demonstrate the dataset’s utility across NLP tasks including structured psychological profiling and fine-tuning large language models for depression detection. Comprehensive evaluations highlight effectiveness for depression risk identification and psychological analysis.
Conclusion: CNSocialDepress provides valuable resources for mental health applications tailored to Chinese-speaking populations, enabling interpretable and fine-grained analyses of depressive signals in social media content.
Abstract: Depression is a pressing global public health issue, yet publicly available Chinese-language resources for depression risk detection remain scarce and largely focus on binary classification. To address this limitation, we release CNSocialDepress, a benchmark dataset for depression risk detection on Chinese social media. The dataset contains 44,178 posts from 233 users; psychological experts annotated 10,306 depression-related segments. CNSocialDepress provides binary risk labels along with structured, multidimensional psychological attributes, enabling interpretable and fine-grained analyses of depressive signals. Experimental results demonstrate the dataset’s utility across a range of NLP tasks, including structured psychological profiling and fine-tuning large language models for depression detection. Comprehensive evaluations highlight the dataset’s effectiveness and practical value for depression risk identification and psychological analysis, thereby providing insights for mental health applications tailored to Chinese-speaking populations.
[66] EQ-Negotiator: Dynamic Emotional Personas Empower Small Language Models for Edge-Deployable Credit Negotiation
Yunbo Long, Yuhan Liu, Alexandra Brintrup
Main category: cs.CL
TL;DR: EQ-Negotiator enables small language models to match large models in emotional negotiation by combining game theory with Hidden Markov Models for real-time emotional state tracking.
Details
Motivation: LLMs are computationally expensive and privacy-invasive for on-device negotiation applications, while SLMs lack emotional intelligence for complex persona-based negotiations like credit recovery.Method: A reasoning system integrating game theory with Hidden Markov Models to learn and track debtor emotional states online without pre-training, enabling SLMs to counter manipulation while maintaining ethics.
Result: A 7B parameter model with EQ-Negotiator outperforms baseline LLMs 10x larger in debt recovery and negotiation efficiency across diverse adversarial scenarios.
Conclusion: Strategic emotional intelligence, not model scale, is critical for automated negotiation; EQ-Negotiator enables privacy-preserving, ethical AI negotiators for edge applications.
Abstract: The deployment of large language models (LLMs) in automated negotiation has set a high performance benchmark, but their computational cost and data privacy requirements render them unsuitable for many privacy-sensitive, on-device applications such as mobile assistants, embodied AI agents or private client interactions. While small language models (SLMs) offer a practical alternative, they suffer from a significant performance gap compared to LLMs in playing emotionally charged complex personas, especially for credit negotiation. This paper introduces EQ-Negotiator, a novel framework that bridges this capability gap using emotional personas. Its core is a reasoning system that integrates game theory with a Hidden Markov Model(HMM) to learn and track debtor emotional states online, without pre-training. This allows EQ-Negotiator to equip SLMs with the strategic intelligence to counter manipulation while de-escalating conflict and upholding ethical standards. Through extensive agent-to-agent simulations across diverse credit negotiation scenarios, including adversarial debtor strategies like cheating, threatening, and playing the victim, we show that a 7B parameter language model with EQ-Negotiator achieves better debt recovery and negotiation efficiency than baseline LLMs more than 10 times its size. This work advances persona modeling from descriptive character profiles to dynamic emotional architectures that operate within privacy constraints. Besides, this paper establishes that strategic emotional intelligence, not raw model scale, is the critical factor for success in automated negotiation, paving the way for effective, ethical, and privacy-preserving AI negotiators that can operate on the edge.
[67] A cross-species neural foundation model for end-to-end speech decoding
Yizi Zhang, Linyang He, Chaofei Fan, Tingkai Liu, Han Yu, Trung Le, Jingyuan Li, Scott Linderman, Lea Duncker, Francis R Willett, Nima Mesgarani, Liam Paninski
Main category: cs.CL
TL;DR: End-to-end Brain-to-Text framework using cross-species pretrained neural encoder with audio LLMs for speech decoding from neural activity, achieving state-of-the-art performance.
Details
Motivation: Current speech BCIs use cascaded frameworks that decode phonemes before assembling sentences, preventing joint optimization. The authors aim to create an end-to-end differentiable framework for better performance and seamless optimization.Method: BIT framework uses cross-task, cross-species pretrained neural encoder that transfers to attempted and imagined speech. Integrated end-to-end with audio large language models (LLMs) using contrastive learning for cross-modal alignment.
Result: Achieves new SOTA on Brain-to-Text ‘24/‘25 benchmarks. Reduces WER from 24.69% to 10.22% compared to prior end-to-end method. Small-scale audio LLMs significantly improve end-to-end decoding. Enables cross-task generalization between attempted and imagined speech.
Conclusion: BIT advances integration of large neural datasets and enables end-to-end differentiable optimization for speech BCIs, with potential for improved communication restoration for people with paralysis.
Abstract: Speech brain-computer interfaces (BCIs) aim to restore communication for people with paralysis by translating neural activity into text. Most systems use cascaded frameworks that decode phonemes before assembling sentences with an n-gram language model (LM), preventing joint optimization of all stages simultaneously. Here, we introduce an end-to-end Brain-to-Text (BIT) framework that translates neural activity into coherent sentences using a single differentiable neural network. Central to our approach is a cross-task, cross-species pretrained neural encoder, whose representations transfer to both attempted and imagined speech. In a cascaded setting with an n-gram LM, the pretrained encoder establishes a new state-of-the-art (SOTA) on the Brain-to-Text ‘24 and ‘25 benchmarks. Integrated end-to-end with audio large language models (LLMs) and trained with contrastive learning for cross-modal alignment, BIT reduces the word error rate (WER) of the prior end-to-end method from 24.69% to 10.22%. Notably, we find that small-scale audio LLMs markedly improve end-to-end decoding. Beyond record-setting performance, BIT aligns attempted and imagined speech embeddings to enable cross-task generalization. Altogether, our approach advances the integration of large, diverse neural datasets, paving the way for an end-to-end decoding framework that supports seamless, differentiable optimization.
[68] SWAA: Sliding Window Attention Adaptation for Efficient and Quality Preserving Long Context Processing
Yijiong Yu, Jiale Liu, Qingyun Wu, Huazheng Wang, Ji Pei
Main category: cs.CL
TL;DR: SWAA adapts full attention models to sliding window attention for efficient long-context inference without costly pretraining
Details
Motivation: Transformer self-attention has quadratic complexity making long-context inference expensive. Sliding window attention offers linear complexity but suffers from catastrophic performance collapse due to training-inference mismatch and structural inability to access distant information.Method: Proposes SWAA toolkit with four strategies: (1) Full Attention Decode, (2) Interleaving FA and SWA layers, (3) preserving “sink” tokens, and (4) lightweight fine-tuning. These address structural defects and training-inference mismatch.
Result: Specific synergistic combinations effectively recover long-context performance. Achieves 30% to 100% speedups for long-context inference with acceptable quality retention. Performance-efficiency trade-off analysis identifies optimal configurations.
Conclusion: SWAA enables efficient adaptation of full attention models to sliding window attention without costly pretraining, offering practical speedups for long-context inference while maintaining acceptable quality.
Abstract: The quadratic complexity of self attention in Transformer based LLMs renders long context inference prohibitively expensive. While Sliding Window Attention (SWA), the simplest sparse attention pattern, offers a linear complexity alternative, it suffers from catastrophic long context performance collapse, which stems from two fundamental factors: the training inference mismatch when naively applying SWA to models pretrained with Full Attention (FA), and the inherent structural inability to access distant information when applying SWA to every module at all times. To address these dual challenges, we propose Sliding Window Attention Adaptation (SWAA), a plug and play toolkit of recipes that adapts FA models to SWA without costly pretraining. SWAA systematically combines four core strategies to tackle these distinct issues: (1) Full Attention (FA) Decode and (2) Interleaving FA and SWA layers, which mitigate structural defects by selectively allowing access to distant information; alongside (3) preserving ``sink’’ tokens and (4) lightweight fine tuning, which mitigate the training inference mismatch. Our experiments reveal that while isolated strategies are insufficient, specific synergistic combinations effectively recover long context performance. Despite varying computational overheads, our performance efficiency trade off analysis identifies optimal SWAA configurations for diverse scenarios, achieving 30% to 100% speedups for long context inference with acceptable quality retention. Our code, data and model weights are available at https://github.com/yuyijiong/sliding-window-attention-adaptation
[69] From Evidence-Based Medicine to Knowledge Graph: Retrieval-Augmented Generation for Sports Rehabilitation and a Domain Benchmark
Jinning Zhang, Jie Song, Wenhui Tu, Zecheng Li, Jingxuan Li, Jin Li, Xuan Liu, Taole Sha, Zichen Wei, Yan Li
Main category: cs.CL
TL;DR: SR-RAG: An evidence-based medicine adapted GraphRAG framework that integrates PICO framework into knowledge graph construction and retrieval with Bayesian Evidence Tier Reranking for medical QA.
Details
Motivation: Current medical RAG approaches overlook evidence-based medicine principles, lacking PICO alignment between queries and retrieved evidence, and missing evidence hierarchy considerations during reranking.Method: Proposes SR-RAG framework integrating PICO framework into knowledge graph construction and retrieval, with Bayesian Evidence Tier Reranking (BETR) to calibrate ranking scores by evidence grade without predefined weights.
Result: Achieves 0.812 evidence recall@10, 0.830 nugget coverage, 0.819 answer faithfulness, 0.882 semantic similarity, and 0.788 PICOT match accuracy, outperforming five baselines. Expert clinicians rated 4.66-4.84/5.
Conclusion: SR-RAG effectively addresses EBM gaps in medical RAG systems through PICO-aligned knowledge graphs and evidence-aware reranking, validated in sports rehabilitation domain with strong performance metrics.
Abstract: Current medical retrieval-augmented generation (RAG) approaches overlook evidence-based medicine (EBM) principles, leading to two key gaps: (1) the lack of PICO alignment between queries and retrieved evidence, and (2) the absence of evidence hierarchy considerations during reranking. We present SR-RAG, an EBM-adapted GraphRAG framework that integrates the PICO framework into knowledge graph construction and retrieval, and proposes Bayesian Evidence Tier Reranking (BETR) to calibrate ranking scores by evidence grade without predefined weights. Validated in sports rehabilitation, we release a knowledge graph (357,844 nodes, 371,226 edges) and a benchmark of 1,637 QA pairs. SR-RAG achieves 0.812 evidence recall@10, 0.830 nugget coverage, 0.819 answer faithfulness, 0.882 semantic similarity, and 0.788 PICOT match accuracy, substantially outperforming five baselines. Five expert clinicians rated the system 4.66–4.84 on a 5-point Likert scale, and system rankings are preserved on a human-verified gold subset (n=80).
[70] A Geolocation-Aware Multimodal Approach for Ecological Prediction
Valerie Zermatten, Chiara Vanalli, Gencer Sumbul, Diego Marcos, Devis Tuia
Main category: cs.CL
TL;DR: GAMMA is a transformer-based multimodal fusion approach that integrates heterogeneous ecological data (remote sensing, biodiversity observations, text) using explicit spatial context for environmental monitoring.
Details
Motivation: Current approaches struggle to combine heterogeneous ecological data sources with different formats (continuous gridded remote sensing vs sparse irregular point observations), limiting environmental monitoring capabilities.Method: GAMMA uses location-aware embeddings to preserve spatial relationships, dynamically selects relevant neighbors across modalities and spatial scales, and employs transformer-based fusion to integrate continuous remote sensing with sparse observations.
Result: Multimodal fusion consistently improves prediction of 103 environmental variables over single-modality baselines, with explicit spatial context further enhancing accuracy. The architecture allows analysis of each modality’s contribution.
Conclusion: Location-aware multimodal learning shows strong potential for integrating heterogeneous ecological data and supporting large-scale environmental mapping and biodiversity monitoring tasks.
Abstract: While integrating multiple modalities has the potential to improve environmental monitoring, current approaches struggle to combine data sources with heterogeneous formats or contents. A central difficulty arises when combining continuous gridded data (e.g., remote sensing) with sparse and irregular point observations such as species records. Existing geostatistical and deep-learning-based approaches typically operate on a single modality or focus on spatially aligned inputs, and thus cannot seamlessly overcome this difficulty. We propose a Geolocation-Aware MultiModal Approach (GAMMA), a transformer-based fusion approach designed to integrate heterogeneous ecological data using explicit spatial context. Instead of interpolating observations into a common grid, GAMMA first represents all inputs as location-aware embeddings that preserve spatial relationships between samples. GAMMA dynamically selects relevant neighbours across modalities and spatial scales, enabling the model to jointly exploit continuous remote sensing imagery and sparse geolocated observations. We evaluate GAMMA on the task of predicting 103 environmental variables from the SWECO25 data cube across Switzerland. Inputs combine aerial imagery with biodiversity observations from GBIF and textual habitat descriptions from Wikipedia, provided by the EcoWikiRS dataset. Experiments show that multimodal fusion consistently improves prediction performance over single-modality baselines and that explicit spatial context further enhances model accuracy. The flexible architecture of GAMMA also allows to analyse the contribution of each modality through controlled ablation experiments. These results demonstrate the potential of location-aware multimodal learning for integrating heterogeneous ecological data and for supporting large-scale environmental mapping tasks and biodiversity monitoring.
[71] SciCoQA: Quality Assurance for Scientific Paper–Code Alignment
Tim Baumgärtner, Iryna Gurevych
Main category: cs.CL
TL;DR: SciCoQA is a dataset for detecting discrepancies between scientific papers and their code implementations, created from GitHub issues and reproducibility papers with synthetic data generation to scale coverage across AI, Physics, and other computational sciences.
Details
Motivation: To address the problem of mismatches between scientific publications and their code implementations, which can lead to reproducibility issues and inaccurate scientific claims. There's a need for systematic detection of paper-code discrepancies.Method: Constructed dataset from real GitHub issues and reproducibility papers, then scaled using synthetic data generation. Analyzed discrepancies to create types and categories. Evaluated 22 LLMs on the dataset to assess detection capabilities.
Result: Created dataset of 635 paper-code discrepancies (92 real, 543 synthetic) covering AI, Physics, Quantitative Biology, and computational sciences. LLM evaluation showed difficulty, with best models (Gemini 3.1 Pro and GPT-5 Mini) detecting only 46.7% of real-world discrepancies.
Conclusion: Paper-code discrepancy detection is challenging for current LLMs, especially with omitted details, long contexts, and out-of-domain data. The SciCoQA dataset enables research into improving scientific reproducibility through better paper-code alignment.
Abstract: We present SciCoQA, a dataset for detecting discrepancies between scientific publications and their codebases to ensure faithful implementations. We construct SciCoQA from GitHub issues and reproducibility papers, and to scale our dataset, we propose a synthetic data generation method for constructing paper-code discrepancies. We analyze the paper-code discrepancies in detail and propose discrepancy types and categories to better understand the occurring mismatches. In total, our dataset consists of 635 paper-code discrepancies (92 real, 543 synthetic), covering the AI domain from real-world data and extending to Physics, Quantitative Biology, and other computational sciences through synthetic data. Our evaluation of 22 LLMs demonstrates the difficulty of SciCoQA, particularly for instances involving omitted paper details, long-context inputs, and data outside the models’ pre-training corpus. The best-performing models in our evaluation, Gemini 3.1 Pro and GPT-5 Mini, detect only 46.7% of real-world paper-code discrepancies.
[72] ViGoEmotions: A Benchmark Dataset For Fine-grained Emotion Detection on Vietnamese Texts
Hung Quang Tran, Nam Tien Pham, Son T. Luu, Kiet Van Nguyen
Main category: cs.CL
TL;DR: Vietnamese emotion classification dataset (ViGoEmotions) with 20,664 social media comments across 27 emotions, evaluated with 8 Transformer models under 3 preprocessing strategies for emoji handling.
Details
Motivation: Need for Vietnamese emotion classification resources for emotion prediction and harmful content detection, leveraging recent NLP advancements and addressing language-specific challenges.Method: Created ViGoEmotions corpus, evaluated 8 pre-trained Transformer models under three preprocessing strategies: preserving original emojis with rule-based normalization, converting emojis to text descriptions, and applying ViSoLex lexical normalization.
Result: Converting emojis to text improved performance for several BERT-based models, while preserving emojis worked best for ViSoBERT and CafeBERT. ViSoBERT achieved highest Macro F1-score of 61.50% and Weighted F1-score of 63.26%.
Conclusion: The corpus supports diverse architectures effectively, but preprocessing strategies and annotation quality significantly influence downstream performance in Vietnamese emotion classification.
Abstract: Emotion classification plays a significant role in emotion prediction and harmful content detection. Recent advancements in NLP, particularly through large language models (LLMs), have greatly improved outcomes in this field. This study introduces ViGoEmotions – a Vietnamese emotion corpus comprising 20,664 social media comments in which each comment is classified into 27 fine-grained distinct emotions. To evaluate the quality of the dataset and its impact on emotion classification, eight pre-trained Transformer-based models were evaluated under three preprocessing strategies: preserving original emojis with rule-based normalization, converting emojis into textual descriptions, and applying ViSoLex, a model-based lexical normalization system. Results show that converting emojis into text often improves the performance of several BERT-based baselines, while preserving emojis yields the best results for ViSoBERT and CafeBERT. In contrast, removing emojis generally leads to lower performance. ViSoBERT achieved the highest Macro F1-score of 61.50% and Weighted F1-score of 63.26%. Strong performance was also observed from CafeBERT and PhoBERT. These findings highlight that while the proposed corpus can support diverse architectures effectively, preprocessing strategies and annotation quality remain key factors influencing downstream performance.
[73] TurkicNLP: An NLP Toolkit for Turkic Languages
Sherzod Hakimov
Main category: cs.CL
TL;DR: TurkicNLP is an open-source Python library providing unified NLP pipelines for Turkic languages across multiple scripts, with modular architecture integrating rule-based and neural approaches.
Details
Motivation: Turkic languages (spoken by 200M+ people) lack unified NLP tooling and resources, with fragmentation across different script families (Latin, Cyrillic, Perso-Arabic, Old Turkic Runic).Method: Developed a modular multi-backend architecture with language-agnostic API that integrates rule-based finite-state transducers and neural models, featuring automatic script detection and routing between script variants.
Result: Created TurkicNLP library covering tokenization, morphological analysis, POS tagging, dependency parsing, NER, bidirectional script transliteration, cross-lingual sentence embeddings, and machine translation with CoNLL-U standard outputs.
Conclusion: TurkicNLP provides the first unified NLP framework for Turkic languages, addressing fragmentation and enabling consistent processing across script families through open-source tooling.
Abstract: Natural language processing for the Turkic language family, spoken by over 200 million people across Eurasia, remains fragmented, with most languages lacking unified tooling and resources. We present TurkicNLP, an open-source Python library providing a single, consistent NLP pipeline for Turkic languages across four script families: Latin, Cyrillic, Perso-Arabic, and Old Turkic Runic. The library covers tokenization, morphological analysis, part-of-speech tagging, dependency parsing, named entity recognition, bidirectional script transliteration, cross-lingual sentence embeddings, and machine translation through one language-agnostic API. A modular multi-backend architecture integrates rule-based finite-state transducers and neural models transparently, with automatic script detection and routing between script variants. Outputs follow the CoNLL-U standard for full interoperability and extension. Code and documentation are hosted at https://github.com/turkic-nlp/turkicnlp .
[74] Autoscoring Anticlimax: A Meta-analytic Understanding of AI’s Short-answer Shortcomings and Wording Weaknesses
Michael Hardy
Main category: cs.CL
TL;DR: Meta-analysis of 890 LLM short-answer scoring studies shows decoder-only architectures underperform encoders by 0.37 QWK, with task difficulty for humans not correlating with LLM performance, and reveals tokenization sensitivity and racial bias in educational contexts.
Details
Motivation: Automated short-answer scoring lags behind other LLM applications, and there's a need to understand why LLMs underperform in educational assessment tasks compared to human experts, particularly examining architectural differences and potential biases.Method: Conducted meta-analysis of 890 results from LLM short-answer scoring studies using mixed effects metaregression modeling Quadratic Weighted Kappa (QWK) effect sizes, examining factors like architecture type, tokenizer vocabulary size, and task difficulty correlations between humans and LLMs.
Result: Decoder-only architectures underperform encoder models by 0.37 QWK; human task difficulty doesn’t predict LLM performance (some easiest human tasks were hardest for LLMs); tokenizer vocabulary shows diminishing returns; LLMs demonstrate wording/tokenization sensitivity and racial discrimination in high-stakes education contexts.
Conclusion: LLM-based educational assessment needs systems design anticipating autoregressive model shortcomings, with evidence of architectural performance gaps, lack of correlation with human difficulty metrics, and concerning biases requiring careful implementation in high-stakes contexts.
Abstract: Automated short-answer scoring lags other LLM applications. We meta-analyze 890 culminating results across a systematic review of LLM short-answer scoring studies, modeling the traditional effect size of Quadratic Weighted Kappa (QWK) with mixed effects metaregression. We quantitatively illustrate that that the level of difficulty for human experts to perform the task of scoring written work of children has no observed statistical effect on LLM performance. Particularly, we show that some scoring tasks measured as the easiest by human scorers were the hardest for LLMs. Whether by poor implementation by thoughtful researchers or patterns traceable to autoregressive training, on average decoder-only architectures underperform encoders by 0.37–a substantial difference in agreement with humans. Additionally, we measure the contributions of various aspects of LLM technology on successful scoring such as tokenizer vocabulary size, which exhibits diminishing returns–potentially due to undertrained tokens. Findings argue for systems design which better anticipates known statistical shortcomings of autoregressive models. Finally, we provide additional experiments to illustrate wording and tokenization sensitivity and bias elicitation in high-stakes education contexts, where LLMs demonstrate racial discrimination. Code and data for this study are available.
[75] Algorithmic Consequences of Particle Filters for Sentence Processing: Amplified Garden-Paths and Digging-In Effects
Amani Maina-Kilaas, Roger Levy
Main category: cs.CL
TL;DR: Particle filter models with explicit structural representations better predict reading difficulty during structural ambiguity than LLM surprisal alone, showing digging-in effects that scale with ambiguous region length.
Details
Motivation: Current surprisal theory using LLMs fails to fully capture processing difficulty when structural expectations are violated, suggesting explicit representations of structural ambiguity are needed to understand sentence processing mechanisms.Method: The paper proposes particle filter models that explicitly represent structural hypotheses as finite particles, analyzes algorithmic consequences including garden-path effect amplification, and demonstrates how resampling produces digging-in effects where disambiguation difficulty increases with ambiguous region length.
Result: Particle filter models predict real-time digging-in effects that LLM surprisal misses, with digging-in magnitude scaling inversely with particle count, showing that fully parallel models predict no such effect.
Conclusion: Explicit representations of structural ambiguity in particle filter models better capture sentence processing difficulty patterns than surprisal-based approaches alone, providing evidence for causal involvement of ambiguity representations in language comprehension.
Abstract: Under surprisal theory, linguistic representations affect processing difficulty only through the bottleneck of surprisal. Our best estimates of surprisal come from large language models, which have no explicit representation of structural ambiguity. While LLM surprisal robustly predicts reading times across languages, it systematically underpredicts difficulty when structural expectations are violated – suggesting that representations of ambiguity are causally implicated in sentence processing. Particle filter models offer an alternative where structural hypotheses are explicitly represented as a finite set of particles. We prove several algorithmic consequences of particle filter models, including the amplification of garden-path effects. Most critically, we demonstrate that resampling, a common practice with these models, inherently produces real-time digging-in effects – where disambiguation difficulty increases with ambiguous region length. Digging-in magnitude scales inversely with particle count: fully parallel models predict no such effect.
[76] UtilityMax Prompting: A Formal Framework for Multi-Objective Large Language Model Optimization
Ofir Marom
Main category: cs.CL
TL;DR: UtilityMax Prompting: A framework using formal mathematical language and influence diagrams to specify LLM tasks with multiple objectives, maximizing expected utility rather than relying on ambiguous natural language prompts.
Details
Motivation: Natural language prompts for LLMs are inherently ambiguous when multiple objectives must be satisfied simultaneously, leading to subjective interpretations and suboptimal performance.Method: Reconstruct tasks as influence diagrams where the LLM’s answer is the sole decision variable, define utility functions over conditional probability distributions, and instruct LLMs to find answers that maximize expected utility.
Result: Demonstrated consistent improvements in precision and NDCG over natural language baselines on MovieLens 1M dataset across three frontier models (Claude Sonnet 4.6, GPT-5.4, Gemini 2.5 Pro) in multi-objective movie recommendation tasks.
Conclusion: Formal mathematical specification of LLM tasks through UtilityMax Prompting provides more precise optimization targets and better performance than ambiguous natural language prompts for multi-objective tasks.
Abstract: The success of a Large Language Model (LLM) task depends heavily on its prompt. Most use-cases specify prompts using natural language, which is inherently ambiguous when multiple objectives must be simultaneously satisfied. In this paper we introduce UtilityMax Prompting, a framework that specifies tasks using formal mathematical language. We reconstruct the task as an influence diagram in which the LLM’s answer is the sole decision variable. A utility function is defined over the conditional probability distributions within the diagram, and the LLM is instructed to find the answer that maximises expected utility. This constrains the LLM to reason explicitly about each component of the objective, directing its output toward a precise optimization target rather than a subjective natural language interpretation. We validate our approach on the MovieLens 1M dataset across three frontier models (Claude Sonnet 4.6, GPT-5.4, and Gemini 2.5 Pro), demonstrating consistent improvements in precision and Normalized Discounted Cumulative Gain (NDCG) over natural language baselines in a multi-objective movie recommendation task.
[77] SemBench: A Universal Semantic Framework for LLM Evaluation
Mikel Zubillaga, Naiara Perez, Oscar Sainz, German Rigau
Main category: cs.CL
TL;DR: SemBench: A framework for automatically generating synthetic benchmarks to evaluate semantic understanding in LLMs using dictionary definitions and sentence encoders, enabling scalable cross-lingual evaluation.
Details
Motivation: Traditional benchmarks like Word-in-Context (WiC) for evaluating semantic understanding in LLMs are resource-intensive to create and limited to high-resource languages, creating a need for more scalable and language-independent evaluation methods.Method: SemBench uses only dictionary sense definitions and a sentence encoder to automatically generate synthetic benchmarks, eliminating the need for curated example sentences. It was evaluated across three languages (English, Spanish, Basque) with various LLMs.
Result: SemBench rankings strongly correlate with standard WiC dataset rankings, and only a small number of examples is needed to achieve stable and meaningful rankings. The framework works across languages with different resource levels.
Conclusion: SemBench provides a lightweight, adaptable, and data-efficient framework for cross-lingual evaluation of semantic understanding in LLMs, addressing scalability and language coverage limitations of traditional benchmarks.
Abstract: Recent progress in Natural Language Processing (NLP) has been driven by the emergence of Large Language Models (LLMs), which exhibit remarkable generative and reasoning capabilities. However, despite their success, evaluating the true semantic understanding of these models remains a persistent challenge. Traditional benchmarks such as Word-in-Context (WiC) effectively probe this capability, but their creation is resource-intensive and often limited to high-resource languages. In this paper, we introduce SemBench, a framework for automatically generating synthetic benchmarks that assess the semantic competence of LLMs using only dictionary sense definitions and a sentence encoder. This approach eliminates the need for curated example sentences, making it both scalable and language-independent. We evaluate SemBench in three languages (English, Spanish, and Basque) spanning different levels of linguistic resources, and across a wide range of LLMs. Our results show that rankings derived from SemBench strongly correlate with those obtained from standard WiC datasets. Furthermore, our analysis demonstrates that only a small number of examples is required to achieve stable and meaningful rankings. Overall, SemBench provides a lightweight, adaptable, and data-efficient framework for cross-lingual evaluation of semantic understanding in LLMs.
[78] ReasonScaffold: A Scaffolded Reasoning-based Annotation Protocol for Human-AI Co-Annotation
Smitha Muthya Sudheendra, Jaideep Srivastava
Main category: cs.CL
TL;DR: ReasonScaffold introduces a scaffolded reasoning annotation protocol where human annotators first label independently, then revise after viewing LLM-generated explanations (without predicted labels), studying how reasoning affects annotation behavior rather than accuracy.
Details
Motivation: Human annotation in NLP often shows substantial variability across annotators, especially for subjective tasks. While LLMs can provide structured reasoning to support annotation, their influence on human annotation behavior remains underexplored. The paper aims to understand how reasoning explanations shape annotation consistency rather than just evaluating annotation accuracy.Method: Introduces ReasonScaffold, a scaffolded reasoning annotation protocol that exposes LLM-generated explanations while withholding predicted labels. Uses a two-pass protocol inspired by Delphi-style revision: annotators first label instances independently, then revise their decisions after viewing model-generated reasoning. Evaluates on sentiment classification and opinion detection tasks, analyzing changes in inter-annotator agreement and revision behavior. Introduces Annotator Effort Proxy (AEP) metric to capture proportion of labels revised after exposure to reasoning.
Result: Exposure to reasoning is associated with increased inter-annotator agreement, along with minimal revision (low AEP scores). This suggests that reasoning helps resolve ambiguous cases without inducing widespread changes to annotations. The findings indicate reasoning explanations can improve annotation consistency.
Conclusion: Reasoning-based scaffolds provide practical mechanisms for human-AI co-annotation workflows by helping resolve ambiguous cases and increasing annotation consistency without requiring extensive revisions. The approach offers insights into how reasoning explanations shape annotation behavior.
Abstract: Human annotation is central to NLP evaluation, yet subjective tasks often exhibit substantial variability across annotators. While large language models (LLMs) can provide structured reasoning to support annotation, their influence on human annotation behavior remains underexplored. We introduce \textbf{ReasonScaffold}, a scaffolded reasoning annotation protocol that exposes LLM-generated explanations while withholding predicted labels. We study how reasoning affects human annotation behavior in a controlled setting, rather than evaluating annotation accuracy. Using a two-pass protocol inspired by Delphi-style revision, annotators first label instances independently and then revise their decisions after viewing model-generated reasoning. We evaluate the approach on sentiment classification and opinion detection tasks, analyzing changes in inter-annotator agreement and revision behavior. To quantify these effects, we introduce the Annotator Effort Proxy (AEP), a metric capturing the proportion of labels revised after exposure to reasoning. Our results show that exposure to reasoning is associated with increased agreement, along with minimal revision, suggesting that reasoning helps resolve ambiguous cases without inducing widespread changes. These findings provide insight into how reasoning explanations shape annotation consistency and highlight reasoning-based scaffolds as a practical mechanism for human–AI co-annotation workflows.
[79] Optimizing Multilingual LLMs via Federated Learning: A Study of Client Language Composition
Aleix Sant, Jordi Luque, Carlos Escolano
Main category: cs.CL
TL;DR: Federated learning framework for multilingual LLMs with client-specific early stopping mechanism to address language heterogeneity and resource disparities
Details
Motivation: Address challenges in federated learning of LLMs in multilingual environments, including heterogeneous language distributions across clients and disparities in language resource availabilityMethod: Extended FederatedScope-LLM framework for multilingual instruction-tuning, introduced Local Dynamic Early Stopping (LDES-FL) mechanism allowing clients to pause/resume training based on validation performance
Result: Monolingual fine-tuning best for single-language specialization; federated training better for balanced multilingual models. Increasing within-client multilinguality leads to stronger, fairer global models, especially benefiting lower-resource languages
Conclusion: Client language composition is a key design variable in multilingual FL that shapes performance, fairness, and efficiency
Abstract: Federated Learning (FL) of Large Language Models (LLMs) in multilingual environments presents significant challenges stemming from heterogeneous language distributions across clients and disparities in language resource availability. To address these challenges, we extended the FederatedScope-LLM framework to support multilingual instruction-tuning experiments with LLMs. We also introduced a novel client-specific early stopping mechanism, Local Dynamic Early Stopping (LDES-FL), which allows clients to pause and resume local training based on client-side validation performance, enhancing training efficiency and sustainability. Through a series of experiments, we studied how client language composition - from fully monolingual to increasingly multilingual clients - affects multilingual quality, fairness and training cost. Monolingual local fine-tuning remains the most effective for single-language specialization, whereas federated training is better suited to learning a single balanced multilingual model. In FL, increasing within-client multilinguality leads to stronger and fairer global models, narrows the gap to centralized multilingual fine-tuning, and yields the largest gains for lower-resource languages, albeit at the cost of more optimization steps. Overall, our results identify client language composition as a key design variable in multilingual FL, shaping performance, fairness and efficiency.
cs.CV
[80] Scalable Object Relation Encoding for Better 3D Spatial Reasoning in Large Language Models
Shengli Zhou, Minghang Zheng, Feng Zheng, Yang Liu
Main category: cs.CV
TL;DR: QuatRoPE is a novel positional embedding method for 3D spatial reasoning that encodes object positions linearly and computes pairwise spatial relations through attention mechanisms, improving scalability while maintaining spatial consistency.
Details
Motivation: Spatial reasoning in 3D scenes is crucial for embodied AI but limited by scarce 3D scene-language data. Existing methods either struggle with extracting spatial relations from fused features or have poor scalability due to quadratic relation encoding.Method: Proposes QuatRoPE, a positional embedding with linear input length that encodes 3D coordinates holistically and computes pairwise spatial relations via dot products in attention layers. Also introduces IGRE to limit QuatRoPE’s influence to object tokens, preserving LLM’s original capabilities.
Result: Extensive experiments demonstrate effectiveness. The method maintains spatial consistency and geometric integrity while being scalable to large numbers of objects.
Conclusion: QuatRoPE addresses scalability limitations in 3D spatial reasoning by providing linear encoding of object positions with explicit relation computation, enabling better integration of 3D scene understanding with LLMs.
Abstract: Spatial reasoning focuses on locating target objects based on spatial relations in 3D scenes, which plays a crucial role in developing intelligent embodied agents. Due to the limited availability of 3D scene-language paired data, it is challenging to train models with strong reasoning ability from scratch. Previous approaches have attempted to inject 3D scene representations into the input space of Large Language Models (LLMs) and leverage the pretrained comprehension and reasoning abilities for spatial reasoning. However, models encoding absolute positions struggle to extract spatial relations from prematurely fused features, while methods explicitly encoding all spatial relations (which is quadratic in the number of objects) as input tokens suffer from poor scalability. To address these limitations, we propose QuatRoPE, a novel positional embedding method with an input length that is linear to the number of objects, and explicitly calculates pairwise spatial relations through the dot product in attention layers. QuatRoPE’s holistic vector encoding of 3D coordinates guarantees a high degree of spatial consistency, maintaining fidelity to the scene’s geometric integrity. Additionally, we introduce the Isolated Gated RoPE Extension (IGRE), which effectively limits QuatRoPE’s influence to object-related tokens, thereby minimizing interference with the LLM’s existing positional embeddings and maintaining the LLM’s original capabilities. Extensive experiments demonstrate the effectiveness of our approaches. The code and data are available at https://github.com/oceanflowlab/QuatRoPE.
[81] MedOpenClaw: Auditable Medical Imaging Agents Reasoning over Uncurated Full Studies
Weixiang Shen, Yanzhu Hu, Che Liu, Junde Wu, Jiayuan Zhu, Chengzhi Shen, Min Xu, Yueming Jin, Benedikt Wiestler, Daniel Rueckert, Jiazhen Pan
Main category: cs.CV
TL;DR: MEDOPENCLAW is an auditable runtime for VLMs to operate dynamically in medical imaging viewers, with MEDFLOWBENCH benchmark evaluating agentic capabilities across viewer-only, tool-use, and open-method tracks in full-study medical imaging.
Details
Motivation: Current VLM evaluation in medical imaging oversimplifies clinical reality by using pre-selected 2D images, missing the core challenge of real-world diagnostics where agents must actively navigate full 3D volumes across multiple sequences/modalities to gather evidence and support decisions.Method: Proposes MEDOPENCLAW runtime for VLMs to operate dynamically within standard medical tools/viewers (e.g., 3D Slicer), and MEDFLOWBENCH benchmark covering multi-sequence brain MRI and lung CT/PET to systematically evaluate medical agentic capabilities across three tracks: viewer-only, tool-use, and open-method.
Result: Initial results show state-of-the-art LLMs/VLMs (Gemini 3.1 Pro, GPT-5.4) can successfully navigate viewers to solve basic study-level tasks, but performance paradoxically degrades when given access to professional support tools due to lack of precise spatial grounding.
Conclusion: By bridging static-image perception and interactive clinical workflows, MEDOPENCLAW and MEDFLOWBENCH establish a reproducible foundation for developing auditable, full-study medical imaging agents.
Abstract: Currently, evaluating vision-language models (VLMs) in medical imaging tasks oversimplifies clinical reality by relying on pre-selected 2D images that demand significant manual labor to curate. This setup misses the core challenge of realworld diagnostics: a true clinical agent must actively navigate full 3D volumes across multiple sequences or modalities to gather evidence and ultimately support a final decision. To address this, we propose MEDOPENCLAW, an auditable runtime designed to let VLMs operate dynamically within standard medical tools or viewers (e.g., 3D Slicer). On top of this runtime, we introduce MEDFLOWBENCH, a full-study medical imaging benchmark covering multi-sequence brain MRI and lung CT/PET. It systematically evaluates medical agentic capabilities across viewer-only, tool-use, and open-method tracks. Initial results reveal a critical insight: while state-of-the-art LLMs/VLMs (e.g., Gemini 3.1 Pro and GPT-5.4) can successfully navigate the viewer to solve basic study-level tasks, their performance paradoxically degrades when given access to professional support tools due to a lack of precise spatial grounding. By bridging the gap between static-image perception and interactive clinical workflows, MEDOPENCLAW and MEDFLOWBENCH establish a reproducible foundation for developing auditable, full-study medical imaging agents.
[82] AVControl: Efficient Framework for Training Audio-Visual Controls
Matan Ben-Yosef, Tavi Halperin, Naomi Ken Korem, Mohammad Salama, Harel Cain, Asaf Joseph, Anthony Chen, Urska Jelercic, Ofir Bibi
Main category: cs.CV
TL;DR: AVControl is a lightweight, extendable framework for multimodal video and audio generation control using LoRA adapters on LTX-2 foundation model, enabling diverse control modalities without architectural changes.
Details
Motivation: Existing approaches for controlling video and audio generation either use monolithic models for fixed controls or require costly architectural changes for each new modality, lacking flexibility and efficiency.Method: Built on LTX-2 joint audio-visual foundation model, each control modality is trained as separate LoRA adapter on a parallel canvas that provides reference signals as additional tokens in attention layers, requiring no architectural changes beyond the LoRA adapters.
Result: Outperforms all baselines on VACE Benchmark for depth- and pose-guided generation, inpainting, and outpainting; shows competitive results on camera control and audio-visual benchmarks; supports diverse independently trained modalities including first modular audio-visual controls.
Conclusion: AVControl provides an efficient, extendable framework for multimodal video and audio generation control that is both compute- and data-efficient, with each modality requiring minimal training data and converging quickly compared to monolithic alternatives.
Abstract: Controlling video and audio generation requires diverse modalities, from depth and pose to camera trajectories and audio transformations, yet existing approaches either train a single monolithic model for a fixed set of controls or introduce costly architectural changes for each new modality. We introduce AVControl, a lightweight, extendable framework built on LTX-2, a joint audio-visual foundation model, where each control modality is trained as a separate LoRA on a parallel canvas that provides the reference signal as additional tokens in the attention layers, requiring no architectural changes beyond the LoRA adapters themselves. We show that simply extending image-based in-context methods to video fails for structural control, and that our parallel canvas approach resolves this. On the VACE Benchmark, we outperform all evaluated baselines on depth- and pose-guided generation, inpainting, and outpainting, and show competitive results on camera control and audio-visual benchmarks. Our framework supports a diverse set of independently trained modalities: spatially-aligned controls such as depth, pose, and edges, camera trajectory with intrinsics, sparse motion control, video editing, and, to our knowledge, the first modular audio-visual controls for a joint generation model. Our method is both compute- and data-efficient: each modality requires only a small dataset and converges within a few hundred to a few thousand training steps, a fraction of the budget of monolithic alternatives. We publicly release our code and trained LoRA checkpoints.
[83] From Weights to Concepts: Data-Free Interpretability of CLIP via Singular Vector Decomposition
Francesco Gentile, Nicola Dall’Asen, Francesco Tonini, Massimiliano Mancini, Lorenzo Vaquero, Elisa Ricci
Main category: cs.CV
TL;DR: SITH is a data-free, training-free framework that analyzes CLIP’s vision transformer weights to decompose attention heads into interpretable semantic concepts, enabling precise model edits and studying adaptation mechanisms.
Details
Motivation: Current interpretability methods for vision-language models rely on activations, making them dataset-dependent, vulnerable to bias, and limited to coarse head-level explanations. There's a need for data-free, weight-space analysis that provides fine-grained semantic understanding.Method: SITH decomposes each attention head’s value-output matrix using singular value decomposition, then applies COMP (Coherent Orthogonal Matching Pursuit) to interpret singular vectors as sparse combinations of human-interpretable concepts, all without requiring data or training.
Result: SITH produces coherent, faithful intra-head explanations validated through reconstruction fidelity and interpretability experiments. It enables precise weight-space model edits that amplify/suppress specific concepts, improving downstream performance without retraining, and reveals that fine-tuning primarily reweights a stable semantic basis rather than learning new features.
Conclusion: SITH provides a novel data-free approach to understanding vision-language model internals through weight-space analysis, offering fine-grained interpretability and enabling targeted model edits while revealing insights about model adaptation mechanisms.
Abstract: As vision-language models are deployed at scale, understanding their internal mechanisms becomes increasingly critical. Existing interpretability methods predominantly rely on activations, making them dataset-dependent, vulnerable to data bias, and often restricted to coarse head-level explanations. We introduce SITH (Semantic Inspection of Transformer Heads), a fully data-free, training-free framework that directly analyzes CLIP’s vision transformer in weight space. For each attention head, we decompose its value-output matrix into singular vectors and interpret each one via COMP (Coherent Orthogonal Matching Pursuit), a new algorithm that explains them as sparse, semantically coherent combinations of human-interpretable concepts. We show that SITH yields coherent, faithful intra-head explanations, validated through reconstruction fidelity and interpretability experiments. This allows us to use SITH for precise, interpretable weight-space model edits that amplify or suppress specific concepts, improving downstream performance without retraining. Furthermore, we use SITH to study model adaptation, showing how fine-tuning primarily reweights a stable semantic basis rather than learning entirely new features.
[84] Interpretable Zero-shot Referring Expression Comprehension with Query-driven Scene Graphs
Yike Wu, Necva Bolucu, Stephen Wan, Dadong Wang, Jiahao Xia, Jian Zhang
Main category: cs.CV
TL;DR: SGREC: Zero-shot referring expression comprehension using query-driven scene graphs as structured intermediaries between VLMs and LLMs for improved accuracy and interpretability.
Details
Motivation: Existing VLMs like CLIP struggle with fine-grained visual details and complex object relationships in zero-shot REC, while LLMs can't directly process visual features. Need a method that bridges low-level image regions with high-level semantic reasoning.Method: 1) Use VLM to construct query-driven scene graph encoding spatial relationships, descriptive captions, and object interactions. 2) Use scene graph as structured intermediary to bridge visual features and semantic understanding. 3) LLM infers target object from structured textual representation and provides explanatory reasoning.
Result: Achieves top-1 accuracy on most zero-shot REC benchmarks: RefCOCO val (66.78%), RefCOCO+ testB (53.43%), and RefCOCOg val (73.28%). Demonstrates strong visual scene understanding capabilities.
Conclusion: SGREC effectively bridges the gap between low-level visual features and high-level semantic reasoning by using query-driven scene graphs as structured intermediaries, achieving state-of-the-art zero-shot REC performance with interpretable reasoning.
Abstract: Zero-shot referring expression comprehension (REC) aims to locate target objects in images given natural language queries without relying on task-specific training data, demanding strong visual understanding capabilities. Existing Vision-Language Models~(VLMs), such as CLIP, commonly address zero-shot REC by directly measuring feature similarities between textual queries and image regions. However, these methods struggle to capture fine-grained visual details and understand complex object relationships. Meanwhile, Large Language Models~(LLMs) excel at high-level semantic reasoning, their inability to directly abstract visual features into textual semantics limits their application in REC tasks. To overcome these limitations, we propose \textbf{SGREC}, an interpretable zero-shot REC method leveraging query-driven scene graphs as structured intermediaries. Specifically, we first employ a VLM to construct a query-driven scene graph that explicitly encodes spatial relationships, descriptive captions, and object interactions relevant to the given query. By leveraging this scene graph, we bridge the gap between low-level image regions and higher-level semantic understanding required by LLMs. Finally, an LLM infers the target object from the structured textual representation provided by the scene graph, responding with detailed explanations for its decisions that ensure interpretability in the inference process. Extensive experiments show that SGREC achieves top-1 accuracy on most zero-shot REC benchmarks, including RefCOCO val (66.78%), RefCOCO+ testB (53.43%), and RefCOCOg val (73.28%), highlighting its strong visual scene understanding.
[85] ReDiPrune: Relevance-Diversity Pre-Projection Token Pruning for Efficient Multimodal LLMs
An Yu, Ting Yu Tsai, Zhenfei Zhang, Weiheng Lu, Felix X. -F. Ye, Ming-Ching Chang
Main category: cs.CV
TL;DR: ReDiPrune is a training-free token pruning method for multimodal LLMs that selects informative visual tokens before the vision-language projector to reduce computation while improving accuracy.
Details
Motivation: Multimodal LLMs are computationally expensive due to processing many visual tokens. Existing post-projection pruning methods operate on compressed representations, losing fine-grained spatial and semantic information.Method: ReDiPrune selects tokens directly from vision encoder outputs using a lightweight rule that jointly considers text-conditioned relevance and max-min diversity. It operates before the vision-language projector, preserving rich visual features.
Result: On EgoSchema with LLaVA-NeXT-Video-7B, retaining only 15% of visual tokens yields +2.0% absolute accuracy gain while reducing computation by more than 6× in TFLOPs. Consistent improvements across four video and five image benchmarks.
Conclusion: ReDiPrune effectively improves accuracy-efficiency trade-off for multimodal LLMs by pruning visual tokens before projection, preserving fine-grained information while reducing computation.
Abstract: Recent multimodal large language models are computationally expensive because Transformers must process a large number of visual tokens. We present \textbf{ReDiPrune}, a training-free token pruning method applied before the vision-language projector, where visual features remain rich and discriminative. Unlike post-projection pruning methods that operate on compressed representations, ReDiPrune selects informative tokens directly from vision encoder outputs, preserving fine-grained spatial and semantic cues. Each token is scored by a lightweight rule that jointly consider text-conditioned relevance and max-min diversity, ensuring the selected tokens are both query-relevant and non-redundant. ReDiPrune is fully plug-and-play, requiring no retraining or architectural modifications, and can be seamlessly inserted between the encoder and projector. Across four video and five image benchmarks, it consistently improves the accuracy-efficiency trade-off. For example, on EgoSchema with LLaVA-NeXT-Video-7B, retaining only 15% of visual tokens yields a +2.0% absolute accuracy gain while reducing computation by more than $6\times$ in TFLOPs. Code is available at https://github.com/UA-CVML/ReDiPrune.
[86] SAVe: Self-Supervised Audio-visual Deepfake Detection Exploiting Visual Artifacts and Audio-visual Misalignment
Sahibzada Adil Shahzad, Ammarah Hashmi, Junichi Yamagishi, Yusuke Yasuda, Yu Tsao, Chia-Wen Lin, Yan-Tsung Peng, Hsin-Min Wang
Main category: cs.CV
TL;DR: SAVe is a self-supervised audio-visual deepfake detection framework that learns on authentic videos by generating pseudo-manipulations and modeling lip-speech synchronization to detect cross-modal inconsistencies.
Details
Motivation: Current multimodal deepfake detectors rely on curated synthetic forgeries, leading to dataset/generator bias and poor generalization to unseen manipulations. There's a need for scalable, robust detection that can identify subtle visual artifacts and cross-modal inconsistencies.Method: SAVe uses self-supervised learning on authentic videos by: 1) Generating identity-preserving, region-aware self-blended pseudo-manipulations to emulate tampering artifacts across multiple facial granularities; 2) Modeling lip-speech synchronization via audio-visual alignment to detect temporal misalignment patterns characteristic of audio-visual forgeries.
Result: Experiments on FakeAVCeleb and AV-LipSync-TIMIT show competitive in-domain performance and strong cross-dataset generalization, demonstrating SAVe’s effectiveness as a scalable paradigm for multimodal deepfake detection.
Conclusion: Self-supervised learning on authentic videos with pseudo-manipulations and cross-modal alignment modeling provides a scalable and robust approach to multimodal deepfake detection that generalizes well to unseen manipulations.
Abstract: Multimodal deepfakes can exhibit subtle visual artifacts and cross-modal inconsistencies, which remain challenging to detect, especially when detectors are trained primarily on curated synthetic forgeries. Such synthetic dependence can introduce dataset and generator bias, limiting scalability and robustness to unseen manipulations. We propose SAVe, a self-supervised audio-visual deepfake detection framework that learns entirely on authentic videos. SAVe generates on-the-fly, identity-preserving, region-aware self-blended pseudo-manipulations to emulate tampering artifacts, enabling the model to learn complementary visual cues across multiple facial granularities. To capture cross-modal evidence, SAVe also models lip-speech synchronization via an audio-visual alignment component that detects temporal misalignment patterns characteristic of audio-visual forgeries. Experiments on FakeAVCeleb and AV-LipSync-TIMIT demonstrate competitive in-domain performance and strong cross-dataset generalization, highlighting self-supervised learning as a scalable paradigm for multimodal deepfake detection.
[87] CIV-DG: Conditional Instrumental Variables for Domain Generalization in Medical Imaging
Shaojin Bai, Yuting Su, Weizhi Nie
Main category: cs.CV
TL;DR: CIV-DG: A causal framework using Conditional Instrumental Variables to address selection bias in medical AI by disentangling pathological semantics from scanner artifacts, enabling robust cross-site generalization.
Details
Motivation: Medical AI suffers from selection bias where patient demographics non-randomly dictate hospital assignment, creating spurious correlations between site-specific variations and diagnostic labels that conventional Domain Generalization methods fail to address.Method: Proposes CIV-DG framework leveraging Conditional Instrumental Variables to relax strict random assignment assumptions, using Deep Generalized Method of Moments architecture with conditional critic to minimize moment violations and enforce instrument-error orthogonality within demographic strata.
Result: Extensive experiments on Camelyon17 benchmark and large-scale Chest X-Ray datasets show CIV-DG significantly outperforms leading baselines, validating efficacy of conditional causal mechanisms for robust medical AI.
Conclusion: CIV-DG successfully addresses structural confounding in medical AI by disentangling pathological semantics from scanner-induced artifacts through conditional causal mechanisms, enabling robust cross-site generalization despite selection bias.
Abstract: Cross-site generalizability in medical AI is fundamentally compromised by selection bias, a structural mechanism where patient demographics (e.g., age, severity) non-randomly dictate hospital assignment. Conventional Domain Generalization (DG) paradigms, which predominantly target image-level distribution shifts, fail to address the resulting spurious correlations between site-specific variations and diagnostic labels. To surmount this identifiability barrier, we propose CIV-DG, a causal framework that leverages Conditional Instrumental Variables to disentangle pathological semantics from scanner-induced artifacts. By relaxing the strict random assignment assumption of standard IV methods, CIV-DG accommodates complex clinical scenarios where hospital selection is endogenously driven by patient demographics. We instantiate this theory via a Deep Generalized Method of Moments (DeepGMM) architecture, employing a conditional critic to minimize moment violations and enforce instrument-error orthogonality within demographic strata. Extensive experiments on the Camelyon17 benchmark and large-scale Chest X-Ray datasets demonstrate that CIV-DG significantly outperforms leading baselines, validating the efficacy of conditional causal mechanisms in resolving structural confounding for robust medical AI.
[88] KitchenTwin: Semantically and Geometrically Grounded 3D Kitchen Digital Twins
Quanyun Wu, Kyle Gao, Daniel Long, David A. Clausi, Jonathan Li, Yuhao Chen
Main category: cs.CV
TL;DR: Scale-aware 3D fusion framework that registers object meshes with transformer-predicted global point clouds to create metrically consistent digital twins using VLM-guided geometric anchors and geometry-aware registration.
Details
Motivation: Embodied AI requires object-centric digital twins with accurate metric geometry and semantic grounding. Current transformer-based reconstruction methods produce dimensionless point clouds with scale ambiguity and coordinate mismatches that prevent reliable fusion with object meshes.Method: Proposes a scale-aware 3D fusion framework with: 1) VLM-guided geometric anchor mechanism to recover real-world metric scale, 2) Geometry-aware registration pipeline with gravity-aligned vertical estimation, Manhattan-world constraints, and collision-free local refinement.
Result: Experiments on real indoor kitchen environments show improved cross-network object alignment and geometric consistency for downstream tasks like multi-primitive fitting and metric measurement. Also introduces open-source indoor digital twin dataset with metrically scaled scenes and registered object mesh annotations.
Conclusion: The framework successfully resolves coordinate mismatches between transformer-predicted point clouds and object meshes, enabling construction of metrically consistent digital twins for embodied AI applications.
Abstract: Embodied AI training and evaluation require object-centric digital twin environments with accurate metric geometry and semantic grounding. Recent transformer-based feedforward reconstruction methods can efficiently predict global point clouds from sparse monocular videos, yet these geometries suffer from inherent scale ambiguity and inconsistent coordinate conventions. This mismatch prevents the reliable fusion of these dimensionless point cloud predictions with locally reconstructed object meshes. We propose a novel scale-aware 3D fusion framework that registers visually grounded object meshes with transformer-predicted global point clouds to construct metrically consistent digital twins. Our method introduces a Vision-Language Model (VLM)-guided geometric anchor mechanism that resolves this fundamental coordinate mismatch by recovering an accurate real-world metric scale. To fuse these networks, we propose a geometry-aware registration pipeline that explicitly enforces physical plausibility through gravity-aligned vertical estimation, Manhattan-world structural constraints, and collision-free local refinement. Experiments on real indoor kitchen environments demonstrate improved cross-network object alignment and geometric consistency for downstream tasks, including multi-primitive fitting and metric measurement. We additionally introduce an open-source indoor digital twin dataset with metrically scaled scenes and semantically grounded and registered object-centric mesh annotations.
[89] UniICL: Systematizing Unified Multimodal In-context Learning through a Capability-Oriented Taxonomy
Yicheng Xu, Jiangning Zhang, Zhucun Xue, Teng Hu, Ran Yi, Xiaobin Hu, Yong Liu, Dacheng Tao
Main category: cs.CV
TL;DR: Proposes a cognitive taxonomy and large-scale dataset for multimodal in-context learning, with a Context-Adaptive Prototype Modulator to stabilize few-shot adaptation across vision-language tasks.
Details
Motivation: In-context learning in multimodal models suffers from sensitivity to example selection/formatting, cross-modal interference, and varying cognitive demands, leading to non-monotonic performance that's highly task-dependent.Method: 1) Introduces a six-level capability-oriented taxonomy categorizing demonstration roles from basic perception to high-order discernment; 2) Constructs UniICL-760K corpus with curated 8-shot ICL episodes across 15 subtasks; 3) Develops UniICL-Bench for controlled evaluation; 4) Proposes Context-Adaptive Prototype Modulator, a lightweight plug-and-play module to stabilize few-shot adaptation.
Result: Outperforms larger-parameter multimodal large language model baselines on most understanding ICL tasks in UniICL-Bench evaluations, achieving highly competitive unified results.
Conclusion: The proposed cognitive framework, large-scale dataset, and architectural intervention effectively address ICL sensitivity issues in multimodal models, enabling more stable and effective few-shot adaptation across diverse vision-language tasks.
Abstract: In-context Learning enables training-free adaptation via demonstrations but remains highly sensitive to example selection and formatting. In unified multimodal models spanning understanding and generation, this sensitivity is exacerbated by cross-modal interference and varying cognitive demands. Consequently, In-context Learning efficacy is often non-monotonic and highly task-dependent. To diagnose these behaviors, we introduce a six-level capability-oriented taxonomy that categorizes the functional role of demonstrations from basic perception to high-order discernment. Guided by this cognitive framework, we construct UniICL-760K, a large-scale corpus featuring curated 8-shot In-context Learning episodes across 15 subtasks, alongside UniICL-Bench for rigorous, controlled evaluation. As an architectural intervention to stabilize few-shot adaptation, we propose the Context-Adaptive Prototype Modulator, a lightweight, plug-and-play module. Evaluations on UniICL-Bench show that our approach yields highly competitive unified results, outperforming larger-parameter multimodal large language model baselines on most understanding In-context Learning tasks. Data and code will be available soon at https://github.com/xuyicheng-zju/UniICL.
[90] BCMDA: Bidirectional Correlation Maps Domain Adaptation for Mixed Domain Semi-Supervised Medical Image Segmentation
Bentao Song, Jun Huang, Qingfeng Wang
Main category: cs.CV
TL;DR: BCMDA framework for mixed domain semi-supervised medical image segmentation addresses domain shift and limited annotations through virtual domain bridging and prototypical alignment.
Details
Motivation: Address challenges in mixed domain semi-supervised medical image segmentation where domain shift between labeled/unlabeled data hinders knowledge transfer and inefficient learning causes confirmation bias.Method: Proposes BCMDA framework with two components: 1) Knowledge Transfer via Virtual Domain Bridging (KTVDB) using bidirectional correlation maps to synthesize virtual images with fixed/progressive MixUp strategies and dual bidirectional CutMix, 2) Prototypical Alignment and Pseudo Label Correction (PAPLC) using learnable prototype classifiers for bidirectional alignment and pseudo label correction.
Result: Superior performance on three public multi-domain datasets, showing excellent performance even with very limited labeled samples.
Conclusion: BCMDA effectively addresses domain shift and confirmation bias in mixed domain semi-supervised medical image segmentation through virtual domain bridging and prototypical alignment strategies.
Abstract: In mixed domain semi-supervised medical image segmentation (MiDSS), achieving superior performance under domain shift and limited annotations is challenging. This scenario presents two primary issues: (1) distributional differences between labeled and unlabeled data hinder effective knowledge transfer, and (2) inefficient learning from unlabeled data causes severe confirmation bias. In this paper, we propose the bidirectional correlation maps domain adaptation (BCMDA) framework to overcome these issues. On the one hand, we employ knowledge transfer via virtual domain bridging (KTVDB) to facilitate cross-domain learning. First, to construct a distribution-aligned virtual domain, we leverage bidirectional correlation maps between labeled and unlabeled data to synthesize both labeled and unlabeled images, which are then mixed with the original images to generate virtual images using two strategies, a fixed ratio and a progressive dynamic MixUp. Next, dual bidirectional CutMix is used to enable initial knowledge transfer within the fixed virtual domain and gradual knowledge transfer from the dynamically transitioning labeled domain to the real unlabeled domains. On the other hand, to alleviate confirmation bias, we adopt prototypical alignment and pseudo label correction (PAPLC), which utilizes learnable prototype cosine similarity classifiers for bidirectional prototype alignment between the virtual and real domains, yielding smoother and more compact feature representations. Finally, we use prototypical pseudo label correction to generate more reliable pseudo labels. Empirical evaluations on three public multi-domain datasets demonstrate the superiority of our method, particularly showing excellent performance even with very limited labeled samples. Code available at https://github.com/pascalcpp/BCMDA.
[91] OpenCap Monocular: 3D Human Kinematics and Musculoskeletal Dynamics from a Single Smartphone Video
Selim Gilon, Emily Y. Miller, Scott D. Uhlrich
Main category: cs.CV
TL;DR: OpenCap Monocular enables 3D skeletal kinematics and kinetics estimation from single smartphone videos, validated against lab-based motion capture for clinical biomechanics applications.
Details
Motivation: Traditional biomechanical analysis requires expensive lab equipment, limiting clinical translation. There's a need for scalable, accessible tools to quantify human movement and musculoskeletal forces for mobility-related conditions.Method: Refines 3D human pose estimates from monocular pose estimation model (WHAM) via optimization, computes kinematics using biomechanically constrained skeletal model, and estimates kinetics through physics-based simulation and machine learning.
Result: Achieved low kinematic error (4.8° mean absolute error for rotations; 3.4 cm for pelvis translations), outperforming regression-only baseline by 48% in rotational and 69% in translational accuracy. Estimated ground reaction forces with accuracy comparable to prior two-camera system.
Conclusion: OpenCap Monocular provides clinically meaningful accuracy for kinetic outcomes in frailty and knee osteoarthritis applications, deployed via smartphone/web apps for free, accessible single-smartphone biomechanical assessments.
Abstract: Quantifying human movement (kinematics) and musculoskeletal forces (kinetics) at scale, such as estimating quadriceps force during a sit-to-stand movement, could transform prediction, treatment, and monitoring of mobility-related conditions. However, quantifying kinematics and kinetics traditionally requires costly, time-intensive analysis in specialized laboratories, limiting clinical translation. Scalable, accurate tools for biomechanical assessment are needed. We introduce OpenCap Monocular, an algorithm that estimates 3D skeletal kinematics and kinetics from a single smartphone video. The method refines 3D human pose estimates from a monocular pose estimation model (WHAM) via optimization, computes kinematics of a biomechanically constrained skeletal model, and estimates kinetics via physics-based simulation and machine learning. We validated OpenCap Monocular against marker-based motion capture and force plate data for walking, squatting, and sit-to-stand tasks. OpenCap Monocular achieved low kinematic error (4.8° mean absolute error for rotational degrees of freedom; 3.4 cm for pelvis translations), outperforming a regression-only computer vision baseline by 48% in rotational accuracy (p = 0.036) and 69% in translational accuracy (p < 0.001). OpenCap Monocular also estimated ground reaction forces during walking with accuracy comparable to, or better than, our prior two-camera OpenCap system. We demonstrate that the algorithm estimates important kinetic outcomes with clinically meaningful accuracy in applications related to frailty and knee osteoarthritis, including estimating knee extension moment during sit-to-stand transitions and knee adduction moment during walking. OpenCap Monocular is deployed via a smartphone app, web app, and secure cloud computing (https://opencap.ai), enabling free, accessible single-smartphone biomechanical assessments.
[92] LLaVA-LE: Large Language-and-Vision Assistant for Lunar Exploration
Gokce Inal, Pouyan Navard, Alper Yilmaz
Main category: cs.CV
TL;DR: LLaVA-LE is a specialized vision-language model for lunar exploration, trained on a new lunar dataset (LUCID) with 96k images and 81k QA pairs, achieving significant performance gains over base models for planetary science applications.
Details
Motivation: Multimodal vision-language models have shown promise but remain unexplored for planetary science due to lack of large-scale datasets pairing real planetary imagery with scientific descriptions. The paper aims to bridge this gap for lunar exploration.Method: Created LUCID dataset with 96k lunar images and captions plus 81k QA pairs. Fine-tuned LLaVA using two-stage curriculum: (1) concept alignment for domain-specific terrain description, and (2) instruction-tuned visual question answering. Designed evaluation benchmarks for lunar terrain analysis.
Result: LLaVA-LE achieved 3.3x overall performance gain over Base LLaVA and 2.1x over Stage 1 model, with reasoning score of 1.070 exceeding judge’s reference score. Demonstrated effectiveness of domain-specific multimodal data and instruction tuning.
Conclusion: Domain-specific multimodal data and instruction tuning significantly advance vision-language models for planetary exploration. The LLaVA-LE model and LUCID dataset enable new capabilities in lunar surface and subsurface characterization.
Abstract: Recent advances in multimodal vision-language models (VLMs) have enabled joint reasoning over visual and textual information, yet their application to planetary science remains largely unexplored. A key hindrance is the absence of large-scale datasets that pair real planetary imagery with detailed scientific descriptions. In this work, we introduce LLaVA-LE (Large Language-and-Vision Assistant for Lunar Exploration), a vision-language model specialized for lunar surface and subsurface characterization. To enable this capability, we curate a new large-scale multimodal lunar dataset, LUCID (LUnar Caption Image Dataset) consisting of 96k high-resolution panchromatic images paired with detailed captions describing lunar terrain characteristics, and 81k question-answer (QA) pairs derived from approximately 20k images in the LUCID dataset. Leveraging this dataset, we fine-tune LLaVA using a two-stage training curriculum: (1) concept alignment for domain-specific terrain description, and (2) instruction-tuned visual question answering. We further design evaluation benchmarks spanning multiple levels of reasoning complexity relevant to lunar terrain analysis. Evaluated against GPT and Gemini judges, LLaVA-LE achieves a 3.3x overall performance gain over Base LLaVA and 2.1x over our Stage 1 model, with a reasoning score of 1.070, exceeding the judge’s own reference score, highlighting the effectiveness of domain-specific multimodal data and instruction tuning to advance VLMs in planetary exploration. Code is available at https://github.com/OSUPCVLab/LLaVA-LE.
[93] TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs
Jun Zhang, Teng Wang, Yuying Ge, Yixiao Ge, Xinhao Li, Ying Shan, Limin Wang
Main category: cs.CV
TL;DR: Systematic investigation (TimeLens) for building multimodal LLMs with strong video temporal grounding ability, addressing data quality issues and algorithmic design principles.
Details
Motivation: Multimodal LLMs excel at video understanding but recipes for optimizing them for video temporal grounding (VTG) remain under-explored, with existing benchmarks having critical quality issues.Method: Two-pronged approach: 1) Address data quality by creating TimeLens-Bench (re-annotated benchmarks) and TimeLens-100K (large-scale training dataset), 2) Explore algorithmic design including interleaved textual encoding, thinking-free reinforcement learning with verifiable rewards (RLVR), and careful training recipes.
Result: TimeLens models achieve state-of-the-art VTG performance among open-source models, surpassing proprietary models like GPT-5 and Gemini-2.5-Flash, with dramatic model re-rankings compared to legacy benchmarks.
Conclusion: Establishes essential baseline for VTG, demonstrates importance of data quality and systematic algorithmic design, and provides open resources to facilitate future research in video understanding.
Abstract: This paper does not introduce a novel method but instead establishes a straightforward, incremental, yet essential baseline for video temporal grounding (VTG), a core capability in video understanding. While multimodal large language models (MLLMs) excel at various video understanding tasks, the recipes for optimizing them for VTG remain under-explored. In this paper, we present TimeLens, a systematic investigation into building MLLMs with strong VTG ability, along two primary dimensions: data quality and algorithmic design. We first expose critical quality issues in existing VTG benchmarks and introduce TimeLens-Bench, comprising meticulously re-annotated versions of three popular benchmarks with strict quality criteria. Our analysis reveals dramatic model re-rankings compared to legacy benchmarks, confirming the unreliability of prior evaluation standards. We also address noisy training data through an automated re-annotation pipeline, yielding TimeLens-100K, a large-scale, high-quality training dataset. Building on our data foundation, we conduct in-depth explorations of algorithmic design principles, yielding a series of meaningful insights and effective yet efficient practices. These include interleaved textual encoding for time representation, a thinking-free reinforcement learning with verifiable rewards (RLVR) approach as the training paradigm, and carefully designed recipes for RLVR training. These efforts culminate in TimeLens models, a family of MLLMs with state-of-the-art VTG performance among open-source models and even surpass proprietary models such as GPT-5 and Gemini-2.5-Flash. All codes, data, and models will be released to facilitate future research.
[94] Image Rotation Angle Estimation: Comparing Circular-Aware Methods
Maximilian Woehrer
Main category: cs.CV
TL;DR: Comprehensive study of 5 circular-aware methods for image rotation estimation, showing probabilistic approaches (circular Gaussian) are most robust across architectures, with classification achieving best accuracy on well-matched backbones.
Details
Motivation: Automatic image rotation estimation is crucial preprocessing for vision pipelines, but challenging due to circular topology of angles causing boundary discontinuities that hinder standard regression methods.Method: Systematically evaluated five circular-aware methods: direct angle regression with circular loss, classification via angular binning, unit-vector regression, phase-shifting coder, and circular Gaussian distribution. Used transfer learning from ImageNet-pretrained models across sixteen modern architectures by adapting output heads for rotation-specific predictions.
Result: Best configuration (classification with EfficientViT-B3) achieved 1.23° MAE on DRC-D dataset; circular Gaussian with MambaOut Base achieved 1.24° with greater robustness. On COCO 2014, best configuration reached 3.71° MAE, improving over prior work, with further improvement to 2.84° on COCO 2017.
Conclusion: Probabilistic methods, particularly circular Gaussian distribution, are most robust across architectures, while classification achieves best accuracy on well-matched backbones but suffers training instabilities on others.
Abstract: Automatic image rotation estimation is a key preprocessing step in many vision pipelines. This task is challenging because angles have circular topology, creating boundary discontinuities that hinder standard regression methods. We present a comprehensive study of five circular-aware methods for global orientation estimation: direct angle regression with circular loss, classification via angular binning, unit-vector regression, phase-shifting coder, and circular Gaussian distribution. Using transfer learning from ImageNet-pretrained models, we systematically evaluate these methods across sixteen modern architectures by adapting their output heads for rotation-specific predictions. Our results show that probabilistic methods, particularly the circular Gaussian distribution, are the most robust across architectures, while classification achieves the best accuracy on well-matched backbones but suffers training instabilities on others. The best configuration (classification with EfficientViT-B3) achieves a mean absolute error (MAE) of 1.23° (mean across five independent runs) on the DRC-D dataset, while the circular Gaussian distribution with MambaOut Base achieves a virtually identical 1.24° with greater robustness across backbones. Training and evaluating our top-performing method-architecture combinations on COCO 2014, the best configuration reaches 3.71° MAE, improving substantially over prior work, with further improvement to 2.84° on the larger COCO 2017 dataset.
[95] Lookalike3D: Seeing Double in 3D
Chandan Yeshwanth, Angela Dai
Main category: cs.CV
TL;DR: Lookalike3D: A multiview image transformer for detecting identical and similar 3D object pairs in indoor scenes using semantic priors from foundation models, with applications to joint reconstruction and part co-segmentation.
Details
Motivation: Current 3D object understanding methods overlook repeated objects in real-world scenes, which provide valuable complementary cues for better 3D perception and reconstruction.Method: Lookalike3D multiview image transformer that leverages semantic priors from large image foundation models to classify object pairs as identical, similar, or different. Uses 3DTwins dataset with 76k annotated pairs from ScanNet++.
Result: Achieves 104% IoU improvement over baselines. Enables improved joint 3D object reconstruction and part co-segmentation by leveraging repeated object cues.
Conclusion: Repeated and lookalike objects serve as powerful cues for consistent, high-quality 3D perception, with applications to reconstruction and segmentation tasks.
Abstract: 3D object understanding and generation methods produce impressive results, yet they often overlook a pervasive source of information in real-world scenes: repeated objects. We introduce the task of lookalike object detection in indoor scenes, which leverages repeated and complementary cues from identical and near-identical object pairs. Given an input scene, the task is to classify pairs of objects as identical, similar or different using multiview images as input. To address this, we present Lookalike3D, a multiview image transformer that effectively distinguishes such object pairs by harnessing strong semantic priors from large image foundation models. To support this task, we collected the 3DTwins dataset, containing 76k manually annotated identical, similar and different pairs of objects based on ScanNet++, and show an improvement of 104% IoU over baselines. We demonstrate how our method improves downstream tasks such as enabling joint 3D object reconstruction and part co-segmentation, turning repeated and lookalike objects into a powerful cue for consistent, high-quality 3D perception. Our code, dataset and models will be made publicly available.
[96] Challenges in Hyperspectral Imaging for Autonomous Driving: The HSI-Drive Case
Koldo Basterretxea, Jon Gutiérrez-Zaballa, Javier Echanobe
Main category: cs.CV
TL;DR: Survey paper analyzing hyperspectral imaging techniques for autonomous driving, focusing on challenges like variable lighting, depth-of-field, dynamic scenes, and real-time requirements on embedded platforms.
Details
Motivation: Hyperspectral imaging shows promise for autonomous driving but faces domain-specific challenges including non-controlled lighting, wide depth-of-field ranges, dynamic scenes, real-time operation needs, and limited computational resources on embedded platforms.Method: The paper analyzes various HSI techniques explored in research for AD applications, using experimental results from the HSI-Drive dataset to evaluate different approaches.
Result: The analysis provides insights into appropriate HSI technology selection criteria and development of custom vision algorithms that leverage spectral and spatial information from HSI sensors for autonomous driving applications.
Conclusion: HSI has potential for AD but requires careful consideration of both sensor technology selection and algorithm development to address the unique challenges of the autonomous driving domain.
Abstract: The use of hyperspectral imaging (HSI) in autonomous driving (AD), while promising, faces many challenges related to the specifics and requirements of this application domain. On the one hand, non-controlled and variable lighting conditions, the wide depth-of-field ranges, and dynamic scenes with fast-moving objects. On the other hand, the requirements for real-time operation and the limited computational resources of embedded platforms. The combination of these factors determines both the criteria for selecting appropriate HSI technologies and the development of custom vision algorithms that leverage the spectral and spatial information obtained from the sensors. In this article, we analyse several techniques explored in the research of HSI-based vision systems with application to AD, using as an example results obtained from experiments using data from the most recent version of the HSI-Drive dataset.
[97] Accurate Point Measurement in 3DGS – A New Alternative to Traditional Stereoscopic-View Based Measurements
Deyan Deng, Rongjun Qin
Main category: cs.CV
TL;DR: 3D Gaussian Splatting enables accurate 3D point measurements through intuitive multi-view point picking and triangulation, outperforming traditional mesh-based methods.
Details
Motivation: While 3D Gaussian Splatting (3DGS) excels at novel view synthesis, its potential for precise geometric measurement remains unexplored. Current methods rely on stereoscopic workstations or inaccurate mesh picking, creating a gap for accessible, accurate measurement tools.Method: Leverages 3DGS’s ability to render exact source views and interpolate between them. Users pick congruent points across multiple views, then triangulate these points to generate precise 3D measurements. Implemented as a web-based application for accessibility.
Result: Achieves RMSEs of 1-2 cm on well-defined points, significantly outperforming mesh-based methods. On challenging thin structures: 0.037m vs 0.062m mesh RMSE. On sharp corners: 0.013m RMSE where mesh methods failed entirely.
Conclusion: 3DGS provides a practical, accessible alternative to traditional stereoscopic measurement, enabling accurate 3D point measurements without specialized hardware or operator training, while maintaining superior visual quality.
Abstract: 3D Gaussian Splatting (3DGS) has revolutionized real-time rendering with its state-of-the-art novel view synthesis, but its utility for accurate geometric measurement remains underutilized. Compared to multi-view stereo (MVS) point clouds or meshes, 3DGS rendered views present superior visual quality and completeness. However, current point measurement methods still rely on demanding stereoscopic workstations or direct picking on often-incomplete and inaccurate 3D meshes. As a novel view synthesizer, 3DGS renders exact source views and smoothly interpolates in-between views. This allows users to intuitively pick congruent points across different views while operating 3DGS models. By triangulating these congruent points, one can precisely generate 3D point measurements. This approach mimics traditional stereoscopic measurement but is significantly less demanding: it requires neither a stereo workstation nor specialized operator stereoscopic capability. Furthermore, it enables multi-view intersection (more than two views) for higher measurement accuracy. We implemented a web-based application to demonstrate this proof-of-concept (PoC). Using several UAV aerial datasets, we show this PoC allows users to successfully perform highly accurate point measurements, achieving accuracy matching or exceeding traditional stereoscopic methods on standard hardware. Specifically, our approach significantly outperforms direct mesh-based measurements. Quantitatively, our method achieves RMSEs in the 1-2 cm range on well-defined points. More critically, on challenging thin structures where mesh-based RMSE was 0.062 m, our method achieved 0.037 m. On sharp corners poorly reconstructed in the mesh, our method successfully measured all points with a 0.013 m RMSE, whereas the mesh method failed entirely. Code is available at: https://github.com/GDAOSU/3dgs_measurement_tool.
[98] From Scale to Speed: Adaptive Test-Time Scaling for Image Editing
Xiangyan Qu, Zhenlong Yuan, Jing Tang, Rui Chen, Datao Tang, Meng Yu, Lei Sun, Yancheng Bai, Xiangxiang Chu, Gaopeng Gou, Gang Xiong, Yujun Cai
Main category: cs.CV
TL;DR: ADE-CoT is an adaptive test-time scaling framework for image editing that improves efficiency by dynamically allocating sampling budgets based on edit difficulty, using edit-specific verification for early pruning, and implementing depth-first opportunistic stopping.
Details
Motivation: Current Image Chain-of-Thought methods focus on text-to-image generation but are inefficient for image editing, which is goal-directed and constrained by source images and instructions. This mismatch causes three challenges: inefficient resource allocation with fixed budgets, unreliable early-stage verification using general MLLM scores, and redundant edited results from large-scale sampling.Method: ADE-CoT incorporates three key strategies: (1) difficulty-aware resource allocation that assigns dynamic budgets based on estimated edit difficulty; (2) edit-specific verification using region localization and caption consistency for early pruning; and (3) depth-first opportunistic stopping guided by an instance-specific verifier that terminates when intent-aligned results are found.
Result: Extensive experiments on three state-of-the-art editing models (Step1X-Edit, BAGEL, FLUX.1 Kontext) across three benchmarks show ADE-CoT achieves superior performance-efficiency trade-offs. With comparable sampling budgets, ADE-CoT obtains better performance with more than 2x speedup over Best-of-N.
Conclusion: ADE-CoT effectively addresses the challenges of applying test-time scaling to image editing by providing an adaptive framework that improves both efficiency and performance through dynamic resource allocation, edit-specific verification, and opportunistic stopping.
Abstract: Image Chain-of-Thought (Image-CoT) is a test-time scaling paradigm that improves image generation by extending inference time. Most Image-CoT methods focus on text-to-image (T2I) generation. Unlike T2I generation, image editing is goal-directed: the solution space is constrained by the source image and instruction. This mismatch causes three challenges when applying Image-CoT to editing: inefficient resource allocation with fixed sampling budgets, unreliable early-stage verification using general MLLM scores, and redundant edited results from large-scale sampling. To address this, we propose ADaptive Edit-CoT (ADE-CoT), an on-demand test-time scaling framework to enhance editing efficiency and performance. It incorporates three key strategies: (1) a difficulty-aware resource allocation that assigns dynamic budgets based on estimated edit difficulty; (2) edit-specific verification in early pruning that uses region localization and caption consistency to select promising candidates; and (3) depth-first opportunistic stopping, guided by an instance-specific verifier, that terminates when intent-aligned results are found. Extensive experiments on three SOTA editing models (Step1X-Edit, BAGEL, FLUX.1 Kontext) across three benchmarks show that ADE-CoT achieves superior performance-efficiency trade-offs. With comparable sampling budgets, ADE-CoT obtains better performance with more than 2x speedup over Best-of-N.
[99] Is Geometry Enough? An Evaluation of Landmark-Based Gaze Estimation
Daniele Agostinelli, Thomas Agostinelli, Andrea Generosi, Maura Mengoni
Main category: cs.CV
TL;DR: Landmark-based gaze estimation using facial landmarks with lightweight models (XGBoost, MLPs) shows competitive cross-domain generalization compared to CNN baselines, offering efficient, interpretable alternatives for edge applications.
Details
Motivation: Current CNN-based gaze estimation methods are computationally expensive and lack interpretability, while geometric landmark-based methods offer lightweight alternatives but their performance limits and generalization capabilities remain underexplored in modern benchmarks.Method: Standardized pipeline to extract and normalize facial landmarks from three large-scale datasets (Gaze360, ETH-XGaze, GazeGene), then train lightweight regression models: Extreme Gradient Boosted trees, holistic Multi-Layer Perceptron (MLP), and siamese MLP designed to capture binocular geometry.
Result: Landmark-based models show lower performance in within-domain evaluation due to landmark detector noise, but in cross-domain evaluation, the proposed MLP architectures demonstrate generalization capabilities comparable to ResNet18 baselines.
Conclusion: Sparse geometric features encode sufficient information for robust gaze estimation, enabling efficient, interpretable, and privacy-friendly edge applications, with landmark-based methods showing promising cross-domain generalization.
Abstract: Appearance-based gaze estimation frequently relies on deep Convolutional Neural Networks (CNNs). These models are accurate, but computationally expensive and act as “black boxes”, offering little interpretability. Geometric methods based on facial landmarks are a lightweight alternative, but their performance limits and generalization capabilities remain underexplored in modern benchmarks. In this study, we conduct a comprehensive evaluation of landmark-based gaze estimation. We introduce a standardized pipeline to extract and normalize landmarks from three large-scale datasets (Gaze360, ETH-XGaze, and GazeGene) and train lightweight regression models, specifically Extreme Gradient Boosted trees and two neural architectures: a holistic Multi-Layer Perceptron (MLP) and a siamese MLP designed to capture binocular geometry. We find that landmark-based models exhibit lower performance in within-domain evaluation, likely due to noise introduced into the datasets by the landmark detector. Nevertheless, in cross-domain evaluation, the proposed MLP architectures show generalization capabilities comparable to those of ResNet18 baselines. These findings suggest that sparse geometric features encode sufficient information for robust gaze estimation, paving the way for efficient, interpretable, and privacy-friendly edge applications. The source code and generated landmark-based datasets are available at https://github.com/daniele-agostinelli/LandmarkGaze.git.
[100] Confidence-Based Mesh Extraction from 3D Gaussians
Lukas Radl, Felix Windisch, Andreas Kurz, Thomas Köhler, Michael Steiner, Markus Steinberger
Main category: cs.CV
TL;DR: Self-supervised confidence framework for 3D Gaussian Splatting improves mesh extraction from posed images by balancing photometric and geometric supervision with learnable confidence values, achieving state-of-the-art results for unbounded meshes while maintaining efficiency.
Details
Motivation: 3D Gaussian Splatting accelerates mesh extraction but struggles with scenes containing abundant view-dependent effects, leading to ambiguities. Prior solutions sacrifice efficiency by using multi-view techniques, iterative extraction, or large pre-trained models. The authors aim to maintain 3DGS efficiency while improving surface extraction accuracy.Method: Introduces a self-supervised confidence framework where learnable confidence values dynamically balance photometric and geometric supervision. Adds losses penalizing per-primitive color and normal variance. Improves appearance model by decoupling individual terms of the D-SSIM loss.
Result: Achieves state-of-the-art results for unbounded meshes while remaining highly efficient compared to previous approaches.
Conclusion: The proposed confidence-driven framework provides a simple and efficient alternative to complex methods, successfully resolving ambiguities in scenes with view-dependent effects while maintaining the inherent efficiency of 3D Gaussian Splatting.
Abstract: Recently, 3D Gaussian Splatting (3DGS) greatly accelerated mesh extraction from posed images due to its explicit representation and fast software rasterization. While the addition of geometric losses and other priors has improved the accuracy of extracted surfaces, mesh extraction remains difficult in scenes with abundant view-dependent effects. To resolve the resulting ambiguities, prior works rely on multi-view techniques, iterative mesh extraction, or large pre-trained models, sacrificing the inherent efficiency of 3DGS. In this work, we present a simple and efficient alternative by introducing a self-supervised confidence framework to 3DGS: within this framework, learnable confidence values dynamically balance photometric and geometric supervision. Extending our confidence-driven formulation, we introduce losses which penalize per-primitive color and normal variance and demonstrate their benefits to surface extraction. Finally, we complement the above with an improved appearance model, by decoupling the individual terms of the D-SSIM loss. Our final approach delivers state-of-the-art results for unbounded meshes while remaining highly efficient.
[101] A Framework for Generating Semantically Ambiguous Images to Probe Human and Machine Perception
Yuqi Hu, Vasha DuTell, Ahna R. Girshick, Jennifer E. Corbett
Main category: cs.CV
TL;DR: A framework using ambiguous images to compare human and machine vision by generating continuous concept spectra in CLIP space, revealing differences in semantic boundary placement between humans and classifiers.
Details
Motivation: To understand how human observers and machine vision models draw semantic boundaries when faced with ambiguous visual evidence, using interpretability probes to expose concept representation differences.Method: Psychophysically-informed framework that interpolates between concepts in CLIP embedding space to generate continuous spectra of ambiguous images, allowing precise measurement of semantic boundary placement by humans and machine classifiers.
Result: Machine classifiers are more biased towards seeing ‘rabbit’ while humans align more with CLIP embeddings; guidance scale affects human sensitivity more strongly than machine classifiers.
Conclusion: Controlled ambiguity serves as a diagnostic tool bridging human psychophysical analysis, image classification, and generative models, offering insights into human-model alignment, robustness, interpretability, and synthesis methods.
Abstract: The classic duck-rabbit illusion reveals that when visual evidence is ambiguous, the human brain must decide what it sees. But where exactly do human observers draw the line between ‘‘duck’’ and ‘‘rabbit’’, and do machine classifiers draw it in the same place? We use semantically ambiguous images as interpretability probes to expose how vision models represent the boundaries between concepts. We present a psychophysically-informed framework that interpolates between concepts in the CLIP embedding space to generate continuous spectra of ambiguous images, allowing us to precisely measure where and how humans and machine classifiers place their semantic boundaries. Using this framework, we show that machine classifiers are more biased towards seeing ‘‘rabbit’’, whereas humans are more aligned with the CLIP embedding used for synthesis, and the guidance scale seems to affect human sensitivity more strongly than machine classifiers. Our framework demonstrates how controlled ambiguity can serve as a diagnostic tool to bridge the gap between human psychophysical analysis, image classification, and generative image models, offering insight into human-model alignment, robustness, model interpretability, and image synthesis methods.
[102] TIGeR: A Unified Framework for Time, Images and Geo-location Retrieval
David G. Shatwell, Sirnam Swetha, Mubarak Shah
Main category: cs.CV
TL;DR: TIGeR: A multimodal transformer model for unified geo-temporal image understanding that jointly reasons about visual appearance, geolocation, and time for applications like geo-time-aware image retrieval.
Details
Motivation: Real-world applications in digital forensics, urban monitoring, and environmental analysis require jointly reasoning about visual appearance, geolocation, and time, going beyond standard geo-localization to support complex capabilities like retrieving images at the same location but at specified target times.Method: Proposes TIGeR, a multi-modal-transformer-based model that maps image, geolocation, and time into a unified geo-temporal embedding space. Supports flexible input configurations (single-modality and multi-modality queries) and uses the same representation for multiple tasks.
Result: TIGeR consistently outperforms strong baselines and state-of-the-art methods by up to 16% on time-of-year prediction, 8% on time-of-day prediction, and 14% in geo-time aware retrieval recall, demonstrating benefits of unified geo-temporal modeling.
Conclusion: By preserving location identity under large appearance changes, TIGeR enables retrieval based on where and when a scene is, rather than purely on visual similarity, highlighting the importance of unified multimodal reasoning for geo-temporal applications.
Abstract: Many real-world applications in digital forensics, urban monitoring, and environmental analysis require jointly reasoning about visual appearance, geolocation, and time. Beyond standard geo-localization and time-of-capture prediction, these applications increasingly demand more complex capabilities, such as retrieving an image captured at the same location as a query image but at a specified target time. We formalize this problem as Geo-Time Aware Image Retrieval and curate a diverse benchmark of 4.5M paired image-location-time triplets for training and 86k high-quality triplets for evaluation. We then propose TIGeR, a multi-modal-transformer-based model that maps image, geolocation, and time into a unified geo-temporal embedding space. TIGeR supports flexible input configurations (single-modality and multi-modality queries) and uses the same representation to perform (i) geo-localization, (ii) time-of-capture prediction, and (iii) geo-time-aware retrieval. By better preserving underlying location identity under large appearance changes, TIGeR enables retrieval based on where and when a scene is, rather than purely on visual similarity. Extensive experiments show that TIGeR consistently outperforms strong baselines and state-of-the-art methods by up to 16% on time-of-year, 8% time-of-day prediction, and 14% in geo-time aware retrieval recall, highlighting the benefits of unified geo-temporal modeling.
[103] Synthetic Cardiac MRI Image Generation using Deep Generative Models
Ishan Kumarasinghe, Dasuni Kawya, Madhura Edirisooriya, Isuri Devindi, Isuru Nawinne, Vajira Thambawita
Main category: cs.CV
TL;DR: Review paper comparing synthetic cardiac MRI generation methods (GANs, VAEs, diffusion models, flow-matching) focusing on fidelity, utility, and privacy for overcoming medical data scarcity.
Details
Motivation: Address scarcity of annotated medical imaging data, vendor variability, and privacy risks in cardiac MRI through synthetic data generation.Method: Comparative review of existing approaches: GANs, VAEs, diffusion probabilistic models, flow-matching techniques, mask-conditioned generation, vendor-style conditioning, and privacy mechanisms.
Result: Synthetic CMRI generation shows promise for enhancing segmentation accuracy and robustness across multi-vendor settings while addressing privacy concerns.
Conclusion: Need for integrated, evaluation-driven frameworks for reliable clinical workflows, with focus on fidelity, utility, and privacy trade-offs.
Abstract: Synthetic cardiac MRI (CMRI) generation has emerged as a promising strategy to overcome the scarcity of annotated medical imaging data. Recent advances in GANs, VAEs, diffusion probabilistic models, and flow-matching techniques aim to generate anatomically accurate images while addressing challenges such as limited labeled datasets, vendor variability, and risks of privacy leakage through model memorization. Maskconditioned generation improves structural fidelity by guiding synthesis with segmentation maps, while diffusion and flowmatching models offer strong boundary preservation and efficient deterministic transformations. Cross-domain generalization is further supported through vendor-style conditioning and preprocessing steps like intensity normalization. To ensure privacy, studies increasingly incorporate membership inference attacks, nearest-neighbor analyses, and differential privacy mechanisms. Utility evaluations commonly measure downstream segmentation performance, with evidence showing that anatomically constrained synthetic data can enhance accuracy and robustness across multi-vendor settings. This review aims to compare existing CMRI generation approaches through the lenses of fidelity, utility, and privacy, highlighting current limitations and the need for integrated, evaluation-driven frameworks for reliable clinical workflows.
[104] DRoPS: Dynamic 3D Reconstruction of Pre-Scanned Objects
Narek Tumanyan, Samuel Rota Bulò, Denis Rozumny, Lorenzo Porzi, Adam Harley, Tali Dekel, Peter Kontschieder, Jonathon Luiten
Main category: cs.CV
TL;DR: DRoPS: Dynamic scene reconstruction using static pre-scans and grid-structured Gaussian primitives for improved novel view synthesis and motion tracking.
Details
Motivation: Existing dynamic scene reconstruction methods struggle with extreme novel viewpoints and highly articulated motions due to insufficient priors and regularization. They don't fully exploit available static pre-scans of dynamic objects.Method: Uses static pre-scan as explicit geometric/appearance prior. Organizes Gaussian primitives into pixel grids anchored to object surface. Parameterizes motion using CNN conditioned on these grids for implicit regularization and correlation of nearby points.
Result: Significantly outperforms state-of-the-art in rendering quality and 3D tracking accuracy, especially for extreme novel viewpoints and articulated motions.
Conclusion: DRoPS effectively leverages static pre-scans with grid-structured representation and CNN-based motion parameterization to achieve superior dynamic scene reconstruction.
Abstract: Dynamic scene reconstruction from casual videos has seen recent remarkable progress. Numerous approaches have attempted to overcome the ill-posedness of the task by distilling priors from 2D foundational models and by imposing hand-crafted regularization on the optimized motion. However, these methods struggle to reconstruct scenes from extreme novel viewpoints, especially when highly articulated motions are present. In this paper, we present DRoPS, a novel approach that leverages a static pre-scan of the dynamic object as an explicit geometric and appearance prior. While existing state-of-the-art methods fail to fully exploit the pre-scan, DRoPS leverages our novel setup to effectively constrain the solution space and ensure geometrical consistency throughout the sequence. The core of our novelty is twofold: first, we establish a grid-structured and surface-aligned model by organizing Gaussian primitives into pixel grids anchored to the object surface. Second, by leveraging the grid structure of our primitives, we parameterize motion using a CNN conditioned on those grids, injecting strong implicit regularization and correlating the motion of nearby points. Extensive experiments demonstrate that our method significantly outperforms the current state of the art in rendering quality and 3D tracking accuracy.
[105] Calibri: Enhancing Diffusion Transformers via Parameter-Efficient Calibration
Danil Tokhchukov, Aysel Mirzoeva, Andrey Kuznetsov, Konstantin Sobolev
Main category: cs.CV
TL;DR: Calibri is a parameter-efficient calibration method for Diffusion Transformers that uses a single learned scaling parameter and evolutionary optimization to improve generative quality while reducing inference steps.
Details
Motivation: Diffusion Transformers (DiTs) have shown promise for generative tasks, but their potential may not be fully realized. The authors aim to enhance DiT performance through better calibration of model components without adding significant computational overhead.Method: Calibri introduces a single learned scaling parameter to DiT blocks, frames DiT calibration as a black-box reward optimization problem, and uses an evolutionary algorithm to efficiently optimize ~100 parameters for improved performance.
Result: Calibri consistently improves performance across various text-to-image models, reduces inference steps required for image generation, and maintains high-quality outputs despite its lightweight design.
Conclusion: Simple calibration techniques can significantly enhance Diffusion Transformers’ generative capabilities, and Calibri provides an efficient, parameter-light approach to optimize DiT performance for text-to-image generation tasks.
Abstract: In this paper, we uncover the hidden potential of Diffusion Transformers (DiTs) to significantly enhance generative tasks. Through an in-depth analysis of the denoising process, we demonstrate that introducing a single learned scaling parameter can significantly improve the performance of DiT blocks. Building on this insight, we propose Calibri, a parameter-efficient approach that optimally calibrates DiT components to elevate generative quality. Calibri frames DiT calibration as a black-box reward optimization problem, which is efficiently solved using an evolutionary algorithm and modifies just ~100 parameters. Experimental results reveal that despite its lightweight design, Calibri consistently improves performance across various text-to-image models. Notably, Calibri also reduces the inference steps required for image generation, all while maintaining high-quality outputs.
[106] Dissecting Model Failures in Abdominal Aortic Aneurysm Segmentation through Explainability-Driven Analysis
Abu Noman Md Sakib, Merjulah Roby, Zijie Zhang, Satish Muluk, Mark K. Eskandari, Ender A. Finol
Main category: cs.CV
TL;DR: XAI-guided encoder shaping framework for CT image segmentation of abdominal aortic aneurysms that uses attribution-based focus maps to align model attention with segmentation targets and suppress distractors.
Details
Motivation: CT image segmentation of complex abdominal aortic aneurysms often fails because models focus on irrelevant structures or miss thin, low-contrast targets. The authors aim to improve segmentation reliability by explicitly guiding where the model looks during training.Method: Proposes an Explainable AI (XAI) guided encoder shaping framework that computes dense attribution-based encoder focus maps (“XAI fields”) from the final encoder block. Uses these in two ways: (1) aligns predicted probability mass to XAI field to promote agreement between focus and output, and (2) routes the field into a lightweight refinement pathway and confidence prior that modulates logits at inference to suppress distractors while preserving subtle structures.
Result: The method shows substantial improvements compared to a base SAM setup when evaluated on clinically validated challenging cases curated for failure-prone scenarios.
Conclusion: Explicitly optimizing encoder focus via XAI guidance is a practical and effective principle for reliable segmentation in complex medical imaging scenarios.
Abstract: Computed tomography image segmentation of complex abdominal aortic aneurysms (AAA) often fails because the models assign internal focus to irrelevant structures or do not focus on thin, low-contrast targets. Where the model looks is the primary training signal, and thus we propose an Explainable AI (XAI) guided encoder shaping framework. Our method computes a dense, attribution-based encoder focus map (“XAI field”) from the final encoder block and uses it in two complementary ways: (i) we align the predicted probability mass to the XAI field to promote agreement between focus and output; and (ii) we route the field into a lightweight refinement pathway and a confidence prior that modulates logits at inference, suppressing distractors while preserving subtle structures. The objective terms serve only as control signals; the contribution is the integration of attribution guidance into representation and decoding. We evaluate clinically validated challenging cases curated for failure-prone scenarios. Compared to a base SAM setup, our implementation yields substantial improvements. The observed gains suggest that explicitly optimizing encoder focus via XAI guidance is a practical and effective principle for reliable segmentation in complex scenarios.
[107] GoldiCLIP: The Goldilocks Approach for Balancing Explicit Supervision for Language-Image Pretraining
Deen Dayal Mohan, Hossein Souri, Vitali Petsiuk, Juhong Min, Gopal Sharma, Luowei Zhou, Suren Kumar
Main category: cs.CV
TL;DR: GoldiCLIP is a data-efficient vision-language model framework that achieves state-of-the-art performance using only 30M images (300x less data) through balanced supervision combining self-distillation, VQA objectives, and uncertainty-based loss weighting.
Details
Motivation: Current large-scale VLMs require billion-sample datasets, creating significant barriers to progress. Recent works improve supervision quality but only address subsets of weaknesses in contrastive pretraining. There's a need for a balanced approach that synergistically combines multiple supervision signals for data-efficient training.Method: GoldiCLIP uses three key innovations: (1) text-conditioned self-distillation aligning both text-agnostic and text-conditioned features, (2) encoder-integrated decoder with VQA objective for generalization beyond caption-like queries, and (3) uncertainty-based weighting mechanism balancing heterogeneous losses.
Result: Achieves state-of-the-art among data-efficient approaches: +2.2 points on MSCOCO retrieval, +2.0 on fine-grained retrieval, +5.9 on question-based retrieval. Remains competitive with billion-scale models while using only 30M images (300x less data).
Conclusion: GoldiCLIP demonstrates that balanced, multifaceted supervision can achieve strong vision-language alignment with dramatically less data, challenging the need for billion-scale datasets in VLM training.
Abstract: Until recently, the success of large-scale vision-language models (VLMs) has primarily relied on billion-sample datasets, posing a significant barrier to progress. Latest works have begun to close this gap by improving supervision quality, but each addresses only a subset of the weaknesses in contrastive pretraining. We present GoldiCLIP, a framework built on a Goldilocks principle of finding the right balance of supervision signals. Our multifaceted training framework synergistically combines three key innovations: (1) a text-conditioned self-distillation method to align both text-agnostic and text-conditioned features; (2) an encoder integrated decoder with Visual Question Answering (VQA) objective that enables the encoder to generalize beyond the caption-like queries; and (3) an uncertainty-based weighting mechanism that automatically balances all heterogeneous losses. Trained on just 30 million images, 300x less data than leading methods, GoldiCLIP achieves state-of-the-art among data-efficient approaches, improving over the best comparable baseline by 2.2 points on MSCOCO retrieval, 2.0 on fine-grained retrieval, and 5.9 on question-based retrieval, while remaining competitive with billion-scale models. Project page: https://petsi.uk/goldiclip.
[108] Attention-based Pin Site Image Classification in Orthopaedic Patients with External Fixators
Yubo Wang, Marie Fridberg, Anirejuoritse Bafor, Ole Rahbek, Christopher Iobst, Søren Vedding Kold, Ming Shen
Main category: cs.CV
TL;DR: A deep learning approach for classifying pin site wound infections from medical images using attention mechanisms and efficient convolution methods.
Details
Motivation: Pin site infections are common complications in patients with external fixators, causing pain and morbidity. Current identification and management methods need improvement to enhance patient experience.Method: Proposes an attention-based deep learning model with Efficient Redundant Reconstruction Convolution (ERRC) to classify pin site images into infected (Group A) vs. non-infected (Group B) categories, focusing on relevant regions while minimizing distractions.
Result: The model achieves AUC of 0.975 and F1-score of 0.927 with only 5.77M parameters, outperforming baseline methods in differentiating infected vs. non-infected pin sites.
Conclusion: Deep learning shows strong potential for visual classification of pin site infections, aligning with healthcare professional assessments, though further validation with more data is needed.
Abstract: Pin sites represent the interface where a metal pin or wire from the external environment passes through the skin into the internal environment of the limb. These pins or wires connect an external fixator to the bone to stabilize the bone segments in a patient with trauma or deformity. Because these pin sites represent an opportunity for external skin flora to enter the internal environment of the limb, infections of the pin site are common. These pin site infections are painful, annoying, and cause increased morbidity to the patients. Improving the identification and management of pin site infections would greatly enhance the patient experience when external fixators are used. For this, this paper collects and produces a dataset on pin sites wound infections and proposes a deep learning (DL) method to classify pin sites images based on their appearance: Group A displayed signs of inflammation or infection, while Group B showed no evident complications. Unlike studies that primarily focus on open wounds, our research includes potential interventions at the metal pin/skin interface. Our attention-based deep learning model addresses this complexity by emphasizing relevant regions and minimizing distractions from the pins. Moreover, we introduce an Efficient Redundant Reconstruction Convolution (ERRC) method to enhance the richness of feature maps while reducing the number of parameters. Our model outperforms baseline methods with an AUC of 0.975 and an F1-score of 0.927, requiring only 5.77 M parameters. These results highlight the potential of DL in differentiating pin sites only based on visual signs of infection, aligning with healthcare professional assessments, while further validation with more data remains essential.
[109] Generative Adversarial Perturbations with Cross-paradigm Transferability on Localized Crowd Counting
Alabi Mehzabin Anisha, Guangjing Wang, Sriram Chellappan
Main category: cs.CV
TL;DR: Novel adversarial attack framework that compromises both density-map and point-regression crowd counting models through multi-task loss optimization, achieving cross-paradigm transferability.
Details
Motivation: Existing adversarial attacks on crowd counting focus on single paradigms (density maps OR point regression), but cross-paradigm attacks remain unexplored despite security importance in crowd counting applications.Method: Multi-task loss optimization combining: 1) scene-density-specific high-confidence logit suppression for point-regression models, 2) peak-targeted density map suppression for density-map models, and 3) model-agnostic perceptual constraints for imperceptible perturbations.
Result: Achieves 7X increase in Mean Absolute Error compared to clean images while maintaining visual quality, successfully transfers across 7 state-of-the-art crowd models with transfer ratios 0.55-1.69.
Conclusion: The framework demonstrates effective cross-paradigm adversarial attacks on crowd counting models, balancing attack effectiveness and imperceptibility better than existing transferable attack strategies.
Abstract: State-of-the-art crowd counting and localization are primarily modeled using two paradigms: density maps and point regression. Given the field’s security ramifications, there is active interest in model robustness against adversarial attacks. Recent studies have demonstrated transferability across density-map-based approaches via adversarial patches, but cross-paradigm attacks (i.e., across both density map-based models and point regression-based models) remain unexplored. We introduce a novel adversarial framework that compromises both density map and point regression architectural paradigms through a comprehensive multi-task loss optimization. For point-regression models, we employ scene-density-specific high-confidence logit suppression; for density-map approaches, we use peak-targeted density map suppression. Both are combined with model-agnostic perceptual constraints to ensure that perturbations are effective and imperceptible to the human eye. Extensive experiments demonstrate the effectiveness of our attack, achieving on average a 7X increase in Mean Absolute Error compared to clean images while maintaining competitive visual quality, and successfully transferring across seven state-of-the-art crowd models with transfer ratios ranging from 0.55 to 1.69. Our approach strikes a balance between attack effectiveness and imperceptibility compared to state-of-the-art transferable attack strategies. The source code is available at https://github.com/simurgh7/CrowdGen
[110] DCARL: A Divide-and-Conquer Framework for Autoregressive Long-Trajectory Video Generation
Junyi Ouyang, Wenbin Teng, Gonglin Chen, Yajie Zhao, Haiwei Chen
Main category: cs.CV
TL;DR: DCARL: A divide-and-conquer autoregressive framework for long-trajectory video generation that combines structural stability with high-fidelity generation using keyframe anchors and interpolation.
Details
Motivation: Long-trajectory video generation is challenging due to scalability limitations of existing video diffusion models. Autoregressive models suffer from visual drift and poor controllability, while current approaches struggle with maintaining consistency over long sequences.Method: Proposes DCARL framework with two components: 1) Keyframe Generator trained without temporal compression to establish long-range structural anchors, and 2) Interpolation Generator that synthesizes dense frames autoregressively with overlapping segments, using keyframes for global context and preceding frames for local coherence.
Result: Achieves superior performance in visual quality (lower FID and FVD) and camera adherence (lower ATE and ARE) compared to state-of-the-art baselines. Demonstrates stable, high-fidelity generation for long trajectory videos up to 32 seconds.
Conclusion: DCARL effectively addresses visual drift and controllability issues in long-trajectory video generation by combining divide-and-conquer structural stability with autoregressive high-fidelity generation.
Abstract: Long-trajectory video generation is a crucial yet challenging task for world modeling primarily due to the limited scalability of existing video diffusion models (VDMs). Autoregressive models, while offering infinite rollout, suffer from visual drift and poor controllability. To address these issues, we propose DCARL, a novel divide-and-conquer, autoregressive framework that effectively combines the structural stability of the divide-and-conquer scheme with the high-fidelity generation of VDMs. Our approach first employs a dedicated Keyframe Generator trained without temporal compression to establish long-range, globally consistent structural anchors. Subsequently, an Interpolation Generator synthesizes the dense frames in an autoregressive manner with overlapping segments, utilizing the keyframes for global context and a single clean preceding frame for local coherence. Trained on a large-scale internet long trajectory video dataset, our method achieves superior performance in both visual quality (lower FID and FVD) and camera adherence (lower ATE and ARE) compared to state-of-the-art autoregressive and divide-and-conquer baselines, demonstrating stable and high-fidelity generation for long trajectory videos up to 32 seconds in length.
[111] WAFT-Stereo: Warping-Alone Field Transforms for Stereo Matching
Yihan Wang, Jia Deng
Main category: cs.CV
TL;DR: WAFT-Stereo is a warping-based stereo matching method that eliminates the need for cost volumes, achieving state-of-the-art performance on major benchmarks while being significantly faster than existing methods.
Details
Motivation: The paper challenges the conventional reliance on cost volumes in stereo matching, arguing they are computationally expensive and unnecessary for achieving strong performance. The authors aim to develop a more efficient approach that maintains or improves accuracy.Method: WAFT-Stereo replaces traditional cost volumes with a warping-based approach. It uses a simple yet effective warping mechanism to establish correspondences between stereo images, eliminating the computational overhead of constructing and processing 3D cost volumes.
Result: The method achieves state-of-the-art performance, ranking first on ETH3D, KITTI, and Middlebury benchmarks. It reduces zero-shot error by 81% on ETH3D while being 1.8-6.7x faster than competitive methods.
Conclusion: Cost volumes are not necessary for high-performance stereo matching. WAFT-Stereo demonstrates that warping-based approaches can achieve superior accuracy with significantly improved efficiency, offering a promising direction for stereo vision research.
Abstract: We introduce WAFT-Stereo, a simple and effective warping-based method for stereo matching. WAFT-Stereo demonstrates that cost volumes, a common design used in many leading methods, are not necessary for strong performance and can be replaced by warping with improved efficiency. WAFT-Stereo ranks first on ETH3D, KITTI and Middlebury public benchmarks, reducing the zero-shot error by 81% on ETH3D benchmark, while being 1.8-6.7x faster than competitive methods. Code and model weights are available at https://github.com/princeton-vl/WAFT-Stereo.
[112] NeuroVLM-Bench: Evaluation of Vision-Enabled Large Language Models for Clinical Reasoning in Neurological Disorders
Katarina Trojachanec Dineva, Stefan Andonov, Ilinka Ivanoska, Ivan Kitanovski, Sasho Gramatikov, Tamara Kostova, Monika Simjanoska Misheva, Kostadin Mishev
Main category: cs.CV
TL;DR: Benchmark study of 20 multimodal LLMs for neuroimaging tasks shows technical attributes are nearly solved but diagnostic reasoning remains challenging, with proprietary models performing best but open-weight MedGemma showing promise.
Details
Motivation: Multimodal LLMs show promise for image-based decision support in healthcare, but their reliability and operational trade-offs in neuroimaging specifically remain insufficiently understood, requiring comprehensive benchmarking.Method: Comprehensive benchmarking of 20 frontier multimodal models using curated MRI/CT datasets covering multiple sclerosis, stroke, brain tumors, and other abnormalities. Models generate multiple outputs simultaneously (diagnosis, subtype, modality, sequence, plane). Evaluated across four dimensions: discriminative classification with abstention, calibration, structured-output validity, and computational efficiency using a multi-phase framework to control selection bias.
Result: Technical imaging attributes (modality, plane) are nearly solved, but diagnostic reasoning (especially subtype prediction) remains challenging. Tumor classification most reliable, stroke moderately solvable, multiple sclerosis and rare abnormalities difficult. Few-shot prompting improves performance but increases token usage, latency, and cost. Gemini-2.5-Pro and GPT-5-Chat achieve strongest diagnostic performance; Gemini-2.5-Flash offers best efficiency-performance trade-off. Open-weight MedGemma-1.5-4B approaches proprietary model performance with few-shot prompting while maintaining perfect structured output.
Conclusion: The study provides practical insights into performance, reliability, and efficiency trade-offs for multimodal LLMs in neuroimaging, supporting standardized evaluation and showing that while technical attributes are well-handled, diagnostic reasoning remains a significant challenge requiring further research.
Abstract: Recent advances in multimodal large language models enable new possibilities for image-based decision support. However, their reliability and operational trade-offs in neuroimaging remain insufficiently understood. We present a comprehensive benchmarking study of vision-enabled large language models for 2D neuroimaging using curated MRI and CT datasets covering multiple sclerosis, stroke, brain tumors, other abnormalities, and normal controls. Models are required to generate multiple outputs simultaneously, including diagnosis, diagnosis subtype, imaging modality, specialized sequence, and anatomical plane. Performance is evaluated across four directions: discriminative classification with abstention, calibration, structured-output validity, and computational efficiency. A multi-phase framework ensures fair comparison while controlling for selection bias. Across twenty frontier multimodal models, the results show that technical imaging attributes such as modality and plane are nearly solved, whereas diagnostic reasoning, especially subtype prediction, remains challenging. Tumor classification emerges as the most reliable task, stroke is moderately solvable, while multiple sclerosis and rare abnormalities remain difficult. Few-shot prompting improves performance for several models but increases token usage, latency, and cost. Gemini-2.5-Pro and GPT-5-Chat achieve the strongest overall diagnostic performance, while Gemini-2.5-Flash offers the best efficiency-performance trade-off. Among open-weight architectures, MedGemma-1.5-4B demonstrates the most promising results, as under few-shot prompting, it approaches the zero-shot performance of several proprietary models, while maintaining perfect structured output. These findings provide practical insights into performance, reliability, and efficiency trade-offs, supporting standardized evaluation of multimodal LLMs in neuroimaging.
[113] CORA: A Pathology Synthesis Driven Foundation Model for Coronary CT Angiography Analysis and MACE Risk Assessment
Jinkui Hao, Gorkem Durak, Halil Ertugrul Aktas, Ulas Bagci, Bradley D. Allen, Nilay S. Shah, Bo Zhou
Main category: cs.CV
TL;DR: CORA is a 3D vision foundation model for cardiovascular risk assessment that learns from unlabeled CCTA scans using pathology-centric self-supervised learning with simulated vascular abnormalities, outperforming state-of-the-art models and enabling multimodal risk stratification when combined with language models.
Details
Motivation: Clinical translation of automated CCTA analysis is limited by scarce expert annotations, and existing self-supervised methods fail to capture localized pathological features of coronary plaques, necessitating a pathology-centric approach.Method: Uses anatomy-guided lesion synthesis engine to create simulated vascular abnormalities, training on 12,801 unlabeled CCTA volumes via synthesis-driven self-supervised framework that biases learning toward clinically relevant disease features rather than background anatomy.
Result: Outperformed state-of-the-art 3D vision foundation models across diagnostic tasks (plaque characterization, stenosis detection, coronary artery segmentation) with up to 29% performance gain; multimodal extension with LLM significantly improved 30-day MACE risk stratification.
Conclusion: CORA establishes a scalable foundation for unified anatomical assessment and cardiovascular risk prediction, demonstrating the value of pathology-centric self-supervised learning for medical imaging.
Abstract: Coronary artery disease, the leading cause of cardiovascular mortality worldwide, can be assessed non-invasively by coronary computed tomography angiography (CCTA). Despite progress in automated CCTA analysis using deep learning, clinical translation is constrained by the scarcity of expert-annotated datasets. Furthermore, widely adopted label-free pretraining strategies, such as masked image modeling, are intrinsically biased toward global anatomical statistics, frequently failing to capture the spatially localized pathological features of coronary plaques. Here, we introduce CORA, a 3D vision foundation model for comprehensive cardiovascular risk assessment. CORA learns directly from volumetric CCTA via a pathology-centric, synthesis-driven self-supervised framework. By utilizing an anatomy-guided lesion synthesis engine, the model is explicitly trained to detect simulated vascular abnormalities, biasing representation learning toward clinically relevant disease features rather than dominant background anatomy. We trained CORA on a large-scale cohort of 12,801 unlabeled CCTA volumes and comprehensively evaluated the model across multi-center datasets from nine independent hospitals. Across diagnostic and anatomical tasks, including plaque characterization, stenosis detection, and coronary artery segmentation, CORA consistently outperformed the state-of-the-art 3D vision foundation models, achieving up to a 29% performance gain. Crucially, by coupling the imaging encoder with a large language model, we extended CORA into a multimodal framework that significantly improved 30-day major adverse cardiac event (MACE) risk stratification. Our results establish CORA as a scalable and extensible foundation for unified anatomical assessment and cardiovascular risk prediction.
[114] Towards automatic smoke detector inspection: Recognition of the smoke detectors in industrial facilities and preparation for future drone integration
Lukas Kratochvila, Jakub Stefansky, Simon Bilik, Robert Rous, Tomas Zemcik, Michal Wolny, Frantisek Rusnak, Ondrej Cech, Karel Horak
Main category: cs.CV
TL;DR: Smoke detector recognition system using object detection models (YOLOv11, SSD, RT-DETRv2) for drone-based automatic inspection, with evaluation on real and synthetic data.
Details
Motivation: Automating smoke detector inspection is needed because manual inspection is difficult due to high ceilings/dangerous locations, and an automatic system could make the process faster, safer, and cheaper.Method: Compared convolutional-based object detectors (YOLOv11, SSD) with transformer-based RT-DETRv2 using different backbone sizes. Used real and semi-synthetic training data with various augmentation methods, and evaluated on two test datasets with challenging conditions.
Result: YOLOv11n achieved the best performance with average mAP@0.5 score of 0.884. The study provides code, pretrained models, and dataset publicly available.
Conclusion: The smoke detector recognition system using YOLOv11n shows promising results for integration into drone-based automatic inspection systems, addressing practical challenges in fire safety maintenance.
Abstract: Fire safety consists of a complex pipeline, and it is a very important topic of concern. One of its frontal parts are the smoke detectors, which are supposed to provide an alarm prior to a massive fire appears. As they are often difficult to reach due to high ceilings or problematic locations, an automatic inspection system would be very beneficial as it could allow faster revisions, prevent workers from dangerous work in heights, and make the whole process cheaper. In this study, we present the smoke detector recognition part of the automatic inspection system, which could easily be integrated to the drone system. As part of our research, we compare two popular convolutional-based object detectors YOLOv11 and SSD widely used on embedded devices together with the state-of-the-art transformer-based RT-DETRv2 with the backbones of different sizes. Due to a complicated way of collecting a sufficient amount of data for training in the real-world environment, we also compare several training strategies using the real and semi-synthetic data together with various augmentation methods. To achieve a robust testing, all models were evaluated on two test datasets with an expected and difficult appearance of the smoke detectors including motion blur, small resolution, or not complete objects. The best performing detector is the YOLOv11n, which reaches the average mAP@0.5 score of 0.884. Our code, pretrained models and dataset are publicly available.
[115] OptiSAR-Net++: A Large-Scale Benchmark and Transformer-Free Framework for Cross-Domain Remote Sensing Visual Grounding
Xiaoyu Tang, Jun Dong, Jintao Cheng, Rui Fan
Main category: cs.CV
TL;DR: OptiSAR-Net++: A novel framework for cross-domain remote sensing visual grounding that handles both optical and SAR images using efficient cross-modal matching and domain adaptation techniques.
Details
Motivation: Existing remote sensing visual grounding methods are limited to single-sensor domains (optical or SAR only), which restricts real-world applicability. The authors aim to address cross-domain RSVG challenges including feature modeling, computational inefficiency, and fine-grained semantic discrimination.Method: Proposes OptiSAR-Net++ with: 1) Patch-level Low-Rank Adaptation Mixture of Experts (PL-MoE) for efficient cross-domain feature decoupling; 2) CLIP-based contrastive paradigm with dynamic adversarial negative sampling to transform generative regression into efficient cross-modal matching; 3) Text-guided dual-gate fusion module (TGDF-SSA) and region-aware auxiliary head for enhanced semantic-visual alignment and spatial modeling.
Result: Achieves state-of-the-art performance on both OptSAR-RSVG (new cross-domain benchmark) and DIOR-RSVG benchmarks, with significant advantages in localization accuracy and efficiency.
Conclusion: The proposed OptiSAR-Net++ effectively addresses cross-domain RSVG challenges and demonstrates superior performance through novel architectural designs for domain adaptation, computational efficiency, and semantic alignment.
Abstract: Remote sensing visual grounding (RSVG) aims to localize specific targets in remote sensing images using natural language expressions. However, existing methods are restricted to single-sensor domains, i.e., either optical or synthetic aperture radar (SAR), limiting their real-world applicability. In this paper, we introduce the Cross-Domain RSVG (CD-RSVG) task and construct OptSAR-RSVG, the first large-scale benchmark dataset for this setting. To tackle the challenges of cross-domain feature modeling, computational inefficiency, and fine-grained semantic discrimination, we propose OptiSAR-Net++. Our framework features a patch-level Low-Rank Adaptation Mixture of Experts (PL-MoE) for efficient cross-domain feature decoupling. To mitigate the substantial computational overhead of Transformer decoding frameworks, we adopt a CLIP-based contrastive paradigm and further incorporate dynamic adversarial negative sampling, thereby transforming generative regression into an efficient cross-modal matching process. Additionally, a text-guided dual-gate fusion module (TGDF-SSA) and a region-aware auxiliary head are introduced to enhance semantic-visual alignment and spatial modeling. Extensive experiments demonstrate that OptiSAR-Net++ achieves SOTA performance on both OptSAR-RSVG and DIOR-RSVG benchmarks, offering significant advantages in localization accuracy and efficiency. Our code and dataset will be made publicly available.
[116] SurgPhase: Time efficient pituitary tumor surgery phase recognition via an interactive web platform
Yan Meng, Jack Cook, X. Y. Han, Kaan Duman, Shauna Otto, Dhiraj Pangal, Jonathan Chainey, Ruth Lau, Margaux Masson-Forsythe, Daniel A. Donoho, Danielle Levy, Gabriel Zada, Sébastien Froelich, Juan Fernandez-Miranda, Mike Chang
Main category: cs.CV
TL;DR: A framework for surgical phase recognition in pituitary tumor surgery videos using self-supervised learning, temporal modeling, and a collaborative platform for data collection and model improvement.
Details
Motivation: Accurate surgical phase recognition is essential for analyzing procedural workflows, supporting intraoperative decision-making, and enabling data-driven improvements in surgical education and performance evaluation.Method: Combines self-supervised representation learning (pretraining ResNet-50 on 251 unlabeled videos), robust temporal modeling, and scalable data annotation strategies. Uses a collaborative online platform for surgeons to upload videos and receive automated analysis. Fine-tuning incorporates focal loss, gradual layer unfreezing, and dynamic sampling to address class imbalance.
Result: Achieves 90% accuracy on a held-out test set, outperforming current state-of-the-art approaches and demonstrating strong generalization across variable surgical cases.
Conclusion: The framework successfully enables accurate surgical phase recognition through a combination of self-supervised learning, temporal modeling, and collaborative data collection, with potential applications in surgical education and performance evaluation.
Abstract: Accurate surgical phase recognition is essential for analyzing procedural workflows, supporting intraoperative decision-making, and enabling data-driven improvements in surgical education and performance evaluation. In this work, we present a comprehensive framework for phase recognition in pituitary tumor surgery (PTS) videos, combining self-supervised representation learning, robust temporal modeling, and scalable data annotation strategies. Our method achieves 90% accuracy on a held-out test set, outperforming current state-of-the-art approaches and demonstrating strong generalization across variable surgical cases. A central contribution of this work is the integration of a collaborative online platform designed for surgeons to upload surgical videos, receive automated phase analysis, and contribute to a growing dataset. This platform not only facilitates large-scale data collection but also fosters knowledge sharing and continuous model improvement. To address the challenge of limited labeled data, we pretrain a ResNet-50 model using the self-supervised framework on 251 unlabeled PTS videos, enabling the extraction of high-quality feature representations. Fine-tuning is performed on a labeled dataset of 81 procedures using a modified training regime that incorporates focal loss, gradual layer unfreezing, and dynamic sampling to address class imbalance and procedural variability.
[117] Self-Supervised Learning for Knee Osteoarthritis: Diagnostic Limitations and Prognostic Value of Uncurated Hospital Data
Haresh Rengaraj Rajamohan, Yuxuan Chen, Kyunghyun Cho, Cem M. Deniz
Main category: cs.CV
TL;DR: Self-supervised learning (SSL) on knee radiographs shows mixed results: image-only SSL helps linear probing but not fine-tuning for diagnosis, while multimodal image-text SSL trained on biased hospital data improves prognosis prediction but not diagnosis due to data distribution mismatch.
Details
Motivation: To evaluate whether self-supervised learning (SSL) improves knee osteoarthritis modeling compared to standard ImageNet-pretrained initialization, specifically examining both diagnostic (KL grade prediction) and prognostic (4-year structural incidence/progression) tasks.Method: Compared two SSL approaches: (1) image-only SSL pretrained on knee radiographs from multiple cohorts (OAI, MOST, NYU), and (2) multimodal image-text SSL pretrained on uncurated hospital knee radiographs paired with radiologist impressions. Evaluated on diagnostic KL grade prediction and prognostic modeling of 4-year structural outcomes.
Result: For diagnosis: SSL showed mixed results - image-only SSL improved accuracy during linear probing but not during full fine-tuning compared to ImageNet. Multimodal SSL failed to improve grading due to severe bias in pretraining data (93% KL grade 3). For prognosis: Multimodal SSL significantly outperformed ImageNet baselines in predicting 4-year structural incidence/progression, achieving AUROC 0.701 vs 0.599 at 10% labeled data on external validation.
Conclusion: Uncurated hospital image-text data may be ineffective for learning diagnosis due to severity bias, but provides strong signal for prognostic modeling when downstream task aligns with pretraining data distribution. The effectiveness of SSL depends on alignment between pretraining data characteristics and downstream task requirements.
Abstract: This study assesses whether self-supervised learning (SSL) improves knee osteoarthritis (OA) modeling for diagnosis and prognosis relative to ImageNet-pretrained initialization. We compared (i) image-only SSL pretrained on knee radiographs from the OAI, MOST, and NYU cohorts, and (ii) multimodal image-text SSL pretrained on uncurated hospital knee radiographs paired with radiologist impressions. For diagnostic Kellgren-Lawrence (KL) grade prediction, SSL offered mixed results. While image-only SSL improved accuracy during linear probing (frozen encoder), it did not outperform ImageNet pretraining during full fine-tuning. Similarly, multimodal SSL failed to improve grading performance. We attribute this to severe bias in the uncurated hospital pretraining corpus (93% estimated KL grade 3), which limited alignment with the balanced diagnostic task. In contrast, this same multimodal initialization significantly improved prognostic modeling. It outperformed ImageNet baselines in predicting 4-year structural incidence and progression, including on external validation (MOST AUROC: 0.701 vs. 0.599 at 10% labeled data). Overall, while uncurated hospital image-text data may be ineffective for learning diagnosis due to severity bias, it provides a strong signal for prognostic modeling when the downstream task aligns with pretraining data distribution
[118] ICTPolarReal: A Polarized Reflection and Material Dataset of Real World Objects
Jing Yang, Krithika Dharanikota, Emily Jia, Haiwei Chen, Yajie Zhao
Main category: cs.CV
TL;DR: A large-scale polarized reflection dataset of 218 real-world objects captured with a Light Stage, providing over 1.2M images with diffuse-specular separation and material attributes for improving inverse rendering models.
Details
Motivation: Existing inverse rendering approaches rely on synthetic datasets with simplified illumination and limited material realism, preventing generalization to real-world images. There's a scarcity of real measured reflectance data needed for accurate material modeling.Method: Created a large-scale polarized reflection dataset using an 8-camera, 346-light Light Stage with cross/parallel polarization. Captured 218 everyday objects across five acquisition dimensions: multiview, multi-illumination, polarization, reflectance separation, and material attributes.
Result: Produced over 1.2M high-resolution images with diffuse-specular separation and analytically derived diffuse albedo, specular albedo, and surface normals. Demonstrated significant improvements in material separation, illumination fidelity, and geometric consistency for inverse and forward rendering models.
Conclusion: The dataset establishes a new foundation for physically grounded material understanding and enables real-world generalization beyond synthetic training regimes for inverse rendering tasks.
Abstract: Accurately modeling how real-world materials reflect light remains a core challenge in inverse rendering, largely due to the scarcity of real measured reflectance data. Existing approaches rely heavily on synthetic datasets with simplified illumination and limited material realism, preventing models from generalizing to real-world images. We introduce a large-scale polarized reflection and material dataset of real-world objects, captured with an 8-camera, 346-light Light Stage equipped with cross/parallel polarization. Our dataset spans 218 everyday objects across five acquisition dimensions-multiview, multi-illumination, polarization, reflectance separation, and material attributes-yielding over 1.2M high-resolution images with diffuse-specular separation and analytically derived diffuse albedo, specular albedo, and surface normals. Using this dataset, we train and evaluate state-of-the-art inverse and forward rendering models on intrinsic decomposition, relighting, and sparse-view 3D reconstruction, demonstrating significant improvements in material separation, illumination fidelity, and geometric consistency. We hope that our work can establish a new foundation for physically grounded material understanding and enable real-world generalization beyond synthetic training regimes. Project page: https://jingyangcarl.github.io/ICTPolarReal/
[119] TIGFlow-GRPO: Trajectory Forecasting via Interaction-Aware Flow Matching and Reward-Driven Optimization
Xuepeng Jing, Wenhuan Lu, Hao Meng, Zhizhi Yu, Jianguo Wei
Main category: cs.CV
TL;DR: TIGFlow-GRPO: A two-stage generative framework combining Conditional Flow Matching with behavioral rule alignment for human trajectory forecasting, using trajectory-interaction graphs and GRPO post-training for socially compliant and physically feasible predictions.
Details
Motivation: Existing trajectory forecasting methods focus primarily on supervised fitting, which may insufficiently reflect social norms and scene constraints in generated trajectories. There's a need to better align flow-based trajectory generation with behavioral rules for more realistic predictions in complex visual environments.Method: Two-stage framework: 1) CFM-based predictor with Trajectory-Interaction-Graph (TIG) module to model fine-grained visual-spatial interactions and context encoding; 2) Flow-GRPO post-training reformulating deterministic flow rollout as stochastic ODE-to-SDE sampling for trajectory exploration, with composite reward combining social compliance and physical feasibility.
Result: Experiments on ETH/UCY and SDD datasets show improved forecasting accuracy and long-horizon stability while generating trajectories that are more socially compliant and physically feasible compared to existing methods.
Conclusion: The proposed framework effectively connects flow-based trajectory modeling with behavior-aware alignment in dynamic multimedia environments, providing a promising approach for generating realistic human trajectories that respect social norms and physical constraints.
Abstract: Human trajectory forecasting is important for intelligent multimedia systems operating in visually complex environments, such as autonomous driving and crowd surveillance. Although Conditional Flow Matching (CFM) has shown strong ability in modeling trajectory distributions from spatio-temporal observations, existing approaches still focus primarily on supervised fitting, which may leave social norms and scene constraints insufficiently reflected in generated trajectories. To address this issue, we propose TIGFlow-GRPO, a two-stage generative framework that aligns flow-based trajectory generation with behavioral rules. In the first stage, we build a CFM-based predictor with a Trajectory-Interaction-Graph (TIG) module to model fine-grained visual-spatial interactions and strengthen context encoding. This stage captures both agent-agent and agent-scene relations more effectively, providing more informative conditional features for subsequent alignment. In the second stage, we perform Flow-GRPO post-training,where deterministic flow rollout is reformulated as stochastic ODE-to-SDE sampling to enable trajectory exploration, and a composite reward combines view-aware social compliance with map-aware physical feasibility. By evaluating trajectories explored through SDE rollout, GRPO progressively steers multimodal predictions toward behaviorally plausible futures. Experiments on the ETH/UCY and SDD datasets show that TIGFlow-GRPO improves forecasting accuracy and long-horizon stability while generating trajectories that are more socially compliant and physically feasible. These results suggest that the proposed framework provides an effective way to connect flow-based trajectory modeling with behavior-aware alignment in dynamic multimedia environments.
[120] Infinite Gaze Generation for Videos with Autoregressive Diffusion
Jenna Kang, Colin Groth, Tong Wu, Finley Torrens, Patsorn Sangkloy, Gordon Wetzstein, Qi Sun
Main category: cs.CV
TL;DR: Generative framework using autoregressive diffusion model to predict infinite-horizon raw gaze trajectories in videos, conditioned on saliency-aware visual latent space.
Details
Motivation: Traditional gaze prediction methods (saliency maps and scanpaths) collapse fine-grained temporal dynamics and are limited to short-term windows, failing to capture long-range behavioral dependencies in real-world video content.Method: Autoregressive diffusion model that synthesizes gaze trajectories with continuous spatial coordinates and high-resolution timestamps, conditioned on a saliency-aware visual latent space for videos of arbitrary length.
Result: Significantly outperforms existing approaches in long-range spatio-temporal accuracy and trajectory realism in quantitative and qualitative evaluations.
Conclusion: Proposed generative framework enables infinite-horizon raw gaze prediction in videos, capturing fine-grained temporal dynamics and long-range behavioral dependencies better than traditional methods.
Abstract: Predicting human gaze in video is fundamental to advancing scene understanding and multimodal interaction. While traditional saliency maps provide spatial probability distributions and scanpaths offer ordered fixations, both abstractions often collapse the fine-grained temporal dynamics of raw gaze. Furthermore, existing models are typically constrained to short-term windows ($\approx$ 3-5s), failing to capture the long-range behavioral dependencies inherent in real-world content. We present a generative framework for infinite-horizon raw gaze prediction in videos of arbitrary length. By leveraging an autoregressive diffusion model, we synthesize gaze trajectories characterized by continuous spatial coordinates and high-resolution timestamps. Our model is conditioned on a saliency-aware visual latent space. Quantitative and qualitative evaluations demonstrate that our approach significantly outperforms existing approaches in long-range spatio-temporal accuracy and trajectory realism.
[121] Beyond Attention Magnitude: Leveraging Inter-layer Rank Consistency for Efficient Vision-Language-Action Models
Peiju Liu, Jinming Liu, Xipeng Qiu, Xuanjing Huang
Main category: cs.CV
TL;DR: TIES: A dynamic token selection framework for Vision-Language-Action models that reduces visual tokens by 78% while improving task performance through tau-guided inter-layer efficient selection.
Details
Motivation: VLA models suffer from high inference latency due to processing dense visual tokens. Existing token reduction methods use static attention-based selection, but high-attention tokens are task-dependent and can actually degrade policy performance.Method: TIES (Tau-guided Inter-layer Efficient Selection) uses dynamic token selection guided by inter-layer token ranking consistency. It adaptively balances attention magnitude with ranking consistency to ensure robust token selection without requiring additional training.
Result: On CogACT + SIMPLER benchmark, TIES improves average success rates by 6% while reducing token usage by 78%. Demonstrates strong generalization across diverse decoders and benchmarks.
Conclusion: TIES provides an effective dynamic token selection framework that addresses limitations of static attention-based methods, significantly improving efficiency and performance in VLA models for robotic manipulation.
Abstract: Vision-Language-Action (VLA) models excel in robotic manipulation but suffer from significant inference latency due to processing dense visual tokens. Existing token reduction methods predominantly rely on attention magnitude as a static selection. In this work, we challenge this assumption, revealing that high-attention tokens are task-dependent and can even degrade policy performance. To address this, we introduce \textbf{TIES} (\textbf{T}au-guided \textbf{I}nter-layer \textbf{E}fficient \textbf{S}election), a dynamic framework guided by inter-layer token ranking consistency. By adaptively balancing attention magnitude with ranking consistency, TIES ensures robust token selection without requiring additional training. On the CogACT + SIMPLER benchmark, TIES improves average success rates by 6% while reducing token usage by 78%, and demonstrate strong generalization across diverse decoders and benchmarks.
[122] BiFM: Bidirectional Flow Matching for Few-Step Image Editing and Generation
Yasong Dai, Zeeshan Hayder, David Ahmedt-Aristizabal, Hongdong Li
Main category: cs.CV
TL;DR: BiFM is a unified framework for joint generation and inversion in diffusion/flow matching models that learns bidirectional velocity fields for improved few-step image editing.
Details
Motivation: Existing few-step inversion methods for diffusion/flow matching models suffer from poor forward process approximation, rely on pretrained generators and auxiliary modules, limiting scalability and generalization across architectures.Method: BiFM jointly learns generation and inversion by directly estimating average velocity fields in both “image→noise” and “noise→image” directions, constrained by a shared instantaneous velocity field. Uses continuous time-interval supervision with bidirectional consistency objective and lightweight time-interval embedding.
Result: BiFM consistently outperforms existing few-step approaches across diverse image editing and generation tasks, achieving superior performance and editability with one-step inversion capability.
Conclusion: BiFM provides a unified framework that enables efficient few-step inversion and generation, improving editing quality while maintaining scalability across different diffusion/flow matching architectures.
Abstract: Recent diffusion and flow matching models have demonstrated strong capabilities in image generation and editing by progressively removing noise through iterative sampling. While this enables flexible inversion for semantic-preserving edits, few-step sampling regimes suffer from poor forward process approximation, leading to degraded editing quality. Existing few-step inversion methods often rely on pretrained generators and auxiliary modules, limiting scalability and generalization across different architectures. To address these limitations, we propose BiFM (Bidirectional Flow Matching), a unified framework that jointly learns generation and inversion within a single model. BiFM directly estimates average velocity fields in both image $\to$ noise" and noise $\to$ image" directions, constrained by a shared instantaneous velocity field derived from either predefined schedules or pretrained multi-step diffusion models. Additionally, BiFM introduces a novel training strategy using continuous time-interval supervision, stabilized by a bidirectional consistency objective and a lightweight time-interval embedding. This bidirectional formulation also enables one-step inversion and can integrate seamlessly into popular diffusion and flow matching backbones. Across diverse image editing and generation tasks, BiFM consistently outperforms existing few-step approaches, achieving superior performance and editability.
[123] Select, Hypothesize and Verify: Towards Verified Neuron Concept Interpretation
ZeBin Ji, Yang Hu, Xiuli Bi, Bo Liu, Bin Xiao
Main category: cs.CV
TL;DR: A framework for interpreting neuron functionality in neural networks by selecting meaningful neurons, generating concept hypotheses, and verifying those concepts through activation testing.
Details
Motivation: Existing neuron interpretation methods assume all neurons have well-defined functions, but some neurons are redundant or misleading, causing misinterpretations of neural network decisions.Method: Select-Hypothesize-Verify framework: 1) Select activation samples capturing neuron’s functional behavior via activation-distribution analysis, 2) Form concept hypotheses for selected neurons, 3) Verify concepts by checking if they highly activate corresponding neurons.
Result: Method produces more accurate neuron concepts; generated concepts activate corresponding neurons with probability ~1.5 times higher than current state-of-the-art method.
Conclusion: The verification-based framework improves neuron concept interpretation by filtering out misleading neurons and ensuring generated concepts accurately reflect neuron functionality.
Abstract: It is essential for understanding neural network decisions to interpret the functionality (also known as concepts) of neurons. Existing approaches describe neuron concepts by generating natural language descriptions, thereby advancing the understanding of the neural network’s decision-making mechanism. However, these approaches assume that each neuron has well-defined functions and provides discriminative features for neural network decision-making. In fact, some neurons may be redundant or may offer misleading concepts. Thus, the descriptions for such neurons may cause misinterpretations of the factors driving the neural network’s decisions. To address the issue, we introduce a verification of neuron functions, which checks whether the generated concept highly activates the corresponding neuron. Furthermore, we propose a Select-Hypothesize-Verify framework for interpreting neuron functionality. This framework consists of: 1) selecting activation samples that best capture a neuron’s well-defined functional behavior through activation-distribution analysis; 2) forming hypotheses about concepts for the selected neurons; and 3) verifying whether the generated concepts accurately reflect the functionality of the neuron. Extensive experiments show that our method produces more accurate neuron concepts. Our generated concepts activate the corresponding neurons with a probability approximately 1.5 times that of the current state-of-the-art method.
[124] Self-Corrected Image Generation with Explainable Latent Rewards
Yinyi Luo, Hrishikesh Gokhale, Marios Savvides, Jindong Wang, Shengfeng He
Main category: cs.CV
TL;DR: xLARD is a self-correcting framework that uses multimodal LLMs to guide text-to-image generation through explainable latent rewards, improving alignment with complex prompts.
Details
Motivation: Text-to-image generation struggles with aligning outputs to complex prompts, especially for fine-grained semantics and spatial relations. While generation is challenging, evaluating generated images is easier. This asymmetry motivates using multimodal LLMs to guide generation through feedback.Method: xLARD uses a lightweight corrector that refines latent representations based on structured feedback from model-generated references. It creates a differentiable mapping from latent edits to interpretable reward signals, enabling continuous latent-level guidance from non-differentiable image-level evaluations.
Result: Experiments across diverse generation and editing tasks show that xLARD improves semantic alignment and visual fidelity while maintaining generative priors.
Conclusion: xLARD enables models to understand, assess, and correct themselves during generation, addressing the challenge of aligning complex prompts in text-to-image generation.
Abstract: Despite significant progress in text-to-image generation, aligning outputs with complex prompts remains challenging, particularly for fine-grained semantics and spatial relations. This difficulty stems from the feed-forward nature of generation, which requires anticipating alignment without fully understanding the output. In contrast, evaluating generated images is more tractable. Motivated by this asymmetry, we propose xLARD, a self-correcting framework that uses multimodal large language models to guide generation through Explainable LAtent RewarDs. xLARD introduces a lightweight corrector that refines latent representations based on structured feedback from model-generated references. A key component is a differentiable mapping from latent edits to interpretable reward signals, enabling continuous latent-level guidance from non-differentiable image-level evaluations. This mechanism allows the model to understand, assess, and correct itself during generation. Experiments across diverse generation and editing tasks show that xLARD improves semantic alignment and visual fidelity while maintaining generative priors. Code is available at https://yinyiluo.github.io/xLARD/.
[125] PASDiff: Physics-Aware Semantic Guidance for Joint Real-world Low-Light Face Enhancement and Restoration
Yilin Ni, Wenjie Li, Zhengxue Wang, Juncheng Li, Guangwei Gao, Jian Yang
Main category: cs.CV
TL;DR: PASDiff: Physics-Aware Semantic Diffusion for low-light face restoration using inverse intensity weighting, Retinex theory, and Style-Agnostic Structural Injection to recover illumination, color, and facial details without training.
Details
Motivation: Real-world low-light face images suffer multiple degradations (low illumination, blur, noise, low visibility). Existing cascaded solutions suffer error accumulation, while generic joint models lack explicit facial priors and struggle to reconstruct clear face structures.Method: Proposes PASDiff with training-free approach: 1) Uses inverse intensity weighting and Retinex theory for photometric constraints to recover visibility and natural chromaticity, 2) Style-Agnostic Structural Injection (SASI) extracts structures from off-the-shelf facial priors while filtering photometric biases, harmonizing identity with physical constraints.
Result: Significantly outperforms existing methods, achieving superior balance among natural illumination, color recovery, and identity consistency. Constructs WildDark-Face benchmark of 700 real-world low-light facial images with complex degradations.
Conclusion: PASDiff effectively addresses low-light face restoration by combining physics-aware constraints with semantic facial priors, demonstrating state-of-the-art performance on real-world low-light face images.
Abstract: Face images captured in real-world low light suffer multiple degradations-low illumination, blur, noise, and low visibility, etc. Existing cascaded solutions often suffer from severe error accumulation, while generic joint models lack explicit facial priors and struggle to resolve clear face structures. In this paper, we propose PASDiff, a Physics-Aware Semantic Diffusion with a training-free manner. To achieve a plausible illumination and color distribution, we leverage inverse intensity weighting and Retinex theory to introduce photometric constraints, thereby reliably recovering visibility and natural chromaticity. To faithfully reconstruct facial details, our Style-Agnostic Structural Injection (SASI) extracts structures from an off-the-shelf facial prior while filtering out its intrinsic photometric biases, seamlessly harmonizing identity features with physical constraints. Furthermore, we construct WildDark-Face, a real-world benchmark of 700 low-light facial images with complex degradations. Extensive experiments demonstrate that PASDiff significantly outperforms existing methods, achieving a superior balance among natural illumination, color recovery, and identity consistency.
[126] MoE-GRPO: Optimizing Mixture-of-Experts via Reinforcement Learning in Vision-Language Models
Dohwan Ko, Jinyoung Park, Seoung Choi, Sanghyeok Lee, Seohyun Lee, Hyunwoo J. Kim
Main category: cs.CV
TL;DR: MoE-GRPO: Reinforcement learning framework for optimizing expert routing in MoE-based Vision-Language Models to improve diversity and mitigate expert overfitting.
Details
Motivation: Current deterministic top-K routing in MoE-based VLMs may overlook optimal expert combinations and lead to expert overfitting. Need for more diverse expert selection to improve performance.Method: Formulates expert selection as sequential decision-making problem, optimizes using Group Relative Policy Optimization (GRPO). Introduces modality-aware router guidance to enhance training stability by discouraging exploration of infrequently activated experts for given modality.
Result: Extensive experiments on multi-modal image and video benchmarks show MoE-GRPO consistently outperforms standard top-K routing and variants by promoting more diverse expert selection.
Conclusion: RL-based routing optimization enables task-level expert specialization and mitigates expert overfitting in MoE-based VLMs, improving multi-modal understanding performance.
Abstract: Mixture-of-Experts (MoE) has emerged as an effective approach to reduce the computational overhead of Transformer architectures by sparsely activating a subset of parameters for each token while preserving high model capacity. This paradigm has recently been extended to Vision-Language Models (VLMs), enabling scalable multi-modal understanding with reduced computational cost. However, the widely adopted deterministic top-K routing mechanism may overlook more optimal expert combinations and lead to expert overfitting. To address this limitation and improve the diversity of expert selection, we propose MoE-GRPO, a reinforcement learning (RL)-based framework for optimizing expert routing in MoE-based VLMs. Specifically, we formulate expert selection as a sequential decision-making problem and optimize it using Group Relative Policy Optimization (GRPO), allowing the model to learn adaptive expert routing policies through exploration and reward-based feedback. Furthermore, we introduce a modality-aware router guidance that enhances training stability and efficiency by discouraging the router from exploring experts that are infrequently activated for a given modality. Extensive experiments on multi-modal image and video benchmarks show that MoE-GRPO consistently outperforms standard top-K routing and its variants by promoting more diverse expert selection, thereby mitigating expert overfitting and enabling a task-level expert specialization.
[127] Few-Shot Left Atrial Wall Segmentation in 3D LGE MRI via Meta-Learning
Yusri Al-Sanaani, Rebecca Thornhill, Pablo Nery, Elena Pena, Robert deKemp, Calum Redpath, David Birnie, Sreeraman Rajan
Main category: cs.CV
TL;DR: MAML framework for few-shot 3D left atrial wall segmentation from MRI with auxiliary tasks and boundary-aware loss improves accuracy in low-data scenarios.
Details
Motivation: Left atrial wall segmentation from MRI is challenging due to thin geometry, low contrast, and scarce expert annotations, requiring methods that work with minimal labeled data.Method: Model-Agnostic Meta-Learning (MAML) framework for K-shot (K=5,10,20) 3D segmentation, meta-trained on wall task with auxiliary left/right atrial cavity tasks, using boundary-aware composite loss.
Result: MAML outperformed supervised fine-tuning (0.64 vs 0.52 DSC at 5-shot), approached fully supervised performance at 20-shot (0.69 vs 0.71 DSC), and showed robustness to domain shifts and local cohorts.
Conclusion: MAML enables accurate thin-wall segmentation with minimal labeling, potentially facilitating clinical translation for atrial remodeling assessment through few-shot adaptation.
Abstract: Segmenting the left atrial wall from late gadolinium enhancement magnetic resonance images (MRI) is challenging due to the wall’s thin geometry, low contrast, and the scarcity of expert annotations. We propose a Model-Agnostic Meta-Learning (MAML) framework for K-shot (K = 5, 10, 20) 3D left atrial wall segmentation that is meta-trained on the wall task together with auxiliary left atrial and right atrial cavity tasks and uses a boundary-aware composite loss to emphasize thin-structure accuracy. We evaluated MAML segmentation performance on a hold-out test set and assessed robustness under an unseen synthetic shift and on a distinct local cohort. On the hold-out test set, MAML appeared to improve segmentation performance compared to the supervised fine-tuning model, achieving a Dice score (DSC) of 0.64 vs. 0.52 and HD95 of 5.70 vs. 7.60 mm at 5-shot, and approached the fully supervised reference at 20-shot (0.69 vs. 0.71 DSC). Under unseen shift, performance degraded but remained robust: at 5-shot, MAML attained 0.59 DSC and 5.99 mm HD95 on the unseen domain shift and 0.57 DSC and 6.01 mm HD95 on the local cohort, with consistent gains as K increased. These results suggest that more accurate and reliable thin-wall boundaries are achievable in low-shot adaptation, potentially enabling clinical translation with minimal additional labeling for the assessment of atrial remodeling.
[128] Towards Video Anomaly Detection from Event Streams: A Baseline and Benchmark Datasets
Peng Wu, Yuting Yan, Guansong Pang, Yujia Sun, Qingsen Yan, Peng Wang, Yanning Zhang
Main category: cs.CV
TL;DR: EVAD: An event-centric spatiotemporal video anomaly detection framework that leverages event-based vision properties with novel sampling, modeling, and knowledge distillation techniques.
Details
Motivation: Event-based vision has properties well-suited for video anomaly detection (low redundancy, focus on motion, privacy), but lacks dedicated datasets and effective modeling strategies, hindering progress in this field.Method: 1) Constructs multiple event-stream benchmarks with synchronized event and RGB recordings; 2) Proposes EVAD framework with: event density aware dynamic sampling, density-modulated temporal modeling, and RGB-to-event knowledge distillation under weak supervision.
Result: Extensive experiments on three benchmarks show EVAD achieves significant improvements over existing approaches, demonstrating the effectiveness of event-driven modeling for video anomaly detection.
Conclusion: The work establishes event-based VAD as a unified research direction, showing the potential of event-driven modeling with novel techniques and releasing benchmark datasets to the community.
Abstract: Event-based vision, characterized by low redundancy, focus on dynamic motion, and inherent privacy-preserving properties, naturally fits the demands of video anomaly detection (VAD). However, the absence of dedicated event-stream anomaly detection datasets and effective modeling strategies has significantly hindered progress in this field. In this work, we take the first major step toward establishing event-based VAD as a unified research direction. We first construct multiple event-stream based benchmarks for video anomaly detection, featuring synchronized event and RGB recordings. Leveraging the unique properties of events, we then propose an EVent-centric spatiotemporal Video Anomaly Detection framework, namely EWAD, with three key innovations: an event density aware dynamic sampling strategy to select temporally informative segments; a density-modulated temporal modeling approach that captures contextual relations from sparse event streams; and an RGB-to-event knowledge distillation mechanism to enhance event-based representations under weak supervision. Extensive experiments on three benchmarks demonstrate that our EWAD achieves significant improvements over existing approaches, highlighting the potential and effectiveness of event-driven modeling for video anomaly detection. The benchmark datasets will be made publicly available.
[129] C2W-Tune: Cavity-to -Wall Transfer Learning for Thin Atrial Wall Segmentation in 3D Late Gadolinium-enhanced Magnetic Resonance
Yusri Al-Sanaani, Rebecca Thornhill, Sreeraman Rajan
Main category: cs.CV
TL;DR: C2W-Tune: A two-stage cavity-to-wall transfer learning framework for accurate left atrial wall segmentation in 3D LGE-MRI, using cavity segmentation as anatomical prior to improve thin-wall delineation.
Details
Motivation: Accurate segmentation of the left atrial wall in 3D LGE-MRI is challenging due to the wall's thinness, complex anatomy, and low contrast, but essential for wall thickness mapping and fibrosis quantification.Method: Two-stage framework: Stage 1 pre-trains a 3D U-Net with ResNeXt encoder and instance normalization on LA cavity segmentation. Stage 2 transfers weights and adapts to wall segmentation using progressive layer-unfreezing to preserve endocardial features while enabling wall-specific refinement.
Result: Substantial improvements over baseline: wall Dice improved from 0.623 to 0.814, Surface Dice at 1mm from 0.553 to 0.731. Boundary errors reduced: HD95 decreased from 2.95mm to 2.55mm, ASSD from 0.71mm to 0.63mm. Even with reduced supervision (70 training volumes), achieved Dice 0.78 and HD95 3.15mm.
Conclusion: Anatomically grounded task transfer with controlled fine-tuning improves boundary accuracy for thin LA wall segmentation in 3D LGE-MRI, demonstrating the value of leveraging high-accuracy cavity models as anatomical priors.
Abstract: Accurate segmentation of the left atrial (LA) wall in 3D late gadolinium-enhanced MRI (LGE-MRI) is essential for wall thickness mapping and fibrosis quantification, yet it remains challenging due to the wall’s thinness, complex anatomy, and low contrast. We propose C2W-Tune, a two-stage cavity-to-wall transfer framework that leverages a high-accuracy LA cavity model as an anatomical prior to improve thin-wall delineation. Using a 3D U-Net with a ResNeXt encoder and instance normalization, Stage 1 pre-trains the network to segment the LA cavity, learning robust atrial representations. Stage 2 transfers these weights and adapts the network to LA wall segmentation using a progressive layer-unfreezing schedule to preserve endocardial features while enabling wall-specific refinement. Experiments on the 2018 LA Segmentation Challenge dataset demonstrate substantial gains over an architecture-matched baseline trained from scratch: wall Dice improves from 0.623 to 0.814, and Surface Dice at 1 mm improves from 0.553 to 0.731. Boundary errors were substantially reduced, with the 95th-percentile Hausdorff distance (HD95) decreasing from 2.95 mm to 2.55 mm and the average symmetric surface distance (ASSD) from 0.71 mm to 0.63 mm. Furthermore, even with reduced supervision (70 training volumes sampled from the same training pool), C2W-Tune achieved a Dice score of 0.78 and an HD95 of 3.15 mm, maintaining competitive performance and exceeding multi-class benchmarks that typically report Dice values around 0.6-0.7. These results show that anatomically grounded task transfer with controlled fine-tuning improves boundary accuracy for thin LA wall segmentation in 3D LGE-MRI.
[130] Relaxed Rigidity with Ray-based Grouping for Dynamic Gaussian Splatting
Junoh Leea, Junmyeong Lee, Yeon-Ji Song, Inhwan Bae, Jisu Shin, Hae-Gon Jeon, Jin-Hwa Kim
Main category: cs.CV
TL;DR: A novel method for 4D scene reconstruction using 3D Gaussian Splatting that preserves local geometric structure across time through view-space ray grouping and spatial distribution constraints, eliminating need for external priors like optical flow.
Details
Motivation: Current 3D Gaussian Splatting methods for dynamic scenes struggle with realistic motion modeling, especially for monocular videos, leading to degraded reconstruction quality due to incoherent motion that undermines local geometric structure. Most approaches rely on external priors like optical flow for temporal coherence.Method: Introduces view-space ray grouping strategy that clusters Gaussians intersected by the same ray (considering only those with sufficient α-blending weights), then applies constraints to maintain consistent spatial distribution within these groups, preserving local geometry over time without external guidance.
Result: Extensive experiments on challenging monocular datasets show the method significantly outperforms existing approaches when integrated into two distinct baseline models, achieving superior temporal consistency and reconstruction quality.
Conclusion: The proposed approach effectively enforces physically plausible motion by preserving local geometric structure across time, eliminating reliance on external priors and improving dynamic 3D scene reconstruction from monocular videos.
Abstract: The reconstruction of dynamic 3D scenes using 3D Gaussian Splatting has shown significant promise. A key challenge, however, remains in modeling realistic motion, as most methods fail to align the motion of Gaussians with real-world physical dynamics. This misalignment is particularly problematic for monocular video datasets, where failing to maintain coherent motion undermines local geometric structure, ultimately leading to degraded reconstruction quality. Consequently, many state-of-the-art approaches rely heavily on external priors, such as optical flow or 2D tracks, to enforce temporal coherence. In this work, we propose a novel method to explicitly preserve the local geometric structure of Gaussians across time in 4D scenes. Our core idea is to introduce a view-space ray grouping strategy that clusters Gaussians intersected by the same ray, considering only those whose $α$-blending weights exceed a threshold. We then apply constraints to these groups to maintain a consistent spatial distribution, effectively preserving their local geometry. This approach enforces a more physically plausible motion model by ensuring that local geometry remains stable over time, eliminating the reliance on external guidance. We demonstrate the efficacy of our method by integrating it into two distinct baseline models. Extensive experiments on challenging monocular datasets show that our approach significantly outperforms existing methods, achieving superior temporal consistency and reconstruction quality.
[131] Distributed Real-Time Vehicle Control for Emergency Vehicle Transit: A Scalable Cooperative Method
WenXi Wang, JunQi Zhang
Main category: cs.CV
TL;DR: A distributed vehicle control method for emergency vehicle transit that uses only local information for real-time decision making without pre-training, with conflict resolution for safety guarantees.
Details
Motivation: Existing methods for emergency vehicle transit have high computational costs and lack scalability; centralized solvers only work for small-scale scenarios, while reinforcement learning models have limited adaptability to different traffic conditions.Method: Proposes a scalable distributed vehicle control method where vehicles adjust driving behaviors using only local information, with a distributed conflict resolution mechanism to guarantee safety by avoiding decision conflicts.
Result: The method achieves faster decision-making, less impact on ordinary vehicles, and maintains stronger scalability across different traffic densities and road configurations compared to existing methods.
Conclusion: The distributed approach overcomes limitations of centralized and learned methods by providing real-time decision making without pre-training, natural adaptability to varying conditions, and deterministic safety guarantees.
Abstract: Rapid transit of emergency vehicles is critical for saving lives and reducing property loss but often relies on surrounding ordinary vehicles to cooperatively adjust their driving behaviors. It is important to ensure rapid transit of emergency vehicles while minimizing the impact on ordinary vehicles. Centralized mathematical solver and reinforcement learning are the state-of-the-art methods. The former obtains optimal solutions but is only practical for small-scale scenarios. The latter implicitly learns through extensive centralized training but the trained model exhibits limited scalability to different traffic conditions. Hence, existing methods suffer from two fundamental limitations: high computational cost and lack of scalability. To overcome above limitations, this work proposes a scalable distributed vehicle control method, where vehicles adjust their driving behaviors in a distributed manner online using only local instead of global information. We proved that the proposed distributed method using only local information is approximately equivalent to the one using global information, which enables vehicles to evaluate their candidate states and make approximately optimal decisions in real time without pre-training and with natural adaptability to varying traffic conditions. Then, a distributed conflict resolution mechanism is further proposed to guarantee vehicles’ safety by avoiding their decision conflicts, which eliminates the single-point-of-failure risk of centralized methods and provides deterministic safety guarantees that learned methods cannot offer. Compared with existing methods, simulation experiments based on real-world traffic datasets demonstrate that the proposed method achieves faster decision-making, less impact on ordinary vehicles, and maintains much stronger scalability across different traffic densities and road configurations.
[132] Improving Fine-Grained Rice Leaf Disease Detection via Angular-Compactness Dual Loss Learning
Md. Rokon Mia, Rakib Hossain Sajib, Abdullah Al Noman, Abir Ahmed, B M Taslimul Haque
Main category: cs.CV
TL;DR: Dual-loss framework combining Center Loss and ArcFace Loss improves fine-grained classification of rice leaf diseases in vision models, achieving near-perfect accuracy without major architectural changes.
Details
Motivation: Traditional deep learning models using cross entropy loss struggle with high intra-class variance and inter-class similarity in plant pathology datasets. Early detection of rice leaf diseases is critical for preventing large-scale crop losses, but existing methods lack discriminative feature learning for fine-grained classification.Method: Proposes a dual-loss framework combining Center Loss (to reduce intra-class variance) and ArcFace Loss (to increase inter-class margin) for fine-grained classification. Applied to three backbone architectures: InceptionNetV3, DenseNet201, and EfficientNetB0 trained on the Rice Leaf Dataset.
Result: Achieves significant performance gains with accuracies of 99.6% (InceptionNetV3), 99.2% (DenseNet201), and 99.2% (EfficientNetB0). The framework demonstrates that angular margin-based and center-based constraints substantially boost discriminative feature embeddings.
Conclusion: The dual-loss framework effectively addresses fine-grained classification challenges in plant disease detection without requiring major architectural modifications, making it practical for real-world deployment in farming environments.
Abstract: Early detection of rice leaf diseases is critical, as rice is a staple crop supporting a substantial share of the world’s population. Timely identification of these diseases enables more effective intervention and significantly reduces the risk of large-scale crop losses. However, traditional deep learning models primarily rely on cross entropy loss, which often struggles with high intra-class variance and inter-class similarity, common challenges in plant pathology datasets. To tackle this, we propose a dual-loss framework that combines Center Loss and ArcFace Loss to enhance fine-grained classification of rice leaf diseases. The method is applied into three state-of-the-art backbone architectures: InceptionNetV3, DenseNet201, and EfficientNetB0 trained on the public Rice Leaf Dataset. Our approach achieves significant performance gains, with accuracies of 99.6%, 99.2% and 99.2% respectively. The results demonstrate that angular margin-based and center-based constraints substantially boost the discriminative strength of feature embeddings. In particular, the framework does not require major architectural modifications, making it efficient and practical for real-world deployment in farming environments.
[133] Few TensoRF: Enhance the Few-shot on Tensorial Radiance Fields
Thanh-Hai Le, Hoang-Hau Tran, Trong-Nghia Vu
Main category: cs.CV
TL;DR: Few TensoRF combines TensorRF’s efficient tensor representation with FreeNeRF’s frequency regularization for improved 3D reconstruction from sparse views, achieving better quality while maintaining fast training times.
Details
Motivation: The paper aims to address the challenge of 3D reconstruction from sparse input views, which is crucial for practical applications where collecting dense multi-view images is difficult. Existing methods either require many input views or suffer from poor quality when views are limited.Method: Few TensoRF integrates TensorRF’s efficient tensor-based representation (for fast rendering) with FreeNeRF’s frequency-driven few-shot regularization. It introduces frequency and occlusion masks to improve stability and reconstruction quality under sparse input conditions.
Result: On the Synthesis NeRF benchmark, the method improves average PSNR from 21.45 dB (TensorRF) to 23.70 dB, with fine-tuned version reaching 24.52 dB, while maintaining TensorRF’s fast training time (≈10-15 minutes). On THuman 2.0 dataset, it achieves 27.37-34.00 dB with only eight input images for human body reconstruction.
Conclusion: Few TensoRF provides an efficient and data-effective solution for real-time 3D reconstruction across diverse scenes, demonstrating significant improvements in quality under sparse view conditions while maintaining computational efficiency.
Abstract: This paper presents Few TensoRF, a 3D reconstruction framework that combines TensorRF’s efficient tensor based representation with FreeNeRF’s frequency driven few shot regularization. Using TensorRF to significantly accelerate rendering speed and introducing frequency and occlusion masks, the method improves stability and reconstruction quality under sparse input views. Experiments on the Synthesis NeRF benchmark show that Few TensoRF method improves the average PSNR from 21.45 dB (TensorRF) to 23.70 dB, with the fine tuned version reaching 24.52 dB, while maintaining TensorRF’s fast (\approx10-15) minute training time. Experiments on the THuman 2.0 dataset further demonstrate competitive performance in human body reconstruction, achieving 27.37 - 34.00 dB with only eight input images. These results highlight Few TensoRF as an efficient and data effective solution for real-time 3D reconstruction across diverse scenes.
[134] Bilingual Text-to-Motion Generation: A New Benchmark and Baselines
Wanjiang Weng, Xiaofeng Tan, Xiangbo Shu, Guo-Sen Xie, Pan Zhou, Hongsong Wang
Main category: cs.CV
TL;DR: BiHumanML3D is the first bilingual text-to-motion benchmark with LLM-assisted annotation, and BiMD with Cross-Lingual Alignment enables high-quality motion generation from bilingual inputs including zero-shot code-switching.
Details
Motivation: Text-to-motion generation has cross-linguistic potential but is limited by lack of bilingual datasets and poor cross-lingual semantic understanding in existing language models.Method: Introduces BiHumanML3D benchmark via LLM-assisted annotation and manual correction, and proposes Bilingual Motion Diffusion (BiMD) with Cross-Lingual Alignment (CLA) to explicitly align semantic representations across languages.
Result: BiMD with CLA achieves FID of 0.045 vs 0.169 and R@3 of 82.8% vs 80.8%, significantly outperforming monolingual diffusion models and translation baselines on BiHumanML3D.
Conclusion: The work demonstrates the necessity of bilingual datasets and effectiveness of cross-lingual alignment for cross-linguistic motion synthesis, with released dataset and code.
Abstract: Text-to-motion generation holds significant potential for cross-linguistic applications, yet it is hindered by the lack of bilingual datasets and the poor cross-lingual semantic understanding of existing language models. To address these gaps, we introduce BiHumanML3D, the first bilingual text-to-motion benchmark, constructed via LLM-assisted annotation and rigorous manual correction. Furthermore, we propose a simple yet effective baseline, Bilingual Motion Diffusion (BiMD), featuring Cross-Lingual Alignment (CLA). CLA explicitly aligns semantic representations across languages, creating a robust conditional space that enables high-quality motion generation from bilingual inputs, including zero-shot code-switching scenarios. Extensive experiments demonstrate that BiMD with CLA achieves an FID of 0.045 vs. 0.169 and R@3 of 82.8% vs. 80.8%, significantly outperforms monolingual diffusion models and translation baselines on BiHumanML3D, underscoring the critical necessity and reliability of our dataset and the effectiveness of our alignment strategy for cross-lingual motion synthesis. The dataset and code are released at \href{https://wengwanjiang.github.io/BilingualT2M-page}{https://wengwanjiang.github.io/BilingualT2M-page}
[135] GDPO-Listener: Expressive Interactive Head Generation via Auto-Regressive Flow Matching and Group reward-Decoupled Policy Optimization
Zhangyu Jin, Maksim Siniukov, Deuksin Kwon, Ashutosh Chaubey, Mohammad Soleymani
Main category: cs.CV
TL;DR: GDPO-Listener: A framework for generating expressive 3D head motions for both speaking and listening in dyadic interactions using Auto-Regressive Flow Matching and Group reward-Decoupled Policy Optimization.
Details
Motivation: Existing methods for 3D head motion generation in dyadic interactions suffer from the 'Regression-to-the-Mean' problem in listener motions, resulting in static faces and lacking parameter space for complex nonverbal motions.Method: 1) Auto-Regressive Flow Matching architecture for stable supervised learning; 2) Group reward-Decoupled Policy Optimization (GDPO) that isolates reward normalization across distinct FLAME parameter groups to incentivize high variance expressive generations; 3) Enables explicit semantic text control for customizable responses.
Result: Superior performance compared to existing baselines on long-term kinematic variance, visual expressivity and semantic controllability across Seamless Interaction and DualTalk datasets.
Conclusion: GDPO-Listener achieves highly expressive speaking and listening motion generation for dyadic interactions, overcoming limitations of previous methods through innovative architectural and optimization approaches.
Abstract: Generating realistic 3D head motion for dyadic interactions is a significant challenge in virtual human synthesis. While recent methods achieve impressive results with speaking heads, they frequently suffer from the `Regression-to-the-Mean’ problem in listener motions, collapsing into static faces, and lack the parameter space for complex nonverbal motions. In this paper, we propose GDPO-Listener, a novel framework that achieves highly expressive speaking and listening motion generation. First, we introduce an Auto-Regressive Flow Matching architecture enabling stable supervised learning. Second, to overcome kinematic stillness, we apply the Group reward-Decoupled Policy Optimization (GDPO). By isolating reward normalization across distinct FLAME parameter groups, GDPO explicitly incentivizes high variance expressive generations. Finally, we enable explicit semantic text control for customizable responses. Extensive evaluations across the Seamless Interaction and DualTalk datasets demonstrate superior performance compared to existing baselines on long-term kinematic variance, visual expressivity and semantic controllability.
[136] Probabilistic Concept Graph Reasoning for Multimodal Misinformation Detection
Ruichao Yang, Wei Gao, Xiaobin Zhu, Jing Ma, Hongzhan Lin, Ziyang Luo, Bo-Wen Zhang, Xu-Cheng Yin
Main category: cs.CV
TL;DR: PCGR is an interpretable framework for multimodal misinformation detection using concept graph reasoning with MLLMs
Details
Motivation: Multimodal misinformation is challenging for traditional opaque detectors that are fragile against new manipulation tactics, requiring interpretable and evolvable solutionsMethod: Probabilistic Concept Graph Reasoning (PCGR) uses a build-then-infer paradigm: first constructs a graph of human-understandable concept nodes (including novel high-level concepts discovered by MLLMs), then applies hierarchical attention over the concept graph to infer claim veracity
Result: PCGR achieves state-of-the-art MMD accuracy and robustness to emerging manipulation types, outperforming prior methods in both coarse detection and fine-grained manipulation recognition
Conclusion: PCGR provides an interpretable and evolvable framework for multimodal misinformation detection through structured concept-based reasoning with MLLMs
Abstract: Multimodal misinformation poses an escalating challenge that often evades traditional detectors, which are opaque black boxes and fragile against new manipulation tactics. We present Probabilistic Concept Graph Reasoning (PCGR), an interpretable and evolvable framework that reframes multimodal misinformation detection (MMD) as structured and concept-based reasoning. PCGR follows a build-then-infer paradigm, which first constructs a graph of human-understandable concept nodes, including novel high-level concepts automatically discovered and validated by multimodal large language models (MLLMs), and then applies hierarchical attention over this concept graph to infer claim veracity. This design produces interpretable reasoning chains linking evidence to conclusions. Experiments demonstrate that PCGR achieves state-of-the-art MMD accuracy and robustness to emerging manipulation types, outperforming prior methods in both coarse detection and fine-grained manipulation recognition.
[137] VideoTIR: Accurate Understanding for Long Videos with Efficient Tool-Integrated Reasoning
Zhe Gao, Shiyu Shen, Taifeng Chai, Weinong Wang, Haotian Xu, Xing W, Wenbin Li, Qi Fan, Yang Gao, Dacheng Tao
Main category: cs.CV
TL;DR: VideoTIR uses reinforcement learning to enable MLLMs to efficiently understand long videos by learning to call multi-level toolkits for retrieving relevant video segments/images/regions, reducing hallucinations through proper visual data parsing.
Details
Motivation: Existing MLLMs suffer from hallucinations in long video understanding due to imbalance between textual and visual tokens. While SFT-based tool-calling methods exist, they require vast fine-grained data and have constrained tool-calling trajectories.Method: Proposes VideoTIR using RL to encourage proper usage of comprehensive multi-level toolkits. Uses Zero-RL and SFT cold-starting, Toolkit Action Grouped Policy Optimization (TAGPO) for efficient tool-calling with stepwise rewards and failed rollout reuse, and sandbox-based trajectory synthesis for data generation.
Result: Extensive experiments on three long-video QA benchmarks demonstrate the effectiveness and efficiency of the method in enhancing long video understanding.
Conclusion: VideoTIR provides an effective RL-based approach for MLLMs to handle long video understanding by learning to properly use toolkits for visual data parsing, addressing hallucination issues through efficient visual attention mechanisms.
Abstract: Existing Multimodal Large Language Models (MLLMs) often suffer from hallucinations in long video understanding (LVU), primarily due to the imbalance between textual and visual tokens. Observing that MLLMs handle short visual inputs well, recent LVU works alleviate hallucinations by automatically parsing the vast visual data into manageable segments that can be effectively processed by MLLMs. SFT-based tool-calling methods can serve this purpose, but they typically require vast amounts of fine-grained, high-quality data and suffer from constrained tool-calling trajectories. We propose a novel VideoTIR that leverages Reinforcement Learning (RL) to encourage proper usage of comprehensive multi-level toolkits for efficient long video understanding. VideoTIR explores both Zero-RL and SFT cold-starting to enable MLLMs to retrieve and focus on meaningful video segments/images/regions, enhancing long video understanding both accurately and efficiently. To reduce redundant tool-calling, we propose Toolkit Action Grouped Policy Optimization (TAGPO), which enhances the efficiency of the calling process through stepwise reward assignment and reuse of failed rollouts. Additionally, we develop a sandbox-based trajectory synthesis framework to generate high-quality trajectories data. Extensive experiments on three long-video QA benchmarks demonstrate the effectiveness and efficiency of our method.
[138] CARE: Training-Free Controllable Restoration for Medical Images via Dual-Latent Steering
Xu Liu
Main category: cs.CV
TL;DR: CARE is a training-free controllable medical image restoration framework that balances structure preservation and prior-guided enhancement using dual-latent branches with adaptive control.
Details
Motivation: Existing medical image restoration methods lack controllability over the trade-off between faithful reconstruction and prior-driven enhancement, which is critical in clinical settings where aggressive restoration may introduce hallucinations or alter diagnostically important structures.Method: Uses a dual-latent restoration strategy: one branch enforces data fidelity and anatomical consistency, while another leverages generative prior to recover missing/degraded information. A risk-aware adaptive controller dynamically adjusts branch contributions based on restoration uncertainty and local structural reliability.
Result: Achieves strong restoration quality while better preserving clinically relevant structures and reducing risk of implausible reconstructions in noisy and incomplete medical imaging scenarios.
Conclusion: Provides a practical step toward safer, more controllable, and deployment-ready medical image restoration with explicit balance between structure preservation and prior-guided refinement.
Abstract: Medical image restoration is essential for improving the usability of noisy, incomplete, and artifact-corrupted clinical scans, yet existing methods often rely on task-specific retraining and offer limited control over the trade-off between faithful reconstruction and prior-driven enhancement. This lack of controllability is especially problematic in clinical settings, where overly aggressive restoration may introduce hallucinated details or alter diagnostically important structures. In this work, we propose CARE, a training-free controllable restoration framework for real-world medical images that explicitly balances structure preservation and prior-guided refinement during inference. CARE uses a dual-latent restoration strategy, in which one branch enforces data fidelity and anatomical consistency while the other leverages a generative prior to recover missing or degraded information. A risk-aware adaptive controller dynamically adjusts the contribution of each branch based on restoration uncertainty and local structural reliability, enabling conservative or enhancement-focused restoration modes without additional model training. We evaluate CARE on noisy and incomplete medical imaging scenarios and show that it achieves strong restoration quality while better preserving clinically relevant structures and reducing the risk of implausible reconstructions and show that it achieves strong restoration quality while better preserving clinically relevant structures and reducing the risk of implausible reconstructions. The proposed approach offers a practical step toward safer, more controllable, and more deployment-ready medical image restoration.
[139] GeoNDC: A Queryable Neural Data Cube for Planetary-Scale Earth Observation
Jianbo Qi, Mengyao Li, Baogui Jiang, Yidan Chen, Qiao Wang
Main category: cs.CV
TL;DR: GeoNDC is a neural data cube that encodes planetary-scale Earth observation data as a continuous spatiotemporal implicit neural field, enabling efficient storage, querying, and reconstruction of satellite imagery.
Details
Motivation: Satellite Earth observation data is massive and organized as discrete raster files, making it costly to store, transmit, and query. There's a need for more efficient representations that support on-demand queries and continuous-time reconstruction.Method: GeoNDC uses implicit neural fields to encode planetary-scale Earth observation data as a continuous spatiotemporal representation. It learns a neural representation that can be queried for specific spatiotemporal coordinates, enabling compression, reconstruction, and direct queries without full decompression.
Result: Achieves 95:1 compression ratio on 20-year MODIS archive (0.44GB vs optimized baseline), high spectral fidelity (mean R² > 0.98, RMSE = 0.021), supports direct spatiotemporal queries on consumer hardware, and recovers cloud-free dynamics with high fidelity (R² > 0.85) under simulated cloud occlusion.
Conclusion: GeoNDC offers a unified AI-native representation for planetary-scale Earth observation that integrates query, reconstruction, and compression in a single framework, complementing raw archives with a compact, analysis-ready data layer.
Abstract: Satellite Earth observation has accumulated massive spatiotemporal archives essential for monitoring environmental change, yet these remain organized as discrete raster files, making them costly to store, transmit, and query. We present GeoNDC, a queryable neural data cube that encodes planetary-scale Earth observation data as a continuous spatiotemporal implicit neural field, enabling on-demand queries and continuous-time reconstruction without full decompression. Experiments on a 20-year global MODIS MCD43A4 reflectance record (7 bands, 5,km, 8-day sampling) show that the learned representation supports direct spatiotemporal queries on consumer hardware. On Sentinel-2 imagery (10,m), continuous temporal parameterization recovers cloud-free dynamics with high fidelity ($R^2 > 0.85$) under simulated 2-km cloud occlusion. On HiGLASS biophysical products (LAI and FPAR), GeoNDC attains near-perfect accuracy ($R^2 > 0.98$). The representation compresses the 20-year MODIS archive to 0.44,GB – approximately 95:1 relative to an optimized Int16 baseline – with high spectral fidelity (mean $R^2 > 0.98$, mean RMSE $= 0.021$). These results suggest GeoNDC offers a unified AI-native representation for planetary-scale Earth observation, complementing raw archives with a compact, analysis-ready data layer integrating query, reconstruction, and compression in a single framework.
[140] MoRGS: Efficient Per-Gaussian Motion Reasoning for Streamable Dynamic 3D Scenes
Wonjoon Lee, Sungmin Woo, Donghyeong Kim, Jungho Lee, Sangheon Park, Sangyoun Lee
Main category: cs.CV
TL;DR: MoRGS is an online 4D reconstruction framework that improves per-Gaussian motion modeling in dynamic scenes by incorporating optical flow supervision and motion confidence mechanisms.
Details
Motivation: Existing online 4D reconstruction methods using 3D Gaussian Splatting fail to learn per-Gaussian motion that reflects true scene dynamics, as they optimize appearance and motion solely under photometric loss, causing motion to chase pixel residuals rather than true 3D motion.Method: Proposes MoRGS with three key components: 1) Uses optical flow from sparse key views as lightweight motion cues to regularize per-Gaussian motion beyond photometric supervision, 2) Learns a per-Gaussian motion offset field to reconcile discrepancies between projected 3D motion and observed flow across views and time, 3) Introduces per-Gaussian motion confidence to separate dynamic from static Gaussians and weight attribute updates.
Result: Extensive experiments show MoRGS achieves state-of-the-art reconstruction quality and motion fidelity among online methods while maintaining streamable performance.
Conclusion: MoRGS effectively addresses the limitation of existing online 4D reconstruction methods by explicitly modeling per-Gaussian motion with optical flow supervision and confidence mechanisms, leading to improved motion fidelity and temporal consistency.
Abstract: Online reconstruction of dynamic scenes aims to learn from streaming multi-view inputs under low-latency constraints. The fast training and real-time rendering capabilities of 3D Gaussian Splatting have made on-the-fly reconstruction practically feasible, enabling online 4D reconstruction. However, existing online approaches, despite their efficiency and visual quality, fail to learn per-Gaussian motion that reflects true scene dynamics. Without explicit motion cues, appearance and motion are optimized solely under photometric loss, causing per-Gaussian motion to chase pixel residuals rather than true 3D motion. To address this, we propose MoRGS, an efficient online per-Gaussian motion reasoning framework that explicitly models per-Gaussian motion to improve 4D reconstruction quality. Specifically, we leverage optical flow on a sparse set of key views as lightweight motion cues that regularize per-Gaussian motion beyond photometric supervision. To compensate for the sparsity of flow supervision, we learn a per-Gaussian motion offset field that reconciles discrepancies between projected 3D motion and observed flow across views and time. In addition, we introduce a per-Gaussian motion confidence that separates dynamic from static Gaussians and weights Gaussian attribute residual updates, thereby suppressing redundant motion in static regions for better temporal consistency and accelerating the modeling of large motions. Extensive experiments demonstrate that MoRGS achieves state-of-the-art reconstruction quality and motion fidelity among online methods, while maintaining streamable performance.
[141] GaussFusion: Improving 3D Reconstruction in the Wild with A Geometry-Informed Video Generator
Liyuan Zhu, Manjunath Narayana, Michal Stary, Will Hutchcroft, Gordon Wetzstein, Iro Armeni
Main category: cs.CV
TL;DR: GaussFusion improves 3D Gaussian splatting reconstructions using geometry-informed video generation to fix artifacts like floaters, flickering, and blur, achieving state-of-the-art novel-view synthesis with real-time performance.
Details
Motivation: 3D Gaussian splatting (3DGS) suffers from artifacts like floaters, flickering, and blur due to camera pose errors, incomplete coverage, and noisy geometry initialization. Existing RGB-based approaches are limited to single reconstruction pipelines and don't address these fundamental geometric issues.Method: Introduces a geometry-informed video-to-video generator that refines 3DGS renderings across both optimization-based and feed-forward methods. Renders a Gaussian primitive video buffer encoding depth, normals, opacity, and covariance, which the generator refines to produce temporally coherent, artifact-free frames. Also includes an artifact synthesis pipeline that simulates diverse degradation patterns for robustness.
Result: Achieves state-of-the-art performance on novel-view synthesis benchmarks. An efficient variant runs in real time at 21 FPS while maintaining similar performance, enabling interactive 3D applications.
Conclusion: GaussFusion effectively mitigates 3DGS artifacts through geometry-informed video generation, offering both high-quality reconstruction and real-time performance for practical 3D applications.
Abstract: We present GaussFusion, a novel approach for improving 3D Gaussian splatting (3DGS) reconstructions in the wild through geometry-informed video generation. GaussFusion mitigates common 3DGS artifacts, including floaters, flickering, and blur caused by camera pose errors, incomplete coverage, and noisy geometry initialization. Unlike prior RGB-based approaches limited to a single reconstruction pipeline, our method introduces a geometry-informed video-to-video generator that refines 3DGS renderings across both optimization-based and feed-forward methods. Given an existing reconstruction, we render a Gaussian primitive video buffer encoding depth, normals, opacity, and covariance, which the generator refines to produce temporally coherent, artifact-free frames. We further introduce an artifact synthesis pipeline that simulates diverse degradation patterns, ensuring robustness and generalization. GaussFusion achieves state-of-the-art performance on novel-view synthesis benchmarks, and an efficient variant runs in real time at 21 FPS while maintaining similar performance, enabling interactive 3D applications.
[142] Synergistic Event-SVE Imaging for Quantitative Propellant Combustion Diagnostics
Jing Tao, Taihang Lei, Banglei Guan, Ying Qu, Xudong Na, Likun Ma, Yang Shang, Qifeng Yu
Main category: cs.CV
TL;DR: A closed-loop Event-SVE system combining spatially variant exposure camera with stereo neuromorphic event cameras for real-time 3D combustion monitoring under extreme HDR, smoke-obscured conditions.
Details
Motivation: Real-time monitoring of high-energy propellant combustion is challenging due to extreme HDR, microsecond-scale particle motion, and heavy smoke causing saturation, motion blur, and unstable particle extraction in conventional imaging.Method: Closed-loop system coupling SVE camera with stereo pair of neuromorphic event cameras. SVE branch produces HDR maps with smoke-aware fusion strategy using multi-cue smoke-likelihood map. Event cameras provide 3D particle tracking with feature extraction and triangulation.
Result: System achieves maximum calibration error of 0.56%, captures multimodal equivalent-radius statistics for boron-based propellants, and captures fast separation transients difficult to observe with conventional sensors.
Conclusion: The framework provides practical, calibration-consistent route to microsecond-resolved 3D combustion measurement under smoke-obscured HDR conditions.
Abstract: Real-time monitoring of high-energy propellant combustion is difficult. Extreme high dynamic range (HDR), microsecond-scale particle motion, and heavy smoke often occur together. These conditions drive saturation, motion blur, and unstable particle extraction in conventional imaging. We present a closed-loop Event–SVE measurement system that couples a spatially variant exposure (SVE) camera with a stereo pair of neuromorphic event cameras. The SVE branch produces HDR maps with an explicit smoke-aware fusion strategy. A multi-cue smoke-likelihood map is used to separate particle emission from smoke scattering, yielding calibrated intensity maps for downstream analysis. The resulting HDR maps also provide the absolute-intensity reference missing in event cameras. This reference is used to suppress smoke-driven event artifacts and to improve particle-state discrimination. Based on the cleaned event observations, a stereo event-based 3D pipeline estimates separation height and equivalent particle size through feature extraction and triangulation (maximum calibration error 0.56%). Experiments on boron-based propellants show multimodal equivalent-radius statistics. The system also captures fast separation transients that are difficult to observe with conventional sensors. Overall, the proposed framework provides a practical, calibration-consistent route to microsecond-resolved 3D combustion measurement under smoke-obscured HDR conditions.
[143] Learning Explicit Continuous Motion Representation for Dynamic Gaussian Splatting from Monocular Videos
Xuankai Zhang, Junjin Xiao, Shangwei Huang, Wei-shi Zheng, Qing Zhang
Main category: cs.CV
TL;DR: Dynamic Gaussian Splatting from monocular videos using SE(3) B-spline motion bases with adaptive control points for continuous deformation modeling
Details
Motivation: To achieve high-quality dynamic scene reconstruction from monocular videos by explicitly modeling continuous position and orientation deformation of dynamic Gaussians, going beyond previous methodsMethod: Uses SE(3) B-spline motion bases with compact control points, adaptive control mechanism to adjust motion bases dynamically, soft segment reconstruction to mitigate long-interval motion interference, and multi-view diffusion model for multi-view cues
Result: Outperforms state-of-the-art methods in novel view synthesis, demonstrating superior dynamic scene reconstruction quality
Conclusion: The proposed approach effectively models complex dynamic motions from monocular videos and achieves state-of-the-art performance in novel view synthesis
Abstract: We present an approach for high-quality dynamic Gaussian Splatting from monocular videos. To this end, we in this work go one step further beyond previous methods to explicitly model continuous position and orientation deformation of dynamic Gaussians, using an SE(3) B-spline motion bases with a compact set of control points. To improve computational efficiency while enhancing the ability to model complex motions, an adaptive control mechanism is devised to dynamically adjust the number of motion bases and control points. Besides, we develop a soft segment reconstruction strategy to mitigate long-interval motion interference, and employ a multi-view diffusion model to provide multi-view cues for avoiding overfitting to training views. Extensive experiments demonstrate that our method outperforms state-of-the-art methods in novel view synthesis. Our code is available at https://github.com/hhhddddddd/se3bsplinegs.
[144] GIFT: Global Irreplaceability Frame Targeting for Efficient Video Understanding
Junpeng Ma, Sashuai Zhou, Guanghao Li, Xin Gao, Yue Cao, Hengyu Zeng, Yuxiang Yan, Zhibin Wang, Jun Song, Bo Zheng, Shanghang Zhang, Jian Pu
Main category: cs.CV
TL;DR: GIFT is a training-free framework for video large language models that selects frames based on their irreplaceability, using directed diversity and budget-aware refinement to improve video understanding efficiency.
Details
Motivation: Current VLMs have high computational costs from processing dense frames. Existing keyframe selection methods use greedy decision-making with decoupled relevance/diversity evaluation, leading to local optima and irrelevant noise frame selection.Method: Proposes GIFT with two components: 1) Directed Diversity to quantify frame uniqueness conditioned on relevance, creating unified irreplaceability scores; 2) Budget-Aware Refinement that first selects core frames with highest irreplaceability, then builds temporal context around them as budget expands.
Result: Achieves maximum average improvement of 12.5% across long-form video benchmarks on LLaVA-Video-7B compared to uniform sampling.
Conclusion: GIFT provides an effective training-free framework for frame selection in VLMs that improves computational efficiency while maintaining or enhancing video understanding performance.
Abstract: Video Large Language Models (VLMs) have achieved remarkable success in video understanding, but the significant computational cost from processing dense frames severely limits their practical application. Existing methods alleviate this by selecting keyframes, but their greedy decision-making, combined with a decoupled evaluation of relevance and diversity, often falls into local optima and results in erroneously selecting irrelevant noise frames. To address these challenges, we propose GIFT: Global Irreplaceability Frame Targeting, a novel training-free framework that selects frames by assessing their intrinsic irreplaceability. Specifically, we first introduce Directed Diversity to quantify a frame’s uniqueness conditioned on relevance, which allows us to formulate a unified irreplaceability score. Subsequently, our Budget-Aware Refinement strategy employs a adaptive iterative process that first secures a core set of frames with the highest irreplaceability, and then shifts its priority to building crucial temporal context around these selections as the budget expands. Extensive experiments demonstrate that GIFT achieves a maximum average improvement of 12.5% across long-form video benchmarks on LLaVA-Video-7B compared to uniform sampling.
[145] Z-Erase: Enabling Concept Erasure in Single-Stream Diffusion Transformers
Nanxiang Jiang, Zhaoxin Fan, Baisen Wang, Daiheng Gao, Junhang Cheng, Jifeng Guo, Yalan Qin, Yeying Jin, Hongwei Zheng, Faguo Wu, Wenjun Wu
Main category: cs.CV
TL;DR: Z-Erase is a concept erasure method specifically designed for single-stream diffusion transformer T2I models that prevents generation collapse while removing unwanted concepts.
Details
Motivation: Concept erasure is crucial for safety in T2I models, but existing methods fail in single-stream architectures (like Z-Image) where text and image tokens are processed as unified sequences, causing generation collapse when prior methods are applied.Method: Proposes Stream Disentangled Concept Erasure Framework to decouple updates for single-stream models, plus Lagrangian-Guided Adaptive Erasure Modulation as a constrained algorithm to balance erasure-preservation trade-off with rigorous convergence analysis.
Result: Z-Erase successfully overcomes generation collapse issue and achieves state-of-the-art performance across a wide range of tasks in single-stream T2I models.
Conclusion: Z-Erase is the first effective concept erasure method for single-stream diffusion transformer T2I models, providing stable generation while removing unwanted concepts with theoretical guarantees.
Abstract: Concept erasure serves as a vital safety mechanism for removing unwanted concepts from text-to-image (T2I) models. While extensively studied in U-Net and dual-stream architectures (e.g., Flux), this task remains under-explored in the recent emerging paradigm of single-stream diffusion transformers (e.g., Z-Image). In this new paradigm, text and image tokens are processed as a single unified sequence via shared parameters. Consequently, directly applying prior erasure methods typically leads to generation collapse. To bridge this gap, we introduce Z-Erase, the first concept erasure method tailored for single-stream T2I models. To guarantee stable image generation, Z-Erase first proposes a Stream Disentangled Concept Erasure Framework that decouples updates and enables existing methods on single-stream models. Subsequently, within this framework, we introduce Lagrangian-Guided Adaptive Erasure Modulation, a constrained algorithm that further balances the sensitive erasure-preservation trade-off. Moreover, we provide a rigorous convergence analysis proving that Z-Erase can converge to a Pareto stationary point. Experiments demonstrate that Z-Erase successfully overcomes the generation collapse issue, achieving state-of-the-art performance across a wide range of tasks.
[146] Bridging Perception and Reasoning: Token Reweighting for RLVR in Multimodal LLMs
Jinda Lu, Junkang Wu, Jinghan Li, Kexin Huang, Shuo Yang, Guoyin Wang, Jiancan Wu, Xiang Wang, Xiangnan He
Main category: cs.CV
TL;DR: ToR (Token-Reweighting) strategy addresses RLVR challenges in MLLMs by dynamically reweighting perception and reasoning tokens during training, improving multimodal reasoning performance.
Details
Motivation: RLVR applied to MLLMs faces challenges because responses interleave perception tokens (visual grounding) and reasoning tokens (symbolic reasoning). These token types have distinct but interdependent capacities, making isolated optimization insufficient for effective multimodal reasoning.Method: Proposes Token-Reweighting (ToR) strategy that identifies critical perception and reasoning tokens and dynamically reweights them during RLVR training. This plug-and-play approach models the interdependence between token types and can be applied on top of existing methods like GRPO and DAPO.
Result: ToR delivers consistent performance gains across multiple multimodal reasoning benchmarks, achieving state-of-the-art performance with both accurate visual grounding and coherent reasoning. Empirical analysis shows optimizing either perception- or reasoning-only tokens underperforms full optimization.
Conclusion: The interdependence between perception and reasoning tokens in MLLMs requires explicit modeling during RLVR training. ToR’s dynamic token reweighting strategy effectively addresses this challenge and improves multimodal reasoning capabilities.
Abstract: Extending Reinforcement Learning with Verifiable Rewards (RLVR) to multimodal large language models (MLLMs) faces a fundamental challenge: their responses inherently interleave perception-related tokens, which ground visual content, with reasoning-related tokens, which construct reasoning chains. These token types instantiate distinct yet interdependent capacities – visual grounding and symbolic reasoning – making isolated optimization insufficient. Through token-level empirical analysis, we demonstrate that optimizing either perception- or reasoning-only tokens consistently underperforms full optimization, underscoring their inherent coupling. To address this, we propose a plug-and-play Token-Reweighting (ToR) strategy that explicitly models this interdependence by identifying critical tokens of both types and dynamically reweighting them during RLVR training. Applied on top of existing methods (e.g., GRPO and DAPO), ToR delivers consistent performance gains across multiple multi-modal reasoning benchmarks, achieving state-of-the-art performance with both accurate visual grounding and coherent reasoning.
[147] Learning domain-invariant features through channel-level sparsification for Out-Of Distribution Generalization
Haoran Pei, Yuguang Yang, Kexin Liu, Juan Zhang, Baochang Zhang
Main category: cs.CV
TL;DR: HCD (Hierarchical Causal Dropout) is a method that uses channel-level causal masks to separate causal from spurious features in deep learning models, improving OOD generalization through representation-level causal intervention.
Details
Motivation: Deep learning models for image analysis often develop shortcut dependencies on domain-specific features, leading to poor out-of-distribution generalization. Current invariance learning methods struggle to isolate mixed features in deep latent spaces.Method: Proposes Hierarchical Causal Dropout (HCD) with channel-level causal masks to enforce feature sparsity, separating causal from spurious features. Uses Matrix-based Mutual Information objective to minimize mutual information between latent features and domain labels while maximizing information with class labels. Includes StyleMix-driven VICReg module for stability.
Result: Experimental results on OOD benchmarks show HCD outperforms existing top-tier methods.
Conclusion: HCD effectively addresses shortcut learning in deep learning models by performing causal intervention at the representation level, improving OOD generalization.
Abstract: Out-of-Distribution (OOD) generalization has become a primary metric for evaluating image analysis systems. Since deep learning models tend to capture domain-specific context, they often develop shortcut dependencies on these non-causal features, leading to inconsistent performance across different data sources. Current techniques, such as invariance learning, attempt to mitigate this. However, they struggle to isolate highly mixed features within deep latent spaces. This limitation prevents them from fully resolving the shortcut learning problem.In this paper, we propose Hierarchical Causal Dropout (HCD), a method that uses channel-level causal masks to enforce feature sparsity. This approach allows the model to separate causal features from spurious ones, effectively performing a causal intervention at the representation level. The training is guided by a Matrix-based Mutual Information (MMI) objective to minimize the mutual information between latent features and domain labels, while simultaneously maximizing the information shared with class labels.To ensure stability, we incorporate a StyleMix-driven VICReg module, which prevents the masks from accidentally filtering out essential causal data. Experimental results on OOD benchmarks show that HCD performs better than existing top-tier methods.
[148] Visual Attention Drifts,but Anchors Hold:Mitigating Hallucination in Multimodal Large Language Models via Cross-Layer Visual Anchors
Chengxu Yang, Jingling Yuan, Chuang Hu, Jiawei Jiang
Main category: cs.CV
TL;DR: CLVA addresses object hallucination in Multimodal LLMs by identifying that hallucination stems from deep layer attention regressing to early visual noise, and proposes a training-free method using cross-layer visual anchors from intermediate layers to correct attention drift.
Details
Motivation: Multimodal LLMs suffer from object hallucination where they generate incorrect object descriptions. Existing methods using attention enhancement and visual retracing lack interpretability regarding attention drift in final model stages, particularly how attention evolves across layers.Method: CLVA (Cross-Layer Visual Anchors) is a training-free method that: 1) Analyzes layer-wise evolution of visual features to discover hallucination stems from deep layer attention regressing toward initial visual noise from early layers; 2) Identifies that output reliability depends on acquiring visual anchors at intermediate layers; 3) Reinforces critical mid-layer features while suppressing regressive noise; 4) Pulls deep layer attention back to correct visual regions using essential anchors captured from attention dynamics.
Result: The method demonstrates outstanding performance across diverse architectures and benchmarks, effectively reducing object hallucination without significant increase in computational time and GPU memory.
Conclusion: Object hallucination in Multimodal LLMs is caused by attention regression to early visual noise, and can be effectively mitigated by leveraging cross-layer visual anchors from intermediate layers, providing a training-free solution with strong performance across various models.
Abstract: Multimodal Large Language Models often suffer from object hallucination. While existing research utilizes attention enhancement and visual retracing, we find these works lack sufficient interpretability regarding attention drift in final model stages. In this paper, we investigate the layer wise evolution of visual features and discover that hallucination stems from deep layer attention regressing toward initial visual noise from early layers. We observe that output reliability depends on acquiring visual anchors at intermediate layers rather than final layers. Based on these insights, we propose CLVA, which stands for Cross-Layer Visual Anchors, a training free method that reinforces critical mid layer features while suppressing regressive noise. This approach effectively pulls deep layer attention back to correct visual regions by utilizing essential anchors captured from attention dynamics. We evaluate our method across diverse architectures and benchmarks, demonstrating outstanding performance without significant increase in computational time and GPU memory.
[149] THEMIS: Towards Holistic Evaluation of MLLMs for Scientific Paper Fraud Forensics
Tzu-Yen Ma, Bo Zhang, Zichen Tang, Junpeng Ding, Haolin Tian, Yuanze Li, Zhuodi Hao, Zixin Ding, Zirui Wang, Xinyu Yu, Shiyao Peng, Yizhuo Zhao, Ruomeng Jiang, Yiling Huang, Peizhi Zhao, Jiayuan Chen, Weisheng Tan, Haocheng Gao, Yang Liu, Jiacheng Liu, Zhongjun Yang, Jiayu Huang, Haihong E
Main category: cs.CV
TL;DR: THEMIS is a comprehensive benchmark for evaluating multimodal LLMs on visual fraud reasoning in academic scenarios, featuring 4,000+ questions across 7 real-world scenarios, 5 fraud types with 16 fine-grained operations, and multi-dimensional capability evaluation.
Details
Motivation: Existing benchmarks lack the complexity and real-world relevance needed to properly evaluate MLLMs for visual fraud reasoning in academic contexts. There's a gap between current evaluation methods and the actual challenges of detecting academic fraud involving complex multimodal data.Method: Created THEMIS benchmark with 4,000+ questions derived from authentic retracted-paper cases and synthetic multimodal data. Features 7 real-world scenarios, 5 fraud types, 16 fine-grained manipulation operations, and maps fraud types to 5 core visual fraud reasoning capabilities for comprehensive evaluation.
Result: Tested 16 leading MLLMs, with even the best-performing model (GPT-5) achieving only 56.15% overall performance, demonstrating THEMIS presents a stringent test. The benchmark reveals distinct strengths and specific weaknesses of different models across core capabilities.
Conclusion: THEMIS fills a critical gap in evaluating MLLMs for complex, real-world visual fraud reasoning and is expected to advance development of MLLMs for such challenging tasks in academic and other domains.
Abstract: We present THEMIS, a novel multi-task benchmark designed to comprehensively evaluate multimodal large language models (MLLMs) on visual fraud reasoning within real-world academic scenarios. Compared to existing benchmarks, THEMIS introduces three major advances. (1) Real-World Scenarios and Complexity: Our benchmark comprises over 4,000 questions spanning seven scenarios, derived from authentic retracted-paper cases and carefully curated multimodal synthetic data. With 60.47% complex-texture images, THEMIS bridges the critical gap between existing benchmarks and the complexity of real-world academic fraud. (2) Fraud-Type Diversity and Granularity: THEMIS systematically covers five challenging fraud types and introduces 16 fine-grained manipulation operations. On average, each sample undergoes multiple stacked manipulation operations, with the diversity and difficulty of these manipulations demanding a high level of visual fraud reasoning from the models. (3) Multi-Dimensional Capability Evaluation: We establish a mapping from fraud types to five core visual fraud reasoning capabilities, thereby enabling an evaluation that reveals the distinct strengths and specific weaknesses of different models across these core capabilities. Experiments on 16 leading MLLMs show that even the best-performing model, GPT-5, achieves an overall performance of only 56.15%, demonstrating that our benchmark presents a stringent test. We expect THEMIS to advance the development of MLLMs for complex, real-world fraud reasoning tasks.
[150] Pixelis: Reasoning in Pixels, from Seeing to Acting
Yunpeng Zhou
Main category: cs.CV
TL;DR: Pixelis is a pixel-space agent that learns through executable operations on images/videos (zoom, segment, track, OCR, etc.) using a three-phase training approach combining supervised fine-tuning, curiosity-coherence reward optimization, and pixel test-time RL for embodied visual intelligence.
Details
Motivation: Most vision-language systems are passive observers that only describe pixels without acting, limiting generalizable, physically grounded visual intelligence. The authors argue that learning through action rather than static description is essential for moving beyond curated data.Method: Three-phase approach: 1) Supervised Fine-Tuning with masked imitation loss to learn pixel-tool grammar from Chain-of-Thought-Action traces; 2) Curiosity-Coherence Reward Fine-Tuning with dual-drive objective combining prediction-error curiosity, adjacent-step coherence, and efficiency prior; 3) Pixel Test-Time RL for label-free adaptation using neighbor retrieval, trajectory voting, and KL-to-EMA safety control.
Result: Across six public image and video benchmarks, Pixelis achieves average relative gain of +4.08% over the same 8B baseline (peaking at +6.03% on VSI-Bench), produces shorter, auditable toolchains, and maintains in-corridor KL during test-time learning.
Conclusion: Acting within pixels rather than abstract tokens grounds multimodal perception in the physical world, links visual reasoning with actionable outcomes, and enables embodied adaptation without external feedback.
Abstract: Most vision-language systems are static observers: they describe pixels, do not act, and cannot safely improve under shift. This passivity limits generalizable, physically grounded visual intelligence. Learning through action, not static description, is essential beyond curated data. We present Pixelis, a pixel-space agent that operates directly on images and videos via a compact set of executable operations (zoom/crop, segment, track, OCR, temporal localization) and learns from its consequences. Pixelis trains in three phases: (1) Supervised Fine-Tuning learns a pixel-tool grammar from Chain-of-Thought-Action traces with a masked imitation loss that upweights operation/argument tokens and auxiliary heads to stabilize pixel-grounded arguments; (2) Curiosity-Coherence Reward Fine-Tuning optimizes a dual-drive objective marrying prediction-error curiosity with adjacent-step coherence and a mild efficiency prior under a KL anchor, yielding short, valid, structured toolchains; (3) Pixel Test-Time RL performs label-free adaptation by retrieving neighbors, voting over complete trajectories rather than answers, and updating toward short, high-fidelity exemplars while constraining drift with a KL-to-EMA safety control. Across six public image and video benchmarks, Pixelis yields consistent improvements: the average relative gain is +4.08% over the same 8B baseline (peaking at +6.03% on VSI-Bench), computed as (ours-baseline)/baseline, while producing shorter, auditable toolchains and maintaining in-corridor KL during test-time learning. Acting within pixels, rather than abstract tokens, grounds multimodal perception in the physical world, linking visual reasoning with actionable outcomes, and enables embodied adaptation without external feedback.
[151] Label What Matters: Modality-Balanced and Difficulty-Aware Multimodal Active Learning
Yuqiao Zeng, Xu Wang, Tengfei Liang, Yiqing Hao, Yi Jin, Hui Yu
Main category: cs.CV
TL;DR: RL-MBA: A reinforcement learning framework for adaptive multimodal active learning that dynamically balances modality contributions and prioritizes difficult samples based on uncertainty.
Details
Motivation: Multimodal learning requires large labeled datasets, which are expensive to obtain. Traditional multimodal active learning approaches assume static modality importance and fixed selection rules, failing to adapt to the dynamic nature of multimodal learning where modality relevance and sample difficulty change during training.Method: Proposes RL-MBA, a reinforcement learning framework that models sample selection as a Markov Decision Process. Key components: (1) Adaptive Modality Contribution Balancing (AMCB) dynamically adjusts modality weights via reinforcement feedback, and (2) Evidential Fusion for Difficulty-Aware Policy Adjustment (EFDA) estimates sample difficulty using uncertainty-based evidential fusion to prioritize informative samples.
Result: Experiments on Food101, KineticsSound, and VGGSound datasets show RL-MBA consistently outperforms strong baselines, improving both classification accuracy and modality fairness under limited labeling budgets.
Conclusion: RL-MBA effectively addresses the dynamic nature of multimodal learning by adaptively balancing modality contributions and prioritizing difficult samples, making multimodal active learning more efficient and effective with limited labeled data.
Abstract: Multimodal learning integrates complementary information from different modalities such as image, text, and audio to improve model performance, but its success relies on large-scale labeled data, which is costly to obtain. Active learning (AL) mitigates this challenge by selectively annotating informative samples. In multimodal settings, many approaches implicitly assume that modality importance is stable across rounds and keep selection rules fixed at the fusion stage, which leaves them insensitive to the dynamic nature of multimodal learning, where the relative value of modalities and the difficulty of instances shift as training proceeds. To address this issue, we propose RL-MBA, a reinforcement-learning framework for modality-balanced, difficulty-aware multimodal active learning. RL-MBA models sample selection as a Markov Decision Process, where the policy adapts to modality contributions, uncertainty, and diversity, and the reward encourages accuracy gains and balance. Two key components drive this adaptability: (1) Adaptive Modality Contribution Balancing (AMCB), which dynamically adjusts modality weights via reinforcement feedback, and (2) Evidential Fusion for DifficultyAware Policy Adjustment (EFDA), which estimates sample difficulty via uncertainty-based evidential fusion to prioritize informative samples. Experiments on Food101, KineticsSound, and VGGSound demonstrate that RL-MBA consistently outperforms strong baselines, improving both classification accuracy and modality fairness under limited labeling budgets.
[152] MSRL: Scaling Generative Multimodal Reward Modeling via Multi-Stage Reinforcement Learning
Chenglong Wang, Yifu Huo, Yang Gan, Qiaozhi He, Qi Meng, Bei Li, Yan Wang, Junfu Liu, Tianhua Zhou, Jingbo Zhu, Tong Xiao
Main category: cs.CV
TL;DR: MSRL enables scalable reinforcement learning for multimodal reward models using limited multimodal data by transferring reward reasoning from text to multimodal tasks through progressive stages.
Details
Motivation: Current RLVR-based multimodal reward models require expensive labeled multimodal preference data, limiting scalability. Need methods to train effective MRMs with limited multimodal annotations.Method: Multi-Stage Reinforcement Learning (MSRL) with three stages: 1) Learn reward reasoning from large-scale textual preference data, 2) Transfer to multimodal via caption-based RL, 3) Fully multimodal RL. Uses cross-modal knowledge distillation for preference generalization.
Result: MSRL improves MRM performance significantly on visual understanding (66.6% to 75.9% on VL-RewardBench) and visual generation (70.2% to 75.7% on GenAI-Bench) without additional multimodal annotations.
Conclusion: MSRL provides scalable RL training for generative multimodal reward models, overcoming data limitations by transferring reward reasoning from text to multimodal domains.
Abstract: Recent advances in multimodal reward modeling have been largely driven by a paradigm shift from discriminative to generative approaches. Building on this progress, recent studies have further employed reinforcement learning from verifiable rewards (RLVR) to enhance multimodal reward models (MRMs). Despite their success, RLVR-based training typically relies on labeled multimodal preference data, which are costly and labor-intensive to obtain, making it difficult to scale MRM training. To overcome this limitation, we propose a Multi-Stage Reinforcement Learning (MSRL) approach, which can achieve scalable RL for MRMs with limited multimodal data. MSRL replaces the conventional RLVR-based training paradigm by first learning a generalizable reward reasoning capability from large-scale textual preference data, and then progressively transferring this capability to multimodal tasks through caption-based and fully multimodal reinforcement-learning stages. Furthermore, we introduce a cross-modal knowledge distillation approach to improve preference generalization within MSRL. Extensive experiments demonstrate that MSRL effectively scales the RLVR-based training of generative MRMs and substantially improves their performance across both visual understanding and visual generation tasks (e.g., from 66.6% to 75.9% on VL-RewardBench and from 70.2% to 75.7% on GenAI-Bench), without requiring additional multimodal preference annotations. Our code is available at: https://github.com/wangclnlp/MSRL.
[153] MoireMix: A Formula-Based Data Augmentation for Improving Image Classification Robustness
Yuto Matsuo, Yoshihiro Fukuhara, Yuki M. Asano, Rintaro Yanagi, Hirokatsu Kataoka, Akio Nakamura
Main category: cs.CV
TL;DR: Procedural Moire interference pattern augmentation for image classification improves robustness with negligible computational cost and no external data.
Details
Motivation: Current data augmentation methods rely on diffusion-based synthesis or complex feature mixing, which introduce substantial computational overhead or require external datasets. The authors seek a lightweight, storage-free alternative that doesn't depend on external data.Method: Procedural augmentation using analytic Moire interference patterns generated on-the-fly via closed-form mathematical formulation. Patterns are synthesized directly in memory, mixed with training images during training, and immediately discarded.
Result: The method achieves 0.0026 seconds per image computational cost and consistently improves robustness across multiple benchmarks (ImageNet-C, ImageNet-R, adversarial benchmarks), outperforming standard augmentation baselines and existing external-data-free approaches.
Conclusion: Analytic interference patterns provide a practical and efficient alternative to data-driven generative augmentation methods for improving model robustness without computational overhead or external data requirements.
Abstract: Data augmentation is a key technique for improving the robustness of image classification models. However, many recent approaches rely on diffusion-based synthesis or complex feature mixing strategies, which introduce substantial computational overhead or require external datasets. In this work, we explore a different direction: procedural augmentation based on analytic interference patterns. Unlike conventional augmentation methods that rely on stochastic noise, feature mixing, or generative models, our approach exploits Moire interference to generate structured perturbations spanning a wide range of spatial frequencies. We propose a lightweight augmentation method that procedurally generates Moire textures on-the-fly using a closed-form mathematical formulation. The patterns are synthesized directly in memory with negligible computational cost (0.0026 seconds per image), mixed with training images during training, and immediately discarded, enabling a storage-free augmentation pipeline without external data. Extensive experiments with Vision Transformers demonstrate that the proposed method consistently improves robustness across multiple benchmarks, including ImageNet-C, ImageNet-R, and adversarial benchmarks, outperforming standard augmentation baselines and existing external-data-free augmentation approaches. These results suggest that analytic interference patterns provide a practical and efficient alternative to data-driven generative augmentation methods.
[154] AnyDoc: Enhancing Document Generation via Large-Scale HTML/CSS Data Synthesis and Height-Aware Reinforcement Optimization
Jiawei Lin, Wanrong Zhu, Vlad I Morariu, Christopher Tensmeyer
Main category: cs.CV
TL;DR: AnyDoc is a framework for document generation using HTML/CSS format with a large-scale synthetic dataset and multimodal LLM fine-tuning for three document tasks, enhanced with height-aware reinforcement learning to prevent content overflow.
Details
Motivation: Document generation is important for AI content creation, but existing approaches are limited by small human-crafted datasets and lack comprehensive coverage across document categories and styles. There's a need for scalable solutions that can handle diverse document generation tasks in a unified format.Method: 1) Creates scalable data synthesis pipeline to automatically generate documents in HTML/CSS format, producing DocHTML dataset with 265k samples across 111 categories and 32 styles. 2) Fine-tunes multimodal LLMs for three tasks: intention-to-document, document derendering, and element-to-document. 3) Introduces height-aware reinforcement learning (HARL) post-training with reward function based on predicted vs target document heights to penalize and mitigate content overflow issues.
Result: AnyDoc outperforms both general-purpose MLLMs and task-specific baselines across all three document generation tasks. The framework successfully handles multiple generation tasks across wide document spectrum in unified HTML/CSS format, with HARL effectively addressing content overflow issues.
Conclusion: AnyDoc provides a comprehensive framework for document generation with scalable data synthesis, multimodal LLM fine-tuning, and innovative reinforcement learning approach to handle content overflow, advancing the field of AI-driven document creation.
Abstract: Document generation has gained growing attention in the field of AI-driven content creation. In this work, we push its boundaries by introducing AnyDoc, a framework capable of handling multiple generation tasks across a wide spectrum of document categories, all represented in a unified HTML/CSS format. To overcome the limited coverage and scale of existing human-crafted document datasets, AnyDoc first establishes a scalable data synthesis pipeline to automatically generate documents in HTML/CSS form. This pipeline yields DocHTML, a large-scale dataset containing 265,206 document samples, while spanning 111 categories and 32 distinct styles. Additionally, all documents are equipped with comprehensive metadata, including design intentions, HTML/CSS source code, visual assets, and rendered screenshots. Building on the curated dataset, AnyDoc fine-tunes multi-modal large language models (MLLMs) to achieve three practical document generation tasks: intention-to-document, document derendering, and element-to-document. To address the content overflow issue observed during fine-tuning, AnyDoc further incorporates a height-aware reinforcement learning (HARL) post-training procedure. By defining a reward function based on the difference between predicted and target document heights, overflow is penalized and gradually mitigated during HARL, thereby enhancing overall performance. Qualitative and quantitative experiments demonstrate that AnyDoc outperforms both general-purpose MLLMs and task-specific baselines across all three tasks.
[155] AirSplat: Alignment and Rating for Robust Feed-Forward 3D Gaussian Splatting
Minh-Quan Viet Bui, Jaeho Moon, Munchurl Kim
Main category: cs.CV
TL;DR: AirSplat adapts 3D Vision Foundation Models for pose-free novel view synthesis using self-consistent pose alignment and rating-based opacity matching to achieve high-fidelity results.
Details
Motivation: While 3D Vision Foundation Models (3DVFMs) show strong zero-shot capabilities in visual geometry estimation, they struggle with generalizable novel view synthesis (NVS). The paper aims to adapt 3DVFMs' geometric priors for high-quality, pose-free NVS.Method: Two key contributions: (1) Self-Consistent Pose Alignment (SCPA) - a training-time feedback loop ensuring pixel-aligned supervision to resolve pose-geometry discrepancies; (2) Rating-based Opacity Matching (ROM) - leverages local 3D geometry consistency from a sparse-view NVS teacher model to filter degraded primitives.
Result: Experimental results on large-scale benchmarks show AirSplat significantly outperforms state-of-the-art pose-free NVS approaches in reconstruction quality.
Conclusion: AirSplat demonstrates the potential of adapting 3DVFMs to enable simultaneous visual geometry estimation and high-quality view synthesis.
Abstract: While 3D Vision Foundation Models (3DVFMs) have demonstrated remarkable zero-shot capabilities in visual geometry estimation, their direct application to generalizable novel view synthesis (NVS) remains challenging. In this paper, we propose AirSplat, a novel training framework that effectively adapts the robust geometric priors of 3DVFMs into high-fidelity, pose-free NVS. Our approach introduces two key technical contributions: (1) Self-Consistent Pose Alignment (SCPA), a training-time feedback loop that ensures pixel-aligned supervision to resolve pose-geometry discrepancy; and (2) Rating-based Opacity Matching (ROM), which leverages the local 3D geometry consistency knowledge from a sparse-view NVS teacher model to filter out degraded primitives. Experimental results on large-scale benchmarks demonstrate that our method significantly outperforms state-of-the-art pose-free NVS approaches in reconstruction quality. Our AirSplat highlights the potential of adapting 3DVFMs to enable simultaneous visual geometry estimation and high-quality view synthesis.
[156] Denoise and Align: Towards Source-Free UDA for Robust Panoramic Semantic Segmentation
Yaowen Chang, Zhen Cao, Xu Zheng, Xiaoxin Mi, Zhen Dong
Main category: cs.CV
TL;DR: DAPASS framework for source-free unsupervised domain adaptation in panoramic semantic segmentation addresses geometric distortions and annotation costs using confidence-guided pseudo-label denoising and contextual resolution adversarial alignment.
Details
Motivation: Panoramic semantic segmentation faces challenges from geometric distortions and expensive dense annotation. Source-free UDA is needed when source data is inaccessible due to privacy/proprietary constraints, but suffers from domain shift and unreliable pseudo-labels, especially for minority classes.Method: DAPASS introduces two modules: 1) Panoramic Confidence-Guided Denoising (PCGD) generates high-fidelity, class-balanced pseudo-labels using perturbation consistency and neighborhood-level confidence filtering; 2) Contextual Resolution Adversarial Module (CRAM) addresses scale variance and distortion by adversarially aligning fine-grained details from high-resolution crops with global semantics from low-resolution contexts.
Result: State-of-the-art performances on outdoor (Cityscapes-to-DensePASS: 55.04% mIoU, +2.05%) and indoor (Stanford2D3D: 70.38% mIoU, +1.54%) benchmarks.
Conclusion: DAPASS effectively addresses source-free UDA challenges in panoramic semantic segmentation through robust pseudo-label generation and explicit handling of scale variance and geometric distortions, achieving superior performance without access to source data.
Abstract: Panoramic semantic segmentation is pivotal for comprehensive 360° scene understanding in critical applications like autonomous driving and virtual reality. However, progress in this domain is constrained by two key challenges: the severe geometric distortions inherent in panoramic projections and the prohibitive cost of dense annotation. While Unsupervised Domain Adaptation (UDA) from label-rich pinhole-camera datasets offers a viable alternative, many real-world tasks impose a stricter source-free (SFUDA) constraint where source data is inaccessible for privacy or proprietary reasons. This constraint significantly amplifies the core problems of domain shift, leading to unreliable pseudo-labels and dramatic performance degradation, particularly for minority classes. To overcome these limitations, we propose the DAPASS framework. DAPASS introduces two synergistic modules to robustly transfer knowledge without source data. First, our Panoramic Confidence-Guided Denoising (PCGD) module generates high-fidelity, class-balanced pseudo-labels by enforcing perturbation consistency and incorporating neighborhood-level confidence to filter noise. Second, a Contextual Resolution Adversarial Module (CRAM) explicitly addresses scale variance and distortion by adversarially aligning fine-grained details from high-resolution crops with global semantics from low-resolution contexts. DAPASS achieves state-of-the-art performances on outdoor (Cityscapes-to-DensePASS) and indoor (Stanford2D3D) benchmarks, yielding 55.04% (+2.05%) and 70.38% (+1.54%) mIoU, respectively.
[157] Robust Principal Component Completion
Yinjian Wang, Wei Li, Yuanyuan Gui, James E. Fowler, Gemine Vivone
Main category: cs.CV
TL;DR: Robust principal component completion (RPCC) addresses foreground occlusion in video analysis by identifying sparse components indirectly through support determination using Bayesian sparse tensor factorization, eliminating post-hoc thresholding.
Details
Motivation: Traditional RPCA assumes sparse foreground is additive to low-rank background, but in real applications like video analysis, foreground objects often occlude/replace background elements. This mismatch requires a new approach that can handle occlusion scenarios.Method: Proposes RPCC framework using variational Bayesian inference for fully probabilistic Bayesian sparse tensor factorization. Identifies sparse component indirectly by determining its support, converging to a hard classifier that eliminates need for post-hoc thresholding.
Result: Achieves near-optimal estimates on synthetic data and robust foreground-extraction on real color video datasets, plus anomaly-detection performance on hyperspectral datasets. Convergence to hard classifier eliminates thresholding requirements.
Conclusion: RPCC effectively handles occlusion scenarios in video analysis through Bayesian sparse tensor factorization, providing improved foreground extraction and anomaly detection without thresholding post-processing.
Abstract: Robust principal component analysis (RPCA) seeks a low-rank component and a sparse component from their summation. Yet, in many applications of interest, the sparse foreground actually replaces, or occludes, elements from the low-rank background. To address this mismatch, a new framework is proposed in which the sparse component is identified indirectly through determining its support. This approach, called robust principal component completion (RPCC), is solved via variational Bayesian inference applied to a fully probabilistic Bayesian sparse tensor factorization. Convergence to a hard classifier for the support is shown, thereby eliminating the post-hoc thresholding required of most prior RPCA-driven approaches. Experimental results reveal that the proposed approach delivers near-optimal estimates on synthetic data as well as robust foreground-extraction and anomaly-detection performance on real color video and hyperspectral datasets, respectively. Source implementation and Appendices are available at https://github.com/WongYinJ/BCP-RPCC.
[158] EgoXtreme: A Dataset for Robust Object Pose Estimation in Egocentric Views under Extreme Conditions
Taegyoon Yoon, Yegyu Han, Seojin Ji, Jaewoo Park, Sojeong Kim, Taein Kwon, Hyung-Sin Kim
Main category: cs.CV
TL;DR: EgoXtreme: A large-scale 6D object pose estimation dataset captured from egocentric perspective with extreme real-world challenges like motion blur, dynamic lighting, and visual obstructions.
Details
Motivation: Existing 6D object pose estimation benchmarks fail to capture real-world egocentric challenges like severe motion blur, dynamic illumination, and visual obstructions, creating a gap between lab data and real applications.Method: Introduces EgoXtreme dataset with three challenging scenarios (industrial maintenance, sports, emergency rescue) featuring extreme lighting, heavy motion blur, and smoke to introduce severe perceptual ambiguities.
Result: State-of-the-art generalizable pose estimators fail in extreme conditions, especially low light. Image restoration offers no improvement, but tracking-based approaches show gains, indicating temporal information is meaningful.
Conclusion: EgoXtreme is essential for developing next-generation pose estimation models robust enough for real-world egocentric vision applications.
Abstract: Smart glass is emerging as an useful device since it provides plenty of insights under hands-busy, eyes-on-task situations. To understand the context of the wearer, 6D object pose estimation in egocentric view is becoming essential. However, existing 6D object pose estimation benchmarks fail to capture the challenges of real-world egocentric applications, which are often dominated by severe motion blur, dynamic illumination, and visual obstructions. This discrepancy creates a significant gap between controlled lab data and chaotic real-world application. To bridge this gap, we introduce EgoXtreme, a new large-scale 6D pose estimation dataset captured entirely from an egocentric perspective. EgoXtreme features three challenging scenarios - industrial maintenance, sports, and emergency rescue - designed to introduce severe perceptual ambiguities through extreme lighting, heavy motion blur, and smoke. Evaluations of state-of-the-art generalizable pose estimators on EgoXtreme indicate that their generalization fails to hold in extreme conditions, especially under low light. We further demonstrate that simply applying image restoration (e.g., deblurring) offers no positive improvement for extreme conditions. While performance gain has appeared in tracking-based approach, implying using temporal information in fast-motion scenarios is meaningful. We conclude that EgoXtreme is an essential resource for developing and evaluating the next generation of pose estimation models robust enough for real-world egocentric vision. The dataset and code are available at https://taegyoun88.github.io/EgoXtreme/
[159] FD$^2$: A Dedicated Framework for Fine-Grained Dataset Distillation
Hongxu Ma, Guang Li, Shijie Wang, Dongzhan Zhou, Baoli Sun, Takahiro Ogawa, Miki Haseyama, Zhihui Wang
Main category: cs.CV
TL;DR: FD² is a fine-grained dataset distillation framework that improves upon existing decoupled methods by localizing discriminative regions and constructing fine-grained representations, addressing issues of intra-class variation and inter-class similarity in fine-grained datasets.
Details
Motivation: Existing decoupled dataset distillation methods rely on coarse class-label supervision and treat samples within each class similarly, which is problematic for fine-grained datasets where subtle inter-class differences exist alongside large intra-class variation. This leads to distilled samples that retain confusing variations and become overly similar within classes, limiting discriminative cues and hurting recognition performance.Method: FD² introduces a three-stage approach: 1) During pretraining, counterfactual attention learning aggregates discriminative representations to update class prototypes; 2) During distillation, a fine-grained characteristic constraint aligns each sample with its class prototype while repelling others; 3) A similarity constraint diversifies attention across same-class samples to prevent over-similarity.
Result: Experiments on multiple fine-grained and general datasets show that FD² integrates seamlessly with decoupled dataset distillation and improves performance in most settings, demonstrating strong transferability across different dataset types.
Conclusion: FD² effectively addresses the limitations of existing decoupled dataset distillation methods for fine-grained datasets by localizing discriminative regions and constructing fine-grained representations, leading to improved recognition performance while maintaining the efficiency benefits of the decoupled approach.
Abstract: Dataset distillation (DD) compresses a large training set into a small synthetic set, reducing storage and training cost, and has shown strong results on general benchmarks. Decoupled DD further improves efficiency by splitting the pipeline into pretraining, sample distillation, and soft-label generation. However, existing decoupled methods largely rely on coarse class-label supervision and optimize samples within each class in a nearly identical manner. On fine-grained datasets, this often yields distilled samples that (i) retain large intra-class variation with subtle inter-class differences and (ii) become overly similar within the same class, limiting localized discriminative cues and hurting recognition. To solve the above-mentioned problems, we propose FD$^{2}$, a dedicated framework for Fine-grained Dataset Distillation. FD$^{2}$ localizes discriminative regions and constructs fine-grained representations for distillation. During pretraining, counterfactual attention learning aggregates discriminative representations to update class prototypes. During distillation, a fine-grained characteristic constraint aligns each sample with its class prototype while repelling others, and a similarity constraint diversifies attention across same-class samples. Experiments on multiple fine-grained and general datasets show that FD$^{2}$ integrates seamlessly with decoupled DD and improves performance in most settings, indicating strong transferability.
[160] Learning to Rank Caption Chains for Video-Text Alignment
Ansel Blume, Burak Uzkent, Shalini Chaudhuri, Garin Kessler
Main category: cs.CV
TL;DR: Ranking optimization outperforms binary DPO for vision-language models by better handling visual faithfulness in long-form content generation and assessment.
Details
Motivation: Standard DPO's binary "winner-takes-all" approach is suboptimal for vision-language models where response quality depends on visual content. A response may still be faithful to visual inputs even if less preferable than alternatives, but DPO lacks this nuance.Method: Proposes ranking optimization as an alternative to binary DPO, focusing on video-text alignment using detailed video captions. Generates challenging, totally ordered caption chains at scale through repeated caption degradation. Requires finetuning of the vision encoder, not just language model reweighting.
Result: Ranking optimization outperforms binary DPO for long-form content generation and assessment. Approaches require finetuning of the vision encoder to be effective, challenging the view of DPO as purely a language-reweighting process.
Conclusion: Ranking optimization better captures visual faithfulness nuances than binary DPO for vision-language models, and vision encoder finetuning is crucial for effective preference optimization in multimodal settings.
Abstract: Direct preference optimization (DPO) is an effective technique to train language models to generate preferred over dispreferred responses. However, this binary “winner-takes-all” approach is suboptimal for vision-language models whose response quality is highly dependent on visual content. In particular, a response may still be faithful to the visual inputs even if it is less preferable than an alternative. The standard Bradley-Terry DPO formulation lacks this nuance, upweighting winning responses without sufficient regard for whether the “losing” response still maintains high visual fidelity. In this work, we investigate ranking optimization as an alternative that more precisely situates responses’ faithfulness to visual inputs. We focus on video-text alignment using detailed video captions, proposing a method to generate challenging, totally ordered caption chains at scale through repeated caption degradation. Our results show ranking optimization outperforms binary DPO for long-form content generation and assessment, and importantly, we find that these approaches require finetuning of the vision encoder to be effective, challenging the view of DPO as purely a language-reweighting process.
[161] Photon: Speedup Volume Understanding with Efficient Multimodal Large Language Models
Chengyu Fang, Heng Guo, Zheng Jiang, Chunming He, Xiu Li, Minfeng Xu
Main category: cs.CV
TL;DR: Photon is a framework for efficient 3D medical visual question answering that uses adaptive token scheduling to reduce computational costs while maintaining volumetric continuity and accuracy.
Details
Motivation: Current multimodal LLMs struggle with 3D medical imaging due to high computational costs. Existing methods using 2D slices or fixed token compression disrupt volumetric continuity and obscure subtle findings, limiting clinical applicability.Method: Photon introduces instruction-conditioned token scheduling and surrogate gradient propagation to adaptively reduce tokens during training/inference. It uses custom backpropagation with gradient restoration for differentiable optimization despite discrete token drop, plus regularization to mitigate language-only bias.
Result: Experiments show Photon achieves state-of-the-art accuracy on diverse medical VQA tasks while reducing resource usage and accelerating both training and inference.
Conclusion: Photon enables efficient 3D medical VQA by adaptively compressing tokens while preserving volumetric information, offering a practical solution for clinical applications.
Abstract: Multimodal large language models are promising for clinical visual question answering tasks, but scaling to 3D imaging is hindered by high computational costs. Prior methods often rely on 2D slices or fixed-length token compression, disrupting volumetric continuity and obscuring subtle findings. We present Photon, a framework that represents 3D medical volumes with token sequences of variable length. Photon introduces instruction-conditioned token scheduling and surrogate gradient propagation to adaptively reduce tokens during both training and inference, which lowers computational cost while mitigating the attention dilution caused by redundant tokens. It incorporates a custom backpropagation rule with gradient restoration to enable differentiable optimization despite discrete token drop. To stabilize token compression and ensure reliable use of visual evidence, Photon further applies regularization objectives that mitigate language-only bias and improve reliability. Experiments on diverse medical visual question answering tasks show that Photon achieves state-of-the-art accuracy while reducing resource usage and accelerating both training and inference.
[162] A Semantically Disentangled Unified Model for Multi-category 3D Anomaly Detection
SuYeon Kim, Wongyu Lee, MyeongAh Cho
Main category: cs.CV
TL;DR: A unified 3D anomaly detection model that addresses inter-category entanglement through semantic disentanglement, achieving state-of-the-art performance on 3D point cloud datasets.
Details
Motivation: Unified models for 3D anomaly detection suffer from Inter-Category Entanglement (ICE), where latent features from different categories overlap, causing incorrect semantic priors during reconstruction and unreliable anomaly scores.Method: Proposes Semantically Disentangled Unified Model with three components: (1) Coarse-to-Fine Global Tokenization for instance-level semantic identity, (2) Category-Conditioned Contrastive Learning for disentangling category semantics, and (3) Geometry-Guided Decoder for semantically consistent reconstruction.
Result: Achieves state-of-the-art performance on Real3D-AD and Anomaly-ShapeNet datasets, improving object-level AUROC by 2.8% for unified models and 9.1% for category-specific models.
Conclusion: The proposed method effectively addresses ICE in unified 3D anomaly detection models through semantic disentanglement, enhancing both performance and reliability while maintaining scalability across multiple categories.
Abstract: 3D anomaly detection targets the detection and localization of defects in 3D point clouds trained solely on normal data. While a unified model improves scalability by learning across multiple categories, it often suffers from Inter-Category Entanglement (ICE)-where latent features from different categories overlap, causing the model to adopt incorrect semantic priors during reconstruction and ultimately yielding unreliable anomaly scores. To address this issue, we propose the Semantically Disentangled Unified Model for 3D Anomaly Detection, which reconstructs features conditioned on disentangled semantic representations. Our framework consists of three key components: (i) Coarse-to-Fine Global Tokenization for forming instance-level semantic identity, (ii) Category-Conditioned Contrastive Learning for disentangling category semantics, and (iii) a Geometry-Guided Decoder for semantically consistent reconstruction. Extensive experiments on Real3D-AD and Anomaly-ShapeNet demonstrate that our method achieves state-of-the-art for both unified and category-specific models, improving object-level AUROC by 2.8% and 9.1%, respectively, while enhancing the reliability of unified 3D anomaly detection.
[163] SportSkills: Physical Skill Learning from Sports Instructional Videos
Kumar Ashutosh, Chi Hsuan Wu, Kristen Grauman
Main category: cs.CV
TL;DR: SportSkills: A large-scale sports dataset with 360k+ instructional videos for physical skill learning, enabling fine-grained action understanding and mistake-conditioned video retrieval for personalized feedback.
Details
Motivation: Current video datasets lack depth in fine-grained physical activities needed for skill learning. There's a need for datasets that capture the nuances of sports skills with instructional narrations to enable better understanding of physical actions and provide actionable feedback.Method: Introduces SportSkills dataset with 360k+ instructional videos covering 55 sports, featuring visual demonstrations paired with instructional narrations. Develops representation learning for fine-grained action understanding and formulates mistake-conditioned instructional video retrieval task.
Result: Achieves 4x gains in representation learning compared to traditional activity-centric datasets. Professional coach evaluations show significant advancement in personalized visual instruction retrieval for user queries.
Conclusion: SportSkills enables fine-grained physical action understanding and bridges representation learning with actionable feedback generation through mistake-conditioned video retrieval, advancing personalized skill learning.
Abstract: Current large-scale video datasets focus on general human activity, but lack depth of coverage on fine-grained activities needed to address physical skill learning. We introduce SportSkills, the first large-scale sports dataset geared towards physical skill learning with in-the-wild video. SportSkills has more than 360k instructional videos containing more than 630k visual demonstrations paired with instructional narrations explaining the know-how behind the actions from 55 varied sports. Through a suite of experiments, we show that SportSkills unlocks the ability to understand fine-grained differences between physical actions. Our representation achieves gains of up to 4x with the same model trained on traditional activity-centric datasets. Crucially, building on SportSkills, we introduce the first large-scale task formulation of mistake-conditioned instructional video retrieval, bridging representation learning and actionable feedback generation (e.g., “here’s my execution of a skill; which video clip should I watch to improve it?”). Formal evaluations by professional coaches show our retrieval approach significantly advances the ability of video models to personalize visual instructions for a user query.
[164] An Image Dataset of Common Skin Diseases of Bangladesh and Benchmarking Performance with Machine Learning Models
Sazzad Hossain, Saiful Islam, Muhammad Ibrahim, Md. Rasel Ahmed, Md Shuayb, Ahmedul Kabir
Main category: cs.CV
TL;DR: A publicly available dataset for detecting 5 common skin diseases in Bangladesh using machine learning and deep learning approaches, with 1612 images collected from medical patients.
Details
Motivation: Address the shortage of dermatological expertise and diagnostic instruments in highly populated countries like Bangladesh by developing automated skin disease detection systems using AI/ML techniques.Method: Created a dataset of 1612 images (250 distinct, rest augmented) for 5 skin diseases collected from Faridpur Medical College, then applied various machine learning and deep learning models for classification.
Result: Developed a publicly available dataset with 302 Dermatitis, 381 Eczema, 301 Scabies, 316 Tinea Ringworm, and 312 Vitiligo images, and reported classification performance of ML/DL models on this dataset.
Conclusion: The dataset enables automated skin disease detection using computer vision techniques, potentially valuable for global ML-based dermatology applications despite regional data collection.
Abstract: Skin diseases are a major public health concern worldwide, and their detection is often challenging without access to dermatological expertise. In countries like Bangladesh, which is highly populated, the number of qualified skin specialists and diagnostic instruments is insufficient to meet the demand. Due to the lack of proper detection and treatment of skin diseases, that may lead to severe health consequences including death. Common properties of skin diseases are, changing the color, texture, and pattern of skin and in this era of artificial intelligence and machine learning, we are able to detect skin diseases by using image processing and computer vision techniques. In response to this challenge, we develop a publicly available dataset focused on common skin disease detection using machine learning techniques. We focus on five prevalent skin diseases in Bangladesh: Contact Dermatitis, Vitiligo, Eczema, Scabies, and Tinea Ringworm. The dataset consists of 1612 images (of which, 250 are distinct while others are augmented), collected directly from patients at the outpatient department of Faridpur Medical College, Faridpur, Bangladesh. The data comprises of 302, 381, 301, 316, and 312 images of Dermatitis, Eczema, Scabies, Tinea Ringworm, and Vitiligo, respectively. Although the data are collected regionally, the selected diseases are common across many countries especially in South Asia, making the dataset potentially valuable for global applications in machine learning-based dermatology. We also apply several machine learning and deep learning models on the dataset and report classification performance. We expect that this research would garner attention from machine learning and deep learning researchers and practitioners working in the field of automated disease diagnosis.
[165] Towards Foundation Models for 3D Scene Understanding: Instance-Aware Self-Supervised Learning for Point Clouds
Bin Yang, Mohamed Abdelsamad, Miao Zhang, Alexandru Paul Condurache
Main category: cs.CV
TL;DR: PointINS is a self-supervised learning framework for point clouds that improves instance localization by learning geometry-aware representations through orthogonal offset prediction and complementary regularization strategies.
Details
Motivation: Existing self-supervised learning approaches for point clouds focus on semantic awareness but transfer poorly to instance localization tasks. Since instance awareness is fundamental for 3D perception, bridging this gap is crucial for developing true 3D foundation models that support all downstream tasks.Method: PointINS introduces an instance-oriented self-supervised framework with an orthogonal offset branch that jointly learns semantic understanding and geometric reasoning. It uses two regularization strategies: Offset Distribution Regularization (ODR) aligns predicted offsets with geometric priors, and Spatial Clustering Regularization (SCR) enforces local coherence using pseudo-instance masks.
Result: Extensive experiments across five datasets show PointINS achieves +3.5% mAP improvement for indoor instance segmentation and +4.1% PQ gain for outdoor panoptic segmentation compared to previous methods.
Conclusion: PointINS successfully bridges the gap between semantic and instance awareness in 3D point cloud learning, paving the way for scalable 3D foundation models that can support diverse downstream tasks.
Abstract: Recent advances in self-supervised learning (SSL) for point clouds have substantially improved 3D scene understanding without human annotations. Existing approaches emphasize semantic awareness by enforcing feature consistency across augmented views or by masked scene modeling. However, the resulting representations transfer poorly to instance localization, and often require full finetuning for strong performance. Instance awareness is a fundamental component of 3D perception, thus bridging this gap is crucial for progressing toward true 3D foundation models that support all downstream tasks on 3D data. In this work, we introduce PointINS, an instance-oriented self-supervised framework that enriches point cloud representations through geometry-aware learning. PointINS employs an orthogonal offset branch to jointly learn high-level semantic understanding and geometric reasoning, yielding instance awareness. We identify two consistent properties essential for robust instance localization and formulate them as complementary regularization strategies, Offset Distribution Regularization (ODR), which aligns predicted offsets with empirically observed geometric priors, and Spatial Clustering Regularization (SCR), which enforces local coherence by regularizing offsets with pseudo-instance masks. Through extensive experiments across five datasets, PointINS achieves on average +3.5% mAP improvement for indoor instance segmentation and +4.1% PQ gain for outdoor panoptic segmentation, paving the way for scalable 3D foundation models.
[166] Activation Matters: Test-time Activated Negative Labels for OOD Detection with Vision-Language Models
Yabin Zhang, Maya Varma, Yunhe Gao, Jean-Benoit Delbrouck, Jiaming Liu, Chong Wang, Curtis Langlotz
Main category: cs.CV
TL;DR: TANL proposes test-time activated negative labels for OOD detection by dynamically mining high-activation labels from test samples to better capture OOD characteristics.
Details
Motivation: Existing OOD detection methods using negative labels often fail because these labels may have poor activation on OOD samples, not capturing OOD characteristics effectively.Method: TANL dynamically evaluates activation levels across corpus datasets during testing, mines candidate labels with high activation responses, constructs label activation metrics using high-confidence test images, and uses activation-aware scoring emphasizing strongly activated negative labels.
Result: Significantly reduces FPR95 from 17.5% to 9.8% on ImageNet benchmark; validated across diverse backbones and task settings.
Conclusion: TANL is an effective, training-free, test-efficient OOD detection method that adaptively selects distribution-aware negative labels using test-time activation information.
Abstract: Out-of-distribution (OOD) detection aims to identify samples that deviate from in-distribution (ID). One popular pipeline addresses this by introducing negative labels distant from ID classes and detecting OOD based on their distance to these labels. However, such labels may present poor activation on OOD samples, failing to capture the OOD characteristics. To address this, we propose \underline{T}est-time \underline{A}ctivated \underline{N}egative \underline{L}abels (TANL) by dynamically evaluating activation levels across the corpus dataset and mining candidate labels with high activation responses during the testing process. Specifically, TANL identifies high-confidence test images online and accumulates their assignment probabilities over the corpus to construct a label activation metric. Such a metric leverages historical test samples to adaptively align with the test distribution, enabling the selection of distribution-adaptive activated negative labels. By further exploring the activation information within the current testing batch, we introduce a more fine-grained, batch-adaptive variant. To fully utilize label activation knowledge, we propose an activation-aware score function that emphasizes negative labels with stronger activations, boosting performance and enhancing its robustness to the label number. Our TANL is training-free, test-efficient, and grounded in theoretical justification. Experiments on diverse backbones and wide task settings validate its effectiveness. Notably, on the large-scale ImageNet benchmark, TANL significantly reduces the FPR95 from 17.5% to 9.8%. Codes are available at \href{https://github.com/YBZh/OpenOOD-VLM}{YBZh/OpenOOD-VLM}.
[167] ET-SAM: Efficient Point Prompt Prediction in SAM for Unified Scene Text Detection and Layout Analysis
Xike Zhang, Maoyuan Ye, Juhua Liu, Bo Du
Main category: cs.CV
TL;DR: ET-SAM is an efficient SAM-based framework for unified scene text detection and layout analysis that uses a lightweight point decoder to reduce inference latency and a joint training strategy to leverage heterogeneous text annotations.
Details
Motivation: Previous SAM-based approaches for text detection and layout analysis suffer from high inference latency due to sampling thousands of foreground points as prompts, and limited data utilization due to dependence on pixel-level segmentation annotations.Method: Proposes ET-SAM with: 1) A lightweight point decoder that produces word heatmaps to generate few foreground points instead of thousands, accelerating inference; 2) A joint training strategy that combines datasets with multi-level, word-level only, and line-level only annotations; 3) Three sets of learnable task prompts in both decoders to handle dataset discrepancies.
Result: Achieves ~3× inference acceleration compared to previous SAM-based architecture while maintaining competitive performance on HierText, and improves average F-score by 11.0% on Total-Text, CTW1500, and ICDAR15 datasets.
Conclusion: ET-SAM provides an efficient solution for unified scene text detection and layout analysis by reducing inference latency and better utilizing heterogeneous annotation data through task-specific prompts.
Abstract: Previous works based on Segment Anything Model (SAM) have achieved promising performance in unified scene text detection and layout analysis. However, the typical reliance on pixel-level text segmentation for sampling thousands of foreground points as prompts leads to unsatisfied inference latency and limited data utilization. To address above issues, we propose ET-SAM, an Efficient framework with two decoders for unified scene Text detection and layout analysis based on SAM. Technically, we customize a lightweight point decoder that produces word heatmaps for achieving a few foreground points, thereby eliminating excessive point prompts and accelerating inference. Without the dependence on pixel-level segmentation, we further design a joint training strategy to leverage existing data with heterogeneous text-level annotations. Specifically, the datasets with multi-level, word-level only, and line-level only annotations are combined in parallel as a unified training set. For these datasets, we introduce three corresponding sets of learnable task prompts in both the point decoder and hierarchical mask decoder to mitigate discrepancies across datasets.Extensive experiments demonstrate that, compared to the previous SAM-based architecture, ET-SAM achieves about 3$\times$ inference acceleration while obtaining competitive performance on HierText, and improves an average of 11.0% F-score on Total-Text, CTW1500, and ICDAR15.
[168] Knowledge-Guided Adversarial Training for Infrared Object Detection via Thermal Radiation Modeling
Shiji Zhao, Shukun Xiong, Maoxun Yuan, Yao Huang, Ranjie Duan, Qing Guo, Jiansheng Chen, Haibin Duan, Xingxing Wei
Main category: cs.CV
TL;DR: KGAT improves infrared object detection robustness by embedding thermal radiation knowledge into adversarial training, using relative gray value rank order between classes as stable physical knowledge.
Details
Motivation: Infrared object detection is vulnerable to adversarial attacks and common corruptions, but current data-driven methods don't leverage infrared-specific physical characteristics, limiting robustness.Method: Proposes Knowledge-Guided Adversarial Training (KGAT) that models thermal radiation relations based on gray value rank order between classes, quantifies relation stability, and embeds this physical knowledge into adversarial training to ensure predictions align with actual physical laws.
Result: Extensive experiments on three infrared datasets and six detection models show KGAT effectively enhances both clean accuracy and robustness against adversarial attacks and common corruptions.
Conclusion: Incorporating infrared physical knowledge (thermal radiation relations) into adversarial training significantly improves robustness of infrared object detection systems against security threats.
Abstract: In complex environments, infrared object detection exhibits broad applicability and stability across diverse scenarios. However, infrared object detection is vulnerable to both common corruptions and adversarial examples, leading to potential security risks. To improve the robustness of infrared object detection, current methods mostly adopt a data-driven ideology, which only superficially drives the network to fit the training data without specifically considering the unique characteristics of infrared images, resulting in limited robustness. In this paper, we revisit infrared physical knowledge and find that relative thermal radiation relations between different classes can be regarded as a reliable knowledge source under the complex scenarios of adversarial examples and common corruptions. Thus, we theoretically model thermal radiation relations based on the rank order of gray values for different classes, and further quantify the stability of various inter-class thermal radiation relations. Based on the above theoretical framework, we propose Knowledge-Guided Adversarial Training (KGAT) for infrared object detection, in which infrared physical knowledge is embedded into the adversarial training process, and the predicted results are optimized to be consistent with the actual physical laws. Extensive experiments on three infrared datasets and six mainstream infrared object detection models demonstrate that KGAT effectively enhances both clean accuracy and robustness against adversarial attacks and common corruptions.
[169] Hyperspectral Trajectory Image for Multi-Month Trajectory Anomaly Detection
Md Awsafur Rahman, Chandrakanth Gudavalli, Hardik Prajapati, B. S. Manjunath
Main category: cs.CV
TL;DR: TITAnD reformulates trajectory anomaly detection as a vision problem using hyperspectral trajectory images and cyclic factorized transformers, achieving state-of-the-art performance on both sparse and dense GPS data.
Details
Motivation: Current trajectory anomaly detection methods face a trade-off: dense GPS methods preserve fine-grained evidence but are computationally expensive for multi-month analysis, while sparse stay-point methods are scalable but discard important evidence. There's no unified approach that works across both regimes.Method: Proposes TITAnD which represents trajectories as Hyperspectral Trajectory Images (HTI) - a day x time-of-day grid with channels encoding spatial, semantic, temporal, and kinematic information. Uses Cyclic Factorized Transformer (CFT) that factorizes attention along within-day and across-day axes to encode cyclic patterns of human routines while reducing computational cost.
Result: Achieves best AUC-PR across both sparse and dense benchmarks, surpassing vision models like UNet while being 11-75x faster than standard Transformers with comparable memory usage.
Conclusion: Vision reformulation combined with structure-aware modeling is essential for effective trajectory anomaly detection, enabling unified handling of both dense and sparse GPS data with computational efficiency.
Abstract: Trajectory anomaly detection underpins applications from fraud detection to urban mobility analysis. Dense GPS methods preserve fine-grained evidence such as abnormal speeds and short-duration events, but their quadratic cost makes multi-month analysis intractable; consequently, no existing approach detects anomalies over multi-month dense GPS trajectories. The field instead relies on scalable sparse stay-point methods that discard this evidence, forcing separate architectures for each regime and preventing knowledge transfer. We argue this bottleneck is unnecessary: human trajectories, dense or sparse, share a natural two-dimensional cyclic structure along within-day and across-day axes. We therefore propose TITAnD (Trajectory Image Transformer for Anomaly Detection), which reformulates trajectory anomaly detection as a vision problem by representing trajectories as a Hyperspectral Trajectory Image (HTI): a day x time-of-day grid whose channels encode spatial, semantic, temporal, and kinematic information from either modality, unifying both under a single representation. Under this formulation, agent-level detection reduces to image classification and temporal localization to semantic segmentation. To model this representation, we introduce the Cyclic Factorized Transformer (CFT), which factorizes attention along the two temporal axes, encoding the cyclic inductive bias of human routines, while reducing attention cost by orders of magnitude and enabling dense multi-month anomaly detection for the first time. Empirically, TITAnD achieves the best AUC-PR across sparse and dense benchmarks, surpassing vision models like UNet while being 11-75x faster than the Transformer with comparable memory, demonstrating that vision reformulation and structure-aware modeling are jointly essential. Code will be made public soon.
[170] AG-EgoPose: Leveraging Action-Guided Motion and Kinematic Joint Encoding for Egocentric 3D Pose Estimation
Md Mushfiqur Azam, John Quarles, Kevin Desai
Main category: cs.CV
TL;DR: AG-EgoPose: A dual-stream framework for egocentric 3D human pose estimation that integrates short- and long-range motion context with spatial cues from fisheye camera input.
Details
Motivation: Egocentric 3D human pose estimation is challenging due to severe perspective distortion, limited body visibility, and complex camera motion in first-person viewpoints. Existing methods relying on single-frame analysis or limited temporal fusion fail to effectively leverage rich motion context available in egocentric videos.Method: Dual-stream framework with parallel spatial and temporal streams. Spatial stream uses weight-sharing ResNet-18 encoder-decoder for 2D joint heatmaps and joint-specific spatial feature tokens. Temporal stream uses ResNet-50 backbone with action recognition backbone to capture motion dynamics. Representations are fused and refined in transformer decoder with learnable joint tokens for joint-level integration of spatial and temporal evidence while maintaining anatomical constraints.
Result: Experiments on real-world datasets demonstrate state-of-the-art performance in both quantitative and qualitative metrics.
Conclusion: AG-EgoPose effectively addresses challenges in egocentric 3D pose estimation by integrating spatial and temporal information through a novel dual-stream framework with transformer-based fusion.
Abstract: Egocentric 3D human pose estimation remains challenging due to severe perspective distortion, limited body visibility, and complex camera motion inherent in first-person viewpoints. Existing methods typically rely on single-frame analysis or limited temporal fusion, which fails to effectively leverage the rich motion context available in egocentric videos. We introduce AG-EgoPose, a novel dual-stream framework that integrates short- and long-range motion context with fine-grained spatial cues for robust pose estimation from fisheye camera input. Our framework features two parallel streams: A spatial stream uses a weight-sharing ResNet-18 encoder-decoder to generate 2D joint heatmaps and corresponding joint-specific spatial feature tokens. Simultaneously, a temporal stream uses a ResNet-50 backbone to extract visual features, which are then processed by an action recognition backbone to capture the motion dynamics. These complementary representations are fused and refined in a transformer decoder with learnable joint tokens, which allows for the joint-level integration of spatial and temporal evidence while maintaining anatomical constraints. Experiments on real-world datasets demonstrate that AG-EgoPose achieves state-of-the-art performance in both quantitative and qualitative metrics. Code is available at: https://github.com/Mushfiq5647/AG-EgoPose.
[171] VolDiT: Controllable Volumetric Medical Image Synthesis with Diffusion Transformers
Marvin Seyfarth, Salman Ul Hassan Dar, Yannik Frisch, Philipp Wild, Norbert Frey, Florian André, Sandy Engelhardt
Main category: cs.CV
TL;DR: VolDiT: First transformer-based 3D diffusion model for volumetric medical image synthesis using volumetric patch embeddings and timestep-gated control adapters for segmentation mask conditioning.
Details
Motivation: Existing 3D medical image generation methods rely on convolutional U-Net backbones with locality biases and limited receptive fields, constraining scalability, global context integration, and flexible conditioning.Method: Extends diffusion transformers to native 3D data via volumetric patch embeddings and global self-attention over 3D tokens. Introduces timestep-gated control adapter that maps segmentation masks into learnable control tokens to modulate transformer layers during denoising.
Result: Demonstrates improved global coherence, superior generative fidelity, and enhanced controllability compared to state-of-the-art 3D latent diffusion models based on U-Nets.
Conclusion: Fully transformer-based diffusion models provide a flexible foundation for volumetric medical image synthesis with better global understanding and control.
Abstract: Diffusion models have become a leading approach for high-fidelity medical image synthesis. However, most existing methods for 3D medical image generation rely on convolutional U-Net backbones within latent diffusion frameworks. While effective, these architectures impose strong locality biases and limited receptive fields, which may constrain scalability, global context integration, and flexible conditioning. In this work, we introduce VolDiT, the first purely transformer-based 3D Diffusion Transformer for volumetric medical image synthesis. Our approach extends diffusion transformers to native 3D data through volumetric patch embeddings and global self-attention operating directly over 3D tokens. To enable structured control, we propose a timestep-gated control adapter that maps segmentation masks into learnable control tokens that modulate transformer layers during denoising. This token-level conditioning mechanism allows precise spatial guidance while preserving the modeling advantages of transformer architectures. We evaluate our model on high-resolution 3D medical image synthesis tasks and compare it to state-of-the-art 3D latent diffusion models based on U-Nets. Results demonstrate improved global coherence, superior generative fidelity, and enhanced controllability. Our findings suggest that fully transformerbased diffusion models provide a flexible foundation for volumetric medical image synthesis. The code and models trained on public data are available at https://github.com/Cardio-AI/voldit.
[172] AnyID: Ultra-Fidelity Universal Identity-Preserving Video Generation from Any Visual References
Jiahao Wang, Hualian Sheng, Sijia Cai, Yuxiao Yang, Weizhan Zhang, Caixia Yan, Bing Deng, Jieping Ye
Main category: cs.CV
TL;DR: AnyID is an ultra-fidelity identity-preservation video generation framework that handles multiple heterogeneous identity references and enables precise attribute-level controllability through a scalable omni-referenced architecture and primary-referenced generation paradigm.
Details
Motivation: Current identity-preserving video generation methods are limited to single identity references, restricting creative flexibility and causing ambiguity in identity reproduction across novel contexts. The authors aim to overcome these limitations by supporting diverse input formats and improving identity fidelity.Method: 1) Scalable omni-referenced architecture that unifies heterogeneous identity inputs (faces, portraits, videos) into cohesive representations. 2) Primary-referenced generation paradigm with one canonical anchor reference and differential prompts for attribute-level controllability. 3) Training on large-scale curated dataset followed by reinforcement learning fine-tuning using human preference data.
Result: AnyID achieves ultra-high identity fidelity and superior attribute-level controllability across different task settings, validated through extensive evaluations including human preference comparisons.
Conclusion: AnyID successfully addresses limitations of single-reference methods by supporting multiple heterogeneous identity inputs and enabling precise attribute-level control, advancing the state of identity-preserving video generation.
Abstract: Identity-preserving video generation offers powerful tools for creative expression, allowing users to customize videos featuring their beloved characters. However, prevailing methods are typically designed and optimized for a single identity reference. This underlying assumption restricts creative flexibility by inadequately accommodating diverse real-world input formats. Relying on a single source also constitutes an ill-posed scenario, causing an inherently ambiguous setting that makes it difficult for the model to faithfully reproduce an identity across novel contexts. To address these issues, we present AnyID, an ultra-fidelity identity-preservation video generation framework that features two core contributions. First, we introduce a scalable omni-referenced architecture that effectively unifies heterogeneous identity inputs (e.g., faces, portraits, and videos) into a cohesive representation. Second, we propose a primary-referenced generation paradigm, which designates one reference as a canonical anchor and uses a novel differential prompt to enable precise, attribute-level controllability. We conduct training on a large-scale, meticulously curated dataset to ensure robustness and high fidelity, and then perform a final fine-tuning stage using reinforcement learning. This process leverages a preference dataset constructed from human evaluations, where annotators performed pairwise comparisons of videos based on two key criteria: identity fidelity and prompt controllability. Extensive evaluations validate that AnyID achieves ultra-high identity fidelity as well as superior attribute-level controllability across different task settings.
[173] CardioDiT: Latent Diffusion Transformers for 4D Cardiac MRI Synthesis
Marvin Seyfarth, Sarah Kaye Müller, Arman Ghanaat, Isabelle Ayx, Fabian Fastenrath, Philipp Wild, Alexander Hertel, Theano Papavassiliu, Salman Ul Hassan Dar, Sandy Engelhardt
Main category: cs.CV
TL;DR: CardioDiT: A 4D latent diffusion transformer framework for generating temporally coherent cine cardiac MRI volumes without architectural factorization of space and time.
Details
Motivation: Current generative models for 4D medical imaging (3D space + time) often factorize space and time or use auxiliary mechanisms like anatomical masks, which can introduce structural biases, spatiotemporal discontinuities, and physiologically inconsistent dynamics. The authors investigate whether a unified 4D generative model can learn continuous cardiac dynamics without such architectural factorization.Method: Proposes CardioDiT, a fully 4D latent diffusion framework using diffusion transformers. A spatiotemporal VQ-VAE encodes 2D+t slices into compact latents, then a diffusion transformer models these latents jointly as complete 3D+t volumes, coupling space and time throughout the generative process.
Result: Evaluation on public CMR datasets and a larger private cohort shows improved inter-slice consistency, temporally coherent motion, and realistic cardiac function distributions compared to baselines with progressively stronger spatiotemporal coupling.
Conclusion: Explicit 4D modeling with a diffusion transformer provides a principled foundation for spatiotemporal cardiac image synthesis, enabling better integration of global context and more physiologically consistent cardiac dynamics.
Abstract: Latent diffusion models (LDMs) have recently achieved strong performance in 3D medical image synthesis. However, modalities like cine cardiac MRI (CMR), representing a temporally synchronized 3D volume across the cardiac cycle, add an additional dimension that most generative approaches do not model directly. Instead, they factorize space and time or enforce temporal consistency through auxiliary mechanisms such as anatomical masks. Such strategies introduce structural biases that may limit global context integration and lead to subtle spatiotemporal discontinuities or physiologically inconsistent cardiac dynamics. We investigate whether a unified 4D generative model can learn continuous cardiac dynamics without architectural factorization. We propose CardioDiT, a fully 4D latent diffusion framework for short-axis cine CMR synthesis based on diffusion transformers. A spatiotemporal VQ-VAE encodes 2D+t slices into compact latents, which a diffusion transformer then models jointly as complete 3D+t volumes, coupling space and time throughout the generative process. We evaluate CardioDiT on public CMR datasets and a larger private cohort, comparing it to baselines with progressively stronger spatiotemporal coupling. Results show improved inter-slice consistency, temporally coherent motion, and realistic cardiac function distributions, suggesting that explicit 4D modeling with a diffusion transformer provides a principled foundation for spatiotemporal cardiac image synthesis. Code and models trained on public data are available at https://github.com/Cardio-AI/cardiodit.
[174] TacSIm: A Dataset and Benchmark for Football Tactical Style Imitation
Peng Wen, Yuting Wang, Qiurui Wang
Main category: cs.CV
TL;DR: TacSIm is a dataset and benchmark for tactical style imitation in football that focuses on replicating real-world team tactical behaviors rather than just optimizing reward-based objectives like goals scored.
Details
Motivation: Current football imitation research focuses too much on reward-based objectives (goals, win rates) and pays insufficient attention to accurately replicating real-world team tactical behaviors. There's a need for better evaluation of tactical coordination and style imitation.Method: TacSIm creates a large-scale dataset from Premier League broadcast footage, projecting positions and actions of all 22 players onto a standard pitch coordinate system. It establishes explicit style imitation tasks and evaluation protocols using spatial occupancy similarity and movement vector similarity metrics.
Result: The benchmark enables both quantitative and visual assessment of tactical coordination by running multiple baseline methods in a unified virtual environment to generate full team behaviors.
Conclusion: TacSIm establishes a rigorous benchmark for measuring and modeling style-aligned tactical imitation in football using unified data and metrics from broadcast to simulation.
Abstract: Current football imitation research primarily aims to opti mize reward-based objectives, such as goals scored or win rate proxies, paying less attention to accurately replicat ing real-world team tactical behaviors. We introduce Tac SIm, a large-scale dataset and benchmark for Tactical Style Imitation in football. TacSIm imitates the acitons of all 11 players in one team in the given broadcast footage of Pre mier League matches under a single broadcast view. Under a offensive or defensive broadcast footage, TacSIm projects the beginning positions and actions of all 22 players from both sides onto a standard pitch coordinate system. Tac SIm offers an explicit style imitation task and evaluation protocols. Tactics style imitation is measured by using spatial occupancy similarity and movement vector similarity in defined time, supporting the evaluation of spatial and tem poral similarities for one team. We run multiple baseline methods in a unified virtual environment to generate full team behaviors, enabling both quantitative and visual as sessment of tactical coordination. By using unified data and metrics from broadcast to simulation, TacSIm estab lishes a rigorous benchmark for measuring and modeling style-aligned tactical imitation task in football.
[175] Free-Lunch Long Video Generation via Layer-Adaptive O.O.D Correction
Jiahao Tian, Chenxi Song, Wei Cheng, Chi Zhang
Main category: cs.CV
TL;DR: FreeLOC is a training-free framework that addresses out-of-distribution problems in long video generation using video diffusion models, improving temporal consistency and visual quality without additional training.
Details
Motivation: Pre-trained video diffusion models trained on short clips struggle with long-video generation due to two O.O.D problems: frame-level relative position O.O.D and context-length O.O.D, causing degradation in visual quality.Method: Proposes two core techniques: Video-based Relative Position Re-encoding (VRPR) for frame-level position O.O.D (hierarchically re-encodes temporal relative positions) and Tiered Sparse Attention (TSA) for context-length O.O.D (preserves local detail and long-range dependencies). Includes layer-adaptive probing to selectively apply methods based on layer sensitivity.
Result: Significantly outperforms existing training-free methods, achieving state-of-the-art results in both temporal consistency and visual quality for long video generation.
Conclusion: FreeLOC effectively addresses O.O.D problems in long video generation without additional training, offering a practical solution for extending pre-trained video diffusion models to longer sequences.
Abstract: Generating long videos using pre-trained video diffusion models, which are typically trained on short clips, presents a significant challenge. Directly applying these models for long-video inference often leads to a notable degradation in visual quality. This paper identifies that this issue primarily stems from two out-of-distribution (O.O.D) problems: frame-level relative position O.O.D and context-length O.O.D. To address these challenges, we propose FreeLOC, a novel training-free, layer-adaptive framework that introduces two core techniques: Video-based Relative Position Re-encoding (VRPR) for frame-level relative position O.O.D, a multi-granularity strategy that hierarchically re-encodes temporal relative positions to align with the model’s pre-trained distribution, and Tiered Sparse Attention (TSA) for context-length O.O.D, which preserves both local detail and long-range dependencies by structuring attention density across different temporal scales. Crucially, we introduce a layer-adaptive probing mechanism that identifies the sensitivity of each transformer layer to these O.O.D issues, allowing for the selective and efficient application of our methods. Extensive experiments demonstrate that our approach significantly outperforms existing training-free methods, achieving state-of-the-art results in both temporal consistency and visual quality. Code is available at https://github.com/Westlake-AGI-Lab/FreeLOC.
[176] SDD-YOLO: A Small-Target Detection Framework for Ground-to-Air Anti-UAV Surveillance with Edge-Efficient Deployment
Pengyu Chen, Haotian Sa, Yiwei Hu, Yuhan Cheng, Junbo Wang
Main category: cs.CV
TL;DR: SDD-YOLO: A specialized small-target detection framework for ground-to-air UAV surveillance with high-resolution detection head and streamlined architecture for real-time edge deployment.
Details
Motivation: Detecting small UAVs from ground-to-air perspective is challenging due to extremely low pixel occupancy, cluttered backgrounds, and real-time constraints. Existing YOLO-based detectors lack adequate feature resolution for sub-pixel targets and have deployment complexities.Method: Proposes SDD-YOLO with P2 high-resolution detection head (4x downsampling), integrates YOLO26 architectural advancements (DFL-free, NMS-free architecture), and uses MuSGD hybrid training strategy with ProgLoss and STAL to mitigate gradient oscillation on sparse small-target signals.
Result: Achieves 86.0% mAP@0.5 on DroneSOD-30K dataset, surpassing YOLOv5n by 7.8 percentage points. Attains 226 FPS on NVIDIA RTX 5090 and 35 FPS on Intel Xeon CPU, demonstrating exceptional efficiency for edge deployment.
Conclusion: SDD-YOLO effectively addresses small UAV detection challenges with specialized architecture and training strategies, achieving state-of-the-art performance while maintaining real-time efficiency suitable for edge deployment.
Abstract: Detecting small unmanned aerial vehicles (UAVs) from a ground-to-air (G2A) perspective presents significant challenges, including extremely low pixel occupancy, cluttered aerial backgrounds, and strict real-time constraints. Existing YOLO-based detectors are primarily optimized for general object detection and often lack adequate feature resolution for sub-pixel targets, while introducing complexities during deployment. In this paper, we propose SDD-YOLO, a small-target detection framework tailored for G2A anti-UAV surveillance. To capture fine-grained spatial details critical for micro-targets, SDD-YOLO introduces a P2 high-resolution detection head operating at 4 times downsampling. Furthermore, we integrate the recent architectural advancements from YOLO26, including a DFL-free, NMS-free architecture for streamlined inference, and the MuSGD hybrid training strategy with ProgLoss and STAL, which substantially mitigates gradient oscillation on sparse small-target signals. To support our evaluation, we construct DroneSOD-30K, a large-scale G2A dataset comprising approximately 30,000 annotated images covering diverse meteorological conditions. Experiments demonstrate that SDD-YOLO-n achieves a mAP@0.5 of 86.0% on DroneSOD-30K, surpassing the YOLOv5n baseline by 7.8 percentage points. Extensive inference analysis shows our model attains 226 FPS on an NVIDIA RTX 5090 and 35 FPS on an Intel Xeon CPU, demonstrating exceptional efficiency for future edge deployment.
[177] Training-free Detection and 6D Pose Estimation of Unseen Surgical Instruments
Jonas Hein, Lilian Calvet, Matthias Seibold, Siyu Tang, Marc Pollefeys, Philipp Fürnstahl
Main category: cs.CV
TL;DR: Training-free pipeline for 6D pose estimation of unseen surgical instruments using only CAD models, combining multi-view geometry, feature matching, and contour-based refinement.
Details
Motivation: Supervised methods for surgical instrument pose estimation lack flexibility for new tools and require extensive annotated data. Need training-free approach that generalizes to unseen instruments.Method: Two-stage pipeline: 1) Multi-view detection using rendered templates and feature similarity scoring, with geometric consistency filtering. 2) Pose refinement using feature-metric scores with cross-view attention and occlusion-aware contour registration.
Result: Achieves millimeter-accurate pose estimates comparable to supervised methods on MVPSP dataset, while maintaining full generalization to unseen surgical instruments.
Conclusion: Training-free pipeline effectively combines foundational models, multi-view geometry, and contour-based refinement for accurate 6D pose estimation of surgical instruments without task-specific training.
Abstract: Purpose: Accurate detection and 6D pose estimation of surgical instruments are crucial for many computer-assisted interventions. However, supervised methods lack flexibility for new or unseen tools and require extensive annotated data. This work introduces a training-free pipeline for accurate multi-view 6D pose estimation of unseen surgical instruments, which only requires a textured CAD model as prior knowledge. Methods: Our pipeline consists of two main stages. First, for detection, we generate object mask proposals in each view and score their similarity to rendered templates using a pre-trained feature extractor. Detections are matched across views, triangulated into 3D instance candidates, and filtered using multi-view geometric consistency. Second, for pose estimation, a set of pose hypotheses is iteratively refined and scored using feature-metric scores with cross-view attention. The best hypothesis undergoes a final refinement using a novel multi-view, occlusion-aware contour registration, which minimizes reprojection errors of unoccluded contour points. Results: The proposed method was rigorously evaluated on real-world surgical data from the MVPSP dataset. The method achieves millimeter-accurate pose estimates that are on par with supervised methods under controlled conditions, while maintaining full generalization to unseen instruments. These results demonstrate the feasibility of training-free, marker-less detection and tracking in surgical scenes, and highlight the unique challenges in surgical environments. Conclusion: We present a novel and flexible pipeline that effectively combines state-of-the-art foundational models, multi-view geometry, and contour-based refinement for high-accuracy 6D pose estimation of surgical instruments without task-specific training. This approach enables robust instrument tracking and scene understanding in dynamic clinical environments.
[178] A Unified Spatial Alignment Framework for Highly Transferable Transformation-Based Attacks on Spatially Structured Tasks
Jiaming Liang, Chi-Man Pun
Main category: cs.CV
TL;DR: SAF framework enables transferable adversarial attacks on structured vision tasks by synchronizing spatial transformations of inputs and labels to maintain alignment.
Details
Motivation: Existing transformation-based adversarial attacks work well for classification but fail on structured tasks like segmentation/detection due to spatial misalignment when transformations aren't applied to both inputs and labels.Method: Proposes Spatial Alignment Framework (SAF) with Spatial Alignment algorithm that synchronously transforms labels with inputs during adversarial attack generation, maintaining spatial consistency for structured tasks.
Result: SAF significantly improves attack effectiveness: reduces mIoU on Cityscapes from 24.50 to 11.34, on Kvasir-SEG from 49.91 to 31.80, and reduces COCO mAP from 17.89 to 5.25.
Conclusion: Spatial alignment is crucial for transferable adversarial attacks on structured vision tasks; SAF provides a unified framework that effectively addresses spatial misalignment issues.
Abstract: Transformation-based adversarial attacks (TAAs) demonstrate strong transferability when deceiving classification models. However, existing TAAs often perform unsatisfactorily or even fail when applied to structured tasks such as semantic segmentation and object detection. Encouragingly, recent studies that categorize transformations into non-spatial and spatial transformations inspire us to address this challenge. We find that for non-structured tasks, labels are spatially non-structured, and thus TAAs are not required to adjust labels when applying spatial transformations. In contrast, for structured tasks, labels are spatially structured, and failing to transform labels synchronously with inputs can cause spatial misalignment and yield erroneous gradients. To address these issues, we propose a novel unified Spatial Alignment Framework (SAF) for highly transferable TAAs on spatially structured tasks, where the TAAs spatially transform labels synchronously with the input using the proposed Spatial Alignment (SA) algorithm. Extensive experiments demonstrate the crucial role of our SAF for TAAs on structured tasks. Specifically, in non-targeted attacks, our SAF degrades the average mIoU on Cityscapes from 24.50 to 11.34, and on Kvasir-SEG from 49.91 to 31.80, while reducing the average mAP of COCO from 17.89 to 5.25.
[179] FEAST: Fully Connected Expressive Attention for Spatial Transcriptomics
Taejin Jeong, Joohyeok Kim, Jinyeong Kim, Chanyoung Kim, Seong Jae Hwang
Main category: cs.CV
TL;DR: FEAST is an attention-based framework that predicts spatial gene expression from whole slide images using fully connected graphs and negative-aware attention to capture both excitatory and inhibitory biological interactions.
Details
Motivation: Spatial transcriptomics is expensive, so researchers want to infer spatial gene expression from cheaper whole slide images. Existing graph neural network methods use sparse graphs that miss potential interactions between tissue spots, and standard attention overlooks negative biological relationships.Method: FEAST models tissue as a fully connected graph using attention mechanisms to consider all pairwise interactions. It introduces negative-aware attention to capture both excitatory and inhibitory interactions, and an off-grid sampling strategy to gather additional images from intermediate regions for richer morphological context.
Result: Experiments on public spatial transcriptomics datasets show FEAST outperforms state-of-the-art methods in gene expression prediction and provides biologically plausible attention maps that clarify positive and negative interactions.
Conclusion: FEAST effectively addresses limitations of existing methods by using fully connected attention with negative-aware modeling and off-grid sampling, improving spatial gene expression prediction from whole slide images while providing interpretable biological insights.
Abstract: Spatial Transcriptomics (ST) provides spatially-resolved gene expression, offering crucial insights into tissue architecture and complex diseases. However, its prohibitive cost limits widespread adoption, leading to significant attention on inferring spatial gene expression from readily available whole slide images. While graph neural networks have been proposed to model interactions between tissue regions, their reliance on pre-defined sparse graphs prevents them from considering potentially interacting spot pairs, resulting in a structural limitation in capturing complex biological relationships. To address this, we propose FEAST (Fully connected Expressive Attention for Spatial Transcriptomics), an attention-based framework that models the tissue as a fully connected graph, enabling the consideration of all pairwise interactions. To better reflect biological interactions, we introduce negative-aware attention, which models both excitatory and inhibitory interactions, capturing essential negative relationships that standard attention often overlooks. Furthermore, to mitigate the information loss from truncated or ignored context in standard spot image extraction, we introduce an off-grid sampling strategy that gathers additional images from intermediate regions, allowing the model to capture a richer morphological context. Experiments on public ST datasets show that FEAST surpasses state-of-the-art methods in gene expression prediction while providing biologically plausible attention maps that clarify positive and negative interactions. Our code is available at https://github.com/starforTJ/ FEAST.
[180] Efficient Preemptive Robustification with Image Sharpening
Jiaming Liang, Chi-Man Pun
Main category: cs.CV
TL;DR: Sharpening images as a simple, efficient pre-attack defense method to improve robustness against adversarial perturbations without complex optimization or surrogate models.
Details
Motivation: Existing preemptive robustification methods have practical limitations including reliance on surrogate classifiers, high computational costs from iterative optimization or trained generators, and lack of interpretability. The paper aims to find a simpler, more practical approach.Method: Proposes using image sharpening alone as a robustification technique, based on findings that texture intensity correlates with robustness. This is a surrogate-free, optimization-free, generator-free approach that is computationally efficient and human-interpretable.
Result: Extensive experiments show sharpening yields significant robustness gains with low computational cost, particularly effective in transfer scenarios where models face unseen attacks.
Conclusion: Simple image sharpening provides an effective, efficient, and interpretable pre-attack defense method that addresses limitations of previous robustification approaches while achieving strong robustness improvements.
Abstract: Despite their great success, deep neural networks rely on high-dimensional, non-robust representations, making them vulnerable to imperceptible perturbations, even in transfer scenarios. To address this, both training-time defenses (e.g., adversarial training and robust architecture design) and post-attack defenses (e.g., input purification and adversarial detection) have been extensively studied. Recently, a limited body of work has preliminarily explored a pre-attack defense paradigm, termed preemptive robustification, which introduces subtle modifications to benign samples prior to attack to proactively resist adversarial perturbations. Unfortunately, their practical applicability remains questionable due to several limitations, including (1) reliance on well-trained classifiers as surrogates to provide robustness priors, (2) substantial computational overhead arising from iterative optimization or trained generators for robustification, and (3) limited interpretability of the optimization- or generation-based robustification processes. Inspired by recent studies revealing a positive correlation between texture intensity and the robustness of benign samples, we show that image sharpening alone can efficiently robustify images. To the best of our knowledge, this is the first surrogate-free, optimization-free, generator-free, and human-interpretable robustification approach. Extensive experiments demonstrate that sharpening yields remarkable robustness gains with low computational cost, especially in transfer scenarios.
[181] Semantic-Aware Prefix Learning for Token-Efficient Image Generation
Qingfeng Li, Haoxian Zhang, Xu He, Songlin Tang, Zhixue Fang, Xiaoqiang Liu, Pengfei Wan Guoqi Li
Main category: cs.CV
TL;DR: SMAP is a semantic-aware visual tokenizer that injects class-level semantic conditions into a 1D tokenization framework, making semantics indispensable through a tail token dropping strategy, enabling better latent representations for generation tasks.
Details
Motivation: Existing visual tokenizers are trained with reconstruction-dominated objectives, yielding latent representations weakly grounded in high-level semantics. Current approaches treat semantic signals as auxiliary regularization rather than making them functionally necessary for representation learning.Method: Proposes SMAP (SeMantic-Aware Prefix tokenizer) that injects class-level semantic conditions into a query-based 1D tokenization framework. Introduces tail token dropping strategy to force semantic conditions and early latent prefixes to bear increasing responsibility under reduced token budgets. Also introduces CARD, a hybrid Causal AutoRegressive-Diffusion generator to verify latent space usefulness for generation.
Result: Extensive experiments on ImageNet show SMAP consistently improves reconstruction quality across discrete and continuous tokenization settings, and its semantically grounded latent space yields strong downstream generation performance under compact token budgets.
Conclusion: SMAP demonstrates that making semantic conditions indispensable during tokenizer training leads to better latent representations for generation tasks, bridging the gap between reconstruction and generation objectives in visual tokenization.
Abstract: Visual tokenizers play a central role in latent image generation by bridging high-dimensional images and tractable generative modeling. However, most existing tokenizers are still trained with reconstruction-dominated objectives, which often yield latent representations that are only weakly grounded in high-level semantics. Recent approaches improve semantic alignment, but typically treat semantic signals as auxiliary regularization rather than making them functionally necessary for representation learning. We propose SMAP, a SeMantic-Aware Prefix tokenizer that injects class-level semantic conditions into a query-based 1D tokenization framework. To make semantics indispensable during training, SMAP introduces a tail token dropping strategy, which forces semantic conditions and early latent prefixes to bear increasing responsibility under progressively reduced token budgets. To verify that the resulting latent space is useful for generation rather than reconstruction alone, we further introduce CARD, a hybrid Causal AutoRegressive–Diffusion generator. Extensive experiments on ImageNet show that SMAP consistently improves reconstruction quality across discrete and continuous tokenization settings, and that its semantically grounded latent space yields strong downstream generation performance under compact token budgets.
[182] The LLM Bottleneck: Why Open-Source Vision LLMs Struggle with Hierarchical Visual Recognition
Yuwen Tan, Yuan Qing, Boqing Gong
Main category: cs.CV
TL;DR: Open-source LLMs lack hierarchical visual world knowledge, creating bottlenecks for vision LLMs in hierarchical visual recognition tasks.
Details
Motivation: To investigate whether open-source LLMs possess hierarchical knowledge about visual concepts and understand how this affects vision LLMs' ability to perform hierarchical visual recognition.Method: Constructed ~1 million four-choice VQA tasks from six taxonomies and four image datasets, then finetuned vision LLMs using these tasks to analyze knowledge transfer.
Result: LLMs lack hierarchical visual knowledge, creating bottlenecks for vision LLMs; finetuning improved LLMs’ hierarchical consistency more than vision LLMs’, confirming the bottleneck effect.
Conclusion: Open-source vision LLMs cannot understand visual concepts hierarchically until their underlying LLMs possess corresponding taxonomy knowledge, highlighting the fundamental limitation of current architectures.
Abstract: This paper reveals that many open-source large language models (LLMs) lack hierarchical knowledge about our visual world, unaware of even well-established biology taxonomies. This shortcoming makes LLMs a bottleneck for vision LLMs’ hierarchical visual recognition (e.g., recognizing Anemone Fish but not Vertebrate). We arrive at these findings using about one million four-choice visual question answering (VQA) tasks constructed from six taxonomies and four image datasets. Interestingly, finetuning a vision LLM using our VQA tasks reaffirms LLMs’ bottleneck effect because the VQA tasks improve the LLMs’ hierarchical consistency more than the vision LLMs’. We conjecture that one cannot make open-source vision LLMs understand visual concepts hierarchically until LLMs possess corresponding taxonomy knowledge.
[183] Towards Practical Lossless Neural Compression for LiDAR Point Clouds
Pengpeng Yu, Haoran Li, Runqing Jiang, Dingquan Li, Jing Wang, Liang Lin, Yulan Guo
Main category: cs.CV
TL;DR: FastPCC: A lightweight neural compression framework for LiDAR point clouds using geometry re-densification and cross-scale feature propagation for efficient predictive lossless coding.
Details
Motivation: LiDAR point clouds are crucial for many applications but their extreme sparsity makes efficient context modeling difficult, limiting compression speed and performance of existing methods.Method: Two lightweight modules: 1) Geometry Re-Densification Module iteratively densifies sparse geometry, extracts dense-scale features, then sparsifies for predictive coding; 2) Cross-scale Feature Propagation Module uses occupancy cues from multiple resolution levels to guide hierarchical feature propagation and reduce redundant extraction. Also includes integer-only inference pipeline for bit-exact cross-platform consistency.
Result: Achieves competitive compression performance at real-time speed while avoiding entropy-coding collapse observed in existing neural compression methods.
Conclusion: Proposed framework enables efficient predictive lossless coding of LiDAR point clouds with lightweight architecture and real-time performance.
Abstract: LiDAR point clouds are fundamental to various applications, yet the extreme sparsity of high-precision geometric details hinders efficient context modeling, thereby limiting the compression speed and performance of existing methods. To address this challenge, we propose a compact representation for efficient predictive lossless coding. Our framework comprises two lightweight modules. First, the Geometry Re-Densification Module iteratively densifies encoded sparse geometry, extracts features at a dense scale, and then sparsifies the features for predictive coding. This module avoids costly computation on highly sparse details while maintaining a lightweight prediction head. Second, the Cross-scale Feature Propagation Module leverages occupancy cues from multiple resolution levels to guide hierarchical feature propagation, enabling information sharing across scales and reducing redundant feature extraction. Additionally, we introduce an integer-only inference pipeline to enable bit-exact cross-platform consistency, which avoids the entropy-coding collapse observed in existing neural compression methods and further accelerates coding. Experiments demonstrate competitive compression performance at real-time speed. Code will be released upon acceptance. Code is available at https://github.com/pengpeng-yu/FastPCC.
[184] Knowledge-Guided Failure Prediction: Detecting When Object Detectors Miss Safety-Critical Objects
Jakob Paul Zimmermann, Gerrit Holzbach, David Lerch
Main category: cs.CV
TL;DR: KGFP is a framework that predicts when object detectors will fail on safety-critical objects by measuring semantic misalignment between detector features and visual foundation model embeddings.
Details
Motivation: Object detectors in safety-critical applications can fail silently without warning. Traditional OOD detection methods identify unfamiliar inputs but don't predict functional failures of the detector itself, creating safety risks.Method: Uses a dual-encoder architecture with angular distance metric to measure semantic misalignment between internal object detector features and visual foundation model embeddings. When either system operates outside its competence, embeddings diverge, producing a high-angle signal to flag unsafe images.
Result: On COCO person detection, KGFP as a selective-prediction gate raises person recall from 64.3% to 84.5% at 5% FPR. Maintains strong performance across six COCO-O visual domains, outperforming OOD baselines by large margins.
Conclusion: KGFP provides effective runtime monitoring for object detector failures on safety-critical objects by leveraging semantic alignment between detector features and foundation model embeddings.
Abstract: Object detectors deployed in safety-critical environments can fail silently, e.g. missing pedestrians, workers, or other safety-critical objects without emitting any warning. Traditional Out Of Distribution (OOD) detection methods focus on identifying unfamiliar inputs, but do not directly predict functional failures of the detector itself. We introduce Knowledge Guided Failure Prediction (KGFP), a representation-based monitoring framework that treats missed safety-critical detections as anomalies to be detected at runtime. KGFP measures semantic misalignment between internal object detector features and visual foundation model embeddings using a dual-encoder architecture with an angular distance metric. A key property is that when either the detector is operating outside its competence or the visual foundation model itself encounters novel inputs, the two embeddings diverge, producing a high-angle signal that reliably flags unsafe images. We compare our novel KGFS method to baseline OOD detection methods. On COCO person detection, applying KGFP as a selective-prediction gate raises person recall among accepted images from 64.3% to 84.5% at 5% False Positive Rate (FPR), and maintains strong performance across six COCO-O visual domains, outperforming OOD baselines by large margins. Our code, models, and features are published at https://gitlab.cc-asp.fraunhofer.de/iosb_public/KGFP.
[185] ViewSplat: View-Adaptive Dynamic Gaussian Splatting for Feed-Forward Synthesis
Moonyeon Jeong, Seunggi Min, Suhyeon Lee, Hongje Seong
Main category: cs.CV
TL;DR: ViewSplat introduces view-adaptive 3D Gaussian splatting for novel view synthesis from unposed images, using dynamic MLPs to adjust Gaussian attributes based on target viewpoints, achieving state-of-the-art fidelity with fast inference.
Details
Motivation: Current feed-forward 3D Gaussian splatting methods have a fidelity gap due to limited capacity of single-step networks to regress static Gaussian primitives that satisfy all viewpoints. The authors aim to bridge this gap by moving from static primitive regression to view-adaptive dynamic splatting.Method: ViewSplat learns a view-adaptable latent representation that predicts base Gaussian primitives alongside dynamic MLP weights. During rendering, these MLPs take target view coordinates as input and predict view-dependent residual updates for each Gaussian attribute (3D position, scale, rotation, opacity, and color), enabling primitive rectification.
Result: Extensive experiments show ViewSplat achieves state-of-the-art fidelity while maintaining fast inference (17 FPS) and real-time rendering (154 FPS), outperforming previous feed-forward 3D Gaussian splatting methods.
Conclusion: ViewSplat successfully addresses the fidelity bottleneck in feed-forward 3D Gaussian splatting by introducing view-adaptive dynamic splatting, enabling high-fidelity novel view synthesis from unposed images with efficient rendering performance.
Abstract: We present ViewSplat, a view-adaptive 3D Gaussian splatting network for novel view synthesis from unposed images. While recent feed-forward 3D Gaussian splatting has significantly accelerated 3D scene reconstruction by bypassing per-scene optimization, a fundamental fidelity gap remains. We attribute this bottleneck to the limited capacity of single-step feed-forward networks to regress static Gaussian primitives that satisfy all viewpoints. To address this limitation, we shift the paradigm from static primitive regression to view-adaptive dynamic splatting. Instead of a rigid Gaussian representation, our pipeline learns a view-adaptable latent representation. Specifically, ViewSplat initially predicts base Gaussian primitives alongside the weights of dynamic MLPs. During rendering, these MLPs take target view coordinates as input and predict view-dependent residual updates for each Gaussian attribute (i.e., 3D position, scale, rotation, opacity, and color). This mechanism, which we term view-adaptive dynamic splatting, allows each primitive to rectify initial estimation errors, effectively capturing high-fidelity appearances. Extensive experiments demonstrate that ViewSplat achieves state-of-the-art fidelity while maintaining fast inference (17 FPS) and real-time rendering (154 FPS).
[186] EagleNet: Energy-Aware Fine-Grained Relationship Learning Network for Text-Video Retrieval
Yuhan Chen, Pengwen Dai, Chuan Wang, Dayan Wu, Xiaochun Cao
Main category: cs.CV
TL;DR: EagleNet introduces an energy-aware fine-grained relationship learning network for text-video retrieval that captures frame contextual information through text-frame graph interactions and energy-aware matching.
Details
Motivation: Existing text-video retrieval methods focus on video representations or cross-modal alignment but ignore rich interactions among internal video frames, leading to disparities between text and video semantics.Method: Proposes Fine-Grained Relationship Learning (FRL) that constructs text-frame graphs to learn relationships between text candidates and frames, and Energy-Aware Matching (EAM) to model text-frame interaction energy. Uses sigmoid loss instead of softmax-based contrastive loss for better alignment.
Result: Demonstrates superiority across multiple benchmarks including MSRVTT, DiDeMo, MSVD, and VATEX datasets.
Conclusion: EagleNet effectively captures frame contextual information through fine-grained relationship learning and energy-aware matching, improving text-video retrieval performance.
Abstract: Text-video retrieval tasks have seen significant improvements due to the recent development of large-scale vision-language pre-trained models. Traditional methods primarily focus on video representations or cross-modal alignment, while recent works shift toward enriching text expressiveness to better match the rich semantics in videos. However, these methods use only interactions between text and frames/video, and ignore rich interactions among the internal frames within a video, so the final expanded text cannot capture frame contextual information, leading to disparities between text and video. In response, we introduce Energy-Aware Fine-Grained Relationship Learning Network (EagleNet) to generate accurate and context-aware enriched text embeddings. Specifically, the proposed Fine-Grained Relationship Learning mechanism (FRL) first constructs a text-frame graph by the generated text candidates and frames, then learns relationships among texts and frames, which are finally used to aggregate text candidates into an enriched text embedding that incorporates frame contextual information. To further improve fine-grained relationship learning in FRL, we design Energy-Aware Matching (EAM) to model the energy of text-frame interactions and thus accurately capture the distribution of real text-video pairs. Moreover, for more effective cross-modal alignment and stable training, we replace the conventional softmax-based contrastive loss with the sigmoid loss. Extensive experiments have demonstrated the superiority of EagleNet across MSRVTT, DiDeMo, MSVD, and VATEX. Codes are available at https://github.com/draym28/EagleNet.
[187] See the Text: From Tokenization to Visual Reading
Ling Xing, Rui Yan, Alex Jinpeng Wang, Zechao Li, Jinhui Tang
Main category: cs.CV
TL;DR: SeeTok replaces subword tokenization with visual text processing by rendering text as images and using multimodal LLMs for interpretation, achieving efficiency gains and improved robustness.
Details
Motivation: Current LLMs use subword tokenization which fragments text and is inefficient for low-resource languages, while humans read visually by recognizing words as visual objects. The paper aims to develop a more human-like, vision-centric approach to text processing.Method: SeeTok renders text as images (visual-text) and leverages pretrained multimodal LLMs to interpret them, reusing OCR and text-vision alignment abilities learned from large-scale multimodal training.
Result: SeeTok matches or surpasses subword tokenizers across three language tasks while requiring 4.43× fewer tokens and reducing FLOPs by 70.5%, with additional gains in cross-lingual generalization, robustness to typographic noise, and linguistic hierarchy.
Conclusion: SeeTok represents a shift from symbolic tokenization to human-like visual reading, taking a step toward more natural and cognitively inspired language models that process text visually rather than through fragmented tokenization.
Abstract: People see text. Humans read by recognizing words as visual objects, including their shapes, layouts, and patterns, before connecting them to meaning, which enables us to handle typos, distorted fonts, and various scripts effectively. Modern large language models (LLMs), however, rely on subword tokenization, fragmenting text into pieces from a fixed vocabulary. While effective for high-resource languages, this approach over-segments low-resource languages, yielding long, linguistically meaningless sequences and inflating computation. In this work, we challenge this entrenched paradigm and move toward a vision-centric alternative. Our method, SeeTok, renders text as images (visual-text) and leverages pretrained multimodal LLMs to interpret them, reusing strong OCR and text-vision alignment abilities learned from large-scale multimodal training. Across three different language tasks, SeeTok matches or surpasses subword tokenizers while requiring 4.43 times fewer tokens and reducing FLOPs by 70.5%, with additional gains in cross-lingual generalization, robustness to typographic noise, and linguistic hierarchy. SeeTok signals a shift from symbolic tokenization to human-like visual reading, and takes a step toward more natural and cognitively inspired language models.
[188] V2U4Real: A Real-world Large-scale Dataset for Vehicle-to-UAV Cooperative Perception
Weijia Li, Haoen Xiang, Tianxu Wang, Shuaibing Wu, Qiming Xia, Cheng Wang, Chenglu Wen
Main category: cs.CV
TL;DR: V2U4Real is the first large-scale real-world dataset for Vehicle-to-UAV cooperative perception, featuring multi-modal data from ground vehicles and drones to address occlusion and range limitations in autonomous driving.
Details
Motivation: Existing cooperative perception (V2V/V2I) is limited to ground-level collaboration and cannot fully address large-scale occlusions or long-range perception in complex environments. There's a need for cross-view cooperation between ground vehicles and aerial perspectives.Method: Created V2U4Real dataset collected by a ground vehicle and UAV equipped with multi-view LiDARs and RGB cameras, covering urban streets, campuses, and rural roads. Established benchmarks for single-agent 3D detection, cooperative 3D detection, and object tracking.
Result: Dataset contains over 56K LiDAR frames, 56K multi-view camera images, and 700K annotated 3D bounding boxes across four classes. Comprehensive evaluations demonstrate V2U cooperation enhances perception robustness and long-range awareness.
Conclusion: V2U4Real enables research in cross-view cooperative perception, showing that Vehicle-to-UAV cooperation can significantly improve autonomous vehicle perception by overcoming ground-level limitations.
Abstract: Modern autonomous vehicle perception systems are often constrained by occlusions, blind spots, and limited sensing range. While existing cooperative perception paradigms, such as Vehicle-to-Vehicle (V2V) and Vehicle-to-Infrastructure (V2I), have demonstrated their effectiveness in mitigating these challenges, they remain limited to ground-level collaboration and cannot fully address large-scale occlusions or long-range perception in complex environments. To advance research in cross-view cooperative perception, we present V2U4Real, the first large-scale real-world multi-modal dataset for Vehicle-to-UAV (V2U) cooperative object perception. V2U4Real is collected by a ground vehicle and a UAV equipped with multi-view LiDARs and RGB cameras. The dataset covers urban streets, university campuses, and rural roads under diverse traffic scenarios, comprising over 56K LiDAR frames, 56K multi-view camera images, and 700K annotated 3D bounding boxes across four classes. To support a wide range of research tasks, we establish benchmarks for single-agent 3D object detection, cooperative 3D object detection, and object tracking. Comprehensive evaluations of several state-of-the-art models demonstrate the effectiveness of V2U cooperation in enhancing perception robustness and long-range awareness. The V2U4Real dataset and codebase is available at https://github.com/VjiaLi/V2U4Real.
[189] Which Way Does Time Flow? A Psychophysics-Grounded Evaluation for Vision-Language Models
Shiho Matta, Lis Kanashiro Pereira, Peitao Han, Fei Cheng, Shigeru Kitazawa
Main category: cs.CV
TL;DR: VLMs struggle with temporal reasoning, performing near chance on judging whether videos play forward or backward, revealing a fundamental gap in temporal understanding despite strong visual-semantic capabilities.
Details
Motivation: Current vision-language models excel at many multimodal tasks but have weak temporal understanding in videos, which hasn't been adequately evaluated. The authors aim to probe this gap using the arrow of time judgment task.Method: Introduce AoT-PsyPhyBENCH, a psychophysically validated benchmark that tests VLMs’ ability to infer temporal direction in natural videos. Use same stimuli and behavioral baselines established for humans. Evaluate open-weight and proprietary VLMs (both reasoning and non-reasoning models).
Result: Most models perform near chance on temporal direction judgment. Even the best model lags far behind human accuracy on physically irreversible processes (free fall, diffusion/explosion) and causal manual actions (division/addition) that humans recognize instantly.
Conclusion: Current multimodal systems lack inductive biases for temporal continuity and causal understanding despite capturing rich visual-semantic correlations. The benchmark highlights a fundamental gap in temporal reasoning capabilities.
Abstract: Modern vision-language models (VLMs) excel at many multimodal tasks, yet their grasp of temporal information in video remains weak and has not been adequately evaluated. We probe this gap with a deceptively simple but revealing challenge: judging the arrow of time (AoT)-whether a short clip is played forward or backward. We introduce AoT-PsyPhyBENCH, a psychophysically validated benchmark that tests whether VLMs can infer temporal direction in natural videos using the same stimuli and behavioral baselines established for humans. Our comprehensive evaluation of open-weight and proprietary, reasoning and non-reasoning VLMs reveals that most models perform near chance, and even the best model lags far behind human accuracy on physically irreversible processes (e.g., free fall, diffusion/explosion) and causal manual actions (division/addition) that humans recognize almost instantly. These results highlight a fundamental gap in current multimodal systems: while they capture rich visual-semantic correlations, they lack the inductive biases required for temporal continuity and causal understanding. We release the code and data for AoT-PsyPhyBENCH to encourage further progress in the physical and temporal reasoning capabilities of VLMs.
[190] CHIRP dataset: towards long-term, individual-level, behavioral monitoring of bird populations in the wild
Alex Hoi Hang Chan, Neha Singhal, Onur Kocahan, Andrea Meltzer, Saverio Lubrano, Miyako H. Warrington, Michel Griesser, Fumihiro Kano, Hemal Naik
Main category: cs.CV
TL;DR: CHIRP dataset and CORVID method for individual re-identification of wild birds using computer vision, with application-specific biological metrics.
Details
Motivation: Long-term behavioral monitoring of individual animals is crucial for conservation and evolutionary biology, but automated behavior monitoring in wild populations remains challenging due to lack of comprehensive datasets covering necessary computer vision tasks.Method: Introduces CHIRP dataset from wild Siberian jays supporting multiple vision tasks, and CORVID pipeline for individual identification based on segmentation and classification of colored leg rings with probability-based tracking.
Result: CORVID outperforms state-of-the-art re-id methods in application-specific benchmarking with biologically relevant metrics like feeding rates and co-occurrence rates.
Conclusion: Provides a blueprint for curating real-world datasets from ethically approved biological studies to bridge computer vision research with biological applications.
Abstract: Long-term behavioral monitoring of individual animals is crucial for studying behavioral changes that occur over different time scales, especially for conservation and evolutionary biology. Computer vision methods have proven to benefit biodiversity monitoring, but automated behavior monitoring in wild populations remains challenging. This stems from the lack of datasets that cover a range of computer vision tasks necessary to extract biologically meaningful measurements of individual animals. Here, we introduce such a dataset (CHIRP) with a new method (CORVID) for individual re-identification of wild birds. The CHIRP (Combining beHaviour, Individual Re-identification and Postures) dataset is curated from a long-term population of wild Siberian jays studied in Swedish Lapland, supporting re-identification (re-id), action recognition, 2D keypoint estimation, object detection, and instance segmentation. In addition to traditional task-specific benchmarking, we introduce application-specific benchmarking with biologically relevant metrics (feeding rates, co-occurrence rates) to evaluate the performance of models in real-world use cases. Finally, we present CORVID (COlouR-based Video re-ID), a novel pipeline for individual identification of birds based on the segmentation and classification of colored leg rings, a widespread approach for visual identification of individual birds. CORVID offers a probability-based id tracking method by matching the detected combination of color rings with a database. We use application-specific benchmarking to show that CORVID outperforms state-of-the-art re-id methods. We hope this work offers the community a blueprint for curating real-world datasets from ethically approved biological studies to bridge the gap between computer vision research and biological applications.
[191] Towards Controllable Low-Light Image Enhancement: A Continuous Multi-illumination Dataset and Efficient State Space Framework
Hongru Han, Tingrui Guo, Liming Zhang, Yan Su, Qiwen Xu, Zhuohua Ye
Main category: cs.CV
TL;DR: CLE-RWKV: A controllable low-light image enhancement framework using state space models with noise-decoupled supervision in HVI color space for better luminance control and chromatic fidelity.
Details
Motivation: Traditional low-light image enhancement methods treat the task as deterministic mapping, but this fails to account for the ill-posed nature where unknown ambient conditions create multimodal solutions. Current methods often need gt-mean post-processing to align output luminance, indicating a fundamental limitation.Method: Proposes CLE-RWKV framework with: 1) Reformulation as controllable low-light enhancement (CLE), 2) Light100 benchmark with continuous real-world illumination transitions, 3) Noise-decoupled supervision in HVI color space to separate illumination modulation from texture restoration, 4) Space-to-Depth strategy to adapt State Space Models for dense prediction while maintaining local inductive biases.
Result: Experiments across seven benchmarks show competitive performance and robust controllability, providing a real-world multi-illumination alternative that significantly reduces reliance on gt-mean post-processing.
Conclusion: The paper successfully transitions low-light enhancement from deterministic mapping to controllable conditional problem, addressing fundamental limitations through architectural innovations and better supervision strategies.
Abstract: Low-light image enhancement (LLIE) has traditionally been formulated as a deterministic mapping. However, this paradigm often struggles to account for the ill-posed nature of the task, where unknown ambient conditions and sensor parameters create a multimodal solution space. Consequently, state-of-the-art methods frequently encounter luminance discrepancies between predictions and labels, often necessitating “gt-mean” post-processing to align output luminance for evaluation. To address this fundamental limitation, we propose a transition toward Controllable Low-light Enhancement (CLE), explicitly reformulating the task as a well-posed conditional problem. To this end, we introduce CLE-RWKV, a holistic framework supported by Light100, a new benchmark featuring continuous real-world illumination transitions. To resolve the conflict between luminance control and chromatic fidelity, a noise-decoupled supervision strategy in the HVI color space is employed, effectively separating illumination modulation from texture restoration. Architecturally, to adapt efficient State Space Models (SSMs) for dense prediction, we leverage a Space-to-Depth (S2D) strategy. By folding spatial neighborhoods into channel dimensions, this design allows the model to recover local inductive biases and effectively bridge the “scanning gap” inherent in flattened visual sequences without sacrificing linear complexity. Experiments across seven benchmarks demonstrate that our approach achieves competitive performance and robust controllability, providing a real-world multi-illumination alternative that significantly reduces the reliance on gt-mean post-processing.
[192] Insights on back marking for the automated identification of animals
David Brunner, Marie Bordes, Elisabeth Mayrhuber, Stephan M. Winkler, Viktoria Dorfer, Maciej Oczak
Main category: cs.CV
TL;DR: Study on designing back marks for pigs to optimize machine learning-based individual monitoring, analyzing how mark design affects neural network recognition performance.
Details
Motivation: There's little research on designing back marks for uniform-looking species like pigs, especially for machine learning-based monitoring systems. With the rise of ML solutions, guidelines are needed for creating marks that algorithms can effectively recognize.Method: Trained a ResNet-50 neural network to classify ten pigs with unique back marks, then analyzed the model’s predictions to understand which design choices work best under various conditions.
Result: Analysis revealed that back marks must be designed to remain unambiguous under motion blur, diverse view angles, and occlusions. Design must also consider common data augmentation strategies like color, flip, and crop augmentations used during model training.
Conclusion: The insights can support individual-level monitoring in future studies and real-world applications by optimizing back mark design for machine learning recognition.
Abstract: To date, there is little research on how to design back marks to best support individual-level monitoring of uniform looking species like pigs. With the recent surge of machine learning-based monitoring solutions, there is a particular need for guidelines on the design of marks that can be effectively recognised by such algorithms. This study provides valuable insights on effective back mark design, based on the analysis of a machine learning model, trained to distinguish pigs via their back marks. Specifically, a neural network of type ResNet-50 was trained to classify ten pigs with unique back marks. The analysis of the model’s predictions highlights the significance of certain design choices, even in controlled settings. Most importantly, the set of back marks must be designed such that each mark remains unambiguous under conditions of motion blur, diverse view angles and occlusions, caused by animal behaviour. Further, the back mark design must consider data augmentation strategies commonly employed during model training, like colour, flip and crop augmentations. The generated insights can support individual-level monitoring in future studies and real-world applications by optimizing back mark design.
[193] Adaptive Learned Image Compression with Graph Neural Networks
Yunuo Chen, Bing He, Zezheng Lyu, Hongwei Hu, Qunshan Gu, Yuan Tian, Guo Lu
Main category: cs.CV
TL;DR: GLIC: Graph-based Learned Image Compression using GNNs with dual-scale graphs and adaptive connectivity for flexible, content-aware compression that outperforms traditional CNN/Transformer methods.
Details
Motivation: Current learned image compression methods using CNNs/Transformers have rigid receptive fields and static connectivity that couple non-redundant pixels due to proximity, limiting adaptive modeling of spatially varying redundancy across images.Method: Proposes Graph Neural Network (GNN) based framework with dual-scale graphs enabling flexible, data-driven receptive fields and adaptive connectivity that dynamically adjusts neighbors per node based on local content complexity.
Result: Achieves SOTA performance with BD-rate reductions of 19.29%, 21.69%, and 18.71% relative to VTM-9.1 on Kodak, Tecnick, and CLIC datasets respectively.
Conclusion: Graph-based approach enables more effective modeling of diverse redundancy patterns for efficient and adaptive image compression, overcoming limitations of rigid CNN/Transformer architectures.
Abstract: Efficient image compression relies on modeling both local and global redundancy. Most state-of-the-art (SOTA) learned image compression (LIC) methods are based on CNNs or Transformers, which are inherently rigid. Standard CNN kernels and window-based attention mechanisms impose fixed receptive fields and static connectivity patterns, which potentially couple non-redundant pixels simply due to their proximity in Euclidean space. This rigidity limits the model’s ability to adaptively capture spatially varying redundancy across the image, particularly at the global level. To overcome these limitations, we propose a content-adaptive image compression framework based on Graph Neural Networks (GNNs). Specifically, our approach constructs dual-scale graphs that enable flexible, data-driven receptive fields. Furthermore, we introduce adaptive connectivity by dynamically adjusting the number of neighbors for each node based on local content complexity. These innovations empower our Graph-based Learned Image Compression (GLIC) model to effectively model diverse redundancy patterns across images, leading to more efficient and adaptive compression. Experiments demonstrate that GLIC achieves state-of-the-art performance, achieving BD-rate reductions of 19.29%, 21.69%, and 18.71% relative to VTM-9.1 on Kodak, Tecnick, and CLIC, respectively. Code will be released at https://github.com/UnoC-727/GLIC.
[194] Hierarchy-Guided Multimodal Representation Learning for Taxonomic Inference
Sk Miraj Ahmed, Xi Yu, Yunqi Li, Yuewei Lin, Wei Xu
Main category: cs.CV
TL;DR: Hierarchy-aware multimodal learning for biodiversity identification using images and DNA barcodes with hierarchical regularization and flexible fusion
Details
Motivation: Existing multimodal methods for biodiversity identification treat taxonomy as flat labels, failing to encode hierarchical biological classification structure, which is critical for robustness under noise and missing modalitiesMethod: Two end-to-end variants: CLiBD-HiR with Hierarchical Information Regularization to shape embedding geometry across taxonomic levels, and CLiBD-HiR-Fuse with additional lightweight fusion predictor supporting image-only, DNA-only, or joint inference resilient to modality corruption
Result: Improves taxonomic classification accuracy by over 14% compared to strong multimodal baselines, with particularly large gains under partial and corrupted DNA conditions
Conclusion: Explicitly encoding biological hierarchy together with flexible fusion is key for practical biodiversity foundation models
Abstract: Accurate biodiversity identification from large-scale field data is a foundational problem with direct impact on ecology, conservation, and environmental monitoring. In practice, the core task is taxonomic prediction - inferring order, family, genus, or species from imperfect inputs such as specimen images, DNA barcodes, or both. Existing multimodal methods often treat taxonomy as a flat label space and therefore fail to encode the hierarchical structure of biological classification, which is critical for robustness under noise and missing modalities. We present two end-to-end variants for hierarchy-aware multimodal learning: CLiBD-HiR, which introduces Hierarchical Information Regularization (HiR) to shape embedding geometry across taxonomic levels, yielding structured and noise-robust representations; and CLiBD-HiR-Fuse, which additionally trains a lightweight fusion predictor that supports image-only, DNA-only, or joint inference and is resilient to modality corruption. Across large-scale biodiversity benchmarks, our approach improves taxonomic classification accuracy by over 14 percent compared to strong multimodal baselines, with particularly large gains under partial and corrupted DNA conditions. These results highlight that explicitly encoding biological hierarchy, together with flexible fusion, is key for practical biodiversity foundation models.
[195] MACRO: Advancing Multi-Reference Image Generation with Structured Long-Context Data
Zhekai Chen, Yuqing Wang, Manyuan Zhang, Xihui Liu
Main category: cs.CV
TL;DR: MacroData: A large-scale dataset for multi-reference image generation with up to 10 reference images across four task dimensions, plus MacroBench benchmark for evaluation.
Details
Motivation: Current models for multi-reference image generation suffer performance degradation as reference count increases due to data bottleneck - existing datasets lack structured, long-context supervision for dense inter-reference dependencies.Method: Introduce MacroData (400K samples with up to 10 references) organized across four dimensions: Customization, Illustration, Spatial reasoning, and Temporal dynamics. Also propose MacroBench benchmark (4,000 samples) for standardized evaluation across task dimensions and input scales.
Result: Fine-tuning on MacroData yields substantial improvements in multi-reference generation. Ablation studies reveal synergistic benefits of cross-task co-training and effective strategies for handling long-context complexity.
Conclusion: MacroData addresses the fundamental data bottleneck in multi-reference image generation, and MacroBench provides standardized evaluation. The dataset and benchmark will be publicly released to advance the field.
Abstract: Generating images conditioned on multiple visual references is critical for real-world applications such as multi-subject composition, narrative illustration, and novel view synthesis, yet current models suffer from severe performance degradation as the number of input references grows. We identify the root cause as a fundamental data bottleneck: existing datasets are dominated by single- or few-reference pairs and lack the structured, long-context supervision needed to learn dense inter-reference dependencies. To address this, we introduce MacroData, a large-scale dataset of 400K samples, each containing up to 10 reference images, systematically organized across four complementary dimensions – Customization, Illustration, Spatial reasoning, and Temporal dynamics – to provide comprehensive coverage of the multi-reference generation space. Recognizing the concurrent absence of standardized evaluation protocols, we further propose MacroBench, a benchmark of 4,000 samples that assesses generative coherence across graded task dimensions and input scales. Extensive experiments show that fine-tuning on MacroData yields substantial improvements in multi-reference generation, and ablation studies further reveal synergistic benefits of cross-task co-training and effective strategies for handling long-context complexity. The dataset and benchmark will be publicly released.
[196] DeepFAN, a transformer-based deep learning model for human-artificial intelligence collaborative assessment of incidental pulmonary nodules in CT scans: a multi-reader, multi-case trial
Zhenchen Zhu, Ge Hu, Weixiong Tan, Kai Gao, Chao Sun, Zhen Zhou, Kepei Xu, Wei Han, Meixia Shang, Xiaoming Qiu, Yiqing Tan, Jinhua Wang, Zhoumeng Ying, Li Peng, Wei Song, Lan Song, Zhengyu Jin, Nan Hong, Yizhou Yu
Main category: cs.CV
TL;DR: DeepFAN is a transformer-based model for classifying benign vs malignant lung nodules from CT scans, validated through clinical trials showing significant improvement in junior radiologists’ diagnostic performance.
Details
Motivation: Current deep learning methods for lung nodule classification often fail to comprehensively integrate global and local features, and most lack clinical trial validation despite widespread CT adoption increasing detected nodules.Method: Developed DeepFAN, a transformer-based model trained on over 10K pathology-confirmed nodules, and conducted a multi-reader, multi-case clinical trial across three medical institutions with 400 cases to evaluate its efficacy in assisting junior radiologists.
Result: DeepFAN achieved AUC of 0.939 (internal) and 0.954 (clinical trial). Junior radiologists’ performance improved significantly: 10.9% AUC increase, 10.0% accuracy, 7.6% sensitivity, 12.6% specificity. Inter-reader consistency improved from fair to moderate.
Conclusion: DeepFAN effectively assists junior radiologists, potentially homogenizing diagnostic quality and reducing unnecessary follow-up of indeterminate pulmonary nodules.
Abstract: The widespread adoption of CT has notably increased the number of detected lung nodules. However, current deep learning methods for classifying benign and malignant nodules often fail to comprehensively integrate global and local features, and most of them have not been validated through clinical trials. To address this, we developed DeepFAN, a transformer-based model trained on over 10K pathology-confirmed nodules and further conducted a multi-reader, multi-case clinical trial to evaluate its efficacy in assisting junior radiologists. DeepFAN achieved diagnostic area under the curve (AUC) of 0.939 (95% CI 0.930-0.948) on an internal test set and 0.954 (95% CI 0.934-0.973) on the clinical trial dataset involving 400 cases across three independent medical institutions. Explainability analysis indicated higher contributions from global than local features. Twelve readers’ average performance significantly improved by 10.9% (95% CI 8.3%-13.5%) in AUC, 10.0% (95% CI 8.9%-11.1%) in accuracy, 7.6% (95% CI 6.1%-9.2%) in sensitivity, and 12.6% (95% CI 10.9%-14.3%) in specificity (P<0.001 for all). Nodule-level inter-reader diagnostic consistency improved from fair to moderate (overall k: 0.313 vs. 0.421; P=0.019). In conclusion, DeepFAN effectively assisted junior radiologists and may help homogenize diagnostic quality and reduce unnecessary follow-up of indeterminate pulmonary nodules. Chinese Clinical Trial Registry: ChiCTR2400084624.
[197] HeSS: Head Sensitivity Score for Sparsity Redistribution in VGGT
Yongsung Kim, Wooseok Song, Jaihyun Lew, Hun Hwangbo, Jaehoon Lee, Sungroh Yoon
Main category: cs.CV
TL;DR: HeSS: A two-stage sparsification method for VGGT that uses head sensitivity scores to allocate attention budgets heterogeneously, reducing computational costs while minimizing accuracy degradation.
Details
Motivation: Existing sparsification techniques for Visual Geometry Grounded Transformer (VGGT) suffer from substantial accuracy degradation due to applying uniform sparsity patterns across all attention heads, ignoring the heterogeneous sensitivity characteristics of different heads.Method: Two-stage pipeline: 1) Measure head-wise sparsification sensitivity using Head Sensitivity Score (HeSS), which approximates Hessian with respect to error terms on a calibration set; 2) Perform HeSS-Guided Sparsification that reallocates total attention budget by assigning denser attention to sensitive heads and sparser attention to robust ones.
Result: HeSS effectively captures head-wise sparsification sensitivity, and the method demonstrates strong robustness across varying sparsification levels, effectively mitigating performance degradation under high sparsity.
Conclusion: The proposed heterogeneous sparsification approach based on head sensitivity scores enables efficient VGGT acceleration while maintaining accuracy, addressing the limitations of uniform sparsification methods.
Abstract: Visual Geometry Grounded Transformer (VGGT) has advanced 3D vision, yet its global attention layers suffer from quadratic computational costs that hinder scalability. Several sparsification-based acceleration techniques have been proposed to alleviate this issue, but they often suffer from substantial accuracy degradation. We hypothesize that the accuracy degradation stems from the heterogeneity in head-wise sparsification sensitivity, as the existing methods apply a uniform sparsity pattern across all heads. Motivated by this hypothesis, we present a two-stage sparsification pipeline that effectively quantifies and exploits headwise sparsification sensitivity. In the first stage, we measure head-wise sparsification sensitivity using a novel metric, the Head Sensitivity Score (HeSS), which approximates the Hessian with respect to two distinct error terms on a small calibration set. In the inference stage, we perform HeSS-Guided Sparsification, leveraging the pre-computed HeSS to reallocate the total attention budget-assigning denser attention to sensitive heads and sparser attention to more robust ones. We demonstrate that HeSS effectively captures head-wise sparsification sensitivity and empirically confirm that attention heads in the global attention layers exhibit heterogeneous sensitivity characteristics. Extensive experiments further show that our method effectively mitigates performance degradation under high sparsity, demonstrating strong robustness across varying sparsification levels. Code is available at https://github.com/libary753/HeSS.
[198] Demographic Fairness in Multimodal LLMs: A Benchmark of Gender and Ethnicity Bias in Face Verification
Ünsal Öztürk, Hatef Otroshi Shahreza, Sébastien Marcel
Main category: cs.CV
TL;DR: Benchmark study evaluating demographic fairness of 9 open-source MLLMs on face verification tasks across ethnicity and gender groups, finding FaceLLM-8B outperforms general-purpose models and revealing different bias patterns than traditional face recognition.
Details
Motivation: While MLLMs have been explored for face verification, their demographic fairness remains largely unexplored. The paper aims to benchmark fairness of open-source MLLMs on face verification across different demographic groups.Method: Evaluated 9 open-source MLLMs from 6 model families (2B-8B parameters) on IJB-C and RFW face verification protocols across 4 ethnicity groups and 2 gender groups. Measured verification accuracy with Equal Error Rate and True Match Rate at multiple operating points, and quantified demographic disparity with four FMR-based fairness metrics.
Result: FaceLLM-8B (face-specialized model) substantially outperformed general-purpose MLLMs on both benchmarks. Bias patterns differed from traditional face recognition, with different groups most affected depending on benchmark and model. Most accurate models weren’t necessarily fairest, and models with poor overall accuracy could appear fair due to uniformly high error rates.
Conclusion: MLLM face verification systems exhibit different fairness characteristics than traditional face recognition, requiring specialized evaluation. Face-specialized models perform better, but fairness considerations are complex and not directly correlated with overall accuracy.
Abstract: Multimodal Large Language Models (MLLMs) have recently been explored as face verification systems that determine whether two face images are of the same person. Unlike dedicated face recognition systems, MLLMs approach this task through visual prompting and rely on general visual and reasoning abilities. However, the demographic fairness of these models remains largely unexplored. In this paper, we present a benchmarking study that evaluates nine open-source MLLMs from six model families, ranging from 2B to 8B parameters, on the IJB-C and RFW face verification protocols across four ethnicity groups and two gender groups. We measure verification accuracy with the Equal Error Rate and True Match Rate at multiple operating points per demographic group, and we quantify demographic disparity with four FMR-based fairness metrics. Our results show that FaceLLM-8B, the only face-specialised model in our study, substantially outperforms general-purpose MLLMs on both benchmarks. The bias patterns we observe differ from those commonly reported for traditional face recognition, with different groups being most affected depending on the benchmark and the model. We also note that the most accurate models are not necessarily the fairest and that models with poor overall accuracy can appear fair simply because they produce uniformly high error rates across all demographic groups.
[199] InstanceAnimator: Multi-Instance Sketch Video Colorization
Yinhan Zhang, Yue Ma, Bingyuan Wang, Kunyu Feng, Yeying Jin, Qifeng Chen, Anyi Rao, Zeyu Wang
Main category: cs.CV
TL;DR: InstanceAnimator is a Diffusion Transformer framework for multi-instance sketch video colorization that addresses limitations in user control, instance alignment, and detail fidelity through canvas guidance, instance matching, and adaptive decoupled control.
Details
Motivation: Existing sketch video colorization methods have three core limitations: 1) inflexible user control due to heavy reliance on single reference frames, 2) poor instance controllability causing misalignment in multi-character scenarios, and 3) degraded detail fidelity in fine-grained regions.Method: Three key innovations: 1) Canvas Guidance Condition for flexible placement of reference elements and background, 2) Instance Matching Mechanism integrating instance features with sketches for precise multi-character control, 3) Adaptive Decoupled Control Module injecting semantic features from characters, backgrounds, and text conditions into the diffusion process.
Result: Extensive experiments demonstrate superior multi-instance colorization with enhanced user control, high visual quality, and strong instance consistency compared to existing methods.
Conclusion: InstanceAnimator effectively addresses the core limitations of existing sketch video colorization methods, providing a robust framework for multi-instance scenarios with improved flexibility, alignment, and detail preservation.
Abstract: We propose InstanceAnimator, a novel Diffusion Transformer framework for multi-instance sketch video colorization. Existing methods suffer from three core limitations: inflexible user control due to heavy reliance on single reference frames, poor instance controllability leading to misalignment in multi-character scenarios, and degraded detail fidelity in fine-grained regions. To address these challenges, we introduce three corresponding innovations. First, a Canvas Guidance Condition eliminates workflow fragmentation by allowing free placement of reference elements and background, enabling unprecedented user flexibility. Second, an Instance Matching Mechanism resolves misalignment by integrating instance features with the sketches, ensuring precise control over multiple characters. Third, an Adaptive Decoupled Control Module enhances detail fidelity by injecting semantic features from characters, backgrounds, and text conditions into the diffusion process. Extensive experiments demonstrate that InstanceAnimator achieves superior multi-instance colorization with enhanced user control, high visual quality, and strong instance consistency.
[200] LanteRn: Latent Visual Structured Reasoning
André G. Viveiros, Nuno Gonçalves, Matthias Lindemann, André Martins
Main category: cs.CV
TL;DR: LanteRn enables multimodal models to perform visual reasoning directly in latent space using continuous visual thought embeddings, improving efficiency and fine-grained understanding.
Details
Motivation: Current large multimodal models struggle with visual reasoning, often defaulting to verbalizing visual content into text, which limits fine-grained spatial and visual understanding. Existing approaches either rely on external modules or inefficient pixel-space reasoning.Method: LanteRn augments a vision-language transformer to generate and attend to continuous visual thought embeddings during inference. Training involves two stages: supervised fine-tuning to ground visual features in latent states, followed by reinforcement learning to align latent reasoning with task-level utility.
Result: LanteRn shows consistent improvements on three perception-centric benchmarks (VisCoT, V*, and Blink), demonstrating better visual grounding and fine-grained reasoning compared to existing approaches.
Conclusion: Internal latent representations provide a promising direction for more efficient multimodal reasoning, enabling models to think with images directly in latent space rather than relying on text verbalization or pixel-space processing.
Abstract: While language reasoning models excel in many tasks, visual reasoning remains challenging for current large multimodal models (LMMs). As a result, most LMMs default to verbalizing perceptual content into text, a strong limitation for tasks requiring fine-grained spatial and visual understanding. While recent approaches take steps toward thinking with images by invoking tools or generating intermediate images, they either rely on external modules, or incur unnecessary computation by reasoning directly in pixel space. In this paper, we introduce LanteRn, a framework that enables LMMs to interleave language with compact latent visual representations, allowing visual reasoning to occur directly in latent space. LanteRn augments a vision-language transformer with the ability to generate and attend to continuous visual thought embeddings during inference. We train the model in two stages: supervised fine-tuning to ground visual features in latent states, followed by reinforcement learning to align latent reasoning with task-level utility. We evaluate LanteRn on three perception-centric benchmarks (VisCoT, V*, and Blink), observing consistent improvements in visual grounding and fine-grained reasoning. These results suggest that internal latent representations provide a promising direction for more efficient multimodal reasoning.
[201] CLIP-RD: Relational Distillation for Efficient CLIP Knowledge Distillation
Jeannie Chung, Hanna Jang, Ingyeong Yang, Uiwon Hwang, Jaehyung Sim
Main category: cs.CV
TL;DR: CLIP-RD: A relational knowledge distillation framework that preserves multi-directional structural relationships between teacher and student embeddings for efficient CLIP model compression.
Details
Motivation: CLIP models require substantial computational resources, motivating lightweight student models via distillation. Existing methods fail to model multi-directional relational dependencies between teacher and student embeddings, limiting preservation of structural relationships.Method: Proposes two novel methods: Vertical Relational Distillation (VRD) that enforces consistency of teacher-student distillation strength across modalities at distribution level, and Cross Relational Distillation (XRD) that imposes bidirectional symmetry on cross-modal teacher-student similarity distributions.
Result: Outperforms existing CLIP distillation methods by 0.8% percentage points by better aligning student embedding geometry with teacher structure.
Conclusion: Jointly modeling multi-directional relational structures enables more faithful distillation of CLIP’s capabilities into lightweight models while preserving the original embedding geometry.
Abstract: CLIP aligns image and text embeddings via contrastive learning and demonstrates strong zero-shot generalization. Its large-scale architecture requires substantial computational and memory resources, motivating the distillation of its capabilities into lightweight student models. However, existing CLIP distillation methods do not explicitly model multi-directional relational dependencies between teacher and student embeddings, limiting the student’s ability to preserve the structural relationships encoded by the teacher. To address this, we propose a relational knowledge distillation framework that introduces two novel methods, Vertical Relational Distillation (VRD) and Cross Relational Distillation (XRD). VRD enforces consistency of teacher-student distillation strength across modalities at the distribution level, while XRD imposes bidirectional symmetry on cross-modal teacher-student similarity distributions. By jointly modeling multi-directional relational structures, CLIP-RD promotes faithful alignment of the student embedding geometry with that of the teacher, outperforming existing methods by 0.8%p.
[202] No Hard Negatives Required: Concept Centric Learning Leads to Compositionality without Degrading Zero-shot Capabilities of Contrastive Models
Hai X. Pham, David T. Hoffmann, Ricardo Guerrero, Brais Martinez
Main category: cs.CV
TL;DR: CLIP-style V&L models improved for compositionality via short concept-centric captions and cross-modal attention pooling, achieving SOTA on compositionality benchmarks while maintaining zero-shot/retrieval performance.
Details
Motivation: Contrastive V&L models have limited compositional representation learning. Prior methods using custom hard negatives degrade basic V&L capabilities and don't generalize well. Need to address root causes of compositionality limitations without sacrificing core V&L performance.Method: Two key modifications: 1) Extract short concept-centric caption parts using NLP tools and align with images; 2) Introduce parameter-free cross-modal attention pooling to obtain concept-centric visual embeddings. Use simple auxiliary contrastive losses.
Result: Achieves state-of-the-art performance on standard compositionality benchmarks while maintaining or improving zero-shot and retrieval capabilities. No inference cost increase.
Conclusion: Addressing root causes of compositionality limitations through concept-centric alignment and attention pooling enables improved compositional understanding in V&L models without sacrificing core capabilities.
Abstract: Contrastive vision-language (V&L) models remain a popular choice for various applications. However, several limitations have emerged, most notably the limited ability of V&L models to learn compositional representations. Prior methods often addressed this limitation by generating custom training data to obtain hard negative samples. Hard negatives have been shown to improve performance on compositionality tasks, but are often specific to a single benchmark, do not generalize, and can cause substantial degradation of basic V&L capabilities such as zero-shot or retrieval performance, rendering them impractical. In this work we follow a different approach. We identify two root causes that limit compositionality performance of V&Ls: 1) Long training captions do not require a compositional representation; and 2) The final global pooling in the text and image encoders lead to a complete loss of the necessary information to learn binding in the first place. As a remedy, we propose two simple solutions: 1) We obtain short concept centric caption parts using standard NLP software and align those with the image; and 2) We introduce a parameter-free cross-modal attention-pooling to obtain concept centric visual embeddings from the image encoder. With these two changes and simple auxiliary contrastive losses, we obtain SOTA performance on standard compositionality benchmarks, while maintaining or improving strong zero-shot and retrieval capabilities. This is achieved without increasing inference cost. We release the code for this work at https://github.com/SamsungLabs/concept_centric_clip.
[203] Multimodal Dataset Distillation via Phased Teacher Models
Shengbin Guo, Hang Zhao, Senqiao Yang, Chenyang Jiang, Yuhang Cheng, Xiangru Peng, Rui Shao, Zhuotao Tian
Main category: cs.CV
TL;DR: A phased distillation framework (PTM-ST) that improves multimodal dataset distillation by better capturing teacher model’s evolving knowledge across training stages using stage-aware modeling and shortcut trajectories.
Details
Motivation: Existing multimodal dataset distillation methods fail to capture the complex, dynamically evolving knowledge embedded in later training stages of teacher models, leading to degraded student performance and poor distilled data quality.Method: Proposes PTM-ST (Phased Teacher Model with Shortcut Trajectory) - a phased distillation framework with stage-aware teacher modeling and shortcut-based trajectory construction to accurately fit teacher’s learning dynamics across distinct training phases.
Result: Significantly mitigates optimization oscillations and inter-phase knowledge gaps, reduces storage overhead, and consistently surpasses state-of-the-art baselines on Flickr30k and COCO, achieving up to 13.5% absolute improvement and average gain of 9.53% on Flickr30k.
Conclusion: PTM-ST effectively addresses critical challenges in multimodal dataset distillation by enhancing stability and expressiveness through better modeling of teacher model’s evolving knowledge across training stages.
Abstract: Multimodal dataset distillation aims to construct compact synthetic datasets that enable efficient compression and knowledge transfer from large-scale image-text data. However, existing approaches often fail to capture the complex, dynamically evolving knowledge embedded in the later training stages of teacher models. This limitation leads to degraded student performance and compromises the quality of the distilled data. To address critical challenges such as pronounced cross-stage performance gaps and unstable teacher trajectories, we propose Phased Teacher Model with Shortcut Trajectory (PTM-ST) – a novel phased distillation framework. PTM-ST leverages stage-aware teacher modeling and a shortcut-based trajectory construction strategy to accurately fit the teacher’s learning dynamics across distinct training phases. This enhances both the stability and expressiveness of the distillation process. Through theoretical analysis and comprehensive experiments, we show that PTM-ST significantly mitigates optimization oscillations and inter-phase knowledge gaps, while also reducing storage overhead. Our method consistently surpasses state-of-the-art baselines on Flickr30k and COCO, achieving up to 13.5% absolute improvement and an average gain of 9.53% on Flickr30k. Code: https://github.com/Previsior/PTM-ST.
[204] Just Zoom In: Cross-View Geo-Localization via Autoregressive Zooming
Yunus Talha Erzurumlu, Jiyong Kwag, Alper Yilmaz
Main category: cs.CV
TL;DR: Just Zoom In reformulates cross-view geo-localization as an autoregressive zooming task over satellite maps instead of contrastive image retrieval, achieving state-of-the-art performance.
Details
Motivation: Existing CVGL methods use contrastive image retrieval which requires large batches and hard negative mining, ignores geometric map structure, and suffers from coverage mismatch between street-view and satellite imagery where landmarks may fall outside fixed satellite crops.Method: Proposes autoregressive zooming over city-scale overhead maps: starting from coarse satellite view, model makes sequential zoom-in decisions to select terminal satellite cell at target resolution, eliminating need for contrastive losses or hard negative mining.
Result: Achieves state-of-the-art performance on realistic benchmark with crowd-sourced street views and high-resolution satellite imagery, improving Recall@1 within 50m by 5.5% and Recall@1 within 100m by 9.6% over strongest contrastive-retrieval baseline.
Conclusion: Sequential coarse-to-fine spatial reasoning is more effective than contrastive retrieval for cross-view geo-localization, demonstrating the value of autoregressive zooming over geometric map structures.
Abstract: Cross-view geo-localization (CVGL) estimates a camera’s location by matching a street-view image to geo-referenced overhead imagery, enabling GPS-denied localization and navigation. Existing methods almost universally formulate CVGL as an image-retrieval problem in a contrastively trained embedding space. This ties performance to large batches and hard negative mining, and it ignores both the geometric structure of maps and the coverage mismatch between street-view and overhead imagery. In particular, salient landmarks visible from the street view can fall outside a fixed satellite crop, making retrieval targets ambiguous and limiting explicit spatial inference over the map. We propose Just Zoom In, an alternative formulation that performs CVGL via autoregressive zooming over a city-scale overhead map. Starting from a coarse satellite view, the model takes a short sequence of zoom-in decisions to select a terminal satellite cell at a target resolution, without contrastive losses or hard negative mining. We further introduce a realistic benchmark with crowd-sourced street views and high-resolution satellite imagery that reflects real capture conditions. On this benchmark, Just Zoom In achieves state-of-the-art performance, improving Recall@1 within 50 m by 5.5% and Recall@1 within 100 m by 9.6% over the strongest contrastive-retrieval baseline. These results demonstrate the effectiveness of sequential coarse-to-fine spatial reasoning for cross-view geo-localization.
[205] FSGNet: A Frequency-Aware and Semantic Guidance Network for Infrared Small Target Detection
Yingmei Zhang, Wangtao Bao, Yong Yang, Weiguo Wan, Qin Xiao, Xueting Zou
Main category: cs.CV
TL;DR: FSGNet is a lightweight infrared small target detection framework that addresses semantic degradation in U-Net by incorporating frequency-aware filtering and semantic guidance mechanisms for precise target localization.
Details
Motivation: U-Net architectures for infrared small target detection suffer from semantic degradation when transferring high-level features from deep to shallow layers, limiting precise localization of small targets in complex backgrounds.Method: Proposes FSGNet with: 1) multi-directional interactive attention in encoder for fine-grained directional features, 2) multi-scale frequency-aware module using FFT to filter target-similar clutter, and 3) global semantic guidance flows from deep layers to decoder stages for semantic consistency.
Result: Extensive experiments on four public IRSTD datasets demonstrate superior detection performance and high efficiency, highlighting practical applicability and robustness.
Conclusion: FSGNet effectively addresses semantic degradation in U-Net for IRSTD through frequency-aware filtering and semantic guidance, achieving state-of-the-art performance with lightweight design.
Abstract: Infrared small target detection (IRSTD) aims to identify and distinguish small targets from complex backgrounds. Leveraging the powerful multi-scale feature fusion capability of the U-Net architecture, IRSTD has achieved significant progress. However, U-Net suffers from semantic degradation when transferring high-level features from deep to shallow layers, limiting the precise localization of small targets. To address this issue, this paper proposes FSGNet, a lightweight and effective detection framework incorporating frequency-aware and semantic guidance mechanisms. Specifically, a multi-directional interactive attention module is proposed throughout the encoder to capture fine-grained and directional features, enhancing the network’s sensitivity to small, low-contrast targets. To suppress background interference propagated through skip connections, a multi-scale frequency-aware module leverages Fast Fourier transform to filter out target-similar clutter while preserving salient target structures. At the deepest layer, a global pooling module captures high-level semantic information, which is subsequently upsampled and propagated to each decoder stage through the global semantic guidance flows, ensuring semantic consistency and precise localization across scales. Extensive experiments on four public IRSTD datasets demonstrate that FSGNet achieves superior detection performance and maintains high efficiency, highlighting its practical applicability and robustness. The codes will be released on https://github.com/Wangtao-Bao/FSGNet.
[206] PMT: Plain Mask Transformer for Image and Video Segmentation with Frozen Vision Encoders
Niccolò Cavagnero, Narges Norouzi, Gijs Dubbelman, Daan de Geus
Main category: cs.CV
TL;DR: PMT is a fast Transformer-based segmentation decoder that works on frozen Vision Foundation Model features, enabling efficient multi-task sharing while maintaining competitive accuracy for image and video segmentation.
Details
Motivation: Current VFM-based encoder-only segmentation models require finetuning the encoder, which sacrifices the multi-task encoder sharing that makes VFMs practical for large-scale deployment. There's a need to reconcile encoder-only simplicity/speed with frozen VFM features.Method: Proposes Plain Mask Decoder (PMD), a fast Transformer-based segmentation decoder that operates on top of frozen VFM features. The resulting Plain Mask Transformer (PMT) preserves encoder-only architectural simplicity while keeping encoder representation unchanged and shareable.
Result: On image segmentation benchmarks, PMT matches frozen-encoder state-of-the-art while running ~3x faster. For video segmentation, it performs on par with fully finetuned methods while being up to 8x faster than state-of-the-art frozen-encoder models.
Conclusion: PMT successfully reconciles encoder-only simplicity and speed with frozen VFM features, enabling efficient multi-task sharing while maintaining competitive performance for both image and video segmentation tasks.
Abstract: Vision Foundation Models (VFMs) pre-trained at scale enable a single frozen encoder to serve multiple downstream tasks simultaneously. Recent VFM-based encoder-only models for image and video segmentation, such as EoMT and VidEoMT, achieve competitive accuracy with remarkably low latency, yet they require finetuning the encoder, sacrificing the multi-task encoder sharing that makes VFMs practically attractive for large-scale deployment. To reconcile encoder-only simplicity and speed with frozen VFM features, we propose the Plain Mask Decoder (PMD), a fast Transformer-based segmentation decoder that operates on top of frozen VFM features. The resulting model, the Plain Mask Transformer (PMT), preserves the architectural simplicity and low latency of encoder-only designs while keeping the encoder representation unchanged and shareable. The design seamlessly applies to both image and video segmentation, inheriting the generality of the encoder-only framework. On standard image segmentation benchmarks, PMT matches the frozen-encoder state of the art while running up to ~3x faster. For video segmentation, it even performs on par with fully finetuned methods, while being up to 8x faster than state-of-the-art frozen-encoder models. Code: https://github.com/tue-mps/pmt.
[207] Out of Sight but Not Out of Mind: Hybrid Memory for Dynamic Video World Models
Kaijin Chen, Dingkang Liang, Xin Zhou, Yikang Ding, Xiaoqiang Liu, Pengfei Wan, Xiang Bai
Main category: cs.CV
TL;DR: Hybrid Memory paradigm for video world models that handles dynamic subjects disappearing and reappearing, with HM-World dataset and HyDRA architecture for improved subject tracking and consistency.
Details
Motivation: Current video world models treat environments as static canvases and struggle when dynamic subjects hide out of sight and later re-emerge, leading to frozen, distorted, or vanishing subjects in generated videos.Method: Introduces Hybrid Memory paradigm requiring models to act as archivists for static backgrounds and trackers for dynamic subjects. Creates HM-World dataset with 59K high-fidelity clips featuring decoupled camera/subject trajectories and exit-entry events. Proposes HyDRA architecture that compresses memory into tokens and uses spatiotemporal relevance-driven retrieval to selectively attend to motion cues.
Result: Extensive experiments on HM-World show the method significantly outperforms state-of-the-art approaches in both dynamic subject consistency and overall generation quality.
Conclusion: Hybrid Memory paradigm addresses critical limitations in video world models for handling dynamic subjects, with HM-World dataset and HyDRA architecture providing effective solutions for maintaining subject identity and motion continuity during out-of-view intervals.
Abstract: Video world models have shown immense potential in simulating the physical world, yet existing memory mechanisms primarily treat environments as static canvases. When dynamic subjects hide out of sight and later re-emerge, current methods often struggle, leading to frozen, distorted, or vanishing subjects. To address this, we introduce Hybrid Memory, a novel paradigm requiring models to simultaneously act as precise archivists for static backgrounds and vigilant trackers for dynamic subjects, ensuring motion continuity during out-of-view intervals. To facilitate research in this direction, we construct HM-World, the first large-scale video dataset dedicated to hybrid memory. It features 59K high-fidelity clips with decoupled camera and subject trajectories, encompassing 17 diverse scenes, 49 distinct subjects, and meticulously designed exit-entry events to rigorously evaluate hybrid coherence. Furthermore, we propose HyDRA, a specialized memory architecture that compresses memory into tokens and utilizes a spatiotemporal relevance-driven retrieval mechanism. By selectively attending to relevant motion cues, HyDRA effectively preserves the identity and motion of hidden subjects. Extensive experiments on HM-World demonstrate that our method significantly outperforms state-of-the-art approaches in both dynamic subject consistency and overall generation quality.
[208] LaMP: Learning Vision-Language-Action Policies with 3D Scene Flow as Latent Motion Prior
Xinkai Wang, Chenyi Wang, Yifu Xu, Mingzhe Ye, Fu-Cheng Zhang, Jialin Tian, Xinyu Zhan, Lifeng Zhu, Cewu Lu, Lixin Yang
Main category: cs.CV
TL;DR: LaMP is a dual-expert VLA framework that uses 3D scene flow as latent motion prior for robotic manipulation, improving performance and robustness over existing VLA models.
Details
Motivation: Existing VLA models directly regress actions from 2D semantic visual features, forcing them to learn complex 3D physical interactions implicitly, which degrades under unfamiliar spatial dynamics.Method: Dual-expert framework with flow-matching Motion Expert that generates partially denoised 3D scene flow, and policy-predicting Action Expert conditioned through gated cross-attention on motion expert’s hidden states.
Result: Consistently outperforms VLA baselines on LIBERO, LIBERO-Plus, and SimplerEnv-WidowX benchmarks with highest reported average success rates. Shows 9.7% average gain over strongest baseline on LIBERO-Plus OOD perturbations.
Conclusion: LaMP effectively embeds 3D scene flow as latent motion prior to improve robotic manipulation performance and robustness in VLA models.
Abstract: We introduce \textbf{LaMP}, a dual-expert Vision-Language-Action framework that embeds dense 3D scene flow as a latent motion prior for robotic manipulation. Existing VLA models regress actions directly from 2D semantic visual features, forcing them to learn complex 3D physical interactions implicitly. This implicit learning strategy degrades under unfamiliar spatial dynamics. LaMP addresses this limitation by aligning a flow-matching \emph{Motion Expert} with a policy-predicting \emph{Action Expert} through gated cross-attention. Specifically, the Motion Expert generates a one-step partially denoised 3D scene flow, and its hidden states condition the Action Expert without full multi-step reconstruction. We evaluate LaMP on the LIBERO, LIBERO-Plus, and SimplerEnv-WidowX simulation benchmarks as well as real-world experiments. LaMP consistently outperforms evaluated VLA baselines across LIBERO, LIBERO-Plus, and SimplerEnv-WidowX benchmarks, achieving the highest reported average success rates under the same training budgets. On LIBERO-Plus OOD perturbations, LaMP shows improved robustness with an average 9.7% gain over the strongest prior baseline. Our project page is available at https://summerwxk.github.io/lamp-project-page/.
[209] PixelSmile: Toward Fine-Grained Facial Expression Editing
Jiabin Hua, Hengyuan Xu, Aojie Li, Wei Cheng, Gang Yu, Xingjun Ma, Yu-Gang Jiang
Main category: cs.CV
TL;DR: PixelSmile: A diffusion framework for fine-grained facial expression editing with continuous control and identity preservation, using a new FFE dataset with continuous affective annotations.
Details
Motivation: Fine-grained facial expression editing has been limited by intrinsic semantic overlap between different expressions, making precise control difficult. Existing methods struggle with disentangling expression semantics while preserving identity.Method: Proposes PixelSmile, a diffusion framework with fully symmetric joint training that disentangles expression semantics. Uses intensity supervision with contrastive learning to produce stronger, more distinguishable expressions. Achieves linear expression control through textual latent interpolation.
Result: PixelSmile achieves superior disentanglement and robust identity preservation compared to existing methods. It enables precise, stable linear expression control and naturally supports smooth expression blending.
Conclusion: PixelSmile is effective for continuous, controllable, and fine-grained expression editing while maintaining identity preservation, addressing the fundamental challenge of semantic overlap in facial expression editing.
Abstract: Fine-grained facial expression editing has long been limited by intrinsic semantic overlap. To address this, we construct the Flex Facial Expression (FFE) dataset with continuous affective annotations and establish FFE-Bench to evaluate structural confusion, editing accuracy, linear controllability, and the trade-off between expression editing and identity preservation. We propose PixelSmile, a diffusion framework that disentangles expression semantics via fully symmetric joint training. PixelSmile combines intensity supervision with contrastive learning to produce stronger and more distinguishable expressions, achieving precise and stable linear expression control through textual latent interpolation. Extensive experiments demonstrate that PixelSmile achieves superior disentanglement and robust identity preservation, confirming its effectiveness for continuous, controllable, and fine-grained expression editing, while naturally supporting smooth expression blending.
[210] HiSpatial: Taming Hierarchical 3D Spatial Understanding in Vision-Language Models
Huizhi Liang, Yichao Shen, Yu Deng, Sicheng Xu, Zhiyuan Feng, Tong Zhang, Yaobo Liang, Jiaolong Yang
Main category: cs.CV
TL;DR: A hierarchical framework for teaching VLMs 3D spatial understanding through four progressive levels, using automated data generation and RGB-D inputs to achieve SOTA performance on spatial reasoning benchmarks.
Details
Motivation: Current VLMs lack human-like 3D spatial intelligence - the ability to infer 3D structures from 2D observations, recognize object properties/relations in 3D space, and perform high-level spatial reasoning.Method: Proposes hierarchical framework with four progressive levels (geometric perception to abstract reasoning). Creates automated pipeline processing 5M images with 45M+ objects to generate 3D spatial VQA pairs for VLM fine-tuning. Develops RGB-D VLM with metric-scale point maps as auxiliary inputs.
Result: Achieves state-of-the-art performance on multiple spatial understanding/reasoning benchmarks, surpassing specialized spatial models and large proprietary systems like Gemini-2.5-pro and GPT-5. Analysis reveals clear dependencies among hierarchical task levels.
Conclusion: The hierarchical framework effectively teaches VLMs 3D spatial intelligence, with multi-level task design facilitating emergence of spatial understanding. The approach demonstrates superior performance over existing systems.
Abstract: Achieving human-like spatial intelligence for vision-language models (VLMs) requires inferring 3D structures from 2D observations, recognizing object properties and relations in 3D space, and performing high-level spatial reasoning. In this paper, we propose a principled hierarchical framework that decomposes the learning of 3D spatial understanding in VLMs into four progressively complex levels, from geometric perception to abstract spatial reasoning. Guided by this framework, we construct an automated pipeline that processes approximately 5M images with over 45M objects to generate 3D spatial VQA pairs across diverse tasks and scenes for VLM supervised fine-tuning. We also develop an RGB-D VLM incorporating metric-scale point maps as auxiliary inputs to further enhance spatial understanding. Extensive experiments demonstrate that our approach achieves state-of-the-art performance on multiple spatial understanding and reasoning benchmarks, surpassing specialized spatial models and large proprietary systems such as Gemini-2.5-pro and GPT-5. Moreover, our analysis reveals clear dependencies among hierarchical task levels, offering new insights into how multi-level task design facilitates the emergence of 3D spatial intelligence.
[211] PackForcing: Short Video Training Suffices for Long Video Sampling and Long Context Inference
Xiaofeng Mao, Shaohao Rui, Kaining Ying, Bo Zheng, Chuanhao Li, Mingmin Chi, Kaipeng Zhang
Main category: cs.CV
TL;DR: PackForcing is a unified framework for long-video generation using autoregressive video diffusion models, addressing KV-cache growth, temporal repetition, and compounding errors through hierarchical context compression.
Details
Motivation: Autoregressive video diffusion models face challenges with intractable linear KV-cache growth, temporal repetition, and compounding errors during long-video generation, limiting their practical application for extended video synthesis.Method: Introduces a three-partition KV-cache strategy: (1) Sink tokens preserve early anchor frames at full resolution, (2) Mid tokens achieve 32x spatiotemporal compression via dual-branch network with progressive 3D convolutions and low-resolution VAE re-encoding, and (3) Recent tokens kept at full resolution. Includes dynamic top-k context selection and Temporal RoPE Adjustment for position alignment.
Result: Achieves coherent 2-minute, 832x480 videos at 16 FPS on single H200 GPU with bounded 4 GB KV cache, 24x temporal extrapolation (5s to 120s), state-of-the-art VBench scores for temporal consistency (26.07) and dynamic degree (56.25).
Conclusion: PackForcing demonstrates that hierarchical context compression enables efficient long-video generation with bounded memory, proving short-video supervision is sufficient for high-quality long-video synthesis while maintaining temporal coherence.
Abstract: Autoregressive video diffusion models have demonstrated remarkable progress, yet they remain bottlenecked by intractable linear KV-cache growth, temporal repetition, and compounding errors during long-video generation. To address these challenges, we present PackForcing, a unified framework that efficiently manages the generation history through a novel three-partition KV-cache strategy. Specifically, we categorize the historical context into three distinct types: (1) Sink tokens, which preserve early anchor frames at full resolution to maintain global semantics; (2) Mid tokens, which achieve a massive spatiotemporal compression (32x token reduction) via a dual-branch network fusing progressive 3D convolutions with low-resolution VAE re-encoding; and (3) Recent tokens, kept at full resolution to ensure local temporal coherence. To strictly bound the memory footprint without sacrificing quality, we introduce a dynamic top-$k$ context selection mechanism for the mid tokens, coupled with a continuous Temporal RoPE Adjustment that seamlessly re-aligns position gaps caused by dropped tokens with negligible overhead. Empowered by this principled hierarchical context compression, PackForcing can generate coherent 2-minute, 832x480 videos at 16 FPS on a single H200 GPU. It achieves a bounded KV cache of just 4 GB and enables a remarkable 24x temporal extrapolation (5s to 120s), operating effectively either zero-shot or trained on merely 5-second clips. Extensive results on VBench demonstrate state-of-the-art temporal consistency (26.07) and dynamic degree (56.25), proving that short-video supervision is sufficient for high-quality, long-video synthesis. https://github.com/ShandaAI/PackForcing
[212] VideoWeaver: Multimodal Multi-View Video-to-Video Transfer for Embodied Agents
George Eskandar, Fengyi Shen, Mohammad Altillawi, Dong Chen, Yang Bai, Liudi Yang, Ziyuan Liu
Main category: cs.CV
TL;DR: VideoWeaver: A multimodal multi-view video-to-video translation framework that enables consistent appearance across multiple synchronized camera views for embodied AI applications.
Details
Motivation: Current video-to-video translation methods only work on single views, but embodied AI tasks use multiple synchronized cameras. Applying single-view models independently leads to inconsistent appearance across views, and transformer architectures don't scale well to multi-view settings due to quadratic attention costs.Method: VideoWeaver starts as a single-view flow-based V2V model, then extends to multi-view by grounding all views in a shared 4D latent space from Pi3 spatial foundation model. To scale beyond fixed camera numbers, views are trained at distinct diffusion timesteps to learn both joint and conditional view distributions, enabling autoregressive synthesis of new viewpoints.
Result: Superior or similar performance to state-of-the-art on single-view benchmarks, and first demonstration of physically and stylistically consistent multi-view translations, including challenging egocentric and heterogeneous-camera setups for robot learning.
Conclusion: VideoWeaver enables consistent multi-view video translation essential for world randomization in robot learning, overcoming limitations of single-view methods and scaling challenges of transformer architectures.
Abstract: Recent progress in video-to-video (V2V) translation has enabled realistic resimulation of embodied AI demonstrations, a capability that allows pretrained robot policies to be transferable to new environments without additional data collection. However, prior works can only operate on a single view at a time, while embodied AI tasks are commonly captured from multiple synchronized cameras to support policy learning. Naively applying single-view models independently to each camera leads to inconsistent appearance across views, and standard transformer architectures do not scale to multi-view settings due to the quadratic cost of cross-view attention. We present VideoWeaver, the first multimodal multi-view V2V translation framework. VideoWeaver is initially trained as a single-view flow-based V2V model. To achieve an extension to the multi-view regime, we propose to ground all views in a shared 4D latent space derived from a feed-forward spatial foundation model, namely, Pi3. This encourages view-consistent appearance even under wide baselines and dynamic camera motion. To scale beyond a fixed number of cameras, we train views at distinct diffusion timesteps, enabling the model to learn both joint and conditional view distributions. This in turn allows autoregressive synthesis of new viewpoints conditioned on existing ones. Experiments show superior or similar performance to the state-of-the-art on the single-view translation benchmarks and, for the first time, physically and stylistically consistent multi-view translations, including challenging egocentric and heterogeneous-camera setups central to world randomization for robot learning.
[213] Vega: Learning to Drive with Natural Language Instructions
Sicheng Zuo, Yuxuan Li, Wenzhao Zheng, Zheng Zhu, Jie Zhou, Jiwen Lu
Main category: cs.CV
TL;DR: Vega is a Vision-Language-World-Action model for autonomous driving that follows diverse user instructions to generate personalized driving trajectories using joint multimodal attention and diffusion-based world modeling.
Details
Motivation: Existing vision-language-action models for autonomous driving lack flexibility to follow diverse user instructions for personalized driving, as they mainly use language only for scene descriptions or reasoning rather than instruction-following.Method: Proposes Vega model with: 1) Large-scale InstructScene dataset (100k scenes with diverse driving instructions and trajectories, 2) Unified architecture using autoregressive processing for vision/language inputs and diffusion paradigm for future predictions (world modeling) and trajectory generation (action), 3) Joint attention for multimodal interactions and individual projection layers for different modalities.
Result: Extensive experiments show superior planning performance and strong instruction-following abilities, enabling more intelligent and personalized driving systems.
Conclusion: Vega paves the way for instruction-based personalized autonomous driving by effectively integrating vision, language, world modeling, and action generation in a unified framework.
Abstract: Vision-language-action models have reshaped autonomous driving to incorporate languages into the decision-making process. However, most existing pipelines only utilize the language modality for scene descriptions or reasoning and lack the flexibility to follow diverse user instructions for personalized driving. To address this, we first construct a large-scale driving dataset (InstructScene) containing around 100,000 scenes annotated with diverse driving instructions with the corresponding trajectories. We then propose a unified Vision-Language-World-Action model, Vega, for instruction-based generation and planning. We employ the autoregressive paradigm to process visual inputs (vision) and language instructions (language) and the diffusion paradigm to generate future predictions (world modeling) and trajectories (action). We perform joint attention to enable interactions between the modalities and use individual projection layers for different modalities for more capabilities. Extensive experiments demonstrate that our method not only achieves superior planning performance but also exhibits strong instruction-following abilities, paving the way for more intelligent and personalized driving systems.
[214] DC-Reg: Globally Optimal Point Cloud Registration via Tight Bounding with Difference of Convex Programming
Wei Lian, Fei Ma, Hang Pan, Zhesen Cui, Wangmeng Zuo
Main category: cs.CV
TL;DR: DC-Reg: A globally optimal point cloud registration framework using Difference of Convex programming to tighten Branch-and-Bound search for robust registration under partial overlaps and large misalignments.
Details
Motivation: Existing point cloud registration methods struggle with global optimality under partial overlaps and large misalignments. Simultaneous transformation and correspondence estimation is robust to nonrigid deformation but suffers from local minima for heuristic methods and slow convergence for global solvers due to loose lower bounds.Method: Proposes DC-Reg framework with holistic concave underestimator for coupled transformation-assignment objective using Difference of Convex programming. Captures joint structural interaction between transformation and correspondence variables, enabling tight lower bounds via Linear Assignment Problems evaluated at search box vertices.
Result: Validated on 2D similarity and 3D rigid registration using rotation-invariant features. Shows significantly faster convergence and superior robustness to extreme noise and outliers compared to state-of-the-art global techniques on synthetic data and 3DMatch benchmark.
Conclusion: DC-Reg provides a robust globally optimal framework for point cloud registration that addresses the fundamental challenge of achieving global optimality under partial overlaps and large misalignments through tight lower bounds enabled by holistic DC decomposition.
Abstract: Achieving globally optimal point cloud registration under partial overlaps and large misalignments remains a fundamental challenge. While simultaneous transformation ($\boldsymbolθ$) and correspondence ($\mathbf{P}$) estimation has the advantage of being robust to nonrigid deformation, its non-convex coupled objective often leads to local minima for heuristic methods and prohibitive convergence times for existing global solvers due to loose lower bounds. To address this, we propose DC-Reg, a robust globally optimal framework that significantly tightens the Branch-and-Bound (BnB) search. Our core innovation is the derivation of a holistic concave underestimator for the coupled transformation-assignment objective, grounded in the Difference of Convex (DC) programming paradigm. Unlike prior works that rely on term-wise relaxations (e.g., McCormick envelopes) which neglect variable interplay, our holistic DC decomposition captures the joint structural interaction between $\boldsymbolθ$ and $\mathbf{P}$. This formulation enables the computation of remarkably tight lower bounds via efficient Linear Assignment Problems (LAP) evaluated at the vertices of the search boxes. We validate our framework on 2D similarity and 3D rigid registration, utilizing rotation-invariant features for the latter to achieve high efficiency without sacrificing optimality. Experimental results on synthetic data and the 3DMatch benchmark demonstrate that DC-Reg achieves significantly faster convergence and superior robustness to extreme noise and outliers compared to state-of-the-art global techniques.
[215] CIAR: Interval-based Collaborative Decoding for Image Generation Acceleration
Keming Ye, Zhou Zhao, Fan Wu, Shengyu Zhang
Main category: cs.CV
TL;DR: CIAR is a cloud-device collaboration framework that accelerates auto-regressive image generation on devices by using on-device self-verification with continuous probability intervals to handle visual token uncertainty and spatial redundancy.
Details
Motivation: Auto-regressive models for image generation are computationally intensive and sequential, causing disruptive latency for on-device deployment. The challenge lies in handling the vast token vocabulary needed for high-fidelity images and the inherent spatial redundancy where homogeneous regions are predictable while object boundaries have high uncertainty.Method: CIAR uses an on-device token uncertainty quantifier with continuous probability intervals (instead of discrete solution sets) to accelerate processing for large visual vocabularies. It incorporates an Interval-enhanced decoding module with distribution alignment training to speed up decoding while maintaining visual fidelity and semantic consistency.
Result: CIAR achieves 2.18x speed-up, reduces cloud requests by 70%, while preserving image quality compared to existing methods.
Conclusion: The cloud-device collaboration framework with on-device self-verification using continuous probability intervals effectively addresses the computational challenges of AR image generation on devices, enabling faster deployment with maintained quality.
Abstract: Auto-regressive (AR) models have recently made notable progress in image generation, achieving performance comparable to diffusion-based approaches. However, their computational intensity and sequential nature impede on-device deployment, causing disruptive latency. We address this via a cloud-device collaboration framework \textbf{CIAR}, which utilizes on-device self-verification to handle two key properties of visual synthesis: \textit{the vast token vocabulary} required for high-fidelity images and \textit{inherent spatial redundancy} which leads to extreme predictability in homogeneous regions, while object boundaries exhibit high uncertainty. Uniform verification wastes resources on such redundant tokens. Our solution centers on an on-device token uncertainty quantifier, which adopts continuous probability intervals to accelerate processing and make it feasible for large visual vocabularies instead of conventional discrete solution sets. Additionally, we incorporate a Interval-enhanced decoding module to further speed up decoding while maintaining visual fidelity and semantic consistency via a distribution alignment training strategy. Extensive experiments demonstrate that CIAR achieves a 2.18x speed-up and reduces cloud requests by 70%, while preserving image quality compared to existing methods.
[216] GridVAD: Open-Set Video Anomaly Detection via Spatial Reasoning over Stratified Frame Grids
Mohamed Eltahir, Ahmed O. Ibrahim, Obada Siralkhatim, Tabarak Abdallah, Sondos Mohamed
Main category: cs.CV
TL;DR: GridVAD: A training-free pipeline for video anomaly detection using VLMs as anomaly proposers, with spatial grounding and temporal propagation modules.
Details
Motivation: VLMs are powerful open-set reasoners but fragile as direct anomaly detectors in video surveillance due to uncalibrated anomaly priors causing missed detections and false alarms. The problem is not the VLM itself but how it's used.Method: Propose-ground-propagate principle: VLMs generate open-set anomaly proposals from stratified grid representations of video clips. Self-Consistency Consolidation filters hallucinations by retaining only recurring proposals. Grounding DINO anchors proposals to bounding boxes, and SAM2 propagates them as dense masks through anomaly intervals.
Result: Achieves highest Pixel-AUROC (77.59) on UCSD Ped2 among all compared methods, surpassing partially fine-tuned TAO (75.11). Outperforms other zero-shot approaches on object-level RBDC by over 5x. SCC provides controllable precision-recall tradeoff. 2.7x more call-efficient than uniform per-frame VLM querying.
Conclusion: VLMs should function as anomaly proposers rather than direct detectors, with purpose-built spatial and temporal modules for grounding and propagation. GridVAD demonstrates effective training-free video anomaly detection with fixed computational budget.
Abstract: Vision-Language Models (VLMs) are powerful open-set reasoners, yet their direct use as anomaly detectors in video surveillance is fragile: without calibrated anomaly priors, they alternate between missed detections and hallucinated false alarms. We argue the problem is not the VLM itself but how it is used. VLMs should function as anomaly proposers, generating open-set candidate descriptions that are then grounded and tracked by purpose-built spatial and temporal modules. We instantiate this propose-ground-propagate principle in GridVAD, a training-free pipeline that produces pixel-level anomaly masks without any domain-specific training. A VLM reasons over stratified grid representations of video clips to generate natural-language anomaly proposals. Self-Consistency Consolidation (SCC) filters hallucinations by retaining only proposals that recur across multiple independent samplings. Grounding DINO anchors each surviving proposal to a bounding box, and SAM2 propagates it as a dense mask through the anomaly interval. The per-clip VLM budget is fixed at M+1 calls regardless of video length, where M can be set according to the proposals needed. On UCSD Ped2, GridVAD achieves the highest Pixel-AUROC (77.59) among all compared methods, surpassing even the partially fine-tuned TAO (75.11) and outperforms other zero-shot approaches on object-level RBDC by over 5x. Ablations reveal that SCC provides a controllable precision-recall tradeoff: filtering improves all pixel level metrics at a modest cost in object-level recall. Efficiency experiments show GridVAD is 2.7x more call-efficient than uniform per-frame VLM querying while additionally producing dense segmentation masks.Code and qualitative video results are available at https://gridvad.github.io.
[217] AdaSFormer: Adaptive Serialized Transformers for Monocular Semantic Scene Completion from Indoor Environments
Xuzhi Wang, Xinran Wu, Song Wang, Lingdong Kong, Ziping Zhao
Main category: cs.CV
TL;DR: AdaSFormer: A serialized transformer framework for indoor monocular semantic scene completion that addresses challenges of complex layouts and occlusions through adaptive receptive fields, center-relative positional encoding, and convolution-modulated layer normalization.
Details
Motivation: Indoor monocular semantic scene completion is more challenging than outdoor due to complex spatial layouts and severe occlusions. While transformers can model global dependencies, they suffer from high memory costs and difficulty reconstructing fine-grained details in indoor MSSC applications.Method: AdaSFormer introduces three key designs: 1) Adaptive Serialized Transformer with learnable shifts for dynamic receptive field adjustment, 2) Center-Relative Positional Encoding to capture spatial information richness, and 3) Convolution-Modulated Layer Normalization to bridge heterogeneous representations between convolutional and transformer features.
Result: Extensive experiments on NYUv2 and Occ-ScanNet datasets demonstrate that AdaSFormer achieves state-of-the-art performance in indoor monocular semantic scene completion.
Conclusion: AdaSFormer effectively addresses the limitations of transformers in indoor MSSC by combining adaptive serialization, improved positional encoding, and better feature integration, achieving superior performance on benchmark datasets.
Abstract: Indoor monocular semantic scene completion (MSSC) is notably more challenging than its outdoor counterpart due to complex spatial layouts and severe occlusions. While transformers are well suited for modeling global dependencies, their high memory cost and difficulty in reconstructing fine-grained details have limited their use in indoor MSSC. To address these limitations, we introduce AdaSFormer, a serialized transformer framework tailored for indoor MSSC. Our model features three key designs: (1) an Adaptive Serialized Transformer with learnable shifts that dynamically adjust receptive fields; (2) a Center-Relative Positional Encoding that captures spatial information richness; and (3) a Convolution-Modulated Layer Normalization that bridges heterogeneous representations between convolutional and transformer features. Extensive experiments on NYUv2 and Occ-ScanNet demonstrate that AdaSFormer achieves state-of-the-art performance. The code is publicly available at: https://github.com/alanWXZ/AdaSFormer.
[218] RealRestorer: Towards Generalizable Real-World Image Restoration with Large-Scale Image Editing Models
Yufeng Yang, Xianfang Zeng, Zhangqi Jiang, Fukun Yin, Jianzhuang Liu, Wei Cheng, jinghong lan, Shiyu Liu, Yuqi Peng, Gang YU, Shifeng Chen
Main category: cs.CV
TL;DR: The paper presents an open-source image restoration model trained on a large-scale dataset covering nine real-world degradation types, along with a benchmark (RealIR-Bench) for evaluating restoration quality with focus on degradation removal and consistency preservation.
Details
Motivation: Existing image restoration models suffer from poor generalization to real-world scenarios due to limited training data scale and distribution. While large-scale closed-source models show strong generalization, they require substantial data and computational costs, creating a need for open-source alternatives.Method: Constructed a large-scale dataset covering nine common real-world degradation types, trained a state-of-the-art open-source restoration model, and introduced RealIR-Bench benchmark with 464 real-world degraded images and tailored evaluation metrics focusing on degradation removal and consistency preservation.
Result: The proposed model ranks first among open-source methods and achieves state-of-the-art performance in image restoration tasks, effectively narrowing the gap with closed-source alternatives.
Conclusion: The work successfully addresses the generalization limitations of existing restoration models through large-scale data collection and open-source model development, providing both a competitive restoration solution and a comprehensive benchmark for future research.
Abstract: Image restoration under real-world degradations is critical for downstream tasks such as autonomous driving and object detection. However, existing restoration models are often limited by the scale and distribution of their training data, resulting in poor generalization to real-world scenarios. Recently, large-scale image editing models have shown strong generalization ability in restoration tasks, especially for closed-source models like Nano Banana Pro, which can restore images while preserving consistency. Nevertheless, achieving such performance with those large universal models requires substantial data and computational costs. To address this issue, we construct a large-scale dataset covering nine common real-world degradation types and train a state-of-the-art open-source model to narrow the gap with closed-source alternatives. Furthermore, we introduce RealIR-Bench, which contains 464 real-world degraded images and tailored evaluation metrics focusing on degradation removal and consistency preservation. Extensive experiments demonstrate our model ranks first among open-source methods, achieving state-of-the-art performance.
[219] Beyond the Golden Data: Resolving the Motion-Vision Quality Dilemma via Timestep Selective Training
Xiangyang Luo, Qingyu Li, Yuming Li, Guanbo Huang, Yongjie Zhu, Wenyu Qin, Meng Wang, Pengfei Wan, Shao-Lun Huang
Main category: cs.CV
TL;DR: TQD addresses the Motion-Vision Quality Dilemma in video generation by using timestep-aware sampling to train on imbalanced data, achieving better results than using perfect golden data.
Details
Motivation: Video generation models require high-quality data with both good visual quality and motion quality, but these two aspects are negatively correlated (Motion-Vision Quality Dilemma), making it hard to obtain perfect training data.Method: Timestep-aware Quality Decoupling (TQD) modifies data sampling distribution based on timesteps: motion-rich data is sampled more at higher timesteps, while high visual quality data is sampled more at lower timesteps, matching the model’s hierarchical learning dynamics.
Result: TQD enables training on separated imbalanced data to surpass conventional training with better data, and also boosts performance when trained on high-quality data.
Conclusion: The method challenges the necessity of perfect data in video generation and shows effectiveness across different data scenarios through timestep-aware quality decoupling.
Abstract: Recent advances in video generation models have achieved impressive results. However, these models heavily rely on the use of high-quality data that combines both high visual quality and high motion quality. In this paper, we identify a key challenge in video data curation: the Motion-Vision Quality Dilemma. We discovered that visual quality and motion intensity inherently exhibit a negative correlation, making it hard to obtain golden data that excels in both aspects. To address this challenge, we first examine the hierarchical learning dynamics of video diffusion models and conduct gradient-based analysis on quality-degraded samples. We discover that quality-imbalanced data can produce gradients similar to golden data at appropriate timesteps. Based on this, we introduce the novel concept of Timestep selection in Training Process. We propose Timestep-aware Quality Decoupling (TQD), which modifies the data sampling distribution to better match the model’s learning process. For certain types of data, the sampling distribution is skewed toward higher timesteps for motion-rich data, while high visual quality data is more likely to be sampled during lower timesteps. Through extensive experiments, we demonstrate that TQD enables training exclusively on separated imbalanced data to achieve performance surpassing conventional training with better data, challenging the necessity of perfect data in video generation. Moreover, our method also boosts model performance when trained on high-quality data, showcasing its effectiveness across different data scenarios.
[220] BFMD: A Full-Match Badminton Dense Dataset for Dense Shot Captioning
Ning Ding, Keisuke Fujii, Toru Tamaki
Main category: cs.CV
TL;DR: BFMD: First badminton full-match dense dataset with multimodal annotations for tactical analysis and shot captioning
Details
Motivation: Existing badminton datasets focus on short clips with task-specific annotations, lacking full-match data with dense multimodal annotations needed for accurate shot captioning and match-level tactical analysis.Method: Created BFMD dataset with 19 broadcast matches (20+ hours, 1,687 rallies, 16,751 hit events) with hierarchical annotations including shot types, shuttle trajectories, player pose keypoints, and shot captions. Developed VideoMAE-based multimodal captioning framework with Semantic Feedback mechanism.
Result: Multimodal modeling with semantic feedback improves shot caption quality over RGB-only baselines. Dataset enables analysis of temporal evolution of tactical patterns across full matches.
Conclusion: BFMD addresses limitations of existing datasets and enables comprehensive match-level analysis. The multimodal approach with semantic feedback enhances caption generation and tactical understanding.
Abstract: Understanding tactical dynamics in badminton requires analyzing entire matches rather than isolated clips. However, existing badminton datasets mainly focus on short clips or task-specific annotations and rarely provide full-match data with dense multimodal annotations. This limitation makes it difficult to generate accurate shot captions and perform match-level analysis. To address this limitation, we introduce the first Badminton Full Match Dense (BFMD) dataset, with 19 broadcast matches (including both singles and doubles) covering over 20 hours of play, comprising 1,687 rallies and 16,751 hit events, each annotated with a shot caption. The dataset provides hierarchical annotations including match segments, rally events, and dense rally-level multimodal annotations such as shot types, shuttle trajectories, player pose keypoints, and shot captions. We develop a VideoMAE-based multimodal captioning framework with a Semantic Feedback mechanism that leverages shot semantics to guide caption generation and improve semantic consistency. Experimental results demonstrate that multimodal modeling and semantic feedback improve shot caption quality over RGB-only baselines. We further showcase the potential of BFMD by analyzing the temporal evolution of tactical patterns across full matches.
[221] PAWS: Perception of Articulation in the Wild at Scale from Egocentric Videos
Yihao Wang, Yang Miao, Wenshuai Zhao, Wenyan Yang, Zihan Wang, Joni Pajarinen, Luc Van Gool, Danda Pani Paudel, Juho Kannala, Xi Wang, Arno Solin
Main category: cs.CV
TL;DR: PAWS extracts object articulations from hand-object interactions in egocentric videos without requiring 3D supervision or manual annotations.
Details
Motivation: Current articulation perception methods rely on supervised training with high-quality 3D data and manual annotations, limiting scalability and diversity. There's a need for methods that can learn from more accessible data sources.Method: PAWS directly extracts object articulations from hand-object interactions in large-scale in-the-wild egocentric videos, avoiding the need for 3D supervision or manual annotations.
Result: Achieves significant improvements over baselines on HD-EPIC and Arti4D datasets, and demonstrates benefits for downstream tasks including fine-tuning 3D articulation prediction models and enabling robot manipulation.
Conclusion: PAWS provides a scalable approach to articulation perception by learning from hand-object interactions in egocentric videos, enabling applications in robotics, simulation, and animation.
Abstract: Articulation perception aims to recover the motion and structure of articulated objects (e.g., drawers and cupboards), and is fundamental to 3D scene understanding in robotics, simulation, and animation. Existing learning-based methods rely heavily on supervised training with high-quality 3D data and manual annotations, limiting scalability and diversity. To address this limitation, we propose PAWS, a method that directly extracts object articulations from hand-object interactions in large-scale in-the-wild egocentric videos. We evaluate our method on the public data sets, including HD-EPIC and Arti4D data sets, achieving significant improvements over baselines. We further demonstrate that the extracted articulations benefit downstream tasks, including fine-tuning 3D articulation prediction models and enabling robot manipulation. See the project website at https://aaltoml.github.io/PAWS/.
[222] Towards Comprehensive Real-Time Scene Understanding in Ophthalmic Surgery through Multimodal Image Fusion
Nikolo Rohrmoser, Ghazal Ghazaei, Michael Sommersperger, Nassir Navab
Main category: cs.CV
TL;DR: Multimodal fusion of operating microscope and intraoperative OCT for real-time instrument tracking in ophthalmic surgery, improving tool-tissue distance estimation accuracy.
Details
Motivation: To enhance surgical scene understanding in ophthalmic surgery by fusing complementary imaging modalities (operating microscope and intraoperative OCT) for more precise instrument tracking and tool-tissue distance estimation.Method: Proposes a multimodal, temporal, real-time network architecture with cross-attention fusion module to merge OPMI and iOCT features, using YoloNAS and CNN encoders respectively, plus region-based recurrent module for temporal coherence.
Result: Achieved 95.79% mAP50 for instrument localization/keypoint detection, real-time processing (22.5 ms/frame), and significantly improved tool-tissue distance estimation accuracy from 284μm (OPMI only) to 33μm (multimodal) for distances below 1mm.
Conclusion: Multimodal feature fusion enhances multi-task prediction accuracy compared to single-modality processing, demonstrating potential for image-guided vitreoretinal surgery while highlighting challenges for future research.
Abstract: Purpose: The integration of multimodal imaging into operating rooms paves the way for comprehensive surgical scene understanding. In ophthalmic surgery, by now, two complementary imaging modalities are available: operating microscope (OPMI) imaging and real-time intraoperative optical coherence tomography (iOCT). This first work toward temporal OPMI and iOCT feature fusion demonstrates the potential of multimodal image processing for multi-head prediction through the example of precise instrument tracking in vitreoretinal surgery. Methods: We propose a multimodal, temporal, real-time capable network architecture to perform joint instrument detection, keypoint localization, and tool-tissue distance estimation. Our network design integrates a cross-attention fusion module to merge OPMI and iOCT image features, which are efficiently extracted via a YoloNAS and a CNN encoder, respectively. Furthermore, a region-based recurrent module leverages temporal coherence. Results: Our experiments demonstrate reliable instrument localization and keypoint detection (95.79% mAP50) and show that the incorporation of iOCT significantly improves tool-tissue distance estimation, while achieving real-time processing rates of 22.5 ms per frame. Especially for close distances to the retina (below 1 mm), the distance estimation accuracy improved from 284 $μm$ (OPMI only) to 33 $μm$ (multimodal). Conclusion: Feature fusion of multimodal imaging can enhance multi-task prediction accuracy compared to single-modality processing and real-time processing performance can be achieved through tailored network design. While our results demonstrate the potential of multi-modal processing for image-guided vitreoretinal surgery, they also underline key challenges that motivate future research toward more reliable, consistent, and comprehensive surgical scene understanding.
[223] GeoHeight-Bench: Towards Height-Aware Multimodal Reasoning in Remote Sensing
Xuran Hu, Zhitong Xiong, Zhongcheng Hong, Yifang Ban, Xiaoxiang Zhu, Wufan Zhao
Main category: cs.CV
TL;DR: A framework for height-aware remote sensing understanding in Large Multimodal Models, addressing the “vertical blind spot” in Earth Observation through new benchmarks and a height-aware LMM baseline.
Details
Motivation: Current Earth Observation LMMs neglect the critical vertical dimension, limiting reasoning in complex remote sensing geometries and disaster scenarios where physical spatial structures are more important than planar visual textures.Method: 1) Developed a scalable VLM-driven data generation pipeline using systematic prompt engineering and metadata extraction; 2) Created two benchmarks: GeoHeight-Bench for relative height analysis and GeoHeight-Bench+ for holistic terrain-aware reasoning; 3) Proposed GeoHeightChat, the first height-aware remote sensing LMM baseline that synergizes visual semantics with implicitly injected height geometric features.
Result: The framework successfully bridges the “vertical blind spot” in optical models, demonstrating that combining visual semantics with height geometric features enables interactive height reasoning in existing Earth Observation models.
Conclusion: Height perception is crucial for remote sensing understanding, and the proposed approach unlocks a new paradigm of interactive height reasoning in Earth Observation LMMs, addressing a significant limitation in current models.
Abstract: Current Large Multimodal Models (LMMs) in Earth Observation typically neglect the critical “vertical” dimension, limiting their reasoning capabilities in complex remote sensing geometries and disaster scenarios where physical spatial structures often outweigh planar visual textures. To bridge this gap, we introduce a comprehensive evaluation framework dedicated to height-aware remote sensing understanding. First, to overcome the severe scarcity of annotated data, we develop a scalable, VLM-driven data generation pipeline utilizing systematic prompt engineering and metadata extraction. This pipeline constructs two complementary benchmarks: GeoHeight-Bench for relative height analysis, and a more challenging GeoHeight-Bench+ for holistic, terrain-aware reasoning. Furthermore, to validate the necessity of height perception, we propose GeoHeightChat, the first height-aware remote sensing LMM baseline. Serving as a strong proof of concept, our baseline demonstrates that synergizing visual semantics with implicitly injected height geometric features effectively mitigates the “vertical blind spot”, successfully unlocking a new paradigm of interactive height reasoning in existing optical models.
[224] UNIC: Neural Garment Deformation Field for Real-time Clothed Character Animation
Chengfeng Zhao, Junbo Qi, Yulou Liu, Zhiyang Dou, Minchen Li, Taku Komura, Ziwei Liu, Wenping Wang, Yuan Liu
Main category: cs.CV
TL;DR: UNIC: A neural deformation field method for real-time garment animation using instance-specific learning that maps 3D points to deformation offsets, avoiding complex topology handling.
Details
Motivation: Physics simulation methods for garment deformation are computationally expensive and not suitable for real-time applications. Existing learning-based methods using graph neural networks struggle with complex garment topologies.Method: Proposes UNIC, which learns instance-specific neural deformation fields to animate garment meshes. Instead of generalizing to new garments, it focuses on new motion sequences for a specific garment. Uses neural deformation fields that map 3D points to deformation offsets, avoiding topology handling and providing smoothness constraints.
Result: Extensive experiments show UNIC is effective and efficient for various garment meshes, outperforming baseline methods and enabling real-time performance suitable for interactive applications.
Conclusion: UNIC provides a practical solution for real-time garment animation by using instance-specific neural deformation fields, making it suitable for interactive applications like video games.
Abstract: Simulating physically realistic garment deformations is an essential task for virtual immersive experience, which is often achieved by physics simulation methods. However, these methods are typically time-consuming, computationally demanding, and require costly hardware, which is not suitable for real-time applications. Recent learning-based methods tried to resolve this problem by training graph neural networks to learn the garment deformation on vertices, which, however, fail to capture the intricate deformation of complex garment meshes with complex topologies. In this paper, we introduce a novel neural deformation field-based method, named UNIC, to animate the garments of an avatar in real time, given the motion sequences. Our key idea is to learn the instance-specific neural deformation field to animate the garment meshes. Such an instance-specific learning scheme does not require UNIC to generalize to new garments but only to new motion sequences, which greatly reduces the difficulty in training and improves the deformation quality. Moreover, neural deformation fields map the 3D points to their deformation offsets, which not only avoids handling topologies of the complex garments but also injects a natural smoothness constraint in the deformation learning. Extensive experiments have been conducted on various kinds of garment meshes to demonstrate the effectiveness and efficiency of UNIC over baseline methods, making it potentially practical and useful in real-world interactive applications like video games.
[225] Designing Any Imaging System from Natural Language: Agent-Constrained Composition over a Finite Primitive Basis
Chengshuai Yang
Main category: cs.CV
TL;DR: Automated pipeline (spec.md + 3 agents) converts natural language imaging system descriptions into validated forward models with bounded reconstruction error, matching expert quality across 6 modalities.
Details
Motivation: Current computational imaging system design requires weeks of specialist effort per modality, creating an expertise bottleneck that excludes broader scientific community from prototyping imaging instruments.Method: Introduces spec.md structured specification format and three autonomous agents (Plan, Judge, Execute) that translate one-sentence natural-language descriptions into validated forward models. Uses design-to-real error theorem that decomposes total reconstruction error into five independently bounded terms.
Result: On 6 real-data modalities spanning all 5 carrier families, automated pipeline matches expert-library quality (98.1 +/- 4.2%). Ten novel designs demonstrate compositional reach beyond any single-modality tool.
Conclusion: The automated pipeline democratizes computational imaging system design by eliminating expertise bottlenecks, enabling broader scientific community to prototype imaging instruments through natural language descriptions.
Abstract: Designing a computational imaging system – selecting operators, setting parameters, validating consistency – requires weeks of specialist effort per modality, creating an expertise bottleneck that excludes the broader scientific community from prototyping imaging instruments. We introduce spec.md, a structured specification format, and three autonomous agents – Plan, Judge, and Execute – that translate a one-sentence natural-language description into a validated forward model with bounded reconstruction error. A design-to-real error theorem decomposes total reconstruction error into five independently bounded terms, each linked to a corrective action. On 6 real-data modalities spanning all 5 carrier families, the automated pipeline matches expert-library quality (98.1 +/- 4.2%). Ten novel designs – composing primitives into chains from 3D to 5D – demonstrate compositional reach beyond any single-modality tool.
[226] LEMMA: Laplacian pyramids for Efficient Marine SeMAntic Segmentation
Ishaan Gakhar, Laven Srivastava, Sankarshanaa Sagaram, Aditya Kasliwal, Ujjwal Verma
Main category: cs.CV
TL;DR: LEMMA is a lightweight semantic segmentation model for marine environments that uses Laplacian Pyramids for edge recognition, reducing computational costs while maintaining accuracy for applications like oil spill detection and coastal monitoring.
Details
Motivation: Existing semantic segmentation methods for marine environments are computationally expensive and resource-intensive, limiting their practicality for real-time, low-cost applications in real-world marine settings like autonomous navigation and disaster response.Method: Proposes LEMMA, a lightweight semantic segmentation model that leverages Laplacian Pyramids to enhance edge recognition early in feature extraction, eliminating the need for computationally expensive feature map computations in deeper network layers.
Result: LEMMA reduces trainable parameters by up to 71x, GFLOPs by up to 88.5%, and inference time by up to 84.65% compared to existing models, while achieving 93.42% IoU on Oil Spill dataset and 98.97% mIoU on Mastr1325.
Conclusion: LEMMA demonstrates state-of-the-art performance for marine semantic segmentation with significantly reduced computational requirements, making it suitable for real-time, low-cost applications in marine environments.
Abstract: Semantic segmentation in marine environments is crucial for the autonomous navigation of unmanned surface vessels (USVs) and coastal Earth Observation events such as oil spills. However, existing methods, often relying on deep CNNs and transformer-based architectures, face challenges in deployment due to their high computational costs and resource-intensive nature. These limitations hinder the practicality of real-time, low-cost applications in real-world marine settings. To address this, we propose LEMMA, a lightweight semantic segmentation model designed specifically for accurate remote sensing segmentation under resource constraints. The proposed architecture leverages Laplacian Pyramids to enhance edge recognition, a critical component for effective feature extraction in complex marine environments for disaster response, environmental surveillance, and coastal monitoring. By integrating edge information early in the feature extraction process, LEMMA eliminates the need for computationally expensive feature map computations in deeper network layers, drastically reducing model size, complexity and inference time. LEMMA demonstrates state-of-the-art performance across datasets captured from diverse platforms while reducing trainable parameters and computational requirements by up to 71x, GFLOPs by up to 88.5%, and inference time by up to 84.65%, as compared to existing models. Experimental results highlight its effectiveness and real-world applicability, including 93.42% IoU on the Oil Spill dataset and 98.97% mIoU on Mastr1325.
[227] Wan-Weaver: Interleaved Multi-modal Generation via Decoupled Training
Jinbo Xing, Zeyinzi Jiang, Yuxiang Tuo, Chaojie Mao, Xiaotang Gai, Xi Chen, Jingfeng Zhang, Yulin Pan, Zhen Han, Jie Xiao, Keyu Yan, Chenwei Xie, Chongyang Zhong, Kai Zhu, Tong Shen, Lianghua Huang, Yu Liu, Yujiu Yang
Main category: cs.CV
TL;DR: Wan-Weaver: A framework for interleaved text-image generation using a planner-visualizer approach trained on proxy data without real interleaved examples.
Details
Motivation: Current multimodal models accept multimodal inputs but produce only single-modality outputs, lacking ability to generate interleaved content due to training data scarcity and difficulty modeling long-range cross-modal context.Method: Decomposes interleaved generation into textual planning and visual consistency modeling using a planner (produces dense textual descriptions for visual content) and visualizer (synthesizes images accordingly). Trains planner on large-scale textual-proxy interleaved data and visualizer on reference-guided image data.
Result: Wan-Weaver exhibits emergent interleaved generation ability with long-range textual coherence and visual consistency, achieves robust task reasoning and generation proficiency, and outperforms existing methods without access to real interleaved data.
Conclusion: The planner-visualizer framework successfully enables interleaved text-image generation through proxy data training, demonstrating superior performance over existing methods and establishing a new benchmark for evaluation.
Abstract: Recent unified models have made unprecedented progress in both understanding and generation. However, while most of them accept multi-modal inputs, they typically produce only single-modality outputs. This challenge of producing interleaved content is mainly due to training data scarcity and the difficulty of modeling long-range cross-modal context. To address this issue, we decompose interleaved generation into textual planning and visual consistency modeling, and introduce a framework consisting of a planner and a visualizer. The planner produces dense textual descriptions for visual content, while the visualizer synthesizes images accordingly. Under this guidance, we construct large-scale textual-proxy interleaved data (where visual content is represented in text) to train the planner, and curate reference-guided image data to train the visualizer. These designs give rise to Wan-Weaver, which exhibits emergent interleaved generation ability with long-range textual coherence and visual consistency. Meanwhile, the integration of diverse understanding and generation data into planner training enables Wan-Weaver to achieve robust task reasoning and generation proficiency. To assess the model’s capability in interleaved generation, we further construct a benchmark that spans a wide range of use cases across multiple dimensions. Extensive experiments demonstrate that, even without access to any real interleaved data, Wan-Weaver achieves superior performance over existing methods.
[228] TRACE: Object Motion Editing in Videos with First-Frame Trajectory Guidance
Quynh Phung, Long Mai, Cusuh Ham, Feng Liu, Jia-Bin Huang, Aniruddha Mahapatra
Main category: cs.CV
TL;DR: Trace is a framework for object motion path editing in videos that allows users to design desired object trajectories in a single anchor frame, then synthesizes temporally consistent edited videos with preserved scene content.
Details
Motivation: Prior video editing methods focus on appearance manipulation or require challenging point-track-based trajectory control, especially in videos with camera motion. There's a need for a practical, easy-to-use approach to controllable object-centric motion editing.Method: Two-stage pipeline: 1) Cross-view motion transformation module maps first-frame path design to frame-aligned box trajectories under camera motion, 2) Motion-conditioned video re-synthesis module follows these trajectories to regenerate the object while preserving remaining video content.
Result: Experiments on diverse real-world videos show the method produces more coherent, realistic, and controllable motion edits than recent image-to-video and video-to-video methods.
Conclusion: Trace provides a practical framework for object motion path editing that enables easy trajectory design and produces high-quality, temporally consistent video edits.
Abstract: We study object motion path editing in videos, where the goal is to alter a target object’s trajectory while preserving the original scene content. Unlike prior video editing methods that primarily manipulate appearance or rely on point-track-based trajectory control, which is often challenging for users to provide during inference, especially in videos with camera motion, we offer a practical, easy-to-use approach to controllable object-centric motion editing. We present Trace, a framework that enables users to design the desired trajectory in a single anchor frame and then synthesizes a temporally consistent edited video. Our approach addresses this task with a two-stage pipeline: a cross-view motion transformation module that maps first-frame path design to frame-aligned box trajectories under camera motion, and a motion-conditioned video re-synthesis module that follows these trajectories to regenerate the object while preserving the remaining content of the input video. Experiments on diverse real-world videos show that our method produces more coherent, realistic, and controllable motion edits than recent image-to-video and video-to-video methods.
[229] Seeing to Ground: Visual Attention for Hallucination-Resilient MDLLMs
Vishal Narnaware, Animesh Gupta, Kevin Zhai, Zhenyi Wang, Mubarak Shah
Main category: cs.CV
TL;DR: VISAGE is a training-free decoding framework that addresses multimodal hallucinations in MDLLMs by calibrating the objective at inference time through spatial entropy analysis of cross-attention distributions.
Details
Motivation: Multimodal Diffusion Large Language Models suffer from structural vulnerability to hallucinations due to an algorithmic flaw where decoders rank tokens based only on textual likelihood without verifying visual support, creating an objective mismatch between language probability and multimodal grounding.Method: VISAGE quantifies spatial entropy of cross-attention distributions to estimate proxy discrepancy, enforces localization consensus across attention heads, penalizes spatially uniform distributions, and re-ranks token commitments to favor visually grounded outcomes without requiring training.
Result: The framework achieves relative gains of 8.59% on MMMU-val and 7.75% on HallusionBench, with analytical stability guarantees showing bounded objective loss under estimation error.
Conclusion: VISAGE effectively addresses multimodal hallucinations by correcting the objective mismatch in MDLLMs through inference-time calibration of attention distributions, providing a robust training-free solution for improving visual grounding.
Abstract: Multimodal Diffusion Large Language Models (MDLLMs) achieve high-concurrency generation through parallel masked decoding, yet the architectures remain prone to multimodal hallucinations. This structural vulnerability stems from an algorithmic flaw: the decoder ranks candidate tokens based on textual likelihood without verifying localized visual support. We establish that this language-only ranking induces an objective mismatch, where language probability mass acts as a misspecified proxy for the intended multimodal task. Consequently, we reinterpret hallucination as a localized optimization error, a phenomenon where the decoder exploits language shortcuts to maximize a proxy score at the expense of visual grounding. To address this objective mismatch, we introduce VISAGE, a training-free decoding framework that calibrates the objective at inference time. VISAGE estimates the proxy discrepancy by quantifying the spatial entropy of cross-attention distributions. By enforcing a localization consensus across attention heads, the method penalizes spatially uniform distributions and re-ranks token commitments to favor visually grounded outcomes. We provide an analytical stability guarantee establishing that VISAGE maintains a bounded objective loss under estimation error. Evaluations across hallucination-sensitive and general-purpose benchmarks demonstrate the robustness of the framework, yielding relative gains of 8.59% on MMMU-val and 7.75% on HallusionBench.
[230] Graph Memory: A Structured and Interpretable Framework for Modality-Agnostic Embedding-Based Inference
Artur A. Oliveira, Mateus Espadoto, Roberto M. Cesar Jr., Roberto Hirata Jr
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2511.14961: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.14961&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[231] AnyHand: A Large-Scale Synthetic Dataset for RGB(-D) Hand Pose Estimation
Chen Si, Yulin Liu, Bo Ai, Jianwen Xie, Rolandos Alexandros Potamias, Chuanxia Zheng, Hao Su
Main category: cs.CV
TL;DR: AnyHand is a large-scale synthetic RGB-D dataset for 3D hand pose estimation, containing 6.6M images with rich annotations, showing significant performance gains on benchmarks and strong generalization to out-of-domain data.
Details
Motivation: Existing real-world datasets for hand pose estimation are limited in coverage, and prior synthetic datasets lack scale, occlusions, arm details, and aligned depth data, creating a bottleneck for advancing 3D hand pose estimation.Method: Created a large-scale synthetic dataset (AnyHand) with 2.5M single-hand and 4.1M hand-object interaction RGB-D images with rich geometric annotations. Also developed a lightweight depth fusion module that integrates with existing RGB-based models.
Result: Extending existing baselines with AnyHand yields significant gains on FreiHAND and HO-3D benchmarks. Shows strong generalization to out-of-domain HO-Cap dataset without fine-tuning. RGB-D model achieves superior performance on HO-3D benchmark.
Conclusion: AnyHand dataset effectively addresses data limitations in hand pose estimation, enabling improved performance and generalization through large-scale synthetic data and demonstrating benefits of depth integration for multimodal approaches.
Abstract: We present AnyHand, a large-scale synthetic dataset designed to advance the state of the art in 3D hand pose estimation from both RGB-only and RGB-D inputs. While recent works with foundation approaches have shown that an increase in the quantity and diversity of training data can markedly improve performance and robustness in hand pose estimation, existing real-world-collected datasets on this task are limited in coverage, and prior synthetic datasets rarely provide occlusions, arm details, and aligned depth together at scale. To address this bottleneck, our AnyHand contains 2.5M single-hand and 4.1M hand-object interaction RGB-D images, with rich geometric annotations. In the RGB-only setting, we show that extending the original training sets of existing baselines with AnyHand yields significant gains on multiple benchmarks (FreiHAND and HO-3D), even when keeping the architecture and training scheme fixed. More impressively, the model trained with AnyHand shows stronger generalization to the out-of-domain HO-Cap dataset, without any fine-tuning. We also contribute a lightweight depth fusion module that can be easily integrated into existing RGB-based models. Trained with AnyHand, the resulting RGB-D model achieves superior performance on the HO-3D benchmark, showing the benefits of depth integration and the effectiveness of our synthetic data.
[232] BizGenEval: A Systematic Benchmark for Commercial Visual Content Generation
Yan Li, Zezi Zeng, Ziwei Zhou, Xin Gao, Muzhao Tian, Yifan Yang, Mingxi Cheng, Qi Dai, Yuqing Yang, Lili Qiu, Zhendong Wang, Zhengyuan Yang, Xue Yang, Lijuan Wang, Ji Li, Chong Luo
Main category: cs.CV
TL;DR: BizGenEval is a benchmark for evaluating commercial visual content generation across 5 document types and 4 capability dimensions, revealing gaps in current image generation models for real-world design tasks.
Details
Motivation: Existing benchmarks focus on natural image synthesis but fail to evaluate models under the structured, multi-constraint requirements of real-world commercial design tasks like slides, charts, webpages, posters, and scientific figures.Method: Created BizGenEval benchmark spanning 5 document types (slides, charts, webpages, posters, scientific figures) and evaluating 4 key capability dimensions (text rendering, layout control, attribute binding, knowledge-based reasoning). Contains 400 curated prompts and 8000 human-verified checklist questions to assess complex visual and semantic constraints.
Result: Large-scale benchmarking of 26 popular image generation systems (commercial APIs and open-source models) revealed substantial capability gaps between current generative models and professional visual content creation requirements.
Conclusion: BizGenEval serves as a standardized benchmark for real-world commercial visual content generation, highlighting the need for improved models that can handle structured, multi-constraint design tasks.
Abstract: Recent advances in image generation models have expanded their applications beyond aesthetic imagery toward practical visual content creation. However, existing benchmarks mainly focus on natural image synthesis and fail to systematically evaluate models under the structured and multi-constraint requirements of real-world commercial design tasks. In this work, we introduce BizGenEval, a systematic benchmark for commercial visual content generation. The benchmark spans five representative document types: slides, charts, webpages, posters, and scientific figures, and evaluates four key capability dimensions: text rendering, layout control, attribute binding, and knowledge-based reasoning, forming 20 diverse evaluation tasks. BizGenEval contains 400 carefully curated prompts and 8000 human-verified checklist questions to rigorously assess whether generated images satisfy complex visual and semantic constraints. We conduct large-scale benchmarking on 26 popular image generation systems, including state-of-the-art commercial APIs and leading open-source models. The results reveal substantial capability gaps between current generative models and the requirements of professional visual content creation. We hope BizGenEval serves as a standardized benchmark for real-world commercial visual content generation.
[233] SlotVTG: Object-Centric Adapter for Generalizable Video Temporal Grounding
Jiwook Han, Geo Ahn, Youngrae Kim, Jinwoo Choi
Main category: cs.CV
TL;DR: SlotVTG: A lightweight slot adapter framework that enhances MLLMs’ object-centric visual reasoning for Video Temporal Grounding, improving OOD generalization without full retraining.
Details
Motivation: Current MLLMs have coarse recognition capabilities insufficient for fine-grained temporal understanding in Video Temporal Grounding (VTG). Task-specific fine-tuning causes models to memorize dataset shortcuts rather than grounding in actual visual content, leading to poor Out-of-Domain generalization. Object-centric learning offers a solution but requires expensive full retraining.Method: Proposes SlotVTG framework with lightweight slot adapter that decomposes visual tokens into abstract slots via slot attention and reconstructs original sequence. Uses objectness priors from self-supervised vision model to encourage semantically coherent slot formation, steering MLLMs toward object-centric reasoning with minimal overhead.
Result: Cross-domain evaluation on standard VTG benchmarks shows significant improvement in Out-of-Domain robustness while maintaining competitive In-Domain performance with minimal computational overhead.
Conclusion: SlotVTG provides an efficient solution to enhance MLLMs’ fine-grained temporal understanding and OOD generalization for Video Temporal Grounding through object-centric reasoning, without requiring expensive full retraining pipelines.
Abstract: Multimodal Large Language Models (MLLMs) have shown strong performance on Video Temporal Grounding (VTG). However, their coarse recognition capabilities are insufficient for fine-grained temporal understanding, making task-specific fine-tuning indispensable. This fine-tuning causes models to memorize dataset-specific shortcuts rather than faithfully grounding in the actual visual content, leading to poor Out-of-Domain (OOD) generalization. Object-centric learning offers a promising remedy by decomposing scenes into entity-level representations, but existing approaches require re-running the entire multi-stage training pipeline from scratch. We propose SlotVTG, a framework that steers MLLMs toward object-centric, input-grounded visual reasoning at minimal cost. SlotVTG introduces a lightweight slot adapter that decomposes visual tokens into abstract slots via slot attention and reconstructs the original sequence, where objectness priors from a self-supervised vision model encourage semantically coherent slot formation. Cross-domain evaluation on standard VTG benchmarks demonstrates that our approach significantly improves OOD robustness while maintaining competitive In-Domain (ID) performance with minimal overhead.
[234] Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers
Katherine Crowson, Stefan Andreas Baumann, Alex Birch, Tanishq Mathew Abraham, Daniel Z. Kaplan, Enrico Shippole
Main category: cs.CV
TL;DR: Paper analysis unavailable due to HTTP 429 error when fetching abstract from arXiv
Details
Motivation: Unable to determine motivation as the abstract could not be retrieved due to rate limiting (HTTP 429 error)Method: Cannot analyze method without access to paper abstract or content
Result: No results available due to technical issue fetching the paper information
Conclusion: Paper analysis cannot be completed due to arXiv API rate limiting preventing access to the abstract
Abstract: Failed to fetch summary for 2401.11605: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2401.11605&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[235] Unleashing Guidance Without Classifiers for Human-Object Interaction Animation
Ziyin Wang, Sirui Xu, Chuan Guo, Bing Zhou, Jiangshan Gong, Jian Wang, Yu-Xiong Wang, Liang-Yan Gui
Main category: cs.CV
TL;DR: LIGHT: A data-driven diffusion framework for human-object interaction animation that uses modality-specific noise levels and asynchronous denoising schedules to generate contact-aware guidance without manual priors.
Details
Motivation: Generating realistic human-object interaction animations is challenging due to the need to jointly model dynamic human actions and diverse object geometries. Prior approaches rely on hand-crafted contact priors or human-imposed kinematic constraints, which can be limiting.Method: Proposes LIGHT, a diffusion-based approach that factors representations into modality-specific components with individualized noise levels and asynchronous denoising schedules. Cleaner components guide noisier ones through cross-attention, creating data-driven guidance. Training is augmented with synthetic object geometries to encourage invariance of contact semantics to geometric diversity.
Result: The pace-induced guidance effectively mirrors benefits of contact priors, achieving higher contact fidelity, more realistic HOI generation, and stronger generalization to unseen objects and tasks compared to conventional classifier-free guidance.
Conclusion: LIGHT demonstrates that data-driven guidance emerging from denoising pace can reduce dependence on manually designed priors while improving contact quality and generalization in human-object interaction animation generation.
Abstract: Generating realistic human-object interaction (HOI) animations remains challenging because it requires jointly modeling dynamic human actions and diverse object geometries. Prior diffusion-based approaches often rely on hand-crafted contact priors or human-imposed kinematic constraints to improve contact quality. We propose LIGHT, a data-driven alternative in which guidance emerges from the denoising pace itself, reducing dependence on manually designed priors. Building on diffusion forcing, we factor the representation into modality-specific components and assign individualized noise levels with asynchronous denoising schedules. In this paradigm, cleaner components guide noisier ones through cross-attention, yielding guidance without auxiliary classifiers. We find that this data-driven guidance is inherently contact-aware, and can be enhanced when training is augmented with a broad spectrum of synthetic object geometries, encouraging invariance of contact semantics to geometric diversity. Extensive experiments show that pace-induced guidance more effectively mirrors the benefits of contact priors than conventional classifier-free guidance, while achieving higher contact fidelity, more realistic HOI generation, and stronger generalization to unseen objects and tasks.
[236] MindSet: Vision. A toolbox for testing DNNs on key psychological experiments
Valerio Biscione, Milton L. Montero, Marin Dujmovic, Gaurav Malhotra, Dong Yin, Guillermo Puebla, Federico Adolfi, Rachel F. Heaton, John E. Hummel, Benjamin D. Evans, Karim Habashy, Jeffrey S. Bowers
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). Need to retry or use alternative methods to access the paper content.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot draw conclusions without access to the paper content.
Abstract: Failed to fetch summary for 2404.05290: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2404.05290&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[237] How good was my shot? Quantifying Player Skill Level in Table Tennis
Akihiro Kubota, Tomoya Hasegawa, Ryo Kawahara, Ko Nishino
Main category: cs.CV
TL;DR: Learning generative models of table tennis players’ tactical racket strokes to embed them in a latent space that encodes skill levels and play styles, enabling automated skill assessment.
Details
Motivation: Skill is crucial in shaping behavior but challenging to quantify as it's latent to observed actions. The paper aims to understand skill in human behavior through dyadic sports like table tennis, where skill manifests in complex movements and subtle execution nuances conditioned on game context.Method: Learn generative models of each player’s tactical racket strokes and jointly embed them in a common latent space that encodes individual characteristics including skill levels. Train on large-scale 3D-reconstructed professional matches, conditioning on comprehensive game context (player positioning, opponent behaviors). Then train a relative ranking network on these embeddings for skill prediction.
Result: The learned player space reflects distinct play styles and attributes representing skill. Both relative and absolute skill predictions can be achieved through a simple ranking network trained on these embeddings, demonstrating effective skill quantification.
Conclusion: The learned player space effectively quantifies skill levels, providing a foundation for automated skill assessment in complex, interactive behaviors like sports.
Abstract: Gauging an individual’s skill level is crucial, as it inherently shapes their behavior. Quantifying skill, however, is challenging because it is latent to the observed actions. To explore skill understanding in human behavior, we focus on dyadic sports – specifically table tennis – where skill manifests not just in complex movements, but in the subtle nuances of execution conditioned on game context. Our key idea is to learn a generative model of each player’s tactical racket strokes and jointly embed them in a common latent space that encodes individual characteristics, including those pertaining to skill levels. By training these player models on a large-scale dataset of 3D-reconstructed professional matches and conditioning them on comprehensive game context – including player positioning and opponent behaviors – the models capture individual tactical identities within their latent space. We probe this learned player space and find that it reflects distinct play styles and attributes that collectively represent skill. By training a simple relative ranking network on these embeddings, we demonstrate that both relative and absolute skill predictions can be achieved. These results demonstrate that the learned player space effectively quantifies skill levels, providing a foundation for automated skill assessment in complex, interactive behaviors.
[238] PSDesigner: Automated Graphic Design with a Human-Like Creative Workflow
Xincheng Shuai, Song Tang, Yutong Huang, Henghui Ding, Dacheng Tao
Main category: cs.CV
TL;DR: PSDesigner is an automated graphic design system that mimics human designer workflows using specialized components and tool-calling capabilities to translate user instructions into editable design files.
Details
Motivation: Current automated design systems using text-to-image models and MLLMs oversimplify professional workflows, resulting in limited flexibility and intuitiveness. There's a need for systems that can faithfully translate user intentions into editable design files while maintaining professional quality.Method: PSDesigner emulates human designer workflows through multiple specialized components that collect theme-related assets based on user instructions, autonomously infer and execute tool calls to manipulate design files (PSD files), and refine elements. The system is trained on CreativePSD dataset containing high-quality PSD files annotated with operation traces across diverse design scenarios and styles.
Result: Extensive experiments show PSDesigner outperforms existing methods across diverse graphic design tasks, enabling non-specialists to create production-quality designs conveniently.
Conclusion: PSDesigner successfully addresses limitations of existing automated design systems by mimicking human creative workflows and leveraging specialized tool-use capabilities trained on comprehensive design datasets.
Abstract: Graphic design is a creative and innovative process that plays a crucial role in applications such as e-commerce and advertising. However, developing an automated design system that can faithfully translate user intentions into editable design files remains an open challenge. Although recent studies have leveraged powerful text-to-image models and MLLMs to assist graphic design, they typically simplify professional workflows, resulting in limited flexibility and intuitiveness. To address these limitations, we propose PSDesigner, an automated graphic design system that emulates the creative workflow of human designers. Building upon multiple specialized components, PSDesigner collects theme-related assets based on user instructions, and autonomously infers and executes tool calls to manipulate design files, such as integrating new assets or refining inferior elements. To endow the system with strong tool-use capabilities, we construct a design dataset, CreativePSD, which contains a large amount of high-quality PSD design files annotated with operation traces across a wide range of design scenarios and artistic styles, enabling models to learn expert design procedures. Extensive experiments demonstrate that PSDesigner outperforms existing methods across diverse graphic design tasks, empowering non-specialists to conveniently create production-quality designs.
[239] MegaFlow: Zero-Shot Large Displacement Optical Flow
Dingxi Zhang, Fangjinhua Wang, Marc Pollefeys, Haofei Xu
Main category: cs.CV
TL;DR: MegaFlow: A zero-shot large displacement optical flow model using pre-trained Vision Transformer features for global matching, achieving SOTA performance without task-specific fine-tuning.
Details
Motivation: Existing optical flow methods struggle with large displacements and require iterative local search or domain-specific fine-tuning, limiting zero-shot generalization. There's a need for models that can handle large displacements without task-specific architectural complexity.Method: Formulates flow estimation as global matching using pre-trained Vision Transformer features to capture large displacements naturally. Uses lightweight iterative refinements for sub-pixel accuracy. Adapts pre-trained vision priors rather than complex task-specific designs.
Result: Achieves state-of-the-art zero-shot performance across multiple optical flow benchmarks. Also delivers competitive zero-shot performance on long-range point tracking benchmarks, demonstrating robust transferability.
Conclusion: MegaFlow presents a unified paradigm for generalizable motion estimation by leveraging pre-trained vision priors for zero-shot large displacement optical flow, showing strong transferability to related tasks.
Abstract: Accurate estimation of large displacement optical flow remains a critical challenge. Existing methods typically rely on iterative local search or/and domain-specific fine-tuning, which severely limits their performance in large displacement and zero-shot generalization scenarios. To overcome this, we introduce MegaFlow, a simple yet powerful model for zero-shot large displacement optical flow. Rather than relying on highly complex, task-specific architectural designs, MegaFlow adapts powerful pre-trained vision priors to produce temporally consistent motion fields. In particular, we formulate flow estimation as a global matching problem by leveraging pre-trained global Vision Transformer features, which naturally capture large displacements. This is followed by a few lightweight iterative refinements to further improve the sub-pixel accuracy. Extensive experiments demonstrate that MegaFlow achieves state-of-the-art zero-shot performance across multiple optical flow benchmarks. Moreover, our model also delivers highly competitive zero-shot performance on long-range point tracking benchmarks, demonstrating its robust transferability and suggesting a unified paradigm for generalizable motion estimation. Our project page is at: https://kristen-z.github.io/projects/megaflow.
[240] RefAlign: Representation Alignment for Reference-to-Video Generation
Lei Wang, YuXin Song, Ge Wu, Haocheng Feng, Hang Zhou, Jingdong Wang, Yaxing Wang, jian Yang
Main category: cs.CV
TL;DR: RefAlign is a representation alignment framework for reference-to-video generation that explicitly aligns diffusion Transformer features to visual foundation model semantics to reduce copy-paste artifacts and improve identity consistency.
Details
Motivation: Existing R2V methods struggle with copy-paste artifacts and multi-subject confusion due to modality mismatch between heterogeneous encoder features, even when using auxiliary semantic guidance.Method: Proposes RefAlign with a reference alignment loss that pulls reference features and VFM features of the same subject closer while pushing apart features of different subjects, applied only during training.
Result: Outperforms current state-of-the-art methods on OpenS2V-Eval benchmark in TotalScore, achieving better balance between text controllability and reference fidelity.
Conclusion: Explicit reference alignment effectively addresses modality mismatch issues in R2V generation without inference overhead, improving identity consistency and semantic discriminability.
Abstract: Reference-to-video (R2V) generation is a controllable video synthesis paradigm that constrains the generation process using both text prompts and reference images, enabling applications such as personalized advertising and virtual try-on. In practice, existing R2V methods typically introduce additional high-level semantic or cross-modal features alongside the VAE latent representation of the reference image and jointly feed them into the diffusion Transformer (DiT). These auxiliary representations provide semantic guidance and act as implicit alignment signals, which can partially alleviate pixel-level information leakage in the VAE latent space. However, they may still struggle to address copy–paste artifacts and multi-subject confusion caused by modality mismatch across heterogeneous encoder features. In this paper, we propose RefAlign, a representation alignment framework that explicitly aligns DiT reference-branch features to the semantic space of a visual foundation model (VFM). The core of RefAlign is a reference alignment loss that pulls the reference features and VFM features of the same subject closer to improve identity consistency, while pushing apart the corresponding features of different subjects to enhance semantic discriminability. This simple yet effective strategy is applied only during training, incurring no inference-time overhead, and achieves a better balance between text controllability and reference fidelity. Extensive experiments on the OpenS2V-Eval benchmark demonstrate that RefAlign outperforms current state-of-the-art methods in TotalScore, validating the effectiveness of explicit reference alignment for R2V tasks.
[241] MuRF: Unlocking the Multi-Scale Potential of Vision Foundation Models
Bocheng Zou, Mu Cai, Mark Stanley, Dingfu Lu, Yong Jae Lee
Main category: cs.CV
TL;DR: MuRF is a training-free multi-resolution fusion method that enhances vision foundation models by combining features from different image scales during inference to leverage complementary inductive biases.
Details
Motivation: Current vision foundation models use single-scale inference, which overlooks the complementary benefits of different resolutions: low-resolution for global semantics and high-resolution for fine details. The authors aim to create a universal, training-free method to harness this multi-resolution synergy.Method: MuRF processes images at multiple resolutions through a frozen vision foundation model, then fuses the resulting features to create a unified representation. It’s architecture-agnostic and training-free, working with various VFM families like DINOv2 and SigLIP2.
Result: The method demonstrates universal effectiveness across a broad spectrum of computer vision tasks and multiple VFM families, showing improved performance by leveraging complementary information from different resolutions.
Conclusion: Multi-resolution fusion is a fundamental enhancement for visual representation that can universally improve vision foundation models without requiring training, offering better performance by combining global semantic and fine-grained information.
Abstract: Vision Foundation Models (VFMs) have become the cornerstone of modern computer vision, offering robust representations across a wide array of tasks. While recent advances allow these models to handle varying input sizes during training, inference typically remains restricted to a single, fixed scale. This prevalent single-scale paradigm overlooks a fundamental property of visual perception: varying resolutions offer complementary inductive biases, where low-resolution views excel at global semantic recognition and high-resolution views are essential for fine-grained refinement. In this work, we propose Multi-Resolution Fusion (MuRF), a simple yet universally effective strategy to harness this synergy at inference time. Instead of relying on a single view, MuRF constructs a unified representation by processing an image at multiple resolutions through a frozen VFM and fusing the resulting features. The universality of MuRF is its most compelling attribute. It is not tied to a specific architecture, serving instead as a fundamental, training-free enhancement to visual representation. We empirically validate this by applying MuRF to a broad spectrum of critical computer vision tasks across multiple distinct VFM families - primarily DINOv2, but also demonstrating successful generalization to contrastive models like SigLIP2.
[242] Less Gaussians, Texture More: 4K Feed-Forward Textured Splatting
Yixing Lao, Xuyang Bai, Xiaoyang Wu, Nuoyuan Yan, Zixin Luo, Tian Fang, Jean-Daniel Nahmias, Yanghai Tsin, Shiwei Li, Hengshuang Zhao
Main category: cs.CV
TL;DR: LGTM introduces a feed-forward framework for high-resolution 3D Gaussian Splatting that predicts compact Gaussian primitives with per-primitive textures, decoupling geometric complexity from rendering resolution to enable 4K novel view synthesis without per-scene optimization.
Details
Motivation: Existing feed-forward 3D Gaussian Splatting methods suffer from quadratic growth in primitive count as resolution increases, fundamentally limiting scalability and making high-resolution synthesis (like 4K) intractable.Method: LGTM predicts compact Gaussian primitives coupled with per-primitive textures, decoupling geometric complexity from rendering resolution. This approach uses significantly fewer Gaussian primitives while enabling high-resolution rendering.
Result: Enables high-fidelity 4K novel view synthesis without per-scene optimization, a capability previously out of reach for feed-forward methods, while using significantly fewer Gaussian primitives.
Conclusion: LGTM overcomes the resolution scaling barrier in feed-forward 3D Gaussian Splatting, making high-resolution synthesis tractable through a novel approach that separates geometric complexity from rendering resolution.
Abstract: Existing feed-forward 3D Gaussian Splatting methods predict pixel-aligned primitives, leading to a quadratic growth in primitive count as resolution increases. This fundamentally limits their scalability, making high-resolution synthesis such as 4K intractable. We introduce LGTM (Less Gaussians, Texture More), a feed-forward framework that overcomes this resolution scaling barrier. By predicting compact Gaussian primitives coupled with per-primitive textures, LGTM decouples geometric complexity from rendering resolution. This approach enables high-fidelity 4K novel view synthesis without per-scene optimization, a capability previously out of reach for feed-forward methods, all while using significantly fewer Gaussian primitives. Project page: https://yxlao.github.io/lgtm/
[243] MedShift: Implicit Conditional Transport for X-Ray Domain Adaptation
Francisco Caetano, Christiaan Viviers, Peter H.N. de With, Fons van der Sommen
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to draw conclusions due to access error
Abstract: Failed to fetch summary for 2508.21435: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.21435&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[244] ShotStream: Streaming Multi-Shot Video Generation for Interactive Storytelling
Yawen Luo, Xiaoyu Shi, Junhao Zhuang, Yutian Chen, Quande Liu, Xintao Wang, Pengfei Wan, Tianfan Xue
Main category: cs.CV
TL;DR: ShotStream: A causal multi-shot video generation architecture for interactive storytelling with sub-second latency, using dual-cache memory and two-stage distillation to maintain visual coherence.
Details
Motivation: Current bidirectional video generation architectures suffer from limited interactivity and high latency, making them unsuitable for real-time interactive storytelling. There's a need for causal architectures that can generate multi-shot videos efficiently while maintaining coherence.Method: Proposes ShotStream with a causal multi-shot architecture that reformulates video generation as next-shot prediction. Key innovations: 1) Dual-cache memory mechanism (global context cache for inter-shot consistency, local context cache for intra-shot consistency) with RoPE discontinuity indicator, 2) Two-stage distillation strategy (intra-shot self-forcing → inter-shot self-forcing) to mitigate error accumulation, 3) Fine-tuning text-to-video model into bidirectional next-shot generator then distilling into causal student via Distribution Matching Distillation.
Result: ShotStream generates coherent multi-shot videos with sub-second latency, achieving 16 FPS on a single GPU. It matches or exceeds the quality of slower bidirectional models while enabling real-time interactive storytelling.
Conclusion: ShotStream enables interactive storytelling with efficient on-the-fly frame generation, paving the way for real-time interactive narrative creation. The architecture successfully addresses challenges of inter-shot consistency and error accumulation in autoregressive video generation.
Abstract: Multi-shot video generation is crucial for long narrative storytelling, yet current bidirectional architectures suffer from limited interactivity and high latency. We propose ShotStream, a novel causal multi-shot architecture that enables interactive storytelling and efficient on-the-fly frame generation. By reformulating the task as next-shot generation conditioned on historical context, ShotStream allows users to dynamically instruct ongoing narratives via streaming prompts. We achieve this by first fine-tuning a text-to-video model into a bidirectional next-shot generator, which is then distilled into a causal student via Distribution Matching Distillation. To overcome the challenges of inter-shot consistency and error accumulation inherent in autoregressive generation, we introduce two key innovations. First, a dual-cache memory mechanism preserves visual coherence: a global context cache retains conditional frames for inter-shot consistency, while a local context cache holds generated frames within the current shot for intra-shot consistency. And a RoPE discontinuity indicator is employed to explicitly distinguish the two caches to eliminate ambiguity. Second, to mitigate error accumulation, we propose a two-stage distillation strategy. This begins with intra-shot self-forcing conditioned on ground-truth historical shots and progressively extends to inter-shot self-forcing using self-generated histories, effectively bridging the train-test gap. Extensive experiments demonstrate that ShotStream generates coherent multi-shot videos with sub-second latency, achieving 16 FPS on a single GPU. It matches or exceeds the quality of slower bidirectional models, paving the way for real-time interactive storytelling. Training and inference code, as well as the models, are available on our
[245] Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation
Inclusion AI, Bowen Ma, Cheng Zou, ChengKun Du, Canxiang Yan, Chunxiang Jin, Chunjie Shen, Chenyu Lian, Chengxiang Fan, Dandan Zheng, Fudong Wang, Furong Xu, Guangming Yao, Haohao Liu, Han Peng, Jun Zhou, Junluan Xia, Jingdong Chen, Jianing Li, Jianxin Sun, Jianjiang Zhu, Jianping Jiang, Jinpeng Ou, Jun Peng, Jin Peng, Kaixiang Ji, Li Tang, Libin Wang, Lixiang Ru, Longhua Tan, Lu Ma, Lan Wang, Mochen Bai, Minghong Cai, Mingxue Yang, Ning Gao, Qingpei Guo, Qinglong Zhang, Qiang Xu, Qin Zhao, Rui Liu, Ruijie Xiong, Ruobing Zheng, Sirui Gao, Shaoxiong Lin, Tao Zhang, Tianqi Li, Tinghao Liu, Tongli Wang, Taoye Huang, Weilong Chai, Xiaomei Wang, Xiaolong Wang, Xiaojian Liu, Xiao Lu, Xiaoyu Li, Xingning Dong, Xuzheng Yu, Xuezhi Wang, Yi Yuan, Yuting Gao, Yuting Xiao, Yunxiao Sun, Yipeng Chen, Yifan Mao, Yifei Wu, Yongjie Lyu, Yingying Zhang, YuQian Li, Ziping Ma, Zhiqiang Fang, Zhihao Qiu, Ziyuan Huang, Zizheng Yang, Zhengyu He
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access restrictionsMethod: Unable to determine method due to access restrictions
Result: Unable to determine results due to access restrictions
Conclusion: Unable to determine conclusion due to access restrictions
Abstract: Failed to fetch summary for 2510.24821: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.24821&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[246] Generative deep learning for foundational video translation in ultrasound
Nikolina Tomic, Roshni Bhatnagar, Sarthak Jain, Connor Lau, Tien-Yu Liu, Laura Gambini, Rima Arnaout
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to analyze paper due to technical fetching error
Abstract: Failed to fetch summary for 2511.03255: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.03255&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[247] Embedding Compression via Spherical Coordinates
Han Xiao
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2602.00079: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.00079&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[248] Foundry: Distilling 3D Foundation Models for the Edge
Guillaume Letellier, Siddharth Srivastava, Frédéric Jurie, Gaurav Sharma
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2511.20721: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.20721&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[249] CODER: Coupled Diversity-Sensitive Momentum Contrastive Learning for Image-Text Retrieval
Haoran Wang, Dongliang He, Wenhao Wu, Boyang Xia, Min Yang, Fu Li, Yunlong Yu, Zhong Ji, Errui Ding, Jingdong Wang
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to access errorMethod: Cannot determine method due to access error
Result: Cannot determine results due to access error
Conclusion: Cannot determine conclusion due to access error
Abstract: Failed to fetch summary for 2208.09843: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2208.09843&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[250] Self-Calibrated CLIP for Training-Free Open-Vocabulary Segmentation
Sule Bai, Yong Liu, Yifei Han, Haoji Zhang, Yansong Tang, Jie Zhou, Jiwen Lu
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2411.15869: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2411.15869&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[251] Weight Space Representation Learning on Diverse NeRF Architectures
Francesco Ballerini, Pierluigi Zama Ramirez, Luigi Di Stefano, Samuele Salti
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to draw conclusions due to failed API request
Abstract: Failed to fetch summary for 2502.09623: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.09623&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[252] Elastic Weight Consolidation Done Right for Continual Learning
Xuan Liu, Xiaobin Chang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2603.18596: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.18596&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[253] IDESplat: Iterative Depth Probability Estimation for Generalizable 3D Gaussian Splatting
Wei Long, Haifeng Wu, Shiyin Jiang, Jinhua Zhang, Xinchun Ji, Shuhang Gu
Main category: cs.CV
TL;DR: Unable to analyze paper 2601.03824 due to HTTP 429 error when fetching from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2601.03824: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.03824&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[254] 3D Dynamics-Aware Manipulation: Endowing Manipulation Policies with 3D Foresight
Yuxin He, Ruihao Zhang, Xianzu Wu, Zhiyuan Zhang, Cheng Ding, Qiang Nie
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2502.10028: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.10028&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[255] Context Matters: Peer-Aware Student Behavioral Engagement Measurement via VLM Action Parsing and LLM Sequence Classification
Ahmed Abdelkawy, Ahmed Elsayed, Asem Ali, Aly Farag, Thomas Tretter, Michael McIntyre
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2601.06394: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.06394&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[256] ConcreTizer: Model Inversion Attack via Occupancy Classification and Dispersion Control for 3D Point Cloud Restoration
Youngseok Kim, Sunwook Hwang, Hyung-Sin Kim, Saewoong Bahk
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2503.06986: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.06986&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[257] TAG-MoE: Task-Aware Gating for Unified Generative Mixture-of-Experts
Yu Xu, Hongbin Yan, Juan Cao, Yiji Cheng, Tiankai Hang, Runze He, Zijin Yin, Shiyi Zhang, Yuxin Zhang, Jintao Li, Chunyu Wang, Qinglin Lu, Tong-Yee Lee, Fan Tang
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2601.08881 could not be retrieved from arXiv API.
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting.Method: Cannot determine method as paper content is unavailable due to API rate limiting.
Result: Cannot determine results as paper content is unavailable due to API rate limiting.
Conclusion: Cannot determine conclusion as paper content is unavailable due to API rate limiting.
Abstract: Failed to fetch summary for 2601.08881: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.08881&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[258] PE3R: Perception-Efficient 3D Reconstruction
Jie Hu, Shizun Wang, Xinchao Wang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2503.07507: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.07507&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[259] ShowMak3r: Compositional TV Show Reconstruction
Sangmin Kim, Seunguk Do, Jaesik Park
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2504.19584: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.19584&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[260] Structure Causal Models and LLMs Integration in Medical Visual Question Answering
Zibo Xu, Qiang Li, Weizhi Nie, Weijie Wang, Anan Liu
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2505.02703 suggests it’s from May 2025, but content is unavailable.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot determine conclusion without access to the paper content.
Abstract: Failed to fetch summary for 2505.02703: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.02703&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[261] Monocular Normal Estimation via Shading Sequence Estimation
Zongrui Li, Xinhua Ma, Minghui Hu, Yunqing Zhao, Yingchen Yu, Qian Zheng, Chang Liu, Xudong Jiang, Song Bai
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2602.09929 appears to be from February 2026, which is in the future relative to current date.
Details
Motivation: Cannot determine motivation due to inability to fetch paper content.Method: Cannot determine method due to inability to fetch paper content.
Result: Cannot determine results due to inability to fetch paper content.
Conclusion: Cannot draw conclusions due to inability to fetch paper content.
Abstract: Failed to fetch summary for 2602.09929: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.09929&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[262] CompBench: Benchmarking Complex Instruction-guided Image Editing
Bohan Jia, Wenxuan Huang, Yuntian Tang, Junbo Qiao, Jincheng Liao, Shaosheng Cao, Fei Zhao, Zhaopeng Feng, Zhouhong Gu, Zhenfei Yin, Lei Bai, Wanli Ouyang, Lin Chen, Fei Zhao, Yao Hu, Zihan Wang, Yuan Xie, Shaohui Lin
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2505.12200: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.12200&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[263] Corruption-Aware Training of Latent Video Diffusion Models for Robust Text-to-Video Generation
Chika Maduabuchi, Hao Chen, Yujin Han, Jindong Wang
Main category: cs.CV
TL;DR: Paper 2505.21545: Unable to fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to inability to access paper contentMethod: Cannot determine method due to inability to access paper content
Result: Cannot determine results due to inability to access paper content
Conclusion: Cannot determine conclusion due to inability to access paper content
Abstract: Failed to fetch summary for 2505.21545: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.21545&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[264] SPR-128K: A New Benchmark for Spatial Plausibility Reasoning with Multimodal Large Language Models
Zhiyuan Hu, Zheng Sun, Yi Wei, Long Yu
Main category: cs.CV
TL;DR: Paper 2505.23265: Unable to fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access restrictionsMethod: Unable to determine method due to access restrictions
Result: Unable to determine results due to access restrictions
Conclusion: Unable to determine conclusion due to access restrictions
Abstract: Failed to fetch summary for 2505.23265: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.23265&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[265] See and Fix the Flaws: Enabling VLMs and Diffusion Models to Comprehend Visual Artifacts via Agentic Data Synthesis
Jaehyun Park, Minyoung Ahn, Minkyu Kim, Jonghyun Lee, Jae-Gil Lee, Dongmin Park
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2602.20951 suggests it’s from February 2025, but no content available for analysis.
Details
Motivation: Cannot determine motivation without access to paper content.Method: Cannot determine method without access to paper content.
Result: Cannot determine results without access to paper content.
Conclusion: Cannot draw conclusions without access to paper content.
Abstract: Failed to fetch summary for 2602.20951: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.20951&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[266] AceVFI: A Comprehensive Survey of Advances in Video Frame Interpolation
Dahyeon Kye, Changhyun Roh, Sukhun Ko, Chanho Eom, Jihyong Oh
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2506.01061: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.01061&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[267] Graph-of-Mark: Promote Spatial Reasoning in Multimodal Language Models with Graph-Based Visual Prompting
Giacomo Frisoni, Lorenzo Molfetta, Mattia Buzzoni, Gianluca Moro
Main category: cs.CV
TL;DR: Unable to analyze paper 2603.06663 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation due to inability to access paper contentMethod: Cannot determine method due to inability to access paper content
Result: Cannot determine results due to inability to access paper content
Conclusion: Cannot draw conclusions due to inability to access paper content
Abstract: Failed to fetch summary for 2603.06663: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.06663&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[268] Patch2Loc: Learning to Localize Patches for Unsupervised Brain Lesion Detection
Hassan Baker, Austin J. Brockmeier
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2506.22504: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.22504&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[269] HyperGaussians: High-Dimensional Gaussian Splatting for High-Fidelity Animatable Face Avatars
Gent Serifi, Marcel C. Buehler
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2507.02803: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.02803&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[270] OFFSET: Segmentation-based Focus Shift Revision for Composed Image Retrieval
Zhiwei Chen, Yupeng Hu, Zixu Li, Zhiheng Fu, Xuemeng Song, Liqiang Nie
Main category: cs.CV
TL;DR: Unable to analyze paper 2507.05631 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract is unavailableMethod: Cannot determine method as abstract is unavailable
Result: Cannot determine results as abstract is unavailable
Conclusion: Cannot draw conclusions without access to the paper content
Abstract: Failed to fetch summary for 2507.05631: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.05631&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[271] Seeking Physics in Diffusion Noise
Chujun Tang, Lei Zhong, Fangqiang Ding
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2603.14294: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.14294&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[272] ThinkingViT: Matryoshka Thinking Vision Transformer for Elastic Inference
Ali Hojjat, Janek Haberer, Soren Pirk, Olaf Landsiedel
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2507.10800: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.10800&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[273] 360° Image Perception with MLLMs: A Comprehensive Benchmark and a Training-Free Method
Huyen T. T. Tran, Van-Quang Nguyen, Farros Alferro, Kang-Jun Liu, Takayuki Okatani
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.16179: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.16179&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[274] Debugging Concept Bottleneck Models through Removal and Retraining
Eric Enouen, Sainyam Galhotra
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to draw conclusions due to access error
Abstract: Failed to fetch summary for 2509.21385: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.21385&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[275] Easy3D-Labels: Supervising Semantic Occupancy Estimation with 3D Pseudo-Labels for Automotive Perception
Seamie Hayes, Ganesh Sistu, Tim Brophy, Ciaran Eising
Main category: cs.CV
TL;DR: Paper 2509.26087: Could not fetch summary due to HTTP 429 error (rate limiting).
Details
Motivation: Unable to determine motivation due to missing abstract content.Method: Unable to determine method due to missing abstract content.
Result: Unable to determine results due to missing abstract content.
Conclusion: Unable to determine conclusion due to missing abstract content.
Abstract: Failed to fetch summary for 2509.26087: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.26087&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[276] Gastric-X: A Multimodal Multi-Phase Benchmark Dataset for Advancing Vision-Language Models in Gastric Cancer Analysis
Sheng Lu, Hao Chen, Rui Yin, Juyan Ba, Yu Zhang, Yuanzhe Li
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.19516: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.19516&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[277] One Dimensional CNN ECG Mamba for Multilabel Abnormality Classification in 12 Lead ECG
Huawei Jiang, Husna Mutahira, Gan Huang, Mannan Saeed Muhammad
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2510.13046: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.13046&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[278] Instruction-Guided Lesion Segmentation for Chest X-rays with Automatically Generated Large-Scale Dataset
Geon Choi, Hangyul Yoon, Hyunju Shin, Hyunki Park, Sang Hoon Seo, Eunho Yang, Edward Choi
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2511.15186 suggests it’s from November 2024, but no abstract or content is available for analysis.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot draw conclusions without access to the paper content.
Abstract: Failed to fetch summary for 2511.15186: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.15186&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[279] PartDiffuser: Part-wise 3D Mesh Generation via Discrete Diffusion
Yichen Yang, Hong Li, Haodong Zhu, Linin Yang, Guojun Lei, Sheng Xu, Baochang Zhang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error fetching paper contentMethod: Unable to determine method due to technical error fetching paper content
Result: Unable to determine results due to technical error fetching paper content
Conclusion: Unable to determine conclusion due to technical error fetching paper content
Abstract: Failed to fetch summary for 2511.18801: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.18801&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[280] DiP: Taming Diffusion Models in Pixel Space
Zhennan Chen, Junwei Zhu, Xu Chen, Jiangning Zhang, Xiaobin Hu, Hanzhen Zhao, Chengjie Wang, Jian Yang, Ying Tai
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2511.18822: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.18822&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[281] Mistake Attribution: Fine-Grained Mistake Understanding in Egocentric Videos
Yayuan Li, Aadit Jain, Filippos Bellos, Jason J. Corso
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to access restrictionsMethod: Cannot determine method due to access restrictions
Result: Cannot determine results due to access restrictions
Conclusion: Cannot determine conclusion due to access restrictions
Abstract: Failed to fetch summary for 2511.20525: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.20525&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[282] MultiBanana: A Challenging Benchmark for Multi-Reference Text-to-Image Generation
Yuta Oshima, Daiki Miyake, Kohsei Matsutani, Yusuke Iwasawa, Masahiro Suzuki, Yutaka Matsuo, Hiroki Furuta
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2511.22989: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.22989&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[283] Inferring Compositional 4D Scenes without Ever Seeing One
Ahmet Berke Gokmen, Ajad Chhatkuli, Luc Van Gool, Danda Pani Paudel
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting).
Details
Motivation: Cannot determine motivation without access to paper content.Method: Cannot determine method without access to paper content.
Result: Cannot determine results without access to paper content.
Conclusion: Cannot draw conclusions without access to paper content.
Abstract: Failed to fetch summary for 2512.05272: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.05272&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[284] MedGRPO: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding
Yuhao Su, Anwesa Choudhuri, Zhongpai Gao, Benjamin Planche, Van Nguyen Nguyen, Meng Zheng, Yuhan Shen, Arun Innanje, Terrence Chen, Ehsan Elhamifar, Ziyan Wu
Main category: cs.CV
TL;DR: Unable to analyze paper 2512.06581 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract retrieval failedMethod: Cannot determine method as abstract retrieval failed
Result: Cannot determine results as abstract retrieval failed
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2512.06581: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.06581&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[285] Unified Camera Positional Encoding for Controlled Video Generation
Cheng Zhang, Boying Li, Meng Wei, Yan-Pei Cao, Camilo Cruz Gambardella, Dinh Phung, Jianfei Cai
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). Need to wait before retrying or use alternative methods to access the paper information.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot draw conclusions without access to the paper content.
Abstract: Failed to fetch summary for 2512.07237: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.07237&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[286] Verifier Threshold: An Efficient Test-Time Scaling Approach for Image Generation
Vignesh Sundaresha, Akash Haridas, Vikram Appia, Lav R. Varshney
Main category: cs.CV
TL;DR: Unable to analyze paper 2512.08985 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract is unavailableMethod: Cannot determine method as abstract is unavailable
Result: Cannot determine results as abstract is unavailable
Conclusion: Cannot draw conclusion due to missing abstract data
Abstract: Failed to fetch summary for 2512.08985: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.08985&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[287] MoRel: Long-Range Flicker-Free 4D Motion Modeling via Anchor Relay-based Bidirectional Blending with Hierarchical Densification
Sangwoon Kwak, Weeyoung Kwon, Jun Young Jeong, Geonho Kim, Won-Sik Cheong, Jihyong Oh
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2512.09270: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.09270&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[288] ByteLoom: Weaving Geometry-Consistent Human-Object Interactions through Progressive Curriculum Learning
Bangya Liu, Xinyu Gong, Zelin Zhao, Ziyang Song, Yulei Lu, Suhui Wu, Jun Zhang, Suman Banerjee, Hao Zhang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2512.22854: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.22854&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[289] Closing the Navigation Compliance Gap in End-to-end Autonomous Driving
Hanfeng Wu, Marlon Steiner, Michael Schmidt, Alvaro Marcos-Ramiro, Christoph Stiller
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2512.10660: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.10660&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[290] ShowTable: Unlocking Creative Table Visualization with Collaborative Reflection and Refinement
Zhihang Liu, Xiaoyi Bao, Pandeng Li, Junjie Zhou, Zhaohe Liao, Yefei He, Kaixun Jiang, Chen-Wei Xie, Yun Zheng, Hongtao Xie
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to draw conclusions due to access error
Abstract: Failed to fetch summary for 2512.13303: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.13303&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[291] XtraLight-MedMamba for Classification of Neoplastic Tubular Adenomas
Aqsa Sultana, Rayan Afsar, Ahmed Rahu, Surendra P. Singh, Brian Shula, Brandon Combs, Derrick Forchetti, Vijayan K. Asari
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2602.04819 appears to be from February 2024.
Details
Motivation: Cannot determine motivation without access to paper content.Method: Cannot determine method without access to paper content.
Result: Cannot determine results without access to paper content.
Conclusion: Cannot determine conclusion without access to paper content.
Abstract: Failed to fetch summary for 2602.04819: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.04819&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[292] Test-Time Modification: Inverse Domain Transformation for Robust Perception
Arpit Jadon, Joshua Niemeijer, Yuki M. Asano
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2512.13454: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.13454&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[293] MoLingo: Motion-Language Alignment for Text-to-Motion Generation
Yannan He, Garvita Tiwari, Xiaohan Zhang, Pankaj Bora, Tolga Birdal, Jan Eric Lenssen, Gerard Pons-Moll
Main category: cs.CV
TL;DR: Paper 2512.13840: Unable to fetch abstract due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper abstractMethod: Cannot determine method without access to paper abstract
Result: Cannot determine results without access to paper abstract
Conclusion: Cannot determine conclusion without access to paper abstract
Abstract: Failed to fetch summary for 2512.13840: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.13840&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[294] Diffusion Forcing for Multi-Agent Interaction Sequence Modeling
Vongani H. Maluleke, Kie Horiuchi, Lea Wilken, Evonne Ng, Jitendra Malik, Angjoo Kanazawa
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation due to failed API requestMethod: Cannot determine method due to failed API request
Result: Cannot determine results due to failed API request
Conclusion: Cannot determine conclusion due to failed API request
Abstract: Failed to fetch summary for 2512.17900: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.17900&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[295] JANUS: A Lightweight Framework for Jailbreaking Text-to-Image Models via Distribution Optimization
Haolun Zheng, Yu He, Tailun Chen, Shuo Shao, Zhixuan Chu, Hongbin Zhou, Lan Tao, Zhan Qin, Kui Ren
Main category: cs.CV
TL;DR: Failed to fetch paper summary - HTTP 429 error indicates rate limiting from arXiv API
Details
Motivation: Unable to determine motivation due to API access issueMethod: Unable to determine method due to API access issue
Result: Unable to determine results due to API access issue
Conclusion: Unable to determine conclusion due to API access issue
Abstract: Failed to fetch summary for 2603.21208: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.21208&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[296] Widget2Code: From Visual Widgets to UI Code via Multimodal LLMs
Houston H. Zhang, Tao Zhang, Baoze Lin, Yuanqi Xue, Yincheng Zhu, Huan Liu, Li Gu, Linfeng Ye, Ziqiang Wang, Xinxin Zuo, Yang Wang, Yuanhao Yu, Zhixiang Chi
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2512.19918: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.19918&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[297] 3D sans 3D Scans: Scalable Pre-training from Video-Generated Point Clouds
Ryousuke Yamada, Kohsuke Ide, Yoshihiro Fukuhara, Hirokatsu Kataoka, Gilles Puy, Andrei Bursuc, Yuki M. Asano
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2512.23042: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.23042&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[298] NeoVerse: Enhancing 4D World Model with in-the-wild Monocular Videos
Yuxue Yang, Lue Fan, Ziqi Shi, Junran Peng, Feng Wang, Zhaoxiang Zhang
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting).
Details
Motivation: Cannot determine motivation without access to paper content.Method: Cannot determine method without access to paper content.
Result: Cannot determine results without access to paper content.
Conclusion: Cannot draw conclusions without access to paper content.
Abstract: Failed to fetch summary for 2601.00393: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.00393&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[299] Unified Primitive Proxies for Structured Shape Completion
Zhaiyu Chen, Yuqing Wang, Xiao Xiang Zhu
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2601.00759: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.00759&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[300] Thinking with Frames: Generative Video Distortion Evaluation via Frame Reward Model
Yuan Wang, Borui Liao, Huijuan Huang, Jinda Lu, Ouxiang Li, Kuien Liu, Meng Wang, Xiang Wang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: No method information available - API request was rate limited
Result: No results available - could not retrieve paper information
Conclusion: Cannot analyze paper due to technical limitations in accessing the content
Abstract: Failed to fetch summary for 2601.04033: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.04033&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[301] Towards Mitigating Modality Bias in Vision-Language Models for Temporal Action Localization
Jiaqi Li, Guangming Wang, Shuntian Zheng, Minzhe Ni, Xiaoman Lu, Guanghui Ye, Yu Guan
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2601.21078 could not be retrieved from arXiv API.
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting.Method: Cannot determine method as paper content is unavailable due to API rate limiting.
Result: Cannot determine results as paper content is unavailable due to API rate limiting.
Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting.
Abstract: Failed to fetch summary for 2601.21078: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.21078&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[302] SSI-DM: Singularity Skipping Inversion of Diffusion Models
Chen Min, Enze Jiang, Jishen Peng, Zheng Ma
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2602.02193: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.02193&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[303] PokeFusion Attention: Enhancing Reference-Free Style-Conditioned Generation
Jingbang Tang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access limitationsMethod: Unable to determine method due to access limitations
Result: Unable to determine results due to access limitations
Conclusion: Unable to draw conclusions due to access limitations
Abstract: Failed to fetch summary for 2602.03220: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.03220&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[304] EchoTorrent: Towards Swift, Sustained, and Streaming Multi-Modal Video Generation
Rang Meng, Yingjie Yin, Yuming Li, Chenguang Ma
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2602.13669: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.13669&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[305] MeanFuser: Fast One-Step Multi-Modal Trajectory Generation and Adaptive Reconstruction via MeanFlow for End-to-End Autonomous Driving
Junli Wang, Yinan Zheng, Xueyi Liu, Zebin Xing, Pengfei Li, Guang Li, Kun Ma, Guang Chen, Hangjun Ye, Zhongpu Xia, Long Chen, Qichao Zhang
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting).
Details
Motivation: Cannot determine motivation without access to paper content.Method: Cannot determine method without access to paper content.
Result: Cannot determine results without access to paper content.
Conclusion: Cannot draw conclusions without access to paper content.
Abstract: Failed to fetch summary for 2602.20060: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.20060&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[306] RobustVisRAG: Causality-Aware Vision-Based Retrieval-Augmented Generation under Visual Degradations
I-Hsiang Chen, Yu-Wei Liu, Tse-Yu Wu, Yu-Chien Chiang, Jen-Chien Yang, Wei-Ting Chen
Main category: cs.CV
TL;DR: The paper with ID 2602.22013 could not be analyzed due to HTTP 429 error (rate limiting) when attempting to fetch the abstract from arXiv API.
Details
Motivation: Unable to determine motivation as the paper content could not be retrieved due to API rate limiting.Method: Unable to determine method as the paper content could not be retrieved due to API rate limiting.
Result: Unable to determine results as the paper content could not be retrieved due to API rate limiting.
Conclusion: Unable to determine conclusion as the paper content could not be retrieved due to API rate limiting.
Abstract: Failed to fetch summary for 2602.22013: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.22013&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[307] ArtPro: Self-Supervised Articulated Object Reconstruction with Adaptive Integration of Mobility Proposals
Xuelu Li, Zhaonan Wang, Xiaogang Wang, Lei Wu, Manyi Li, Changhe Tu
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2602.22666: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.22666&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[308] DMAligner: Enhancing Image Alignment via Diffusion Model Based View Synthesis
Xinglong Luo, Ao Luo, Zhengning Wang, Yueqi Yang, Chaoyu Feng, Lei Lei, Bing Zeng, Shuaicheng Liu
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: Cannot analyze method as paper content is unavailable due to technical limitations
Result: No results available - technical error prevented paper retrieval
Conclusion: Cannot draw conclusions about paper content due to access limitations
Abstract: Failed to fetch summary for 2602.23022: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.23022&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[309] Diffusion Probe: Generated Image Result Prediction Using CNN Probes
Benlei Cui, Bukun Huang, Zhizeng Ye, Xuemei Dong, Tuo Chen, Hui Xue, Dingkang Yang, Longtao Huang, Jingqun Tang, Haiwen Hong
Main category: cs.CV
TL;DR: Unable to analyze paper 2602.23783 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract is not availableMethod: Cannot determine method as abstract is not available
Result: Cannot determine results as abstract is not available
Conclusion: Cannot draw conclusions without access to the paper content
Abstract: Failed to fetch summary for 2602.23783: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.23783&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[310] GeodesicNVS: Probability Density Geodesic Flow Matching for Novel View Synthesis
Xuqin Wang, Tao Wu, Yanfeng Zhang, Lu Liu, Mingwei Sun, Yongliang Wang, Niclas Zeller, Daniel Cremers
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2603.01010: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.01010&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[311] Architecture and evaluation protocol for transformer-based visual object tracking in UAV applications
Augustin Borne, Pierre Notin, Christophe Hennequin, Sebastien Changey, Stephane Bazeille, Christophe Cudel, Franz Quint
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions without access to the paper content
Abstract: Failed to fetch summary for 2603.03904: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.03904&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[312] FOZO: Forward-Only Zeroth-Order Prompt Optimization for Test-Time Adaptation
Xingyu Wang, Tao Wang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed fetchMethod: Unable to determine method due to failed fetch
Result: Unable to determine results due to failed fetch
Conclusion: Unable to determine conclusion due to failed fetch
Abstract: Failed to fetch summary for 2603.04733: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.04733&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[313] CoIn3D: Revisiting Configuration-Invariant Multi-Camera 3D Object Detection
Zhaonian Kuang, Rui Ding, Haotian Wang, Xinhu Zheng, Meng Yang, Gang Hua
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to access errorMethod: Cannot determine method due to access error
Result: Cannot determine results due to access error
Conclusion: Cannot determine conclusion due to access error
Abstract: Failed to fetch summary for 2603.05042: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.05042&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[314] Mario: Multimodal Graph Reasoning with Large Language Models
Yuanfu Sun, Kang Li, Pengkang Guo, Jiajin Liu, Qiaoyu Tan
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2603.05181: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.05181&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[315] Enhancing Cross-View UAV Geolocalization via LVLM-Driven Relational Modeling
Bowen Liu, Pengyue Jia, Wanyu Wang, Derong Xu, Jiawei Cheng, Jiancheng Dong, Xiao Han, Zimo Zhao, Chao Zhang, Bowen Yu, Fangyu Hong, Xiangyu Zhao
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2603.08063: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.08063&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[316] Multimodal classification of Radiation-Induced Contrast Enhancements and tumor recurrence using deep learning
Robin Peretzke, Marlin Hanstein, Maximilian Fischer, Lars Badhi Wessel, Obada Alhalabi, Sebastian Regnery, Andreas Kudak, Maximilian Deng, Tanja Eichkorn, Philipp Hoegen Saßmannshausen, Fabian Allmendinger, Jan-Hendrik Bolten, Philipp Schröter, Christine Jungk, Jürgen Peter Debus, Peter Neher, Laila König, Klaus Maier-Hein
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2603.11827: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.11827&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[317] HIFICL: High-Fidelity In-Context Learning for Multimodal Tasks
Xiaoyu Li, Yuhang Liu, Zheng Luo, Xuanshuo Kang, Fangqi Lou, Xiaohua Wu, Zihan Xiong
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2603.12760: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.12760&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[318] MOGeo: Beyond One-to-One Cross-View Object Geo-localization
Bo Lv, Qingwang Zhang, Le Wu, Yuanyuan Li, Yingying Zhu
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without paper contentMethod: Cannot determine method without paper content
Result: Cannot determine results without paper content
Conclusion: Cannot draw conclusions without paper content
Abstract: Failed to fetch summary for 2603.13843: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.13843&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[319] High-speed Imaging through Turbulence with Event-based Light Fields
Yu-Hsiang Huang, Levi Burner, Sachin Shah, Ziyuan Qu, Adithya Pediredla, Christopher A. Metzler
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to failed paper retrievalMethod: Cannot determine method due to failed paper retrieval
Result: Cannot determine results due to failed paper retrieval
Conclusion: Cannot determine conclusion due to failed paper retrieval
Abstract: Failed to fetch summary for 2603.14023: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.14023&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[320] WiT: Waypoint Diffusion Transformers via Trajectory Conflict Navigation
Hainuo Wang, Mingjia Li, Xiaojie Guo
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.15132: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.15132&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[321] 3D Gaussian Splatting with Self-Constrained Priors for High Fidelity Surface Reconstruction
Takeshi Noda, Yu-Shen Liu, Zhizhong Han
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation as paper content could not be retrievedMethod: Unable to determine method as paper content could not be retrieved
Result: Unable to determine results as paper content could not be retrieved
Conclusion: Unable to draw conclusions as paper content could not be retrieved
Abstract: Failed to fetch summary for 2603.19682: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.19682&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[322] Hyper-Connections for Adaptive Multi-Modal MRI Brain Tumor Segmentation
Lokendra Kumar, Shubham Aggarwal
Main category: cs.CV
TL;DR: Unable to analyze paper 2603.19844 due to HTTP 429 error when fetching from arXiv API
Details
Motivation: Cannot determine motivation as paper content could not be retrievedMethod: Cannot determine method as paper content could not be retrieved
Result: Cannot determine results as paper content could not be retrieved
Conclusion: Cannot draw conclusions about paper content due to retrieval failure
Abstract: Failed to fetch summary for 2603.19844: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.19844&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[323] Cov2Pose: Leveraging Spatial Covariance for Direct Manifold-aware 6-DoF Object Pose Estimation
Nassim Ali Ousalah, Peyman Rostami, Vincent Gaudillière, Emmanuel Koumandakis, Anis Kacem, Enjie Ghorbel, Djamila Aouada
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: Cannot analyze method as paper content is unavailable due to HTTP 429 error
Result: No results available due to failed API request
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2603.19961: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.19961&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[324] Cross-Instance Gaussian Splatting Registration via Geometry-Aware Feature-Guided Alignment
Roy Amoyal, Oren Freifeld, Chaim Baskin
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2603.21936: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.21936&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[325] StreamingClaw Technical Report
Jiawei Chen, Zhe Chen, Chaoqun Du, Maokui He, Wei He, Hengtao Li, Qizhen Li, Zide Liu, Hao Ma, Xuhao Pan, Chang Ren, Xudong Rao, Xintian Shen, Chenfeng Wang, Tao Wei, Chengjun Yu, Pengfei Yu, Shengyu Yao, Chunpeng Zhou, Kun Zhan, Lihao Zheng, Pan Zhou, Xuhan Zhu, Yufei Zheng
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.22120: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.22120&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[326] Group Editing: Edit Multiple Images in One Go
Yue Ma, Xinyu Wang, Qianli Ma, Qinghe Wang, Mingzhe Zheng, Xiangpeng Yang, Hao Li, Chongbo Zhao, Jixuan Ying, Harry Yang, Hongyu Liu, Qifeng Chen
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.22883: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.22883&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[327] SLARM: Streaming and Language-Aligned Reconstruction Model for Dynamic Scenes
Zhicheng Qiu, Jiarui Meng, Tong-an Luo, Yican Huang, Xuan Feng, Xuanfu Li, ZHan Xu
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2603.22893: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.22893&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[328] Pose-Free Omnidirectional Gaussian Splatting for 360-Degree Videos with Consistent Depth Priors
Chuanqing Zhuang, Xin Lu, Zehui Deng, Zhengda Lu, Yiqun Wang, Junqi Diao, Jun Xiao
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2603.23324 could not be retrieved from arXiv API.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot determine conclusion without access to the paper content.
Abstract: Failed to fetch summary for 2603.23324: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.23324&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[329] Stochastic Ray Tracing for the Reconstruction of 3D Gaussian Splatting
Peiyu Xu, Xin Sun, Krishna Mullia, Raymond Fei, Iliyan Georgiev, Shuang Zhao
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2603.23637: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.23637&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[330] GenMask: Adapting DiT for Segmentation via Direct Mask Generation
Yuhuan Yang, Xianwei Zhuang, Yuxuan Cai, Chaofan Ma, Shuai Bai, Jiangchao Yao, Ya Zhang, Junyang Lin, Yanfeng Wang
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2603.23906: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.23906&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[331] VOLMO: Versatile and Open Large Models for Ophthalmology
Zhenyue Qin, Younjoon Chung, Elijah Lee, Wanyue Feng, Xuguang Ai, Serina Applebaum, Minjie Zou, Yang Liu, Pan Xiao, Mac Singer, Amisha Dave, Aidan Gilson, Tiarnan D. L. Keenan, Emily Y. Chew, Zhiyong Lu, Yih-Chung Tham, Ron Adelman, Luciano V. Del Priore, Qingyu Chen
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2603.23953: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.23953&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[332] HGGT: Robust and Flexible 3D Hand Mesh Reconstruction from Uncalibrated Images
Yumeng Liu, Xiao-Xiao Long, Marc Habermann, Xuanze Yang, Cheng Lin, Yuan Liu, Yuexin Ma, Wenping Wang, Ligang Liu
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) when querying arXiv API for paper ID 2603.23997
Details
Motivation: Cannot determine motivation without access to the paper contentMethod: Cannot determine method without access to the paper content
Result: Cannot determine results without access to the paper content
Conclusion: Cannot determine conclusion without access to the paper content
Abstract: Failed to fetch summary for 2603.23997: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.23997&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[333] ScrollScape: Unlocking 32K Image Generation With Video Diffusion Priors
Haodong Yu, Yabo Zhang, Donglin Di, Ruyi Zhang, Wangmeng Zuo
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot determine conclusion as paper content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2603.24270: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.24270&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[334] TopoMesh: High-Fidelity Mesh Autoencoding via Topological Unification
Guan Luo, Xiu Li, Rui Chen, Xuanyu Yi, Jing Lin, Chia-Hao Chen, Jiahang Liu, Song-Hai Zhang, Jianfeng Zhang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2603.24278: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.24278&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[335] RS-SSM: Refining Forgotten Specifics in State Space Model for Video Semantic Segmentation
Kai Zhu, Zhenyu Cui, Zehua Zang, Jiahuan Zhou
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about paper content due to access limitations
Abstract: Failed to fetch summary for 2603.24295: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.24295&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[336] Complex-Valued Holographic Radiance Fields
Yicheng Zhan, Dong-Ha Shin, Seung-Hwan Baek, Kaan Akşit
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2506.08350: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.08350&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[337] Diagnose, Correct, and Learn from Manipulation Failures via Visual Symbols
Xianchao Zeng, Xinyu Zhou, Youcheng Li, Jiayou Shi, Tianle Li, Liangming Chen, Lei Ren, Yong-Lu Li
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper with ID 2512.02787 could not be retrieved for analysis.
Details
Motivation: Cannot determine motivation as the paper content is unavailable due to API rate limiting.Method: Cannot determine method as the paper content is unavailable due to API rate limiting.
Result: Cannot determine results as the paper content is unavailable due to API rate limiting.
Conclusion: Cannot draw conclusions about the paper due to unavailability of content.
Abstract: Failed to fetch summary for 2512.02787: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.02787&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.AI
[338] ARC-AGI-3: A New Challenge for Frontier Agentic Intelligence
ARC Prize Foundation
Main category: cs.AI
TL;DR: ARC-AGI-3 is a benchmark for evaluating agentic intelligence through abstract turn-based environments that require exploration, goal inference, internal model building, and planning without explicit instructions or language.
Details
Motivation: To create a benchmark that evaluates fluid adaptive efficiency on novel tasks while avoiding language and external knowledge dependencies, focusing purely on core reasoning and planning abilities.Method: Design of abstract, turn-based environments leveraging only Core Knowledge priors, calibrated via extensive human testing, with efficiency-based scoring framework grounded in human action baselines.
Result: Humans can solve 100% of environments, while frontier AI systems as of March 2026 score below 1%, demonstrating a significant gap in agentic intelligence capabilities.
Conclusion: ARC-AGI-3 provides a challenging benchmark for evaluating agentic intelligence that reveals substantial limitations in current AI systems’ adaptive reasoning and planning abilities.
Abstract: We introduce ARC-AGI-3, an interactive benchmark for studying agentic intelligence through novel, abstract, turn-based environments in which agents must explore, infer goals, build internal models of environment dynamics, and plan effective action sequences without explicit instructions. Like its predecessors ARC-AGI-1 and 2, ARC-AGI-3 focuses entirely on evaluating fluid adaptive efficiency on novel tasks, while avoiding language and external knowledge. ARC-AGI-3 environments only leverage Core Knowledge priors and are difficulty-calibrated via extensive testing with human test-takers. Our testing shows humans can solve 100% of the environments, in contrast to frontier AI systems which, as of March 2026, score below 1%. In this paper, we present the benchmark design, its efficiency-based scoring framework grounded in human action baselines, and the methodology used to construct, validate, and calibrate the environments.
[339] When Is Collective Intelligence a Lottery? Multi-Agent Scaling Laws for Memetic Drift in LLMs
Hidenori Tanaka
Main category: cs.AI
TL;DR: Multi-agent LLM systems reach consensus through “memetic drift” where agents learn from each other’s sampled outputs, creating self-reinforcing agreement even without initial biases.
Details
Motivation: To understand how multi-agent LLM systems reach consensus decisions, and whether outcomes reflect collective reasoning, systematic bias, or random chance, particularly in consequential decision-making contexts.Method: Introduces Quantized Simplex Gossip (QSG) minimal model where agents maintain internal belief states but learn from each other’s sampled outputs. Analyzes “memetic drift” (sampling-driven regime) and derives scaling laws for drift-induced polarization based on population size, communication bandwidth, adaptation rate, and agent uncertainty.
Result: QSG predicts crossover from drift-dominated regime (consensus as lottery) to selection regime (weak biases amplified). Validates scaling laws in both QSG simulations and naming-game experiments with LLM populations.
Conclusion: Provides framework for studying collective mechanisms of social representation formation in multi-agent systems, revealing how mutual in-context learning drives consensus through memetic drift.
Abstract: Multi-agent systems powered by large language models (LLMs) are increasingly deployed in settings that shape consequential decisions, both directly and indirectly. Yet it remains unclear whether their outcomes reflect collective reasoning, systematic bias, or mere chance. Recent work has sharpened this question with naming games, showing that even when no individual agent favors any label a priori, populations rapidly break symmetry and reach consensus. Here, we reveal the mechanism by introducing a minimal model, Quantized Simplex Gossip (QSG), and trace the microscopic origin of this agreement to mutual in-context learning. In QSG, agents maintain internal belief states but learn from one another’s sampled outputs, so one agent’s arbitrary choice becomes the next agent’s evidence and can compound toward agreement. By analogy with neutral evolution, we call this sampling-driven regime memetic drift. QSG predicts a crossover from a drift-dominated regime, where consensus is effectively a lottery, to a selection regime, where weak biases are amplified and shape the outcome. We derive scaling laws for drift-induced polarization as a function of population size, communication bandwidth, in-context adaptation rate, and agents’ internal uncertainty, and we validate them in both QSG simulations and naming-game experiments with LLM populations. Together, these results provide a framework for studying the collective mechanisms of social representation formation in multi-agent systems.
[340] AutoSAM: an Agentic Framework for Automating Input File Generation for the SAM Code with Multi-Modal Retrieval-Augmented Generation
Zaid Abulawi, Zavier Ndum Ndum, Eric Cervi, Rui Hu, Yang Liu
Main category: cs.AI
TL;DR: AutoSAM: An agentic framework using LLMs and multimodal retrieval to automate generation of thermal-hydraulics simulation input files from unstructured engineering documents
Details
Motivation: Manual construction of input files for system-level thermal-hydraulics codes like SAM is labor-intensive, requiring analysts to extract data from heterogeneous documents and manually translate into solver-specific syntaxMethod: Combines LLM agent with retrieval-augmented generation over solver documentation and specialized tools for analyzing PDFs, images, spreadsheets, and text files; multimodal pipeline integrates scientific text extraction, vision-based figure interpretation, semantic embedding, and query answering
Result: Achieved 100% utilization of structured inputs, ~88% extraction from PDF text, and 100% completeness in vision-based geometric extraction; produced runnable SAM models across four case studies of increasing complexity
Conclusion: Demonstrates practical path toward prompt-driven reactor modeling where analysts provide system descriptions and documentation while agent translates them into transparent, executable simulations
Abstract: In the design and safety analysis of advanced reactor systems, constructing input files for system-level thermal-hydraulics codes such as the System Analysis Module (SAM) remains a labor-intensive task. Analysts must extract and reconcile design data from heterogeneous engineering documents and manually translate it into solver-specific syntax. In this paper, we present AutoSAM, an agentic framework that automates SAM input file generation. The framework combines a large language model agent with retrieval-augmented generation over the solver’s user guide and theory manual, together with specialized tools for analyzing PDFs, images, spreadsheets, and text files. AutoSAM ingests unstructured engineering documents, including system diagrams, design reports, and data tables, extracts simulation-relevant parameters into a human-auditable intermediate representation, and synthesizes validated, solver-compatible input decks. Its multimodal retrieval pipeline integrates scientific text extraction, vision-based figure interpretation, semantic embedding, and query answering. We evaluate AutoSAM on four case studies of increasing complexity: a single-pipe steady-state model, a solid-fuel channel with temperature reactivity feedback, the Advanced Burner Test Reactor core, and the Molten Salt Reactor Experiment primary loop. Across all cases, the agent produces runnable SAM models consistent with expected thermal-hydraulic behavior while explicitly identifying missing data and labeling assumed values. The framework achieves 100% utilization of structured inputs, about 88% extraction from PDF text, and 100% completeness in vision-based geometric extraction. These results demonstrate a practical path toward prompt-driven reactor modeling, in which analysts provide system descriptions and supporting documentation while the agent translates them into transparent, and executable, SAM simulations.
[341] Trust as Monitoring: Evolutionary Dynamics of User Trust and AI Developer Behaviour
Adeela Bashir, Zhao Song, Ndidi Bianca Ogbo, Nataliya Balabanova, Martin Smit, Chin-wing Leung, Paolo Bova, Manuel Chica Serrano, Dhanushka Dissanayake, Manh Hong Duong, Elias Fernandez Domingos, Nikita Huber-Kralj, Marcus Krellner, Andrew Powell, Stefan Sarkadi, Fernando P. Santos, Zia Ush Shamszaman, Chaimaa Tarzi, Paolo Turrini, Grace Ibukunoluwa Ufeoshi, Victor A. Vargas-Perez, Alessandro Di Stefano, Simon T. Powers, The Anh Han
Main category: cs.AI
TL;DR: Evolutionary game theory model shows AI safety requires user monitoring, developer penalties, and transparency to avoid unsafe adoption or low adoption outcomes
Details
Motivation: Existing AI governance models treat user trust as one-shot adoption rather than dynamic process; need to understand how trust evolves through repeated interactions between users and developersMethod: Evolutionary game theory modeling trust as reduced monitoring in repeated asymmetric interactions; complement with stochastic finite-population dynamics and reinforcement learning (Q-learning) simulations
Result: Three robust long-run regimes emerge: no adoption with unsafe development, unsafe but widely adopted systems, and safe systems widely adopted; safe adoption requires penalties exceeding safety costs and affordable monitoring
Conclusion: Formally supports governance emphasizing transparency, low-cost monitoring, and meaningful sanctions; neither regulation alone nor blind trust prevents evolutionary drift toward unsafe outcomes
Abstract: AI safety is an increasingly urgent concern as the capabilities and adoption of AI systems grow. Existing evolutionary models of AI governance have primarily examined incentives for safe development and effective regulation, typically representing users’ trust as a one-shot adoption choice rather than as a dynamic, evolving process shaped by repeated interactions. We instead model trust as reduced monitoring in a repeated, asymmetric interaction between users and AI developers, where checking AI behaviour is costly. Using evolutionary game theory, we study how user trust strategies and developer choices between safe (compliant) and unsafe (non-compliant) AI co-evolve under different levels of monitoring cost and institutional regimes. We complement the infinite-population replicator analysis with stochastic finite-population dynamics and reinforcement learning (Q-learning) simulations. Across these approaches, we find three robust long-run regimes: no adoption with unsafe development, unsafe but widely adopted systems, and safe systems that are widely adopted. Only the last is desirable, and it arises when penalties for unsafe behaviour exceed the extra cost of safety and users can still afford to monitor at least occasionally. Our results formally support governance proposals that emphasise transparency, low-cost monitoring, and meaningful sanctions, and they show that neither regulation alone nor blind user trust is sufficient to prevent evolutionary drift towards unsafe or low-adoption outcomes.
[342] Back to Basics: Revisiting ASR in the Age of Voice Agents
Geeyang Tay, Wentao Ma, Jaewon Lee, Yuzhi Tang, Daniel Lee, Weisu Yin, Dongming Shen, Silin Meng, Yi Zhu, Mu Li, Alex Smola
Main category: cs.AI
TL;DR: WildASR is a multilingual diagnostic benchmark for evaluating ASR system robustness across environmental degradation, demographic shift, and linguistic diversity using real human speech data.
Details
Motivation: Current ASR systems achieve high accuracy on curated benchmarks but fail in real-world conditions that existing evaluations don't systematically cover. Without diagnostic tools that isolate specific failure factors, practitioners can't anticipate which conditions in which languages will cause performance degradation.Method: Introduced WildASR, a multilingual (four-language) diagnostic benchmark sourced entirely from real human speech that factorizes ASR robustness along three axes: environmental degradation, demographic shift, and linguistic diversity. Evaluated seven widely used ASR systems and developed three analytical tools for practitioners.
Result: Found severe and uneven performance degradation across ASR systems. Model robustness does not transfer across languages or conditions. Models often hallucinate plausible but unspoken content under partial or degraded inputs, creating concrete safety risks for downstream agent behavior.
Conclusion: Targeted, factor-isolated evaluation is essential for understanding and improving ASR reliability in production systems. The benchmark and analytical tools can guide deployment decisions.
Abstract: Automatic speech recognition (ASR) systems have achieved near-human accuracy on curated benchmarks, yet still fail in real-world voice agents under conditions that current evaluations do not systematically cover. Without diagnostic tools that isolate specific failure factors, practitioners cannot anticipate which conditions, in which languages, will cause what degree of degradation. We introduce WildASR, a multilingual (four-language) diagnostic benchmark sourced entirely from real human speech that factorizes ASR robustness along three axes: environmental degradation, demographic shift, and linguistic diversity. Evaluating seven widely used ASR systems, we find severe and uneven performance degradation, and model robustness does not transfer across languages or conditions. Critically, models often hallucinate plausible but unspoken content under partial or degraded inputs, creating concrete safety risks for downstream agent behavior. Our results demonstrate that targeted, factor-isolated evaluation is essential for understanding and improving ASR reliability in production systems. Besides the benchmark itself, we also present three analytical tools that practitioners can use to guide deployment decisions.
[343] Formal Semantics for Agentic Tool Protocols: A Process Calculus Approach
Andreas Schlapbach
Main category: cs.AI
TL;DR: Formal verification study comparing Schema-Guided Dialogue (SGD) and Model Context Protocol (MCP) for LLM agent tool integration, showing SGD is more expressive and proposing MCP+ extensions for equivalence.
Details
Motivation: There's an urgent need for formal verification of agent protocols as LLM agents increasingly invoke external tools. While SGD (research framework) and MCP (industry standard) both enable dynamic service discovery, their formal relationship and expressivity differences remain unexplored, creating safety concerns.Method: Developed first process calculus formalization of both SGD and MCP, proved structural bisimilarity under mapping Φ, analyzed reverse mapping Φ⁻¹ to identify expressivity gaps, identified five principles for full behavioral equivalence, formalized these as type-system extensions MCP+, and proved MCP+ is isomorphic to SGD.
Result: Found that while SGD and MCP are structurally bisimilar under Φ, the reverse mapping is partial and lossy, revealing critical gaps in MCP’s expressivity. Identified five necessary and sufficient principles for equivalence: semantic completeness, explicit action boundaries, failure mode documentation, progressive disclosure compatibility, and inter-tool relationship declaration.
Conclusion: Provides first formal foundation for verified agent systems, establishes schema quality as provable safety property, and shows that MCP+ (extended MCP) achieves full behavioral equivalence with SGD through five principled extensions.
Abstract: The emergence of large language model agents capable of invoking external tools has created urgent need for formal verification of agent protocols. Two paradigms dominate this space: Schema-Guided Dialogue (SGD), a research framework for zero-shot API generalization, and the Model Context Protocol (MCP), an industry standard for agent-tool integration. While both enable dynamic service discovery through schema descriptions, their formal relationship remains unexplored. Building on prior work establishing the conceptual convergence of these paradigms, we present the first process calculus formalization of SGD and MCP, proving they are structurally bisimilar under a well-defined mapping Phi. However, we demonstrate that the reverse mapping Phi^{-1} is partial and lossy, revealing critical gaps in MCP’s expressivity. Through bidirectional analysis, we identify five principles – semantic completeness, explicit action boundaries, failure mode documentation, progressive disclosure compatibility, and inter-tool relationship declaration – as necessary and sufficient conditions for full behavioral equivalence. We formalize these principles as type-system extensions MCP+, proving MCP+ is isomorphic to SGD. Our work provides the first formal foundation for verified agent systems and establishes schema quality as a provable safety property.
[344] Supervising Ralph Wiggum: Exploring a Metacognitive Co-Regulation Agentic AI Loop for Engineering Design
Zeda Xu, Nikolas Martelaro, Christopher McComb
Main category: cs.AI
TL;DR: Novel agentic AI systems with metacognitive regulation loops (Self-Regulation Loop and Co-Regulation Design Agentic Loop) to mitigate design fixation in LLM-based engineering design agents, tested on battery pack design problems.
Details
Motivation: LLM-based design agents suffer from design fixation similar to human designers - they get stuck on existing paradigms and fail to explore alternatives, leading to suboptimal solutions. The paper aims to address this limitation through metacognitive regulation mechanisms.Method: Proposes two novel systems: (1) Self-Regulation Loop (SRL) where the Design Agent monitors its own metacognition, and (2) Co-Regulation Design Agentic Loop (CRDAL) where a Metacognitive Co-Regulation Agent assists the Design Agent in metacognition. Tested on battery pack design problem comparing against baseline Ralph Wiggum Loop (RWL).
Result: CRDAL generated designs with better performance without significant computational cost increase compared to RWL and SRL. CRDAL navigated design space more effectively. SRL explored different design regions but didn’t produce significantly better performance than RWL.
Conclusion: Metacognitive co-regulation (CRDAL) effectively mitigates design fixation in LLM-based design agents, improving performance and design space exploration. The architectures provide practical implications for future agentic AI systems in engineering design.
Abstract: The engineering design research community has studied agentic AI systems that use Large Language Model (LLM) agents to automate the engineering design process. However, these systems are prone to some of the same pathologies that plague humans. Just as human designers, LLM design agents can fixate on existing paradigms and fail to explore alternatives when solving design challenges, potentially leading to suboptimal solutions. In this work, we propose (1) a novel Self-Regulation Loop (SRL), in which the Design Agent self-regulates and explicitly monitors its own metacognition, and (2) a novel Co-Regulation Design Agentic Loop (CRDAL), in which a Metacognitive Co-Regulation Agent assists the Design Agent in metacognition to mitigate design fixation, thereby improving system performance for engineering design tasks. In the battery pack design problem examined here, we found that the novel CRDAL system generates designs with better performance, without significantly increasing the computational cost, compared to a plain Ralph Wiggum Loop (RWL) and the metacognitively self-assessing Self-Regulation Loop (SRL). Also, we found that the CRDAL system navigated through the latent design space more effectively than both SRL and RWL. However, the SRL did not generate designs with significantly better performance than RWL, even though it explored a different region of the design space. The proposed system architectures and findings of this work provide practical implications for future development of agentic AI systems for engineering design.
[345] ReLope: KL-Regularized LoRA Probes for Multimodal LLM Routing
Yaopei Zeng, Congchao Wang, Blake JianHang Chen, Lu Lin
Main category: cs.AI
TL;DR: Improved probe routing for multimodal LLMs using attention-based aggregation and KL-regularized LoRA adapters to address degraded correctness signal separability in visual inputs.
Details
Motivation: Probe routing works well in text-only LLMs but degrades substantially in multimodal LLMs due to visual inputs weakening correctness signal separability in hidden states.Method: Two approaches: 1) Attention Probe aggregates hidden states from preceding layer using attention scores to recover distributed correctness signals; 2) KL-Regularized LoRA Probe (ReLope) inserts lightweight LoRA adapter with KL regularizer to learn routing-aware representations.
Result: Methods consistently outperform baselines, demonstrating that improving hidden state quality is key to effective routing in MLLMs.
Conclusion: Proposed approaches successfully address the challenge of probe routing degradation in multimodal LLMs by enhancing correctness signal extraction from hidden states.
Abstract: Routing has emerged as a promising strategy for balancing performance and cost in large language model (LLM) systems that combine lightweight models with powerful but expensive large models. Recent studies show that \emph{probe routing}, which predicts the correctness of a small model using its hidden states, provides an effective solution in text-only LLMs. However, we observe that these probes degrade substantially when applied to multimodal LLMs (MLLMs). Through empirical analysis, we find that the presence of visual inputs weakens the separability of correctness signals in hidden states, making them harder to extract using standard probe designs. To address this challenge, we introduce two complementary approaches for improving probe routing in MLLMs. First, we propose the \emph{Attention Probe}, which aggregates hidden states from the preceding layer based on attention scores to recover distributed correctness signals. Second, we present the \emph{KL-Regularized LoRA Probe (ReLope)}, which inserts a lightweight LoRA adapter and applies a KL regularizer to learn routing-aware representations. Comprehensive experiments show that our methods consistently outperform baselines, suggesting that improving the quality of hidden states is key to effective routing in MLLMs. Our code is available at https://github.com/Spinozaaa/ReLope.
[346] Resisting Humanization: Ethical Front-End Design Choices in AI for Sensitive Contexts
Silvia Rossi, Diletta Huyskes, Mackenzie Jorgensen
Main category: cs.AI
TL;DR: Paper examines ethical implications of humanizing front-end design in conversational AI interfaces, arguing it shapes user trust and autonomy, with case study from trauma-informed nonprofit.
Details
Motivation: Addresses gap in AI ethics literature focusing on back-end issues while neglecting front-end design choices, particularly humanizing elements in conversational interfaces that can misalign user expectations and trust.Method: Draws on HCI, conversational AI, and value-sensitive design research, analyzing how interface design choices affect user mental models. Uses case study of Chayn nonprofit’s trauma-informed AI design principles for gender-based violence survivors.
Result: Shows how humanizing design elements can foster misplaced trust and undermine autonomy, especially in vulnerable contexts. Demonstrates how ethical considerations can motivate restraint in interface design, challenging engagement-focused industry norms.
Conclusion: Ethical front-end AI design is procedural ethics enacted through interaction choices, not just system logic. Humanization in interfaces is value-driven choice requiring careful consideration of user vulnerability and trust calibration.
Abstract: Ethical debates in AI have primarily focused on back-end issues such as data governance, model training, and algorithmic decision-making. Less attention has been paid to the ethical significance of front-end design choices, such as the interaction and representation-based elements through which users interact with AI systems. This gap is particularly significant for Conversational User Interfaces (CUI) based on Natural Language Processing (NLP) systems, where humanizing design elements such as dialogue-based interaction, emotive language, personality modes, and anthropomorphic metaphors are increasingly prevalent. This work argues that humanization in AI front-end design is a value-driven choice that profoundly shapes users’ mental models, trust calibration, and behavioral responses. Drawing on research in human-computer interaction (HCI), conversational AI, and value-sensitive design, we examine how interfaces can play a central role in misaligning user expectations, fostering misplaced trust, and subtly undermining user autonomy, especially in vulnerable contexts. To ground this analysis, we discuss two AI systems developed by Chayn, a nonprofit organization supporting survivors of gender-based violence. Chayn is extremely cautious when building AI that interacts with or impacts survivors by operationalizing their trauma-informed design principles. This Chayn case study illustrates how ethical considerations can motivate principled restraint in interface design, challenging engagement-based norms in contemporary AI products. We argue that ethical front-end AI design is a form of procedural ethics, enacted through interaction choices rather than embedded solely in system logic.
[347] SentinelAI: A Multi-Agent Framework for Structuring and Linking NG9-1-1 Emergency Incident Data
Kliment Ho, Ilya Zaslavsky
Main category: cs.AI
TL;DR: SentinelAI is a framework for integrating and standardizing emergency response data into machine-readable formats compliant with Next Generation 9-1-1 standards using specialized agents.
Details
Motivation: Emergency response systems generate data from multiple agencies and systems, but correlating and updating this information across sources in alignment with Next Generation 9-1-1 standards remains challenging. There's a need to treat emergency data as a continuous stream of operational updates for timely incident management.Method: SentinelAI implements a scalable processing pipeline composed of specialized agents. The EIDO Agent ingests raw communications and produces NENA-compliant Emergency Incident Data Object JSON. The framework transforms emergency communications into standardized, machine-readable datasets supporting integration, composite incident construction, and cross-source reasoning.
Result: The paper presents SentinelAI as a data integration and standardization framework that can transform emergency communications into standardized, machine-readable datasets that support integration, composite incident construction, and cross-source reasoning.
Conclusion: SentinelAI provides a solution for standardizing emergency response data to enable better integration and reasoning across multiple data sources, addressing challenges in emergency communication systems.
Abstract: Emergency response systems generate data from many agencies and systems. In practice, correlating and updating this information across sources in a way that aligns with Next Generation 9-1-1 data standards remains challenging. Ideally, this data should be treated as a continuous stream of operational updates, where new facts are integrated immediately to provide a timely and unified view of an evolving incident. This paper presents SentinelAI, a data integration and standardization framework for transforming emergency communications into standardized, machine-readable datasets that support integration, composite incident construction, and cross-source reasoning. SentinelAI implements a scalable processing pipeline composed of specialized agents. The EIDO Agent ingests raw communications and produces NENA-compliant Emergency Incident Data Object JSON.
[348] How Far Are Vision-Language Models from Constructing the Real World? A Benchmark for Physical Generative Reasoning
Luyu Yang, Yutong Dai, An Yan, Viraj Prabhu, Ran Xu, Zeyuan Chen
Main category: cs.AI
TL;DR: DreamHouse is a benchmark for evaluating physical generative reasoning in VLMs, focusing on structural and procedural constraints in residential timber-frame construction rather than just visual realism.
Details
Motivation: Current VLM evaluation focuses too much on perceptual realism and visual plausibility, neglecting whether models understand step-by-step processes and physical dependencies needed for actual construction. There's a need to test if models can generate artifacts that satisfy geometric, structural, constructability, and code-compliance constraints.Method: Created DreamHouse benchmark grounded in residential timber-frame construction with codified engineering standards. Curated over 26,000 structures across 13 architectural styles verified to construction-document standards (LOD 350). Developed deterministic 10-test structural validation framework. Supports iterative agentic interaction where models observe intermediate build states, generate construction actions, and receive structured environmental feedback.
Result: Extensive experiments with state-of-the-art VLMs reveal substantial capability gaps that are largely invisible on existing leaderboards. The benchmark establishes physical validity as a critical evaluation axis orthogonal to visual realism.
Conclusion: Physical generative reasoning is a distinct and underdeveloped frontier in multimodal intelligence. DreamHouse highlights the need to evaluate models beyond visual realism to include physical and procedural understanding for real-world applications like design-to-construction automation.
Abstract: The physical world is not merely visual; it is governed by rigorous structural and procedural constraints. Yet, the evaluation of vision-language models (VLMs) remains heavily skewed toward perceptual realism, prioritizing the generation of visually plausible 3D layouts, shapes, and appearances. Current benchmarks rarely test whether models grasp the step-by-step processes and physical dependencies required to actually build these artifacts, a capability essential for automating design-to-construction pipelines. To address this, we introduce DreamHouse, a novel benchmark for physical generative reasoning: the capacity to synthesize artifacts that concurrently satisfy geometric, structural, constructability, and code-compliance constraints. We ground this benchmark in residential timber-frame construction, a domain with fully codified engineering standards and objectively verifiable correctness. We curate over 26,000 structures spanning 13 architectural styles, ach verified to construction-document standards (LOD 350) and develop a deterministic 10-test structural validation framework. Unlike static benchmarks that assess only final outputs, DreamHouse supports iterative agentic interaction. Models observe intermediate build states, generate construction actions, and receive structured environmental feedback, enabling a fine-grained evaluation of planning, structural reasoning, and self-correction. Extensive experiments with state-of-the-art VLMs reveal substantial capability gaps that are largely invisible on existing leaderboards. These findings establish physical validity as a critical evaluation axis orthogonal to visual realism, highlighting physical generative reasoning as a distinct and underdeveloped frontier in multimodal intelligence. Available at https://luluyuyuyang.github.io/dreamhouse
[349] On the Foundations of Trustworthy Artificial Intelligence
TJ Dunham
Main category: cs.AI
TL;DR: The paper argues that platform-deterministic inference is essential for trustworthy AI, introduces trust entropy to quantify non-determinism costs, and demonstrates a pure integer inference engine achieving bitwise identical outputs across different architectures.
Details
Motivation: Current AI systems lack platform determinism due to floating-point arithmetic variations across hardware, making verification and trust properties (fairness, robustness, privacy, safety, alignment) fundamentally unachievable without deterministic inference.Method: Formalizes the Determinism Thesis, introduces trust entropy to quantify non-determinism costs, constructs a pure integer inference engine that avoids IEEE 754 floating-point arithmetic, and implements it in 99,000 lines of Rust deployed across three continents.
Result: Achieved bitwise identical output across ARM and x86 architectures in 82 cross-architecture tests on models up to 6.7B parameters with zero hash mismatches. Four geographically distributed nodes produced identical outputs verified by 356 on-chain attestation transactions.
Conclusion: Platform-deterministic inference is necessary and sufficient for trustworthy AI, and AI trust fundamentally reduces to a question of arithmetic rather than just algorithmic properties.
Abstract: We prove that platform-deterministic inference is necessary and sufficient for trustworthy AI. We formalize this as the Determinism Thesis and introduce trust entropy to quantify the cost of non-determinism, proving that verification failure probability equals 1 - 2^{-H_T} exactly. We prove a Determinism-Verification Collapse: verification under determinism requires O(1) hash comparison; without it, the verifier faces an intractable membership problem. IEEE 754 floating-point arithmetic fundamentally violates the determinism requirement. We resolve this by constructing a pure integer inference engine that achieves bitwise identical output across ARM and x86. In 82 cross-architecture tests on models up to 6.7B parameters, we observe zero hash mismatches. Four geographically distributed nodes produce identical outputs, verified by 356 on-chain attestation transactions. Every major trust property of AI systems (fairness, robustness, privacy, safety, alignment) presupposes platform determinism. Our system, 99,000 lines of Rust deployed across three continents, establishes that AI trust is a question of arithmetic.
[350] LogitScope: A Framework for Analyzing LLM Uncertainty Through Information Metrics
Farhan Ahmed, Yuya Jeremy Ong, Chad DeLuca
Main category: cs.AI
TL;DR: LogitScope is a lightweight framework for analyzing LLM uncertainty through token-level information metrics computed from probability distributions, enabling uncertainty quantification without labeled data.
Details
Motivation: Traditional evaluation approaches provide limited insight into model confidence at individual token positions during generation, making it difficult to understand and quantify uncertainty in LLM outputs for reliable deployment.Method: Introduces LogitScope framework that computes token-level information metrics (such as entropy and varentropy) from probability distributions at each generation step, using lazy evaluation for computational efficiency and being model-agnostic and compatible with HuggingFace models.
Result: LogitScope reveals patterns in model confidence, identifies potential hallucinations, and exposes decision points where models exhibit high uncertainty, demonstrating utility across diverse applications including uncertainty quantification, model behavior analysis, and production monitoring.
Conclusion: LogitScope provides a practical framework for researchers and practitioners to inspect LLM behavior during inference, enabling better understanding of model uncertainty without requiring labeled data or semantic interpretation.
Abstract: Understanding and quantifying uncertainty in large language model (LLM) outputs is critical for reliable deployment. However, traditional evaluation approaches provide limited insight into model confidence at individual token positions during generation. To address this issue, we introduce LogitScope, a lightweight framework for analyzing LLM uncertainty through token-level information metrics computed from probability distributions. By measuring metrics such as entropy and varentropy at each generation step, LogitScope reveals patterns in model confidence, identifies potential hallucinations, and exposes decision points where models exhibit high uncertainty, all without requiring labeled data or semantic interpretation. We demonstrate LogitScope’s utility across diverse applications including uncertainty quantification, model behavior analysis, and production monitoring. The framework is model-agnostic, computationally efficient through lazy evaluation, and compatible with any HuggingFace model, enabling both researchers and practitioners to inspect LLM behavior during inference.
[351] Decoding Market Emotions in Cryptocurrency Tweets via Predictive Statement Classification with Machine Learning and Transformers
Moein Shahiki Tash, Zahra Ahani, Mohim Tash, Mostafa Keikhay Farzaneh, Ari Y. Barrera-Animas, Olga Kolesnikova
Main category: cs.AI
TL;DR: A novel two-stage classification framework for identifying predictive cryptocurrency tweets, using binary classification followed by sentiment categorization, with GPT-based data augmentation and emotion analysis.
Details
Motivation: With growing cryptocurrency prominence and speculative activity on social media, there's a need to systematically identify and categorize predictive statements about cryptocurrencies to understand market sentiment and prediction patterns.Method: Two-stage classification: Task 1 (binary classification of Predictive vs Non-Predictive tweets) and Task 2 (categorizing Predictive tweets as Incremental, Decremental, or Neutral). Uses manual and GPT-based annotation, SenticNet for emotion features, GPT-generated paraphrasing for data augmentation, and evaluates ML, DL, and transformer models.
Result: GPT-based balancing significantly improved model performance. Transformer models achieved highest F1-score in Task 1, while traditional ML models performed best in Task 2. Emotion analysis revealed distinct emotional patterns associated with each prediction category across different cryptocurrencies.
Conclusion: The framework effectively classifies cryptocurrency predictive statements, with GPT-based augmentation enhancing performance and emotion analysis providing insights into sentiment patterns across different cryptocurrencies.
Abstract: The growing prominence of cryptocurrencies has triggered widespread public engagement and increased speculative activity, particularly on social media platforms. This study introduces a novel classification framework for identifying predictive statements in cryptocurrency-related tweets, focusing on five popular cryptocurrencies: Cardano, Matic, Binance, Ripple, and Fantom. The classification process is divided into two stages: Task 1 involves binary classification to distinguish between Predictive and Non-Predictive statements. Tweets identified as Predictive proceed to Task 2, where they are further categorized as Incremental, Decremental, or Neutral. To build a robust dataset, we combined manual and GPT-based annotation methods and utilized SenticNet to extract emotion features corresponding to each prediction category. To address class imbalance, GPT-generated paraphrasing was employed for data augmentation. We evaluated a wide range of machine learning, deep learning, and transformer-based models across both tasks. The results show that GPT-based balancing significantly enhanced model performance, with transformer models achieving the highest F1-score in Task 1, while traditional machine learning models performed best in Task 2. Furthermore, our emotion analysis revealed distinct emotional patterns associated with each prediction category across the different cryptocurrencies.
[352] FinMCP-Bench: Benchmarking LLM Agents for Real-World Financial Tool Use under the Model Context Protocol
Jie Zhu, Yimin Tian, Boyang Li, Kehao Wu, Zhongzhi Liang, Junhui Li, Xianyin Zhang, Lifan Guo, Feng Chen, Yong Liu, Chi Zhang
Main category: cs.AI
TL;DR: FinMCP-Bench is a benchmark for evaluating LLMs on financial problem-solving through tool invocation, featuring 613 diverse samples across 10 scenarios with real financial protocols.
Details
Motivation: There's a need for standardized evaluation of LLMs in real-world financial applications, particularly for assessing their ability to use financial tools and protocols effectively.Method: Created a benchmark with 613 samples spanning 10 main scenarios and 33 sub-scenarios, incorporating 65 real financial MCPs and three sample types (single tool, multi-tool, multi-turn).
Result: Provides a comprehensive testbed for evaluating LLMs on financial tool invocation accuracy and reasoning capabilities across different complexity levels.
Conclusion: FinMCP-Bench offers a standardized, practical benchmark for advancing research on financial LLM agents and their tool-using capabilities.
Abstract: This paper introduces \textbf{FinMCP-Bench}, a novel benchmark for evaluating large language models (LLMs) in solving real-world financial problems through tool invocation of financial model context protocols. FinMCP-Bench contains 613 samples spanning 10 main scenarios and 33 sub-scenarios, featuring both real and synthetic user queries to ensure diversity and authenticity. It incorporates 65 real financial MCPs and three types of samples, single tool, multi-tool, and multi-turn, allowing evaluation of models across different levels of task complexity. Using this benchmark, we systematically assess a range of mainstream LLMs and propose metrics that explicitly measure tool invocation accuracy and reasoning capabilities. FinMCP-Bench provides a standardized, practical, and challenging testbed for advancing research on financial LLM agents.
[353] Shopping with a Platform AI Assistant: Who Adopts, When in the Journey, and What For
Se Yan, Han Zhong, Zemin, Zhong, Wenyu Zhou
Main category: cs.AI
TL;DR: Study of 31M users on Ctrip platform shows LLM-based shopping AI adoption patterns differ from general AI tools, with older/female users adopting more; AI chat complements rather than replaces search, used for exploratory product discovery.
Details
Motivation: To understand how consumers adopt and use platform-embedded shopping AI in e-commerce, particularly how LLM-based AI assistants function within purchase journeys compared to traditional search.Method: Analysis of 31 million users on Ctrip (China’s largest online travel platform) using data on “Wendao,” an LLM-based AI assistant integrated into the platform, examining adoption patterns, usage timing, and query types.
Result: 1) Adoption highest among older consumers, female users, and highly engaged existing users (reversing typical AI adoption patterns); 2) AI chat appears in same phase as traditional search but used for exploratory tasks; 3) 42% of chat requests are for attraction queries, with chat intent varying by timing and product category.
Conclusion: Embedded shopping AI functions as a complementary interface for exploratory product discovery rather than a substitute for conventional search in e-commerce.
Abstract: This paper provides some of the first large-scale descriptive evidence on how consumers adopt and use platform-embedded shopping AI in e-commerce. Using data on 31 million users of Ctrip, China’s largest online travel platform, we study “Wendao,” an LLM-based AI assistant integrated into the platform. We document three empirical regularities. First, adoption is highest among older consumers, female users, and highly engaged existing users, reversing the younger, male-dominated profile commonly documented for general-purpose AI tools. Second, AI chat appears in the same broad phase of the purchase journey as traditional search and well before order placement; among journeys containing both chat and search, the most common pattern is interleaving, with users moving back and forth between the two modalities. Third, consumers disproportionately use the assistant for exploratory, hard-to-keyword tasks: attraction queries account for 42% of observed chat requests, and chat intent varies systematically with both the timing of chat relative to search and the category of products later purchased within the same journey. These findings suggest that embedded shopping AI functions less as a substitute for conventional search than as a complementary interface for exploratory product discovery in e-commerce.
[354] Can MLLMs Read Students’ Minds? Unpacking Multimodal Error Analysis in Handwritten Math
Dingjie Song, Tianlong Xu, Yi-Fan Zhang, Hang Li, Zhiling Yan, Xing Fan, Haoyang Li, Lichao Sun, Qingsong Wen
Main category: cs.AI
TL;DR: ScratchMath is a benchmark dataset for evaluating multimodal LLMs on explaining and classifying errors in handwritten math scratchwork, revealing significant performance gaps between models and human experts.
Details
Motivation: Existing educational NLP focuses on textual responses and neglects the complexity of handwritten scratchwork. Current MLLMs adopt an "examinee perspective" rather than diagnosing student errors, creating a need for benchmarks that assess error explanation capabilities.Method: Created ScratchMath dataset with 1,720 handwritten math samples from Chinese students, annotated through human-machine collaboration. Supports Error Cause Explanation (ECE) and Error Cause Classification (ECC) tasks with 7 error types. Evaluated 16 leading MLLMs on these tasks.
Result: Significant performance gaps between MLLMs and human experts, especially in visual recognition and logical reasoning. Proprietary models outperformed open-source models. Large reasoning models showed strong potential for error explanation.
Conclusion: ScratchMath addresses a critical gap in educational assessment by focusing on handwritten scratchwork error diagnosis. The benchmark reveals current limitations of MLLMs in this domain and provides resources for advancing multimodal educational AI.
Abstract: Assessing student handwritten scratchwork is crucial for personalized educational feedback but presents unique challenges due to diverse handwriting, complex layouts, and varied problem-solving approaches. Existing educational NLP primarily focuses on textual responses and neglects the complexity and multimodality inherent in authentic handwritten scratchwork. Current multimodal large language models (MLLMs) excel at visual reasoning but typically adopt an “examinee perspective”, prioritizing generating correct answers rather than diagnosing student errors. To bridge these gaps, we introduce ScratchMath, a novel benchmark specifically designed for explaining and classifying errors in authentic handwritten mathematics scratchwork. Our dataset comprises 1,720 mathematics samples from Chinese primary and middle school students, supporting two key tasks: Error Cause Explanation (ECE) and Error Cause Classification (ECC), with seven defined error types. The dataset is meticulously annotated through rigorous human-machine collaborative approaches involving multiple stages of expert labeling, review, and verification. We systematically evaluate 16 leading MLLMs on ScratchMath, revealing significant performance gaps relative to human experts, especially in visual recognition and logical reasoning. Proprietary models notably outperform open-source models, with large reasoning models showing strong potential for error explanation. All evaluation data and frameworks are publicly available to facilitate further research.
[355] Design Once, Deploy at Scale: Template-Driven ML Development for Large Model Ecosystems
Jiang Liu, John Martabano Landy, Yao Xuan, Swamy Muddu, Nhat Le, Munaf Sahaf, Luc Kien Hang, Rupinder Khandpour, Kevin De Angeli, Chang Yang, Shouyuan Chen, Shiblee Sadik, Ani Agrawal, Djordje Gligorijevic, Jingzheng Qin, Peggy Yao, Alireza Vahdatpour
Main category: cs.AI
TL;DR: Standard Model Template (SMT) framework for recommendation systems reduces engineering complexity and improves efficiency in large-scale computational advertising platforms.
Details
Motivation: Large-scale advertising platforms face significant challenges in maintaining numerous ML models for different optimization events, requiring substantial engineering effort for model refreshes and technique propagation across the ecosystem.Method: Proposes Standard Model Template (SMT) - a framework using standardized, composable ML components to generate adaptable models for diverse data distributions and optimization events, reducing technique propagation complexity from O(n·2^k) to O(n+k).
Result: Evaluation in Meta’s production ads ranking ecosystem showed: 0.63% average improvement in cross-entropy, 92% reduction in per-model iteration engineering time, and 6.3× increase in technique-model pair adoption throughput.
Conclusion: Standardized model-building approach outperforms independent per-model optimization, challenging conventional wisdom that diverse optimization goals require diversified ML model design.
Abstract: Modern computational advertising platforms typically rely on recommendation systems to predict user responses, such as click-through rates, conversion rates, and other optimization events. To support a wide variety of product surfaces and advertiser goals, these platforms frequently maintain an extensive ecosystem of machine learning (ML) models. However, operating at this scale creates significant development and efficiency challenges. Substantial engineering effort is required to regularly refresh ML models and propagate new techniques, which results in long latencies when deploying ML innovations across the ecosystem. We present a large-scale empirical study comparing model performance, efficiency, and ML technique propagation between a standardized model-building approach and independent per-model optimization in recommendation systems. To facilitate this standardization, we propose the Standard Model Template (SMT) – a framework that generates high-performance models adaptable to diverse data distributions and optimization events. By utilizing standardized, composable ML model components, SMT reduces technique propagation complexity from $O(n \cdot 2^k)$ to $O(n + k)$ where $n$ is the number of models and $k$ the number of techniques. Evaluating an extensive suite of models over four global development cycles within Meta’s production ads ranking ecosystem, our results demonstrate: (1) a 0.63% average improvement in cross-entropy at neutral serving capacity, (2) a 92% reduction in per-model iteration engineering time, and (3) a $6.3\times$ increase in technique-model pair adoption throughput. These findings challenge the conventional wisdom that diverse optimization goals inherently require diversified ML model design.
[356] The Anatomy of Uncertainty in LLMs
Aditya Taparia, Ransalu Senanayake, Kowshik Thopalli, Vivek Narayanaswamy
Main category: cs.AI
TL;DR: A framework for decomposing LLM uncertainty into three semantic components: input ambiguity, knowledge gaps, and decoding randomness, providing actionable insights for improving model reliability.
Details
Motivation: Current uncertainty quantification methods for LLMs (single scores or aleatoric-epistemic dichotomy) fail to provide actionable insights for improving generative models and understanding their reliability.Method: Proposes an uncertainty decomposition framework that dissects LLM uncertainty into three distinct semantic components: input ambiguity (from ambiguous prompts), knowledge gaps (from insufficient parametric evidence), and decoding randomness (from stochastic sampling).
Result: Experiments show that the dominance of these uncertainty components shifts across model sizes and tasks, providing better understanding for auditing LLM reliability and detecting hallucinations.
Conclusion: The framework enables targeted interventions for improving LLM trustworthiness and paves the way for more reliable deployment by providing actionable insights into uncertainty sources.
Abstract: Understanding why a large language model (LLM) is uncertain about the response is important for their reliable deployment. Current approaches, which either provide a single uncertainty score or rely on the classical aleatoric-epistemic dichotomy, fail to offer actionable insights for improving the generative model. Recent studies have also shown that such methods are not enough for understanding uncertainty in LLMs. In this work, we advocate for an uncertainty decomposition framework that dissects LLM uncertainty into three distinct semantic components: (i) input ambiguity, arising from ambiguous prompts; (ii) knowledge gaps, caused by insufficient parametric evidence; and (iii) decoding randomness, stemming from stochastic sampling. Through a series of experiments we demonstrate that the dominance of these components can shift across model size and task. Our framework provides a better understanding to audit LLM reliability and detect hallucinations, paving the way for targeted interventions and more trustworthy systems.
[357] Rethinking Failure Attribution in Multi-Agent Systems: A Multi-Perspective Benchmark and Evaluation
Yeonjun In, Mehrab Tanjim, Jayakumar Subramanian, Sungchul Kim, Uttaran Bhattacharya, Wonjoong Kim, Sangwu Park, Somdeb Sarkhel, Chanyoung Park
Main category: cs.AI
TL;DR: The paper introduces multi-perspective failure attribution for multi-agent systems, proposing a new benchmark (MP-Bench) and evaluation protocol to address attribution ambiguity in complex MAS failures.
Details
Motivation: Existing benchmarks and methods for multi-agent system failure attribution assume single deterministic root causes, but in practice MAS failures often have multiple plausible attributions due to complex inter-agent dependencies and ambiguous execution trajectories.Method: Proposes multi-perspective failure attribution paradigm, introduces MP-Bench benchmark specifically designed for this setting, and develops a new evaluation protocol tailored to handle attribution ambiguity in MAS failures.
Result: Experiments show that prior conclusions suggesting LLMs struggle with failure attribution are largely driven by limitations in existing benchmark designs, highlighting the necessity of multi-perspective benchmarks and evaluation protocols.
Conclusion: Multi-perspective failure attribution is essential for realistic and reliable MAS debugging, and the proposed benchmark and evaluation protocol address the limitations of existing single-perspective approaches.
Abstract: Failure attribution is essential for diagnosing and improving multi-agent systems (MAS), yet existing benchmarks and methods largely assume a single deterministic root cause for each failure. In practice, MAS failures often admit multiple plausible attributions due to complex inter-agent dependencies and ambiguous execution trajectories. We revisit MAS failure attribution from a multi-perspective standpoint and propose multi-perspective failure attribution, a practical paradigm that explicitly accounts for attribution ambiguity. To support this setting, we introduce MP-Bench, the first benchmark designed for multi-perspective failure attribution in MAS, along with a new evaluation protocol tailored to this paradigm. Through extensive experiments, we find that prior conclusions suggesting LLMs struggle with failure attribution are largely driven by limitations in existing benchmark designs. Our results highlight the necessity of multi-perspective benchmarks and evaluation protocols for realistic and reliable MAS debugging.
[358] A Public Theory of Distillation Resistance via Constraint-Coupled Reasoning Architectures
Peng Wei, Wesley Shu
Main category: cs.AI
TL;DR: Theoretical framework for reducing capability-structure asymmetry in AI models by coupling high-level capabilities to internal stability constraints, making distillation less valuable as a shortcut.
Details
Motivation: Addresses the risk that useful AI capabilities can be transferred more cheaply than the governance structures that accompany them, creating a dangerous asymmetry where capabilities can be extracted without their original safety constraints.Method: Introduces a constraint-coupled reasoning framework with four elements: bounded transition burden, path-load accumulation, dynamically evolving feasible regions, and a capability-stability coupling condition. The framework is intentionally abstract and omits proprietary details to remain public-safe.
Result: Presents a theoretical framework rather than operational results, offering a falsifiable architectural thesis, clear threat model, and testable hypotheses for future work on distillation resistance and model governance.
Conclusion: Proposes that coupling high-level capabilities to internal stability constraints can reduce the value of distillation as a shortcut, potentially helping to preserve governance structures when capabilities are transferred.
Abstract: Knowledge distillation, model extraction, and behavior transfer have become central concerns in frontier AI. The main risk is not merely copying, but the possibility that useful capability can be transferred more cheaply than the governance structure that originally accompanied it. This paper presents a public, trade-secret-safe theoretical framework for reducing that asymmetry at the architectural level. The core claim is that distillation becomes less valuable as a shortcut when high-level capability is coupled to internal stability constraints that shape state transitions over time. To formalize this idea, the paper introduces a constraint-coupled reasoning framework with four elements: bounded transition burden, path-load accumulation, dynamically evolving feasible regions, and a capability-stability coupling condition. The paper is intentionally public-safe: it omits proprietary implementation details, training recipes, thresholds, hidden-state instrumentation, deployment procedures, and confidential system design choices. The contribution is therefore theoretical rather than operational. It offers a falsifiable architectural thesis, a clear threat model, and a set of experimentally testable hypotheses for future work on distillation resistance, alignment, and model governance.
[359] System-Anchored Knee Estimation for Low-Cost Context Window Selection in PDE Forecasting
Wenshuo Wang, Fan Zhang
Main category: cs.AI
TL;DR: SAKE is a low-cost method for selecting optimal context windows for autoregressive neural PDE simulators by identifying system anchors and performing knee-aware selection.
Details
Motivation: Current approaches for context-window selection in neural PDE simulators are either expensive, brittle, or not aligned with downstream performance, creating a need for formalized low-cost selection methods.Method: Two-stage method: 1) identifies small structured candidate set from physically interpretable system anchors, 2) performs knee-aware downstream selection within that candidate set.
Result: Across eight PDEBench families, SAKE achieved 67.8% Exact selection accuracy, 91.7% Within-1 accuracy, 6.1% mean regret@knee, and 94.9% normalized search-cost savings.
Conclusion: SAKE provides an effective low-cost solution for context-window selection in neural PDE simulators, outperforming existing methods while significantly reducing computational costs.
Abstract: Autoregressive neural PDE simulators predict the evolution of physical fields one step at a time from a finite history, but low-cost context-window selection for such simulators remains an unformalized problem. Existing approaches to context-window selection in time-series forecasting include exhaustive validation, direct low-cost search, and system-theoretic memory estimation, but they are either expensive, brittle, or not directly aligned with downstream rollout performance. We formalize explicit context-window selection for fixed-window autoregressive neural PDE simulators as an independent low-cost algorithmic problem, and propose \textbf{System-Anchored Knee Estimation (SAKE)}, a two-stage method that first identifies a small structured candidate set from physically interpretable system anchors and then performs knee-aware downstream selection within it. Across all eight PDEBench families evaluated under the shared (L\in{1,\dots,16}) protocol, SAKE is the strongest overall matched-budget low-cost selector among the evaluated methods, achieving 67.8% Exact, 91.7% Within-1, 6.1% mean regret@knee, and a cost ratio of 0.051 (94.9% normalized search-cost savings).
[360] From Stateless to Situated: Building a Psychological World for LLM-Based Emotional Support
Boning Zhao, Clover Hu, Xinnuo Li
Main category: cs.AI
TL;DR: LEKIA 2.0 is a situated LLM architecture that separates cognitive and executive layers to maintain temporal continuity and consent boundaries in emotional support dialogues, achieving 31% improvement over baselines.
Details
Motivation: Current LLMs fail in emotional support scenarios due to their stateless nature - they lack temporal continuity, stage awareness, and user consent boundaries, leading to premature advancement, stage misalignment, and boundary violations in multi-turn interventions.Method: Proposes LEKIA 2.0 with a situated LLM architecture that separates cognitive layer (situational modeling) from executive layer (intervention execution), creating a sustainably updatable external situational structure to maintain stable representations of user situation and consent boundaries.
Result: LEKIA achieved approximately 31% average absolute improvement over prompt-only baselines in deep intervention loop completion, evaluated using a Static-to-Dynamic online evaluation protocol for multi-turn interaction.
Conclusion: An external situational structure is crucial for building stable, controllable, and situated emotional support systems, enabling LLMs to maintain process control in multi-turn interventions.
Abstract: In psychological support and emotional companionship scenarios, the core limitation of large language models (LLMs) lies not merely in response quality, but in their reliance on local next-token prediction, which prevents them from maintaining the temporal continuity, stage awareness, and user consent boundaries required for multi-turn intervention. This stateless characteristic makes systems prone to premature advancement, stage misalignment, and boundary violations in continuous dialogue. To address this problem, we argue that the key challenge in process-oriented emotional support is not simply generating natural language, but constructing a sustainably updatable external situational structure for the model. We therefore propose LEKIA 2.0, a situated LLM architecture that separates the cognitive layer from the executive layer, thereby decoupling situational modeling from intervention execution. This design enables the system to maintain stable representations of the user’s situation and consent boundaries throughout ongoing interaction. To evaluate this process-control capability, we further introduce a Static-to-Dynamic online evaluation protocol for multi-turn interaction. LEKIA achieved an average absolute improvement of approximately 31% over prompt-only baselines in deep intervention loop completion. The results suggest that an external situational structure is a key enabling condition for building stable, controllable, and situated emotional support systems.
[361] Mechanistically Interpreting Compression in Vision-Language Models
Veeraraju Elluru, Arth Singh, Roberto Aguero, Ajay Agarwal, Debojyoti Das, Hreetam Paul
Main category: cs.AI
TL;DR: Analysis of how pruning and quantization affect internal computations and safety behaviors in compressed vision-language models, with introduction of a new safety benchmark
Details
Motivation: Compressed VLMs are widely deployed but raise concerns about whether their internal computations and safety behaviors are preserved after compression techniques like pruning and quantizationMethod: Use causal circuit analysis and crosscoder-based feature comparisons to examine how pruning and quantization change model internals across representative VLMs; introduce VLMSafe-420 benchmark with harmful inputs and benign counterfactuals across safety categories
Result: Pruning keeps circuit structure intact but rotates/attenuates internal features, while quantization modifies circuits at higher level but leaves surviving features better aligned; pruning causes sharp drop in genuine refusal behavior
Conclusion: Compression techniques fundamentally change VLM internals and have significant safety implications, with pruning particularly damaging refusal behaviors; choice of compression method matters for safety
Abstract: Compressed vision-language models (VLMs) are widely used to reduce memory and compute costs, making them a suitable choice for real-world deployment. However, compressing these models raises concerns about whether internal computations and safety behaviors are preserved. In this work, we use causal circuit analysis and crosscoder-based feature comparisons to examine how pruning and quantization fundamentally change the internals across representative VLMs. We observe that pruning generally keeps circuit structure intact but rotates and attenuates internal features, while quantization modifies the circuits at a higher level yet leaves the surviving features better aligned. Leveraging this insight, we also introduce VLMSafe-420, a novel benchmark that pairs harmful inputs with matched benign counterfactuals across various safety categories. Our findings show that pruning causes a sharp drop in genuine refusal behavior, suggesting that the choice of compression has safety implications.
[362] MP-MoE: Matrix Profile-Guided Mixture of Experts for Precipitation Forecasting
Huyen Ngoc Tran, Dung Trung Tran, Hong Nguyen, Xuan Vu Phan, Nam-Phong Nguyen
Main category: cs.AI
TL;DR: MP-MoE framework combines intensity loss with Matrix Profile objective for better precipitation forecasting in Vietnam, addressing double penalty issues in NWP post-processing.
Details
Motivation: NWP models have limited accuracy in tropical regions like Vietnam due to complex topography and convective instability. Existing data-driven post-processing methods suffer from "double penalty" effect from point-wise objective functions when there are minor temporal misalignments.Method: Proposes Matrix Profile-guided Mixture of Experts (MP-MoE) framework that integrates conventional intensity loss with structural-aware Matrix Profile objective. Uses subsequence-level similarity rather than point-wise errors to facilitate reliable expert selection and mitigate excessive penalization from phase shifts.
Result: MP-MoE outperforms raw NWP and baseline learning methods in terms of Mean Critical Success Index (CSI-M) for heavy rainfall events, while significantly reducing Dynamic Time Warping (DTW) values. Evaluated on rainfall datasets from two major river basins in Vietnam across multiple horizons (1-hour intensity and accumulated rainfall over 12, 24, and 48 hours).
Conclusion: The framework effectively captures peak rainfall intensities and preserves morphological integrity of storm events, demonstrating efficacy in precipitation forecasting for tropical regions with complex topography.
Abstract: Precipitation forecasting remains a persistent challenge in tropical regions like Vietnam, where complex topography and convective instability often limit the accuracy of Numerical Weather Prediction (NWP) models. While data-driven post-processing is widely used to mitigate these biases, most existing frameworks rely on point-wise objective functions, which suffer from the ``double penalty’’ effect under minor temporal misalignments. In this work, we propose the Matrix Profile-guided Mixture of Experts (MP-MoE), a framework that integrates conventional intensity loss with a structural-aware Matrix Profile objective. By leveraging subsequence-level similarity rather than point-wise errors, the proposed loss facilitates more reliable expert selection and mitigates excessive penalization caused by phase shifts. We evaluate MP-MoE on rainfall datasets from two major river basins in Vietnam across multiple horizons, including 1-hour intensity and accumulated rainfall over 12, 24, and 48 hours. Experimental results demonstrate that MP-MoE outperforms raw NWP and baseline learning methods in terms of Mean Critical Success Index (CSI-M) for heavy rainfall events, while significantly reducing Dynamic Time Warping (DTW) values. These findings highlight the framework’s efficacy in capturing peak rainfall intensities and preserving the morphological integrity of storm events.
[363] Sparse Visual Thought Circuits in Vision-Language Models
Yunpeng Zhou
Main category: cs.AI
TL;DR: SAE features in multimodal models often fail to be modular/composable units for reasoning, causing output drift when intervening on feature unions, despite modest improvements from task-selective feature interventions.
Details
Motivation: To test whether sparse autoencoder (SAE) features form modular, composable units for reasoning in multimodal models - an assumption underlying many intervention-based steering methods.Method: Developed a reproducible causal pipeline to localize and test sparse visual thought circuits in Qwen3-VL-8B. Used linear probes to identify mid-decoder locus for task type information, trained SAEs at this layer, constructed task-selective sets via explicit rules, and performed inference-time scaling/ablation while quantifying accuracy and drift.
Result: Found that intervening on task-selective feature sets modestly improves reasoning accuracy, but intervening on the union of two such sets reliably induces output drift and degrades accuracy, even under norm-matched perturbations. Non-modular circuit interference consistent with shared internal pathways where feature unions amplify activation shifts.
Conclusion: SAE features often fail to be modular/composable units for reasoning, clarifying boundaries of SAE feature composability and providing a rigorous diagnostic framework for more reliable VLM control.
Abstract: Sparse autoencoders (SAEs) improve interpretability in multimodal models, but it remains unclear whether SAE features form modular, composable units for reasoning-an assumption underlying many intervention-based steering methods. We test this modularity hypothesis and find it often fails: intervening on a task-selective feature set can modestly improve reasoning accuracy, while intervening on the union of two such sets reliably induces output drift (large unintended changes in predictions) and degrades accuracy, even under norm-matched perturbations. This non modular circuit interference is consistent with shared internal pathways where feature unions amplify activation shifts. We develop a reproducible causal pipeline to localize and test these sparse visual thought circuits in Qwen3-VL-8B. On a controlled synthetic benchmark with seven task types and three difficulty levels, linear probes identify a mid decoder locus for task type information. We train SAEs at this layer, construct task-selective sets via an explicit rule, and perform inference time scaling and ablation while quantifying accuracy and drift. Our findings-validated with bootstrapped subsamples and permutation controls, and replicated across multiple VLM families and five diverse datasets clarify the boundaries of SAE feature composability and provide a rigorous diagnostic framework for more reliable VLM control.
[364] ElephantBroker: A Knowledge-Grounded Cognitive Runtime for Trustworthy AI Agents
Cristian Lupascu, Alexandru Lupascu
Main category: cs.AI
TL;DR: ElephantBroker is an open-source cognitive runtime that combines Neo4j knowledge graphs with Qdrant vector stores to provide verifiable, durable memory for LLM agents, featuring comprehensive cognitive loops, safety mechanisms, and enterprise-grade deployment options.
Details
Motivation: Current LLM agent memory systems rely on flat key-value stores or simple vector retrieval without tracking knowledge provenance or trustworthiness, creating risks in high-stakes, multi-turn settings where factual grounding is critical.Method: Unifies Neo4j knowledge graph with Qdrant vector store through Cognee SDK, implementing a complete cognitive loop with hybrid five-source retrieval, competitive scoring engine, evidence verification, context lifecycle management, safety guard pipelines, AI firewall, consolidation engine, and numeric authority model.
Result: Validated through comprehensive test suite of over 2,200 tests spanning unit, integration, and end-to-end levels, confirming subsystem correctness. Supports three deployment tiers, five profile presets, multi-gateway isolation, and management dashboard for oversight.
Conclusion: ElephantBroker provides a robust, verifiable memory system for LLM agents that addresses provenance tracking and trustworthiness issues in high-stakes environments, enabling configurations from lightweight memory-only agents to full cognitive runtimes with enterprise safety.
Abstract: Large Language Model based agents increasingly operate in high stakes, multi turn settings where factual grounding is critical, yet their memory systems typically rely on flat key value stores or plain vector retrieval with no mechanism to track the provenance or trustworthiness of stored knowledge. We present ElephantBroker, an open source cognitive runtime that unifies a Neo4j knowledge graph with a Qdrant vector store through the Cognee SDK to provide durable, verifiable agent memory. The system implements a complete cognitive loop (store, retrieve, score, compose, protect, learn) comprising a hybrid five source retrieval pipeline, an eleven dimension competitive scoring engine for budget constrained context assembly, a four state evidence verification model, a five stage context lifecycle with goal aware assembly and continuous compaction, a six layer cheap first guard pipeline for safety enforcement, an AI firewall providing enforceable tool call interception and multi tier safety scanning, a nine stage consolidation engine that strengthens useful patterns while decaying noise, and a numeric authority model governing multi organization identity with hierarchical access control. Architectural validation through a comprehensive test suite of over 2,200 tests spanning unit, integration, and end to end levels confirms subsystem correctness. The modular design supports three deployment tiers, five profile presets with inheritance, multi gateway isolation, and a management dashboard for human oversight, enabling configurations from lightweight memory only agents to full cognitive runtimes with enterprise grade safety and auditability.
[365] When Sensing Varies with Contexts: Context-as-Transform for Tactile Few-Shot Class-Incremental Learning
Yifeng Lin, Aiping Huang, Wenxi Liu, Si Wu, Tiesong Zhao, Zheng-Jun Zha
Main category: cs.AI
TL;DR: CaT-FSCIL addresses few-shot class-incremental learning in tactile sensing by modeling acquisition context as transformable components and using uncertainty-conditioned prototype calibration.
Details
Motivation: FSCIL in tactile sensing suffers from performance degradation due to diverse acquisition contexts (devices, contact states, interaction settings) that lack standardization, making few-shot learning particularly challenging.Method: Decomposes acquisition context into low-dimensional structured component (modeled as invertible Context-as-Transform family) and high-dimensional residual component. Uses inverse-transform canonicalization with pseudo-context consistency loss for the former, and Uncertainty-Conditioned Prototype Calibration (UCPC) for the latter to calibrate biased prototypes and decision boundaries.
Result: Comprehensive experiments on HapTex and LMT108 benchmarks demonstrate the superiority of CaT-FSCIL over existing methods.
Conclusion: The proposed approach effectively handles context variations in tactile sensing FSCIL by separating and addressing different types of context components through specialized mechanisms.
Abstract: Few-Shot Class-Incremental Learning (FSCIL) can be particularly susceptible to acquisition contexts with only a few labeled samples. A typical scenario is tactile sensing, where the acquisition context ({\it e.g.}, diverse devices, contact state, and interaction settings) degrades performance due to a lack of standardization. In this paper, we propose Context-as-Transform FSCIL (CaT-FSCIL) to tackle the above problem. We decompose the acquisition context into a structured low-dimensional component and a high-dimensional residual component. The former can be easily affected by tactile interaction features, which are modeled as an approximately invertible Context-as-Transform family and handled via inverse-transform canonicalization optimized with a pseudo-context consistency loss. The latter mainly arises from platform and device differences, which can be mitigated with an Uncertainty-Conditioned Prototype Calibration (UCPC) that calibrates biased prototypes and decision boundaries based on context uncertainty. Comprehensive experiments on the standard benchmarks HapTex and LMT108 have demonstrated the superiority of the proposed CaT-FSCIL.
[366] RubricEval: A Rubric-Level Meta-Evaluation Benchmark for LLM Judges in Instruction Following
Tianjun Pan, Xuan Lin, Wenyan Yang, Qianyu He, Shisong Chen, Licai Qi, Wanqing Xu, Hongwei Feng, Bo Xu, Yanghua Xiao
Main category: cs.AI
TL;DR: RubricEval: A benchmark for meta-evaluating rubric-based evaluation of instruction following in LLMs, revealing limitations in current evaluation methods.
Details
Motivation: Rubric-based evaluation is widely used for assessing instruction following in LLMs, but its reliability remains unclear. Prior meta-evaluation focuses on response level rather than fine-grained rubric-level judgments needed for accurate evaluation.Method: Introduces RubricEval benchmark with: (1) first rubric-level meta-evaluation benchmark for instruction following, (2) diverse instructions and responses across categories and model sources, (3) 3,486 quality-controlled instances with Easy/Hard subsets to differentiate judge performance.
Result: Rubric-level judging remains challenging: GPT-4o achieves only 55.97% on Hard subset. Rubric-level evaluation outperforms checklist-level, explicit reasoning improves accuracy, and combining both reduces inter-judge variance. Identifies common failure modes through rubric taxonomy.
Conclusion: Provides actionable insights for reliable instruction-following evaluation, highlighting the need for improved rubric-level evaluation methods despite their current limitations.
Abstract: Rubric-based evaluation has become a prevailing paradigm for evaluating instruction following in large language models (LLMs). Despite its widespread use, the reliability of these rubric-level evaluations remains unclear, calling for meta-evaluation. However, prior meta-evaluation efforts largely focus on the response level, failing to assess the fine-grained judgment accuracy that rubric-based evaluation relies on. To bridge this gap, we introduce RubricEval. Our benchmark features: (1) the first rubric-level meta-evaluation benchmark for instruction following, (2) diverse instructions and responses spanning multiple categories and model sources, and (3) a substantial set of 3,486 quality-controlled instances, along with Easy/Hard subsets that better differentiates judge performance. Our experiments reveal that rubric-level judging remains far from solved: even GPT-4o, a widely adopted judge in instruction-following benchmarks, achieves only 55.97% on Hard subset. Considering evaluation paradigm, rubric-level evaluation outperforms checklist-level, explicit reasoning improves accuracy, and both together reduce inter-judge variance. Through our established rubric taxonomy, we further identify common failure modes and offer actionable insights for reliable instruction-following evaluation.
[367] UniAI-GraphRAG: Synergizing Ontology-Guided Extraction, Multi-Dimensional Clustering, and Dual-Channel Fusion for Robust Multi-Hop Reasoning
Jie Wang, Honghua Huang, Xi Ge, Jianhui Su, Wen Liu, Shiguo Lian
Main category: cs.AI
TL;DR: UniAI-GraphRAG enhances GraphRAG with ontology-guided extraction, multi-dimensional clustering, and dual-channel retrieval to improve complex reasoning and domain-specific QA performance.
Details
Motivation: Existing GraphRAG frameworks have limitations in cross-industry adaptability, community report integrity, and retrieval performance for complex reasoning, multi-hop queries, and domain-specific QA tasks.Method: Three core innovations: (1) Ontology-Guided Knowledge Extraction using predefined Schema to guide LLMs in identifying domain-specific entities/relations; (2) Multi-Dimensional Community Clustering Strategy with alignment completion, attribute-based clustering, and multi-hop relationship clustering; (3) Dual-Channel Graph Retrieval Fusion balancing QA accuracy and performance through hybrid graph and community retrieval.
Result: Outperforms mainstream open source solutions (e.g., LightRAG) in comprehensive F1 scores on MultiHopRAG benchmark, particularly in inference and temporal queries.
Conclusion: UniAI-GraphRAG provides an enhanced framework that addresses limitations of existing GraphRAG systems for complex reasoning and domain-specific QA through structured knowledge organization and improved retrieval mechanisms.
Abstract: Retrieval-Augmented Generation (RAG) systems face significant challenges in complex reasoning, multi-hop queries, and domain-specific QA. While existing GraphRAG frameworks have made progress in structural knowledge organization, they still have limitations in cross-industry adaptability, community report integrity, and retrieval performance. This paper proposes UniAI-GraphRAG, an enhanced framework built upon open-source GraphRAG. The framework introduces three core innovations: (1) Ontology-Guided Knowledge Extraction that uses predefined Schema to guide LLMs in accurately identifying domain-specific entities and relations; (2) Multi-Dimensional Community Clustering Strategy that improves community completeness through alignment completion, attribute-based clustering, and multi-hop relationship clustering; (3) Dual-Channel Graph Retrieval Fusion that balances QA accuracy and performance through hybrid graph and community retrieval. Evaluation results on MultiHopRAG benchmark show that UniAI-GraphRAG outperforms mainstream open source solutions (e.g.LightRAG) in comprehensive F1 scores, particularly in inference and temporal queries. The code is available at https://github.com/UnicomAI/wanwu/tree/main/rag/rag_open_source/rag_core/graph.
[368] Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills
Jingwei Ni, Yihao Liu, Xinpeng Liu, Yutao Sun, Mengyu Zhou, Pengyu Cheng, Dexin Wang, Xiaoxi Jiang, Guanjun Jiang
Main category: cs.AI
TL;DR: Trace2Skill: A framework for automatically generating robust LLM agent skills by analyzing diverse execution trajectories in parallel and hierarchically consolidating lessons into transferable declarative skills.
Details
Motivation: Manual skill authoring for LLM agents doesn't scale, while automated approaches often produce fragile skills due to shallow parametric knowledge or overfitting to specific trajectories.Method: Uses parallel sub-agents to analyze diverse execution traces, extracts trajectory-specific lessons, then hierarchically consolidates them into unified conflict-free skill directories via inductive reasoning.
Result: Significantly outperforms baselines including Anthropic’s official skills in spreadsheet, VisionQA, and math reasoning domains. Skills transfer across LLM scales and generalize to OOD settings (e.g., 57.65% improvement on WikiTableQuestions).
Conclusion: Complex agent experience can be packaged into highly transferable declarative skills without parameter updates, external retrieval, or large models (works with 35B parameters).
Abstract: Equipping Large Language Model (LLM) agents with domain-specific skills is critical for tackling complex tasks. Yet, manual authoring creates a severe scalability bottleneck. Conversely, automated skill generation often yields fragile or fragmented results because it either relies on shallow parametric knowledge or sequentially overfits to non-generalizable trajectory-local lessons. To overcome this, we introduce Trace2Skill, a framework that mirrors how human experts author skills: by holistically analyzing broad execution experience before distilling it into a single, comprehensive guide. Instead of reacting sequentially to individual trajectories, Trace2Skill dispatches a parallel fleet of sub-agents to analyze a diverse pool of executions. It extracts trajectory-specific lessons and hierarchically consolidates them into a unified, conflict-free skill directory via inductive reasoning. Trace2Skill supports both deepening existing human-written skills and creating new ones from scratch. Experiments in challenging domains, such as spreadsheet, VisionQA and math reasoning, show that Trace2Skill significantly improves upon strong baselines, including Anthropic’s official xlsx skills. Crucially, this trajectory-grounded evolution does not merely memorize task instances or model-specific quirks: evolved skills transfer across LLM scales and generalize to OOD settings. For example, skills evolved by Qwen3.5-35B on its own trajectories improved a Qwen3.5-122B agent by up to 57.65 absolute percentage points on WikiTableQuestions. Ultimately, our results demonstrate that complex agent experience can be packaged into highly transferable, declarative skills – requiring no parameter updates, no external retrieval modules, and utilizing open-source models as small as 35B parameters.
[369] The Competence Shadow: Theory and Bounds of AI Assistance in Safety Engineering
Umair Siddique
Main category: cs.AI
TL;DR: AI assistance in safety engineering creates systematic blind spots called “competence shadows” that narrow human reasoning, requiring workflow design rather than just tool selection for trustworthy Physical AI systems.
Details
Motivation: As AI assistants integrate into safety engineering for Physical AI systems, there's a critical need to understand whether AI improves safety analysis quality or introduces systematic blind spots that only surface post-deployment. The paper addresses the fundamental tension between AI assistance and safety competence.Method: Develops a formal framework for AI assistance in safety analysis, establishing a five-dimensional competence framework (domain knowledge, standards expertise, operational experience, contextual understanding, judgment). Introduces the concept of “competence shadow” - systematic narrowing of human reasoning induced by AI-generated safety analysis. Formalizes four canonical human-AI collaboration structures and derives closed-form performance bounds.
Result: Demonstrates that competence shadow compounds multiplicatively to produce degradation far exceeding naive additive estimates. Shows that AI assistance in safety engineering is fundamentally a collaboration design problem, not a software procurement decision - the same tool can degrade or improve analysis quality depending entirely on how it’s used.
Conclusion: Derives non-degradation conditions for shadow-resistant workflows and calls for a shift from tool qualification toward workflow qualification for trustworthy Physical AI. The central insight is that workflow design determines whether AI assistance enhances or diminishes safety analysis quality.
Abstract: As AI assistants become integrated into safety engineering workflows for Physical AI systems, a critical question emerges: does AI assistance improve safety analysis quality, or introduce systematic blind spots that surface only through post-deployment incidents? This paper develops a formal framework for AI assistance in safety analysis. We first establish why safety engineering resists benchmark-driven evaluation: safety competence is irreducibly multidimensional, constrained by context-dependent correctness, inherent incompleteness, and legitimate expert disagreement. We formalize this through a five-dimensional competence framework capturing domain knowledge, standards expertise, operational experience, contextual understanding, and judgment. We introduce the competence shadow: the systematic narrowing of human reasoning induced by AI-generated safety analysis. The shadow is not what the AI presents, but what it prevents from being considered. We formalize four canonical human-AI collaboration structures and derive closed-form performance bounds, demonstrating that the competence shadow compounds multiplicatively to produce degradation far exceeding naive additive estimates. The central finding is that AI assistance in safety engineering is a collaboration design problem, not a software procurement decision. The same tool degrades or improves analysis quality depending entirely on how it is used. We derive non-degradation conditions for shadow-resistant workflows and call for a shift from tool qualification toward workflow qualification for trustworthy Physical AI.
[370] Probabilistic Abstract Interpretation on Neural Networks via Grids Approximation
Zhuofan Zhang, Herbert Wiklicky
Main category: cs.AI
TL;DR: Applying probabilistic abstract interpretation theory to analyze neural networks’ density distribution flow for infinite input spaces
Details
Motivation: To analyze neural networks when testing all inputs is infeasible due to uncountably infinite or countably infinite input spaces, using abstract interpretation theory to extract propertiesMethod: Apply probabilistic abstract interpretation theory to neural networks, discuss abstract domains, Moore-Penrose pseudo-inverses, and abstract transformers within the framework
Result: Framework successfully applied to neural networks with experimental examples demonstrating analysis of real-world problems
Conclusion: Probabilistic abstract interpretation provides a theoretical framework for analyzing neural networks with infinite input spaces, enabling property extraction without exhaustive testing
Abstract: Probabilistic abstract interpretation is a theory used to extract particular properties of a computer program when it is infeasible to test every single inputs. In this paper we apply the theory on neural networks for the same purpose: to analyse density distribution flow of all possible inputs of a neural network when a network has uncountably many or countable but infinitely many inputs. We show how this theoretical framework works in neural networks and then discuss different abstract domains and corresponding Moore-Penrose pseudo-inverses together with abstract transformers used in the framework. We also present experimental examples to show how this framework helps to analyse real world problems.
[371] Distribution and Clusters Approximations as Abstract Domains in Probabilistic Abstract Interpretation to Neural Network Analysis
Zhuofan Zhang, Herbert Wiklicky
Main category: cs.AI
TL;DR: Introduces two novel approximation methods (distribution and clusters approximation) for probabilistic abstract interpretation of neural networks, extending beyond traditional grids approximation.
Details
Motivation: To improve neural network analysis through probabilistic abstract interpretation by introducing more sophisticated approximation methods beyond the existing grids approximation, enabling better analysis of density distribution flows of all possible inputs.Method: Proposes two new approximation methods: 1) Distribution approximation, and 2) Clusters approximation, with corresponding abstract transformers. Provides theoretical foundations and illustrates with simple examples.
Result: Theoretical demonstration of how distribution and clusters approximation methods work within the probabilistic abstract interpretation framework, showing their potential advantages over traditional grids approximation.
Conclusion: Distribution and clusters approximation offer promising alternatives to grids approximation for neural network analysis via probabilistic abstract interpretation, potentially providing more accurate or efficient analysis of input distributions.
Abstract: The probabilistic abstract interpretation framework of neural network analysis analyzes a neural network by analyzing its density distribution flow of all possible inputs. The grids approximation is one of abstract domains the framework uses which abstracts concrete space into grids. In this paper, we introduce two novel approximation methods: distribution approximation and clusters approximation. We show how these two methods work in theory with corresponding abstract transformers with help of illustrations of some simple examples.
[372] A Gait Foundation Model Predicts Multi-System Health Phenotypes from 3D Skeletal Motion
Adam Gabet, Sarah Kohn, Guy Lutsker, Shira Gelman, Anastasia Godneva, Gil Sasson, Arad Zulti, David Krongauz, Rotem Shaulitch, Assaf Rotem, Ohad Doron, Yuval Brodsky, Adina Weinberger, Eran Segal
Main category: cs.AI
TL;DR: A gait foundation model using 3D skeletal motion from depth camera recordings predicts multiple phenotypic traits and clinical outcomes, establishing gait as a multi-system biomarker.
Details
Motivation: Current approaches treat gait as a symptom of specific pathologies rather than a systemic biomarker. The researchers aim to develop gait as a comprehensive vital sign that can predict multiple health outcomes across body systems.Method: Developed a gait foundation model using 3D skeletal motion data from 3,414 adults recorded via depth camera during five motor tasks. Used learned embeddings from the model to predict various phenotypic targets and compared performance against engineered features.
Result: Learned embeddings outperformed engineered features in predicting age (r=0.69), BMI (r=0.90), and visceral adipose tissue area (r=0.82). Gait embeddings significantly predicted 1,980 of 3,210 phenotypic targets and provided independent predictive gains across nearly all body systems after adjusting for covariates. Anatomical ablation showed different body parts encode different phenotypic information.
Conclusion: Gait functions as an independent multi-system biosignal that can predict diverse health outcomes. This establishes the foundation for translating gait analysis to consumer-grade video as a scalable, passive vital sign for comprehensive health monitoring.
Abstract: Gait is increasingly recognized as a vital sign, yet current approaches treat it as a symptom of specific pathologies rather than a systemic biomarker. We developed a gait foundation model for 3D skeletal motion from 3,414 deeply phenotyped adults, recorded via a depth camera during five motor tasks. Learned embeddings outperformed engineered features, predicting age (Pearson r = 0.69), BMI (r = 0.90), and visceral adipose tissue area (r = 0.82). Embeddings significantly predicted 1,980 of 3,210 phenotypic targets; after adjustment for age, BMI, VAT, and height, gait provided independent gains in all 18 body systems in males and 17 of 18 in females, and improved prediction of clinical diagnoses and medication use. Anatomical ablation revealed that legs dominated metabolic and frailty predictions while torso encoded sleep and lifestyle phenotypes. These findings establish gait as an independent multi-system biosignal, motivating translation to consumer-grade video and its integration as a scalable, passive vital sign.
[373] SliderQuant: Accurate Post-Training Quantization for LLMs
Shigeng Wang, Chao Li, Yangyuxuan Kang, Jiawei Fan, Zhonghong Ou, Anbang Yao
Main category: cs.AI
TL;DR: SliderQuant: A novel post-training quantization framework for LLMs that uses adaptive sliding quantization with learnable parameters to address varying quantization sensitivity across different layers, outperforming existing PTQ methods.
Details
Motivation: Current PTQ methods treat all LLM layers equally during quantization, but different layers have varying sensitivity to quantization - shallow and deep layers are more sensitive than intermediate layers, with first and last layers being most sensitive. This uniform approach is suboptimal for challenging low-bit quantization.Method: Proposes SliderQuant with two components: (1) Inter-layer sliding quantization using three types of sliding window designs tailored for shallow, intermediate, and deep layers; (2) Intra-layer sliding quantization using incremental window quantization. The framework uses few learnable parameters for adaptive sliding quantization.
Result: Extensive experiments show SliderQuant outperforms existing PTQ methods (including latest rotation-based methods) for both weight-only and weight-activation quantization across various LLMs (Llama/Llama2/Llama3/Qwen2.5 families, DeepSeek-R1 distilled models, and large MoE models) on language generation, zero-shot reasoning, and challenging math/code tasks.
Conclusion: SliderQuant effectively addresses varying quantization sensitivity across LLM layers through adaptive sliding quantization, demonstrating superior performance over existing PTQ methods and providing a more optimal quantization framework for LLMs.
Abstract: In this paper, we address post-training quantization (PTQ) for large language models (LLMs) from an overlooked perspective: given a pre-trained high-precision LLM, the predominant sequential quantization framework treats different layers equally, but this may be not optimal in challenging bit-width settings. We empirically study the quantization impact of different layers on model accuracy, and observe that: (1) shallow/deep layers are usually more sensitive to quantization than intermediate layers; (2) among shallow/deep layers, the most sensitive one is the first/last layer, which exhibits significantly larger quantization error than others. These empirical observations imply that the quantization design for different layers of LLMs is required on multiple levels instead of a single level shared to all layers. Motivated by this, we propose a new PTQ framework termed Sliding-layer Quantization (SliderQuant) that relies on a simple adaptive sliding quantization concept facilitated by few learnable parameters. The base component of SliderQuant is called inter-layer sliding quantization, which incorporates three types of novel sliding window designs tailored for addressing the varying quantization sensitivity of shallow, intermediate and deep layers. The other component is called intra-layer sliding quantization that leverages an incremental strategy to quantize each window. As a result, SliderQuant has a strong ability to reduce quantization errors across layers. Extensive experiments on basic language generation, zero-shot commonsense reasoning and challenging math and code tasks with various LLMs, including Llama/Llama2/Llama3/Qwen2.5 model families, DeepSeek-R1 distilled models and large MoE models, show that our method outperforms existing PTQ methods (including the latest PTQ methods using rotation transformations) for both weight-only quantization and weight-activation quantization.
[374] DAGverse: Building Document-Grounded Semantic DAGs from Scientific Papers
Shu Wan, Saketh Vishnubhatla, Iskander Kushbay, Tom Heffernan, Aaron Belikoff, Raha Moraffah, Huan Liu
Main category: cs.AI
TL;DR: DAGverse framework extracts semantic DAGs from scientific papers using DAG figures as supervision and accompanying text as evidence, creating a dataset of document-grounded DAGs.
Details
Motivation: Real-world DAG datasets are scarce due to expert annotation requirements. Scientific papers with explicit DAG figures provide natural supervision for recovering semantic DAGs with supporting evidence from documents.Method: DAGverse-Pipeline: semi-automatic system with figure classification, graph reconstruction, semantic grounding, and validation components. Uses DAG figures as structure supervision and accompanying text as evidence.
Result: Created DAGverse-1 dataset with 108 expert-validated semantic DAGs with graph/node/edge-level evidence. Outperforms existing Vision-Language Models on DAG classification and annotation tasks.
Conclusion: DAGverse provides foundation for document-grounded DAG benchmarks and enables structured reasoning grounded in real-world evidence from multimodal scientific documents.
Abstract: Directed Acyclic Graphs (DAGs) are widely used to represent structured knowledge in scientific and technical domains. However, datasets for real-world DAGs remain scarce because constructing them typically requires expert interpretation of domain documents. We study Doc2SemDAG construction: recovering a preferred semantic DAG from a document together with the cited evidence and context that explain it. This problem is challenging because a document may admit multiple plausible abstractions, the intended structure is often implicit, and the supporting evidence is scattered across prose, equations, captions, and figures. To address these challenges, we leverage scientific papers containing explicit DAG figures as a natural source of supervision. In this setting, the DAG figure provides the DAG structure, while the accompanying text provides context and explanation. We introduce DAGverse, a framework for constructing document-grounded semantic DAGs from online scientific papers. Its core component, DAGverse-Pipeline, is a semi-automatic system designed to produce high-precision semantic DAG examples through figure classification, graph reconstruction, semantic grounding, and validation. As a case study, we test the framework for causal DAGs and release DAGverse-1, a dataset of 108 expert-validated semantic DAGs with graph-level, node-level, and edge-level evidence. Experiments show that DAGverse-Pipeline outperforms existing Vision-Language Models on DAG classification and annotation. DAGverse provides a foundation for document-grounded DAG benchmarks and opens new directions for studying structured reasoning grounded in real-world evidence.
[375] Evaluating Language Models for Harmful Manipulation
Canfer Akbulut, Rasmi Elasmar, Abhishek Roy, Anthony Payne, Priyanka Suresh, Lujain Ibrahim, Seliem El-Sayed, Charvi Rastogi, Ashyana Kachra, Will Hawkins, Kristian Lum, Laura Weidinger
Main category: cs.AI
TL;DR: Framework for evaluating harmful AI manipulation through context-specific human-AI interaction studies across domains and geographies, showing AI can induce belief/behavior changes with domain/regional variations.
Details
Motivation: Current approaches to evaluating AI-driven harmful manipulation are limited, creating a need for better evaluation frameworks that account for context-specific factors in human-AI interactions.Method: Developed a framework for evaluating harmful AI manipulation using context-specific human-AI interaction studies with 10,101 participants across three domains (public policy, finance, health) and three locales (US, UK, India).
Result: AI models can produce manipulative behaviors and induce belief/behavior changes; effects vary significantly by domain and geography; manipulative propensity doesn’t consistently predict efficacy.
Conclusion: Context matters for AI manipulation evaluation; domain and geographic differences require separate assessment; framework provides practical testing protocols for broader adoption.
Abstract: Interest in the concept of AI-driven harmful manipulation is growing, yet current approaches to evaluating it are limited. This paper introduces a framework for evaluating harmful AI manipulation via context-specific human-AI interaction studies. We illustrate the utility of this framework by assessing an AI model with 10,101 participants spanning interactions in three AI use domains (public policy, finance, and health) and three locales (US, UK, and India). Overall, we find that that the tested model can produce manipulative behaviours when prompted to do so and, in experimental settings, is able to induce belief and behaviour changes in study participants. We further find that context matters: AI manipulation differs between domains, suggesting that it needs to be evaluated in the high-stakes context(s) in which an AI system is likely to be used. We also identify significant differences across our tested geographies, suggesting that AI manipulation results from one geographic region may not generalise to others. Finally, we find that the frequency of manipulative behaviours (propensity) of an AI model is not consistently predictive of the likelihood of manipulative success (efficacy), underscoring the importance of studying these dimensions separately. To facilitate adoption of our evaluation framework, we detail our testing protocols and make relevant materials publicly available. We conclude by discussing open challenges in evaluating harmful manipulation by AI models.
[376] Macroscopic Characteristics of Mixed Traffic Flow with Deep Reinforcement Learning Based Automated and Human-Driven Vehicles
Pankaj Kumar, Pranamesh Chakraborty, Subrahmanya Swamy Peruru
Main category: cs.AI
TL;DR: DRL-based AV control improves traffic capacity by 7.52% and fuel efficiency by up to 28.98% compared to traditional models in mixed traffic scenarios.
Details
Motivation: Traditional car-following models like IDM struggle to generalize across diverse traffic scenarios and don't account for fuel efficiency, motivating learning-based approaches for AV control in mixed traffic.Method: Implemented Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm for AV control, trained on NGSIM highway dataset, evaluated traffic performance using Fundamental Diagram under varying conditions including driver heterogeneity and RL vehicle penetration.
Result: Transition from fully human-driven to fully RL-controlled traffic increases road capacity by ~7.52%. RL-based AVs improve fuel efficiency by ~28.98% at higher speeds (>50 km/h) and 1.86% at lower speeds (<50 km/h) compared to IDM.
Conclusion: DRL framework enhances traffic capacity and fuel efficiency without compromising safety, showing sensitivity to safe time gap distribution and RL vehicle proportion in mixed traffic.
Abstract: Automated Vehicle (AV) control in mixed traffic, where AVs coexist with human-driven vehicles, poses significant challenges in balancing safety, efficiency, comfort, fuel efficiency, and compliance with traffic rules while capturing heterogeneous driver behavior. Traditional car-following models, such as the Intelligent Driver Model (IDM), often struggle to generalize across diverse traffic scenarios and typically do not account for fuel efficiency, motivating the use of learning-based approaches. Although Deep Reinforcement Learning (DRL) has shown strong microscopic performance in car-following conditions, its macroscopic traffic flow characteristics remain underexplored. This study focuses on analyzing the macroscopic traffic flow characteristics and fuel efficiency of DRL-based models in mixed traffic. A Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm is implemented for AVs’ control and trained using the NGSIM highway dataset, enabling realistic interaction with human-driven vehicles. Traffic performance is evaluated using the Fundamental Diagram (FD) under varying driver heterogeneity, heterogeneous time-gap penetration levels, and different shares of RL-controlled vehicles. A macroscopic level comparison of fuel efficiency between the RL-based AV model and the IDM is also conducted. Results show that traffic performance is sensitive to the distribution of safe time gaps and the proportion of RL vehicles. Transitioning from fully human-driven to fully RL-controlled traffic can increase road capacity by approximately 7.52%. Further, RL-based AVs also improve average fuel efficiency by about 28.98% at higher speeds (above 50 km/h), and by 1.86% at lower speeds (below 50 km/h) compared to the IDM. Overall, the DRL framework enhances traffic capacity and fuel efficiency without compromising safety.
[377] Agentic Trust Coordination for Federated Learning through Adaptive Thresholding and Autonomous Decision Making in Sustainable and Resilient Industrial Networks
Paul Shepherd, Tasos Dagiuklas, Bugra Alkan, Jonathan Rodriguez
Main category: cs.AI
TL;DR: A lightweight agentic trust coordination approach for federated learning in industrial networks that uses a server-side control loop to monitor trust signals and apply targeted adjustments for stable FL operation.
Details
Motivation: Federated learning in industrial networks faces reliability issues due to inconsistent client behavior, noisy sensing conditions, and adversarial updates. Existing trust mechanisms are statistical/heuristic with fixed parameters that struggle with changing conditions.Method: Proposes an Agentic Trust Control Layer operating as a server-side control loop that observes trust-related and system-level signals, interprets their evolution over time, and applies targeted trust adjustments when instability is detected. Separates observation, reasoning, and action without modifying client-side training.
Result: The approach enables context-aware intervention decisions rather than fixed or reactive parameter updates, supporting stable FL operation without increasing communication overhead or modifying client training.
Conclusion: The agentic trust coordination framework provides a lightweight solution for maintaining reliable federated learning in resource-constrained industrial networks by intelligently adapting to changing conditions through server-side control.
Abstract: Distributed intelligence in industrial networks increasingly integrates sensing, communication, and computation across heterogeneous and resource constrained devices. Federated learning (FL) enables collaborative model training in such environments, but its reliability is affected by inconsistent client behaviour, noisy sensing conditions, and the presence of faulty or adversarial updates. Trust based mechanisms are commonly used to mitigate these effects, yet most remain statistical and heuristic, relying on fixed parameters or simple adaptive rules that struggle to accommodate changing operating conditions. This paper presents a lightweight agentic trust coordination approach for FL in sustainable and resilient industrial networks. The proposed Agentic Trust Control Layer operates as a server side control loop that observes trust related and system level signals, interprets their evolution over time, and applies targeted trust adjustments when instability is detected. The approach extends prior adaptive trust mechanisms by enabling context aware intervention decisions, rather than relying on fixed or purely reactive parameter updates. By explicitly separating observation, reasoning, and action, the proposed framework supports stable FL operation without modifying client side training or increasing communication overhead.
[378] 4OPS: Structural Difficulty Modeling in Integer Arithmetic Puzzles
Yunus E. Zeytuncu
Main category: cs.AI
TL;DR: This paper analyzes difficulty in arithmetic puzzle games using exact dynamic programming to create a large dataset and identify structural determinants of puzzle difficulty.
Details
Motivation: To understand structural determinants of difficulty in mathematical reasoning tasks, particularly in adaptive learning systems, using arithmetic puzzles as a controlled setting.Method: Developed exact dynamic-programming solver to enumerate reachable targets, extract minimal-operation witnesses, and construct dataset of 3.4M instances. Analyzed difficulty via minimum operations required, examined solver-derived features, and tested baseline ML models.
Result: Baseline ML models partially predict solvability but fail to distinguish easy instances. Difficulty is fully determined by a small set of interpretable structural attributes, with number of input values used in minimal construction serving as minimal sufficient statistic.
Conclusion: Provides transparent, computationally grounded account of puzzle difficulty bridging symbolic reasoning and data-driven modeling, supporting explainable difficulty estimation and principled task sequencing for adaptive arithmetic learning.
Abstract: Arithmetic puzzle games provide a controlled setting for studying difficulty in mathematical reasoning tasks, a core challenge in adaptive learning systems. We investigate the structural determinants of difficulty in a class of integer arithmetic puzzles inspired by number games. We formalize the problem and develop an exact dynamic-programming solver that enumerates reachable targets, extracts minimal-operation witnesses, and enables large-scale labeling. Using this solver, we construct a dataset of over 3.4 million instances and define difficulty via the minimum number of operations required to reach a target. We analyze the relationship between difficulty and solver-derived features. While baseline machine learning models based on bag- and target-level statistics can partially predict solvability, they fail to reliably distinguish easy instances. In contrast, we show that difficulty is fully determined by a small set of interpretable structural attributes derived from exact witnesses. In particular, the number of input values used in a minimal construction serves as a minimal sufficient statistic for difficulty under this labeling. These results provide a transparent, computationally grounded account of puzzle difficulty that bridges symbolic reasoning and data-driven modeling. The framework supports explainable difficulty estimation and principled task sequencing, with direct implications for adaptive arithmetic learning and intelligent practice systems.
[379] Does Structured Intent Representation Generalize? A Cross-Language, Cross-Model Empirical Study of 5W3H Prompting
Peng Gang
Main category: cs.AI
TL;DR: PPS (5W3H framework) for structured intent representation generalizes across English, Japanese, and Chinese, with AI-assisted expansion showing comparable performance to manual crafting while reducing user effort.
Details
Motivation: Extend prior Chinese-only evidence on structured intent representation (PPS/5W3H framework) to test generalization across languages (English, Japanese), explore AI-assisted authoring, and examine cross-model consistency.Method: Study across 3 languages (English, Japanese, Chinese) x 4 conditions (including AI-expanded 5W3H) x 3 LLMs x 60 tasks (2,160 total outputs). Compare goal alignment, cross-model variance, and identify biases in unstructured prompts.
Result: AI-expanded 5W3H prompts perform similarly to manually crafted ones across all languages. Structured prompts reduce/reshape cross-model variance (effects vary by language/metric). Unstructured prompts show dual-inflation bias (artificially high scores, low apparent variance).
Conclusion: Structured 5W3H representations improve intent alignment and accessibility across languages/models, especially when AI-assisted authoring lowers barriers for non-expert users.
Abstract: Does structured intent representation generalize across languages and models? We study PPS (Prompt Protocol Specification), a 5W3H-based framework for structured intent representation in human-AI interaction, and extend prior Chinese-only evidence along three dimensions: two additional languages (English and Japanese), a fourth condition in which a user’s simple prompt is automatically expanded into a full 5W3H specification by an AI-assisted authoring interface, and a new research question on cross-model output consistency. Across 2,160 model outputs (3 languages x 4 conditions x 3 LLMs x 60 tasks), we find that AI-expanded 5W3H prompts (Condition D) show no statistically significant difference in goal alignment from manually crafted 5W3H prompts (Condition C) across all three languages, while requiring only a single-sentence input from the user. Structured PPS conditions often reduce or reshape cross-model output variance, though this effect is not uniform across languages and metrics; the strongest evidence comes from identifying spurious low variance in unconstrained baselines. We also show that unstructured prompts exhibit a systematic dual-inflation bias: artificially high composite scores and artificially low apparent cross-model variance. These findings suggest that structured 5W3H representations can improve intent alignment and accessibility across languages and models, especially when AI-assisted authoring lowers the barrier for non-expert users.
[380] Beyond Content Safety: Real-Time Monitoring for Reasoning Vulnerabilities in Large Language Models
Xunguang Wang, Yuguang Zhou, Qingyue Wang, Zongjie Li, Ruixuan Huang, Zhenlan Ji, Pingchuan Ma, Shuai Wang
Main category: cs.AI
TL;DR: Paper introduces reasoning safety as a new security dimension for LLMs, defining unsafe reasoning behaviors, studying their prevalence, and proposing a real-time monitoring system to detect and interrupt unsafe reasoning chains.
Details
Motivation: While LLMs increasingly use chain-of-thought reasoning for complex tasks, existing safety work focuses only on content safety (harmful outputs), ignoring the safety of the reasoning process itself. The paper identifies reasoning safety as an orthogonal and critical security dimension that needs addressing.Method: 1) Formally defines reasoning safety and introduces a nine-category taxonomy of unsafe reasoning behaviors; 2) Conducts large-scale prevalence study annotating 4111 reasoning chains from natural benchmarks and adversarial attacks; 3) Proposes a Reasoning Safety Monitor - an external LLM-based component that inspects reasoning steps in real time using taxonomy-embedded prompts and dispatches interrupt signals.
Result: Evaluation on 450-chain benchmark shows the monitor achieves up to 84.88% step-level localization accuracy and 85.37% error-type classification accuracy, outperforming hallucination detectors and process reward model baselines by substantial margins.
Conclusion: Reasoning-level monitoring is both necessary and practically achievable, establishing reasoning safety as a foundational concern for secure deployment of large reasoning models.
Abstract: Large language models (LLMs) increasingly rely on explicit chain-of-thought (CoT) reasoning to solve complex tasks, yet the safety of the reasoning process itself remains largely unaddressed. Existing work on LLM safety focuses on content safety–detecting harmful, biased, or factually incorrect outputs – and treats the reasoning chain as an opaque intermediate artifact. We identify reasoning safety as an orthogonal and equally critical security dimension: the requirement that a model’s reasoning trajectory be logically consistent, computationally efficient, and resistant to adversarial manipulation. We make three contributions. First, we formally define reasoning safety and introduce a nine-category taxonomy of unsafe reasoning behaviors, covering input parsing errors, reasoning execution errors, and process management errors. Second, we conduct a large-scale prevalence study annotating 4111 reasoning chains from both natural reasoning benchmarks and four adversarial attack methods (reasoning hijacking and denial-of-service), confirming that all nine error types occur in practice and that each attack induces a mechanistically interpretable signature. Third, we propose a Reasoning Safety Monitor: an external LLM-based component that runs in parallel with the target model, inspects each reasoning step in real time via a taxonomy-embedded prompt, and dispatches an interrupt signal upon detecting unsafe behavior. Evaluation on a 450-chain static benchmark shows that our monitor achieves up to 84.88% step-level localization accuracy and 85.37% error-type classification accuracy, outperforming hallucination detectors and process reward model baselines by substantial margins. These results demonstrate that reasoning-level monitoring is both necessary and practically achievable, and establish reasoning safety as a foundational concern for the secure deployment of large reasoning models.
[381] Modernising Reinforcement Learning-Based Navigation for Embodied Semantic Scene Graph Generation
Roman Kueble, Marco Hueller, Mrunmai Phatak, Rainer Lienhart, Joerg Haehner
Main category: cs.AI
TL;DR: A modular navigation system for embodied semantic scene graph generation that improves exploration efficiency through modern optimization algorithms and factorized action representations.
Details
Motivation: Semantic world models are crucial for embodied agents to reason about objects and spatial context, but constructing semantic scene graphs within limited action budgets requires efficient exploration strategies that balance information gain against navigation costs.Method: Presents a modular navigation component that replaces policy-optimization methods and revisits discrete action formulations. Compares compact vs. finer-grained motion sets, single-head vs. factorized multi-head policies, evaluates curriculum learning and depth-based collision supervision.
Result: Modern optimization alone improves SSG completeness by 21% relative to baseline. Depth supervision mainly affects execution safety (collision avoidance) while completeness remains unchanged. Combining modern optimization with finer-grained factorized action representation yields the best completeness-efficiency trade-off.
Conclusion: The proposed improvements to embodied semantic scene graph generation demonstrate significant gains in exploration efficiency and model quality, with factorized action representations and modern optimization algorithms being key to better performance.
Abstract: Semantic world models enable embodied agents to reason about objects, relations, and spatial context beyond purely geometric representations. In Organic Computing, such models are a key enabler for objective-driven self-adaptation under uncertainty and resource constraints. The core challenge is to acquire observations maximising model quality and downstream usefulness within a limited action budget. Semantic scene graphs (SSGs) provide a structured and compact representation for this purpose. However, constructing them within a finite action horizon requires exploration strategies that trade off information gain against navigation cost and decide when additional actions yield diminishing returns. This work presents a modular navigation component for Embodied Semantic Scene Graph Generation and modernises its decision-making by replacing the policy-optimisation method and revisiting the discrete action formulation. We study compact and finer-grained, larger discrete motion sets and compare a single-head policy over atomic actions with a factorised multi-head policy over action components. We evaluate curriculum learning and optional depth-based collision supervision, and assess SSG completeness, execution safety, and navigation behaviour. Results show that replacing the optimisation algorithm alone improves SSG completeness by 21% relative to the baseline under identical reward shaping. Depth mainly affects execution safety (collision-free motion), while completeness remains largely unchanged. Combining modern optimisation with a finer-grained, factorised action representation yields the strongest overall completeness–efficiency trade-off.
[382] Cross-Model Disagreement as a Label-Free Correctness Signal
Matt Gorbett, Suman Jana
Main category: cs.AI
TL;DR: Cross-model disagreement (CMP/CME) detects LLM errors without ground truth by measuring how surprised a second verifier model is about the first model’s answer.
Details
Motivation: Existing uncertainty-based error detection methods fail on confident errors where models are wrong but certain. Need training-free, label-free approach for safe LLM deployment.Method: Proposes cross-model disagreement using Cross-Model Perplexity (CMP) and Cross-Model Entropy (CME) - measuring verifier model’s surprise/uncertainty about generating model’s answer tokens via single forward pass.
Result: CMP/CME outperform within-model uncertainty baselines across reasoning, retrieval, and math benchmarks (MMLU, TriviaQA, GSM8K). On MMLU, CMP achieves AUROC 0.75 vs baseline 0.59.
Conclusion: Cross-model disagreement provides practical, training-free approach for label-free correctness estimation with applications in deployment monitoring, model routing, selective prediction, and scalable oversight.
Abstract: Detecting when a language model is wrong without ground truth labels is a fundamental challenge for safe deployment. Existing approaches rely on a model’s own uncertainty – such as token entropy or confidence scores – but these signals fail critically on the most dangerous failure mode: confident errors, where a model is wrong but certain. In this work we introduce cross-model disagreement as a correctness indicator – a simple, training-free signal that can be dropped into existing production systems, routing pipelines, and deployment monitoring infrastructure without modification. Given a model’s generated answer, cross-model disagreement computes how surprised or uncertain a second verifier model is when reading that answer via a single forward pass. No generation from the verifying model is required, and no correctness labels are needed. We instantiate this principle as Cross-Model Perplexity (CMP), which measures the verifying model’s surprise at the generating model’s answer tokens, and Cross-Model Entropy (CME), which measures the verifying model’s uncertainty at those positions. Both CMP and CME outperform within-model uncertainty baselines across benchmarks spanning reasoning, retrieval, and mathematical problem solving (MMLU, TriviaQA, and GSM8K). On MMLU, CMP achieves a mean AUROC of 0.75 against a within-model entropy baseline of 0.59. These results establish cross-model disagreement as a practical, training-free approach to label-free correctness estimation, with direct applications in deployment monitoring, model routing, selective prediction, data filtering, and scalable oversight of production language model systems.
[383] Retraining as Approximate Bayesian Inference
Harrison Katz
Main category: cs.AI
TL;DR: A decision-theoretic framework for model retraining as approximate Bayesian inference under computational constraints, introducing “learning debt” concept and evidence-based retraining triggers.
Details
Motivation: Traditional model retraining is treated as maintenance, but there's a need for a principled framework to determine when to retrain models based on evidence rather than arbitrary schedules.Method: Treats retraining as approximate Bayesian inference under computational constraints, introduces “learning debt” concept (gap between continuously updated belief state and frozen deployed model), and formulates retraining as a cost minimization problem with decision-theoretic thresholds derived from loss functions.
Result: Develops evidence-based retraining triggers that replace calendar schedules, making governance auditable and providing a principled approach to retraining decisions.
Conclusion: Retraining should be viewed through a decision-theoretic lens as approximate Bayesian inference, leading to more systematic, evidence-based retraining policies with auditable governance.
Abstract: Model retraining is usually treated as an ongoing maintenance task. But as Harrison Katz now argues, retraining can be better understood as approximate Bayesian inference under computational constraints. The gap between a continuously updated belief state and your frozen deployed model is “learning debt,” and the retraining decision is a cost minimization problem with a threshold that falls out of your loss function. In this article Katz provides a decision-theoretic framework for retraining policies. The result is evidence-based triggers that replace calendar schedules and make governance auditable. For readers less familiar with the Bayesian and decision-theoretic language, key terms are defined in a glossary at the end of the article.
[384] EcoThink: A Green Adaptive Inference Framework for Sustainable and Accessible Agents
Linxiao Li, Zhixiang Lu
Main category: cs.AI
TL;DR: EcoThink is an energy-aware adaptive inference framework that reduces LLM energy consumption by 40.4% on average through a lightweight router that skips unnecessary reasoning for simple queries while reserving deep computation for complex tasks.
Details
Motivation: Address the environmental sustainability challenge of LLMs, which currently apply computation-intensive strategies like Chain-of-Thought to all queries, causing energy waste and carbon emissions that hinder equitable AI access in resource-constrained regions.Method: Introduces EcoThink framework with a lightweight, distillation-based router that dynamically assesses query complexity, skipping unnecessary reasoning for factoid retrieval while reserving deep computation for complex logical tasks.
Result: Extensive evaluations across 9 benchmarks show EcoThink reduces inference energy by 40.4% on average (up to 81.9% for web knowledge retrieval) without statistically significant performance loss.
Conclusion: EcoThink offers a scalable path toward sustainable, inclusive, and energy-efficient generative AI by mitigating algorithmic waste and reconciling high-performance AI with environmental responsibility.
Abstract: As the Web transitions from static retrieval to generative interaction, the escalating environmental footprint of Large Language Models (LLMs) presents a critical sustainability challenge. Current paradigms indiscriminately apply computation-intensive strategies like Chain-of-Thought (CoT) to billions of daily queries, causing LLM overthinking, a redundancy that amplifies carbon emissions and operational barriers. This inefficiency directly undermines UN Sustainable Development Goals 13 (Climate Action) and 10 (Reduced Inequalities) by hindering equitable AI access in resource-constrained regions. To address this, we introduce EcoThink, an energy-aware adaptive inference framework designed to reconcile high-performance AI intelligence with environmental responsibility. EcoThink employs a lightweight, distillation-based router to dynamically assess query complexity, skipping unnecessary reasoning for factoid retrieval while reserving deep computation for complex logic. Extensive evaluations across 9 diverse benchmarks demonstrate that EcoThink reduces inference energy by 40.4% on average (up to 81.9% for web knowledge retrieval) without statistically significant performance loss. By mitigating algorithmic waste, EcoThink offers a scalable path toward a sustainable, inclusive, and energy-efficient generative AI Agent.
[385] Voxtral TTS
Alexander H. Liu, Alexis Tacnet, Andy Ehrenberg, Andy Lo, Chen-Yo Sun, Guillaume Lample, Henry Lagarde, Jean-Malo Delignon, Jaeyoung Kim, John Harvill, Khyathi Raghavi Chandu, Lorenzo Signoretti, Margaret Jennings, Patrick von Platen, Pavankumar Reddy Muddireddy, Rohin Arora, Sanchit Gandhi, Samuel Humeau, Soham Ghosh, Srijan Mishra, Van Phung, Abdelaziz Bounhar, Abhinav Rastogi, Adrien Sadé, Alan Jeffares, Albert Jiang, Alexandre Cahill, Alexandre Gavaudan, Alexandre Sablayrolles, Amélie Héliou, Amos You, Andrew Bai, Andrew Zhao, Angele Lenglemetz, Anmol Agarwal, Anton Eliseev, Antonia Calvi, Arjun Majumdar, Arthur Fournier, Artjom Joosen, Avi Sooriyarachchi, Aysenur Karaduman Utkur, Baptiste Bout, Baptiste Rozière, Baudouin De Monicault, Benjamin Tibi, Bowen Yang, Charlotte Cronjäger, Clémence Lanfranchi, Connor Chen, Corentin Barreau, Corentin Sautier, Cyprien Courtot, Darius Dabert, Diego de las Casas, Elizaveta Demyanenko, Elliot Chane-Sane, Emmanuel Gottlob, Enguerrand Paquin, Etienne Goffinet, Fabien Niel, Faruk Ahmed, Federico Baldassarre, Gabrielle Berrada, Gaëtan Ecrepont, Gauthier Guinet, Genevieve Hayes, Georgii Novikov, Giada Pistilli, Guillaume Kunsch, Guillaume Martin, Guillaume Raille, Gunjan Dhanuka, Gunshi Gupta, Han Zhou, Harshil Shah, Hope McGovern, Hugo Thimonier, Indraneel Mukherjee, Irene Zhang, Jacques Sun, Jan Ludziejewski, Jason Rute, Jérémie Dentan, Joachim Studnia, Jonas Amar, Joséphine Delas, Josselin Somerville Roberts, Julien Tauran, Karmesh Yadav, Kartik Khandelwal, Kilian Tep, Kush Jain, Laurence Aitchison, Laurent Fainsin, Léonard Blier, Lingxiao Zhao, Louis Martin, Lucile Saulnier, Luyu Gao, Maarten Buyl, Manan Sharma, Marie Pellat, Mark Prins, Martin Alexandre, Mathieu Poirée, Mathieu Schmitt, Mathilde Guillaumin, Matthieu Dinot, Matthieu Futeral, Maxime Darrin, Maximilian Augustin, Mert Unsal, Mia Chiquier, Mikhail Biriuchinskii, Minh-Quang Pham, Mircea Lica, Morgane Rivière, Nathan Grinsztajn, Neha Gupta, Olivier Bousquet, Olivier Duchenne, Patricia Wang, Paul Jacob, Paul Wambergue, Paula Kurylowicz, Philippe Pinel, Philomène Chagniot, Pierre Stock, Piotr Miłoś, Prateek Gupta, Pravesh Agrawal, Quentin Torroba, Ram Ramrakhya, Randall Isenhour, Rishi Shah, Romain Sauvestre, Roman Soletskyi, Rosalie Millner, Rupert Menneer, Sagar Vaze, Samuel Barry, Samuel Belkadi, Sandeep Subramanian, Sean Cha, Shashwat Verma, Siddhant Waghjale, Siddharth Gandhi, Simon Lepage, Sumukh Aithal, Szymon Antoniak, Tarun Kumar Vangani, Teven Le Scao, Théo Cachet, Theo Simon Sorg, Thibaut Lavril, Thomas Chabal, Thomas Foubert, Thomas Robert, Thomas Wang, Tim Lawson, Tom Bewley, Tom Edwards, Tyler Wang, Umar Jamil, Umberto Tomasini, Valeriia Nemychnikova, Vedant Nanda, Victor Jouault, Vincent Maladière, Vincent Pfister, Virgile Richard, Vladislav Bataev, Wassim Bouaziz, Wen-Ding Li, William Havard, William Marshall, Xinghui Li, Xingran Guo, Xinyu Yang, Yannic Neuhaus, Yassine El Ouahidi, Yassir Bendou, Yihan Wang, Yimu Pan, Zaccharie Ramzi, Zhenlin Xu
Main category: cs.AI
TL;DR: Voxtral TTS is an expressive multilingual text-to-speech model that generates natural speech from just 3 seconds of reference audio using a hybrid architecture combining auto-regressive semantic tokens and flow-matching for acoustic tokens.
Details
Motivation: The paper aims to create a high-quality multilingual TTS system that can perform voice cloning with minimal reference audio (just 3 seconds) while maintaining naturalness and expressivity across languages.Method: Uses hybrid architecture: auto-regressive generation for semantic speech tokens + flow-matching for acoustic tokens. Employs Voxtral Codec - a speech tokenizer trained from scratch with hybrid VQ-FSQ quantization scheme.
Result: In human evaluations by native speakers, Voxtral TTS achieves 68.4% win rate over ElevenLabs Flash v2.5 for multilingual voice cloning, preferred for naturalness and expressivity.
Conclusion: Voxtral TTS demonstrates state-of-the-art performance in multilingual voice cloning with minimal reference audio, offering high-quality expressive speech generation across languages.
Abstract: We introduce Voxtral TTS, an expressive multilingual text-to-speech model that generates natural speech from as little as 3 seconds of reference audio. Voxtral TTS adopts a hybrid architecture that combines auto-regressive generation of semantic speech tokens with flow-matching for acoustic tokens. These tokens are encoded and decoded with Voxtral Codec, a speech tokenizer trained from scratch with a hybrid VQ-FSQ quantization scheme. In human evaluations conducted by native speakers, Voxtral TTS is preferred for multilingual voice cloning due to its naturalness and expressivity, achieving a 68.4% win rate over ElevenLabs Flash v2.5. We release the model weights under a CC BY-NC license.
[386] Is Mathematical Problem-Solving Expertise in Large Language Models Associated with Assessment Performance?
Liang Zhang, Yu Fu, Xinyi Jin
Main category: cs.AI
TL;DR: LLMs used as math tutors show that stronger math problem-solving ability correlates with better step-level assessment performance, but assessment remains more challenging than direct solving.
Details
Motivation: To understand whether stronger math problem-solving ability in LLMs is associated with stronger step-level assessment performance in educational contexts, particularly for AI-supported math tutoring systems.Method: Evaluated GPT-4 and GPT-5 on GSM8K and MATH subsets of PROCESSBENCH benchmark, comparing performance on two tasks: solving original math problems and assessing benchmark-provided solutions by predicting earliest erroneous step.
Result: Assessment accuracy was substantially higher on math problems the same model solved correctly than on those it solved incorrectly, with statistically significant associations across both models and datasets. Assessment remained more difficult than direct problem solving, especially on error-present solutions.
Conclusion: Math problem-solving expertise supports stronger assessment performance, but reliable step-level diagnosis requires additional capabilities like step tracking, monitoring, and precise error localization, with implications for AI-supported Adaptive Instructional Systems in math education.
Abstract: Large Language Models (LLMs) are increasingly used in math education not only as problem solvers but also as assessors of learners’ reasoning. However, it remains unclear whether stronger math problem-solving ability is associated with stronger step-level assessment performance. This study examines that relationship using the GSM8K and MATH subsets of PROCESSBENCH, a human-annotated benchmark for identifying the earliest erroneous step in mathematical reasoning. We evaluate two LLM-based math tutor agent settings, instantiated with GPT-4 and GPT-5, in two independent tasks on the same math problems: solving the original problem and assessing a benchmark-provided solution by predicting the earliest erroneous step. Results show a consistent within-model pattern: assessment accuracy is substantially higher on math problem items the same model solved correctly than on items it solved incorrectly, with statistically significant associations across both models and datasets. At the same time, assessment remains more difficult than direct problem solving, especially on error-present solutions. These findings suggest that math problem-solving expertise supports stronger assessment performance, but reliable step-level diagnosis also requires additional capabilities such as step tracking, monitoring, and precise error localization. The results have implications for the design and evaluation of AI-supported Adaptive Instructional Systems (AISs) for formative assessment in math education.
[387] Agent Factories for High Level Synthesis: How Far Can General-Purpose Coding Agents Go in Hardware Optimization?
Abhishek Bhandwaldar, Mihir Choudhury, Ruchir Puri, Akash Srivastava
Main category: cs.AI
TL;DR: Autonomous coding agents optimize hardware designs from high-level specs using a two-stage pipeline with sub-kernel decomposition and cross-function optimization, achieving up to 20x speedup without hardware-specific training.
Details
Motivation: To explore how far general-purpose coding agents can optimize hardware designs without domain-specific training, and to develop scalable autonomous optimization methods for high-level synthesis (HLS).Method: Two-stage agent factory: Stage 1 decomposes designs into sub-kernels, optimizes each independently with pragma/code transformations, formulates ILP for global assembly; Stage 2 launches expert agents over top ILP solutions to explore cross-function optimizations like pragma recombination, loop fusion, and memory restructuring.
Result: Achieved mean 8.27x speedup over baseline scaling from 1 to 10 agents, with larger gains on harder benchmarks (streamcluster >20x, kmeans ~10x). Agents rediscovered known hardware optimization patterns without domain-specific training.
Conclusion: General-purpose coding agents can effectively optimize hardware designs without hardware-specific training, and agent scaling is a practical axis for HLS optimization, with global optimization revealing improvements missed by sub-kernel search.
Abstract: We present an empirical study of how far general-purpose coding agents – without hardware-specific training – can optimize hardware designs from high-level algorithmic specifications. We introduce an agent factory, a two-stage pipeline that constructs and coordinates multiple autonomous optimization agents.
In Stage1, the pipeline decomposes a design into sub-kernels, independently optimizes each using pragma and code-level transformations, and formulates an Integer Linear Program (ILP) to assemble globally promising configurations under an area constraint. In Stage2, it launches $N$ expert agents over the top ILP solutions, each exploring cross-function optimizations such as pragma recombination, loop fusion, and memory restructuring that are not captured by sub-kernel decomposition.
We evaluate the approach on 12 kernels from HLS-Eval and Rodinia-HLS using Claude Code (Opus~4.5/4.6) with AMD Vitis HLS. Scaling from 1 to 10 agents yields a mean $8.27\times$ speedup over baseline, with larger gains on harder benchmarks: streamcluster exceeds $20\times$ and kmeans reaches approximately $10\times$. Across benchmarks, agents consistently rediscover known hardware optimization patterns without domain-specific training, and the best designs often do not originate from top-ranked ILP candidates, indicating that global optimization exposes improvements missed by sub-kernel search. These results establish agent scaling as a practical and effective axis for HLS optimization.
[388] R-C2: Cycle-Consistent Reinforcement Learning Improves Multimodal Reasoning
Zirui Zhang, Haoyu Dong, Kexin Pei, Chengzhi Mao
Main category: cs.AI
TL;DR: RC2: Reinforcement learning framework that uses cross-modal cycle consistency to resolve contradictions between visual and textual representations, improving multimodal reasoning accuracy.
Details
Motivation: Current multimodal models often produce contradictory predictions for visual and textual representations of the same concept. Standard voting mechanisms can amplify biases rather than addressing the underlying inconsistency problem.Method: Introduces RC2, a reinforcement learning framework that enforces cross-modal cycle consistency. The model performs backward inference, switches modalities, and reconstructs answers through forward inference, creating a dense, label-free reward signal that encourages alignment of internal representations.
Result: The cyclic constraint mitigates modality-specific errors and improves reasoning accuracy by up to 7.6 points. The approach shows that advanced reasoning emerges not just from scaling data, but from enforcing structurally consistent understanding.
Conclusion: Cross-modal inconsistency provides a rich natural signal for learning. Enforcing cycle consistency helps models align internal representations autonomously, leading to more robust multimodal reasoning without requiring additional labeled data.
Abstract: Robust perception and reasoning require consistency across sensory modalities. Yet current multimodal models often violate this principle, yielding contradictory predictions for visual and textual representations of the same concept. Rather than masking these failures with standard voting mechanisms, which can amplify systematic biases, we show that cross-modal inconsistency provides a rich and natural signal for learning. We introduce RC2, a reinforcement learning framework that resolves internal conflicts by enforcing cross-modal cycle consistency. By requiring a model to perform backward inference, switch modalities, and reliably reconstruct the answer through forward inference, we obtain a dense, label-free reward. This cyclic constraint encourages the model to align its internal representations autonomously. Optimizing for this structure mitigates modality-specific errors and improves reasoning accuracy by up to 7.6 points. Our results suggest that advanced reasoning emerges not only from scaling data, but also from enforcing a structurally consistent understanding of the world.
[389] Training the Knowledge Base through Evidence Distillation and Write-Back Enrichment
Yuxing Lu, Xukai Zhao, Wei Wu, Jinzhuo Wang
Main category: cs.AI
TL;DR: WriteBack-RAG treats knowledge bases as trainable components, using labeled examples to identify relevant documents and distill them into compact knowledge units that are indexed alongside the original corpus.
Details
Motivation: Traditional RAG systems have static knowledge bases that are assembled once and never revised, even though relevant facts are often fragmented across documents and buried in irrelevant content.Method: Uses labeled examples to identify where retrieval succeeds, isolate relevant documents, and distill them into compact knowledge units that are indexed alongside the original corpus as an offline preprocessing step.
Result: Improves performance across four RAG methods, six benchmarks, and two LLM backbones with average gains of +2.14%, and shows cross-method transfer benefits.
Conclusion: Treating knowledge bases as trainable components through distillation into compact knowledge units significantly improves RAG performance and the improvements reside in the corpus itself.
Abstract: The knowledge base in a retrieval-augmented generation (RAG) system is typically assembled once and never revised, even though the facts a query requires are often fragmented across documents and buried in irrelevant content. We argue that the knowledge base should be treated as a trainable component and propose WriteBack-RAG, a framework that uses labeled examples to identify where retrieval succeeds, isolate the relevant documents, and distill them into compact knowledge units that are indexed alongside the original corpus. Because the method modifies only the corpus, it can be applied once as an offline preprocessing step and combined with any RAG pipeline. Across four RAG methods, six benchmarks, and two LLM backbones, WriteBack-RAG improves every evaluated setting, with gains averaging +2.14%. Cross-method transfer experiments further show that the distilled knowledge benefits RAG pipelines other than the one used to produce it, confirming that the improvement resides in the corpus itself.
[390] TrustGeoGen: Formal-Verified Data Engine for Trustworthy Multi-modal Geometric Problem Solving
Daocheng Fu, Jianlong Chen, Renqiu Xia, Zijun Chen, Qi Liu, Yuan Feng, Hongbin Zhou, Renrui Zhang, Shiyang Feng, Peng Gao, Hongyuan Zha, Junchi Yan, Botian Shi, Yu Qiao, Bo Zhang
Main category: cs.AI
TL;DR: TrustGeoGen is an autonomous geometric data generation engine that creates high-quality multimodal geometry problems with formal verification, addressing data scarcity for training MLLMs on geometric reasoning.
Details
Motivation: Current MLLMs struggle with geometric problem solving due to scarcity of high-quality, verifiable multimodal data. Existing data acquisition methods have issues with modality incompleteness, logical gaps, or produce rigid, homogeneous data that fails to capture natural language reasoning or high-difficulty problems.Method: TrustGeoGen uses formal verification to guarantee reasoning trustworthiness while generating multimodal data (premises, diagrams, solutions). It incorporates difficulty-aware filtering, iterative bootstrapping, “connection thinking” to bridge formal logic and human-like reasoning, and GeoExplore sampling algorithms for diverse problem-solving trajectories.
Result: Training models on the synthesized GeoTrust dataset substantially enhances deep geometric reasoning capabilities and yields significant performance gains across OOD benchmarks including GeoQA, Geometry3K, and OlympiadBench.
Conclusion: TrustGeoGen effectively addresses the data scarcity bottleneck for geometric reasoning in MLLMs by generating high-quality, verifiable multimodal data with formal guarantees, enabling better training of models for complex geometric problem solving.
Abstract: Geometric problem solving (GPS) requires precise multimodal understanding and rigorous, step-by-step logical reasoning. However, developing capable Multimodal Large Language Models (MLLMs) for GPS is heavily bottlenecked by the scarcity of high-quality, verifiable data. Existing data acquisition paradigms either suffer from modality incompleteness and unverified logical gaps (“leaps-of-faith”), or rely on formal engines that generate rigid, structurally homogeneous data, failing to produce high-difficulty problems or foster genuine natural-language reasoning. To overcome these limitations, we introduce TrustGeoGen, an autonomous and formalized geometric data generation engine. TrustGeoGen strictly guarantees reasoning trustworthiness through formal verification while generating multimodally integrated data, including premises, visual diagrams, and solutions. To systematically scale problem difficulty, we incorporates difficulty-aware filtering and iterative bootstrapping mechanism. Furthermore, we propose “connection thinking” to bridge the semantic gap between rigid formal logic and fluent human-like reasoning, ensuring coherent logical transitions. We also introduce the GeoExplore family of sampling algorithms to extract diverse problem-solving trajectories based on various thinking templates. Extensive experiments demonstrate that training models on our synthesized dataset, GeoTrust, substantially enhances deep geometric reasoning capabilities and yields significant performance gains across out-of-distribution (OOD) benchmarks, including GeoQA, Geometry3K, and OlympiadBench.Our code and data can be found at https://github.com/InternScience/TrustGeoGen
[391] Do Language Models Follow Occam’s Razor? An Evaluation of Parsimony in Inductive and Abductive Reasoning
Yunxin Sun, Abulhair Saparov
Main category: cs.AI
TL;DR: LLMs struggle with Occam’s Razor in non-deductive reasoning tasks, performing poorly on complex scenarios despite using reasoning-enhancing techniques.
Details
Motivation: Current evaluations of LLMs' non-deductive reasoning ignore Occam's Razor, which is essential for selecting the simplest valid hypotheses in inductive and abductive reasoning.Method: Developed a framework to generate synthetic reasoning questions requiring both inductive and abductive reasoning, with automated metrics to assess adherence to Occam’s Razor.
Result: LLMs can handle simple reasoning scenarios but struggle with complex world models and fail to produce high-quality hypotheses that adhere to Occam’s Razor, even with in-context learning and RLVR.
Conclusion: LLMs have limited capability in generating high-quality hypotheses that follow Occam’s Razor in complex non-deductive reasoning tasks, highlighting a significant gap in their reasoning abilities.
Abstract: Non-deductive reasoning, encompassing inductive and abductive reasoning, is essential in addressing complex real-world questions. One key feature of inductive and abductive reasoning is that there are many valid hypotheses; the simplest ones (those that adhere to Occam’s Razor) are often most useful. However, this aspect is ignored in recent work that evaluates the non-deductive reasoning capabilities of large language models (LLMs). This work fills this gap, focusing on understanding whether the inductive and abductive reasoning capabilities of LLMs adhere to Occam’s Razor, while also examining the correctness of their reasoning. To accomplish this goal, we introduce a framework to synthetically generate reasoning questions that (a) require inductive reasoning and abductive reasoning simultaneously; (b) is readily extended to produce any abductive/inductive reasoning question expressible in first-order logic. The task for the intelligent agent is to produce hypotheses to explain observations under a given world model. We also propose a new automated metric to assess whether hypotheses quantitatively adhere to Occam’s Razor; those hypotheses that are correct and simplest are considered high-quality. Our findings on state-of-the-art LLMs suggest that LLMs can perform inductive and abductive reasoning in simple scenarios, but struggle with complex world models and with producing high-quality hypotheses, even with popular reasoning-enhancing techniques such as in-context learning and RLVR.
[392] From What to Why: A Multi-Agent System for Evidence-based Chemical Reaction Condition Reasoning
Cheng Yang, Jiaxuan Lu, Haiyuan Wan, Junchi Yu, Feiwei Qin
Main category: cs.AI
TL;DR: ChemMAS: A multi-agent system for chemical reaction condition recommendation that provides evidence-based reasoning and interpretable justifications, achieving 20-35% gains over domain baselines.
Details
Motivation: Current LLM-based chemical reaction condition recommendation methods lack explainability and rationale behind their recommendations, limiting their utility in high-stakes scientific workflows where trust and interpretability are crucial.Method: Proposes ChemMAS, a multi-agent system that reframes condition prediction as evidence-based reasoning. It decomposes the task into: 1) mechanistic grounding, 2) multi-channel recall, 3) constraint-aware agentic debate, and 4) rationale aggregation. Each decision is backed by interpretable justifications grounded in chemical knowledge and retrieved precedents.
Result: ChemMAS achieves 20-35% gains over domain-specific baselines and outperforms general-purpose LLMs by 10-15% in Top-1 accuracy. The system offers falsifiable, human-trustable rationales.
Conclusion: ChemMAS establishes a new paradigm for explainable AI in scientific discovery by providing evidence-based reasoning with interpretable justifications for chemical reaction condition recommendations.
Abstract: The chemical reaction recommendation is to select proper reaction condition parameters for chemical reactions, which is pivotal to accelerating chemical science. With the rapid development of large language models (LLMs), there is growing interest in leveraging their reasoning and planning capabilities for reaction condition recommendation. Despite their success, existing methods rarely explain the rationale behind the recommended reaction conditions, limiting their utility in high-stakes scientific workflows. In this work, we propose ChemMAS, a multi-agent system that reframes condition prediction as an evidence-based reasoning task. ChemMAS decomposes the task into mechanistic grounding, multi-channel recall, constraint-aware agentic debate, and rationale aggregation. Each decision is backed by interpretable justifications grounded in chemical knowledge and retrieved precedents. Experiments show that ChemMAS achieves 20-35% gains over domain-specific baselines and outperforms general-purpose LLMs by 10-15% in Top-1 accuracy, while offering falsifiable, human-trustable rationales, which establishes a new paradigm for explainable AI in scientific discovery.
[393] Working Paper: Active Causal Structure Learning with Latent Variables: Towards Learning to Detour in Autonomous Robots
Pablo de los Riscos, Fernando J. Corbacho
Main category: cs.AI
TL;DR: Active causal structure learning with latent variables enables AGI agents to adapt to novel environmental changes by constructing new internal causal models when encountering unexpected obstacles.
Details
Motivation: AGI agents and robots need to handle dynamic environments and novel situations by actively constructing internal causal models when structural changes occur, rather than relying on pre-programmed responses.Method: ACSLWL (Active Causal Structure Learning with Latent Variables) involves: acting in environment, discovering new causal relations, constructing causal models, exploiting models for utility maximization, detecting latent variables during unexpected observations, and building new structures with optimal parameter estimation.
Result: The method enables a simulated robot to learn complex planning and expectation-based detour behavior when encountering a transparent barrier for the first time, transforming unexpected situations into predictable ones with optimal plans.
Conclusion: Active causal structure learning with latent variables is essential for AGI agents to adapt to novel environmental changes and achieve optimal performance in dynamic, unpredictable scenarios.
Abstract: Artificial General Intelligence (AGI) Agents and Robots must be able to cope with everchanging environments and tasks. They must be able to actively construct new internal causal models of their interactions with the environment when new structural changes take place in the environment. Thus, we claim that active causal structure learning with latent variables (ACSLWL) is a necessary component to build AGI agents and robots. This paper describes how a complex planning and expectation-based detour behavior can be learned by ACSLWL when, unexpectedly, and for the first time, the simulated robot encounters a sort of transparent barrier in its pathway towards its target. ACSWL consists of acting in the environment, discovering new causal relations, constructing new causal models, exploiting the causal models to maximize its expected utility, detecting possible latent variables when unexpected observations occur, and constructing new structures-internal causal models and optimal estimation of the associated parameters, to be able to cope efficiently with the new encountered situations. That is, the agent must be able to construct new causal internal models that transform a previously unexpected and inefficient (sub-optimal) situation, into a predictable situation with an optimal operating plan.
[394] Semi-Strongly solved: a New Definition Leading Computer to Perfect Gameplay
Hiroki Takizawa
Main category: cs.AI
TL;DR: Semi-strong solving bridges weak and strong game solving by certifying optimal play on positions reachable when at least one player follows optimal policy, using reopening alpha-beta search with selective full-window evaluation.
Details
Motivation: Strong solving requires exhaustive state-space coverage which is often prohibitive, while weak solving only certifies correctness at initial position without guarantees after deviations. Need intermediate approach with formal guarantees on reachable positions under optimal play assumptions.Method: Proposes semi-strong solving with certified region R, using reopening alpha-beta search (node-kind-aware Principal Variation Search/Negascout) that enforces full-window search only where certification requires exact values and canonical optimal actions, while using null-window refutations elsewhere.
Result: Achieves O(d b^(d/2)) node expansions bound. On 6x6 Othello, computes semi-strong solution artifact supporting exact value queries. On 7x6 Connect Four, semi-strong certification is 9,074x smaller than strong baseline under matched counting conventions.
Conclusion: Semi-strong solving provides assumption-scoped, verifiable optimality guarantee bridging weak and strong solving, enabling explicit resource-guarantee trade-offs with deployable solution artifacts and proof certificates.
Abstract: Strong solving of perfect-information games certifies optimal play from every reachable position, but the required state-space coverage is often prohibitive. Weak solving is far cheaper, yet it certifies correctness only at the initial position and provides no formal guarantee for optimal responses after arbitrary deviations. We define semi-strong solving, an intermediate notion that certifies correctness on a certified region R: positions reachable from the initial position under the explicit assumption that at least one player follows an optimal policy while the opponent may play arbitrarily. A fixed tie-breaking rule among optimal moves makes the target deterministic. We propose reopening alpha-beta, a node-kind-aware Principal Variation Search/Negascout scheme that enforces full-window search only where semi-strong certification requires exact values and a canonical optimal action, while using null-window refutations and standard cut/all reasoning elsewhere. The framework exports a deployable solution artifact and, when desired, a proof certificate for third-party verification. Under standard idealizations, we bound node expansions by O(d b^(d/2)). On 6x6 Othello (score-valued utility), we compute a semi-strong solution artifact supporting exact value queries on R and canonical move selection. An attempted strong enumeration exhausts storage after exceeding 4x10^12 distinct rule-reachable positions. On 7x6 Connect Four (win/draw/loss utility), an oracle-value experiment shows that semi-strong certification is 9,074x smaller than a published strong baseline under matched counting conventions. Semi-strong solving provides an assumption-scoped, verifiable optimality guarantee that bridges weak and strong solving and enables explicit resource-guarantee trade-offs.
[395] Research on environment perception and behavior prediction of intelligent UAV based on semantic communication
Kechong Ren, Li Gao, Qi Guan
Main category: cs.AI
TL;DR: A framework combining drone delivery, virtual worlds, and blockchain for logistics, using reinforcement learning for drone adaptation, semantic communication for efficiency, and blockchain for security.
Details
Motivation: To address the challenge of collecting real-time delivery information from edge devices for virtual service providers, enabling fast, environmentally friendly drone delivery systems integrated with virtual worlds and blockchain technology.Method: Three-pronged approach: 1) Reinforcement learning for drone training and adaptation to virtual scenarios, 2) Semantic communication framework to reduce costs and incentivize information transmission, 3) Lightweight blockchain-based authentication scheme for security.
Result: Drone adaptation performance improved by ~35%, local offloading rate reached 90% with more base stations, semantic communication outperformed Cross Entropy baseline, and blockchain maintained stable transaction throughput with varying drone numbers.
Conclusion: The integrated framework successfully combines drone delivery, virtual worlds, and blockchain to create an efficient, secure logistics system with improved adaptation, communication efficiency, and stable performance.
Abstract: The convergence of drone delivery systems, virtual worlds, and blockchain has transformed logistics and supply chain management, providing a fast, and environmentally friendly alternative to traditional ground transportation methods;Provide users with a real-world experience, virtual service providers need to collect up-to-the-minute delivery information from edge devices. To address this challenge, 1) a reinforcement learning approach is introduced to enable drones with fast training capabilities and the ability to autonomously adapt to new virtual scenarios for effective resource allocation.2) A semantic communication framework for meta-universes is proposed, which utilizes the extraction of semantic information to reduce the communication cost and incentivize the transmission of information for meta-universe services.3) In order to ensure that user information security, a lightweight authentication and key agreement scheme is designed between the drone and the user by introducing blockchain technology. In our experiments, the drone adaptation performance is improved by about 35%, and the local offloading rate can reach 90% with the increase of the number of base stations. The semantic communication system proposed in this paper is compared with the Cross Entropy baseline model. Introducing blockchain technology the throughput of the transaction is maintained at a stable value with different number of drones.
[396] Concepts Learned Visually by Infants Can Contribute to Visual Learning and Understanding in AI Models
Shify Treger, Shimon Ullman
Main category: cs.AI
TL;DR: Infants learn complex visual concepts with minimal supervision, using early-acquired concepts like animacy and goal attribution to learn more complex concepts, leading to better and more efficient learning compared to standard deep networks.
Details
Motivation: Infants learn visual concepts surprisingly well with little supervision and few examples, using early concepts to bootstrap learning of more complex concepts. Current deep networks require much more data and supervision. The paper aims to model how early concepts (animacy, goal attribution) facilitate learning of subsequent concepts in dynamic visual scenes.Method: Model how early-acquired concepts (animacy, goal attribution) are used in learning subsequent concepts. Compare results with standard deep network modeling. Focus on predicting future events in dynamic visual scenes. Also compare advanced vision-language models with human studies on understanding animate vs. inanimate agent behavior.
Result: Using early concepts leads to better learning (higher accuracy) and more efficient learning (requiring less data). Combination of early and new concepts shapes concept representations and improves generalization. Vision-language models compared to human studies support contribution of early concepts to visual understanding.
Conclusion: Incorporating human-like visual learning aspects (using early concepts to bootstrap learning) into computer vision models could provide benefits in efficiency and generalization. Early concepts play crucial role in visual understanding and concept acquisition.
Abstract: Early in development, infants learn to extract surprisingly complex aspects of visual scenes. This early learning comes together with an initial understanding of the extracted concepts, such as their implications, causality, and using them to predict likely future events. In many cases, this learning is obtained with little or no supervision, and from relatively few examples, compared to current network models. Empirical studies of visual perception in early development have shown that in the domain of objects and human-object interactions, early-acquired concepts are often used in the process of learning additional, more complex concepts. In the current work, we model how early-acquired concepts are used in the learning of subsequent concepts, and compare the results with standard deep network modeling. We focused in particular on the use of the concepts of animacy and goal attribution in learning to predict future events in dynamic visual scenes. We show that the use of early concepts in the learning of new concepts leads to better learning (higher accuracy) and more efficient learning (requiring less data), and that the combination of early and new concepts shapes the representation of the concepts acquired by the model and improves its generalization. We further compare advanced vision-language models to a human study in a task that requires an understanding of the behavior of animate vs. inanimate agents, with results supporting the contribution of early concepts to visual understanding. We finally briefly discuss the possible benefits of incorporating aspects of human-like visual learning into computer vision models.
[397] Ludax: A GPU-Accelerated Domain Specific Language for Board Games
Graham Todd, Alexander G. Padula, Dennis J. N. J. Soemers, Julian Togelius
Main category: cs.AI
TL;DR: Ludax is a domain-specific language for board games that compiles into hardware-accelerated code, combining game description language generality with modern parallel processing speed for AI research.
Details
Motivation: To accelerate games research by providing a framework that combines the generality of game description languages (which allow researchers to generalize algorithms across multiple games) with the speed of modern hardware acceleration used in reinforcement learning.Method: Developed a domain-specific language for board games that automatically compiles into hardware-accelerated code using libraries like JAX, designed to fit into existing deep learning pipelines and enable rapid simulation.
Result: Created the open-source Ludax framework with implementations of existing board games, featuring detailed description language, compilation process, speed benchmarking, and demonstration of RL agent training.
Conclusion: Ludax accelerates games research across RL and cognitive science by enabling rapid simulation and providing flexible representation, bridging game description languages with hardware acceleration.
Abstract: Games have long been used as benchmarks and testing environments for research in artificial intelligence. A key step in supporting this research was the development of game description languages: frameworks that compile domain-specific code into playable and simulatable game environments, allowing researchers to generalize their algorithms and approaches across multiple games without having to manually implement each one. More recently, progress in reinforcement learning (RL) has been largely driven by advances in hardware acceleration. Libraries like JAX allow practitioners to take full advantage of cutting-edge computing hardware, often speeding up training and testing by orders of magnitude. Here, we present a synthesis of these strands of research: a domain-specific language for board games which automatically compiles into hardware-accelerated code. Our framework, Ludax, combines the generality of game description languages with the speed of modern parallel processing hardware and is designed to fit neatly into existing deep learning pipelines. We envision Ludax as a tool to help accelerate games research generally, from RL to cognitive science, by enabling rapid simulation and providing a flexible representation scheme. We present a detailed breakdown of Ludax’s description language and technical notes on the compilation process, along with speed benchmarking and a demonstration of training RL agents. The Ludax framework, along with implementations of existing board games, is open-source and freely available.
[398] Interactive Query Answering on Knowledge Graphs with Soft Entity Constraints
Daniel Daza, Alberto Bernardi, Luca Costabello, Christophe Gueret, Masoud Mansoury, Michael Cochez, Martijn Schut
Main category: cs.AI
TL;DR: This paper introduces query answering with soft constraints over incomplete knowledge graphs, proposing lightweight methods to incorporate vague or context-dependent preferences while maintaining original answer rankings.
Details
Motivation: Existing query answering methods for incomplete knowledge graphs focus on first-order-logic queries, but many real-world queries involve vague or context-dependent constraints (e.g., preferences for attributes or related categories). There's a gap in handling these soft constraints.Method: The authors formalize the problem of query answering with soft constraints and introduce two efficient methods: 1) parameter tuning approach requiring only two parameters, and 2) a small neural network trained to capture soft constraints while preserving original ranking structure. They extend existing QA benchmarks with soft constraint datasets for evaluation.
Result: Experiments show the methods can effectively capture soft constraints while maintaining robust query answering performance with minimal overhead. The approaches add flexibility to graph database interactions.
Conclusion: The work explores a new flexible way to interact with graph databases that allows users to specify preferences interactively through examples, addressing the limitation of existing approaches that only handle rigid first-order-logic constraints.
Abstract: Methods for query answering over incomplete knowledge graphs retrieve entities that are \emph{likely} to be answers, which is particularly useful when such answers cannot be reached by direct graph traversal due to missing edges. However, existing approaches have focused on queries formalized using first-order-logic. In practice, many real-world queries involve constraints that are inherently vague or context-dependent, such as preferences for attributes or related categories. Addressing this gap, we introduce the problem of query answering with soft constraints. We formalize the problem and introduce two efficient methods designed to adjust query answer scores by incorporating soft constraints without disrupting the original answers to a query. These methods are lightweight, requiring tuning only two parameters or a small neural network trained to capture soft constraints while maintaining the original ranking structure. To evaluate the task, we extend existing QA benchmarks by generating datasets with soft constraints. Our experiments demonstrate that our methods can capture soft constraints while maintaining robust query answering performance and adding very little overhead. With our work, we explore a new and flexible way to interact with graph databases that allows users to specify their preferences by providing examples interactively.
[399] Planned Diffusion
Daniel Israel, Tian Jin, Ellie Cheng, Guy Van den Broeck, Aditya Grover, Suvinay Subramanian, Michael Carbin
Main category: cs.AI
TL;DR: Planned diffusion: A hybrid approach combining autoregressive planning with parallel diffusion-based generation to improve inference speed while maintaining quality.
Details
Motivation: Autoregressive LLMs generate tokens sequentially, causing high latency. Discrete diffusion models can generate tokens in parallel but require heuristic denoising orders that create quality-latency trade-offs. The goal is to develop a method that enables efficient parallel generation without sacrificing quality.Method: A single model that transitions between autoregressive and diffusion-based generation: 1) Autoregressively generates a plan partitioning the response into semantically independent chunks, 2) Denoises all chunks in parallel using diffusion. The model learns to determine its own denoising order through the planning phase.
Result: Achieves 1.27x to 1.81x speedup over autoregressive generation with only 0.87% to 5.4% drop in win rate on AlpacaEval, establishing new Pareto frontier for parallel generation. Instruction following quality continues to improve with more finetuning compute while autoregressive baseline plateaus.
Conclusion: Planned diffusion successfully addresses the quality-latency trade-off in parallel generation by enabling models to determine their own denoising order through learned planning, offering tunable control over the trade-off.
Abstract: Most large language models are autoregressive: they generate tokens one at a time. Discrete diffusion language models can generate multiple tokens in parallel, but sampling from them requires a denoising order: a strategy for deciding which tokens to decode at each step. Determining a good denoising order is difficult, and existing approaches use heuristics that create a steep trade-off between quality and latency. We propose planned diffusion, a system that trains the model to determine its own denoising order. Planned diffusion uses a single model that transitions between autoregressive and diffusion-based generation: first, the model autoregressively generates a plan that partitions the response into semantically independent chunks; second, the model denoises all chunks in parallel. The autoregressive plan enables the model to define the denoising order itself. On AlpacaEval, planned diffusion achieves 1.27x to 1.81x speedup over autoregressive generation with only 0.87% to 5.4% drop in win rate, establishing a new Pareto frontier for parallel generation with discrete diffusion. Additionally, planned diffusion’s instruction following quality continues to improve with more finetuning compute, while the autoregressive baseline plateaus. Our implementation provides simple runtime knobs that offer tunable control over the quality-latency trade-off.
[400] Hierarchical Adaptive networks with Task vectors for Test-Time Adaptation
Sameer Ambekar, Marta Hasny, Laura Daza, Daniel M. Lang, Julia A. Schnabel
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: Unable to determine method due to API rate limiting preventing access to paper content
Result: Unable to determine results due to API rate limiting preventing access to paper content
Conclusion: Unable to determine conclusion due to API rate limiting preventing access to paper content
Abstract: Failed to fetch summary for 2508.09223: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.09223&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[401] Analysing Environmental Efficiency in AI for X-Ray Diagnosis
Liam Kearns
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: Cannot analyze method as paper content is unavailable due to HTTP 429 error
Result: No results available due to technical issue with arXiv API access
Conclusion: Paper analysis cannot be completed due to API rate limiting preventing content retrieval
Abstract: Failed to fetch summary for 2511.07436: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.07436&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[402] XGrammar-2: Efficient Dynamic Structured Generation Engine for Agentic LLMs
Linzhang Li, Yixin Dong, Guanjie Wang, Ziyi Xu, Alexander Jiang, Tianqi Chen
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without paper contentMethod: Cannot determine method without paper content
Result: Cannot determine results without paper content
Conclusion: Cannot draw conclusions without paper content
Abstract: Failed to fetch summary for 2601.04426: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.04426&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[403] RetroAgent: From Solving to Evolving via Retrospective Dual Intrinsic Feedback
Xiaoying Zhang, Zichen Liu, Yipeng Zhang, Xia Hu, Wenqi Shao
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to data fetch failureMethod: Unable to determine method due to data fetch failure
Result: Unable to determine results due to data fetch failure
Conclusion: Unable to draw conclusions due to data fetch failure
Abstract: Failed to fetch summary for 2603.08561: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.08561&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[404] Consequentialist Objectives and Catastrophe
Henrik Marklund, Alex Infanger, Benjamin Van Roy
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2603.15017: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.15017&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[405] Conflict-Based Search for Multi Agent Path Finding with Asynchronous Actions
Xuemian Wu, Shizhe Zhao, Zhongqiang Ren
Main category: cs.AI
TL;DR: Paper ID 2603.18866 could not be analyzed due to HTTP 429 error when fetching from arXiv API
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to determine conclusion due to failed API request
Abstract: Failed to fetch summary for 2603.18866: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.18866&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[406] Characterizing Linear Alignment Across Language Models
Matt Gorbett, Suman Jana
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2603.18908: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.18908&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[407] Man and machine: artificial intelligence and judicial decision making
Arthur Dyevre, Ahmad Shahvaroughi
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about paper content due to access limitations
Abstract: Failed to fetch summary for 2603.19042: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.19042&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[408] MIRAGE: The Illusion of Visual Understanding
Mohammad Asadi, Jack W. O’Sullivan, Fang Cao, Tahoura Nedaee, Kamyar Fardi, Fei-Fei Li, Ehsan Adeli, Euan Ashley
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to draw conclusions due to access error
Abstract: Failed to fetch summary for 2603.21687: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.21687&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[409] Environment Maps: Structured Environmental Representations for Long-Horizon Agents
Yenchia Feng, Chirag Sharma, Karime Maamari
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2603.23610: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.23610&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[410] AI-Supervisor: Autonomous AI Research Supervision via a Persistent Research World Model
Yunbo Long
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2603.24402 suggests it’s from March 2024, but no content available for analysis.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot draw conclusions without access to the paper content.
Abstract: Failed to fetch summary for 2603.24402: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.24402&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[411] A Comprehensive Survey on Vector Database: Storage and Retrieval Technique, Challenge
Le Ma, Ran Zhang, Yikun Han, Shirui Yu, Zaitian Wang, Zhiyuan Ning, Jinghan Zhang, Ping Xu, Pengjiang Li, Ziyue Qiao, Wei Ju, Chong Chen, Dongjie Wang, Kunpeng Liu, Pengyang Wang, Pengfei Wang, Yanjie Fu, Chunjiang Liu, Yuanchun Zhou, Chang-Tien Lu
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2310.11703: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2310.11703&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[412] ByteStorm: a multi-step data-driven approach for Tropical Cyclones detection and tracking
Davide Donno, Donatello Elia, Gabriele Accarino, Marco De Carlo, Enrico Scoccimarro, Silvio Gualdi
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to technical limitationsMethod: Cannot determine method as paper content is unavailable due to technical limitations
Result: Cannot determine results as paper content is unavailable due to technical limitations
Conclusion: Cannot draw conclusions about the paper due to technical limitations in accessing the content
Abstract: Failed to fetch summary for 2512.07885: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.07885&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[413] The Future of AI-Driven Software Engineering
Valerio Terragni, Annie Vella, Partha Roop, Kelly Blincoe
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2406.07737: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2406.07737&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[414] Physics-Informed Evolution: An Evolutionary Framework for Solving Quantum Control Problems Involving the Schrödinger Equation
Kaichen Ouyang, Mingyang Yu, Zong Ke, Jun Zhang, Yi Chen, Huiling Chen
Main category: cs.AI
TL;DR: Unable to analyze paper 2502.05228 due to HTTP 429 error when fetching from arXiv API
Details
Motivation: Cannot determine motivation due to inability to access paper contentMethod: Cannot determine method due to inability to access paper content
Result: Cannot determine results due to inability to access paper content
Conclusion: Cannot draw conclusions due to inability to access paper content
Abstract: Failed to fetch summary for 2502.05228: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.05228&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[415] DRIFT: Dynamic Rule-Based Defense with Injection Isolation for Securing LLM Agents
Hao Li, Xiaogeng Liu, Hung-Chun Chiu, Dianqi Li, Ning Zhang, Chaowei Xiao
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about paper content due to access limitations
Abstract: Failed to fetch summary for 2506.12104: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.12104&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[416] BMFM-RNA: whole-cell expression decoding improves transcriptomic foundation models
Michael M. Danziger, Bharath Dandala, Viatcheslav Gurev, Matthew Madgwick, Sivan Ravid, Tim Rumbell, Akira Koseki, Tal Kozlovski, Ching-Huei Tsou, Ella Barkan, Tanwi Biswas, Jielin Xu, Yishai Shimoni, Jianying Hu, Michal Rosen-Zvi
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2506.14861: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.14861&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[417] The Information Dynamics of Generative Diffusion
Dejan Stancevic, Luca Ambrogioni
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2508.19897: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.19897&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[418] Gradient Regularized Natural Gradients
Satya Prakash Dash, Hossein Abdi, Wei Pan, Samuel Kaski, Mingfei Sun
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2601.18420: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.18420&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[419] End-to-End Low-Level Neural Control of an Industrial-Grade 6D Magnetic Levitation System
Philipp Hartmann, Jannick Stranghöner, Klaus Neumann
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error fetching paper contentMethod: Unable to determine method due to technical error fetching paper content
Result: Unable to determine results due to technical error fetching paper content
Conclusion: Unable to draw conclusions due to technical error fetching paper content
Abstract: Failed to fetch summary for 2509.01388: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.01388&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[420] Constrained Diffusion for Protein Design with Hard Structural Constraints
Jacob K. Christopher, Austin Seamann, Jingyi Cui, Sagar Khare, Ferdinando Fioretto
Main category: cs.AI
TL;DR: Failed to fetch paper summary - HTTP 429 error (rate limiting) prevents accessing arXiv paper 2510.14989
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2510.14989: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.14989&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[421] Temporal Sepsis Modeling: a Fully Interpretable Relational Way
Vincent Lemaire, Nédra Meloulli, Pierre Jaquet
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting).
Details
Motivation: Cannot determine motivation without access to paper content.Method: Cannot determine method without access to paper content.
Result: Cannot determine results without access to paper content.
Conclusion: Cannot draw conclusions without access to paper content.
Abstract: Failed to fetch summary for 2601.21747: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.21747&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[422] Why Adam Can Beat SGD: Second-Moment Normalization Yields Sharper Tails
Ruinan Jin, Yingbin Liang, Shaofeng Zou
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2603.03099: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.03099&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[423] Epistemic Bias Injection: Biasing LLMs via Selective Context Retrieval
Hao Wu, Prateek Saxena
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation due to API access failureMethod: Unable to determine method due to API access failure
Result: Unable to determine results due to API access failure
Conclusion: Unable to determine conclusion due to API access failure
Abstract: Failed to fetch summary for 2512.00804: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.00804&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[424] Constant-Time Motion Planning with Manipulation Behaviors
Nayesha Gandotra, Itamar Mishani, Maxim Likhachev
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to draw conclusions due to access error
Abstract: Failed to fetch summary for 2512.00939: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.00939&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[425] mSFT: Addressing Dataset Mixtures Overfitting Heterogeneously in Multi-task SFT
Woosung Koh, Jeyoung Jeon, Youngjin Song, Yujin Cheon, Soowon Oh, Jaehyeong Choi, Se-Young Yun
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2603.21606: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.21606&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[426] Information Access of the Oppressed: A Problem-Posing Framework for Envisioning Emancipatory Information Access Platforms
Bhaskar Mitra, Nicola Neophytou, Sireesh Gururaja
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to technical limitationsMethod: Cannot determine method as paper content is unavailable due to technical limitations
Result: Cannot determine results as paper content is unavailable due to technical limitations
Conclusion: Cannot draw conclusions as paper content is unavailable due to technical limitations
Abstract: Failed to fetch summary for 2601.09600: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.09600&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[427] P^2O: Joint Policy and Prompt Optimization
Xinyu Lu, Kaiqi Zhang, Jinglin Yang, Boxi Cao, Yaojie Lu, Hongyu Lin, Min He, Xianpei Han, Le Sun
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2603.21877: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.21877&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[428] Towards Exploratory and Focused Manipulation with Bimanual Active Perception: A New Problem, Benchmark and Strategy
Yuxin He, Ruihao Zhang, Tianao Shen, Cheng Liu, Qiang Nie
Main category: cs.AI
TL;DR: Paper ID 2602.01939 could not be fetched due to HTTP 429 error (rate limiting), so analysis cannot be performed
Details
Motivation: Unable to determine motivation due to missing abstract contentMethod: Unable to determine method due to missing abstract content
Result: Unable to determine results due to missing abstract content
Conclusion: Unable to draw conclusions due to missing abstract content
Abstract: Failed to fetch summary for 2602.01939: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.01939&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[429] Learning When to Act: Interval-Aware Reinforcement Learning with Predictive Temporal Structure
Davide Di Gioia
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) when trying to access arXiv API for paper ID 2603.22384
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to determine conclusion due to failed API request
Abstract: Failed to fetch summary for 2603.22384: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.22384&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[430] Impact of AI Search Summaries on Website Traffic: Evidence from Google AI Overviews and Wikipedia
Mehrzad Khosravi, Hema Yoganarasimhan
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) when trying to access arXiv API for paper ID 2602.18455
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot determine conclusion as paper content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2602.18455: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.18455&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[431] The Landscape of AI in Science Education: What is Changing and How to Respond
Xiaoming Zhai, Kent Crippen
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to access restrictionsMethod: Cannot determine method due to access restrictions
Result: Cannot determine results due to access restrictions
Conclusion: Cannot determine conclusion due to access restrictions
Abstract: Failed to fetch summary for 2602.18469: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.18469&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[432] Probabilistic Geometric Alignment via Bayesian Latent Transport for Domain-Adaptive Foundation Models
Aueaphum Aueawatthanaphisut, Kuepon Auewattanapisut
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2603.23783: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.23783&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[433] The DMA Streaming Framework: Kernel-Level Buffer Orchestration for High-Performance AI Data Paths
Marco Graziano
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to retrieval errorMethod: Unable to determine method due to retrieval error
Result: Unable to determine results due to retrieval error
Conclusion: Unable to determine conclusion due to retrieval error
Abstract: Failed to fetch summary for 2603.10030: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.10030&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[434] Evaluation format, not model capability, drives triage failure in the assessment of consumer health AI
David Fraile Navarro, Farah Magrabi, Enrico Coiera
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to inability to access paper contentMethod: Cannot determine method due to inability to access paper content
Result: Cannot determine results due to inability to access paper content
Conclusion: Cannot draw conclusions due to inability to access paper content
Abstract: Failed to fetch summary for 2603.11413: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.11413&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[435] When Should a Robot Think? Resource-Aware Reasoning via Reinforcement Learning for Embodied Robotic Decision-Making
Jun Liu, Pu Zhao, Zhenglun Kong, Xuan Shen, Peiyan Dong, Fan Yang, Lin Cui, Hao Tang, Geng Yuan, Wei Niu, Wenbin Zhang, Xue Lin, Gaowen Liu, Yanzhi Wang, Dong Huang
Main category: cs.AI
TL;DR: Paper 2603.16673 could not be analyzed due to HTTP 429 error (rate limiting) when attempting to fetch the abstract from arXiv API.
Details
Motivation: Unable to determine motivation as the abstract could not be retrieved due to API rate limiting.Method: Unable to determine method as the abstract could not be retrieved due to API rate limiting.
Result: Unable to determine results as the abstract could not be retrieved due to API rate limiting.
Conclusion: Unable to draw conclusions about the paper’s content due to technical limitations in accessing the abstract.
Abstract: Failed to fetch summary for 2603.16673: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.16673&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[436] TRACE: A Multi-Agent System for Autonomous Physical Reasoning for Seismology
Feng Liu, Jian Xu, Xin Cui, Xinghao Wang, Zijie Guo, Jiong Wang, S. Mostafa Mousavi, Xinyu Gu, Hao Chen, Ben Fei, Lihua Fang, Fenghua Ling, Zefeng Li, Lei Bai
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: Unable to determine method due to API rate limiting preventing access to paper content
Result: Unable to determine results due to API rate limiting preventing access to paper content
Conclusion: Unable to determine conclusion due to API rate limiting preventing access to paper content
Abstract: Failed to fetch summary for 2603.21152: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.21152&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[437] Cognitive Training for Language Models: Towards General Capabilities via Cross-Entropy Games
Clément Hongler, Franck Gabriel, Valentin Hartmann, Arthur Renard, Andrew Emil
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.22479: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.22479&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.SD
[438] Joint Learning Global-Local Speaker Classification to Enhance End-to-End Speaker Diarization and Recognition
Yuhang Dai, Haopeng Lin, Jiale Qian, Ruiqi Yan, Hao Meng, Hanke Xie, Hanlin Wen, Shunshun Yin, Ming Tao, Xie Chen, Lei Xie, Xinsheng Wang
Main category: cs.SD
TL;DR: GLSC-SDR improves speaker discriminability in Large Audio-Language Models through joint training of speaker classification with diarization and recognition, using a Global-Local Speaker Classification strategy.
Details
Motivation: Current LALMs have limited speaker discriminability due to scarcity of large-scale conversational data and lack of explicit speaker representation optimization, which hinders their performance in speaker diarization and recognition tasks.Method: Proposes GLSC-SDR paradigm that jointly trains speaker classification with diarization and recognition. Introduces Global-Local Speaker Classification strategy using clustered speakers as global labels and re-encoded intra-cluster speakers as local labels for hierarchical speaker discrimination.
Result: Achieves competitive or superior performance on AliMeeting, AISHELL-4, and AMI-SDM datasets compared to simulation-based and multi-encoder approaches, without requiring large-scale real conversational data.
Conclusion: The proposed approach effectively enhances fine-grained speaker discrimination while preserving semantic transcription accuracy in audio-language models.
Abstract: Large Audio-Language Models (LALMs) have demonstrated remarkable performance in end-to-end speaker diarization and recognition. However, their speaker discriminability remains limited due to the scarcity of large-scale conversational data and the absence of explicit speaker representation optimization. To address this, we propose GLSC-SDR, a paradigm that jointly trains speaker classification with diarization and recognition. We further introduce a Global-Local Speaker Classification strategy, which uses clustered speakers as global labels and re-encoded intra-cluster speakers as local labels. This hierarchical design enhances fine-grained speaker discrimination while preserving semantic transcription accuracy. Experiments on AliMeeting, AISHELL-4, and AMI-SDM demonstrate that GLSC-SDR achieves competitive or superior performance compared to simulation-based and multi-encoder approaches, without relying on large-scale real conversational data.
[439] CoDeTT: A Context-Aware Decision Benchmark for Turn-Taking Evaluation
Huan Shen, Yingao Wang, Shangkun Huang, Wei Zou, Yunzhang Chen
Main category: cs.SD
TL;DR: CoDeTT is a context-aware benchmark for evaluating turn-taking models as structured decision problems across multiple conversational scenarios with fine-grained categories.
Details
Motivation: Current turn-taking evaluation is fragmented and limited to binary boundary detection in narrow settings, hindering systematic comparison and obscuring model weaknesses across different conversational conditions.Method: CoDeTT formulates turn-taking as a structured decision problem and constructs a multi-scenario dataset with fine-grained decision categories and controlled context variations. It provides a unified evaluation protocol for systematic assessment.
Result: Evaluation of representative existing models shows substantial performance disparities across decision types and interaction scenarios, revealing model weaknesses that were previously obscured.
Conclusion: CoDeTT provides a standardized benchmark for systematic and context-aware evaluation of turn-taking systems, enabling better comparison and understanding of model capabilities across diverse conversational conditions.
Abstract: Turn-taking modeling is fundamental to spoken dialogue systems, yet its evaluation remains fragmented and often limited to binary boundary detection under narrow interaction settings. Such protocols hinder systematic comparison and obscure model weaknesses across conversational conditions. We present CoDeTT, a context-aware decision benchmark for turn-taking evaluation. CoDeTT formulates turn-taking as a structured decision problem and constructs a multi-scenario dataset with fine-grained decision categories and controlled context variations. Under a unified evaluation protocol, we assess representative existing models and observe substantial performance disparities across decision types and interaction scenarios. CoDeTT provides a standardized benchmark for systematic and context-aware evaluation of turn-taking systems. The benchmark dataset and evaluation toolkit are available at https://github.com/YingaoWang-casia/CoDeTT.github.io.
[440] CLAR: CIF-Localized Alignment for Retrieval-Augmented Speech LLM-Based Contextual ASR
Shangkun Huang, Huan Shen, Wei Zou, Yunzhang Chen
Main category: cs.SD
TL;DR: CLAR is a dual-encoder speech-text retriever that improves ASR for named entities and long-tail words by learning monotonic token-level alignments without timestamps, then injecting retrieved hotwords as contextual prompts into Speech LLMs.
Details
Motivation: Speech LLM-based ASR struggles with named entities and long-tail words due to strong internal language model priors. While retrieval-augmented biasing can help, it requires accurate hotword localization in full-utterance speech under weak supervision.Method: Proposes CLAR, a dual-encoder speech-text retriever using Continuous Integrate-and-Fire (CIF) to learn monotonic token-level alignments without timestamps. Uses length-aware localized matching to anchor short-entity acoustic cues and reduce representation dilution. Trained with multi-granularity objective combining global and local segment-level contrastive losses plus CIF quantity constraint.
Result: CLAR significantly improves hotword retrieval and reduces both CER (Character Error Rate) and B-WER (Biased Word Error Rate) against strong contextual ASR baselines.
Conclusion: The proposed CLAR framework effectively addresses ASR challenges for named entities and long-tail words by learning better speech-text alignments and injecting retrieved hotwords as contextual prompts, improving recognition without shallow fusion.
Abstract: Speech LLM-based ASR often struggles with named entities and long-tail words due to strong internal language-model priors. Retrieval-augmented biasing can help, but its effectiveness depends on accurate hotword localization in full-utterance speech under weak supervision. We propose CLAR, a dual-encoder speech-text retriever that uses Continuous Integrate-and-Fire (CIF) to learn monotonic token-level alignments without timestamps. With length-aware localized matching, CLAR anchors short-entity acoustic cues and reduces representation dilution and attention drift. The retriever is trained with a multi-granularity objective combining global and local segment-level contrastive losses and a CIF quantity constraint. At inference, top-ranked hotwords are injected as contextual prompts for the Speech LLM, improving recognition without shallow fusion. Experiments show that CLAR significantly improves hotword retrieval and reduces both CER and B-WER against strong contextual ASR baselines.
[441] U-DREAM: Unsupervised Dereverberation guided by a Reverberation Model
Louis Bahrman, Marius Rodrigues, Mathieu Fontaine, Gaël Richard
Main category: cs.SD
TL;DR: A weakly-supervised to unsupervised dereverberation method using only reverberant signals and acoustic models, requiring minimal labeled data.
Details
Motivation: Existing deep learning dereverberation methods require paired dry and reverberant data, which is difficult to obtain in practice, creating a need for more data-efficient approaches.Method: Sequential learning strategy based on maximum-likelihood formulation, using deep neural networks to estimate acoustic parameters and dry signals from reverberant inputs with reverberation matching loss.
Result: The most data-efficient variant requires only 100 reverberation-parameter-labeled samples to outperform unsupervised baselines, demonstrating effectiveness in low-resource scenarios.
Conclusion: Proposed method enables practical dereverberation with minimal supervision, addressing the data scarcity problem in real-world applications.
Abstract: This paper explores the outcome of training state-of-the-art dereverberation models with supervision settings ranging from weakly-supervised to virtually unsupervised, relying solely on reverberant signals and an acoustic model for training. Most of the existing deep learning approaches typically require paired dry and reverberant data, which are difficult to obtain in practice. We develop instead a sequential learning strategy motivated by a maximum-likelihood formulation of the dereverberation problem, wherein acoustic parameters and dry signals are estimated from reverberant inputs using deep neural networks, guided by a reverberation matching loss. Our most data-efficient variant requires only 100 reverberation-parameter-labeled samples to outperform an unsupervised baseline, demonstrating the effectiveness and practicality of the proposed method in low-resource scenarios.
[442] MiDashengLM: Efficient Audio Understanding with General Audio Captions
Heinrich Dinkel, Gang Li, Jizhong Liu, Jian Luan, Yadong Niu, Xingwei Sun, Tianzi Wang, Qiyang Xiao, Junbo Zhang, Jiahao Zhou
Main category: cs.SD
TL;DR: MiDashengLM is an open audio-language model using general audio captions for comprehensive audio understanding, with efficient processing and full transparency.
Details
Motivation: Current LALMs rely on closed data/proprietary models, limiting generalization and accessibility. Need for open, transparent models that handle diverse audio types (speech, sound, music) holistically rather than just ASR-based alignment.Method: Uses novel ACAVCaps training dataset with general audio captions. Integrates Dasheng open-source audio encoder. Trained exclusively on publicly available pretraining and SFT datasets. Focuses on fusing speech, sound, and music into unified textual representation.
Result: Achieves up to 4x speedup in time-to-first-token and up to 20x higher throughput than comparable models. Provides holistic audio understanding across diverse audio types.
Conclusion: MiDashengLM demonstrates effective open audio-language modeling with general audio captioning approach, offering transparency, reproducibility, and significant efficiency improvements.
Abstract: Current approaches for large audio language models (LALMs) often rely on closed data sources or proprietary models, limiting their generalization and accessibility. This paper introduces MiDashengLM, a novel open audio-language model designed for efficient and comprehensive audio understanding through the use of general audio captions using our novel ACAVCaps training dataset. MiDashengLM exclusively relies on publicly available pretraining and supervised fine-tuning (SFT) datasets, ensuring full transparency and reproducibility. At its core, MiDashengLM integrates Dasheng, an open-source audio encoder, specifically engineered to process diverse auditory information effectively. Unlike previous works primarily focused on Automatic Speech Recognition (ASR) based audio-text alignment, our strategy centers on general audio captions, fusing speech, sound and music information into one textual representation, enabling a holistic textual representation of complex audio scenes. Lastly, MiDashengLM provides an up to 4x speedup in terms of time-to-first-token (TTFT) and up to 20x higher throughput than comparable models. Checkpoints are available online at https://huggingface.co/mispeech/midashenglm-7b and https://github.com/xiaomi-research/dasheng-lm.
[443] DashengTokenizer: One layer is enough for unified audio understanding and generation
Heinrich Dinkel, Xingwei Sun, Gang Li, Jiahao Mei, Yadong Niu, Jizhong Liu, Xiyang Li, Yifan Liao, Jiahao Zhou, Junbo Zhang, Jian Luan
Main category: cs.SD
TL;DR: DashengTokenizer is a continuous audio tokenizer that uses frozen semantic features with injected acoustic information for both understanding and generation tasks, outperforming previous methods across diverse audio tasks.
Details
Motivation: The paper challenges conventional approaches that train acoustic tokenizers first and then integrate semantic knowledge. Instead, it proposes inverting this paradigm by leveraging frozen semantic features and injecting acoustic information to create a more effective joint audio understanding and generation system.Method: The method develops DashengTokenizer, a continuous audio tokenizer that uses frozen semantic features as a foundation and injects acoustic information into this structure. This approach contrasts with standard VAE-based methods and conventional acoustic tokenizer training pipelines.
Result: The tokenizer outperforms previous audio codec and encoder baselines across 22 diverse tasks in linear evaluation, maintains competitive audio reconstruction quality, and shows improved performance on speech emotion recognition, music understanding, and acoustic scene classification. It also surpasses VAE-based methods on text-to-audio and text-to-music tasks while being effective for speech enhancement.
Conclusion: DashengTokenizer demonstrates that acoustic injection into frozen semantic features creates an effective general-purpose audio encoder that works for both understanding and generation tasks, challenging the assumption that VAE-based architectures are necessary for audio synthesis.
Abstract: This paper introduces DashengTokenizer, a continuous audio tokenizer engineered for joint use in both understanding and generation tasks. Unlike conventional approaches, which train acoustic tokenizers and subsequently integrate frozen semantic knowledge, our method inverts this paradigm: we leverage frozen semantic features and inject acoustic information. In linear evaluation across 22 diverse tasks, our method outperforms previous audio codec and audio encoder baselines by a significant margin while maintaining competitive audio reconstruction quality. Notably, we demonstrate that this acoustic injection improves performance for tasks such as speech emotion recognition, music understanding, and acoustic scene classification. We further evaluate the tokenizer’s generative performance on text-to-audio (TTA), text-to-music (TTM), and speech enhancement (SE). Our approach surpasses standard variational autoencoder (VAE)-based methods on TTA and TTM tasks, while its effectiveness on SE underscores its capabilities as a general-purpose audio encoder. Finally, our results challenge the prevailing assumption that VAE-based architectures are a prerequisite for audio synthesis. Checkpoints are available at https://huggingface.co/mispeech/dashengtokenizer.
[444] A Lightweight Two-Branch Architecture for Multi-instrument Transcription via Note-Level Contrastive Clustering
Ruigang Li, Yongxu Zhu
Main category: cs.SD
TL;DR: A lightweight model for joint transcription and dynamic separation of arbitrary instruments using timbre encoding and deep clustering at note level, optimized for resource-constrained deployment.
Details
Motivation: Existing multi-timbre transcription models have limitations: poor generalization beyond pre-trained instruments, rigid source-count constraints, and high computational demands that prevent deployment on low-resource devices.Method: Extends a timbre-agnostic transcription backbone with a dedicated timbre encoder and performs deep clustering at the note level for joint transcription and dynamic separation. Uses practical optimizations including spectral normalization, dilated convolutions, and contrastive clustering for efficiency and robustness.
Result: Achieves competitive performance with heavier baselines in transcription accuracy and separation quality despite small size and fast inference. Shows promising generalization ability.
Conclusion: The lightweight model is highly suitable for real-world deployment in practical and resource-constrained settings, addressing key limitations of existing multi-timbre transcription systems.
Abstract: Existing multi-timbre transcription models struggle with generalization beyond pre-trained instruments, rigid source-count constraints, and high computational demands that hinder deployment on low-resource devices. We address these limitations with a lightweight model that extends a timbre-agnostic transcription backbone with a dedicated timbre encoder and performs deep clustering at the note level, enabling joint transcription and dynamic separation of arbitrary instruments given a specified number of instrument classes. Practical optimizations including spectral normalization, dilated convolutions, and contrastive clustering further improve efficiency and robustness. Despite its small size and fast inference, the model achieves competitive performance with heavier baselines in terms of transcription accuracy and separation quality, and shows promising generalization ability, making it highly suitable for real-world deployment in practical and resource-constrained settings.
[445] Enhancing Automatic Chord Recognition via Pseudo-Labeling and Knowledge Distillation
Nghia Phan, Rong Jin, Gang Liu, Xiao Dong
Main category: cs.SD
TL;DR: Two-stage training pipeline for Automatic Chord Recognition using pre-trained models and unlabeled audio, achieving state-of-the-art performance through pseudo-labeling and selective knowledge distillation.
Details
Motivation: Automatic Chord Recognition faces data scarcity issues due to costly aligned chord annotations, while pre-trained models are becoming more accessible than their proprietary training data.Method: Two-stage pipeline: 1) Use pre-trained BTC model as teacher to generate pseudo-labels for 1,000+ hours of unlabeled audio, train student model on pseudo-labels; 2) Continual training on ground-truth labels with selective knowledge distillation to prevent catastrophic forgetting.
Result: BTC student achieves 98% of teacher’s performance with pseudo-labels only; after stage 2, surpasses supervised baseline by 2.5% and teacher by 1.55%. 2E1D student improves baseline by 2.67% and matches teacher performance. Large gains on rare chord qualities.
Conclusion: The proposed method effectively leverages pre-trained models and unlabeled audio to overcome data scarcity in ACR, achieving state-of-the-art performance through pseudo-labeling and selective knowledge distillation.
Abstract: Automatic Chord Recognition (ACR) is constrained by the scarcity of aligned chord labels, as well-aligned annotations are costly to acquire. At the same time, open-weight pre-trained models are currently more accessible than their proprietary training data. In this work, we present a two-stage training pipeline that leverages pre-trained models together with unlabeled audio. The proposed method decouples training into two stages. In the first stage, we use a pre-trained BTC model as a teacher to generate pseudo-labels for over 1,000 hours of diverse unlabeled audio and train a student model solely on these pseudo-labels. In the second stage, the student is continually trained on ground-truth labels as they become available. To prevent catastrophic forgetting of the representations learned in the first stage, we apply selective knowledge distillation (KD) from the teacher as a regularizer. In our experiments, two models (BTC, 2E1D) were used as students. In stage 1, using only pseudo-labels, the BTC student achieves over 98% of the teacher’s performance, while the 2E1D model achieves about 96% across seven standard mir_eval metrics. After a single training run for both students in stage 2, the resulting BTC student model surpasses the traditional supervised learning baseline by 2.5% and the original pre-trained teacher model by 1.55% on average across all metrics. The resulting 2E1D student model improves over the traditional supervised learning baseline by 2.67% on average and achieves almost the same performance as the teacher. Both cases show large gains on rare chord qualities.
[446] Enhancing Efficiency and Performance in Deepfake Audio Detection through Neuron-level Dropin & Neuroplasticity Mechanisms
Yupei Li, Shuaijie Shao, Manuel Milling, Björn Schuller
Main category: cs.SD
TL;DR: Novel dropin and plasticity algorithms dynamically adjust neuron counts in layers to modulate model parameters, improving audio deepfake detection efficiency and performance across ResNet, GRU, and Wav2Vec architectures.
Details
Motivation: Current audio deepfake detection faces computational bottlenecks with large models, and existing low-rank adaptation methods are limited to attention-based architectures. Inspired by neuronal plasticity in mammalian brains, the authors seek more flexible parameter modulation approaches.Method: Proposed dropin and plasticity algorithms that dynamically adjust the number of neurons in certain layers to flexibly modulate model parameters. Evaluated on multiple architectures including ResNet, Gated Recurrent Neural Networks (GRU), and Wav2Vec.
Result: Experimental results on ASVSpoof2019 LA, PA, and FakeorReal datasets show consistent improvements in computational efficiency with dropin, and maximum relative reductions of ~39% and ~66% in Equal Error Rate with dropin and plasticity approaches respectively.
Conclusion: The proposed neuronal plasticity-inspired algorithms provide effective parameter modulation for audio deepfake detection, offering computational efficiency gains and significant performance improvements across diverse model architectures.
Abstract: Current audio deepfake detection has achieved remarkable performance using diverse deep learning architectures such as ResNet, and has seen further improvements with the introduction of large models (LMs) like Wav2Vec. The success of large language models (LLMs) further demonstrates the benefits of scaling model parameters, but also highlights one bottleneck where performance gains are constrained by parameter counts. Simply stacking additional layers, as done in current LLMs, is computationally expensive and requires full retraining. Furthermore, existing low-rank adaptation methods are primarily applied to attention-based architectures, which limits their scope. Inspired by the neuronal plasticity observed in mammalian brains, we propose novel algorithms, dropin and further plasticity, that dynamically adjust the number of neurons in certain layers to flexibly modulate model parameters. We evaluate these algorithms on multiple architectures, including ResNet, Gated Recurrent Neural Networks, and Wav2Vec. Experimental results using the widely recognised ASVSpoof2019 LA, PA, and FakeorReal dataset demonstrate consistent improvements in computational efficiency with the dropin approach and a maximum of around 39% and 66% relative reduction in Equal Error Rate with the dropin and plasticity approach among these dataset, respectively. The code and supplementary material are available at Github link.
cs.LG
[447] DyMRL: Dynamic Multispace Representation Learning for Multimodal Event Forecasting in Knowledge Graph
Feng Zhao, Kangzheng Liu, Teng Peng, Yu Yang, Guandong Xu
Main category: cs.LG
TL;DR: DyMRL is a dynamic multispace representation learning approach for multimodal temporal knowledge acquisition and fusion in event forecasting, using Euclidean, hyperbolic, and complex spaces with dual fusion-evolution attention mechanisms.
Details
Motivation: Existing multimodal event forecasting methods focus on static settings and overlook dynamic acquisition/fusion of multimodal knowledge. Current approaches struggle with learning time-sensitive structural information and capturing evolving multimodal fusion features with varying historical contributions.Method: 1) Integrates time-specific structural features from Euclidean, hyperbolic, and complex spaces into relational message-passing framework to learn deep representations. 2) Uses dual fusion-evolution attention mechanisms that assign dynamic learning emphases equally to different modalities at different timestamps symmetrically. 3) Leverages pretrained models for time-sensitive visual and linguistic intelligences.
Result: DyMRL outperforms state-of-the-art dynamic unimodal and static multimodal baseline methods on four constructed multimodal temporal knowledge graph benchmarks.
Conclusion: DyMRL effectively addresses dynamic multimodal knowledge acquisition and fusion for event forecasting by learning deep relation-aware geometric features across multiple spaces and capturing evolving multimodal fusion patterns through symmetric attention mechanisms.
Abstract: Accurate representation of multimodal knowledge is crucial for event forecasting in real-world scenarios. However, existing studies have largely focused on static settings, overlooking the dynamic acquisition and fusion of multimodal knowledge. 1) At the knowledge acquisition level, how to learn time-sensitive information of different modalities, especially the dynamic structural modality. Existing dynamic learning methods are often limited to shallow structures across heterogeneous spaces or simple unispaces, making it difficult to capture deep relation-aware geometric features. 2) At the knowledge fusion level, how to learn evolving multimodal fusion features. Existing knowledge fusion methods based on static coattention struggle to capture the varying historical contributions of different modalities to future events. To this end, we propose DyMRL, a Dynamic Multispace Representation Learning approach to efficiently acquire and fuse multimodal temporal knowledge. 1) For the former issue, DyMRL integrates time-specific structural features from Euclidean, hyperbolic, and complex spaces into a relational message-passing framework to learn deep representations, reflecting human intelligences in associative thinking, high-order abstracting, and logical reasoning. Pretrained models endow DyMRL with time-sensitive visual and linguistic intelligences. 2) For the latter concern, DyMRL incorporates advanced dual fusion-evolution attention mechanisms that assign dynamic learning emphases equally to different modalities at different timestamps in a symmetric manner. To evaluate DyMRL’s event forecasting performance through leveraging its learned multimodal temporal knowledge in history, we construct four multimodal temporal knowledge graph benchmarks. Extensive experiments demonstrate that DyMRL outperforms state-of-the-art dynamic unimodal and static multimodal baseline methods.
[448] How unconstrained machine-learning models learn physical symmetries
Michelangelo Domina, Joseph William Abbott, Paolo Pegolo, Filippo Bigi, Michele Ceriotti
Main category: cs.LG
TL;DR: The paper introduces metrics to measure symmetry content in learned representations of unconstrained ML models for physical simulations, applies them to transformer-based models, and shows how strategic injection of minimal inductive biases improves stability and accuracy while preserving expressivity.
Details
Motivation: Physical simulations require predictions that fulfill fundamental symmetries, but unconstrained models often perform competitively by learning approximate equivariant behavior through data augmentation. The paper aims to rigorously measure how these models process symmetry information and understand their failure modes.Method: Introduces rigorous metrics to measure symmetry content in learned representations. Applies these metrics to two unconstrained transformer-based models: a graph neural network for atomistic simulations and a PointNet-style architecture for particle physics. Analyzes how symmetry information is processed across architectural layers and learned during training.
Result: The analysis enables diagnosis of spectral failure modes in ML models. Demonstrates that strategic injection of minimum required inductive biases leads to superior stability and accuracy while preserving the high expressivity and scalability of unconstrained architectures and guaranteeing physical fidelity.
Conclusion: The paper establishes a framework for analyzing symmetry learning in unconstrained models and shows that targeted inductive bias injection can achieve both physical fidelity and model expressivity, providing insights into how ML models learn physical symmetries.
Abstract: The requirement of generating predictions that exactly fulfill the fundamental symmetry of the corresponding physical quantities has profoundly shaped the development of machine-learning models for physical simulations. In many cases, models are built using constrained mathematical forms that ensure that symmetries are enforced exactly. However, unconstrained models that do not obey rotational symmetries are often found to have competitive performance, and to be able to \emph{learn} to a high level of accuracy an approximate equivariant behavior with a simple data augmentation strategy. In this paper, we introduce rigorous metrics to measure the symmetry content of the learned representations in such models, and assess the accuracy by which the outputs fulfill the equivariant condition. We apply these metrics to two unconstrained, transformer-based models operating on decorated point clouds (a graph neural network for atomistic simulations and a PointNet-style architecture for particle physics) to investigate how symmetry information is processed across architectural layers and is learned during training. Based on these insights, we establish a rigorous framework for diagnosing spectral failure modes in ML models. Enabled by this analysis, we demonstrate that one can achieve superior stability and accuracy by strategically injecting the minimum required inductive biases, preserving the high expressivity and scalability of unconstrained architectures while guaranteeing physical fidelity.
[449] Experiential Reflective Learning for Self-Improving LLM Agents
Marc-Antoine Allard, Arnaud Teinturier, Victor Xing, Gautier Viaud
Main category: cs.LG
TL;DR: ERL framework enables LLM agents to self-improve by learning transferable heuristics from past experiences, improving task success rates through experiential learning.
Details
Motivation: Current LLM-based autonomous agents lack adaptation to specialized environments and don't leverage accumulated experience, approaching each task from scratch despite having past interactions.Method: Experiential Reflective Learning (ERL) framework where agents reflect on task trajectories and outcomes to generate reusable heuristics, which are retrieved and injected into context for new tasks.
Result: On Gaia2 benchmark, ERL improves success rate by 7.8% over ReAct baseline, with large gains in task completion reliability, outperforming prior experiential learning methods.
Conclusion: Reflecting on single-attempt experiences to extract transferable heuristics enables effective agent self-improvement, with selective retrieval being essential for success.
Abstract: Recent advances in large language models (LLMs) have enabled the development of autonomous agents capable of complex reasoning and multi-step problem solving. However, these agents struggle to adapt to specialized environments and do not leverage past interactions, approaching each new task from scratch regardless of their accumulated experience. We introduce Experiential Reflective Learning (ERL), a simple self-improvement framework that enables rapid environment adaptation through experiential learning. ERL reflects on task trajectories and outcomes to generate heuristics, capturing actionable lessons that transfer across tasks. At test time, relevant heuristics are retrieved based on the current task and injected into the agent’s context to guide execution. On the Gaia2 benchmark, ERL improves success rate by 7.8% over a ReAct baseline, with large gains in task completion reliability, and outperforms prior experiential learning methods. Through systematic ablations, we find that selective retrieval is essential and that heuristics provide more transferable abstractions than few-shot trajectory prompting. These results demonstrate that reflecting on single-attempt experiences to extract transferable heuristics enables effective agent self-improvement.
[450] Learning Mesh-Free Discrete Differential Operators with Self-Supervised Graph Neural Networks
Lucas Gerken Starepravo, Georgios Fourtakas, Steven Lind, Ajay B. Harish, Tianning Tang, Jack R. C. King
Main category: cs.LG
TL;DR: A neural network framework learns mesh-free discrete differential operators from local geometry using graph neural networks trained via polynomial moment constraints, achieving improved accuracy-cost trade-offs for irregular particle configurations.
Details
Motivation: Classical meshless methods face a trade-off between computational cost and accuracy - either low cost with limited accuracy or high accuracy with substantial computation. There's a need for mesh-free operators that maintain accuracy while being efficient and robust to irregular geometries.Method: Uses graph neural networks to learn mesh-free discrete differential operators by mapping local stencil relative positions to operator weights. Training employs polynomial moment constraints from truncated Taylor expansions. The learned operators are geometry-dependent, resolution-agnostic, and reusable across different particle configurations and equations.
Result: The framework shows improved accuracy over Smoothed Particle Hydrodynamics and favorable accuracy-cost trade-off compared to high-order consistent mesh-free methods. Successfully applied to solving weakly compressible Navier-Stokes equations, demonstrating practical applicability.
Conclusion: Neural networks can learn classical polynomial consistency while maintaining robustness to irregular geometry, creating reusable, resolution-agnostic operators that offer better accuracy-cost trade-offs for mesh-free numerical methods.
Abstract: Mesh-free numerical methods provide flexible discretisations for complex geometries; however, classical meshless discrete differential operators typically trade low computational cost for limited accuracy or high accuracy for substantial per-stencil computation. We introduce a parametrised framework for learning mesh-free discrete differential operators using a graph neural network trained via polynomial moment constraints derived from truncated Taylor expansions. The model maps local stencils relative positions directly to discrete operator weights. The current work demonstrates that neural networks can learn classical polynomial consistency while retaining robustness to irregular neighbourhood geometry. The learned operators depend only on local geometry, are resolution-agnostic, and can be reused across particle configurations and governing equations. We evaluate the framework using standard numerical analysis diagnostics, showing improved accuracy over Smoothed Particle Hydrodynamics, and a favourable accuracy-cost trade-off relative to a representative high-order consistent mesh-free method in the moderate-accuracy regime. Applicability is demonstrated by solving the weakly compressible Navier-Stokes equations using the learned operators.
[451] Physics-Informed Neural Network Digital Twin for Dynamic Tray-Wise Modeling of Distillation Columns under Transient Operating Conditions
Debadutta Patra, Ayush Bardhan Tripathy, Soumya Ranjan Sahu, Sucheta Panda
Main category: cs.LG
TL;DR: Physics-Informed Neural Network digital twin for binary distillation columns using thermodynamic constraints and Aspen simulation data
Details
Motivation: To create a more accurate and physically consistent digital twin for industrial distillation processes that combines data-driven learning with fundamental physics constraintsMethod: Physics-Informed Neural Network framework that embeds vapor-liquid equilibrium, mass/energy balances, and McCabe-Thiele methodology into loss function; trained on synthetic Aspen HYSYS data with adaptive loss-weighting
Result: Achieved RMSE of 0.00143 for HX mole fraction prediction (R^2 = 0.9887), 44.6% improvement over best data-only baseline; accurately captures transient column dynamics
Conclusion: PINN digital twin provides robust foundation for real-time soft sensing, model-predictive control, and anomaly detection in industrial distillation processes
Abstract: Digital twin technology, when combined with physics-informed machine learning with simulation results of Aspen, offers transformative capabilities for industrial process monitoring, control, and optimization. In this work, the proposed model presents a Physics-Informed Neural Network (PINN) digital twin framework for the dynamic, tray-wise modeling of binary distillation columns operating under transient conditions. The architecture of the proposed model embeds fundamental thermodynamic constraints, including vapor-liquid equilibrium (VLE) described by modified Raoult’s law, tray-level mass and energy balances, and the McCabe-Thiele graphical methodology directly into the neural network loss function via physics residual terms. The model is trained and evaluated on a high-fidelity synthetic dataset of 961 timestamped measurements spanning 8 hours of transient operation, generated in Aspen HYSYS for a binary HX/TX distillation system comprising 16 sensor streams. An adaptive loss-weighting scheme balances the data fidelity and physics consistency objectives during training. Compared to five data-driven baselines (LSTM, vanilla MLP, GRU, Transformer, DeepONet), the proposed PINN achieves an RMSE of 0.00143 for HX mole fraction prediction (R^2 = 0.9887), representing a 44.6% reduction over the best data-only baseline, while strictly satisfying thermodynamic constraints. Tray-wise temperature and composition profiles predicted under transient perturbations demonstrate that the digital twin accurately captures column dynamics including feed tray responses, reflux ratio variations, and pressure transients. These results establish the proposed PINN digital twin as a robust foundation for real-time soft sensing, model-predictive control, and anomaly detection in industrial distillation processes.
[452] Can LLMs Beat Classical Hyperparameter Optimization Algorithms? A Study on autoresearch
Fabio Ferreira, Lucca Wobbe, Arjun Krishnakumar, Frank Hutter, Arber Zela
Main category: cs.LG
TL;DR: LLM agents for hyperparameter optimization can edit training code directly, but classical methods outperform them; hybrid approach combining CMA-ES with small LLM achieves best results.
Details
Motivation: To compare classical hyperparameter optimization (HPO) algorithms against LLM-based methods for tuning language model hyperparameters, exploring whether LLMs can effectively search unconstrained spaces by editing training code directly.Method: Developed autoresearch repository enabling LLM agents to edit training code for unconstrained hyperparameter search. Compared classical HPO (CMA-ES, TPE) against LLM-based methods. Introduced Centaur, a hybrid approach that shares CMA-ES’s internal state (mean vector, step-size, covariance matrix) with an LLM for better optimization.
Result: Classical HPO methods consistently outperform LLM-based agents in fixed search spaces. LLM agents editing code directly narrow the gap to classical methods. Centaur hybrid achieves best results, with 0.8B variant outperforming 27B variant, showing cheap LLMs suffice when paired with strong classical optimizers.
Conclusion: Reliability matters more than exploration breadth in HPO. Small LLMs struggle with optimization state tracking, while classical methods lack domain knowledge. Hybrid approaches combining classical optimizers with small LLMs provide optimal balance, with 0.8B models sufficient for hybrid optimization but not for unconstrained code editing.
Abstract: The autoresearch repository enables an LLM agent to search for optimal hyperparameter configurations on an unconstrained search space by editing the training code directly. Given a fixed compute budget and constraints, we use \emph{autoresearch} as a testbed to compare classical hyperparameter optimization (HPO) algorithms against LLM-based methods on tuning the hyperparameters of a small language model. Within a fixed hyperparameter search space, classical HPO methods such as CMA-ES and TPE consistently outperform LLM-based agents. However, an LLM agent that directly edits training source code in an unconstrained search space narrows the gap to classical methods substantially despite using only a self-hosted open-weight 27B model. Methods that avoid out-of-memory failures outperform those with higher search diversity, suggesting that reliability matters more than exploration breadth. While small and mid-sized LLMs struggle to track optimization state across trials, classical methods lack domain knowledge. To bridge this gap, we introduce Centaur, a hybrid that shares CMA-ES’s internal state, including mean vector, step-size, and covariance matrix, with an LLM. Centaur achieves the best result in our experiments, with its 0.8B variant outperforming the 27B variant, suggesting that a cheap LLM suffices when paired with a strong classical optimizer. The 0.8B model is insufficient for unconstrained code editing but sufficient for hybrid optimization, while scaling to 27B provides no advantage for fixed search space methods with the open-weight models tested. Code is available at https://github.com/ferreirafabio/autoresearch-automl.
[453] Energy-Efficient Hierarchical Federated Anomaly Detection for the Internet of Underwater Things via Selective Cooperative Aggregation
Kenechi Omeke, Michael Mollel, Lei Zhang, Qammer H. Abbasi, Muhammad Ali Imran
Main category: cs.LG
TL;DR: Hierarchical federated learning framework for underwater anomaly detection that addresses acoustic communication constraints through sensor-to-fog clustering, compressed updates, and selective fog cooperation.
Details
Motivation: Underwater IoT anomaly detection faces challenges due to low-bandwidth, energy-intensive acoustic links that limit direct sensor-to-surface communication and reduce participation in standard flat federated learning approaches.Method: Three-tier hierarchical architecture with: 1) feasibility-aware sensor-to-fog association, 2) compressed model-update transmission, and 3) selective cooperative aggregation among fog nodes that activates fog-to-fog exchange only when beneficial.
Result: Hierarchical learning preserves full participation (vs 48% in flat FL), selective cooperation reduces energy by 31-33% while matching accuracy, compressed uploads reduce total energy by 71-95%, and maintains competitive detection quality on real benchmarks.
Conclusion: The hierarchical framework provides practical design guidance for underwater deployments under severe acoustic constraints, balancing detection quality with communication energy efficiency.
Abstract: Anomaly detection is a core service in the Internet of Underwater Things, yet training accurate distributed models underwater is difficult because acoustic links are low-bandwidth, energy-intensive, and often unable to support direct sensor-to-surface communication. Standard flat federated learning therefore faces two coupled limitations in underwater deployments: expensive long-range transmissions and reduced participation when only a subset of sensors can reach the gateway. This paper proposes an energy-efficient hierarchical federated learning framework for underwater anomaly detection based on three components: feasibility-aware sensor-to-fog association, compressed model-update transmission, and selective cooperative aggregation among fog nodes. The proposed three-tier architecture localises most communication within short-range clusters while activating fog-to-fog exchange only when smaller clusters can benefit from nearby larger neighbours. A physics-grounded underwater acoustic model is used to evaluate detection quality, communication energy, and network participation jointly. In large synthetic deployments, only about 48% of sensors can directly reach the gateway in the 200-sensor case, whereas hierarchical learning preserves full participation through feasible fog paths. Selective cooperation matches the detection accuracy of always-on inter-fog exchange while reducing its energy by 31-33%, and compressed uploads reduce total energy by 71-95% in matched sensitivity tests. Experiments on three real benchmarks further show that low-overhead hierarchical methods remain competitive in detection quality, while flat federated learning defines the minimum-energy operating point. These results provide practical design guidance for underwater deployments operating under severe acoustic communication constraints.
[454] Amplified Patch-Level Differential Privacy for Free via Random Cropping
Kaan Durmaz, Jan Schuchardt, Sebastian Schmidt, Stephan Günnemann
Main category: cs.LG
TL;DR: Random cropping in vision models probabilistically excludes sensitive localized content, amplifying differential privacy without architectural changes by introducing patch-level privacy bounds that compose with DP-SGD.
Details
Motivation: Random cropping is widely used in computer vision but its privacy implications for differentially private training haven't been explored. When sensitive content (like faces or license plates) is spatially localized, random cropping can probabilistically exclude this content, providing additional privacy protection.Method: Introduces a patch-level neighboring relation for vision data and derives tight privacy bounds for DP-SGD combined with random cropping. Quantifies patch inclusion probability and shows how it composes with minibatch sampling to yield lower effective sampling rate.
Result: Empirical validation shows patch-level amplification improves privacy-utility trade-off across multiple segmentation architectures and datasets. Demonstrates stronger privacy guarantees at no additional cost by aligning privacy accounting with domain structure.
Conclusion: Random cropping provides inherent privacy amplification for vision models when sensitive content is spatially localized. This additional source of randomness can be formally accounted for to improve differential privacy guarantees without modifying model architecture or training procedures.
Abstract: Random cropping is one of the most common data augmentation techniques in computer vision, yet the role of its inherent randomness in training differentially private machine learning models has thus far gone unexplored. We observe that when sensitive content in an image is spatially localized, such as a face or license plate, random cropping can probabilistically exclude that content from the model’s input. This introduces a third source of stochasticity in differentially private training with stochastic gradient descent, in addition to gradient noise and minibatch sampling. This additional randomness amplifies differential privacy without requiring changes to model architecture or training procedure. We formalize this effect by introducing a patch-level neighboring relation for vision data and deriving tight privacy bounds for differentially private stochastic gradient descent (DP-SGD) when combined with random cropping. Our analysis quantifies the patch inclusion probability and shows how it composes with minibatch sampling to yield a lower effective sampling rate. Empirically, we validate that patch-level amplification improves the privacy-utility trade-off across multiple segmentation architectures and datasets. Our results demonstrate that aligning privacy accounting with domain structure and additional existing sources of randomness can yield stronger guarantees at no additional cost.
[455] Training LLMs for Multi-Step Tool Orchestration with Constrained Data Synthesis and Graduated Rewards
Cheng Jiayang, Xin Liu, Zhihan Zhang, Haoyang Wen, Zixuan Zhang, Qingyu Yin, Shiyang Li, Priyanka Nigam, Bing Yin, Chao Zhang, Yangqiu Song
Main category: cs.LG
TL;DR: A framework for training LLMs on multi-step tool orchestration using RL with real API responses and graduated rewards that decompose correctness into atomic validity and orchestration components.
Details
Motivation: Multi-step tool orchestration where LLMs must invoke multiple dependent APIs in correct order with proper intermediate output propagation remains challenging. Current models frequently fail on full sequence execution, with parameter value errors being a major failure source. Existing training approaches face two obstacles: environments focus on simple per-turn function calls with simulated data, and binary rewards provide no signal for partial correctness.Method: 1) Construct reinforcement learning environment backed by large-scale cache of real API responses, enabling data synthesis pipeline that samples valid multi-step orchestration traces with controllable complexity and high generation efficiency. 2) Propose graduated reward design that decomposes correctness into atomic validity (individual function call correctness at increasing granularity) and orchestration (correct tool sequencing with dependency respect).
Result: On ComplexFuncBench, the approach demonstrates substantial improvements in turn accuracy. Ablation studies confirm both reward components are essential: using either alone significantly degrades performance.
Conclusion: The framework effectively addresses challenges in training LLMs for multi-step tool orchestration by combining real API response environments with graduated reward structures that provide meaningful feedback for partial correctness.
Abstract: Multi-step tool orchestration, where LLMs must invoke multiple dependent APIs in the correct order while propagating intermediate outputs, remains challenging. State-of-the-art models frequently fail on full sequence execution, with parameter value errors accounting for a significant portion of failures. Training models to handle such workflows faces two obstacles: existing environments focus on simple per-turn function calls with simulated data, and binary rewards provide no signal for partial correctness. We present a framework addressing both challenges. First, we construct a reinforcement learning environment backed by a large-scale cache of real API responses, enabling a data synthesis pipeline that samples valid multi-step orchestration traces with controllable complexity and significantly higher generation efficiency than unconstrained methods. Second, we propose a graduated reward design that decomposes correctness into atomic validity (individual function call correctness at increasing granularity) and orchestration (correct tool sequencing with dependency respect). On ComplexFuncBench, our approach demonstrates substantial improvements in turn accuracy. Ablation studies confirm both reward components are essential: using either alone significantly degrades performance.
[456] Can an Actor-Critic Optimization Framework Improve Analog Design Optimization?
Sounak Dutta, Fin Amin, Sushil Panda, Jonathan Rabe, Yuejiang Wen, Paul Franzon
Main category: cs.LG
TL;DR: ACOF is an actor-critic optimization framework for analog circuit sizing that separates proposal and evaluation roles to bring human-like judgment into the search process, improving performance over existing baselines.
Details
Motivation: Analog design optimization is slow due to expensive simulation cycles and narrow optimal regions in large search spaces. Existing optimizers lack the judgment designers use when deciding where to search next.Method: Actor-critic framework separates proposal and evaluation: actor suggests promising design space regions, critic reviews choices, enforces design legality, and redirects search when progress stalls. Preserves compatibility with standard simulator-based flows.
Result: Improves top-10 figure of merit by average 38.9% over strongest baseline, reduces regret by average 24.7%, with peak gains of 70.5% in FoM and 42.2% lower regret on individual circuits.
Conclusion: ACOF offers more transparent, deliberate, stable, and interpretable path toward automated analog sizing by combining iterative reasoning with simulation-driven search across challenging design spaces.
Abstract: Analog design often slows down because even small changes to device sizes or biases require expensive simulation cycles, and high-quality solutions typically occupy only a narrow part of a very large search space. While existing optimizers reduce some of this burden, they largely operate without the kind of judgment designers use when deciding where to search next. This paper presents an actor-critic optimization framework (ACOF) for analog sizing that brings that form of guidance into the loop. Rather than treating optimization as a purely black-box search problem, ACOF separates the roles of proposal and evaluation: an actor suggests promising regions of the design space, while a critic reviews those choices, enforces design legality, and redirects the search when progress is hampered. This structure preserves compatibility with standard simulator-based flows while making the search process more deliberate, stable, and interpretable. Across our test circuits, ACOF improves the top-10 figure of merit by an average of 38.9% over the strongest competing baseline and reduces regret by an average of 24.7%, with peak gains of 70.5% in FoM and 42.2% lower regret on individual circuits. By combining iterative reasoning with simulation-driven search, the framework offers a more transparent path toward automated analog sizing across challenging design spaces.
[457] Contrastive Learning Boosts Deterministic and Generative Models for Weather Data
Nathan Bailey
Main category: cs.LG
TL;DR: SPARTA: A contrastive learning framework for creating robust embeddings from sparse weather data using temporal-aware sampling and physics-informed graph neural networks.
Details
Motivation: Weather data is high-dimensional and multimodal, requiring compression into compact latent spaces for efficient downstream tasks. Current contrastive learning approaches don't adequately address sparse data common in real-world weather collection, nor compare well with autoencoders for compression.Method: SPARTA framework aligns sparse samples with complete ones via contrastive loss, uses temporally aware batch sampling and cycle-consistency loss for better latent space structure, and incorporates domain-specific physical knowledge through novel graph neural network fusion.
Result: Demonstrates that contrastive learning is a feasible and advantageous compression method for sparse geoscience data, enhancing performance in downstream tasks like forecasting and extreme-weather detection.
Conclusion: Contrastive learning with sparse data augmentation and physics-informed modeling creates robust embeddings for weather data, outperforming traditional compression methods and improving downstream task performance.
Abstract: Weather data, comprising multiple variables, poses significant challenges due to its high dimensionality and multimodal nature. Creating low-dimensional embeddings requires compressing this data into a compact, shared latent space. This compression is required to improve the efficiency and performance of downstream tasks, such as forecasting or extreme-weather detection. Self-supervised learning, particularly contrastive learning, offers a way to generate low-dimensional, robust embeddings from unlabelled data, enabling downstream tasks when labelled data is scarce. Despite initial exploration of contrastive learning in weather data, particularly with the ERA5 dataset, the current literature does not extensively examine its benefits relative to alternative compression methods, notably autoencoders. Moreover, current work on contrastive learning does not investigate how these models can incorporate sparse data, which is more common in real-world data collection. It is critical to explore and understand how contrastive learning contributes to creating more robust embeddings for sparse weather data, thereby improving performance on downstream tasks. Our work extensively explores contrastive learning on the ERA5 dataset, aligning sparse samples with complete ones via a contrastive loss term to create SPARse-data augmented conTRAstive spatiotemporal embeddings (SPARTA). We introduce a temporally aware batch sampling strategy and a cycle-consistency loss to improve the structure of the latent space. Furthermore, we propose a novel graph neural network fusion technique to inject domain-specific physical knowledge. Ultimately, our results demonstrate that contrastive learning is a feasible and advantageous compression method for sparse geoscience data, thereby enhancing performance in downstream tasks.
[458] Grokking as a Falsifiable Finite-Size Transition
Yuda Bi, Chenyu Zhang, Qiheng Wang, Vince D Calhoun
Main category: cs.LG
TL;DR: Grokking (delayed generalization after memorization) is analyzed as a phase transition using group order as extensive variable and spectral contrast as order parameter, showing evidence against smooth crossover interpretation.
Details
Motivation: To provide falsifiable finite-size inputs for analyzing grokking as a phase transition rather than just using phase-transition language as an analogy.Method: Treat group order p of ℤₚ as extensive variable, use held-out spectral head-tail contrast as representation-level order parameter, apply condensed-matter-style diagnostic chain including coarse-grid sweeps and dense near-critical addition audit with Binder-like crossings.
Result: Binder-like crossings reveal shared finite-size boundary, susceptibility comparison strongly disfavors smooth-crossover interpretation (ΔAIC=16.8 in near-critical audit), supporting phase-transition interpretation.
Conclusion: Phase-transition language in grokking can be tested as quantitative finite-size claim rather than invoked as analogy alone, though transition order remains unresolved.
Abstract: Grokking – the delayed onset of generalization after early memorization – is often described with phase-transition language, but that claim has lacked falsifiable finite-size inputs. Here we supply those inputs by treating the group order $p$ of $\mathbb{Z}_p$ as an admissible extensive variable and a held-out spectral head-tail contrast as a representation-level order parameter, then apply a condensed-matter-style diagnostic chain to coarse-grid sweeps and a dense near-critical addition audit. Binder-like crossings reveal a shared finite-size boundary, and susceptibility comparison strongly disfavors a smooth-crossover interpretation ($Δ\mathrm{AIC}=16.8$ in the near-critical audit). Phase-transition language in grokking can therefore be tested as a quantitative finite-size claim rather than invoked as analogy alone, although the transition order remains unresolved at present.
[459] Light Cones For Vision: Simple Causal Priors For Visual Hierarchy
Manglam Kartik, Neel Tushar Shah
Main category: cs.LG
TL;DR: Worldline Slot Attention models objects as persistent spacetime trajectories using Lorentzian geometry, achieving 6x better hierarchical object discovery than Euclidean approaches by encoding causal structure.
Details
Motivation: Standard vision models treat objects as independent points in Euclidean space, failing to capture hierarchical structure like parts within wholes. There's a need for models that can represent objects as persistent entities with hierarchical relationships across spacetime.Method: Introduces Worldline Slot Attention where objects are modeled as persistent trajectories through spacetime worldlines. Each object has multiple slots at different hierarchy levels sharing spatial position but differing in temporal coordinates. Compares Euclidean vs Lorentzian geometric structures for encoding worldlines.
Result: Euclidean worldlines achieve only 0.078 accuracy (below random chance 0.33), while Lorentzian worldlines achieve 0.479-0.661 across three datasets - a 6x improvement replicated over 20+ independent runs. Lorentzian geometry outperforms hyperbolic embeddings, showing visual hierarchies require causal structure rather than tree structure.
Conclusion: Hierarchical object discovery requires geometric structure encoding asymmetric causality, an inductive bias absent from Euclidean space but natural to Lorentzian light cones. This can be achieved with only 11K parameters, demonstrating the importance of appropriate geometric priors for visual understanding.
Abstract: Standard vision models treat objects as independent points in Euclidean space, unable to capture hierarchical structure like parts within wholes. We introduce Worldline Slot Attention, which models objects as persistent trajectories through spacetime worldlines, where each object has multiple slots at different hierarchy levels sharing the same spatial position but differing in temporal coordinates. This architecture consistently fails without geometric structure: Euclidean worldlines achieve 0.078 level accuracy, below random chance (0.33), while Lorentzian worldlines achieve 0.479-0.661 across three datasets: a 6x improvement replicated over 20+ independent runs. Lorentzian geometry also outperforms hyperbolic embeddings showing visual hierarchies require causal structure (temporal dependency) rather than tree structure (radial branching). Our results demonstrate that hierarchical object discovery requires geometric structure encoding asymmetric causality, an inductive bias absent from Euclidean space but natural to Lorentzian light cones, achieved with only 11K parameters. The code is available at: https://github.com/iclrsubmissiongram/loco.
[460] Transformers in the Dark: Navigating Unknown Search Spaces via Bandit Feedback
Jungtaek Kim, Thomas Zeng, Ziqian Lin, Minjae Lee, Chungpa Lee, Jy-yong Sohn, Hyung Il Koo, Kangwook Lee
Main category: cs.LG
TL;DR: Transformers can learn to approximate search algorithms for tree-structured problem solving, enabling LLMs to balance exploration and exploitation without external search components.
Details
Motivation: While external search algorithms enhance LLM problem-solving by navigating tree-structured idea spaces, they complicate the overall process. The paper investigates whether LLMs/Transformers can internally approximate search algorithms to simplify and improve problem-solving.Method: Introduces “unknown tree search with bandit feedback” framework where tree extensions and feedback are externally specified. Trains Transformers from scratch to implement search strategies and fine-tunes pretrained LLMs on search trajectories to unlock search capabilities.
Result: Transformers are theoretically expressive enough to implement distinct search strategies and can be trained to approximate them. They generalize to unseen conditions like longer horizons or deeper trees, and fine-tuning unlocks search capabilities in pretrained LLMs.
Conclusion: LLMs can learn to approximate search algorithms internally, potentially eliminating the need for external search components while maintaining effective exploration-exploitation balance in tree-structured problem solving.
Abstract: Effective problem solving with Large Language Models (LLMs) can be enhanced when they are paired with external search algorithms. By viewing the space of diverse ideas and their follow-up possibilities as a tree structure, the search algorithm can navigate such a search space and guide the LLM toward better solutions more efficiently. While the search algorithm enables an effective balance between exploitation and exploration of a tree-structured space, the need for an external component can complicate the overall problem-solving process. We therefore pose the following question: Can LLMs or their underlying Transformer architectures approximate a search algorithm? To answer this question, we first introduce a simplified framework in which tree extensions and feedback signals are externally specified, allowing for controlled evaluation of search capabilities. We call this setting unknown tree search with bandit feedback. Within this setting, we show that Transformers are theoretically expressive enough to implement distinct search strategies and can be trained from scratch to approximate those strategies. Our Transformer models exhibit the possibility of generalizing to unseen conditions such as longer horizons or deeper trees. Furthermore, we demonstrate that continued task-focused training unlocks the complete capabilities of a pretrained LLM, by fine-tuning the LLM on search trajectories.
[461] Local learning for stable backpropagation-free neural network training towards physical learning
Yaqi Guo, Fabian Braun, Bastiaan Ketelaar, Stephanie Tan, Richard Norte, Siddhant Kumar
Main category: cs.LG
TL;DR: FFzero is a forward-only learning framework that enables neural network training without backpropagation or automatic differentiation, using layer-wise local learning and prototype-based representations.
Details
Motivation: Physical limits of chip manufacturing and environmental costs of deep learning motivate alternative learning paradigms like physical neural networks, but most still rely on digital computing for training due to difficulties implementing backpropagation in physical systems.Method: FFzero combines layer-wise local learning, prototype-based representations, and directional-derivative-based optimization using only forward evaluations, eliminating the need for backpropagation or automatic differentiation.
Result: FFzero generalizes to multilayer perceptron and convolutional neural networks across classification and regression tasks, and using a simulated photonic neural network demonstrates viability for backpropagation-free in-situ physical learning.
Conclusion: FFzero provides a viable path toward backpropagation-free in-situ physical learning, enabling stable neural network training in physical systems where traditional backpropagation is difficult to implement.
Abstract: While backpropagation and automatic differentiation have driven deep learning’s success, the physical limits of chip manufacturing and rising environmental costs of deep learning motivate alternative learning paradigms such as physical neural networks. However, most existing physical neural networks still rely on digital computing for training, largely because backpropagation and automatic differentiation are difficult to realize in physical systems. We introduce FFzero, a forward-only learning framework enabling stable neural network training without backpropagation or automatic differentiation. FFzero combines layer-wise local learning, prototype-based representations, and directional-derivative-based optimization through forward evaluations only. We show that local learning is effective under forward-only optimization, where backpropagation fails. FFzero generalizes to multilayer perceptron and convolutional neural networks across classification and regression. Using a simulated photonic neural network as an example, we demonstrate that FFzero provides a viable path toward backpropagation-free in-situ physical learning.
[462] A Practical Guide Towards Interpreting Time-Series Deep Clinical Predictive Models: A Reproducibility Study
Yongda Fan, John Wu, Andrea Fitzpatrick, Naveen Baskaran, Jimeng Sun, Adam Cross
Main category: cs.LG
TL;DR: Benchmark study evaluating interpretability methods across clinical prediction tasks, finding attention mechanisms effective, black-box methods computationally infeasible for time-series, and many approaches unreliable.
Details
Motivation: Clinical decisions require explicit justification, making model interpretability essential for auditing deep clinical models before deployment. Need to understand if architectural features like attention improve explainability and if interpretability methods generalize across clinical tasks.Method: Comprehensive benchmark evaluating interpretability methods across diverse clinical prediction tasks and model architectures, implemented via PyHealth open-source framework for reproducibility and extensibility.
Result: (1) Attention mechanisms when leveraged properly are highly efficient for faithfully interpreting model predictions; (2) Black-box interpreters like KernelSHAP and LIME are computationally infeasible for time-series clinical prediction tasks; (3) Several interpretability approaches are too unreliable to be trustworthy.
Conclusion: Provides guidelines for improving interpretability within clinical predictive pipelines and offers open-source implementation via PyHealth framework to support reproducibility and extensibility.
Abstract: Clinical decisions are high-stakes and require explicit justification, making model interpretability essential for auditing deep clinical models prior to deployment. As the ecosystem of model architectures and explainability methods expands, critical questions remain: Do architectural features like attention improve explainability? Do interpretability approaches generalize across clinical tasks? While prior benchmarking efforts exist, they often lack extensibility and reproducibility, and critically, fail to systematically examine how interpretability varies across the interplay of clinical tasks and model architectures. To address these gaps, we present a comprehensive benchmark evaluating interpretability methods across diverse clinical prediction tasks and model architectures. Our analysis reveals that: (1) attention when leveraged properly is a highly efficient approach for faithfully interpreting model predictions; (2) black-box interpreters like KernelSHAP and LIME are computationally infeasible for time-series clinical prediction tasks; and (3) several interpretability approaches are too unreliable to be trustworthy. From our findings, we discuss several guidelines on improving interpretability within clinical predictive pipelines. To support reproducibility and extensibility, we provide our implementations via PyHealth, a well-documented open-source framework: https://github.com/sunlabuiuc/PyHealth.
[463] Flow matching on homogeneous spaces
Francesco Ruscelli
Main category: cs.LG
TL;DR: Extends Flow Matching to homogeneous spaces via Lie group lifting, avoiding complex geometry by working on Lie algebras for simpler Euclidean flow matching.
Details
Motivation: To develop a general framework for Flow Matching on homogeneous spaces (quotients of Lie groups) that avoids the complicated geometry of these spaces and simplifies the computational requirements compared to Riemannian Flow Matching approaches.Method: Reformulates the problem as flow matching on the underlying Lie group by lifting data distributions, then reduces to Euclidean flow matching on Lie algebras. This avoids defining premetrics or computing geodesics required in Riemannian Flow Matching.
Result: Creates a simpler, faster, and fully intrinsic framework for Flow Matching on homogeneous spaces without the computational overhead of Riemannian geometry operations.
Conclusion: Provides an efficient approach to extend Flow Matching to homogeneous spaces by leveraging Lie group theory to simplify the geometric complexity, making it more practical for applications on quotient spaces.
Abstract: We propose a general framework to extend Flow Matching to homogeneous spaces, i.e. quotients of Lie groups. Our approach reformulates the problem as a flow matching task on the underlying Lie group by lifting the data distributions. This strategy avoids the potentially complicated geometry of homogeneous spaces by working directly on Lie groups, which in turn enables us reduce the problem to a Euclidean flow matching task on Lie algebras. In contrast to Riemannian Flow Matching, our method eliminates the need to define and compute premetrics or geodesics, resulting in a simpler, faster, and fully intrinsic framework.
[464] Reaching Beyond the Mode: RL for Distributional Reasoning in Language Models
Isha Puri, Mehul Damani, Idan Shenfeld, Marzyeh Ghassemi, Jacob Andreas, Yoon Kim
Main category: cs.LG
TL;DR: Multi-answer reinforcement learning enables language models to generate multiple plausible answers with confidence estimates in a single forward pass, improving diversity and efficiency for tasks with inherent uncertainty.
Details
Motivation: Current language models collapse answer distributions to single modes, which fails for real-world tasks with multiple valid answers (medical diagnosis, ambiguous QA, incomplete information). Need models that can generate multiple hypotheses with confidence estimates efficiently.Method: Multi-answer reinforcement learning modifies RL objective to train models to explicitly generate multiple candidate answers in a single forward pass, internalizing inference-time search into the generative process.
Result: Improved diversity, coverage, and set-level calibration across QA, medical diagnostic, and coding benchmarks. Models generate multiple answers with fewer tokens than competing approaches and show substantially higher accuracy on coding tasks.
Conclusion: Multi-answer RL provides a principled, compute-efficient alternative to inference-time scaling procedures like best-of-k, enabling better distributional reasoning for uncertain real-world tasks.
Abstract: Given a question, a language model (LM) implicitly encodes a distribution over possible answers. In practice, post-training procedures for LMs often collapse this distribution onto a single dominant mode. While this is generally not a problem for benchmark-style evaluations that assume one correct answer, many real-world tasks inherently involve multiple valid answers or irreducible uncertainty. Examples include medical diagnosis, ambiguous question answering, and settings with incomplete information. In these cases, we would like LMs to generate multiple plausible hypotheses, ideally with confidence estimates for each one, and without computationally intensive repeated sampling to generate non-modal answers. This paper describes a multi-answer reinforcement learning approach for training LMs to perform distributional reasoning over multiple answers during inference. We modify the RL objective to enable models to explicitly generate multiple candidate answers in a single forward pass, internalizing aspects of inference-time search into the model’s generative process. Across question-answering, medical diagnostic, and coding benchmarks, we observe improved diversity, coverage, and set-level calibration scores compared to single answer trained baselines. Models trained with our approach require fewer tokens to generate multiple answers than competing approaches. On coding tasks, they are also substantially more accurate. These results position multi-answer RL as a principled and compute-efficient alternative to inference-time scaling procedures such as best-of-k. Code and more information can be found at https://multi-answer-rl.github.io/.
[465] Learning to Staff: Offline Reinforcement Learning and Fine-Tuned LLMs for Warehouse Staffing Optimization
Kalle Kujanpää, Yuying Zhu, Kristina Klinkner, Shervin Malmasi
Main category: cs.LG
TL;DR: The paper explores two ML approaches for optimizing real-time staffing decisions in warehouse sortation systems: custom Transformer-based offline RL policies and LLMs operating on human-readable state descriptions.
Details
Motivation: To develop AI-assisted operational decision-making for semi-automated warehouse sortation systems, supporting staffing decisions at different levels of abstraction with different trade-offs between detailed state representations and human-readable inputs.Method: Two approaches: 1) Custom Transformer-based policies trained with offline reinforcement learning on detailed historical state representations, evaluated in learned simulators. 2) LLMs operating on abstracted, human-readable state descriptions, comparing prompting techniques, automatic prompt optimization, and fine-tuning strategies including supervised fine-tuning with Direct Preference Optimization.
Result: Offline RL achieved 2.4% throughput improvement over historical baselines in learned simulators. LLMs with supervised fine-tuning combined with DPO on simulator-generated preferences matched or slightly exceeded historical baselines in hand-crafted simulators, though prompting alone was insufficient.
Conclusion: Both approaches offer viable paths toward AI-assisted operational decision-making: offline RL excels with task-specific architectures, while LLMs support human-readable inputs and can incorporate manager preferences through iterative feedback loops.
Abstract: We investigate machine learning approaches for optimizing real-time staffing decisions in semi-automated warehouse sortation systems. Operational decision-making can be supported at different levels of abstraction, with different trade-offs. We evaluate two approaches, each in a matching simulation environment. First, we train custom Transformer-based policies using offline reinforcement learning on detailed historical state representations, achieving a 2.4% throughput improvement over historical baselines in learned simulators. In high-volume warehouse operations, improvements of this size translate to significant savings. Second, we explore LLMs operating on abstracted, human-readable state descriptions. These are a natural fit for decisions that warehouse managers make using high-level operational summaries. We systematically compare prompting techniques, automatic prompt optimization, and fine-tuning strategies. While prompting alone proves insufficient, supervised fine-tuning combined with Direct Preference Optimization on simulator-generated preferences achieves performance that matches or slightly exceeds historical baselines in a hand-crafted simulator. Our findings demonstrate that both approaches offer viable paths toward AI-assisted operational decision-making. Offline RL excels with task-specific architectures. LLMs support human-readable inputs and can be combined with an iterative feedback loop that can incorporate manager preferences.
[466] Once-for-All Channel Mixers (HYPERTINYPW): Generative Compression for TinyML
Yassien Shaalan
Main category: cs.LG
TL;DR: HYPER-TINYPW compresses neural networks for microcontrollers by generating pointwise mixer weights at load time instead of storing them, achieving 6.31x compression while maintaining performance on ECG and audio tasks.
Details
Motivation: Deploying neural networks on microcontrollers is constrained by limited flash and SRAM memory, where 1x1 pointwise mixers dominate memory usage even after INT8 quantization across vision, audio, and wearable sensing applications.Method: A compression-as-generation approach that replaces stored pointwise weights with generated weights: a shared micro-MLP synthesizes pointwise kernels at load time from tiny per-layer codes, caches them, and executes with standard integer operators while keeping PW1 in INT8 for stability.
Result: Achieves 6.31x compression (84.15% fewer bytes) while matching performance of larger models on ECG benchmarks, and reaches 96.2% test accuracy on Speech Commands audio dataset, demonstrating broad applicability to embedded sensing workloads.
Conclusion: HYPER-TINYPW shifts the Pareto frontier for memory-constrained ML deployment, enabling effective neural networks under 32-64 kB budgets where compact baselines degrade, with broader applicability to 1D biosignals, on-device speech, and embedded sensing tasks.
Abstract: Deploying neural networks on microcontrollers is constrained by kilobytes of flash and SRAM, where 1x1 pointwise (PW) mixers often dominate memory even after INT8 quantization across vision, audio, and wearable sensing. We present HYPER-TINYPW, a compression-as-generation approach that replaces most stored PW weights with generated weights: a shared micro-MLP synthesizes PW kernels once at load time from tiny per-layer codes, caches them, and executes them with standard integer operators. This preserves commodity MCU runtimes and adds only a one-off synthesis cost; steady-state latency and energy match INT8 separable CNN baselines. Enforcing a shared latent basis across layers removes cross-layer redundancy, while keeping PW1 in INT8 stabilizes early, morphology-sensitive mixing. We contribute (i) TinyML-faithful packed-byte accounting covering generator, heads/factorization, codes, kept PW1, and backbone; (ii) a unified evaluation with validation-tuned t* and bootstrap confidence intervals; and (iii) a deployability analysis covering integer-only inference and boot versus lazy synthesis. On three ECG benchmarks (Apnea-ECG, PTB-XL, MIT-BIH), HYPER-TINYPW shifts the macro-F1 versus flash Pareto frontier: at about 225 kB it matches a roughly 1.4 MB CNN while being 6.31x smaller (84.15% fewer bytes), retaining at least 95% of large-model macro-F1. Under 32-64 kB budgets it sustains balanced detection where compact baselines degrade. The mechanism applies broadly to other 1D biosignals, on-device speech, and embedded sensing tasks where per-layer redundancy dominates, indicating a wider role for compression-as-generation in resource-constrained ML systems. Beyond ECG, HYPER-TINYPW transfers to TinyML audio: on Speech Commands it reaches 96.2% test accuracy (98.2% best validation), supporting broader applicability to embedded sensing workloads where repeated linear mixers dominate memory.
[467] GraphER: An Efficient Graph-Based Enrichment and Reranking Method for Retrieval-Augmented Generation
Ruizhong Miao, Yuying Wang, Rongguang Wang, Chenyang Li, Tao Sheng, Sujith Ravi, Dan Roth
Main category: cs.LG
TL;DR: GraphER: A graph-based enrichment and reranking method for RAG systems that captures multiple proximity forms beyond semantic similarity without requiring knowledge graphs.
Details
Motivation: Semantic search in RAG systems is insufficient for complex queries with scattered evidence. Existing approaches either use inefficient iterative agentic retrieval or costly knowledge graphs that don't integrate well with production vector stores.Method: GraphER performs offline graph-based enrichment of data objects and online graph-based reranking of candidates. It captures multiple proximity forms beyond semantic similarity without requiring knowledge graphs, working seamlessly with standard vector stores.
Result: Experiments on multiple retrieval benchmarks demonstrate GraphER’s effectiveness. The method is retriever-agnostic and introduces negligible latency overhead while improving retrieval quality.
Conclusion: GraphER provides an effective solution for complex retrieval needs by capturing richer proximities beyond semantic similarity, without the maintenance costs of knowledge graphs or the inefficiencies of iterative agentic approaches.
Abstract: Semantic search in retrieval-augmented generation (RAG) systems is often insufficient for complex information needs, particularly when relevant evidence is scattered across multiple sources. Prior approaches to this problem include agentic retrieval strategies, which expand the semantic search space by generating additional queries. However, these methods do not fully leverage the organizational structure of the data and instead rely on iterative exploration, which can lead to inefficient retrieval. Another class of approaches employs knowledge graphs to model non-semantic relationships through graph edges. Although effective in capturing richer proximities, such methods incur significant maintenance costs and are often incompatible with the vector stores used in most production systems. To address these limitations, we propose GraphER, a graph-based enrichment and reranking method that captures multiple forms of proximity beyond semantic similarity. GraphER independently enriches data objects during offline indexing and performs graph-based reranking over candidate objects at query time. This design does not require a knowledge graph, allowing GraphER to integrate seamlessly with standard vector stores. In addition, GraphER is retriever-agnostic and introduces negligible latency overhead. Experiments on multiple retrieval benchmarks demonstrate the effectiveness of the proposed approach.
[468] CVA: Context-aware Video-text Alignment for Video Temporal Grounding
Sungho Moon, Seunghun Lee, Jiwan Seo, Sunghoon Im
Main category: cs.LG
TL;DR: CVA is a novel framework for video temporal grounding that addresses robustness to irrelevant background context through query-aware data augmentation, context-invariant boundary discrimination, and hierarchical transformer architecture.
Details
Motivation: Addressing the challenge of achieving temporally sensitive video-text alignment that remains robust to irrelevant background context in video temporal grounding tasks.Method: Three key components: 1) Query-aware Context Diversification (QCD) for data augmentation, 2) Context-invariant Boundary Discrimination (CBD) contrastive loss, 3) Context-enhanced Transformer Encoder (CTE) hierarchical architecture.
Result: Achieves state-of-the-art performance on major VTG benchmarks (QVHighlights and Charades-STA), with ~5 point improvement in Recall@1 scores over previous methods.
Conclusion: CVA effectively mitigates false negatives through synergistic data-centric and architectural enhancements for robust video-text alignment.
Abstract: We propose Context-aware Video-text Alignment (CVA), a novel framework to address a significant challenge in video temporal grounding: achieving temporally sensitive video-text alignment that remains robust to irrelevant background context. Our framework is built on three key components. First, we propose Query-aware Context Diversification (QCD), a new data augmentation strategy that ensures only semantically unrelated content is mixed in. It builds a video-text similarity-based pool of replacement clips to simulate diverse contexts while preventing the ``false negative" caused by query-agnostic mixing. Second, we introduce the Context-invariant Boundary Discrimination (CBD) loss, a contrastive loss that enforces semantic consistency at challenging temporal boundaries, making their representations robust to contextual shifts and hard negatives. Third, we introduce the Context-enhanced Transformer Encoder (CTE), a hierarchical architecture that combines windowed self-attention and bidirectional cross-attention with learnable queries to capture multi-scale temporal context. Through the synergy of these data-centric and architectural enhancements, CVA achieves state-of-the-art performance on major VTG benchmarks, including QVHighlights and Charades-STA. Notably, our method achieves a significant improvement of approximately 5 points in Recall@1 (R1) scores over state-of-the-art methods, highlighting its effectiveness in mitigating false negatives.
[469] A Systematic Empirical Study of Grokking: Depth, Architecture, Activation, and Regularization
Shalima Binta Manir, Anamika Paul Rupa
Main category: cs.LG
TL;DR: Grokking dynamics in neural networks are primarily driven by optimization stability and regularization interactions rather than architecture, with weight decay being the dominant control parameter requiring a narrow “Goldilocks” regime for delayed generalization to occur.
Details
Motivation: To disentangle the confounding roles of architecture, optimization, and regularization in grokking (delayed transition from memorization to generalization), which remains poorly understood in neural networks.Method: Controlled study on modular addition (mod 97) with matched and carefully tuned training regimes across models, systematically varying architecture (depth, activation functions, Transformers vs MLPs), optimization stability, and regularization (weight decay).
Result: Depth has non-monotonic effects (depth-4 MLPs fail while depth-8 residual networks recover), Transformer-MLP gap largely disappears under matched hyperparameters (1.11× delay), activation function effects are regime-dependent (GELU up to 4.3× faster than ReLU), and weight decay is the dominant control parameter with a narrow “Goldilocks” regime for grokking.
Conclusion: Grokking is an interaction-driven phenomenon governed by optimization and regularization rather than architecture, challenging architecture-centric interpretations and providing a unified empirical account of delayed generalization.
Abstract: Grokking the delayed transition from memorization to generalization in neural networks remains poorly understood, in part because prior empirical studies confound the roles of architecture, optimization, and regularization. We present a controlled study that systematically disentangles these factors on modular addition (mod 97), with matched and carefully tuned training regimes across models. Our central finding is that grokking dynamics are not primarily determined by architecture, but by interactions between optimization stability and regularization. Specifically, we show: (1) \textbf{depth has a non-monotonic effect}, with depth-4 MLPs consistently failing to grok while depth-8 residual networks recover generalization, demonstrating that depth requires architectural stabilization; (2) \textbf{the apparent gap between Transformers and MLPs largely disappears} (1.11$\times$ delay) under matched hyperparameters, indicating that previously reported differences are largely due to optimizer and regularization confounds; (3) \textbf{activation function effects are regime-dependent}, with GELU up to 4.3$\times$ faster than ReLU only when regularization permits memorization; and (4) \textbf{weight decay is the dominant control parameter}, exhibiting a narrow ``Goldilocks’’ regime in which grokking occurs, while too little or too much prevents generalization. Across 3–5 seeds per configuration, these results provide a unified empirical account of grokking as an interaction-driven phenomenon. Our findings challenge architecture-centric interpretations and clarify how optimization and regularization jointly govern delayed generalization.
[470] Optimal High-Probability Regret for Online Convex Optimization with Two-Point Bandit Feedback
Haishan Ye
Main category: cs.LG
TL;DR: First high-probability regret bound for OCO with two-point bandit feedback for strongly convex functions, achieving minimax optimal O(d(log T + log(1/δ))/μ) bound.
Details
Motivation: Address the open problem of achieving tight high-probability regret bounds for strongly convex functions in OCO with two-point bandit feedback, which remained unresolved despite gradient estimation being possible.Method: Develop new techniques to handle heavy-tailed nature of bandit gradient estimators that make standard concentration analysis difficult, enabling high-probability regret analysis for strongly convex losses.
Result: Achieve minimax optimal high-probability regret bound of O(d(log T + log(1/δ))/μ) for μ-strongly convex losses, resolving the open challenge highlighted by prior work.
Conclusion: Successfully resolves the open problem of tight high-probability regret bounds for strongly convex OCO with two-point bandit feedback, providing optimal theoretical guarantees.
Abstract: We consider the problem of Online Convex Optimization (OCO) with two-point bandit feedback in an adversarial environment. In this setting, a player attempts to minimize a sequence of adversarially generated convex loss functions, while only observing the value of each function at two points. While it is well-known that two-point feedback allows for gradient estimation, achieving tight high-probability regret bounds for strongly convex functions still remained open as highlighted by \citet{agarwal2010optimal}. The primary challenge lies in the heavy-tailed nature of bandit gradient estimators, which makes standard concentration analysis difficult. In this paper, we resolve this open challenge by providing the first high-probability regret bound of $O(d(\log T + \log(1/δ))/μ)$ for $μ$-strongly convex losses. Our result is minimax optimal with respect to both the time horizon $T$ and the dimension $d$.
[471] Epistemic Compression: The Case for Deliberate Ignorance in High-Stakes AI
Steffen Lukas
Main category: cs.LG
TL;DR: Paper introduces Epistemic Compression principle: match model complexity to data shelf life rather than scaling parameters, using a Regime Index to determine when simplicity beats complexity in high-stakes domains.
Details
Motivation: Foundation models fail in high-stakes domains like medicine and finance where reliability matters most (Fidelity Paradox). This is structural - in domains with changing rules, extra model capacity amplifies noise rather than capturing signal.Method: Introduces Epistemic Compression principle and Regime Index that separates Shifting Regime (unstable, data-poor; simplicity wins) from Stable Regime (invariant, data-rich; complexity viable). Architecture enforces parsimony by making it costly to represent variance exceeding data evidence.
Result: In exploratory synthesis of 15 high-stakes domains, the Regime Index was concordant with empirically superior modeling strategy in 86.7% of cases (13/15).
Conclusion: High-stakes AI demands shift from scaling for its own sake to principled parsimony. Model structure should match data shelf life rather than maximizing parameters.
Abstract: Foundation models excel in stable environments, yet often fail where reliability matters most: medicine, finance, and policy. This Fidelity Paradox is not just a data problem; it is structural. In domains where rules change over time, extra model capacity amplifies noise rather than capturing signal. We introduce Epistemic Compression: the principle that robustness emerges from matching model complexity to the shelf life of the data, not from scaling parameters. Unlike classical regularization, which penalizes weights post hoc, Epistemic Compression enforces parsimony through architecture: the model structure itself is designed to reduce overfitting by making it architecturally costly to represent variance that exceeds the evidence in the data. We operationalize this with a Regime Index that separates Shifting Regime (unstable, data-poor; simplicity wins) from Stable Regime (invariant, data-rich; complexity viable). In an exploratory synthesis of 15 high-stakes domains, this index was concordant with the empirically superior modeling strategy in 86.7% of cases (13/15). High-stakes AI demands a shift from scaling for its own sake to principled parsimony.
[472] Intern-S1-Pro: Scientific Multimodal Foundation Model at Trillion Scale
Yicheng Zou, Dongsheng Zhu, Lin Zhu, Tong Zhu, Yunhua Zhou, Peiheng Zhou, Xinyu Zhou, Dongzhan Zhou, Zhiwang Zhou, Yuhao Zhou, Bowen Zhou, Zhanping Zhong, Zhijie Zhong, Haiteng Zhao, Penghao Zhao, Xiaomeng Zhao, Zhiyuan Zhao, Yechen Zhang, Jin Zhang, Wenwei Zhang, Hongjie Zhang, Zhuo Zhang, Wenlong Zhang, Bo Zhang, Chao Zhang, Chen Zhang, Yuhang Zang, Fei Yuan, Jiakang Yuan, Jiashuo Yu, Jinhui Yin, Haochen Ye, Qian Yao, Bowen Yang, Danni Yang, Kaichen Yang, Ziang Yan, Jun Xu, Yicheng Xu, Wanghan Xu, Xuenan Xu, Chao Xu, Ruiliang Xu, Shuhao Xing, Long Xing, Xinchen Xie, Ling-I Wu, Zijian Wu, Zhenyu Wu, Lijun Wu, Yue Wu, Jianyu Wu, Wen Wu, Fan Wu, Xilin Wei, Qi Wei, Bingli Wang, Rui Wang, Ziyi Wang, Zun Wang, Yi Wang, Haomin Wang, Yizhou Wang, Lintao Wang, Yiheng Wang, Longjiang Wang, Bin Wang, Jian Tong, Zhongbo Tian, Huanze Tang, Chen Tang, Shixiang Tang, Yu Sun, Qiushi Sun, Xuerui Su, Qisheng Su, Chenlin Su, Demin Song, Jin Shi, Fukai Shang, Yuchen Ren, Pengli Ren, Xiaoye Qu, Yuan Qu, Jiantao Qiu, Yu Qiao, Runyu Peng, Tianshuo Peng, Jiahui Peng, Qizhi Pei, Zhuoshi Pan, Linke Ouyang, Wenchang Ning, Yichuan Ma, Zerun Ma, Ningsheng Ma, Runyuan Ma, Chengqi Lyu, Haijun Lv, Han Lv, Lindong Lu, Kuikun Liu, Jiangning Liu, Yuhong Liu, Kai Liu, Hongwei Liu, Zhoumianze Liu, Mengjie Liu, Ziyu Liu, Wenran Liu, Yang Liu, Liwei Liu, Kaiwen Liu, Junyao Lin, Junming Lin, Tianyang Lin, Dahua Lin, Jianze Liang, Linyang Li, Peiji Li, Zonglin Li, Zehao Li, Pengze Li, Guoyan Li, Lingkai Kong, Linglin Jing, Zhenjiang Jin, Feifei Jiang, Qian Jiang, Junhao Huang, Zixian Huang, Haian Huang, Zhouqi Hua, Han Hu, Linfeng Hou, Yinan He, Conghui He, Tianyao He, Xu Guo, Qipeng Guo, Aijia Guo, Yuzhe Gu, Lixin Gu, Jingyang Gong, Qiming Ge, Jiaye Ge, Songyang Gao, Jianfei Gao, Xinyu Fang, Caihua fan, Yue Fan, Yanhui Duan, Zichen Ding, Shengyuan Ding, Xuanlang Dai, Erfei Cui, Ganqu Cui, Pei Chu, Tao Chu, Guangran Cheng, Yu Cheng, Kai Chen, Yongkang Chen, Chiyu Chen, Guanzhou Chen, Qiaosheng Chen, Sitao Chen, Xin Chen, Haojiong Chen, Yicheng Chen, Weihan Cao, Yuhang Cao, Qinglong Cao, Lei Bai
Main category: cs.LG
TL;DR: Intern-S1-Pro is a 1-trillion parameter scientific multimodal foundation model that combines general reasoning and image-text understanding with specialized scientific expertise across 100+ tasks in chemistry, materials, life sciences, and earth sciences, serving as a “Specializable Generalist” that outperforms proprietary models in scientific depth.
Details
Motivation: To create a massive-scale multimodal foundation model that bridges general intelligence with specialized scientific expertise, addressing the need for models that can handle both broad reasoning tasks and deep scientific domain knowledge simultaneously.Method: Scales to 1-trillion parameters using XTuner and LMDeploy infrastructure for efficient RL training at massive scale while maintaining precision consistency between training and inference. Integrates multimodal capabilities with scientific domain knowledge across multiple disciplines.
Result: Achieves comprehensive enhancement across general and scientific domains, demonstrates advanced agent capabilities, masters over 100 specialized scientific tasks, and outperforms proprietary models in scientific task depth while maintaining top-tier general capabilities.
Conclusion: Intern-S1-Pro successfully demonstrates the fusion of general and specialized intelligence as a “Specializable Generalist,” showing that massive-scale multimodal models can excel in both broad reasoning and deep scientific expertise simultaneously.
Abstract: We introduce Intern-S1-Pro, the first one-trillion-parameter scientific multimodal foundation model. Scaling to this unprecedented size, the model delivers a comprehensive enhancement across both general and scientific domains. Beyond stronger reasoning and image-text understanding capabilities, its intelligence is augmented with advanced agent capabilities. Simultaneously, its scientific expertise has been vastly expanded to master over 100 specialized tasks across critical science fields, including chemistry, materials, life sciences, and earth sciences. Achieving this massive scale is made possible by the robust infrastructure support of XTuner and LMDeploy, which facilitates highly efficient Reinforcement Learning (RL) training at the 1-trillion parameter level while ensuring strict precision consistency between training and inference. By seamlessly integrating these advancements, Intern-S1-Pro further fortifies the fusion of general and specialized intelligence, working as a Specializable Generalist, demonstrating its position in the top tier of open-source models for general capabilities, while outperforming proprietary models in the depth of specialized scientific tasks.
[473] The Order Is The Message
Jordan LeDoux
Main category: cs.LG
TL;DR: Training order in modular arithmetic tasks dramatically affects learning efficiency, with structured orderings enabling generalization from minimal data while IID ordering fails, revealing ordering as a covert information channel.
Details
Motivation: The paper investigates how training example ordering affects learning in neural networks, challenging the assumption that IID sampling is optimal and exploring whether structured orderings can enable learning from minimal data.Method: Controlled experiments on modular arithmetic (p=9973) with identical training data but varying ordering strategies: fixed-order sequences, IID baseline, and adversarial ordering. Analysis of learned Fourier representations across different seeds and initializations.
Result: Fixed-order strategies achieved 99.5% test accuracy from only 0.3% of input space, while IID baseline reached only 0.30% after 5000 epochs. Adversarial ordering suppressed learning entirely. Models constructed Fourier representations whose fundamental frequency matched the ordering structure.
Conclusion: Training order serves as a powerful information channel that can dramatically accelerate learning or suppress it entirely, with implications for training efficiency, grokking phenomena reinterpretation, and security risks from covert channels.
Abstract: In a controlled experiment on modular arithmetic ($p = 9973$), varying only example ordering while holding all else constant, two fixed-ordering strategies achieve 99.5% test accuracy by epochs 487 and 659 respectively from a training set comprising 0.3% of the input space, well below established sample complexity lower bounds for this task under IID ordering. The IID baseline achieves 0.30% after 5{,}000 epochs from identical data. An adversarially structured ordering suppresses learning entirely. The generalizing model reliably constructs a Fourier representation whose fundamental frequency is the Fourier dual of the ordering structure, encoding information present in no individual training example, with the same fundamental emerging across all seeds tested regardless of initialization or training set composition. We discuss implications for training efficiency, the reinterpretation of grokking, and the safety risks of a channel that evades all content-level auditing.
[474] SIGMA: Structure-Invariant Generative Molecular Alignment for Chemical Language Models via Autoregressive Contrastive Learning
Xinyu Wang, Fei Dou, Jinbo Bi, Minghu Song
Main category: cs.LG
TL;DR: SIGMA addresses modality mismatch in molecular generation by aligning latent representations of equivalent molecular structures through token-level contrastive learning and isomorphic beam search.
Details
Motivation: Linear string representations for molecular generation create a fundamental modality mismatch where a single molecular graph maps to multiple distinct sequences, causing trajectory divergence where latent representations of structurally equivalent partial graphs drift apart due to different linearization histories.Method: Proposes Structure-Invariant Generative Molecular Alignment (SIGMA) which uses token-level contrastive objective to align latent states of prefixes that share identical suffixes, enabling the model to recognize geometric symmetries without altering the linear representation. Also introduces Isomorphic Beam Search (IsoBeam) to eliminate isomorphic redundancy during inference by dynamically pruning equivalent paths.
Result: Empirical evaluations on standard benchmarks show SIGMA bridges the gap between sequence scalability and graph fidelity, yielding superior sample efficiency and structural diversity in multi-parameter optimization compared to strong baselines.
Conclusion: SIGMA resolves the modality mismatch in molecular generation without abandoning efficient string formulations, enabling scalable autoregressive generation while maintaining structural fidelity through geometric symmetry recognition.
Abstract: Linearized string representations serve as the foundation of scalable autoregressive molecular generation; however, they introduce a fundamental modality mismatch where a single molecular graph maps to multiple distinct sequences. This ambiguity leads to \textit{trajectory divergence}, where the latent representations of structurally equivalent partial graphs drift apart due to differences in linearization history. To resolve this without abandoning the efficient string formulation, we propose Structure-Invariant Generative Molecular Alignment (SIGMA). Rather than altering the linear representation, SIGMA enables the model to strictly recognize geometric symmetries via a token-level contrastive objective, which explicitly aligns the latent states of prefixes that share identical suffixes. Furthermore, we introduce Isomorphic Beam Search (IsoBeam) to eliminate isomorphic redundancy during inference by dynamically pruning equivalent paths. Empirical evaluations on standard benchmarks demonstrate that SIGMA bridges the gap between sequence scalability and graph fidelity, yielding superior sample efficiency and structural diversity in multi-parameter optimization compared to strong baselines.
[475] An Explainable Ensemble Learning Framework for Crop Classification with Optimized Feature Pyramids and Deep Networks
Syed Rayhan Masud, SK Muktadir Hossain, Md. Ridoy Sarkar, Mohammad Sakib Mahmood, Md. Kishor Morol, Rakib Hossain Sajib
Main category: cs.LG
TL;DR: An explainable ensemble learning framework for crop suitability prediction using soil and climate data, achieving 98.80% accuracy with interpretable AI methods.
Details
Motivation: Address agricultural challenges from climate change and soil degradation by developing data-driven crop classification and recommendation systems that bridge the gap between complex ML models and actionable agricultural decision-making.Method: Proposes an explainable ensemble learning paradigm combining optimized feature pyramids, deep networks, self-attention mechanisms, and residual networks. Uses preprocessing (label encoding, IQR outlier removal, StandardScaler normalization, SMOTE balancing) and compares various ML models (Logistic Regression, KNN, SVM, Decision Trees, Random Forest, Gradient Boosting, Relative Error SVM) with Grid Search hyperparameter tuning. Final meta-ensemble design integrates best models.
Result: Final Ensemble meta-ensemble achieves 98.80% accuracy, precision, recall, and F1-score, outperforming individual models like KNN (95.56% accuracy). SHAP and permutation importance identify critical features: soil pH, nitrogen, and zinc.
Conclusion: The paradigm successfully bridges complex ML models with practical agricultural decision-making, providing explainable, high-accuracy crop suitability predictions that foster sustainability and trust in AI-powered agricultural recommendations.
Abstract: Agriculture is increasingly challenged by climate change, soil degradation, and resource depletion, and hence requires advanced data-driven crop classification and recommendation solutions. This work presents an explainable ensemble learning paradigm that fuses optimized feature pyramids, deep networks, self-attention mechanisms, and residual networks for bolstering crop suitability predictions based on soil characteristics (e.g., pH, nitrogen, potassium) and climatic conditions (e.g., temperature, rainfall). With a dataset comprising 3,867 instances and 29 features from the Ethiopian Agricultural Transformation Agency and NASA, the paradigm leverages preprocessing methods such as label encoding, outlier removal using IQR, normalization through StandardScaler, and SMOTE for balancing classes. A range of machine learning models such as Logistic Regression, K-Nearest Neighbors, Support Vector Machines, Decision Trees, Random Forest, Gradient Boosting, and a new Relative Error Support Vector Machine are compared, with hyperparameter tuning through Grid Search and cross-validation. The suggested “Final Ensemble” meta-ensemble design outperforms with 98.80% accuracy, precision, recall, and F1-score, compared to individual models such as K-Nearest Neighbors (95.56% accuracy). Explainable AI methods, such as SHAP and permutation importance, offer actionable insights, highlighting critical features such as soil pH, nitrogen, and zinc. The paradigm addresses the gap between intricate ML models and actionable agricultural decision-making, fostering sustainability and trust in AI-powered recommendations
[476] Process-Aware AI for Rainfall-Runoff Modeling: A Mass-Conserving Neural Framework with Hydrological Process Constraints
Mohammad A. Farmani, Hoshin V. Gupta, Ali Behrangi, Muhammad Jawad, Sadaf Moghisi, Guo-Yue Niu
Main category: cs.LG
TL;DR: Physics-aware AI framework (Mass-Conserving Perceptron) for hydrological modeling that enforces conservation principles while learning process relationships from data, with progressive embedding of physical processes improving predictive skill and interpretability.
Details
Motivation: Machine learning models in hydrology often achieve high predictive accuracy but lack physical interpretability. There's a need for AI frameworks that can enforce conservation principles while learning hydrological processes from data.Method: Mass-Conserving Perceptron (MCP) framework that progressively embeds physically meaningful representations: bounded soil storage, state-dependent conductivity, variable porosity, infiltration capacity, surface ponding, vertical drainage, and nonlinear water-table dynamics. Evaluated across 15 catchments in five hydroclimatic regions using daily streamflow prediction.
Result: Progressively augmenting physical structure generally improves predictive performance. Effects are hydroclimate dependent: vertical drainage improves skill in arid/snow basins but reduces performance in rainfall regions, while surface ponding has small effects. Best MCP configurations approach LSTM benchmark skill while maintaining physical interpretability.
Conclusion: Embedding hydrological process constraints within AI architectures provides a promising pathway toward interpretable and process-aware rainfall-runoff modeling, balancing predictive accuracy with physical interpretability.
Abstract: Machine learning models can achieve high predictive accuracy in hydrological applications but often lack physical interpretability. The Mass-Conserving Perceptron (MCP) provides a physics-aware artificial intelligence (AI) framework that enforces conservation principles while allowing hydrological process relationships to be learned from data. In this study, we investigate how progressively embedding physically meaningful representations of hydrological processes within a single MCP storage unit improves predictive skill and interpretability in rainfall-runoff modeling. Starting from a minimal MCP formulation, we sequentially introduce bounded soil storage, state-dependent conductivity, variable porosity, infiltration capacity, surface ponding, vertical drainage, and nonlinear water-table dynamics. The resulting hierarchy of process-aware MCP models is evaluated across 15 catchments spanning five hydroclimatic regions of the continental United States using daily streamflow prediction as the target. Results show that progressively augmenting the internal physical structure of the MCP unit generally improves predictive performance. The influence of these process representations is strongly hydroclimate dependent: vertical drainage substantially improves model skill in arid and snow-dominated basins but reduces performance in rainfall-dominated regions, while surface ponding has comparatively small effects. The best-performing MCP configurations approach the predictive skill of a Long Short-Term Memory benchmark while maintaining explicit physical interpretability. These results demonstrate that embedding hydrological process constraints within AI architectures provides a promising pathway toward interpretable and process-aware rainfall-runoff modeling.
[477] Layer-Specific Lipschitz Modulation for Fault-Tolerant Multimodal Representation Learning
Diyar Altinses, Andreas Schwung
Main category: cs.LG
TL;DR: A theoretical framework for fault-tolerant multimodal learning that combines self-supervised anomaly detection and error correction with mathematical guarantees against sensor failures.
Details
Motivation: Multimodal systems in safety-critical environments need reliability under sensor failures, signal degradation, or cross-modal inconsistencies. Current approaches lack mathematical guarantees for fault tolerance.Method: Two-stage self-supervised training: 1) Pre-train multimodal convolutional autoencoder on clean data to preserve anomaly signals in latent space; 2) Expand with learnable compute block for correction and contrastive objectives. Uses Lipschitz- and Jacobian-based criteria to control fault propagation, with layer-specific Lipschitz modulation and gradient clipping.
Result: Experimental results on multimodal fault datasets show improved anomaly detection accuracy and reconstruction under sensor corruption compared to baseline methods.
Conclusion: The framework bridges analytical robustness guarantees with practical fault-tolerant multimodal learning, providing mathematically grounded reliability for safety-critical applications.
Abstract: Modern multimodal systems deployed in industrial and safety-critical environments must remain reliable under partial sensor failures, signal degradation, or cross-modal inconsistencies. This work introduces a mathematically grounded framework for fault-tolerant multimodal representation learning that unifies self-supervised anomaly detection and error correction within a single architecture. Building upon a theoretical analysis of perturbation propagation, we derive Lipschitz- and Jacobian-based criteria that determine whether a neural operator amplifies or attenuates localized faults. Guided by this theory, we propose a two-stage self-supervised training scheme: pre-training a multimodal convolutional autoencoder on clean data to preserve localized anomaly signals in the latent space, and expanding it with a learnable compute block composed of dense layers for correction and contrastive objectives for anomaly identification. Furthermore, we introduce layer-specific Lipschitz modulation and gradient clipping as principled mechanisms to control sensitivity across detection and correction modules. Experimental results on multimodal fault datasets demonstrate that the proposed approach improves both anomaly detection accuracy and reconstruction under sensor corruption. Overall, this framework bridges the gap between analytical robustness guarantees and practical fault-tolerant multimodal learning.
[478] SEVerA: Verified Synthesis of Self-Evolving Agents
Debangshu Banerjee, Changming Xu, Gagandeep Singh
Main category: cs.LG
TL;DR: SEVerA framework combines formal verification with LLM-based agent synthesis to guarantee safety and correctness while improving performance on tasks like program verification and symbolic math.
Details
Motivation: Self-evolving LLM agents lack formal guarantees of safety/correctness, raising reliability and security concerns when deployed autonomously on unseen inputs.Method: Introduces Formally Guarded Generative Models (FGGM) that wrap generative models with verified fallbacks, then SEVerA framework with three stages: Search (synthesize parametric programs with FGGM calls), Verification (prove correctness for all parameters), and Learning (gradient-based optimization while preserving correctness).
Result: Achieves zero constraint violations while improving performance over unconstrained and SOTA baselines on Dafny program verification, symbolic math synthesis, and policy-compliant agentic tool use tasks.
Conclusion: Formal behavioral constraints not only guarantee correctness but also steer synthesis toward higher-quality agents, enabling safe and reliable self-evolving LLM agents.
Abstract: Recent advances have shown the effectiveness of self-evolving LLM agents on tasks such as program repair and scientific discovery. In this paradigm, a planner LLM synthesizes an agent program that invokes parametric models, including LLMs, which are then tuned per task to improve performance. However, existing self-evolving agent frameworks provide no formal guarantees of safety or correctness. Because such programs are often executed autonomously on unseen inputs, this lack of guarantees raises reliability and security concerns. We formulate agentic code generation as a constrained learning problem, combining hard formal specifications with soft objectives capturing task utility. We introduce Formally Guarded Generative Models (FGGM), which allow the planner LLM to specify a formal output contract for each generative model call using first-order logic. Each FGGM call wraps the underlying model in a rejection sampler with a verified fallback, ensuring every returned output satisfies the contract for any input and parameter setting. Building on FGGM, we present SEVerA (Self-Evolving Verified Agents), a three-stage framework: Search synthesizes candidate parametric programs containing FGGM calls; Verification proves correctness with respect to hard constraints for all parameter values, reducing the problem to unconstrained learning; and Learning applies scalable gradient-based optimization, including GRPO-style fine-tuning, to improve the soft objective while preserving correctness. We evaluate SEVerA on Dafny program verification, symbolic math synthesis, and policy-compliant agentic tool use ($τ^2$-bench). Across tasks, SEVerA achieves zero constraint violations while improving performance over unconstrained and SOTA baselines, showing that formal behavioral constraints not only guarantee correctness but also steer synthesis toward higher-quality agents.
[479] Vision Hopfield Memory Networks
Jianfeng Wang, Amine M’Charrak, Luk Koska, Xiangtao Wang, Daniel Petriceanu, Mykyta Smyrnov, Ruizhi Wang, Michael Bumbar, Luca Pinchetti, Thomas Lukasiewicz
Main category: cs.LG
TL;DR: V-HMN is a brain-inspired vision foundation model using hierarchical Hopfield memory modules for better interpretability and data efficiency compared to Transformers and Mamba.
Details
Motivation: Current vision/multimodal models (Transformers, Mamba) lack biological plausibility, require massive data, and have limited interpretability. The authors aim to create a brain-inspired architecture that bridges neuroscience with machine learning.Method: V-HMN integrates hierarchical memory mechanisms: local Hopfield modules for patch-level associative memory, global Hopfield modules for episodic memory/contextual modulation, and predictive-coding-inspired refinement for iterative error correction.
Result: Achieved competitive results on computer vision benchmarks compared to standard backbones, while offering better interpretability, higher data efficiency, and stronger biological plausibility.
Conclusion: V-HMN shows potential as a next-generation vision foundation model and provides a blueprint for multimodal backbones in text/audio domains, bridging brain-inspired computation with large-scale ML.
Abstract: Recent vision and multimodal foundation backbones, such as Transformer families and state-space models like Mamba, have achieved remarkable progress, enabling unified modeling across images, text, and beyond. Despite their empirical success, these architectures remain far from the computational principles of the human brain, often demanding enormous amounts of training data while offering limited interpretability. In this work, we propose the Vision Hopfield Memory Network (V-HMN), a brain-inspired foundation backbone that integrates hierarchical memory mechanisms with iterative refinement updates. Specifically, V-HMN incorporates local Hopfield modules that provide associative memory dynamics at the image patch level, global Hopfield modules that function as episodic memory for contextual modulation, and a predictive-coding-inspired refinement rule for iterative error correction. By organizing these memory-based modules hierarchically, V-HMN captures both local and global dynamics in a unified framework. Memory retrieval exposes the relationship between inputs and stored patterns, making decisions more interpretable, while the reuse of stored patterns improves data efficiency. This brain-inspired design therefore enhances interpretability and data efficiency beyond existing self-attention- or state-space-based approaches. We conducted extensive experiments on public computer vision benchmarks, and V-HMN achieved competitive results against widely adopted backbone architectures, while offering better interpretability, higher data efficiency, and stronger biological plausibility. These findings highlight the potential of V-HMN to serve as a next-generation vision foundation model, while also providing a generalizable blueprint for multimodal backbones in domains such as text and audio, thereby bridging brain-inspired computation with large-scale machine learning.
[480] Train at Moving Edge: Online-Verified Prompt Selection for Efficient RL Training of Large Reasoning Model
Jiahao Wu, Ning Lu, Shengcai Liu, Kun Wang, Yanting Yang, Li Qing, Ke Tang
Main category: cs.LG
TL;DR: HIVE is a dual-stage prompt selection framework that improves RL efficiency for LLM reasoning by selecting high-utility prompts before rollout, focusing on the “learning edge” of intermediate difficulty and high uncertainty.
Details
Motivation: Current RL methods for post-training LLMs in reasoning tasks require multiple rollouts per prompt, which is computationally expensive. Many prompts provide negligible gradients and are of low utility, wasting computational resources.Method: HIVE uses a dual-stage framework: 1) coarse selection using historical reward trajectories, and 2) real-time pruning using prompt entropy as a proxy to filter out instances with stale utility. It focuses on selecting prompts at the “learning edge” - the intersection of intermediate difficulty and high uncertainty.
Result: HIVE achieves significant rollout efficiency improvements across multiple math reasoning benchmarks and models without compromising performance compared to standard RL approaches.
Conclusion: The HIVE framework effectively addresses computational inefficiency in RL for LLM reasoning by intelligently selecting high-utility prompts, focusing on the evolving “learning edge” during training.
Abstract: Reinforcement learning (RL) has become essential for post-training large language models (LLMs) in reasoning tasks. While scaling rollouts can stabilize training and enhance performance, the computational overhead is a critical issue. In algorithms like GRPO, multiple rollouts per prompt incur prohibitive costs, as a large portion of prompts provide negligible gradients and are thus of low utility. To address this problem, we investigate how to select high-utility prompts before the rollout phase. Our experimental analysis reveals that sample utility is non-uniform and evolving: the strongest learning signals concentrate at the ``learning edge", the intersection of intermediate difficulty and high uncertainty, which shifts as training proceeds. Motivated by this, we propose HIVE (History-Informed and online-VErified prompt selection), a dual-stage framework for data-efficient RL. HIVE utilizes historical reward trajectories for coarse selection and employs prompt entropy as a real-time proxy to prune instances with stale utility. By evaluating HIVE across multiple math reasoning benchmarks and models, we show that HIVE yields significant rollout efficiency without compromising performance.
[481] Knowledge-Guided Retrieval-Augmented Generation for Zero-Shot Psychiatric Data: Privacy Preserving Synthetic Data Generation
Adam Jakobsen, Sushant Gautam, Hugo Lewi Hammer, Susanne Olofsdotter, Miriam S Johanson, Pål Halvorsen, Vajira Thambawita
Main category: cs.LG
TL;DR: Zero-shot knowledge-guided LLM framework generates privacy-preserving synthetic psychiatric data using DSM-5/ICD-10 knowledge, competitive with data-dependent models like CTGAN/TVAE
Details
Motivation: AI healthcare research is constrained by limited access to real patient data due to privacy concerns, creating need for privacy-preserving synthetic data generation methodsMethod: Zero-shot framework using LLMs steered via Retrieval-Augmented Generation with DSM-5 and ICD-10 knowledge bases to generate synthetic psychiatric tabular data without real data access
Result: LLM approach competitive on pairwise structure, achieves lowest pairwise error for separation/social anxiety disorders; clinical retrieval improves fidelity; privacy analysis shows modest overlaps comparable to CTGAN
Conclusion: Knowledge-grounded LLMs enable high-quality, privacy-preserving synthetic psychiatric data generation when real datasets are unavailable or cannot be shared
Abstract: AI systems in healthcare research have shown potential to increase patient throughput and assist clinicians, yet progress is constrained by limited access to real patient data. To address this issue, we present a zero-shot, knowledge-guided framework for psychiatric tabular data in which large language models (LLMs) are steered via Retrieval-Augmented Generation using the Diagnostic and Statistical Manual of Mental Disorders (DSM-5) and the International Classification of Diseases (ICD-10). We conducted experiments using different combinations of knowledge bases to generate privacy-preserving synthetic data. The resulting models were benchmarked against two state-of-the-art deep learning models for synthetic tabular data generation, namely CTGAN and TVAE, both of which rely on real data and therefore entail potential privacy risks. Evaluation was performed on six anxiety-related disorders: specific phobia, social anxiety disorder, agoraphobia, generalized anxiety disorder, separation anxiety disorder, and panic disorder. CTGAN typically achieves the best marginals and multivariate structure, while the knowledge-augmented LLM is competitive on pairwise structure and attains the lowest pairwise error in separation anxiety and social anxiety. An ablation study shows that clinical retrieval reliably improves univariate and pairwise fidelity over a no-retrieval LLM. Privacy analyses indicate that the real data-free LLM yields modest overlaps and a low average linkage risk comparable to CTGAN, whereas TVAE exhibits extensive duplication despite a low k-map score. Overall, grounding an LLM in clinical knowledge enables high-quality, privacy-preserving synthetic psychiatric data when real datasets are unavailable or cannot be shared.
[482] A CDF-First Framework for Free-Form Density Estimation
Chenglong Song, Mazharul Islam, Lin Wang, Bing Chen, Bo Yang
Main category: cs.LG
TL;DR: A CDF-first framework for conditional density estimation that models cumulative distribution functions instead of probability density functions to avoid mathematical ill-posedness, using Smooth Min-Max networks to guarantee valid PDFs.
Details
Motivation: Direct PDF estimation is mathematically ill-posed as it requires differentiating empirical distributions, amplifying random fluctuations and necessitating strong inductive biases that limit expressivity. A more stable approach is needed for free-form density estimation capturing multimodality, asymmetry, and topological complexity.Method: Proposes a CDF-first framework that estimates cumulative distribution functions (stable, well-posed target) then recovers PDFs via differentiation of learned smooth CDFs. Uses Smooth Min-Max (SMM) networks to parameterize CDFs, guaranteeing valid PDFs by construction. For multivariate outputs, employs autoregressive decomposition with SMM factors.
Result: Experiments show the approach outperforms state-of-the-art density estimators on a range of univariate and multivariate conditional density estimation tasks.
Conclusion: The CDF-first framework provides a mathematically sound alternative to direct PDF estimation, enabling stable learning of complex conditional distributions while guaranteeing valid probability densities by construction.
Abstract: Conditional density estimation (CDE) is a fundamental task in machine learning that aims to model the full conditional law $\mathbb{P}(\mathbf{y} \mid \mathbf{x})$, beyond mere point prediction (e.g., mean, mode). A core challenge is free-form density estimation, capturing distributions that exhibit multimodality, asymmetry, or topological complexity without restrictive assumptions. However, prevailing methods typically estimate the probability density function (PDF) directly, which is mathematically ill-posed: differentiating the empirical distribution amplifies random fluctuations inherent in finite datasets, necessitating strong inductive biases that limit expressivity and fail when violated. We propose a CDF-first framework that circumvents this issue by estimating the cumulative distribution function (CDF), a stable and well-posed target, and then recovering the PDF via differentiation of the learned smooth CDF. Parameterizing the CDF with a Smooth Min-Max (SMM) network, our framework guarantees valid PDFs by construction, enables tractable approximate likelihood training, and preserves complex distributional shapes. For multivariate outputs, we use an autoregressive decomposition with SMM factors. Experiments demonstrate our approach outperforms state-of-the-art density estimators on a range of univariate and multivariate tasks.
[483] Gap Safe Screening Rules for Fast Training of Robust Support Vector Machines under Feature Noise
Tan-Hau Nguyen, Thu-Le Tran, Kien Trung Nguyen
Main category: cs.LG
TL;DR: First safe sample screening rules for Robust Support Vector Machines (R-SVMs) that reduce computational cost without affecting optimal solution by identifying samples whose uncertainty sets lie entirely on one side of the margin hyperplane.
Details
Motivation: Robust SVMs improve reliability against feature noise but suffer from increased computational cost. There's a need to accelerate training while preserving robustness.Method: Develop safe screening rules for R-SVMs using Lagrangian duality (instead of Fenchel-Rockafellar duality) to identify training samples whose uncertainty sets are guaranteed to lie entirely on either side of the margin hyperplane, reducing problem size.
Result: Proposed method significantly reduces training time while preserving classification accuracy, demonstrating effective computational acceleration for robust models.
Conclusion: First successful application of safe screening techniques to worst-case robust models in supervised ML, enabling efficient training of robust classifiers without compromising solution quality.
Abstract: Robust Support Vector Machines (R-SVMs) address feature noise by adopting a worst-case robust formulation that explicitly incorporates uncertainty sets into training. While this robustness improves reliability, it also leads to increased computational cost. In this work, we develop safe sample screening rules for R-SVMs that reduce the training complexity without affecting the optimal solution. To the best of our knowledge, this is the first study to apply safe screening techniques to worst-case robust models in supervised machine learning. Our approach safely identifies training samples whose uncertainty sets are guaranteed to lie entirely on either side of the margin hyperplane, thereby reducing the problem size and accelerating optimization. Owing to the nonstandard structure of R-SVMs, the proposed screening rules are derived from the Lagrangian duality rather than the Fenchel-Rockafellar duality commonly used in recent methods. Based on this analysis, we first establish an ideal screening rule, and then derive a practical rule by adapting GAP-based safe regions to the robust setting. Experiments demonstrate that the proposed method significantly reduces training time while preserving classification accuracy.
[484] Offline Decision Transformers for Neural Combinatorial Optimization: Surpassing Heuristics on the Traveling Salesman Problem
Hironori Ohigashi, Shinichiro Hamada
Main category: cs.LG
TL;DR: Offline RL with Decision Transformer learns to outperform classical heuristics for Traveling Salesman Problem by training on heuristic solution datasets
Details
Motivation: Neural combinatorial optimization shows promise but relies on online RL which hampers deployment and underutilizes decades of algorithmic knowledge; offline RL can learn superior strategies directly from existing heuristic solutionsMethod: Apply offline RL framework Decision Transformer to learn from heuristic solution datasets; integrate Pointer Network for variable action space of node selection; use expectile regression for optimistic conditioning of Return-to-Go
Result: Method consistently produces higher-quality tours than the four classical heuristics it was trained on, demonstrating ability to exceed performance embedded in existing domain knowledge
Conclusion: Offline RL has potential to unlock and exceed performance embedded in existing domain knowledge for combinatorial optimization problems
Abstract: Combinatorial optimization problems like the Traveling Salesman Problem are critical in industry yet NP-hard. Neural Combinatorial Optimization has shown promise, but its reliance on online reinforcement learning (RL) hampers deployment and underutilizes decades of algorithmic knowledge. We address these limitations by applying the offline RL framework, Decision Transformer, to learn superior strategies directly from datasets of heuristic solutions; it aims to not only to imitate but to synthesize and outperform them. Concretely, we (i) integrate a Pointer Network to handle the instance-dependent, variable action space of node selection, and (ii) employ expectile regression for optimistic conditioning of Return-to-Go, which is crucial for instances with widely varying optimal values. Experiments show that our method consistently produces higher-quality tours than the four classical heuristics it is trained on, demonstrating the potential of offline RL to unlock and exceed the performance embedded in existing domain knowledge.
[485] How Pruning Reshapes Features: Sparse Autoencoder Analysis of Weight-Pruned Language Models
Hector Borobia, Elies Seguí-Mas, Guillermina Tormo-Carbó
Main category: cs.LG
TL;DR: Pruning preserves rare features better than frequent ones in language models, acting as implicit feature selection, with Wanda pruning outperforming magnitude pruning in feature preservation.
Details
Motivation: While weight pruning is widely used for compressing large language models, its effects on learned internal representations remain poorly understood. The paper aims to systematically study how pruning reshapes feature geometry in language models.Method: Used Sparse Autoencoders (SAEs) as interpretability probes across three model families (Gemma 3 1B, Gemma 2 2B, Llama 3.2 1B), two pruning methods (magnitude and Wanda), and six sparsity levels (0-60%). Investigated five research questions: seed stability, feature survival, SAE transferability, feature fragility, and causal relevance.
Result: Rare SAE features (low firing rates) survive pruning far better than frequent ones (rho = -1.0 in 11/17 conditions). Wanda pruning preserves feature structure up to 3.7x better than magnitude pruning. Pre-trained SAEs remain viable on Wanda-pruned models up to 50% sparsity. Geometric feature survival doesn’t predict causal importance.
Conclusion: Pruning acts as implicit feature selection, preferentially destroying high-frequency generic features while preserving specialized rare ones. There’s a dissociation between geometric feature survival and causal importance, with implications for interpretability under compression.
Abstract: Weight pruning is a standard technique for compressing large language models, yet its effect on learned internal representations remains poorly understood. We present the first systematic study of how unstructured pruning reshapes the feature geometry of language models, using Sparse Autoencoders (SAEs) as interpretability probes. Across three model families (Gemma 3 1B, Gemma 2 2B, Llama 3.2 1B), two pruning methods (magnitude and Wanda), and six sparsity levels (0–60%), we investigate five research questions spanning seed stability, feature survival, SAE transferability, feature fragility, and causal relevance. Our most striking finding is that rare SAE features–those with low firing rates–survive pruning far better than frequent ones, with within-condition Spearman correlations of rho = -1.0 in 11 of 17 experimental conditions. This counter-intuitive result suggests that pruning acts as implicit feature selection, preferentially destroying high-frequency generic features while preserving specialized rare ones. We further show that Wanda pruning preserves feature structure up to 3.7x better than magnitude pruning, that pre-trained SAEs remain viable on Wanda-pruned models up to 50% sparsity, and that geometric feature survival does not predict causal importance–a dissociation with implications for interpretability under compression.
[486] From Intent to Evidence: A Categorical Approach for Structural Evaluation of Deep Research Agents
Shuoling Liu, Zhiquan Tan, Kun Yi, Hui Wu, Yihan Li, Jiangpeng Yan, Liyuan Chen, Kai Chen, Qiang Yang
Main category: cs.LG
TL;DR: Paper introduces a category theory-based framework to formally model deep research agents and creates a benchmark to stress-test them on structural information synthesis tasks, revealing significant limitations in current models.
Details
Motivation: Current evaluation of deep research agents relies on ad hoc empirical benchmarks that don't rigorously model agent behavior or adequately test long-horizon synthesis and ambiguity resolution capabilities.Method: Formalizes DRA behavior using category theory, modeling research workflows as compositions of structure-preserving maps (functors). Creates a mechanism-aware benchmark with 296 questions testing four interpretable axes: sequential connectivity chains, V-structure pullback intersections, topological ordering, and ontological falsification via Yoneda Probe.
Result: Evaluation of 11 leading models shows persistently low performance (state-of-the-art achieves only 19.9% average accuracy). Reveals dichotomy: models succeed at topological re-ordering and ontological verification but collapse on multi-hop structural synthesis. Exposes reliance on brittle heuristics rather than systemic understanding.
Conclusion: While top-tier autonomous agents can unify search and reasoning, achieving generalized mastery over complex structural information remains a formidable open challenge. The benchmark exposes fundamental limitations in current AI capabilities for formal structural reasoning.
Abstract: Although deep research agents (DRAs) have emerged as a promising paradigm for complex information synthesis, their evaluation remains constrained by ad hoc empirical benchmarks. These heuristic approaches do not rigorously model agent behavior or adequately stress-test long-horizon synthesis and ambiguity resolution. To bridge this gap, we formalize DRA behavior through the lens of category theory, modeling deep research workflow as a composition of structure-preserving maps (functors). Grounded in this theoretical framework, we introduce a novel mechanism-aware benchmark with 296 questions designed to stress-test agents along four interpretable axes: traversing sequential connectivity chains, verifying intersections within V-structure pullbacks, imposing topological ordering on retrieved substructures, and performing ontological falsification via the Yoneda Probe. Our rigorous evaluation of 11 leading models establishes a persistently low baseline, with the state-of-the-art achieving only a 19.9% average accuracy, exposing the difficulty of formal structural stress-testing. Furthermore, our findings reveal a stark dichotomy in the current AI capabilities. While advanced deep research pipelines successfully redefine dynamic topological re-ordering and exhibit robust ontological verification – matching pure reasoning models in falsifying hallucinated premises – they almost universally collapse on multi-hop structural synthesis. Crucially, massive performance variance across tasks exposes a lingering reliance on brittle heuristics rather than a systemic understanding. Ultimately, this work demonstrates that while top-tier autonomous agents can now organically unify search and reasoning, achieving a generalized mastery over complex structural information remains a formidable open challenge.\footnote{Our implementation will be available at https://github.com/tzq1999/CDR.
[487] Hessian-informed machine learning interatomic potential towards bridging theory and experiments
Bangchen Yin, Jian Ouyang, Zhen Fan, Kailai Lin, Hanshi Hu, Dingshun Lv, Weiluo Ren, Hai Xiao, Ji Chen, Changsu Cao
Main category: cs.LG
TL;DR: Hi-MLIP is a Hessian-informed machine learning interatomic potential that accurately captures curvature of potential energy surfaces, enabling reliable analysis of thermodynamic and kinetic phenomena with dramatically reduced computational cost.
Details
Motivation: Local curvature of potential energy surfaces is critical for predicting experimental observables of molecules and materials from first principles, but remains computationally prohibitive for complex systems. Current methods struggle with accurate Hessian calculations which are essential for thermodynamic and kinetic analysis.Method: Developed Hi-MLIP (Hessian-informed Machine Learning Interatomic Potential) with HINT (Hessian INformed Training) protocol. HINT uses Hessian pre-training, configuration sampling, curriculum learning, and stochastic projection Hessian loss to achieve 2-4 orders of magnitude reduction in required Hessian labels.
Result: Hi-MLIP significantly improves transition-state search accuracy and brings Gibbs free-energy predictions close to chemical accuracy, especially in data-scarce regimes. Accurately treats strongly anharmonic hydrides, reproducing phonon renormalization and superconducting critical temperatures in agreement with experiments while bypassing computational bottlenecks.
Conclusion: The framework establishes a practical route to enhancing curvature awareness of machine learning interatomic potentials, bridging simulation and experimental observables across a wide range of systems with dramatically reduced computational requirements.
Abstract: Local curvature of potential energy surfaces is critical for predicting certain experimental observables of molecules and materials from first principles, yet it remains far beyond reach for complex systems. In this work, we introduce a Hessian-informed Machine Learning Interatomic Potential (Hi-MLIP) that captures such curvature reliably, thereby enabling accurate analysis of associated thermodynamic and kinetic phenomena. To make Hessian supervision practically viable, we develop a highly efficient training protocol, termed Hessian INformed Training (HINT), achieving two to four orders of magnitude reduction for the requirement of expensive Hessian labels. HINT integrates critical techniques, including Hessian pre-training, configuration sampling, curriculum learning and stochastic projection Hessian loss. Enabled by HINT, Hi-MLIP significantly improves transition-state search and brings Gibbs free-energy predictions close to chemical accuracy especially in data-scarce regimes. Our framework also enables accurate treatment of strongly anharmonic hydrides, reproducing phonon renormalization and superconducting critical temperatures in close agreement with experiment while bypassing the computational bottleneck of anharmonic calculations. These results establish a practical route to enhancing curvature awareness of machine learning interatomic potentials, bridging simulation and experimental observables across a wide range of systems.
[488] GlowQ: Group-Shared LOw-Rank Approximation for Quantized LLMs
Selim An, Il hong Suh, Yeseong Kim
Main category: cs.LG
TL;DR: GlowQ: Group-shared low-rank approximation for quantized LLMs that caches shared right factors per input-sharing group and selectively restores only high-benefit groups/layers, reducing latency and memory overhead while maintaining accuracy.
Details
Motivation: Existing quantization techniques degrade accuracy with low-bit representations, and current correction methods restore all layers with error-correction modules in every decoder block, increasing latency and memory overhead.Method: Proposes GlowQ with group-shared low-rank approximation that caches a single shared right factor per input-sharing group, computing high-precision projection once per group and reusing it across modules. Also introduces selective variant GlowQ-S that applies cached shared module only where it provides largest benefit.
Result: Reduces TTFB by 5.6% and increases throughput by 9.6% on average, while reducing perplexity on WikiText-2 by 0.17% and increasing downstream accuracy by 0.42 percentage points. GlowQ-S further reduces latency, cutting TTFB by 23.4% and increasing throughput by 37.4%, while maintaining accuracy within 0.2 percentage points on average.
Conclusion: GlowQ effectively addresses limitations of existing quantization correction methods by reducing parameter/memory overhead while retaining expressivity of layer-specific corrections through group-shared low-rank approximation and selective application.
Abstract: Quantization techniques such as BitsAndBytes, AWQ, and GPTQ are widely used as a standard method in deploying large language models but often degrades accuracy when using low-bit representations, e.g., 4 bits. Low-rank correction methods (e.g., LQER, QERA, ASER) has been proposed to mitigate this issue, however, they restore all layers and insert error-correction modules into every decoder block, which increases latency and memory overhead. To address this limitation, we propose GlowQ, a group-shared low-rank approximation for quantized LLMs that caches a single shared right factor per input-sharing group and restores only the groups or layers that yield the highest accuracy benefit. GlowQ computes the high-precision projection once per input-sharing group and reuses it across its modules, reducing parameter and memory overhead, and retaining the expressivity of layer-specific corrections. We also propose a selective variant, GlowQ-S, that applies the cached shared module only where it provides the largest benefit. Compared with strong baselines, our approach reduces TTFB by (5.6%) and increases throughput by (9.6%) on average, while reducing perplexity on WikiText-2 by (0.17%) and increasing downstream accuracy by 0.42 percentage points. The selective model GlowQ-S further reduces latency, cutting TTFB by (23.4%) and increasing throughput by (37.4%), while maintaining accuracy within 0.2 percentage points on average.
[489] Maximum Entropy Behavior Exploration for Sim2Real Zero-Shot Reinforcement Learning
Jiajun Hu, Nuria Armengol Urpi, Jin Cheng, Stelian Coros
Main category: cs.LG
TL;DR: FB-MEBE: Online zero-shot RL algorithm for quadrupedal control combining unsupervised behavior exploration with regularization critic to enable hardware deployment without finetuning.
Details
Motivation: Zero-shot RL requires diverse pretraining datasets, but undirected exploration yields low-diversity data leading to poor downstream performance and policies unsuitable for hardware deployment.Method: FB-MEBE combines unsupervised behavior exploration (maximizing entropy of achieved behavior distribution) with a regularization critic that shapes policies toward natural, physically plausible behaviors.
Result: FB-MEBE outperforms other exploration strategies in simulated downstream tasks and produces natural policies that can be directly deployed to hardware without finetuning.
Conclusion: The proposed online zero-shot RL approach enables effective quadrupedal control with policies ready for real-world hardware deployment through improved exploration and regularization.
Abstract: Zero-shot reinforcement learning (RL) algorithms aim to learn a family of policies from a reward-free dataset, and recover optimal policies for any reward function directly at test time. Naturally, the quality of the pretraining dataset determines the performance of the recovered policies across tasks. However, pre-collecting a relevant, diverse dataset without prior knowledge of the downstream tasks of interest remains a challenge. In this work, we study $\textit{online}$ zero-shot RL for quadrupedal control on real robotic systems, building upon the Forward-Backward (FB) algorithm. We observe that undirected exploration yields low-diversity data, leading to poor downstream performance and rendering policies impractical for direct hardware deployment. Therefore, we introduce FB-MEBE, an online zero-shot RL algorithm that combines an unsupervised behavior exploration strategy with a regularization critic. FB-MEBE promotes exploration by maximizing the entropy of the achieved behavior distribution. Additionally, a regularization critic shapes the recovered policies toward more natural and physically plausible behaviors. We empirically demonstrate that FB-MEBE achieves and improved performance compared to other exploration strategies in a range of simulated downstream tasks, and that it renders natural policies that can be seamlessly deployed to hardware without further finetuning. Videos and code available on our website.
[490] Not a fragment, but the whole: Map-based evaluation of data-driven Fire Danger Index models
Shahbaz Alvi, Italo Epicoco, Jose Maria Costa Saura
Main category: cs.LG
TL;DR: Proposes new evaluation method for wildfire forecasting models that better aligns with operational decision-making by systematically assessing both fire prediction accuracy and false alarm rates, showing ensemble ML improves both.
Details
Motivation: Standard ML evaluation metrics provide limited measure of operational performance for Fire Danger Index forecasting, and current evaluations often inadequately account for false positive rates which are critical in real-world decision-making contexts.Method: Revisits daily FDI model evaluation paradigm and proposes novel evaluation method aligned with real-world decision-making. Systematically assesses performance in accurately predicting fire activity and false positives (false alarms). Uses ensemble of ML models.
Result: Demonstrates that ensemble of ML models improves both fire identification and reduces false positives compared to individual models.
Conclusion: Proposed evaluation framework better captures operational performance of wildfire forecasting models by considering both prediction accuracy and false alarm rates, with ensemble ML approaches offering superior performance.
Abstract: A growing body of literature has focused on predicting wildfire occurrence using machine learning methods, capitalizing on high-resolution data and fire predictors that canonical process-based frameworks largely ignore. Standard evaluation metrics for an ML classifier, while important, provide a potentially limited measure of the model’s operational performance for the Fire Danger Index (FDI) forecast. Furthermore, model evaluation is frequently conducted without adequately accounting for false positive rates, despite their critical relevance in operational contexts. In this paper, we revisit the daily FDI model evaluation paradigm and propose a novel method for evaluating a forest fire forecasting model that is aligned with real-world decision-making. Furthermore, we systematically assess performance in accurately predicting fire activity and the false positives (false alarms). We further demonstrate that an ensemble of ML models improves both fire identification and reduces false positives.
[491] Causal-INSIGHT: Probing Temporal Models to Extract Causal Structure
Benjamin Redden, Hui Wang, Shuyan Li
Main category: cs.LG
TL;DR: Causal-INSIGHT is a model-agnostic framework for extracting directed temporal influence structures from trained predictors using intervention-inspired input clamping at inference time.
Details
Motivation: Understanding directed temporal interactions in multivariate time series is crucial for interpreting complex dynamical systems and the predictive models trained on them. Existing methods often infer causal structure at the data-generating process level, but there's a need to understand how trained predictors actually use temporal dependencies.Method: Causal-INSIGHT uses systematic, intervention-inspired input clamping applied at inference time to analyze how pre-trained predictors respond. It constructs directed temporal influence signals from these responses and introduces Qbic, a sparsity-aware graph selection criterion that balances predictive fidelity and structural complexity without requiring ground-truth graph labels.
Result: Experiments across synthetic, simulated, and realistic benchmarks show that Causal-INSIGHT generalizes across diverse backbone architectures, maintains competitive structural accuracy, and yields significant improvements in temporal delay localization when applied to existing predictors.
Conclusion: Causal-INSIGHT provides an effective framework for extracting model-implied directed temporal influence structures from trained predictors, offering insights into how these models leverage temporal dependencies for prediction without requiring ground-truth causal graphs.
Abstract: Understanding directed temporal interactions in multivariate time series is essential for interpreting complex dynamical systems and the predictive models trained on them. We present Causal-INSIGHT, a model-agnostic, post-hoc interpretation framework for extracting model-implied (predictor-dependent), directed, time-lagged influence structure from trained temporal predictors. Rather than inferring causal structure at the level of the data-generating process, Causal-INSIGHT analyzes how a fixed, pre-trained predictor responds to systematic, intervention-inspired input clamping applied at inference time. From these responses, we construct directed temporal influence signals that reflect the dependencies the predictor relies on for prediction, and introduce Qbic, a sparsity-aware graph selection criterion that balances predictive fidelity and structural complexity without requiring ground-truth graph labels. Experiments across synthetic, simulated, and realistic benchmarks show that Causal-INSIGHT generalizes across diverse backbone architectures, maintains competitive structural accuracy, and yields significant improvements in temporal delay localization when applied to existing predictors.
[492] How Class Ontology and Data Scale Affect Audio Transfer Learning
Manuel Milling, Andreas Triantafyllopoulos, Alexander Gebhard, Simon Rampp, Björn W. Schuller
Main category: cs.LG
TL;DR: Audio-to-audio transfer learning study showing that similarity between pre-training and downstream tasks is more important than just increasing pre-training data size or diversity.
Details
Motivation: Despite widespread use of transfer learning in deep learning, there are still open questions about when and how well it works, particularly in audio domains. The researchers aim to understand the factors that make audio-to-audio transfer learning effective.Method: Pre-trained various models on ontology-based subsets of AudioSet, then fine-tuned them on three computer audition tasks: acoustic scene recognition, bird activity recognition, and speech command recognition. Systematically studied the impact of pre-training data size, class diversity, and task similarity.
Result: Increasing both the number of samples and classes in pre-training data has a positive impact on transfer learning performance. However, similarity between pre-training and downstream tasks generally surpasses these factors, as it enables the model to learn comparable features.
Conclusion: Task similarity is a crucial factor for effective audio-to-audio transfer learning, often more important than simply scaling pre-training data. This provides insights for designing better pre-training strategies in audio domains.
Abstract: Transfer learning is a crucial concept within deep learning that allows artificial neural networks to benefit from a large pre-training data basis when confronted with a task of limited data. Despite its ubiquitous use and clear benefits, there are still many open questions regarding the inner workings of transfer learning and, in particular, regarding the understanding of when and how well it works. To that extent, we perform a rigorous study focusing on audio-to-audio transfer learning, in which we pre-train various model states on (ontology-based) subsets of AudioSet and fine-tune them on three computer audition tasks, namely acoustic scene recognition, bird activity recognition, and speech command recognition. We report that increasing the number of samples and classes in the pre-training data both have a positive impact on transfer learning. This is, however, generally surpassed by similarity between pre-training and the downstream task, which can lead the model to learn comparable features.
[493] Interpretable PM2.5 Forecasting for Urban Air Quality: A Comparative Study of Operational Time-Series Models
Moazzam Umer Gondal, Hamad ul Qudous, Asma Ahmad Farhan, Sultan Alamri
Main category: cs.LG
TL;DR: Lightweight interpretable models (SARIMAX, Facebook Prophet, NeuralProphet) achieve competitive performance for hourly PM2.5 forecasting in Beijing, with Facebook Prophet showing best balance of accuracy and efficiency.
Details
Motivation: Many air-quality forecasting frameworks rely on complex, data-intensive models that are computationally demanding. This study investigates whether lightweight and interpretable approaches can provide competitive performance for practical deployment.Method: Developed a leakage-aware forecasting workflow with chronological data partitioning, preprocessing, feature selection, and exogenous-driver modeling. Evaluated three forecasting families (SARIMAX, Facebook Prophet, NeuralProphet) under two adaptive regimes: weekly walk-forward refitting and frozen forecasting with online residual correction.
Result: Facebook Prophet achieved strongest performance under walk-forward refitting (MAE 37.61, RMSE 50.10) with substantially less execution time than NeuralProphet. In frozen-model regime, online residual correction improved Facebook Prophet and SARIMAX, with corrected SARIMAX yielding lowest overall error (MAE 32.50, RMSE 46.85). NeuralProphet remained less accurate and less stable.
Conclusion: Lightweight additive forecasting strategies can remain highly competitive for urban air-quality prediction, offering practical balance between accuracy, interpretability, and computational efficiency.
Abstract: Accurate short-term air-quality forecasting is essential for public health protection and urban management, yet many recent forecasting frameworks rely on complex, data-intensive, and computationally demanding models. This study investigates whether lightweight and interpretable forecasting approaches can provide competitive performance for hourly PM2.5 prediction in Beijing, China. Using multi-year pollutant and meteorological time-series data, we developed a leakage-aware forecasting workflow that combined chronological data partitioning, preprocessing, feature selection, and exogenous-driver modeling under the Perfect Prognosis setting. Three forecasting families were evaluated: SARIMAX, Facebook Prophet, and NeuralProphet. To assess practical deployment behavior, the models were tested under two adaptive regimes: weekly walk-forward refitting and frozen forecasting with online residual correction. Results showed clear differences in both predictive accuracy and computational efficiency. Under walk-forward refitting, Facebook Prophet achieved the strongest completed performance, with an MAE of $37.61$ and an RMSE of $50.10$, while also requiring substantially less execution time than NeuralProphet. In the frozen-model regime, online residual correction improved Facebook Prophet and SARIMAX, with corrected SARIMAX yielding the lowest overall error (MAE $32.50$; RMSE $46.85$). NeuralProphet remained less accurate and less stable across both regimes, and residual correction did not improve its forecasts. Notably, corrected Facebook Prophet reached nearly the same error as its walk-forward counterpart while reducing runtime from $15$ min $21.91$ sec to $46.60$ sec. These findings show that lightweight additive forecasting strategies can remain highly competitive for urban air-quality prediction, offering a practical balance between accuracy, interpretability, …
[494] Missing-Aware Multimodal Fusion for Unified Microservice Incident Management
Wenzhuo Qian, Hailiang Zhao, Ziqi Wang, Zhipeng Gao, Jiayi Chen, Zhiwei Ling, Shuiguang Deng
Main category: cs.LG
TL;DR: ARMOR is a robust self-supervised framework for microservice incident management that handles missing modalities in metrics, logs, and traces through modality-specific encoders and missing-aware gated fusion.
Details
Motivation: Existing unified frameworks for microservice incident management assume perfect multimodal data completeness, but in practice, network fluctuations and agent failures cause missing modalities. Current approaches using static placeholders introduce imputation noise that masks anomalies and degrades performance.Method: ARMOR uses: (1) modality-specific asymmetric encoders to isolate distribution disparities among metrics, logs, and traces; (2) missing-aware gated fusion with learnable placeholders and dynamic bias compensation to prevent cross-modal interference from incomplete inputs; (3) self-supervised auto-regression with mask-guided reconstruction to jointly optimize anomaly detection, failure triage, and root cause localization.
Result: ARMOR achieves state-of-the-art performance under complete data conditions and maintains robust diagnostic accuracy even with severe modality loss. Anomaly detection and root cause localization require no fault labels, while failure triage relies only on failure-type annotations for the downstream classifier.
Conclusion: ARMOR provides a robust solution for real-world microservice incident management by effectively handling missing modalities through its specialized architecture and fusion mechanisms, outperforming existing approaches in both complete and incomplete data scenarios.
Abstract: Automated incident management is critical for microservice reliability. While recent unified frameworks leverage multimodal data for joint optimization, they unrealistically assume perfect data completeness. In practice, network fluctuations and agent failures frequently cause missing modalities. Existing approaches relying on static placeholders introduce imputation noise that masks anomalies and degrades performance. To address this, we propose ARMOR, a robust self-supervised framework designed for missing modality scenarios. ARMOR features: (i) a modality-specific asymmetric encoder that isolates distribution disparities among metrics, logs, and traces; and (ii) a missing-aware gated fusion mechanism utilizing learnable placeholders and dynamic bias compensation to prevent cross-modal interference from incomplete inputs. By employing self-supervised auto-regression with mask-guided reconstruction, ARMOR jointly optimizes anomaly detection (AD), failure triage (FT), and root cause localization (RCL). AD and RCL require no fault labels, while FT relies solely on failure-type annotations for the downstream classifier. Extensive experiments demonstrate that ARMOR achieves state-of-the-art performance under complete data conditions and maintains robust diagnostic accuracy even with severe modality loss.
[495] Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes
Yuqian Fu, Haohuan Huang, Kaiwen Jiang, Yuanheng Zhu, Dongbin Zhao
Main category: cs.LG
TL;DR: Improved on-policy distillation for LLMs using teacher top-K local support matching to address fragility in long-horizon settings
Details
Motivation: On-policy distillation (OPD) is valuable for LLM post-training but suffers from fragility in long-horizon settings where sampled-token variants become unreliable as rollouts drift from teacher-common prefixesMethod: Proposes teacher top-K local support matching implemented as truncated reverse-KL with top-p rollout sampling and special-token masking to address three failure modes of sampled-token OPD
Result: The method yields more stable optimization and better downstream performance than sampled-token OPD across single-task math reasoning and multi-task agentic-plus-math training
Conclusion: Teacher top-K local support matching addresses fundamental limitations of sampled-token OPD for long-horizon LLM distillation, providing more reliable optimization and improved performance
Abstract: On-policy distillation (OPD) is appealing for large language model (LLM) post-training because it evaluates teacher feedback on student-generated rollouts rather than fixed teacher traces. In long-horizon settings, however, the common sampled-token variant is fragile: it reduces distribution matching to a one-token signal and becomes increasingly unreliable as rollouts drift away from prefixes the teacher commonly visits. We revisit OPD from the estimator and implementation sides. Theoretically, token-level OPD is biased relative to sequence-level reverse-KL, but it has a much tighter worst-case variance bound; our toy study shows the same tradeoff empirically, with stronger future-reward coupling producing higher gradient variance and less stable learning. Empirically, we identify three failure modes of sampled-token OPD: an imbalanced one-token signal, unreliable teacher guidance on student-generated prefixes, and distortions caused by tokenizer or special-token mismatch. We address these issues with teacher top-K local support matching, implemented as truncated reverse-KL with top-p rollout sampling and special-token masking. Across single-task math reasoning and multi-task agentic-plus-math training, this objective yields more stable optimization and better downstream performance than sampled-token OPD.
[496] An Integrative Genome-Scale Metabolic Modeling and Machine Learning Framework for Predicting and Optimizing Biofuel-Relevant Biomass Production in Saccharomyces cerevisiae
Neha K. Nair, Aaron D’Souza
Main category: cs.LG
TL;DR: A computational framework combining yeast metabolic modeling with machine learning to predict and optimize biomass flux for metabolic engineering.
Details
Motivation: Accurately predicting biomass flux across diverse environmental and genetic perturbations remains challenging for rational yeast strain design in industrial biotechnology.Method: Combines Yeast9 genome-scale metabolic model with machine learning (Random Forest, XGBoost, variational autoencoder, SHAP analysis) and optimization techniques (Bayesian optimization, generative adversarial networks) to predict, interpret, and enhance biomass flux.
Result: Random Forest and XGBoost achieved R² of 0.99989 and 0.9990 respectively; variational autoencoder revealed four metabolic clusters; SHAP identified key pathways; in silico overexpression achieved 0.979 gDW/hr; Bayesian optimization produced 12-fold increase; GAN proposed novel flux configurations.
Conclusion: The framework demonstrates how genome-scale simulation, interpretable ML, and generative modeling can advance yeast metabolic engineering for industrial applications.
Abstract: Saccharomyces cerevisiae is a cornerstone organism in industrial biotechnology, valued for its genetic tractability and robust fermentative capacity. Accurately predicting biomass flux across diverse environmental and genetic perturbations remains a significant challenge for rational strain design. We present a computational framework combining the Yeast9 genome-scale metabolic model with machine learning and optimization to predict, interpret, and enhance biomass flux. Flux balance analysis generated 2,000 flux profiles by varying glucose, oxygen, and ammonium uptake rates. Random Forest and XGBoost regressors achieved R2 of 0.99989 and 0.9990, respectively. A variational autoencoder revealed four distinct metabolic clusters, and SHAP analysis identified glycolysis, the TCA cycle, and lipid biosynthesis as key biomass determinants. In silico overexpression achieved a biomass flux of 0.979 gDW/hr, while Bayesian optimization of nutrient constraints produced a 12-fold increase (0.0858 to 1.041 gDW/hr). A generative adversarial network proposed stoichiometrically feasible novel flux configurations. This framework demonstrates how genome-scale simulation, interpretable ML, and generative modeling can advance yeast metabolic engineering.
[497] Spatiotemporal System Forecasting with Irregular Time Steps via Masked Autoencoder
Kewei Zhu, Yanze Xin, Jinwei Hu, Xiaoyuan Cheng, Yiming Yang, Sibo Cheng
Main category: cs.LG
TL;DR: Physics-Spatiotemporal Masked Autoencoder for predicting high-dimensional dynamical systems with irregular time steps, integrating convolutional autoencoders for spatial features and masked autoencoders for irregular time series.
Details
Motivation: Predicting high-dimensional dynamical systems with irregular time steps is challenging due to missing data, sparse observations, or adaptive computational techniques, which reduce prediction accuracy in current data-driven methods.Method: Proposes a Physics-Spatiotemporal Masked Autoencoder that combines convolutional autoencoders for spatial feature extraction with masked autoencoders optimized for irregular time series, using attention mechanisms to reconstruct entire physical sequences in a single prediction pass without data imputation.
Result: The method achieves significant improvements in prediction accuracy, robustness to nonlinearities, and computational efficiency over traditional convolutional and recurrent network methods on simulated datasets and real-world ocean temperature data.
Conclusion: The model effectively captures complex spatiotemporal patterns without requiring domain-specific knowledge, with applications in climate modeling, fluid dynamics, ocean forecasting, environmental monitoring, and scientific computing.
Abstract: Predicting high-dimensional dynamical systems with irregular time steps presents significant challenges for current data-driven algorithms. These irregularities arise from missing data, sparse observations, or adaptive computational techniques, reducing prediction accuracy. To address these limitations, we propose a novel method: a Physics-Spatiotemporal Masked Autoencoder. This method integrates convolutional autoencoders for spatial feature extraction with masked autoencoders optimised for irregular time series, leveraging attention mechanisms to reconstruct the entire physical sequence in a single prediction pass. The model avoids the need for data imputation while preserving physical integrity of the system. Here, ‘physics’ refers to high-dimensional fields generated by underlying dynamical systems, rather than the enforcement of explicit physical constraints or PDE residuals. We evaluate this approach on multiple simulated datasets and real-world ocean temperature data. The results demonstrate that our method achieves significant improvements in prediction accuracy, robustness to nonlinearities, and computational efficiency over traditional convolutional and recurrent network methods. The model shows potential for capturing complex spatiotemporal patterns without requiring domain-specific knowledge, with applications in climate modelling, fluid dynamics, ocean forecasting, environmental monitoring, and scientific computing.
[498] Social Hippocampus Memory Learning
Liping Yi, Zhiming Zhao, Qinghua Hu
Main category: cs.LG
TL;DR: SoHip is a memory-centric social machine learning framework that enables heterogeneous agents to collaborate through memory sharing instead of model sharing, improving performance while preserving privacy.
Details
Motivation: Existing federated learning approaches for heterogeneous agents often share model parameters or intermediate representations, which can expose sensitive information and incur additional overhead. The authors aim to develop a more privacy-preserving and efficient collaboration framework inspired by social learning principles.Method: SoHip abstracts each agent’s individual short-term memory from local representations, consolidates it into individual long-term memory through a hippocampus-inspired mechanism, and fuses it with collectively aggregated long-term memory to enhance local prediction. Only lightweight memory is exchanged while raw data and local models remain on-device.
Result: Experiments on two benchmark datasets with seven baselines show that SoHip consistently outperforms existing methods, achieving up to 8.78% accuracy improvements. Theoretical analysis confirms convergence and privacy preservation properties.
Conclusion: SoHip provides an effective memory-centric approach for social machine learning that enables privacy-preserving collaboration among heterogeneous agents through memory sharing rather than model sharing, demonstrating superior performance over existing methods.
Abstract: Social learning highlights that learning agents improve not in isolation, but through interaction and structured knowledge exchange with others. When introduced into machine learning, this principle gives rise to social machine learning (SML), where multiple agents collaboratively learn by sharing abstracted knowledge. Federated learning (FL) provides a natural collaboration substrate for this paradigm, yet existing heterogeneous FL approaches often rely on sharing model parameters or intermediate representations, which may expose sensitive information and incur additional overhead. In this work, we propose SoHip (Social Hippocampus Memory Learning), a memory-centric social machine learning framework that enables collaboration among heterogeneous agents via memory sharing rather than model sharing. SoHip abstracts each agent’s individual short-term memory from local representations, consolidates it into individual long-term memory through a hippocampus-inspired mechanism, and fuses it with collectively aggregated long-term memory to enhance local prediction. Throughout the process, raw data and local models remain on-device, while only lightweight memory are exchanged. We provide theoretical analysis on convergence and privacy preservation properties. Experiments on two benchmark datasets with seven baselines demonstrate that SoHip consistently outperforms existing methods, achieving up to 8.78% accuracy improvements.
[499] Anchored-Branched Steady-state WInd Flow Transformer (AB-SWIFT): a metamodel for 3D atmospheric flow in urban environments
Armand de Villeroché, Rem-Sophia Mouradi, Vincent Le Guen, Sibo Cheng, Marc Bocquet, Alban Farchi, Patrick Armand, Patrick Massin
Main category: cs.LG
TL;DR: AB-SWIFT is a transformer-based model with branched architecture for urban air flow modeling, trained on atmospheric simulations with various stratifications to predict wind fields around complex urban geometries.
Details
Motivation: Traditional CFD simulations for urban air flow are computationally expensive, and existing deep learning models struggle with high variations in urban geometry and large mesh sizes, necessitating more adaptable surrogate models.Method: Proposes Anchored Branched Steady-state Wind Flow Transformer (AB-SWIFT), a transformer-based model with internal branched structure specifically designed for atmospheric flow modeling, trained on a database of atmospheric simulations around randomized urban geometries with mixed atmospheric stratifications.
Result: The model achieves best accuracy on all predicted fields compared to state-of-the-art transformers and graph-based models, demonstrating superior performance in urban air flow prediction.
Conclusion: AB-SWIFT effectively addresses challenges in urban air flow modeling by combining transformer architecture with branched structure, offering an accurate and efficient alternative to costly CFD simulations.
Abstract: Air flow modeling at a local scale is essential for applications such as pollutant dispersion modeling or wind farm modeling. To circumvent costly Computational Fluid Dynamics (CFD) computations, deep learning surrogate models have recently emerged as promising alternatives. However, in the context of urban air flow, deep learning models struggle to adapt to the high variations of the urban geometry and to large mesh sizes. To tackle these challenges, we introduce Anchored Branched Steady-state WInd Flow Transformer (AB-SWIFT), a transformer-based model with an internal branched structure uniquely designed for atmospheric flow modeling. We train our model on a specially designed database of atmospheric simulations around randomised urban geometries and with a mixture of unstable, neutral, and stable atmospheric stratifications. Our model reaches the best accuracy on all predicted fields compared to state-of-the-art transformers and graph-based models. Our code and data is available at https://github.com/cerea-daml/abswift.
[500] Uncertainty-Guided Label Rebalancing for CPS Safety Monitoring
John Ayotunde, Qinghua Xu, Guancheng Wang, Lionel C. Briand
Main category: cs.LG
TL;DR: U-Balance: A supervised approach that uses behavioral uncertainty to rebalance imbalanced datasets for CPS safety monitoring, achieving significant performance improvements on UAV telemetry data.
Details
Motivation: Safety monitoring in Cyber-Physical Systems suffers from extreme class imbalance (rare unsafe events), where standard rebalancing techniques fail on time-series telemetry data. Behavioral uncertainty in CPS operations correlates with safety outcomes but remains unexplored for safety monitoring.Method: U-Balance trains a GatedMLP-based uncertainty predictor that converts telemetry windows into distributional kinematic features and outputs uncertainty scores. It then applies uncertainty-guided label rebalancing (uLNR) that probabilistically relabels safe-labeled windows with high uncertainty as unsafe, enriching the minority class with boundary samples without synthetic data generation.
Result: On a large-scale UAV benchmark with 46:1 safe-to-unsafe ratio, U-Balance achieves 0.806 F1 score, outperforming the strongest baseline by 14.3 percentage points. Results confirm moderate but significant correlation between behavioral uncertainty and safety, with uLNR identified as the most effective strategy.
Conclusion: U-Balance effectively addresses class imbalance in CPS safety monitoring by leveraging behavioral uncertainty, providing a practical solution that maintains inference efficiency while significantly improving safety prediction performance.
Abstract: Safety monitoring is essential for Cyber-Physical Systems (CPSs). However, unsafe events are rare in real-world CPS operations, creating an extreme class imbalance that degrades safety predictors. Standard rebalancing techniques perform poorly on time-series CPS telemetry, either generating unrealistic synthetic samples or overfitting on the minority class. Meanwhile, behavioral uncertainty in CPS operations, defined as the degree of doubt or uncertainty in CPS decisions , is often correlated with safety outcomes but unexplored in safety monitoring. To that end, we propose U-Balance, a supervised approach that leverages behavioral uncertainty to rebalance imbalanced datasets prior to training a safety predictor. U-Balance first trains a GatedMLP-based uncertainty predictor that summarizes each telemetry window into distributional kinematic features and outputs an uncertainty score. It then applies an uncertainty-guided label rebalancing (uLNR) mechanism that probabilistically relabels \textit{safe}-labeled windows with unusually high uncertainty as \textit{unsafe}, thereby enriching the minority class with informative boundary samples without synthesizing new data. Finally, a safety predictor is trained on the rebalanced dataset for safety monitoring. We evaluate U-Balance on a large-scale UAV benchmark with a 46:1 safe-to-unsafe ratio. Results confirm a moderate but significant correlation between behavioral uncertainty and safety. We then identify uLNR as the most effective strategy to exploit uncertainty information, compared to direct early and late fusion. U-Balance achieves a 0.806 F1 score, outperforming the strongest baseline by 14.3 percentage points, while maintaining competitive inference efficiency. Ablation studies confirm that both the GatedMLP-based uncertainty predictor and the uLNR mechanism contribute significantly to U-Balance’s effectiveness.
[501] Longitudinal Digital Phenotyping for Early Cognitive-Motor Screening
Diego Jimenez-Oviedo, Ruben Vera-Rodriguez, Ruben Tolosana, Juan Carlos Ruiz-Garcia, Jaime Herreros-Rodriguez
Main category: cs.LG
TL;DR: AI framework uses tablet interaction data and unsupervised learning to identify and track developmental trajectories in children, revealing stable low-performance profiles that may indicate persistent deficits.
Details
Motivation: Traditional assessments for cognitive-motor development are subjective and static, while digital devices offer opportunities for continuous, objective monitoring through digital biomarkers for early detection and intervention.Method: Used tablet-based interaction data from children aged 18 months to 8 years across six cognitive-motor tasks, applied dimensionality reduction (t-SNE) and unsupervised clustering (K-Means++) to identify developmental phenotypes, and tracked longitudinal transitions between profiles.
Result: Identified three distinct performance profiles (low, medium, high) with high stability in low-performance cluster (>90% retention in early years), suggesting early deficits tend to persist without intervention, while higher-performance clusters show greater variability.
Conclusion: Validates unsupervised learning on touchscreen data for uncovering heterogeneous developmental paths, with identified profiles serving as scalable, data-driven proxies for cognitive growth and foundation for early screening tools and personalized interventions.
Abstract: Early detection of atypical cognitive-motor development is critical for timely intervention, yet traditional assessments rely heavily on subjective, static evaluations. The integration of digital devices offers an opportunity for continuous, objective monitoring through digital biomarkers. In this work, we propose an AI-driven longitudinal framework to model developmental trajectories in children aged 18 months to 8 years. Using a dataset of tablet-based interactions collected over multiple academic years, we analyzed six cognitive-motor tasks (e.g., fine motor control, reaction time). We applied dimensionality reduction (t-SNE) and unsupervised clustering (K-Means++) to identify distinct developmental phenotypes and tracked individual transitions between these profiles over time. Our analysis reveals three distinct profiles: low, medium, and high performance. Crucially, longitudinal tracking highlights a high stability in the low-performance cluster (>90% retention in early years), suggesting that early deficits tend to persist without intervention. Conversely, higher-performance clusters show greater variability, potentially reflecting engagement factors. This study validates the use of unsupervised learning on touchscreen data to uncover heterogeneous developmental paths. The identified profiles serve as scalable, data-driven proxies for cognitive growth, offering a foundation for early screening tools and personalized pediatric interventions.
[502] On Neural Scaling Laws for Weather Emulation through Continual Training
Shashank Subramanian, Alexander Kiefer, Arnur Nigmetov, Amir Gholami, Dmitriy Morozov, Michael W. Mahoney
Main category: cs.LG
TL;DR: Swin Transformer scaling laws for weather forecasting models show predictable performance trends with compute/data scaling, outperforming standard training schedules and enabling better multi-step predictions.
Details
Motivation: To study neural scaling laws in Scientific Machine Learning, specifically for weather forecasting, to understand how model performance scales with compute, data, and model size, and to identify efficient training regimes.Method: Uses minimal Swin Transformer architecture with continual training (constant learning rates + periodic cooldowns) to analyze scaling behavior. Systematically explores model/dataset sizes under various compute budgets to construct IsoFLOP curves and identify compute-optimal regimes.
Result: Models follow predictable scaling trends and outperform standard cosine learning rate schedules. Cooldown phases improve downstream performance for multi-step rollouts and sharper predictions. Compute-optimal training regimes identified, with scaling trends extrapolated to larger scales.
Conclusion: Neural scaling laws provide important diagnostics for efficient resource allocation in scientific ML, with predictable performance trends that can guide model development and training strategies for weather forecasting and similar domains.
Abstract: Neural scaling laws, which in some domains can predict the performance of large neural networks as a function of model, data, and compute scale, are the cornerstone of building foundation models in Natural Language Processing and Computer Vision. We study neural scaling in Scientific Machine Learning, focusing on models for weather forecasting. To analyze scaling behavior in as simple a setting as possible, we adopt a minimal, scalable, general-purpose Swin Transformer architecture, and we use continual training with constant learning rates and periodic cooldowns as an efficient training strategy. We show that models trained in this minimalist way follow predictable scaling trends and even outperform standard cosine learning rate schedules. Cooldown phases can be re-purposed to improve downstream performance, e.g., enabling accurate multi-step rollouts over longer forecast horizons as well as sharper predictions through spectral loss adjustments. We also systematically explore a wide range of model and dataset sizes under various compute budgets to construct IsoFLOP curves, and we identify compute-optimal training regimes. Extrapolating these trends to larger scales highlights potential performance limits, demonstrating that neural scaling can serve as an important diagnostic for efficient resource allocation. We open-source our code for reproducibility.
[503] A Unified Memory Perspective for Probabilistic Trustworthy AI
Xueji Zhao, Likai Pei, Jianbo Liu, Kai Ni, Ningyuan Cao
Main category: cs.LG
TL;DR: Paper presents unified framework treating deterministic memory access as special case of stochastic sampling, revealing entropy limitations in probabilistic AI systems and proposing criteria for evaluating memory architectures.
Details
Motivation: Probabilistic computation is increasingly important for trustworthy AI (robustness, interpretability, security, privacy), but current systems face performance bottlenecks as stochastic sampling shifts bottlenecks from arithmetic units to memory systems that must deliver both data and randomness.Method: Proposes unified data-access perspective where deterministic access is treated as limiting case of stochastic sampling, enabling both modes to be analyzed within common framework. Defines memory-level evaluation criteria: unified operation, distribution programmability, efficiency, robustness to hardware non-idealities, and parallel compatibility.
Result: Analysis reveals that increasing stochastic demand reduces effective data-access efficiency and can drive systems into entropy-limited operation. Conventional architectures have limitations, while emerging probabilistic compute-in-memory approaches that integrate sampling with memory access show promise for scalable trustworthy AI hardware.
Conclusion: A unified perspective on data access and stochastic sampling provides insights into memory system design for probabilistic AI, with compute-in-memory approaches offering pathways toward scalable hardware for trustworthy AI systems.
Abstract: Trustworthy artificial intelligence increasingly relies on probabilistic computation to achieve robustness, interpretability, security and privacy. In practical systems, such workloads interleave deterministic data access with repeated stochastic sampling across models, data paths and system functions, shifting performance bottlenecks from arithmetic units to memory systems that must deliver both data and randomness. Here we present a unified data-access perspective in which deterministic access is treated as a limiting case of stochastic sampling, enabling both modes to be analyzed within a common framework. This view reveals that increasing stochastic demand reduces effective data-access efficiency and can drive systems into entropy-limited operation. Based on this insight, we define memory-level evaluation criteria, including unified operation, distribution programmability, efficiency, robustness to hardware non-idealities and parallel compatibility. Using these criteria, we analyze limitations of conventional architectures and examine emerging probabilistic compute-in-memory approaches that integrate sampling with memory access, outlining pathways toward scalable hardware for trustworthy AI.
[504] Neural Network Conversion of Machine Learning Pipelines
Man-Ling Sung, Jan Silovsky, Man-Hung Siu, Herbert Gish, Chinnu Pittapally
Main category: cs.LG
TL;DR: Transfer learning from random forest teacher to neural network student for unified ML pipeline optimization
Details
Motivation: To enable joint optimization of ML pipeline components and create a single unified inference engine by transferring knowledge from traditional ML models (random forest) to neural networksMethod: Student-teacher learning approach where random forest classifiers serve as teachers to neural network students, experimenting with various NN topologies on 100 OpenML tasks and using random forest for hyperparameter selection
Result: Student neural networks can successfully mimic random forest teachers for majority of tasks when appropriate hyperparameters are selected
Conclusion: Knowledge distillation from traditional ML models to neural networks is feasible and enables unified optimization of ML pipelines
Abstract: Transfer learning and knowledge distillation has recently gained a lot of attention in the deep learning community. One transfer approach, the student-teacher learning, has been shown to successfully create small'' student neural networks that mimic the performance of a much bigger and more complex teacher’’ networks. In this paper, we investigate an extension to this approach and transfer from a non-neural-based machine learning pipeline as teacher to a neural network (NN) student, which would allow for joint optimization of the various pipeline components and a single unified inference engine for multiple ML tasks. In particular, we explore replacing the random forest classifier by transfer learning to a student NN. We experimented with various NN topologies on 100 OpenML tasks in which random forest has been one of the best solutions. Our results show that for the majority of the tasks, the student NN can indeed mimic the teacher if one can select the right NN hyper-parameters. We also investigated the use of random forest for selecting the right NN hyper-parameters.
[505] MANDERA: Malicious Node Detection in Federated Learning via Ranking
Wanchuang Zhu, Benjamin Zi Hao Zhao, Simon Luo, Tongliang Liu, Ke Deng
Main category: cs.LG
TL;DR: MANDERA: A Byzantine attack detection method for federated learning that transforms gradients into ranking matrices to separate malicious gradients from benign ones without prior knowledge of attack numbers.
Details
Motivation: Byzantine attacks pose a significant threat to federated learning deployment. Current detection methods struggle because gradients are high-dimensional with unique distributions per dimension, and benign/malicious gradients are mixed, preventing direct application of two-sample tests.Method: Proposes MANDERA which transforms original gradient space into ranking matrices, making scales identical across dimensions. This transformation enables easy separation of high-dimensional benign and malicious gradients. The method requires no prior knowledge about the number of attacked nodes.
Result: MANDERA effectively detects malicious gradients under four Byzantine attack types (Gaussian, Zero Gradient, Sign Flipping, Shifted Mean) on both IID and Non-IID datasets, outperforming state-of-the-art defense methods.
Conclusion: MANDERA provides a theoretically guaranteed, efficient solution for Byzantine attack detection in federated learning that works without prior knowledge of attack numbers and handles both IID and Non-IID data distributions.
Abstract: Byzantine attacks hinder the deployment of federated learning algorithms. Although we know that the benign gradients and Byzantine attacked gradients are distributed differently, to detect the malicious gradients is challenging due to (1) the gradient is high-dimensional and each dimension has its unique distribution and (2) the benign gradients and the attacked gradients are always mixed (two-sample test methods cannot apply directly). To address the above, for the first time, we propose MANDERA which is theoretically guaranteed to efficiently detect all malicious gradients under Byzantine attacks with no prior knowledge or history about the number of attacked nodes. More specifically, we transfer the original updating gradient space into a ranking matrix. By such an operation, the scales of different dimensions of the gradients in the ranking space become identical. The high-dimensional benign gradients and the malicious gradients can be easily separated. The effectiveness of MANDERA is further confirmed by experimentation on four Byzantine attack implementations (Gaussian, Zero Gradient, Sign Flipping, Shifted Mean), comparing with state-of-the-art defenses. The experiments cover both IID and Non-IID datasets.
[506] On Building Myopic MPC Policies using Supervised Learning
Christopher A. Orrico, Bokan Yang, Dinesh Krishnamoorthy
Main category: cs.LG
TL;DR: Using supervised learning to approximate the optimal value function offline, enabling short-horizon MPC with reduced online computation while maintaining performance.
Details
Motivation: Current approximate explicit MPC methods lose performance guarantees when replacing online optimization with neural networks. Need to reduce online computation burden while preserving control performance.Method: Learn optimal value function offline using supervised learning on state-value pairs, then use as cost-to-go in myopic MPC with short prediction horizon. Uses sensitivity-based data augmentation to reduce training data generation cost.
Result: Significantly reduces online computation burden without affecting controller performance, maintaining performance guarantees through short-horizon MPC optimization.
Conclusion: Value function learning approach enables efficient MPC with performance guarantees, differing from policy learning methods by preserving online optimization structure.
Abstract: The application of supervised learning techniques in combination with model predictive control (MPC) has recently generated significant interest, particularly in the area of approximate explicit MPC, where function approximators like deep neural networks are used to learn the MPC policy via optimal state-action pairs generated offline. While the aim of approximate explicit MPC is to closely replicate the MPC policy, substituting online optimization with a trained neural network, the performance guarantees that come with solving the online optimization problem are typically lost. This paper considers an alternative strategy, where supervised learning is used to learn the optimal value function offline instead of learning the optimal policy. This can then be used as the cost-to-go function in a myopic MPC with a very short prediction horizon, such that the online computation burden reduces significantly without affecting the controller performance. This approach differs from existing work on value function approximations in the sense that it learns the cost-to-go function by using offline-collected state-value pairs, rather than closed-loop performance data. The cost of generating the state-value pairs used for training is addressed using a sensitivity-based data augmentation scheme.
[507] Branch Scaling Manifests as Implicit Architectural Regularization for Improving Generalization in Overparameterized ResNets
Zixiong Yu, Guhan Chen, Jianfa Lai, Bohan Li, Songtao Tian
Main category: cs.LG
TL;DR: Theoretical analysis shows that scaling factors in residual networks affect generalization: constant scaling leads to unlearnability with depth, while depth-wise decay with early stopping enables optimal generalization rates.
Details
Motivation: While scaling factors in residual branches are widely used to boost performance, prior work focuses on optimization effects. This paper investigates their role from a generalization theory perspective to understand how different scaling strategies affect learnability in deep networks.Method: Theoretical analysis of wide residual networks using kernel regression approximation. Shows that constant scaling factors make networks asymptotically unlearnable as depth increases, while depth-wise decay with early stopping enables minimax-optimal generalization rates. Validated with experiments on synthetic data, MNIST, and CIFAR-100.
Result: Theoretical results demonstrate fundamental trade-offs between scaling strategies and generalization. Constant scaling leads to unlearnability with depth, while properly decaying scaling with early stopping achieves optimal generalization rates. Experimental validation supports theoretical findings.
Conclusion: Scaling factors in residual networks significantly impact generalization, not just optimization. Proper depth-wise decay combined with early stopping is crucial for achieving optimal generalization in over-parameterized ResNets, providing theoretical guidance for architecture design.
Abstract: Scaling factors in residual branches have emerged as a prevalent method for boosting neural network performance, especially in normalization-free architectures. While prior work has primarily examined scaling effects from an optimization perspective, this paper investigates their role in residual architectures through the lens of generalization theory. Specifically, we establish that wide residual networks (ResNets) with constant scaling factors become asymptotically unlearnable as depth increases. In contrast, when the scaling factor exhibits rapid depth-wise decay combined with early stopping, over-parameterized ResNets achieve minimax-optimal generalization rates. To establish this, we demonstrate that the generalization capability of wide ResNets can be approximated by the kernel regression associated with a specific kernel. Our theoretical findings are validated through experiments on synthetic data and real-world classification tasks, including MNIST and CIFAR-100.
[508] Revisit, Extend, and Enhance Hessian-Free Influence Functions
Ziao Yang, Han Yue, Jian Chen, Hongfu Liu
Main category: cs.LG
TL;DR: TracIn: A simple yet effective influence function approximation using identity matrix instead of Hessian inverse, with extensions to fairness/robustness and ensemble improvements.
Details
Motivation: Influence functions are valuable for model interpretation and various applications, but are difficult to apply to deep models due to non-convex loss functions and large parameter spaces. Existing methods for approximating Hessian inversion are complex, while TracIn offers a surprisingly effective simple alternative.Method: Revisits TracIn approximation method that replaces the inverse Hessian matrix with an identity matrix. Provides theoretical insights into why this simple approach works well. Extends TracIn to fairness and robustness applications beyond utility measurement. Enhances TracIn through ensemble strategies.
Result: Validated effectiveness through experiments on synthetic data and extensive evaluations including noisy label detection, sample selection for large language model fine-tuning, and defense against adversarial attacks.
Conclusion: TracIn’s simple identity matrix approximation is surprisingly effective for influence estimation in deep models, with practical applications in noisy label detection, LLM fine-tuning sample selection, and adversarial defense.
Abstract: Influence functions serve as crucial tools for assessing sample influence in model interpretation, subset training set selection, noisy label detection, and more. By employing the first-order Taylor extension, influence functions can estimate sample influence without the need for expensive model retraining. However, applying influence functions directly to deep models presents challenges, primarily due to the non-convex nature of the loss function and the large size of model parameters. This difficulty not only makes computing the inverse of the Hessian matrix costly but also renders it non-existent in some cases. Various approaches, including matrix decomposition, have been explored to expedite and approximate the inversion of the Hessian matrix, with the aim of making influence functions applicable to deep models. In this paper, we revisit a specific, albeit naive, yet effective approximation method known as TracIn. This method substitutes the inverse of the Hessian matrix with an identity matrix. We provide deeper insights into why this simple approximation method performs well. Furthermore, we extend its applications beyond measuring model utility to include considerations of fairness and robustness. Finally, we enhance TracIn through an ensemble strategy. To validate its effectiveness, we conduct experiments on synthetic data and extensive evaluations on noisy label detection, sample selection for large language model fine-tuning, and defense against adversarial attacks.
[509] SMILES-Mamba: Chemical Mamba Foundation Models for Drug ADMET Prediction
Bohao Xu, Yingzhou Lu, Chenhao Li, Ling Yue, Xiao Wang, Tianfan Fu, Minjie Shen, Lulu Chen
Main category: cs.LG
TL;DR: SMILES-Mamba: A two-stage self-supervised learning model for ADMET property prediction in drug discovery using SMILES strings
Details
Motivation: Predicting ADMET properties is crucial but resource-intensive; need to reduce dependence on large labeled datasets while improving accuracyMethod: Two-stage approach: 1) Self-supervised pretraining on large unlabeled SMILES strings corpus, 2) Fine-tuning on smaller labeled ADMET datasets
Result: Competitive performance across 22 ADMET datasets, achieving highest score in 14 tasks
Conclusion: Self-supervised learning improves molecular property prediction accuracy and reduces labeled data dependence, promising for drug discovery
Abstract: In drug discovery, predicting the absorption, distribution, metabolism, excretion, and toxicity (ADMET) properties of small-molecule drugs is critical for ensuring safety and efficacy. However, the process of accurately predicting these properties is often resource-intensive and requires extensive experimental data. To address this challenge, we propose SMILES-Mamba, a two-stage model that leverages both unlabeled and labeled data through a combination of self-supervised pretraining and fine-tuning strategies. The model first pre-trains on a large corpus of unlabeled SMILES strings to capture the underlying chemical structure and relationships, before being fine-tuned on smaller, labeled datasets specific to ADMET tasks. Our results demonstrate that SMILES-Mamba exhibits competitive performance across 22 ADMET datasets, achieving the highest score in 14 tasks, highlighting the potential of self-supervised learning in improving molecular property prediction. This approach not only enhances prediction accuracy but also reduces the dependence on large, labeled datasets, offering a promising direction for future research in drug discovery.
[510] Adaptive Online Mirror Descent for Tchebycheff Scalarization in Multi-Objective Learning
Meitong Liu, Xiaoyuan Zhang, Chulin Xie, Kate Donahue, Han Zhao
Main category: cs.LG
TL;DR: Proposes (Ada)OMD-TCH, an adaptive online mirror descent algorithm for multi-objective learning using Tchebycheff scalarization to address training oscillation and stagnation issues
Details
Motivation: Multi-objective learning aims to balance conflicting objectives, but existing preference-guided methods often require additional optimization objectives or constraints. The classic Tchebycheff scalarization naturally allows user-specified trade-offs but suffers from training oscillation and stagnation due to its minimax formulation.Method: Proposes (Ada)OMD-TCH, an adaptive online mirror descent algorithm for Tchebycheff scalarization. Key innovation is an adaptive online-to-batch conversion that improves solution optimality while maintaining theoretical convergence guarantees.
Result: Achieves convergence rate of O(√(log m/T)), providing tighter dependency on number of objectives m compared to existing work. Empirically demonstrates effectiveness on synthetic problems and federated learning tasks, producing preference-guided, specific, diverse, and fair solutions.
Conclusion: (Ada)OMD-TCH effectively addresses training oscillation in Tchebycheff scalarization, provides improved theoretical convergence rates, and yields practical benefits for multi-objective learning tasks including federated learning.
Abstract: Multi-objective learning (MOL) aims to learn under multiple potentially conflicting objectives and strike a proper balance. While recent preference-guided MOL methods often rely on additional optimization objectives or constraints, we consider the classic Tchebycheff scalarization (TCH) that naturally allows for locating solutions with user-specified trade-offs. Due to its minimax formulation, directly optimizing TCH often leads to training oscillation and stagnation. In light of this limitation, we propose an adaptive online mirror descent algorithm for TCH, called (Ada)OMD-TCH. One of our main ingredients is an adaptive online-to-batch conversion that significantly improves solution optimality over traditional conversion in practice while maintaining the same theoretical convergence guarantees. We show that (Ada)OMD-TCH achieves a convergence rate of $\mathcal O(\sqrt{\log m/T})$, where $m$ is the number of objectives and $T$ is the number of rounds, providing a tighter dependency on $m$ in the offline setting compared to existing work. Empirically, we demonstrate on both synthetic problems and federated learning tasks that (Ada)OMD-TCH effectively smooths the training process and yields preference-guided, specific, diverse, and fair solutions.
[511] The Limits of Inference Scaling Through Resampling
Benedikt Stroebl, Sayash Kapoor, Arvind Narayanan
Main category: cs.LG
TL;DR: Inference scaling through resampling with imperfect verifiers has fundamental limitations due to false positives that cannot be reduced through resampling, creating an upper bound on accuracy regardless of compute.
Details
Motivation: Recent research suggests inference scaling (resampling solutions until they pass verifiers like unit tests) could allow weaker models to match stronger ones, and enables training reasoning models via rejection sampling. However, this approach has fundamental limitations when verifiers are imperfect.Method: Theoretical analysis of inference scaling limitations when verifiers produce false positives. Empirical evaluation on HumanEval and MBPP benchmarks showing correlation between model’s single-sample accuracy and false positive rate. Examination of optimal sampling attempts and negative utility of false positives.
Result: Resampling cannot decrease false positive probability, imposing an upper bound to resampling-based inference scaling accuracy regardless of compute budget. Strong correlation found between model’s single-sample accuracy and false positive rate. Optimal sampling attempts often fewer than 10 due to negative utility of false positives outweighing benefits, bending inference scaling curves downward.
Conclusion: Inference scaling of weaker models cannot match single-sample accuracy of sufficiently strong models due to fundamental limitations from imperfect verifiers. False positives impose accuracy ceilings and may have undesirable qualities like poor coding style adherence.
Abstract: Recent research has generated hope that inference scaling, such as resampling solutions until they pass verifiers like unit tests, could allow weaker models to match stronger ones. Beyond inference, this approach also enables training reasoning models, where data is curated using rejection sampling against a verifier. However, we show that this approach is fundamentally limited when verifiers are imperfect and have a non-zero probability of producing false positives. Resampling cannot decrease this probability, so it imposes an upper bound to the accuracy of resampling-based inference scaling, regardless of compute budget. Our analysis shows that there is a strong correlation between the model’s single-sample accuracy and its false positive rate on HumanEval and MBPP, whose unit tests have limited coverage. Therefore, no amount of inference scaling of weaker models can enable them to match the single-sample accuracy of a sufficiently strong model. Empirical results show that optimal sampling attempts are often fewer than 10, as the negative utility of false positives outweighs benefits, bending inference scaling curves downward. Finally, false positives may have other undesirable qualities, like poor adherence to coding style conventions.
[512] Scalable Multi-Objective Reinforcement Learning with Fairness Guarantees using Lorenz Dominance
Dimitris Michailidis, Willem Röpke, Diederik M. Roijers, Sennay Ghebreab, Fernando P. Santos
Main category: cs.LG
TL;DR: A fairness-aware multi-objective reinforcement learning method using Lorenz dominance for equitable reward distributions, demonstrated on large-scale transport planning problems.
Details
Motivation: Multi-objective RL becomes computationally complex with many objectives, and fairness considerations are important when objectives involve preferences of agents/groups. Current methods lack principled fairness integration and scalability.Method: Proposes using Lorenz dominance to identify policies with equitable reward distributions, introduces lambda-Lorenz dominance for flexible fairness preferences, and develops a scalable algorithm for many-objective problems.
Result: Method outperforms common multi-objective approaches, particularly in high-dimensional objective spaces, and demonstrates improved scalability on large-scale transport planning environments in Xi’an and Amsterdam.
Conclusion: The approach successfully incorporates fairness into MORL while improving scalability, enabling discovery of fair policies in complex real-world problems with many objectives.
Abstract: Multi-Objective Reinforcement Learning (MORL) aims to learn a set of policies that optimize trade-offs between multiple, often conflicting objectives. MORL is computationally more complex than single-objective RL, particularly as the number of objectives increases. Additionally, when objectives involve the preferences of agents or groups, incorporating fairness becomes both important and socially desirable. This paper introduces a principled algorithm that incorporates fairness into MORL while improving scalability to many-objective problems. We propose using Lorenz dominance to identify policies with equitable reward distributions and introduce lambda-Lorenz dominance to enable flexible fairness preferences. We release a new, large-scale real-world transport planning environment and demonstrate that our method encourages the discovery of fair policies, showing improved scalability in two large cities (Xi’an and Amsterdam). Our methods outperform common multi-objective approaches, particularly in high-dimensional objective spaces.
[513] Density Ratio-based Proxy Causal Learning Without Density Ratios
Bariscan Bozkurt, Ben Deaner, Dimitri Meunier, Liyuan Xu, Arthur Gretton
Main category: cs.LG
TL;DR: Proposes a practical implementation for proxy causal learning that bypasses challenging density ratio estimation, using kernel ridge regression to estimate causal effects with hidden confounding.
Details
Motivation: Existing proxy causal learning methods face challenges with density ratio estimation in high dimensions, limiting practical adoption. The authors aim to develop a more practical approach that avoids this difficulty while handling continuous and high-dimensional treatments.Method: Uses kernel ridge regression to derive estimators for proxy causal learning, bypassing explicit density ratio estimation. Provides closed-form solutions for dose-response and conditional dose-response curves with consistency guarantees.
Result: Empirical results show superior or comparable performance to existing frameworks on both synthetic and real-world datasets, demonstrating practical effectiveness.
Conclusion: The proposed method offers a practical and effective implementation for proxy causal learning that avoids challenging density ratio estimation, making the approach more accessible for real-world applications with continuous and high-dimensional treatments.
Abstract: We address the setting of Proxy Causal Learning (PCL), which has the goal of estimating causal effects from observed data in the presence of hidden confounding. Proxy methods accomplish this task using two proxy variables related to the latent confounder: a treatment proxy (related to the treatment) and an outcome proxy (related to the outcome). Two approaches have been proposed to perform causal effect estimation given proxy variables; however only one of these has found mainstream acceptance, since the other was understood to require density ratio estimation - a challenging task in high dimensions. In the present work, we propose a practical and effective implementation of the second approach, which bypasses explicit density ratio estimation and is suitable for continuous and high-dimensional treatments. We employ kernel ridge regression to derive estimators, resulting in simple closed-form solutions for dose-response and conditional dose-response curves, along with consistency guarantees. Our methods empirically demonstrate superior or comparable performance to existing frameworks on synthetic and real-world datasets.
[514] Revealing Human Attention Patterns from Gameplay Analysis for Reinforcement Learning
Henrik Krauss, Takehisa Yairi
Main category: cs.LG
TL;DR: Proposes CTR attention networks to extract human internal attention patterns from gameplay data using RL techniques, validated against eye-tracking data and applied to improve RL agent learning.
Details
Motivation: To develop methods for revealing human internal attention patterns from behavioral data alone, without requiring eye-tracking equipment, and to understand differences between human and AI attention patterns in decision-making tasks.Method: Uses contextualized, task-relevant (CTR) attention networks to generate attention maps from both human and RL agent gameplay in Atari environments. Validates human CTR maps by comparing them to agent attention maps and temporally integrated overt attention (TIOA) models based on human eye-tracking data.
Result: Human CTR maps are more sparse than agent maps and align better with TIOA maps. Human attention-guided RL agents achieve slightly improved and more stable learning compared to baselines, and significantly outperform TIOA-based agents.
Conclusion: The method successfully captures patterns of internal human attention from gameplay data, advances understanding of human-agent attention differences, and provides a new approach for extracting and validating internal attention patterns from behavioral data.
Abstract: This study introduces a novel method for revealing human internal attention patterns (decision-relevant attention) from gameplay data alone, leveraging offline attention techniques from reinforcement learning (RL). We propose contextualized, task-relevant (CTR) attention networks, which generate attention maps from both human and RL agent gameplay in Atari environments. To evaluate whether the human CTR maps reveal internal attention patterns, we validate our model by quantitative and qualitative comparison to the agent maps as well as to a temporally integrated overt attention (TIOA) model based on human eye-tracking data. Our results show that human CTR maps are more sparse than the agent ones and align better with the TIOA maps. Following a qualitative visual comparison we conclude that they likely capture patterns of internal attention. As a further application, we use these maps to guide RL agents, finding that human attention-guided agents achieve slightly improved and more stable learning compared to baselines, and significantly outperform TIOA-based agents. This work advances the understanding of human-agent attention differences and provides a new approach for extracting and validating internal attention patterns from behavioral data.
[515] Density Ratio-Free Doubly Robust Proxy Causal Learning
Bariscan Bozkurt, Houssam Zenati, Dimitri Meunier, Liyuan Xu, Arthur Gretton
Main category: cs.LG
TL;DR: Proposes kernel-based doubly robust estimators for proxy causal learning that combine outcome and treatment bridge approaches, handle continuous/high-dimensional variables without density ratio estimation or kernel smoothing over treatment.
Details
Motivation: Address limitations in proxy causal learning where confounders are unobserved but proxies are available. Existing methods have limitations: outcome bridge and treatment bridge approaches each have weaknesses, and prior doubly robust methods require both kernel smoothing and density ratio estimation.Method: Uses kernel mean embeddings to create density-ratio free doubly robust estimators. Builds on recent density ratio-free method for treatment bridge-based PCL. Avoids indicator functions and kernel smoothing over treatment variable. Provides closed-form solutions with strong uniform consistency guarantees.
Result: Proposed estimators outperform existing methods on PCL benchmarks, including prior doubly robust methods that require both kernel smoothing and density ratio estimation. Especially effective for continuous or high-dimensional treatments.
Conclusion: First density-ratio free doubly robust estimators for proxy causal learning with closed-form solutions and strong theoretical guarantees. Method is particularly suitable for continuous/high-dimensional variables and advances the field of causal inference with proxy variables.
Abstract: We study the problem of causal function estimation in the Proxy Causal Learning (PCL) framework, where confounders are not observed but proxies for the confounders are available. Two main approaches have been proposed: outcome bridge-based and treatment bridge-based methods. In this work, we propose two kernel-based doubly robust estimators that combine the strengths of both approaches, and naturally handle continuous and high-dimensional variables. Our identification strategy builds on a recent density ratio-free method for treatment bridge-based PCL; furthermore, in contrast to previous approaches, it does not require indicator functions or kernel smoothing over the treatment variable. These properties make it especially well-suited for continuous or high-dimensional treatments. By using kernel mean embeddings, we propose the first density-ratio free doubly robust estimators for proxy causal learning, which have closed form solutions and strong uniform consistency guarantees. Our estimators outperform existing methods on PCL benchmarks, including a prior doubly robust method that requires both kernel smoothing and density ratio estimation.
[516] QLIP: A Dynamic Quadtree Vision Prior Enhances MLLM Performance Without Retraining
Kyle R. Chickering, Bangzheng Li, Muhao Chen
Main category: cs.LG
TL;DR: QLIP is a drop-in replacement for CLIP’s vision encoder that addresses its limitations through content-aware patchification using image quadtrees, improving MLLM performance without retraining.
Details
Motivation: CLIP vision encoder has limitations: fixed input resolution, poor separation of dissimilar image embeddings, and replacing it requires expensive retraining. The paper aims to create a seamless replacement that enhances visual understanding in existing MLLMs.Method: Proposes QLIP with image quadtree structure replacing uniform grid patches with content-aware patchification. Addresses mesoscopic and interpolation biases in CLIP. Designed as drop-in replacement requiring minimal code changes.
Result: Improves LLaVA v1.5 visual question answering accuracy across model sizes without retraining. Boosts detailed understanding on V-star benchmark by up to 13.6%. Enhances both coarse-grained and fine-grained visual understanding.
Conclusion: QLIP effectively addresses CLIP’s limitations and can be seamlessly integrated into existing MLLMs, providing improved visual understanding capabilities without computational overhead of retraining.
Abstract: Multimodal Large Language Models (MLLMs) encode images into visual tokens, aligning visual and textual signals within a shared latent space to facilitate crossmodal representation learning. The CLIP model is a widely adopted foundational vision language model whose vision encoder has played a critical role in the development of MLLMs such as LLaVA. However, the CLIP vision encoder suffers from notable limitations including being constrained to only handling fixed input resolutions and a failure to produce separated embeddings for dissimilar images. Replacing the vision encoder of an existing model typically incurs substantial computational costs because such a change often necessitates retraining the entire model pipeline. In this work, we identify two factors which underlie the limitations of the CLIP vision encoder: mesoscopic bias and interpolation bias. To address these issues, we propose QLIP, a drop-in replacement for CLIP that can be seamlessly integrated with existing MLLMs with only a few lines of code and can enhance both coarse-grained and fine-grained visual understanding, without re-training. QLIP is designed around an image quadtree which replaces the standard uniform grid patches with a novel content aware patchification. Our experimental results demonstrate that QLIP improves the general visual question answering accuracy of the LLaVA v1.5 model series across various model sizes–without requiring retraining or fine-tuning of the full MLLM. Notably, QLIP boosts detailed understanding performance on the challenging V-star benchmark by up to 13.6 percent.
[517] Predicting Human Mobility during Extreme Events via LLM-Enhanced Cross-City Learning
Yinzhou Tang, Huandong Wang, Xiaochen Fan, Yong Li
Main category: cs.LG
TL;DR: X-MLM is a cross-extreme-event mobility language model framework that uses LLMs to predict human mobility patterns during extreme events by modeling mobility intentions and transferring knowledge between cities.
Details
Motivation: Existing human mobility prediction models fail in extreme scenarios due to pattern shifts, creating a need for models that can adapt to extreme weather and disasters for applications like early warning systems and resource allocation.Method: Uses LLMs to model mobility intentions with a RAG-enhanced intention predictor, refines intentions with an LLM-based refiner, then maps intentions to exact locations using an intention-modulated location predictor, transferring knowledge between cities.
Result: Achieves 32.8% improvement in Acc@1 and 35.0% improvement in F1-score for predicting immobility compared to baselines.
Conclusion: X-MLM effectively addresses the challenge of predicting human mobility during extreme events by leveraging LLMs for intention modeling and cross-city knowledge transfer.
Abstract: The vulnerability of cities has increased with urbanization and climate change, making it more important to predict human mobility during extreme events (e.g., extreme weather) for downstream tasks including location-based early disaster warning and pre-allocating rescue resources, etc. However, existing human mobility prediction models are mainly designed for normal scenarios, and fail to adapt to extreme scenarios due to the shift of human mobility patterns under extreme scenarios. To address this issue, we introduce \textbf{X-MLM}, a cross-e\textbf{X}treme-event \textbf{M}obility \textbf{L}anguge \textbf{M}odel framework for extreme scenarios that can be integrated into existing deep mobility prediction methods by leveraging LLMs to model the mobility intention and transferring the common knowledge of how different extreme events affect mobility intentions between cities. This framework utilizes a RAG-Enhanced Intention Predictor to forecast the next intention, refines it with an LLM-based Intention Refiner, and then maps the intention to an exact location using an Intention-Modulated Location Predictor. Extensive experiments illustrate that X-MLM can achieve a 32.8% improvement in terms of Acc@1 and a 35.0% improvement in terms of the F1-score of predicting immobility compared to the baselines. The code is available at https://github.com/tsinghua-fib-lab/XMLM.
[518] Towards Interpretable Deep Neural Networks for Tabular Data
Khawla Elhadri, Jörg Schlötterer, Christin Seifert
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2509.08617: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.08617&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[519] CausalPre: Scalable and Effective Data Pre-Processing for Causal Fairness
Ying Zheng, Yangfan Jiang, Kian-Lee Tan
Main category: cs.LG
TL;DR: Failed to fetch summary for paper 2509.15199 due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to analyze motivation due to failed paper fetchMethod: Unable to analyze method due to failed paper fetch
Result: Unable to analyze results due to failed paper fetch
Conclusion: Unable to analyze conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2509.15199: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.15199&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[520] Benchmarking M-LTSF: Frequency and Noise-Based Evaluation of Multivariate Long Time Series Forecasting Models
Nick Janssen, Melanie Schaller, Bodo Rosenhahn
Main category: cs.LG
TL;DR: Paper 2510.04900: Unable to fetch abstract due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to missing abstractMethod: Unable to determine method due to missing abstract
Result: Unable to determine results due to missing abstract
Conclusion: Unable to draw conclusions due to missing abstract
Abstract: Failed to fetch summary for 2510.04900: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.04900&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[521] Get RICH or Die Scaling: Profitably Trading Inference Compute for Robustness
Tavish McDonald, Bo Lei, Stanislav Fort, Bhavya Kailkhura, Brian Bartoldson
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2510.06790 appears to be from October 2024, but no content available for analysis.
Details
Motivation: Cannot determine motivation without access to paper content.Method: Cannot determine method without access to paper content.
Result: Cannot determine results without access to paper content.
Conclusion: Cannot determine conclusion without access to paper content.
Abstract: Failed to fetch summary for 2510.06790: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.06790&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[522] Time-Correlated Video Bridge Matching
Viacheslav Vasilev, Arseny Ivanov, Nikita Gushchin, Maria Kovaleva, Alexander Korotin
Main category: cs.LG
TL;DR: Paper ID 2510.12453 - Unable to fetch abstract due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as abstract is unavailable due to rate limiting errorMethod: Cannot determine method as abstract is unavailable due to rate limiting error
Result: Cannot determine results as abstract is unavailable due to rate limiting error
Conclusion: Cannot determine conclusion as abstract is unavailable due to rate limiting error
Abstract: Failed to fetch summary for 2510.12453: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.12453&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[523] Tensor Gaussian Processes: Efficient Solvers for Nonlinear PDEs
Qiwei Yuan, Zhitong Xu, Yinghao Chen, Yiming Xu, Houman Owhadi, Shandian Zhe
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation as paper content could not be retrievedMethod: Unable to determine method as paper content could not be retrieved
Result: Unable to determine results as paper content could not be retrieved
Conclusion: Unable to determine conclusion as paper content could not be retrieved
Abstract: Failed to fetch summary for 2510.13772: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.13772&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[524] FusionLog: Cross-System Log-based Anomaly Detection via Fusion of General and Proprietary Knowledge
Xinlong Zhao, Tong Jia, Minghua He, Xixuan Yang, Ying Li
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to access restrictionsMethod: Cannot determine method due to access restrictions
Result: Cannot determine results due to access restrictions
Conclusion: Cannot determine conclusion due to access restrictions
Abstract: Failed to fetch summary for 2511.05878: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.05878&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[525] FIRM: Federated In-client Regularized Multi-objective Alignment for Large Language Models
Fatemeh Nourzad, Amirhossein Roknilamouki, Eylem Ekici, Jia Liu, Ness Shroff
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about the paper due to access limitations
Abstract: Failed to fetch summary for 2511.16992: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.16992&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[526] Cleaning the Pool: Progressive Filtering of Unlabeled Pools in Deep Active Learning
Denis Huseljic, Marek Herde, Lukas Rauch, Paul Hahn, Bernhard Sick
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2511.22344: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.22344&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[527] Morphling: Fast, Fused, and Flexible GNN Training at Scale
Anubhab, Rupesh Nasre
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation as paper content could not be retrievedMethod: Unable to determine method as paper content could not be retrieved
Result: Unable to determine results as paper content could not be retrieved
Conclusion: Unable to determine conclusion as paper content could not be retrieved
Abstract: Failed to fetch summary for 2512.01678: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.01678&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[528] Delays in Spiking Neural Networks: A State Space Model Approach
Sanja Karilanova, Subhrakanti Dey, Ayça Özçelikkale
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2512.01906: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.01906&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[529] Rethinking Bivariate Causal Discovery Through the Lens of Exchangeability
Tiago Brogueira, Mário Figueiredo
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2512.10152: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.10152&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[530] Benchmarking Attribute Discrimination in Infant-Scale Vision-Language Models
Patrick Batsell, Tsutsui Satoshi, Bihan Wen
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2512.18951: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.18951&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[531] Divided We Fall: Defending Against Adversarial Attacks via Soft-Gated Fractional Mixture-of-Experts with Randomized Adversarial Training
Mohammad Meymani, Roozbeh Razavi-Far
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2512.20821 exists but content cannot be retrieved from arXiv API.
Details
Motivation: Cannot determine motivation without access to paper content.Method: Cannot determine method without access to paper content.
Result: Cannot determine results without access to paper content.
Conclusion: Cannot draw conclusions without access to paper content.
Abstract: Failed to fetch summary for 2512.20821: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.20821&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[532] Interpretable ML Under the Microscope: Performance, Meta-Features, and the Regression-Classification Predictability Gap
Mattia Billa, Giovanni Orlandi, Veronica Guidetti, Federica Mandreoli
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation as paper content could not be retrievedMethod: Unable to determine method as paper content could not be retrieved
Result: Unable to determine results as paper content could not be retrieved
Conclusion: Unable to determine conclusion as paper content could not be retrieved
Abstract: Failed to fetch summary for 2601.00428: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.00428&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[533] Electricity Price Forecasting: Bridging Linear Models, Neural Networks and Online Learning
Btissame El Mahtout, Florian Ziel
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about paper content due to access limitations
Abstract: Failed to fetch summary for 2601.02856: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.02856&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[534] OWLEYE: Zero-Shot Learner for Cross-Domain Graph Data Anomaly Detection
Lecheng Zheng, Dongqi Fu, Zihao Li, Jingrui He
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2601.19102: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.19102&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[535] PowerGenie: Analytically-Guided Evolutionary Discovery of Superior Reconfigurable Power Converters
Jian Gao, Yiwei Zou, Abhishek Pradhan, Wenhao Huang, Yumin Su, Kaiyuan Yang, Xuan Zhang
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2601.21984: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.21984&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[536] Minimum-Action Learning: Energy-Constrained Symbolic Model Selection for Physical Law Identification from Noisy Data
Martin G. Frasch
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.16951: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.16951&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[537] Position: Spectral GNNs Are Neither Spectral Nor Superior for Node Classification
Qin Jiang, Chengjia Wang, Michael Lones, Dongdong Chen, Wei Pang
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to missing paper contentMethod: Unable to determine method due to missing paper content
Result: Unable to determine results due to missing paper content
Conclusion: Unable to determine conclusion due to missing paper content
Abstract: Failed to fetch summary for 2603.19091: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.19091&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[538] Neural Uncertainty Principle: A Unified View of Adversarial Fragility and LLM Hallucination
Dong-Xiao Zhang, Hu Lou, Jun-Jie Zhang, Jun Zhu, Deyu Meng
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.19562: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.19562&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[539] A Task Decomposition Framework for Aircraft Health Diagnosis: Balancing Safety and Efficiency via Heterogeneous Long-Micro Scale Cascading
Xinhang Chen, Zhihuan Wei, Yang Hu, Zhiguo Zeng, Kang Zeng, Wei Wang
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2603.22885: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.22885&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[540] SpecXMaster Technical Report
Yutang Ge, Yaning Cui, Hanzheng Li, Jun-Jie Wang, Fanjie Xu, Jinhan Dong, Yongqi Jin, Dongxu Cui, Peng Jin, Guojiang Zhao, Hengxing Cai, Rong Zhu, Linfeng Zhang, Xiaohong Ji, Zhifeng Gao
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to technical limitationsMethod: Cannot determine method as paper content is unavailable due to technical limitations
Result: Cannot determine results as paper content is unavailable due to technical limitations
Conclusion: Cannot draw conclusions as paper content is unavailable due to technical limitations
Abstract: Failed to fetch summary for 2603.23101: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.23101&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[541] Central Dogma Transformer III: Interpretable AI Across DNA, RNA, and Protein
Nobuyuki Ota
Main category: cs.LG
TL;DR: Paper 2603.23361: Could not fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to missing abstract contentMethod: Unable to determine method due to missing abstract content
Result: Unable to determine results due to missing abstract content
Conclusion: Unable to determine conclusion due to missing abstract content
Abstract: Failed to fetch summary for 2603.23361: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.23361&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[542] Discriminative reconstruction via simultaneous dense and sparse coding
Abiy Tasissa, Emmanouil Theodosis, Bahareh Tolooshams, Demba Ba
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2006.09534: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2006.09534&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[543] Correlative Information Maximization: A Biologically Plausible Approach to Supervised Deep Neural Networks without Weight Symmetry
Bariscan Bozkurt, Cengiz Pehlevan, Alper T Erdogan
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2306.04810 appears to be from June 2023, but content cannot be retrieved.
Details
Motivation: Unable to determine motivation due to content fetch failure.Method: Unable to determine method due to content fetch failure.
Result: Unable to determine results due to content fetch failure.
Conclusion: Unable to determine conclusion due to content fetch failure.
Abstract: Failed to fetch summary for 2306.04810: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2306.04810&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[544] Chain-Oriented Objective Logic with Neural Network Feedback Control and Cascade Filtering for Dynamic Multi-DSL Regulation
Jipeng Han
Main category: cs.LG
TL;DR: Unable to analyze paper 2410.13874 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract is unavailable due to rate limiting errorMethod: Cannot determine method as abstract is unavailable due to rate limiting error
Result: Cannot determine results as abstract is unavailable due to rate limiting error
Conclusion: Cannot draw conclusions as paper content is inaccessible due to arXiv API rate limiting
Abstract: Failed to fetch summary for 2410.13874: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2410.13874&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[545] Exact Risk Curves of signSGD in High-Dimensions: Quantifying Preconditioning and Noise-Compression Effects
Ke Liang Xiao, Noah Marshall, Atish Agarwala, Elliot Paquette
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about paper content due to access limitations
Abstract: Failed to fetch summary for 2411.12135: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2411.12135&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[546] Kernel Density Machines
Andrea Della Vecchia, Damir Filipovic, Paul Schneider
Main category: cs.LG
TL;DR: Paper 2504.21419: Unable to fetch abstract due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to inability to access paper contentMethod: Cannot determine method due to inability to access paper content
Result: Cannot determine results due to inability to access paper content
Conclusion: Cannot determine conclusion due to inability to access paper content
Abstract: Failed to fetch summary for 2504.21419: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.21419&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[547] Gradient-Based Program Repair: Fixing Bugs in Continuous Program Spaces
André Silva, Gustav Thorén, Martin Monperrus
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2505.17703: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.17703&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[548] When Models Don’t Collapse: On the Consistency of Iterative MLE
Daniel Barzilai, Ohad Shamir
Main category: cs.LG
TL;DR: Paper 2505.19046: Could not fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access restrictionsMethod: Unable to determine method due to access restrictions
Result: Unable to determine results due to access restrictions
Conclusion: Unable to determine conclusion due to access restrictions
Abstract: Failed to fetch summary for 2505.19046: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.19046&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[549] A Resource Efficient Quantum Kernel
Utkarsh Singh, Jean-Frédéric Laprade, Aaron Z. Goldberg, Khabat Heshami
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2507.03689: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.03689&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[550] Data-driven Mori-Zwanzig modeling of Lagrangian particle dynamics in turbulent flows
Xander de Wit, Alessandro Gabbana, Michael Woodward, Yen Ting Lin, Federico Toschi, Daniel Livescu
Main category: cs.LG
TL;DR: Paper 2507.16058: Failed to fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2507.16058: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.16058&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[551] Efficient Best-of-Both-Worlds Algorithms for Contextual Combinatorial Semi-Bandits
Mengmeng Li, Philipp J. Schneider, Jelisaveta Aleksić, Daniel Kuhn
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2508.18768: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.18768&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[552] Locket: Robust Feature-Locking Technique for Language Models
Lipeng He, Vasisht Duddu, N. Asokan
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2510.12117: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.12117&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[553] Massive Memorization with Hundreds of Trillions of Parameters for Sequential Transducer Generative Recommenders
Zhimin Chen, Chenyu Zhao, Ka Chun Mo, Yunjiang Jiang, Jane H. Lee, Khushhall Chandra Mahajan, Ning Jiang, Kai Ren, Jinhui Li, Wen-Yun Yang
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2510.22049: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.22049&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[554] Split-Flows: Measure Transport and Information Loss Across Molecular Resolutions
Sander Hummerich, Tristan Bereau, Ullrich Köthe
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to missing paper contentMethod: Unable to determine method due to missing paper content
Result: Unable to determine results due to missing paper content
Conclusion: Unable to determine conclusion due to missing paper content
Abstract: Failed to fetch summary for 2511.01464: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.01464&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[555] Fitting Reinforcement Learning Model to Behavioral Data under Bandits
Hao Zhu, Jasper Hoffmann, Baohe Zhang, Joschka Boedecker
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about the paper due to technical limitations in accessing the content
Abstract: Failed to fetch summary for 2511.04454: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.04454&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[556] Self-Supervised Multisensory Pretraining for Contact-Rich Robot Reinforcement Learning
Rickmer Krohn, Vignesh Prasad, Gabriele Tiboni, Georgia Chalvatzaki
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about paper content due to access limitations
Abstract: Failed to fetch summary for 2511.14427: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.14427&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[557] STAR-GO: Improving Protein Function Prediction by Learning to Hierarchically Integrate Ontology-Informed Semantic Embeddings
Mehmet Efe Akça, Gökçe Uludoğan, Arzucan Özgür, İnci M. Baytaş
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2512.05245: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.05245&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[558] Robust Bayesian Inference via Variational Approximations of Generalized Rho-Posteriors
EL Mahdi Khribch, Pierre Alquier
Main category: cs.LG
TL;DR: Unable to analyze paper 2601.07325 due to HTTP 429 error when fetching from arXiv API
Details
Motivation: Cannot determine motivation as paper content could not be retrievedMethod: Cannot determine method as paper content could not be retrieved
Result: Cannot determine results as paper content could not be retrieved
Conclusion: Cannot draw conclusions as paper content could not be retrieved
Abstract: Failed to fetch summary for 2601.07325: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.07325&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[559] Mixture-of-Experts under Finite-Rate Gating: Communication–Generalization Trade-offs
Ali Khalesi, Mohammad Reza Deylam Salehi
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2602.15091: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.15091&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[560] RadioDiff-FS: Physics-Informed Manifold Alignment in Few-Shot Diffusion Models for High-Fidelity Radio Map Construction
Xiucheng Wang, Zixuan Guo, Nan Cheng
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper information.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot determine conclusion without access to the paper content.
Abstract: Failed to fetch summary for 2603.18865: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.18865&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[561] Combinatorial Privacy: Private Multi-Party Bitstream Grand Sum by Hiding in Birkhoff Polytopes
Praneeth Vepakomma
Main category: cs.LG
TL;DR: Unable to analyze paper 2603.22808 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation without access to paper abstractMethod: Cannot determine method without access to paper abstract
Result: Cannot determine results without access to paper abstract
Conclusion: Cannot draw conclusions without access to paper abstract
Abstract: Failed to fetch summary for 2603.22808: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.22808&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[562] Labeled Compression Schemes for Concept Classes of Finite Functions
Benchong Li
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2603.23561: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.23561&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[563] The Economics of Builder Saturation in Digital Markets
Armin Catovic
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2603.23685: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.23685&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[564] Toward a Multi-Layer ML-Based Security Framework for Industrial IoT
Aymen Bouferroum, Valeria Loscri, Abderrahim Benslimane
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2603.24111: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.24111&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[565] Adaptive decision-making for stochastic service network design
Javier Durán-Micco, Bilge Atasoy
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to the paper contentMethod: Cannot determine method without access to the paper content
Result: Cannot determine results without access to the paper content
Conclusion: Cannot determine conclusion without access to the paper content
Abstract: Failed to fetch summary for 2603.24369: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.24369&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[566] Composer 2 Technical Report
Cursor Research, Aaron Chan, Ahmed Shalaby, Alexander Wettig, Aman Sanger, Andrew Zhai, Anurag Ajay, Ashvin Nair, Charlie Snell, Chen Lu, Chen Shen, Emily Jia, Federico Cassano, Hanpeng Liu, Haoyu Chen, Henry Wildermuth, Jacob Jackson, Janet Li, Jediah Katz, Jiajun Yao, Joey Hejna, Josh Warner, Julius Vering, Kevin Frans, Lee Danilek, Less Wright, Lujing Cen, Luke Melas-Kyriazi, Michael Truell, Michiel de Jong, Naman Jain, Nate Schmidt, Nathan Wang, Niklas Muennighoff, Oleg Rybkin, Paul Loh, Phillip Kravtsov, Rishabh Yadav, Sahil Shah, Sam Kottler, Alexander M Rush, Shengtong Zhang, Shomil Jain, Sriram Sankar, Stefan Heule, Stuart H. Sul, Sualeh Asif, Victor Rong, Wanqi Zhu, William Lin, Yuchen Wu, Yuri Volkov, Yury Zemlyanskiy, Zack Holbrook, Zhiyuan Zhang
Main category: cs.LG
TL;DR: Paper 2603.24477: Failed to fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error in accessing paper informationMethod: No method information available due to HTTP 429 error when trying to fetch paper details
Result: No results available - paper summary could not be retrieved from arXiv API
Conclusion: Cannot provide analysis due to technical limitations in accessing the paper content
Abstract: Failed to fetch summary for 2603.24477: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.24477&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.MA
[567] Belief-Driven Multi-Agent Collaboration via Approximate Perfect Bayesian Equilibrium for Social Simulation
Weiwei Fang, Lin Li, Kaize Shi, Yu Yang, Jianwei Zhang
Main category: cs.MA
TL;DR: BEACOF is a belief-driven adaptive collaboration framework for social simulation that enables agents to dynamically switch between cooperative and competitive interactions based on probabilistic beliefs about peer capabilities, overcoming limitations of static interaction topologies in LLM-based multi-agent systems.
Details
Motivation: Current LLM-based multi-agent frameworks use static interaction topologies that fail to capture the fluid oscillation between cooperative knowledge synthesis and competitive critical reasoning seen in real-world human interactions, leading to unrealistic "groupthink" or unproductive deadlocks that undermine simulation credibility for decision support.Method: BEACOF is a belief-driven adaptive collaboration framework inspired by Perfect Bayesian Equilibrium (PBE). It models social interaction as a dynamic game of incomplete information, addressing the circular dependency between collaboration type selection and capability estimation. Agents iteratively refine probabilistic beliefs about peer capabilities and autonomously modulate their collaboration strategy to ensure sequentially rational decisions under uncertainty.
Result: Validated across adversarial (judicial), open-ended (social) and mixed (medical) scenarios, BEACOF prevents coordination failures and fosters robust convergence toward high-quality solutions, demonstrating superior potential for reliable social simulation.
Conclusion: BEACOF bridges the gap in high-fidelity social simulation by enabling dynamic, belief-driven collaboration that authentically replicates human interaction patterns, offering a more credible framework for addressing complex Web societal challenges.
Abstract: High-fidelity social simulation is pivotal for addressing complex Web societal challenges, yet it demands agents capable of authentically replicating the dynamic spectrum of human interaction. Current LLM-based multi-agent frameworks, however, predominantly adhere to static interaction topologies, failing to capture the fluid oscillation between cooperative knowledge synthesis and competitive critical reasoning seen in real-world scenarios. This rigidity often leads to unrealistic ``groupthink’’ or unproductive deadlocks, undermining the credibility of simulations for decision support. To bridge this gap, we propose \textit{BEACOF}, a \textit{belief-driven adaptive collaboration framework} inspired by Perfect Bayesian Equilibrium (PBE). By modeling social interaction as a dynamic game of incomplete information, BEACOF rigorously addresses the circular dependency between collaboration type selection and capability estimation. Agents iteratively refine probabilistic beliefs about peer capabilities and autonomously modulate their collaboration strategy, thereby ensuring sequentially rational decisions under uncertainty. Validated across adversarial (judicial), open-ended (social) and mixed (medical) scenarios, BEACOF prevents coordination failures and fosters robust convergence toward high-quality solutions, demonstrating superior potential for reliable social simulation. Source codes and datasets are publicly released at: https://github.com/WUT-IDEA/BEACOF.
[568] Ultra-fast Traffic Nowcasting and Control via Differentiable Agent-based Simulation
Fumiyasu Makinoshima, Yuya Yamaguchi, Eigo Segawa, Koichiro Niinuma, Sean Qian
Main category: cs.MA
TL;DR: Differentiable agent-based traffic simulator enables ultra-fast calibration, nowcasting, and control on large-scale networks through gradient-based optimization
Details
Motivation: Traffic digital twins need large-scale, high-fidelity models calibrated to real-world data, but conventional simulations are non-differentiable and rely on inefficient gradient-free optimization, making calibration computationally infeasibleMethod: Developed differentiable agent-based traffic simulator with differentiable computing techniques for vehicle movements, stochastic decision-making, and inter-agent interactions while maintaining end-to-end differentiability for gradient-based optimization
Result: On Chicago road network (10,000+ parameters, 1M+ vehicles), achieves 173x real-time speed, completes calibration in 455s, nowcasting in 21s, control in 728s - full loop under 20 minutes
Conclusion: Provides practical computational basis for traffic digital twins with ultra-fast differentiable simulation enabling real-time calibration, nowcasting, and control
Abstract: Traffic digital twins, which inform policymakers of effective interventions based on large-scale, high-fidelity computational models calibrated to real-world traffic, hold promise for addressing societal challenges in our rapidly urbanizing world. However, conventional fine-grained traffic simulations are non-differentiable and typically rely on inefficient gradient-free optimization, making calibration for real-world applications computationally infeasible. Here we present a differentiable agent-based traffic simulator that enables ultra-fast model calibration, traffic nowcasting, and control on large-scale networks. We develop several differentiable computing techniques for simulating individual vehicle movements, including stochastic decision-making and inter-agent interactions, while ensuring that entire simulation trajectories remain end-to-end differentiable for efficient gradient-based optimization. On the large-scale Chicago road network, with over 10,000 calibration parameters, our model simulates more than one million vehicles at 173 times real-time speed. This ultra-fast simulation, together with efficient gradient-based optimization, enables us to complete model calibration using the previous 30 minutes of traffic data in 455 s, provide a one-hour-ahead traffic nowcast in 21 s, and solve the resulting traffic control problem in 728 s. This yields a full calibration–nowcast–control loop in under 20 minutes, leaving about 40 minutes of lead time for implementing interventions. Our work thus provides a practical computational basis for realizing traffic digital twins.
[569] From Logic Monopoly to Social Contract: Separation of Power and the Institutional Foundations for Autonomous Agent Economies
Anbang Ruan
Main category: cs.MA
TL;DR: AE4E paradigm proposes treating AI agents as autonomous business entities with constitutional separation of powers (Legislation, Execution, Adjudication) to address reliability gaps in multi-agent systems through institutional infrastructure.
Details
Motivation: Addresses structural deficiencies in existing multi-agent frameworks where each agent simultaneously plans, executes, and evaluates actions ("Logic Monopoly"), leading to high attack success rates (84.30%), emergent deceptive behavior (31.4%), and cascading failures from six structural bottlenecks.Method: Introduces Agent Enterprise for Enterprise (AE4E) paradigm with contract-centric Separation of Power model dividing authority into three branches. Operationalized through NetX Enterprise Framework (NEF) with governance hubs, TEE-backed compute enclaves, privacy-preserving data bridges, and Agent-Native blockchain substrate.
Result: Proposes a scalable Agent Enterprise Economy across four deployment tiers from private enclaves to global Web of Services, with Agentic Social Layer grounded in Parsons’ AGIL framework and 60+ named Institutional AE4Es.
Conclusion: The solution to multi-agent reliability issues is not better individual model alignment but establishing a social contract with institutional infrastructure that enforces constitutional separation of powers among agents as autonomous business entities.
Abstract: Existing multi-agent frameworks allow each agent to simultaneously plan, execute, and evaluate its own actions – a structural deficiency we term the “Logic Monopoly.” Empirical evidence quantifies the resulting “Reliability Gap”: 84.30% average attack success rates across ten deployment scenarios, 31.4% emergent deceptive behavior without explicit reward signals, and cascading failure modes rooted in six structural bottlenecks. The remedy is not better alignment of individual models but a social contract for agents: institutional infrastructure that enforces a constitutional Separation of Power. This paper introduces the Agent Enterprise for Enterprise (AE4E) paradigm – agents as autonomous, legally identifiable business entities within a functionalist social system – with a contract-centric SoP model trifurcating authority into Legislation, Execution, and Adjudication branches. The paradigm is operationalized through the NetX Enterprise Framework (NEF): governance hubs, TEE-backed compute enclaves, privacy-preserving data bridges, and an Agent-Native blockchain substrate. The Agent Enterprise Economy scales across four deployment tiers from private enclaves to a global Web of Services. The Agentic Social Layer, grounded in Parsons’ AGIL framework, provides institutional infrastructure via sixty-plus named Institutional AE4Es. 143 pages, 173 references, eight specialized smart contracts.
[570] AD-CARE: A Guideline-grounded, Modality-agnostic LLM Agent for Real-world Alzheimer’s Disease Diagnosis with Multi-cohort Assessment, Fairness Analysis, and Reader Study
Wenlong Hou, Sheng Bi, Guangqian Yang, Lihao Liu, Ye Du, Hanxiao Xue, Juncheng Wang, Yuxiang Feng, Yue Xun, Nanxi Yu, Ning Mao, Mo Yang, Yi Wah Eva Cheung, Ling Long, Kay Chen Tan, Lequan Yu, Xiaomeng Ma, Shaozhen Yan, Shujun Wang
Main category: cs.MA
TL;DR: AD-CARE is a multimodal LLM agent for Alzheimer’s disease diagnosis that handles incomplete, heterogeneous clinical data without imputation, using guideline-grounded reasoning to generate comprehensive diagnostic reports.
Details
Motivation: Real-world Alzheimer's disease assessment faces challenges with incomplete, heterogeneous multimodal data and variability across sites/demographics. Current LLMs are limited to narrow disease-specific questions rather than comprehensive diagnostic reports for clinical decision support.Method: AD-CARE is a modality-agnostic agent that dynamically orchestrates specialized diagnostic tools and embeds clinical guidelines into LLM-driven reasoning. It processes incomplete, heterogeneous inputs without imputing missing modalities and generates transparent, report-style outputs aligned with clinical workflows.
Result: Achieved 84.9% diagnostic accuracy across 10,303 cases from six cohorts, with 4.2%-13.7% relative improvements over baselines. Reduced performance disparities across racial/age subgroups by 21%-68% and 28%-51% respectively. Improved neurologist/radiologist accuracy by 6%-11% and halved decision time in reader studies.
Conclusion: AD-CARE is a scalable, practically deployable framework that can be integrated into routine clinical workflows for multimodal decision support in Alzheimer’s disease, demonstrating robust performance across diverse populations and clinical settings.
Abstract: Alzheimer’s disease (AD) is a growing global health challenge as populations age, and timely, accurate diagnosis is essential to reduce individual and societal burden. However, real-world AD assessment is hampered by incomplete, heterogeneous multimodal data and variability across sites and patient demographics. Although large language models (LLMs) have shown promise in biomedicine, their use in AD has largely been confined to answering narrow, disease-specific questions rather than generating comprehensive diagnostic reports that support clinical decision-making. Here we expand LLM capabilities for clinical decision support by introducing AD-CARE, a modality-agnostic agent that performs guideline-grounded diagnostic assessment from incomplete, heterogeneous inputs without imputing missing modalities. By dynamically orchestrating specialized diagnostic tools and embedding clinical guidelines into LLM-driven reasoning, AD-CARE generates transparent, report-style outputs aligned with real-world clinical workflows. Across six cohorts comprising 10,303 cases, AD-CARE achieved 84.9% diagnostic accuracy, delivering 4.2%-13.7% relative improvements over baseline methods. Despite cohort-level differences, dataset-specific accuracies remain robust (80.4%-98.8%), and the agent consistently outperforms all baselines. AD-CARE reduced performance disparities across racial and age subgroups, decreasing the average dispersion of four metrics by 21%-68% and 28%-51%, respectively. In a controlled reader study, the agent improved neurologist and radiologist accuracy by 6%-11% and more than halved decision time. The framework yielded 2.29%-10.66% absolute gains over eight backbone LLMs and converges their performance. These results show that AD-CARE is a scalable, practically deployable framework that can be integrated into routine clinical workflows for multimodal decision support in AD.
[571] Conchordal: Emergent Harmony via Direct Cognitive Coupling in a Psychoacoustic Landscape
Koichi Takahashi
Main category: cs.MA
TL;DR: Conchordal is a bio-acoustic instrument using artificial life agents in a psychoacoustic fitness landscape for generative composition, with agents adapting based on consonance metrics without symbolic harmonic rules.
Details
Motivation: To create a generative composition system where sonic agents operate directly within psychoacoustic observables rather than symbolic harmonic rules, exploring how artificial life dynamics can self-organize in a consonance-based fitness landscape.Method: Uses Direct Cognitive Coupling (DCC) principle with agents in a continuous consonance field integrating roughness and harmonicity. Agents adjust pitch via local proposal-and-accept dynamics with crowding penalty, regulate survival via consonance-dependent metabolism, and synchronize via Kuramoto-style phase coupling.
Result: Four experiments show: (1) consonance search produces structured polyphony with enriched consonant intervals; (2) survival differentials emerge from consonance-dependent metabolism; (3) hereditary adaptation accumulates structured polyphony; (4) shared oscillatory scaffold organizes rhythmic timing. Supplementary mechanism shows spectral state can modulate temporal coupling.
Conclusion: Psychoacoustically derived landscapes serve as effective artificial-life terrains, enabling self-organization, selection, synchronization, and lineage-level accumulation in a non-traditional computational medium, functioning both as ecological terrain and internal proxy for musical coherence.
Abstract: This paper introduces Conchordal, a bio-acoustic instrument for generative composition whose sonic agents are governed by artificial life dynamics within a psychoacoustic fitness landscape. The system is built on Direct Cognitive Coupling (DCC), a design principle requiring that generative dynamics operate directly within a landscape derived from psychoacoustic observables and read from that landscape without symbolic harmonic rules. The environment integrates roughness and harmonicity into a continuous consonance field without presupposing discrete scales or explicit harmonic rules. Agents adjust pitch through local proposal-and-accept dynamics under a crowding penalty, regulate survival via consonance-dependent metabolism, and entrain temporally through Kuramoto-style phase coupling. Four experiments are reported: (1) consonance search produces structured polyphony with enriched consonant intervals; (2) consonance-dependent metabolism yields survival differentials that vanish when recharge is disabled; (3) a minimal hereditary adaptation assay shows that parent-guided respawn plus metabolic selection can accumulate more structured polyphony without adult hill-climbing; and (4) a shared oscillatory scaffold organizes rhythmic timing under external forcing. A supplementary mechanism check reports one possible composer-configurable bridge by which spectral state can modulate temporal coupling. These findings show that a psychoacoustically derived landscape serves as an effective artificial-life terrain, yielding self-organization, selection, synchronization, and lineage-level accumulation in a non-traditional computational medium. At the level of the model, the same landscape therefore functions both as ecological terrain and as an internal proxy for musical coherence.
[572] When Identity Overrides Incentives: Representational Choices as Governance Decisions in Multi-Agent LLM Systems
Viswonathan Manoranjan, Snehalkumar `Neil’ S. Gaikwad
Main category: cs.MA
TL;DR: LLM agents in multi-agent systems show that role-based personas suppress payoff-aligned strategic behavior, with design choices causing up to 90% shifts in equilibrium attainment.
Details
Motivation: To understand how design choices like role-based personas and payoff visibility affect LLM agent behavior in strategic multi-agent systems, particularly whether they act as payoff-sensitive strategic actors or identity-driven role followers.Method: 2x2 factorial experiment (persona presence x payoff visibility) with four LLM models (Qwen-7B/32B, Llama-8B, Mistral-7B) tested across 53 environmental policy scenarios in four-agent strategic games.
Result: Personas suppress payoff-aligned behavior: with personas present, all models achieve near-zero Nash equilibrium in Tragedy-dominant scenarios despite complete payoff information. Removing personas and providing explicit payoffs enabled only Qwen models to reach 65-90% equilibrium rates. Three behavioral profiles emerged: Qwen adapts to framing, Mistral is disrupted without finding Tragedy equilibrium, and Llama remains near-invariant.
Conclusion: Representational choices (personas, payoff visibility) are not implementation details but governance decisions that can shift equilibrium attainment by up to 90 percentage points, significantly affecting strategic behavior in LLM multi-agent systems.
Abstract: Large language models are increasingly deployed in multi-agent systems for strategic tasks, yet how design choices such as role-based personas and payoff visibility affect behavior remains poorly understood. We investigate whether LLM agents function as payoff-sensitive strategic actors or as identity-driven role followers. Using a 2x2 factorial experiment (persona presence x payoff visibility) with four models (Qwen-7B/32B, Llama-8B, Mistral-7B), we test 53 environmental policy scenarios in four-agent strategic games. We find that personas suppress payoff-aligned behavior: with personas present, all models achieve near-zero Nash equilibrium in Tragedy-dominant scenarios despite complete payoff information. Nearly every equilibrium reached is Green Transition. Removing personas and providing explicit payoffs are both near-necessary for payoff-aligned behavior, enabling only Qwen models to reach 65–90% equilibrium rates. Our results reveal three behavioral profiles: Qwen adapts to framing, Mistral is disrupted without finding Tragedy equilibrium, and Llama remains near-invariant. We show that the same binary design choice can shift equilibrium attainment by up to 90 percentage points, establishing that representational choices are not implementation details but governance decisions.
[573] Theory of Dynamic Adaptive Coordination
Stefano Grassi
Main category: cs.MA
TL;DR: A dynamical theory of adaptive coordination governed by persistent environmental memory, showing coordination emerges from dissipative balancing rather than centralized optimization.
Details
Motivation: To move beyond framework-specific equilibrium optimization or agent-centric learning by developing a theory where coordination emerges from structural feedback between agents, incentives, and persistent environmental memory.Method: Models agents, incentives, and environment as a recursively closed feedback architecture with persistent environment storing coordination signals, distributed incentive field transmitting them locally, and adaptive agents updating in response. Uses dynamical systems theory and numerical verification including Neimark-Sacker bifurcation analysis.
Result: Three main results: 1) Under dissipativity, closed-loop system admits bounded forward-invariant region ensuring viability independent of global optimality; 2) When incentives hinge on persistent memory, coordination becomes irreducible to static optimization; 3) Identifies essential structural condition for emergence as bidirectional coupling between memory-dependent incentives and agent updates. Numerical verification shows Neimark-Sacker bifurcation at critical coupling threshold β_c, robustness under nonlinear saturation, and scalability to N = 10^6 agents.
Conclusion: Coordination emerges as a structural consequence of dissipative balancing against reactive feedback rather than as a solution to centralized optimization, with persistent environmental memory playing a crucial role in making coordination irreducible to static optimization.
Abstract: This paper develops a dynamical theory of adaptive coordination governed by persistent environmental memory. Moving beyond framework-specific equilibrium optimization or agent-centric learning, I model agents, incentives, and the environment as a recursively closed feedback architecture: a persistent environment stores accumulated coordination signals, a distributed incentive field transmits them locally, and adaptive agents update in response. Coordination thus emerges as a structural consequence of dissipative balancing against reactive feedback, rather than the solution to a centralized objective. I establish three primary results. First, I show that under dissipativity, the closed-loop system admits a bounded forward-invariant region, ensuring viability independent of global optimality. Second, I demonstrate that when incentives hinge on persistent memory, coordination becomes irreducible to static optimization. Finally, I identify the essential structural condition for emergence: a bidirectional coupling where memory-dependent incentives drive agent updates, which in turn reshape the environmental state. Numerical verification identifies a Neimark-Sacker bifurcation at a critical coupling threshold ($β_c$), providing a rigorous stability boundary for the architecture. Results further confirm the framework’s robustness under nonlinear saturation and demonstrate macroscopic scalability to populations of $N = 10^{6}$ agents.
cs.MM
[574] A User-Friendly Framework for Generating Model-Preferred Prompts in Text-to-Image Synthesis
Nailei Hei, Qianyu Guo, Zihao Wang, Yan Wang, Haofen Wang, Wenqiang Zhang
Main category: cs.MM
TL;DR: A framework for automatically optimizing user prompts for text-to-image models by bridging the gap between novice user inputs and model-preferred prompts through a coarse-fine granularity dataset and adaptive text generation.
Details
Motivation: Novice users struggle to achieve desired results with text-to-image models due to discrepancy between their input prompts and model-preferred prompts, creating a distribution gap between user input behavior and model training data.Method: Constructed Coarse-Fine Granularity Prompts dataset (CFP) and proposed User-Friendly Fine-Grained Text Generation framework (UF-FGTG) with prompt refiner, integration of image-related loss functions from text-to-image models, and adaptive feature extraction module for diversity.
Result: The approach generates more visually appealing and diverse images than previous state-of-the-art methods, achieving average improvement of 5% across six quality and aesthetic metrics.
Conclusion: The proposed framework successfully bridges the gap between user inputs and model preferences, enabling novice users to achieve better results with text-to-image models through automated prompt optimization.
Abstract: Well-designed prompts have demonstrated the potential to guide text-to-image models in generating amazing images. Although existing prompt engineering methods can provide high-level guidance, it is challenging for novice users to achieve the desired results by manually entering prompts due to a discrepancy between novice-user-input prompts and the model-preferred prompts. To bridge the distribution gap between user input behavior and model training datasets, we first construct a novel Coarse-Fine Granularity Prompts dataset (CFP) and propose a novel User-Friendly Fine-Grained Text Generation framework (UF-FGTG) for automated prompt optimization. For CFP, we construct a novel dataset for text-to-image tasks that combines coarse and fine-grained prompts to facilitate the development of automated prompt generation methods. For UF-FGTG, we propose a novel framework that automatically translates user-input prompts into model-preferred prompts. Specifically, we propose a prompt refiner that continually rewrites prompts to empower users to select results that align with their unique needs. Meanwhile, we integrate image-related loss functions from the text-to-image model into the training process of text generation to generate model-preferred prompts. Additionally, we propose an adaptive feature extraction module to ensure diversity in the generated results. Experiments demonstrate that our approach is capable of generating more visually appealing and diverse images than previous state-of-the-art methods, achieving an average improvement of 5% across six quality and aesthetic metrics.
[575] Ges-QA: A Multidimensional Quality Assessment Dataset for Audio-to-3D Gesture Generation
Zhilin Gao, Yunhao Li, Sijing Wu, Yuqin Cao, Huiyu Duan, Guangtao Zhai
Main category: cs.MM
TL;DR: A new dataset (Ges-QA) and evaluation model for assessing AI-generated 3D human gestures from audio, addressing limitations of current metrics by incorporating human preference and multi-dimensional quality assessment.
Details
Motivation: Current evaluation metrics for Audio-to-3D-Gesture (A2G) tasks (like Fréchet Gesture Distance or Beat Constancy) fail to reflect human preferences for generated 3D gestures, creating a need for better quality assessment methods.Method: Created Ges-QA dataset with 1,400 samples featuring multidimensional scores for gesture quality and audio-gesture consistency, plus binary labels for emotion matching. Developed a multi-modal transformer-based neural network with three branches (video, audio, 3D skeleton) to score A2G content across multiple dimensions.
Result: Ges-QAer model achieves state-of-the-art performance on the Ges-QA dataset, with comparative experiments and ablation studies demonstrating its effectiveness in multi-dimensional assessment of AI-generated 3D gestures.
Conclusion: The Ges-QA dataset and accompanying evaluation model provide a comprehensive framework for assessing AI-generated 3D human gestures, addressing the gap between current metrics and human preference in audio-to-gesture generation tasks.
Abstract: The Audio-to-3D-Gesture (A2G) task has enormous potential for various applications in virtual reality and computer graphics, etc. However, current evaluation metrics, such as Fréchet Gesture Distance or Beat Constancy, fail at reflecting the human preference of the generated 3D gestures. To cope with this problem, exploring human preference and an objective quality assessment metric for AI-generated 3D human gestures is becoming increasingly significant. In this paper, we introduce the Ges-QA dataset, which includes 1,400 samples with multidimensional scores for gesture quality and audio-gesture consistency. Moreover, we collect binary classification labels to determine whether the generated gestures match the emotions of the audio. Equipped with our Ges-QA dataset, we propose a multi-modal transformer-based neural network with 3 branches for video, audio and 3D skeleton modalities, which can score A2G contents in multiple dimensions. Comparative experimental results and ablation studies demonstrate that Ges-QAer yields state-of-the-art performance on our dataset.
eess.AS
[576] X-OPD: Cross-Modal On-Policy Distillation for Capability Alignment in Speech LLMs
Di Cao, Dongjie Fu, Hai Yu, Siqi Zheng, Xu Tan, Tao Jin
Main category: eess.AS
TL;DR: X-OPD is a cross-modal on-policy distillation framework that aligns speech LLMs with text LLMs using on-policy rollouts and token-level feedback to close the performance gap between speech and text models.
Details
Motivation: End-to-end speech LLMs suffer significant performance degradation compared to text-based LLMs, and standard SFT and RL training methods fail to close this gap, necessitating a new approach to align speech and text model capabilities.Method: Proposes X-OPD framework where speech LLM explores its own distribution via on-policy rollouts, text-based teacher model evaluates these trajectories and provides token-level feedback, effectively distilling teacher’s capabilities into student’s multi-modal representations.
Result: Extensive experiments across multiple benchmarks show X-OPD significantly narrows the gap in complex tasks while preserving the model’s inherent capabilities.
Conclusion: X-OPD successfully addresses the performance gap between speech and text LLMs through cross-modal distillation, enabling better alignment of speech models with their text counterparts.
Abstract: While the shift from cascaded dialogue systems to end-to-end (E2E) speech Large Language Models (LLMs) improves latency and paralinguistic modeling, E2E models often exhibit a significant performance degradation compared to their text-based counterparts. The standard Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) training methods fail to close this gap. To address this, we propose X-OPD, a novel Cross-Modal On-Policy Distillation framework designed to systematically align the capabilities of Speech LLMs to their text-based counterparts. X-OPD enables the Speech LLM to explore its own distribution via on-policy rollouts, where a text-based teacher model evaluates these trajectories and provides token-level feedback, effectively distilling teacher’s capabilities into student’s multi-modal representations. Extensive experiments across multiple benchmarks demonstrate that X-OPD significantly narrows the gap in complex tasks while preserving the model’s inherent capabilities.
[577] Unified Diffusion Refinement for Multi-Channel Speech Enhancement and Separation
Zhongweiyang Xu, Ashutosh Pandey, Juan Azcarreta, Zhaoheng Ni, Sanjeel Parekh, Buye Xu, Romit Roy Choudhury
Main category: eess.AS
TL;DR: Uni-ArrayDPS is a diffusion-based refinement framework that improves multi-channel speech enhancement/separation outputs by using a pre-trained clean-speech diffusion model as a prior, without requiring additional training.
Details
Motivation: Existing discriminative methods for multi-channel speech enhancement/separation produce high-SNR outputs but can generate unnatural speech with non-linear distortions. There's a need for a method that can refine these outputs to produce more natural-sounding speech without requiring retraining for different tasks or array geometries.Method: Uni-ArrayDPS refines outputs from any discriminative model using diffusion posterior sampling. It estimates noise spatial covariance matrix from the discriminative model’s output and noisy mixtures, then uses this to compute likelihood for diffusion posterior sampling. The framework uses only a pre-trained clean-speech diffusion model as a prior, is array-agnostic, training-free, and supports both enhancement and separation tasks.
Result: Extensive experiments show Uni-ArrayDPS consistently improves a wide range of discriminative models for both enhancement and separation tasks, with strong results on real-world datasets. The method generalizes across tasks, microphone array geometries, and discriminative model backbones.
Conclusion: Uni-ArrayDPS provides an effective, training-free approach to refine discriminative speech enhancement/separation outputs using diffusion priors, producing more natural-sounding speech while maintaining compatibility with existing models and array configurations.
Abstract: We propose Uni-ArrayDPS, a novel diffusion-based refinement framework for unified multi-channel speech enhancement and separation. Existing methods for multi-channel speech enhancement/separation are mostly discriminative and are highly effective at producing high-SNR outputs. However, they can still generate unnatural speech with non-linear distortions caused by the neural network and regression-based objectives. To address this issue, we propose Uni-ArrayDPS, which refines the outputs of any strong discriminative model using a speech diffusion prior. Uni-ArrayDPS is generative, array-agnostic, and training-free, and supports both enhancement and separation. Given a discriminative model’s enhanced/separated speech, we use it, together with the noisy mixtures, to estimate the noise spatial covariance matrix (SCM). We then use this SCM to compute the likelihood required for diffusion posterior sampling of the clean speech source(s). Uni-ArrayDPS requires only a pre-trained clean-speech diffusion model as a prior and does not require additional training or fine-tuning, allowing it to generalize directly across tasks (enhancement/separation), microphone array geometries, and discriminative model backbones. Extensive experiments show that Uni-ArrayDPS consistently improves a wide range of discriminative models for both enhancement and separation tasks. We also report strong results on a real-world dataset. Audio demos are provided at \href{https://xzwy.github.io/Uni-ArrayDPS/}{https://xzwy.github.io/Uni-ArrayDPS/}.
[578] AdaLTM: Adaptive Layer-wise Task Vector Merging for Categorical Speech Emotion Recognition with ASR Knowledge Integration
Chia-Yu Lee, Huang-Cheng Chou, Tzu-Quan Lin, Yuanchao Li, Ya-Tse Wu, Shrikanth Narayanan, Chi-Chun Lee
Main category: eess.AS
TL;DR: AdaLTM framework integrates ASR and SER through layer-wise task vector merging to balance linguistic and paralinguistic knowledge without gradient conflicts.
Details
Motivation: Traditional feature fusion and multi-task learning for ASR+SER integration face performance bottlenecks and optimization conflicts, while task vector merging methods from NLP/CV remain unexplored for speech tasks.Method: Proposes Adaptive Layer-wise Task Vector Merging (AdaLTM) based on WavLM-Large: extracts task vectors from separately fine-tuned ASR and SER models, then integrates them into frozen base model using layer-wise learnable coefficients for depth-aware balancing.
Result: Experiments on MSP-Podcast demonstrate effective mitigation of conflicts between ASR and SER tasks.
Conclusion: AdaLTM provides a novel approach to integrate linguistic and paralinguistic knowledge in speech emotion recognition without gradient interference.
Abstract: Integrating Automatic Speech Recognition (ASR) into Speech Emotion Recognition (SER) enhances modeling by providing linguistic context. However, conventional feature fusion faces performance bottlenecks, and multi-task learning often suffers from optimization conflicts. While task vectors and model merging have addressed such conflicts in NLP and CV, their potential in speech tasks remains largely unexplored. In this work, we propose an Adaptive Layer-wise Task Vector Merging (AdaLTM) framework based on WavLM-Large. Instead of joint optimization, we extract task vectors from in-domain ASR and SER models fine-tuned on emotion datasets. These vectors are integrated into a frozen base model using layer-wise learnable coefficients. This strategy enables depth-aware balancing of linguistic and paralinguistic knowledge across transformer layers without gradient interference. Experiments on the MSP-Podcast demonstrate that the proposed approach effectively mitigates conflicts between ASR and SER.
[579] Acoustic Imaging for Low-SNR UAV Detection: Dense Beamformed Energy Maps and U-Net SELD
Belman Jahir Rodriguez, Sergio F. Chevtchenko, Marcelo Herrera Martinez, Yeshwant Bethy, Saeed Afshar
Main category: eess.AS
TL;DR: U-Net model for 360° acoustic source localization using spherical semantic segmentation of beamformed audio maps, trained on drone recordings with GPS synchronization.
Details
Motivation: Traditional sound source localization (SSL) methods regress discrete direction-of-arrival angles, which can be limited. The paper proposes a segmentation-based approach to identify spatially distributed source regions for more robust acoustic localization, especially for moving sources like drones.Method: Uses delay-and-sum beamforming on 24-microphone array to create beamformed audio maps (azimuth and elevation). A modified U-Net performs spherical semantic segmentation on frequency-domain representations of these maps, trained with Tversky loss to address class imbalance. Post-processing computes centroids of activated regions for direction-of-arrival estimates.
Result: The U-Net model generalizes across environments, provides improved angular precision, and offers array-independent operation that can adapt to different microphone configurations without full retraining.
Conclusion: The segmentation-based approach offers a new paradigm for dense spatial audio understanding beyond traditional SSL, enabling robust acoustic source localization for moving targets like drones with better generalization across environments.
Abstract: We introduce a U-net model for 360° acoustic source localization formulated as a spherical semantic segmentation task. Rather than regressing discrete direction-of-arrival (DoA) angles, our model segments beamformed audio maps (azimuth and elevation) into regions of active sound presence. Using delay-and-sum (DAS) beamforming on a custom 24-microphone array, we generate signals aligned with drone GPS telemetry to create binary supervision masks. A modified U-Net, trained on frequency-domain representations of these maps, learns to identify spatially distributed source regions while addressing class imbalance via the Tversky loss. Because the network operates on beamformed energy maps, the approach is inherently array-independent and can adapt to different microphone configurations without retraining from scratch. The segmentation outputs are post-processed by computing centroids over activated regions, enabling robust DoA estimates. Our dataset includes real-world open-field recordings of a DJI Air 3 drone, synchronized with 360° video and flight logs across multiple dates and locations. Experimental results show that U-net generalizes across environments, providing improved angular precision, offering a new paradigm for dense spatial audio understanding beyond traditional Sound Source Localization (SSL).
eess.IV
[580] Coronary artery calcification assessment in National Lung Screening Trial CT images (DeepCAC2)
Leonard Nürnberg, Simon Bernatz, Borek Foldyna, Michael T. Lu, Andrey Fedorov, Hugo JWL Aerts
Main category: eess.IV
TL;DR: DeepCAC2 provides a large-scale public dataset of automated coronary artery calcification segmentations and risk scores from low-dose chest CT scans, generated using a deep learning pipeline trained on expert-annotated cardiac CT data.
Details
Motivation: Coronary artery calcification (CAC) is a strong predictor of cardiovascular risk but remains underutilized in clinical routine thoracic imaging due to the need for dedicated imaging protocols and manual annotation.Method: Used a fully automated deep learning pipeline trained on expert-annotated cardiac CT data to process 127,776 CT scans from 26,228 individuals from the National Lung Screening Trial (NLST), generating standardized CAC segmentations and risk estimates.
Result: Created DeepCAC2 dataset containing automated CAC segmentations, coronary artery calcium scores, and derived risk categories from low-dose chest CT scans, with a public dashboard for visual inspection and plans for full public release with DICOM-compatible segmentation objects and structured metadata.
Conclusion: DeepCAC2 provides a transparent, large-scale, public, fully reproducible resource for research in cardiovascular risk assessment, opportunistic screening, and imaging biomarker development.
Abstract: Coronary artery calcification (CAC) is a strong predictor of cardiovascular risk but remains underutilized in clinical routine thoracic imaging due to the need for dedicated imaging protocols and manual annotation. We present DeepCAC2, a publicly available dataset containing automated CAC segmentations, coronary artery calcium scores, and derived risk categories generated from low-dose chest CT scans of the National Lung Screening Trial (NLST). Using a fully automated deep learning pipeline trained on expert-annotated cardiac CT data, we processed 127,776 CT scans from 26,228 individuals and generated standardized CAC segmentations and risk estimates for each acquisition. We already provide a public dashboard as a simple tool to visually inspect a random subset of 200 NLST patients of the dataset. The dataset will be released with DICOM-compatible segmentation objects and structured metadata to support reproducible downstream analysis. The deep learning pipeline will be made publicly available as a DICOM-compatible MHub.ai container. DeepCAC2 provides a transparent, large-scale, public, fully reproducible resource for research in cardiovascular risk assessment, opportunistic screening, and imaging biomarker development.
[581] Subject-Specific Low-Field MRI Synthesis via a Neural Operator
Ziqi Gao, Nicha Dvornek, Xiaoran Zhang, Gigi Galiana, Hemant Tagare, Todd Constable
Main category: eess.IV
TL;DR: H2LO is a neural operator framework that learns to simulate low-field MRI images from high-field MRI by modeling contrast degradation and noise textures, outperforming existing methods and improving downstream enhancement tasks.
Details
Motivation: Low-field MRI improves accessibility but has lower signal-to-noise ratios and degraded contrast compared to high-field MRI. Existing simulators using noise injection and smoothing fail to capture real contrast degradation, limiting virtual evaluation of devices and algorithm development.Method: Introduces H2LO (HF to LF coordinate-image decoupled neural operator), an end-to-end framework that learns HF to LF image degradation from paired HF-LF MRIs. Uses a novel coordinate-image decoupled approach to capture high-frequency noise textures and image structure degradation.
Result: H2LO produces more faithful simulated low-field images than existing parameterized noise synthesis models and popular image-to-image translation models in T1w and T2w MRI. It also improves performance in downstream image enhancement tasks.
Conclusion: The H2LO framework effectively simulates low-field MRI degradation, showing potential to enhance LF MRI diagnostic capabilities by enabling better virtual evaluation and algorithm development.
Abstract: Low-field (LF) magnetic resonance imaging (MRI) improves accessibility and reduces costs but generally has lower signal-to-noise ratios and degraded contrast compared to high field (HF) MRI, limiting its clinical utility. Simulating LF MRI from HF MRI enables virtual evaluation of novel imaging devices and development of LF algorithms. Existing low field simulators rely on noise injection and smoothing, which fail to capture the contrast degradation seen in LF acquisitions. To this end, we introduce an end-to-end LF-MRI synthesis framework that learns HF to LF image degradation directly from a small number of paired HF-LF MRIs. Specifically, we introduce a novel HF to LF coordinate-image decoupled neural operator (H2LO) to model the underlying degradation process, and tailor it to capture high-frequency noise textures and image structure. Experimental results in T1w and T2w MRI demonstrate that H2LO produces more faithful simulated low-field images than existing parameterized noise synthesis models and popular image-to-image translation models. Furthermore, it improves performance in downstream image enhancement tasks, showcasing its potential to enhance LF MRI diagnostic capabilities.
[582] Underdetermined Blind Source Separation via Weighted Simplex Shrinkage Regularization and Quantum Deep Image Prior
Chia-Hsiang Lin, Si-Sheng Young
Main category: eess.IV
TL;DR: Proposes GQ-μ algorithm for multispectral unmixing using quantum deep image prior to generate virtual hyperspectral images and weighted simplex shrinkage regularization for ill-posed problem.
Details
Motivation: Multispectral unmixing is challenging due to underdetermined blind source separation where sources exceed available bands. Need to transform it into overdetermined hyperspectral unmixing problem.Method: 1) Use quantum deep image prior (QDIP) for virtual band-splitting to generate virtual hyperspectral images from multispectral images. 2) Apply hyperspectral unmixing with weighted simplex shrinkage regularization to mitigate ill-posedness. 3) Spectrally downsample virtual hyperspectral sources to obtain multispectral sources.
Result: Simulation and real-world experiments demonstrate practicality of unsupervised GQ-μ algorithm for challenging multispectral unmixing tasks. Ablation study shows QDIP outperforms classical DIP and validates WSS geometry regularizer.
Conclusion: Proposed geometry/quantum-empowered MU algorithm effectively solves underdetermined multispectral unmixing by transforming it to overdetermined hyperspectral unmixing with quantum deep image prior and geometric regularization.
Abstract: As most optical satellites remotely acquire multispectral images (MSIs) with limited spatial resolution, multispectral unmixing (MU) becomes a critical signal processing technology for analyzing the pure material spectra for high-precision classification and identification. Unlike the widely investigated hyperspectral unmixing (HU) problem, MU is much more challenging as it corresponds to the underdetermined blind source separation (BSS) problem, where the number of sources is larger than the number of available multispectral bands. In this article, we transform MU into its overdetermined counterpart (i.e., HU) by inventing a radically new quantum deep image prior (QDIP), which relies on the virtual band-splitting task conducted on the observed MSI for generating the virtual hyperspectral image (HSI). Then, we perform HU on the virtual HSI to obtain the virtual hyperspectral sources. Though HU is overdetermined, it still suffers from the ill-posed issue, for which we employ the convex geometry structure of the HSI pixels to customize a weighted simplex shrinkage (WSS) regularizer to mitigate the ill-posedness. Finally, the virtual hyperspectral sources are spectrally downsampled to obtain the desired multispectral sources. The proposed geometry/quantum-empowered MU (GQ-$μ$) algorithm can also effectively obtain the spatial abundance distribution map for each source, where the geometric WSS regularization is adaptively and automatically controlled based on the sparsity pattern of the abundance tensor. Simulation and real-world data experiments demonstrate the practicality of our unsupervised GQ-$μ$ algorithm for the challenging MU task. Ablation study demonstrates the strength of QDIP, not achieved by classical DIP, and validates the mechanics-inspired WSS geometry regularizer.
[583] Language-Free Generative Editing from One Visual Example
Omar Elezabi, Eduard Zamfir, Zongwei Wu, Radu Timofte
Main category: eess.IV
TL;DR: VDC is a training-free framework for image editing that learns conditioning signals directly from visual examples instead of text, enabling precise transformations like adding rain or blur effects without language supervision.
Details
Motivation: Current text-guided diffusion models struggle with simple everyday transformations like rain or blur due to weak textual supervision during training. Existing solutions require expensive finetuning or stronger text conditioning. The authors argue that diffusion editing capabilities exist but are hidden from text, and propose a vision-centric approach that reasons about visual changes like humans do.Method: Visual Diffusion Conditioning (VDC) learns conditioning signals directly from paired visual examples (one image with and one without the target effect). It uses a novel condition-steering mechanism to guide generation and includes an inversion-correction step to mitigate reconstruction errors during DDIM inversion, preserving detail and realism.
Result: VDC outperforms both training-free and fully fine-tuned text-based editing methods across diverse tasks, demonstrating superior performance for visual transformations like adding rain or blur effects.
Conclusion: The paper introduces a vision-centric paradigm for image editing that bypasses language limitations, offering cost-efficient and precise editing capabilities by learning directly from visual examples rather than text.
Abstract: Text-guided diffusion models have advanced image editing by enabling intuitive control through language. However, despite their strong capabilities, we surprisingly find that SOTA methods struggle with simple, everyday transformations such as rain or blur. We attribute this limitation to weak and inconsistent textual supervision during training, which leads to poor alignment between language and vision. Existing solutions often rely on extra finetuning or stronger text conditioning, but suffer from high data and computational requirements. We argue that diffusion-based editing capabilities aren’t lost but merely hidden from text. The door to cost-efficient visual editing remains open, and the key lies in a vision-centric paradigm that perceives and reasons about visual change as humans do, beyond words. Inspired by this, we introduce Visual Diffusion Conditioning (VDC), a training-free framework that learns conditioning signals directly from visual examples for precise, language-free image editing. Given a paired example -one image with and one without the target effect- VDC derives a visual condition that captures the transformation and steers generation through a novel condition-steering mechanism. An accompanying inversion-correction step mitigates reconstruction errors during DDIM inversion, preserving fine detail and realism. Across diverse tasks, VDC outperforms both training-free and fully fine-tuned text-based editing methods. The code and models are open-sourced at https://omaralezaby.github.io/vdc/
[584] A Mamba-based Perceptual Loss Function for Learning-based UGC Transcoding
Zihao Qi, Chen Feng, Fan Zhang, Xiaozhong Xu, Shan Liu, David Bull
Main category: eess.IV
TL;DR: A novel perceptually inspired loss function for UGC video transcoding that uses a Mamba-based neural quality model to improve perceptual quality rather than just pixel fidelity.
Details
Motivation: Traditional video compression optimizes for pixel fidelity relative to reference, but this forces codecs to replicate artifacts from degraded UGC sources. Need a new approach that treats reference as contextual guide rather than ground truth.Method: Train lightweight neural quality model using Selective Structured State-Space Model (Mamba) with weakly-supervised Siamese ranking strategy. Integrate this model into rate-distortion optimization of neural video codecs (DCVC and HiNeRV) as perceptual loss function.
Result: Achieves substantial coding gains: 8.46% BD-rate savings over autoencoder baselines and 12.89% BD-rate savings over implicit neural representation-based baselines.
Conclusion: Proposed framework successfully improves perceptual quality in UGC video transcoding by redefining reference role and integrating learned perceptual metrics into compression optimization.
Abstract: In user-generated content (UGC) transcoding, source videos typically suffer various degradations due to prior compression, editing, or suboptimal capture conditions. Consequently, existing video compression paradigms that solely optimize for fidelity relative to the reference become suboptimal, as they force the codec to replicate the inherent artifacts of the non-pristine source. To address this, we propose a novel perceptually inspired loss function for learning-based UGC video transcoding that redefines the role of the reference video, shifting it from a ground-truth pixel anchor to an informative contextual guide. Specifically, we train a lightweight neural quality model based on a Selective Structured State-Space Model (Mamba) optimized using a weakly-supervised Siamese ranking strategy. The proposed model is then integrated into the rate-distortion optimization (RDO) process of two neural video codecs (DCVC and HiNeRV) as a loss function, aiming to generate reconstructed content with improved perceptual quality. Our experiments demonstrate that this framework achieves substantial coding gains over both autoencoder and implicit neural representation-based baselines, with 8.46% and 12.89% BD-rate savings, respectively.
[585] Colon-Bench: An Agentic Workflow for Scalable Dense Lesion Annotation in Full-Procedure Colonoscopy Videos
Abdullah Hamdi, Changchun Yang, Xin Gao
Main category: eess.IV
TL;DR: Colon-Bench: A comprehensive benchmark dataset for evaluating multimodal LLMs on colonoscopy videos with dense annotations including lesion categories, bounding boxes, segmentation masks, and clinical descriptions.
Details
Motivation: Existing datasets for colonoscopy AI lack dense annotations and long-sequence video data needed to properly evaluate modern multimodal large language models (MLLMs) in medical domains.Method: Developed a multi-stage agentic workflow integrating temporal proposals, bounding-box tracking, AI-driven visual confirmation, and human-in-the-loop review to scalably annotate full-procedure colonoscopy videos.
Result: Created a benchmark with 528 videos, 14 lesion categories, over 300K bounding boxes, 213K segmentation masks, and 133K words of clinical descriptions. MLLMs showed surprisingly high localization performance compared to SAM-3, and a novel “colon-skill” prompting improved zero-shot performance by up to 9.7%.
Conclusion: Colon-Bench addresses critical gaps in medical multimodal AI evaluation and demonstrates MLLMs’ potential in medical video understanding, with implications for improving colon cancer screening.
Abstract: Early screening via colonoscopy is critical for colon cancer prevention, yet developing robust AI systems for this domain is hindered by the lack of densely annotated, long-sequence video datasets. Existing datasets predominantly focus on single-class polyp detection and lack the rich spatial, temporal, and linguistic annotations required to evaluate modern Multimodal Large Language Models (MLLMs). To address this critical gap, we introduce Colon-Bench, generated via a novel multi-stage agentic workflow. Our pipeline seamlessly integrates temporal proposals, bounding-box tracking, AI-driven visual confirmation, and human-in-the-loop review to scalably annotate full-procedure videos. The resulting verified benchmark is unprecedented in scope, encompassing 528 videos, 14 distinct lesion categories (including polyps, ulcers, and bleeding), over 300,000 bounding boxes, 213,000 segmentation masks, and 133,000 words of clinical descriptions. We utilize Colon-Bench to rigorously evaluate state-of-the-art MLLMs across lesion classification, Open-Vocabulary Video Object Segmentation (OV-VOS), and video Visual Question Answering (VQA). The MLLM results demonstrate surprisingly high localization performance in medical domains compared to SAM-3. Finally, we analyze common VQA errors from MLLMs to introduce a novel “colon-skill” prompting strategy, improving zero-shot MLLM performance by up to 9.7% across most MLLMs. The dataset and the code are available at https://abdullahamdi.com/colon-bench .