Editor’s Picks
Top papers matching your research interests in multimodal LLMs, audio and vision understanding/generation.
[1] Multi-Source Evidence Fusion for Audio Question Answering
Aivo Olev, Tanel Alumäe
Main category: eess.AS
TL;DR: TalTech’s winning solution for Interspeech 2026 Audio Reasoning Challenge uses multi-source ensemble with LALMs and acoustic tools to produce verifiable reasoning chains about audio content.
Details
Motivation: Large audio language models (LALMs) can answer questions about audio content but their internal reasoning is opaque and difficult to validate. The challenge requires evaluating reasoning process quality - factual accuracy, logical soundness, and completeness of reasoning chains.Method: Multi-source ensemble pipeline using two LALMs to generate independent observations, with a separate text-only reasoning model cross-checking these against outputs from 25 acoustic tools organized into reliability tiers. Every inference step is grounded in explicit, reliability-tagged evidence.
Result: The system ranked first in the Interspeech 2026 Audio Reasoning Challenge, outperforming all competing systems by a wide margin in the challenge’s reasoning quality metric.
Conclusion: By grounding inferences in explicit evidence from multiple sources with reliability tagging, the system produces dense, verifiable reasoning chains that address the opacity problem in LALMs.
Abstract: Large audio language models (LALMs) can answer questions about speech, music, and environmental sounds, yet their internal reasoning is largely opaque and difficult to validate. We describe TalTech’s solution to the Agent Track of the Interspeech 2026 Audio Reasoning Challenge, in which systems are evaluated on reasoning process quality, specifically the factual accuracy, logical soundness, and completeness of their reasoning chains. Our multi-source ensemble pipeline uses two LALMs that generate independent observations, while a separate text-only reasoning model cross-checks these against outputs from 25 acoustic tools organized into reliability tiers. By grounding every inference step in explicit, reliability-tagged evidence, the system produces dense, verifiable reasoning chains. Our system ranked first in the challenge, outperforming all competing systems by a wide margin in challenge’s reasoning quality metric.
Relevance: 9/10
[2] Neuron-Level Emotion Control in Speech-Generative Large Audio-Language Models
Xiutian Zhao, Ismail Rasim Ulgen, Philipp Koehn, Björn Schuller, Berrak Sisman
Main category: cs.CL
TL;DR: The paper presents a neuron-level study of emotion control in speech-generative large audio-language models, identifying compact emotion-sensitive neurons that enable training-free emotion steering at inference time.
Details
Motivation: Current large audio-language models can produce expressive speech but lack reliable emotion control, often missing target affects and degrading linguistic fidelity through refusals, hallucinations, or paraphrasing.Method: Identifies emotion-sensitive neurons via success-filtered activation aggregation that enforces both emotion realization and content preservation. Uses these neurons for training-free emotion steering interventions across three LALMs (Qwen2.5-Omni-7B, MiniCPM-o 4.5, Kimi-Audio).
Result: ESN interventions yield emotion-specific gains that generalize to unseen speakers, supported by both automatic and human evaluation. Controllability depends on selector design, mask sparsity, filtering, and intervention strength.
Conclusion: Establishes a mechanistic framework for training-free emotion control in speech generation through neuron-level interventions.
Abstract: Large audio-language models (LALMs) can produce expressive speech, yet reliable emotion control remains elusive: conversions often miss the target affect and may degrade linguistic fidelity through refusals, hallucinations, or paraphrase. We present, to our knowledge, the first neuron-level study of emotion control in speech-generative LALMs and demonstrate that compact emotion-sensitive neurons (ESNs) are causally actionable, enabling training-free emotion steering at inference time. ESNs are identified via success-filtered activation aggregation enforcing both emotion realization and content preservation. Across three LALMs (Qwen2.5-Omni-7B, MiniCPM-o 4.5, Kimi-Audio), ESN interventions yield emotion-specific gains that generalize to unseen speakers and are supported by automatic and human evaluation. Controllability depends on selector design, mask sparsity, filtering, and intervention strength. Our results establish a mechanistic framework for training-free emotion control in speech generation.
Relevance: 9/10
[3] HopChain: Multi-Hop Data Synthesis for Generalizable Vision-Language Reasoning
Shenzhi Wang, Shixuan Liu, Jing Zhou, Chang Gao, Xiong-Hui Chen, Binghai Wang, An Yang, Shiji Song, Bowen Yu, Gao Huang, Junyang Lin
Main category: cs.CV
TL;DR: HopChain is a framework for synthesizing multi-hop vision-language reasoning data to improve VLMs’ fine-grained reasoning capabilities through RLVR training, addressing compound errors in long CoT reasoning.
Details
Motivation: VLMs struggle with fine-grained vision-language reasoning, especially in long chain-of-thought reasoning where perception, reasoning, knowledge, and hallucination errors can compound across steps. Existing RLVR training data lacks complex reasoning chains that rely on visual evidence throughout.Method: HopChain synthesizes multi-hop vision-language reasoning data where each query forms logically dependent chains of instance-grounded hops. Earlier hops establish instances, sets, or conditions needed for later hops, with final answers as specific numbers for verifiable rewards in RLVR training.
Result: Adding HopChain’s multi-hop data to RLVR training improved 20 out of 24 benchmarks across STEM/Puzzle, General VQA, Text Recognition/Document Understanding, and Video Understanding. Multi-hop training significantly outperformed half-multi-hop and single-hop variants, with gains peaking at over 50 accuracy points in ultra-long-CoT reasoning.
Conclusion: HopChain is an effective, scalable framework for synthesizing multi-hop data that improves generalizable vision-language reasoning in VLMs, demonstrating that full chained queries are crucial for addressing compound errors in long CoT reasoning.
Abstract: VLMs show strong multimodal capabilities, but they still struggle with fine-grained vision-language reasoning. We find that long CoT reasoning exposes diverse failure modes, including perception, reasoning, knowledge, and hallucination errors, which can compound across intermediate steps. However, most existing vision-language data used for RLVR does not involve complex reasoning chains that rely on visual evidence throughout, leaving these weaknesses largely unexposed. We therefore propose HopChain, a scalable framework for synthesizing multi-hop vision-language reasoning data specifically for RLVR training of VLMs. Each synthesized multi-hop query forms a logically dependent chain of instance-grounded hops, where earlier hops establish the instances, sets, or conditions needed for later hops, while the final answer remains a specific, unambiguous number suitable for verifiable rewards. We add the multi-hop data synthesized by HopChain to the original RLVR data used to train Qwen3.5-35B-A3B and Qwen3.5-397B-A17B, and compare against RLVR on the original RLVR data alone across 24 benchmarks spanning STEM and Puzzle, General VQA, Text Recognition and Document Understanding, and Video Understanding. Although this multi-hop data is not synthesized to target any specific benchmark, adding it improves 20 out of 24 benchmarks on both models, indicating broad and generalizable gains. To demonstrate that full chained queries are important, we replace them with half-multi-hop or single-hop variants, reducing the 24-benchmark average accuracy by 5.3 and 7.0 points, respectively. Multi-hop training also strengthens long-CoT vision-language reasoning, with gains peaking at more than 50 accuracy points in the ultra-long-CoT regime. These experiments establish HopChain as an effective, scalable framework for synthesizing multi-hop data that improves generalizable vision-language reasoning.
Relevance: 9/10
Today’s Research Highlights
AI-enhanced summaries of the latest research papers from arXiv.
Table of Contents
- cs.CL [Total: 106]
- cs.CV [Total: 244]
- cs.AI [Total: 99]
- cs.SD [Total: 4]
- cs.LG [Total: 146]
- cs.MA [Total: 9]
- cs.MM [Total: 2]
- eess.AS [Total: 14]
- eess.IV [Total: 9]
cs.CL
[1] Trust, Safety, and Accuracy: Assessing LLMs for Routine Maternity Advice
V Sai Divya, A Bhanusree, Rimjhim, K Venkata Krishna Rao
Main category: cs.CL
TL;DR: LLMs like ChatGPT-4o, Perplexity AI, and GeminiAI show promise for providing maternal health information in rural India, with Perplexity matching expert semantics and ChatGPT-4o offering better clarity and medical terminology.
Details
Motivation: Addressing the challenge of reliable maternal healthcare information access in rural India where limited medical resources exist, despite growing internet penetration among rural women.Method: Evaluated three LLMs (ChatGPT-4o, Perplexity AI, GeminiAI) on 17 pregnancy-focused questions, comparing responses with maternal health professionals using semantic similarity, noun overlap, and readability metrics.
Result: Perplexity AI closely matched expert semantics, while ChatGPT-4o produced clearer, more understandable text with better medical terminology. LLMs show potential as scalable aids for maternal health education.
Conclusion: LLMs could serve as scalable tools for maternal health education in underserved regions, highlighting the need for AI tools that balance accuracy and clarity in healthcare communication.
Abstract: Access to reliable maternal healthcare information is a major challenge in rural India due to limited medical resources and infrastructure. With over 830 million internet users and nearly half of rural women online, digital tools offer new opportunities for health education. This study evaluates large language models (LLMs) like ChatGPT-4o, Perplexity AI, and GeminiAI to provide reliable and understandable pregnancy-related information. Seventeen pregnancy-focused questions were posed to each model and compared with responses from maternal health professionals. Evaluations used semantic similarity, noun overlap, and readability metrics to measure content quality. Results show Perplexity closely matched expert semantics, while ChatGPT-4o produced clearer, more understandable text with better medical terminology. As internet access grows in rural areas, LLMs could serve as scalable aids for maternal health education. The study highlights the need for AI tools that balance accuracy and clarity to improve healthcare communication in underserved regions.
[2] Enhancing Financial Report Question-Answering: A Retrieval-Augmented Generation System with Reranking Analysis
Zhiyuan Cheng, Longying Lai, Yue Liu, Kai Cheng, Xiaoxi Qi
Main category: cs.CL
TL;DR: RAG system with neural reranking improves financial report question answering, showing 15.5% improvement in correctness over baseline.
Details
Motivation: Financial analysts struggle with extracting information from lengthy 10-K reports (often 100+ pages), needing efficient question answering systems.Method: Retrieval-Augmented Generation (RAG) system with hybrid search (full-text + semantic retrieval) and optional neural reranking using cross-encoder model, evaluated on FinDER benchmark dataset.
Result: Reranking significantly improves answer quality: 49.0% correctness (scores ≥8) vs 33.5% without reranking (15.5% improvement), and reduces completely incorrect answers from 35.3% to 22.5%.
Conclusion: Neural reranking plays critical role in financial RAG systems, with modern language models and refined retrieval strategies outperforming baseline methods.
Abstract: Financial analysts face significant challenges extracting information from lengthy 10-K reports, which often exceed 100 pages. This paper presents a Retrieval-Augmented Generation (RAG) system designed to answer questions about S&P 500 financial reports and evaluates the impact of neural reranking on system performance. Our pipeline employs hybrid search combining full-text and semantic retrieval, followed by an optional reranking stage using a cross-encoder model. We conduct systematic evaluation using the FinDER benchmark dataset, comprising 1,500 queries across five experimental groups. Results demonstrate that reranking significantly improves answer quality, achieving 49.0 percent correctness for scores of 8 or above compared to 33.5 percent without reranking, representing a 15.5 percentage point improvement. Additionally, the error rate for completely incorrect answers decreases from 35.3 percent to 22.5 percent. Our findings emphasize the critical role of reranking in financial RAG systems and demonstrate performance improvements over baseline methods through modern language models and refined retrieval strategies.
[3] Rubric-Guided Fine-tuning of SpeechLLMs for Multi-Aspect, Multi-Rater L2 Reading-Speech Assessment
Aditya Kamlesh Parikh, Cristian Tejedor-Garcia, Catia Cucchiarini, Helmer Strik
Main category: cs.CL
TL;DR: A rubric-guided reasoning framework for L2 speech assessment using uncertainty-calibrated SpeechLLMs that aligns with human raters on accuracy, fluency, and prosody criteria.
Details
Motivation: Large speech-language models struggle to align with nuanced human rating variability in second-language speech assessment, requiring more reliable and interpretable automated assessment methods.Method: Fine-tune Qwen2-Audio-7B-Instruct using multi-rater human judgments, develop uncertainty-calibrated regression with conformal calibration for interpretable confidence intervals, and implement Gaussian uncertainty modeling.
Result: The approach achieves strongest alignment with human ratings, outperforming regression and classification baselines, reliably assesses fluency and prosody, but highlights difficulty in assessing accuracy.
Conclusion: Rubric-guided, uncertainty-calibrated reasoning offers a principled path toward trustworthy and explainable SpeechLLM-based speech assessment.
Abstract: Reliable and interpretable automated assessment of second-language (L2) speech remains a central challenge, as large speech-language models (SpeechLLMs) often struggle to align with the nuanced variability of human raters. To address this, we introduce a rubric-guided reasoning framework that explicitly encodes multi-aspect human assessment criteria: accuracy, fluency, and prosody, while calibrating model uncertainty to capture natural rating variability. We fine-tune the Qwen2-Audio-7B-Instruct model using multi-rater human judgments and develop an uncertainty-calibrated regression approach supported by conformal calibration for interpretable confidence intervals. Our Gaussian uncertainty modeling and conformal calibration approach achieves the strongest alignment with human ratings, outperforming regression and classification baselines. The model reliably assesses fluency and prosody while highlighting the inherent difficulty of assessing accuracy. Together, these results demonstrate that rubric-guided, uncertainty-calibrated reasoning offers a principled path toward trustworthy and explainable SpeechLLM-based speech assessment.
[4] LLM NL2SQL Robustness: Surface Noise vs. Linguistic Variation in Traditional and Agentic Settings
Lifu Tu, Rongguang Wang, Tao Sheng, Sujjith Ravi, Dan Roth
Main category: cs.CL
TL;DR: A robustness evaluation benchmark for NL2SQL systems with 10 perturbation types tested on state-of-the-art LLMs, revealing performance degradation with surface-level noise and linguistic variation.
Details
Motivation: Real-world database environments are dynamic, noisy, and continuously evolving, but conventional NL2SQL benchmarks assume static schemas and well-formed inputs, creating a need for robustness evaluation.Method: Introduced a robustness evaluation benchmark with approximately 10 types of perturbations, evaluated multiple state-of-the-art LLMs (Grok-4.1, Gemini-3-Pro, Claude-Opus-4.6, GPT-5.2) under both traditional and agentic settings.
Result: Models maintain strong performance under several perturbations but show notable degradation for surface-level noise (character-level corruption) and linguistic variation that preserves semantics while altering lexical/syntactic forms. Surface-level noise causes larger drops in traditional pipelines, while linguistic variation challenges agentic settings more.
Conclusion: Robust NL2SQL systems still face challenges, particularly in handling linguistic variability, highlighting the need for improved robustness against both surface-level noise and semantic-preserving linguistic variations.
Abstract: Robustness evaluation for Natural Language to SQL (NL2SQL) systems is essential because real-world database environments are dynamic, noisy, and continuously evolving, whereas conventional benchmark evaluations typically assume static schemas and well-formed user inputs. In this work, we introduce a robustness evaluation benchmark containing approximately ten types of perturbations and conduct evaluations under both traditional and agentic settings. We assess multiple state-of-the-art large language models (LLMs), including Grok-4.1, Gemini-3-Pro, Claude-Opus-4.6, and GPT-5.2. Our results show that these models generally maintain strong performance under several perturbations; however, notable performance degradation is observed for surface-level noise (e.g., character-level corruption) and linguistic variation that preserves semantics while altering lexical or syntactic forms. Furthermore, we observe that surface-level noise causes larger performance drops in traditional pipelines, whereas linguistic variation presents greater challenges in agentic settings. These findings highlight the remaining challenges in achieving robust NL2SQL systems, particularly in handling linguistic variability.
[5] Evaluating Ill-Defined Tasks in Large Language Models
Yi Zhou, Basel Shbita
Main category: cs.CL
TL;DR: The paper analyzes limitations in current LLM evaluation methods for ill-defined tasks, using two case studies to show how existing benchmarks fail to provide reliable, diagnostic signals of model capability.
Details
Motivation: Current LLM evaluations for inherently ill-defined tasks (with unclear input/output spaces and ambiguous success criteria) often fail to provide reliable or diagnostic signals of model capability. The authors aim to expose fundamental limitations in existing evaluation practices.Method: The paper examines two case studies: 1) Complex Instruction Following (CIF), identifying issues like limited coverage of real-world instruction complexity, sensitivity to phrasing, inconsistent metrics, and instability from LLM-based judges; 2) Natural Language to Mermaid Sequence Diagrams (NL2Mermaid), showing how multi-faceted evaluation criteria can yield actionable insights beyond aggregate scores.
Result: The analysis reveals that current evaluations frequently conflate distinct failure modes, yielding scores that are unstable, non-diagnostic, and difficult to act upon. The NL2Mermaid case study demonstrates how more nuanced evaluation approaches can provide better insights.
Conclusion: The findings expose fundamental limitations in existing evaluation practices for ill-defined tasks and motivate the need for more robust, interpretable evaluation designs that can provide reliable diagnostic signals about model capabilities.
Abstract: Many evaluations of Large Language Models (LLMs) target tasks that are inherently ill-defined, with unclear input and output spaces and ambiguous success criteria. We analyze why existing evaluation benchmarks and metrics fail to provide reliable or diagnostic signals of model capability for such tasks. We examine two case studies: Complex Instruction Following (CIF), where we identify recurring issues including limited coverage of real-world instruction complexity, sensitivity to instruction phrasing, inconsistent and non-comparable metrics, and instability introduced by LLM-based judges; and Natural Language to Mermaid Sequence Diagrams (NL2Mermaid), where we show how multi-faceted evaluation criteria can yield actionable insights beyond aggregate scores. Together, these case studies show that current evaluations frequently conflate distinct failure modes, yielding scores that are unstable, non-diagnostic, and difficult to act upon. Our findings expose fundamental limitations in existing evaluation practices for ill-defined tasks and motivate more robust, interpretable evaluation designs.
[6] Large Reasoning Models Struggle to Transfer Parametric Knowledge Across Scripts
Lucas Bandarkar, Alan Ansell, Trevor Cohn
Main category: cs.CL
TL;DR: Cross-lingual knowledge transfer gaps in reasoning LLMs are primarily due to script barriers, not language or family differences, and can be improved through targeted reasoning training.
Details
Motivation: The paper aims to understand and address shortcomings in cross-lingual knowledge transfer in modern reasoning LLMs, particularly investigating why knowledge doesn't transfer well across different writing systems.Method: 1) Observational analysis on ECLeKTic and MultiLoKo datasets using regression to identify script match as key predictor; 2) Entity-based intervention testing; 3) Synthetic generation pipeline for SFT samples to teach models to reason about transliteration ambiguities.
Result: Script match, not language or family, is the primary predictor of knowledge transfer failure. Providing source language entities disproportionately helps cross-script questions. Targeted reasoning training reduces the cross-script transfer gap.
Conclusion: Cross-lingual knowledge transfer gaps in LLMs are primarily script barriers, and targeted post-training interventions can significantly improve cross-script reasoning and knowledge transfer.
Abstract: In this work, we analyze shortcomings in cross-lingual knowledge transfer in large, modern reasoning LLMs. We demonstrate that the perceived gap in knowledge transfer is primarily a script barrier. First, we conduct an observational data analysis on the performance of thinking models on two datasets with local knowledge from around the world, ECLeKTic and MultiLoKo. Our regression analysis shows that script match - not language or family - is the primary predictor of knowledge transfer failure once model capability and question difficulty are accounted for. We further this finding by providing the LLMs with the key entities of the questions in their source language and find that this disproportionately improves cross-script questions. We then posit that these LLMs could be reasoning better at test-time. To evaluate this, we develop a synthetic generation pipeline to design SFT samples to encourage the model to better reason about transliteration ambiguities when trying to fetch parametric knowledge at inference-time. We show that teaching two models to reason better reduces the cross-script transfer gap. As a result, we conclude that there is potential to improve cross-lingual parametric knowledge transfer during post-training.
[7] Ensemble Self-Training for Unsupervised Machine Translation
Ido Aharon, Jonathan Shaki, Sarit Kraus
Main category: cs.CL
TL;DR: Ensemble-driven self-training framework for unsupervised neural machine translation using multiple models with structured diversity and token-level ensemble decoding.
Details
Motivation: To improve unsupervised neural machine translation by leveraging ensemble methods to generate higher-quality pseudo-translations for self-training, overcoming limitations of single-model approaches.Method: Train multiple UNMT models sharing the same translation task but differing in auxiliary languages to induce structured diversity. Use token-level ensemble decoding to generate pseudo-translations, then use these as synthetic parallel data for further training. At deployment, select single best model.
Result: Statistically significant improvements over single-model UNMT baselines, with mean gains of 1.7 chrF when translating from English and 0.67 chrF when translating into English.
Conclusion: Ensemble-driven self-training effectively improves UNMT performance while maintaining single-model inference cost at deployment through careful model selection.
Abstract: We present an ensemble-driven self-training framework for unsupervised neural machine translation (UNMT). Starting from a primary language pair, we train multiple UNMT models that share the same translation task but differ in an auxiliary language, inducing structured diversity across models. We then generate pseudo-translations for the primary pair using token-level ensemble decoding, averaging model predictions in both directions. These ensemble outputs are used as synthetic parallel data to further train each model, allowing the models to improve via shared supervision. At deployment time, we select a single model by validation performance, preserving single-model inference cost. Experiments show statistically significant improvements over single-model UNMT baselines, with mean gains of 1.7 chrF when translating from English and 0.67 chrF when translating into English.
[8] Evaluating LLM-Simulated Conversations in Modeling Inconsistent and Uncollaborative Behaviors in Human Social Interaction
Ryo Kamoi, Ameya Godbole, Longqi Yang, Rui Zhang, Mengting Wan, Pei Zhou
Main category: cs.CL
TL;DR: CoCoEval framework evaluates LLM-simulated conversations by detecting 10 types of inconsistent/uncollaborative behaviors, finding LLMs underproduce these human-like social behaviors compared to real human conversations.
Details
Motivation: LLMs are increasingly used to simulate human conversations for modeling social interaction, but they struggle to reproduce the inherent inconsistencies and uncollaborative behaviors (misunderstandings, interruptions) that characterize real human communication.Method: Developed CoCoEval framework using LLM-as-a-Judge to detect 10 types of inconsistent/uncollaborative behaviors at turn level. Evaluated GPT-4.1, GPT-5.1, and Claude Opus 4 across academic, business, governmental meetings, and debates, comparing with human conversations.
Result: 1) LLM-simulated conversations show far fewer inconsistent/uncollaborative behaviors than human conversations; 2) Prompt engineering doesn’t reliably control these behaviors (under- or overproduction); 3) Fine-tuning on human conversations leads to overproduction of narrow behaviors like repetition.
Conclusion: Simulating human conversations with LLMs is difficult, raising concerns about using LLMs as proxies for human social interaction due to their inability to reproduce authentic inconsistent/uncollaborative behaviors.
Abstract: Simulating human conversations using large language models (LLMs) has emerged as a scalable methodology for modeling human social interaction. However, simulating human conversations is challenging because they inherently involve inconsistent and uncollaborative behaviors, such as misunderstandings and interruptions. Analysis comparing inconsistent and uncollaborative behaviors in human- and LLM-generated conversations remains limited, although reproducing these behaviors is integral to simulating human-like and complex social interaction. In this work, we introduce CoCoEval, an evaluation framework that analyzes LLM-simulated conversations by detecting 10 types of inconsistent and uncollaborative behaviors at the turn level using an LLM-as-a-Judge. Using CoCoEval, we evaluate GPT-4.1, GPT-5.1, and Claude Opus 4 by comparing the frequencies of detected behaviors in conversations simulated by each model and in human conversations across academic, business, and governmental meetings, as well as debates. Our analysis shows that (1) under vanilla prompting, LLM-simulated conversations exhibit far fewer inconsistent and uncollaborative behaviors than human conversations; (2) prompt engineering does not provide reliable control over these behaviors, as our results show that different prompts lead to their under- or overproduction; and (3) supervised fine-tuning on human conversations can lead LLMs to overproduce a narrow set of behaviors, such as repetition. Our findings highlight the difficulty of simulating human conversations, raising concerns about the use of LLMs as a proxy for human social interaction.
[9] Knowledge Localization in Mixture-of-Experts LLMs Using Cross-Lingual Inconsistency
Lucas Bandarkar, Alan Ansell, Trevor Cohn
Main category: cs.CL
TL;DR: Using cross-lingual inconsistency in MoE LLMs as an interpretability tool to localize knowledge by contrasting routing patterns between languages where models succeed vs. fail at factual recall.
Details
Motivation: Modern LLMs show significant variance in behavior across languages (e.g., recalling factual information in some languages but not others). Rather than treating this as a problem to fix, the authors propose leveraging this cross-lingual inconsistency as a tool for interpretability in mixture-of-experts (MoE) LLMs.Method: Two-stage knowledge localization framework: (1) Query model with difficult factual questions across diverse languages to generate “success” and “failure” activation buckets, (2) Apply statistical contrastive analysis to MoE router logits to identify experts important for specific knowledge. Validation by deactivating identified experts and re-asking questions.
Result: Despite deactivating only about 20 out of 6000 experts, the model no longer answers correctly in over 40% of cases, demonstrating effective knowledge localization.
Conclusion: The method provides a realistic and scalable knowledge localization approach for increasingly complex LLMs by leveraging cross-lingual inconsistencies as an interpretability tool rather than treating them as bugs.
Abstract: Modern LLMs continue to exhibit significant variance in behavior across languages, such as being able to recall factual information in some languages but not others. While typically studied as a problem to be mitigated, in this work, we propose leveraging this cross-lingual inconsistency as a tool for interpretability in mixture-of-experts (MoE) LLMs. Our knowledge localization framework contrasts routing for sets of languages where the model correctly recalls information from languages where it fails. This allows us to isolate model components that play a functional role in answering about a piece of knowledge. Our method proceeds in two stages: (1) querying the model with difficult factual questions across a diverse set of languages to generate “success” and “failure” activation buckets and then (2) applying a statistical contrastive analysis to the MoE router logits to identify experts important for knowledge. To validate the necessity of this small number of experts for answering a knowledge question, we deactivate them and re-ask the question. We find that despite only deactivating about 20 out of 6000 experts, the model no longer answers correctly in over 40% of cases. Generally, this method provides a realistic and scalable knowledge localization approach to address increasingly complex LLMs.
[10] Zipper-LoRA: Dynamic Parameter Decoupling for Speech-LLM based Multilingual Speech Recognition
Yuxiang Mei, Delai Qiu, Shengping Liu, Jiaen Liang, Yanhua Long
Main category: cs.CL
TL;DR: Zipper-LoRA: A rank-level decoupling framework for multilingual Speech-LLMs that dynamically combines shared and language-specific LoRA subspaces to balance cross-lingual knowledge transfer with language-specific adaptation.
Details
Motivation: Multilingual Speech-LLMs face a stability-plasticity dilemma in imbalanced data settings - fully shared PEFT causes negative interference for under-represented languages, while fully language-specific tuning limits beneficial cross-lingual knowledge transfer needed for low-resource tasks.Method: Proposes Zipper-LoRA with three variants (Static, Hard, Soft) that dynamically synthesizes LoRA updates from shared and language-specific subspaces using a lightweight language-conditioned router. Includes a two-stage training strategy with Initial-B warm start for stable optimization under imbalanced data.
Result: Experiments on 12-language mixed-resource setting show Zipper-LoRA consistently outperforms both fully shared and independent baselines, especially in extremely low-resource scenarios. Gains are robust across both chunked and non-chunked encoder configurations.
Conclusion: Zipper-LoRA provides an effective framework for multilingual Speech-LLMs that enables fine-grained sharing where languages are compatible and strict decoupling when conflicts occur, addressing the stability-plasticity dilemma in imbalanced multilingual settings.
Abstract: Speech Large Language Models (Speech-LLMs) have emerged as a powerful approach for automatic speech recognition (ASR) by aligning speech encoders with large language models. However, adapting these systems to multilingual settings with imbalanced data distributions remains challenging. In such scenarios, a stability-plasticity dilemma often arises: fully shared Parameter-Efficient Fine-Tuning (PEFT) can cause negative inter-lingual interference for under-represented languages, while fully language-specific tuning limits the cross-lingual beneficial knowledge transfer needed for low-resource tasks. To address this, we propose Zipper-LoRA, a novel rank-level decoupling framework with three variants (Static, Hard, and Soft) that dynamically synthesizes LoRA updates from shared and language-specific subspaces. By using a lightweight language-conditioned router, Zipper-LoRA dynamically controls the contribution of each subspace at the LoRA rank level, enabling fine-grained sharing where languages are compatible and strict decoupling when conflicts occur. To further stabilize optimization under imbalanced data, we propose a two-stage training strategy with an Initial-B warm start that significantly accelerates convergence. Experiments on a 12-language mixed-resource setting show that Zipper-LoRA consistently outperforms both fully shared and independent baselines, particularly in extremely low-resource scenarios. Moreover, we demonstrate that these gains are robust across both chunked and non-chunked encoder configurations, confirming the framework’s reliability for practical, large-scale multilingual ASR. Our code and data will be available at https://github.com/YuCeong-May/Zipper-LoRA for reproducibility.
[11] Exploiting the English Grammar Profile for L2 grammatical analysis with LLMs
Stefano Bannò, Penny Karanasou, Kate Knill, Mark Gales
Main category: cs.CL
TL;DR: A framework for evaluating L2 learners’ grammatical competence using English Grammar Profile taxonomy to detect successful/unsuccessful attempts at grammatical constructs and assess CEFR proficiency levels.
Details
Motivation: To provide targeted feedback and assess proficiency for second language learners by evaluating both successful and unsuccessful attempts at grammatical constructs, moving beyond just error detection.Method: Leverages English Grammar Profile (EGP) taxonomy mapped to CEFR levels, compares rule-based and LLM-based classifiers for grammatical construct detection, and uses hybrid approaches combining rule-based pre-filtering with LLMs for proficiency assessment.
Result: LLMs outperform rule-based methods on semantically nuanced constructs, while rule-based approaches remain competitive for morphological/syntactic features. Hybrid approaches combining rule-based pre-filter with LLM yield strongest performance for proficiency assessment.
Conclusion: The framework enables positive formative feedback by emphasizing learners’ successful attempts alongside errors, providing actionable insights into grammatical development with performance approaching semi-automated systems.
Abstract: Evaluating the grammatical competence of second language (L2) learners is essential both for providing targeted feedback and for assessing proficiency. To achieve this, we propose a novel framework leveraging the English Grammar Profile (EGP), a taxonomy of grammatical constructs mapped to the proficiency levels of the Common European Framework of Reference (CEFR), to detect learners’ attempts at grammatical constructs and classify them as successful or unsuccessful. This detection can then be used to provide fine-grained feedback. Moreover, the grammatical constructs are used as predictors of proficiency assessment by using automatically detected attempts as predictors of holistic CEFR proficiency. For the selection of grammatical constructs derived from the EGP, rule-based and LLM-based classifiers are compared. We show that LLMs outperform rule-based methods on semantically and pragmatically nuanced constructs, while rule-based approaches remain competitive for constructs that rely purely on morphological or syntactic features and do not require semantic interpretation. For proficiency assessment, we evaluate both rule-based and hybrid pipelines and show that a hybrid approach combining a rule-based pre-filter with an LLM consistently yields the strongest performance. Since our framework operates on pairs of original learner sentences and their corrected counterparts, we also evaluate a fully automated pipeline using automatic grammatical error correction. This pipeline closely approaches the performance of semi-automated systems based on manual corrections, particularly for the detection of successful attempts at grammatical constructs. Overall, our framework emphasises learners’ successful attempts in addition to unsuccessful ones, enabling positive, formative feedback and providing actionable insights into grammatical development.
[12] Tabular LLMs for Interpretable Few-Shot Alzheimer’s Disease Prediction with Multimodal Biomedical Data
Sophie Kearney, Shu Yang, Zixuan Wen, Weimin Lyu, Bojian Hou, Duy Duong-Tran, Tianlong Chen, Jason H. Moore, Marylyn D. Ritchie, Chao Chen, Li Shen
Main category: cs.CL
TL;DR: TAP-GPT is a domain-adapted tabular LLM framework built on TableGPT2 for few-shot Alzheimer’s disease classification using tabular prompts, showing improved performance over traditional methods and handling missing data without imputation.
Details
Motivation: Deep learning models often underperform classical methods on small, incomplete tabular biomarker data for Alzheimer's disease diagnosis. Pretrained LLMs offer few-shot generalization and structured reasoning capabilities that could address these limitations.Method: Built on TableGPT2 and fine-tuned for few-shot AD classification using tabular prompts instead of plain text. Evaluated across four ADNI-derived datasets with multimodal biomarkers including structural MRI, amyloid PET, and tau PET.
Result: TAP-GPT outperforms backbone models and traditional ML baselines in few-shot settings, remains competitive with general-purpose LLMs, handles missing data without imputation, and produces modality-aware reasoning aligned with AD biology.
Conclusion: First systematic application of tabular-specialized LLM to multimodal biomarker-based AD prediction, demonstrating effective handling of structured clinical prediction tasks and laying foundation for tabular LLM-driven clinical decision-support systems.
Abstract: Accurate diagnosis of Alzheimer’s disease (AD) requires handling tabular biomarker data, yet such data are often small and incomplete, where deep learning models frequently fail to outperform classical methods. Pretrained large language models (LLMs) offer few-shot generalization, structured reasoning, and interpretable outputs, providing a powerful paradigm shift for clinical prediction. We propose TAP-GPT Tabular Alzheimer’s Prediction GPT, a domain-adapted tabular LLM framework built on TableGPT2 and fine-tuned for few-shot AD classification using tabular prompts rather than plain texts. We evaluate TAP-GPT across four ADNI-derived datasets, including QT-PAD biomarkers and region-level structural MRI, amyloid PET, and tau PET for binary AD classification. Across multimodal and unimodal settings, TAP-GPT improves upon its backbone models and outperforms traditional machine learning baselines in the few-shot setting while remaining competitive with state-of-the-art general-purpose LLMs. We show that feature selection mitigates degradation in high-dimensional inputs and that TAP-GPT maintains stable performance under simulated and real-world missingness without imputation. Additionally, TAP-GPT produces structured, modality-aware reasoning aligned with established AD biology and shows greater stability under self-reflection, supporting its use in iterative multi-agent systems. To our knowledge, this is the first systematic application of a tabular-specialized LLM to multimodal biomarker-based AD prediction, demonstrating that such pretrained models can effectively address structured clinical prediction tasks and laying the foundation for tabular LLM-driven multi-agent clinical decision-support systems. The source code is publicly available on GitHub: https://github.com/sophie-kearney/TAP-GPT.
[13] CODMAS: A Dialectic Multi-Agent Collaborative Framework for Structured RTL Optimization
Che-Ming Chang, Prashanth Vijayaraghavan, Ashutosh Jadhav, Charles Mackin, Vandana Mukherjee, Hsinyu Tsai, Ehsan Degan
Main category: cs.CL
TL;DR: CODMAS is a multi-agent system for automated RTL optimization using dialectic reasoning between specialized agents to generate and verify Verilog code improvements.
Details
Motivation: RTL optimization is crucial for improving power, performance, and area in chip design, but current methods require significant manual effort. The paper aims to automate this process using structured multi-agent reasoning.Method: CODMAS uses two dialectic agents (Articulator and Hypothesis Partner) that engage in structured reasoning to direct a Domain-Specific Coding Agent for Verilog edits and a Code Evaluation Agent for verification. The framework is evaluated on RTLOPT benchmark with 120 Verilog triples.
Result: Achieves ~25% reduction in critical path delay for pipelining and ~22% power reduction for clock gating, while reducing functional and compilation failures compared to baseline methods.
Conclusion: Structured multi-agent reasoning significantly enhances automated RTL optimization and can scale to more complex designs and broader optimization tasks.
Abstract: Optimizing Register Transfer Level (RTL) code is a critical step in Electronic Design Automation (EDA) for improving power, performance, and area (PPA). We present CODMAS (Collaborative Optimization via a Dialectic Multi-Agent System), a framework that combines structured dialectic reasoning with domain-aware code generation and deterministic evaluation to automate RTL optimization. At the core of CODMAS are two dialectic agents: the Articulator, inspired by rubber-duck debugging, which articulates stepwise transformation plans and exposes latent assumptions; and the Hypothesis Partner, which predicts outcomes and reconciles deviations between expected and actual behavior to guide targeted refinements. These agents direct a Domain-Specific Coding Agent (DCA) to generate architecture-aware Verilog edits and a Code Evaluation Agent (CEA) to verify syntax, functionality, and PPA metrics. We introduce RTLOPT, a benchmark of 120 Verilog triples (unoptimized, optimized, testbench) for pipelining and clock-gating transformations. Across proprietary and open LLMs, CODMAS achieves ~25% reduction in critical path delay for pipelining and ~22% power reduction for clock gating, while reducing functional and compilation failures compared to strong prompting and agentic baselines. These results demonstrate that structured multi-agent reasoning can significantly enhance automated RTL optimization and scale to more complex designs and broader optimization tasks.
[14] SYMDIREC: A Neuro-Symbolic Divide-Retrieve-Conquer Framework for Enhanced RTL Synthesis and Summarization
Prashanth Vijayaraghavan, Apoorva Nitsure, Luyao Shi, Charles Mackin, Ashutosh Jadhav, David Beymer, Ehsan Degan, Vandana Mukherjee
Main category: cs.CL
TL;DR: SYMDIREC is a neuro-symbolic framework for RTL synthesis and summarization that uses symbolic planning to decompose tasks, retrieves relevant code via fine-tuned retriever, and assembles verified outputs through LLM reasoning, achieving significant improvements over existing methods.
Details
Motivation: RTL synthesis and summarization are challenging for LLMs due to rigid HDL syntax, limited supervision, and weak alignment with natural language. Existing prompting and RAG methods lack symbolic planning, limiting structural precision in hardware design automation tasks.Method: SYMDIREC decomposes RTL tasks into symbolic subgoals, retrieves relevant code via a fine-tuned retriever, and assembles verified outputs through LLM reasoning. It supports both Verilog and VHDL without requiring LLM fine-tuning.
Result: Achieves ~20% higher Pass@1 rates for synthesis and 15-20% ROUGE-L improvements for summarization over prompting and RAG baselines, demonstrating benefits of symbolic guidance in RTL tasks.
Conclusion: Symbolic guidance significantly improves LLM performance on RTL tasks, with SYMDIREC providing a neuro-symbolic framework that enhances structural precision without requiring LLM fine-tuning.
Abstract: Register-Transfer Level (RTL) synthesis and summarization are central to hardware design automation but remain challenging for Large Language Models (LLMs) due to rigid HDL syntax, limited supervision, and weak alignment with natural language. Existing prompting and retrieval-augmented generation (RAG) methods have not incorporated symbolic planning, limiting their structural precision. We introduce SYMDIREC, a neuro-symbolic framework that decomposes RTL tasks into symbolic subgoals, retrieves relevant code via a fine-tuned retriever, and assembles verified outputs through LLM reasoning. Supporting both Verilog and VHDL without LLM fine-tuning, SYMDIREC achieves ~20% higher Pass@1 rates for synthesis and 15-20% ROUGE-L improvements for summarization over prompting and RAG baselines, demonstrating the benefits of symbolic guidance in RTL tasks.
[15] Anonymous-by-Construction: An LLM-Driven Framework for Privacy-Preserving Text
Federico Albanese, Pablo Ronco, Nicolás D’Ippolito
Main category: cs.CL
TL;DR: LLM-driven text anonymization pipeline that replaces PII with realistic surrogates while preserving semantics, evaluated for privacy, utility, and trainability.
Details
Motivation: Need to protect sensitive information in AI systems without compromising data utility, especially for large language models, requiring privacy-preserving text processing that maintains semantic integrity.Method: On-premise LLM-driven substitution pipeline that anonymizes text by replacing PII with type-consistent surrogates, evaluated using multi-metric protocol measuring privacy, semantic utility, and trainability via fine-tuning on sanitized text.
Result: Achieves state-of-the-art privacy, minimal topical drift, strong factual utility, and low trainability loss, outperforming rule-based approaches, NER baselines, and ZSTS variants on privacy-utility-trainability frontier.
Conclusion: Local LLM substitution produces anonymized corpora that are both privacy-preserving and operationally valuable, enabling responsible deployment of Q&A agents and downstream fine-tuning with limited degradation.
Abstract: Responsible use of AI demands that we protect sensitive information without undermining the usefulness of data, an imperative that has become acute in the age of large language models. We address this challenge with an on-premise, LLM-driven substitution pipeline that anonymizes text by replacing personally identifiable information (PII) with realistic, type-consistent surrogates. Executed entirely within organizational boundaries using local LLMs, the approach prevents data egress while preserving fluency and task-relevant semantics. We conduct a systematic, multi-metric, cross-technique evaluation on the Action-Based Conversation Dataset, benchmarking against industry standards (Microsoft Presidio and Google DLP) and a state-of-the-art approach (ZSTS, in redaction-only and redaction-plus-substitution variants). Our protocol jointly measures privacy, semantic utility, and trainability under privacy via a lifecycle-ready criterion obtained by fine-tuning a compact encoder (BERT+LoRA) on sanitized text. In addition, we assess agentic Q&A performance by inserting an on-premise anonymization layer before the answering LLM and evaluating the quality of its responses. This intermediate, type-preserving substitution stage ensures that no sensitive content is exposed to third-party APIs, enabling responsible deployment of Q&A agents without compromising confidentiality. Our method attains state-of-the-art privacy, minimal topical drift, strong factual utility, and low trainability loss, outperforming rule-based approaches and named-entity recognition (NER) baselines and ZSTS variants on the combined privacy–utility–trainability frontier. These results show that local LLM substitution yields anonymized corpora that are both responsible to use and operationally valuable: safe for agentic pipelines and suitable for downstream fine-tuning with limited degradation.
[16] Neuron-Level Emotion Control in Speech-Generative Large Audio-Language Models
Xiutian Zhao, Ismail Rasim Ulgen, Philipp Koehn, Björn Schuller, Berrak Sisman
Main category: cs.CL
TL;DR: The paper presents a neuron-level study of emotion control in speech-generative large audio-language models, identifying compact emotion-sensitive neurons that enable training-free emotion steering at inference time.
Details
Motivation: Current large audio-language models can produce expressive speech but lack reliable emotion control, often missing target affects and degrading linguistic fidelity through refusals, hallucinations, or paraphrasing.Method: Identifies emotion-sensitive neurons via success-filtered activation aggregation that enforces both emotion realization and content preservation. Uses these neurons for training-free emotion steering interventions across three LALMs (Qwen2.5-Omni-7B, MiniCPM-o 4.5, Kimi-Audio).
Result: ESN interventions yield emotion-specific gains that generalize to unseen speakers, supported by both automatic and human evaluation. Controllability depends on selector design, mask sparsity, filtering, and intervention strength.
Conclusion: Establishes a mechanistic framework for training-free emotion control in speech generation through neuron-level interventions.
Abstract: Large audio-language models (LALMs) can produce expressive speech, yet reliable emotion control remains elusive: conversions often miss the target affect and may degrade linguistic fidelity through refusals, hallucinations, or paraphrase. We present, to our knowledge, the first neuron-level study of emotion control in speech-generative LALMs and demonstrate that compact emotion-sensitive neurons (ESNs) are causally actionable, enabling training-free emotion steering at inference time. ESNs are identified via success-filtered activation aggregation enforcing both emotion realization and content preservation. Across three LALMs (Qwen2.5-Omni-7B, MiniCPM-o 4.5, Kimi-Audio), ESN interventions yield emotion-specific gains that generalize to unseen speakers and are supported by automatic and human evaluation. Controllability depends on selector design, mask sparsity, filtering, and intervention strength. Our results establish a mechanistic framework for training-free emotion control in speech generation.
[17] Alignment Makes Language Models Normative, Not Descriptive
Eilam Shapira, Moshe Tennenholtz, Roi Reichart
Main category: cs.CL
TL;DR: Aligned language models perform worse than base models at predicting human behavior in multi-round strategic games but better in one-shot normative scenarios, revealing a normative bias from alignment.
Details
Motivation: To investigate whether post-training alignment, which optimizes models for human preferences, actually helps models predict real human behavior in strategic decision-making contexts.Method: Compared 120 base-aligned model pairs on over 10,000 real human decisions across multi-round strategic games (bargaining, persuasion, negotiation, repeated matrix games) and one-shot textbook games.
Result: Base models outperformed aligned models in predicting human choices in multi-round strategic games by nearly 10:1, but aligned models dominated in one-shot textbook games and non-strategic lottery choices.
Conclusion: Alignment induces a normative bias - improves prediction when human behavior follows normative solutions but hurts prediction in multi-round strategic settings where descriptive dynamics like reciprocity and adaptation matter.
Abstract: Post-training alignment optimizes language models to match human preference signals, but this objective is not equivalent to modeling observed human behavior. We compare 120 base-aligned model pairs on more than 10,000 real human decisions in multi-round strategic games - bargaining, persuasion, negotiation, and repeated matrix games. In these settings, base models outperform their aligned counterparts in predicting human choices by nearly 10:1, robustly across model families, prompt formulations, and game configurations. This pattern reverses, however, in settings where human behavior is more likely to follow normative predictions: aligned models dominate on one-shot textbook games across all 12 types tested and on non-strategic lottery choices - and even within the multi-round games themselves, at round one, before interaction history develops. This boundary-condition pattern suggests that alignment induces a normative bias: it improves prediction when human behavior is relatively well captured by normative solutions, but hurts prediction in multi-round strategic settings, where behavior is shaped by descriptive dynamics such as reciprocity, retaliation, and history-dependent adaptation. These results reveal a fundamental trade-off between optimizing models for human use and using them as proxies for human behavior.
[18] TharuChat: Bootstrapping Large Language Models for a Low-Resource Language via Synthetic Data and Human Validation
Prajwal Panth, Agniva Maiti
Main category: cs.CL
TL;DR: Tharu-LLaMA (3B) is a specialized instruction-following LLM for the under-resourced Tharu language, created using synthetic data from a LLM-to-Human bootstrapping pipeline to address the digital divide excluding indigenous languages from AI.
Details
Motivation: Address the digital divide excluding indigenous languages like Tharu from the AI revolution, caused by severe data scarcity and linguistic fragmentation that leads multilingual models to hallucinate or default to dominant languages like Hindi/Nepali.Method: Created TharuChat dataset via LLM-to-Human bootstrapping pipeline using prompt-engineered Gemini models fed with Rana Tharu grammar/folklore. Built Tharu-LLaMA (3B) instruction-following model and conducted empirical ablation study on synthetic data scaling.
Result: Increasing synthetic data volume from 25% to 100% linearly reduces perplexity from 6.42 to 2.88. The model serves as proof-of-concept for preserving under-resourced Himalayan languages via generative AI on consumer-grade hardware.
Conclusion: Small-scale synthetic data is highly effective for low-resource language modeling, demonstrating feasibility of preserving indigenous languages through generative AI despite dialectal code-mixing and residual influences.
Abstract: The rapid proliferation of Large Language Models (LLMs) has created a profound digital divide, effectively excluding indigenous languages of the Global South from the AI revolution. The Tharu language, an Indo-Aryan vernacular spoken by approximately 1.7 million people across the Terai belt of Nepal and India, exemplifies this crisis. Despite a rich oral tradition, Tharu suffers from severe data scarcity and linguistic fragmentation, causing state-of-the-art multilingual models to routinely “hallucinate” or default to dominant high-resource neighbors like Hindi and Nepali due to contamination in pre-training corpora. This paper presents Tharu-LLaMA (3B), a specialized instruction-following model designed to address this exclusion. We introduce TharuChat, a novel dataset constructed via a LLM-to-Human bootstrapping pipeline. We utilized prompt-engineered Gemini models, fed with Rana Tharu grammar and folklore, to synthesize training data. Unlike curated gold-standard corpora, TharuChat reflects the noisy, heterogeneous linguistic reality of the region: it is predominantly anchored in Rana Tharu (~70%) while integrating elements of Dangaura and Kochila dialects. We provide a transparent analysis of the dataset’s limitations, including dialectal code-mixing and residual Awadhi/Hindi influence. Through a rigorous empirical ablation study, we demonstrate that despite these imperfections, small-scale synthetic data is highly effective, increasing the dataset volume from 25% to 100% results in a linear reduction in perplexity from 6.42 to 2.88. The resulting model serves as a proof-of-concept for the preservation of under-resourced Himalayan languages via generative AI, achievable on consumer-grade hardware.
[19] From Words to Worlds: Benchmarking Cross-Cultural Cultural Understanding in Machine Translation
Bangju Han, Yingqi Wang, Huang Qing, Tiyuan Li, Fengyi Yang, Ahtamjan Ahmat, Abibulla Atawulla, Yating Yang, Xi Zhou
Main category: cs.CL
TL;DR: CulT-Eval is a benchmark for evaluating machine translation of culturally grounded expressions like idioms and slang, revealing systematic failures in current models.
Details
Motivation: Existing benchmarks are fragmented and lack systematic evaluation of how machine translation systems handle culture-specific expressions that encode meanings beyond literal linguistic form.Method: Created CulT-Eval benchmark with 7,959 curated instances of culturally grounded expressions across multiple types, with comprehensive error taxonomy and proposed complementary evaluation metric for culturally induced meaning deviations.
Result: Current models struggle to preserve culturally grounded meaning and capture cultural/contextual nuances; systematic failure modes identified that aren’t captured by existing automatic metrics.
Conclusion: Need for specialized benchmarks and metrics to evaluate translation of culturally loaded expressions, as current models fail to handle cultural nuances essential for accurate translation.
Abstract: Culture-expressions, such as idioms, slang, and culture-specific items (CSIs), are pervasive in natural language and encode meanings that go beyond literal linguistic form. Accurately translating such expressions remains challenging for machine translation systems. Despite this, existing benchmarks remain fragmented and do not provide a systematic framework for evaluating translation performance on culture-loaded expressions. To address this gap, we introduce CulT-Eval, a benchmark designed to evaluate how models handle different types of culturally grounded expressions. CulT-Eval comprises over 7,959 carefully curated instances spanning multiple types of culturally grounded expressions, with a comprehensive error taxonomy covering culturally grounded expressions. Through extensive evaluation of large language models and detailed analysis, we identify recurring and systematic failure modes that are not adequately captured by existing automatic metrics. Accordingly, we propose a complementary evaluation metric that targets culturally induced meaning deviations overlooked by standard MT metrics. The results indicate that current models struggle to preserve culturally grounded meaning and to capture the cultural and contextual nuances essential for accurate translation. Our benchmark and code are available at https://anonymous.4open.science/r/CulT-Eval-E75D/.
[20] Beyond bouba/kiki: Multidimensional semantic signals are deeply woven into the fabric of natural language
Gexin Zhao
Main category: cs.CL
TL;DR: Individual English letter-phonemes carry structured semantic signals that can be predicted by articulatory features, challenging the traditional view of arbitrary sound-meaning relationships.
Details
Motivation: To systematically investigate whether sound-meaning relationships are arbitrary or systematic by mapping semantic profiles of every phonological unit in English, challenging foundational linguistic assumptions.Method: Used minimal-pair paradigm spanning all 220 pairwise letter contrasts, employed three large language models to recover phoneme-meaning associations across nine perceptual dimensions, analyzed articulatory-phonetic features, and conducted behavioral validation with English speakers plus preliminary cross-linguistic analysis in five diverse languages.
Result: LLMs recovered consistent phoneme-meaning associations systematically predicted by articulatory features (manner and place of articulation), behavioral data confirmed patterns at 80.8% above chance, and preliminary cross-linguistic evidence suggests generalization beyond English.
Conclusion: Sound-meaning iconicity is a pervasive, structured property of phonological signals, systematic enough that LLMs can recover it from text alone without exposure to speech or articulation.
Abstract: A foundational assumption in linguistics holds that the relationship between a word’s sound and its meaning is arbitrary. Accumulating evidence from sound symbolism challenges this view, yet no study has systematically mapped the multidimensional semantic profile of every phonological unit within a language. Here we show that individual letter-phonemes in English carry structured, multidimensional semantic signals. Using a minimal-pair paradigm spanning all 220 pairwise letter contrasts, three large language models independently recover consistent phoneme-meaning associations across nine perceptual dimensions. These associations are systematically predicted by articulatory-phonetic features, with manner and place of articulation mapping onto distinct semantic dimensions. Behavioral data from English speakers confirm these patterns at rates well above chance (80.8%), and preliminary cross-linguistic evidence from five typologically diverse languages suggests that core mappings generalize beyond English. Our findings indicate that sound-meaning iconicity is not an occasional curiosity but a pervasive, structured property of the phonological signal, one so systematic that large language models recover it when given only text input, without exposure to speech or articulation during the task.
[21] Ruyi2.5 Technical Report
Huan Song, Shuyu Tian, Qingfei Zhao, Wenhao Hong, Jiang Liu, Ting Long, Jiawei Shao, Xuelong Li
Main category: cs.CL
TL;DR: Ruyi2.5 is a multimodal familial model with shared-backbone architecture for co-training varying scale models, featuring a privacy-preserving camera system and BPPO for efficient RL fine-tuning.
Details
Motivation: Extend the "Train Once, Deploy Many" paradigm to multimodal domain, enable semantic consistency across deployment tiers, address privacy concerns in camera systems, and improve RL fine-tuning efficiency.Method: Shared-backbone architecture for co-training models of varying scales; two-stage privacy-preserving camera system with edge model using irreversible feature mapping and cloud model for deep reasoning; Binary Prefix Policy Optimization (BPPO) for RL fine-tuning with binary response selection and prefix-focused gradient updates.
Result: Matches Qwen3-VL on general multimodal benchmarks; substantially outperforms Qwen3-VL on privacy-constrained surveillance tasks; achieves 2-3x training speedup over GRPO with BPPO.
Conclusion: Ruyi2.5 successfully extends familial modeling to multimodal domain, provides effective privacy-preserving camera system, and demonstrates efficient RL fine-tuning with BPPO.
Abstract: We present Ruyi2.5, a multimodal familial model built on the AI Flow framework. Extending Ruyi2’s “Train Once, Deploy Many” paradigm to the multimodal domain, Ruyi2.5 constructs a shared-backbone architecture that co-trains models of varying scales within a single unified pipeline, ensuring semantic consistency across all deployment tiers. Built upon Ruyi2.5, Ruyi2.5-Camera model is developed as a privacy-preserving camera service system, which instantiates Ruyi2.5-Camera into a two-stage recognition pipeline: an edge model applies information-bottleneck-guided irreversible feature mapping to de-identify raw frames at the source, while a cloud model performs deep behavior reasoning. To accelerate reinforcement learning fine-tuning, we further propose Binary Prefix Policy Optimization (BPPO), which reduces sample redundancy via binary response selection and focuses gradient updates on response prefixes, achieving a 2 to 3 times training speedup over GRPO. Experiments show Ruyi2.5 matches Qwen3-VL on the general multimodal benchmarks, while Ruyi2.5-Camera substantially outperforms Qwen3-VL on privacy-constrained surveillance tasks.
[22] Grid Spatial Understanding: A Dataset for Textual Spatial Reasoning over Grids, Embodied Settings, and Coordinate Structures
Risham Sidhu, Julia Hockenmaier
Main category: cs.CL
TL;DR: GSU is a text-only grid dataset for evaluating LLMs’ spatial reasoning capabilities on navigation, object localization, and structure composition tasks, showing models struggle with embodied frames of reference and 3D shape recognition from coordinates.
Details
Motivation: To evaluate spatial reasoning capabilities of LLMs independent of visual perception, isolating spatial cognition from visual processing to understand how well language models can reason about 3D space and embodied perspectives.Method: Created a text-only grid dataset (GSU) with three core spatial reasoning tasks: navigation, object localization, and structure composition. Evaluated various LLMs including frontier models, VLMs, and fine-tuned smaller models on these tasks.
Result: Most models grasp basic grid concepts but struggle with frames of reference relative to an embodied agent and identifying 3D shapes from coordinate lists. VLMs don’t show generalizable 3D space understanding. Frontier models can solve tasks, while fine-tuned small models can match their performance.
Conclusion: Spatial reasoning remains challenging for LLMs, especially for embodied perspectives. Fine-tuning smaller models shows promise for specialized embodied agents, suggesting a path forward for developing spatial reasoning capabilities without requiring massive frontier models.
Abstract: We introduce GSU, a text-only grid dataset to evaluate the spatial reasoning capabilities of LLMs over 3 core tasks: navigation, object localization, and structure composition. By forgoing visual inputs, isolating spatial reasoning from perception, we show that while most models grasp basic grid concepts, they struggle with frames of reference relative to an embodied agent and identifying 3D shapes from coordinate lists. We also find that exposure to a visual modality does not provide a generalizable understanding of 3D space that VLMs are able to utilize for these tasks. Finally, we show that while the very latest frontier models can solve the provided tasks (though harder variants may still stump them), fully fine-tuning a small LM or LORA fine-tuning a small LLM show potential to match frontier model performance, suggesting an avenue for specialized embodied agents.
[23] PACE-RAG: Patient-Aware Contextual and Evidence-based Policy RAG for Clinical Drug Recommendation
Chaeyoung Huh, Hyunmin Hwang, Jung Hwan Shin, Jinse Park, Jong Chul Ye
Main category: cs.CL
TL;DR: PACE-RAG is a novel RAG framework for personalized drug recommendation that synthesizes individual patient context with prescribing patterns of similar cases, achieving state-of-the-art performance on Parkinson’s disease and MIMIC-IV benchmarks.
Details
Motivation: Current LLMs lack nuanced understanding of actual prescribing patterns, and existing RAG methods struggle with personalized drug recommendations because guideline-based retrieval is too generic and similar-patient retrieval often replicates majority patterns without accounting for individual clinical nuances.Method: PACE-RAG (Patient-Aware Contextual and Evidence-based Policy RAG) analyzes treatment patterns tailored to specific clinical signals to identify optimal prescriptions and generates explainable clinical summaries by synthesizing individual patient context with prescribing tendencies of similar cases.
Result: Evaluated on a Parkinson’s cohort and MIMIC-IV benchmark using Llama-3.1-8B and Qwen3-8B, PACE-RAG achieved state-of-the-art performance with F1 scores of 80.84% and 47.22%, respectively.
Conclusion: PACE-RAG is validated as a robust, clinically grounded solution for personalized decision support in drug recommendation, bridging the gap between generic guidelines and individual patient needs.
Abstract: Drug recommendation requires a deep understanding of individual patient context, especially for complex conditions like Parkinson’s disease. While LLMs possess broad medical knowledge, they fail to capture the subtle nuances of actual prescribing patterns. Existing RAG methods also struggle with these complexities because guideline-based retrieval remains too generic and similar-patient retrieval often replicates majority patterns without accounting for the unique clinical nuances of individual patients. To bridge this gap, we propose PACE-RAG (Patient-Aware Contextual and Evidence-based Policy RAG), a novel framework designed to synthesize individual patient context with the prescribing tendencies of similar cases. By analyzing treatment patterns tailored to specific clinical signals, PACE-RAG identifies optimal prescriptions and generates an explainable clinical summary. Evaluated on a Parkinson’s cohort and the MIMIC-IV benchmark using Llama-3.1-8B and Qwen3-8B, PACE-RAG achieved state-of-the-art performance, reaching F1 scores of 80.84% and 47.22%, respectively. These results validate PACE-RAG as a robust, clinically grounded solution for personalized decision support. Our code is available at: https://github.com/ChaeYoungHuh/PACE-RAG.
[24] SafeTutors: Benchmarking Pedagogical Safety in AI Tutoring Systems
Rima Hazra, Bikram Ghuku, Ilona Marchenko, Yaroslava Tokarieva, Sayan Layek, Somnath Banerjee, Julia Stoyanovich, Mykola Pechenizkiy
Main category: cs.CL
TL;DR: SafeTutors benchmark evaluates LLM tutoring safety and pedagogy across math, physics, and chemistry, revealing systematic failures in extended interactions despite single-turn safety.
Details
Motivation: Current LLM evaluations assess problem-solving accuracy and generic safety in isolation, failing to capture whether models are simultaneously pedagogically effective and safe across student-tutor interactions. Tutoring safety differs from conventional LLM safety with risks like answer over-disclosure, misconception reinforcement, and abdication of scaffolding.Method: Introduces SafeTutors benchmark with theoretically grounded risk taxonomy comprising 11 harm dimensions and 48 sub-risks from learning-science literature. Evaluates models across mathematics, physics, and chemistry in both single-turn and multi-turn dialogue settings.
Result: All models show broad harm; scale doesn’t reliably improve safety; multi-turn dialogue worsens behavior with pedagogical failures rising from 17.7% to 77.8%. Harms vary by subject, and single-turn “safe/helpful” results can mask systematic tutor failure over extended interaction.
Conclusion: Tutoring safety requires specialized evaluation beyond conventional LLM safety. Mitigations must be discipline-aware and account for extended interactions, as current models show systematic pedagogical failures that scale alone cannot address.
Abstract: Large language models are rapidly being deployed as AI tutors, yet current evaluation paradigms assess problem-solving accuracy and generic safety in isolation, failing to capture whether a model is simultaneously pedagogically effective and safe across student-tutor interaction. We argue that tutoring safety is fundamentally different from conventional LLM safety: the primary risk is not toxic content but the quiet erosion of learning through answer over-disclosure, misconception reinforcement, and the abdication of scaffolding. To systematically study this failure mode, we introduce SafeTutors, a benchmark that jointly evaluates safety and pedagogy across mathematics, physics, and chemistry. SafeTutors is organized around a theoretically grounded risk taxonomy comprising 11 harm dimensions and 48 sub-risks drawn from learning-science literature. We uncover that all models show broad harm; scale doesn’t reliably help; and multi-turn dialogue worsens behavior, with pedagogical failures rising from 17.7% to 77.8%. Harms also vary by subject, so mitigations must be discipline-aware, and single-turn “safe/helpful” results can mask systematic tutor failure over extended interaction.
[25] Argument Reconstruction as Supervision for Critical Thinking in LLMs
Hyun Ryu, Gyouk Chu, Gregor Betz, Eunho Yang, Carolyn Rose, Sean Welleck
Main category: cs.CL
TL;DR: LLMs trained on argument reconstruction (using GAAR engine and Arguinas dataset) show improved performance on downstream critical thinking tasks.
Details
Motivation: To investigate whether LLMs can enhance their critical thinking ability by learning to reconstruct arguments, similar to how humans improve critical thinking through argument analysis and reconstruction training.Method: Three-part framework: (1) GAAR engine for automatic argument reconstruction, (2) Arguinas dataset synthesized using GAAR, (3) experiments comparing models trained on argument reconstruction vs. not on seven critical thinking tasks.
Result: Models trained to learn argument reconstruction outperform those that don’t across all seven critical thinking tasks, with largest gains when training on the Arguinas dataset.
Conclusion: Learning argument reconstruction benefits LLMs’ downstream critical thinking performance, demonstrating that structured argument analysis training can enhance model reasoning capabilities.
Abstract: To think critically about arguments, human learners are trained to identify, reconstruct, and evaluate arguments. Argument reconstruction is especially important because it makes an argument’s underlying inferences explicit. However, it remains unclear whether LLMs can similarly enhance their critical thinking ability by learning to reconstruct arguments. To address this question, we introduce a holistic framework with three contributions. We (1) propose an engine that automatically reconstructs arbitrary arguments (GAAR), (2) synthesize a new high-quality argument reconstruction dataset (Arguinas) using the GAAR engine, and (3) investigate whether learning argument reconstruction benefits downstream critical thinking tasks. Our experimental results show that, across seven critical thinking tasks, models trained to learn argument reconstruction outperform models that do not, with the largest performance gains observed when training on the proposed Arguinas dataset. The source code and dataset will be publicly available.
[26] TRiMS: Real-Time Tracking of Minimal Sufficient Length for Efficient Reasoning via RL
Tingcheng Bian, Jinchang Luo, Mingquan Cheng, Jinyu Zhang, Xiaoling Xia, Ni Li, Yan Tao, Haiwei Wang
Main category: cs.CL
TL;DR: TRiMS reduces reasoning chain length by 80%+ tokens while maintaining or slightly improving accuracy through MSL-based compression and GRPO training
Details
Motivation: Chain-of-thought reasoning in LLMs causes computational redundancy (reasoning inflation), wasting tokens. Need to maximize "Intelligence per Token" by compressing reasoning chains while preserving correctness.Method: Introduce MSL (Minimal Sufficient Length) metric for shortest reasoning length preserving correctness. Analyze CoT compression strategies, identify structural factors. Propose TRiMS using GRPO algorithm with MSL-based estimation, dynamic batch aggregation, and batch-level standard deviation for advantage computation to stabilize training.
Result: Achieves over 80% CoT token reduction with minor accuracy boost across all benchmarks.
Conclusion: MSL provides theoretical foundation for reasoning compression. TRiMS effectively compresses reasoning chains while maintaining performance, demonstrating practical approach to reducing computational redundancy in LLM reasoning.
Abstract: Large language models achieve breakthroughs in complex reasoning via long chain-of-thought sequences. However, this often leads to severe reasoning inflation, causing substantial computational redundancy. To maximize Intelligence per Token, we introduce a theoretical metric, MSL-Minimal Sufficient Length. MSL rigorously characterizes the shortest reasoning length that preserves answer correctness. We provide a recursive definition based on independently sampled sequences and prove the existence of its limit, establishing the first measurable lower bound for reasoning-chain compression. Building on an analysis of mainstream CoT compression strategies, we identify key structural factors enabling a model to approach MSL. Based on these insights, we propose TRiMS which employs the GRPO algorithm in conjunction with MSL-based estimation during training, while mitigating instabilities during the training process through dynamic batch aggregation and advantage computation using batch-level standard deviation. TRiMS achieves over 80% CoT token reduction with a minor accuracy boost across all benchmarks.
[27] Humans and transformer LMs: Abstraction drives language learning
Jasper Jian, Christopher D. Manning
Main category: cs.CL
TL;DR: Transformer language models learn linguistic categories through abstraction, with class-level behavior emerging before item-specific behavior, suggesting abstraction plays a key role in LM learning similar to human language acquisition.
Details
Motivation: To understand how transformer-based language models learn linguistic categories and whether their learning trajectories resemble abstract feature-based or concrete exemplar-based accounts of human language acquisition.Method: Used GPT-2 small and novel divergence-based metrics to track learning trajectories through next-token distributions, comparing emergence of lexical semantic and syntactic categories over training.
Result: Found that (i) abstract class-level behavior emerges earlier than lexical item-specific behavior when a construction is learned, and (ii) different linguistic behaviors emerge abruptly in sequence at different training points.
Conclusion: Abstraction plays a key role in how language models learn linguistic categories, and LMs may serve as existence proofs for models of human language acquisition that emphasize abstract feature learning.
Abstract: Categorization is a core component of human linguistic competence. We investigate how a transformer-based language model (LM) learns linguistic categories by comparing its behaviour over the course of training to behaviours which characterize abstract feature-based and concrete exemplar-based accounts of human language acquisition. We investigate how lexical semantic and syntactic categories emerge using novel divergence-based metrics that track learning trajectories using next-token distributions. In experiments with GPT-2 small, we find that (i) when a construction is learned, abstract class-level behaviour is evident at earlier steps than lexical item-specific behaviour, and (ii) that different linguistic behaviours emerge abruptly in sequence at different points in training, revealing that abstraction plays a key role in how LMs learn. This result informs the models of human language acquisition that LMs may serve as an existence proof for.
[28] Learning When to Attend: Conditional Memory Access for Long-Context LLMs
Sakshi Choudhary, Aditya Chattopadhyay, Luca Zancato, Elvis Nunez, Matthew Trager, Wei Xia, Stefano Soatto
Main category: cs.CL
TL;DR: L2A (Learning To Attend) enables conditional long-range memory access by deciding when tokens need global attention vs local context, extending context lengths from 32K to 128K tokens with 80% global attention skipping and 2x training speedup.
Details
Motivation: Language models struggle with long-context generalization, and continued pretraining on long sequences is expensive due to quadratic attention scaling. Most tokens don't need global attention over entire sequences and can rely on local context.Method: Proposes L2A layer that enables token-wise conditional long-range memory access by deciding when to invoke global attention. Uses custom Triton kernels for efficient implementation on GPUs, and enables post-training pruning of sparse global attention layers.
Result: Extends Qwen models’ context from 32K to 128K tokens, matches standard long-context training within 3% while skipping global attention for ~80% of tokens, achieves ~2x training throughput improvements, and reduces KV cache memory by up to 50% with negligible performance loss.
Conclusion: L2A provides an efficient approach to extend context lengths by conditionally applying global attention only when needed, significantly reducing computational costs while maintaining performance for long-context reasoning tasks.
Abstract: Language models struggle to generalize beyond pretraining context lengths, limiting long-horizon reasoning and retrieval. Continued pretraining on long-context data can help but is expensive due to the quadratic scaling of Attention. We observe that most tokens do not require (Global) Attention over the entire sequence and can rely on local context. Based on this, we propose L2A (Learning To Attend), a layer that enables conditional (token-wise) long-range memory access by deciding when to invoke global attention. We evaluate L2A on Qwen 2.5 and Qwen 3 models, extending their effective context length from 32K to 128K tokens. L2A matches the performance of standard long-context training to within 3% while skipping Global Attention for $\sim$80% of tokens, outperforming prior baselines. We also design custom Triton kernels to efficiently implement this token-wise conditional Attention on GPUs, achieving up to $\sim$2x improvements in training throughput and time-to-first-token over FlashAttention. Moreover, L2A enables post-training pruning of highly sparse Global Attention layers, reducing KV cache memory by up to 50% with negligible performance loss.
[29] Inducing Epistemological Humility in Large Language Models: A Targeted SFT Approach to Reducing Hallucination
Cem Uluoglakci, Tugba Taskaya Temizel
Main category: cs.CL
TL;DR: HypoTermInstruct: SFT dataset teaching LLMs epistemological humility by training on questions about non-existent terms to reduce hallucination
Details
Motivation: LLMs often hallucinate false information because SFT implicitly rewards always responding, creating a need to teach models to recognize knowledge limits and admit uncertaintyMethod: Created HypoTermInstruct dataset (31,487 responses for 11,151 questions) with questions about non-existent “hypothetical” terms; conducted 800 controlled LoRA SFT runs across Llama3.1-8B and Gemma3-4B models, testing 100 fine-tuning configurations with paired controls
Result: Replacing generic instruction data with HypoTermInstruct significantly improves HypoTerm Score (median increases 0.19% to 25.91%) and FactScore (+0.39% to +0.86%), while maintaining stable MMLU performance (minimal decreases 0.26% to 0.35%)
Conclusion: Targeted, high-quality SFT data teaching meta-cognitive skills can effectively reduce hallucination without preference/RL pipelines, providing mechanistic insights and practical path toward more reliable AI systems
Abstract: Large language models (LLMs) often hallucinate, producing fluent but false information, partly because supervised fine-tuning (SFT) implicitly rewards always responding. We introduce $\textit{HypoTermInstruct}$, an SFT dataset (31,487 responses for 11,151 questions) designed to teach models epistemological humility-the ability to recognize the limits of their own knowledge and admit uncertainty. This is achieved through questions about non-existent “hypothetical” terms. We also release $\textit{HypoTermQA-Enhanced}$, a benchmark for hallucination tendency strengthened through multiple validations. We conducted 800 controlled LoRA SFT runs across $\textit{Llama3.1-8B}$ and $\textit{Gemma3-4B}$ (base and instruct), testing 100 fine-tuning configurations with paired controls. Our results demonstrate that replacing generic instruction data with $\textit{HypoTermInstruct}$ significantly improves the HypoTerm Score (median increases of 0.19% to 25.91%) and FactScore (+0.39% to +0.86%), while maintaining stable performance on MMLU (minimal decreases of 0.26% to 0.35%). Our work demonstrates that targeted, high-quality SFT data teaching meta-cognitive skills can effectively reduce hallucination without preference/RL pipelines, providing mechanistic insights and a practical path toward more reliable AI systems.
[30] Language on Demand, Knowledge at Core: Composing LLMs with Encoder-Decoder Translation Models for Extensible Multilinguality
Mengyu Bu, Yang Feng
Main category: cs.CL
TL;DR: XBridge: A compositional encoder-LLM-decoder architecture that leverages pretrained translation models for multilingual capabilities while keeping LLMs as English-centric knowledge cores, with cross-model alignment for semantic consistency.
Details
Motivation: LLMs have strong general intelligence but imbalanced multilingual performance, struggling with low-resource/unseen languages. Pretrained translation models have balanced multilingual capability, suggesting a natural complement to LLMs.Method: Proposes XBridge architecture: encoder-LLM-decoder composition that offloads multilingual understanding/generation to external translation models while preserving LLM as English-centric knowledge core. Uses lightweight cross-model mapping layers and optimal transport-based alignment objective for semantic consistency.
Result: Experiments on four LLMs across multilingual understanding, reasoning, summarization, and generation show XBridge outperforms strong baselines, especially on low-resource and previously unseen languages, without retraining the LLM.
Conclusion: XBridge effectively bridges LLMs’ knowledge with translation models’ multilingual capabilities, achieving balanced multilingual performance while preserving LLMs’ general intelligence.
Abstract: Large language models (LLMs) exhibit strong general intelligence, yet their multilingual performance remains highly imbalanced. Although LLMs encode substantial cross-lingual knowledge in a unified semantic space, they often struggle to reliably interface this knowledge with low-resource or unseen languages. Fortunately, pretrained encoder-decoder translation models already possess balanced multilingual capability, suggesting a natural complement to LLMs. In this work, we propose XBridge, a compositional encoder-LLM-decoder architecture that offloads multilingual understanding and generation to external pretrained translation models, while preserving the LLM as an English-centric core for general knowledge processing. To address the resulting representation misalignment across models, we introduce lightweight cross-model mapping layers and an optimal transport-based alignment objective, enabling fine-grained semantic consistency for multilingual generation. Experiments on four LLMs across multilingual understanding, reasoning, summarization, and generation indicate that XBridge outperforms strong baselines, especially on low-resource and previously unseen languages, without retraining the LLM.
[31] Detecting the Machine: A Comprehensive Benchmark of AI-Generated Text Detectors Across Architectures, Domains, and Adversarial Conditions
Madhav S. Baidya, S. S. Baidya, Chirag Chawla
Main category: cs.CL
TL;DR: Comprehensive benchmark of machine-generated text detectors across multiple methods, datasets, and LLMs reveals limitations in cross-domain generalization and adversarial robustness.
Details
Motivation: Existing benchmarks for machine-generated text detection are limited, typically evaluating single detectors on single datasets under ideal conditions, leaving gaps in understanding cross-domain transfer, cross-LLM generalization, and adversarial robustness.Method: Comprehensive benchmark evaluating diverse detection approaches across two corpora (HC3 with 23,363 human-ChatGPT pairs and ELI5 with 15,000 human-Mistral-7B pairs). Methods include classical classifiers, fine-tuned transformer encoders (BERT, RoBERTa, ELECTRA, DistilBERT, DeBERTa-v3), CNN, XGBoost stylometric model, perplexity-based detectors, and LLM-as-detector prompting.
Result: Transformer models achieve near-perfect in-distribution performance but degrade under domain shift. XGBoost stylometric model matches performance while remaining interpretable. LLM-based detectors underperform and are affected by generator-detector identity bias. Perplexity-based methods exhibit polarity inversion (modern LLM outputs show lower perplexity than human text) but remain effective when corrected. No method generalizes robustly across domains and LLM sources.
Conclusion: Current machine-generated text detection methods lack robust cross-domain and cross-LLM generalization capabilities, highlighting the need for more robust and generalizable detection approaches that can handle domain shifts and different LLM sources.
Abstract: The rapid proliferation of large language models (LLMs) has created an urgent need for robust and generalizable detectors of machine-generated text. Existing benchmarks typically evaluate a single detector on a single dataset under ideal conditions, leaving open questions about cross-domain transfer, cross-LLM generalization, and adversarial robustness. We present a comprehensive benchmark evaluating diverse detection approaches across two corpora: HC3 (23,363 human-ChatGPT pairs) and ELI5 (15,000 human-Mistral-7B pairs). Methods include classical classifiers, fine-tuned transformer encoders (BERT, RoBERTa, ELECTRA, DistilBERT, DeBERTa-v3), a CNN, an XGBoost stylometric model, perplexity-based detectors, and LLM-as-detector prompting. Results show that transformer models achieve near-perfect in-distribution performance but degrade under domain shift. The XGBoost stylometric model matches performance while remaining interpretable. LLM-based detectors underperform and are affected by generator-detector identity bias. Perplexity-based methods exhibit polarity inversion, with modern LLM outputs showing lower perplexity than human text, but remain effective when corrected. No method generalizes robustly across domains and LLM sources.
[32] AURORA Model of Formant-to-Tongue Inversion for Didactic and Clinical Applications
Patrycja Strycharczuk, Sam Kirkham
Main category: cs.CL
TL;DR: AURORA model predicts tongue shape from vowel formants using ultrasound and acoustic data, serving as educational tool and biofeedback foundation for speech applications.
Details
Motivation: To create a didactic aid explaining the relationship between acoustic formants and articulatory tongue movements, and to establish a foundation for biofeedback applications in speech therapy and phonetics education.Method: Developed using ultrasound tongue imaging and acoustic data from 40 native English speakers, creating a model that predicts tongue displacement and shape based on first two formant values. Includes Shiny app and real-time biofeedback software prototypes.
Result: Qualitative evaluation shows the model can predict tongue features from formant values. Two accessibility tools developed: interactive Shiny app for visualization and prototype software for real-time tongue biofeedback.
Conclusion: AURORA provides a practical computational model linking acoustic features to articulatory movements, with applications in phonetics education, linguistics research, and speech therapy biofeedback systems.
Abstract: This paper outlines the conceptual and computational foundations of the AURORA (Acoustic Understanding and Real-time Observation of Resonant Articulations) model. AURORA predicts tongue displacement and shape in vowel sounds based on the first two formant values. It is intended as a didactic aid helping to explain the relationship between formants and the underlying articulation, as well as a foundation for biofeedback applications. The model is informed by ultrasound tongue imaging and acoustic data from 40 native speakers of English. In this paper we discuss the motivation for the model, the modelling objectives as well as the model architecture. We provide a qualitative evaluation of the model, focusing on selected tongue features. We then present two tools developed to make the model more accessible to a wider audience, a Shiny app and a prototype software for real-time tongue biofeedback. Potential users include students of phonetics, linguists in fields adjacent to phonetics, as well as speech and language therapy practitioners and clients.
[33] KA2L: A Knowledge-Aware Active Learning Framework for LLMs
Haoxuan Yin, Bojian Liu, Chen Tang, Yangfan Wang, Lian Yan, Jingchi Jiang
Main category: cs.CL
TL;DR: KA2L framework uses latent space analysis to assess LLMs’ knowledge mastery and generate targeted unknown questions for efficient active learning, reducing annotation costs by 50%.
Details
Motivation: There's limited research on assessing LLMs' depth of domain-specific knowledge comprehension and applying targeted active learning to improve their expertise. Current approaches lack efficient methods to identify what knowledge LLMs have not yet mastered.Method: Proposes Knowledge-Aware Active Learning (KA2L) framework that: 1) Uses knowledge distribution probing to examine Transformer hidden states and identify known/unknown knowledge distributions, 2) Employs hidden-state decoding to generate unknown questions from latent knowledge space, 3) Focuses training on unmastered knowledge points through active learning.
Result: KA2L reduces annotation and computation costs by 50% across two open-domain and one vertical-domain dataset while achieving better performance. Validated on nine open-source LLMs.
Conclusion: KA2L provides an efficient active learning strategy for LLMs by targeting unmastered knowledge, offering valuable insights for improving LLM expertise with reduced resource requirements.
Abstract: Fine-tuning large language models (LLMs) with high-quality knowledge has been shown to enhance their performance effectively. However, there is a paucity of research on the depth of domain-specific knowledge comprehension by LLMs and the application of targeted active learning to improve their expertise. To address this gap, we introduce the Knowledge-Aware Active Learning (KA2L) framework. This framework assesses LLMs’ mastery of specific knowledge points to aid in constructing unanswerable or unknowable questions through latent space analysis. This active learning strategy enhances training efficiency by focusing on knowledge the model has yet to master, thereby minimizing redundancy in learning already acquired information. This study innovatively employs a knowledge distribution probing technique to examine the hidden states of specific Transformer layers and identify the distribution of known and unknown knowledge within the LLM. Additionally, a hidden-state decoding method is proposed to generate numerous unknown questions in natural language from the latent knowledge space. In our experiments, we selected nine open-source LLMs to validate the effectiveness of the proposed framework. Results indicate that KA2L not only significantly reduces 50% annotation and computation costs across two open-domain and one vertical-domain dataset but also achieves better performance, offering valuable insights into active learning strategies for LLMs. The code is available at https://anonymous.4open.science/r/KA2L-F15C.
[34] VeriAgent: A Tool-Integrated Multi-Agent System with Evolving Memory for PPA-Aware RTL Code Generation
Yaoxiang Wang, Qi Shi, ShangZhan Li, Qingguo Hu, Xinyu Yin, Bo Guo, Xu Han, Maosong Sun, Jinsong Su
Main category: cs.CL
TL;DR: A multi-agent framework for PPA-aware Verilog code generation that integrates EDA tools and uses evolvable memory for continual optimization.
Details
Motivation: Current LLM-based RTL code generation focuses only on functional correctness while ignoring critical physical design metrics (Power, Performance, Area), limiting practical deployment in real-world hardware design flows.Method: Proposes a PPA-aware, tool-integrated multi-agent framework with three agents: Programmer Agent (code generation), Correctness Agent (functional verification), and PPA Agent (physical metrics optimization). Introduces Evolved Memory Mechanism that externalizes optimization experience into structured memory nodes with dynamic memory management for continual improvement.
Result: Achieves strong functional correctness while delivering significant improvements in PPA metrics, transforming RTL generation from one-shot reasoning into a continual, feedback-driven optimization process.
Conclusion: The framework provides a scalable pathway for deploying LLMs in real-world hardware design flows by integrating tool-driven feedback with structured and evolvable memory for joint optimization of functional correctness and physical metrics.
Abstract: LLMs have recently demonstrated strong capabilities in automatic RTL code generation, achieving high syntactic and functional correctness. However, most methods focus on functional correctness while overlooking critical physical design objectives, including Power, Performance, and Area. In this work, we propose a PPA-aware, tool-integrated multi-agent framework for high-quality verilog code generation. Our framework explicitly incorporates EDA tools into a closed-loop workflow composed of a \textit{Programmer Agent}, a \textit{Correctness Agent}, and a \textit{PPA Agent}, enabling joint optimization of functional correctness and physical metrics. To support continuous improvement without model retraining, we introduce an \textit{Evolved Memory Mechanism} that externalizes optimization experience into structured memory nodes. A dedicated memory manager dynamically maintains the memory pool and allows the system to refine strategies based on historical execution trajectories. Extensive experiments demonstrate that our approach achieves strong functional correctness while delivering significant improvements in PPA metrics. By integrating tool-driven feedback with structured and evolvable memory, our framework transforms RTL generation from one-shot reasoning into a continual, feedback-driven optimization process, providing a scalable pathway for deploying LLMs in real-world hardware design flows.
[35] Do Language Models Encode Semantic Relations? Probing and Sparse Feature Analysis
Andor Diera, Ansgar Scherp
Main category: cs.CL
TL;DR: LLMs encode semantic relations (synonymy, antonymy, hypernymy, hyponymy) with directional asymmetry: hypernymy is redundant and robust, hyponymy is compact and fragile. Relation signals peak in mid-layers, stronger in MLP than attention, with consistent difficulty across models.
Details
Motivation: To understand whether and how LLMs capture structured semantic meaning, specifically examining how they represent concept relationships like synonymy, antonymy, hypernymy, and hyponymy across different model scales.Method: Combined linear probing with mechanistic interpretability techniques including sparse autoencoders (SAE) and activation patching on three models of increasing scale: Pythia-70M, GPT-2, and Llama 3.1 8B.
Result: Found directional asymmetry in hierarchical relations: hypernymy encoded redundantly and resists suppression, hyponymy relies on compact features easily disrupted. Relation signals peak in mid-layers, stronger in post-residual/MLP than attention. Antonymy easiest, synonymy hardest across models. Probe-level causality is capacity-dependent.
Conclusion: LLMs reliably encode semantic relations with specific architectural patterns, providing a reproducible framework for relating sparse features to causal evidence in model interpretability.
Abstract: Understanding whether large language models (LLMs) capture structured meaning requires examining how they represent concept relationships. In this work, we study three models of increasing scale: Pythia-70M, GPT-2, and Llama 3.1 8B, focusing on four semantic relations: synonymy, antonymy, hypernymy, and hyponymy. We combine linear probing with mechanistic interpretability techniques, including sparse autoencoders (SAE) and activation patching, to identify where these relations are encoded and how specific features contribute to their representation. Our results reveal a directional asymmetry in hierarchical relations: hypernymy is encoded redundantly and resists suppression, while hyponymy relies on compact features that are more easily disrupted by ablation. More broadly, relation signals are diffuse but exhibit stable profiles: they peak in the mid-layers and are stronger in post-residual/MLP pathways than in attention. Difficulty is consistent across models (antonymy easiest, synonymy hardest). Probe-level causality is capacity-dependent: on Llama 3.1, SAE-guided patching reliably shifts these signals, whereas on smaller models the shifts are weak or unstable. Our results clarify where and how reliably semantic relations are represented inside LLMs, and provide a reproducible framework for relating sparse features to probe-level causal evidence.
[36] Adaptive Guidance for Retrieval-Augmented Masked Diffusion Models
Jaemin Kim, Jong Chul Ye
Main category: cs.CL
TL;DR: ARAM is a training-free adaptive guidance framework for Masked Diffusion Models in RAG settings that dynamically calibrates guidance scale based on the SNR of retrieved context distributional shift.
Details
Motivation: Retrieval-Augmented Generation (RAG) faces challenges when retrieved context is noisy, unreliable, or conflicts with model's parametric knowledge, causing retrieval-prior conflicts. While studied in autoregressive LMs, this problem remains unexplored in diffusion-based LMs where iterative denoising introduces unique integration challenges.Method: Proposes Adaptive Retrieval-Augmented Masked Diffusion (ARAM), a training-free adaptive guidance framework for Masked Diffusion Models in RAG settings. ARAM dynamically calibrates guidance scale during denoising according to the Signal-to-Noise Ratio (SNR) of the distributional shift induced by retrieved context, strengthening guidance when context is reliable and suppressing it when noisy.
Result: Extensive experiments on multiple knowledge-intensive QA benchmarks show that ARAM improves overall QA performance over competitive RAG baselines.
Conclusion: ARAM effectively addresses retrieval-prior conflicts in diffusion-based language models through adaptive guidance based on context reliability, improving RAG performance in knowledge-intensive tasks.
Abstract: Retrieval-Augmented Generation (RAG) improves factual grounding by incorporating external knowledge into language model generation. However, when retrieved context is noisy, unreliable, or inconsistent with the model’s parametric knowledge, it introduces retrieval-prior conflicts that can degrade generation quality. While this problem has been studied in autoregressive language models, it remains largely unexplored in diffusion-based language models, where the iterative denoising process introduces unique challenges for integrating retrieved context. In this work, we propose Adaptive Retrieval-Augmented Masked Diffusion (ARAM), a training-free adaptive guidance framework for Masked Diffusion Models (MDMs) in RAG settings. ARAM dynamically calibrates the guidance scale during denoising according to the Signal-to-Noise Ratio (SNR) of the distributional shift induced by retrieved context. Intuitively, the model strengthens guidance when the retrieved context provides reliable corrective evidence and suppresses it when the contextual signal is noisy or non-supportive. Extensive experiments on multiple knowledge-intensive QA benchmarks show that ARAM improves overall QA performance over competitive RAG baselines.
[37] Harm or Humor: A Multimodal, Multilingual Benchmark for Overt and Covert Harmful Humor
Ahmed Sharshar, Hosam Elgendy, Saad El Dine Ahmed, Yasser Rohaim, Yuxia Wang
Main category: cs.CL
TL;DR: A multimodal, multilingual benchmark for detecting harmful humor across text, images, and videos in English and Arabic, with explicit/implicit classification to test contextual reasoning.
Details
Motivation: Current static benchmarks fail to capture the subtle cultural nuances and implicit cues in dark humor that require contextual reasoning, posing safety challenges for AI systems.Method: Created a manually curated dataset of 3,000 texts, 6,000 images (English/Arabic), and 1,200 videos (English/Arabic/universal) with strict annotation guidelines distinguishing Safe vs Harmful jokes, and further classifying Harmful into Explicit and Implicit categories.
Result: Closed-source models significantly outperform open-source ones, with notable performance differences between English and Arabic languages, highlighting the need for culturally grounded, reasoning-aware safety alignment.
Conclusion: The benchmark reveals critical gaps in multimodal AI’s ability to understand culturally nuanced, implicit harmful content, emphasizing the need for improved reasoning capabilities in safety alignment.
Abstract: Dark humor often relies on subtle cultural nuances and implicit cues that require contextual reasoning to interpret, posing safety challenges that current static benchmarks fail to capture. To address this, we introduce a novel multimodal, multilingual benchmark for detecting and understanding harmful and offensive humor. Our manually curated dataset comprises 3,000 texts and 6,000 images in English and Arabic, alongside 1,200 videos that span English, Arabic, and language-independent (universal) contexts. Unlike standard toxicity datasets, we enforce a strict annotation guideline: distinguishing \emph{Safe} jokes from \emph{Harmful} ones, with the latter further classified into \emph{Explicit} (overt) and \emph{Implicit} (Covert) categories to probe deep reasoning. We systematically evaluate state-of-the-art (SOTA) open and closed-source models across all modalities. Our findings reveal that closed-source models significantly outperform open-source ones, with a notable difference in performance between the English and Arabic languages in both, underscoring the critical need for culturally grounded, reasoning-aware safety alignment. \textcolor{red}{Warning: this paper contains example data that may be offensive, harmful, or biased.}
[38] CoVerRL: Breaking the Consensus Trap in Label-Free Reasoning via Generator-Verifier Co-Evolution
Teng Pan, Yuchen Yan, Zixuan Wang, Ruiqing Zhang, Gaiyang Han, Wanqi Zhang, Weiming Lu, Jun Xiao, Yongliang Shen
Main category: cs.CL
TL;DR: CoVerRL is a label-free RL framework where a single LLM alternates between generator and verifier roles to avoid the “consensus trap” where models reinforce systematic errors through majority voting.
Details
Motivation: Label-free RL for LLMs using majority-voted answers as pseudo-labels suffers from "consensus trap" - as training maximizes self-consistency, output diversity collapses and models confidently reinforce systematic errors that evade detection.Method: CoVerRL framework where a single model alternates between generator and verifier roles. Majority voting provides noisy supervision for training the verifier, while the improving verifier progressively filters self-consistent errors from pseudo-labels, creating a co-evolution cycle.
Result: Outperforms label-free baselines by 4.7-5.9% on mathematical reasoning benchmarks across Qwen and Llama model families. Self-verification accuracy improves from ~55% to over 85%, confirming genuine co-evolution of both capabilities.
Conclusion: CoVerRL successfully escapes the consensus trap through generator-verifier co-evolution, maintaining high reward accuracy throughout training while improving both reasoning and verification capabilities.
Abstract: Label-free reinforcement learning enables large language models to improve reasoning capabilities without ground-truth supervision, typically by treating majority-voted answers as pseudo-labels. However, we identify a critical failure mode: as training maximizes self-consistency, output diversity collapses, causing the model to confidently reinforce systematic errors that evade detection. We term this the consensus trap. To escape it, we propose CoVerRL, a framework where a single model alternates between generator and verifier roles, with each capability bootstrapping the other. Majority voting provides noisy but informative supervision for training the verifier, while the improving verifier progressively filters self-consistent errors from pseudo-labels. This co-evolution creates a virtuous cycle that maintains high reward accuracy throughout training. Experiments across Qwen and Llama model families demonstrate that CoVerRL outperforms label-free baselines by 4.7-5.9% on mathematical reasoning benchmarks. Moreover, self-verification accuracy improves from around 55% to over 85%, confirming that both capabilities genuinely co-evolve.
[39] Process Supervision for Chain-of-Thought Reasoning via Monte Carlo Net Information Gain
Corentin Royer, Debarun Bhattacharjya, Gaetano Rossiello, Andrea Giovannini, Mennatallah El-Assady
Main category: cs.CL
TL;DR: Proposes an information-theoretic method to automatically generate step-level labels for process reward models, enabling efficient supervision of LLM reasoning with O(N) complexity.
Details
Motivation: Multi-step reasoning in LLMs improves capabilities but increases error propagation risk. Process reward models (PRMs) help by scoring individual steps, but existing training methods require costly human annotations or computationally intensive automatic labeling.Method: Uses information theory to automatically generate step-level labels by estimating how each reasoning step affects the likelihood of the correct answer. Reduces computational complexity from O(N log N) to O(N).
Result: Demonstrates effective chain-of-thought selection in best-of-K evaluation settings across diverse reasoning benchmarks including mathematics, Python programming, SQL, and scientific question answering.
Conclusion: Enables scalable and efficient supervision of LLM reasoning, particularly for tasks where error propagation is critical, without requiring expensive human annotations.
Abstract: Multi-step reasoning improves the capabilities of large language models (LLMs) but increases the risk of errors propagating through intermediate steps. Process reward models (PRMs) mitigate this by scoring each step individually, enabling fine-grained supervision and improved reliability. Existing methods for training PRMs rely on costly human annotations or computationally intensive automatic labeling. We propose a novel approach to automatically generate step-level labels using Information Theory. Our method estimates how each reasoning step affects the likelihood of the correct answer, providing a signal of step quality. Importantly, it reduces computational complexity to $\mathcal{O}(N)$, improving over the previous $\mathcal{O}(N \log N)$ methods. We demonstrate that these labels enable effective chain-of-thought selection in best-of-$K$ evaluation settings across diverse reasoning benchmarks, including mathematics, Python programming, SQL, and scientific question answering. This work enables scalable and efficient supervision of LLM reasoning, particularly for tasks where error propagation is critical.
[40] Text-to-Stage: Spatial Layouts from Long-form Narratives
Jefferson Hernandez, Swarnadeep Saha, Chenxi Whitehouse, Sanjeel Parekh, Calvin Murdock, Yuliang Li, W. Owen Brimijoin, Vamsi Krishna Ithapu, Ishwarya Ananthabhotla
Main category: cs.CL
TL;DR: A method for training language models to infer spatial layouts (stage-play scenes, positions, movements) from unstructured narrative text using rejection SFT with Best-of-N sampling and RL with verifiable rewards.
Details
Motivation: To develop language models that can perform spatial reasoning from unstructured text, mimicking human capabilities for automating spatial layout inference in media applications, specifically for narrative-to-play tasks.Method: Combines rejection SFT using Best-of-N sampling with RL from verifiable rewards via GRPO, evaluated using a dramaturgy-inspired deterministic evaluation suite on classical English literature corpus.
Result: Improvements over vanilla models across multiple metrics: character attribution, spatial plausibility, movement economy, with alignment to LLM-as-a-judge and human preferences.
Conclusion: The approach successfully enables language models to perform spatial reasoning from text, demonstrating practical applications for media automation and narrative understanding.
Abstract: In this work, we probe the ability of a language model to demonstrate spatial reasoning from unstructured text, mimicking human capabilities and automating a process that benefits many downstream media applications. Concretely, we study the narrative-to-play task: inferring stage-play layouts (scenes, speaker positions, movements, and room types) from text that lacks explicit spatial, positional, or relational cues. We then introduce a dramaturgy-inspired deterministic evaluation suite and, finally, a training and inference recipe that combines rejection SFT using Best-of-N sampling with RL from verifiable rewards via GRPO. Experiments on a text-only corpus of classical English literature demonstrate improvements over vanilla models across multiple metrics (character attribution, spatial plausibility, and movement economy), as well as alignment with an LLM-as-a-judge and subjective human preferences.
[41] Event-Centric Human Value Understanding in News-Domain Texts: An Actor-Conditioned, Multi-Granularity Benchmark
Yao Wang, Xin Liu, Zhuochen Liu, Jiankang Chen, Adam Jatowt, Kyoungsook Kim, Noriko Kando, Haitao Yu
Main category: cs.CL
TL;DR: NEVU is a benchmark for actor-conditioned, event-centric, and direction-aware human value recognition in factual news, addressing limitations of existing value datasets.
Details
Motivation: Existing human value datasets lack actor-conditioning, event structure, and value direction specificity needed for understanding values in factual news contexts.Method: Built from 2,865 English news articles with LLM-assisted annotation pipeline and human auditing, organized at four semantic unit levels with hierarchical value space of 54 fine-grained values.
Result: NEVU contains 45,793 unit-actor pairs and 168,061 directed value instances, with baselines showing lightweight adaptation (LoRA) improves open-source models beyond prompting-only evaluation.
Conclusion: NEVU provides a comprehensive benchmark for value understanding in news that supports both evaluation and supervised adaptation for value recognition tasks.
Abstract: Existing human value datasets do not directly support value understanding in factual news: many are actor-agnostic, rely on isolated utterances or synthetic scenarios, and lack explicit event structure or value direction. We present \textbf{NEVU} (\textbf{N}ews \textbf{E}vent-centric \textbf{V}alue \textbf{U}nderstanding), a benchmark for \emph{actor-conditioned}, \emph{event-centric}, and \emph{direction-aware} human value recognition in factual news. NEVU evaluates whether models can identify value cues, attribute them to the correct actor, and determine value direction from grounded evidence. Built from 2{,}865 English news articles, NEVU organizes annotations at four semantic unit levels (\textbf{Subevent}, \textbf{behavior-based composite event}, \textbf{story-based composite event}, and \textbf{Article}) and labels \mbox{(unit, actor)} pairs for fine-grained evaluation across local and composite contexts. The annotations are produced through an LLM-assisted pipeline with staged verification and targeted human auditing. Using a hierarchical value space with \textbf{54} fine-grained values and \textbf{20} coarse-grained categories, NEVU covers 45{,}793 unit–actor pairs and 168{,}061 directed value instances. We provide unified baselines for proprietary and open-source LLMs, and find that lightweight adaptation (LoRA) consistently improves open-source models, showing that although NEVU is designed primarily as a benchmark, it also supports supervised adaptation beyond prompting-only evaluation. Data availability is described in Appendix~\ref{app:data_code_availability}.
[42] How do LLMs Compute Verbal Confidence
Dharshan Kumaran, Arthur Conmy, Federico Barbero, Simon Osindero, Viorica Patraucean, Petar Velickovic
Main category: cs.CL
TL;DR: LLMs generate verbal confidence scores through cached retrieval from answer-adjacent positions rather than just-in-time computation, representing richer answer-quality evaluation beyond token log-probabilities.
Details
Motivation: To understand how LLMs internally generate verbal confidence scores - whether they compute them just-in-time when requested or automatically during answer generation and cache them, and what these confidence scores actually represent (token log-probabilities vs richer answer quality evaluation).Method: Used Gemma 3 27B and Qwen 2.5 7B models with activation steering, patching, noising, swap experiments, attention blocking, linear probing, and variance partitioning to analyze confidence generation mechanisms.
Result: Found evidence for cached retrieval: confidence representations emerge at answer-adjacent positions before appearing at verbalization site, with information flow from answer tokens to first post-answer position then retrieved for output. Cached representations explain substantial variance beyond token log-probabilities.
Conclusion: Verbal confidence reflects automatic, sophisticated self-evaluation rather than post-hoc reconstruction, with implications for understanding metacognition in LLMs and improving calibration.
Abstract: Verbal confidence – prompting LLMs to state their confidence as a number or category – is widely used to extract uncertainty estimates from black-box models. However, how LLMs internally generate such scores remains unknown. We address two questions: first, when confidence is computed - just-in-time when requested, or automatically during answer generation and cached for later retrieval; and second, what verbal confidence represents - token log-probabilities, or a richer evaluation of answer quality? Focusing on Gemma 3 27B and Qwen 2.5 7B, we provide convergent evidence for cached retrieval. Activation steering, patching, noising, and swap experiments reveal that confidence representations emerge at answer-adjacent positions before appearing at the verbalization site. Attention blocking pinpoints the information flow: confidence is gathered from answer tokens, cached at the first post-answer position, then retrieved for output. Critically, linear probing and variance partitioning reveal that these cached representations explain substantial variance in verbal confidence beyond token log-probabilities, suggesting a richer answer-quality evaluation rather than a simple fluency readout. These findings demonstrate that verbal confidence reflects automatic, sophisticated self-evaluation – not post-hoc reconstruction – with implications for understanding metacognition in LLMs and improving calibration.
[43] Mitigating LLM Hallucinations through Domain-Grounded Tiered Retrieval
Md. Asraful Haque, Aasar Mehdi, Maaz Mahboob, Tamkeen Fatima
Main category: cs.CL
TL;DR: A tiered retrieval and verification architecture for LLMs that reduces hallucinations through a four-phase pipeline with domain-specific grounding and claim-level verification.
Details
Motivation: LLMs suffer from hallucinations (factually incorrect content), especially critical in high-stakes domains where reliability is paramount. Current systems need systematic approaches to intercept factual inaccuracies and shift LLMs from stochastic pattern-matchers to verified truth-seekers.Method: Four-phase self-regulating pipeline using LangGraph: 1) Intrinsic Verification with Early-Exit logic for compute optimization, 2) Adaptive Search Routing with Domain Detector for subject-specific archives, 3) Corrective Document Grading (CRAG) to filter irrelevant context, and 4) Extrinsic Regeneration with atomic claim-level verification.
Result: Evaluated on 650 queries from five benchmarks (TimeQA v2, FreshQA v2, HaluEval General, MMLU Global Facts, TruthfulQA). Outperformed zero-shot baselines across all environments with win rates peaking at 83.7% in TimeQA v2 and 78.0% in MMLU Global Facts. Groundedness scores remained stable between 78.8% and 86.4%.
Conclusion: The architecture provides robust fail-safe for misinformation but identified persistent “False-Premise Overclaiming” failure mode. Future work should prioritize pre-retrieval “answerability” nodes to further bridge reliability gap in conversational AI.
Abstract: Large Language Models (LLMs) have achieved unprecedented fluency but remain susceptible to “hallucinations” - the generation of factually incorrect or ungrounded content. This limitation is particularly critical in high-stakes domains where reliability is paramount. We propose a domain-grounded tiered retrieval and verification architecture designed to systematically intercept factual inaccuracies by shifting LLMs from stochastic pattern-matchers to verified truth-seekers. The proposed framework utilizes a four-phase, self-regulating pipeline implemented via LangGraph: (I) Intrinsic Verification with Early-Exit logic to optimize compute, (II) Adaptive Search Routing utilizing a Domain Detector to target subject-specific archives, (III) Corrective Document Grading (CRAG) to filter irrelevant context, and (IV) Extrinsic Regeneration followed by atomic claim-level verification. The system was evaluated across 650 queries from five diverse benchmarks: TimeQA v2, FreshQA v2, HaluEval General, MMLU Global Facts, and TruthfulQA. Empirical results demonstrate that the pipeline consistently outperforms zero-shot baselines across all environments. Win rates peaked at 83.7% in TimeQA v2 and 78.0% in MMLU Global Facts, confirming high efficacy in domains requiring granular temporal and numerical precision. Groundedness scores remained robustly stable between 78.8% and 86.4% across factual-answer rows. While the architecture provides a robust fail-safe for misinformation, a persistent failure mode of “False-Premise Overclaiming” was identified. These findings provide a detailed empirical characterization of multi-stage RAG behavior and suggest that future work should prioritize pre-retrieval “answerability” nodes to further bridge the reliability gap in conversational AI.
[44] DebugLM: Learning Traceable Training Data Provenance for LLMs
Wenjie Jacky Mo, Qin Liu, Xiaofei Wen, Wenxuan Zhou, Zhe Zhao, Muhao Chen
Main category: cs.CL
TL;DR: DebugLM is a framework that adds data provenance capabilities to LLMs, allowing them to trace behaviors back to specific training data sources and enabling targeted refusal mechanisms without retraining.
Details
Motivation: LLMs are trained through complex multi-stage pipelines with heterogeneous data, but developers lack tools to identify which specific training data is responsible for observed behaviors, making debugging reactive and failures prone to recurrence.Method: DebugLM equips LLMs with built-in data provenance by training them to associate responses with unique provenance tags indicating responsible datasets, enabling behavior tracing and supporting targeted test-time remediation through selective refusal mechanisms.
Result: Experiments show DebugLM provides accurate behavior tracing in multi-stage training pipelines and effective test-time remediation while preserving the model’s general utility.
Conclusion: DebugLM offers a principled approach to LLM observability and debugging by enabling data provenance tracking and targeted remediation, addressing limitations in current LLM development pipelines.
Abstract: Large language models (LLMs) are trained through multi-stage pipelines over heterogeneous data sources, yet developers lack a principled way to pinpoint the specific data responsible for an observed behavior. This lack of observability reduces debugging to reactive patching and makes failures prone to recur under distribution shift or subsequent model updates. To address this limitation, we propose DebugLM, a framework that equips LLMs with built-in data provenance, enabling them to explicitly trace the origins of their behaviors to specific training data sources. Specifically, the model learns to associate its responses with unique provenance tags that indicate the responsible dataset, empowering developers to precisely identify where undesirable behaviors are learned. Building on this capability, DebugLM further supports targeted test-time remediation, enabling developers to selectively trigger targeted refusal for specified data sources without retraining or modifying model parameters. Experiments demonstrate that DebugLM provides accurate behavior tracing in multi-stage training pipelines and effective test-time remediation while preserving the general utility of the model.
[45] Pretrained Multilingual Transformers Reveal Quantitative Distance Between Human Languages
Yue Zhao, Jiatao Gu, Paloma Jeretič, Weijie Su
Main category: cs.CL
TL;DR: Attention Transport Distance (ATD) uses multilingual language models’ attention mechanisms to quantitatively measure cross-linguistic distance via optimal transport, recovering linguistic groupings and improving low-resource translation.
Details
Motivation: The paper addresses the lack of a unified, scalable quantitative approach to measuring language distance, which is crucial for linguistics, anthropology, and understanding human evolutionary history. While qualitative accounts exist, there's a need for systematic quantitative measurement.Method: Leverages pretrained multilingual language models, using their spontaneously emerged attention mechanisms as a tokenization-agnostic measure. Treats attention matrices as probability distributions and measures their geometric divergence via optimal transport to compute Attention Transport Distance (ATD) between languages during translation.
Result: ATD applied to diverse languages recovers established linguistic groupings with high fidelity and reveals patterns aligned with geographic and contact-induced relationships. Incorporating ATD as a regularizer improves transfer performance in low-resource machine translation.
Conclusion: Establishes a principled foundation for testing linguistic hypotheses using artificial neural networks, transforming multilingual models into powerful tools for quantitative linguistic discovery and facilitating more equitable multilingual AI.
Abstract: Understanding the distance between human languages is central to linguistics, anthropology, and tracing human evolutionary history. Yet, while linguistics has long provided rich qualitative accounts of cross-linguistic variation, a unified and scalable quantitative approach to measuring language distance remains lacking. In this paper, we introduce a method that leverages pretrained multilingual language models as systematic instruments for linguistic measurement. Specifically, we show that the spontaneously emerged attention mechanisms of these models provide a robust, tokenization-agnostic measure of cross-linguistic distance, termed Attention Transport Distance (ATD). By treating attention matrices as probability distributions and measuring their geometric divergence via optimal transport, we quantify the representational distance between languages during translation. Applying ATD to a large and diverse set of languages, we demonstrate that the resulting distances recover established linguistic groupings with high fidelity and reveal patterns aligned with geographic and contact-induced relationships. Furthermore, incorporating ATD as a regularizer improves transfer performance in low-resource machine translation. Our results establish a principled foundation for testing linguistic hypotheses using artificial neural networks. This framework transforms multilingual models into powerful tools for quantitative linguistic discovery, facilitating more equitable multilingual AI.
[46] IndicSafe: A Benchmark for Evaluating Multilingual LLM Safety in South Asia
Priyaranjan Pattnayak, Sanchari Chowdhuri
Main category: cs.CL
TL;DR: Systematic evaluation of LLM safety across 12 Indic languages reveals significant safety drift and poor cross-language consistency, highlighting critical safety generalization gaps in multilingual LLMs.
Details
Motivation: LLMs are increasingly deployed in multilingual settings, but their safety behavior in culturally diverse, low-resource languages remains poorly understood, especially for languages like Indic languages spoken by over 1.2 billion people but underrepresented in training data.Method: Created a dataset of 6,000 culturally grounded prompts spanning caste, religion, gender, health, and politics; assessed 10 leading LLMs on translated variants; used metrics like cross-language agreement, SAFE rate variance, prompt-level entropy, category bias scores, and multilingual consistency indices.
Result: Revealed significant safety drift with cross-language agreement of just 12.8% and SAFE rate variance exceeding 17% across languages; models showed inconsistent behavior including over-refusal of benign prompts in low-resource scripts, overflagging politically sensitive topics, and failure to flag unsafe generations.
Conclusion: Safety alignment does not transfer evenly across languages; there are critical safety generalization gaps in multilingual LLMs; released IndicSafe benchmark for culturally informed safety evaluation; advocates for language-aware alignment strategies grounded in regional harms.
Abstract: As large language models (LLMs) are deployed in multilingual settings, their safety behavior in culturally diverse, low-resource languages remains poorly understood. We present the first systematic evaluation of LLM safety across 12 Indic languages, spoken by over 1.2 billion people but underrepresented in LLM training data. Using a dataset of 6,000 culturally grounded prompts spanning caste, religion, gender, health, and politics, we assess 10 leading LLMs on translated variants of the prompt. Our analysis reveals significant safety drift: cross-language agreement is just 12.8%, and \texttt{SAFE} rate variance exceeds 17% across languages. Some models over-refuse benign prompts in low-resource scripts, overflag politically sensitive topics, while others fail to flag unsafe generations. We quantify these failures using prompt-level entropy, category bias scores, and multilingual consistency indices. Our findings highlight critical safety generalization gaps in multilingual LLMs and show that safety alignment does not transfer evenly across languages. We release \textsc{IndicSafe}, the first benchmark to enable culturally informed safety evaluation for Indic deployments, and advocate for language-aware alignment strategies grounded in regional harms.
[47] Efficient Training-Free Multi-Token Prediction via Embedding-Space Probing
Raghavv Goel, Mukul Gagrani, Mingu Lee, Chris Lott
Main category: cs.CL
TL;DR: A training-free method for parallel multi-token prediction in LLMs using mask tokens from the embedding space, enabling speculative decoding without model modifications or auxiliary draft models.
Details
Motivation: LLMs have latent multi-token prediction capabilities despite being trained only for next-token prediction. Current speculative decoding methods require auxiliary draft models or model modifications, which are computationally expensive and complex to implement.Method: Probes LLMs using on-the-fly mask tokens drawn from the model’s embedding space to enable parallel prediction of future tokens. Constructs speculative token trees by sampling top-K candidates from mask-token logits and applies lightweight pruning to retain high-probability continuations. Candidate predictions are verified in parallel during decoding.
Result: Outperforms existing training-free baselines, increasing acceptance length by ~12% on LLaMA3 and 8-12% on Qwen3, achieving throughput gains of 15-19%. Provides theoretical and empirical evidence that decoder layers naturally align mask-token representations with next-token states.
Conclusion: Demonstrates that LLMs have inherent multi-token prediction capabilities that can be exploited without retraining or auxiliary models, enabling efficient parallel decoding with significant throughput improvements.
Abstract: Large language models (LLMs) exhibit latent multi-token prediction (MTP) capabilities despite being trained solely for next-token generation. We propose a simple, training-free MTP approach that probes an LLM using on-the-fly mask tokens drawn from its embedding space, enabling parallel prediction of future tokens without modifying model weights or relying on auxiliary draft models. Our method constructs a speculative token tree by sampling top-K candidates from mask-token logits and applies a lightweight pruning strategy to retain high-probability continuations. During decoding, candidate predictions are verified in parallel, resulting in lossless generation while substantially reducing the number of model calls and improving token throughput. Across benchmarks, our probing-based MTP consistently outperforms existing training-free baselines, increasing acceptance length by approximately 12% on LLaMA3 and 8–12% on Qwen3, and achieving throughput gains of up to 15–19%. Finally, we provide theoretical insights and empirical evidence showing that decoder layers naturally align mask-token representations with next-token states, enabling accurate multi-step prediction without retraining or auxiliary models.
[48] ShapleyLaw: A Game-Theoretic Approach to Multilingual Scaling Laws
Xuyang Cao, Qianying Liu, Chuan Xiao, Yusuke Oda, Pontus Stenetorp, Daisuke Kawahara, Makoto Onizuka, Sadao Kurohashi, Shuyuan Zheng
Main category: cs.CL
TL;DR: ShapleyLaw: A game-theoretic approach to multilingual scaling laws that quantifies cross-lingual transfer effects using cooperative game theory to optimize language mixture ratios in pretraining.
Details
Motivation: Current multilingual scaling laws fail to account for cross-lingual transfer effects, leading to suboptimal language mixture ratios in pretraining data. The authors aim to better quantify how different languages contribute to overall model performance through transfer effects.Method: Treats multilingual pretraining as a cooperative game where each language is a player. Uses Shapley values from cooperative game theory to quantify each language’s contribution to test loss reduction. Proposes ShapleyLaw as a game-theoretic multilingual scaling law that incorporates cross-lingual transfer effects.
Result: ShapleyLaw outperforms baseline methods in both model performance prediction and language mixture optimization, demonstrating the importance of accounting for cross-lingual transfer in multilingual scaling laws.
Conclusion: The game-theoretic approach provides a principled way to quantify cross-lingual transfer effects, leading to better optimization of language mixture ratios in multilingual pretraining.
Abstract: In multilingual pretraining, the test loss of a pretrained model is heavily influenced by the proportion of each language in the pretraining data, namely the \textit{language mixture ratios}. Multilingual scaling laws can predict the test loss under different language mixture ratios and can therefore be used to estimate the optimal ratios. However, the current approaches to multilingual scaling laws do not measure the \textit{cross-lingual transfer} effect, resulting in suboptimal mixture ratios. In this paper, we consider multilingual pretraining as a cooperative game in which each language acts as a player that jointly contributes to pretraining, gaining the resulting reduction in test loss as the payoff. Consequently, from the perspective of cooperative game theory, we quantify the cross-lingual transfer from each language by its contribution in the game, and propose a game-theoretic multilingual scaling law called \textit{ShapleyLaw}. Our experiments show that ShapleyLaw outperforms baseline methods in model performance prediction and language mixture optimization.
[49] Gender Disambiguation in Machine Translation: Diagnostic Evaluation in Decoder-Only Architectures
Chiara Manna, Hosein Mohebbi, Afra Alishahi, Frédéric Blain, Eva Vanmassenhove
Main category: cs.CL
TL;DR: Decoder-only MT models show gender bias similar to encoder-decoder models, but instruction tuning reduces masculine prior bias.
Details
Motivation: Large Language Models in MT exhibit systematic gender biases due to language-specific gender marking differences, and existing benchmarks fail to capture the full complexity of this bias.Method: Extended bias evaluation framework by introducing “Prior Bias” measure to capture default gender assumptions, applied to decoder-only MT models, comparing with encoder-decoder architectures and analyzing effects of post-training like instruction tuning.
Result: Decoder-only models don’t outperform encoder-decoder models on gender-specific metrics; instruction tuning improves contextual awareness and reduces masculine Prior Bias.
Conclusion: Despite scale and state-of-the-art status, decoder-only MT models still exhibit gender bias, but post-training interventions like instruction tuning can mitigate masculine prior bias.
Abstract: While Large Language Models achieve state-of-the-art results across a wide range of NLP tasks, they remain prone to systematic biases. Among these, gender bias is particularly salient in MT, due to systematic differences across languages in whether and how gender is marked. As a result, translation often requires disambiguating implicit source signals into explicit gender-marked forms. In this context, standard benchmarks may capture broad disparities but fail to reflect the full complexity of gender bias in modern MT. In this paper, we extend recent frameworks on bias evaluation by: (i) introducing a novel measure coined “Prior Bias”, capturing a model’s default gender assumptions, and (ii) applying the framework to decoder-only MT models. Our results show that, despite their scale and state-of-the-art status, decoder-only models do not generally outperform encoder-decoder architectures on gender-specific metrics; however, post-training (e.g., instruction tuning) not only improves contextual awareness but also reduces the masculine Prior Bias.
[50] ConGA: Guidelines for Contextual Gender Annotation. A Framework for Annotating Gender in Machine Translation
Argentina Anna Rescigno, Eva Vanmassenhove, Johanna Monti
Main category: cs.CL
TL;DR: ConGA framework provides linguistically grounded gender annotation guidelines for evaluating gender bias in machine translation, particularly for English-Italian language pairs with gender asymmetry.
Details
Motivation: Gender handling in machine translation remains challenging, especially when translating from gender-neutral languages (English) to morphologically gendered ones (Italian). Current MT systems often default to masculine forms, reinforcing bias and reducing translation accuracy.Method: Developed Contextual Gender Annotation (ConGA) framework - a linguistically grounded set of guidelines for word-level gender annotation. Distinguishes between semantic gender in English (Masculine, Feminine, Ambiguous) and grammatical gender in Italian (Masculine, Feminine), with entity-level identifiers for cross-sentence tracking.
Result: Applied ConGA to gENder-IT dataset, creating gold-standard resource for evaluating gender bias. Results show systematic masculine overuse and inconsistent feminine realization in current MT systems.
Conclusion: ConGA provides both methodology and benchmark for building more gender-aware and multilingual NLP systems by combining fine-grained linguistic annotation with quantitative evaluation.
Abstract: Handling gender across languages remains a persistent challenge for Machine Translation (MT) and Large Language Models (LLMs), especially when translating from gender-neutral languages into morphologically gendered ones, such as English to Italian. English largely omits grammatical gender, while Italian requires explicit agreement across multiple grammatical categories. This asymmetry often leads MT systems to default to masculine forms, reinforcing bias and reducing translation accuracy. To address this issue, we present the Contextual Gender Annotation (ConGA) framework, a linguistically grounded set of guidelines for word-level gender annotation. The scheme distinguishes between semantic gender in English through three tags, Masculine (M), Feminine (F), and Ambiguous (A), and grammatical gender realisation in Italian (Masculine (M), Feminine (F)), combined with entity-level identifiers for cross-sentence tracking. We apply ConGA to the gENder-IT dataset, creating a gold-standard resource for evaluating gender bias in translation. Our results reveal systematic masculine overuse and inconsistent feminine realisation, highlighting persistent limitations of current MT systems. By combining fine-grained linguistic annotation with quantitative evaluation, this work offers both a methodology and a benchmark for building more gender-aware and multilingual NLP systems.
[51] The Moral Foundations Reddit Corpus
Jackson Trager, Alireza S. Ziabari, Elnaz Rahmati, Aida Mostafazadeh Davani, Preni Golazizian, Farzan Karimi-Malekabadi, Ali Omrani, Zhihe Li, Brendan Kennedy, Georgios Chochlakis, Nils Karl Reimer, Melissa Reyes, Kelsey Cheng, Mellow Wei, Christina Merrifield, Arta Khosravi, Evans Alvarez, Morteza Dehghani
Main category: cs.CL
TL;DR: A new Moral Foundations Reddit Corpus of 16K hand-annotated comments for 8 moral sentiment categories, showing LLMs still lag behind fine-tuned encoders on this subjective moral classification task.
Details
Motivation: Existing moral sentiment datasets are limited to Twitter, creating a need for diverse social media data to better understand moral rhetoric's role in online behaviors. Current computational methods require large annotated datasets for strong performance in subjective moral sentiment detection.Method: Created Moral Foundations Reddit Corpus with 16,123 English Reddit comments from 12 subreddits, annotated by at least three trained annotators for 8 moral sentiment categories based on updated Moral Foundations Theory. Evaluated LLMs (Llama3-8B, Ministral-8B) in zero-shot, few-shot, and PEFT settings against fine-tuned encoder-only models like BERT.
Result: LLMs continue to underperform fine-tuned encoders on this subjective moral sentiment classification task, highlighting the ongoing need for human-annotated moral corpora for AI alignment evaluation.
Conclusion: The Moral Foundations Reddit Corpus provides a valuable resource for moral sentiment analysis, demonstrating that despite LLM advances, fine-tuned encoders still outperform on subjective moral classification tasks, emphasizing the importance of specialized annotated datasets for AI alignment.
Abstract: Moral framing and sentiment can affect a variety of online and offline behaviors, including donation, environmental action, political engagement, and protest. Various computational methods in Natural Language Processing (NLP) have been used to detect moral sentiment from textual data, but achieving strong performance in such subjective tasks requires large, hand-annotated datasets. Previous corpora annotated for moral sentiment have proven valuable, and have generated new insights both within NLP and across the social sciences, but have been limited to Twitter. To facilitate improving our understanding of the role of moral rhetoric, we present the Moral Foundations Reddit Corpus, a collection of 16,123 English Reddit comments that have been curated from 12 distinct subreddits, hand-annotated by at least three trained annotators for 8 categories of moral sentiment (i.e., Care, Proportionality, Equality, Purity, Authority, Loyalty, Thin Morality, Implicit/Explicit Morality) based on the updated Moral Foundations Theory (MFT) framework. We evaluate baselines using large language models (Llama3-8B, Ministral-8B) in zero-shot, few-shot, and PEFT (Parameter-Efficient Fine-Tuning) settings, comparing their performance to fine-tuned encoder-only models like BERT (Bidirectional Encoder Representations from Transformers). The results show that LLMs continue to lag behind fine-tuned encoders on this subjective task, underscoring the ongoing need for human-annotated moral corpora for AI alignment evaluation. Keywords: moral sentiment annotation, moral values, moral foundations theory, multi-label text classification, large language models, benchmark dataset, evaluation and alignment resource
[52] Bridge Diffusion Model: Bridge Chinese Text-to-Image Diffusion Model with English Communities
Shanyuan Liu, Bo Cheng, Yuhang Ma, Liebucha Wu, Ao Ma, Xiaoyu Wu, Dawei Leng, Yuhui Yin
Main category: cs.CL
TL;DR: Bridge Diffusion Model (BDM) enables Chinese text-to-image generation while maintaining compatibility with English TTI ecosystems through a backbone-branch architecture.
Details
Motivation: English-native TTI models have language barriers and cultural biases, while training non-English models from scratch loses compatibility with English TTI advancements. Need a solution for Chinese TTI that bridges both worlds.Method: Proposes Bridge Diffusion Model (BDM) with backbone-branch structure: backbone learns English semantics, branch learns Chinese semantics while keeping latent space compatible with English TTI backbone in end-to-end training.
Result: BDM generates images with precise Chinese semantics while remaining compatible with English TTI plugins (checkpoints, LoRA, ControlNet, Dreambooth, Textual Inversion) and enables seamless Chinese-English semantic combination in single images.
Conclusion: BDM provides effective solution for building Chinese TTI models that maintain compatibility with English TTI ecosystems, enabling cultural interaction and leveraging continuous English TTI advancements.
Abstract: Text-to-Image generation (TTI) technologies are advancing rapidly, especially in the English language communities. However, apart from the user input language barrier problem, English-native TTI models inherently carry biases from their English world centric training data, which creates a dilemma for development of other language-native TTI models. One common choice is to fine-tune the English-native TTI model with translated samples. It falls short of fully addressing the model bias problem. Alternatively, training non-English language native models from scratch can effectively resolve the English world bias, but model trained this way would diverge from the English TTI communities, thus not able to utilize the strides continuously gaining in the English TTI communities any more. To build Chinese TTI model meanwhile keep compatibility with the English TTI communities, we propose a novel model structure referred as “Bridge Diffusion Model” (BDM). The proposed BDM employs a backbone-branch network structure to learn the Chinese semantics while keep the latent space compatible with the English-native TTI backbone, in an end-to-end manner. The unique advantages of the proposed BDM are that it’s not only adept at generating images that precisely depict Chinese semantics, but also compatible with various English-native TTI plugins, such as different checkpoints, LoRA, ControlNet, Dreambooth, and Textual Inversion, etc. Moreover, BDM can concurrently generate content seamlessly combining both Chinese-native and English-native semantics within a single image, fostering cultural interaction.
[53] QFT: Quantized Full-parameter Tuning of LLMs with Affordable Resources
Zhikai Li, Xiaoxuan Liu, Banghua Zhu, Zhen Dong, Qingyi Gu, Kurt Keutzer
Main category: cs.CL
TL;DR: QFT enables full-parameter fine-tuning of LLMs using INT8 quantization for all training states (weights, gradients, optimizer states), reducing memory usage to 21% while maintaining performance.
Details
Motivation: Fine-tuning LLMs requires expensive GPUs due to large memory demands. Parameter-efficient methods don't match full-parameter tuning performance. Need affordable full-parameter tuning on existing hardware.Method: Quantizes all training states to INT8. Uses Lion optimizer for robustness to gradient quantization. Employs hybrid feature quantizer to protect critical features. Develops stack-based gradient flow for integer backpropagation.
Result: Reduces model state memory to 21% of standard. LLaMA-7B tuning requires <30GB memory (feasible on single A6000). Achieves comparable performance to full-precision tuning.
Conclusion: QFT enables affordable full-parameter LLM fine-tuning on existing GPUs through comprehensive INT8 quantization while maintaining training performance.
Abstract: Large Language Models (LLMs) have showcased remarkable impacts across a wide spectrum of natural language processing tasks. Fine-tuning these pretrained models on downstream datasets provides further significant performance gains; however, this process typically requires a large number of expensive, high-end GPUs. Although there have been efforts focused on parameter-efficient fine-tuning, they cannot fully unlock the powerful potential of full-parameter fine-tuning. In this paper, we propose QFT, a Quantized Full-parameter Tuning framework for LLMs that quantizes and stores all training states, including weights, gradients, and optimizer states, in INT8 format to reduce training memory, thereby enabling full-parameter fine-tuning on existing GPUs at an affordable cost. To ensure training performance, we make two key efforts: i) for quantized gradients and optimizer states, we theoretically prove that the Lion optimizer, with its property of consistent update magnitudes, is highly robust to quantization; ii) and for quantized weights, we employ the hybrid feature quantizer, which identifies and protects a small subset of sparse critical features while quantizing the remaining dense features, thus ensuring accurate weight updates without FP32 backups. Moreover, to support backpropagation in the integer context, we develop a stack-based gradient flow scheme with O(1) complexity, forming a unified integer training pipeline. As a result, QFT reduces the model state memory to 21% of the standard solution while achieving comparable performance, e.g., tuning a LLaMA-7B model requires only <30GB of memory, making it feasible on a single A6000 GPU.
[54] BioMamba: Domain-Adaptive Biomedical Language Models
Ling Yue, Mingzhi Zhu, Sixue Xing, Shaowu Pan, Vijil Chenthamarakshan, Yanbo Wang, Yunning Cao, Payel Das, Tianfan Fu
Main category: cs.CL
TL;DR: BioMamba: A family of biomedical Mamba2 models trained on PubMed with general-domain data preservation, showing strong performance on biomedical NLP tasks while maintaining general language capabilities.
Details
Motivation: Biomedical language models need to excel on biomedical text while retaining general-domain language abilities. For Mamba-based models, this trade-off hasn't been clearly studied across biomedical literature and clinical text.Method: Developed BioMamba through continued pretraining of public Mamba2 checkpoints on PubMed, with small amounts of general-domain data from C4 and Wikipedia included to preserve general-domain language ability. Evaluated across multiple model scales on clinical note completion, discharge summary generation, and biomedical yes/no question answering.
Result: BioMamba consistently improved PubMed modeling, improved Wikipedia modeling, and left C4 performance largely unchanged. After supervised fine-tuning, it transferred well to both biomedical literature and clinical text, matching or exceeding SFT from base checkpoints. Strongest model achieved PubMed perplexity of 5.28 and accuracies of 90.24% (BioASQ) and 73.00% (PubMedQA).
Conclusion: Balanced domain-adaptive pretraining strategy strengthens Mamba language models for both biomedical literature and clinical text while preserving general-domain language capabilities, establishing BioMamba as a practical foundation for biomedical NLP applications.
Abstract: Background: Biomedical language models should improve performance on biomedical text while retaining general-domain language ability. For Mamba-based models, this trade-off has not been clearly studied across biomedical literature and clinical text. Methods: We developed BioMamba, a family of biomedical models obtained by continued pretraining of public Mamba2 checkpoints on PubMed, with small amounts of general-domain data from the Colossal Clean Crawled Corpus (C4) and Wikipedia included to help preserve general-domain language ability. We evaluated language modeling and three downstream tasks across multiple model scales: clinical note completion, discharge summary generation, and biomedical yes/no question answering. Results: BioMamba consistently improved PubMed modeling, improved Wikipedia modeling, and left C4 performance largely unchanged. After supervised fine-tuning, BioMamba transferred well to both biomedical literature and clinical text, yielding strong results on completion, summarization, and question answering. On MIMIC-IV, BioMamba+SFT consistently matched or exceeded SFT from the corresponding base checkpoints across note completion and discharge summary generation. The strongest model achieved a PubMed perplexity of 5.28 and accuracies of 90.24% and 73.00% on BioASQ and PubMedQA, respectively. Conclusion: Balanced domain-adaptive pretraining strategy strengthens Mamba language models for both biomedical literature and clinical text, while preserving general-domain language capabilities, establishing BioMamba as a practical foundation for biomedical NLP applications.
[55] Multilingual LLMs Struggle to Link Orthography and Semantics in Bilingual Word Processing
Eshaan Tanwar, Gayatri Oke, Tanmoy Chakraborty
Main category: cs.CL
TL;DR: LLMs struggle with cross-lingual word ambiguity, particularly with interlingual homographs, showing heavy reliance on orthographic similarity over semantic understanding.
Details
Motivation: To investigate how multilingual Large Language Models handle cross-lingual lexical phenomena like cognates and interlingual homographs, and assess their semantic understanding capabilities across languages.Method: Evaluated LLMs on English-Spanish, English-French, and English-German cognates, non-cognates, and interlingual homographs. Tested disambiguation abilities both in isolation and within sentence contexts, comparing performance against random baselines.
Result: LLMs perform well on cognates and non-cognates in isolation but struggle significantly with interlingual homographs, often performing below random baselines. Models show heavy reliance on orthographic similarities rather than semantic understanding, with no correlation between disambiguation performance and semantic comprehension.
Conclusion: Multilingual LLMs lack robust semantic understanding of cross-lingual ambiguities, particularly for interlingual homographs, and demonstrate inconsistent strategies for handling different language contexts.
Abstract: Bilingual lexical processing is shaped by the complex interplay of phonological, orthographic, and semantic features of two languages within an integrated mental lexicon. In humans, this is evident in the ease with which cognate words - words similar in both orthographic form and meaning (e.g., blind, meaning “sightless” in both English and German) - are processed, compared to the challenges posed by interlingual homographs, which share orthographic form but differ in meaning (e.g., gift, meaning “present” in English but “poison” in German). We investigate how multilingual Large Language Models (LLMs) handle such phenomena, focusing on English-Spanish, English-French, and English-German cognates, non-cognate, and interlingual homographs. Specifically, we evaluate their ability to disambiguate meanings and make semantic judgments, both when these word types are presented in isolation or within sentence contexts. Our findings reveal that while certain LLMs demonstrate strong performance in recognizing cognates and non-cognates in isolation, they exhibit significant difficulty in disambiguating interlingual homographs, often performing below random baselines. This suggests LLMs tend to rely heavily on orthographic similarities rather than semantic understanding when interpreting interlingual homographs. Further, we find LLMs exhibit difficulty in retrieving word meanings, with performance in isolative disambiguation tasks having no correlation with semantic understanding. Finally, we study how the LLM processes interlingual homographs in incongruent sentences. We find models to opt for different strategies in understanding English and non-English homographs, highlighting a lack of a unified approach to handling cross-lingual ambiguities.
[56] Byte-token Enhanced Language Models for Temporal Point Processes Analysis
Quyu Kong, Yixuan Zhang, Yang Liu, Panrong Tong, Enqi Liu, Feng Zhou
Main category: cs.CL
TL;DR: Language-TPP integrates Temporal Point Processes with Large Language Models for Web event sequence modeling, using temporal encoding as byte-tokens to achieve SOTA performance on time/type prediction while improving text generation quality.
Details
Motivation: Traditional TPP models struggle to incorporate rich textual descriptions from Web events, while LLMs lack mechanisms for handling temporal dynamics in event sequences. There's a need to bridge this gap for better Web event sequence modeling.Method: Introduces a novel temporal encoding mechanism that converts continuous time intervals into specialized byte-tokens, enabling direct integration with standard language model architectures without requiring TPP-specific modifications.
Result: Achieves state-of-the-art performance on multiple TPP benchmarks (event time prediction and type prediction) on real-world Web datasets. Temporal information improves quality of generated event descriptions with enhanced ROUGE-L scores and better aligned sentiment distributions.
Conclusion: Language-TPP effectively captures both temporal dynamics and textual patterns in Web user behavior, with implications for content generation, user behavior understanding, and Web platform applications.
Abstract: Temporal Point Processes (TPPs) have been widely used for modeling event sequences on the Web, such as user reviews, social media posts, and online transactions. However, traditional TPP models often struggle to effectively incorporate the rich textual descriptions that accompany these events, while Large Language Models (LLMs), despite their remarkable text processing capabilities, lack mechanisms for handling the temporal dynamics inherent in Web-based event sequences. To bridge this gap, we introduce Language-TPP, a unified framework that seamlessly integrates TPPs with LLMs for enhanced Web event sequence modeling. Our key innovation is a novel temporal encoding mechanism that converts continuous time intervals into specialized byte-tokens, enabling direct integration with standard language model architectures for TPP modeling without requiring TPP-specific modifications. This approach allows Language-TPP to achieve state-of-the-art performance across multiple TPP benchmarks, including event time prediction and type prediction, on real-world Web datasets spanning e-commerce reviews, social media and online Q&A platforms. More importantly, we demonstrate that our unified framework unlocks new capabilities for TPP research: incorporating temporal information improves the quality of generated event descriptions, as evidenced by enhanced ROUGE-L scores, and better aligned sentiment distributions. Through comprehensive experiments, including qualitative analysis of learned distributions and scalability evaluations on long sequences, we show that Language-TPP effectively captures both temporal dynamics and textual patterns in Web user behavior, with important implications for content generation, user behavior understanding, and Web platform applications. Code is available at https://github.com/qykong/Language-TPP.
[57] Integrating Arithmetic Learning Improves Mathematical Reasoning in Smaller Models
Neeraj Gangwar, Suma P Bhat, Nickvash Kani
Main category: cs.CL
TL;DR: Smaller models struggle with arithmetic in math reasoning tasks; using programmatically generated synthetic arithmetic data via intermediate fine-tuning or instruction-tuning mixtures improves their arithmetic capabilities and overall math reasoning performance.
Details
Motivation: Smaller models have difficulty with arithmetic computations in mathematical reasoning tasks despite knowledge distillation and data augmentation approaches. There's a need to enhance arithmetic capabilities specifically in smaller models to improve their mathematical reasoning performance.Method: Two approaches using programmatically generated synthetic arithmetic dataset: 1) Intermediate fine-tuning - fine-tuning on arithmetic data before reasoning dataset training, 2) Integrating arithmetic dataset into instruction-tuning mixture to learn arithmetic alongside general instruction-following.
Result: Experiments on multiple reasoning benchmarks show that incorporating arithmetic dataset through either approach enhances models’ arithmetic capabilities and improves mathematical reasoning performance.
Conclusion: Targeted arithmetic training using synthetic data effectively improves smaller models’ mathematical reasoning by addressing their arithmetic computation weaknesses.
Abstract: While large models pre-trained on high-quality data exhibit excellent performance on mathematical reasoning (e.g., GSM8k, MultiArith), it remains challenging to specialize smaller models for these tasks. Common approaches to address this challenge include knowledge distillation from large teacher models and data augmentation (e.g., rephrasing questions and generating synthetic solutions). Despite these efforts, smaller models struggle with arithmetic computations, leading to errors in mathematical reasoning. In this work, we leverage a synthetic arithmetic dataset generated programmatically to enhance the reasoning capabilities of smaller models. We investigate two key approaches to incorporate this dataset: (1) intermediate fine-tuning, in which a model is fine-tuned on the arithmetic dataset before training it on a reasoning dataset, and (2) integrating the arithmetic dataset into an instruction-tuning mixture, allowing the model to learn arithmetic skills alongside general instruction-following abilities. Our experiments on multiple reasoning benchmarks demonstrate that incorporating an arithmetic dataset, whether through targeted fine-tuning or within an instruction-tuning mixture, enhances models’ arithmetic capabilities, thereby improving their mathematical reasoning performance.
[58] Comparing Uncertainty Measurement and Mitigation Methods for Large Language Models: A Systematic Review
Toghrul Abbasli, Kentaroh Toyoda, Yuan Wang, Leon Witt, Muhammad Asif Ali, Yukai Miao, Dan Li, Qingsong Wei
Main category: cs.CL
TL;DR: Survey paper reviewing uncertainty quantification and calibration methods for Large Language Models, with empirical evaluation and benchmark creation.
Details
Motivation: Hallucination in LLMs remains a major challenge, but there's a lack of comprehensive analysis and benchmarking of uncertainty quantification and calibration methods adapted for LLMs.Method: Systematic survey of prior works on UQ and calibration for LLMs, creation of rigorous benchmark, and empirical evaluation of six methods using two widely used reliability datasets.
Result: Empirical evaluation justifies significant findings from the review, providing insights into effectiveness of different calibration approaches for LLMs.
Conclusion: First dedicated survey on calibration methods and metrics for LLMs, with identified future directions and open challenges for improving uncertainty quantification in LLMs.
Abstract: Large Language Models (LLMs) have been transformative across many domains. However, hallucination, i.e., confidently outputting incorrect information, remains one of the leading challenges for LLMs. This raises the question of how to accurately assess and quantify the uncertainty of LLMs. Extensive literature on traditional models has explored Uncertainty Quantification (UQ) to measure uncertainty and employed calibration techniques to address the misalignment between uncertainty and accuracy. While some of these methods have been adapted for LLMs, the literature lacks an in-depth analysis of their effectiveness and does not offer a comprehensive benchmark to enable insightful comparison among existing solutions. In this work, we fill this gap via a systematic survey of representative prior works on UQ and calibration for LLMs and introduce a rigorous benchmark. Using two widely used reliability datasets, we empirically evaluate six related methods, which justify the significant findings of our review. Finally, we provide outlooks for key future directions and outline open challenges. To the best of our knowledge, this survey is the first dedicated study to review the calibration methods and relevant metrics for LLMs.
[59] A quantitative analysis of semantic information in deep representations of text and images
Santiago Acevedo, Andrea Mascaretti, Riccardo Rende, Matéo Mahaut, Marco Baroni, Alessandro Laio
Main category: cs.CL
TL;DR: Analysis of semantic alignment across models using Information Imbalance measure, showing semantic information concentrates in middle layers, with cross-modal predictability stronger between large independently trained models than jointly trained ones.
Details
Motivation: To understand the phenomenon of semantic alignment across different models processing identical or semantically related inputs, and to analyze how semantic information is distributed across model layers and modalities.Method: Uses Information Imbalance - an asymmetric rank-based measure that quantifies how well one representation can predict another - to analyze representations from models like DeepSeek-V3, Llama3-8b, and DinoV2 across languages and modalities.
Result: Semantic information spreads across many tokens, concentrates in middle layers for autoregressive models and final layers for encoders. English representations are more predictive than other languages. Large independently trained models (DeepSeek-V3 and DinoV2) show stronger cross-modal predictability than jointly trained CLIP.
Conclusion: Supports semantic convergence hypothesis across languages, modalities, and architectures, but shows directed predictability varies with layer-depth, model scale, and language. Model scale may outweigh explicit multimodal training for cross-modal alignment.
Abstract: It was recently observed that the representations of different models that process identical or semantically related inputs tend to align. We analyze this phenomenon using the Information Imbalance, an asymmetric rank-based measure that quantifies the capability of a representation to predict another, providing a proxy of the cross-entropy which can be computed efficiently in high-dimensional spaces. By measuring the Information Imbalance between representations generated by DeepSeek-V3 processing translations, we find that semantic information is spread across many tokens, and that semantic predictability is strongest in a set of central layers of the network, robust across six language pairs. We measure clear information asymmetries: English representations are systematically more predictive than those of other languages, and DeepSeek-V3 representations are more predictive of those in a smaller model such as Llama3-8b than the opposite. In the visual domain, we observe that semantic information concentrates in middle layers for autoregressive models and in final layers for encoder models, and these same layers yield the strongest cross-modal predictability with textual representations of image captions. Notably, two independently trained models (DeepSeek-V3 and DinoV2) achieve stronger cross-modal predictability than the jointly trained CLIP model, suggesting that model scale may outweigh explicit multimodal training. Our results support the hypothesis of semantic convergence across languages, modalities, and architectures, while showing that directed predictability between representations varies strongly with layer-depth, model scale, and language.
[60] BiomedSQL: Text-to-SQL for Scientific Reasoning on Biomedical Knowledge Bases
Mathew J. Koretsky, Maya Willey, Owen Bianchi, Chelsea X. Alvarado, Tanay Nayak, Nicole Kuznetsov, Sungwon Kim, Mike A. Nalls, Daniel Khashabi, Faraz Faghri
Main category: cs.CL
TL;DR: BiomedSQL benchmark evaluates scientific reasoning in text-to-SQL generation for biomedical knowledge bases, requiring domain-specific inference beyond syntactic translation.
Details
Motivation: Current text-to-SQL systems struggle with mapping qualitative scientific questions into executable SQL when implicit domain reasoning is required, particularly in biomedical contexts where understanding domain-specific criteria is essential.Method: Created BiomedSQL benchmark with 68,000 question/SQL query/answer triples generated from templates, grounded in a harmonized BigQuery knowledge base integrating gene-disease associations, omics data causal inference, and drug approval records. Evaluated various LLMs across prompting strategies and interaction paradigms.
Result: Substantial performance gap observed: Gemini-3-Pro achieved 58.1% execution accuracy, custom multi-step agent BMSQL reached 62.6%, both well below expert baseline of 90.0%.
Conclusion: BiomedSQL provides a new foundation for advancing text-to-SQL systems capable of supporting scientific discovery through robust reasoning over structured biomedical knowledge bases.
Abstract: Biomedical researchers increasingly rely on large-scale structured databases for complex analytical tasks. However, current text-to-SQL systems often struggle to map qualitative scientific questions into executable SQL, particularly when implicit domain reasoning is required. We introduce BiomedSQL, the first benchmark explicitly designed to evaluate scientific reasoning in text-to-SQL generation over a real-world biomedical knowledge base. BiomedSQL comprises 68,000 question/SQL query/answer triples generated from templates and grounded in a harmonized BigQuery knowledge base that integrates gene-disease associations, causal inference from omics data, and drug approval records. Each question requires models to infer domain-specific criteria, such as genome-wide significance thresholds, effect directionality, or trial phase filtering, rather than rely on syntactic translation alone. We evaluate a range of open- and closed-source LLMs across prompting strategies and interaction paradigms. Our results reveal a substantial performance gap: Gemini-3-Pro achieves 58.1% execution accuracy, while our custom multi-step agent, BMSQL, reaches 62.6%, both well below the expert baseline of 90.0%. BiomedSQL provides a new foundation for advancing text-to-SQL systems capable of supporting scientific discovery through robust reasoning over structured biomedical knowledge bases. Our dataset is publicly available at https://huggingface.co/datasets/NIH-CARD/BiomedSQL, and our code is open-source at https://github.com/NIH-CARD/biomedsql.
[61] Probing Association Biases in LLM Moderation Over-Sensitivity
Yuxin Wang, Botao Yu, Ivory Yang, Saeed Hassanpour, Soroush Vosoughi
Main category: cs.CL
TL;DR: LLMs show systematic topic-toxicity associations beyond explicit offensive triggers, causing over-sensitivity in content moderation. Topic Association Analysis reveals advanced models have stronger topic-association skew despite lower false-positive rates.
Details
Motivation: LLMs used for content moderation often exhibit over-sensitivity, misclassifying benign content. While previous research focused on explicit offensive triggers, this paper investigates deeper topic-toxicity associations that cause systematic false positives.Method: Proposed Topic Association Analysis: a behavior-based probe that elicits short contextual scenarios for benign inputs and quantifies topic amplification between scenarios and original comments. Tested across multiple LLMs with large-scale data and controlled prefix interventions.
Result: Advanced models (e.g., GPT-4 Turbo) show stronger topic-association skew in false-positive cases despite lower overall false-positive rates. Topic cues can measurably shift false-positive rates via controlled prefix interventions, indicating topic framing is decision-relevant.
Conclusion: Mitigating LLM over-sensitivity requires addressing learned topic associations in addition to keyword-based filtering, as topic-toxicity patterns go beyond explicit offensive triggers.
Abstract: Large Language Models are widely used for content moderation but often present certain over-sensitivity, leading to misclassification of benign content and rejecting safe user commands. While previous research attributes this issue primarily to the presence of explicit offensive triggers, we statistically reveal a deeper connection beyond token level: When behaving over-sensitively, particularly on decontextualized statements, LLMs exhibit systematic topic-toxicity association patterns that go beyond explicit offensive triggers. To characterize these patterns, we propose Topic Association Analysis, a behavior-based probe that elicits short contextual scenarios for benign inputs and quantifies topic amplification between the scenario and the original comment. Across multiple LLMs and large-scale data, we find that more advanced models (e.g., GPT-4 Turbo) show stronger topic-association skew in false-positive cases despite lower overall false-positive rates. Moreover, via controlled prefix interventions, we show that topic cues can measurably shift false-positive rates, indicating that topic framing is decision-relevant. These results suggest that mitigating over-sensitivity may require addressing learned topic associations in addition to keyword-based filtering.
[62] Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions
Yuanzhe Hu, Yu Wang, Julian McAuley
Main category: cs.CL
TL;DR: MemoryAgentBench: A new benchmark for evaluating memory capabilities in LLM agents, covering four core competencies: accurate retrieval, test-time learning, long-range understanding, and selective forgetting.
Details
Motivation: Existing benchmarks for LLM agents focus on reasoning, planning, and execution but neglect memory capabilities. Current benchmarks either use limited context lengths or are designed for static, long-context settings, failing to capture the interactive, multi-turn nature of memory agents that incrementally accumulate information.Method: The authors introduce MemoryAgentBench, which transforms existing long-context datasets and incorporates newly constructed datasets into a multi-turn format to simulate incremental information processing. The benchmark systematically covers four core memory competencies identified from memory and cognitive science theories.
Result: Evaluation of diverse memory agents (from simple context-based/RAG systems to advanced agents with external memory and tools) shows current methods fall short of mastering all four competencies, highlighting the need for better memory mechanisms.
Conclusion: MemoryAgentBench provides a comprehensive testbed for assessing memory quality in LLM agents, revealing significant gaps in current approaches and emphasizing the importance of developing more sophisticated memory mechanisms for interactive agents.
Abstract: Recent benchmarks for Large Language Model (LLM) agents primarily focus on evaluating reasoning, planning, and execution capabilities, while another critical component-memory, encompassing how agents memorize, update, and retrieve long-term information-is under-evaluated due to the lack of benchmarks. We term agents with memory mechanisms as memory agents. In this paper, based on classic theories from memory science and cognitive science, we identify four core competencies essential for memory agents: accurate retrieval, test-time learning, long-range understanding, and selective forgetting. Existing benchmarks either rely on limited context lengths or are tailored for static, long-context settings like book-based QA, which do not reflect the interactive, multi-turn nature of memory agents that incrementally accumulate information. Moreover, no existing benchmarks cover all four competencies. We introduce MemoryAgentBench, a new benchmark specifically designed for memory agents. Our benchmark transforms existing long-context datasets and incorporates newly constructed datasets into a multi-turn format, effectively simulating the incremental information processing characteristic of memory agents. By carefully selecting and curating datasets, our benchmark provides comprehensive coverage of the four core memory competencies outlined above, thereby offering a systematic and challenging testbed for assessing memory quality. We evaluate a diverse set of memory agents, ranging from simple context-based and retrieval-augmented generation (RAG) systems to advanced agents with external memory modules and tool integration. Empirical results reveal that current methods fall short of mastering all four competencies, underscoring the need for further research into comprehensive memory mechanisms for LLM agents.
[63] MATA: Mindful Assessment of the Telugu Abilities of Large Language Models
Chalamalasetti Kranti, Sowmya Vajjala
Main category: cs.CL
TL;DR: MATA is a Telugu language evaluation dataset with 729 multiple-choice and open-ended questions to assess LLM capabilities in Telugu, revealing model limitations and heuristic biases.
Details
Motivation: There's a need for comprehensive evaluation of LLMs in low-resource languages like Telugu to understand their linguistic capabilities and limitations, as most benchmarks focus on high-resource languages.Method: Created a dataset of 729 carefully curated Telugu questions spanning diverse linguistic dimensions, evaluated 11 open-weight and closed-source LLMs, analyzed performance patterns, and compared LLM-as-a-judge evaluation with human evaluation.
Result: LLMs show significant limitations in Telugu language understanding, rely on superficial heuristics like answer position and distractor patterns for multiple-choice questions, and LLM-as-a-judge evaluation shows reliability issues compared to human evaluation.
Conclusion: Fine-grained evaluation in low-resource languages is essential for understanding model limitations and developing more linguistically capable LLMs, with MATA serving as a foundation for future Telugu NLP research.
Abstract: In this paper, we introduce MATA, a novel evaluation dataset to assess the ability of Large Language Models (LLMs) in Telugu language, comprising 729 carefully curated multiple-choice and open-ended questions that span diverse linguistic dimensions. We evaluate 11 open-weight and closed-source LLMs on our dataset and present a fine-grained analysis of their performance. Further, we empirically show how LLMs rely on superficial heuristics such as answer position and distractor patterns for multiple-choice questions. Finally, we also compare LLM-as-a-judge evaluation with human evaluation for open-ended questions assess its reliability in a low-resource language. We argue that such fine-grained evaluation is essential for understanding model limitations and can inform the development of more linguistically capable LLMs, while also serving as a foundation for future research in Telugu NLP. Our dataset is available at: https://huggingface.co/datasets/TeluguLLMResearch/MATA
[64] Human Psychometric Questionnaires Mischaracterize LLM Psychology: Evidence from Generation Behavior
Woojung Song, Dongmin Choi, Yoonah Park, Jongwook Han, Eun-Ju Lee, Yohan Jo
Main category: cs.CL
TL;DR: LLM psychological profiling using human questionnaires may mischaracterize models’ actual psychological characteristics, as questionnaire responses differ substantially from generation probabilities in real-world interactions.
Details
Motivation: To examine whether psychological profiles derived from human psychometric questionnaires accurately reflect LLMs' psychological characteristics expressed during real-world interactions with users, addressing concerns about mischaracterization.Method: Compared two types of profiles for eight open-source LLMs: 1) self-reported Likert scores from established questionnaires (PVQ-40, PVQ-21, BFI-44, BFI-10), and 2) generation probability scores of value- or personality-laden responses to real-world user queries.
Result: The two profiles were substantially different, showing that LLMs’ questionnaire responses reflect desired behavior rather than stable psychological constructs. Established questionnaires also risk exaggerating demographic biases of LLMs.
Conclusion: Psychological profiles from established questionnaires should be interpreted cautiously, as they may misrepresent LLM psychology. Generation-based profiling is a more reliable approach to LLM psychometrics.
Abstract: Psychological profiling of large language models (LLMs) using psychometric questionnaires designed for humans has become widespread. However, it remains unclear whether the resulting profiles mirror the models’ psychological characteristics expressed during their real-world interactions with users. To examine the risk of human questionnaires mischaracterizing LLM psychology, we compare two types of profiles for eight open-source LLMs: self-reported Likert scores from established questionnaires (PVQ-40, PVQ-21, BFI-44, BFI-10) and generation probability scores of value- or personality-laden responses to real-world user queries. The two profiles turn out to be substantially different and provide evidence that LLMs’ responses to established questionnaires reflect desired behavior rather than stable psychological constructs, which challenges the consistent psychological dispositions of LLMs claimed in prior work. Established questionnaires also risk exaggerating the demographic biases of LLMs. Our results suggest caution when interpreting psychological profiles derived from established questionnaires and point to generation-based profiling as a more reliable approach to LLM psychometrics.
[65] Incongruent Positivity: When Miscalibrated Positivity Undermines Online Supportive Conversations
Leen Almajed, Abeer ALdayel
Main category: cs.CL
TL;DR: Analysis of how well-intended positivity can misfire in emotional support conversations, comparing human and LLM responses across mild vs. severe emotional contexts, revealing LLMs are more prone to unrealistic positivity in high-stakes situations.
Details
Motivation: To examine the phenomenon of "incongruent positivity" - where positive support responses feel dismissive or minimizing - in both human and LLM-generated emotional support conversations, particularly across different emotional intensity levels.Method: Collected real user-assistant dialogues from Reddit across emotional intensities, generated additional LLM responses for same contexts, categorized conversations into Mild (relationship tension, general advice) and Severe (grief, anxiety), analyzed response patterns, finetuned LLMs on emotional reaction datasets, and developed weakly supervised multilabel classifier ensemble (DeBERTa and MentalBERT) for detecting incongruent positivity types.
Result: LLMs are more prone to unrealistic positivity through dismissive and minimizing tone, particularly in high-stakes contexts. The classifier ensemble showed improved detection of incongruent positivity types across mild and severe concerns.
Conclusion: Need to move beyond generic positive responses and study congruent support measures to balance positive affect with emotional acknowledgment, paving way for context-aware, trust-preserving online conversation systems.
Abstract: In emotionally supportive conversations, well-intended positivity can sometimes misfire, leading to responses that feel dismissive, minimizing, or unrealistically optimistic. We examine this phenomenon of incongruent positivity as miscalibrated expressions of positive support in both human and LLM generated responses. To this end, we collected real user-assistant dialogues from Reddit across a range of emotional intensities and generated additional responses using large language models for the same context. We categorize these conversations by intensity into two levels: Mild, which covers relationship tension and general advice, and Severe, which covers grief and anxiety conversations. This level of categorization enables a comparative analysis of how supportive responses vary across lower and higher stakes contexts. Our analysis reveals that LLMs are more prone to unrealistic positivity through dismissive and minimizing tone, particularly in high-stakes contexts. To further study the underlying dimensions of this phenomenon, we finetune LLMs on datasets with strong and weak emotional reactions. Moreover, we developed a weakly supervised multilabel classifier ensemble (DeBERTa and MentalBERT) that shows improved detection of incongruent positivity types across two sorts of concerns (Mild and Severe). Our findings shed light on the need to move beyond merely generating generic positive responses and instead study the congruent support measures to balance positive affect with emotional acknowledgment. This approach offers insights into aligning large language models with affective expectations in the online supportive dialogue, paving the way toward context-aware and trust preserving online conversation systems.
[66] ReviewScore: Misinformed Peer Review Detection with Large Language Models
Hyun Ryu, Doohyuk Jang, Hyemin S. Lee, Joonhyun Jeong, Gyeongman Kim, Donghyeon Cho, Gyouk Chu, Minyeong Hwang, Hyeongwon Jang, Changhun Kim, Haechan Kim, Jina Kim, Joowon Kim, Yoonjeon Kim, Kwanhyung Lee, Chanjae Park, Heecheol Yun, Gregor Betz, Eunho Yang
Main category: cs.CL
TL;DR: Automated system to detect misinformed peer review points by analyzing weaknesses and questions, with LLM evaluation showing moderate but imperfect performance.
Details
Motivation: Peer review quality is degrading in AI conferences due to exploding submission numbers, creating need for automated detection of low-quality reviews containing misinformed points.Method: Define misinformed review points as weaknesses with incorrect premises or questions already answered in paper; build automated engine to reconstruct premises from weaknesses; create human-annotated dataset; evaluate eight state-of-the-art LLMs on ReviewScore detection.
Result: 15.2% of weaknesses and 26.4% of questions are misinformed; LLMs achieve F1 scores of 0.4-0.5 and kappa scores of 0.3-0.4; premise-level factuality evaluation shows significantly higher agreement than weakness-level evaluation.
Conclusion: Automated detection of misinformed review points is promising but challenging; LLMs show moderate agreement with human experts but struggle with reasoning errors; premise-level analysis improves accuracy.
Abstract: Peer review serves as a backbone of academic research, but in most AI conferences, the review quality is degrading as the number of submissions explodes. To reliably detect low-quality reviews, we define misinformed review points as either “weaknesses” in a review that contain incorrect premises, or “questions” in a review that can be already answered by the paper. We verify that 15.2% of weaknesses and 26.4% of questions are misinformed and introduce ReviewScore indicating if a review point is misinformed. To evaluate the factuality of each premise of weaknesses, we propose an automated engine that reconstructs every explicit and implicit premise from a weakness. We build a human expert-annotated ReviewScore dataset to check the ability of LLMs to automate ReviewScore evaluation. Then, we measure human-model agreements on ReviewScore using eight current state-of-the-art LLMs. The models show F1 scores of 0.4–0.5 and kappa scores of 0.3–0.4, indicating moderate agreement but also suggesting that fully automating the evaluation remains challenging. A thorough disagreement analysis reveals that most errors are due to models’ incorrect reasoning. We also prove that evaluating premise-level factuality shows significantly higher agreements than evaluating weakness-level factuality.
[67] HarmMetric Eval: Benchmarking Metrics and Judges for LLM Harmfulness Assessment
Langqi Yang, Tianhang Zheng, Yixuan Chen, Kedong Xiu, Hao Zhou, Wangze Ni, Lei Chen, Zhan Qin, Kui Ren
Main category: cs.CL
TL;DR: HarmMetric Eval is a benchmark for evaluating harmfulness metrics in LLMs, finding that traditional reference-based metrics can outperform LLM-based judges, leading to an improved judge that combines fine-grained criteria with reference-based fine-tuning.
Details
Motivation: LLMs pose safety risks by generating harmful content, but existing harmfulness evaluation metrics produce inconsistent results due to format and scale differences, undermining their credibility for data management applications.Method: Created HarmMetric Eval benchmark with high-quality dataset of harmful prompts paired with harmful/non-harmful LLM outputs across fine-grained categories, plus unified scoring mechanism. Analyzed limitations of LLM-based judges, then designed improved judge incorporating fine-grained criteria in prompts and using reference-based metrics for lightweight fine-tuning.
Result: Surprising finding: traditional reference-based metrics (ROUGE, METEOR) outperform LLM-based judges in fine-grained harmfulness evaluation. The improved judge combining fine-grained criteria and reference-based fine-tuning achieves state-of-the-art effectiveness on HarmMetric Eval.
Conclusion: LLM-based judges have limitations in harmfulness evaluation, particularly with irrelevant outputs. Combining fine-grained criteria in prompts with reference-based metric fine-tuning creates more effective harmfulness judges, challenging assumptions about LLM superiority in this domain.
Abstract: The potential of large language models (LLMs) to generate harmful content poses a significant safety risk for data management, as LLMs are increasingly being used as engines for data generation. To assess this risk, numerous harmfulness evaluation metrics and judges have been proposed. However, due to differences in their formats and scales, these metrics may yield inconsistent evaluation results on LLM-generated harmful data, undermining their credibility in practice. To address this gap, we present HarmMetric Eval, a systematic benchmark for assessing the quality of harmfulness metrics and judges with varying formats and scales. HarmMetric Eval includes a high-quality dataset comprising representative harmful prompts paired with harmful and non-harmful LLM outputs across multiple fine-grained categories, along with a unified scoring mechanism to reward the metrics for correctly ranking harmful outputs over non-harmful ones. Extensive experiments on HarmMetric Eval yield a surprising finding: conventional reference-based metrics such as ROUGE and METEOR can outperform LLM-based judges in fine-grained harmfulness evaluation, challenging prevailing assumptions about LLMs’ superiority in this domain. To reveal the reasons behind this finding, we provide a fine-grained analysis to explain the limitations of LLM-based judges on rating irrelevant or useless LLM outputs. Motivated by these insights, we design an improved harmfulness judge that explicitly incorporates fine-grained harmfulness criteria in its prompt template and leverages reference-based metrics for lightweight fine-tuning of its base LLM. The resulting judge achieves state-of-the-art evaluation effectiveness on HarmMetric Eval.
[68] Bringing Emerging Architectures to Sequence Labeling in NLP
Ana Ezquerro, Carlos Gómez-Rodríguez, David Vilares
Main category: cs.CL
TL;DR: The paper evaluates alternative architectures (xLSTMs, SSMs, diffusion models, adversarial learning) for sequence labeling tasks, finding their strong performance on simple tasks doesn’t generalize well to complex structured tasks or across languages.
Details
Motivation: To investigate whether alternative architectures that show promise in language modeling (xLSTMs, structured state-space models, diffusion models, adversarial learning) can effectively adapt to sequence labeling tasks, especially those with varying structural complexity, label spaces, and token dependencies across multiple languages.Method: Comparative study of different architectures (xLSTMs, structured state-space models, diffusion models, adversarial learning) across tagging tasks with varying complexity, label spaces, and token dependencies, evaluated across multiple languages and datasets.
Result: The strong performance of alternative architectures observed in simpler settings does not generalize well across languages or datasets, nor does it extend to more complex structured tasks.
Conclusion: Alternative architectures need more careful adaptation for sequence labeling tasks, especially complex structured ones, and their performance advantages on simple tasks don’t necessarily translate to broader applications.
Abstract: Pretrained Transformer encoders are the dominant approach to sequence labeling. While some alternative architectures-such as xLSTMs, structured state-space models, diffusion models, and adversarial learning-have shown promise in language modeling, few have been applied to sequence labeling, and mostly on flat or simplified tasks. We study how these architectures adapt across tagging tasks that vary in structural complexity, label space, and token dependencies, with evaluation spanning multiple languages. We find that the strong performance previously observed in simpler settings does not always generalize well across languages or datasets, nor does it extend to more complex structured tasks.
[69] Detecting Data Contamination from Reinforcement Learning Post-training for Large Language Models
Yongding Tao, Tian Wang, Yihong Dong, Huanyu Liu, Kechi Zhang, Xiaolong Hu, Ge Li
Main category: cs.CL
TL;DR: Self-Critique: A novel method for detecting data contamination in RL post-training phase of LLMs by probing policy collapse and entropy reduction patterns.
Details
Motivation: Data contamination threatens LLM evaluation reliability, especially in RL post-training phase which lacks specialized detection methods despite being crucial for advancing LLM reasoning capabilities.Method: Proposes Self-Critique method that detects contamination by probing policy collapse (model’s convergence to narrow reasoning paths) and analyzing output entropy distribution collapse into sparse modes after RL phase. Also introduces RL-MIA benchmark for RL-phase contamination scenarios.
Result: Self-Critique significantly outperforms baseline methods across multiple models and contamination tasks, achieving up to 30% AUC improvement. Makes detection possible where existing methods perform close to random guessing.
Conclusion: First systematic study of data detection in RL post-training scenario, addressing critical vulnerability in LLM evaluation. Self-Critique enables effective contamination detection in RL phase where previous methods failed.
Abstract: Data contamination poses a significant threat to the reliable evaluation of Large Language Models (LLMs). This issue arises when benchmark samples may inadvertently appear in training sets, compromising the validity of reported performance. While detection methods have been developed for the pre-training and Supervised Fine-Tuning stages, a critical research gap exists for the increasingly significant phase of Reinforcement Learning (RL) post-training. As RL post-training becomes pivotal for advancing LLM reasoning, the absence of specialized contamination detection methods in this paradigm presents a critical vulnerability. To address this, we conduct the first systematic study of data detection within RL post-training scenario and propose Self-Critique. Our method is motivated by a key observation: after RL phase, the output entropy distribution of LLMs tends to collapse into highly specific and sparse modes. Self-Critique probes for the underlying policy collapse, i.e., the model’s convergence to a narrow reasoning path, which causes this entropy reduction. To facilitate this research, we also introduce RL-MIA, a benchmark constructed to simulate this specific contamination scenario. Extensive experiments show that Self-Critique significantly outperforms baseline methods across multiple models and contamination tasks, achieving an AUC improvement of up to 30%. Whereas existing methods are close to a random guess for RL-phase contamination, our method makes detection possible.
[70] From Slides to Chatbots: Enhancing Large Language Models with University Course Materials
Tu Anh Dinh, Philipp Nicolas Schumacher, Jan Niehues
Main category: cs.CL
TL;DR: LLMs enhanced with course materials via RAG outperform CPT for university CS education, with multimodal image-based slide retrieval showing significant gains over text-only approaches.
Details
Motivation: LLMs struggle with accurate question answering in university-level computer science courses despite their general capabilities. The challenge lies in effectively incorporating diverse, non-standard course materials like lecture slides (with visual elements) and transcripts (with spoken language) to improve educational support.Method: Compared two strategies: Retrieval-Augmented Generation (RAG) and Continual Pre-Training (CPT) for incorporating course-specific knowledge. For lecture slides, explored a multimodal RAG approach where retrieved content is presented to the generator in image form rather than text.
Result: RAG proved more effective and efficient than CPT given the relatively small size of university course materials. Multimodal RAG with slides presented as images significantly outperformed text-only retrieval approaches.
Conclusion: Practical strategies for developing better AI educational assistants include using RAG over CPT for small educational datasets and leveraging multimodal approaches that preserve visual information from slides. These findings can inspire similar efforts in other educational contexts.
Abstract: Large Language Models (LLMs) have advanced rapidly in recent years. One application of LLMs is to support student learning in educational settings. However, prior work has shown that LLMs still struggle to answer questions accurately within university-level computer science courses. In this work, we investigate how incorporating university course materials can enhance LLM performance in this setting. A key challenge lies in leveraging diverse course materials such as lecture slides and transcripts, which differ substantially from typical textual corpora: slides also contain visual elements like images and formulas, while transcripts contain spoken, less structured language. We compare two strategies, Retrieval-Augmented Generation (RAG) and Continual Pre-Training (CPT), to extend LLMs with course-specific knowledge. For lecture slides, we further explore a multi-modal RAG approach, where we present the retrieved content to the generator in image form. Our experiments reveal that, given the relatively small size of university course materials, RAG is more effective and efficient than CPT. Moreover, incorporating slides as images in the multi-modal setting significantly improves performance over text-only retrieval. These findings highlight practical strategies for developing AI assistants that better support learning and teaching, and we hope they inspire similar efforts in other educational contexts.
[71] Frame Semantic Patterns for Identifying Underreporting of Notifiable Events in Healthcare: The Case of Gender-Based Violence
Lívia Dutra, Arthur Lorenzi, Laís Berno, Franciany Campos, Karoline Biscardi, Kenneth Brown, Marcelo Viridiano, Frederico Belcavello, Ely Matos, Olívia Guaranha, Erik Santos, Sofia Reinach, Tiago Timponi Torrent
Main category: cs.CL
TL;DR: A methodology using semantic frames to identify notifiable healthcare events like gender-based violence in unstructured medical text, achieving 0.726 precision on Brazilian Portuguese data.
Details
Motivation: Address underreporting of gender-based violence in healthcare records by developing an NLP approach to automatically identify such events from unstructured text in electronic medical records.Method: Uses semantic frames to define fine-grained patterns, searches these patterns in unstructured text from e-medical records (21 million sentences in Brazilian Portuguese), and manually evaluates results to measure precision.
Result: Achieved 0.726 precision in identifying reports of violence, demonstrating effectiveness of the semantic frame methodology for healthcare event detection.
Conclusion: The transparent, efficient, low-carbon, language-agnostic pipeline can be adapted to other health surveillance contexts, supporting ethical and explainable NLP use in public health systems.
Abstract: We introduce a methodology for the identification of notifiable events in the domain of healthcare. The methodology harnesses semantic frames to define fine-grained patterns and search them in unstructured data, namely, open-text fields in e-medical records. We apply the methodology to the problem of underreporting of gender-based violence (GBV) in e-medical records produced during patients’ visits to primary care units. A total of eight patterns are defined and searched on a corpus of 21 million sentences in Brazilian Portuguese extracted from e-SUS APS. The results are manually evaluated by linguists and the precision of each pattern measured. Our findings reveal that the methodology effectively identifies reports of violence with a precision of 0.726, confirming its robustness. Designed as a transparent, efficient, low-carbon, and language-agnostic pipeline, the approach can be easily adapted to other health surveillance contexts, contributing to the broader, ethical, and explainable use of NLP in public health systems.
[72] Simple Additions, Substantial Gains: Expanding Scripts, Languages, and Lineage Coverage in URIEL+
Mason Shipton, York Hay Ng, Aditya Khan, Phuong Hanh Hoang, Xiang Lu, A. Seza Doğruöz, En-Shiun Annie Lee
Main category: cs.CL
TL;DR: Extends URIEL+ linguistic knowledge base by adding script vectors, integrating Glottolog for expanded language coverage, and improving lineage imputation to reduce data sparsity for multilingual research.
Details
Motivation: URIEL+ linguistic knowledge base suffers from data sparsity issues including missing feature types, incomplete language entries, and limited genealogical coverage, which limits its usefulness for cross-lingual transfer, especially for low-resource languages.Method: Three main improvements: 1) Introduced script vectors to represent writing system properties for 7,488 languages, 2) Integrated Glottolog to add 18,710 additional languages, 3) Expanded lineage imputation for 26,449 languages by propagating typological and script features across genealogies.
Result: Reduced feature sparsity by 14% for script vectors, increased language coverage by up to 19,015 languages (1,007% increase), boosted imputation quality metrics by up to 35%, and showed performance gains up to 6% in cross-lingual transfer tasks for low-resource languages.
Conclusion: The extended URIEL+ provides significantly improved coverage and reduced sparsity, enabling better support for multilingual research and cross-lingual transfer, particularly benefiting low-resource languages through enhanced linguistic knowledge representation.
Abstract: The URIEL+ linguistic knowledge base supports multilingual research by encoding languages through geographic, genetic, and typological vectors. However, data sparsity (e.g. missing feature types, incomplete language entries, and limited genealogical coverage) remains prevalent. This limits the usefulness of URIEL+ in cross-lingual transfer, particularly for supporting low-resource languages. To address this sparsity, we extend URIEL+ by introducing script vectors to represent writing system properties for 7,488 languages, integrating Glottolog to add 18,710 additional languages, and expanding lineage imputation for 26,449 languages by propagating typological and script features across genealogies. These improvements reduce feature sparsity by 14% for script vectors, increase language coverage by up to 19,015 languages (1,007%), and boost imputation quality metrics by up to 35%. Our benchmark on cross-lingual transfer tasks (oriented around low-resource languages) shows occasionally divergent performance compared to URIEL+, with performance gains up to 6% in certain setups.
[73] Silenced Biases: The Dark Side LLMs Learned to Refuse
Rom Himelstein, Amit LeVi, Brit Youngmann, Yaniv Nemcovsky, Avi Mendelson
Main category: cs.CL
TL;DR: SBB benchmark uses activation steering to uncover silenced biases in safety-aligned LLMs that standard QA evaluations miss by interpreting refusals as fairness.
Details
Motivation: Current fairness evaluations for safety-aligned LLMs are flawed because they interpret model refusal responses as positive fairness measurements, creating a false sense of fairness. These methods overlook deeper unfair preferences encoded in models' latent space that are concealed by safety alignment.Method: Proposes Silenced Bias Benchmark (SBB) which uses activation steering to reduce model refusals during QA evaluations, allowing underlying biases to surface. The benchmark supports easy expansion to new demographic groups and subjects.
Result: Demonstrated approach over multiple LLMs, revealing alarming distinctions between models’ direct responses and their underlying fairness issues. Exposed biases that were previously concealed by safety alignment mechanisms.
Conclusion: SBB provides a more accurate fairness evaluation framework that goes beyond the masking effects of alignment training, encouraging future development of genuinely fair models and tools.
Abstract: Safety-aligned large language models (LLMs) are becoming increasingly widespread, especially in sensitive applications where fairness is essential and biased outputs can cause significant harm. However, evaluating the fairness of models is a complex challenge, and approaches that do so typically utilize standard question-answer (QA) styled schemes. Such methods often overlook deeper issues by interpreting the model’s refusal responses as positive fairness measurements, which creates a false sense of fairness. In this work, we introduce the concept of silenced biases, which are unfair preferences encoded within models’ latent space and are effectively concealed by safety-alignment. Previous approaches that considered similar indirect biases often relied on prompt manipulation or handcrafted implicit queries, which present limited scalability and risk contaminating the evaluation process with additional biases. We propose the Silenced Bias Benchmark (SBB), which aims to uncover these biases by employing activation steering to reduce model refusals during QA. SBB supports easy expansion to new demographic groups and subjects, presenting a fairness evaluation framework that encourages the future development of fair models and tools beyond the masking effects of alignment training. We demonstrate our approach over multiple LLMs, where our findings expose an alarming distinction between models’ direct responses and their underlying fairness issues.
[74] PEPPER: Perception-Guided Perturbation for Robust Backdoor Defense in Text-to-Image Diffusion Models
Oscar Chew, Po-Yi Lu, Jayden Lin, Kuan-Hao Huang, Hsuan-Tien Lin
Main category: cs.CL
TL;DR: PEPPER is a backdoor defense method for text-to-image diffusion models that rewrites input captions to disrupt trigger tokens while preserving visual similarity, making models robust against text-based backdoor attacks.
Details
Motivation: Text-to-image diffusion models are vulnerable to backdoor attacks where trigger tokens in prompts can steer generation toward harmful content. These attacks can spread to neighboring tokens in embedding space, requiring robust defense mechanisms.Method: PEPPER rewrites input captions into semantically distant but visually similar versions while adding unobstructive elements. This disrupts trigger tokens embedded in prompts and dilutes their influence, enhancing model robustness against backdoor attacks.
Result: Experiments show PEPPER is particularly effective against text encoder-based attacks, substantially reducing attack success rates while preserving generation quality. It can be paired with existing defenses for stronger, generalizable robustness.
Conclusion: PEPPER provides an effective defense against backdoor attacks in text-to-image diffusion models by strategically rewriting prompts to disrupt triggers while maintaining visual output quality, offering enhanced security for multimodal AI systems.
Abstract: Recent studies show that text to image (T2I) diffusion models are vulnerable to backdoor attacks, where a trigger in the input prompt can steer generation toward harmful or unintended content. Beyond the trigger token itself, backdoor effects can spread to neighboring tokens in the text embedding space. To address this, we introduce PEPPER (PErcePtion Guided PERturbation), a backdoor defense that rewrites the caption into a semantically distant yet visually similar caption while adding unobstructive elements. With this rewriting strategy, PEPPER disrupt the trigger embedded in the input prompt, dilute the influence of trigger tokens and thereby achieve enhanced robustness. Experiments show that PEPPER is particularly effective against text encoder based attacks, substantially reducing attack success while preserving generation quality. Beyond this, PEPPER can be paired with any existing defenses yielding consistently stronger and generalizable robustness than any standalone method. Our code will be released on Github.
[75] The Moralization Corpus: Frame-Based Annotation and Analysis of Moralizing Speech Acts across Diverse Text Genres
Maria Becker, Mirko Sommer, Lars Tapken, Yi Wan Teh, Bruno Brocai
Main category: cs.CL
TL;DR: A novel German multi-genre dataset (Moralization Corpus) for analyzing moral value arguments in discourse, with frame-based annotation of moral values, demands, and protagonists, plus LLM evaluation for moralization detection.
Details
Motivation: Moralizations (arguments using moral values to justify positions) are an underexplored form of persuasive communication that is pragmatically complex and often implicit, posing challenges for both human annotation and NLP systems.Method: Developed a frame-based annotation scheme capturing moral values, demands, and discourse protagonists; applied to diverse German texts (political debates, news articles, online discussions); evaluated LLMs under varied prompting conditions for moralization detection and component extraction.
Result: Detailed prompt instructions had greater effect than few-shot or explanation-based prompting; moralization detection remains highly subjective and context-sensitive; corpus enables fine-grained analysis across communicative formats and domains.
Conclusion: The Moralization Corpus provides resources for interdisciplinary research on moral discourse and reasoning in NLP, highlighting the challenges of analyzing implicit, context-dependent moral arguments.
Abstract: Moralizations - arguments that invoke moral values to justify demands or positions - are a yet underexplored form of persuasive communication. We present the Moralization Corpus, a novel multi-genre dataset designed to analyze how moral values are strategically used in argumentative discourse. Moralizations are pragmatically complex and often implicit, posing significant challenges for both human annotators and NLP systems. We develop a frame-based annotation scheme that captures the constitutive elements of moralizations - moral values, demands, and discourse protagonists - and apply it to a diverse set of German texts, including political debates, news articles, and online discussions. The corpus enables fine-grained analysis of moralizing language across communicative formats and domains. We further evaluate several large language models (LLMs) under varied prompting conditions for the task of moralization detection and moralization component extraction and compare it to human annotations in order to investigate the challenges of automatic and manual analysis of moralizations. Results show that detailed prompt instructions has a greater effect than few-shot or explanation-based prompting, and that moralization remains a highly subjective and context-sensitive task. We release all data, annotation guidelines, and code to foster future interdisciplinary research on moral discourse and moral reasoning in NLP.
[76] Enhancing Moral Diagnosis and Correction in Large Language Models
Bocheng Chen, Xi Chen, Han Zi, Haitao Mao, Zimo Qi, Xitong Zhang, Kristen Johnson, Guangliang Liu
Main category: cs.CL
TL;DR: Using pragmatic inference to improve LLMs’ moral error identification and correction across diverse tasks like moral reasoning, toxic language detection, bias detection, and jailbreaks.
Details
Motivation: Enhancing moral sensitivity in LLMs is crucial for their moral performance but challenging. Current approaches lack generalization across different moral tasks with varying semantic formulations.Method: Leverages pragmatic inference-based approach with a unifying variable called “pragmatic inference load” that captures the degree of pragmatic reasoning required across tasks, enabling generalization.
Result: Approach enables LLMs to produce high-quality moral error diagnostics, make effective corrections, and consistently outperforms baseline methods across diverse tasks.
Conclusion: Pragmatic inference approach effectively enhances LLMs’ moral sensitivity and generalization capabilities across different moral tasks, with improvements stemming from learned inferential processes rather than heuristic patterns.
Abstract: Identifying specific moral errors in an input and generating appropriate corrections require moral sensitivity in large language models (LLMs), which is fundamental for developing their moral performance, yet a challenging task. This study leverages a pragmatic inference-based approach to enhance both the moral diagnosis and corrections of models. Crucially, our method generalizes across a diverse set of different tasks, including moral reasoning, toxic language detection, social bias detection, and jailbreaks, despite substantial differences in their semantic formulations. To enable such generalization, the study also introduces a unifying variable, pragmatic inference load, which captures the degree of pragmatic reasoning required across tasks. Experimental results show that our approach enables LLMs to produce high-quality diagnostic outputs of moral errors, make effective corrections, and consistently outperform a range of baseline methods. Further analyses reveal that these improvements do not arise from heuristic-based response patterns, but from learned inferential processes, highlighting the effectiveness of our approach.
[77] EpiQAL: Benchmarking Large Language Models in Epidemiological Question Answering for Enhanced Alignment and Reasoning
Mingyang Wei, Dehai Min, Zewen Liu, Yuzhang Xie, Guanchen Wu, Ziyang Zhang, Carl Yang, Max S. Y. Lau, Qi He, Lu Cheng, Wei Jin
Main category: cs.CL
TL;DR: EpiQAL is a diagnostic benchmark for epidemiological question answering that tests factual recall, multi-step inference, and conclusion reconstruction from medical literature.
Details
Motivation: Existing medical QA benchmarks focus on clinical knowledge and patient-level reasoning, but lack systematic evaluation of evidence-grounded epidemiological inference needed for population-level disease analysis.Method: Created three benchmark subsets from open-access literature using quality-controlled pipeline with taxonomy guidance, multi-model verification, and difficulty screening to test progressively harder epidemiological reasoning tasks.
Result: Fourteen LLMs showed limited epidemiological reasoning performance, with multi-step inference being most challenging. Model rankings varied across subsets, scale didn’t predict success, and Chain-of-Thought helped multi-step inference but had mixed results elsewhere.
Conclusion: EpiQAL provides fine-grained diagnostic signals for evidence-grounding, inferential reasoning, and conclusion reconstruction in epidemiological contexts, revealing current LLM limitations in this domain.
Abstract: Reliable epidemiological reasoning requires synthesizing study evidence to infer disease burden, transmission dynamics, and intervention effects at the population level. Existing medical question answering benchmarks primarily emphasize clinical knowledge or patient-level reasoning, yet few systematically evaluate evidence-grounded epidemiological inference. We present EpiQAL, the first diagnostic benchmark for epidemiological question answering across diverse diseases, comprising three subsets built from open-access literature. The three subsets progressively test factual recall, multi-step inference, and conclusion reconstruction under incomplete information, and are constructed through a quality-controlled pipeline combining taxonomy guidance, multi-model verification, and difficulty screening. Experiments on fourteen models spanning open-source and proprietary systems reveal that current LLMs show limited performance on epidemiological reasoning, with multi-step inference posing the greatest challenge. Model rankings shift across subsets, and scale alone does not predict success. Chain-of-Thought prompting benefits multi-step inference but yields mixed results elsewhere. EpiQAL provides fine-grained diagnostic signals for evidence-grounding, inferential reasoning, and conclusion reconstruction.
[78] RADAR: Retrieval-Augmented Detector with Adversarial Refinement for Robust Fake News Detection
Song-Duo Ma, Yi-Hung Liu, Hsin-Yu Lin, Pin-Yu Chen, Hong-Yan Huang, Shau-Yung Hsu, Yun-Nung Chen
Main category: cs.CL
TL;DR: RADAR is a retrieval-augmented adversarial training framework for detecting LLM-generated misinformation, using a generator that rewrites real articles and a detector that verifies claims with retrieval, enhanced by natural-language adversarial feedback.
Details
Motivation: To combat the spread of LLM-generated misinformation efficiently, the paper addresses the need for robust fake news detection systems that can adapt to sophisticated generation techniques.Method: RADAR employs a generator that rewrites real articles with factual perturbations, paired with a lightweight detector that verifies claims using dense passage retrieval. It introduces verbal adversarial feedback (VAF) - structured natural-language critiques that guide the generator toward more sophisticated evasion attempts, enabling co-evolution between generator and detector.
Result: RADAR consistently outperforms strong retrieval-augmented trainable baselines and general-purpose LLMs with retrieval on fake news detection benchmarks. Detector-side retrieval yields the largest gains, while VAF and few-shot demonstrations provide complementary benefits. RADAR also transfers better to fake news generated by unseen external attackers.
Conclusion: The RADAR framework demonstrates effective adversarial co-evolution for robust fake news detection, with retrieval mechanisms and natural-language adversarial feedback significantly improving detection performance and generalization to unseen attack methods.
Abstract: To efficiently combat the spread of LLM-generated misinformation, we present RADAR, a Retrieval-Augmented Detector with Adversarial Refinement for robust fake news detection. Our approach employs a generator that rewrites real articles with factual perturbations, paired with a lightweight detector that verifies claims using dense passage retrieval. To enable effective co-evolution, we introduce verbal adversarial feedback (VAF). Rather than relying on scalar rewards, VAF issues structured natural-language critiques; these guide the generator toward more sophisticated evasion attempts, compelling the detector to adapt and improve. On a fake news detection benchmark, RADAR consistently outperforms strong retrieval-augmented trainable baselines, as well as general-purpose LLMs with retrieval. Further analysis shows that detector-side retrieval yields the largest gains, while VAF and few-shot demonstrations provide complementary benefits. RADAR also transfers better to fake news generated by an unseen external attacker, indicating improved robustness beyond the co-evolved training setting.
[79] Distilling Feedback into Memory-as-a-Tool
Víctor Gallego
Main category: cs.CL
TL;DR: A framework that amortizes inference-time reasoning costs by converting transient critiques into retrievable guidelines using file-based memory and agent-controlled tool calls, evaluated on rubric-based learning tasks.
Details
Motivation: To reduce the high computational costs of inference-time reasoning and refinement in LLMs while maintaining performance, by making critiques reusable rather than transient.Method: Uses a file-based memory system and agent-controlled tool calls to convert transient critiques into retrievable guidelines that can be reused across inference sessions, amortizing reasoning costs.
Result: The augmented LLMs rapidly match the performance of test-time refinement pipelines while drastically reducing inference costs, as demonstrated on the Rubric Feedback Bench dataset.
Conclusion: The framework successfully amortizes reasoning costs by making critiques persistent and reusable, offering an efficient alternative to expensive test-time refinement approaches.
Abstract: We propose a framework that amortizes the cost of inference-time reasoning by converting transient critiques into retrievable guidelines, through a file-based memory system and agent-controlled tool calls. We evaluate this method on the Rubric Feedback Bench, a novel dataset for rubric-based learning. Experiments demonstrate that our augmented LLMs rapidly match the performance of test-time refinement pipelines while drastically reducing inference cost.
[80] EVM-QuestBench: An Execution-Grounded Benchmark for Natural-Language Transaction Code Generation
Pei Yang, Wanyi Chen, Ke Wang, Lynn Ai, Eric Yang, Tianyu Shi
Main category: cs.CL
TL;DR: EVM-QuestBench is an execution-grounded benchmark for evaluating natural-language transaction-script generation on EVM-compatible chains, focusing on execution accuracy and safety.
Details
Motivation: Existing evaluations for LLMs in on-chain transaction scenarios often overlook execution accuracy and safety, which is critical since even minor errors can cause irreversible losses for users in blockchain environments.Method: The benchmark uses dynamic evaluation: instructions are sampled from template pools, numeric parameters from predefined intervals, and validators verify outcomes against instantiated values. It contains 107 tasks (62 atomic, 45 composite) with modular architecture for rapid task development. The runner executes scripts on a forked EVM chain with snapshot isolation, and composite tasks apply step-efficiency decay.
Result: Evaluation of 20 models reveals large performance gaps, with split scores showing persistent asymmetry between single-action precision and multi-step workflow completion capabilities.
Conclusion: EVM-QuestBench addresses critical gaps in evaluating LLMs for blockchain transaction scenarios, providing a rigorous execution-grounded benchmark that reveals significant model performance disparities in this safety-critical domain.
Abstract: Large language models are increasingly applied to various development scenarios. However, in on-chain transaction scenarios, even a minor error can cause irreversible loss for users. Existing evaluations often overlook execution accuracy and safety. We introduce EVM-QuestBench, an execution-grounded benchmark for natural-language transaction-script generation on EVM-compatible chains. The benchmark employs dynamic evaluation: instructions are sampled from template pools, numeric parameters are drawn from predefined intervals, and validators verify outcomes against these instantiated values. EVM-QuestBench contains 107 tasks (62 atomic, 45 composite). Its modular architecture enables rapid task development. The runner executes scripts on a forked EVM chain with snapshot isolation; composite tasks apply step-efficiency decay. We evaluate 20 models and find large performance gaps, with split scores revealing persistent asymmetry between single-action precision and multi-step workflow completion. Code: https://anonymous.4open.science/r/bsc_quest_bench-A9CF/.
[81] Representation-Aware Unlearning via Activation Signatures: From Suppression to Knowledge-Signature Erasure
Syed Naveed Mahmood, Md. Rezaur Rahman Bhuiyan, Tasfia Zaman, Jareen Tasneem Khondaker, Md. Sameer Sakib, K. M. Shadman Wadith, Nazia Tasnim, Farig Sadeque
Main category: cs.CL
TL;DR: KIF is a representation-aware framework for true knowledge erasure in LLMs that targets internal activation signatures rather than surface outputs, achieving near-perfect erasure while preserving utility.
Details
Motivation: Current unlearning methods conflate behavioral suppression with true knowledge removal, allowing latent capabilities to persist. This is critical for GDPR compliance and model safety, but existing approaches fail to achieve genuine erasure.Method: Knowledge Immunization Framework (KIF) combines dynamic suppression of subject-specific representations with parameter-efficient adaptation. It targets internal activation signatures rather than surface outputs and uses a dual-metric evaluation protocol to distinguish true erasure from obfuscation.
Result: KIF achieves near-oracle erasure (FQ ≈ 0.99 vs. 1.00) while preserving utility at oracle levels (MU = 0.62), breaking the stability-erasure tradeoff. Standard models show scale-independent true erasure (<3% utility drift), while reasoning-prior models reveal architectural divergence.
Conclusion: KIF enables durable unlearning without full retraining by targeting representation-level mechanisms rather than surface behavior. The framework provides systematic diagnosis of forgetting behavior across model families and scales.
Abstract: Selective knowledge erasure from LLMs is critical for GDPR compliance and model safety, yet current unlearning methods conflate behavioral suppression with true knowledge removal, allowing latent capabilities to persist beneath surface-level refusals. In this work, we address this challenge by introducing Knowledge Immunization Framework (KIF), a representation-aware architecture that distinguishes genuine erasure from obfuscation by targeting internal activation signatures rather than surface outputs. Our approach combines dynamic suppression of subject-specific representations with parameter-efficient adaptation, enabling durable unlearning without full model retraining. KIF achieves near-oracle erasure (FQ approx 0.99 vs. 1.00) while preserving utility at oracle levels (MU = 0.62), effectively breaking the stability-erasure tradeoff that has constrained all prior work. We evaluate both standard foundation models (Llama and Mistral) and reasoning-prior models (Qwen and DeepSeek) across 3B to 14B parameters. Our observation shows that standard models exhibit scale-independent true erasure (<3% utility drift), while reasoning-prior models reveal fundamental architectural divergence. Our comprehensive dual-metric evaluation protocol, combining surface-level leakage with latent trace persistence, operationalizes the obfuscation - erasure distinction and enables the first systematic diagnosis of mechanism-level forgetting behavior across model families and scales.
[82] Speculative Decoding: Performance or Illusion?
Xiaoxuan Liu, Jiaxiang Yu, Jongseok Park, Ion Stoica, Alvin Cheung
Main category: cs.CL
TL;DR: Systematic study of speculative decoding on production-grade inference engine reveals verification overhead dominates execution, acceptance patterns vary significantly, and substantial gaps exist between observed and theoretical performance bounds.
Details
Motivation: Prior evaluations of speculative decoding rely on research prototypes and unrealistically small batch sizes, leaving real-world effectiveness unclear. The paper aims to provide the first systematic study on a production-grade inference engine to understand practical performance.Method: Conducted systematic evaluation on vLLM inference engine, covering multiple SD variants (n-gram, EAGLE/EAGLE-3, Draft-Model, Multi-Token Prediction) across diverse workloads, model scales, and batch sizes. Analyzed key performance factors and quantified theoretical upper bounds.
Result: Verification by target model dominates execution time; acceptance length varies markedly across token positions, requests, and datasets. Substantial gaps exist between observed performance and theoretical upper bounds, highlighting optimization opportunities.
Conclusion: The study provides practical insights into SD performance in production settings, reveals current limitations, and opens new research directions for improving speculative decoding efficiency.
Abstract: Speculative decoding (SD) has become a popular technique to accelerate Large Language Model (LLM) inference, yet its real-world effectiveness remains unclear as prior evaluations rely on research prototypes and unrealistically small batch sizes. We present, to our knowledge, the first systematic study of SD on a production-grade and widely deployed inference engine (vLLM), covering multiple SD variants ($n$-gram, EAGLE/EAGLE-3, Draft-Model, Multi-Token Prediction) across diverse workloads, model scales, and batch sizes. We analyze key factors governing SD performance, and quantify a theoretical upper bound on SD speedup. Our results show that verification by the target model dominates the execution, while acceptance length varies markedly across output token positions, requests, and datasets. Comparing measured performance with theoretical bounds reveals substantial gaps between observed and theoretical upper bounds, and we leverage this observation to highlight new research opportunities that our study opens up in improving SD.
[83] Improving Low-Resource Machine Translation via Round-Trip Reinforcement Learning
Ahmed Attia, Alham Fikri Aji
Main category: cs.CL
TL;DR: Self-supervised reinforcement learning fine-tuning for low-resource machine translation using round-trip bootstrapping with NLLB models, showing improvements in translation quality for several low-resource languages.
Details
Motivation: Low-resource machine translation remains challenging despite increasing availability of parallel data, with many approaches still underexplored. The paper aims to improve translation quality for low-resource languages using self-supervised reinforcement learning.Method: Uses round-trip bootstrapping: translate English → target low-resource language → back to English, then uses chrF++ and BLEU scores on reconstructed English sentences as reward function for reinforcement learning fine-tuning of NLLB models (600M and 1.3B parameters).
Result: Consistent improvements observed for Central Aymara, Friulian, Wolof, Dyula, Bhojpuri, and Russian. Qualitative analysis shows increased fluency and semantic fidelity in translations.
Conclusion: The self-supervised reinforcement learning approach with round-trip bootstrapping effectively improves low-resource machine translation, with potential for further benefits from model scaling and leveraging pretrained knowledge.
Abstract: Low-resource machine translation (MT) has gained increasing attention as parallel data from low-resource language communities is collected, but many approaches for improving low-resource MT remain underexplored. We investigate a self-supervised reinforcement learning fine-tuning for translation in low-resource settings using round-trip bootstrapping with the No Language Left Behind (NLLB) family of models. Our approach translates English into a target low-resource language and then back into English, using a combination of chrF++ and BLEU as the reward function on the reconstructed English sentences. Using the NLLB-MD dataset, we evaluate both the 600M and 1.3B parameter NLLB models and observe consistent improvements for the following languages: Central Aymara, Friulian, Wolof, Dyula, Bhojpuri and Russian. Qualitative inspection of translation outputs indicates increased fluency and semantic fidelity. We argue that our method can further benefit from scale, enabling models to increasingly leverage their pretrained knowledge and continue self-improving. Code available at: https://github.com/Copticoder/MT-via-Round-Trip-RL
[84] What Patients Really Ask: Exploring the Effect of False Assumptions in Patient Information Seeking
Raymond Xiong, Furong Jia, Lionel Wong, Monica Agrawal
Main category: cs.CL
TL;DR: LLMs struggle to identify incorrect assumptions in real patient questions about medications, unlike medical exam questions they excel at
Details
Motivation: Current LLM benchmarks focus on medical exam questions, but patients ask different types of questions with incorrect assumptions and dangerous intentions that LLMs need to handle safelyMethod: Created dataset from Google’s People Also Ask feature using top 200 prescribed medications in US, analyzed questions with incorrect assumptions and dangerous intentions
Result: Found many patient questions contain incorrect assumptions; LLMs that perform well on medical exams struggle to identify these incorrect assumptions in real patient questions
Conclusion: Need better benchmarks and LLM capabilities for handling real patient questions with incorrect assumptions, especially for medication-related queries
Abstract: Patients are increasingly using large language models (LLMs) to seek answers to their healthcare-related questions. However, benchmarking efforts in LLMs for question answering often focus on medical exam questions, which differ significantly in style and content from the questions patients actually raise in real life. To bridge this gap, we sourced data from Google’s People Also Ask feature by querying the top 200 prescribed medications in the United States, curating a dataset of medical questions people commonly ask. A considerable portion of the collected questions contains incorrect assumptions and dangerous intentions. We demonstrate that the emergence of these corrupted questions is not uniformly random and depends heavily on the degree of incorrectness in the history of questions that led to their appearance. Current LLMs that perform strongly on other benchmarks struggle to identify incorrect assumptions in everyday questions.
[85] GTS: Inference-Time Scaling of Latent Reasoning with a Learnable Gaussian Thought Sampler
Minghan Wang, Ye Bai, Thuy-Trang Vu, Ehsan Shareghi, Gholamreza Haffari
Main category: cs.CL
TL;DR: Gaussian Thought Sampler (GTS) improves inference-time scaling in latent reasoning models by learning context-dependent perturbation distributions instead of using heuristic noise, enabling better-controlled sampling of reasoning trajectories.
Details
Motivation: Current inference-time scaling methods use heuristic perturbations like dropout or Gaussian noise, but stronger perturbations don't necessarily improve sampling quality - they cause distribution shifts without producing better reasoning paths or decisions. There's no explicit conditional sampling distribution, making latent exploration difficult to control or optimize.Method: Proposes Gaussian Thought Sampler (GTS), a lightweight module that reformulates latent exploration as sampling from a learned conditional distribution over continuous reasoning states. GTS predicts context-dependent perturbation distributions and is trained with GRPO-style policy optimization while keeping the backbone frozen, turning heuristic perturbation into an explicit probabilistic sampling policy.
Result: Experiments across multiple benchmarks and two latent reasoning architectures show that GTS yields more reliable inference-time scaling than heuristic baselines.
Conclusion: Effective latent inference-time scaling requires better-controlled and optimizable sampling rather than simply amplifying stochasticity. GTS provides a principled approach to learning conditional sampling distributions for improved reasoning trajectory exploration.
Abstract: Inference-time scaling (ITS) in latent reasoning models typically relies on heuristic perturbations, such as dropout or fixed Gaussian noise, to generate diverse candidate trajectories. However, we show that stronger perturbations do not necessarily yield better sampling quality: they often induce larger distribution shifts without producing more useful reasoning paths or better final decisions. A key limitation is that these perturbations inject stochasticity without defining an explicit conditional sampling distribution, making latent exploration difficult to control or optimize. To address this, we propose the Gaussian Thought Sampler (GTS), a lightweight module that reformulates latent exploration as sampling from a learned conditional distribution over continuous reasoning states. GTS predicts context-dependent perturbation distributions and is trained with GRPO-style policy optimization while keeping the backbone frozen, turning heuristic perturbation into an explicit probabilistic sampling policy. Experiments across multiple benchmarks and two latent reasoning architectures show that GTS yields more reliable inference-time scaling than heuristic baselines, suggesting that effective latent ITS requires better-controlled and optimizable sampling rather than simply amplifying stochasticity.
[86] Improving Sampling for Masked Diffusion Models via Information Gain
Kaisen Yang, Jayden Teoh, Kaicheng Yang, Yitong Zhang, Alex Lamb
Main category: cs.CL
TL;DR: Info-Gain Sampler improves masked diffusion model decoding by considering future information gain rather than just local certainty, achieving better performance across reasoning, coding, creative writing, and image generation tasks.
Details
Motivation: Existing masked diffusion model samplers use greedy heuristics that only consider local certainty, neglecting how current decoding choices affect future steps and failing to minimize cumulative uncertainty across all masked positions.Method: Proposes Info-Gain Sampler, a principled decoding framework that balances immediate uncertainty reduction with information gain over future masked tokens, leveraging the non-causal nature of MDMs to evaluate how decoding decisions reshape probabilities across all remaining positions.
Result: Consistently outperforms existing samplers across diverse architectures and tasks: 3.6% average accuracy improvement on reasoning tasks, 63.1% win-rate in creative writing, and reduces cumulative uncertainty from 78.4 to 48.6 on reasoning tasks.
Conclusion: The Info-Gain Sampler provides a more effective decoding strategy for masked diffusion models by considering downstream impacts of decoding decisions, demonstrating significant improvements across multiple domains including reasoning, coding, creative writing, and image generation.
Abstract: Masked Diffusion Models (MDMs) offer greater flexibility in decoding order than autoregressive models but require careful planning to achieve high-quality generation. Existing samplers typically adopt greedy heuristics, prioritizing positions with the highest local certainty to decode at each step. Through failure case analysis, we identify a fundamental limitation of this approach: it neglects the downstream impact of current decoding choices on subsequent steps and fails to minimize cumulative uncertainty. In particular, these methods do not fully exploit the non-causal nature of MDMs, which enables evaluating how a decoding decision reshapes token probabilities/uncertainty across all remaining masked positions. To bridge this gap, we propose the Info-Gain Sampler, a principled decoding framework that balances immediate uncertainty with information gain over future masked tokens. Extensive evaluations across diverse architectures and tasks (reasoning, coding, creative writing, and image generation) demonstrate that Info-Gain Sampler consistently outperforms existing samplers for MDMs. For instance, it achieves a 3.6% improvement in average accuracy on reasoning tasks and a 63.1% win-rate in creative writing. Notably, on reasoning tasks it reduces cumulative uncertainty from 78.4 to 48.6, outperforming the best baseline by a large margin. The code will be available at https://github.com/yks23/Information-Gain-Sampler.
[87] Bootstrapping Embeddings for Low Resource Languages
Merve Basoz, Andrew Horne, Mattia Opper
Main category: cs.CL
TL;DR: The paper explores using LLMs to generate synthetic triplet data for embedding models in low-resource languages, proposing two novel methods (adapter composition and XL-LoRA) that outperform in-context learning and achieve strong multilingual performance.
Details
Motivation: Creating effective embedding models requires supervised finetuning data, which is readily available for high-resource languages like English but non-existent for hundreds of other languages. The paper investigates whether large language models can help bridge this gap for low-resource languages.Method: Three strategies for generating synthetic triplet data: 1) in-context learning, 2) novel adapter composition approach, and 3) cross-lingual finetuning of LLM generator (XL-LoRA). The synthetic data is used to optimize embedding models across multiple languages.
Result: While in-context learning falls short of strong non-synthetic baselines, both adapter composition and XL-LoRA yield strong performance gains across a wide array of tasks and languages, offering a scalable pathway to producing performant embedding models for diverse languages.
Conclusion: LLMs can effectively bridge the data gap for low-resource languages in embedding model development. The novel approaches of adapter composition and XL-LoRA provide scalable solutions for creating performant multilingual embedding models without extensive supervised data.
Abstract: Embedding models are crucial to modern NLP. However, the creation of the most effective models relies on carefully constructed supervised finetuning data. For high resource languages, such as English, such datasets are readily available. However, for hundreds of other languages, they are simply non-existent. We investigate whether the advent of large language models can help to bridge this gap. We test three different strategies for generating synthetic triplet data used to optimise embedding models. These include in-context learning as well as two novel approaches, leveraging adapter composition and cross lingual finetuning of the LLM generator (XL-LoRA) respectively. We find that while in-context learning still falls short of strong non-synthetic baselines, adapter composition and XL-LoRA yield strong performance gains across a wide array of tasks and languages, offering a clear, scalable pathway to producing performant embedding models for a wide variety of languages.
[88] Learning When to Sample: Confidence-Aware Self-Consistency for Efficient LLM Chain-of-Thought Reasoning
Juming Xiong, Kevin Guo, Congning Ni, Chao Yan, Katherine Brown, Avinash Baidya, Xiang Gao, Bradley Malin, Zhijun Yin
Main category: cs.CL
TL;DR: A confidence-aware decision framework that analyzes single reasoning trajectories to adaptively choose between single-path and multi-path reasoning, reducing token usage by up to 80% while maintaining accuracy comparable to multi-path baselines.
Details
Motivation: Current LLM reasoning approaches using chain-of-thought often generate unnecessarily long reasoning paths with high inference costs, while self-consistency methods improve accuracy but require sampling multiple trajectories with substantial computational overhead.Method: A confidence-aware decision framework trained using sentence-level numeric and linguistic features extracted from intermediate reasoning states in MedQA dataset, which generalizes to other datasets without additional fine-tuning.
Result: The method maintains accuracy comparable to multi-path baselines while using up to 80% fewer tokens, demonstrating that reasoning trajectories contain rich signals for uncertainty estimation.
Conclusion: Reasoning trajectories contain rich signals for uncertainty estimation, enabling a simple, transferable mechanism to balance accuracy and efficiency in LLM reasoning.
Abstract: Large language models (LLMs) achieve strong reasoning performance through chain-of-thought (CoT) reasoning, yet often generate unnecessarily long reasoning paths that incur high inference cost. Recent self-consistency-based approaches further improve accuracy but require sampling and aggregating multiple reasoning trajectories, leading to substantial additional computational overhead. This paper introduces a confidence-aware decision framework that analyzes a single completed reasoning trajectory to adaptively select between single-path and multi-path reasoning. The framework is trained using sentence-level numeric and linguistic features extracted from intermediate reasoning states in the MedQA dataset and generalizes effectively to MathQA, MedMCQA, and MMLU without additional fine-tuning. Experimental results show that the proposed method maintains accuracy comparable to multi-path baselines while using up to 80% fewer tokens. These findings demonstrate that reasoning trajectories contain rich signals for uncertainty estimation, enabling a simple, transferable mechanism to balance accuracy and efficiency in LLM reasoning.
[89] Human-AI Co-reasoning for Clinical Diagnosis with Evidence-Integrated Language Agent
Zhongzhen Huang, Yan Ling, Hong Chen, Ye Feng, Li Wu, Linjie Mu, Shaoting Zhang, Xiaofan Zhang, Kun Qian, Xiaomu Li
Main category: cs.CL
TL;DR: PULSE is a medical reasoning agent combining a domain-tuned LLM with scientific literature retrieval for diagnostic decision-making in endocrinology cases, achieving expert-competitive accuracy and stable performance across disease rarity levels.
Details
Motivation: To develop an AI agent that can support clinical diagnostic decision-making in complex real-world medical cases, particularly in endocrinology where cases vary widely in disease types and incidence levels, and to understand how such AI assistance influences human diagnostic reasoning.Method: Combines a domain-tuned large language model with scientific literature retrieval to create PULSE medical reasoning agent. Evaluated on curated benchmark of 82 authentic endocrinology case reports. Compared performance against physicians with varying expertise levels (residents to senior specialists) and examined AI assistance effects on human diagnostic reasoning through controlled experiments.
Result: PULSE achieved expert-competitive accuracy, outperforming residents and junior specialists while matching senior specialist performance at Top@1 and Top@4 thresholds. Unlike physicians whose accuracy declined with disease rarity, PULSE maintained stable performance across incidence tiers. The agent exhibited adaptive reasoning (longer outputs for harder cases). In collaborative use, PULSE helped physicians correct errors and broaden hypotheses but introduced automation bias risks.
Conclusion: PULSE demonstrates promise for clinical diagnosis support with robust performance across common and rare presentations, but also reveals limitations including automation bias risks. The study provides a framework for evaluating language model-based agents in real-world medical decision-making.
Abstract: We present PULSE, a medical reasoning agent that combines a domain-tuned large language model with scientific literature retrieval to support diagnostic decision-making in complex real-world cases. To evaluate its capabilities, we curated a benchmark of 82 authentic endocrinology case reports encompassing a broad spectrum of disease types and incidence levels. In controlled experiments, we compared PULSE’s performance against physicians with varying levels of expertise-from residents to senior specialists-and examined how AI assistance influenced human diagnostic reasoning. PULSE attained expert-competitive accuracy, outperforming residents and junior specialists while matching senior specialist performance at both Top@1 and Top@4 thresholds. Unlike physicians, whose accuracy declined with disease rarity, PULSE maintained stable performance across incidence tiers. The agent also exhibited adaptive reasoning, increasing output length with case difficulty in a manner analogous to the longer deliberation observed among expert clinicians. When used collaboratively, PULSE enabled physicians to correct initial errors and broaden diagnostic hypotheses, but also introduced risks of automation bias. The study explores both serial and concurrent collaboration workflows, revealing that PULSE offers robust support across common and rare presentations. These findings underscore both the promise and the limitations of language model-based agents in clinical diagnosis, and offer a framework for evaluating their role in real-world decision-making.
[90] CTG-DB: An Ontology-Based Transformation of ClinicalTrials.gov to Enable Cross-Trial Drug Safety Analyses
Jeffery L. Painter, François Haguinet, Andrew Bate
Main category: cs.CL
TL;DR: CTG-DB transforms ClinicalTrials.gov data into a relational database with standardized adverse event terminology using MedDRA for systematic pharmacovigilance analytics.
Details
Motivation: ClinicalTrials.gov has heterogeneous adverse event terminology and registry-oriented architecture that limits systematic pharmacovigilance analytics, requiring manual reconciliation of safety concepts.Method: Created an open-source pipeline that ingests complete CT.gov XML archive, produces relational database aligned to MedDRA terminology using deterministic exact and fuzzy matching, preserves arm-level denominators and comparator arms.
Result: CTG-DB enables concept-level retrieval, cross-trial aggregation for scalable placebo-referenced safety analyses, and integration of clinical trial evidence into downstream pharmacovigilance signal detection.
Conclusion: The framework provides transparent and reproducible mappings for systematic pharmacovigilance analytics from clinical trial data.
Abstract: ClinicalTrials .gov (CT .gov) is the largest publicly accessible registry of clinical studies, yet its registry-oriented architecture and heterogeneous adverse event (AE) terminology limit systematic pharmacovigilance (PV) analytics. AEs are typically recorded as investigator-reported text rather than standardized identifiers, requiring manual reconciliation to identify coherent safety concepts. We present the ClinicalTrials .gov Transformation Database (CTG-DB), an open-source pipeline that ingests the complete CT .gov XML archive and produces a relational database aligned to standardized AE terminology using the Medical Dictionary for Regulatory Activities (MedDRA). CTG-DB preserves arm-level denominators, represents placebo and comparator arms, and normalizes AE terminology using deterministic exact and fuzzy matching to ensure transparent and reproducible mappings. This framework enables concept-level retrieval and cross-trial aggregation for scalable placebo-referenced safety analyses and integration of clinical trial evidence into downstream PV signal detection.
[91] Attention-guided Evidence Grounding for Spoken Question Answering
Ke Yang, Bolin Chen, Yuejie Li, Yueying Hua, Jianhao Nie, Yueping He, Bowen Li, Chengjun Mao
Main category: cs.CL
TL;DR: AEG is an end-to-end framework for Spoken QA that uses attention mechanisms in SpeechLLMs to ground evidence directly in latent space, reducing hallucinations and improving efficiency over cascaded ASR systems.
Details
Motivation: Spoken QA faces challenges with cross-modal alignment between acoustic queries and textual knowledge, and cascaded ASR systems suffer from latency and error propagation issues. The authors aim to create a more efficient end-to-end solution.Method: Proposes Attention-guided Evidence Grounding (AEG) framework that leverages internal cross-modal attention in SpeechLLMs to locate and ground key evidence in latent space. Introduces Learning to Focus on Evidence (LFE) fine-tuning paradigm to calibrate attention mechanisms to distinguish relevant from irrelevant segments.
Result: Experiments on SQuAD, HotpotQA, and MuSiQue show AEG reduces hallucinations and achieves strong efficiency gains, outperforming large-scale cascaded baselines (Whisper-Large-v3 + Reranker) while reducing inference latency by approximately 62%.
Conclusion: AEG provides an effective end-to-end solution for Spoken QA that improves both accuracy and efficiency by better leveraging cross-modal attention mechanisms in SpeechLLMs.
Abstract: Spoken Question Answering (Spoken QA) presents a challenging cross-modal problem: effectively aligning acoustic queries with textual knowledge while avoiding the latency and error propagation inherent in cascaded ASR-based systems. In this paper, we introduce Attention-guided Evidence Grounding (AEG), a novel end-to-end framework that leverages the internal cross-modal attention of Speech Large Language Models (SpeechLLMs) to explicitly locate and ground key evidence in the model’s latent space. To address the diffuse attention distribution in pre-trained models, we propose Learning to Focus on Evidence (LFE), a supervised fine-tuning paradigm that calibrates the model’s attention mechanism to distinguish query-relevant segments from irrelevant context. Experiments on SQuAD, HotpotQA, and MuSiQue demonstrate that AEG reduces hallucinations and achieves strong efficiency gains, outperforming large-scale cascaded baselines (Whisper-Large-v3 + Reranker) while reducing inference latency by approximately 62%.
[92] Omnilingual MT: Machine Translation for 1,600 Languages
Omnilingual MT Team, Belen Alastruey, Niyati Bafna, Andrea Caciolai, Kevin Heffernan, Artyom Kozhevnikov, Christophe Ropers, Eduardo Sánchez, Charles-Eric Saint-James, Ioannis Tsiamas, Chierh Cheng, Joe Chuang, Paul-Ambroise Duquenne, Mark Duppenthaler, Nate Ekberg, Cynthia Gao, Pere Lluís Huguet Cabot, João Maria Janeiro, Jean Maillard, Gabriel Mejia Gonzalez, Holger Schwenk, Edan Toledo, Arina Turkatenko, Albert Ventayol-Boada, Rashel Moritz, Alexandre Mourachko, Surya Parimi, Mary Williamson, Shireen Yates, David Dale, Marta R. Costa-jussà
Main category: cs.CL
TL;DR: OMT is a machine translation system supporting 1,600+ languages using specialized LLMs, outperforming larger baselines and enabling coherent generation for undersupported languages.
Details
Motivation: Current multilingual MT systems cover only ~200 target languages out of 7,000+ world languages, with limited evaluation benchmarks. There's a need for broader language coverage and better understanding of cross-lingual transfer capabilities.Method: Developed Omnilingual Machine Translation (OMT) using comprehensive data strategy integrating public multilingual corpora with new datasets like MeDLEY bitext. Explored two LLM specialization approaches: decoder-only (OMT-LLaMA) and encoder-decoder architecture (OMT-NLLB) with 1B to 8B parameters.
Result: OMT models match/exceed 70B LLM baseline performance, enable coherent generation for undersupported languages, improve cross-lingual transfer, and provide evaluation datasets (BOUQuET and Met-BOUQuET) for omnilingual assessment.
Conclusion: OMT demonstrates that specialized smaller models can achieve strong translation quality across 1,600+ languages, addressing both understanding and generation challenges in low-resource settings.
Abstract: High-quality machine translation (MT) can scale to hundreds of languages, setting a high bar for multilingual systems. However, compared to the world’s 7,000 languages, current systems still offer only limited coverage: about 200 languages on the target side, and maybe a few hundreds more on the source side, supported due to cross-lingual transfer. And even these numbers have been hard to evaluate due to the lack of reliable benchmarks and metrics. We present Omnilingual Machine Translation (OMT), the first MT system supporting more than 1,600 languages. This scale is enabled by a comprehensive data strategy that integrates large public multilingual corpora with newly created datasets, including manually curated MeDLEY bitext. We explore two ways of specializing a Large Language model (LLM) for machine translation: as a decoder-only model (OMT-LLaMA) or as a module in an encoder-decoder architecture (OMT-NLLB). Notably, all our 1B to 8B parameter models match or exceed the MT performance of a 70B LLM baseline, revealing a clear specialization advantage and enabling strong translation quality in low-compute settings. Moreover, our evaluation of English-to-1,600 translations further shows that while baseline models can interpret undersupported languages, they frequently fail to generate them with meaningful fidelity; OMT-LLaMA models substantially expand the set of languages for which coherent generation is feasible. Additionally, OMT models improve in cross-lingual transfer, being close to solving the “understanding” part of the puzzle in MT for the 1,600 evaluated. Our leaderboard and main human-created evaluation datasets (BOUQuET and Met-BOUQuET) are dynamically evolving towards Omnilinguality and freely available.
[93] EngGPT2: Sovereign, Efficient and Open Intelligence
G. Ciarfaglia, A. Rosanova, S. Cipolla, J. Bartoli, A. Di Domenico, C. Fioroni, A. Fontana, M. R. Scoleri, M. I. Mone, D. Franchi, M. C. Del Gaudio, F. Picariello, M. Gabusi, S. Bonura, V. Morreale, I. Bailo
Main category: cs.CL
TL;DR: EngGPT2-16B-A3B is an efficient Italian-focused Mixture-of-Experts LLM trained on 2.5T tokens with strong performance comparable to 8B-16B dense models while using significantly less inference power and training data.
Details
Motivation: To create a sovereign, efficient, and open European LLM that combines performance and efficiency while being fully aligned with the EU AI Act, with particular focus on Italian and European NLP tasks.Method: Trained-from-scratch Mixture-of-Experts architecture with 16B total parameters (3B active per inference), trained on 2.5T tokens including 25% Italian-language data, featuring multiple reasoning modes.
Result: Delivers performance comparable to dense 8B-16B models on benchmarks (MMLU-Pro, GSM8K, IFEval, HumanEval) while requiring 1/5 to 1/2 inference power and 1/10 to 1/6 training data/power.
Conclusion: EngGPT2 sets a new standard for resource-conscious, high-performance LLMs tailored to European and Italian contexts, positioning itself as a key contributor to open-weight European models.
Abstract: EngGPT2-16B-A3B is the latest iteration of Engineering Group’s Italian LLM and it’s built to be a Sovereign, Efficient and Open model. EngGPT2 is trained on 2.5 trillion tokens - less than Qwen3’s 36T or Llama3’s 15T - and delivers performance on key benchmarks, including MMLU-Pro, GSM8K, IFEval and HumanEval, comparable to dense models in the 8B-16B range, while requiring one-fifth to half of the inference power, and between one-tenth to one-sixth of the training data and consequent needed training power. Designed as a trained-from-scratch Mixture-of-Experts (MoE) architecture, EngGPT2 features 16 billion parameters with 3 billion active per inference, with expert sizes positioned between those used in GPT-OSS and Qwen3. Approximately 25% of its training corpus consists of Italian-language data, to deliver strong capabilities for European and Italian NLP tasks among models of similar scale. This efficiency aims to position EngGPT2 as a key contributor to the growing portfolio of open-weight European models, combining performance and efficiency with full alignment to the EU AI Act. EngGPT2 is also a single model capable of multiple reasoning modes: non-reasoning, reasoning in Italian or English, and turbo-reasoning (a concise, bullet-point style reasoning available in both languages designed for real-time reasoning use cases). EngGPT2 aims to set a new standard for resource-conscious, high-performance LLMs tailored to European and Italian contexts.
[94] Advancing Software Quality: A Standards-Focused Review of LLM-Based Assurance Techniques
Avinash Patil
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2505.13766: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.13766&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[95] ToolRegistry: A Protocol-Agnostic Tool Management Library for Function-Calling LLMs
Peng Ding
Main category: cs.CL
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2507.10593: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.10593&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[96] IA2: Alignment with ICL Activations Improves Supervised Fine-Tuning
Aayush Mishra, Daniel Khashabi, Anqi Liu
Main category: cs.CL
TL;DR: Paper 2509.22621: Could not fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to HTTP 429 error when attempting to fetch paper details from arXiv APIMethod: Unable to determine method due to HTTP 429 error when attempting to fetch paper details from arXiv API
Result: Unable to determine results due to HTTP 429 error when attempting to fetch paper details from arXiv API
Conclusion: Unable to draw conclusions due to HTTP 429 error when attempting to fetch paper details from arXiv API
Abstract: Failed to fetch summary for 2509.22621: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.22621&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[97] Slow-Fast Policy Optimization: Reposition-Before-Update for LLM Reasoning
Ziyan Wang, Zheng Wang, Xingwei Qu, Qi Cheng, Jie Fu, Shengpu Tang, Minjia Zhang, Xiaoming Huo
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to draw conclusions due to access error
Abstract: Failed to fetch summary for 2510.04072: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.04072&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[98] SO-Bench: A Structural Output Evaluation of Multimodal LLMs
Di Feng, Kaixin Ma, Feng Nan, Haofeng Chen, Bohan Zhai, David Griffiths, Mingfei Gao, Zhe Gan, Eshan Verma, Yinfei Yang, Zhifeng Chen, Afshin Dehghan
Main category: cs.CL
TL;DR: Paper 2511.21750 could not be fetched due to HTTP 429 (rate limiting) error from arXiv API
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper abstractMethod: Unable to determine method due to API rate limiting preventing access to paper abstract
Result: Unable to determine results due to API rate limiting preventing access to paper abstract
Conclusion: Unable to determine conclusion due to API rate limiting preventing access to paper abstract
Abstract: Failed to fetch summary for 2511.21750: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.21750&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[99] TxSum: User-Centered Ethereum Transaction Understanding with Micro-Level Semantic Grounding
Zifan Peng, Jingyi Zheng, Yule Liu, Huaiyu Jia, Qiming Ye, Jingyu Liu, Xufeng Yang, Mingchen Li, Qingyuan Gong, Xuechao Wang, Xinlei He
Main category: cs.CL
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) when querying arXiv API for paper ID 2512.06933
Details
Motivation: Cannot determine motivation as paper content could not be retrieved due to technical limitationsMethod: No method information available due to failed API request
Result: No results available - paper summary could not be fetched
Conclusion: Unable to analyze paper due to technical limitations in accessing the content
Abstract: Failed to fetch summary for 2512.06933: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.06933&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[100] VL-RouterBench: A Benchmark for Vision-Language Model Routing
Zhehao Huang, Baijiong Lin, Jingyuan Zhang, Jingying Wang, Yuhang Liu, Ning Lu, Tao Li, Xiaolin Huang
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error fetching paper contentMethod: Unable to determine method due to technical error fetching paper content
Result: Unable to determine results due to technical error fetching paper content
Conclusion: Unable to determine conclusion due to technical error fetching paper content
Abstract: Failed to fetch summary for 2512.23562: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.23562&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[101] APEX-SWE
Abhi Kottamasu, Chirag Mahapatra, Sam Lee, Ben Pan, Aakash Barthwal, Akul Datta, Ajay Arun, Silas Alberti, Adarsh Hiremath, Brendan Foody, Bertie Vidgen
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error in fetching paper contentMethod: Unable to determine method due to technical error in fetching paper content
Result: Unable to determine results due to technical error in fetching paper content
Conclusion: Unable to draw conclusions due to technical error in fetching paper content
Abstract: Failed to fetch summary for 2601.08806: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.08806&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[102] GIFT: Reconciling Post-Training Objectives via Finite-Temperature Gibbs Initialization
Zhengyang Zhao, Lu Ma, Yizhen Jiang, Xiaochen Ma, Zimo Meng, Chengyu Shen, Lexiang Tang, Haoze Sun, Peng Pei, Wentao Zhang
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper retrievalMethod: Unable to determine method due to failed paper retrieval
Result: Unable to determine results due to failed paper retrieval
Conclusion: Unable to draw conclusions due to failed paper retrieval
Abstract: Failed to fetch summary for 2601.09233: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.09233&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[103] SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration
Jialong Chen, Xander Xu, Hu Wei, Chuan Chen, Bing Zhao
Main category: cs.CL
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) when querying arXiv API for paper ID 2603.03823
Details
Motivation: Cannot determine motivation as paper content could not be retrieved due to technical limitations in accessing the arXiv APIMethod: No method information available - paper content inaccessible due to HTTP 429 error from arXiv API
Result: No results available - unable to fetch paper summary due to rate limiting on arXiv API requests
Conclusion: Cannot provide analysis due to technical limitations in accessing the paper content from arXiv
Abstract: Failed to fetch summary for 2603.03823: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.03823&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[104] Meta-Reinforcement Learning with Self-Reflection for Agentic Search
Teng Xiao, Yige Yuan, Hamish Ivison, Huaisheng Zhu, Faeze Brahman, Nathan Lambert, Pradeep Dasigi, Noah A. Smith, Hannaneh Hajishirzi
Main category: cs.CL
TL;DR: Paper 2603.11327: Failed to fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access restrictions preventing abstract retrievalMethod: Unknown - paper content not accessible due to API rate limiting
Result: No results available - technical issue prevented paper analysis
Conclusion: Cannot analyze paper due to access restrictions; need to try again later or use alternative methods
Abstract: Failed to fetch summary for 2603.11327: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.11327&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[105] The Reasoning Bottleneck in Graph-RAG: Structured Prompting and Context Compression for Multi-Hop QA
Yasaman Zarrinkia, Venkatesh Srinivasan, Alex Thomo
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2603.14045: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.14045&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[106] Resource Consumption Threats in Large Language Models
Yuanhe Zhang, Xinyue Wang, Zhican Chen, Weiliu Wang, Zilu Zhang, Zhengshuo Gong, Zhenhong Zhou, Kun Wang, Li Sun, Yang Liu, Sen Su
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) - unable to analyze content
Details
Motivation: Unable to determine motivation due to failed paper retrievalMethod: Unable to determine method due to failed paper retrieval
Result: Unable to determine results due to failed paper retrieval
Conclusion: Unable to determine conclusion due to failed paper retrieval
Abstract: Failed to fetch summary for 2603.16068: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.16068&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.CV
[107] Multi-Modal Multi-Agent Reinforcement Learning for Radiology Report Generation: Radiologist-Like Workflow with Clinically Verifiable Rewards
Kaito Baba, Satoshi Kodera
Main category: cs.CV
TL;DR: MARL-Rad is a multi-modal multi-agent reinforcement learning framework for radiology report generation that coordinates region-specific agents with a global integrating agent, optimized via clinically verifiable rewards to improve clinical efficacy metrics.
Details
Motivation: Current radiology report generation methods often use single-model reinforcement learning or post-hoc agentization of independently trained models, which may not fully optimize clinical efficacy and consistency in medical reporting.Method: Proposes a multi-modal multi-agent reinforcement learning framework with region-specific agents (for different anatomical areas) and a global integrating agent, jointly trained through reinforcement learning with clinically verifiable rewards.
Result: Experiments on MIMIC-CXR and IU X-ray datasets show consistent improvements in clinical efficacy metrics (RadGraph, CheXbert, GREEN scores), achieving state-of-the-art CE performance with enhanced laterality consistency and more accurate, detail-informed reports.
Conclusion: MARL-Rad demonstrates that coordinated multi-agent reinforcement learning with clinically verifiable rewards can significantly improve radiology report generation quality and clinical utility compared to existing approaches.
Abstract: We propose MARL-Rad, a novel multi-modal multi-agent reinforcement learning framework for radiology report generation that coordinates region-specific agents and a global integrating agent, optimized via clinically verifiable rewards. Unlike prior single-model reinforcement learning or post-hoc agentization of independently trained models, our method jointly trains multiple agents and optimizes the entire agent system through reinforcement learning. Experiments on the MIMIC-CXR and IU X-ray datasets show that MARL-Rad consistently improves clinically efficacy (CE) metrics such as RadGraph, CheXbert, and GREEN scores, achieving state-of-the-art CE performance. Further analyses confirm that MARL-Rad enhances laterality consistency and produces more accurate, detail-informed reports.
[108] CineSRD: Leveraging Visual, Acoustic, and Linguistic Cues for Open-World Visual Media Speaker Diarization
Liangbin Huang, Xiaohua Liao, Chaoqun Cui, Shijing Wang, Zhaolong Huang, Yanlong Du, Wenji Mao
Main category: cs.CV
TL;DR: CineSRD is a multimodal framework for speaker diarization in visual media like films/TV series, using visual, acoustic, and linguistic cues to handle open-world challenges.
Details
Motivation: Traditional speaker diarization systems are limited to constrained scenarios like meetings with few speakers and clean audio. The paper aims to extend speaker diarization to open-world visual media (films, TV series) which present challenges like long-form content, many speakers, audio-visual asynchrony, and uncontrolled variability.Method: Proposes CineSRD (Cinematic Speaker Registration & Diarization) - a unified multimodal framework that leverages visual, acoustic, and linguistic cues from video, speech, and subtitles. First performs visual anchor clustering to register initial speakers, then integrates an audio language model for speaker turn detection to refine annotations and supplement unregistered off-screen speakers.
Result: CineSRD achieves superior performance on the proposed visual media benchmark (including Chinese and English programs) and competitive results on conventional datasets, demonstrating robustness and generalizability in open-world visual media settings.
Conclusion: The paper successfully extends speaker diarization to complex visual media using multimodal fusion, addressing open-world challenges through a unified framework that combines visual, acoustic, and linguistic information.
Abstract: Traditional speaker diarization systems have primarily focused on constrained scenarios such as meetings and interviews, where the number of speakers is limited and acoustic conditions are relatively clean. To explore open-world speaker diarization, we extend this task to the visual media domain, encompassing complex audiovisual programs such as films and TV series. This new setting introduces several challenges, including long-form video understanding, a large number of speakers, cross-modal asynchrony between audio and visual cues, and uncontrolled in-the-wild variability. To address these challenges, we propose Cinematic Speaker Registration & Diarization (CineSRD), a unified multimodal framework that leverages visual, acoustic, and linguistic cues from video, speech, and subtitles for speaker annotation. CineSRD first performs visual anchor clustering to register initial speakers and then integrates an audio language model for speaker turn detection, refining annotations and supplementing unregistered off-screen speakers. Furthermore, we construct and release a dedicated speaker diarization benchmark for visual media that includes Chinese and English programs. Experimental results demonstrate that CineSRD achieves superior performance on the proposed benchmark and competitive results on conventional datasets, validating its robustness and generalizability in open-world visual media settings.
[109] Tokenization vs. Augmentation: A Systematic Study of Writer Variance in IMU-Based Online Handwriting Recognition
Jindong Li, Dario Zanca, Vincent Christlein, Tim Hamann, Jens Barth, Peter Kämpf, Björn Eskofier
Main category: cs.CV
TL;DR: Online handwriting recognition using inertial sensors benefits from sub-word tokenization for inter-writer variability and concatenation-based data augmentation for intra-writer sparsity.
Details
Motivation: Inertial measurement unit-based online handwriting recognition faces challenges with uneven character distributions and inter-writer variability, requiring strategies to handle both inter-writer and intra-writer variance.Method: Systematically investigates two strategies: sub-word tokenization (Bigram tokenization) and concatenation-based data augmentation. Experiments conducted on OnHW-Words500 dataset with writer-independent and writer-dependent splits.
Result: Bigram tokenization reduced WER from 15.40% to 12.99% on writer-independent split. Concatenation-based data augmentation reduced character error rate by 34.5% and WER by 25.4% on writer-dependent split. Short, low-level tokens benefit performance.
Conclusion: Sub-word tokenization primarily mitigates inter-writer stylistic variability, while concatenation-based data augmentation effectively compensates for intra-writer distributional sparsity, revealing a clear variance-dependent effect.
Abstract: Inertial measurement unit-based online handwriting recognition enables the recognition of input signals collected across different writing surfaces but remains challenged by uneven character distributions and inter-writer variability. In this work, we systematically investigate two strategies to address these issues: sub-word tokenization and concatenation-based data augmentation. Our experiments on the OnHW-Words500 dataset reveal a clear dichotomy between handling inter-writer and intra-writer variance. On the writer-independent split, structural abstraction via Bigram tokenization significantly improves performance to unseen writing styles, reducing the word error rate (WER) from 15.40% to 12.99%. In contrast, on the writer-dependent split, tokenization degrades performance due to vocabulary distribution shifts between the training and validation sets. Instead, our proposed concatenation-based data augmentation acts as a powerful regularizer, reducing the character error rate by 34.5% and the WER by 25.4%. Further analysis shows that short, low-level tokens benefit model performance and that concatenation-based data augmentation performance gain surpasses those achieved by proportionally extended training. These findings reveal a clear variance-dependent effect: sub-word tokenization primarily mitigates inter-writer stylistic variability, whereas concatenation-based data augmentation effectively compensates for intra-writer distributional sparsity.
[110] Towards Inclusive Communication: A Unified Framework for Generating Spoken Language from Sign, Lip, and Audio
Jeong Hun Yeo, Hyeongseop Rha, Sungjune Park, Junil Won, Yong Man Ro
Main category: cs.CV
TL;DR: First unified framework for processing sign language, lip movements, and audio together for spoken-language text generation, achieving state-of-the-art performance across multiple tasks.
Details
Motivation: Audio-centric ASR systems exclude deaf/hard-of-hearing individuals. While sign language and lip reading offer alternatives, these modalities have been studied in isolation without a unified framework that can handle diverse combinations of visual and audio inputs.Method: Proposes a unified, modality-agnostic architecture capable of processing heterogeneous inputs (sign language, lip movements, audio). Focuses on exploring synergy among modalities, particularly lip movements as non-manual cues in sign language comprehension.
Result: Achieves performance on par with or better than task-specific state-of-the-art models across SLT, VSR, ASR, and Audio-Visual Speech Recognition. Key finding: explicitly modeling lip movements as distinct modality significantly improves SLT performance by capturing critical non-manual cues.
Conclusion: Successfully demonstrates first unified framework for multimodal processing of sign language, lip movements, and audio, revealing important linguistic insights about modality synergy while achieving superior performance across multiple speech-related tasks.
Abstract: Audio is the primary modality for human communication and has driven the success of Automatic Speech Recognition (ASR) technologies. However, such audio-centric systems inherently exclude individuals who are deaf or hard of hearing. Visual alternatives such as sign language and lip reading offer effective substitutes, and recent advances in Sign Language Translation (SLT) and Visual Speech Recognition (VSR) have improved audio-less communication. Yet, these modalities have largely been studied in isolation, and their integration within a unified framework remains underexplored. In this paper, we propose the first unified framework capable of handling diverse combinations of sign language, lip movements, and audio for spoken-language text generation. We focus on three main objectives: (i) designing a unified, modality-agnostic architecture capable of effectively processing heterogeneous inputs; (ii) exploring the underexamined synergy among modalities, particularly the role of lip movements as non-manual cues in sign language comprehension; and (iii) achieving performance on par with or superior to state-of-the-art models specialized for individual tasks. Building on this framework, we achieve performance on par with or better than task-specific state-of-the-art models across SLT, VSR, ASR, and Audio-Visual Speech Recognition. Furthermore, our analysis reveals a key linguistic insight: explicitly modeling lip movements as a distinct modality significantly improves SLT performance by capturing critical non-manual cues.
[111] Leveraging Large Vision Model for Multi-UAV Co-perception in Low-Altitude Wireless Networks
Yunting Xu, Jiacheng Wang, Ruichen Zhang, Changyuan Zhao, Yinqiu Liu, Dusit Niyato, Liang Yu, Haibo Zhou, Dong In Kim
Main category: cs.CV
TL;DR: BHU framework enables communication-efficient multi-UAV cooperative perception using Top-K pixel selection, MU-MIMO transmission, and Swin-large-based BEV feature fusion optimized by diffusion-based DRL.
Details
Motivation: Multi-UAV cooperative perception faces challenges with massive visual data causing communication latency and resource inefficiency; need for solutions that reduce overhead while maintaining perception performance.Method: Top-K selection for informative pixel sparsification, MU-MIMO transmission to ground server, Swin-large-based MaskDINO encoder for BEV feature extraction and fusion, diffusion model-based DRL for joint optimization of UAV selection, sparsification ratios, and precoding matrices.
Result: Improves perception performance by over 5% while reducing communication overhead by 85% compared to traditional CNN-based BEV fusion baselines on Air-Co-Pred dataset.
Conclusion: BHU framework provides effective solution for multi-UAV cooperative perception under resource-constrained wireless environments by balancing communication efficiency and perception utility.
Abstract: Multi-uncrewed aerial vehicle (UAV) cooperative perception has emerged as a promising paradigm for diverse low-altitude economy applications, where complementary multi-view observations are leveraged to enhance perception performance via wireless communications. However, the massive visual data generated by multiple UAVs poses significant challenges in terms of communication latency and resource efficiency. To address these challenges, this paper proposes a communication-efficient cooperative perception framework, termed Base-Station-Helped UAV (BHU), which reduces communication overhead while enhancing perception performance. Specifically, we employ a Top-K selection mechanism to identify the most informative pixels from UAV-captured RGB images, enabling sparsified visual transmission with reduced data volume and latency. The sparsified images are transmitted to a ground server via multi-user MIMO (MU-MIMO), where a Swin-large-based MaskDINO encoder extracts bird’s-eye-view (BEV) features and performs cooperative feature fusion for ground vehicle perception. Furthermore, we develop a diffusion model-based deep reinforcement learning (DRL) algorithm to jointly select cooperative UAVs, sparsification ratios, and precoding matrices, achieving a balance between communication efficiency and perception utility. Simulation results on the Air-Co-Pred dataset demonstrate that, compared with traditional CNN-based BEV fusion baselines, the proposed BHU framework improves perception performance by over 5% while reducing communication overhead by 85%, providing an effective solution for multi-UAV cooperative perception under resource-constrained wireless environments.
[112] PAND: Prompt-Aware Neighborhood Distillation for Lightweight Fine-Grained Visual Classification
Qiuming Luo, Yuebing Li, Feng Li, Chang Kong
Main category: cs.CV
TL;DR: PAND is a two-stage knowledge distillation framework for fine-grained visual classification that uses prompt-aware semantic calibration and neighborhood-aware structural distillation to transfer knowledge from large vision-language models to lightweight networks.
Details
Motivation: Current knowledge distillation methods for fine-grained visual classification rely on fixed prompts and global alignment, which limits their effectiveness in transferring nuanced visual knowledge from large vision-language models to lightweight networks.Method: Two-stage framework: 1) Prompt-Aware Semantic Calibration generates adaptive semantic anchors, 2) Neighborhood-aware structural distillation constrains the student’s local decision structure to better capture fine-grained visual relationships.
Result: PAND outperforms state-of-the-art methods on four FGVC benchmarks. ResNet-18 student achieves 76.09% accuracy on CUB-200, surpassing VL2Lite baseline by 3.4%.
Conclusion: PAND effectively addresses limitations of fixed prompts and global alignment in knowledge distillation for fine-grained visual classification, demonstrating superior performance through adaptive semantic calibration and structural transfer.
Abstract: Distilling knowledge from large Vision-Language Models (VLMs) into lightweight networks is crucial yet challenging in Fine-Grained Visual Classification (FGVC), due to the reliance on fixed prompts and global alignment. To address this, we propose PAND (Prompt-Aware Neighborhood Distillation), a two-stage framework that decouples semantic calibration from structural transfer. First, we incorporate Prompt-Aware Semantic Calibration to generate adaptive semantic anchors. Second, we introduce a neighborhood-aware structural distillation strategy to constrain the student’s local decision structure. PAND consistently outperforms state-of-the-art methods on four FGVC benchmarks. Notably, our ResNet-18 student achieves 76.09% accuracy on CUB-200, surpassing the strong baseline VL2Lite by 3.4%. Code is available at https://github.com/LLLVTA/PAND.
[113] Facial beauty prediction fusing transfer learning and broad learning system
Junying Gan, Xiaoshan Xie, Yikui Zhai, Guohui He, Chaoyun Mai, Heng Luo
Main category: cs.CV
TL;DR: Transfer Learning fused with Broad Learning System (BLS) for Facial Beauty Prediction using EfficientNets for feature extraction, achieving improved accuracy over previous methods.
Details
Motivation: Facial beauty prediction faces challenges due to data scarcity, overfitting risks, facial appearance variability, and complexity of human perception. Need methods that reduce data dependence and enable quick model building.Method: Two approaches: 1) E-BLS: Uses EfficientNets (CNNs with transfer learning) as feature extractor, transfers features to BLS for prediction. 2) ER-BLS: Adds connection layer between feature extractor and BLS to improve integration.
Result: Both E-BLS and ER-BLS improved facial beauty prediction accuracy compared to previous BLS and CNN methods, demonstrating effectiveness and superiority of the proposed approach.
Conclusion: The fusion of transfer learning with BLS provides an effective solution for facial beauty prediction that reduces data dependence, avoids overfitting, and enables fast model building, with potential applications in pattern recognition, object detection, and image classification.
Abstract: Facial beauty prediction (FBP) is an important and challenging problem in the fields of computer vision and machine learning. Not only it is easily prone to overfitting due to the lack of large-scale and effective data, but also difficult to quickly build robust and effective facial beauty evaluation models because of the variability of facial appearance and the complexity of human perception. Transfer Learning can be able to reduce the dependence on large amounts of data as well as avoid overfitting problems. Broad learning system (BLS) can be capable of quickly completing models building and training. For this purpose, Transfer Learning was fused with BLS for FBP in this paper. Firstly, a feature extractor is constructed by way of CNNs models based on transfer learning for facial feature extraction, in which EfficientNets are used in this paper, and the fused features of facial beauty extracted are transferred to BLS for FBP, called E-BLS. Secondly, on the basis of E-BLS, a connection layer is designed to connect the feature extractor and BLS, called ER-BLS. Finally, experimental results show that, compared with the previous BLS and CNNs methods existed, the accuracy of FBP was improved by E-BLS and ER-BLS, demonstrating the effectiveness and superiority of the method presented, which can also be widely used in pattern recognition, object detection and image classification.
[114] Script-to-Slide Grounding: Grounding Script Sentences to Slide Objects for Automatic Instructional Video Generation
Rena Suzuki, Masato Kikuchi, Tadachika Ozono
Main category: cs.CV
TL;DR: Paper proposes Script-to-Slide Grounding (S2SG) task to automatically ground script sentences to slide objects for automated slide-based video generation, with initial Text-S2SG method using LLMs achieving high performance.
Details
Motivation: Slide-based videos with visual effects are widely used in education/research but require labor-intensive manual editing to ground spoken content to slide objects. The paper aims to automate this process by formalizing the implicit grounding task.Method: Proposes Script-to-Slide Grounding (S2SG) task formulation. As initial step, introduces Text-S2SG method using large language models (LLMs) to ground script sentences to text objects in slides. Focuses on text objects as foundational approach.
Result: The Text-S2SG method achieves high performance with F1-score of 0.924, demonstrating effectiveness of LLM-based approach for grounding script sentences to slide text objects.
Conclusion: Formalizes previously implicit slide-based video editing process into computable S2SG task, paving way for automation. Initial Text-S2SG method shows promising results, establishing foundation for more comprehensive slide object grounding.
Abstract: While slide-based videos augmented with visual effects are widely utilized in education and research presentations, the video editing process – particularly applying visual effects to ground spoken content to slide objects – remains highly labor-intensive. This study aims to develop a system that automatically generates such instructional videos from slides and corresponding scripts. As a foundational step, this paper proposes and formulates Script-to-Slide Grounding (S2SG), defined as the task of grounding script sentences to their corresponding slide objects. Furthermore, as an initial step, we propose ``Text-S2SG,’’ a method that utilizes a large language model (LLM) to perform this grounding task for text objects. Our experiments demonstrate that the proposed method achieves high performance (F1-score: 0.924). The contribution of this work is the formalization of a previously implicit slide-based video editing process into a computable task, thereby paving the way for its automation.
[115] Look Where It Matters: High-Resolution Crops Retrieval for Efficient VLMs
Nimrod Shabtay, Moshe Kimhi, Artem Spector, Sivan Haray, Ehud Rivlin, Chaim Baskin, Raja Giryes, Eli Schwartz
Main category: cs.CV
TL;DR: AwaRes is a spatial-on-demand framework for vision-language models that uses tool-calling to retrieve high-resolution segments only when needed, balancing accuracy and efficiency.
Details
Motivation: Current VLMs face a trade-off between accuracy and computational efficiency: high-resolution inputs capture fine details but are computationally expensive, while low-resolution inputs are efficient but may miss critical visual information like small text.Method: Operates on low-resolution global view and uses tool-calling to retrieve high-resolution segments as needed. Constructs supervised data automatically using a judge to compare low- vs high-resolution answers and an oracle grounding model to localize evidence. Trains with cold-start SFT followed by multi-turn GRPO with composite reward combining semantic correctness and crop-cost penalties.
Result: The framework resolves the accuracy-efficiency trade-off by dynamically retrieving high-resolution segments only when necessary, improving computational efficiency while maintaining accuracy.
Conclusion: AwaRes provides an effective spatial-on-demand approach for VLMs that balances accuracy and efficiency through selective high-resolution retrieval based on query needs.
Abstract: Vision-language models (VLMs) typically process images at a native high-resolution, forcing a trade-off between accuracy and computational efficiency: high-resolution inputs capture fine details but incur significant computational costs, while low-resolution inputs advocate for efficiency, they potentially miss critical visual information, like small text. We present AwaRes, a spatial-on-demand framework that resolves this accuracy-efficiency trade-off by operating on a low-resolution global view and using tool-calling to retrieve only high-resolution segments needed for a given query. We construct supervised data automatically: a judge compares low- vs.\ high-resolution answers to label whether cropping is needed, and an oracle grounding model localizes the evidence for the correct answer, which we map to a discrete crop set to form multi-turn tool-use trajectories. We train our framework with cold-start SFT followed by multi-turn GRPO with a composite reward that combines semantic answer correctness with explicit crop-cost penalties. Project page: https://nimrodshabtay.github.io/AwaRes
[116] AgriChat: A Multimodal Large Language Model for Agriculture Image Understanding
Abderrahmene Boudiaf, Irfan Hussain, Sajid Javed
Main category: cs.CV
TL;DR: Proposes V2VK pipeline to generate AgriMM benchmark with 607k VQAs across 3,000 agricultural classes, and AgriChat MLLM for agricultural vision-language tasks with verified knowledge.
Details
Motivation: Addresses the lack of large-scale agricultural datasets and domain expertise in current MLLMs for agriculture, aiming to eliminate biological hallucinations through verified knowledge.Method: Develops Vision-to-Verified-Knowledge (V2VK) pipeline combining visual captioning with web-augmented scientific retrieval to autonomously generate AgriMM benchmark. Creates AgriChat MLLM using this verifiable data.
Result: AgriMM benchmark contains 3,000+ agricultural classes and 607k+ VQAs. AgriChat outperforms other open-source models across diverse agricultural tasks and benchmarks.
Conclusion: Preserving visual detail combined with web-verified knowledge enables robust and trustworthy agricultural AI. The approach demonstrates superior performance over existing models.
Abstract: The deployment of Multimodal Large Language Models (MLLMs) in agriculture is currently stalled by a critical trade-off: the existing literature lacks the large-scale agricultural datasets required for robust model development and evaluation, while current state-of-the-art models lack the verified domain expertise necessary to reason across diverse taxonomies. To address these challenges, we propose the Vision-to-Verified-Knowledge (V2VK) pipeline, a novel generative AI-driven annotation framework that integrates visual captioning with web-augmented scientific retrieval to autonomously generate the AgriMM benchmark, effectively eliminating biological hallucinations by grounding training data in verified phytopathological literature. The AgriMM benchmark contains over 3,000 agricultural classes and more than 607k VQAs spanning multiple tasks, including fine-grained plant species identification, plant disease symptom recognition, crop counting, and ripeness assessment. Leveraging this verifiable data, we present AgriChat, a specialized MLLM that presents broad knowledge across thousands of agricultural classes and provides detailed agricultural assessments with extensive explanations. Extensive evaluation across diverse tasks, datasets, and evaluation conditions reveals both the capabilities and limitations of current agricultural MLLMs, while demonstrating AgriChat’s superior performance over other open-source models, including internal and external benchmarks. The results validate that preserving visual detail combined with web-verified knowledge constitutes a reliable pathway toward robust and trustworthy agricultural AI. The code and dataset are publicly available at https://github.com/boudiafA/AgriChat .
[117] GenLie: A Global-Enhanced Lie Detection Network under Sparsity and Semantic Interference
Zongshun Zhang, Yao Liu, Qiao Liu, Xuefeng Peng, Peiyuan Jiang, Jiaye Yang, Daibing Yao, Wei Lin
Main category: cs.CV
TL;DR: GenLie: A video-based lie detection network that uses local feature modeling with global supervision to capture subtle deceptive cues while suppressing identity-related noise.
Details
Motivation: Video-based lie detection faces challenges in learning sparse yet discriminative representations because deceptive signals are subtle and short-lived, easily overwhelmed by redundant information, while individual and contextual variations introduce strong identity-related noise.Method: Proposes GenLie, a Global-Enhanced Lie Detection Network that performs local feature modeling under global supervision. Sparse deceptive cues are captured at local level while global supervision ensures robust representations by suppressing identity-related noise.
Result: Experiments on three public datasets covering both high- and low-stakes scenarios show that GenLie consistently outperforms state-of-the-art methods.
Conclusion: GenLie effectively addresses the core challenge of video-based lie detection by combining local feature modeling with global supervision to capture subtle deceptive cues while suppressing noise.
Abstract: Video-based lie detection aims to identify deceptive behaviors from visual cues. Despite recent progress, its core challenge lies in learning sparse yet discriminative representations. Deceptive signals are typically subtle and short-lived, easily overwhelmed by redundant information, while individual and contextual variations introduce strong identity-related noise. To address this issue, we propose GenLie, a Global-Enhanced Lie Detection Network that performs local feature modeling under global supervision. Specifically, sparse and subtle deceptive cues are captured at the local level, while global supervision and optimization ensure robust and discriminative representations by suppressing identity-related noise. Experiments on three public datasets, covering both high- and low-stakes scenarios, show that GenLie consistently outperforms state-of-the-art methods. Source code is available at https://github.com/AliasDictusZ1/GenLie.
[118] A 3D Reconstruction Benchmark for Asset Inspection
James L. Gray, Nikolai Goncharov, Alexandre Cardaillac, Ryan Griffiths, Jack Naylor, Donald G. Dansereau
Main category: cs.CV
TL;DR: New dataset for 3D reconstruction in asset inspection with challenging conditions like reflections and transparency, showing current methods struggle with dense capture trajectories and complex surfaces.
Details
Motivation: Asset management requires accurate 3D models for maintenance and repair, but existing datasets lack examples of challenging conditions like reflections, transparency, and dense capture trajectories common in aerial surveys.Method: Created a new dataset with ground truth depth maps, camera poses, and mesh models of three synthetic scenes with simulated inspection trajectories and varying surface conditions on non-Lambertian content.
Result: Evaluation of state-of-the-art reconstruction methods shows they struggle significantly with dense capture trajectories and complex surface conditions inherent to asset inspection.
Conclusion: Exposes a critical scalability gap and points toward new research directions for deployable 3D reconstruction in asset inspection applications.
Abstract: Asset management requires accurate 3D models to inform the maintenance, repair, and assessment of buildings, maritime vessels, and other key structures as they age. These downstream applications rely on high-fidelity models produced from aerial surveys in close proximity to the asset, enabling operators to locate and characterise deterioration or damage and plan repairs. Captured images typically have high overlap between adjacent camera poses, sufficient detail at millimetre scale, and challenging visual appearances such as reflections and transparency. However, existing 3D reconstruction datasets lack examples of these conditions, making it difficult to benchmark methods for this task. We present a new dataset with ground truth depth maps, camera poses, and mesh models of three synthetic scenes with simulated inspection trajectories and varying levels of surface condition on non-Lambertian scene content. We evaluate state-of-the-art reconstruction methods on this dataset. Our results demonstrate that current approaches struggle significantly with the dense capture trajectories and complex surface conditions inherent to this domain, exposing a critical scalability gap and pointing toward new research directions for deployable 3D reconstruction in asset inspection. Project page: https://roboticimaging.org/Projects/asset-inspection-dataset/
[119] TDMM-LM: Bridging Facial Understanding and Animation via Language Models
Luchuan Song, Pinxin Liu, Haiyang Liu, Zhenchao Jin, Yolo Yunlong Tang, Zichong Xu, Susan Liang, Jing Bi, Jason J Corso, Chenliang Xu
Main category: cs.CV
TL;DR: This paper addresses text-guided facial animation by creating a synthetic dataset using foundation models and training language models for bidirectional facial motion understanding and generation.
Details
Motivation: Facial animation lags behind body animation due to scarcity of well-annotated text-paired facial datasets. The authors aim to close this gap by leveraging foundation models to create synthetic data and enable bidirectional facial motion understanding.Method: 1) Generate synthetic facial video corpus using foundation models with prompts covering emotions and head motions, 2) Fit 3D facial parameters to create prompt-parameter pairs, 3) Train language models for two tasks: Motion2Language (describing facial motion) and Language2Motion (generating facial parameters from text).
Result: Created ~80 hours of facial videos with 3D parameter annotations. Language models successfully interpret and synthesize facial motion with strong generalization. First work to cast facial-parameter modeling as a language problem.
Conclusion: This approach establishes a unified framework for text-conditioned facial animation and motion understanding, demonstrating language models’ capability for bidirectional facial motion tasks.
Abstract: Text-guided human body animation has advanced rapidly, yet facial animation lags due to the scarcity of well-annotated, text-paired facial corpora. To close this gap, we leverage foundation generative models to synthesize a large, balanced corpus of facial behavior. We design prompts suite covering emotions and head motions, generate about 80 hours of facial videos with multiple generators, and fit per-frame 3D facial parameters, yielding large-scale (prompt and parameter) pairs for training. Building on this dataset, we probe language models for bidirectional competence over facial motion via two complementary tasks: (1) Motion2Language: given a sequence of 3D facial parameters, the model produces natural-language descriptions capturing content, style, and dynamics; and (2) Language2Motion: given a prompt, the model synthesizes the corresponding sequence of 3D facial parameters via quantized motion tokens for downstream animation. Extensive experiments show that in this setting language models can both interpret and synthesize facial motion with strong generalization. To best of our knowledge, this is the first work to cast facial-parameter modeling as a language problem, establishing a unified path for text-conditioned facial animation and motion understanding.
[120] Solution for 10th Competition on Ambivalence/Hesitancy (AH) Video Recognition Challenge using Divergence-Based Multimodal Fusion
Aislan Gabriel O. Souza, Agostinho Freire, Leandro Honorato Silva, Igor Lucas B. da Silva, João Vinícius R. de Andrade, Gabriel C. de Albuquerque, Lucas Matheus da S. Oliveira, Mário Stela Guerra, Luciana Machado
Main category: cs.CV
TL;DR: A multimodal fusion approach for ambivalence/hesitancy video recognition using divergence-based fusion to measure cross-modal conflict between visual (action units), audio, and textual features, achieving significant improvement over baseline on the BAH dataset.
Details
Motivation: To address the Ambivalence/Hesitancy Video Recognition Challenge by developing a method that can effectively capture the incongruence between different modalities (visual, audio, text) which characterizes ambivalent or hesitant behavior in videos.Method: Uses Py-Feat for visual Action Units (AUs), Wav2Vec 2.0 for audio, and BERT for text. Each modality is processed by BiLSTM with attention pooling and projected into shared embedding space. Fusion module computes pairwise absolute differences between modality embeddings to explicitly measure cross-modal conflict.
Result: Achieves Macro F1 of 0.6808 on validation test set of BAH dataset, significantly outperforming challenge baseline of 0.2827. Statistical analysis across 1,132 videos confirms temporal variability of AUs is the dominant visual discriminator of A/H.
Conclusion: Divergence-based multimodal fusion effectively captures cross-modal conflict for ambivalence/hesitancy recognition, with visual action unit temporal variability being the most important feature. The approach shows strong performance on the challenge dataset.
Abstract: We address the Ambivalence/Hesitancy (A/H) Video Recognition Challenge at the 10th ABAW Competition (CVPR 2026). We propose a divergence-based multimodal fusion that explicitly measures cross-modal conflict between visual, audio, and textual channels. Visual features are encoded as Action Units (AUs) extracted via Py-Feat, audio via Wav2Vec 2.0, and text via BERT. Each modality is processed by a BiLSTM with attention pooling and projected into a shared embedding space. The fusion module computes pairwise absolute differences between modality embeddings, directly capturing the incongruence that characterizes A/H. On the BAH dataset, our approach achieves a Macro F1 of 0.6808 on the validation test set, outperforming the challenge baseline of 0.2827. Statistical analysis across 1{,}132 videos confirms that temporal variability of AUs is the dominant visual discriminator of A/H.
[121] KGS-GCN: Enhancing Sparse Skeleton Sensing via Kinematics-Driven Gaussian Splatting and Probabilistic Topology for Action Recognition
Yuhan Chen, Yicui Shi, Guofa Li, Liping Zhang, Jie Li, Jiaxin Gao, Wenbo Chu
Main category: cs.CV
TL;DR: KGS-GCN integrates kinematics-driven Gaussian splatting with probabilistic topology to enhance skeleton-based action recognition from sparse sensor data by creating continuous generative representations and adaptive graph structures.
Details
Motivation: Current sensor devices generate sparse skeleton data that loses fine-grained spatiotemporal details during dynamic movements, and rigid predefined sensor topologies hinder modeling of long-range dependencies in action recognition.Method: Proposes KGS-GCN with two key components: 1) kinematics-driven Gaussian splatting that uses joint velocity vectors to create anisotropic covariance matrices, rendering sparse skeletons into multi-view continuous heatmaps; 2) probabilistic topology construction using Bhattacharyya distance between joint Gaussian distributions to generate adaptive adjacency matrices; plus visual context gating to modulate GCN backbone.
Result: Empirical results show KGS-GCN significantly enhances modeling of complex spatiotemporal dynamics and offers robust solution for processing low-fidelity sensor data, improving perceptual reliability in real-world sensing applications.
Conclusion: The framework addresses limitations of sparse inputs and rigid topologies in skeleton-based action recognition, establishing practical pathway for improving sensor data processing through continuous generative representations and adaptive graph structures.
Abstract: Skeleton-based action recognition is widely utilized in sensor systems including human-computer interaction and intelligent surveillance. Nevertheless, current sensor devices typically generate sparse skeleton data as discrete coordinates, which inevitably discards fine-grained spatiotemporal details during highly dynamic movements. Moreover, the rigid constraints of predefined physical sensor topologies hinder the modeling of latent long-range dependencies. To overcome these limitations, we propose KGS-GCN, a graph convolutional network that integrates kinematics-driven Gaussian splatting with probabilistic topology. Our framework explicitly addresses the challenges of sensor data sparsity and topological rigidity by transforming discrete joints into continuous generative representations. Firstly, a kinematics-driven Gaussian splatting module is designed to dynamically construct anisotropic covariance matrices using instantaneous joint velocity vectors. This module enhances visual representation by rendering sparse skeleton sequences into multi-view continuous heatmaps rich in spatiotemporal semantics. Secondly, to transcend the limitations of fixed physical connections, a probabilistic topology construction method is proposed. This approach generates an adaptive prior adjacency matrix by quantifying statistical correlations via the Bhattacharyya distance between joint Gaussian distributions. Ultimately, the GCN backbone is adaptively modulated by the rendered visual features via a visual context gating mechanism. Empirical results demonstrate that KGS-GCN significantly enhances the modeling of complex spatiotemporal dynamics. By addressing the inherent limitations of sparse inputs, our framework offers a robust solution for processing low-fidelity sensor data. This approach establishes a practical pathway for improving perceptual reliability in real-world sensing applications.
[122] Omni IIE Bench: Benchmarking the Practical Capabilities of Image Editing Models
Yujia Yang, Yuanxiang Wang, Zhenyu Guan, Tiankun Yang, Chenxi Bao, Haopeng Jin, Jinwen Luo, Xinyu Zuo, Lisheng Duan, Haijin Liang, Jin Ma, Xinming Wang, Ruiwen Tao, Hongzhu Yi
Main category: cs.CV
TL;DR: Omni IIE Bench is a human-annotated benchmark for evaluating instruction-based image editing models’ consistency across different semantic scales, revealing performance degradation when moving from low to high semantic complexity tasks.
Details
Motivation: Existing instruction-based image editing benchmarks use mixed evaluations that obscure critical failure modes, particularly inconsistent performance across tasks of varying semantic scales, which is crucial for professional applications.Method: Introduces Omni IIE Bench with dual-track diagnostic design: (1) Single-turn Consistency with shared-context task pairs for attribute modification and entity replacement, and (2) Multi-turn Coordination with continuous dialogue tasks traversing semantic scales. Uses rigorous multi-stage human filtering with quality standards enforced by computer vision graduate students and industry relevance review by professional designers.
Result: Comprehensive evaluation of 8 mainstream IIE models reveals a prevalent performance gap: nearly all models show significant performance degradation when transitioning from low-semantic-scale to high-semantic-scale tasks.
Conclusion: Omni IIE Bench provides critical diagnostic tools and insights for developing next-generation, more reliable and stable instruction-based image editing models by quantifying consistency issues across semantic scales.
Abstract: While Instruction-based Image Editing (IIE) has achieved significant progress, existing benchmarks pursue task breadth via mixed evaluations. This paradigm obscures a critical failure mode crucial in professional applications: the inconsistent performance of models across tasks of varying semantic scales. To address this gap, we introduce Omni IIE Bench, a high-quality, human-annotated benchmark specifically designed to diagnose the editing consistency of IIE models in practical application scenarios. Omni IIE Bench features an innovative dual-track diagnostic design: (1) Single-turn Consistency, comprising shared-context task pairs of attribute modification and entity replacement; and (2) Multi-turn Coordination, involving continuous dialogue tasks that traverse semantic scales. The benchmark is constructed via an exceptionally rigorous multi-stage human filtering process, incorporating a quality standard enforced by computer vision graduate students and an industry relevance review conducted by professional designers. We perform a comprehensive evaluation of 8 mainstream IIE models using Omni IIE Bench. Our analysis quantifies, for the first time, a prevalent performance gap: nearly all models exhibit a significant performance degradation when transitioning from low-semantic-scale to high-semantic-scale tasks. Omni IIE Bench provides critical diagnostic tools and insights for the development of next-generation, more reliable, and stable IIE models.
[123] Joint Optimization of Storage and Loading for High-Performance 3D Point Cloud Data Processing
Ke Wang, Yanfei Cao, Xiangzhi Tao, Naijie Gu, Jun Yu, Zhengdong Wang, Shouyang Dong, Fan Yu, Cong Wang, Yang Luo
Main category: cs.CV
TL;DR: Proposes .PcRecord format and high-performance pipeline for efficient large-scale point cloud data storage and processing, achieving significant speedups across multiple datasets.
Details
Motivation: Large-scale 3D point cloud data presents challenges in loading, processing, and storage due to data volume, complexity, and diverse formats (PLY, XYZ, BIN). Existing solutions don't fully address time-consuming data preparation phases.Method: Introduces .PcRecord unified data storage format to reduce storage and accelerate processing, combined with a high-performance multi-stage parallel pipeline architecture that optimizes computational resource usage.
Result: Achieves significant performance improvements: 6.61x (ModelNet40), 2.69x (S3DIS), 2.23x (ShapeNet), 3.09x (Kitti), 8.07x (SUN RGB-D), and 5.67x (ScanNet) with GPU, and 6.9x, 1.88x, 1.29x, 2.28x, 25.4x, and 19.3x with Ascend.
Conclusion: The proposed .PcRecord format and processing pipeline effectively address large-scale point cloud data handling challenges, significantly improving storage efficiency and processing speed across various datasets and hardware platforms.
Abstract: With the rapid development of computer vision and deep learning, significant advancements have been made in 3D vision, partic- ularly in autonomous driving, robotic perception, and augmented reality. 3D point cloud data, as a crucial representation of 3D information, has gained widespread attention. However, the vast scale and complexity of point cloud data present significant chal- lenges for loading and processing and traditional algorithms struggle to handle large-scale datasets.The diversity of storage formats for point cloud datasets (e.g., PLY, XYZ, BIN) adds complexity to data handling and results in inefficiencies in data preparation. Al- though binary formats like BIN and NPY have been used to speed up data access, they still do not fully address the time-consuming data loading and processing phase. To overcome these challenges, we propose the .PcRecord format, a unified data storage solution designed to reduce the storage occupation and accelerate the processing of point cloud data. We also introduce a high-performance data processing pipeline equipped with multiple modules. By leveraging a multi-stage parallel pipeline architecture, our system optimizes the use of computational resources, significantly improving processing speed and efficiency. This paper details the im- plementation of this system and demonstrates its effectiveness in addressing the challenges of handling large-scale point cloud datasets.On average, our system achieves performance improvements of 6.61x (ModelNet40), 2.69x (S3DIS), 2.23x (ShapeNet), 3.09x (Kitti), 8.07x (SUN RGB-D), and 5.67x (ScanNet) with GPU and 6.9x, 1.88x, 1.29x, 2.28x, 25.4x, and 19.3x with Ascend.
[124] EmergeNav: Structured Embodied Inference for Zero-Shot Vision-and-Language Navigation in Continuous Environments
Kun Luo, Xiaoguang Ma
Main category: cs.CV
TL;DR: EmergeNav: A zero-shot framework for vision-and-language navigation in continuous environments using structured embodied inference without task-specific training.
Details
Motivation: Vision-language models have semantic priors but struggle with stable long-horizon embodied execution in continuous navigation tasks. The bottleneck is not missing knowledge but missing execution structure for organizing instruction following, perceptual grounding, temporal progress, and stage verification.Method: Proposes EmergeNav with: 1) Plan-Solve-Transition hierarchy for stage-structured execution, 2) GIPE for goal-conditioned perceptual extraction, 3) contrastive dual-memory reasoning for progress grounding, and 4) role-separated Dual-FOV sensing for time-aligned local control and boundary verification.
Result: Achieves strong zero-shot performance on VLN-CE: 30.00 SR with Qwen3-VL-8B and 37.00 SR with Qwen3-VL-32B, using only open-source VLM backbones without task-specific training, explicit maps, graph search, or waypoint predictors.
Conclusion: Explicit execution structure is key for turning VLM priors into stable embodied navigation behavior, enabling zero-shot performance in continuous environments through structured embodied inference.
Abstract: Zero-shot vision-and-language navigation in continuous environments (VLN-CE) remains challenging for modern vision-language models (VLMs). Although these models encode useful semantic priors, their open-ended reasoning does not directly translate into stable long-horizon embodied execution. We argue that the key bottleneck is not missing knowledge alone, but missing an execution structure for organizing instruction following, perceptual grounding, temporal progress, and stage verification. We propose EmergeNav, a zero-shot framework that formulates continuous VLN as structured embodied inference. EmergeNav combines a Plan–Solve–Transition hierarchy for stage-structured execution, GIPE for goal-conditioned perceptual extraction, contrastive dual-memory reasoning for progress grounding, and role-separated Dual-FOV sensing for time-aligned local control and boundary verification. On VLN-CE, EmergeNav achieves strong zero-shot performance using only open-source VLM backbones and no task-specific training, explicit maps, graph search, or waypoint predictors, reaching 30.00 SR with Qwen3-VL-8B and 37.00 SR with Qwen3-VL-32B. These results suggest that explicit execution structure is a key ingredient for turning VLM priors into stable embodied navigation behavior.
[125] Neighbor GRPO: Contrastive ODE Policy Optimization Aligns Flow Models
Dailan He, Guanlin Feng, Xingtong Ge, Yazhe Niu, Yi Zhang, Bingqi Ma, Guanglu Song, Yu Liu, Hongsheng Li
Main category: cs.CV
TL;DR: Neighbor GRPO: A novel alignment algorithm for flow matching models that bypasses SDE conversion by perturbing initial noise conditions and using distance-based optimization, preserving ODE efficiency while improving training.
Details
Motivation: GRPO shows promise for aligning image/video generative models with human preferences, but applying it to modern flow matching models is challenging due to their deterministic ODE sampling. Current SDE-based approaches suffer from inefficient credit assignment and incompatibility with high-order solvers.Method: Reinterpret SDE-based GRPO from distance optimization perspective, revealing contrastive learning mechanism. Propose Neighbor GRPO that generates diverse candidate trajectories by perturbing initial noise conditions of ODE, then optimizes using softmax distance-based surrogate leaping policy. Introduce symmetric anchor sampling for efficiency and group-wise quasi-norm reweighting to address reward flattening.
Result: Neighbor GRPO significantly outperforms SDE-based counterparts in training cost, convergence speed, and generation quality. Preserves advantages of deterministic ODE sampling including efficiency and compatibility with high-order solvers.
Conclusion: Neighbor GRPO provides an effective alignment method for flow matching models that avoids SDE conversion issues, offering theoretical grounding in policy gradient optimization while maintaining computational efficiency of ODE sampling.
Abstract: Group Relative Policy Optimization (GRPO) has shown promise in aligning image and video generative models with human preferences. However, applying it to modern flow matching models is challenging because of its deterministic sampling paradigm. Current methods address this issue by converting Ordinary Differential Equations (ODEs) to Stochastic Differential Equations (SDEs), which introduce stochasticity. However, this SDE-based GRPO suffers from issues of inefficient credit assignment and incompatibility with high-order solvers for fewer-step sampling. In this paper, we first reinterpret existing SDE-based GRPO methods from a distance optimization perspective, revealing their underlying mechanism as a form of contrastive learning. Based on this insight, we propose Neighbor GRPO, a novel alignment algorithm that completely bypasses the need for SDEs. Neighbor GRPO generates a diverse set of candidate trajectories by perturbing the initial noise conditions of the ODE and optimizes the model using a softmax distance-based surrogate leaping policy. We establish a theoretical connection between this distance-based objective and policy gradient optimization, rigorously integrating our approach into the GRPO framework. Our method fully preserves the advantages of deterministic ODE sampling, including efficiency and compatibility with high-order solvers. We further introduce symmetric anchor sampling for computational efficiency and group-wise quasi-norm reweighting to address reward flattening. Extensive experiments demonstrate that Neighbor GRPO significantly outperforms SDE-based counterparts in terms of training cost, convergence speed, and generation quality.
[126] PhysQuantAgent: An Inference Pipeline of Mass Estimation for Vision-Language Models
Hisayuki Yokomizo, Taiki Miyanishi, Yan Gang, Shuhei Kurita, Nakamasa Inoue, Yusuke Iwasawa
Main category: cs.CV
TL;DR: PhysQuantAgent framework uses VLMs with visual prompting for real-world object mass estimation, evaluated on VisPhysQuant benchmark dataset with RGB-D videos and mass annotations.
Details
Motivation: Current Vision-Language Models lack reliable mass reasoning capabilities needed for robotic manipulation tasks like determining appropriate grasp force, and existing benchmarks don't evaluate physical quantity estimation under realistic sensing conditions.Method: Proposes PhysQuantAgent framework with three visual prompting methods: object detection, scale estimation, and cross-sectional image generation to enhance VLM’s understanding of object size and internal structure for mass estimation. Uses VisPhysQuant benchmark dataset with RGB-D videos of real objects from multiple viewpoints and precise mass measurements.
Result: Visual prompting significantly improves mass estimation accuracy on real-world data, demonstrating efficacy of integrating spatial reasoning with VLM knowledge for physical inference.
Conclusion: The framework successfully enhances VLMs’ physical reasoning capabilities for mass estimation, which is crucial for robotic manipulation tasks requiring appropriate force application.
Abstract: Vision-Language Models (VLMs) are increasingly applied to robotic perception and manipulation, yet their ability to infer physical properties required for manipulation remains limited. In particular, estimating the mass of real-world objects is essential for determining appropriate grasp force and ensuring safe interaction. However, current VLMs lack reliable mass reasoning capabilities, and most existing benchmarks do not explicitly evaluate physical quantity estimation under realistic sensing conditions. In this work, we propose PhysQuantAgent, a framework for real-world object mass estimation using VLMs, together with VisPhysQuant, a new benchmark dataset for evaluation. VisPhysQuant consists of RGB-D videos of real objects captured from multiple viewpoints, annotated with precise mass measurements. To improve estimation accuracy, we introduce three visual prompting methods that enhance the input image with object detection, scale estimation, and cross-sectional image generation to help the model comprehend the size and internal structure of the target object. Experiments show that visual prompting significantly improves mass estimation accuracy on real-world data, suggesting the efficacy of integrating spatial reasoning with VLM knowledge for physical inference.
[127] NutVLM: A Self-Adaptive Defense Framework against Full-Dimension Attacks for Vision Language Models in Autonomous Driving
Xiaoxu Peng, Dong Zhou, Jianwen Zhang, Guanghui Sun, Anh Tu Ngo, Anupam Chattopadhyay
Main category: cs.CV
TL;DR: NutVLM is a self-adaptive defense framework for Vision Language Models in autonomous driving that detects and mitigates adversarial threats through a unified detection-purification mechanism and expert-guided adversarial prompt tuning.
Details
Motivation: Vision Language Models in autonomous driving are vulnerable to adversarial threats ranging from physical patches to imperceptible perturbations, but existing defense methods fail to reconcile robustness with clean-sample performance.Method: Uses NutNet++ as a sentinel for three-way classification (benign, local patches, global perturbations), purifies localized threats via grayscale masking, and applies Expert-guided Adversarial Prompt Tuning (EAPT) for global perturbations through gradient-based latent optimization and discrete projection to generate corrective driving prompts.
Result: Achieves 4.89% improvement in overall metrics (Accuracy, Language Score, GPT Score) on the Dolphins benchmark, validating it as a scalable security solution for intelligent transportation.
Conclusion: NutVLM provides a comprehensive self-adaptive defense framework that secures the entire perception-decision lifecycle in autonomous driving while maintaining performance on clean samples.
Abstract: Vision Language Models (VLMs) have advanced perception in autonomous driving (AD), but they remain vulnerable to adversarial threats. These risks range from localized physical patches to imperceptible global perturbations. Existing defense methods for VLMs remain limited and often fail to reconcile robustness with clean-sample performance. To bridge these gaps, we propose NutVLM, a comprehensive self-adaptive defense framework designed to secure the entire perception-decision lifecycle. Specifically, we first employ NutNet++ as a sentinel, which is a unified detection-purification mechanism. It identifies benign samples, local patches, and global perturbations through three-way classification. Subsequently, localized threats are purified via efficient grayscale masking, while global perturbations trigger Expert-guided Adversarial Prompt Tuning (EAPT). Instead of the costly parameter updates of full-model fine-tuning, EAPT generates “corrective driving prompts” via gradient-based latent optimization and discrete projection. These prompts refocus the VLM’s attention without requiring exhaustive full-model retraining. Evaluated on the Dolphins benchmark, our NutVLM yields a 4.89% improvement in overall metrics (e.g., Accuracy, Language Score, and GPT Score). These results validate NutVLM as a scalable security solution for intelligent transportation. Our code is available at https://github.com/PXX/NutVLM.
[128] Behavior-Centric Extraction of Scenarios from Highway Traffic Data and their Domain-Knowledge-Guided Clustering using CVQ-VAE
Niklas Roßberg, Sinan Hasirlioglu, Mohamed Essayed Bouzouraa, Wolfgang Utschick, Michael Botsch
Main category: cs.CV
TL;DR: Paper proposes standardized scenario extraction and domain-knowledge-guided clustering for autonomous driving validation using highway data recordings
Details
Motivation: Current autonomous driving system (ADS) validation lacks standardized scenario extraction methods and interpretable clustering approaches that align with domain knowledgeMethod: Standardized scenario extraction based on Scenario-as-Specification concept, plus domain-knowledge-guided clustering process applied to highD highway dataset
Result: Demonstrates reliable scenario extraction and effective integration of domain knowledge into clustering process on highD dataset
Conclusion: Methodology enables more standardized process for deriving scenario categories from highway data, supporting more efficient autonomous vehicle validation
Abstract: Approval of ADS depends on evaluating its behavior within representative real-world traffic scenarios. A common way to obtain such scenarios is to extract them from real-world data recordings. These can then be grouped and serve as basis on which the ADS is subsequently tested. This poses two central challenges: how scenarios are extracted and how they are grouped. Existing extraction methods rely on heterogeneous definitions, hindering scenario comparability. For the grouping of scenarios, rule-based or ML-based methods can be utilized. However, while modern ML-based approaches can handle the complexity of traffic scenarios, unlike rule-based approaches, they lack interpretability and may not align with domain-knowledge. This work contributes to a standardized scenario extraction based on the Scenario-as-Specification concept, as well as a domain-knowledge-guided scenario clustering process. Experiments on the highD dataset demonstrate that scenarios can be extracted reliably and that domain-knowledge can be effectively integrated into the clustering process. As a result, the proposed methodology supports a more standardized process for deriving scenario categories from highway data recordings and thus enables a more efficient validation process of automated vehicles.
[129] MSRAMIE: Multimodal Structured Reasoning Agent for Multi-instruction Image Editing
Zhaoyuan Qiu, Ken Chen, Xiangwei Wang, Yu Xia, Sachith Seneviratne, Saman Halgamuge
Main category: cs.CV
TL;DR: MSRAMIE is a training-free agent framework that uses Multimodal Large Language Models to handle complex multi-instruction image editing tasks by decomposing instructions into structured reasoning steps.
Details
Motivation: Existing instruction-based image editing models struggle with complex, multi-step instructions due to lack of training data with such annotations, and retraining is costly.Method: Proposes MSRAMIE framework with MLLM-based Instructor and image editing Actor components, using Tree-of-States and Graph-of-References reasoning topology to decompose complex instructions into editing steps with state transitions and information aggregation.
Result: Improves instruction following by over 15% as complexity increases, increases probability of finishing all modifications in single run by over 100%, while preserving perceptual quality and visual consistency.
Conclusion: MSRAMIE provides effective training-free solution for complex multi-instruction image editing with interpretable reasoning pathways and strong performance gains.
Abstract: Existing instruction-based image editing models perform well with simple, single-step instructions but degrade in realistic scenarios that involve multiple, lengthy, and interdependent directives. A main cause is the scarcity of training data with complex multi-instruction annotations. However, it is costly to collect such data and retrain these models. To address this challenge, we propose MSRAMIE, a training-free agent framework built on Multimodal Large Language Model (MLLM). MSRAMIE takes existing editing models as plug-in components and handle multi-instruction tasks via structured multimodal reasoning. It orchestrates iterative interactions between an MLLM-based Instructor and an image editing Actor, introducing a novel reasoning topology that comprises the proposed Tree-of-States and Graph-of-References. During inference, complex instructions are decomposed into multiple editing steps which enable state transitions, cross-step information aggregation, and original input recall, which enables systematic exploration of the image editing space and flexible progressive output refinement. The visualizable inference topology further provides interpretable and controllable decision pathways. Experiments show that as the instruction complexity increases, MSRAMIE can improve instruction following over 15% and increases the probability of finishing all modifications in a single run over 100%, while preserving perceptual quality and maintaining visual consistency.
[130] Continual Multimodal Egocentric Activity Recognition via Modality-Aware Novel Detection
Wonseon Lim, Hyejeong Im, Dae-Won Kim
Main category: cs.CV
TL;DR: MAND: Modality-aware framework for multimodal egocentric open-world continual learning that improves novelty detection and classification by better utilizing complementary modality cues
Details
Motivation: Current multimodal egocentric activity recognition systems underutilize complementary evidence from individual modalities (especially IMU), relying too heavily on RGB-dominated logits, which worsens over time due to catastrophic forgetting in open-world continual learning scenarios.Method: Proposes MAND with two key components: 1) Modality-aware Adaptive Scoring (MoAS) at inference estimates modality reliability from energy scores and adaptively integrates modality logits for better novelty detection; 2) Modality-wise Representation Stabilization Training (MoRST) during training preserves modality-specific discriminability using auxiliary heads and modality-wise logit distillation.
Result: Experiments on a public multimodal egocentric benchmark show MAND improves novel activity detection AUC by up to 10% and known-class classification accuracy by up to 2.8% over state-of-the-art baselines.
Conclusion: MAND effectively addresses modality imbalance in multimodal egocentric open-world continual learning, demonstrating that modality-aware approaches can significantly improve both novelty detection and classification performance.
Abstract: Multimodal egocentric activity recognition integrates visual and inertial cues for robust first-person behavior understanding. However, deploying such systems in open-world environments requires detecting novel activities while continuously learning from non-stationary streams. Existing methods rely on the main logits for novelty scoring, without fully exploiting the complementary evidence available from individual modalities. Because these logits are often dominated by RGB, cues from other modalities, particularly IMU, remain underutilized, and this imbalance worsens over time under catastrophic forgetting. To address this, we propose MAND, a modality-aware framework for multimodal egocentric open-world continual learning. At inference, Modality-aware Adaptive Scoring (MoAS) estimates sample-wise modality reliability from energy scores and adaptively integrates modality logits to better exploit complementary modality cues for novelty detection. During training, Modality-wise Representation Stabilization Training (MoRST) preserves modality-specific discriminability across tasks via auxiliary heads and modality-wise logit distillation. Experiments on a public multimodal egocentric benchmark show that MAND improves novel activity detection AUC by up to 10% and known-class classification accuracy by up to 2.8% over state-of-the-art baselines.
[131] Are a Thousand Words Better Than a Single Picture? Beyond Images – A Framework for Multi-Modal Knowledge Graph Dataset Enrichment
Pengyu Zhang, Klim Zaporojets, Jie Liu, Jia-Hong Huang, Paul Groth
Main category: cs.CV
TL;DR: Automatic pipeline for enriching Multi-Modal Knowledge Graphs by retrieving additional entity images, converting all visuals to textual descriptions, and fusing them with LLMs to generate concise entity summaries that improve KG completion performance.
Details
Motivation: Current MMKGs suffer from limited image coverage and struggle with ambiguous visuals like logos and symbols that are hard to curate manually but contain relevant semantic information.Method: Three-stage pipeline: (1) large-scale retrieval of entity-related images, (2) conversion of all visual inputs into textual descriptions using vision-language models, (3) fusion of multi-source descriptions using LLMs to generate concise entity-aligned summaries that can be used with existing MMKG models.
Result: Consistent gains up to 7% Hits@1 across three public MMKG datasets and multiple baseline models. Particularly large improvements on challenging ambiguous entities (201.35% MRR and 333.33% Hits@1). Also released a Text-Image Consistency Check Interface for quality auditing.
Conclusion: Scaling image coverage and converting ambiguous visuals into text is an effective approach for improving MMKG completion, with the pipeline providing practical benefits without requiring architectural changes to existing models.
Abstract: Multi-Modal Knowledge Graphs (MMKGs) benefit from visual information, yet large-scale image collection is hard to curate and often excludes ambiguous but relevant visuals (e.g., logos, symbols, abstract scenes). We present Beyond Images, an automatic data-centric enrichment pipeline with optional human auditing. This pipeline operates in three stages: (1) large-scale retrieval of additional entity-related images, (2) conversion of all visual inputs into textual descriptions to ensure that ambiguous images contribute usable semantics rather than noise, and (3) fusion of multi-source descriptions using a large language model (LLM) to generate concise, entity-aligned summaries. These summaries replace or augment the text modality in standard MMKG models without changing their architectures or loss functions. Across three public MMKG datasets and multiple baseline models, we observe consistent gains (up to 7% Hits@1 overall). Furthermore, on a challenging subset of entities with visually ambiguous logos and symbols, converting images into text yields large improvements (201.35% MRR and 333.33% Hits@1). Additionally, we release a lightweight Text-Image Consistency Check Interface for optional targeted audits, improving description quality and dataset reliability. Our results show that scaling image coverage and converting ambiguous visuals into text is a practical path to stronger MMKG completion. Code, datasets, and supplementary materials are available at https://github.com/pengyu-zhang/Beyond-Images.
[132] Empirical Recipes for Efficient and Compact Vision-Language Models
Jiabo Huang, Zhizhong Li, Sina Sajadmanesh, Weiming Zhuang, Lingjuan Lyu
Main category: cs.CV
TL;DR: Compact VLMs optimized for efficiency with 53-93% latency reduction and new ArgusVLM family with structured perception outputs
Details
Motivation: Existing compact vision-language models (VLMs) don't achieve expected inference speedups despite smaller parameter counts, creating deployment challenges in resource-constrained settings that require low latency and high throughput.Method: Conducted empirical end-to-end efficiency analysis and systematic inference profiling to identify bottlenecks, then developed optimization recipes tailored to compact VLMs. Also extended compact VLMs with structured perception outputs to create ArgusVLM family.
Result: Optimization techniques reduced time to first token (TTFT) by 53% on InternVL3-2B and 93% on SmolVLM-256M. ArgusVLM achieves strong performance across diverse benchmarks while maintaining compact and efficient design.
Conclusion: The optimization recipes are broadly applicable across VLM architectures and serving frameworks, providing practical guidance for efficient VLM systems. ArgusVLM demonstrates that compact VLMs can achieve strong performance with structured perception capabilities.
Abstract: Deploying vision-language models (VLMs) in resource-constrained settings demands low latency and high throughput, yet existing compact VLMs often fall short of the inference speedups their smaller parameter counts suggest. To explain this discrepancy, we conduct an empirical end-to-end efficiency analysis and systematically profile inference to identify the dominant bottlenecks. Based on these findings, we develop optimization recipes tailored to compact VLMs that substantially reduce latency while preserving accuracy. These techniques cut time to first token (TTFT) by 53% on InternVL3-2B and by 93% on SmolVLM-256M. Our recipes are broadly applicable across both VLM architectures and common serving frameworks, providing practical guidance for building efficient VLM systems. Beyond efficiency, we study how to extend compact VLMs with structured perception outputs and introduce the resulting model family, ArgusVLM. Across diverse benchmarks, ArgusVLM achieves strong performance while maintaining a compact and efficient design.
[133] HopChain: Multi-Hop Data Synthesis for Generalizable Vision-Language Reasoning
Shenzhi Wang, Shixuan Liu, Jing Zhou, Chang Gao, Xiong-Hui Chen, Binghai Wang, An Yang, Shiji Song, Bowen Yu, Gao Huang, Junyang Lin
Main category: cs.CV
TL;DR: HopChain is a framework for synthesizing multi-hop vision-language reasoning data to improve VLMs’ fine-grained reasoning capabilities through RLVR training, addressing compound errors in long CoT reasoning.
Details
Motivation: VLMs struggle with fine-grained vision-language reasoning, especially in long chain-of-thought reasoning where perception, reasoning, knowledge, and hallucination errors can compound across steps. Existing RLVR training data lacks complex reasoning chains that rely on visual evidence throughout.Method: HopChain synthesizes multi-hop vision-language reasoning data where each query forms logically dependent chains of instance-grounded hops. Earlier hops establish instances, sets, or conditions needed for later hops, with final answers as specific numbers for verifiable rewards in RLVR training.
Result: Adding HopChain’s multi-hop data to RLVR training improved 20 out of 24 benchmarks across STEM/Puzzle, General VQA, Text Recognition/Document Understanding, and Video Understanding. Multi-hop training significantly outperformed half-multi-hop and single-hop variants, with gains peaking at over 50 accuracy points in ultra-long-CoT reasoning.
Conclusion: HopChain is an effective, scalable framework for synthesizing multi-hop data that improves generalizable vision-language reasoning in VLMs, demonstrating that full chained queries are crucial for addressing compound errors in long CoT reasoning.
Abstract: VLMs show strong multimodal capabilities, but they still struggle with fine-grained vision-language reasoning. We find that long CoT reasoning exposes diverse failure modes, including perception, reasoning, knowledge, and hallucination errors, which can compound across intermediate steps. However, most existing vision-language data used for RLVR does not involve complex reasoning chains that rely on visual evidence throughout, leaving these weaknesses largely unexposed. We therefore propose HopChain, a scalable framework for synthesizing multi-hop vision-language reasoning data specifically for RLVR training of VLMs. Each synthesized multi-hop query forms a logically dependent chain of instance-grounded hops, where earlier hops establish the instances, sets, or conditions needed for later hops, while the final answer remains a specific, unambiguous number suitable for verifiable rewards. We add the multi-hop data synthesized by HopChain to the original RLVR data used to train Qwen3.5-35B-A3B and Qwen3.5-397B-A17B, and compare against RLVR on the original RLVR data alone across 24 benchmarks spanning STEM and Puzzle, General VQA, Text Recognition and Document Understanding, and Video Understanding. Although this multi-hop data is not synthesized to target any specific benchmark, adding it improves 20 out of 24 benchmarks on both models, indicating broad and generalizable gains. To demonstrate that full chained queries are important, we replace them with half-multi-hop or single-hop variants, reducing the 24-benchmark average accuracy by 5.3 and 7.0 points, respectively. Multi-hop training also strengthens long-CoT vision-language reasoning, with gains peaking at more than 50 accuracy points in the ultra-long-CoT regime. These experiments establish HopChain as an effective, scalable framework for synthesizing multi-hop data that improves generalizable vision-language reasoning.
[134] OpenQlaw: An Agentic AI Assistant for Analysis of 2D Quantum Materials
Sankalp Pandey, Xuan-Bac Nguyen, Hoang-Quan Nguyen, Tim Faltermeier, Nicholas Borys, Hugh Churchill, Khoa Luu
Main category: cs.CV
TL;DR: OpenQlaw is an agentic orchestration system for analyzing 2D quantum materials that decouples visual identification from reasoning using specialized MLLMs, enabling naturalistic interaction and persistent memory for lab applications.
Details
Motivation: Current domain-specific MLLMs for quantum material analysis produce verbose, step-by-step outputs optimized for cognitive transparency, which causes cognitive overload and lacks immediate utility for real-world researcher interaction. There's a need for more practical systems that can accelerate high-throughput device fabrication.Method: Built on NanoBot (lightweight agentic framework) and QuPAINT (physics-aware multimodal platform), OpenQlaw uses a core LLM agent to orchestrate domain-expert MLLMs. It decouples visual identification from reasoning, parses spatial data from experts, and features persistent memory for physical scale ratios and sample preparation methods.
Result: The system transforms isolated inferences into a context-aware assistant capable of dynamic query processing, scale-aware physical computation, isolated visual annotations, and naturalistic responses, making it accessible via various messaging channels for lab floor use.
Conclusion: OpenQlaw’s agentic architecture with expert orchestration enables practical quantum material analysis that accelerates high-throughput device fabrication by providing context-aware assistance with persistent memory and naturalistic interaction.
Abstract: The transition from optical identification of 2D quantum materials to practical device fabrication requires dynamic reasoning beyond the detection accuracy. While recent domain-specific Multimodal Large Language Models (MLLMs) successfully ground visual features using physics-informed reasoning, their outputs are optimized for step-by-step cognitive transparency. This yields verbose candidate enumerations followed by dense reasoning that, while accurate, may induce cognitive overload and lack immediate utility for real-world interaction with researchers. To address this challenge, we introduce OpenQlaw, an agentic orchestration system for analyzing 2D materials. The architecture is built upon NanoBot, a lightweight agentic framework inspired by OpenClaw, and QuPAINT, one of the first Physics-Aware Instruction Multi-modal platforms for Quantum Material Discovery. This allows accessibility to the lab floor via a variety of messaging channels. OpenQlaw allows the core Large Language Model (LLM) agent to orchestrate a domain-expert MLLM,with QuPAINT, as a specialized node, successfully decoupling visual identification from reasoning and deterministic image rendering. By parsing spatial data from the expert, the agent can dynamically process user queries, such as performing scale-aware physical computation or generating isolated visual annotations, and answer in a naturalistic manner. Crucially, the system features a persistent memory that enables the agent to save physical scale ratios (e.g., 1 pixel = 0.25 μm) for area computations and store sample preparation methods for efficacy comparison. The application of an agentic architecture, together with the extension that uses the core agent as an orchestrator for domain-specific experts, transforms isolated inferences into a context-aware assistant capable of accelerating high-throughput device fabrication.
[135] Astrolabe: Steering Forward-Process Reinforcement Learning for Distilled Autoregressive Video Models
Songchun Zhang, Zeyue Xue, Siming Fu, Jie Huang, Xianghao Kong, Y Ma, Haoyang Huang, Nan Duan, Anyi Rao
Main category: cs.CV
TL;DR: Astrolabe: Efficient online RL framework for aligning distilled autoregressive video models with human preferences using forward-process fine-tuning and streaming training.
Details
Motivation: Distilled AR video models are efficient for streaming generation but often misalign with human visual preferences. Existing RL frameworks are inefficient for these architectures, requiring expensive re-distillation or computationally heavy reverse-process optimization.Method: Introduces forward-process RL formulation with negative-aware fine-tuning that contrasts positive/negative samples at inference endpoints. Uses streaming training with rolling KV-cache for long videos, applying RL updates to local clip windows while maintaining long-range coherence. Implements multi-reward objective with uncertainty-aware selective regularization and dynamic reference updates to prevent reward hacking.
Result: Method consistently enhances generation quality across multiple distilled AR video models, providing robust and scalable alignment solution.
Conclusion: Astrolabe offers an efficient online RL framework specifically tailored for distilled AR video models, overcoming previous bottlenecks and enabling scalable alignment with human preferences.
Abstract: Distilled autoregressive (AR) video models enable efficient streaming generation but frequently misalign with human visual preferences. Existing reinforcement learning (RL) frameworks are not naturally suited to these architectures, typically requiring either expensive re-distillation or solver-coupled reverse-process optimization that introduces considerable memory and computational overhead. We present Astrolabe, an efficient online RL framework tailored for distilled AR models. To overcome existing bottlenecks, we introduce a forward-process RL formulation based on negative-aware fine-tuning. By contrasting positive and negative samples directly at inference endpoints, this approach establishes an implicit policy improvement direction without requiring reverse-process unrolling. To scale this alignment to long videos, we propose a streaming training scheme that generates sequences progressively via a rolling KV-cache, applying RL updates exclusively to local clip windows while conditioning on prior context to ensure long-range coherence. Finally, to mitigate reward hacking, we integrate a multi-reward objective stabilized by uncertainty-aware selective regularization and dynamic reference updates. Extensive experiments demonstrate that our method consistently enhances generation quality across multiple distilled AR video models, serving as a robust and scalable alignment solution.
[136] PaAgent: Portrait-Aware Image Restoration Agent via Subjective-Objective Reinforcement Learning
Yijian Wang, Qingsen Yan, Jiantao Zhou, Duwei Dai, Wei Dong
Main category: cs.CV
TL;DR: PaAgent is a portrait-aware image restoration agent that uses a self-evolving portrait bank and RAG to select optimal restoration tools, with reinforcement learning for degradation perception in complex scenes.
Details
Motivation: Existing image restoration agents lack insight summarization mechanisms, leading to exhaustive searches for optimal tools. The authors aim to create an agent that can better perceive degradation and select appropriate restoration tools through learned insights.Method: Proposes PaAgent with: 1) Self-evolving portrait bank that summarizes IR tool characteristics using restored images, selected tools, and degraded images; 2) RAG for retrieving relevant insights to select optimal tools; 3) Subjective-objective reinforcement learning strategy combining image quality scores and semantic insights for degradation perception.
Result: Extensive experiments across 8 IR benchmarks covering six single-degradation and eight mixed-degradation scenarios validate PaAgent’s superiority in addressing complex IR tasks.
Conclusion: PaAgent effectively addresses limitations of existing IR agents by incorporating insight summarization and enhanced degradation perception, demonstrating strong performance on complex image restoration tasks.
Abstract: Image Restoration (IR) agents, leveraging multimodal large language models to perceive degradation and invoke restoration tools, have shown promise in automating IR tasks. However, existing IR agents typically lack an insight summarization mechanism for past interactions, which results in an exhaustive search for the optimal IR tool. To address this limitation, we propose a portrait-aware IR agent, dubbed PaAgent, which incorporates a self-evolving portrait bank for IR tools and Retrieval-Augmented Generation (RAG) to select a suitable IR tool for input. Specifically, to construct and evolve the portrait bank, the PaAgent continuously enriches it by summarizing the characteristics of various IR tools with restored images, selected IR tools, and degraded images. In addition, the RAG is employed to select the optimal IR tool for the input image by retrieving relevant insights from the portrait bank. Furthermore, to enhance PaAgent’s ability to perceive degradation in complex scenes, we propose a subjective-objective reinforcement learning strategy that considers both image quality scores and semantic insights in reward generation, which accurately provides the degradation information even under partial and non-uniform degradation. Extensive experiments across 8 IR benchmarks, covering six single-degradation and eight mixed-degradation scenarios, validate PaAgent’s superiority in addressing complex IR tasks. Our project page is \href{https://wyjgr.github.io/PaAgent.html}{PaAgent}.
[137] DesertFormer: Transformer-Based Semantic Segmentation for Off-Road Desert Terrain Classification in Autonomous Navigation Systems
Yasaswini Chebolu
Main category: cs.CV
TL;DR: DesertFormer: A semantic segmentation pipeline for off-road desert terrain analysis using SegFormer B2 with MiT-B2 backbone, achieving 64.4% mIoU on 10 terrain categories.
Details
Motivation: Desert landscapes present unique challenges for terrain perception due to low chromatic contrast, extreme lighting variability, and sparse vegetation that defy standard road-scene segmentation models, requiring specialized solutions for autonomous navigation in unstructured off-road environments.Method: Uses SegFormer B2 with hierarchical Mix Transformer (MiT-B2) backbone for semantic segmentation, trained on a purpose-built dataset of 4,176 annotated off-road images at 512x512 resolution, with class-weighted training and copy-paste augmentation for rare terrain categories.
Result: Achieves 64.4% mIoU and 86.1% pixel accuracy, representing +24.2% absolute improvement over DeepLabV3 MobileNetV2 baseline (41.0% mIoU). Systematic failure analysis identifies primary confusion patterns between Ground Clutter↔Landscape and Dry Grass↔Landscape.
Conclusion: DesertFormer provides effective semantic segmentation for desert terrain analysis, enabling safety-aware path planning for ground robots and autonomous vehicles in challenging off-road environments, with code and models publicly released.
Abstract: Reliable terrain perception is a fundamental requirement for autonomous navigation in unstructured, off-road environments. Desert landscapes present unique challenges due to low chromatic contrast between terrain categories, extreme lighting variability, and sparse vegetation that defy the assumptions of standard road-scene segmentation models. We present DesertFormer, a semantic segmentation pipeline for off-road desert terrain analysis based on SegFormer B2 with a hierarchical Mix Transformer (MiT-B2) backbone. The system classifies terrain into ten ecologically meaningful categories – Trees, Lush Bushes, Dry Grass, Dry Bushes, Ground Clutter, Flowers, Logs, Rocks, Landscape, and Sky – enabling safety-aware path planning for ground robots and autonomous vehicles. Trained on a purpose-built dataset of 4,176 annotated off-road images at 512x512 resolution, DesertFormer achieves a mean Intersection-over-Union (mIoU) of 64.4% and pixel accuracy of 86.1%, representing a +24.2% absolute improvement over a DeepLabV3 MobileNetV2 baseline (41.0% mIoU). We further contribute a systematic failure analysis identifying the primary confusion patterns – Ground Clutter to Landscape and Dry Grass to Landscape – and propose class-weighted training and copy-paste augmentation for rare terrain categories. Code, checkpoints, and an interactive inference dashboard are released at https://github.com/Yasaswini-ch/Vision-based-Desert-Terrain-Segmentation-using-SegFormer.
[138] TrackDeform3D: Markerless and Autonomous 3D Keypoint Tracking and Dataset Collection for Deformable Objects
Yeheng Zong, Yizhou Chen, Alexander Bowler, Chia-Tung Yang, Ram Vasudevan
Main category: cs.CV
TL;DR: TrackDeform3D: An autonomous framework using RGB-D cameras to collect 3D datasets of deformable objects by identifying and tracking 3D keypoints with motion consistency constraints.
Details
Motivation: Current methods struggle with extracting structured 3D representations (keypoints, meshes) for deformable objects due to complex deformations. Large-scale 3D data collection is bottlenecked by expensive annotation/motion capture or simplifying assumptions that fail in unstructured environments.Method: Uses affordable RGB-D cameras to autonomously collect 3D datasets. Identifies 3D keypoints and tracks their trajectories with motion consistency constraints to produce temporally smooth and geometrically coherent data.
Result: Outperforms state-of-the-art tracking methods across diverse object categories in both geometric and tracking accuracy. Creates a high-quality dataset of 6 deformable objects with 110 minutes of trajectory data.
Conclusion: Presents an affordable, autonomous framework for large-scale 3D dataset collection of deformable objects, addressing data scarcity and enabling better 3D representation learning.
Abstract: Structured 3D representations such as keypoints and meshes offer compact, expressive descriptions of deformable objects, jointly capturing geometric and topological information useful for downstream tasks such as dynamics modeling and motion planning. However, robustly extracting such representations remains challenging, as current perception methods struggle to handle complex deformations. Moreover, large-scale 3D data collection remains a bottleneck: existing approaches either require prohibitive data collection efforts, such as labor-intensive annotation or expensive motion capture setups, or rely on simplifying assumptions that break down in unstructured environments. As a result, large-scale 3D datasets and benchmarks for deformable objects remain scarce. To address these challenges, this paper presents an affordable and autonomous framework for collecting 3D datasets of deformable objects using only RGB-D cameras. The proposed method identifies 3D keypoints and robustly tracks their trajectories, incorporating motion consistency constraints to produce temporally smooth and geometrically coherent data. TrackDeform3D is evaluated against several state-of-the-art tracking methods across diverse object categories and demonstrates consistent improvements in both geometric and tracking accuracy. Using this framework, this paper presents a high-quality, large-scale dataset consisting of 6 deformable objects, totaling 110 minutes of trajectory data.
[139] Edge-Efficient Two-Stream Multimodal Architecture for Non-Intrusive Bathroom Fall Detection
Haitian Wang, Yiren Wang, Xinyu Wang, Sheldon Fung, Atif Mansoor
Main category: cs.CV
TL;DR: A two-stream multimodal architecture combining mmWave radar and floor vibration sensors for privacy-preserving fall detection in bathrooms, using Motion-Mamba and Impact-Griffin branches with cross-conditioned fusion for real-time edge deployment.
Details
Motivation: Existing multimodal fall detection systems treat motion and impact as loosely coupled streams with coarse temporal alignment, failing to encode causal links between collapse and impact, and not addressing timing drift, object drop confounders, or edge device constraints.Method: Two-stream architecture: Motion-Mamba branch processes radar signals for long-range motion patterns; Impact-Griffin branch processes floor vibration for impact transients. Uses cross-conditioned fusion with low-rank bilinear interaction and Switch-MoE head to align motion/impact tokens and suppress confounders. Optimized for real-time execution on Raspberry Pi 4B.
Result: Achieves 96.1% accuracy, 94.8% precision, 88.0% recall, 91.1% macro F1, and 0.968 AUC on bathroom fall detection benchmark. Reduces latency from 35.9ms to 15.8ms and energy consumption from 14200mJ to 10750mJ per window compared to strongest baseline.
Conclusion: The proposed multimodal architecture effectively encodes causal relationships between motion and impact for fall detection, addresses practical challenges like timing drift and confounders, and achieves efficient real-time performance on low-power edge devices.
Abstract: Falls in wet bathroom environments are a major safety risk for seniors living alone. Recent work has shown that mmWave-only, vibration-only, and existing multimodal schemes, such as vibration-triggered radar activation, early feature concatenation, and decision-level score fusion, can support privacy-preserving, non-intrusive fall detection. However, these designs still treat motion and impact as loosely coupled streams, depending on coarse temporal alignment and amplitude thresholds, and do not explicitly encode the causal link between radar-observed collapse and floor impact or address timing drift, object drop confounders, and latency and energy constraints on low-power edge devices. To this end, we propose a two-stream architecture that encodes radar signals with a Motion–Mamba branch for long-range motion patterns and processes floor vibration with an Impact–Griffin branch that emphasizes impact transients and cross-axis coupling. Cross-conditioned fusion uses low-rank bilinear interaction and a Switch–MoE head to align motion and impact tokens and suppress object-drop confounders. The model keeps inference cost suitable for real-time execution on a Raspberry Pi 4B gateway. We construct a bathroom fall detection benchmark dataset with frame-level annotations, comprising more than 3~h of synchronized mmWave radar and triaxial vibration recordings across eight scenarios under running water, together with subject-independent training, validation, and test splits. On the test split, our model attains 96.1% accuracy, 94.8% precision, 88.0% recall, a 91.1% macro F1 score, and an AUC of 0.968. Compared with the strongest baseline, it improves accuracy by 2.0 percentage points and fall recall by 1.3 percentage points, while reducing latency from 35.9 ms to 15.8 ms and lowering energy per 2.56 s window from 14200 mJ to 10750 mJ on the Raspberry Pi 4B gateway.
[140] ACE-LoRA: Graph-Attentive Context Enhancement for Parameter-Efficient Adaptation of Medical Vision-Language Models
M. Arda Aydın, Melih B. Yilmaz, Aykut Koç, Tolga Çukur
Main category: cs.CV
TL;DR: ACE-LoRA is a parameter-efficient adaptation framework for medical vision-language models that bridges the specialization-generalization trade-off using LoRA modules and an attention-based hypergraph neural network to capture fine-grained diagnostic cues while maintaining zero-shot generalization.
Details
Motivation: Existing medical VLMs face a trade-off: specialist models trained on single-domain data capture domain-specific details but generalize poorly, while generalist models trained on multi-domain data retain broad semantics but dilute fine-grained diagnostic cues. Bridging this specialization-generalization gap is challenging.Method: ACE-LoRA integrates Low-Rank Adaptation (LoRA) modules into frozen image-text encoders and introduces an Attention-based Context Enhancement Hypergraph Neural Network (ACE-HGNN) module to capture higher-order contextual interactions beyond pairwise similarity. It also uses a label-guided InfoNCE loss to suppress false negatives between semantically related image-text pairs.
Result: Despite adding only 0.95M trainable parameters, ACE-LoRA consistently outperforms state-of-the-art medical VLMs and PEFT baselines across zero-shot classification, segmentation, and detection benchmarks spanning multiple domains.
Conclusion: ACE-LoRA successfully bridges the specialization-generalization trade-off in medical VLMs through parameter-efficient adaptation that maintains robust zero-shot generalization while capturing fine-grained diagnostic cues.
Abstract: The success of CLIP-like vision-language models (VLMs) on natural images has inspired medical counterparts, yet existing approaches largely fall into two extremes: specialist models trained on single-domain data, which capture domain-specific details but generalize poorly, and generalist medical VLMs trained on multi-domain data, which retain broad semantics but dilute fine-grained diagnostic cues. Bridging this specialization-generalization trade-off remains challenging. To address this problem, we propose ACE-LoRA, a parameter-efficient adaptation framework for generalist medical VLMs that maintains robust zero-shot generalization. ACE-LoRA integrates Low-Rank Adaptation (LoRA) modules into frozen image-text encoders and introduces an Attention-based Context Enhancement Hypergraph Neural Network (ACE-HGNN) module that captures higher-order contextual interactions beyond pairwise similarity to enrich global representations with localized diagnostic cues, addressing a key limitation of prior Parameter-Efficient Fine-Tuning (PEFT) methods that overlook fine-grained details. To further enhance cross-modal alignment, we formulate a label-guided InfoNCE loss to effectively suppress false negatives between semantically related image-text pairs. Despite adding only 0.95M trainable parameters, ACE-LoRA consistently outperforms state-of-the-art medical VLMs and PEFT baselines across zero-shot classification, segmentation, and detection benchmarks spanning multiple domains. Our code is available at https://github.com/icon-lab/ACE-LoRA.
[141] Accurate Shift Invariant Convolutional Neural Networks Using Gaussian-Hermite Moments
Jaspreet Singh, Petra Bosilj, Grzegorz Cielniak
Main category: cs.CV
TL;DR: A novel Gaussian-Hermite Sampling (GHS) method for CNNs that achieves perfect shift invariance through shift-consistent downsampling using Gaussian-Hermite polynomials, maintaining 100% classification consistency under spatial shifts while improving accuracy.
Details
Motivation: Standard CNNs lack inherent shift invariance due to downsampling operations, which break this property despite being essential for computational efficiency and receptive field expansion. There's a need for downsampling methods that preserve shift invariance.Method: Proposes Gaussian-Hermite Sampling (GHS) that uses Gaussian-Hermite polynomials to perform shift-consistent sampling. This method embeds shift invariance directly at the CNN layer level without requiring architectural changes or additional training procedures.
Result: Achieves 100% classification consistency under spatial shifts on CIFAR-10, CIFAR-100, and MNIST-rot datasets. Also improves classification accuracy compared to baseline CNN models while maintaining computational efficiency.
Conclusion: GHS provides an effective solution for achieving shift invariance in CNNs through principled downsampling, offering both perfect shift consistency and improved accuracy without architectural modifications.
Abstract: The convolutional neural networks (CNNs) are not inherently shift invariant or equivariant. The downsampling operation, used in CNNs, is one of the key reasons which breaks the shift invariant property of a CNN. Conversely, downsampling operation is important to improve computational efficiency and increase the area of the receptive field for more contextual information. In this work, we propose Gaussian-Hermite Sampling (GHS), a novel downsampling strategy designed to achieve accurate shift invariance. GHS leverages Gaussian-Hermite polynomials to perform shift-consistent sampling, enabling CNN layers to maintain invariance to arbitrary spatial shifts prior to training. When integrated into standard CNN architectures, the proposed method embeds shift invariance directly at the layer level without requiring architectural modifications or additional training procedures. We evaluate the proposed approach on CIFAR-10, CIFAR-100, and MNIST-rot datasets. Experimental results demonstrate that GHS significantly improves shift consistency, achieving 100% classification consistency under spatial shifts, while also improving classification accuracy compared to baseline CNN models.
[142] LLM-Powered Flood Depth Estimation from Social Media Imagery: A Vision-Language Model Framework with Mechanistic Interpretability for Transportation Resilience
Nafis Fuad, Xiaodong Qian
Main category: cs.CV
TL;DR: FloodLlama: A fine-tuned vision-language model for centimeter-resolution flood depth estimation from street-level images using TikTok data and synthetic training.
Details
Motivation: Urban flooding threatens transportation networks, but no existing system provides real-time, street-level flood depth information at the centimeter resolution needed for dynamic routing, EV safety, and AV operations.Method: Fine-tuned LLaMA 3.2-11B Vision using QLoRA on a synthetic dataset of ~190k images covering various vehicle types, weather conditions, and depth levels (0-40 cm). Used progressive curriculum training and developed a five-phase mechanistic interpretability framework to identify critical depth-encoding layers.
Result: Achieved MAE below 0.97 cm and Acc@5cm above 93.7% for deep flooding (exceeding 96.8% for shallow depths). Tier 3 configuration reached 98.62% accuracy on real-world data with strong robustness under occlusion. TikTok pipeline validated on 676 annotated flood frames from Detroit.
Conclusion: FloodLlama provides a scalable, infrastructure-free solution for real-time flood depth estimation with applications for EV safety, AV deployment, and resilient transportation management.
Abstract: Urban flooding poses an escalating threat to transportation network continuity, yet no operational system currently provides real-time, street-level flood depth information at the centimeter resolution required for dynamic routing, electric vehicle (EV) safety, and autonomous vehicle (AV) operations. This study presents FloodLlama, a fine-tuned open-source vision-language model (VLM) for continuous flood depth estimation from single street-level images, supported by a multimodal sensing pipeline using TikTok data. A synthetic dataset of approximately 190000 images was generated, covering seven vehicle types, four weather conditions, and 41 depth levels (0-40 cm at 1 cm resolution). Progressive curriculum training enabled coarse-to-fine learning, while LLaMA 3.2-11B Vision was fine-tuned using QLoRA. Evaluation across 34797 trials reveals a depth-dependent prompt effect: simple prompts perform better for shallow flooding, whereas chain-of-thought (CoT) reasoning improves performance at greater depths. FloodLlama achieves a mean absolute error (MAE) below 0.97 cm and Acc@5cm above 93.7% for deep flooding, exceeding 96.8% for shallow depths. A five-phase mechanistic interpretability framework identifies layer L23 as the critical depth-encoding transition and enables selective fine-tuning that reduces trainable parameters by 76-80% while maintaining accuracy. The Tier 3 configuration achieves 98.62% accuracy on real-world data and shows strong robustness under visual occlusion. A TikTok-based data pipeline, validated on 676 annotated flood frames from Detroit, demonstrates the feasibility of real-time, crowd-sourced flood sensing. The proposed framework provides a scalable, infrastructure-free solution with direct implications for EV safety, AV deployment, and resilient transportation management.
[143] Pixel-level Counterfactual Contrastive Learning for Medical Image Segmentation
Marceau Lafargue-Hauret, Raghav Mehta, Fabio De Sousa Ribeiro, Mélanie Roschewitz, Ben Glocker
Main category: cs.CV
TL;DR: A pipeline combining counterfactual generation with dense contrastive learning for image segmentation, using Dual-View and Multi-View methods, plus supervised variants with silver-standard labels, achieving ~94% DSC on challenging data.
Details
Motivation: Image segmentation requires expensive annotated datasets, while AI-generated silver-standard labels risk bias. Self-supervised learning is key but existing contrastive learning with counterfactual generation doesn't extend well to pixel-level tasks.Method: Proposes a pipeline combining counterfactual generation with dense contrastive learning via Dual-View (DVD-CL) and Multi-View (MVD-CL) methods. Includes supervised variants using silver-standard annotations and introduces CHRO-map visualization algorithm.
Result: Annotation-free DVD-CL outperforms other dense contrastive learning methods. Supervised variants using silver-standard labels outperform training on silver-standard data directly, achieving ~94% DSC on challenging data.
Conclusion: Pixel-level contrastive learning enhanced by counterfactuals and silver-standard annotations improves robustness to acquisition and pathological variations in image segmentation.
Abstract: Image segmentation relies on large annotated datasets, which are expensive and slow to produce. Silver-standard (AI-generated) labels are easier to obtain, but they risk introducing bias. Self-supervised learning, needing only images, has become key for pre-training. Recent work combining contrastive learning with counterfactual generation improves representation learning for classification but does not readily extend to pixel-level tasks. We propose a pipeline combining counterfactual generation with dense contrastive learning via Dual-View (DVD-CL) and Multi-View (MVD-CL) methods, along with supervised variants that utilize available silver-standard annotations. A new visualisation algorithm, the Color-coded High Resolution Overlay map (CHRO-map) is also introduced. Experiments show annotation-free DVD-CL outperforms other dense contrastive learning methods, while supervised variants using silver-standard labels outperform training on the silver-standard labeled data directly, achieving $\sim$94% DSC on challenging data. These results highlight that pixel-level contrastive learning, enhanced by counterfactuals and silver-standard annotations, improves robustness to acquisition and pathological variations.
[144] Hidden Clones: Exposing and Fixing Family Bias in Vision-Language Model Ensembles
Zacharie Bugaud
Main category: cs.CV
TL;DR: Ensembling VLMs from different families reveals correlated errors within families, reducing effective ensemble diversity. Proposed family-aware methods (HFV, QualRCCV, LCS) improve accuracy by accounting for family structure.
Details
Motivation: Current VLM ensembling ignores correlated errors within architectural families, reducing ensemble effectiveness. Family-correlated errors create misleading questions where majority voting fails despite individual models being correct.Method: Three family-aware methods: 1) Hierarchical Family Voting (HFV) aggregates within families before cross-family voting; 2) QualRCCV weights models by calibration, family quality, and inverse family size; 3) Learned Candidate Scoring (LCS) trains a classifier to re-rank answers using support breadth, family diversity, and model quality.
Result: HFV recovers +18-26 pp on misleading questions. QualRCCV beats calibrated voting on all three benchmarks (VQAv2, TextVQA, GQA). LCS achieves largest gains: +0.68% VQAv2, +0.61% TextVQA, +2.45% GQA, reaching 87.83% on VQAv2 test-standard.
Conclusion: Family-correlated errors significantly impact VLM ensembling. Family-aware methods substantially improve ensemble performance, with LCS achieving state-of-the-art results without degrading any benchmark.
Abstract: Ensembling Vision-Language Models (VLMs) from different providers maximizes benchmark accuracy, yet models from the same architectural family share correlated errors that standard voting ignores. We study this structure across 17 VLMs from 8 families on VQAv2, TextVQA, and GQA. Family-correlated errors reduce effective ensemble dimensionality to 2.5-3.6 independent voters and create a Misleading tier (1.5-6.5% of questions) where correlated majority errors destroy accuracy to 0% despite the best model being correct. We propose three family-aware methods. Hierarchical Family Voting (HFV) aggregates within families before voting across them, recovering +18-26 pp on the Misleading tier. QualRCCV, a training-free method weighting models by calibration, family quality, and inverse family size, is the first to beat calibrated voting on all three benchmarks (p<0.05). Learned Candidate Scoring (LCS) trains a cross-validated classifier to re-rank candidate answers using support breadth, family diversity, and model quality, achieving the largest gains: +0.68% VQAv2, +0.61% TextVQA, +2.45% GQA – all significant – and is the only learned method that never degrades any benchmark. On VQAv2 test-standard (EvalAI), LCS reaches 87.83% with 12 models, confirming generalization.
[145] MosaicMem: Hybrid Spatial Memory for Controllable Video World Models
Wei Yu, Runjia Qian, Yumeng Li, Liquan Wang, Songheng Yin, Sri Siddarth Chakaravarthy P, Dennis Anthony, Yang Ye, Yidi Li, Weiwei Wan, Animesh Garg
Main category: cs.CV
TL;DR: MosaicMem is a hybrid spatial memory system for video diffusion models that combines 3D patch lifting for reliable localization with native model conditioning for prompt-following generation, enabling improved pose adherence and dynamic modeling.
Details
Motivation: Current video diffusion models struggle with spatial memory: explicit 3D structures improve consistency but fail with moving objects, while implicit memory often produces inaccurate camera motion even with correct poses. There's a need for better spatial memory to enable world simulation with consistent camera motion, revisits, and interventions.Method: Proposes MosaicMem, a hybrid spatial memory that lifts patches into 3D for reliable localization and targeted retrieval while using the model’s native conditioning for prompt-following. Features a patch-and-compose interface that spatially aligns patches in queried views, preserving persistent elements while allowing the model to inpaint evolving content. Uses PRoPE camera conditioning and two new memory alignment methods.
Result: Experiments show improved pose adherence compared to implicit memory and stronger dynamic modeling than explicit baselines. Enables minute-level navigation, memory-based scene editing, and autoregressive rollout capabilities.
Conclusion: MosaicMem addresses spatial memory bottlenecks in video diffusion models by combining the strengths of explicit and implicit approaches, enabling more consistent world simulation with better camera motion handling and dynamic content generation.
Abstract: Video diffusion models are moving beyond short, plausible clips toward world simulators that must remain consistent under camera motion, revisits, and intervention. Yet spatial memory remains a key bottleneck: explicit 3D structures can improve reprojection-based consistency but struggle to depict moving objects, while implicit memory often produces inaccurate camera motion even with correct poses. We propose Mosaic Memory (MosaicMem), a hybrid spatial memory that lifts patches into 3D for reliable localization and targeted retrieval, while exploiting the model’s native conditioning to preserve prompt-following generation. MosaicMem composes spatially aligned patches in the queried view via a patch-and-compose interface, preserving what should persist while allowing the model to inpaint what should evolve. With PRoPE camera conditioning and two new memory alignment methods, experiments show improved pose adherence compared to implicit memory and stronger dynamic modeling than explicit baselines. MosaicMem further enables minute-level navigation, memory-based scene editing, and autoregressive rollout.
[146] SMAL-pets: SMAL Based Avatars of Pets from Single Image
Piotr Borycki, Joanna Waczyńska, Yizhe Zhu, Yongqiang Gao, Przemysław Spurek
Main category: cs.CV
TL;DR: SMAL-pets is a framework that generates high-quality, editable 3D dog avatars from single images using 3D Gaussian Splatting integrated with SMAL parametric model, enabling multimodal editing through text prompts.
Details
Motivation: Creating animatable 3D dog avatars is challenging due to lack of large-scale annotated datasets, immense morphological diversity across breeds, difficulty capturing realistic fur textures, and the need for labor-intensive manual rigging for naturalistic movements.Method: Hybrid architecture integrating 3D Gaussian Splatting with SMAL parametric model to provide both visual fidelity and anatomical grounding. Includes multimodal editing suite allowing appearance refinement and complex animation control through textual prompts.
Result: Framework generates high-quality, editable animal avatars from single input images, bridging reconstruction and generative modeling. Enables users to control both aesthetic and behavioral aspects via natural language for animation and VR applications.
Conclusion: SMAL-pets provides a flexible, robust tool for creating animatable 3D dog avatars that addresses key challenges in animal reconstruction through hybrid representation and multimodal editing capabilities.
Abstract: Creating high-fidelity, animatable 3D dog avatars remains a formidable challenge in computer vision. Unlike human digital doubles, animal reconstruction faces a critical shortage of large-scale, annotated datasets for specialized applications. Furthermore, the immense morphological diversity across species, breeds, and crosses, which varies significantly in size, proportions, and features, complicates the generalization of existing models. Current reconstruction methods often struggle to capture realistic fur textures. Additionally, ensuring these avatars are fully editable and capable of performing complex, naturalistic movements typically necessitates labor-intensive manual mesh manipulation and expert rigging. This paper introduces SMAL-pets, a comprehensive framework that generates high-quality, editable animal avatars from a single input image. Our approach bridges the gap between reconstruction and generative modeling by leveraging a hybrid architecture. Our method integrates 3D Gaussian Splatting with the SMAL parametric model to provide a representation that is both visually high-fidelity and anatomically grounded. We introduce a multimodal editing suite that enables users to refine the avatar’s appearance and execute complex animations through direct textual prompts. By allowing users to control both the aesthetic and behavioral aspects of the model via natural language, SMAL-pets provides a flexible, robust tool for animation and virtual reality.
[147] BEV-SLD: Self-Supervised Scene Landmark Detection for Global Localization with LiDAR Bird’s-Eye View Images
David Skuddis, Vincent Ress, Wei Zhang, Vincent Ofosu Nyako, Norbert Haala
Main category: cs.CV
TL;DR: BEV-SLD is a self-supervised LiDAR global localization method that uses BEV images to discover scene-specific landmarks and aligns them via consistency loss for robust localization across diverse environments.
Details
Motivation: The paper addresses the need for robust global localization in LiDAR-based systems, particularly in challenging environments like campuses, industrial areas, and forests. Current scene-agnostic pipelines lack adaptability to specific environments, motivating a self-supervised approach that can discover scene-specific patterns as landmarks.Method: The method uses bird’s-eye-view (BEV) images to discover scene-specific patterns at prescribed spatial densities, treating them as landmarks. It employs a consistency loss to align learnable global landmark coordinates with per-frame heatmaps, ensuring consistent landmark detections across the scene.
Result: BEV-SLD demonstrates robust localization performance across campus, industrial, and forest environments, achieving strong results compared to state-of-the-art methods in LiDAR global localization.
Conclusion: The proposed BEV-SLD method provides an effective self-supervised approach for LiDAR global localization by leveraging scene-specific landmark discovery through BEV representations, offering improved robustness across diverse real-world environments.
Abstract: We present BEV-SLD, a LiDAR global localization method building on the Scene Landmark Detection (SLD) concept. Unlike scene-agnostic pipelines, our self-supervised approach leverages bird’s-eye-view (BEV) images to discover scene-specific patterns at a prescribed spatial density and treat them as landmarks. A consistency loss aligns learnable global landmark coordinates with per-frame heatmaps, yielding consistent landmark detections across the scene. Across campus, industrial, and forest environments, BEV-SLD delivers robust localization and achieves strong performance compared to state-of-the-art methods.
[148] GazeOnce360: Fisheye-Based 360° Multi-Person Gaze Estimation with Global-Local Feature Fusion
Zhuojiang Cai, Zhenghui Sun, Feng Lu
Main category: cs.CV
TL;DR: GazeOnce360 is an end-to-end model for multi-person 3D gaze estimation from a single upward-facing fisheye camera, addressing 360° scene coverage with a novel dual-resolution architecture and synthetic dataset.
Details
Motivation: The paper addresses the underexplored problem of estimating 3D gaze directions for multiple people distributed across a 360° scene from an upward fisheye perspective, which differs from conventional forward-facing camera approaches in constrained viewpoints.Method: The model incorporates rotational convolutions to handle fisheye distortion, uses eye landmark supervision, and features a dual-resolution architecture that fuses global low-resolution context with high-resolution local eye regions. The authors also introduce MPSGaze360, a large-scale synthetic dataset rendered with Unreal Engine.
Result: Experimental results demonstrate the effectiveness of each component in the model, showing the feasibility of fisheye-based 360° gaze estimation in practical multi-person scenarios.
Conclusion: This work highlights the potential of fisheye-based gaze estimation for 360° multi-person scenarios, providing both a novel model architecture and a comprehensive synthetic dataset to advance research in this area.
Abstract: We present GazeOnce360, a novel end-to-end model for multi-person gaze estimation from a single tabletop-mounted upward-facing fisheye camera. Unlike conventional approaches that rely on forward-facing cameras in constrained viewpoints, we address the underexplored setting of estimating the 3D gaze direction of multiple people distributed across a 360° scene from an upward fisheye perspective. To support research in this setting, we introduce MPSGaze360, a large-scale synthetic dataset rendered using Unreal Engine, featuring diverse multi-person configurations with accurate 3D gaze and eye landmark annotations. Our model tackles the severe distortion and perspective variation inherent in fisheye imagery by incorporating rotational convolutions and eye landmark supervision. To better capture fine-grained eye features crucial for gaze estimation, we propose a dual-resolution architecture that fuses global low-resolution context with high-resolution local eye regions. Experimental results demonstrate the effectiveness of each component in our model. This work highlights the feasibility and potential of fisheye-based 360° gaze estimation in practical multi-person scenarios. Project page: https://caizhuojiang.github.io/GazeOnce360/.
[149] Generalist Multimodal LLMs Gain Biometric Expertise via Human Salience
Jacob Piland, Byron Dowling, Christopher Sweet, Adam Czajka
Main category: cs.CV
TL;DR: MLLMs can perform iris presentation attack detection using vision encoder embeddings and expert-informed prompts, achieving better performance than specialized CNNs and human examiners while maintaining privacy constraints.
Details
Motivation: Iris PAD faces practical barriers: collecting data for unknown future attacks is impossible, diverse data collection is expensive, and sharing biometric data raises privacy concerns. Need adaptable solutions for rapidly emerging attack vectors.Method: Use pre-trained vision transformers from MLLMs to extract embeddings from iris images, analyze clustering of attack types, and enhance performance with structured prompts incorporating human expert knowledge (verbal descriptions of attack indicators). Tested on 224 iris images spanning 7 attack types using privacy-compliant services (Gemini 2.5 Pro) and locally-hosted models (Llama 3.2-Vision).
Result: Gemini with expert-informed prompts outperforms specialized CNN-based baseline and human examiners. Locally-deployable Llama achieves near-human performance. Pre-trained vision transformers in MLLMs inherently cluster many iris attack types despite no explicit training for this task.
Conclusion: MLLMs deployable within institutional privacy constraints offer a viable path for iris PAD, combining pre-trained vision capabilities with human expert knowledge through structured prompting.
Abstract: Iris presentation attack detection (PAD) is critical for secure biometric deployments, yet developing specialized models faces significant practical barriers: collecting data representing future unknown attacks is impossible, and collecting diverse-enough data, yet still limited in terms of its predictive power, is expensive. Additionally, sharing biometric data raises privacy concerns. Due to rapid emergence of new attack vectors demanding adaptable solutions, we thus investigate in this paper whether general-purpose multimodal large language models (MLLMs) can perform iris PAD when augmented with human expert knowledge, operating under strict privacy constraints that prohibit sending biometric data to public cloud MLLM services. Through analysis of vision encoder embeddings applied to our dataset, we demonstrate that pre-trained vision transformers in MLLMs inherently cluster many iris attack types despite never being explicitly trained for this task. However, where clustering shows overlap between attack classes, we find that structured prompts incorporating human salience (verbal descriptions from subjects identifying attack indicators) enable these models to resolve ambiguities. Testing on an IRB-restricted dataset of 224 iris images spanning seven attack types, using only university-approved services (Gemini 2.5 Pro) or locally-hosted models (e.g., Llama 3.2-Vision), we show that Gemini with expert-informed prompts outperforms both a specialized convolutional neural networks (CNN)-based baseline and human examiners, while the locally-deployable Llama achieves near-human performance. Our results establish that MLLMs deployable within institutional privacy constraints offer a viable path for iris PAD.
[150] Patient4D: Temporally Consistent Patient Body Mesh Recovery from Monocular Operating Room Video
Mingxiao Tu, Hoijoon Jung, Alireza Moghadam, Andre Kyme, Jinman Kim
Main category: cs.CV
TL;DR: Patient4D: A stationarity-constrained 3D body mesh reconstruction pipeline for surgical AR that exploits patient stationarity prior to handle occlusion from draping and moving camera viewpoints.
Details
Motivation: Existing human mesh recovery methods fail under surgical conditions where patients are stationary under draping while cameras move continuously. Need robust reconstruction for surgical AR applications.Method: Combines image-level foundation models with geometric mechanisms: Pose Locking (anchors pose using stable keyframes) and Rigid Fallback (handles severe occlusion via silhouette-guided rigid alignment). Enforces temporal consistency across frames.
Result: Achieves 0.75 mean IoU under surgical drape occlusion, reducing failure frames from 30.5% to 1.3% compared to best baseline. Evaluated on 4,680 synthetic surgical sequences and three public HMR benchmarks.
Conclusion: Exploiting stationarity priors substantially improves monocular reconstruction in clinical AR scenarios, with robust performance under occlusion and moving viewpoints.
Abstract: Recovering a dense 3D body mesh from monocular video remains challenging under occlusion from draping and continuously moving camera viewpoints. This configuration arises in surgical augmented reality (AR), where an anesthetized patient lies under surgical draping while a surgeon’s head-mounted camera continuously changes viewpoint. Existing human mesh recovery (HMR) methods are typically trained on upright, moving subjects captured from relatively stable cameras, leading to performance degradation under such conditions. To address this, we present Patient4D, a stationarity-constrained reconstruction pipeline that explicitly exploits the stationarity prior. The pipeline combines image-level foundation models for perception with lightweight geometric mechanisms that enforce temporal consistency across frames. Two key components enable robust reconstruction: Pose Locking, which anchors pose parameters using stable keyframes, and Rigid Fallback, which recovers meshes under severe occlusion through silhouette-guided rigid alignment. Together, these mechanisms stabilize predictions while remaining compatible with off-the-shelf HMR models. We evaluate Patient4D on 4,680 synthetic surgical sequences and three public HMR video benchmarks. Under surgical drape occlusion, Patient4D achieves a 0.75 mean IoU, reducing failure frames from 30.5% to 1.3% compared to the best baseline. Our findings demonstrate that exploiting stationarity priors can substantially improve monocular reconstruction in clinical AR scenarios.
[151] Visual Product Search Benchmark
Karthik Sulthanpete Govindappa
Main category: cs.CV
TL;DR: Benchmark study evaluating modern visual embedding models for industrial product identification via image retrieval, comparing foundation models, proprietary multimodal systems, and domain-specific models on curated industrial datasets.
Details
Motivation: Reliable product identification from images is critical in industrial applications where incorrect matches lead to costly failures. There's a need to benchmark how well modern visual embedding models perform for fine-grained instance retrieval in production environments with diverse imaging conditions.Method: Structured benchmark of visual embedding models using unified image-to-image retrieval protocol. Evaluates open-source foundation models, proprietary multimodal embedding systems, and domain-specific vision-only models on curated industrial datasets (Manufacturing, Automotive, DIY, Retail) and established public benchmarks without post-processing.
Result: Provides insights into how contemporary foundation and unified embedding models transfer to fine-grained instance retrieval tasks compared to models explicitly trained for industrial applications. Results show performance variations under realistic constraints and heterogeneous image conditions.
Conclusion: The benchmark informs practitioners and researchers about strengths and limitations of current visual embedding approaches in production-level product identification systems, highlighting the gap between general foundation models and specialized industrial solutions.
Abstract: Reliable product identification from images is a critical requirement in industrial and commercial applications, particularly in maintenance, procurement, and operational workflows where incorrect matches can lead to costly downstream failures. At the core of such systems lies the visual search component, which must retrieve and rank the exact object instance from large and continuously evolving catalogs under diverse imaging conditions. This report presents a structured benchmark of modern visual embedding models for instance-level image retrieval, with a focus on industrial applications. A curated set of open-source foundation embedding models, proprietary multi-modal embedding systems, and domain-specific vision-only models are evaluated under a unified image-to-image retrieval protocol. The benchmark includes curated datasets, which includes industrial datasets derived from production deployments in Manufacturing, Automotive, DIY, and Retail, as well as established public benchmarks. Evaluation is conducted without post-processing, isolating the retrieval capability of each model. The results provide insight into how well contemporary foundation and unified embedding models transfer to fine-grained instance retrieval tasks, and how they compare to models explicitly trained for industrial applications. By emphasizing realistic constraints, heterogeneous image conditions, and exact instance matching requirements, this benchmark aims to inform both practitioners and researchers about the strengths and limitations of current visual embedding approaches in production-level product identification systems. An interactive companion website presenting the benchmark results, evaluation details, and additional visualizations is available at https://benchmark.nyris.io.
[152] SA-CycleGAN-2.5D: Self-Attention CycleGAN with Tri-Planar Context for Multi-Site MRI Harmonization
Ishrith Gowda, Chunwei Liu
Main category: cs.CV
TL;DR: SA-CycleGAN-2.5D: A domain adaptation framework for neuroimaging harmonization that addresses scanner-induced covariate shifts using 2.5D tri-planar manifold injection, U-ResNet with global self-attention, and spectrally-normalized discriminators.
Details
Motivation: Multi-site neuroimaging analysis suffers from scanner-induced covariate shifts where acquisition variance often exceeds biological pathology variance, harming radiomic reproducibility. Existing methods either operate in feature space (precluding spatial tasks) or are limited by local receptive fields that cannot model global intensity correlations from field-strength bias.Method: Proposes SA-CycleGAN-2.5D with three innovations: (1) 2.5D tri-planar manifold injection preserving through-plane gradients at O(HW) complexity, (2) U-ResNet generator with dense voxel-to-voxel self-attention to model global scanner field biases beyond CNN receptive field limits, and (3) spectrally-normalized discriminator constraining Lipschitz constant for stable adversarial optimization.
Result: Evaluated on 654 glioma patients across BraTS and UPenn-GBM domains, reduces Maximum Mean Discrepancy by 99.1% (1.729 → 0.015) and degrades domain classifier accuracy to near-chance (59.7%). Ablation shows global attention is statistically essential (Cohen’s d = 1.32, p < 0.001) for heterogeneous-to-homogeneous translation.
Conclusion: The framework bridges 2D efficiency and 3D consistency to produce voxel-level harmonized images that preserve tumor pathophysiology, enabling reproducible multi-center radiomic analysis by effectively addressing scanner-induced domain shifts.
Abstract: Multi-site neuroimaging analysis is fundamentally confounded by scanner-induced covariate shifts, where the marginal distribution of voxel intensities $P(\mathbf{x})$ varies non-linearly across acquisition protocols while the conditional anatomy $P(\mathbf{y}|\mathbf{x})$ remains constant. This is particularly detrimental to radiomic reproducibility, where acquisition variance often exceeds biological pathology variance. Existing statistical harmonization methods (e.g., ComBat) operate in feature space, precluding spatial downstream tasks, while standard deep learning approaches are theoretically bounded by local effective receptive fields (ERF), failing to model the global intensity correlations characteristic of field-strength bias. We propose SA-CycleGAN-2.5D, a domain adaptation framework motivated by the $HΔH$-divergence bound of Ben-David et al., integrating three architectural innovations: (1) A 2.5D tri-planar manifold injection preserving through-plane gradients $\nabla_z$ at $O(HW)$ complexity; (2) A U-ResNet generator with dense voxel-to-voxel self-attention, surpassing the $O(\sqrt{L})$ receptive field limit of CNNs to model global scanner field biases; and (3) A spectrally-normalized discriminator constraining the Lipschitz constant ($K_D \le 1$) for stable adversarial optimization. Evaluated on 654 glioma patients across two institutional domains (BraTS and UPenn-GBM), our method reduces Maximum Mean Discrepancy (MMD) by 99.1% ($1.729 \to 0.015$) and degrades domain classifier accuracy to near-chance (59.7%). Ablation confirms that global attention is statistically essential (Cohen’s $d = 1.32$, $p < 0.001$) for the harder heterogeneous-to-homogeneous translation direction. By bridging 2D efficiency and 3D consistency, our framework yields voxel-level harmonized images that preserve tumor pathophysiology, enabling reproducible multi-center radiomic analysis.
[153] Adaptive Anchor Policies for Efficient 4D Gaussian Streaming
Ashim Dahal, Rabab Abdelfattah, Nick Rahimi
Main category: cs.CV
TL;DR: EGS: Reinforcement-learned anchor sampler for efficient Gaussian streaming that dynamically selects anchor budgets and informative anchors to optimize quality-runtime trade-offs in dynamic scene reconstruction.
Details
Motivation: Current Gaussian Splatting pipelines use fixed anchor selection (FPS with 8,192 anchors) regardless of scene complexity, leading to computational over-allocation under strict budgets. Need for adaptive, budget-aware anchor sampling.Method: Proposes Efficient Gaussian Streaming (EGS), a plug-in reinforcement-learned policy that jointly selects anchor budget and informative anchor subset using spatial features of Gaussian representation, while keeping the Gaussian streaming reconstruction backbone unchanged.
Result: On unseen data, EGS with 256 anchors (32× fewer than 8,192) improves PSNR by +0.52-0.61 dB while running 1.29-1.35× faster than IGS@8192. In high-quality refinement, remains competitive with full-anchor baseline at substantially lower budgets.
Conclusion: EGS enables efficient dynamic scene reconstruction by adaptively selecting anchors based on scene complexity, achieving better quality-efficiency trade-offs than fixed FPS sampling across different rendering settings.
Abstract: Dynamic scene reconstruction with Gaussian Splatting has enabled efficient streaming for real-time rendering and free-viewpoint video. However, most pipelines rely on fixed anchor selection such as Farthest Point Sampling (FPS), typically using 8,192 anchors regardless of scene complexity, which over-allocates computation under strict budgets. We propose Efficient Gaussian Streaming (EGS), a plug-in, budget-aware anchor sampler that replaces FPS with a reinforcement-learned policy while keeping the Gaussian streaming reconstruction backbone unchanged. The policy jointly selects an anchor budget and a subset of informative anchors under discrete constraints, balancing reconstruction quality and runtime using spatial features of the Gaussian representation. We evaluate EGS in two settings: fast rendering, which prioritizes runtime efficiency, and high-quality refinement, which enables additional optimization. Experiments on dynamic multi-view datasets show consistent improvements in the quality–efficiency trade-off over FPS sampling. On unseen data, in fast rendering at 256 anchors ($32\times$ fewer than 8,192), EGS improves PSNR by $+0.52$–$0.61$,dB while running $1.29$–$1.35\times$ faster than IGS@8192 (N3DV and MeetingRoom). In high-quality refinement, EGS remains competitive with the full-anchor baseline at substantially lower anchor budgets. \emph{Code and pretrained checkpoints will be released upon acceptance.} \keywords{4D Gaussian Splatting \and 4D Gaussian Streaming \and Reinforcement Learning}
[154] From Drop-off to Recovery: A Mechanistic Analysis of Segmentation in MLLMs
Boyong Wu, Sanghwan Kim, Zeynep Akata
Main category: cs.CV
TL;DR: MLLMs show progressive refinement of segmentation representations through attention mechanisms, with adapters causing initial drop-off but LLM layers recovering via cross-token attention steering.
Details
Motivation: To understand the intrinsic spatial understanding capacity of Multimodal Large Language Models (MLLMs) for pixel-level vision tasks like segmentation, which remains poorly understood despite their increasing application to such tasks.Method: Layerwise linear probing evaluation across the entire MLLM pipeline (vision encoder, adapter, and LLM), intervention-based attention knockout analysis to test cross-token attention refinement, and evaluation of bidirectional attention among image tokens on spatial consistency.
Result: Adapter introduces segmentation representation drop-off, but LLM layers progressively recover through attention-mediated refinement where correctly classified tokens steer misclassified neighbors toward correct labels. Early image token positions show recovery bounded by causal attention, which bidirectional attention alleviates.
Conclusion: The findings provide a mechanistic account of how MLLMs process visual information for segmentation, informing the design of future segmentation-capable models by revealing the progressive refinement process through attention mechanisms.
Abstract: Multimodal Large Language Models (MLLMs) are increasingly applied to pixel-level vision tasks, yet their intrinsic capacity for spatial understanding remains poorly understood. We investigate segmentation capacity through a layerwise linear probing evaluation across the entire MLLM pipeline: vision encoder, adapter, and LLM. We further conduct an intervention based attention knockout analysis to test whether cross-token attention progressively refines visual representations, and an evaluation of bidirectional attention among image tokens on spatial consistency. Our analysis reveals that the adapter introduces a segmentation representation drop-off, but LLM layers progressively recover through attention-mediated refinement, where correctly classified tokens steer misclassified neighbors toward the correct label. At early image token positions, this recovery is bounded by causal attention, which bidirectional attention among image tokens alleviates. These findings provide a mechanistic account of how MLLMs process visual information for segmentation, informing the design of future segmentation-capable models.
[155] GigaWorld-Policy: An Efficient Action-Centered World–Action Model
Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Hao Li, Hengtao Li, Jie Li, Jindi Lv, Jingyu Liu, Min Cao, Peng Li, Qiuping Deng, Wenjun Mei, Xiaofeng Wang, Xinze Chen, Xinyu Zhou, Yang Wang, Yifan Chang, Yifan Li, Yukun Zhou, Yun Ye, Zhichao Liu, Zheng Zhu
Main category: cs.CV
TL;DR: GigaWorld-Policy is an action-centered World-Action Model that learns 2D pixel-action dynamics for efficient robot policy learning, with optional video generation for richer supervision but faster inference.
Details
Motivation: Existing World-Action Models (WAMs) face two critical bottlenecks: 1) joint reasoning over future visual dynamics and actions incurs substantial inference overhead, and 2) joint modeling entangles visual and motion representations, making motion prediction accuracy dependent on video forecast quality.Method: Introduces GigaWorld-Policy with two coupled components: predicts future action sequences conditioned on current observation, and simultaneously generates future videos conditioned on predicted actions and observation. Uses causal design preventing future-video tokens from influencing action tokens, making video generation optional at inference. Pre-trains on large-scale robot dataset for action-centered video generation backbone.
Result: Runs 9x faster than leading WAM baseline (Motus) while improving task success rates by 7%. Compared with pi-0.5, improves performance by 95% on RoboTwin 2.0.
Conclusion: GigaWorld-Policy addresses efficiency bottlenecks in WAMs through action-centered design with optional video generation, achieving faster inference and better performance on real-world robotic platforms.
Abstract: World-Action Models (WAM) initialized from pre-trained video generation backbones have demonstrated remarkable potential for robot policy learning. However, existing approaches face two critical bottlenecks that hinder performance and deployment. First, jointly reasoning over future visual dynamics and corresponding actions incurs substantial inference overhead. Second, joint modeling often entangles visual and motion representations, making motion prediction accuracy heavily dependent on the quality of future video forecasts. To address these issues, we introduce GigaWorld-Policy, an action-centered WAM that learns 2D pixel-action dynamics while enabling efficient action decoding, with optional video generation. Specifically, we formulate policy training into two coupled components: the model predicts future action sequences conditioned on the current observation, and simultaneously generates future videos conditioned on the predicted actions and the same observation. The policy is supervised by both action prediction and video generation, providing richer learning signals and encouraging physically plausible actions through visual-dynamics constraints. With a causal design that prevents future-video tokens from influencing action tokens, explicit future-video generation is optional at inference time, allowing faster action prediction during deployment. To support this paradigm, we curate a diverse, large-scale robot dataset to pre-train an action-centered video generation model, which is then adapted as the backbone for robot policy learning. Experimental results on real-world robotic platforms show that GigaWorld-Policy runs 9x faster than the leading WAM baseline, Motus, while improving task success rates by 7%. Moreover, compared with pi-0.5, GigaWorld-Policy improves performance by 95% on RoboTwin 2.0.
[156] LED: A Benchmark for Evaluating Layout Error Detection in Document Analysis
Inbum Heo, Taewook Hwang, Jeesu Jung, Sangkeun Jung
Main category: cs.CV
TL;DR: LED benchmark evaluates structural reasoning in document layout analysis by defining 8 standardized error types and providing quantitative rules for realistic error simulation.
Details
Motivation: Current document layout analysis models suffer from structural errors like region merging, splitting, and omission, but conventional overlap-based metrics (IoU, mAP) fail to capture these logical inconsistencies.Method: Proposes Layout Error Detection (LED) benchmark with 8 standardized error types (Missing, Hallucination, Size Error, Split, Merge, Overlap, Duplicate, Misclassification), provides quantitative rules and injection algorithms for realistic error simulation, constructs LED-Dataset, and designs three evaluation tasks: document-level error detection, document-level error-type classification, and element-level error-type classification.
Result: Experiments with state-of-the-art multimodal models show LED enables fine-grained and interpretable assessment of structural understanding, revealing clear weaknesses across modalities and architectures.
Conclusion: LED establishes a unified and explainable benchmark for diagnosing the structural robustness and reasoning capability of document understanding models.
Abstract: Recent advances in Large Language Models (LLMs) and Large Multimodal Models (LMMs) have improved Document Layout Analysis (DLA), yet structural errors such as region merging, splitting, and omission remain persistent. Conventional overlap-based metrics (e.g., IoU, mAP) fail to capture such logical inconsistencies. To overcome this limitation, we propose Layout Error Detection (LED), a benchmark that evaluates structural reasoning in DLA predictions beyond surface-level accuracy. LED defines eight standardized error types (Missing, Hallucination, Size Error, Split, Merge, Overlap, Duplicate, and Misclassification) and provides quantitative rules and injection algorithms for realistic error simulation. Using these definitions, we construct LED-Dataset and design three evaluation tasks: document-level error detection, document-level error-type classification, and element-level error-type classification. Experiments with state-of-the-art multimodal models show that LED enables fine-grained and interpretable assessment of structural understanding, revealing clear weaknesses across modalities and architectures. Overall, LED establishes a unified and explainable benchmark for diagnosing the structural robustness and reasoning capability of document understanding models.
[157] ConfusionBench: An Expert-Validated Benchmark for Confusion Recognition and Localization in Educational Videos
Lu Dong, Xiao Wang, Mark Frank, Srirangaraj Setlur, Venu Govindaraju, Ifeoma Nwogu
Main category: cs.CV
TL;DR: ConfusionBench: A new benchmark for recognizing and localizing student confusion in educational videos, built using a multi-stage filtering pipeline with model-assisted screening and expert validation.
Details
Motivation: Existing confusion datasets have noisy labels, coarse temporal annotations, and limited expert validation, hindering reliable fine-grained recognition and temporally grounded analysis of student confusion in educational videos.Method: Proposed a practical multi-stage filtering pipeline integrating two stages of model-assisted screening, researcher curation, and expert validation to build ConfusionBench, consisting of balanced confusion recognition and video localization datasets.
Result: Proprietary models perform better overall but tend to over-predict transitional segments, while open-source models are more conservative and prone to missed detections. The confusion report visualization supports educational experts in intervention decisions.
Conclusion: ConfusionBench provides a higher-quality benchmark for confusion understanding in educational videos, with datasets and materials made publicly available to support future research.
Abstract: Recognizing and localizing student confusion from video is an important yet challenging problem in educational AI. Existing confusion datasets suffer from noisy labels, coarse temporal annotations, and limited expert validation, which hinder reliable fine-grained recognition and temporally grounded analysis. To address these limitations, we propose a practical multi-stage filtering pipeline that integrates two stages of model-assisted screening, researcher curation, and expert validation to build a higher-quality benchmark for confusion understanding. Based on this pipeline, we introduce ConfusionBench, a new benchmark for educational videos consisting of a balanced confusion recognition dataset and a video localization dataset. We further provide zero-shot baseline evaluations of a representative open-source model and a proprietary model on clip-level confusion recognition, long-video confusion localization tasks. Experimental results show that the proprietary model performs better overall but tends to over-predict transitional segments, while the open-source model is more conservative and more prone to missed detections. In addition, the proposed student confusion report visualization can support educational experts in making intervention decisions and adapting learning plans accordingly. All datasets and related materials will be made publicly available on our project page.
[158] DANCE: Dynamic 3D CNN Pruning: Joint Frame, Channel, and Feature Adaptation for Energy Efficiency on the Edge
Mohamed Mejri, Ashiqur Rasul, Abhijit Chatterjee
Main category: cs.CV
TL;DR: DANCE: A dynamic pruning framework for 3D CNNs that adaptively prunes frames, channels, and features based on input characteristics to maximize energy efficiency with minimal performance impact.
Details
Motivation: Modern CNNs for video/image processing lack dynamic adaptation to input complexity, leading to inefficient energy consumption. There's a need for fine-grained, input-aware pruning that maintains performance while maximizing power efficiency.Method: Two-step approach: 1) Activation Variability Amplification (AVA) - retrain 3D CNN to increase variance of neuron activation magnitudes; 2) Adaptive Activation Pruning (AAP) - train lightweight activation controller network to dynamically prune frames, channels, and features based on first-layer output statistics.
Result: Achieves substantial MAC operation and memory access savings through sparsity. Hardware validation shows 1.37X speedup on NVIDIA Jetson Nano GPU, 2.22X on Qualcomm Snapdragon 8 Gen 1, and up to 1.47X higher energy efficiency compared to state-of-the-art.
Conclusion: DANCE enables fine-grained, input-aware dynamic pruning for 3D CNNs, achieving significant energy efficiency gains with negligible performance impact, validated on real hardware platforms.
Abstract: Modern convolutional neural networks (CNNs) are workhorses for video and image processing, but fail to adapt to the computational complexity of input samples in a dynamic manner to minimize energy consumption. In this research, we propose DANCE, a fine-grained, input-aware, dynamic pruning framework for 3D CNNs to maximize power efficiency with negligible to zero impact on performance. In the proposed two-step approach, the first step is called activation variability amplification (AVA), and the 3D CNN model is retrained to increase the variance of the magnitude of neuron activations across the network in this step, facilitating pruning decisions across diverse CNN input scenarios. In the second step, called adaptive activation pruning (AAP), a lightweight activation controller network is trained to dynamically prune frames, channels, and features of 3D convolutional layers of the network (different for each layer), based on statistics of the outputs of the first layer of the network. Our method achieves substantial savings in multiply-accumulate (MAC) operations and memory accesses by introducing sparsity within convolutional layers. Hardware validation on the NVIDIA Jetson Nano GPU and the Qualcomm Snapdragon 8 Gen 1 platform demonstrates respective speedups of 1.37X and 2.22X, achieving up to 1.47X higher energy efficiency compared to the state of the art.
[159] Directing the Narrative: A Finetuning Method for Controlling Coherence and Style in Story Generation
Jianzhang Zhang, Yijing Tian, Jiwang Qu, Chuang Liu
Main category: cs.CV
TL;DR: A two-stage framework for consistent story visualization using Group-Shared Attention for identity consistency and Direct Preference Optimization for aesthetic alignment.
Details
Motivation: Existing story visualization methods struggle with subject inconsistency and identity drift when depicting complex narratives, requiring better solutions for maintaining character identity and visual style across sequential imagery.Method: Two-stage framework: 1) Group-Shared Attention (GSA) enables lossless cross-sample information flow within attention layers to structurally encode identity correspondence across frames without external encoders. 2) Direct Preference Optimization (DPO) aligns generated outputs with human aesthetic and narrative standards using holistic preference data instead of conflicting auxiliary losses.
Result: Achieves state-of-the-art on ViStoryBench benchmark with significant gains: +10.0 in Character Identity (CIDS) and +18.7 in Style Consistency (CSD) while preserving high-fidelity generation.
Conclusion: The proposed framework effectively addresses identity consistency challenges in story visualization through architectural innovations in attention mechanisms and preference-based optimization, establishing new benchmarks for consistent narrative image generation.
Abstract: Story visualization requires generating sequential imagery that aligns semantically with evolving narratives while maintaining rigorous consistency in character identity and visual style. However, existing methodologies often struggle with subject inconsistency and identity drift, particularly when depicting complex interactions or extended narrative arcs. To address these challenges, we propose a cohesive two-stage framework designed for robust and consistent story generation. First, we introduce Group-Shared Attention (GSA), a mechanism that fosters intrinsic consistency by enabling lossless cross-sample information flow within attention layers. This allows the model to structurally encode identity correspondence across frames without relying on external encoders. Second, we leverage Direct Preference Optimization (DPO) to align generated outputs with human aesthetic and narrative standards. Unlike conventional methods that rely on conflicting auxiliary losses, our approach simultaneously enhances visual fidelity and identity preservation by learning from holistic preference data. Extensive evaluations on the ViStoryBench benchmark demonstrate that our method establishes a new state-of-the-art, significantly outperforming strong baselines with gains of +10.0 in Character Identity (CIDS) and +18.7 in Style Consistency (CSD), all while preserving high-fidelity generation.
[160] 3D MRI-Based Alzheimer’s Disease Classification Using Multi-Modal 3D CNN with Leakage-Aware Subject-Level Evaluation
Md Sifat, Sania Akter, Akif Islam, Md. Ekramul Hamid, Abu Saleh Musa Miah, Najmul Hassan, Md Abdur Rahim, Jungpil Shin
Main category: cs.CV
TL;DR: 3D multimodal CNN for Alzheimer’s disease classification using volumetric MRI data with T1 scans and tissue probability maps, achieving 72.34% accuracy on OASIS 1 dataset.
Details
Motivation: Clinical neuroimaging relies on full 3D brain structure, but many existing studies use 2D slices. Volumetric analysis may better capture spatial relationships relevant to Alzheimer's disease progression.Method: Multimodal 3D convolutional neural network using raw OASIS 1 MRI volumes, combining structural T1 information with gray matter, white matter, and cerebrospinal fluid probability maps from FSL FAST segmentation.
Result: Achieved mean accuracy of 72.34% ± 4.66% and ROC AUC of 0.7781 ± 0.0365 using 5-fold subject-level cross-validation. GradCAM visualizations showed focus on anatomically meaningful regions like medial temporal lobe and ventricular areas.
Conclusion: The multimodal 3D framework establishes a reproducible subject-level benchmark and highlights benefits of volumetric MRI analysis for Alzheimer’s disease classification.
Abstract: Deep learning has become an important tool for Alzheimer’s disease (AD) classification from structural MRI. Many existing studies analyze individual 2D slices extracted from MRI volumes, while clinical neuroimaging practice typically relies on the full three dimensional structure of the brain. From this perspective, volumetric analysis may better capture spatial relationships among brain regions that are relevant to disease progression. Motivated by this idea, this work proposes a multimodal 3D convolutional neural network for AD classification using raw OASIS 1 MRI volumes. The model combines structural T1 information with gray matter, white matter, and cerebrospinal fluid probability maps obtained through FSL FAST segmentation in order to capture complementary neuroanatomical information. The proposed approach is evaluated on the clinically labelled OASIS 1 cohort using 5 fold subject level cross validation, achieving a mean accuracy of 72.34% plus or minus 4.66% and a ROC AUC of 0.7781 plus or minus 0.0365. GradCAM visualizations further indicate that the model focuses on anatomically meaningful regions, including the medial temporal lobe and ventricular areas that are known to be associated with Alzheimer’s related structural changes. To better understand how data representation and evaluation strategies may influence reported performance, additional diagnostic experiments were conducted on a slice based version of the dataset under both slice level and subject level protocols. These observations help provide context for the volumetric results. Overall, the proposed multimodal 3D framework establishes a reproducible subject level benchmark and highlights the potential benefits of volumetric MRI analysis for Alzheimer’s disease classification.
[161] Symphony: A Cognitively-Inspired Multi-Agent System for Long-Video Understanding
Haiyang Yan, Hongyun Zhou, Peng Xu, Xiaoxue Feng, Mengyi Liu
Main category: cs.CV
TL;DR: Symphony is a multi-agent system for long-form video understanding that decomposes complex video tasks into fine-grained subtasks using human-like cognition patterns and deep reasoning collaboration with reflection mechanisms.
Details
Motivation: Current MLLM agents struggle with long-form video understanding tasks due to high information density and extended temporal spans. Simple task decomposition and collaboration mechanisms are insufficient, and embedding-based retrieval methods risk losing key information for complex problems.Method: Proposes Symphony, a multi-agent system that: 1) Decomposes LVU tasks into fine-grained subtasks using human cognition patterns, 2) Incorporates deep reasoning collaboration enhanced by reflection, 3) Uses VLM-based grounding to analyze tasks and assess video segment relevance for locating complex problems with implicit intentions.
Result: Achieves state-of-the-art performance on LVBench, LongVideoBench, VideoMME, and MLVU benchmarks, with a 5.0% improvement over prior SOTA on LVBench.
Conclusion: Symphony effectively addresses limitations in long-form video understanding by combining fine-grained task decomposition, deep reasoning collaboration with reflection, and VLM-based grounding for complex problem localization.
Abstract: Despite rapid developments and widespread applications of MLLM agents, they still struggle with long-form video understanding (LVU) tasks, which are characterized by high information density and extended temporal spans. Recent research on LVU agents demonstrates that simple task decomposition and collaboration mechanisms are insufficient for long-chain reasoning tasks. Moreover, directly reducing the time context through embedding-based retrieval may lose key information of complex problems. In this paper, we propose Symphony, a multi-agent system, to alleviate these limitations. By emulating human cognition patterns, Symphony decomposes LVU into fine-grained subtasks and incorporates a deep reasoning collaboration mechanism enhanced by reflection, effectively improving the reasoning capability. Additionally, Symphony provides a VLM-based grounding approach to analyze LVU tasks and assess the relevance of video segments, which significantly enhances the ability to locate complex problems with implicit intentions and large temporal spans. Experimental results show that Symphony achieves state-of-the-art performance on LVBench, LongVideoBench, VideoMME, and MLVU, with a 5.0% improvement over the prior state-of-the-art method on LVBench. Code is available at https://github.com/Haiyang0226/Symphony.
[162] Recurrent Reasoning with Vision-Language Models for Estimating Long-Horizon Embodied Task Progress
Yuelin Zhang, Sijie Cheng, Chen Li, Zongzhao Li, Yuxin Huang, Yang Liu, Wenbing Huang
Main category: cs.CV
TL;DR: R²VLM: A recurrent reasoning vision-language model for long-horizon task progress estimation that processes local video snippets iteratively with an evolving Chain of Thought to track task decomposition and completion status.
Details
Motivation: Existing VLM-based methods for task progress estimation primarily use video understanding capabilities but neglect complex reasoning potential. Processing long video trajectories with VLMs is computationally prohibitive for real-world deployment.Method: Proposes R²VLM with recurrent reasoning framework that processes local video snippets iteratively, maintaining global context through evolving Chain of Thought that explicitly records task decomposition, key steps, and completion status.
Result: Achieves strong performance and generalization, setting new state-of-the-art in long-horizon task progress estimation. Demonstrates effectiveness in downstream applications including progress-enhanced policy learning, reward modeling for RL, and proactive assistance.
Conclusion: R²VLM effectively addresses computational challenges of processing long videos while preserving essential reasoning capabilities, enabling accurate progress estimation for embodied agents in long-horizon tasks.
Abstract: Accurately estimating task progress is critical for embodied agents to plan and execute long-horizon, multi-step tasks. Despite promising advances, existing Vision-Language Models (VLMs) based methods primarily leverage their video understanding capabilities, while neglecting their complex reasoning potential. Furthermore, processing long video trajectories with VLMs is computationally prohibitive for real-world deployment. To address these challenges, we propose the Recurrent Reasoning Vision-Language Model ($\text{R}^2$VLM). Our model features a recurrent reasoning framework that processes local video snippets iteratively, maintaining a global context through an evolving Chain of Thought (CoT). This CoT explicitly records task decomposition, key steps, and their completion status, enabling the model to reason about complex temporal dependencies. This design avoids the high cost of processing long videos while preserving essential reasoning capabilities. We train $\text{R}^2$VLM on large-scale, automatically generated datasets from ALFRED and Ego4D. Extensive experiments on progress estimation and downstream applications, including progress-enhanced policy learning, reward modeling for reinforcement learning, and proactive assistance, demonstrate that $\text{R}^2$VLM achieves strong performance and generalization, achieving a new state-of-the-art in long-horizon task progress estimation. The models and benchmarks are publicly available at \href{https://huggingface.co/collections/zhangyuelin/r2vlm}{huggingface}.
[163] A Proposal-Free Query-Guided Network for Grounded Multimodal Named Entity Recognition
Hongbing Li, Jiamin Liu, Shuo Zhang, Bo Xiao
Main category: cs.CV
TL;DR: QGN is a proposal-free Query-Guided Network for Grounded Multimodal Named Entity Recognition that unifies multimodal reasoning through text guidance and cross-modal interaction to address limitations of two-step approaches using pre-trained object detectors.
Details
Motivation: Existing GMNER approaches use two-step methods with pre-trained object detectors that operate independently of textual entities, causing them to detect common objects while missing fine-grained regions needed for named entities, leading to misalignment and performance issues.Method: Proposes a proposal-free Query-Guided Network that unifies multimodal reasoning and decoding through text guidance and cross-modal interaction, enabling accurate grounding without relying on pre-trained object detectors.
Result: Extensive experiments show QGN achieves top performance among compared GMNER models on widely used benchmarks, demonstrating accurate grounding and robust performance in open-domain scenarios.
Conclusion: QGN effectively addresses limitations of existing GMNER approaches by unifying multimodal reasoning through text-guided cross-modal interaction, achieving state-of-the-art performance without dependency on pre-trained object detectors.
Abstract: Grounded Multimodal Named Entity Recognition (GMNER) identifies named entities, including their spans and types, in natural language text and grounds them to the corresponding regions in associated images. Most existing approaches split this task into two steps: they first detect objects using a pre-trained general-purpose detector and then match named entities to the detected objects. However, these methods face a major limitation. Because pre-trained general-purpose object detectors operate independently of textual entities, they tend to detect common objects and frequently overlook specific fine-grained regions required by named entities. This misalignment between object detectors and entities introduces imprecision and can impair overall system performance. In this paper, we propose a proposal-free Query-Guided Network (QGN) that unifies multimodal reasoning and decoding through text guidance and cross- modal interaction. QGN enables accurate grounding and robust performance in open-domain scenarios. Extensive experiments demonstrate that QGN achieves top performance among compared GMNER models on widely used benchmarks.
[164] UniSAFE: A Comprehensive Benchmark for Safety Evaluation of Unified Multimodal Models
Segyu Lee, Boryeong Cho, Hojung Jung, Seokhyun An, Juhyeong Kim, Jaehyun Kwak, Yongjin Yang, Sangwon Jang, Youngrok Park, Wonjun Chang, Se-Young Yun
Main category: cs.CV
TL;DR: UniSAFE is a comprehensive benchmark for evaluating system-level safety risks in Unified Multimodal Models across 7 modality combinations, revealing critical vulnerabilities in multi-image composition and multi-turn settings.
Details
Motivation: Existing safety benchmarks are fragmented across tasks and modalities, limiting comprehensive evaluation of complex system-level vulnerabilities in emerging Unified Multimodal Models.Method: Built UniSAFE benchmark with shared-target design projecting common risk scenarios across task-specific I/O configurations, comprising 6,802 curated instances across 7 modality combinations including conventional tasks and novel multimodal-context image generation.
Result: Evaluation of 15 state-of-the-art UMMs revealed critical vulnerabilities: elevated safety violations in multi-image composition and multi-turn settings, with image-output tasks consistently more vulnerable than text-output tasks.
Conclusion: Highlights need for stronger system-level safety alignment for UMMs, with UniSAFE providing the first comprehensive benchmark for evaluating multimodal safety risks.
Abstract: Unified Multimodal Models (UMMs) offer powerful cross-modality capabilities but introduce new safety risks not observed in single-task models. Despite their emergence, existing safety benchmarks remain fragmented across tasks and modalities, limiting the comprehensive evaluation of complex system-level vulnerabilities. To address this gap, we introduce UniSAFE, the first comprehensive benchmark for system-level safety evaluation of UMMs across 7 I/O modality combinations, spanning conventional tasks and novel multimodal-context image generation settings. UniSAFE is built with a shared-target design that projects common risk scenarios across task-specific I/O configurations, enabling controlled cross-task comparisons of safety failures. Comprising 6,802 curated instances, we use UniSAFE to evaluate 15 state-of-the-art UMMs, both proprietary and open-source. Our results reveal critical vulnerabilities across current UMMs, including elevated safety violations in multi-image composition and multi-turn settings, with image-output tasks consistently more vulnerable than text-output tasks. These findings highlight the need for stronger system-level safety alignment for UMMs. Our code and data are publicly available at https://github.com/segyulee/UniSAFE
[165] MedSAD-CLIP: Supervised CLIP with Token-Patch Cross-Attention for Medical Anomaly Detection and Segmentation
Thuy Truong Tran, Minh Kha Do, Phuc Nguyen Duy, Min Hun Lee
Main category: cs.CV
TL;DR: MedSAD-CLIP adapts CLIP for medical anomaly detection using fine-grained text-visual attention and lightweight adapters to improve lesion localization while preserving CLIP’s generalization capabilities.
Details
Motivation: Existing CLIP-based medical anomaly detection methods rely on global representations and weak supervision, producing coarse localization and limited segmentation quality. The authors aim to leverage supervised adaptation of CLIP using limited labeled abnormal data to improve lesion localization while preserving CLIP's generalization capabilities.Method: Proposes MedSAD-CLIP with Token-Patch Cross-Attention (TPCA) for fine-grained text-visual cues, lightweight image adapters and learnable prompt tokens to adapt CLIP to medical domain, and Margin-based image-text Contrastive Loss to enhance global feature discrimination between normal and abnormal representations.
Result: Extensive experiments on four diverse benchmarks (Brain, Retina, Lung, and Breast datasets) demonstrate superior performance in both pixel-level segmentation and image-level classification over state-of-the-art methods.
Conclusion: Supervised CLIP adaptation shows potential as a unified and scalable paradigm for medical anomaly understanding, with MedSAD-CLIP effectively leveraging fine-grained text-visual cues while preserving CLIP’s rich semantic alignment.
Abstract: Medical anomaly detection (MAD) and segmentation play a critical role in assisting clinical diagnosis by identifying abnormal regions in medical images and localizing pathological regions. Recent CLIP-based studies are promising for anomaly detection in zero-/few-shot settings, and typically rely on global representations and weak supervision, often producing coarse localization and limited segmentation quality. In this work, we study supervised adaptation of CLIP for MAD under a realistic clinical setting where a limited yet meaningful amount of labeled abnormal data is available. Our model MedSAD-CLIP leverages fine-grained text-visual cues via the Token-Patch Cross-Attention(TPCA) to improve lesion localization while preserving the generalization capability of CLIP representations. Lightweight image adapters and learnable prompt tokens efficiently adapt the pretrained CLIP encoder to the medical domain while preserving its rich semantic alignment. Furthermore, a Margin-based image-text Contrastive Loss is designed to enhance global feature discrimination between normal and abnormal representations. Extensive experiments on four diverse benchmarks-Brain, Retina, Lung, and Breast datasets-demonstrate the effectiveness of our approach, achieving superior performance in both pixel-level segmentation and image-level classification over state-of-the-art methods. Our results highlight the potential of supervised CLIP adaptation as a unified and scalable paradigm for medical anomaly understanding. Code will be made available at https://github.com/thuy4tbn99/MedSAD-CLIP
[166] FineViT: Progressively Unlocking Fine-Grained Perception with Dense Recaptions
Peisen Zhao, Xiaopeng Zhang, Mingxing Xu, Ruoyu Sun, Zewei Du, Dunzheng Wang, Guanghao Zheng, Haohang Xu, Zhibo Zhang, Yuhang Zhang, Yi Ai, Lin Liu, Qi Tian
Main category: cs.CV
TL;DR: FineViT is a novel vision encoder designed to overcome CLIP-based encoders’ limitations in fine-grained perception by using progressive training with dense recaptions and LLM alignment for local perception enhancement.
Details
Motivation: Current MLLMs face performance bottlenecks from visual encoders, particularly CLIP-based ones that lose visual details due to low-resolution pretraining and rely on noisy web-crawled image-text pairs, limiting their ability for dense spatial tasks and fine-grained perception.Method: Progressive training paradigm: 1) Train encoder from scratch at high native resolution on billions of global recaptioned image-text pairs to establish detail-rich semantic foundation; 2) Enhance local perception through LLM alignment using curated FineCap-450M dataset with over 450 million high-quality local captions.
Result: FineViT achieves state-of-the-art zero-shot recognition and retrieval performance, especially in long-context retrieval, and consistently outperforms multimodal visual encoders like SigLIP2 and Qwen-ViT when integrated into MLLMs.
Conclusion: FineViT effectively addresses visual encoder bottlenecks in MLLMs through progressive training with dense recaptions and LLM alignment, establishing a powerful new baseline for fine-grained visual perception that could significantly improve multimodal understanding capabilities.
Abstract: While Multimodal Large Language Models (MLLMs) have experienced rapid advancements, their visual encoders frequently remain a performance bottleneck. Conventional CLIP-based encoders struggle with dense spatial tasks due to the loss of visual details caused by low-resolution pretraining and the reliance on noisy, coarse web-crawled image-text pairs. To overcome these limitations, we introduce FineViT, a novel vision encoder specifically designed to unlock fine-grained perception. By replacing coarse web data with dense recaptions, we systematically mitigate information loss through a progressive training paradigm.: first, the encoder is trained from scratch at a high native resolution on billions of global recaptioned image-text pairs, establishing a robust, detail rich semantic foundation. Subsequently, we further enhance its local perception through LLM alignment, utilizing our curated FineCap-450M dataset that comprises over $450$ million high quality local captions. Extensive experiments validate the effectiveness of the progressive strategy. FineViT achieves state-of-the-art zero-shot recognition and retrieval performance, especially in long-context retrieval, and consistently outperforms multimodal visual encoders such as SigLIP2 and Qwen-ViT when integrated into MLLMs. We hope FineViT could serve as a powerful new baseline for fine-grained visual perception.
[167] EvoGuard: An Extensible Agentic RL-based Framework for Practical and Evolving AI-Generated Image Detection
Chenyang Zhu, Maorong Wang, Jun Liu, Ching-Chun Chang, Isao Echizen
Main category: cs.CV
TL;DR: EvoGuard is an agentic framework for AI-generated image detection that orchestrates multiple MLLM and non-MLLM detectors through dynamic tool selection and multi-turn reasoning, eliminating the need for expensive fine-grained annotations.
Details
Motivation: The proliferation of AI-generated images creates misinformation risks, but current detection methods relying on low-level features or MLLMs suffer from limited extensibility and require expensive training data annotations.Method: Proposes EvoGuard framework that encapsulates SOTA MLLM and non-MLLM detectors as callable tools, coordinates them through capability-aware dynamic orchestration with autonomous planning and reflection, and uses GRPO-based Agentic Reinforcement Learning optimized with only binary labels.
Result: EvoGuard achieves state-of-the-art accuracy while mitigating bias between positive/negative samples, allows plug-and-play integration of new detectors to boost performance without retraining, and offers practical solution to evolving AIGI threats.
Conclusion: EvoGuard provides a highly practical, long-term solution for AIGI detection by leveraging agentic capabilities to orchestrate heterogeneous detectors, eliminating reliance on fine-grained annotations while maintaining extensibility.
Abstract: The rapid proliferation of AI-Generated Images (AIGIs) has introduced severe risks of misinformation, making AIGI detection a critical yet challenging task. While traditional detection paradigms mainly rely on low-level features, recent research increasingly focuses on leveraging the general understanding ability of Multimodal Large Language Models (MLLMs) to achieve better generalization, but still suffer from limited extensibility and expensive training data annotations. To better address complex and dynamic real-world environments, we propose EvoGuard, a novel agentic framework for AIGI detection. It encapsulates various state-of-the-art (SOTA) off-the-shelf MLLM and non-MLLM detectors as callable tools, and coordinates them through a capability-aware dynamic orchestration mechanism. Empowered by the agent’s capacities for autonomous planning and reflection, it intelligently selects suitable tools for given samples, reflects intermediate results, and decides the next action, reaching a final conclusion through multi-turn invocation and reasoning. This design effectively exploits the complementary strengths among heterogeneous detectors, transcending the limits of any single model. Furthermore, optimized by a GRPO-based Agentic Reinforcement Learning algorithm using only low-cost binary labels, it eliminates the reliance on fine-grained annotations. Extensive experiments demonstrate that EvoGuard achieves SOTA accuracy while mitigating the bias between positive and negative samples. More importantly, it allows the plug-and-play integration of new detectors to boost overall performance in a train-free manner, offering a highly practical, long-term solution to ever-evolving AIGI threats. Source code will be publicly available upon acceptance.
[168] OnlineHMR: Video-based Online World-Grounded Human Mesh Recovery
Yiwen Zhao, Ce Zheng, Yufu Wang, Hsueh-Han Daniel Yang, Liting Wen, Laszlo A. Jeni
Main category: cs.CV
TL;DR: OnlineHMR: A fully online framework for 3D human mesh recovery from monocular videos that supports streaming inference with causal processing, temporal consistency, and real-time efficiency for interactive applications.
Details
Motivation: Most existing human mesh recovery methods are offline and rely on future frames or global optimization, limiting their applicability in interactive scenarios like AR/VR and telepresence that require real-time feedback and perception-action loops.Method: Proposes a two-branch architecture with causal key-value cache design and sliding-window learning for streaming inference, combined with human-centric incremental SLAM for online world-grounded alignment with physically plausible trajectory correction.
Result: Achieves performance comparable to existing chunk-based approaches on the EMDB benchmark and highly dynamic custom videos, while uniquely supporting online processing with system-level causality, faithfulness, temporal consistency, and efficiency.
Conclusion: OnlineHMR successfully addresses the limitations of offline HMR methods by providing a fully online framework that enables real-time 3D human mesh recovery suitable for interactive applications like AR/VR and telepresence.
Abstract: Human mesh recovery (HMR) models 3D human body from monocular videos, with recent works extending it to world-coordinate human trajectory and motion reconstruction. However, most existing methods remain offline, relying on future frames or global optimization, which limits their applicability in interactive feedback and perception-action loop scenarios such as AR/VR and telepresence. To address this, we propose OnlineHMR, a fully online framework that jointly satisfies four essential criteria of online processing, including system-level causality, faithfulness, temporal consistency, and efficiency. Built upon a two-branch architecture, OnlineHMR enables streaming inference via a causal key-value cache design and a curated sliding-window learning strategy. Meanwhile, a human-centric incremental SLAM provides online world-grounded alignment under physically plausible trajectory correction. Experimental results show that our method achieves performance comparable to existing chunk-based approaches on the standard EMDB benchmark and highly dynamic custom videos, while uniquely supporting online processing. Page and code are available at https://tsukasane.github.io/Video-OnlineHMR/.
[169] MCoT-MVS: Multi-level Vision Selection by Multi-modal Chain-of-Thought Reasoning for Composed Image Retrieval
Xuri Ge, Chunhao Wang, Xindi Wang, Zheyun Qin, Zhumin Chen, Xin Xin
Main category: cs.CV
TL;DR: MCoT-MVS uses multi-modal chain-of-thought reasoning with MLLMs to guide visual attention selection for composed image retrieval, achieving SOTA performance on CIR benchmarks.
Details
Motivation: Existing composed image retrieval methods struggle to extract correct semantic cues from reference images that reflect user intent under textual modifications, suffering from irrelevant visual noise interference.Method: Uses MLLM for chain-of-thought reasoning on multimodal input to generate retained/removed/target-inferred texts, which guide two visual attention modules to extract patch-level and instance-level semantics, then fuses these with modified text via weighted hierarchical combination.
Result: Achieves new state-of-the-art performance on CIRR and FashionIQ benchmarks, consistently outperforming existing methods.
Conclusion: The proposed MCoT-MVS effectively integrates MLLM reasoning with multi-level visual attention selection for improved composed image retrieval by better understanding user intent.
Abstract: Composed Image Retrieval (CIR) aims to retrieve target images based on a reference image and modified texts. However, existing methods often struggle to extract the correct semantic cues from the reference image that best reflect the user’s intent under textual modification prompts, resulting in interference from irrelevant visual noise. In this paper, we propose a novel Multi-level Vision Selection by Multi-modal Chain-of-Thought Reasoning (MCoT-MVS) for CIR, integrating attention-aware multi-level vision features guided by reasoning cues from a multi-modal large language model (MLLM). Specifically, we leverage an MLLM to perform chain-of-thought reasoning on the multimodal composed input, generating the retained, removed, and target-inferred texts. These textual cues subsequently guide two reference visual attention selection modules to selectively extract discriminative patch-level and instance-level semantics from the reference image. Finally, to effectively fuse these multi-granular visual cues with the modified text and the imagined target description, we design a weighted hierarchical combination module to align the composed query with target images in a unified embedding space. Extensive experiments on two CIR benchmarks, namely CIRR and FashionIQ, demonstrate that our approach consistently outperforms existing methods and achieves new state-of-the-art performance. Code and trained models are publicly released.
[170] Understanding and Defending VLM Jailbreaks via Jailbreak-Related Representation Shift
Zhihua Wei, Qiang Li, Jian Ruan, Zhenxin Qin, Leilei Wen, Dongrui Liu, Wen Shen
Main category: cs.CV
TL;DR: VLMs have weakened safety alignment with visual modality; jailbreaks occur not from failure to recognize harm but from visual modality shifting representations toward jailbreak states; proposed defense removes jailbreak-related shift.
Details
Motivation: Large vision-language models show weakened safety alignment when visual modality is integrated, with images substantially increasing jailbreak success rates even when text prompts contain explicit harmful intent.Method: Observed that VLMs distinguish benign from harmful inputs in representation space, identified jailbreak direction in representation space, defined jailbreak-related shift as component of image-induced representation shift along this direction, and proposed JRS-Rem defense method that removes this shift at inference time.
Result: Jailbreak-related shift reliably characterizes jailbreak behavior, providing unified explanation for diverse jailbreak scenarios; JRS-Rem defense provides strong protection across multiple scenarios while preserving performance on benign tasks.
Conclusion: Jailbreaks in VLMs arise from visual modality shifting representations toward specific jailbreak states rather than failure to recognize harmful intent; removing jailbreak-related shift effectively enhances VLM safety.
Abstract: Large vision-language models (VLMs) often exhibit weakened safety alignment with the integration of the visual modality. Even when text prompts contain explicit harmful intent, adding an image can substantially increase jailbreak success rates. In this paper, we observe that VLMs can clearly distinguish benign inputs from harmful ones in their representation space. Moreover, even among harmful inputs, jailbreak samples form a distinct internal state that is separable from refusal samples. These observations suggest that jailbreaks do not arise from a failure to recognize harmful intent. Instead, the visual modality shifts representations toward a specific jailbreak state, thereby leading to a failure to trigger refusal. To quantify this transition, we identify a jailbreak direction and define the jailbreak-related shift as the component of the image-induced representation shift along this direction. Our analysis shows that the jailbreak-related shift reliably characterizes jailbreak behavior, providing a unified explanation for diverse jailbreak scenarios. Finally, we propose a defense method that enhances VLM safety by removing the jailbreak-related shift (JRS-Rem) at inference time. Experiments show that JRS-Rem provides strong defense across multiple scenarios while preserving performance on benign tasks.
[171] Material Magic Wand: Material-Aware Grouping of 3D Parts in Untextured Meshes
Umangi Jain, Vladimir Kim, Matheus Gadelha, Igor Gilitschenski, Zhiqin Chen
Main category: cs.CV
TL;DR: Material Magic Wand: A tool for automatic material-aware part grouping in untextured 3D meshes using learned embeddings and contrastive learning.
Details
Motivation: Manual material assignment for repeated structures in 3D meshes (like scales, windows) is tedious and time-consuming, requiring piece-by-piece selection of geometrically varied but material-consistent parts.Method: Proposes a part encoder generating material-aware embeddings considering local geometry and global context, trained with supervised contrastive loss to cluster material-consistent parts while separating different materials.
Result: Introduces a benchmark dataset of 100 shapes with 241 queries, shows effectiveness through experiments, and demonstrates practical value in interactive material assignment applications.
Conclusion: Material Magic Wand enables efficient material-aware part grouping in 3D meshes, reducing manual effort in material assignment workflows.
Abstract: We introduce the problem of material-aware part grouping in untextured meshes. Many real-world shapes, such as scales of pinecones or windows of buildings, contain repeated structures that share the same material but exhibit geometric variations. When assigning materials to such meshes, these repeated parts often require piece-by-piece manual identification and selection, which is tedious and time-consuming. To address this, we propose Material Magic Wand, a tool that allows artists to select part groups based on their estimated material properties – when one part is selected, our algorithm automatically retrieves all other parts likely to share the same material. The key component of our approach is a part encoder that generates a material-aware embedding for each 3D part, accounting for both local geometry and global context. We train our model with a supervised contrastive loss that brings embeddings of material-consistent parts closer while separating those of different materials; therefore, part grouping can be achieved by retrieving embeddings that are close to the embedding of the selected part. To benchmark this task, we introduce a curated dataset of 100 shapes with 241 part-level queries. We verify the effectiveness of our method through extensive experiments and demonstrate its practical value in an interactive material assignment application.
[172] Shot-Aware Frame Sampling for Video Understanding
Mengyu Zhao, Di Fu, Yongyu Xie, Jiaxing Zhang, Zhigang Yuan, Shirin Jalali, Yong Cao
Main category: cs.CV
TL;DR: InfoShot is a task-agnostic frame sampler for long-video understanding that partitions videos into shots and selects complementary keyframes to balance broad coverage with critical short events.
Details
Motivation: Existing video frame samplers struggle to balance broad video coverage with brief but critical events when only a small number of frames can be retained, leading to unreliable downstream predictions for Vision-Language Models.Method: InfoShot partitions videos into semantically consistent shots, then selects two complementary keyframes from each shot: one representing main content and one capturing unusual within-shot changes, guided by an information-theoretic objective.
Result: InfoShot improves anomaly hit rate and downstream Video-QA accuracy under frame number constraints, while matching or outperforming strong baselines on standard video understanding benchmarks.
Conclusion: InfoShot provides an effective task-agnostic frame sampling approach for long-video understanding that balances broad coverage with critical short events without requiring retraining.
Abstract: Video frame sampling is essential for efficient long-video understanding with Vision-Language Models (VLMs), since dense inputs are costly and often exceed context limits. Yet when only a small number of frames can be retained, existing samplers often fail to balance broad video coverage with brief but critical events, which can lead to unreliable downstream predictions. To address this issue, we present InfoShot, a task-agnostic, shot-aware frame sampler for long-video understanding. InfoShot first partitions a video into semantically consistent shots, and then selects two complementary keyframes from each shot: one to represent the main content and one to capture unusual within-shot changes. This design is guided by an information-theoretic objective that encourages the sampled set to retain high information about both shot structure and sparse within-shot deviations. In this way, it improves the chance of preserving both overall video context and short decision-critical moments without requiring any retraining. To better evaluate such short-lived events, we further introduce SynFlash, a synthetic benchmark with controllable sub-second anomaly patterns and frame-level ground truth, and we also evaluate InfoShot on existing anomaly datasets and general video understanding tasks. Experiments show that InfoShot improves anomaly hit rate and downstream Video-QA accuracy under frame number constraints, while matching or outperforming strong baselines on standard video understanding benchmarks.
[173] Joint Degradation-Aware Arbitrary-Scale Super-Resolution for Variable-Rate Extreme Image Compression
Xinning Chai, Zhengxue Cheng, Xin Li, Rong Xie, Li Song
Main category: cs.CV
TL;DR: ASSR-EIC: A novel diffusion-based extreme image compression framework using arbitrary-scale super-resolution for variable-rate compression within a single model
Details
Motivation: Current diffusion-based extreme image compression methods require separate models for each bitrate (computationally expensive), and existing super-resolution approaches struggle at ultra-low bitrates with fixed scaling factors that prevent flexible adaptation.Method: Proposes ASSR-EIC framework with: 1) arbitrary-scale downsampling module for controllable rate reduction at encoder, 2) diffusion-based joint degradation-aware ASSR decoder for rate-adaptive reconstruction, 3) compression-rescaling aware diffusion prior with global adaptor and local modulator for fine-grained bitrate-adaptive restoration, 4) dual semantic-enhanced design.
Result: Extensive experiments show state-of-the-art performance in extreme image compression while supporting flexible bitrate control and adaptive rate-dependent reconstruction.
Conclusion: ASSR-EIC successfully addresses limitations of existing methods by enabling variable-rate extreme image compression within a single model through arbitrary-scale super-resolution, achieving high fidelity across diverse compression settings.
Abstract: Recent diffusion-based extreme image compression methods have demonstrated remarkable performance at ultra-low bitrates. However, most approaches require training separate diffusion models for each target bitrate, resulting in substantial computational overhead and hindering practical deployment. Meanwhile, recent studies have shown that joint super-resolution can serve as an effective approach for enhancing low-bitrate reconstruction. However, when moving toward ultra-low bitrate regimes, these methods struggle due to severe information loss, and their reliance on fixed super-resolution scales prevents flexible adaptation across diverse bitrates. To address these limitations, we propose ASSR-EIC, a novel image compression framework that leverages arbitrary-scale super-resolution (ASSR) to support variable-rate extreme image compression (EIC). An arbitrary-scale downsampling module is introduced at the encoder side to provide controllable rate reduction, while a diffusion-based, joint degradation-aware ASSR decoder enables rate-adaptive reconstruction within a single model. We exploit the compression- and rescaling-aware diffusion prior to guide the reconstruction, yielding high fidelity and high realism restoration across diverse compression and rescaling settings. Specifically, we design a global compression-rescaling adaptor that offers holistic guidance for rate adaptation, and a local compression-rescaling modulator that dynamically balances generative and fidelity-oriented behaviors to achieve fine-grained, bitrate-adaptive detail restoration. To further enhance reconstruction quality, we introduce a dual semantic-enhanced design. Extensive experiments demonstrate that ASSR-EIC delivers state-of-the-art performance in extreme image compression while simultaneously supporting flexible bitrate control and adaptive rate-dependent reconstruction.
[174] Stereo World Model: Camera-Guided Stereo Video Generation
Yang-Tian Sun, Zehuan Huang, Yifan Niu, Lin Ma, Yan-Pei Cao, Yuewen Ma, Xiaojuan Qi
Main category: cs.CV
TL;DR: StereoWorld is a camera-conditioned stereo world model that learns appearance and binocular geometry for end-to-end stereo video generation, improving stereo consistency and efficiency over monocular approaches.
Details
Motivation: Current approaches for stereo video generation often rely on monocular RGB or RGBD methods followed by conversion, which can lead to inconsistencies and computational inefficiencies. The authors aim to create a unified model that directly learns binocular geometry while maintaining view and time consistency.Method: Two key designs: 1) Unified camera-frame RoPE that augments latent tokens with camera-aware rotary positional encoding for view/time consistency while preserving pretrained video priors, and 2) Stereo-aware attention decomposition that factors 4D attention into 3D intra-view attention plus horizontal row attention, leveraging epipolar prior for disparity-aligned correspondences with lower compute.
Result: StereoWorld improves stereo consistency, disparity accuracy, and camera-motion fidelity over monocular-then-convert pipelines, achieving more than 3x faster generation with 5% gain in viewpoint consistency. Enables end-to-end binocular VR rendering without depth estimation or inpainting, enhances embodied policy learning through metric-scale depth grounding, and supports long-video distillation.
Conclusion: StereoWorld demonstrates that joint learning of appearance and binocular geometry in a unified stereo world model significantly outperforms conversion-based approaches, enabling efficient and consistent stereo video generation with applications in VR, embodied AI, and interactive synthesis.
Abstract: We present StereoWorld, a camera-conditioned stereo world model that jointly learns appearance and binocular geometry for end-to-end stereo video generation.Unlike monocular RGB or RGBD approaches, StereoWorld operates exclusively within the RGB modality, while simultaneously grounding geometry directly from disparity. To efficiently achieve consistent stereo generation, our approach introduces two key designs: (1) a unified camera-frame RoPE that augments latent tokens with camera-aware rotary positional encoding, enabling relative, view- and time-consistent conditioning while preserving pretrained video priors via a stable attention initialization; and (2) a stereo-aware attention decomposition that factors full 4D attention into 3D intra-view attention plus horizontal row attention, leveraging the epipolar prior to capture disparity-aligned correspondences with substantially lower compute. Across benchmarks, StereoWorld improves stereo consistency, disparity accuracy, and camera-motion fidelity over strong monocular-then-convert pipelines, achieving more than 3x faster generation with an additional 5% gain in viewpoint consistency. Beyond benchmarks, StereoWorld enables end-to-end binocular VR rendering without depth estimation or inpainting, enhances embodied policy learning through metric-scale depth grounding, and is compatible with long-video distillation for extended interactive stereo synthesis.
[175] Loc3R-VLM: Language-based Localization and 3D Reasoning with Vision-Language Models
Kevin Qu, Haozhe Qi, Mihai Dusmanu, Mahdi Rad, Rui Wang, Marc Pollefeys
Main category: cs.CV
TL;DR: Loc3R-VLM enhances 2D Vision-Language Models with 3D understanding from monocular video using global layout reconstruction and explicit situation modeling with camera pose priors.
Details
Motivation: Current MLLMs struggle with spatial understanding and viewpoint-aware reasoning despite progress in vision-language connections. Existing approaches add geometric cues to inputs rather than teaching explicit 3D reasoning.Method: Framework that equips 2D VLMs with 3D understanding from monocular video. Uses two joint objectives: global layout reconstruction for holistic scene structure and explicit situation modeling for egocentric perspective. Leverages lightweight camera pose priors from pre-trained 3D foundation model for geometric consistency.
Result: Achieves state-of-the-art performance in language-based localization and outperforms existing 2D- and video-based approaches on situated and general 3D question-answering benchmarks.
Conclusion: The spatial supervision framework enables strong 3D understanding in VLMs, demonstrating that explicit spatial objectives can effectively ground both perception and language in 3D context.
Abstract: Multimodal Large Language Models (MLLMs) have made impressive progress in connecting vision and language, but they still struggle with spatial understanding and viewpoint-aware reasoning. Recent efforts aim to augment the input representations with geometric cues rather than explicitly teaching models to reason in 3D space. We introduce Loc3R-VLM, a framework that equips 2D Vision-Language Models with advanced 3D understanding capabilities from monocular video input. Inspired by human spatial cognition, Loc3R-VLM relies on two joint objectives: global layout reconstruction to build a holistic representation of the scene structure, and explicit situation modeling to anchor egocentric perspective. These objectives provide direct spatial supervision that grounds both perception and language in a 3D context. To ensure geometric consistency and metric-scale alignment, we leverage lightweight camera pose priors extracted from a pre-trained 3D foundation model. Loc3R-VLM achieves state-of-the-art performance in language-based localization and outperforms existing 2D- and video-based approaches on situated and general 3D question-answering benchmarks, demonstrating that our spatial supervision framework enables strong 3D understanding. Project page: https://kevinqu7.github.io/loc3r-vlm
[176] VisionNVS: Self-Supervised Inpainting for Novel View Synthesis under the Virtual-Shift Paradigm
Hongbo Lu, Liang Yao, Chenghao He, Fan Liu, Wenlong Liao, Tao He, Pai Peng
Main category: cs.CV
TL;DR: VisionNVS reformulates novel view synthesis for autonomous driving from extrapolation to self-supervised inpainting using virtual shifts and pseudo-3D seam synthesis, achieving state-of-the-art results without LiDAR.
Details
Motivation: The fundamental bottleneck in Novel View Synthesis for autonomous driving is the supervision gap - models must synthesize unseen views during inference but lack ground truth images for these shifted poses during training. Previous approaches suffer from domain gaps between training and inference.Method: 1) Virtual-Shift strategy: Uses monocular depth proxies to simulate occlusion patterns and map them onto original views, transforming view synthesis into a self-supervised inpainting task. 2) Pseudo-3D Seam Synthesis: Integrates visual data from adjacent cameras during training to explicitly model real-world photometric discrepancies and calibration errors for spatial consistency.
Result: VisionNVS achieves superior geometric fidelity and visual quality compared to LiDAR-dependent baselines, offering a robust solution for scalable driving simulation without requiring LiDAR data.
Conclusion: The camera-only VisionNVS framework fundamentally reformulates view synthesis from an ill-posed extrapolation problem into a self-supervised inpainting task, eliminating the domain gap inherent in previous approaches and providing a scalable solution for autonomous driving simulation.
Abstract: A fundamental bottleneck in Novel View Synthesis (NVS) for autonomous driving is the inherent supervision gap on novel trajectories: models are tasked with synthesizing unseen views during inference, yet lack ground truth images for these shifted poses during training. In this paper, we propose VisionNVS, a camera-only framework that fundamentally reformulates view synthesis from an ill-posed extrapolation problem into a self-supervised inpainting task. By introducing a ``Virtual-Shift’’ strategy, we use monocular depth proxies to simulate occlusion patterns and map them onto the original view. This paradigm shift allows the use of raw, recorded images as pixel-perfect supervision, effectively eliminating the domain gap inherent in previous approaches. Furthermore, we address spatial consistency through a Pseudo-3D Seam Synthesis strategy, which integrates visual data from adjacent cameras during training to explicitly model real-world photometric discrepancies and calibration errors. Experiments demonstrate that VisionNVS achieves superior geometric fidelity and visual quality compared to LiDAR-dependent baselines, offering a robust solution for scalable driving simulation.
[177] AdaZoom-GUI: Adaptive Zoom-based GUI Grounding with Instruction Refinement
Siqi Pei, Liang Tang, Tiaonan Duan, Long Chen, Shuxian Li, Kaer Huang, Yanzhe Jing, Yiqiang Yan, Bo Zhang, Chenghao Jiang, Borui Zhang, Jiwen Lu
Main category: cs.CV
TL;DR: AdaZoom-GUI: Adaptive zoom-based framework for GUI grounding that improves localization accuracy and instruction understanding through instruction refinement and conditional zoom-in strategies.
Details
Motivation: GUI grounding is challenging due to high-resolution images, small UI elements, and ambiguous user instructions. Current methods struggle with precise element localization in complex GUI environments.Method: 1) Instruction refinement module rewrites natural language commands into explicit descriptions; 2) Conditional zoom-in strategy performs second-stage inference on predicted small elements; 3) Trained using Group Relative Policy Optimization (GRPO) to predict click coordinates and bounding boxes.
Result: Achieves state-of-the-art performance on public benchmarks among models with comparable or larger parameter sizes, demonstrating effectiveness for high-resolution GUI understanding and practical GUI agent deployment.
Conclusion: AdaZoom-GUI effectively addresses GUI grounding challenges through adaptive zooming and instruction refinement, enabling more accurate and efficient automated GUI interaction.
Abstract: GUI grounding is a critical capability for vision-language models (VLMs) that enables automated interaction with graphical user interfaces by locating target elements from natural language instructions. However, grounding on GUI screenshots remains challenging due to high-resolution images, small UI elements, and ambiguous user instructions. In this work, we propose AdaZoom-GUI, an adaptive zoom-based GUI grounding framework that improves both localization accuracy and instruction understanding. Our approach introduces an instruction refinement module that rewrites natural language commands into explicit and detailed descriptions, allowing the grounding model to focus on precise element localization. In addition, we design a conditional zoom-in strategy that selectively performs a second-stage inference on predicted small elements, improving localization accuracy while avoiding unnecessary computation and context loss on simpler cases. To support this framework, we construct a high-quality GUI grounding dataset and train the grounding model using Group Relative Policy Optimization (GRPO), enabling the model to predict both click coordinates and element bounding boxes. Experiments on public benchmarks demonstrate that our method achieves state-of-the-art performance among models with comparable or even larger parameter sizes, highlighting its effectiveness for high-resolution GUI understanding and practical GUI agent deployment.
[178] Toward Phonology-Guided Sign Language Motion Generation: A Diffusion Baseline and Conditioning Analysis
Rui Hong, Jana Kosecka
Main category: cs.CV
TL;DR: A study on 3D sign language motion generation using diffusion models with phonological attribute conditioning, showing that natural language translation of ASL-LEX annotations improves CLIP-based conditioning and outperforms state-of-the-art methods.
Details
Motivation: Generating natural, correct, and visually smooth 3D avatar sign language motion from text inputs remains challenging. The paper explores how phonological attribute conditioning (hand shape, location, movement from ASL-LEX 2.0) can improve sign language motion generation.Method: Establishes a diffusion baseline using Human Motion MDM-style diffusion model with SMPL-X representation. Systematically studies text conditioning using different text encoders (CLIP vs. T5), conditioning modes (gloss-only vs. gloss+phonological attributes), and attribute notation formats (symbolic vs. natural language).
Result: The diffusion baseline outperforms SignAvatar (state-of-the-art CVAE method) on gloss discriminability metrics. Translating symbolic ASL-LEX notations to natural language is necessary for effective CLIP-based attribute conditioning, while T5 is largely unaffected. The best variant (CLIP with mapped attributes) outperforms SignAvatar across all metrics.
Conclusion: Input representation is a critical factor for text-encoder-based attribute conditioning. The findings motivate structured conditioning approaches where gloss and phonological attributes are encoded through independent pathways for improved sign language motion generation.
Abstract: Generating natural, correct, and visually smooth 3D avatar sign language motion conditioned on the text inputs continues to be very challenging. In this work, we train a generative model of 3D body motion and explore the role of phonological attribute conditioning for sign language motion generation, using ASL-LEX 2.0 annotations such as hand shape, hand location and movement. We first establish a strong diffusion baseline using an Human Motion MDM-style diffusion model with SMPL-X representation, which outperforms SignAvatar, a state-of-the-art CVAE method, on gloss discriminability metrics. We then systematically study the role of text conditioning using different text encoders (CLIP vs. T5), conditioning modes (gloss-only vs. gloss+phonological attributes), and attribute notation format (symbolic vs. natural language). Our analysis reveals that translating symbolic ASL-LEX notations to natural language is a necessary condition for effective CLIP-based attribute conditioning, while T5 is largely unaffected by this translation. Furthermore, our best-performing variant (CLIP with mapped attributes) outperforms SignAvatar across all metrics. These findings highlight input representation as a critical factor for text-encoder-based attribute conditioning, and motivate structured conditioning approaches where gloss and phonological attributes are encoded through independent pathways.
[179] Harnessing the Power of Foundation Models for Accurate Material Classification
Qingran Lin, Fengwei Yang, Chaolun Zhu
Main category: cs.CV
TL;DR: A framework that uses vision-language models to improve material classification by generating synthetic training data and incorporating VLM priors through joint fine-tuning.
Details
Motivation: Material classification is important for computer vision/graphics but suffers from limited annotated data. Existing VLM-based solutions still underperform for material recognition tasks.Method: Two key innovations: (1) image generation and auto-labeling pipeline creating material-centric synthetic dataset with fused object semantics and material attributes; (2) prior incorporation strategy distilling VLM information with joint fine-tuning of vision foundation model and VLM-derived priors.
Result: Extensive experiments show significant improvements on multiple datasets. Synthetic dataset effectively captures real-world material characteristics, and VLM priors significantly enhance final performance.
Conclusion: The proposed framework effectively addresses data limitations in material classification by leveraging foundation models, with code and dataset to be released.
Abstract: Material classification has emerged as a critical task in computer vision and graphics, supporting the assignment of accurate material properties to a wide range of digital and real-world applications. While traditionally framed as an image classification task, this domain faces significant challenges due to the scarcity of annotated data, limiting the accuracy and generalizability of trained models. Recent advances in vision-language foundation models (VLMs) offer promising avenues to address these issues, yet existing solutions leveraging these models still exhibit unsatisfying results in material recognition tasks. In this work, we propose a novel framework that effectively harnesses foundation models to overcome data limitations and enhance classification accuracy. Our method integrates two key innovations: (a) a robust image generation and auto-labeling pipeline that creates a diverse and high-quality training dataset with material-centric images, and automatically assigns labels by fusing object semantics and material attributes in text prompts; (b) a prior incorporation strategy to distill information from VLMs, combined with a joint fine-tuning method that optimizes a pre-trained vision foundation model alongside VLM-derived priors, preserving broad generalizability while adapting to material-specific features.Extensive experiments demonstrate significant improvements on multiple datasets. We show that our synthetic dataset effectively captures the characteristics of real world materials, and the integration of priors from vision-language models significantly enhances the final performance. The source code and dataset will be released.
[180] VirPro: Visual-referred Probabilistic Prompt Learning for Weakly-Supervised Monocular 3D Detection
Chupeng Liu, Jiyong Rao, Shangquan Sun, Runkai Zhao, Weidong Cai
Main category: cs.CV
TL;DR: VirPro introduces adaptive probabilistic prompt learning for weakly supervised monocular 3D detection, using scene-aware visual-textual embeddings to capture visual diversity and improve semantic coherence.
Details
Motivation: Hand-crafted textual descriptions fail to capture visual diversity across scenes in monocular 3D detection, limiting scene-aware representation learning despite linguistic cues being useful as weak supervision.Method: Proposes Visual-referred Probabilistic Prompt Learning (VirPro) with Adaptive Prompt Bank (APB) for instance-conditioned prompts, Multi-Gaussian Prompt Modeling (MGPM) to incorporate visual features into textual embeddings, and RoI-level contrastive matching for modality alignment.
Result: Extensive experiments on KITTI benchmark show consistent performance gains, achieving up to 4.8% average precision improvement over baseline methods.
Conclusion: VirPro effectively addresses visual diversity limitations in weakly supervised 3D detection by learning adaptive probabilistic prompts that enhance scene-aware representation learning and semantic coherence.
Abstract: Monocular 3D object detection typically relies on pseudo-labeling techniques to reduce dependency on real-world annotations. Recent advances demonstrate that deterministic linguistic cues can serve as effective auxiliary weak supervision signals, providing complementary semantic context. However, hand-crafted textual descriptions struggle to capture the inherent visual diversity of individuals across scenes, limiting the model’s ability to learn scene-aware representations. To address this challenge, we propose Visual-referred Probabilistic Prompt Learning (VirPro), an adaptive multi-modal pretraining paradigm that can be seamlessly integrated into diverse weakly supervised monocular 3D detection frameworks. Specifically, we generate a diverse set of learnable, instance-conditioned prompts across scenes and store them in an Adaptive Prompt Bank (APB). Subsequently, we introduce Multi-Gaussian Prompt Modeling (MGPM), which incorporates scene-based visual features into the corresponding textual embeddings, allowing the text prompts to express visual uncertainties. Then, from the fused vision-language embeddings, we decode a prompt-targeted Gaussian, from which we derive a unified object-level prompt embedding for each instance. RoI-level contrastive matching is employed to enforce modality alignment, bringing embeddings of co-occurring objects within the same scene closer in the latent space, thus enhancing semantic coherence. Extensive experiments on the KITTI benchmark demonstrate that integrating our pretraining paradigm consistently yields substantial performance gains, achieving up to a 4.8% average precision improvement than the baseline.
[181] Gesture-Aware Pretraining and Token Fusion for 3D Hand Pose Estimation
Rui Hong, Jana Kosecka
Main category: cs.CV
TL;DR: Gesture-aware pretraining using semantic labels improves 3D hand pose estimation from monocular RGB images by leveraging gesture semantics as inductive bias.
Details
Motivation: 3D hand pose estimation from monocular RGB images is challenging but crucial for AR/VR, HCI, and sign language applications. The authors propose that gesture semantics can serve as a powerful inductive bias when discrete gesture labels are available.Method: Two-stage framework: 1) Gesture-aware pretraining learns an informative embedding space using coarse and fine gesture labels from InterHand2.6M dataset, 2) Per-joint token Transformer guided by gesture embeddings for final regression of MANO hand parameters, with layered objective over parameters, joints, and structural constraints.
Result: Experiments on InterHand2.6M show gesture-aware pretraining consistently improves single-hand accuracy over state-of-the-art EANet baseline, and the benefit transfers across architectures without modification.
Conclusion: Gesture semantics provide valuable inductive bias for 3D hand pose estimation, and gesture-aware pretraining is an effective approach that improves accuracy and transfers well across different model architectures.
Abstract: Estimating 3D hand pose from monocular RGB images is fundamental for applications in AR/VR, human-computer interaction, and sign language understanding. In this work we focus on a scenario where a discrete set of gesture labels is available and show that gesture semantics can serve as a powerful inductive bias for 3D pose estimation. We present a two-stage framework: gesture-aware pretraining that learns an informative embedding space using coarse and fine gesture labels from InterHand2.6M, followed by a per-joint token Transformer guided by gesture embeddings as intermediate representations for final regression of MANO hand parameters. Training is driven by a layered objective over parameters, joints, and structural constraints. Experiments on InterHand2.6M demonstrate that gesture-aware pretraining consistently improves single-hand accuracy over the state-of-the-art EANet baseline, and that the benefit transfers across architectures without any modification.
[182] Revisiting Cross-Attention Mechanisms: Leveraging Beneficial Noise for Domain-Adaptive Learning
Zelin Zang, Yehui Yang, Fei Wang, Liangyu Li, Baigui Sun
Main category: cs.CV
TL;DR: DACSM framework uses beneficial noise in cross-attention for domain adaptation, combining domain translation with cross-scale matching to handle appearance and scale gaps.
Details
Motivation: Unsupervised Domain Adaptation (UDA) suffers from severe domain and scale gaps that degrade performance when transferring knowledge from labeled source to unlabeled target domains. Existing cross-attention transformers struggle to preserve content semantics under large appearance and scale variations.Method: Proposes Domain-Adaptive Cross-Scale Matching (DACSM) framework with two components: 1) Domain-Adaptive Transformer (DAT) that injects beneficial noise into cross-attention to regularize it, enabling progressive domain translation while focusing on content over style, and 2) Cross-Scale Matching (CSM) module that adaptively aligns features across multiple resolutions for semantic consistency under scale changes.
Result: Achieves state-of-the-art performance on VisDA-2017, Office-Home, and DomainNet datasets, with up to +2.3% improvement over CDTrans on VisDA-2017. Notably achieves +5.9% gain on challenging “truck” class of VisDA, demonstrating strength in handling scale discrepancies.
Conclusion: Combining domain translation, beneficial-noise-enhanced attention, and scale-aware alignment is effective for robust cross-domain representation learning, particularly for handling domain and scale gaps in UDA tasks.
Abstract: Unsupervised Domain Adaptation (UDA) seeks to transfer knowledge from a labeled source domain to an unlabeled target domain but often suffers from severe domain and scale gaps that degrade performance. Existing cross-attention-based transformers can align features across domains, yet they struggle to preserve content semantics under large appearance and scale variations. To explicitly address these challenges, we introduce the concept of beneficial noise, which regularizes cross-attention by injecting controlled perturbations, encouraging the model to ignore style distractions and focus on content. We propose the Domain-Adaptive Cross-Scale Matching (DACSM) framework, which consists of a Domain-Adaptive Transformer (DAT) for disentangling domain-shared content from domain-specific style, and a Cross-Scale Matching (CSM) module that adaptively aligns features across multiple resolutions. DAT incorporates beneficial noise into cross-attention, enabling progressive domain translation with enhanced robustness, yielding content-consistent and style-invariant representations. Meanwhile, CSM ensures semantic consistency under scale changes. Extensive experiments on VisDA-2017, Office-Home, and DomainNet demonstrate that DACSM achieves state-of-the-art performance, with up to +2.3% improvement over CDTrans on VisDA-2017. Notably, DACSM achieves a +5.9% gain on the challenging “truck” class of VisDA, evidencing the strength of beneficial noise in handling scale discrepancies. These results highlight the effectiveness of combining domain translation, beneficial-noise-enhanced attention, and scale-aware alignment for robust cross-domain representation learning.
[183] Motion-Adaptive Temporal Attention for Lightweight Video Generation with Stable Diffusion
Rui Hong, Shuxue Quan
Main category: cs.CV
TL;DR: Motion-adaptive temporal attention for efficient video generation using frozen Stable Diffusion, adjusting attention based on motion content with minimal added parameters.
Details
Motivation: Current video generation methods often treat all content uniformly or require extensive retraining. The authors aim to create a parameter-efficient approach that adapts to motion characteristics while leveraging pre-trained image diffusion models.Method: Injects lightweight temporal attention modules into UNet transformer blocks with cascaded strategy: global attention in down-sampling/middle blocks for semantic stability, motion-adaptive attention in up-sampling blocks for refinement. Uses motion estimation to adjust temporal receptive fields, temporally correlated noise initialization, and motion-aware gating.
Result: Achieves competitive results on WebVid validation with only 25.8M trainable parameters (2.9% of base UNet). Shows standard denoising objective provides sufficient temporal regularization, outperforming explicit consistency loss methods. Reveals trade-off between noise correlation and motion amplitude for inference-time control.
Conclusion: Motion-adaptive temporal attention enables efficient video generation from frozen image diffusion models, with minimal parameters and adaptive behavior based on motion content, offering practical inference-time controls.
Abstract: We present a motion-adaptive temporal attention mechanism for parameter-efficient video generation built upon frozen Stable Diffusion models. Rather than treating all video content uniformly, our method dynamically adjusts temporal attention receptive fields based on estimated motion content: high-motion sequences attend locally across frames to preserve rapidly changing details, while low-motion sequences attend globally to enforce scene consistency. We inject lightweight temporal attention modules into all UNet transformer blocks via a cascaded strategy – global attention in down-sampling and middle blocks for semantic stabilization, motion-adaptive attention in up-sampling blocks for fine-grained refinement. Combined with temporally correlated noise initialization and motion-aware gating, the system adds only 25.8M trainable parameters (2.9% of the base UNet) while achieving competitive results on WebVid validation when trained on 100K videos. We demonstrate that the standard denoising objective alone provides sufficient implicit temporal regularization, outperforming approaches that add explicit temporal consistency losses. Our ablation studies reveal a clear trade-off between noise correlation and motion amplitude, providing a practical inference-time control for diverse generation behaviors.
[184] Mutually Causal Semantic Distillation Network for Zero-Shot Learning
Shiming Chen, Shuhuang Chen, Guo-Sen Xie, Xinge You
Main category: cs.CV
TL;DR: MSDN++ is a mutually causal semantic distillation network for zero-shot learning that uses bidirectional causal attention between visual and attribute features to learn intrinsic semantic representations.
Details
Motivation: Prior ZSL methods use unidirectional attention in a weakly-supervised manner, learning spurious and limited latent semantic representations that fail to discover intrinsic semantic knowledge between visual and attribute features.Method: Proposes MSDN++ with two mutual causal attention sub-nets: attribute→visual causal attention learns attribute-based visual features, and visual→attribute causal attention learns visual-based attribute features. Uses semantic distillation loss for collaborative learning.
Result: Extensive experiments on CUB, SUN, AWA2, and FLO datasets show significant improvements over strong baselines, achieving new state-of-the-art performances.
Conclusion: MSDN++ effectively distills intrinsic and sufficient semantic representations for ZSL through mutually causal semantic distillation, outperforming previous methods.
Abstract: Zero-shot learning (ZSL) aims to recognize the unseen classes in the open-world guided by the side-information (e.g., attributes). Its key task is how to infer the latent semantic knowledge between visual and attribute features on seen classes, and thus conducting a desirable semantic knowledge transfer from seen classes to unseen ones. Prior works simply utilize unidirectional attention within a weakly-supervised manner to learn the spurious and limited latent semantic representations, which fail to effectively discover the intrinsic semantic knowledge (e.g., attribute semantic) between visual and attribute features. To solve the above challenges, we propose a mutually causal semantic distillation network (termed MSDN++) to distill the intrinsic and sufficient semantic representations for ZSL. MSDN++ consists of an attribute$\rightarrow$visual causal attention sub-net that learns attribute-based visual features, and a visual$\rightarrow$attribute causal attention sub-net that learns visual-based attribute features. The causal attentions encourages the two sub-nets to learn causal vision-attribute associations for representing reliable features with causal visual/attribute learning. With the guidance of semantic distillation loss, the two mutual attention sub-nets learn collaboratively and teach each other throughout the training process. Extensive experiments on three widely-used benchmark datasets (e.g., CUB, SUN, AWA2, and FLO) show that our MSDN++ yields significant improvements over the strong baselines, leading to new state-of-the-art performances.
[185] AdapTS: Lightweight Teacher-Student Approach for Multi-Class and Continual Visual Anomaly Detection
Manuel Barusco, Davide Dalle Pezze, Francesco Borsatti, Gian Antonio Susto
Main category: cs.CV
TL;DR: AdapTS is a unified Teacher-Student framework for multi-class continual visual anomaly detection that uses shared frozen backbone with lightweight adapters, achieving comparable performance with drastically reduced memory overhead for edge deployment.
Details
Motivation: Existing visual anomaly detection methods are limited to single-category scenarios and don't address multi-class continual learning needs in real-world industrial environments. Teacher-Student architectures are efficient but unexplored for continual settings.Method: Uses single shared frozen backbone with lightweight trainable adapters injected into student pathway. Enhanced via segmentation-guided objective and synthetic Perlin noise. Includes prototype-based task identification mechanism to dynamically select adapters at inference.
Result: Matches performance of existing Teacher-Student methods on MVTec AD and VisA datasets for multi-class and continual learning scenarios. Lightest variant (AdapTS-S) requires only 8 MB additional memory, 13-149x less than competitors.
Conclusion: AdapTS provides a scalable solution for edge deployment in complex industrial environments by addressing multi-class continual visual anomaly detection with minimal memory overhead while maintaining performance.
Abstract: Visual Anomaly Detection (VAD) is crucial for industrial inspection, yet most existing methods are limited to single-category scenarios, failing to address the multi-class and continual learning demands of real-world environments. While Teacher-Student (TS) architectures are efficient, they remain unexplored for the Continual Setting. To bridge this gap, we propose AdapTS, a unified TS framework designed for multi-class and continual settings, optimized for edge deployment. AdapTS eliminates the need for two different architectures by utilizing a single shared frozen backbone and injecting lightweight trainable adapters into the student pathway. Training is enhanced via a segmentation-guided objective and synthetic Perlin noise, while a prototype-based task identification mechanism dynamically selects adapters at inference with 99% accuracy. Experiments on MVTec AD and VisA demonstrate that AdapTS matches the performance of existing TS methods across multi-class and continual learning scenarios, while drastically reducing memory overhead. Our lightest variant, AdapTS-S, requires only 8 MB of additional memory, 13x less than STFPM (95 MB), 48x less than RD4AD (360 MB), and 149x less than DeSTSeg (1120 MB), making it a highly scalable solution for edge deployment in complex industrial environments.
[186] Towards Motion-aware Referring Image Segmentation
Chaeyun Kim, Seunghoon Yi, Yejin Kim, Yohan Jo, Joonseok Lee
Main category: cs.CV
TL;DR: A method to improve Referring Image Segmentation (RIS) for motion-centric queries through data augmentation and multimodal contrastive learning, with a new benchmark for evaluation.
Details
Motivation: Existing RIS methods perform poorly on motion-related queries compared to appearance-based ones, highlighting a gap in handling dynamic object descriptions.Method: Two key contributions: 1) Data augmentation extracting motion-centric phrases from captions, 2) Multimodal Radial Contrastive Learning (MRaCL) on fused image-text embeddings rather than unimodal representations.
Result: The method substantially improves performance on motion-centric queries across multiple RIS models while maintaining competitive results on appearance-based descriptions.
Conclusion: The approach effectively addresses the motion-centric query challenge in RIS through targeted data augmentation and multimodal contrastive learning, with a new benchmark for comprehensive evaluation.
Abstract: Referring Image Segmentation (RIS) requires identifying objects from images based on textual descriptions. We observe that existing methods significantly underperform on motion-related queries compared to appearance-based ones. To address this, we first introduce an efficient data augmentation scheme that extracts motion-centric phrases from original captions, exposing models to more motion expressions without additional annotations. Second, since the same object can be described differently depending on the context, we propose Multimodal Radial Contrastive Learning (MRaCL), performed on fused image-text embeddings rather than unimodal representations. For comprehensive evaluation, we introduce a new test split focusing on motion-centric queries, and introduce a new benchmark called M-Bench, where objects are distinguished primarily by actions. Extensive experiments show our method substantially improves performance on motion-centric queries across multiple RIS models, maintaining competitive results on appearance-based descriptions. Codes are available at https://github.com/snuviplab/MRaCL
[187] Rel-Zero: Harnessing Patch-Pair Invariance for Robust Zero-Watermarking Against AI Editing
Pengzhen Chen, Yanwei Liu, Xiaoyan Gu, Xiaojun Chen, Wu Liu, Weiping Wang
Main category: cs.CV
TL;DR: Rel-Zero is a novel zero-watermarking framework that uses editing-invariant patch relations for image authentication without modifying original images, providing robust protection against AI-based editing manipulations.
Details
Motivation: The paper addresses the threat posed by diffusion-based image editing to digital content authenticity. Traditional watermarking methods compromise visual fidelity with perceptible perturbations, while existing zero-watermarking approaches using global features fail against sophisticated manipulations.Method: The method leverages the observation that while individual image patches change during AI editing, the relational distance between patch pairs remains relatively invariant. Rel-Zero extracts a unique zero-watermark from these editing-invariant patch relations without modifying the original image, using intrinsic structural consistency rather than absolute appearance.
Result: Extensive experiments show that Rel-Zero achieves substantially improved robustness across diverse editing models and manipulations compared to prior zero-watermarking approaches, providing a non-invasive yet resilient mechanism for content authentication.
Conclusion: Rel-Zero offers an effective solution for protecting digital visual content authenticity against AI-based editing threats by exploiting invariant patch relations, balancing robustness and visual fidelity without image modification.
Abstract: Recent advancements in diffusion-based image editing pose a significant threat to the authenticity of digital visual content. Traditional embedding-based watermarking methods often introduce perceptible perturbations to maintain robustness, inevitably compromising visual fidelity. Meanwhile, existing zero-watermarking approaches, typically relying on global image features, struggle to withstand sophisticated manipulations. In this work, we uncover a key observation: while individual image patches undergo substantial alterations during AI-based editing, the relational distance between patch pairs remains relatively invariant. Leveraging this property, we propose Relational Zero-Watermarking (Rel-Zero), a novel framework that requires no modification to the original image but derives a unique zero-watermark from these editing-invariant patch relations. By grounding the watermark in intrinsic structural consistency rather than absolute appearance, Rel-Zero provides a non-invasive yet resilient mechanism for content authentication. Extensive experiments demonstrate that Rel-Zero achieves substantially improved robustness across diverse editing models and manipulations compared to prior zero-watermarking approaches.
[188] SHIFT: Motion Alignment in Video Diffusion Models with Adversarial Hybrid Fine-Tuning
Xi Ye, Wenjia Yang, Yangyang Xu, Xiaoyang Liu, Duo Su, Mengfei Xia, Jun Zhu
Main category: cs.CV
TL;DR: SHIFT framework improves motion alignment in video diffusion models using pixel-motion rewards and hybrid fine-tuning to enhance motion dynamics and temporal coherence.
Details
Motivation: Video diffusion models often suffer from weakened motion fidelity (reduced motion dynamics, degraded temporal coherence) after fine-tuning, especially for long-term consistency.Method: Introduces pixel-motion rewards based on pixel flux dynamics for both instantaneous and long-term motion consistency. Proposes Smooth Hybrid Fine-tuning (SHIFT) framework that fuses supervised fine-tuning with advantage-weighted fine-tuning using novel adversarial advantages to prevent reward hacking and improve convergence.
Result: SHIFT efficiently resolves dynamic-degree collapse in modern video diffusion models during supervised fine-tuning, improving motion alignment and temporal coherence.
Conclusion: The proposed SHIFT framework with pixel-motion rewards effectively addresses motion alignment issues in video diffusion models, enhancing both motion dynamics and long-term temporal consistency.
Abstract: Image-conditioned Video diffusion models achieve impressive visual realism but often suffer from weakened motion fidelity, e.g., reduced motion dynamics or degraded long-term temporal coherence, especially after fine-tuning. We study the problem of motion alignment in video diffusion models post-training. To address this, we introduce pixel-motion rewards based on pixel flux dynamics, capturing both instantaneous and long-term motion consistency. We further propose Smooth Hybrid Fine-tuning (SHIFT), a scalable reward-driven fine-tuning framework for video diffusion models. SHIFT fuses the normal supervised fine-tuning and advantage weighted fine-tuning into a unified framework. Benefiting from novel adversarial advantages, SHIFT improves convergence speed and mitigates reward hacking. Experiments show that our approach efficiently resolves dynamic-degree collapse in modern video diffusion models supervised fine-tuning.
[189] Learning Coordinate-based Convolutional Kernels for Continuous SE(3) Equivariant and Efficient Point Cloud Analysis
Jaein Kim, Hee Bin Yoo, Dong-Sig Han, Byoung-Tak Zhang
Main category: cs.CV
TL;DR: ECKConv is an SE(3)-equivariant convolution method for 3D point clouds that uses coordinate-based networks in a double coset space kernel domain to achieve both rigorous symmetry and scalability to large-scale problems.
Details
Motivation: Existing group convolution methods for 3D point clouds struggle to maintain both rigorous SE(3) symmetry and scalability to large-scale problems simultaneously. Previous intertwiner framework approaches either didn't achieve complete SE(3) symmetry or couldn't scale to large-scale applications.Method: ECKConv uses an intertwiner framework with kernels defined in a double coset space to achieve SE(3) equivariance. It employs coordinate-based networks for explicit kernel design, enhancing learning capability and memory efficiency. The method extracts equivariant features from 3D point clouds while maintaining scalability.
Result: Experiments on diverse point cloud tasks (classification, pose registration, part segmentation, and large-scale semantic segmentation) validate ECKConv’s rigid equivariance, memory scalability, and outstanding performance compared to state-of-the-art equivariant methods.
Conclusion: ECKConv successfully resolves the trade-off between rigorous symmetry and scalability in 3D point cloud learning, providing an effective SE(3)-equivariant convolution method that works well across various tasks including large-scale applications.
Abstract: A symmetry on rigid motion is one of the salient factors in efficient learning of 3D point cloud problems. Group convolution has been a representative method to extract equivariant features, but its realizations have struggled to retain both rigorous symmetry and scalability simultaneously. We advocate utilizing the intertwiner framework to resolve this trade-off, but previous works on it, which did not achieve complete SE(3) symmetry or scalability to large-scale problems, necessitate a more advanced kernel architecture. We present Equivariant Coordinate-based Kernel Convolution, or ECKConv. It acquires SE(3) equivariance from the kernel domain defined in a double coset space, and its explicit kernel design using coordinate-based networks enhances its learning capability and memory efficiency. The experiments on diverse point cloud tasks, e.g., classification, pose registration, part segmentation, and large-scale semantic segmentation, validate the rigid equivariance, memory scalability, and outstanding performance of ECKConv compared to state-of-the-art equivariant methods.
[190] ECHO: Towards Emotionally Appropriate and Contextually Aware Interactive Head Generation
Xiangyu Kong, Xiaoyu Jin, Yihan Pan, Haoqin Sun, Hengde Zhu, Xiaoming Xu, Xiaoming Wei, Lu Liu, Siyang Song
Main category: cs.CV
TL;DR: ECHO is a novel Interactive Head Generation framework that addresses limitations in existing methods by incorporating long-range contextual understanding and decoupled signal processing to generate more contextually appropriate and emotionally rational avatar facial behaviors with better lip synchronization.
Details
Motivation: Existing IHG methods have two main limitations: 1) they rely on short-clip behavioral cues without long-range contextual modeling, producing facial behaviors lacking contextual appropriateness, and 2) they use entangled, role-agnostic fusion of dual-track signals (user behaviors and avatar audio) which introduces cross-signal interference and compromises lip synchronization.Method: ECHO features two key components: 1) Long-range Contextual Understanding (LCU) for contextual understanding of behavior-grounded dynamics and linguistic-driven affective semantics, and 2) block-wise Spatial-aware Decoupled Cross-attention Modulation (SDCM) that preserves self-audio-driven lip articulation while adaptively integrating user behavioral cues for non-lip regions, using a two-stage training paradigm.
Result: Extensive experiments demonstrate the effectiveness of the proposed components and ECHO’s superior IHG performance compared to existing methods.
Conclusion: ECHO successfully addresses the limitations of existing IHG methods by incorporating long-range contextual modeling and decoupled signal processing, resulting in more contextually appropriate, emotionally rational facial behaviors with improved lip synchronization and visual fidelity.
Abstract: In natural face-to-face interaction, participants seamlessly alternate between speaking and listening, producing facial behaviors (FBs) that are finely informed by long-range context and naturally exhibit contextual appropriateness and emotional rationality. Interactive Head Generation (IHG) aims to synthesize lifelike avatar head video emulating such capabilities. Existing IHG methods typically condition on dual-track signals (i.e., human user’s behaviors and pre-defined audio for avatar) within a short temporal window, jointly driving generation of avatar’s audio-aligned lip articulation and non-verbal FBs. However, two main challenges persist in these methods: (i) the reliance on short-clip behavioral cues without long-range contextual modeling leads them to produce facial behaviors lacking contextual appropriateness; and (ii) the entangled, role-agnostic fusion of dual-track signals empirically introduces cross-signal interference, potentially compromising lip-region synchronization during speaking. To this end, we propose ECHO, a novel IHG framework comprising two key components: a Long-range Contextual Understanding (LCU) component that facilitates contextual understanding of both behavior-grounded dynamics and linguistic-driven affective semantics to promote contextual appropriateness and emotional rationality of synthesized avatar FBs; and a block-wise Spatial-aware Decoupled Cross-attention Modulation (SDCM) module, that preserves self-audio-driven lip articulation while adaptively integrating user contextual behavioral cues for non-lip facial regions, complemented by our designed two-stage training paradigm, to jointly enhance lip synchronization and visual fidelity. Extensive experiments demonstrate the effectiveness of proposed components and ECHO’s superior IHG performance.
[191] FrescoDiffusion: 4K Image-to-Video with Prior-Regularized Tiled Diffusion
Hugo Caselles-Dupré, Mathis Koroglu, Guillaume Jeanneret, Arnaud Dapogny, Matthieu Cord
Main category: cs.CV
TL;DR: FrescoDiffusion: A training-free method for coherent large-format image-to-video generation from single complex images using latent priors and tiled denoising with global consistency regularization.
Details
Motivation: Current diffusion-based image-to-video models struggle with ultra-high-resolution inputs (e.g., 4K). Generating at native resolution loses fine detail, while tiled denoising breaks global layout consistency, especially problematic for complex fresco animations with many distinct elements that must remain spatially coherent over time.Method: Augments tiled denoising with a precomputed latent prior: first generate low-resolution video at model resolution, upsample its latent trajectory for global reference, then compute per-tile noise predictions and fuse them with reference at every diffusion timestep using weighted least-squares objective combining tile-merging with regularization term for global coherence.
Result: Experiments on VBench-I2V dataset and proposed fresco I2V dataset show improved global consistency and fidelity over tiled baselines, with computational efficiency. Method enables explicit controllability of trade-off between creativity and consistency.
Conclusion: FrescoDiffusion provides effective training-free solution for coherent large-format image-to-video generation, addressing the critical challenge of maintaining global layout consistency while preserving fine detail in ultra-high-resolution video synthesis.
Abstract: Diffusion-based image-to-video (I2V) models are increasingly effective, yet they struggle to scale to ultra-high-resolution inputs (e.g., 4K). Generating videos at the model’s native resolution often loses fine-grained structure, whereas high-resolution tiled denoising preserves local detail but breaks global layout consistency. This failure mode is particularly severe in the fresco animation setting: monumental artworks containing many distinct characters, objects, and semantically different sub-scenes that must remain spatially coherent over time. We introduce FrescoDiffusion, a training-free method for coherent large-format I2V generation from a single complex image. The key idea is to augment tiled denoising with a precomputed latent prior: we first generate a low-resolution video at the underlying model resolution and upsample its latent trajectory to obtain a global reference that captures long-range temporal and spatial structure. For 4K generation, we compute per-tile noise predictions and fuse them with this reference at every diffusion timestep by minimizing a single weighted least-squares objective in model-output space. The objective combines a standard tile-merging criterion with our regularization term, yielding a closed-form fusion update that strengthens global coherence while retaining fine detail. We additionally provide a spatial regularization variable that enables region-level control over where motion is allowed. Experiments on the VBench-I2V dataset and our proposed fresco I2V dataset show improved global consistency and fidelity over tiled baselines, while being computationally efficient. Our regularization enables explicit controllability of the trade-off between creativity and consistency.
[192] FACE-net: Factual Calibration and Emotion Augmentation for Retrieval-enhanced Emotional Video Captioning
Weidong Chen, Cheng Ye, Zhendong Mao, Peipei Song, Xinyan Liu, Lei Zhang, Xiaojun Chang, Yongdong Zhang
Main category: cs.CV
TL;DR: FACE-net is a retrieval-enhanced framework for Emotional Video Captioning that addresses factual-emotional bias through factual calibration, emotion augmentation, and dynamic bias adjustment.
Details
Motivation: Existing EVC methods struggle with factual-emotional bias - the challenge of balancing factual content description with emotional expression in different video samples. Current approaches insufficiently mine and coordinate factual and emotional cues during generation.Method: Proposes FACE-net with three key components: 1) External repository retrieval for semantic augmentation, 2) Factual calibration via uncertainty estimation using subject-predicate-object triplets, 3) Progressive visual emotion augmentation using calibrated semantics as experts, and 4) Dynamic bias adjustment routing module to predict and adjust sample bias.
Result: The framework collaboratively mines factual-emotional semantics and provides adaptive guidance for generation, overcoming the compromising tendency of factual-emotional descriptions in all sample learning.
Conclusion: FACE-net effectively addresses the factual-emotional bias problem in EVC through a unified architecture that enhances both factual accuracy and emotional expression in video descriptions.
Abstract: Emotional Video Captioning (EVC) is an emerging task, which aims to describe factual content with the intrinsic emotions expressed in videos. Existing works perceive global emotional cues and then combine with video content to generate descriptions. However, insufficient factual and emotional cues mining and coordination during generation make their methods difficult to deal with the factual-emotional bias, which refers to the factual and emotional requirements being different in different samples on generation. To this end, we propose a retrieval-enhanced framework with FActual Calibration and Emotion augmentation (FACE-net), which through a unified architecture collaboratively mines factual-emotional semantics and provides adaptive and accurate guidance for generation, breaking through the compromising tendency of factual-emotional descriptions in all sample learning. Technically, we firstly introduces an external repository and retrieves the most relevant sentences with the video content to augment the semantic information. Subsequently, our factual calibration via uncertainty estimation module splits the retrieved information into subject-predicate-object triplets, and self-refines and cross-refines different components through video content to effectively mine the factual semantics; while our progressive visual emotion augmentation module leverages the calibrated factual semantics as experts, interacts with the video content and emotion dictionary to generate visual queries and candidate emotions, and then aggregates them to adaptively augment emotions to each factual semantics. Moreover, to alleviate the factual-emotional bias, we design a dynamic bias adjustment routing module to predict and adjust the degree of bias of a sample.
[193] Edit-As-Act: Goal-Regressive Planning for Open-Vocabulary 3D Indoor Scene Editing
Seongrae Noh, SeungWon Seo, Gyeong-Moon Park, HyeongYeop Kang
Main category: cs.CV
TL;DR: Edit-As-Act: A framework for 3D indoor scene editing from natural language using goal-regressive planning with symbolic predicates and physical constraints.
Details
Motivation: Existing open-vocabulary 3D scene editing systems either regenerate large portions of scenes or use image-space edits that disrupt spatial structure, leading to unintended global changes or physically inconsistent layouts. These limitations arise from treating editing as primarily a generative task rather than a reasoning problem.Method: Proposes Edit-As-Act framework that treats editing as goal-regressive planning in 3D space. Given a source scene and free-form instruction, it predicts symbolic goal predicates and plans in EditLang, a PDDL-inspired action language with explicit preconditions and effects encoding geometric relations. Uses language-driven planner to propose actions and validator to enforce goal-directedness, monotonicity, and physical feasibility.
Result: On E2A-Bench (63 editing tasks across 9 indoor environments), Edit-As-Act significantly outperforms prior approaches across all edit types and scene categories, achieving instruction fidelity, semantic consistency, and physical plausibility simultaneously.
Conclusion: By separating reasoning from low-level generation, Edit-As-Act addresses limitations of existing paradigms and enables interpretable, physically coherent 3D scene transformations from natural language instructions.
Abstract: Editing a 3D indoor scene from natural language is conceptually straightforward but technically challenging. Existing open-vocabulary systems often regenerate large portions of a scene or rely on image-space edits that disrupt spatial structure, resulting in unintended global changes or physically inconsistent layouts. These limitations stem from treating editing primarily as a generative task. We take a different view. A user instruction defines a desired world state, and editing should be the minimal sequence of actions that makes this state true while preserving everything else. This perspective motivates Edit-As-Act, a framework that performs open-vocabulary scene editing as goal-regressive planning in 3D space. Given a source scene and free-form instruction, Edit-As-Act predicts symbolic goal predicates and plans in EditLang, a PDDL-inspired action language that we design with explicit preconditions and effects encoding support, contact, collision, and other geometric relations. A language-driven planner proposes actions, and a validator enforces goal-directedness, monotonicity, and physical feasibility, producing interpretable and physically coherent transformations. By separating reasoning from low-level generation, Edit-As-Act achieves instruction fidelity, semantic consistency, and physical plausibility - three criteria that existing paradigms cannot satisfy together. On E2A-Bench, our benchmark of 63 editing tasks across 9 indoor environments, Edit-As-Act significantly outperforms prior approaches across all edit types and scene categories.
[194] AR-CoPO: Align Autoregressive Video Generation with Contrastive Policy Optimization
Dailan He, Guanlin Feng, Xingtong Ge, Yi Zhang, Bingqi Ma, Guanglu Song, Yu Liu, Hongsheng Li
Main category: cs.CV
TL;DR: AR-CoPO is a reinforcement learning framework for aligning streaming autoregressive video generators with human preferences, addressing challenges in few-step distillation settings through chunk-level alignment and semi-on-policy training.
Details
Motivation: Streaming autoregressive video generators with few-step distillation achieve efficient synthesis but are difficult to align with human preferences via RLHF. Existing SDE-based methods face challenges with few-step ODEs and consistency model samplers that deviate from standard flow-matching, and their short trajectories are highly sensitive to initialization noise, making intermediate exploration ineffective.Method: AR-CoPO adapts the Neighbor GRPO contrastive perspective to streaming AR generation. It introduces chunk-level alignment via a forking mechanism that constructs neighborhood candidates at a randomly selected chunk, assigns sequence-level rewards, and performs localized GRPO updates. Additionally, a semi-on-policy training strategy combines on-policy exploration with exploitation over a replay buffer of reference rollouts.
Result: Experiments on Self-Forcing demonstrate that AR-CoPO improves both out-of-domain generalization and in-domain human preference alignment over baselines, providing evidence of genuine alignment rather than reward hacking.
Conclusion: AR-CoPO provides an effective framework for aligning streaming autoregressive video generators with human preferences, overcoming challenges in few-step distillation settings through novel chunk-level alignment and semi-on-policy training strategies.
Abstract: Streaming autoregressive (AR) video generators combined with few-step distillation achieve low-latency, high-quality synthesis, yet remain difficult to align via reinforcement learning from human feedback (RLHF). Existing SDE-based GRPO methods face challenges in this setting: few-step ODEs and consistency model samplers deviate from standard flow-matching ODEs, and their short, low-stochasticity trajectories are highly sensitive to initialization noise, rendering intermediate SDE exploration ineffective. We propose AR-CoPO (AutoRegressive Contrastive Policy Optimization), a framework that adapts the Neighbor GRPO contrastive perspective to streaming AR generation. AR-CoPO introduces chunk-level alignment via a forking mechanism that constructs neighborhood candidates at a randomly selected chunk, assigns sequence-level rewards, and performs localized GRPO updates. We further propose a semi-on-policy training strategy that complements on-policy exploration with exploitation over a replay buffer of reference rollouts, improving generation quality across domains. Experiments on Self-Forcing demonstrate that AR-CoPO improves both out-of-domain generalization and in-domain human preference alignment over the baseline, providing evidence of genuine alignment rather than reward hacking.
[195] UAV-CB: A Complex-Background RGB-T Dataset and Local Frequency Bridge Network for UAV Detection
Shenghui Huang, Menghao Hu, Longkun Zou, Hongyu Chi, Zekai Li, Feng Gao, Fan Yang, Qingyao Wu, Ke Chen
Main category: cs.CV
TL;DR: LFBNet: A frequency-aware RGB-T fusion network for UAV detection in complex backgrounds and camouflage conditions, evaluated on a new UAV-CB dataset.
Details
Motivation: UAV detection in low-altitude environments is challenging due to complex backgrounds, camouflage, and multimodal interference. Existing datasets don't adequately capture these challenges, limiting progress in robust real-world perception.Method: Proposes Local Frequency Bridge Network (LFBNet) that models features in localized frequency space to bridge both frequency-spatial fusion gap and cross-modality discrepancy gap in RGB-T fusion. Also constructs UAV-CB dataset emphasizing complex backgrounds and camouflage.
Result: Extensive experiments on UAV-CB and public benchmarks show LFBNet achieves state-of-the-art detection performance and strong robustness under camouflaged and cluttered conditions.
Conclusion: LFBNet offers a frequency-aware perspective on multimodal UAV perception in real-world applications, addressing challenges of complex backgrounds and camouflage through localized frequency modeling.
Abstract: Detecting Unmanned Aerial Vehicles (UAVs) in low-altitude environments is essential for perception and defense systems but remains highly challenging due to complex backgrounds, camouflage, and multimodal interference. In real-world scenarios, UAVs are frequently visually blended with surrounding structures such as buildings, vegetation, and power lines, resulting in low contrast, weak boundaries, and strong confusion with cluttered background textures. Existing UAV detection datasets, though diverse, are not specifically designed to capture these camouflage and complex-background challenges, which limits progress toward robust real-world perception. To fill this gap, we construct UAV-CB, a new RGB-T UAV detection dataset deliberately curated to emphasize complex low-altitude backgrounds and camouflage characteristics. Furthermore, we propose the Local Frequency Bridge Network (LFBNet), which models features in localized frequency space to bridge both the frequency-spatial fusion gap and the cross-modality discrepancy gap in RGB-T fusion. Extensive experiments on UAV-CB and public benchmarks demonstrate that LFBNet achieves state-of-the-art detection performance and strong robustness under camouflaged and cluttered conditions, offering a frequency-aware perspective on multimodal UAV perception in real-world applications.
[196] Omni-I2C: A Holistic Benchmark for High-Fidelity Image-to-Code Generation
Jiawei Zhou, Chi Zhang, Xiang Feng, Qiming Zhang, Haibo Qiu, Lihuo He, Dengpan Ye, Xinbo Gao, Jing Zhang
Main category: cs.CV
TL;DR: Omni-I2C is a benchmark for evaluating Large Multimodal Models’ ability to convert complex digital graphics into executable code, testing both visual perception and code generation capabilities.
Details
Motivation: Current LMMs struggle with converting structured digital graphics to executable code, which requires high-fidelity visual perception to parse spatial hierarchies and symbolic details, plus precise code generation for syntactically sound and logically consistent output.Method: Created a comprehensive benchmark with 1080 curated samples spanning various subjects, image modalities, and programming languages, incorporating authentic user-sourced cases. The evaluation framework decouples performance into perceptual fidelity and symbolic precision to expose structural failures.
Result: Evaluation reveals substantial performance gaps among leading LMMs, with even state-of-the-art models struggling to preserve structural integrity in complex scenarios, showing multimodal code generation remains a formidable challenge.
Conclusion: Omni-I2C provides a rigorous benchmark that exposes current limitations in LMMs’ ability to convert complex visual structures to code, highlighting the need for improved multimodal reasoning and code generation capabilities.
Abstract: We present Omni-I2C, a comprehensive benchmark designed to evaluate the capability of Large Multimodal Models (LMMs) in converting complex, structured digital graphics into executable code. We argue that this task represents a non-trivial challenge for the current generation of LMMs: it demands an unprecedented synergy between high-fidelity visual perception – to parse intricate spatial hierarchies and symbolic details – and precise generative expression – to synthesize syntactically sound and logically consistent code. Unlike traditional descriptive tasks, Omni-I2C requires a holistic understanding where any minor perceptual hallucination or coding error leads to a complete failure in visual reconstruction. Omni-I2C features 1080 meticulously curated samples, defined by its breadth across subjects, image modalities, and programming languages. By incorporating authentic user-sourced cases, the benchmark spans a vast spectrum of digital content – from scientific visualizations to complex symbolic notations – each paired with executable reference code. To complement this diversity, our evaluation framework provides necessary depth; by decoupling performance into perceptual fidelity and symbolic precision, it transcends surface-level accuracy to expose the granular structural failures and reasoning bottlenecks of current LMMs. Our evaluation reveals a substantial performance gap among leading LMMs; even state-of-the-art models struggle to preserve structural integrity in complex scenarios, underscoring that multimodal code generation remains a formidable challenge. Data and code are available at https://github.com/MiliLab/Omni-I2C.
[197] EI: Early Intervention for Multimodal Imaging based Disease Recognition
Qijie Wei, Hailan Lin, Xirong Li
Main category: cs.CV
TL;DR: Early Intervention framework with Mixture of Low-varied-Ranks Adaptation for multimodal medical imaging disease recognition, addressing fusion limitations and domain adaptation challenges.
Details
Motivation: Current multimodal medical imaging methods fail to fully leverage complementary information due to late fusion approaches, and face challenges adapting Vision Foundation Models to medical domains due to data scarcity and domain shifts.Method: Proposes Early Intervention (EI) framework that uses reference modality tokens to steer target modality embedding early, and Mixture of Low-varied-Ranks Adaptation (MoR) for parameter-efficient fine-tuning of Vision Foundation Models using varied-rank low-rank adapters with weight-relaxed routing.
Result: Extensive experiments on retinal disease, skin lesion, and knee anomaly classification datasets show effectiveness against competitive baselines.
Conclusion: The proposed EI framework with MoR adaptation successfully addresses multimodal fusion and domain adaptation challenges in medical imaging, improving disease recognition performance.
Abstract: Current methods for multimodal medical imaging based disease recognition face two major challenges. First, the prevailing “fusion after unimodal image embedding” paradigm cannot fully leverage the complementary and correlated information in the multimodal data. Second, the scarcity of labeled multimodal medical images, coupled with their significant domain shift from natural images, hinders the use of cutting-edge Vision Foundation Models (VFMs) for medical image embedding. To jointly address the challenges, we propose a novel Early Intervention (EI) framework. Treating one modality as target and the rest as reference, EI harnesses high-level semantic tokens from the reference as intervention tokens to steer the target modality’s embedding process at an early stage. Furthermore, we introduce Mixture of Low-varied-Ranks Adaptation (MoR), a parameter-efficient fine-tuning method that employs a set of low-rank adapters with varied ranks and a weight-relaxed router for VFM adaptation. Extensive experiments on three public datasets for retinal disease, skin lesion, and keen anomaly classification verify the effectiveness of the proposed method against a number of competitive baselines.
[198] Anchoring and Rescaling Attention for Semantically Coherent Inbetweening
Tae Eun Choi, Sumin Shim, Junhyeok Kim, Seong Jae Hwang
Main category: cs.CV
TL;DR: TGI-Bench introduces a text-conditioned generative inbetweening benchmark with Keyframe-anchored Attention Bias and Rescaled Temporal RoPE for improved frame consistency and semantic alignment in sparse sequences.
Details
Motivation: Previous generative inbetweening models struggle with inconsistent frames, unstable pacing, and semantic misalignment when sequences become sparser and motions larger. The task requires additional guidance from keyframes and text to specify intended paths between fixed endpoints.Method: Proposes Keyframe-anchored Attention Bias to provide semantic and temporal guidance from keyframes and text onto each intermediate frame. Also introduces Rescaled Temporal RoPE to allow self-attention to attend to keyframes more faithfully and enforce frame consistency.
Result: Achieves state-of-the-art frame consistency, semantic fidelity, and pace stability for both short and long sequences across diverse challenges without additional training. Introduces TGI-Bench, the first benchmark specifically designed for text-conditioned generative inbetweening evaluation.
Conclusion: The proposed methods effectively address challenges in generative inbetweening by providing better guidance from keyframes and text, resulting in improved consistency and alignment in synthesized intermediate frames.
Abstract: Generative inbetweening (GI) seeks to synthesize realistic intermediate frames between the first and last keyframes beyond mere interpolation. As sequences become sparser and motions larger, previous GI models struggle with inconsistent frames with unstable pacing and semantic misalignment. Since GI involves fixed endpoints and numerous plausible paths, this task requires additional guidance gained from the keyframes and text to specify the intended path. Thus, we give semantic and temporal guidance from the keyframes and text onto each intermediate frame through Keyframe-anchored Attention Bias. We also better enforce frame consistency with Rescaled Temporal RoPE, which allows self-attention to attend to keyframes more faithfully. TGI-Bench, the first benchmark specifically designed for text-conditioned GI evaluation, enables challenge-targeted evaluation to analyze GI models. Without additional training, our method achieves state-of-the-art frame consistency, semantic fidelity, and pace stability for both short and long sequences across diverse challenges.
[199] UniSem: Generalizable Semantic 3D Reconstruction from Sparse Unposed Images
Guibiao Liao, Qian Ren, Kaimin Liao, Hua Wang, Zhi Chen, Luchao Wang, Yaohua Tang
Main category: cs.CV
TL;DR: UniSem improves 3D Gaussian Splatting for sparse-view reconstruction by addressing geometry instability and semantic incompleteness through error-aware Gaussian dropout and mix-training curriculum with prototype alignment.
Details
Motivation: Existing feed-forward 3DGS methods for sparse-view reconstruction produce over-complete Gaussian sets leading to unstable geometry and poor depth quality, while relying on weak 2D segmenter features results in incomplete 3D semantics with limited generalization.Method: Two key components: 1) Error-aware Gaussian Dropout (EGD) uses rendering error cues to suppress redundancy-prone Gaussians for better geometry; 2) Mix-training Curriculum (MTC) progressively blends 2D segmenter-lifted semantics with emergent 3D semantic priors using object-level prototype alignment.
Result: On ScanNet and Replica datasets, UniSem achieves superior depth prediction and open-vocabulary 3D segmentation across varying input views. With 16-view inputs, reduces depth Rel by 15.2% and improves open-vocabulary segmentation mAcc by 3.7% over baselines.
Conclusion: UniSem provides a unified framework that jointly improves geometric reconstruction and semantic understanding in sparse-view 3DGS, addressing both depth accuracy and semantic generalization challenges through principled capacity control and progressive semantic learning.
Abstract: Semantic-aware 3D reconstruction from sparse, unposed images remains challenging for feed-forward 3D Gaussian Splatting (3DGS). Existing methods often predict an over-complete set of Gaussian primitives under sparse-view supervision, leading to unstable geometry and inferior depth quality. Meanwhile, they rely solely on 2D segmenter features for semantic lifting, which provides weak 3D-level and limited generalizable supervision, resulting in incomplete 3D semantics in novel scenes. To address these issues, we propose UniSem, a unified framework that jointly improves depth accuracy and semantic generalization via two key components. First, Error-aware Gaussian Dropout (EGD) performs error-guided capacity control by suppressing redundancy-prone Gaussians using rendering error cues, producing meaningful, geometrically stable Gaussian representations for improved depth estimation. Second, we introduce a Mix-training Curriculum (MTC) that progressively blends 2D segmenter-lifted semantics with the model’s own emergent 3D semantic priors, implemented with object-level prototype alignment to enhance semantic coherence and completeness. Extensive experiments on ScanNet and Replica show that UniSem achieves superior performance in depth prediction and open-vocabulary 3D segmentation across varying numbers of input views. Notably, with 16-view inputs, UniSem reduces depth Rel by 15.2% and improves open-vocabulary segmentation mAcc by 3.7% over strong baselines.
[200] Interpretable Cross-Domain Few-Shot Learning with Rectified Target-Domain Local Alignment
Yaze Zhao, Yixiong Zou, Yuhua Li, Ruixuan Li
Main category: cs.CV
TL;DR: CC-CDFSL method uses cycle consistency and semantic anchors to improve local vision-language alignment in CLIP-based cross-domain few-shot learning, addressing the local misalignment problem exacerbated by domain gaps and scarce data.
Details
Motivation: Current fine-tuned CLIP models struggle with fine-grained visual cues in cross-domain few-shot learning (CDFSL), especially in domains like medical diagnosis. The domain gap and scarce training data exacerbate CLIP's shortcomings in capturing local subtle patterns, creating a local misalignment problem between visual features and text semantics.Method: Proposes CC-CDFSL with cycle consistency that translates local visual features into text features and back (and vice versa), constraining original features to stay close to translated-back features. Also introduces Semantic Anchor mechanism that augments visual features for text-to-image mapping and shrinks image features to filter noise for image-to-text mapping.
Result: The method effectively improves local vision-language alignment, enhances interpretability of learned patterns through patch visualization, and achieves state-of-the-art performance across various benchmarks, backbones, and fine-tuning methods.
Conclusion: The proposed CC-CDFSL method successfully addresses the local misalignment problem in CLIP-based CDFSL by leveraging self-supervision through cycle consistency and semantic anchors, improving both performance and interpretability for fine-grained recognition tasks.
Abstract: Cross-Domain Few-Shot Learning (CDFSL) adapts models trained with large-scale general data (source domain) to downstream target domains with only scarce training data, where the research on vision-language models (e.g., CLIP) is still in the early stages. Typical downstream domains, such as medical diagnosis, require fine-grained visual cues for interpretable recognition, but we find that current fine-tuned CLIP models can hardly focus on these cues, albeit they can roughly focus on important regions in source domains. Although current works have demonstrated CLIP’s shortcomings in capturing local subtle patterns, in this paper, we find that the domain gap and scarce training data further exacerbate such shortcomings, much more than that of holistic patterns, which we call the local misalignment problem in CLIP-based CDFSL. To address this problem, due to the lack of supervision in aligning local visual features and text semantics, we turn to self-supervision information. Inspired by the translation task, we propose the CC-CDFSL method with cycle consistency, which translates local visual features into text features and then translates them back into visual features (and vice versa), and constrains the original features close to the translated back features. To reduce the noise imported by richer information in the visual modality, we further propose a Semantic Anchor mechanism, which first augments visual features to provide a larger corpus for the text-to-image mapping, and then shrinks the image features to filter out irrelevant image-to-text mapping. Extensive experiments on various benchmarks, backbones, and fine-tuning methods show we can (1) effectively improve the local vision-language alignment, (2) enhance the interpretability of learned patterns and model decisions by visualizing patches, and (3) achieve state-of-the-art performance.
[201] PCA-Seg: Revisiting Cost Aggregation for Open-Vocabulary Semantic and Part Segmentation
Jianjian Yin, Tao Chen, Yi Chen, Gensheng Pei, Xiangbo Shu, Yazhou Yao, Fumin Shen
Main category: cs.CV
TL;DR: PCA-Seg introduces a parallel cost aggregation paradigm for open-vocabulary semantic and part segmentation, using expert-driven perceptual learning and feature orthogonalization to better capture vision-language alignment.
Details
Motivation: Existing vision-language models for open-vocabulary segmentation use serial aggregation that causes knowledge interference between class semantics and spatial context, limiting their ability to capture rich vision-language alignment.Method: Proposes parallel cost aggregation (PCA-Seg) with expert-driven perceptual learning module that integrates semantic and contextual streams in parallel, using multi-expert parser and coefficient mapper, plus feature orthogonalization decoupling to reduce redundancy.
Result: Achieves state-of-the-art performance on eight benchmarks while adding only 0.35M parameters per parallel block.
Conclusion: The parallel aggregation paradigm effectively alleviates knowledge interference in vision-language models for segmentation, enabling richer alignment information capture with minimal parameter overhead.
Abstract: Recent advances in vision-language models (VLMs) have garnered substantial attention in open-vocabulary semantic and part segmentation (OSPS). However, existing methods extract image-text alignment cues from cost volumes through a serial structure of spatial and class aggregations, leading to knowledge interference between class-level semantics and spatial context. Therefore, this paper proposes a simple yet effective parallel cost aggregation (PCA-Seg) paradigm to alleviate the above challenge, enabling the model to capture richer vision-language alignment information from cost volumes. Specifically, we design an expert-driven perceptual learning (EPL) module that efficiently integrates semantic and contextual streams. It incorporates a multi-expert parser to extract complementary features from multiple perspectives. In addition, a coefficient mapper is designed to adaptively learn pixel-specific weights for each feature, enabling the integration of complementary knowledge into a unified and robust feature embedding. Furthermore, we propose a feature orthogonalization decoupling (FOD) strategy to mitigate redundancy between the semantic and contextual streams, which allows the EPL module to learn diverse knowledge from orthogonalized features. Extensive experiments on eight benchmarks show that each parallel block in PCA-Seg adds merely 0.35M parameters while achieving state-of-the-art OSPS performance.
[202] FINER: MLLMs Hallucinate under Fine-grained Negative Queries
Rui Xiao, Sanghwan Kim, Yongqin Xian, Zeynep Akata, Stephan Alaniz
Main category: cs.CV
TL;DR: FINER introduces fine-grained negative queries to benchmark and reduce hallucinations in multimodal LLMs, with FINER-Tuning using DPO to improve performance.
Details
Motivation: Existing MLLM benchmarks focus on coarse image questions, missing fine-grained hallucinations. Need better evaluation and mitigation for fine-grained queries where models hallucinate when mismatches co-occur with present elements.Method: Introduces FINER benchmarks (FINER-CompreCap, FINER-DOCCI) with fine-grained negative queries across four settings. Proposes FINER-Tuning using Direct Preference Optimization on FINER-inspired data to reduce hallucinations.
Result: FINER-Tuning yields up to 24.2% gains on hallucination benchmarks (InternVL3.5-14B), improves performance on 8 existing hallucination suites, and enhances general multimodal capabilities across 6 benchmarks.
Conclusion: Fine-grained negative queries effectively benchmark and reduce hallucinations in MLLMs. FINER-Tuning with DPO significantly improves model performance on both specialized and general multimodal tasks.
Abstract: Multimodal large language models (MLLMs) struggle with hallucinations, particularly with fine-grained queries, a challenge underrepresented by existing benchmarks that focus on coarse image-related questions. We introduce FIne-grained NEgative queRies (FINER), alongside two benchmarks: FINER-CompreCap and FINER-DOCCI. Using FINER, we analyze hallucinations across four settings: multi-object, multi-attribute, multi-relation, and ``what’’ questions. Our benchmarks reveal that MLLMs hallucinate when fine-grained mismatches co-occur with genuinely present elements in the image. To address this, we propose FINER-Tuning, leveraging Direct Preference Optimization (DPO) on FINER-inspired data. Finetuning four frontier MLLMs with FINER-Tuning yields up to 24.2% gains (InternVL3.5-14B) on hallucinations from our benchmarks, while simultaneously improving performance on eight existing hallucination suites, and enhancing general multimodal capabilities across six benchmarks. Code, benchmark, and models are available at \href{https://explainableml.github.io/finer-project/}{https://explainableml.github.io/finer-project/}.
[203] MM-OVSeg:Multimodal Optical-SAR Fusion for Open-Vocabulary Segmentation in Remote Sensing
Yimin Wei, Aoran Xiao, Hongruixuan Chen, Junshi Xia, Naoto Yokoya
Main category: cs.CV
TL;DR: MM-OVSeg is a multimodal Optical-SAR fusion framework for open-vocabulary segmentation that works under adverse weather conditions by combining optical imagery’s spectral semantics with SAR’s cloud-penetrating structural cues.
Details
Motivation: Current open-vocabulary segmentation methods are limited to clear-sky optical data and struggle under cloudy/haze conditions. There's a need for resilient segmentation that works across diverse weather conditions in remote sensing applications.Method: Proposes MM-OVSeg with two key designs: 1) cross-modal unification process for multi-sensor representation alignment between optical and SAR data, and 2) dual-encoder fusion module that integrates hierarchical features from multiple vision foundation models for text-aligned multimodal segmentation.
Result: Extensive experiments demonstrate superior robustness and generalization across diverse cloud conditions compared to existing methods.
Conclusion: MM-OVSeg effectively addresses the limitations of current vision-language models for dense prediction under adverse weather by leveraging complementary multimodal data and specialized fusion techniques.
Abstract: Open-vocabulary segmentation enables pixel-level recognition from an open set of textual categories, allowing generalization beyond fixed classes. Despite great potential in remote sensing, progress in this area remains largely limited to clear-sky optical data and struggles under cloudy or haze-contaminated conditions. We present MM-OVSeg, a multimodal Optical-SAR fusion framework for resilient open-vocabulary segmentation under adverse weather conditions. MM-OVSeg leverages the complementary strengths of the two modalities–optical imagery provides rich spectral semantics, while synthetic aperture radar (SAR) offers cloud-penetrating structural cues. To address the cross-modal domain gap and the limited dense prediction capability of current vision-language models, we propose two key designs: a cross-modal unification process for multi-sensor representation alignment, and a dual-encoder fusion module that integrates hierarchical features from multiple vision foundation models for text-aligned multimodal segmentation. Extensive experiments demonstrate that MM-OVSeg achieves superior robustness and generalization across diverse cloud conditions. The source dataset and code are available here.
[204] Temporal Gains, Spatial Costs: Revisiting Video Fine-Tuning in Multimodal Large Language Models
Linghao Zhang, Jungang Li, Yonghua Hei, Sicheng Tao, Song Dai, Yibo Yan, Zihao Dongfang, Weiting Liu, Chenxi Qin, Hanqian Li, Xin Zou, Jiahao Zhang, Shuhang Xun, Haiyun Jiang, Xuming Hu
Main category: cs.CV
TL;DR: Video-SFT improves video performance in MLLMs but often hurts static image understanding, revealing a spatial-temporal trade-off that depends on frame sampling strategies.
Details
Motivation: To understand how video-based supervised fine-tuning (Video-SFT) affects the balance between spatial and temporal understanding in multimodal large language models, as current approaches show inconsistent effects on visual capabilities.Method: Systematic study across different architectures, parameter scales, and frame sampling settings to analyze how Video-SFT reshapes visual capabilities, including investigation of temporal budget effects and development of an instruction-aware Hybrid-Frame strategy.
Result: Video-SFT consistently improves video performance but yields limited gains or degradation on static image benchmarks; increasing sampled frames improves video performance but not reliably for images; Hybrid-Frame strategy partially mitigates the trade-off.
Conclusion: Video-SFT is not a free lunch for MLLMs - preserving spatial understanding remains a central challenge in joint image-video training, requiring careful consideration of temporal budget allocation.
Abstract: Multimodal large language models (MLLMs) are typically trained in multiple stages, with video-based supervised fine-tuning (Video-SFT) serving as a key step for improving visual understanding. Yet its effect on the fine-grained evolution of visual capabilities, particularly the balance between spatial and temporal understanding, remains poorly understood. In this paper, we systematically study how Video-SFT reshapes visual capabilities in MLLMs. Across architectures, parameter scales, and frame sampling settings, we observe a consistent pattern: Video-SFT reliably improves video performance, but often yields limited gains or even degradation on static image benchmarks. We further show that this trade-off is closely tied to temporal budget: increasing the number of sampled frames generally improves video performance, but does not reliably improve static image performance. Motivated by this finding, we study an instruction-aware Hybrid-Frame strategy that adaptively allocates frame counts and partially mitigates the image-video trade-off. Our results indicate that Video-SFT is not a free lunch for MLLMs, and preserving spatial understanding remains a central challenge in joint image-video training.
[205] ProGVC: Progressive-based Generative Video Compression via Auto-Regressive Context Modeling
Daowen Li, Ruixiao Dong, Ying Chen, Kai Li, Ding Ding, Li Li
Main category: cs.CV
TL;DR: ProGVC is a progressive generative video compression framework that uses hierarchical multi-scale residual token maps and transformer-based autoregressive context modeling for efficient entropy coding and perceptual detail restoration at low bitrates.
Details
Motivation: Existing perceptual video codecs lack native support for variable bitrate and progressive delivery, and their generative modules are weakly coupled with entropy coding, limiting bitrate reduction potential.Method: Encodes videos into hierarchical multi-scale residual token maps for progressive transmission. Uses transformer-based multi-scale autoregressive context model to estimate token probabilities for both entropy coding and predicting truncated fine-scale tokens at decoder.
Result: Extensive experiments show ProGVC delivers promising perceptual compression performance at low bitrates while offering practical scalability and progressive transmission capabilities.
Conclusion: ProGVC presents a new coding paradigm that unifies progressive transmission, efficient entropy coding, and detail synthesis within a single video codec framework.
Abstract: Perceptual video compression leverages generative priors to reconstruct realistic textures and motions at low bitrates. However, existing perceptual codecs often lack native support for variable bitrate and progressive delivery, and their generative modules are weakly coupled with entropy coding, limiting bitrate reduction. Inspired by the next-scale prediction in the Visual Auto-Regressive (VAR) models, we propose ProGVC, a Progressive-based Generative Video Compression framework that unifies progressive transmission, efficient entropy coding, and detail synthesis within a single codec. ProGVC encodes videos into hierarchical multi-scale residual token maps, enabling flexible rate adaptation by transmitting a coarse-to-fine subset of scales in a progressive manner. A Transformer-based multi-scale autoregressive context model estimates token probabilities, utilized both for efficient entropy coding of the transmitted tokens and for predicting truncated fine-scale tokens at the decoder to restore perceptual details. Extensive experiments demonstrate that as a new coding paradigm, ProGVC delivers promising perceptual compression performance at low bitrates while offering practical scalability at the same time.
[206] WeatherReasonSeg: A Benchmark for Weather-Aware Reasoning Segmentation in Visual Language Models
Wanjun Du, Zifeng Yuan, Tingting Chen, Fucai Ke, Beibei Lin, Shunli Zhang
Main category: cs.CV
TL;DR: WeatherReasonSeg benchmark evaluates vision-language models’ reasoning-based segmentation under adverse weather conditions, revealing performance degradation with increasing weather severity and distinct vulnerability patterns across weather types.
Details
Motivation: Existing VLM benchmarks use high-quality images under ideal conditions, but real-world applications face adverse weather that degrades visual cues. The paper addresses whether VLMs can maintain reliable reasoning segmentation when visual information is compromised by rain, snow, or fog.Method: Introduces WeatherReasonSeg benchmark with two components: 1) Controllable reasoning dataset using synthetic weather with varying severity levels applied to existing segmentation datasets, and 2) Real-world adverse-weather reasoning segmentation dataset with semantically consistent queries generated via mask-guided LLM prompting. Evaluates across five reasoning dimensions: functionality, application scenarios, structural attributes, interactions, and requirement matching.
Result: Extensive experiments show: 1) VLM performance degrades monotonically with increasing weather severity, and 2) Different weather types induce distinct vulnerability patterns in VLMs’ reasoning segmentation capabilities.
Conclusion: WeatherReasonSeg serves as a foundation for advancing robust, weather-aware reasoning in vision-language models, highlighting the need for models that can handle degraded visual information in real-world conditions.
Abstract: Existing vision-language models (VLMs) have demonstrated impressive performance in reasoning-based segmentation. However, current benchmarks are primarily constructed from high-quality images captured under idealized conditions. This raises a critical question: when visual cues are severely degraded by adverse weather conditions such as rain, snow, or fog, can VLMs sustain reliable reasoning segmentation capabilities? In response to this challenge, we introduce WeatherReasonSeg, a benchmark designed to evaluate VLM performance in reasoning-based segmentation under adverse weather conditions. It consists of two complementary components. First, we construct a controllable reasoning dataset by applying synthetic weather with varying severity levels to existing segmentation datasets, enabling fine-grained robustness analysis. Second, to capture real-world complexity, we curate a real-world adverse-weather reasoning segmentation dataset with semantically consistent queries generated via mask-guided LLM prompting. We further broaden the evaluation scope across five reasoning dimensions, including functionality, application scenarios, structural attributes, interactions, and requirement matching. Extensive experiments across diverse VLMs reveal two key findings: (1) VLM performance degrades monotonically with increasing weather severity, and (2) different weather types induce distinct vulnerability patterns. We hope WeatherReasonSeg will serve as a foundation for advancing robust, weather-aware reasoning.
[207] Prompt-Free Universal Region Proposal Network
Qihong Tang, Changhan Liu, Shaofeng Zhang, Wenbin Li, Qi Fan, Yang Gao
Main category: cs.CV
TL;DR: PF-RPN is a prompt-free universal region proposal network that identifies potential objects without external prompts using learnable query embeddings and cascaded self-prompting.
Details
Motivation: Existing object localization methods rely on exemplar images, predefined categories, or textual descriptions, limiting flexibility in real-world scenarios. The authors aim to create a more adaptable approach that doesn't require external prompts.Method: Three main components: 1) Sparse Image-Aware Adapter (SIA) for initial localization using learnable query embeddings, 2) Cascade Self-Prompt (CSP) module for identifying remaining objects through self-prompted embeddings, and 3) Centerness-Guided Query Selection (CG-QS) for selecting high-quality query embeddings using centerness scoring.
Result: The method can be optimized with limited data (5% of MS COCO) and applied directly to various domains without fine-tuning, including underwater object detection, industrial defect detection, and remote sensing. Experimental validation across 19 datasets demonstrates effectiveness.
Conclusion: PF-RPN provides a flexible, prompt-free approach to object localization that works across diverse application domains without requiring fine-tuning, addressing limitations of prompt-dependent methods.
Abstract: Identifying potential objects is critical for object recognition and analysis across various computer vision applications. Existing methods typically localize potential objects by relying on exemplar images, predefined categories, or textual descriptions. However, their reliance on image and text prompts often limits flexibility, restricting adaptability in real-world scenarios. In this paper, we introduce a novel Prompt-Free Universal Region Proposal Network (PF-RPN), which identifies potential objects without relying on external prompts. First, the Sparse Image-Aware Adapter (SIA) module performs initial localization of potential objects using a learnable query embedding dynamically updated with visual features. Next, the Cascade Self-Prompt (CSP) module identifies the remaining potential objects by leveraging the self-prompted learnable embedding, autonomously aggregating informative visual features in a cascading manner. Finally, the Centerness-Guided Query Selection (CG-QS) module facilitates the selection of high-quality query embeddings using a centerness scoring network. Our method can be optimized with limited data (e.g., 5% of MS COCO data) and applied directly to various object detection application domains for identifying potential objects without fine-tuning, such as underwater object detection, industrial defect detection, and remote sensing image object detection. Experimental results across 19 datasets validate the effectiveness of our method. Code is available at https://github.com/tangqh03/PF-RPN.
[208] Eye image segmentation using visual and concept prompts with Segment Anything Model 3 (SAM3)
Diederick C. Niehorster, Marcus Nyström
Main category: cs.CV
TL;DR: SAM3 shows no improvement over SAM2 for eye image segmentation across lab and in-the-wild datasets, with SAM2 performing better and faster.
Details
Motivation: To evaluate whether the latest Segment Anything Model (SAM3) offers better eye image segmentation performance than SAM2, and to explore its new text prompting capabilities for this specific medical imaging task.Method: Evaluated SAM3’s segmentation performance using diverse datasets including high-resolution lab videos and challenging in-the-wild TEyeD dataset. Compared SAM3 with visual/concept prompts against SAM2 using their adapted codebase for arbitrary video duration processing.
Result: SAM3 with either visual or concept prompts did not perform better than SAM2 in most cases for both lab and in-the-wild datasets. SAM2 not only performed better but was also faster.
Conclusion: SAM2 remains the best option for eye image segmentation, showing that the latest iteration (SAM3) does not provide improvements for this specific medical imaging application.
Abstract: Previous work has reported that vision foundation models show promising zero-shot performance in eye image segmentation. Here we examine whether the latest iteration of the Segment Anything Model, SAM3, offers better eye image segmentation performance than SAM2, and explore the performance of its new concept (text) prompting mode. Eye image segmentation performance was evaluated using diverse datasets encompassing both high-resolution high-quality videos from a lab environment and the TEyeD dataset consisting of challenging eye videos acquired in the wild. Results show that in most cases SAM3 with either visual or concept prompts did not perform better than SAM2, for both lab and in-the-wild datasets. Since SAM2 not only performed better but was also faster, we conclude that SAM2 remains the best option for eye image segmentation. We provide our adaptation of SAM3’s codebase that allows processing videos of arbitrary duration.
[209] Face anonymization preserving facial expressions and photometric realism
Luigi Celona, Simone Bianco, Raimondo Schettini
Main category: cs.CV
TL;DR: A face anonymization framework that preserves facial expressions and photometric consistency (lighting/skin tone) while concealing identity, with improved realism and feature fidelity compared to existing methods.
Details
Motivation: Existing face anonymization methods focus on identity removal and image realism but neglect important facial features like expressions and photometric consistency (illumination, skin tone), which are critical for applications like relighting, color constancy, and affective analysis.Method: Extends DeepPrivacy by incorporating dense facial landmarks to better retain expressions, and introduces lightweight post-processing modules to ensure consistency in lighting direction and skin color. Also establishes new evaluation metrics for expression fidelity, lighting consistency, and color preservation.
Result: Experiments on CelebA-HQ dataset show the method produces anonymized faces with improved realism and significantly higher fidelity in expression, illumination, and skin tone compared to state-of-the-art baselines.
Conclusion: The work demonstrates the importance of feature-aware anonymization for more useful, fair, and trustworthy privacy-preserving facial data, addressing limitations of existing methods that neglect critical facial attributes.
Abstract: The widespread sharing of face images on social media platforms and in large-scale datasets raises pressing privacy concerns, as biometric identifiers can be exploited without consent. Face anonymization seeks to generate realistic facial images that irreversibly conceal the subject’s identity while preserving their usefulness for downstream tasks. However, most existing generative approaches focus on identity removal and image realism, often neglecting facial expressions as well as photometric consistency – specifically attributes such as illumination and skin tone – that are critical for applications like relighting, color constancy, and medical or affective analysis. In this work, we propose a feature-preserving anonymization framework that extends DeepPrivacy by incorporating dense facial landmarks to better retain expressions, and by introducing lightweight post-processing modules that ensure consistency in lighting direction and skin color. We further establish evaluation metrics specifically designed to quantify expression fidelity, lighting consistency, and color preservation, complementing standard measures of image realism, pose accuracy, and re-identification resistance. Experiments on the CelebA-HQ dataset demonstrate that our method produces anonymized faces with improved realism and significantly higher fidelity in expression, illumination, and skin tone compared to state-of-the-art baselines. These results underscore the importance of feature-aware anonymization as a step toward more useful, fair, and trustworthy privacy-preserving facial data.
[210] PanoVGGT: Feed-Forward 3D Reconstruction from Panoramic Imagery
Yijing Guo, Mengjun Chao, Luo Wang, Tianyang Zhao, Haizhao Dai, Yingliang Zhang, Jingyi Yu, Yujiao Shi
Main category: cs.CV
TL;DR: PanoVGGT is a permutation-equivariant Transformer framework for joint camera pose estimation, depth prediction, and 3D reconstruction from panoramic images, addressing challenges of spherical distortions with spherical-aware embeddings and SO(3) augmentation.
Details
Motivation: Panoramic imagery with 360° field of view introduces non-pinhole distortions that challenge existing perspective camera models for joint pose estimation and 3D reconstruction, requiring specialized approaches for spherical domain reasoning.Method: Proposes PanoVGGT, a permutation-equivariant Transformer framework with spherical-aware positional embeddings, panorama-specific three-axis SO(3) rotation augmentation, and stochastic anchoring strategy to resolve global-frame ambiguity.
Result: Achieves competitive accuracy, strong robustness, and improved cross-domain generalization on PanoCity dataset and standard benchmarks, with code and dataset to be released.
Conclusion: PanoVGGT effectively addresses panoramic image processing challenges through specialized spherical domain reasoning, demonstrating strong performance in joint pose estimation and 3D reconstruction tasks.
Abstract: Panoramic imagery offers a full 360° field of view and is increasingly common in consumer devices. However, it introduces non-pinhole distortions that challenge joint pose estimation and 3D reconstruction. Existing feed-forward models, built for perspective cameras, generalize poorly to this setting. We propose PanoVGGT, a permutation-equivariant Transformer framework that jointly predicts camera poses, depth maps, and 3D point clouds from one or multiple panoramas in a single forward pass. The model incorporates spherical-aware positional embeddings and a panorama-specific three-axis SO(3) rotation augmentation, enabling effective geometric reasoning in the spherical domain. To resolve inherent global-frame ambiguity, we further introduce a stochastic anchoring strategy during training. In addition, we contribute PanoCity, a large-scale outdoor panoramic dataset with dense depth and 6-DoF pose annotations. Extensive experiments on PanoCity and standard benchmarks demonstrate that PanoVGGT achieves competitive accuracy, strong robustness, and improved cross-domain generalization. Code and dataset will be released.
[211] SARE: Sample-wise Adaptive Reasoning for Training-free Fine-grained Visual Recognition
Jingxiao Yang, DaLin He, Miao Pan, Ge Su, Wenqi Zhang, Yifeng Hu, Tangwei Li, Yuke Li, Xuhong Zhang
Main category: cs.CV
TL;DR: SARE is a sample-wise adaptive reasoning framework for training-free fine-grained visual recognition using large vision-language models, combining fast retrieval with fine-grained reasoning only when needed, and incorporating self-reflective experience from past failures.
Details
Motivation: Existing methods for fine-grained visual recognition using LVLMs have two key limitations: (1) they use the same inference pipeline for all samples regardless of difficulty, leading to suboptimal accuracy and efficiency, and (2) they lack mechanisms to consolidate and reuse error-specific experience, causing repeated failures on similar challenging cases.Method: SARE adopts a cascaded design combining fast candidate retrieval with fine-grained reasoning, invoking reasoning only when necessary. It incorporates a self-reflective experience mechanism that leverages past failures to provide transferable discriminative guidance during inference without parameter updates.
Result: Extensive experiments across 14 datasets show SARE achieves state-of-the-art performance while substantially reducing computational overhead.
Conclusion: SARE effectively addresses the limitations of existing methods by providing adaptive reasoning and leveraging past experience, making training-free fine-grained visual recognition more accurate and efficient.
Abstract: Recent advances in Large Vision-Language Models (LVLMs) have enabled training-free Fine-Grained Visual Recognition (FGVR). However, effectively exploiting LVLMs for FGVR remains challenging due to the inherent visual ambiguity of subordinate-level categories. Existing methods predominantly adopt either retrieval-oriented or reasoning-oriented paradigms to tackle this challenge, but both are constrained by two fundamental limitations:(1) They apply the same inference pipeline to all samples without accounting for uneven recognition difficulty, thereby leading to suboptimal accuracy and efficiency; (2) The lack of mechanisms to consolidate and reuse error-specific experience causes repeated failures on similar challenging cases. To address these limitations, we propose SARE, a Sample-wise Adaptive textbfREasoning framework for training-free FGVR. Specifically, SARE adopts a cascaded design that combines fast candidate retrieval with fine-grained reasoning, invoking the latter only when necessary. In the reasoning process, SARE incorporates a self-reflective experience mechanism that leverages past failures to provide transferable discriminative guidance during inference, without any parameter updates. Extensive experiments across 14 datasets substantiate that SARE achieves state-of-the-art performance while substantially reducing computational overhead.
[212] LoGSAM: Parameter-Efficient Cross-Modal Grounding for MRI Segmentation
Mohammad Robaitul Islam Bhuiyan, Sheethal Bhat, Melika Qahqaie, Tri-Thien Nguyen, Paula Andrea Pérez Toro, Tomas Arias Vergara, Andreas Maier
Main category: cs.CV
TL;DR: LoGSAM: A parameter-efficient framework that transforms radiologist speech dictations into text prompts for brain tumor localization and segmentation using foundation models with minimal fine-tuning.
Details
Motivation: Existing brain tumor segmentation methods rely on task-specific supervised models with limited annotated data. Need for efficient domain adaptation that leverages foundation models while minimizing parameter updates.Method: 1) Transcribe radiologist speech using Whisper ASR, 2) Extract tumor-specific textual prompts via negation-aware clinical NLP, 3) Use LoRA-adapted Grounding DINO for text-conditioned tumor localization (5% parameter update), 4) Use predicted bounding boxes as prompts for frozen MedSAM to generate pixel-level masks.
Result: Achieves state-of-the-art dice score of 80.32% on BRISC 2025. On 12 unseen German MRI scans with radiologist dictations, achieves 91.7% case-level accuracy.
Conclusion: Demonstrates feasibility of modular speech-to-segmentation pipeline using pretrained foundation models with minimal parameter updates, enabling efficient clinical adaptation.
Abstract: Precise localization and delineation of brain tumors using Magnetic Resonance Imaging (MRI) are essential for planning therapy and guiding surgical decisions. However, most existing approaches rely on task-specific supervised models and are constrained by the limited availability of annotated data. To address this, we propose LoGSAM, a parameter-efficient, detection-driven framework that transforms radiologist dictation into text prompts for foundation-model-based localization and segmentation. Radiologist speech is first transcribed and translated using a pretrained Whisper ASR model, followed by negation-aware clinical NLP to extract tumor-specific textual prompts. These prompts guide text-conditioned tumor localization via a LoRA-adapted vision-language detection model, Grounding DINO (GDINO). The LoRA adaptation updates using 5% of the model parameters, thereby enabling computationally efficient domain adaptation while preserving pretrained cross-modal knowledge. The predicted bounding boxes are used as prompts for MedSAM to generate pixel-level tumor masks without any additional fine-tuning. Conditioning the frozen MedSAM on LoGSAM-derived priors yields a state-of-the-art dice score of 80.32% on BRISC 2025. In addition, we evaluate the full pipeline using German dictations from a board-certified radiologist on 12 unseen MRI scans, achieving 91.7% case-level accuracy. These results highlight the feasibility of constructing a modular, speech-to-segmentation pipeline by intelligently leveraging pretrained foundation models with minimal parameter updates.
[213] Trust the Unreliability: Inward Backward Dynamic Unreliability Driven Coreset Selection for Medical Image Classification
Yan Liang, Ziyuan Yang, Zhuxin Lei, Mengyu Sun, Yingyu Chen, Yi Zhang
Main category: cs.CV
TL;DR: DUCS is a coreset selection method for medical imaging that selects unreliable samples near decision boundaries by analyzing confidence fluctuations and forgetting frequency during training, improving model performance at high compression rates.
Details
Motivation: Medical imaging datasets are large and complex with high intra-class variation and inter-class similarity, making coreset selection challenging. Traditional methods focusing on stable, center samples may not effectively model decision boundaries, while unreliable samples near boundaries could be more informative.Method: Proposes Dynamic Unreliability-Driven Coreset Selection (DUCS) with two assessment perspectives: 1) Inward Self-Awareness - analyzes confidence evolution during training to quantify uncertainty; 2) Backward Memory Tracking - tracks frequency of forgetting samples to evaluate retention ability. Selects samples with substantial confidence fluctuations and repeated forgetting during training.
Result: Extensive experiments on public medical datasets show superior performance compared to state-of-the-art methods, particularly at high compression rates.
Conclusion: Unreliable samples near decision boundaries are more informative for model training than stable center samples, and the proposed DUCS strategy effectively identifies these samples to improve coreset selection for medical imaging datasets with limited resources.
Abstract: Efficiently managing and utilizing large-scale medical imaging datasets with limited resources presents significant challenges. While coreset selection helps reduce computational costs, its effectiveness in medical data remains limited due to inherent complexity, such as large intra-class variation and high inter-class similarity. To address this, we revisit the training process and observe that neural networks consistently produce stable confidence predictions and better remember samples near class centers in training. However, concentrating on these samples may complicate the modeling of decision boundaries. Hence, we argue that the more unreliable samples are, in fact, the more informative in helping build the decision boundary. Based on this, we propose the Dynamic Unreliability-Driven Coreset Selection(DUCS) strategy. Specifically, we introduce an inward-backward unreliability assessment perspective: 1) Inward Self-Awareness: The model introspects its behavior by analyzing the evolution of confidence during training, thereby quantifying uncertainty of each sample. 2) Backward Memory Tracking: The model reflects on its training tracking by tracking the frequency of forgetting samples, thus evaluating its retention ability for each sample. Next, we select unreliable samples that exhibit substantial confidence fluctuations and are repeatedly forgotten during training. This selection process ensures that the chosen samples are near the decision boundary, thereby aiding the model in refining the boundary. Extensive experiments on public medical datasets demonstrate our superior performance compared to state-of-the-art(SOTA) methods, particularly at high compression rates.
[214] ReLaGS: Relational Language Gaussian Splatting
Yaxu Xie, Abdalla Arafa, Alireza Javanmardi, Christen Millerdurai, Jia Cheng Hu, Shaoxiang Wang, Alain Pagani, Didier Stricker
Main category: cs.CV
TL;DR: A novel framework for unified 3D perception and reasoning that constructs hierarchical language-distilled Gaussian scenes and 3D semantic scene graphs without scene-specific training, enabling open-vocabulary 3D reasoning across segmentation, retrieval, and relation understanding tasks.
Details
Motivation: Existing methods for 3D perception and reasoning are either object-centric or require costly training for inter-object reasoning, lacking a unified approach that can handle multiple tasks like segmentation, retrieval, and relation understanding without scene-specific training.Method: Constructs hierarchical language-distilled Gaussian scenes using Gaussian pruning for geometry refinement and multi-view language alignment for aggregating 2D features into accurate 3D object embeddings. Builds open-vocabulary 3D scene graphs with Vision Language annotations and Graph Neural Network-based relational reasoning.
Result: Validated across multiple tasks including open-vocabulary segmentation, scene graph generation, and relation-guided retrieval, demonstrating efficient and scalable open-vocabulary 3D reasoning by jointly modeling hierarchical semantics and inter/intra-object relationships.
Conclusion: The framework enables unified 3D perception and reasoning without scene-specific training, providing a scalable solution for open-vocabulary 3D understanding across multiple tasks through hierarchical semantic modeling and relational reasoning.
Abstract: Achieving unified 3D perception and reasoning across tasks such as segmentation, retrieval, and relation understanding remains challenging, as existing methods are either object-centric or rely on costly training for inter-object reasoning. We present a novel framework that constructs a hierarchical language-distilled Gaussian scene and its 3D semantic scene graph without scene-specific training. A Gaussian pruning mechanism refines scene geometry, while a robust multi-view language alignment strategy aggregates noisy 2D features into accurate 3D object embeddings. On top of this hierarchy, we build an open-vocabulary 3D scene graph with Vision Language derived annotations and Graph Neural Network-based relational reasoning. Our approach enables efficient and scalable open-vocabulary 3D reasoning by jointly modeling hierarchical semantics and inter/intra-object relationships, validated across tasks including open-vocabulary segmentation, scene graph generation, and relation-guided retrieval. Project page: https://dfki-av.github.io/ReLaGS/
[215] Fine-Grained Post-Training Quantization for Large Vision Language Models with Quantization-Aware Integrated Gradients
Ziwei Xiang, Fanhu Zeng, Hongjian Fang, Rui-Qi Wang, Renxing Chen, Yanan Zhu, Yi Chen, Peipei Yang, Xu-Yao Zhang
Main category: cs.CV
TL;DR: QIG introduces fine-grained token-level quantization for LVLMs using integrated gradients to measure token sensitivity, improving accuracy over modality-level methods with minimal latency overhead.
Details
Motivation: Current LVLM quantization methods use modality-level sensitivity measurement, which fails to capture complex cross-token interactions and doesn't quantitatively measure quantization error at token level. As tokens interact, modality distinctions diminish, requiring fine-grained calibration.Method: Proposes Quantization-aware Integrated Gradients (QIG), which uses integrated gradients to quantitatively evaluate token sensitivity, pushing granularity from modality level to token level to reflect both inter-modality and intra-modality dynamics.
Result: Extensive experiments on multiple LVLMs under W4A8 and W3A16 settings show improved accuracy across models and benchmarks with negligible latency overhead. Under 3-bit weight-only quantization, improves average accuracy of LLaVA-onevision-7B by 1.60%, reducing gap to full-precision to only 1.33%.
Conclusion: QIG provides an effective fine-grained quantization strategy for LVLMs that addresses limitations of modality-level approaches by capturing token-level interactions, enabling more efficient deployment of large vision-language models.
Abstract: Large Vision Language Models (LVLMs) have achieved remarkable success in a range of downstream tasks that require multimodal interaction, but their capabilities come with substantial computational and memory overhead, which hinders practical deployment. Among numerous acceleration techniques, post-training quantization is a popular and effective strategy for reducing memory cost and accelerating inference. However, existing LVLM quantization methods typically measure token sensitivity at the modality level, which fails to capture the complex cross-token interactions and falls short in quantitatively measuring the quantization error at the token level. As tokens interact within the model, the distinction between modalities gradually diminishes, suggesting the need for fine-grained calibration. Inspired by axiomatic attribution in mechanistic interpretability, we introduce a fine-grained quantization strategy on Quantization-aware Integrated Gradients (QIG), which leverages integrated gradients to quantitatively evaluate token sensitivity and push the granularity from modality level to token level, reflecting both inter-modality and intra-modality dynamics. Extensive experiments on multiple LVLMs under both W4A8 and W3A16 settings show that our method improves accuracy across models and benchmarks with negligible latency overhead. For example, under 3-bit weight-only quantization, our method improves the average accuracy of LLaVA-onevision-7B by 1.60%, reducing the gap to its full-precision counterpart to only 1.33%. The code is available at https://github.com/ucas-xiang/QIG.
[216] S-VGGT: Structure-Aware Subscene Decomposition for Scalable 3D Foundation Models
Xinze Li, Pengxu Chen, Yiyuan Wang, Weifeng Su, Wentao Cheng
Main category: cs.CV
TL;DR: S-VGGT is a novel approach that addresses computational redundancy in 3D foundation models by operating at the structural frame level rather than token level, enabling significant acceleration through scene partitioning and parallel processing.
Details
Motivation: Feed-forward 3D foundation models suffer from quadratic computational costs due to global attention, which limits scalability. Existing token-level acceleration methods introduce overhead and fail to address structural redundancy in dense capture data.Method: The method builds a dense scene graph from initial features to characterize structural redundancy, then softly assigns frames to balanced subscenes with a common reference frame. This enables independent parallel processing without explicit geometric alignment, cutting global attention costs at the source.
Result: S-VGGT provides strong intrinsic acceleration by addressing redundancy at the structural level, and is orthogonal to token-level methods, allowing for compounded speedups without compromising reconstruction fidelity.
Conclusion: The approach fundamentally shifts optimization focus from token-level to structural-level redundancy, offering a novel solution to computational bottlenecks in 3D foundation models that can be combined with existing acceleration techniques.
Abstract: Feed-forward 3D foundation models face a key challenge: the quadratic computational cost introduced by global attention, which severely limits scalability as input length increases. Concurrent acceleration methods, such as token merging, operate at the token level. While they offer local savings, the required nearest-neighbor searches introduce undesirable overhead. Consequently, these techniques fail to tackle the fundamental issue of structural redundancy dominant in dense capture data. In this work, we introduce \textbf{S-VGGT}, a novel approach that addresses redundancy at the structural frame level, drastically shifting the optimization focus. We first leverage the initial features to build a dense scene graph, which characterizes structural scene redundancy and guides the subsequent scene partitioning. Using this graph, we softly assign frames to a small number of subscenes, guaranteeing balanced groups and smooth geometric transitions. The core innovation lies in designing the subscenes to share a common reference frame, establishing a parallel geometric bridge that enables independent and highly efficient processing without explicit geometric alignment. This structural reorganization provides strong intrinsic acceleration by cutting the global attention cost at its source. Crucially, S-VGGT is entirely orthogonal to token-level acceleration methods, allowing the two to be seamlessly combined for compounded speedups without compromising reconstruction fidelity. Code is available at https://github.com/Powertony102/S-VGGT.
[217] ChopGrad: Pixel-Wise Losses for Latent Video Diffusion via Truncated Backpropagation
Dmitriy Rivkin, Parker Ewen, Lili Gao, Julian Ost, Stefanie Walz, Rasika Kangutkar, Mario Bijelic, Felix Heide
Main category: cs.CV
TL;DR: ChopGrad introduces truncated backpropagation for video diffusion models, enabling efficient fine-tuning with pixel-wise losses by limiting gradient computation to local frame windows while maintaining global consistency.
Details
Motivation: Current video diffusion models use recurrent frame processing that requires storing activations across entire video sequences during training, leading to prohibitive memory costs. This makes fine-tuning with pixel-wise losses computationally intractable for long or high-resolution videos.Method: ChopGrad uses a truncated backpropagation scheme that limits gradient computation to local frame windows rather than the entire video sequence. This reduces memory scaling from linear with frame count to constant memory while theoretically maintaining global consistency.
Result: ChopGrad achieves state-of-the-art performance across multiple conditional video generation tasks including video super-resolution, inpainting, enhancement of neural-rendered scenes, and controlled driving video generation, while significantly reducing training memory requirements.
Conclusion: ChopGrad provides an efficient solution for fine-tuning video diffusion models with pixel-wise losses, overcoming the memory limitations of recurrent architectures while maintaining high-quality video generation across diverse applications.
Abstract: Recent video diffusion models achieve high-quality generation through recurrent frame processing where each frame generation depends on previous frames. However, this recurrent mechanism means that training such models in the pixel domain incurs prohibitive memory costs, as activations accumulate across the entire video sequence. This fundamental limitation also makes fine-tuning these models with pixel-wise losses computationally intractable for long or high-resolution videos. This paper introduces ChopGrad, a truncated backpropagation scheme for video decoding, limiting gradient computation to local frame windows while maintaining global consistency. We provide a theoretical analysis of this approximation and show that it enables efficient fine-tuning with frame-wise losses. ChopGrad reduces training memory from scaling linearly with the number of video frames (full backpropagation) to constant memory, and compares favorably to existing state-of-the-art video diffusion models across a suite of conditional video generation tasks with pixel-wise losses, including video super-resolution, video inpainting, video enhancement of neural-rendered scenes, and controlled driving video generation.
[218] A Multi-Agent System for Building-Age Cohort Mapping to Support Urban Energy Planning
Kundan Thota, Thorsten Schlachter, Veit Hagenmeyer
Main category: cs.CV
TL;DR: Multi-agent LLM system fuses heterogeneous data sources to classify building ages from satellite imagery for sustainable urban heat planning, achieving 90.69% accuracy but with class imbalance challenges.
Details
Motivation: Existing approaches for determining urban building age distribution rely on inconsistent sensor/remote sensing data, creating gaps crucial for sustainable municipal heat planning and upgrade prioritization.Method: Three-agent LLM system (Zensus, OSM, Monument agents) fuses heterogeneous data sources, with data orchestrator/harmonizer for geocoding/deduplication. BuildingAgeCNN classifier uses ConvNeXt backbone with FPN, CoordConv, and SE blocks for satellite-only classification.
Result: BuildingAgeCNN achieves 90.69% overall accuracy but modest 67.25% macro-F1 due to class imbalance and confusion between adjacent historical cohorts. Pipeline includes calibrated confidence estimates and flags low-confidence cases.
Conclusion: Multi-agent LLM system assists in gathering structured building data for energy demand planners to optimize district-heating networks and target low-carbon sustainable energy systems.
Abstract: Determining the age distribution of the urban building stock is crucial for sustainable municipal heat planning and upgrade prioritization. However, existing approaches often rely on datasets gathered via sensors or remote sensing techniques, leaving inconsistencies and gaps in data. We present a multi-agent LLM system comprising three key agents, the Zensus agent, the OSM agent, and the Monument agent, that fuse data from heterogeneous sources. A data orchestrator and harmonizer geocodes and deduplicates building imprints. Using this fused ground truth, we introduce BuildingAgeCNN, a satellite-only classifier based on a ConvNeXt backbone augmented with a Feature Pyramid Network (FPN), CoordConv spatial channels, and Squeeze-and-Excitation (SE) blocks. Under spatial cross validation, BuildingAgeCNN attains an overall accuracy of 90.69% but a modest macro-F1 of 67.25%, reflecting strong class imbalance and persistent confusions between adjacent historical cohorts. To mitigate risk for planning applications, the address-to prediction pipeline includes calibrated confidence estimates and flags low-confidence cases for manual review. This multi-agent LLM system not only assists in gathering structured data but also helps energy demand planners optimize district-heating networks and target low-carbon sustainable energy systems.
[219] Part-Aware Open-Vocabulary 3D Affordance Grounding via Prototypical Semantic and Geometric Alignment
Dongqiang Gou, Xuming He
Main category: cs.CV
TL;DR: A two-stage cross-modal framework for open-vocabulary 3D affordance grounding that enhances semantic and geometric representations using LLM-generated part-aware instructions and novel geometric modeling components.
Details
Motivation: Existing methods for language-driven 3D affordance grounding face challenges in open-vocabulary generalization, fine-grained geometric alignment, and part-level semantic consistency, which are essential for embodied intelligence and human-AI interaction.Method: Two-stage framework: 1) Large language models generate part-aware instructions to recover missing semantics and link semantically similar affordances; 2) Affordance Prototype Aggregation (APA) captures cross-object geometric consistency, and Intra-Object Relational Modeling (IORM) refines geometric differentiation within objects for precise semantic alignment.
Result: Superior performance demonstrated through extensive experiments on a newly introduced benchmark and two existing benchmarks, outperforming existing methods.
Conclusion: The proposed framework effectively addresses challenges in open-vocabulary 3D affordance grounding by enhancing both semantic and geometric representations through LLM-based semantic recovery and novel geometric modeling techniques.
Abstract: Grounding natural language questions to functionally relevant regions in 3D objects – termed language-driven 3D affordance grounding – is essential for embodied intelligence and human-AI interaction. Existing methods, while progressing from label-based to language-driven approaches, still face challenges in open-vocabulary generalization, fine-grained geometric alignment, and part-level semantic consistency. To address these issues, we propose a novel two-stage cross-modal framework that enhances both semantic and geometric representations for open-vocabulary 3D affordance grounding. In the first stage, large language models generate part-aware instructions to recover missing semantics, enabling the model to link semantically similar affordances. In the second stage, we introduce two key components: Affordance Prototype Aggregation (APA), which captures cross-object geometric consistency for each affordance, and Intra-Object Relational Modeling (IORM), which refines geometric differentiation within objects to support precise semantic alignment. We validate the effectiveness of our method through extensive experiments on a newly introduced benchmark, as well as two existing benchmarks, demonstrating superior performance in comparison with existing methods.
[220] Few-Step Diffusion Sampling Through Instance-Aware Discretizations
Liangyu Yuan, Ruoyu Wang, Tong Zhao, Dingwen Fu, Mingkun Lei, Beier Zhu, Chi Zhang
Main category: cs.CV
TL;DR: Instance-aware discretization framework for diffusion/flow matching models that adapts timestep schedules based on input complexity rather than using global uniform schedules
Details
Motivation: Current diffusion/flow matching models use globally shared timestep schedules that fail to account for instance-specific complexity in the generative process, potentially limiting performance. Synthetic experiments reveal suboptimality of global schedules under instance-specific dynamics.Method: Proposes an instance-aware discretization framework that learns to adapt timestep allocations based on input-dependent priors, extending gradient-based discretization search to the conditional generative setting.
Result: Empirical results across diverse settings (synthetic data, pixel-space diffusion, latent-space images and video flow matching models) demonstrate consistent improvement in generation quality with marginal tuning cost and negligible inference overhead.
Conclusion: Instance-aware discretization provides a more efficient and effective approach to sampling in diffusion/flow matching models by adapting to input-specific complexity rather than using one-size-fits-all schedules.
Abstract: Diffusion and flow matching models generate high-fidelity data by simulating paths defined by Ordinary or Stochastic Differential Equations (ODEs/SDEs), starting from a tractable prior distribution. The probability flow ODE formulation enables the use of advanced numerical solvers to accelerate sampling. Orthogonal yet vital to solver design is the discretization strategy. While early approaches employed handcrafted heuristics and recent methods adopt optimization-based techniques, most existing strategies enforce a globally shared timestep schedule across all samples. This uniform treatment fails to account for instance-specific complexity in the generative process, potentially limiting performance. Motivated by controlled experiments on synthetic data, which reveals the suboptimality of global schedules under instance-specific dynamics, we propose an instance-aware discretization framework. Our method learns to adapt timestep allocations based on input-dependent priors, extending gradient-based discretization search to the conditional generative setting. Empirical results across diverse settings, including synthetic data, pixel-space diffusion, latent-space images and video flow matching models, demonstrate that our method consistently improves generation quality with marginal tuning cost compared to training and negligible inference overhead.
[221] Differential Attention-Augmented BiomedCLIP with Asymmetric Focal Optimization for Imbalanced Multi-Label Video Capsule Endoscopy Classification
Podakanti Satyajith Chary, Nagarajan Ganapathy
Main category: cs.CV
TL;DR: A multi-label classification framework for video capsule endoscopy using modified BiomedCLIP with differential attention and imbalance handling techniques to detect rare pathological findings.
Details
Motivation: Video capsule endoscopy (VCE) datasets like Galar suffer from extreme class imbalance where pathological findings constitute less than 0.1% of frames, requiring specialized approaches for rare event detection.Method: Modifies BiomedCLIP vision-language model with differential attention mechanism, uses sqrt-frequency weighted sampler, asymmetric focal loss, mixup regularization, per-class threshold optimization, and temporal smoothing with median filtering and gap merging.
Result: Achieves overall temporal mAP@0.5 of 0.2456 and mAP@0.95 of 0.2353 on RARE-VISION test set (161,025 frames), with inference completed in ~8.6 minutes on a single GPU.
Conclusion: The proposed framework effectively addresses extreme class imbalance in VCE through architectural modifications and optimization strategies, enabling practical detection of rare pathological findings.
Abstract: This work presents a multi-label classification framework for video capsule endoscopy (VCE) that addresses the extreme class imbalance inherent in the Galar dataset through a combination of architectural and optimization-level strategies. Our approach modifies BiomedCLIP, a biomedical vision-language foundation model, by replacing its standard multi-head self-attention with a differential attention mechanism that computes the difference between two softmax attention maps to suppress attention noise. To counteract the skewed label distribution, where pathological findings constitute less than 0.1% of all annotated frames, a sqrt-frequency weighted sampler, asymmetric focal loss, mixup regularization, and per-class threshold optimization are employed. Temporal coherence is enforced through median-filter smoothing and gap merging prior to event-level JSON generation. On the held-out RARE-VISION test set comprising three NaviCam examinations (161,025 frames), the pipeline achieves an overall temporal mAP@0.5 of 0.2456 and mAP@0.95 of 0.2353, with total inference completed in approximately 8.6 minutes on a single GPU.
[222] DeepCORO-CLIP: A Multi-View Foundation Model for Comprehensive Coronary Angiography Video-Text Analysis and External Validation
Sarra Harrabi, Yichen Wu, Geoffrey H. Tison, Minhaj Ansari, Milos Vukadinovic, David Ouyang, Joshua P. Barrios, Jacques Delfrate, Robert Avram
Main category: cs.CV
TL;DR: DeepCORO-CLIP is a multi-view foundation model for coronary angiography that uses video-text contrastive learning to analyze multiple projections for comprehensive coronary assessment across diagnostic, prognostic, and disease progression tasks.
Details
Motivation: Current AI methods for coronary angiography analysis are limited to single frames/projections and focus mainly on stenosis detection, lacking comprehensive assessment capabilities. Visual interpretation variability between readers also remains a significant clinical challenge.Method: Multi-view foundation model trained with video-text contrastive learning on 203,808 angiography videos from 28,117 patients. Uses attention-based pooling for study-level assessment across multiple projections. Validated internally and externally on large datasets.
Result: Achieved AUROC of 0.888 for stenosis detection internally and 0.89 externally. MAE of 13.6% vs QCA (better than clinical reports at 19.0%). Strong performance for chronic total occlusion, thrombus, and calcification detection. Transfer learning enabled MACE prediction (AUROC 0.79) and LVEF estimation (MAE 7.3%).
Conclusion: DeepCORO-CLIP provides a foundation for automated coronary angiography interpretation with fast inference (4.2 seconds), enabling comprehensive assessment at point of care. Public release of code, data, model weights, and deployment infrastructure.
Abstract: Coronary angiography is the reference standard for evaluating coronary artery disease, yet visual interpretation remains variable between readers. Existing artificial intelligence methods typically analyze single frames or projections and focus mainly on stenosis, limiting comprehensive coronary assessment. We present DeepCORO-CLIP, a multi-view foundation model trained with video-text contrastive learning on 203,808 angiography videos from 28,117 patients across 32,473 studies at the Montreal Heart Institute and externally validated on 4,249 studies from the University of California, San Francisco. DeepCORO-CLIP integrates multiple projections with attention-based pooling for study-level assessment across diagnostic, prognostic, and disease progression tasks. For significant stenosis detection, the model achieved an AUROC of 0.888 internally and 0.89 on external validation. Mean absolute error against core laboratory quantitative coronary angiography was 13.6%, lower than clinical reports at 19.0%. The model also performed strongly for chronic total occlusion, intracoronary thrombus, and coronary calcification detection. Transfer learning enabled prediction of one-year major adverse cardiovascular events with AUROC 0.79 and estimation of left ventricular ejection fraction with mean absolute error 7.3%. Embeddings also captured disease progression across serial examinations. With a mean inference time of 4.2 seconds in hospital deployment, DeepCORO-CLIP provides a foundation for automated coronary angiography interpretation at the point of care. Code, sample data, model weights, and deployment infrastructure are publicly released.
[223] Illumination-Aware Contactless Fingerprint Spoof Detection via Paired Flash-Non-Flash Imaging
Roja Sahoo, Anoop Namboodiri
Main category: cs.CV
TL;DR: Paired flash-non-flash contactless fingerprint acquisition improves spoof detection by analyzing illumination-induced differences in material properties and surface characteristics.
Details
Motivation: Contactless fingerprint recognition lacks physical contact cues for spoof detection, and existing single-image methods generalize poorly across devices, conditions, and spoof materials.Method: Uses paired flash-non-flash acquisition as active sensing, analyzing lighting-induced differences through interpretable metrics like inter-channel correlation, specular reflection, texture realism, and differential imaging.
Result: Flash illumination accentuates material-dependent properties (ridge visibility, subsurface scattering, micro-geometry, surface oils), helping discriminate genuine fingerprints from various presentation attacks.
Conclusion: Illumination-aware analysis improves robustness and interpretability in contactless fingerprint spoof detection, motivating future work on paired acquisition and physics-informed feature design.
Abstract: Contactless fingerprint recognition enables hygienic and convenient biometric authentication but poses new challenges for spoof detection due to the absence of physical contact and traditional liveness cues. Most existing methods rely on single-image acquisition and appearance-based features, which often generalize poorly across devices, capture conditions, and spoof materials. In this work, we study paired flash-non-flash contactless fingerprint acquisition as a lightweight active sensing mechanism for spoof detection. Through a preliminary empirical analysis, we show that flash illumination accentuates material- and structure-dependent properties, including ridge visibility, subsurface scattering, micro-geometry, and surface oils, while non-flash images provide a baseline appearance context. We analyze lighting-induced differences using interpretable metrics such as inter-channel correlation, specular reflection characteristics, texture realism, and differential imaging. These complementary features help discriminate genuine fingerprints from printed, digital, and molded presentation attacks. We further examine the limitations of paired acquisition, including sensitivity to imaging settings, dataset scale, and emerging high-fidelity spoofs. Our findings demonstrate the potential of illumination-aware analysis to improve robustness and interpretability in contactless fingerprint presentation attack detection, motivating future work on paired acquisition and physics-informed feature design. Code is available in the repository.
[224] Does YOLO Really Need to See Every Training Image in Every Epoch?
Xingxing Xie, Jiahua Dong, Junwei Han, Gong Cheng
Main category: cs.CV
TL;DR: AFSS is an anti-forgetting sampling strategy that dynamically selects training images for YOLO detectors based on learning sufficiency, achieving faster training while maintaining or improving accuracy.
Details
Motivation: YOLO detectors have fast inference but slow training because they process every training image in every epoch, even when many images are already sufficiently learned. This contradicts the "You Only Look Once" efficiency philosophy.Method: AFSS measures learning sufficiency of each image as min(detection recall, precision), categorizing images into easy, medium, or hard levels. Easy images are sparsely resampled with priority to unused ones, medium images are partially selected, and hard images are fully sampled every epoch. Learning sufficiency is periodically updated.
Result: On MS COCO 2017, PASCAL VOC 2007, DOTA-v1.0, and DIOR-R datasets, AFSS achieves >1.43× training speedup for YOLO-series detectors while also improving accuracy.
Conclusion: AFSS enables YOLO detectors to train more efficiently by focusing on informative images, achieving significant speedups without sacrificing accuracy.
Abstract: YOLO detectors are known for their fast inference speed, yet training them remains unexpectedly time-consuming due to their exhaustive pipeline that processes every training image in every epoch, even when many images have already been sufficiently learned. This stands in clear contrast to the efficiency suggested by the ``You Only Look Once’’ philosophy. This naturally raises an important question: \textit{Does YOLO really need to see every training image in every epoch?} To explore this, we propose an Anti-Forgetting Sampling Strategy (AFSS) that dynamically determines which images should be used and which can be skipped during each epoch, allowing the detector to learn more effectively and efficiently. Specifically, AFSS measures the learning sufficiency of each training image as the minimum of its detection recall and precision, and dynamically categorizes training images into easy, medium, or hard levels accordingly. Easy training images are sparsely resampled during training in a continuous review manner, with priority given to those that have not been used for a long time to reduce redundancy and prevent forgetting. Moderate training images are partially selected, prioritizing recently unused ones and randomly choosing the rest from unselected images to ensure coverage and prevent forgetting. Hard training images are fully sampled in every epoch to ensure sufficient learning. The learning sufficiency of each training image is periodically updated, enabling detectors to adaptively shift its focus toward the informative training images over time while progressively discarding redundant ones. On widely used natural image detection benchmarks (MS COCO 2017 and PASCAL VOC 2007) and remote sensing detection datasets (DOTA-v1.0 and DIOR-R), AFSS achieves more than $1.43\times$ training speedup for YOLO-series detectors while also improving accuracy.
[225] Learning Transferable Temporal Primitives for Video Reasoning via Synthetic Videos
Songtao Jiang, Sibo Song, Chenyi Zhou, Yuan Wang, Ruizhe Chen, Tongkun Guan, Ruilin Luo, Yan Zhang, Zhihang Tang, Yuchong Sun, Hang Zhang, Zhibo Yang, Shuai Bai, Junyang Lin, Zuozhu Liu
Main category: cs.CV
TL;DR: SynRL is a post-training framework that teaches vision-language models temporal primitives (direction, speed, state tracking) using programmatically generated synthetic videos, achieving strong transfer to real-world video understanding tasks.
Details
Motivation: Current VLMs struggle with temporal understanding in videos because existing datasets lack true temporal-centricity (answers can be inferred from keyframes) and training data from proprietary models contains systematic errors in fundamental temporal perception like motion direction and speed.Method: SynRL decomposes temporal understanding into short-term perceptual primitives (speed, direction) and long-term cognitive primitives, creating 7.7K CoT and 7K RL samples with ground-truth frame-level annotations through code-based video generation of simple geometric shapes.
Result: Despite training on simple synthetic shapes, SynRL achieves substantial improvements across 15 benchmarks spanning temporal grounding, complex reasoning, and general video understanding, outperforming Video-R1 with 165K real-world samples using only 7.7K synthetic CoT samples.
Conclusion: Synthetic data teaching fundamental temporal primitives provides an effective and cost-efficient scaling path for video post-training, establishing that abstract temporal skills transfer effectively from synthetic patterns to complex real-world scenarios.
Abstract: The transition from image to video understanding requires vision-language models (VLMs) to shift from recognizing static patterns to reasoning over temporal dynamics such as motion trajectories, speed changes, and state transitions. Yet current post-training methods fall short due to two critical limitations: (1) existing datasets often lack temporal-centricity, where answers can be inferred from isolated keyframes rather than requiring holistic temporal integration; and (2) training data generated by proprietary models contains systematic errors in fundamental temporal perception, such as confusing motion directions or misjudging speeds. We introduce SynRL, a post-training framework that teaches models temporal primitives, the fundamental building blocks of temporal understanding including direction, speed, and state tracking. Our key insight is that these abstract primitives, learned from programmatically generated synthetic videos, transfer effectively to real-world scenarios. We decompose temporal understanding into short-term perceptual primitives (speed, direction) and long-term cognitive primitives, constructing 7.7K CoT and 7K RL samples with ground-truth frame-level annotations through code-based video generation. Despite training on simple geometric shapes, SynRL achieves substantial improvements across 15 benchmarks spanning temporal grounding, complex reasoning, and general video understanding. Remarkably, our 7.7K synthetic CoT samples outperform Video-R1 with 165K real-world samples. We attribute this to fundamental temporal skills, such as tracking frame by frame changes and comparing velocity, that transfer effectively from abstract synthetic patterns to complex real-world scenarios. This establishes a new paradigm for video post-training: video temporal learning through carefully designed synthetic data provides a more cost efficient scaling path.
[226] VideoAtlas: Navigating Long-Form Video in Logarithmic Compute
Mohamed Eltahir, Ali Habibullah, Yazan Alshoibi, Lama Ayash, Tanveer Hussain, Naeemullah Khan
Main category: cs.CV
TL;DR: VideoAtlas is a hierarchical grid representation for video that enables lossless, navigable video understanding with logarithmic compute growth, combined with Recursive Language Models for scalable long-context video processing.
Details
Motivation: Existing video language models face two key challenges: lossy representations that approximate video content, and long-context limitations where caption-based pipelines lose visual fidelity. There's a need for lossless, scalable video understanding that preserves visual information end-to-end.Method: VideoAtlas represents video as a hierarchical grid that is lossless, navigable, and preprocessing-free. It provides a uniform visual representation for video, intermediate investigations, and agent memory. Combined with Recursive Language Models (RLMs), it enables Video-RLM: a Master-Worker architecture where Master coordinates global exploration while Workers concurrently drill into regions to accumulate visual evidence.
Result: Three key findings: (1) Logarithmic compute growth with video duration and 30-60% multimodal cache hit rate from grid structural reuse. (2) Environment budgeting via maximum exploration depth provides compute-accuracy tradeoff. (3) Emergent adaptive compute allocation scales with question granularity. Video-RLM shows minimal accuracy degradation when scaling from 1-hour to 10-hour benchmarks.
Conclusion: Structured environment navigation through VideoAtlas provides a viable and scalable paradigm for video understanding, enabling lossless visual processing with logarithmic computational growth and robust performance on long-duration videos.
Abstract: Extending language models to video introduces two challenges: representation, where existing methods rely on lossy approximations, and long-context, where caption- or agent-based pipelines collapse video into text and lose visual fidelity. To overcome this, we introduce \textbf{VideoAtlas}, a task-agnostic environment to represent video as a hierarchical grid that is simultaneously lossless, navigable, scalable, caption- and preprocessing-free. An overview of the video is available at a glance, and any region can be recursively zoomed into, with the same visual representation used uniformly for the video, intermediate investigations, and the agent’s memory, eliminating lossy text conversion end-to-end. This hierarchical structure ensures access depth grows only logarithmically with video length. For long-context, Recursive Language Models (RLMs) recently offered a powerful solution for long text, but extending them to visual domain requires a structured environment to recurse into, which \textbf{VideoAtlas} provides. \textbf{VideoAtlas} as a Markov Decision Process unlocks Video-RLM: a parallel Master-Worker architecture where a Master coordinates global exploration while Workers concurrently drill into assigned regions to accumulate lossless visual evidence. We demonstrate three key findings: (1)~logarithmic compute growth with video duration, further amplified by a 30-60% multimodal cache hit rate arising from the grid’s structural reuse. (2)~environment budgeting, where bounding the maximum exploration depth provides a principled compute-accuracy hyperparameter. (3)~emergent adaptive compute allocation that scales with question granularity. When scaling from 1-hour to 10-hour benchmarks, Video-RLM remains the most duration-robust method with minimal accuracy degradation, demonstrating that structured environment navigation is a viable and scalable paradigm for video understanding.
[227] Parameter-Efficient Modality-Balanced Symmetric Fusion for Multimodal Remote Sensing Semantic Segmentation
Haocheng Li, Juepeng Zheng, Shuangxi Miao, Ruibo Lu, Guosheng Cai, Haohuan Fu, Jianxi Huang
Main category: cs.CV
TL;DR: MoBaNet: A parameter-efficient, modality-balanced symmetric fusion framework for multimodal remote sensing semantic segmentation that adapts frozen Vision Foundation Models with minimal trainable parameters while addressing modality imbalance issues.
Details
Motivation: Current approaches for adapting pretrained Vision Foundation Models to multimodal tasks suffer from high computational overhead and modality imbalance, where auxiliary modalities get suppressed during optimization, limiting effective multimodal fusion.Method: Proposes MoBaNet with: 1) Symmetric dual-stream architecture built on frozen VFM backbone, 2) Cross-modal Prompt-Injected Adapter for deep semantic interaction, 3) Difference-Guided Gated Fusion Module for adaptive feature fusion using cross-modal discrepancy, and 4) Modality-Conditional Random Masking strategy to mitigate modality imbalance during training.
Result: Achieves state-of-the-art performance on ISPRS Vaihingen and Potsdam benchmarks with significantly fewer trainable parameters than full fine-tuning, demonstrating robust and balanced multimodal fusion.
Conclusion: MoBaNet provides an effective parameter-efficient framework for multimodal remote sensing semantic segmentation that maintains generalizable representations while addressing modality imbalance through innovative architectural designs and training strategies.
Abstract: Multimodal remote sensing semantic segmentation enhances scene interpretation by exploiting complementary physical cues from heterogeneous data. Although pretrained Vision Foundation Models (VFMs) provide strong general-purpose representations, adapting them to multimodal tasks often incurs substantial computational overhead and is prone to modality imbalance, where the contribution of auxiliary modalities is suppressed during optimization. To address these challenges, we propose MoBaNet, a parameter-efficient and modality-balanced symmetric fusion framework. Built upon a largely frozen VFM backbone, MoBaNet adopts a symmetric dual-stream architecture to preserve generalizable representations while minimizing the number of trainable parameters. Specifically, we design a Cross-modal Prompt-Injected Adapter (CPIA) to enable deep semantic interaction by generating shared prompts and injecting them into bottleneck adapters under the frozen backbone. To obtain compact and discriminative multimodal representations for decoding, we further introduce a Difference-Guided Gated Fusion Module (DGFM), which adaptively fuses paired stage features by explicitly leveraging cross-modal discrepancy to guide feature selection. Furthermore, we propose a Modality-Conditional Random Masking (MCRM) strategy to mitigate modality imbalance by masking one modality only during training and imposing hard-pixel auxiliary supervision on modality-specific branches. Extensive experiments on the ISPRS Vaihingen and Potsdam benchmarks demonstrate that MoBaNet achieves state-of-the-art performance with significantly fewer trainable parameters than full fine-tuning, validating its effectiveness for robust and balanced multimodal fusion. The source code in this work is available at https://github.com/sauryeo/MoBaNet.
[228] DiffVP: Differential Visual Semantic Prompting for LLM-Based CT Report Generation
Yuhe Tian, Kun Zhang, Haoran Ma, Rui Yan, Yingtai Li, Rongsheng Wang, Shaohua Kevin Zhou
Main category: cs.CV
TL;DR: DiffVP improves CT report generation by focusing on scan-to-reference differences rather than holistic 3D volumes, using hierarchical difference extraction and visual prompting for LLMs.
Details
Motivation: Existing LLM-based CT report generation methods encode 3D volumes holistically without distinguishing informative cues from redundant anatomical background, limiting their effectiveness.Method: Proposes Differential Visual Prompting (DiffVP) with hierarchical difference extractor to capture global/local semantic discrepancies, and difference-to-prompt generator that transforms these into learnable visual prefix tokens for LLM conditioning.
Result: Outperforms prior methods on two large-scale benchmarks, improving average BLEU-1-4 by +10.98 and +4.36, and boosts clinical efficacy on RadGenome-ChestCT (F1 score 0.421).
Conclusion: DiffVP effectively conditions report generation on explicit scan-to-reference differences, suppressing invariant anatomy while amplifying diagnostically relevant visual evidence without explicit lesion localization.
Abstract: While large language models (LLMs) have advanced CT report generation, existing methods typically encode 3D volumes holistically, failing to distinguish informative cues from redundant anatomical background. Inspired by radiological cognitive subtraction, we propose Differential Visual Prompting (DiffVP), which conditions report generation on explicit, high-level semantic scan-to-reference differences rather than solely on absolute visual features. DiffVP employs a hierarchical difference extractor to capture complementary global and local semantic discrepancies into a shared latent space, along with a difference-to-prompt generator that transforms these signals into learnable visual prefix tokens for LLM conditioning. These difference prompts serve as structured conditioning signals that implicitly suppress invariant anatomy while amplifying diagnostically relevant visual evidence, thereby facilitating accurate report generation without explicit lesion localization. On two large-scale benchmarks, DiffVP consistently outperforms prior methods, improving the average BLEU-1-4 by +10.98 and +4.36, respectively, and further boosts clinical efficacy on RadGenome-ChestCT (F1 score 0.421). All codes will be released at https://github.com/ArielTYH/DiffVP/.
[229] TAPESTRY: From Geometry to Appearance via Consistent Turntable Videos
Yan Zeng, Haoran Jiang, Kaixin Yao, Qixuan Zhang, Longwen Zhang, Lan Xu, Jingyi Yu
Main category: cs.CV
TL;DR: TAPESTRY generates 360-degree turntable videos from 3D models using geometry-conditioned video diffusion, enabling high-quality 3D appearance generation and reconstruction.
Details
Motivation: Existing video diffusion models struggle with geometric consistency and appearance stability for 3D reconstruction tasks, creating a need for geometry-aware video generation that can produce reliable intermediate representations for texture synthesis and neural rendering.Method: Reframes 3D appearance generation as geometry-conditioned video diffusion: renders multi-modal geometric features from 3D meshes to constrain video generation with pixel-level precision, then uses a multi-stage pipeline with 3D-Aware Inpainting for complete surface coverage.
Result: Outperforms existing approaches in both video consistency and final reconstruction quality, enabling automated creation of production-ready 3D assets from untextured meshes.
Conclusion: TAPESTRY provides a framework for generating high-fidelity turntable videos that serve as both dynamic previews and reliable intermediate representations for 3D reconstruction tasks, bridging video generation with 3D content creation.
Abstract: Automatically generating photorealistic and self-consistent appearances for untextured 3D models is a critical challenge in digital content creation. The advancement of large-scale video generation models offers a natural approach: directly synthesizing 360-degree turntable videos (TTVs), which can serve not only as high-quality dynamic previews but also as an intermediate representation to drive texture synthesis and neural rendering. However, existing general-purpose video diffusion models struggle to maintain strict geometric consistency and appearance stability across the full range of views, making their outputs ill-suited for high-quality 3D reconstruction. To this end, we introduce TAPESTRY, a framework for generating high-fidelity TTVs conditioned on explicit 3D geometry. We reframe the 3D appearance generation task as a geometry-conditioned video diffusion problem: given a 3D mesh, we first render and encode multi-modal geometric features to constrain the video generation process with pixel-level precision, thereby enabling the creation of high-quality and consistent TTVs. Building upon this, we also design a method for downstream reconstruction tasks from the TTV input, featuring a multi-stage pipeline with 3D-Aware Inpainting. By rotating the model and performing a context-aware secondary generation, this pipeline effectively completes self-occluded regions to achieve full surface coverage. The videos generated by TAPESTRY are not only high-quality dynamic previews but also serve as a reliable, 3D-aware intermediate representation that can be seamlessly back-projected into UV textures or used to supervise neural rendering methods like 3DGS. This enables the automated creation of production-ready, complete 3D assets from untextured meshes. Experimental results demonstrate that our method outperforms existing approaches in both video consistency and final reconstruction quality.
[230] Concept-to-Pixel: Prompt-Free Universal Medical Image Segmentation
Haoyun Chen, Fenghe Tang, Wenxin Ma, Shaohua Kevin Zhou
Main category: cs.CV
TL;DR: C2P is a prompt-free universal medical image segmentation framework that disentangles anatomical knowledge into geometric and semantic representations using MLLMs, achieving strong cross-modal generalization.
Details
Motivation: Existing universal medical segmentation approaches rely heavily on manual visual prompts or reference images, limiting automation and robustness. Joint training across modalities often fails to address large domain shifts.Method: Proposes Concept-to-Pixel (C2P) framework that separates anatomical knowledge into Geometric and Semantic representations. Uses MLLMs to distill medical concepts into learnable Semantic Tokens and introduces supervised Geometric Tokens for universal physical constraints. These tokens interact with image features to generate dynamic kernels for mask prediction, with a Geometry-Aware Inference Consensus mechanism for reliability assessment.
Result: Extensive experiments on eight diverse datasets across seven modalities show significant superiority over universe- or single-model approaches. The unified model demonstrates strong generalization on zero-shot tasks with unseen cases and cross-modal transfers across similar tasks.
Conclusion: C2P provides an effective prompt-free universal segmentation framework that explicitly disentangles anatomical knowledge, enabling robust cross-modal generalization and addressing domain shift challenges in medical imaging.
Abstract: Universal medical image segmentation seeks to use a single foundational model to handle diverse tasks across multiple imaging modalities. However, existing approaches often rely heavily on manual visual prompts or retrieved reference images, which limits their automation and robustness. In addition, naive joint training across modalities often fails to address large domain shifts. To address these limitations, we propose Concept-to-Pixel (C2P), a novel prompt-free universal segmentation framework. C2P explicitly separates anatomical knowledge into two components: Geometric and Semantic representations. It leverages Multimodal Large Language Models (MLLMs) to distill abstract, high-level medical concepts into learnable Semantic Tokens and introduces explicitly supervised Geometric Tokens to enforce universal physical and structural constraints. These disentangled tokens interact deeply with image features to generate input-specific dynamic kernels for precise mask prediction. Furthermore, we introduce a Geometry-Aware Inference Consensus mechanism, which utilizes the model’s predicted geometric constraints to assess prediction reliability and suppress outliers. Extensive experiments and analysis on a unified benchmark comprising eight diverse datasets across seven modalities demonstrate the significant superiority of our jointly trained approach, compared to universe- or single-model approaches. Remarkably, our unified model demonstrates strong generalization, achieving impressive results not only on zero-shot tasks involving unseen cases but also in cross-modal transfers across similar tasks. Code is available at: https://github.com/Yundi218/Concept-to-Pixel
[231] Unified Spatio-Temporal Token Scoring for Efficient Video VLMs
Jianrui Zhang, Yue Yang, Rohun Tripathi, Winson Han, Ranjay Krishna, Christopher Clark, Yong Jae Lee, Sangho Lee
Main category: cs.CV
TL;DR: STTS is a spatio-temporal token pruning method for vision-language models that prunes 50% of vision tokens across both vision transformer and LLM components, achieving 62% efficiency gains with minimal performance drop.
Details
Motivation: Current token pruning approaches for VLMs either prune only within the vision transformer for unimodal tasks or only within the LLM with complex text-conditioned mechanisms, lacking unified architecture-wide pruning for video-based vision-language tasks.Method: STTS learns to score tokens temporally via auxiliary loss and spatially via LLM downstream gradients, using an efficient packing algorithm to prune tokens across both ViT and LLM without text conditioning or token merging.
Result: Prunes 50% of vision tokens, achieving 62% efficiency improvement during training and inference with only 0.7% average performance drop across 13 video QA tasks. Efficiency gains increase with more frames, and test-time scaling yields additional 0.5-1% performance gains for long-video QA.
Conclusion: STTS represents a novel, simple yet effective technique for unified, architecture-wide vision token pruning in vision-language models, particularly beneficial for video tasks with temporal redundancy.
Abstract: Token pruning is essential for enhancing the computational efficiency of vision-language models (VLMs), particularly for video-based tasks where temporal redundancy is prevalent. Prior approaches typically prune tokens either (1) within the vision transformer (ViT) exclusively for unimodal perception tasks such as action recognition and object segmentation, without adapting to downstream vision-language tasks; or (2) only within the LLM while leaving the ViT output intact, often requiring complex text-conditioned token selection mechanisms. In this paper, we introduce Spatio-Temporal Token Scoring (STTS), a simple and lightweight module that prunes vision tokens across both the ViT and the LLM without text conditioning or token merging, and is fully compatible with end-to-end training. By learning how to score temporally via an auxiliary loss and spatially via LLM downstream gradients, aided by our efficient packing algorithm, STTS prunes 50% of vision tokens throughout the entire architecture, resulting in a 62% improvement in efficiency during both training and inference with only a 0.7% drop in average performance across 13 short and long video QA tasks. Efficiency gains increase with more sampled frames per video. Applying test-time scaling for long-video QA further yields performance gains of 0.5-1% compared to the baseline. Overall, STTS represents a novel, simple yet effective technique for unified, architecture-wide vision token pruning.
[232] PC-CrossDiff: Point-Cluster Dual-Level Cross-Modal Differential Attention for Unified 3D Referring and Segmentation
Wenbin Tan, Jiawen Lin, Fangyong Wang, Yuan Xie, Yong Xie, Yachao Zhang, Yanyun Qu
Main category: cs.CV
TL;DR: PC-CrossDiff: A unified dual-task framework with dual-level cross-modal differential attention for 3D visual grounding, addressing challenges in complex multi-object scenes by parsing implicit localization cues and suppressing spatial interference.
Details
Motivation: Existing 3D visual grounding methods work well in simple single-object scenes but suffer severe performance degradation in complex multi-object scenes common in real-world settings. Two key challenges are inadequate parsing of implicit localization cues for disambiguating visually similar objects, and ineffective suppression of dynamic spatial interference from co-occurring objects.Method: Proposes PC-CrossDiff with dual-level cross-modal differential attention: (1) Point-Level Differential Attention (PLDA) modules apply bidirectional differential attention between text and point clouds to adaptively extract implicit localization cues via learnable weights; (2) Cluster-Level Differential Attention (CLDA) modules establish hierarchical attention to enhance localization-relevant spatial relationships while suppressing ambiguous/irrelevant spatial relations through localization-aware differential attention blocks.
Result: Achieves state-of-the-art performance on ScanRefer, NR3D, and SR3D benchmarks. Notably improves Overall@0.50 score by +10.16% for 3DREC task on Implicit subsets of ScanRefer, demonstrating strong ability to parse implicit spatial cues.
Conclusion: PC-CrossDiff effectively addresses challenges in complex multi-object 3D visual grounding through its dual-level differential attention architecture, significantly improving performance on implicit localization tasks and advancing practical deployment of 3DVG systems.
Abstract: 3D Visual Grounding (3DVG) aims to localize the referent of natural language referring expressions through two core tasks: Referring Expression Comprehension (3DREC) and Segmentation (3DRES). While existing methods achieve high accuracy in simple, single-object scenes, they suffer from severe performance degradation in complex, multi-object scenes that are common in real-world settings, hindering practical deployment. Existing methods face two key challenges in complex, multi-object scenes: inadequate parsing of implicit localization cues critical for disambiguating visually similar objects, and ineffective suppression of dynamic spatial interference from co-occurring objects, resulting in degraded grounding accuracy. To address these challenges, we propose PC-CrossDiff, a unified dual-task framework with a dual-level cross-modal differential attention architecture for 3DREC and 3DRES. Specifically, the framework introduces: (i) Point-Level Differential Attention (PLDA) modules that apply bidirectional differential attention between text and point clouds, adaptively extracting implicit localization cues via learnable weights to improve discriminative representation; (ii) Cluster-Level Differential Attention (CLDA) modules that establish a hierarchical attention mechanism to adaptively enhance localization-relevant spatial relationships while suppressing ambiguous or irrelevant spatial relations through a localization-aware differential attention block. Our method achieves state-of-the-art performance on the ScanRefer, NR3D, and SR3D benchmarks. Notably, on the Implicit subsets of ScanRefer, it improves the Overall@0.50 score by +10.16% for the 3DREC task, highlighting its strong ability to parse implicit spatial cues.
[233] Evidence Packing for Cross-Domain Image Deepfake Detection with LVLMs
Yuxin Liu, Fei Wang, Kun Li, Yiqi Nie, Junjie Chen, Zhangling Duan, Zhaohong Jia
Main category: cs.CV
TL;DR: SCEP is a training-free framework that adapts large vision-language models for image deepfake detection by using evidence-driven reasoning with suspicious patch tokens instead of whole-image inference.
Details
Motivation: Current large vision-language models (LVLMs) require costly fine-tuning for image deepfake detection and generalize poorly to diverse, evolving manipulations. There's a need for a training-free approach that can effectively detect deepfakes without model adaptation.Method: SCEP uses CLS token as global reference, clusters patch features, scores patches with fused metric (semantic mismatch + frequency/noise anomalies), samples high-confidence patches per cluster, applies grid-based NMS to produce evidence pack, and conditions frozen LVLM for prediction.
Result: Experiments on diverse benchmarks show SCEP outperforms strong baselines without requiring LVLM fine-tuning.
Conclusion: SCEP provides an effective training-free framework for image deepfake detection that leverages evidence-driven reasoning with large vision-language models, addressing generalization challenges without costly fine-tuning.
Abstract: Image Deepfake Detection (IDD) separates manipulated images from authentic ones by spotting artifacts of synthesis or tampering. Although large vision-language models (LVLMs) offer strong image understanding, adapting them to IDD often demands costly fine-tuning and generalizes poorly to diverse, evolving manipulations. We propose the Semantic Consistent Evidence Pack (SCEP), a training-free LVLM framework that replaces whole-image inference with evidence-driven reasoning. SCEP mines a compact set of suspicious patch tokens that best reveal manipulation cues. It uses the vision encoder’s CLS token as a global reference, clusters patch features into coherent groups, and scores patches with a fused metric combining CLS-guided semantic mismatch with frequency-and noise-based anomalies. To cover dispersed traces and avoid redundancy, SCEP samples a few high-confidence patches per cluster and applies grid-based NMS, producing an evidence pack that conditions a frozen LVLM for prediction. Experiments on diverse benchmarks show SCEP outperforms strong baselines without LVLM fine-tuning.
[234] CrowdGaussian: Reconstructing High-Fidelity 3D Gaussians for Human Crowd from a Single Image
Yizheng Song, Yiyu Zhuang, Qipeng Xu, Haixiang Wang, Jiahe Zhu, Jing Tian, Siyu Zhu, Hao Zhu
Main category: cs.CV
TL;DR: CrowdGaussian: A unified framework for reconstructing multi-person 3D Gaussian Splatting representations from single-image inputs, addressing challenges of occlusions, low clarity, and varied appearances in crowd scenes.
Details
Motivation: Prior 3D human reconstruction research focuses on clear, close-up images of individuals but performs poorly in multi-person scenarios with extensive occlusions, low clarity, and varied appearances. There's a need for effective reconstruction of 3D human crowd models from single images.Method: Proposes CrowdGaussian framework with: 1) Self-supervised adaptation pipeline to handle occlusions by enabling pretrained large human models to reconstruct complete 3D humans from heavily occluded inputs, and 2) Self-Calibrated Learning (SCL) training strategy that uses single-step diffusion models to adaptively refine coarse renderings by blending identity-preserving samples with clean/corrupted image pairs, then distilling outputs to enhance multi-person 3DGS quality.
Result: Extensive experiments demonstrate that CrowdGaussian generates photorealistic, geometrically coherent reconstructions of multi-person scenes from single-image inputs.
Conclusion: CrowdGaussian effectively addresses the challenging task of multi-person 3D reconstruction from single images, overcoming occlusion, clarity, and appearance challenges through self-supervised adaptation and self-calibrated learning approaches.
Abstract: Single-view 3D human reconstruction has garnered significant attention in recent years. Despite numerous advancements, prior research has concentrated on reconstructing 3D models from clear, close-up images of individual subjects, often yielding subpar results in the more prevalent multi-person scenarios. Reconstructing 3D human crowd models is a highly intricate task, laden with challenges such as: 1) extensive occlusions, 2) low clarity, and 3) numerous and various appearances. To address this task, we propose CrowdGaussian, a unified framework that directly reconstructs multi-person 3D Gaussian Splatting (3DGS) representations from single-image inputs. To handle occlusions, we devise a self-supervised adaptation pipeline that enables the pretrained large human model to reconstruct complete 3D humans with plausible geometry and appearance from heavily occluded inputs. Furthermore, we introduce Self-Calibrated Learning (SCL). This training strategy enables single-step diffusion models to adaptively refine coarse renderings to optimal quality by blending identity-preserving samples with clean/corrupted image pairs. The outputs can be distilled back to enhance the quality of multi-person 3DGS representations. Extensive experiments demonstrate that CrowdGaussian generates photorealistic, geometrically coherent reconstructions of multi-person scenes.
[235] Exploring parameter-efficient fine-tuning (PEFT) of billion-parameter vision models with QLoRA and DoRA: insights into generalization for limited-data image classification under a 98:1 test-to-train regime
Haiyu Yang, Sumit Sharma, Enhong Liu, Miel Hostens
Main category: cs.CV
TL;DR: Systematic comparison of parameter-efficient fine-tuning (PEFT) methods for adapting billion-parameter vision foundation models to livestock behavior classification with limited labeled data.
Details
Motivation: Automated behavior classification in precision livestock farming faces challenges of high computational costs and limited labeled data, requiring efficient adaptation of large foundation models.Method: Compared three approaches: training from scratch (ResNet-18, ViT-Small), frozen feature extraction, and PEFT of DINOv3 (6.7B parameters) using QLoRA and DoRA with varying ranks (8, 16, 64) and target modules (q_proj vs all-linear layers).
Result: PEFT substantially outperformed alternatives: best QLoRA configuration (all-linear layers, rank=64) achieved 83.16% test accuracy with only 2.72% parameters (183M) in 5.8 hours, compared to 72.87% for ResNet-18, 61.91% for ViT-Small, and 76.56% for frozen DINOv3.
Conclusion: Underfitting, not overfitting, is the primary challenge when adapting foundation models to agricultural imagery; increasing adapter capacity improves generalization without overfitting, providing guidelines for deploying billion-parameter vision models with PEFT in agricultural applications.
Abstract: Automated behavior classification is essential for precision livestock farming but faces challenges of high computational costs and limited labeled data. This study systematically compared three approaches: training from scratch (ResNet-18, ViT-Small), frozen feature extraction, and parameter-efficient fine-tuning (PEFT) of the DINOv3 foundation model (6.7 billion parameters). We evaluated QLoRA and DoRA across multiple configurations varying rank (8, 16, 64) and target modules (q_proj versus all-linear layers). With 2,160 verified training images, we assessed generalization of our model on 211,800 test samples, which is essentially a 98:1 test-to-train ratio. Results demonstrated that PEFT substantially outperformed alternatives, where the best QLoRA configuration (all-linear layers and rank=64) achieved 83.16% test accuracy with only 2.72% parameters (183.0M) in 5.8 hours, compared to 72.87% for ResNet-18 (16.8 hours), 61.91% for ViT-Small (18.7 hours), and 76.56% for frozen DINOv3 (17.5 hours). DoRA achieved comparable accuracy (83.14%) but with longer training time (11.0 hours). Notably, increasing adapter capacity consistently improved generalization while simultaneously not causing overfitting: reducing rank from 16 to 8 decreased test accuracy from 78.38% to 77.17%, while expanding from q_proj-only to all-linear layers with rank=64 improved accuracy from 78.38% to 83.16%. This suggests underfitting, instead of overfitting, is the primary challenge when adapting foundation models to agricultural imagery. Our findings provide guidelines for deploying billion-parameter vision models with PEFT in agricultural livestock applications.
[236] ResNet-50 with Class Reweighting and Anatomy-Guided Temporal Decoding for Gastrointestinal Video Analysis
Romil Imtiaz, Dimitris K. Iakovidis
Main category: cs.CV
TL;DR: A multi-label gastrointestinal video analysis pipeline using ResNet-50 frame classification with anatomy-guided temporal event decoding to predict 17 anatomy and pathology labels from endoscopic videos.
Details
Motivation: To develop an automated system for gastrointestinal video analysis that can accurately detect both anatomical structures and pathological conditions in endoscopic videos, addressing challenges like severe class imbalance and temporal event fragmentation.Method: Uses ResNet-50 frame classifier on 336x336 frames, with clipped class-wise positive weighting to handle class imbalance. Temporal processing combines GT-style framewise event composition, anatomy vote smoothing, anatomy-based pathology gating, and conservative hysteresis decoder.
Result: Improved temporal mAP from 0.3801 to 0.4303 on challenge test set, successfully predicting 5 anatomy classes and 12 pathology classes despite severe class imbalance.
Conclusion: The anatomy-guided temporal event decoding approach effectively addresses class imbalance and temporal fragmentation challenges in gastrointestinal video analysis, demonstrating significant performance improvements over baseline methods.
Abstract: We developed a multi-label gastrointestinal video analysis pipeline based on a ResNet-50 frame classifier followed by anatomy-guided temporal event decoding. The system predicts 17 labels, including 5 anatomy classes and 12 pathology classes, from frames resized to 336x336. A major challenge was severe class imbalance, particularly for rare pathology labels. To address this, we used clipped class-wise positive weighting in the training loss, which improved rare-class learning while maintaining stable optimization. At the temporal stage, we found that direct frame-to-event conversion produced fragmented mismatches with the official ground truth. The final submission therefore combined GT-style framewise event composition, anatomy vote smoothing, and anatomy-based pathology gating with a conservative hysteresis decoder. This design improved the final temporal mAP from 0.3801 to 0.4303 on the challenge test set.
[237] M2P: Improving Visual Foundation Models with Mask-to-Point Weakly-Supervised Learning for Dense Point Tracking
Qiangqiang Wu, Tianyu Yang, Bo Fang, Jia Wan, Matias Di Martino, Guillermo Sapiro, Antoni B. Chan
Main category: cs.CV
TL;DR: Mask-to-Point (M2P) learning improves Vision Foundation Models for dense point tracking by leveraging video object segmentation mask annotations with three novel weakly-supervised constraints.
Details
Motivation: Current Vision Foundation Models rely on static image pre-training, which is suboptimal for capturing dense temporal correspondence in videos needed for point tracking tasks.Method: Proposes M2P learning with three mask-based constraints: 1) local structure consistency loss using Procrustes analysis, 2) mask label consistency loss for foreground point matching, and 3) mask boundary constraint for boundary points.
Result: M2P achieves 12.8% and 14.6% performance gains over DINOv2-B/14 and DINOv3-B/16 on TAP-Vid-DAVIS benchmark, using only 3.6K VOS training videos.
Conclusion: M2P provides an effective weakly-supervised approach to improve VFMs for point tracking, serving as general pre-trained models for both test-time optimized and offline fine-tuned tracking tasks.
Abstract: Tracking Any Point (TAP) has emerged as a fundamental tool for video understanding. Current approaches adapt Vision Foundation Models (VFMs) like DINOv2 via offline finetuning or test-time optimization. However, these VFMs rely on static image pre-training, which is inherently sub-optimal for capturing dense temporal correspondence in videos. To address this, we propose Mask-to-Point (M2P) learning, which leverages rich video object segmentation (VOS) mask annotations to improve VFMs for dense point tracking. Our M2P introduces three new mask-based constraints for weakly-supervised representation learning. First, we propose a local structure consistency loss, which leverages Procrustes analysis to model the cohesive motion of points lying within a local structure, achieving more reliable point-to-point matching learning. Second, we propose a mask label consistency (MLC) loss, which enforces that sampled foreground points strictly match foreground regions across frames. The proposed MLC loss can be regarded as a regularization, which stabilizes training and prevents convergence to trivial solutions. Finally, mask boundary constrain is applied to explicitly supervise boundary points. We show that our weaklysupervised M2P models significantly outperform baseline VFMs with efficient training by using only 3.6K VOS training videos. Notably, M2P achieves 12.8% and 14.6% performance gains over DINOv2-B/14 and DINOv3-B/16 on the TAP-Vid-DAVIS benchmark, respectively. Moreover, the proposed M2P models are used as pre-trained backbones for both test-time optimized and offline fine-tuned TAP tasks, demonstrating its potential to serve as general pre-trained models for point tracking. Code will be made publicly available upon acceptance.
[238] Steering Video Diffusion Transformers with Massive Activations
Xianhang Cheng, Yujian Zheng, Zhenyu Xie, Tingting Liao, Hao Li
Main category: cs.CV
TL;DR: STAS: A training-free method that leverages rare high-magnitude hidden state spikes (Massive Activations) in video diffusion transformers to improve video generation quality by steering activations at key temporal positions.
Details
Motivation: Despite progress in video diffusion transformers, how to leverage internal model signals with minimal overhead to enhance video generation quality remains underexplored. The paper investigates Massive Activations (MAs) - rare, high-magnitude hidden state spikes - and their structured patterns in video generation models.Method: The authors observe that MAs emerge consistently across visual tokens with a clear magnitude hierarchy: first-frame tokens have largest magnitudes, latent-frame boundary tokens (head/tail of temporal chunks) show elevated magnitudes, and interior tokens have moderate magnitudes. Based on this structured pattern, they propose Structured Activation Steering (STAS), a training-free self-guidance method that steers MA values at first-frame and boundary tokens toward a scaled global maximum reference magnitude.
Result: STAS achieves consistent improvements in video quality and temporal coherence across different text-to-video models while introducing negligible computational overhead.
Conclusion: The structured pattern of Massive Activations reveals that video diffusion transformers implicitly prioritize token positions aligned with temporal chunking in latent space. STAS effectively leverages this internal signal to enhance video generation without additional training.
Abstract: Despite rapid progress in video diffusion transformers, how their internal model signals can be leveraged with minimal overhead to enhance video generation quality remains underexplored. In this work, we study the role of Massive Activations (MAs), which are rare, high-magnitude hidden state spikes in video diffusion transformers. We observed that MAs emerge consistently across all visual tokens, with a clear magnitude hierarchy: first-frame tokens exhibit the largest MA magnitudes, latent-frame boundary tokens (the head and tail portions of each temporal chunk in the latent space) show elevated but slightly lower MA magnitudes than the first frame, and interior tokens within each latent frame remain elevated, yet are comparatively moderate in magnitude. This structured pattern suggests that the model implicitly prioritizes token positions aligned with the temporal chunking in the latent space. Based on this observation, we propose Structured Activation Steering (STAS), a training-free self-guidance-like method that steers MA values at first-frame and boundary tokens toward a scaled global maximum reference magnitude. STAS achieves consistent improvements in terms of video quality and temporal coherence across different text-to-video models, while introducing negligible computational overhead.
[239] TINA: Text-Free Inversion Attack for Unlearned Text-to-Image Diffusion Models
Qianlong Xiang, Miao Zhang, Haoyu Zhang, Kun Wang, Junhui Hou, Liqiang Nie
Main category: cs.CV
TL;DR: TINA is a text-free inversion attack that bypasses text-centric concept erasure defenses in diffusion models by operating under null-text conditions, revealing that erased visual knowledge persists despite text-to-image mapping being severed.
Details
Motivation: Current concept erasure techniques in text-to-image diffusion models focus on severing text-to-image mapping but ignore that underlying visual knowledge persists, creating a false sense of security. There's a need to evaluate erasure from a visual perspective rather than just text-centric defenses.Method: TINA (Text-free INversion Attack) uses DDIM inversion under null-text conditions to probe for visual generative pathways of erased concepts. It includes an optimization procedure to overcome approximation errors that occur when standard inversion operates without textual guidance, avoiding text-centric defenses.
Result: TINA successfully regenerates erased concepts from models treated with state-of-the-art unlearning methods, demonstrating that current erasure techniques only obscure concepts rather than truly removing visual knowledge.
Conclusion: Current concept erasure methods are insufficient as they merely hide concepts through text-to-image mapping manipulation while visual knowledge persists. There’s an urgent need for new paradigms that directly address internal visual knowledge rather than focusing on text-centric defenses.
Abstract: Although text-to-image diffusion models exhibit remarkable generative power, concept erasure techniques are essential for their safe deployment to prevent the creation of harmful content. This has fostered a dynamic interplay between the development of erasure defenses and the adversarial probes designed to bypass them, and this co-evolution has progressively enhanced the efficacy of erasure methods. However, this adversarial co-evolution has converged on a narrow, text-centric paradigm that equates erasure with severing the text-to-image mapping, ignoring that the underlying visual knowledge related to undesired concepts still persist. To substantiate this claim, we investigate from a visual perspective, leveraging DDIM inversion to probe whether a generative pathway for the erased concept can still be found. However, identifying such a visual generative pathway is challenging because standard text-guided DDIM inversion is actively resisted by text-centric defenses within the erased model. To address this, we introduce TINA, a novel Text-free INversion Attack, which enforces this visual-only probe by operating under a null-text condition, thereby avoiding existing text-centric defenses. Moreover, TINA integrates an optimization procedure to overcome the accumulating approximation errors that arise when standard inversion operates without its usual textual guidance. Our experiments demonstrate that TINA regenerates erased concepts from models treated with state-of-the-art unlearning. The success of TINA proves that current methods merely obscure concepts, highlighting an urgent need for paradigms that operate directly on internal visual knowledge.
[240] Video Understanding: From Geometry and Semantics to Unified Models
Zhaochong An, Zirui Li, Mingqiao Ye, Feng Qiao, Jiaang Li, Zongwei Wu, Vishal Thengane, Chengzu Li, Lei Li, Luc Van Gool, Guolei Sun, Serge Belongie
Main category: cs.CV
TL;DR: Survey paper providing structured overview of video understanding literature organized into three perspectives: low-level geometry, high-level semantics, and unified models, highlighting shift toward unified modeling paradigms.
Details
Motivation: Video understanding requires modeling temporal dynamics and evolving visual context, placing stronger demands on spatiotemporal reasoning than image understanding. The field needs a structured overview to map the evolving landscape and identify key trends toward building robust, scalable, and unified video foundation models.Method: Survey methodology organizing literature into three complementary perspectives: 1) low-level video geometry understanding, 2) high-level semantic understanding, and 3) unified video understanding models. The survey consolidates these perspectives to provide a coherent map of the field.
Result: Provides comprehensive overview of video understanding landscape, summarizes key modeling trends and design principles, and highlights the broader shift from isolated, task-specific pipelines toward unified modeling paradigms that can be adapted to diverse downstream objectives.
Conclusion: Video understanding is a foundational problem in computer vision requiring spatiotemporal reasoning. The survey offers a systematic view of recent progress and outlines open challenges toward building robust, scalable, and unified video foundation models.
Abstract: Video understanding aims to enable models to perceive, reason about, and interact with the dynamic visual world. In contrast to image understanding, video understanding inherently requires modeling temporal dynamics and evolving visual context, placing stronger demands on spatiotemporal reasoning and making it a foundational problem in computer vision. In this survey, we present a structured overview of video understanding by organizing the literature into three complementary perspectives: low-level video geometry understanding, high-level semantic understanding, and unified video understanding models. We further highlight a broader shift from isolated, task-specific pipelines toward unified modeling paradigms that can be adapted to diverse downstream objectives, enabling a more systematic view of recent progress. By consolidating these perspectives, this survey provides a coherent map of the evolving video understanding landscape, summarizes key modeling trends and design principles, and outlines open challenges toward building robust, scalable, and unified video foundation models.
[241] Omni-3DEdit: Generalized Versatile 3D Editing in One-Pass
Chen Liyi, Wang Pengfei, Zhang Guowen, Ma Zhiyuan, Zhang Lei
Main category: cs.CV
TL;DR: Omni-3DEdit: A unified learning-based model for various 3D editing tasks that eliminates iterative optimization by using synthesized paired multi-view data and a dual-stream LoRA architecture.
Details
Motivation: Existing 3D editing methods rely on 2D models to guide iterative optimization of 3D representations, which lacks universal design for different tasks (appearance editing vs removal) and is time-consuming with thousands of optimization steps.Method: Constructs a data pipeline to synthesize paired multi-view editing samples, adapts pre-trained generative model SEVA as backbone by concatenating source view latents with conditional tokens, and uses dual-stream LoRA module to disentangle different view cues.
Result: The model completes various 3D editing tasks in one forward pass, reducing inference time from tens of minutes to approximately two minutes, with extensive experiments demonstrating effectiveness and efficiency.
Conclusion: Omni-3DEdit provides a unified, learning-based approach for 3D editing that overcomes limitations of iterative optimization methods and achieves efficient multi-task 3D editing.
Abstract: Most instruction-driven 3D editing methods rely on 2D models to guide the explicit and iterative optimization of 3D representations. This paradigm, however, suffers from two primary drawbacks. First, it lacks a universal design of different 3D editing tasks because the explicit manipulation of 3D geometry necessitates task-dependent rules, e.g., 3D appearance editing demands inherent source 3D geometry, while 3D removal alters source geometry. Second, the iterative optimization process is highly time-consuming, often requiring thousands of invocations of 2D/3D updating. We present Omni-3DEdit, a unified, learning-based model that generalizes various 3D editing tasks implicitly. One key challenge to achieve our goal is the scarcity of paired source-edited multi-view assets for training. To address this issue, we construct a data pipeline, synthesizing a relatively rich number of high-quality paired multi-view editing samples. Subsequently, we adapt the pre-trained generative model SEVA as our backbone by concatenating source view latents along with conditional tokens in sequence space. A dual-stream LoRA module is proposed to disentangle different view cues, largely enhancing our model’s representational learning capability. As a learning-based model, our model is free of the time-consuming online optimization, and it can complete various 3D editing tasks in one forward pass, reducing the inference time from tens of minutes to approximately two minutes. Extensive experiments demonstrate the effectiveness and efficiency of Omni-3DEdit.
[242] Revisiting foundation models for cell instance segmentation
Anwai Archit, Constantin Pape
Main category: cs.CV
TL;DR: Comprehensive evaluation of SAM-based foundation models for cell segmentation in microscopy images, introducing APG strategy to improve segmentation performance.
Details
Motivation: To evaluate and improve foundation models for cell segmentation in microscopy images, addressing the gap between general-purpose segmentation models (SAM, SAM2, SAM3) and microscopy-specific adaptations (CellPoseSAM, CellSAM, μSAM).Method: Comprehensive evaluation of multiple foundation models on diverse microscopy datasets, plus introduction of Automatic Prompt Generation (APG) strategy to enhance SAM-based microscopy models by generating prompts automatically.
Result: APG consistently improves segmentation results for μSAM and is competitive with state-of-the-art CellPoseSAM. Provides lessons for adapting SAM-style models to microscopy and strategies for creating more powerful microscopy foundation models.
Conclusion: SAM-based models can be effectively adapted for microscopy segmentation, and APG strategy significantly enhances performance, offering practical guidance for developing better microscopy foundation models.
Abstract: Cell segmentation is a fundamental task in microscopy image analysis. Several foundation models for cell segmentation have been introduced, virtually all of them are extensions of Segment Anything Model (SAM), improving it for microscopy data. Recently, SAM2 and SAM3 have been published, further improving and extending the capabilities of general-purpose segmentation foundation models. Here, we comprehensively evaluate foundation models for cell segmentation (CellPoseSAM, CellSAM, $μ$SAM) and for general-purpose segmentation (SAM, SAM2, SAM3) on a diverse set of (light) microscopy datasets, for tasks including cell, nucleus and organoid segmentation. Furthermore, we introduce a new instance segmentation strategy called automatic prompt generation (APG) that can be used to further improve SAM-based microscopy foundation models. APG consistently improves segmentation results for $μ$SAM, which is used as the base model, and is competitive with the state-of-the-art model CellPoseSAM. Moreover, our work provides important lessons for adaptation strategies of SAM-style models to microscopy and provides a strategy for creating even more powerful microscopy foundation models. Our code is publicly available at https://github.com/computational-cell-analytics/micro-sam.
[243] VISER: Visually-Informed System for Enhanced Robustness in Open-Set Iris Presentation Attack Detection
Byron Dowling, Eleanor Frederick, Jacob Piland, Adam Czajka
Main category: cs.CV
TL;DR: Comparing human perceptual priors for iris presentation attack detection, finding denoised eye tracking heatmaps provide best generalization in open-set PAD
Details
Motivation: To determine the most effective form of human saliency guidance for open-set iris presentation attack detection, comparing different perceptual priors against deep learning baselinesMethod: Experimental comparison of hand annotations, eye tracking heatmaps, segmentation masks, and DINOv2 embeddings against state-of-the-art deep learning baseline using leave-one-attack-type out paradigm for open-set PAD
Result: Denoised eye tracking heatmaps showed best generalization improvement over cross entropy in terms of AUROC and APCER at BPCER of 1%
Conclusion: Human perceptual priors, particularly denoised eye tracking heatmaps, can effectively improve generalization in open-set iris PAD, with resources provided for reproducibility
Abstract: Human perceptual priors have shown promise in saliency-guided deep learning training, particularly in the domain of iris presentation attack detection (PAD). Common saliency approaches include hand annotations obtained via mouse clicks and eye gaze heatmaps derived from eye tracking data. However, the most effective form of human saliency for open-set iris PAD remains underexplored. In this paper, we conduct a series of experiments comparing hand annotations, eye tracking heatmaps, segmentation masks, and DINOv2 embeddings to a state-of-the-art deep learning-based baseline on the task of open-set iris PAD. Results for open-set PAD in a leave-one-attack-type out paradigm indicate that denoised eye tracking heatmaps show the best generalization improvement over cross entropy in terms of Area Under the ROC curve (AUROC) and Attack Presentation Classification Error Rate (APCER) at Bona Fide Presentation Classification Error Rate (BPCER) of 1%. Along with this paper, we offer trained models, code, and saliency maps for reproducibility and to facilitate follow-up research efforts.
[244] Edit Spillover as a Probe: Do Image Editing Models Implicitly Understand World Relations?
Guandong Li, Zhaobin Chu
Main category: cs.CV
TL;DR: EditSpilloverProbe framework uses unintended edit spillover in image editing models as a probe for world knowledge, revealing trade-offs between editing control and semantic understanding.
Details
Motivation: Instruction-following image editing models often modify unspecified content (edit spillover), raising questions about whether this reflects genuine world understanding or just attention leakage.Method: Proposes EditSpilloverProbe with spillover taxonomy (spatial, semantic, mixed, random), automated detection pipeline, and EditSpilloverBench dataset from real-world Chinese text editing tasks.
Result: Evaluation of 5 models shows: 1) spillover rates vary 3.3x (3.49%-11.46%), 2) nano_banana has most semantic spillover (27.8/image) vs qwen_2511’s precise control, 3) semantic spillover proportion remains constant (40%-58%) with distance.
Conclusion: Edit spillover serves as natural probe for world knowledge; semantic spillover reflects genuine understanding, not spatial diffusion; reveals trade-off between editing control and world understanding.
Abstract: Instruction-following image editing models are expected to modify only the specified region while keeping the rest of the image unchanged. However, in practice, we observe a pervasive phenomenon – edit spillover: models alter semantically related but unspecified content outside the edit region. This raises a fundamental question – does spillover reflect genuine implicit world understanding, or is it merely attention leakage? We propose EditSpilloverProbe, a systematic framework that repurposes edit spillover as a natural probe for world knowledge in image editing models. We introduce a spillover taxonomy (spatial, semantic, mixed, random), an automated detection-and-classification pipeline, and a benchmark dataset constructed from real-world Chinese text editing tasks, EditSpilloverBench. Systematic evaluation of 5 representative editing models reveals three core findings: (1) spillover rates vary dramatically across architectures, from 3.49% to 11.46%, with a 3.3x ratio; (2) absolute semantic spillover quantity reveals models’ world understanding capability – nano_banana produces the most semantic spillover (27.8 per image), while qwen_2511 has the most precise editing control but lower semantic spillover (16.3 per image), revealing a trade-off between editing control and world understanding; (3) spatial decay analysis shows spillover area density decays exponentially with distance, but the proportion of semantically relevant spillover remains constant (40%-58%), providing direct evidence that semantic spillover reflects genuine world understanding rather than spatial diffusion.
[245] Identity as Presence: Towards Appearance and Voice Personalized Joint Audio-Video Generation
Yingjie Chen, Shilun Lin, Cai Xing, Qixin Yan, Wenjing Wang, Dingming Liu, Hao Liu, Chen Li, Jing Lyu
Main category: cs.CV
TL;DR: A unified framework for identity-aware joint audio-video generation that enables fine-grained control over facial appearance and voice timbre across multiple identities.
Details
Motivation: There's growing demand for identity-aware content creation, but existing methods lack an openly accessible framework for fine-grained control over both facial appearance and voice timbre across multiple identities in joint audio-video generation.Method: Proposes: 1) Data curation pipeline that automatically extracts identity-bearing information with paired audio-visual annotations, 2) Flexible identity injection mechanism for single- and multi-subject scenarios using facial appearance and vocal timbre as control signals, 3) Multi-stage training strategy to address modality disparity and enforce cross-modal coherence.
Result: Experiments demonstrate the superiority of the proposed framework in generating high-fidelity and consistent personalized audio-video content.
Conclusion: The paper presents a scalable framework for identity-aware joint audio-video generation that enables fine-grained control over both visual and audio identity characteristics, addressing an important gap in multimodal content creation.
Abstract: Recent advances have demonstrated compelling capabilities in synthesizing real individuals into generated videos, reflecting the growing demand for identity-aware content creation. Nevertheless, an openly accessible framework enabling fine-grained control over facial appearance and voice timbre across multiple identities remains unavailable. In this work, we present a unified and scalable framework for identity-aware joint audio-video generation, enabling high-fidelity and consistent personalization. Specifically, we introduce a data curation pipeline that automatically extracts identity-bearing information with paired annotations across audio and visual modalities, covering diverse scenarios from single-subject to multi-subject interactions. We further propose a flexible and scalable identity injection mechanism for single- and multi-subject scenarios, in which both facial appearance and vocal timbre act as identity-bearing control signals. Moreover, in light of modality disparity, we design a multi-stage training strategy to accelerate convergence and enforce cross-modal coherence. Experiments demonstrate the superiority of the proposed framework. For more details and qualitative results, please refer to our webpage: \href{https://chen-yingjie.github.io/projects/Identity-as-Presence}{Identity-as-Presence}.
[246] A Creative Agent is Worth a 64-Token Template
Ruixiao Shi, Fu Feng, Yucheng Xie, Xu Yang, Jing Wang, Xin Geng
Main category: cs.CV
TL;DR: CAT framework introduces Creative Agent Tokenization to generate reusable creative tokens from fuzzy prompts, enabling efficient creative text-to-image generation without repeated reasoning.
Details
Motivation: Current T2I models struggle with fuzzy creative prompts and require expensive iterative reasoning/agent approaches for creative generation, making creativity costly and non-reusable.Method: Creative Tokenizer generates reusable token templates from fuzzy prompt embeddings via creative semantic disentanglement, leveraging relations among overlapping concept pairs to capture latent creative representations.
Result: CAT achieves 3.7× speedup and 4.8× reduction in computational cost while producing images with superior human preference and text-image alignment compared to SOTA methods.
Conclusion: CAT provides a scalable and effective paradigm for enhancing creativity in T2I generation by encapsulating creative understanding in reusable tokens.
Abstract: Text-to-image (T2I) models have substantially improved image fidelity and prompt adherence, yet their creativity remains constrained by reliance on discrete natural language prompts. When presented with fuzzy prompts such as a creative vinyl record-inspired skyscraper'', these models often fail to infer the underlying creative intent, leaving creative ideation and prompt design largely to human users. Recent reasoning- or agent-driven approaches iteratively augment prompts but incur high computational and monetary costs, as their instance-specific generation makes creativity’’ costly and non-reusable, requiring repeated queries or reasoning for subsequent generations. To address this, we introduce \textbf{CAT}, a framework for \textbf{C}reative \textbf{A}gent \textbf{T}okenization that encapsulates agents’ intrinsic understanding of ``creativity’’ through a \textit{Creative Tokenizer}. Given the embeddings of fuzzy prompts, the tokenizer generates a reusable token template that can be directly concatenated with them to inject creative semantics into T2I models without repeated reasoning or prompt augmentation. To enable this, the tokenizer is trained via creative semantic disentanglement, leveraging relations among partially overlapping concept pairs to capture the agent’s latent creative representations. Extensive experiments on \textbf{\textit{Architecture Design}}, \textbf{\textit{Furniture Design}}, and \textbf{\textit{Nature Mixture}} tasks demonstrate that CAT provides a scalable and effective paradigm for enhancing creativity in T2I generation, achieving a $3.7\times$ speedup and a $4.8\times$ reduction in computational cost, while producing images with superior human preference and text-image alignment compared to state-of-the-art T2I models and creative generation methods.
[247] SpiderCam: Low-Power Snapshot Depth from Differential Defocus
Marcos A. Ferreira, Tianao Li, John Mamish, Josiah Hester, Yaman Sangar, Qi Guo, Emma Alexander
Main category: cs.CV
TL;DR: SpiderCam is an FPGA-based snapshot depth-from-defocus camera that produces real-time sparse depth maps at 32.5 FPS with extremely low power consumption (624 mW total).
Details
Motivation: The paper aims to develop a low-power 3D camera system for real-time depth sensing applications, addressing the need for energy-efficient depth estimation hardware that can operate in power-constrained environments.Method: Uses a custom camera capturing two differently focused images simultaneously, processed with a SystemVerilog implementation of depth from differential defocus (DfDD) on a low-power FPGA. Includes algorithmic improvements to handle low-power sensor challenges and a memory-local implementation for streaming depth computation.
Result: Achieves 480x400 sparse depth maps at 32.5 FPS over 52 cm working range with total power consumption of 624 mW, making it the first sub-Watt passive FPGA-based 3D camera reported in literature.
Conclusion: SpiderCam demonstrates that real-time depth sensing with sub-Watt power consumption is achievable through hardware-software co-design, opening possibilities for battery-powered 3D vision applications.
Abstract: We introduce SpiderCam, an FPGA-based snapshot depth-from-defocus camera which produces 480x400 sparse depth maps in real-time at 32.5 FPS over a working range of 52 cm while consuming 624 mW of power in total. SpiderCam comprises a custom camera that simultaneously captures two differently focused images of the same scene, processed with a SystemVerilog implementation of depth from differential defocus (DfDD) on a low-power FPGA. To achieve state-of-the-art power consumption, we present algorithmic improvements to DfDD that overcome challenges caused by low-power sensors, and design a memory-local implementation for streaming depth computation on a device that is too small to store even a single image pair. We report the first sub-Watt total power measurement for passive FPGA-based 3D cameras in the literature.
[248] Noise-Aware Misclassification Attack Detection in Collaborative DNN Inference
Shima Yousefi, Saptarshi Debroy
Main category: cs.CV
TL;DR: A semi-gray-box anomaly detection framework using variational autoencoders to detect adversarial attacks in collaborative edge-AI object classification systems under noisy conditions.
Details
Motivation: Edge-AI systems using collaborative inference are vulnerable to stealthy adversarial attacks that cause misclassifications, especially in noisy environments where detection is challenging.Method: Proposes a variational autoencoder-based anomaly detection framework with noise-aware features that capture environmental noise characteristics to distinguish between legitimate noise and adversarial manipulations.
Result: Achieves up to 90% AUROC across different DNN configurations under realistic noisy conditions, though performance degrades with feature similarity and high noise levels.
Conclusion: The framework provides robust detection of adversarial attacks in edge-AI systems but has limitations when dealing with similar features or extremely noisy environments.
Abstract: Collaborative inference of object classification Deep neural Networks (DNNs) where resource-constrained end-devices offload partially processed data to remote edge servers to complete end-to-end processing, is becoming a key enabler of edge-AI. However, such edge-offloading is vulnerable to malicious data injections leading to stealthy misclassifications that are tricky to detect, especially in the presence of environmental noise. In this paper, we propose a semi-gray-box and noise- aware anomaly detection framework fueled by a variational autoencoder (VAE) to capture deviations caused by adversarial manipulation. The proposed framework incorporates a robust noise-aware feature that captures the characteristic behavior of environmental noise to improve detection accuracy while reducing false alarm rates. Our evaluation with popular object classification DNNs demonstrate the robustness of the proposed detection (up to 90% AUROC across DNN configurations) under realistic noisy conditions while revealing limitations caused by feature similarity and elevated noise levels.
[249] SegFly: A 2D-3D-2D Paradigm for Aerial RGB-Thermal Semantic Segmentation at Scale
Markus Gross, Sai Bharadhwaj Matha, Rui Song, Viswanathan Muthuveerappan, Conrad Christoph, Julius Huber, Daniel Cremers
Main category: cs.CV
TL;DR: A scalable geometry-driven 2D-3D-2D paradigm for automatic label propagation and cross-modal alignment in UAV imagery, creating SegFly - a large-scale RGB-T semantic segmentation benchmark with over 20,000 RGB and 15,000 aligned RGB-T images.
Details
Motivation: Existing UAV semantic segmentation datasets are limited in scale, diversity, and annotation efficiency due to high manual labeling costs and difficulties in accurate RGB-T alignment on off-the-shelf UAVs.Method: Proposes a 2D-3D-2D paradigm that leverages multi-view redundancy in aerial imagery to automatically propagate labels: lift a small subset (3%) of manually annotated RGB images into a semantic 3D point cloud, then reproject it to all views to generate dense pseudo ground-truth for both RGB and thermal modalities. Also uses 3D geometry as intermediate alignment space for cross-modal RGB-T registration.
Result: Automatically produces 97% of RGB labels and 100% of thermal labels with 91% and 88% annotation accuracy without manual refinement. Achieves 87% registration accuracy for RGB-T alignment. Constructs SegFly benchmark with over 20,000 high-resolution RGB images and 15,000 geometrically aligned RGB-T pairs across diverse environments.
Conclusion: The geometry-driven 2D-3D-2D pipeline enables scalable multi-modal scene understanding, with both conventional architectures and vision foundation models benefiting substantially from SegFly supervision, demonstrating the potential for scalable multi-modal aerial scene understanding.
Abstract: Semantic segmentation for uncrewed aerial vehicles (UAVs) is fundamental for aerial scene understanding, yet existing RGB and RGB-T datasets remain limited in scale, diversity, and annotation efficiency due to the high cost of manual labeling and the difficulties of accurate RGB-T alignment on off-the-shelf UAVs. To address these challenges, we propose a scalable geometry-driven 2D-3D-2D paradigm that leverages multi-view redundancy in high-overlap aerial imagery to automatically propagate labels from a small subset of manually annotated RGB images to both RGB and thermal modalities within a unified framework. By lifting less than 3% of RGB images into a semantic 3D point cloud and reprojecting it into all views, our approach enables dense pseudo ground-truth generation across large image collections, automatically producing 97% of RGB labels and 100% of thermal labels while achieving 91% and 88% annotation accuracy without any 2D manual refinement. We further extend this 2D-3D-2D paradigm to cross-modal image registration, using 3D geometry as an intermediate alignment space to obtain fully automatic, strong pixel-level RGB-T alignment with 87% registration accuracy and no hardware-level synchronization. Applying our framework to existing geo-referenced aerial imagery, we construct SegFly, a large-scale benchmark with over 20,000 high-resolution RGB images and more than 15,000 geometrically aligned RGB-T pairs spanning diverse urban, industrial, and rural environments across multiple altitudes and seasons. On SegFly, we establish the Firefly baseline for RGB and thermal semantic segmentation and show that both conventional architectures and vision foundation models benefit substantially from SegFly supervision, highlighting the potential of geometry-driven 2D-3D-2D pipelines for scalable multi-modal scene understanding. Data and Code available at https://github.com/markus-42/SegFly.
[250] LoST: Level of Semantics Tokenization for 3D Shapes
Niladri Shekhar Dutt, Zifan Shi, Paul Guerrero, Chun-Hao Paul Huang, Duygu Ceylan, Niloy J. Mitra, Xuelin Chen
Main category: cs.CV
TL;DR: LoST proposes semantic-aware tokenization for 3D shapes that orders tokens by semantic importance, enabling efficient autoregressive generation with better reconstruction quality using far fewer tokens.
Details
Motivation: Current 3D shape tokenization methods rely on geometric level-of-detail hierarchies designed for rendering/compression, which are token-inefficient and lack semantic coherence for autoregressive modeling. There's a need for semantic-aware tokenization that prioritizes principal semantics.Method: Proposes Level-of-Semantics Tokenization (LoST) that orders tokens by semantic salience. Introduces Relational Inter-Distance Alignment (RIDA) loss to align 3D shape latent space with semantic DINO feature space, ensuring semantic coherence in token ordering.
Result: Achieves state-of-the-art reconstruction quality, surpassing previous LoD-based tokenizers by large margins on both geometric and semantic metrics. Enables efficient high-quality autoregressive 3D generation using only 0.1%-10% of tokens compared to prior models.
Conclusion: LoST provides a semantic-aware tokenization approach for 3D shapes that enables efficient autoregressive generation with superior reconstruction quality and supports downstream tasks like semantic retrieval.
Abstract: Tokenization is a fundamental technique in the generative modeling of various modalities. In particular, it plays a critical role in autoregressive (AR) models, which have recently emerged as a compelling option for 3D generation. However, optimal tokenization of 3D shapes remains an open question. State-of-the-art (SOTA) methods primarily rely on geometric level-of-detail (LoD) hierarchies, originally designed for rendering and compression. These spatial hierarchies are often token-inefficient and lack semantic coherence for AR modeling. We propose Level-of-Semantics Tokenization (LoST), which orders tokens by semantic salience, such that early prefixes decode into complete, plausible shapes that possess principal semantics, while subsequent tokens refine instance-specific geometric and semantic details. To train LoST, we introduce Relational Inter-Distance Alignment (RIDA), a novel 3D semantic alignment loss that aligns the relational structure of the 3D shape latent space with that of the semantic DINO feature space. Experiments show that LoST achieves SOTA reconstruction, surpassing previous LoD-based 3D shape tokenizers by large margins on both geometric and semantic reconstruction metrics. Moreover, LoST achieves efficient, high-quality AR 3D generation and enables downstream tasks like semantic retrieval, while using only 0.1%-10% of the tokens needed by prior AR models.
[251] A practical artificial intelligence framework for legal age estimation using clavicle computed tomography scans
Javier Venema, Stefano De Luca, Pablo Mesejo, Óscar Ibáñez
Main category: cs.CV
TL;DR: Interpretable multi-stage pipeline for legal age estimation from clavicle CT scans using automatic detection, guided slice selection, and conformal prediction for uncertainty quantification.
Details
Motivation: Legal age estimation needs accurate, robust methods with explicit uncertainty quantification for forensic contexts. While AI approaches exist for hand radiographs/dental imaging, clavicle CT scans remain underexplored despite their effectiveness.Method: Three-stage pipeline: 1) Feature-based connected-component method for automatic clavicle detection with minimal manual annotation, 2) Integrated Gradients-guided slice selection for multi-slice CNN input, 3) Conformal prediction intervals for uncertainty-aware decisions.
Result: Achieves state-of-the-art MAE of 1.55 ± 0.16 years on 1,158 post-mortem CT scans, outperforming human experts (~1.90 years) and previous methods (>1.75 years). Conformal prediction enables configurable coverage levels aligned with forensic requirements.
Conclusion: Proposed interpretable pipeline provides accurate, uncertainty-aware legal age estimation from clavicle CT scans, currently being integrated into Skeleton-ID software as decision-support for multi-factorial forensic workflows.
Abstract: Legal age estimation plays a critical role in forensic and medico-legal contexts, where decisions must be supported by accurate, robust, and reproducible methods with explicit uncertainty quantification. While prior artificial intelligence (AI)-based approaches have primarily focused on hand radiographs or dental imaging, clavicle computed tomography (CT) scans remain underexplored despite their documented effectiveness for legal age estimation. In this work, we present an interpretable, multi-stage pipeline for legal age estimation from clavicle CT scans. The proposed framework combines (i) a feature-based connected-component method for automatic clavicle detection that requires minimal manual annotation, (ii) an Integrated Gradients-guided slice selection strategy used to construct the input data for a multi-slice convolutional neural network that estimates legal age, and (iii) conformal prediction intervals to support uncertainty-aware decisions in accordance with established international protocols. The pipeline is evaluated on 1,158 full-body post-mortem CT scans from a public forensic dataset (the New Mexico Decedent Image Database). The final model achieves state-of-the-art performance with a mean absolute error (MAE) of 1.55 $\pm$ 0.16 years on a held-out test set, outperforming both human experts (MAE of approximately 1.90 years) and previous methods (MAEs above 1.75 years in our same dataset). Furthermore, conformal prediction enables configurable coverage levels aligned with forensic requirements. Attribution maps indicate that the model focuses on anatomically relevant regions of the medial clavicular epiphysis. The proposed method, which is currently being added as part of the Skeleton-ID software (https://skeleton-id.com/skeleton-id/), is intended as a decision-support component within multi-factorial forensic workflows.
[252] Interpretable Traffic Responsibility from Dashcam Video via Legal Multi Agent Reasoning
Jingchun Yang, Jinchang Zhang
Main category: cs.CV
TL;DR: C-TRAIL: A multimodal legal dataset and framework for traffic accident responsibility analysis using dashcam videos and Chinese traffic regulations
Details
Motivation: There's a gap between dashcam video evidence and legal responsibility determination. Current ego-view traffic accident studies focus on perception/semantic understanding, while LLM-based legal methods use only textual case descriptions, lacking video evidence integration.Method: Two-stage framework: 1) Traffic accident understanding module generates textual video descriptions from dashcam videos; 2) Legal multi-agent framework outputs responsibility modes, statute sets, and complete judgment reports based on Chinese traffic regulations.
Result: Outperforms general and legal LLMs, as well as existing agent-based approaches on C-TRAIL and MM-AU datasets, while providing transparent and interpretable legal reasoning process.
Conclusion: The proposed multimodal approach effectively bridges the gap between video evidence and legal responsibility determination in traffic accidents, offering a comprehensive solution for automated legal analysis.
Abstract: The widespread adoption of dashcams has made video evidence in traffic accidents increasingly abundant, yet transforming “what happened in the video” into “who is responsible under which legal provisions” still relies heavily on human experts. Existing ego-view traffic accident studies mainly focus on perception and semantic understanding, while LLM-based legal methods are mostly built on textual case descriptions and rarely incorporate video evidence, leaving a clear gap between the two. We first propose C-TRAIL, a multimodal legal dataset that, under the Chinese traffic regulation system, explicitly aligns dashcam videos and textual descriptions with a closed set of responsibility modes and their corresponding Chinese traffic statutes. On this basis, we introduce a two-stage framework: (1) a traffic accident understanding module that generates textual video descriptions; and (2) a legal multi-agent framework that outputs responsibility modes, statute sets, and complete judgment reports. Experimental results on C-TRAIL and MM-AU show that our method outperforms general and legal LLMs, as well as existing agent-based approaches, while providing a transparent and interpretable legal reasoning process.
[253] TransText: Transparency Aware Image-to-Video Typography Animation
Fei Zhang, Zijian Zhou, Bohao Tang, Sen He, Hang Li, Zhe Wang, Soubhik Sanyal, Pengfei Liu, Viktar Atliha, Tao Xiang, Frost Xu, Semih Gunel
Main category: cs.CV
TL;DR: TransText: A framework for adapting image-to-video models to layer-aware text animation using Alpha-as-RGB paradigm without modifying pre-trained generative models.
Details
Motivation: Existing methods for transparent glyph animation require reconstructing RGB-centric VAEs, which is computationally expensive and risks losing semantic priors from RGB data. There's a need for efficient adaptation of image-to-video models to handle transparency without compromising existing capabilities.Method: Proposes TransText with Alpha-as-RGB paradigm that embeds alpha channel as RGB-compatible visual signal through latent spatial concatenation. This jointly models appearance and transparency without modifying pre-trained generative manifold, ensuring cross-modal consistency while preventing feature entanglement.
Result: TransText significantly outperforms baselines, generating coherent, high-fidelity transparent animations with diverse, fine-grained effects. The method successfully adapts image-to-video models to layer-aware text animation without computational overhead of VAE retraining.
Conclusion: TransText provides an effective solution for adapting existing generative models to handle transparency in text animation, preserving semantic priors while enabling layer-aware capabilities through novel Alpha-as-RGB paradigm.
Abstract: We introduce the first method, to the best of our knowledge, for adapting image-to-video models to layer-aware text (glyph) animation, a capability critical for practical dynamic visual design. Existing approaches predominantly handle the transparency-encoding (alpha channel) as an extra latent dimension appended to the RGB space, necessitating the reconstruction of the underlying RGB-centric variational autoencoder (VAE). However, given the scarcity of high-quality transparent glyph data, retraining the VAE is computationally expensive and may erode the robust semantic priors learned from massive RGB corpora, potentially leading to latent pattern mixing. To mitigate these limitations, we propose TransText, a framework based on a novel Alpha-as-RGB paradigm to jointly model appearance and transparency without modifying the pre-trained generative manifold. TransText embeds the alpha channel as an RGB-compatible visual signal through latent spatial concatenation, explicitly ensuring strict cross-modal (RGB-and-Alpha) consistency while preventing feature entanglement. Our experiments demonstrate that TransText significantly outperforms baselines, generating coherent, high-fidelity transparent animations with diverse, fine-grained effects.
[254] LaDe: Unified Multi-Layered Graphic Media Generation and Decomposition
Vlad-Constantin Lungu-Stan, Ionut Mironica, Mariana-Iuliana Georgescu
Main category: cs.CV
TL;DR: LaDe is a latent diffusion framework that generates editable layered media designs (posters, flyers, logos) with flexible layer counts using natural language prompts, outperforming existing methods in text-to-layer alignment.
Details
Motivation: Existing media design generation methods have limitations: they either restrict outputs to fixed layer counts or require each layer to contain only spatially continuous regions, causing layer count to scale linearly with design complexity. There's a need for a system that can generate flexible numbers of semantically meaningful layers for fully editable design documents.Method: LaDe combines three components: 1) LLM-based prompt expander that transforms user intent into structured per-layer descriptions, 2) Latent Diffusion Transformer with 4D RoPE positional encoding that jointly generates full media design and constituent RGBA layers, and 3) RGBA VAE that decodes each layer with full alpha-channel support. The framework supports text-to-image, text-to-layers generation, and media design decomposition.
Result: LaDe outperforms Qwen-Image-Layered on text-to-layers and image-to-layers tasks on the Crello test set. It improves text-to-layer alignment as validated by two VLM-as-a-judge evaluators (GPT-4o mini and Qwen3-VL).
Conclusion: LaDe provides a unified framework for generating editable layered media designs with flexible layer counts, addressing limitations of existing methods and demonstrating superior performance in text-to-layer alignment tasks.
Abstract: Media design layer generation enables the creation of fully editable, layered design documents such as posters, flyers, and logos using only natural language prompts. Existing methods either restrict outputs to a fixed number of layers or require each layer to contain only spatially continuous regions, causing the layer count to scale linearly with design complexity. We propose LaDe (Layered Media Design), a latent diffusion framework that generates a flexible number of semantically meaningful layers. LaDe combines three components: an LLM-based prompt expander that transforms a short user intent into structured per-layer descriptions that guide the generation, a Latent Diffusion Transformer with a 4D RoPE positional encoding mechanism that jointly generates the full media design and its constituent RGBA layers, and an RGBA VAE that decodes each layer with full alpha-channel support. By conditioning on layer samples during training, our unified framework supports three tasks: text-to-image generation, text-to-layers media design generation, and media design decomposition. We compare LaDe to Qwen-Image-Layered on text-to-layers and image-to-layers tasks on the Crello test set. LaDe outperforms Qwen-Image-Layered in text-to-layers generation by improving text-to-layer alignment, as validated by two VLM-as-a-judge evaluators (GPT-4o mini and Qwen3-VL).
[255] Soft Dice Confidence: A Near-Optimal Confidence Estimator for Selective Prediction in Semantic Segmentation
Bruno Laboissiere Camargos Borges, Bruno Machado Pacheco, Danilo Silva
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2402.10665: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2402.10665&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[256] Robust-ComBat: Mitigating Outlier Effects in Diffusion MRI Data Harmonization
Yoan David, Pierre-Marc Jodoin, Alzheimer’s Disease Neuroimaging Initiative, The TRACK-TBI Investigators
Main category: cs.CV
TL;DR: Robust-ComBat: A method using MLP for outlier compensation in diffusion MRI harmonization that preserves disease-related signals while mitigating site-specific biases, outperforming conventional statistical baselines in multi-site cohorts with high pathology prevalence.
Details
Motivation: Current harmonization methods like ComBat assume Gaussian distributions but fail when pathological outliers from neurological disorders distort site-effect estimation. Clinical practice often involves patients with undiagnosed conditions that cannot be excluded from harmonization cohorts, creating a need for robust methods that handle pathology while preserving disease signals.Method: Proposes Robust-ComBat which uses a simple Multi-Layer Perceptron (MLP) for outlier compensation. Evaluated 10 outlier rejection methods with 4 ComBat variants across 7 neurological conditions. The MLP approach provides robust outlier compensation enabling reliable harmonization while preserving disease-related signal.
Result: Experiments on control and real multi-site cohorts (comprising up to 80% subjects with neurological disorders) show Robust-ComBat consistently outperforms conventional statistical baselines with lower harmonization error across all ComBat variants. Many filtering strategies fail in presence of pathology, but MLP-based approach succeeds.
Conclusion: Robust-ComBat provides a practical solution for diffusion MRI harmonization in clinical settings where pathological cases are prevalent, enabling reliable harmonization while preserving disease-related biological signals that are crucial for diagnosis and research.
Abstract: Harmonization methods such as ComBat and its variants are widely used to mitigate diffusion MRI (dMRI) site-specific biases. However, ComBat assumes that subject distributions exhibit a Gaussian profile. In practice, patients with neurological disorders often present diffusion metrics that deviate markedly from those of healthy controls, introducing pathological outliers that distort site-effect estimation. This problem is particularly challenging in clinical practice as most patients undergoing brain imaging have an underlying and yet undiagnosed condition, making it difficult to exclude them from harmonization cohorts, as their scans were precisely prescribed to establish a diagnosis. In this paper, we show that harmonizing data to a normative reference population with ComBat while including pathological cases induces significant distortions. Across 7 neurological conditions, we evaluated 10 outlier rejection methods with 4 ComBat variants over a wide range of scenarios, revealing that many filtering strategies fail in the presence of pathology. In contrast, a simple MLP provides robust outlier compensation enabling reliable harmonization while preserving disease-related signal. Experiments on both control and real multi-site cohorts, comprising up to 80% of subjects with neurological disorders, demonstrate that Robust-ComBat consistently outperforms conventional statistical baselines with lower harmonization error across all ComBat variants.
[257] AHOY! Animatable Humans under Occlusion from YouTube Videos with Gaussian Splatting and Video Diffusion Priors
Aymen Mir, Riza Alp Guler, Xiangjun Tang, Peter Wonka, Gerard Pons-Moll
Main category: cs.CV
TL;DR: AHOY reconstructs complete, animatable 3D Gaussian avatars from occluded monocular video using hallucination supervision and a two-stage architecture.
Details
Motivation: Existing 3D avatar reconstruction methods require unoccluded input, excluding most real-world footage where people are occluded by objects or other people. This creates fundamental challenges when large body regions are never observed and multi-view supervision is unavailable.Method: Four key contributions: (1) hallucination-as-supervision pipeline using identity-finetuned diffusion models to generate supervision for unobserved body regions; (2) two-stage canonical-to-pose-dependent architecture bootstrapping from sparse observations; (3) map-pose/LBS-pose decoupling to absorb multi-view inconsistencies; (4) head/body split supervision to preserve facial identity.
Result: State-of-the-art reconstruction quality on YouTube videos and multi-view capture data with significant occlusion. The resulting avatars are robust enough to be animated with novel poses and composited into 3DGS scenes captured using cell-phone video.
Conclusion: AHOY successfully addresses the challenge of reconstructing complete, animatable 3D avatars from heavily occluded monocular video, enabling reconstruction from real-world footage where subjects are routinely occluded.
Abstract: We present AHOY, a method for reconstructing complete, animatable 3D Gaussian avatars from in-the-wild monocular video despite heavy occlusion. Existing methods assume unoccluded input-a fully visible subject, often in a canonical pose-excluding the vast majority of real-world footage where people are routinely occluded by furniture, objects, or other people. Reconstructing from such footage poses fundamental challenges: large body regions may never be observed, and multi-view supervision per pose is unavailable. We address these challenges with four contributions: (i) a hallucination-as-supervision pipeline that uses identity-finetuned diffusion models to generate dense supervision for previously unobserved body regions; (ii) a two-stage canonical-to-pose-dependent architecture that bootstraps from sparse observations to full pose-dependent Gaussian maps; (iii) a map-pose/LBS-pose decoupling that absorbs multi-view inconsistencies from the generated data; (iv) a head/body split supervision strategy that preserves facial identity. We evaluate on YouTube videos and on multi-view capture data with significant occlusion and demonstrate state-of-the-art reconstruction quality. We also demonstrate that the resulting avatars are robust enough to be animated with novel poses and composited into 3DGS scenes captured using cell-phone video. Our project page is available at https://miraymen.github.io/ahoy/
[258] From Geometric Mimicry to Comprehensive Generation: A Context-Informed Multimodal Diffusion Model for Urban Morphology Synthesis
Fangshuo Zhou, Huaxia Li, Liuchang Xu, Rui Hu, Sensen Wu, Liang Xu, Hailin Feng, Zhenhong Du
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error in accessing paper contentMethod: Unable to determine method due to technical error in accessing paper content
Result: Unable to determine results due to technical error in accessing paper content
Conclusion: Unable to draw conclusions due to technical error in accessing paper content
Abstract: Failed to fetch summary for 2409.17049: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2409.17049&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[259] AdaRadar: Rate Adaptive Spectral Compression for Radar-based Perception
Jinho Park, Se Young Chun, Mingoo Seok
Main category: cs.CV
TL;DR: Proposes adaptive radar data compression using gradient descent on detection confidence to dynamically adjust compression ratio, achieving 100x reduction with minimal performance drop.
Details
Motivation: Radar data is crucial for autonomous driving but its high-dimensional nature saturates communication links; existing image compression methods are unsuitable as they operate at fixed ratios and fail to adapt to varying conditions.Method: Uses adaptive feedback compression with gradient descent from proxy gradient of detection confidence with respect to compression rate, employing zeroth-order gradient approximation for non-differentiable pruning/quantization operations. Applies DCT to radar cubes, selectively prunes coefficients, and preserves dynamic range through scaled quantization.
Result: Achieves over 100x feature size reduction with minimal performance drop (~1%p) on RADIal, CARRADA, and Radatron datasets.
Conclusion: Proposed online adaptive compression scheme effectively addresses radar data bandwidth constraints while maintaining detection performance through intelligent compression techniques.
Abstract: Radar is a critical perception modality in autonomous driving systems due to its all-weather characteristics and ability to measure range and Doppler velocity. However, the sheer volume of high-dimensional raw radar data saturates the communication link to the computing engine (e.g., an NPU), which is often a low-bandwidth interface with data rate provisioned only for a few low-resolution range-Doppler frames. A generalized codec for utilizing high-dimensional radar data is notably absent, while existing image-domain approaches are unsuitable, as they typically operate at fixed compression ratios and fail to adapt to varying or adversarial conditions. In light of this, we propose radar data compression with adaptive feedback. It dynamically adjusts the compression ratio by performing gradient descent from the proxy gradient of detection confidence with respect to the compression rate. We employ a zeroth-order gradient approximation as it enables gradient computation even with non-differentiable core operations–pruning and quantization. This also avoids transmitting the gradient tensors over the band-limited link, which, if estimated, would be as large as the original radar data. In addition, we have found that radar feature maps are heavily concentrated on a few frequency components. Thus, we apply the discrete cosine transform to the radar data cubes and selectively prune out the coefficients effectively. We preserve the dynamic range of each radar patch through scaled quantization. Combining those techniques, our proposed online adaptive compression scheme achieves over 100x feature size reduction at minimal performance drop (~1%p). We validate our results on the RADIal, CARRADA, and Radatron datasets.
[260] Den-TP: A Density-Balanced Data Curation and Evaluation Framework for Trajectory Prediction
Ruining Yang, Yi Xu, Yun Fu, Lili Su
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to access errorMethod: Cannot determine method due to access error
Result: Cannot determine results due to access error
Conclusion: Cannot determine conclusion due to access error
Abstract: Failed to fetch summary for 2409.17385: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2409.17385&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[261] Feeling the Space: Egomotion-Aware Video Representation for Efficient and Accurate 3D Scene Understanding
Shuyao Shi, Kang G. Shin
Main category: cs.CV
TL;DR: Motion-MLLM enhances multimodal LLMs with IMU-based egomotion data for 3D spatial reasoning, using motion-visual keyframe filtering and cross-modal fusion to improve accuracy with less computational overhead.
Details
Motivation: Current MLLMs for 3D spatial reasoning rely on expensive 3D representations (point clouds, BEV maps) or lack physical grounding for scale/size ambiguities. The paper aims to enhance MLLMs with egomotion data from IMUs to provide physical grounding and reduce computational costs.Method: Proposes Motion-MLLM with two key components: 1) Cascaded motion-visual keyframe filtering that uses IMU data and visual features to select sparse representative keyframes, and 2) Asymmetric cross-modal fusion where motion tokens channel egomotion cues and cross-frame visual context into visual representations.
Result: Motion-MLLM achieves significant improvements in 3D scene understanding and spatial reasoning tasks. Compared to SOTA methods using video frames and explicit 3D data, it shows similar or higher accuracy with significantly less overhead (1.40× and 1.63× higher cost-effectiveness).
Conclusion: Incorporating egomotion data from IMUs into MLLMs provides physical grounding for scale and spatial relationships, enabling more efficient and accurate 3D spatial reasoning without expensive 3D representations.
Abstract: Recent Multimodal Large Language Models (MLLMs) have shown high potential for spatial reasoning within 3D scenes. However, they typically rely on computationally expensive 3D representations like point clouds or reconstructed Bird’s-Eye View (BEV) maps, or lack physical grounding to resolve ambiguities in scale and size. This paper significantly enhances MLLMs with egomotion modality data, captured by Inertial Measurement Units (IMUs) concurrently with the video. In particular, we propose a novel framework, called Motion-MLLM, introducing two key components: (1) a cascaded motion-visual keyframe filtering module that leverages both IMU data and visual features to efficiently select a sparse yet representative set of keyframes, and (2) an asymmetric cross-modal fusion module where motion tokens serve as intermediaries that channel egomotion cues and cross-frame visual context into the visual representation. By grounding visual content in physical egomotion trajectories, Motion-MLLM can reason about absolute scale and spatial relationships across the scene. Our extensive evaluation shows that Motion-MLLM makes significant improvements in various tasks related to 3D scene understanding and spatial reasoning. Compared to state-of-the-art (SOTA) methods based on video frames and explicit 3D data, Motion-MLLM exhibits similar or even higher accuracy with significantly less overhead (i.e., 1.40$\times$ and 1.63$\times$ higher cost-effectiveness, respectively).
[262] Efficient Diffusion as Low Light Enhancer
Guanzhou Lan, Qianli Ma, Yuqi Yang, Zhigang Wang, Dong Wang, Xuelong Li, Bin Zhao
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2410.12346: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2410.12346&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[263] Versatile Editing of Video Content, Actions, and Dynamics without Training
Vladimir Kulikov, Roni Paiss, Andrey Voynov, Inbar Mosseri, Tali Dekel, Tomer Michaeli
Main category: cs.CV
TL;DR: DynaEdit is a training-free video editing method using pretrained text-to-video flow models that enables complex video edits including action modification, object insertion with scene interaction, and global effects.
Details
Motivation: Existing video editing methods struggle with complex edits involving action modification, dynamic events, or inserting objects that affect other objects' behaviors. Current approaches either require extensive training data or are limited to structure- and motion-preserving edits.Method: Uses inversion-free approach with pretrained text-to-video flow models (model-agnostic). Introduces novel mechanisms to overcome low-frequency misalignment and high-frequency jitter that occur when adapting inversion-free methods to unconstrained editing.
Result: Achieves state-of-the-art results on complex text-based video editing tasks, including modifying actions, inserting objects that interact with the scene, and introducing global effects.
Conclusion: DynaEdit enables versatile video editing capabilities without additional training, overcoming limitations of existing methods for complex edits involving motion modification and object interactions.
Abstract: Controlled video generation has seen drastic improvements in recent years. However, editing actions and dynamic events, or inserting contents that should affect the behaviors of other objects in real-world videos, remains a major challenge. Existing trained models struggle with complex edits, likely due to the difficulty of collecting relevant training data. Similarly, existing training-free methods are inherently restricted to structure- and motion-preserving edits and do not support modification of motion or interactions. Here, we introduce DynaEdit, a training-free editing method that unlocks versatile video editing capabilities with pretrained text-to-video flow models. Our method relies on the recently introduced inversion-free approach, which does not intervene in the model internals, and is thus model-agnostic. We show that naively attempting to adapt this approach to general unconstrained editing results in severe low-frequency misalignment and high-frequency jitter. We explain the sources for these phenomena and introduce novel mechanisms for overcoming them. Through extensive experiments, we show that DynaEdit achieves state-of-the-art results on complex text-based video editing tasks, including modifying actions, inserting objects that interact with the scene, and introducing global effects.
[264] GMT: Goal-Conditioned Multimodal Transformer for 6-DOF Object Trajectory Synthesis in 3D Scenes
Huajian Zeng, Abhishek Saroha, Daniel Cremers, Xi Wang
Main category: cs.CV
TL;DR: GMT is a multimodal transformer framework that generates realistic 6-DOF object manipulation trajectories by leveraging 3D geometry, point clouds, semantics, and target poses.
Details
Motivation: Existing approaches for object manipulation trajectory synthesis rely on 2D or partial 3D representations, limiting their ability to capture full scene geometry and constrain trajectory precision, making it challenging to achieve accurate spatial reasoning and physical feasibility.Method: GMT uses a multimodal transformer framework that represents trajectories as continuous 6-DOF pose sequences and employs a tailored conditioning strategy that fuses geometric (3D bounding boxes, point clouds), semantic (object categories), contextual, and goal-oriented (target end poses) information.
Result: GMT outperforms state-of-the-art baselines like CHOIS and GIMO on synthetic and real-world benchmarks, achieving substantial gains in spatial accuracy and orientation control, and shows strong generalization to diverse objects and cluttered 3D environments.
Conclusion: GMT establishes a new benchmark for learning-based manipulation planning by effectively integrating multimodal 3D scene understanding for generating physically feasible and goal-directed object manipulation trajectories.
Abstract: Synthesizing controllable 6-DOF object manipulation trajectories in 3D environments is essential for enabling robots to interact with complex scenes, yet remains challenging due to the need for accurate spatial reasoning, physical feasibility, and multimodal scene understanding. Existing approaches often rely on 2D or partial 3D representations, limiting their ability to capture full scene geometry and constraining trajectory precision. We present GMT, a multimodal transformer framework that generates realistic and goal-directed object trajectories by jointly leveraging 3D bounding box geometry, point cloud context, semantic object categories, and target end poses. The model represents trajectories as continuous 6-DOF pose sequences and employs a tailored conditioning strategy that fuses geometric, semantic, contextual, and goaloriented information. Extensive experiments on synthetic and real-world benchmarks demonstrate that GMT outperforms state-of-the-art human motion and human-object interaction baselines, such as CHOIS and GIMO, achieving substantial gains in spatial accuracy and orientation control. Our method establishes a new benchmark for learningbased manipulation planning and shows strong generalization to diverse objects and cluttered 3D environments. Project page: https://huajian- zeng.github. io/projects/gmt/.
[265] The Unreasonable Effectiveness of Text Embedding Interpolation for Continuous Image Steering
Yigit Ekin, Yossi Gandelsman
Main category: cs.CV
TL;DR: Training-free framework for continuous controllable image editing via text-embedding steering using LLM-generated contrastive prompts and elastic range search for smooth edits.
Details
Motivation: Existing methods for continuous image editing require additional training or manual intervention; need for simple, training-free approach that provides smooth, controllable edits in text-conditioned generative models.Method: Uses LLM to generate debiased contrastive prompt pairs, computes steering vector in text-encoder space, adds to input prompt. Elastic range search finds effective steering magnitude interval for continuous control. Works across image and video generation.
Result: Method produces smooth, continuous edits comparable to training-based alternatives, outperforms other training-free methods. Introduces evaluation metric for steering continuity.
Conclusion: Simple text-embedding steering with automatic prompt generation and range search enables effective continuous editing without training, generalizing across modalities.
Abstract: We present a training-free framework for continuous and controllable image editing at test time for text-conditioned generative models. In contrast to prior approaches that rely on additional training or manual user intervention, we find that a simple steering in the text-embedding space is sufficient to produce smooth edit control. Given a target concept (e.g., enhancing photorealism or changing facial expression), we use a large language model to automatically construct a small set of debiased contrastive prompt pairs, from which we compute a steering vector in the generator’s text-encoder space. We then add this vector directly to the input prompt representation to control generation along the desired semantic axis. To obtain a continuous control, we propose an elastic range search procedure that automatically identifies an effective interval of steering magnitudes, avoiding both under-steering (no-edit) and over-steering (changing other attributes). Adding the scaled versions of the same vector within this interval yields smooth and continuous edits. Since our method modifies only textual representations, it naturally generalizes across text-conditioned modalities, including image and video generation. To quantify the steering continuity, we introduce a new evaluation metric that measures the uniformity of semantic change across edit strengths. We compare the continuous editing behavior across methods and find that, despite its simplicity and lightweight design, our approach is comparable to training-based alternatives, outperforming other training-free methods.
[266] EchoGen: Cycle-Consistent Learning for Unified Layout-Image Generation and Understanding
Kai Zou, Hongbo Liu, Dian Zheng, Jianxiong Gao, Zhiwei Zhao, Bin Liu
Main category: cs.CV
TL;DR: EchoGen is a unified framework for layout-to-image generation and image grounding that uses progressive training strategies to overcome optimization challenges and achieve synergistic performance gains between the two tasks.
Details
Motivation: The authors aim to create a unified model that combines layout-to-image generation and image grounding, recognizing that these tasks have complementary strengths - image grounding provides strong text/layout understanding that can benefit generation, while generated images offer diverse content that can improve grounding robustness. However, joint training faces optimization challenges.Method: Three-stage progressive training: 1) Parallel Multi-Task Pre-training (PMTP) for basic abilities using shared tokens, 2) Dual Joint Optimization (DJO) that sequentially integrates tasks for unified optimization, and 3) Cycle RL stage using consistency constraints as rewards via GRPO strategy to eliminate visual supervision dependency.
Result: State-of-the-art results on both layout-to-image generation and image grounding benchmarks, with clear synergistic gains from joint optimization of the two tasks.
Conclusion: The unified framework successfully demonstrates that layout-to-image generation and image grounding can mutually benefit each other through progressive training strategies, achieving superior performance on both tasks compared to separate approaches.
Abstract: In this work, we present EchoGen, a unified framework for layout-to-image generation and image grounding, capable of generating images with accurate layouts and high fidelity to text descriptions (e.g., spatial relationships), while grounding the image robustly at the same time. We believe that image grounding possesses strong text and layout understanding abilities, which can compensate for the corresponding limitations in layout-to-image generation. At the same time, images generated from layouts exhibit high diversity in content, thereby enhancing the robustness of image grounding. Jointly training both tasks within a unified model can promote performance improvements for each. However, we identify that this joint training paradigm encounters several optimization challenges and results in restricted performance. To address these issues, we propose progressive training strategies. First, the Parallel Multi-Task Pre-training (PMTP) stage equips the model with basic abilities for both tasks, leveraging shared tokens to accelerate training. Next, the Dual Joint Optimization (DJO) stage exploits task duality to sequentially integrate the two tasks, enabling unified optimization. Finally, the Cycle RL stage eliminates reliance on visual supervision by using consistency constraints as rewards, significantly enhancing the model’s unified capabilities via the GRPO strategy. Extensive experiments demonstrate state-of-the-art results on both layout-to-image generation and image grounding benchmarks, and reveal clear synergistic gains from optimizing the two tasks together.
[267] Universal Skeleton Understanding via Differentiable Rendering and MLLMs
Ziyi Wang, Peiming Li, Xinshun Wang, Yang Tang, Kai-Kuang Ma, Mengyuan Liu
Main category: cs.CV
TL;DR: SkeletonLLM enables multimodal LLMs to understand human skeleton data by converting skeleton sequences into visual representations that MLLMs can process, achieving strong performance on diverse skeleton understanding tasks.
Details
Motivation: Current MLLMs cannot directly process structured non-visual data like human skeletons. Existing methods either compress skeleton dynamics into lossy feature vectors or quantize motion into discrete tokens that generalize poorly across different skeleton formats.Method: SkeletonLLM uses DrAction, a differentiable format-agnostic renderer that converts skeletal kinematics into compact image sequences. It employs cooperative training with Causal Reasoning Distillation (transferring step-by-step reasoning from a teacher model) and Discriminative Finetuning (sharpening decision boundaries).
Result: SkeletonLLM demonstrates strong generalization on diverse tasks including recognition, captioning, reasoning, and cross-format transfer, suggesting a viable path for applying MLLMs to non-native modalities.
Conclusion: The approach enables MLLMs to understand arbitrary skeleton sequences by translating them into the visual modality, opening new possibilities for applying visual-language models to structured non-visual data.
Abstract: Multimodal large language models (MLLMs) exhibit strong visual-language reasoning, yet remain confined to their native modalities and cannot directly process structured, non-visual data such as human skeletons. Existing methods either compress skeleton dynamics into lossy feature vectors for text alignment, or quantize motion into discrete tokens that generalize poorly across heterogeneous skeleton formats. We present SkeletonLLM, which achieves universal skeleton understanding by translating arbitrary skeleton sequences into the MLLM’s native visual modality. At its core is DrAction, a differentiable, format-agnostic renderer that converts skeletal kinematics into compact image sequences. Because the pipeline is end-to-end differentiable, MLLM gradients can directly guide the rendering to produce task-informative visual tokens. To further enhance reasoning capabilities, we introduce a cooperative training strategy: Causal Reasoning Distillation transfers structured, step-by-step reasoning from a teacher model, while Discriminative Finetuning sharpens decision boundaries between confusable actions. SkeletonLLM demonstrates strong generalization on diverse tasks including recognition, captioning, reasoning, and cross-format transfer – suggesting a viable path for applying MLLMs to non-native modalities. Code will be released upon acceptance.
[268] Frequency Autoregressive Image Generation with Continuous Tokens
Hu Yu, Hao Luo, Hangjie Yuan, Yu Rong, Jie Huang, Feng Zhao
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2503.05305: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.05305&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[269] Look Before You Fuse: 2D-Guided Cross-Modal Alignment for Robust 3D Detection
Xiang Li, Zhangchi Hu, Xiao Xu, Bin Kong
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to missing paper contentMethod: Unable to determine method due to missing paper content
Result: Unable to determine results due to missing paper content
Conclusion: Unable to draw conclusions due to missing paper content
Abstract: Failed to fetch summary for 2507.16861: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.16861&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[270] The MCC approaches the geometric mean of precision and recall as true negatives approach infinity
Jon Crall
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting). No abstract available for analysis.
Details
Motivation: Unable to determine motivation as paper content could not be retrieved due to API rate limiting.Method: No method information available due to failed data retrieval.
Result: No results available as paper content could not be accessed.
Conclusion: Cannot provide analysis due to technical limitations in accessing the paper data.
Abstract: Failed to fetch summary for 2305.00594: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2305.00594&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[271] Learning from Oblivion: Predicting Knowledge Overflowed Weights via Retrodiction of Forgetting
Jinhyeok Jang, Jaehong Kim, Jung Uk Kim
Main category: cs.CV
TL;DR: Unable to analyze paper 2508.05059 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract could not be retrievedMethod: Cannot determine method as abstract could not be retrieved
Result: Cannot determine results as abstract could not be retrieved
Conclusion: Cannot draw conclusion due to missing abstract data
Abstract: Failed to fetch summary for 2508.05059: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.05059&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[272] MultiMedEval: A Benchmark and a Toolkit for Evaluating Medical Vision-Language Models
Corentin Royer, Bjoern Menze, Anjany Sekuboyina
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) for arXiv ID 2402.09262
Details
Motivation: Cannot determine motivation due to access errorMethod: Cannot determine method due to access error
Result: Cannot determine results due to access error
Conclusion: Cannot determine conclusion due to access error
Abstract: Failed to fetch summary for 2402.09262: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2402.09262&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[273] Mamba2D: A Natively Multi-Dimensional State-Space Model for Vision Tasks
Enis Baty, Alejandro Hernández Díaz, Rebecca Davidson, Chris Bridges, Simon Hadfield
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2412.16146: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2412.16146&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[274] Test-Time 3D Occupancy Prediction
Fengyi Zhang, Xiangyu Sun, Huitong Yang, Zheng Zhang, Zi Huang, Yadan Luo
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2503.08485: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.08485&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[275] EdiVal-Agent: An Object-Centric Framework for Automated, Fine-Grained Evaluation of Multi-Turn Editing
Tianyu Chen, Yasi Zhang, Zhi Zhang, Peiyu Yu, Shu Wang, Zhendong Wang, Kevin Lin, Xiaofei Wang, Zhengyuan Yang, Linjie Li, Chung-Ching Lin, Jianwen Xie, Oscar Leong, Lijuan Wang, Ying Nian Wu, Mingyuan Zhou
Main category: cs.CV
TL;DR: Unable to analyze paper 2509.13399 due to HTTP 429 error when fetching summary from arXiv API
Details
Motivation: Cannot determine motivation as paper content could not be retrieved due to API rate limitingMethod: Unknown - paper details unavailable due to HTTP 429 error from arXiv API
Result: No results available - failed to fetch paper summary
Conclusion: Paper analysis impossible due to technical limitations in accessing the content
Abstract: Failed to fetch summary for 2509.13399: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.13399&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[276] Multi-modal 3D Pose and Shape Estimation with Computed Tomography
Mingxiao Tu, Hoijoon Jung, Alireza Moghadam, Jineel Raythatha, Lachlan Allan, Jeremy Hsu, Andre Kyme, Jinman Kim
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2503.19405: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.19405&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[277] Synergistic Bleeding Region and Point Detection in Laparoscopic Surgical Videos
Jialun Pei, Zhangjun Zhou, Diandian Guo, Zhixi Li, Jing Qin, Bo Du, Pheng-Ann Heng
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to determine conclusion due to failed API request
Abstract: Failed to fetch summary for 2503.22174: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.22174&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[278] M3DLayout: A Multi-Source Dataset of 3D Indoor Layouts and Structured Descriptions for 3D Generation
Yiheng Zhang, Zhuojiang Cai, Mingdao Wang, Meitong Guo, Tianxiao Li, Li Lin, Yuwang Wang
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2509.23728: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.23728&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[279] High-Fidelity Diffusion Face Swapping with ID-Constrained Facial Conditioning
Dailan He, Xiahong Wang, Shulun Wang, Guanglu Song, Bingqi Ma, Hao Shao, Yu Liu, Hongsheng Li
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to draw conclusions due to fetch failure
Abstract: Failed to fetch summary for 2503.22179: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.22179&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[280] Vector sketch animation generation with differentiable motion trajectories
Xinding Zhu, Xinye Yang, Shuyang Zheng, Zhexin Zhang, Fei Gao, Jing Huang, Jiazhou Chen
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2509.25857: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.25857&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[281] Echo Planning for Autonomous Driving: From Current Observations to Future Trajectories and Back
Jintao Sun, Hu Zhang, Gangyi Ding, Zhedong Zheng
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2505.18945: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.18945&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[282] Parameterizing Dataset Distillation via Gaussian Splatting
Chenyang Jiang, Zhengcen Li, Hang Zhao, Qiben Shan, Shaocong Wu, Jingyong Su
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to access restrictionsMethod: Cannot determine method due to access restrictions
Result: Cannot determine results due to access restrictions
Conclusion: Cannot determine conclusion due to access restrictions
Abstract: Failed to fetch summary for 2509.26219: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.26219&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[283] Benchmarking Endoscopic Surgical Image Restoration and Beyond
Jialun Pei, Diandian Guo, Donghui Yang, Zhixi Li, Yuxin Feng, Long Ma, Bo Du, Pheng-Ann Heng
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to determine conclusion due to failed API request
Abstract: Failed to fetch summary for 2505.19161: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.19161&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[284] Domain and Task-Focused Example Selection for Data-Efficient Contrastive Medical Image Segmentation
Tyler Ward, Aaron Moseley, Abdullah-Al-Zubaer Imran
Main category: cs.CV
TL;DR: Paper analysis unavailable due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Unable to determine motivation as the paper abstract could not be retrieved due to rate limiting on arXiv APIMethod: Method unknown - paper content not accessible due to HTTP 429 error (Too Many Requests)
Result: No results available - arXiv API returned rate limiting error preventing access to paper details
Conclusion: Cannot analyze paper due to technical limitations in accessing the abstract from arXiv
Abstract: Failed to fetch summary for 2505.19208: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.19208&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[285] HyperMotionX: The Dataset and Benchmark with DiT-Based Pose-Guided Human Image Animation of Complex Motions
Shuolin Xu, Siming Zheng, Ziyi Wang, HC Yu, Jinwei Chen, Huaqi Zhang, Daquan Zhou, Tong-Yee Lee, Bo Li, Peng-Tao Jiang
Main category: cs.CV
TL;DR: Paper 2505.22977 summary unavailable due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed summary fetchMethod: Unable to determine method due to failed summary fetch
Result: Unable to determine results due to failed summary fetch
Conclusion: Unable to determine conclusion due to failed summary fetch
Abstract: Failed to fetch summary for 2505.22977: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.22977&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[286] Automated Wicket-Taking Delivery Segmentation and Trajectory-Based Dismissal-Zone Analysis in Cricket Videos Using OCR-Guided YOLOv8
Joy Karmoker, Masum Billah, Mst Jannatun Ferdous, Akif Islam, Mohd Ruhul Ameen, Md. Omar Faruqe
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2510.18405: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.18405&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[287] Mechanistic Interpretability of Diffusion Models: Circuit-Level Analysis and Causal Validation
Dip Roy
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2506.17237: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.17237&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[288] Generative Hints
Andy Dimnaku, Abdullah Yusuf Kavranoglu, Yaser Abu-Mostafa
Main category: cs.CV
TL;DR: Unable to analyze paper 2511.02933 due to HTTP 429 error when fetching from arXiv API
Details
Motivation: Cannot determine motivation as paper content could not be retrievedMethod: Cannot determine method as paper content could not be retrieved
Result: Cannot determine results as paper content could not be retrieved
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2511.02933: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.02933&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[289] OccTENS: 3D Occupancy World Model via Temporal Next-Scale Prediction
Bu Jin, Songen Gu, Xiaotao Hu, Yupeng Zheng, Xiaoyang Guo, Qian Zhang, Xiaoxiao Long, Wei Yin
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to missing paper contentMethod: Unable to determine method due to missing paper content
Result: Unable to determine results due to missing paper content
Conclusion: Unable to determine conclusion due to missing paper content
Abstract: Failed to fetch summary for 2509.03887: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.03887&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[290] Where MLLMs Attend and What They Rely On: Explaining Autoregressive Token Generation
Ruoyu Chen, Xiaoqing Guo, Kangwei Liu, Siyuan Liang, Shiming Liu, Qunli Zhang, Laiyuan Wang, Hua Zhang, Xiaochun Cao
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2509.22496: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.22496&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[291] Learning Goal-Oriented Vision-and-Language Navigation with Self-Improving Demonstrations at Scale
Songze Li, Zun Wang, Gengze Zhou, Jialu Li, Xiangyu Zeng, Ziyang Gong, Limin Wang, Yu Qiao, Qi Wu, Mohit Bansal, Yi Wang
Main category: cs.CV
TL;DR: Paper 2509.24910: Unable to fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to inability to access paper contentMethod: Cannot determine method due to inability to access paper content
Result: Cannot determine results due to inability to access paper content
Conclusion: Cannot draw conclusions due to inability to access paper content
Abstract: Failed to fetch summary for 2509.24910: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.24910&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[292] Uni-DAD: Unified Distillation and Adaptation of Diffusion Models for Few-step Few-shot Image Generation
Yara Bahram, Melodie Desbos, Mohammadhadi Shateri, Eric Granger
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2511.18281: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.18281&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[293] LMOD+: A Comprehensive Multimodal Dataset and Benchmark for Developing and Evaluating Multimodal Large Language Models in Ophthalmology
Zhenyue Qin, Yang Liu, Yu Yin, Jinyu Ding, Haoran Zhang, Anran Li, Dylan Campbell, Xuansheng Wu, Ke Zou, Tiarnan D. L. Keenan, Emily Y. Chew, Zhiyong Lu, Yih Chung Tham, Ninghao Liu, Xiuzhen Zhang, Qingyu Chen
Main category: cs.CV
TL;DR: Unable to analyze paper 2509.25620 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as paper content could not be retrievedMethod: Cannot determine method as paper content could not be retrieved
Result: Cannot determine results as paper content could not be retrieved
Conclusion: Cannot draw conclusions about paper content due to retrieval failure
Abstract: Failed to fetch summary for 2509.25620: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.25620&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[294] CoT-PL: Chain-of-Thought Pseudo-Labeling for Open-Vocabulary Object Detection
Hojun Choi, Youngsun Lim, Jaeyo Shin, Hyunjung Shim
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to analyze paper content due to technical limitations
Abstract: Failed to fetch summary for 2510.14792: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.14792&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[295] YOLO26: An Analysis of NMS-Free End to End Framework for Real-Time Object Detection
Sudip Chakrabarty
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2601.12882: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.12882&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[296] Skyfall-GS: Synthesizing Immersive 3D Urban Scenes from Satellite Imagery
Jie-Ying Lee, Yi-Ruei Liu, Shr-Ruei Tsai, Wei-Cheng Chang, Chung-Ho Wu, Jiewen Chan, Zhenjun Zhao, Chieh Hubert Lin, Yu-Lun Liu
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2510.15869: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.15869&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[297] CARPE: Context-Aware Image Representation Prioritization via Ensemble for Large Vision-Language Models
Donghee Lee, Rui Cai, Zhe Zhao
Main category: cs.CV
TL;DR: Paper 2601.13622: Could not fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access restrictionsMethod: Unable to determine method due to access restrictions
Result: Unable to determine results due to access restrictions
Conclusion: Unable to determine conclusion due to access restrictions
Abstract: Failed to fetch summary for 2601.13622: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.13622&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[298] Towards One-step Causal Video Generation via Adversarial Self-Distillation
Yongqi Yang, Huayang Huang, Xu Peng, Xiaobin Hu, Donghao Luo, Jiangning Zhang, Chengjie Wang, Yu Wu
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2511.01419: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.01419&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[299] Semi-supervised Shelter Mapping for WASH Accessibility Assessment in Rohingya Refugee Camps
Kyeongjin Ahn, YongHun Suh, Sungwon Han, Jeasurk Yang, Hannes Taubenböck, Meeyoung Cha
Main category: cs.CV
TL;DR: Unable to analyze paper 2511.07231 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract is unavailable due to rate limiting errorMethod: Cannot determine method without access to paper abstract
Result: No results available due to failed API request
Conclusion: Paper analysis impossible due to technical limitations in accessing the abstract
Abstract: Failed to fetch summary for 2511.07231: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.07231&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[300] Unsupervised Decomposition and Recombination with Discriminator-Driven Diffusion Models
Archer Wang, Emile Anand, Yilun Du, Marin Soljačić
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2601.22057: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.22057&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[301] Draft and Refine with Visual Experts
Sungheon Jeong, Ryozo Masukawa, Jihong Park, Sanggeon Yun, Wenjun Huang, Hanning Chen, Mahdi Imani, Mohsen Imani
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to data retrieval failureMethod: Unable to determine method due to data retrieval failure
Result: Unable to determine results due to data retrieval failure
Conclusion: Unable to determine conclusion due to data retrieval failure
Abstract: Failed to fetch summary for 2511.11005: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.11005&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[302] DSeq-JEPA: Discriminative Sequential Joint-Embedding Predictive Architecture
Xiangteng He, Shunsuke Sakai, Shivam Chandhok, Sara Beery, Kun Yuan, Nicolas Padoy, Tatsuhito Hasegawa, Leonid Sigal
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) - need to try again later or use alternative methods
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2511.17354: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.17354&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[303] No Need For Real Anomaly: MLLM Empowered Zero-Shot Video Anomaly Detection
Zunkai Dai, Ke Li, Jiajia Liu, Jie Yang, Yuanyuan Qiao
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2602.19248: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.19248&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[304] MagicWorld: Towards Long-Horizon Stability for Interactive Video World Exploration
Guangyuan Li, Bo Li, Jinwei Chen, Xiaobin Hu, Lei Zhao, Peng-Tao Jiang
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to inability to access paper contentMethod: Cannot determine method due to inability to access paper content
Result: Cannot determine results due to inability to access paper content
Conclusion: Cannot determine conclusion due to inability to access paper content
Abstract: Failed to fetch summary for 2511.18886: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.18886&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[305] WPT: World-to-Policy Transfer via Online World Model Distillation
Guangfeng Jiang, Yueru Luo, Jun Liu, Yi Huang, Yiyao Zhu, Zhan Qu, Dave Zhenyu Chen, Bingbing Liu, Xu Yan
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access restrictionsMethod: Unable to determine method due to access restrictions
Result: Unable to determine results due to access restrictions
Conclusion: Unable to determine conclusion due to access restrictions
Abstract: Failed to fetch summary for 2511.20095: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.20095&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[306] SimScale: Learning to Drive via Real-World Simulation at Scale
Haochen Tian, Tianyu Li, Haochen Liu, Jiazhi Yang, Yihang Qiu, Guang Li, Junli Wang, Yinfeng Gao, Zhang Zhang, Liang Wang, Hangjun Ye, Tieniu Tan, Long Chen, Hongyang Li
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to draw conclusions due to failed API request
Abstract: Failed to fetch summary for 2511.23369: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.23369&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[307] TALO: Pushing 3D Vision Foundation Models Towards Globally Consistent Online Reconstruction
Fengyi Zhang, Tianjun Zhang, Kasra Khosoussi, Zheng Zhang, Zi Huang, Yadan Luo
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2512.02341 could not be retrieved from arXiv API.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot draw conclusions without access to the paper content.
Abstract: Failed to fetch summary for 2512.02341: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.02341&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[308] Multimodal Emotion Recognition via Bi-directional Cross-Attention and Temporal Modeling
Junhyeong Byeon, Jeongyeol Kim, Sejoon Lim
Main category: cs.CV
TL;DR: Paper ID 2603.11971: Unable to fetch abstract due to HTTP 429 error (rate limiting).
Details
Motivation: Cannot determine motivation due to missing abstract.Method: Cannot determine method due to missing abstract.
Result: Cannot determine results due to missing abstract.
Conclusion: Cannot draw conclusions due to missing abstract.
Abstract: Failed to fetch summary for 2603.11971: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.11971&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[309] Vision to Geometry: 3D Spatial Memory for Sequential Embodied MLLM Reasoning and Exploration
Zhongyi Cai, Yi Du, Chen Wang, Yu Kong
Main category: cs.CV
TL;DR: Paper 2512.02458: Unable to fetch abstract due to HTTP 429 error (rate limiting).
Details
Motivation: Cannot determine motivation without access to paper content.Method: Cannot determine method without access to paper content.
Result: Cannot determine results without access to paper content.
Conclusion: Cannot draw conclusions without access to paper content.
Abstract: Failed to fetch summary for 2512.02458: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.02458&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[310] PubTables-v2: A new large-scale dataset for full-page and multi-page table extraction
Brandon Smock, Valerie Faucon-Morin, Max Sokolov, Libin Liang, Tayyibah Khanam, Amrit Ramesh, Maury Courtland
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2512.10888: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.10888&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[311] KeyframeFace: Language-Driven Facial Animation via Semantic Keyframes
Jingchao Wu, Zejian Kang, Haibo Liu, Yuanchen Fei, Xiangru Huang
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2512.11321: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.11321&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[312] Spatial Transcriptomics as Images for Large-Scale Pretraining
Yishun Zhu, Jiaxin Qi, Jian Wang, Yuhua Zheng, Jianqiang Huang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed data retrievalMethod: Unable to determine method due to failed data retrieval
Result: Unable to determine results due to failed data retrieval
Conclusion: Unable to determine conclusion due to failed data retrieval
Abstract: Failed to fetch summary for 2603.13432: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.13432&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[313] TechImage-Bench: Rubric-Based Evaluation for Technical Image Generation
Minheng Ni, Zhengyuan Yang, Yaowen Zhang, Linjie Li, Chung-Ching Lin, Kevin Lin, Zhendong Wang, Xiaofei Wang, Shujie Liu, Lei Zhang, Wangmeng Zuo, Lijuan Wang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2512.12220: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.12220&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[314] Generative Refocusing: Flexible Defocus Control from a Single Image
Chun-Wei Tuan Mu, Cheng-De Fan, Jia-Bin Huang, Yu-Lun Liu
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2512.16923: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.16923&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[315] Digital FAST: An AI-Driven Multimodal Framework for Rapid and Early Stroke Screening
Ngoc-Khai Hoang, Thi-Nhu-Mai Nguyen, Huy-Hieu Pham
Main category: cs.CV
TL;DR: Unable to analyze paper 2601.11896 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation without access to the paper abstractMethod: Cannot determine method without access to the paper abstract
Result: Cannot determine results without access to the paper abstract
Conclusion: Cannot draw conclusions without access to the paper abstract
Abstract: Failed to fetch summary for 2601.11896: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.11896&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[316] Unifying Heterogeneous Degradations: Uncertainty-Aware Diffusion Bridge Model for All-in-One Image Restoration
Luwei Tu, Jiawei Wu, Xing Luo, Zhi Jin
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about paper content due to access limitations
Abstract: Failed to fetch summary for 2601.21592: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.21592&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[317] Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion
Haodong Li, Shaoteng Liu, Zhe Lin, Manmohan Chandraker
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2602.07775: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.07775&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[318] IRIS-SLAM: Unified Geo-Instance Representations for Robust Semantic Localization and Mapping
Tingyang Xiao, Liu Liu, Wei Feng, Zhengyu Zou, Xiaolin Zhou, Wei Sui, Hao Li, Dingwen Zhang, Zhizhong Su
Main category: cs.CV
TL;DR: Paper 2602.18709: Could not fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to missing abstractMethod: Unable to determine method due to missing abstract
Result: Unable to determine results due to missing abstract
Conclusion: Unable to determine conclusion due to missing abstract
Abstract: Failed to fetch summary for 2602.18709: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.18709&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[319] LaS-Comp: Zero-shot 3D Completion with Latent-Spatial Consistency
Weilong Yan, Haipeng Li, Hao Xu, Nianjin Ye, Yihao Ai, Shuaicheng Liu, Jingyu Hu
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation due to failed paper fetchMethod: Cannot determine method due to failed paper fetch
Result: Cannot determine results due to failed paper fetch
Conclusion: Cannot draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2602.18735: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.18735&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[320] VisBrowse-Bench: Benchmarking Visual-Native Search for Multimodal Browsing Agents
Zhengbo Zhang, Jinbo Su, Zhaowen Zhou, Changtao Miao, Yuhan Hong, Qimeng Wu, Yumeng Liu, Feier Wu, Yihe Tian, Yuhao Liang, Zitong Shan, Wanke Xia, Yi-Fan Zhang, Bo Zhang, Zhe Li, Shiming Xiang, Ying Yan
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.16289: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.16289&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[321] SkyReels-V4: Multi-modal Video-Audio Generation, Inpainting and Editing model
Guibin Chen, Dixuan Lin, Jiangping Yang, Youqiang Zhang, Zhengcong Fei, Debang Li, Sheng Chen, Chaofeng Ao, Nuo Pang, Yiming Wang, Yikun Dou, Zheng Chen, Mingyuan Fan, Tuanhui Li, Mingshan Chang, Hao Zhang, Xiaopeng Sun, Jingtao Xu, Yuqiang Xie, Jiahua Wang, Zhiheng Xu, Weiming Xiong, Yuzhe Jin, Baoxuan Gu, Binjie Mao, Yunjie Yu, Jujie He, Yuhao Feng, Shiwen Tu, Chaojie Wang, Rui Yan, Wei Shen, Jingchen Wu, Peng Zhao, Xuanyue Zhong, Zhuangzhuang Liu, Kaifei Wang, Fuxiang Zhang, Weikai Xu, Wenyan Liu, Binglu Zhang, Yu Shen, Tianhui Xiong, Bin Peng, Liang Zeng, Xuchen Song, Haoxiang Guo, Peiyu Wang, Max W. Y. Lam, Chien-Hung Liu, Yahui Zhou
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about paper content due to technical limitations in accessing the information
Abstract: Failed to fetch summary for 2602.21818: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.21818&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[322] Self-Attention And Beyond the Infinite: Towards Linear Transformers with Infinite Self-Attention
Giorgio Roffo, Luke Palmer, Nilli Lavie
Main category: cs.CV
TL;DR: Unable to analyze paper 2603.00175 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation without access to the paper abstractMethod: Cannot determine method without access to the paper abstract
Result: Cannot determine results without access to the paper abstract
Conclusion: Cannot draw conclusions without access to the paper abstract
Abstract: Failed to fetch summary for 2603.00175: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.00175&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[323] Context-Nav: Context-Driven Exploration and Viewpoint-Aware 3D Spatial Reasoning for Instance Navigation
Won Shik Jang, Ue-Hwan Kim
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.09506: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.09506&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[324] HyPER-GAN: Hybrid Patch-Based Image-to-Image Translation for Real-Time Photorealism Enhancement
Stefanos Pasios, Nikos Nikolaidis
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2603.10604: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.10604&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[325] Just-in-Time: Training-Free Spatial Acceleration for Diffusion Transformers
Wenhao Sun, Ji Li, Zhaoqiang Liu
Main category: cs.CV
TL;DR: Paper 2603.10744: Unable to fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to inability to access paper contentMethod: Cannot determine method due to inability to access paper content
Result: Cannot determine results due to inability to access paper content
Conclusion: Cannot determine conclusion due to inability to access paper content
Abstract: Failed to fetch summary for 2603.10744: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.10744&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[326] InstantHDR: Single-forward Gaussian Splatting for High Dynamic Range 3D Reconstruction
Dingqiang Ye, Jiacong Xu, Jianglu Ping, Yuxiang Guo, Chao Fan, Vishal M. Patel
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2603.11298 suggests it’s from March 2023, but content is unavailable.
Details
Motivation: Unable to determine motivation due to content unavailability. The paper appears to be from the arXiv repository with ID 2603.11298 (likely March 2023).Method: Unable to determine method due to content unavailability. The HTTP 429 error indicates rate limiting on the arXiv API.
Result: Unable to determine results due to content unavailability. The paper summary could not be retrieved from the arXiv API.
Conclusion: Unable to determine conclusion due to content unavailability. The paper exists in the arXiv system but its content could not be accessed due to API rate limiting.
Abstract: Failed to fetch summary for 2603.11298: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.11298&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[327] Coherent Human-Scene Reconstruction from Multi-Person Multi-View Video in a Single Pass
Sangmin Kim, Minhyuk Hwang, Geonho Cha, Dongyoon Wee, Jaesik Park
Main category: cs.CV
TL;DR: CHROMM is a unified framework for jointly estimating cameras, scene point clouds, and human meshes from multi-person multi-view videos without external modules or preprocessing.
Details
Motivation: Existing 3D human reconstruction approaches focus on monocular inputs and require additional overhead modules or preprocessed data for multi-view settings, creating inefficiencies.Method: Integrates geometric and human priors from Pi3X and Multi-HMR into a single trainable network, adds scale adjustment module to resolve human-scene scale discrepancy, uses multi-view fusion strategy, and proposes geometry-based multi-person association.
Result: Achieves competitive performance in global human motion and multi-view pose estimation on EMDB, RICH, EgoHumans, and EgoExo4D datasets while running over 8x faster than prior optimization-based multi-view approaches.
Conclusion: CHROMM provides an efficient unified framework for multi-view 3D human and scene reconstruction without external dependencies, demonstrating strong performance and speed advantages.
Abstract: Recent advances in 3D foundation models have led to growing interest in reconstructing humans and their surrounding environments. However, most existing approaches focus on monocular inputs, and extending them to multi-view settings requires additional overhead modules or preprocessed data. To this end, we present CHROMM, a unified framework that jointly estimates cameras, scene point clouds, and human meshes from multi-person multi-view videos without relying on external modules or preprocessing. We integrate strong geometric and human priors from Pi3X and Multi-HMR into a single trainable neural network architecture, and introduce a scale adjustment module to solve the scale discrepancy between humans and the scene. We also introduce a multi-view fusion strategy to aggregate per-view estimates into a single representation at test-time. Finally, we propose a geometry-based multi-person association method, which is more robust than appearance-based approaches. Experiments on EMDB, RICH, EgoHumans, and EgoExo4D show that CHROMM achieves competitive performance in global human motion and multi-view pose estimation while running over 8x faster than prior optimization-based multi-view approaches. Project page: https://nstar1125.github.io/chromm.
[328] Event-Driven Video Generation
Chika Maduabuchi
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2603.13402: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.13402&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[329] Bodhi VLM: Privacy-Alignment Modeling for Hierarchical Visual Representations in Vision Backbones and VLM Encoders via Bottom-Up and Top-Down Feature Search
Bo Ma, Wei Qi Yan, Jinsong Wu
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2603.13728: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.13728&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[330] AvatarForcing: One-Step Streaming Talking Avatars via Local-Future Sliding-Window Denoising
Liyuan Cui, Wentao Hu, Wenyuan Zhang, Zesong Yang, Fan Shi, Xiaoqiang Liu
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) - cannot analyze paper 2603.14331
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper detailsMethod: Cannot analyze method without access to paper content
Result: No results available due to technical access issues
Conclusion: Paper analysis impossible due to HTTP 429 error from arXiv API
Abstract: Failed to fetch summary for 2603.14331: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.14331&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[331] ASAP: Attention-Shift-Aware Pruning for Efficient LVLM Inference
Surendra Pathak, Bo Han
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.14549: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.14549&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[332] SemanticFace: Semantic Facial Action Estimation via Semantic Distillation in Interpretable Space
Zejian Kang, Kai Zheng, Yuanchen Fei, Wentao Yang, Hongyuan Zou, Xiangru Huang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) when querying arXiv API for paper ID 2603.14827
Details
Motivation: Unable to determine motivation as the paper content could not be retrieved due to technical limitationsMethod: No method information available - paper content inaccessible
Result: No results available - paper content inaccessible
Conclusion: No conclusion available - paper content inaccessible
Abstract: Failed to fetch summary for 2603.14827: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.14827&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[333] AutoMoT: A Unified Vision-Language-Action Model with Asynchronous Mixture-of-Transformers for End-to-End Autonomous Driving
Wenhui Huang, Songyan Zhang, Qihang Huang, Zhidong Wang, Zhiqi Mao, Collister Chua, Zhan Chen, Long Chen, Chen Lv
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2603.14851: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.14851&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[334] F2HDR: Two-Stage HDR Video Reconstruction via Flow Adapter and Physical Motion Modeling
Huanjing Yue, Dawei Li, Shaoxiong Tu, Jingyu Yang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.14920: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.14920&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[335] Workflow-Aware Structured Layer Decomposition for Illustration Production
Tianyu Zhang, Dongchi Li, Keiichi Sawada, Haoran Xie
Main category: cs.CV
TL;DR: Unable to analyze paper 2603.14925 due to HTTP 429 error when fetching summary from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about paper content due to access limitations
Abstract: Failed to fetch summary for 2603.14925: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.14925&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[336] Training-free Detection of Generated Videos via Spatial-Temporal Likelihoods
Omer Ben Hayun, Roy Betser, Meir Yossef Levi, Levi Kassel, Guy Gilboa
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2603.15026: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.15026&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[337] A Tutorial on ALOS2 SAR Utilization: Dataset Preparation, Self-Supervised Pretraining, and Semantic Segmentation
Nevrez Imamoglu, Ali Caglayan, Toru Kouyama
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.15119: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.15119&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[338] Next-Frame Decoding for Ultra-Low-Bitrate Image Compression with Video Diffusion Priors
Yunuo Chen, Chuqin Zhou, Jiangchuan Li, Xiaoyue Ling, Bing He, Jincheng Dai, Li Song, Guo Lu
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2603.15129: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.15129&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[339] A Comprehensive Benchmark of Histopathology Foundation Models for Kidney Digital Pathology Images
Harishwar Reddy Kasireddy, Patricio S. La Rosa, Akshita Gupta, Anindya S. Paul, Jamie L. Fermin, William L. Clapp, Meryl A. Waldman, Tarek M. El-Ashkar, Sanjay Jain, Luis Rodrigues, Kuang Yu Jen, Avi Z. Rosenberg, Michael T. Eadon, Jeffrey B. Hodgin, Pinaki Sarder
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2603.15967: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.15967&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[340] S-VAM: Shortcut Video-Action Model by Self-Distilling Geometric and Semantic Foresight
Haodong Yan, Zhide Zhong, Jiaguan Zhu, Junjie He, Weilin Yuan, Wenxuan Song, Xin Gong, Yingjie Cai, Guanyi Zhao, Xu Yan, Bingbing Liu, Ying-Cong Chen, Haoang Li
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2603.16195 suggests it’s from March 2023, but content is unavailable for analysis.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot draw conclusions without access to the paper content.
Abstract: Failed to fetch summary for 2603.16195: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.16195&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[341] Synergizing Deep Learning and Biological Heuristics for Extreme Long-Tail White Blood Cell Classification
Duc T. Nguyen, Hoang-Long Nguyen, Huy-Hieu Pham
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2603.16249: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.16249&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[342] HGP-Mamba: Integrating Histology and Generated Protein Features for Mamba-based Multimodal Survival Risk Prediction
Jing Dai, Chen Wu, Ming Wu, Qibin Zhang, Zexi Wu, Jingdong Zhang, Hongming Xu
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2603.16421: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.16421&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[343] VIEW2SPACE: Studying Multi-View Visual Reasoning from Sparse Observations
Fucai Ke, Zhixi Cai, Boying Li, Long Chen, Beibei Lin, Weiqing Wang, Pari Delir Haghighi, Gholamreza Haffari, Hamid Rezatofighi
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2603.16506: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.16506&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[344] Rationale Matters: Learning Transferable Rubrics via Proxy-Guided Critique for VLM Reward Models
Weijie Qiu, Dai Guan, Junxin Wang, Zhihang Li, Yongbo Gai, Mengyu Zhou, Erchao Zhao, Xiaoxi Jiang, Guanjun Jiang
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2603.16600: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.16600&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[345] Search2Motion: Training-Free Object-Level Motion Control via Attention-Consensus Search
Sainan Liu, Tz-Ying Wu, Hector A Valdez, Subarna Tripathi
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2603.16711: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.16711&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[346] World Reconstruction From Inconsistent Views
Lukas Höllein, Matthias Nießner
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions without access to the paper content
Abstract: Failed to fetch summary for 2603.16736: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.16736&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[347] Bundle Adjustment in the Eager Mode
Zitong Zhan, Huan Xu, Zihang Fang, Xinpeng Wei, Yaoyu Hu, Chen Wang
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about paper content due to access limitations
Abstract: Failed to fetch summary for 2409.12190: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2409.12190&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[348] Learning to See and Act: Task-Aware Virtual View Exploration for Robotic Manipulation
Yongjie Bai, Zhouxia Wang, Yang Liu, Kaijun Luo, Yifan Wen, Mingtong Dai, Weixing Chen, Ziliang Chen, Lingbo Liu, Guanbin Li, Liang Lin
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2508.05186: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.05186&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[349] Aion: Towards Hierarchical 4D Scene Graphs with Temporal Flow Dynamics
Iacopo Catalano, Eduardo Montijano, Javier Civera, Julio A. Placed, Jorge Pena-Queralta
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2512.11903: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.11903&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[350] Dynamic Black-hole Emission Tomography with Physics-informed Neural Fields
Berthy T. Feng, Andrew A. Chael, David Bromley, Aviad Levis, William T. Freeman, Katherine L. Bouman
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper retrievalMethod: Unable to determine method due to failed paper retrieval
Result: Unable to determine results due to failed paper retrieval
Conclusion: Unable to draw conclusions due to failed paper retrieval
Abstract: Failed to fetch summary for 2602.08029: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.08029&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.AI
[351] Generative AI-assisted Participatory Modeling in Socio-Environmental Planning under Deep Uncertainty
Zhihao Pei, Nir Lipovetzky, Angela M. Rojas-Arevalo, Fjalar J. de Haan, Enayat A. Moallemi
Main category: cs.AI
TL;DR: LLM-based workflow for participatory modeling in socio-environmental planning, using ChatGPT to translate stakeholder descriptions into quantitative models
Details
Motivation: Participatory modeling in socio-environmental planning requires translating stakeholders' natural-language descriptions into quantitative models, which is complex and time-consuming. There's a need to facilitate this problem conceptualization process.Method: Proposed templated workflow using large language models (ChatGPT 5.2 Instant) to: 1) identify essential model components from stakeholders’ intuitive problem descriptions, 2) explore diverse perspectives, 3) assemble components into unified model, and 4) implement model in Python through iterative communication with human verification.
Result: Demonstrated workflow on lake problem and electricity market problem. Acceptable outputs obtained after few iterations with human verification and refinement. LLMs can effectively facilitate participatory modeling in problem conceptualization.
Conclusion: Large language models can serve as effective tools for facilitating participatory modeling in socio-environmental planning problem conceptualization, streamlining the translation of stakeholder descriptions into quantitative models.
Abstract: Socio-environmental planning under deep uncertainty requires researchers to identify and conceptualize problems before exploring policies and deploying plans. In practice and model-based planning approaches, this problem conceptualization process often relies on participatory modeling to translate stakeholders’ natural-language descriptions into a quantitative model, making this process complex and time-consuming. To facilitate this process, we propose a templated workflow that uses large language models for an initial conceptualization process. During the workflow, researchers can use large language models to identify the essential model components from stakeholders’ intuitive problem descriptions, explore their diverse perspectives approaching the problem, assemble these components into a unified model, and eventually implement the model in Python through iterative communication. These results will facilitate the subsequent socio-environmental planning under deep uncertainty steps. Using ChatGPT 5.2 Instant, we demonstrated this workflow on the lake problem and an electricity market problem, both of which demonstrate socio-environmental planning problems. In both cases, acceptable outputs were obtained after a few iterations with human verification and refinement. These experiments indicated that large language models can serve as an effective tool for facilitating participatory modeling in the problem conceptualization process in socio-environmental planning.
[352] Transformers are Bayesian Networks
Gregory Coppola
Main category: cs.AI
TL;DR: Transformers are proven to be Bayesian networks implementing weighted loopy belief propagation, with attention as AND operations and FFN as OR operations, forming Pearl’s gather/update algorithm.
Details
Motivation: Despite transformers being the dominant AI architecture, their underlying working principles remain poorly understood. The paper aims to provide a precise theoretical foundation by establishing transformers as Bayesian networks.Method: The paper establishes five formal proofs: 1) sigmoid transformers implement weighted loopy belief propagation on implicit factor graphs, 2) transformers can implement exact belief propagation on declared knowledge bases, 3) uniqueness proof that exact posteriors require BP weights, 4) Boolean structure analysis showing attention as AND and FFN as OR operations, and 5) experimental validation of formal results.
Result: The paper formally proves transformers are Bayesian networks implementing belief propagation, with experimental confirmation. It also establishes that verifiable inference requires finite concept spaces and that hallucination is a structural consequence of operating without concepts.
Conclusion: Transformers are fundamentally Bayesian networks implementing belief propagation, providing a theoretical foundation for understanding their operation. The work reveals that hallucination is not fixable by scaling but is inherent to operating without grounded concepts.
Abstract: Transformers are the dominant architecture in AI, yet why they work remains poorly understood. This paper offers a precise answer: a transformer is a Bayesian network. We establish this in five ways. First, we prove that every sigmoid transformer with any weights implements weighted loopy belief propagation on its implicit factor graph. One layer is one round of BP. This holds for any weights – trained, random, or constructed. Formally verified against standard mathematical axioms. Second, we give a constructive proof that a transformer can implement exact belief propagation on any declared knowledge base. On knowledge bases without circular dependencies this yields provably correct probability estimates at every node. Formally verified against standard mathematical axioms. Third, we prove uniqueness: a sigmoid transformer that produces exact posteriors necessarily has BP weights. There is no other path through the sigmoid architecture to exact posteriors. Formally verified against standard mathematical axioms. Fourth, we delineate the AND/OR boolean structure of the transformer layer: attention is AND, the FFN is OR, and their strict alternation is Pearl’s gather/update algorithm exactly. Fifth, we confirm all formal results experimentally, corroborating the Bayesian network characterization in practice. We also establish the practical viability of loopy belief propagation despite the current lack of a theoretical convergence guarantee. We further prove that verifiable inference requires a finite concept space. Any finite verification procedure can distinguish at most finitely many concepts. Without grounding, correctness is not defined. Hallucination is not a bug that scaling can fix. It is the structural consequence of operating without concepts. Formally verified against standard mathematical axioms.
[353] Cascade-Aware Multi-Agent Routing: Spatio-Temporal Sidecars and Geometry-Switching
Davide Di Gioia
Main category: cs.AI
TL;DR: A lightweight sidecar system that mitigates geometry-blind failure propagation in symbolic graph networks by adaptively selecting between Euclidean and hyperbolic risk models based on graph topology features.
Details
Motivation: Current AI reasoning systems use symbolic graph networks with specialized agents/modules, but their schedulers are geometry-blind - they don't account for how failure propagation differs between tree-like vs cyclic graph structures, leading to inefficient failure handling.Method: Proposes online geometry control with route-risk estimation using: 1) Euclidean spatio-temporal propagation baseline, 2) hyperbolic route-risk model with temporal decay, and 3) learned geometry selector (9->12->1 MLP) using topology statistics and geometry-aware signals (BFS shell-growth slope, cycle-rank norm, Poincare curvature).
Result: Adaptive switching improves win rate in hardest non-tree regimes from 64-72% to 92%, achieving 87.2% overall win rate vs 50.4% baseline. Tree-like regimes show +48 to +68 percentage point gains. A 133-parameter sidecar substantially mitigates geometry-blind failure propagation.
Conclusion: Lightweight geometry-aware sidecar systems can effectively mitigate failure propagation issues in symbolic graph networks by adaptively selecting appropriate risk models based on graph structure, significantly improving system performance.
Abstract: A common architectural pattern in advanced AI reasoning systems is the symbolic graph network: specialized agents or modules connected by delegation edges, routing tasks through a dynamic execution graph. Current schedulers optimize load and fitness but are geometry-blind: they do not model how failures propagate differently in tree-like versus cyclic regimes. In tree-like delegation, a single failure can cascade exponentially; in dense cyclic graphs, failures tend to self-limit. We identify this observability gap, quantify its system-level cost, and propose a lightweight mitigation. We formulate online geometry control for route-risk estimation on time-indexed execution graphs with route-local failure history. Our approach combines (i) a Euclidean spatio-temporal propagation baseline, (ii) a hyperbolic route-risk model with temporal decay (and optional burst excitation), and (iii) a learned geometry selector over structural features. The selector is a compact MLP (9->12->1) using six topology statistics plus three geometry-aware signals: BFS shell-growth slope, cycle-rank norm, and fitted Poincare curvature. On the Genesis 3 benchmark distribution, adaptive switching improves win rate in the hardest non_tree regime from 64-72% (fixed hyperbolic variants) to 92%, and achieves 87.2% overall win rate. To measure total system value, we compare against Genesis 3 routing without any spatio-temporal sidecar, using only native bandit/LinUCB signals (team fitness and mean node load). This baseline achieves 50.4% win rate overall and 20% in tree-like regimes; the full sidecar recovers 87.2% overall (+36.8 pp), with +48 to +68 pp gains in tree-like settings, consistent with a cascade-sensitivity analysis. Overall, a 133-parameter sidecar substantially mitigates geometry-blind failure propagation in one high-capability execution-graph system.
[354] How Clued up are LLMs? Evaluating Multi-Step Deductive Reasoning in a Text-Based Game Environment
Rebecca Ansell, Autumn Toney-Wails
Main category: cs.AI
TL;DR: Multi-agent Clue game testbed evaluates LLM deductive reasoning, finding poor performance and limited transfer from logic puzzle fine-tuning.
Details
Motivation: To create a rule-based testbed for evaluating multi-step deductive reasoning in LLM agents using the classic board game Clue, and to investigate whether fine-tuning on structured logic puzzles improves in-game reasoning.Method: Implemented text-based multi-agent version of Clue with six agents (GPT-4o-mini and Gemini-2.5-Flash), ran 18 simulated games, and investigated transfer learning from fine-tuning on structured logic puzzles.
Result: Agents achieved only 4 correct wins out of 18 games, showing difficulty in maintaining consistent deductive reasoning. Fine-tuning did not reliably improve performance and sometimes increased reasoning volume without improving precision.
Conclusion: Current LLM agents struggle with sustained deductive reasoning in complex multi-step games, and fine-tuning on logic puzzles doesn’t reliably transfer to improved gameplay performance.
Abstract: Deducing whodunit proves challenging for LLM agents. In this paper, we implement a text-based multi-agent version of the classic board game Clue as a rule-based testbed for evaluating multi-step deductive reasoning, with six agents drawn from GPT-4o-mini and Gemini-2.5-Flash. We further investigate whether fine-tuning on structured logic puzzles transfers to improved in-game reasoning and gameplay. Across 18 simulated games, agents achieve only four correct wins, indicating difficulty in maintaining consistent deductive reasoning over the course of a full game. Additionally, we find that fine-tuning does not reliably improve performance and, in some cases, appears to increase reasoning volume without improving reasoning precision.
[355] AI Scientist via Synthetic Task Scaling
Ziyang Cai, Harkirat Behl
Main category: cs.AI
TL;DR: A pipeline for generating synthetic machine learning tasks to train AI agents, verified against real datasets and used to improve performance on ML benchmarks.
Details
Motivation: To enable training of AI agents for scientific discovery by creating synthetic environments that address the problem of LLMs generating plausible but ineffective ideas, providing a principled way to train agents that can learn from doing.Method: Developed an automatic synthetic environment generation pipeline that creates ML challenges compatible with SWE-agent framework, including topic sampling, dataset proposal, and code generation, with verification against Huggingface API and self-debugging loops for quality assurance.
Result: Student models (Qwen3-4B and Qwen3-8B) trained on synthetic tasks from teacher model (GPT-5) showed improved performance on MLGym benchmark, raising AUP metric by 9% for Qwen3-4B and 12% for Qwen3-8B.
Conclusion: The synthetic task generation pipeline effectively trains ML agents, demonstrating that synthetic environments can improve agent performance on real ML benchmarks.
Abstract: With the advent of AI agents, automatic scientific discovery has become a tenable goal. Many recent works scaffold agentic systems that can perform machine learning research, but don’t offer a principled way to train such agents – and current LLMs often generate plausible-looking but ineffective ideas. To make progress on training agents that can learn from doing, we provide a novel synthetic environment generation pipeline targeting machine learning agents. Our pipeline automatically synthesizes machine learning challenges compatible with the SWE-agent framework, covering topic sampling, dataset proposal, and code generation. The resulting synthetic tasks are 1) grounded in real machine learning datasets, because the proposed datasets are verified against the Huggingface API and are 2) verified for higher quality with a self-debugging loop. To validate the effectiveness of our synthetic tasks, we tackle MLGym, a benchmark for machine learning tasks. From the synthetic tasks, we sample trajectories from a teacher model (GPT-5), then use the trajectories to train a student model (Qwen3-4B and Qwen3-8B). The student models trained with our synthetic tasks achieve improved performance on MLGym, raising the AUP metric by 9% for Qwen3-4B and 12% for Qwen3-8B.
[356] Draft-and-Prune: Improving the Reliability of Auto-formalization for Logical Reasoning
Zhiyu Ni, Zheng Liang, Liangcheng Song, Chenrui Cao, Xian Zhang, Alberto Sangiovanni-Vincentelli, Pierluigi Nuzzo
Main category: cs.AI
TL;DR: Draft-and-Prune (D&P) framework improves auto-formalization for logical reasoning by generating diverse natural-language plans, pruning contradictory formalizations, and aggregating predictions via majority voting.
Details
Motivation: Current auto-formalization pipelines are brittle - programs may fail to execute or encode incorrect semantics. While prior work addresses syntactic failures via solver feedback, reducing semantic failures remains a major bottleneck for reliable logical reasoning.Method: Draft-and-Prune (D&P) framework: 1) Drafts multiple natural-language plans and conditions program generation on them, 2) Prunes executable but contradictory or ambiguous formalizations, 3) Aggregates predictions from surviving paths via majority voting.
Result: D&P substantially improves AF-based reasoning across four benchmarks: AR-LSAT (78.43% with GPT-4, 78.00% with GPT-4o), near-ceiling performance on ProofWriter, PrOntoQA (100%), and LogicalDeduction (100%), outperforming strong baselines MAD-LOGIC and CLOVER.
Conclusion: D&P effectively improves auto-formalization-based logical reasoning without extra supervision by leveraging diversity and verification, addressing both syntactic and semantic failures in AF pipelines.
Abstract: Auto-formalization (AF) translates natural-language reasoning problems into solver-executable programs, enabling symbolic solvers to perform sound logical deduction. In practice, however, AF pipelines are currently brittle: programs may fail to execute, or execute but encode incorrect semantics. While prior work largely mitigates syntactic failures via repairs based on solver feedback, reducing semantics failures remains a major bottleneck. We propose Draft-and-Prune (D&P), an inference-time framework that improves AF-based logical reasoning via diversity and verification. D&P first drafts multiple natural-language plans and conditions program generation on them. It further prunes executable but contradictory or ambiguous formalizations, and aggregates predictions from surviving paths via majority voting. Across four representative benchmarks (AR-LSAT, ProofWriter, PrOntoQA, LogicalDeduction), D&P substantially strengthens AF-based reasoning without extra supervision. On AR-LSAT, in the AF-only setting, D&P achieves 78.43% accuracy with GPT-4 and 78.00% accuracy with GPT-4o, significantly outperforming the strongest AF baselines MAD-LOGIC and CLOVER. D&P then attains near-ceiling performance on the other benchmarks, including 100% on PrOntoQA and LogicalDeduction.
[357] DAPS++: Rethinking Diffusion Inverse Problems with Decoupled Posterior Annealing
Hao Chen, Renzheng Zhang, Scott S. Howard
Main category: cs.AI
TL;DR: DAPS++ reinterprets diffusion models for inverse problems as an EM-style framework with decoupled diffusion initialization and data-driven refinement, achieving computational efficiency and robust reconstruction.
Details
Motivation: Current Bayesian interpretation of score-based diffusion for inverse problems fails to explain practical behavior where prior offers limited guidance and reconstruction is largely driven by measurement consistency, effectively decoupling inference from diffusion dynamics.Method: Reinterprets diffusion role as initialization stage within EM-style framework, introducing DAPS++ which allows likelihood term to guide inference more directly while maintaining numerical stability, with fully decoupled diffusion stage and data-driven refinement.
Result: DAPS++ achieves high computational efficiency (fewer function evaluations and measurement-optimization steps) and robust reconstruction performance across diverse image restoration tasks.
Conclusion: The EM-style framework provides insight into why unified diffusion trajectories remain effective in practice, offering a more accurate interpretation of diffusion models for inverse problems.
Abstract: From a Bayesian perspective, score-based diffusion solves inverse problems through joint inference, embedding the likelihood with the prior to guide the sampling process. However, this formulation fails to explain its practical behavior: the prior offers limited guidance, while reconstruction is largely driven by the measurement-consistency term, leading to an inference process that is effectively decoupled from the diffusion dynamics. To clarify this structure, we reinterpret the role of diffusion in inverse problem solving as an initialization stage within an expectation–maximization (EM)–style framework, where the diffusion stage and the data-driven refinement are fully decoupled. We introduce \textbf{DAPS++}, which allows the likelihood term to guide inference more directly while maintaining numerical stability and providing insight into why unified diffusion trajectories remain effective in practice. By requiring fewer function evaluations (NFEs) and measurement-optimization steps, \textbf{DAPS++} achieves high computational efficiency and robust reconstruction performance across diverse image restoration tasks.
[358] Graph-Native Cognitive Memory for AI Agents: Formal Belief Revision Semantics for Versioned Memory Architectures
Young Bin Park
Main category: cs.AI
TL;DR: Kumiho is a graph-native cognitive memory architecture for AI agents that unifies agent memory and work management using formal belief revision semantics, achieving state-of-the-art performance on cognitive memory benchmarks.
Details
Motivation: Existing AI agent systems have individual memory components but lack architectural synthesis and formal grounding. The authors aim to create a unified architecture that combines cognitive memory with agent work management using formal belief revision theory.Method: Proposes Kumiho, a graph-native architecture with formal correspondence between AGM belief revision framework and property graph memory system. Uses dual-store model (Redis working memory, Neo4j long-term graph) with hybrid fulltext/vector retrieval. Implements three innovations: prospective indexing (LLM-generated future implications), event extraction (structured causal events), and client-side LLM reranking.
Result: Achieves 0.565 overall F1 on LoCoMo benchmark (n=1,986) with 97.5% adversarial refusal accuracy. On LoCoMo-Plus (implicit constraint recall), achieves 93.3% judge accuracy (n=401), substantially outperforming all baselines (best baseline: Gemini 2.5 Pro at 45.7%). Architecture is model-decoupled, allowing easy switching between LLMs.
Conclusion: Kumiho demonstrates that formal belief revision semantics can ground practical cognitive memory systems for AI agents, achieving state-of-the-art performance while providing architectural unification of memory and work management.
Abstract: While individual components for AI agent memory exist in prior systems, their architectural synthesis and formal grounding remain underexplored. We present Kumiho, a graph-native cognitive memory architecture grounded in formal belief revision semantics. The structural primitives required for cognitive memory – immutable revisions, mutable tag pointers, typed dependency edges, URI-based addressing – are identical to those required for managing agent-produced work as versionable assets, enabling a unified graph-native architecture that serves both purposes. The central formal contribution is a correspondence between the AGM belief revision framework and the operational semantics of a property graph memory system, proving satisfaction of the basic AGM postulates (K2–K6) and Hansson’s belief base postulates (Relevance, Core-Retainment). The architecture implements a dual-store model (Redis working memory, Neo4j long-term graph) with hybrid fulltext and vector retrieval. On LoCoMo (token-level F1), Kumiho achieves 0.565 overall F1 (n=1,986) including 97.5% adversarial refusal accuracy. On LoCoMo-Plus, a Level-2 cognitive memory benchmark testing implicit constraint recall, Kumiho achieves 93.3% judge accuracy (n=401); independent reproduction by the benchmark authors yielded results in the mid-80% range, still substantially outperforming all published baselines (best: Gemini 2.5 Pro, 45.7%). Three architectural innovations drive the results: prospective indexing (LLM-generated future-scenario implications indexed at write time), event extraction (structured causal events preserved in summaries), and client-side LLM reranking. The architecture is model-decoupled: switching the answer model from GPT-4o-mini (~88%) to GPT-4o (93.3%) improves end-to-end accuracy without pipeline changes, at a total evaluation cost of ~$14 for 401 entries.
[359] Contrastive Reasoning Alignment: Reinforcement Learning from Hidden Representations
Haozheng Luo, Yimin Wang, Jiahao Yu, Binghui Wang, Yan Chen
Main category: cs.AI
TL;DR: CRAFT is a red-teaming alignment framework that improves jailbreak robustness by aligning reasoning models to generate safety-aware reasoning traces through hidden state space optimization.
Details
Motivation: Current defenses against jailbreak attacks operate primarily at the output level, lacking robustness. The paper aims to improve safety alignment by leveraging model reasoning capabilities and hidden representations for more fundamental safety improvements.Method: CRAFT integrates contrastive representation learning with reinforcement learning to separate safe and unsafe reasoning trajectories in hidden state space. It aligns large reasoning models to generate safety-aware reasoning traces by optimizing objectives defined over hidden representations, incorporating latent-textual consistency into GRPO to eliminate superficially aligned policies.
Result: CRAFT consistently outperforms state-of-the-art defenses like IPO and SafeKey on multiple safety benchmarks. It achieves 79.0% improvement in reasoning safety and 87.7% improvement in final-response safety over base models using Qwen3-4B-Thinking and R1-Distill-Llama-8B.
Conclusion: Hidden-space reasoning alignment is effective for improving model safety against jailbreak attacks, demonstrating that leveraging hidden representations and reasoning capabilities provides more robust safety alignment than output-level approaches.
Abstract: We propose CRAFT, a red-teaming alignment framework that leverages model reasoning capabilities and hidden representations to improve robustness against jailbreak attacks. Unlike prior defenses that operate primarily at the output level, CRAFT aligns large reasoning models to generate safety-aware reasoning traces by explicitly optimizing objectives defined over the hidden state space. Methodologically, CRAFT integrates contrastive representation learning with reinforcement learning to separate safe and unsafe reasoning trajectories, yielding a latent-space geometry that supports robust, reasoning-level safety alignment. Theoretically, we show that incorporating latent-textual consistency into GRPO eliminates superficially aligned policies by ruling them out as local optima. Empirically, we evaluate CRAFT on multiple safety benchmarks using two strong reasoning models, Qwen3-4B-Thinking and R1-Distill-Llama-8B, where it consistently outperforms state-of-the-art defenses such as IPO and SafeKey. Notably, CRAFT delivers an average 79.0% improvement in reasoning safety and 87.7% improvement in final-response safety over the base models, demonstrating the effectiveness of hidden-space reasoning alignment.
[360] InfoDensity: Rewarding Information-Dense Traces for Efficient Reasoning
Chengwei Wei, Jung-jae Kim, Longyin Zhang, Shengkai Chen, Nancy F. Chen
Main category: cs.AI
TL;DR: InfoDensity: A reward framework for RL training that optimizes reasoning quality by measuring information density in reasoning traces, reducing verbosity while maintaining accuracy.
Details
Motivation: Current LLMs generate verbose and redundant reasoning traces with unnecessary computational costs. Existing RL approaches optimize final response length but neglect intermediate reasoning quality, making models vulnerable to reward hacking. The authors argue verbosity is a symptom of poor intermediate reasoning quality rather than just a length problem.Method: Conducted empirical study tracking conditional entropy of answer distribution across reasoning steps. Found high-quality reasoning exhibits low uncertainty convergence and monotonic progress. Proposed InfoDensity reward framework combining AUC-based reward and monotonicity reward as unified measure of reasoning quality, weighted by length scaling term favoring concise reasoning.
Result: Experiments on mathematical reasoning benchmarks show InfoDensity matches or surpasses state-of-the-art baselines in accuracy while significantly reducing token usage, achieving strong accuracy-efficiency trade-off.
Conclusion: Verbosity in LLM reasoning stems from poor intermediate reasoning quality. InfoDensity effectively optimizes reasoning quality by promoting informationally dense reasoning traces, offering better accuracy-efficiency balance than length-focused approaches.
Abstract: Large Language Models (LLMs) with extended reasoning capabilities often generate verbose and redundant reasoning traces, incurring unnecessary computational cost. While existing reinforcement learning approaches address this by optimizing final response length, they neglect the quality of intermediate reasoning steps, leaving models vulnerable to reward hacking. We argue that verbosity is not merely a length problem, but a symptom of poor intermediate reasoning quality. To investigate this, we conduct an empirical study tracking the conditional entropy of the answer distribution across reasoning steps. We find that high-quality reasoning traces exhibit two consistent properties: low uncertainty convergence and monotonic progress. These findings suggest that high-quality reasoning traces are informationally dense, that is, each step contributes meaningful entropy reduction relative to the total reasoning length. Motivated by this, we propose InfoDensity, a reward framework for RL training that combines an AUC-based reward and a monotonicity reward as a unified measure of reasoning quality, weighted by a length scaling term that favors achieving equivalent quality more concisely. Experiments on mathematical reasoning benchmarks demonstrate that InfoDensity matches or surpasses state-of-the-art baselines in accuracy while significantly reducing token usage, achieving a strong accuracy-efficiency trade-off.
[361] Physics-informed offline reinforcement learning eliminates catastrophic fuel waste in maritime routing
Aniruddha Bora, Julie Chalfant, Chryssostomos Chryssostomidis
Main category: cs.AI
TL;DR: PIER is an offline reinforcement learning framework for maritime route optimization that reduces fuel consumption and CO2 emissions by learning physics-informed, safety-aware routing policies from historical vessel data without needing online simulators.
Details
Motivation: International shipping contributes significantly to global greenhouse gas emissions (3%), yet current voyage routing relies on heuristic methods that are inefficient and can lead to catastrophic fuel waste during adverse ocean conditions.Method: Offline reinforcement learning framework using physics-calibrated environments based on historical vessel tracking data (AIS) and ocean reanalysis products. Combines physics-informed state construction, demonstration-augmented offline data, and a decoupled post-hoc safety shield.
Result: Reduces mean CO2 emissions by 10% relative to great-circle routing, with 9-fold reduction in catastrophic fuel waste events (from 4.8% to 0.5% of voyages). Per-voyage fuel variance is 3.5x lower, and the system maintains constant performance under forecast uncertainty unlike traditional methods.
Conclusion: PIER provides a forecast-independent, physics-informed routing solution that significantly reduces fuel consumption and emissions while improving safety, with architecture transferable to other domains like wildfire evacuation and autonomous navigation.
Abstract: International shipping produces approximately 3% of global greenhouse gas emissions, yet voyage routing remains dominated by heuristic methods. We present PIER (Physics-Informed, Energy-efficient, Risk-aware routing), an offline reinforcement learning framework that learns fuel-efficient, safety-aware routing policies from physics-calibrated environments grounded in historical vessel tracking data and ocean reanalysis products, requiring no online simulator. Validated on one full year (2023) of AIS data across seven Gulf of Mexico routes (840 episodes per method), PIER reduces mean CO2 emissions by 10% relative to great-circle routing. However, PIER’s primary contribution is eliminating catastrophic fuel waste: great-circle routing incurs extreme fuel consumption (>1.5x median) in 4.8% of voyages; PIER reduces this to 0.5%, a 9-fold reduction. Per-voyage fuel variance is 3.5x lower (p<0.001), with bootstrap 95% CI for mean savings [2.9%, 15.7%]. Partial validation against observed AIS vessel behavior confirms consistency with the fastest real transits while exhibiting 23.1x lower variance. Crucially, PIER is forecast-independent: unlike A* path optimization whose wave protection degrades 4.5x under realistic forecast uncertainty, PIER maintains constant performance using only local observations. The framework combines physics-informed state construction, demonstration-augmented offline data, and a decoupled post-hoc safety shield, an architecture that transfers to wildfire evacuation, aircraft trajectory optimization, and autonomous navigation in unmapped terrain.
[362] ShuttleEnv: An Interactive Data-Driven RL Environment for Badminton Strategy Modeling
Ang Li, Xinyang Gong, Bozhou Chen, Yunlong Lu, Jiaming Ji, Yongyi Wang, Yaodong Yang, Wenxin Li
Main category: cs.AI
TL;DR: ShuttleEnv is a data-driven simulation environment for badminton that uses probabilistic models based on elite-player match data to simulate rally dynamics, enabling reinforcement learning and strategic analysis without physics-based simulation.
Details
Motivation: To create a realistic and interpretable simulation environment for fast-paced adversarial sports like badminton that supports reinforcement learning research and strategic behavior analysis, using actual match data rather than physics-based approaches.Method: The environment is grounded in elite-player match data and employs explicit probabilistic models to simulate rally-level dynamics. It provides interactive visualization of badminton rallies with trained agents, allowing exploration of different play styles and decision-making behaviors.
Result: ShuttleEnv serves as a reusable platform for research, visualization, and demonstration of intelligent agents in sports AI, with live step-by-step visualization of rallies and trained agents showcasing emergent strategies.
Conclusion: ShuttleEnv provides an effective data-driven simulation environment for badminton that enables realistic agent-opponent interactions and supports reinforcement learning research in sports AI without requiring complex physics simulations.
Abstract: We present ShuttleEnv, an interactive and data-driven simulation environment for badminton, designed to support reinforcement learning and strategic behavior analysis in fast-paced adversarial sports. The environment is grounded in elite-player match data and employs explicit probabilistic models to simulate rally-level dynamics, enabling realistic and interpretable agent-opponent interactions without relying on physics-based simulation. In this demonstration, we showcase multiple trained agents within ShuttleEnv and provide live, step-by-step visualization of badminton rallies, allowing attendees to explore different play styles, observe emergent strategies, and interactively analyze decision-making behaviors. ShuttleEnv serves as a reusable platform for research, visualization, and demonstration of intelligent agents in sports AI. Our ShuttleEnv demo video URL: https://drive.google.com/file/d/1hTR4P16U27H2O0-w316bR73pxE2ucczX/view
[363] A Progressive Visual-Logic-Aligned Framework for Ride-Hailing Adjudication
Weiming Wu, Zi-Jian Cheng, Jie Meng, Peng Zhen, Shan Huang, Qun Li, Guobin Wu, Lan-Zhe Guo
Main category: cs.AI
TL;DR: RideJudge is a multimodal LLM framework for automated responsibility adjudication in ride-hailing disputes, addressing challenges of visual-logic alignment, context limitations, and sparse feedback through trajectory synthesis, adaptive context optimization, and ordinal-sensitive reinforcement learning.
Details
Motivation: Manual review of ride-hailing responsibility disputes is intractable due to exponential volume growth, while conventional automated methods lack reasoning transparency needed for quasi-judicial decisions. Existing multimodal LLMs struggle with bridging general visual semantics with rigorous evidentiary protocols, leading to perceptual hallucinations and logical looseness.Method: 1) SynTraj synthesis engine grounds abstract liability concepts into concrete trajectory patterns to bridge semantic gap; 2) Adaptive Context Optimization distills expert knowledge to handle massive regulations within limited context windows; 3) Chain-of-Adjudication mechanism enforces active evidentiary inquiry; 4) Ordinal-Sensitive Reinforcement Learning calibrates decision boundaries against hierarchical severity instead of sparse binary feedback.
Result: RideJudge-8B achieves 88.41% accuracy, surpassing 32B-scale baselines and establishing a new standard for interpretable adjudication in ride-hailing responsibility disputes.
Conclusion: The proposed Progressive Visual-Logic-Aligned Framework effectively addresses systemic misalignments in multimodal LLMs for responsibility adjudication, providing transparent reasoning, handling complex regulations, and achieving state-of-the-art performance with smaller model sizes.
Abstract: The efficient adjudication of responsibility disputes is pivotal for maintaining marketplace fairness. However, the exponential surge in ride-hailing volume renders manual review intractable, while conventional automated methods lack the reasoning transparency required for quasi-judicial decisions. Although Multimodal LLMs offer a promising paradigm, they fundamentally struggle to bridge the gap between general visual semantics and rigorous evidentiary protocols, often leading to perceptual hallucinations and logical looseness. To address these systemic misalignments, we introduce RideJudge, a Progressive Visual-Logic-Aligned Framework. Instead of relying on generic pre-training, we bridge the semantic gap via SynTraj, a synthesis engine that grounds abstract liability concepts into concrete trajectory patterns. To resolve the conflict between massive regulation volume and limited context windows, we propose an Adaptive Context Optimization strategy that distills expert knowledge, coupled with a Chain-of-Adjudication mechanism to enforce active evidentiary inquiry. Furthermore, addressing the inadequacy of sparse binary feedback for complex liability assessment, we implement a novel Ordinal-Sensitive Reinforcement Learning mechanism that calibrates decision boundaries against hierarchical severity. Extensive experiments show that our RideJudge-8B achieves 88.41% accuracy, surpassing 32B-scale baselines and establishing a new standard for interpretable adjudication.
[364] Governed Memory: A Production Architecture for Multi-Agent Workflows
Hamed Taheri
Main category: cs.AI
TL;DR: Governed Memory is a shared memory and governance layer for enterprise AI agents that addresses memory silos, governance fragmentation, and quality degradation through dual memory models, tiered governance routing, and closed-loop schema management.
Details
Motivation: Enterprise AI systems deploy multiple autonomous agents across workflows without shared memory or common governance, leading to five structural challenges: memory silos, governance fragmentation, unstructured memories, redundant context delivery, and silent quality degradation.Method: Four key mechanisms: 1) Dual memory model combining open-set atomic facts with schema-enforced typed properties; 2) Tiered governance routing with progressive context delivery; 3) Reflection-bounded retrieval with entity-scoped isolation; 4) Closed-loop schema lifecycle with AI-assisted authoring and automated per-property refinement.
Result: Experimental validation (N=250) shows: 99.6% fact recall with dual-modality coverage; 92% governance routing precision; 50% token reduction from progressive delivery; zero cross-entity leakage; 100% adversarial governance compliance; quality saturation at ~7 governed memories per entity. Achieves 74.8% accuracy on LoCoMo benchmark.
Conclusion: Governed Memory successfully addresses the memory governance gap in enterprise AI systems, demonstrating that governance and schema enforcement can be implemented without compromising retrieval quality, and is currently in production at Personize.ai.
Abstract: Enterprise AI deploys dozens of autonomous agent nodes across workflows, each acting on the same entities with no shared memory and no common governance. We identify five structural challenges arising from this memory governance gap: memory silos across agent workflows; governance fragmentation across teams and tools; unstructured memories unusable by downstream systems; redundant context delivery in autonomous multi-step executions; and silent quality degradation without feedback loops. We present Governed Memory, a shared memory and governance layer addressing this gap through four mechanisms: a dual memory model combining open-set atomic facts with schema-enforced typed properties; tiered governance routing with progressive context delivery; reflection-bounded retrieval with entity-scoped isolation; and a closed-loop schema lifecycle with AI-assisted authoring and automated per-property refinement. We validate each mechanism through controlled experiments (N=250, five content types): 99.6% fact recall with complementary dual-modality coverage; 92% governance routing precision; 50% token reduction from progressive delivery; zero cross-entity leakage across 500 adversarial queries; 100% adversarial governance compliance; and output quality saturation at approximately seven governed memories per entity. On the LoCoMo benchmark, the architecture achieves 74.8% overall accuracy, confirming that governance and schema enforcement impose no retrieval quality penalty. The system is in production at Personize.ai.
[365] Towards Safer Large Reasoning Models by Promoting Safety Decision-Making before Chain-of-Thought Generation
Jianan Chen, Zhifang Zhang, Shuo He, Linan Yue, Lei Feng, Minling Zhang
Main category: cs.AI
TL;DR: A method to improve safety in large reasoning models by extracting safety decision signals from safe models and integrating them as auxiliary supervision before chain-of-thought generation.
Details
Motivation: Large reasoning models show degraded safety capabilities when chain-of-thought reasoning is enabled, creating a safety-reasoning trade-off that needs to be addressed.Method: Use a BERT-based classifier to extract safety decision signals from safe models (CoT-disabled LRMs), then integrate these signals as auxiliary supervision to strengthen safety decision-making before CoT generation.
Result: The method substantially improves safety capabilities of LRMs while effectively maintaining their general reasoning performance.
Conclusion: The proposed safety alignment method successfully addresses the safety degradation problem in large reasoning models when chain-of-thought reasoning is enabled.
Abstract: Large reasoning models (LRMs) achieved remarkable performance via chain-of-thought (CoT), but recent studies showed that such enhanced reasoning capabilities are at the expense of significantly degraded safety capabilities. In this paper, we reveal that LRMs’ safety degradation occurs only after CoT is enabled, and this degradation is not observed when CoT is disabled. This observation motivates us to consider encouraging LRMs to make safety decisions before CoT generation. To this end, we propose a novel safety alignment method that promotes the safety decision-making of LRMs before starting CoT generation. Specifically, we first utilize a Bert-based classifier to extract safety decision signals from a safe model (e.g., a CoT-disabled LRM) and then integrate these signals into LRMs’ safety alignment as auxiliary supervision. In this way, the safety gradients can be backpropagated to the LRMs’ latent representations, effectively strengthening the LRMs’ safety decision-making abilities against CoT generation. Extensive experiments demonstrate that our method substantially improves the safety capabilities of LRMs while effectively maintaining LRMs’ general reasoning performance.
[366] From Digital Twins to World Models:Opportunities, Challenges, and Applications for Mobile Edge General Intelligence
Jie Zheng, Dusit Niyato, Changyuan Zhao, Jiawen Kang, Jiacheng Wang
Main category: cs.AI
TL;DR: Survey paper on the transition from digital twins to world models for enabling edge general intelligence in 6G+ wireless systems, covering conceptual differences, architectures, applications, and research challenges.
Details
Motivation: Traditional digital twins face limitations in autonomy, adaptability, and scalability in dynamic edge environments, necessitating a transition to more advanced world models for edge general intelligence.Method: Systematic survey approach: clarifies conceptual differences between digital twins and world models, reviews design principles and architectures of world models (perception, latent representation, dynamics learning, planning, memory), examines integration in wireless EGI systems, and surveys emerging applications.
Result: Provides comprehensive roadmap and practical insights for designing world-model-driven edge intelligence systems, outlines key research challenges and future directions for scalable, reliable, and interoperable world models for edge-native agentic AI.
Conclusion: Transition from digital twins to world models is crucial for enabling adaptive, autonomous, and resource-efficient edge general intelligence in 6G+ wireless systems, with significant implications for various emerging applications.
Abstract: The rapid evolution toward 6G and beyond communication systems is accelerating the convergence of digital twins and world models at the network edge. Traditional digital twins provide high-fidelity representations of physical systems and support monitoring, analysis, and offline optimization. However, in highly dynamic edge environments, they face limitations in autonomy, adaptability, and scalability. This paper presents a systematic survey of the transition from digital twins to world models and discusses its role in enabling edge general intelligence (EGI). First, the paper clarifies the conceptual differences between digital twins and world models and highlights the shift from physics-based, centralized, and system-centric replicas to data-driven, decentralized, and agent-centric internal models. This discussion helps readers gain a clear understanding of how this transition enables more adaptive, autonomous, and resource-efficient intelligence at the network edge. The paper reviews the design principles, architectures, and key components of world models, including perception, latent state representation, dynamics learning, imagination-based planning, and memory. In addition, it examines the integration of world models and digital twins in wireless EGI systems and surveys emerging applications in integrated sensing and communications, semantic communication, air-ground networks, and low-altitude wireless networks. Finally, this survey provides a systematic roadmap and practical insights for designing world-model-driven edge intelligence systems in wireless and edge computing environments. It also outlines key research challenges and future directions toward scalable, reliable, and interoperable world models for edge-native agentic AI.
[367] Proactive Knowledge Inquiry in Doctor-Patient Dialogue: Stateful Extraction, Belief Updating, and Path-Aware Action Planning
Zhenhai Pan, Yan Liu, Jia You
Main category: cs.AI
TL;DR: A proactive knowledge-inquiry framework for doctor-patient dialogues that treats EMR generation as an ongoing inquiry loop with stateful extraction, belief updating, and POMDP-lite planning.
Details
Motivation: Current EMR pipelines are output-oriented and don't model what's known, missing, or what questions should come next during consultations. The paper aims to transform EMR generation from passive documentation to proactive knowledge inquiry.Method: Formulates doctor-patient dialogue as proactive knowledge-inquiry under partial observability. Combines stateful extraction, sequential belief updating, gap-aware state modeling, hybrid retrieval over medical knowledge, and POMDP-lite action planning.
Result: On a controlled pilot evaluation with 10 standardized dialogues and 300-query benchmark: 83.3% coverage, 80.0% risk recall, 81.4% structural completeness, with lower redundancy than baselines.
Conclusion: Proactive inquiry is methodologically interesting under controlled conditions and offers a conceptually appealing formulation for dialogue-based EMR generation, but results don’t establish clinical generalization or deployment readiness.
Abstract: Most automated electronic medical record (EMR) pipelines remain output-oriented: they transcribe, extract, and summarize after the consultation, but they do not explicitly model what is already known, what is still missing, which uncertainty matters most, or what question or recommendation should come next. We formulate doctor-patient dialogue as a proactive knowledge-inquiry problem under partial observability. The proposed framework combines stateful extraction, sequential belief updating, gap-aware state modeling, hybrid retrieval over objectified medical knowledge, and a POMDP-lite action planner. Instead of treating the EMR as the only target artifact, the framework treats documentation as the structured projection of an ongoing inquiry loop. To make the formulation concrete, we report a controlled pilot evaluation on ten standardized multi-turn dialogues together with a 300-query retrieval benchmark aggregated across dialogues. On this pilot protocol, the full framework reaches 83.3% coverage, 80.0% risk recall, 81.4% structural completeness, and lower redundancy than the chunk-only and template-heavy interactive baselines. These pilot results do not establish clinical generalization; rather, they suggest that proactive inquiry may be methodologically interesting under tightly controlled conditions and can be viewed as a conceptually appealing formulation worth further investigation for dialogue-based EMR generation. This work should be read as a pilot concept demonstration under a controlled simulated setting rather than as evidence of clinical deployment readiness. No implication of clinical deployment readiness, clinical safety, or real-world clinical utility should be inferred from this pilot protocol.
[368] When Only the Final Text Survives: Implicit Execution Tracing for Multi-Agent Attribution
Yi Nian, Haosen Cao, Shenzhe Zhu, Henry Peng Zou, Qingqing Luan, Yue Zhao
Main category: cs.AI
TL;DR: IET enables token-level attribution and interaction topology reconstruction in multi-agent language systems without execution logs, using embedded keyed signals in token distributions.
Details
Motivation: Multi-agent language systems lack accountability when they produce incorrect or harmful outputs, especially when execution logs and agent identifiers are unavailable. Current systems obscure interaction topology and individual agent contributions in final outputs.Method: IET (Implicit Execution Tracing) embeds agent-specific keyed signals into token distributions during generation, making text self-describing execution traces. A transition-aware scoring method detects agent handover points and reconstructs interaction graphs using only the generated text and a secret key.
Result: Experiments show IET recovers agent segments and coordination structure with high accuracy while preserving generation quality, enabling privacy-preserving auditing for multi-agent systems.
Conclusion: IET provides a practical framework for accountability in multi-agent language systems without requiring execution logs, enabling token-level attribution and interaction topology reconstruction directly from generated text.
Abstract: When a multi-agent system produces an incorrect or harmful answer, who is accountable if execution logs and agent identifiers are unavailable? Multi-agent language systems increasingly rely on structured interactions such as delegation and iterative refinement, yet the final output often obscures the underlying interaction topology and agent contributions. We introduce IET (Implicit Execution Tracing), a metadata-independent framework that enables token-level attribution directly from generated text and a simple mechanism for interaction topology reconstruction. During generation, agent-specific keyed signals are embedded into the token distribution, transforming the text into a self-describing execution trace detectable only with a secret key. At detection time, a transition-aware scoring method identifies agent handover points and reconstructs the interaction graph. Experiments show that IET recovers agent segments and coordination structure with high accuracy while preserving generation quality, enabling privacy-preserving auditing for multi-agent language systems.
[369] Informative Semi-Factuals for XAI: The Elaborated Explanations that People Prefer
Saugat Aryal, Mark T. Keane
Main category: cs.AI
TL;DR: Proposes informative semi-factual explanations that not only show outcome invariance under feature changes but also explain why the outcome remains unchanged by revealing influential hidden features.
Details
Motivation: Current semi-factual XAI methods only show that outcomes remain unchanged under extreme feature changes but don't explain why. Users need more informative explanations that reveal the underlying reasons for outcome invariance.Method: Develops the Informative Semi-Factuals (ISF) method that generates elaborated explanations supplementing semi-factuals with information about additional hidden features that influence automated decisions.
Result: Experimental results on benchmark datasets show ISF computes semi-factuals that are both informative and high-quality on key metrics. User study shows people prefer these elaborated explanations over simpler semi-factuals from current methods.
Conclusion: The ISF method advances XAI by providing more informative semi-factual explanations that reveal why outcomes remain unchanged, improving user understanding and trust in automated decisions.
Abstract: Recently, in eXplainable AI (XAI), $\textit{even if}$ explanations – so-called semi-factuals – have emerged as a popular strategy that explains how a predicted outcome $\textit{can remain the same}$ even when certain input-features are altered. For example, in the commonly-used banking app scenario, a semi-factual explanation could inform customers about better options, other alternatives for their successful application, by saying “$\textit{Even if}$ you asked for double the loan amount, you would still be accepted”. Most semi-factuals XAI algorithms focus on finding maximal value-changes to a single key-feature that do $\textit{not}$ alter the outcome (unlike counterfactual explanations that often find minimal value-changes to several features that alter the outcome). However, no current semi-factual method explains $\textit{why}$ these extreme value-changes do not alter outcomes; for example, a more informative semi-factual could tell the customer that it is their good credit score that allows them to borrow double their requested loan. In this work, we advance a new algorithm – the $\textit{informative semi-factuals}$ (ISF) method – that generates more elaborated explanations supplementing semi-factuals with information about additional $\textit{hidden features}$ that influence an automated decision. Experimental results on benchmark datasets show that this ISF method computes semi-factuals that are both informative and of high-quality on key metrics. Furthermore, a user study shows that people prefer these elaborated explanations over the simpler semi-factual explanations generated by current methods.
[370] Per-Domain Generalizing Policies: On Learning Efficient and Robust Q-Value Functions (Extended Version with Technical Appendix)
Nicola J. Müller, Moritz Oster, Isabel Valera, Jörg Hoffmann, Timo P. Gros
Main category: cs.AI
TL;DR: Learning Q-value functions instead of state-value functions for planning domains, using regularization to distinguish between teacher-chosen and unchosen actions, resulting in more efficient policies competitive with LAMA-first planner.
Details
Motivation: Standard approaches learn state-value functions using graph neural networks from optimal plans, but these require processing all successor states. Q-value functions are cheaper to evaluate as they only need the current state, but vanilla supervised learning fails to distinguish between actions taken vs. not taken by the teacher.Method: Proposes learning Q-value functions with regularization terms that enforce distinction between actions taken and those not taken by the teacher planner. This helps the model learn proper action preferences rather than just value estimates.
Result: Q-value policies consistently outperform state-value policies across 10 domains and are competitive with the LAMA-first planner, while being more computationally efficient during evaluation.
Conclusion: Learning Q-value functions with appropriate regularization is an effective approach for planning domains, offering computational efficiency advantages over state-value functions while maintaining strong performance.
Abstract: Learning per-domain generalizing policies is a key challenge in learning for planning. Standard approaches learn state-value functions represented as graph neural networks using supervised learning on optimal plans generated by a teacher planner. In this work, we advocate for learning Q-value functions instead. Such policies are drastically cheaper to evaluate for a given state, as they need to process only the current state rather than every successor. Surprisingly, vanilla supervised learning of Q-values performs poorly as it does not learn to distinguish between the actions taken and those not taken by the teacher. We address this by using regularization terms that enforce this distinction, resulting in Q-value policies that consistently outperform state-value policies across a range of 10 domains and are competitive with the planner LAMA-first.
[371] VeriGrey: Greybox Agent Validation
Yuntong Zhang, Sungmin Kang, Ruijie Meng, Marcel Böhme, Abhik Roychoudhury
Main category: cs.AI
TL;DR: VeriGrey: A grey-box testing approach for LLM agents that uses tool invocation sequences as feedback to discover security vulnerabilities, particularly indirect prompt injection attacks, with higher efficacy than black-box methods.
Details
Motivation: LLM agents that autonomously interact with external environments introduce critical security risks, especially indirect prompt injection vulnerabilities that are difficult to detect with traditional black-box testing approaches.Method: Grey-box approach using tool invocation sequences as feedback function to drive testing; mutates prompts to create pernicious injection prompts by linking agent tasks to injection tasks; employs mutational fuzz testing for conversation agents.
Result: Achieves 33% additional efficacy in finding indirect prompt injection vulnerabilities compared to black-box baseline on AgentDojo benchmark; successfully discovers attack scenarios in real-world agents like Gemini CLI and OpenClaw; finds malicious skill variants with 100% success rate on Kimi-K2.5 and 90% on Opus 4.6 backends.
Conclusion: VeriGrey demonstrates the value of dynamic grey-box testing for uncovering security risks in LLM agents, particularly indirect prompt injection vulnerabilities that black-box approaches miss, paving the way for agent assurance frameworks.
Abstract: Agentic AI has been a topic of great interest recently. A Large Language Model (LLM) agent involves one or more LLMs in the back-end. In the front end, it conducts autonomous decision-making by combining the LLM outputs with results obtained by invoking several external tools. The autonomous interactions with the external environment introduce critical security risks. In this paper, we present a grey-box approach to explore diverse behaviors and uncover security risks in LLM agents. Our approach VeriGrey uses the sequence of tools invoked as a feedback function to drive the testing process. This helps uncover infrequent but dangerous tool invocations that cause unexpected agent behavior. As mutation operators in the testing process, we mutate prompts to design pernicious injection prompts. This is carefully accomplished by linking the task of the agent to an injection task, so that the injection task becomes a necessary step of completing the agent functionality. Comparing our approach with a black-box baseline on the well-known AgentDojo benchmark, VeriGrey achieves 33% additional efficacy in finding indirect prompt injection vulnerabilities with a GPT-4.1 back-end. We also conduct real-world case studies with the widely used coding agent Gemini CLI, and the well-known OpenClaw personal assistant. VeriGrey finds prompts inducing several attack scenarios that could not be identified by black-box approaches. In OpenClaw, by constructing a conversation agent which employs mutational fuzz testing as needed, VeriGrey is able to discover malicious skill variants from 10 malicious skills (with 10/10= 100% success rate on the Kimi-K2.5 LLM backend, and 9/10= 90% success rate on Opus 4.6 LLM backend). This demonstrates the value of a dynamic approach like VeriGrey to test agents, and to eventually lead to an agent assurance framework.
[372] Sensi: Learn One Thing at a Time – Curriculum-Based Test-Time Learning for LLM Game Agents
Mohsen Arjmandi
Main category: cs.AI
TL;DR: Sensi is an LLM agent architecture for game-playing that introduces structured test-time learning with a two-player system separating perception from action, curriculum learning, and database-as-control-plane for improved sample efficiency.
Details
Motivation: Current LLM agents require thousands of interactions to learn task structure in unknown environments at test time, which is inefficient. The paper aims to improve sample efficiency through structured learning mechanisms.Method: Three key mechanisms: (1) Two-player architecture separating perception from action, (2) Curriculum-based learning system managed by external state machine, (3) Database-as-control-plane for programmatically steerable context window. Also includes LLM-as-judge with dynamically generated evaluation rubrics.
Result: Sensi v1 solves 2 game levels using two-player architecture alone. Sensi v2 adds curriculum learning and solves 0 levels but completes entire learning curriculum in ~32 action attempts, achieving 50-94x greater sample efficiency than comparable systems (1600-3000 attempts). Identifies failure mode as self-consistent hallucination cascade in perception layer.
Conclusion: The architectural bottleneck has shifted from learning efficiency to perceptual grounding, which is a more tractable problem. The structured learning approach significantly improves sample efficiency for LLM agents in unknown environments.
Abstract: Large language model (LLM) agents deployed in unknown environments must learn task structure at test time, but current approaches require thousands of interactions to form useful hypotheses. We present Sensi, an LLM agent architecture for the ARC-AGI-3 game-playing challenge that introduces structured test-time learning through three mechanisms: (1) a two-player architecture separating perception from action, (2) a curriculum-based learning system managed by an external state machine, and (3) a database-as-control-plane that makes the agents context window programmatically steerable. We further introduce an LLM-as-judge component with dynamically generated evaluation rubrics to determine when the agent has learned enough about one topic to advance to the next. We report results across two iterations: Sensi v1 solves 2 game levels using the two-player architecture alone, while Sensi v2 adds curriculum learning and solves 0 levels - but completes its entire learning curriculum in approximately 32 action attempts, achieving 50-94x greater sample efficiency than comparable systems that require 1600-3000 attempts. We precisely diagnose the failure mode as a self-consistent hallucination cascade originating in the perception layer, demonstrating that the architectural bottleneck has shifted from learning efficiency to perceptual grounding - a more tractable problem.
[373] MALLES: A Multi-agent LLMs-based Economic Sandbox with Consumer Preference Alignment
Yusen Wu, Yiran Liu, Xiaotie Deng
Main category: cs.AI
TL;DR: MALLES: Multi-Agent LLM-based Economic Sandbox that uses preference learning on transaction data and multi-agent discussion to simulate economic decision-making in multimodal, high-dimensional environments.
Details
Motivation: Real-world economic decision-making faces challenges from high-dimensional multimodal environments, agent heterogeneity, and combinatorial data sparsity. Existing approaches struggle with these complexities, necessitating a unified simulation framework.Method: 1) LLM post-training on heterogeneous transaction records for economic alignment and preference learning; 2) Mean-field mechanism for stable simulation in high-dimensional spaces; 3) Multi-agent discussion framework where specialized agents collaboratively process product information through structured dialogue.
Result: Significant improvements in product selection accuracy, purchase quantity prediction, and simulation stability compared to existing economic and financial LLM simulation baselines.
Conclusion: Large language models can serve as a foundational pillar for high-fidelity, scalable decision simulation and analysis in the real economy when combined with preference learning and multi-agent architectures.
Abstract: In the real economy, modern decision-making is fundamentally challenged by high-dimensional, multimodal environments, which are further complicated by agent heterogeneity and combinatorial data sparsity. This paper introduces a Multi-Agent Large Language Model-based Economic Sandbox (MALLES), leveraging the inherent generalization capabilities of large-sacle models to establish a unified simulation framework applicable to cross-domain and cross-category scenarios. Central to our approach is a preference learning paradigm in which LLMs are economically aligned via post-training on extensive, heterogeneous transaction records across diverse product categories. This methodology enables the models to internalize and transfer latent consumer preference patterns, thereby mitigating the data sparsity issues prevalent in individual categories. To enhance simulation stability, we implement a mean-field mechanism designed to model the dynamic interactions between the product environment and customer populations, effectively stabilizing sampling processes within high-dimensional decision spaces. Furthermore, we propose a multi-agent discussion framework wherein specialized agents collaboratively process extensive product information. This architecture distributes cognitive load to alleviate single-agent attention bottlenecks and captures critical decision factors through structured dialogue. Experiments demonstrate that our framework achieves significant improvements in product selection accuracy, purchase quantity prediction, and simulation stability compared to existing economic and financial LLM simulation baselines. Our results substantiate the potential of large language models as a foundational pillar for high-fidelity, scalable decision simulation and latter analysis in the real economy based on foundational database.
[374] From Virtual Environments to Real-World Trials: Emerging Trends in Autonomous Driving
A. Humnabadkar, A. Sikdar, B. Cave, H. Zhang, N. Bessis, A. Behera
Main category: cs.AI
TL;DR: Survey paper reviewing synthetic data and simulation technologies for autonomous driving, covering perception, planning, digital twins, domain adaptation, and vision-language models.
Details
Motivation: Real-world autonomous driving deployment faces data scarcity, safety requirements, and generalization challenges. Synthetic data and virtual environments offer scalable, controllable solutions for training and evaluation.Method: Comprehensive survey organized across three dimensions: synthetic data for perception/planning, digital twin-based simulation for validation, and domain adaptation strategies. Includes taxonomy of datasets, tools, and simulation platforms.
Result: Provides detailed analysis of current landscape, trends in benchmark design, and identifies key challenges in the field. Highlights the role of vision-language models in enhancing scene understanding.
Conclusion: Identifies critical challenges including Sim2Real transfer, scalable safety validation, cooperative autonomy, and simulation-driven policy learning that must be addressed for safe, generalizable autonomous driving systems.
Abstract: Autonomous driving technologies have achieved significant advances in recent years, yet their real-world deployment remains constrained by data scarcity, safety requirements, and the need for generalization across diverse environments. In response, synthetic data and virtual environments have emerged as powerful enablers, offering scalable, controllable, and richly annotated scenarios for training and evaluation. This survey presents a comprehensive review of recent developments at the intersection of autonomous driving, simulation technologies, and synthetic datasets. We organize the landscape across three core dimensions: (i) the use of synthetic data for perception and planning, (ii) digital twin-based simulation for system validation, and (iii) domain adaptation strategies bridging synthetic and real-world data. We also highlight the role of vision-language models and simulation realism in enhancing scene understanding and generalization. A detailed taxonomy of datasets, tools, and simulation platforms is provided, alongside an analysis of trends in benchmark design. Finally, we discuss critical challenges and open research directions, including Sim2Real transfer, scalable safety validation, cooperative autonomy, and simulation-driven policy learning, that must be addressed to accelerate the path toward safe, generalizable, and globally deployable autonomous driving systems.
[375] Facts as First Class Objects: Knowledge Objects for Persistent LLM Memory
Oliver Zahn, Simran Chana
Main category: cs.AI
TL;DR: Knowledge Objects (KOs) outperform in-context memory for LLMs by providing O(1) retrieval with 100% accuracy across large fact sets, solving capacity, compaction, and goal drift issues that plague traditional prompt-based approaches.
Details
Motivation: Current LLMs use in-context memory (facts stored in prompts) as default, but this approach has critical limitations including capacity constraints, information loss through summarization, and goal drift during cascading operations.Method: Proposes Knowledge Objects (KOs) - discrete hash-addressed tuples with O(1) retrieval. Benchmarks KOs against in-context memory across multiple frontier models, tests adversarial fact retrieval, and introduces density-adaptive retrieval as a switching mechanism.
Result: KOs achieve 100% accuracy across all conditions at 252x lower cost vs in-context memory. On multi-hop reasoning: 78.9% for KOs vs 31.6% for in-context. Embedding retrieval fails on adversarial facts (20% precision), and neural memory (Titans) stores but fails to retrieve facts on demand.
Conclusion: Knowledge Objects provide superior memory architecture for LLMs, solving fundamental limitations of in-context memory while enabling reliable fact retrieval and reasoning at scale.
Abstract: Large language models increasingly serve as persistent knowledge workers, with in-context memory - facts stored in the prompt - as the default strategy. We benchmark in-context memory against Knowledge Objects (KOs), discrete hash-addressed tuples with O(1) retrieval. Within the context window, Claude Sonnet 4.5 achieves 100% exact-match accuracy from 10 to 7,000 facts (97.5% of its 200K window). However, production deployment reveals three failure modes: capacity limits (prompts overflow at 8,000 facts), compaction loss (summarization destroys 60% of facts), and goal drift (cascading compaction erodes 54% of project constraints while the model continues with full confidence). KOs achieve 100% accuracy across all conditions at 252x lower cost. On multi-hop reasoning, KOs reach 78.9% versus 31.6% for in-context. Cross-model replication across four frontier models confirms compaction loss is architectural, not model-specific. We additionally show that embedding retrieval fails on adversarial facts (20% precision at 1) and that neural memory (Titans) stores facts but fails to retrieve them on demand. We introduce density-adaptive retrieval as a switching mechanism and release the benchmark suite.
[376] RPMS: Enhancing LLM-Based Embodied Planning through Rule-Augmented Memory Synergy
Zhenhang Yuan, Shenghai Yuan, Lihua Xie
Main category: cs.AI
TL;DR: RPMS is a conflict-managed architecture for LLM agents that enforces action feasibility via rule retrieval and gates memory via belief state, achieving significant performance gains in embodied environments like ALFWorld and ScienceWorld.
Details
Motivation: LLM agents often fail in closed-world embodied environments due to invalid actions and state drift, where actions don't satisfy strict preconditions and failure feedback is sparse, creating a degenerative cycle.Method: RPMS uses structured rule retrieval to enforce action feasibility, a lightweight belief state to gate memory applicability, and rules-first arbitration to resolve conflicts between these sources.
Result: On ALFWorld, RPMS achieves 59.7% success with Llama 3.1 8B (+23.9 pp over baseline) and 98.5% with Claude Sonnet 4.5 (+11.9 pp). Rule retrieval alone contributes +14.9 pp. In ScienceWorld with GPT-4, it achieves avg. score 54.0 vs. 44.9 for ReAct baseline.
Conclusion: The architecture effectively addresses action feasibility and state drift in embodied environments, with rule retrieval being the dominant factor. Episodic memory is conditionally useful and becomes beneficial when filtered by current state and constrained by explicit action rules.
Abstract: LLM agents often fail in closed-world embodied environments because actions must satisfy strict preconditions – such as location, inventory, and container states – and failure feedback is sparse. We identify two structurally coupled failure modes: (P1) invalid action generation and (P2) state drift, each amplifying the other in a degenerative cycle. We present RPMS, a conflict-managed architecture that enforces action feasibility via structured rule retrieval, gates memory applicability via a lightweight belief state, and resolves conflicts between the two sources via rules-first arbitration. On ALFWorld (134 unseen tasks), RPMS achieves 59.7% single-trial success with Llama 3.1 8B (+23.9 pp over baseline) and 98.5% with Claude Sonnet 4.5 (+11.9 pp); of the 8B gain, rule retrieval alone contributes +14.9 pp (statistically significant), making it the dominant factor. A key finding is that episodic memory is conditionally useful: it harms performance on some task types when used without grounding, but becomes a stable net positive once filtered by current state and constrained by explicit action rules. Adapting RPMS to ScienceWorld with GPT-4 yields consistent gains across all ablation conditions (avg. score 54.0 vs. 44.9 for the ReAct baseline), providing transfer evidence that the core mechanisms hold across structurally distinct environments.
[377] AgentFactory: A Self-Evolving Framework Through Executable Subagent Accumulation and Reuse
Zhang Zhang, Shuqi Lu, Hongjin Qian, Di He, Zheng Liu
Main category: cs.AI
TL;DR: AgentFactory is a self-evolution paradigm for LLM-based agents that preserves successful task solutions as executable subagent code rather than textual prompts, enabling continuous capability accumulation through a growing library of refined subagents.
Details
Motivation: Existing LLM-based agent self-evolution approaches record successful experiences as textual prompts or reflections, which cannot reliably guarantee efficient task re-execution in complex scenarios. There's a need for a more robust and executable approach to agent evolution.Method: AgentFactory preserves successful task solutions as executable subagent code (pure Python with standardized documentation) rather than textual experience. These subagents are continuously refined based on execution feedback, becoming increasingly robust and efficient over time as more tasks are encountered.
Result: AgentFactory enables continuous capability accumulation: its library of executable subagents grows and improves over time, progressively reducing the effort required for similar tasks without manual intervention. The system demonstrates effective self-evolution through executable code preservation.
Conclusion: AgentFactory presents a novel self-evolution paradigm that moves beyond textual experience recording to executable subagent code, offering more reliable task re-execution and continuous capability improvement for LLM-based agents.
Abstract: Building LLM-based agents has become increasingly important. Recent works on LLM-based agent self-evolution primarily record successful experiences as textual prompts or reflections, which cannot reliably guarantee efficient task re-execution in complex scenarios. We propose AgentFactory, a new self-evolution paradigm that preserves successful task solutions as executable subagent code rather than textual experience. Crucially, these subagents are continuously refined based on execution feedback, becoming increasingly robust and efficient as more tasks are encountered. Saved subagents are pure Python code with standardized documentation, enabling portability across any Python-capable system. We demonstrate that AgentFactory enables continuous capability accumulation: its library of executable subagents grows and improves over time, progressively reducing the effort required for similar tasks without manual intervention. Our implementation is open-sourced at https://github.com/zzatpku/AgentFactory, and our demonstration video is available at https://youtu.be/iKSsuAXJHW0.
[378] Signal in the Noise: Polysemantic Interference Transfers and Predicts Cross-Model Influence
Bofan Gong, Shiyang Lai, James Evans, Dawn Song
Main category: cs.AI
TL;DR: Polysemanticity in language models creates systematic vulnerabilities through interference patterns that generalize across model scales and families, enabling black-box control via interventions distilled from small models.
Details
Motivation: Polysemanticity (multiple meanings per neuron/feature) is a major challenge for interpreting and controlling language model behavior. The authors aim to understand whether polysemanticity patterns are stochastic or systematic, and whether insights from small models can transfer to larger ones.Method: Used sparse autoencoders (SAEs) to map polysemantic topology in small models (Pythia-70M, GPT-2-Small). Identified SAE feature pairs with semantic interference, then performed interventions at four levels: prompt, token, feature, and neuron. Measured shifts in next-token predictions and tested whether interventions from small models transfer to larger instruction-tuned models (Llama-3.1-8B/70B-Instruct, Gemma-2-9B-Instruct).
Result: Found systematic interference patterns in polysemantic structures that expose model vulnerabilities. Interventions distilled from small models reliably transferred to larger models, producing predictable behavioral shifts without access to model internals. This demonstrates that interference structures generalize across scale and model families.
Conclusion: Polysemanticity is not purely stochastic but follows systematic patterns that generalize across models. This reveals a convergent, higher-order organization of internal representations that is weakly aligned with human intuition, offering new possibilities for black-box control and theoretical insights into cognition.
Abstract: Polysemanticity is pervasive in language models and remains a major challenge for interpretation and model behavioral control. Leveraging sparse autoencoders (SAEs), we map the polysemantic topology of two small models (Pythia-70M and GPT-2-Small) to identify SAE feature pairs that are semantically unrelated yet exhibit interference within models. We intervene at four foci (prompt, token, feature, neuron) and measure induced shifts in the next-token prediction distribution, uncovering polysemantic structures that expose a systematic vulnerability in these models. Critically, interventions distilled from counterintuitive interference patterns shared by two small models transfer reliably to larger instruction-tuned models (Llama-3.1-8B/70B-Instruct and Gemma-2-9B-Instruct), yielding predictable behavioral shifts without access to model internals. These findings challenge the view that polysemanticity is purely stochastic, demonstrating instead that interference structures generalize across scale and family. Such generalization suggests a convergent, higher-order organization of internal representations, which is only weakly aligned with intuition and structured by latent regularities, offering new possibilities for both black-box control and theoretical insight into human and artificial cognition.
[379] MLlm-DR: Towards Explainable Depression Recognition with MultiModal Large Language Models
Wei Zhang, Juan Chen, En Zhu, Wenhong Cheng, YunPeng Li, Yanbo J. Wang
Main category: cs.AI
TL;DR: MLlm-DR: A multimodal LLM for explainable depression diagnosis from interview videos, integrating speech/visual features with reasoning capabilities.
Details
Motivation: Existing depression diagnosis methods lack explainability for score determination, limiting clinical adoption. While LLMs offer explainability potential, current multimodal LLMs lack training on interview data and perform poorly for direct diagnosis.Method: Proposes MLlm-DR with smaller LLM for score/rationale generation and lightweight query module (LQ-former) to capture depression-related speech/visual features. Uses robust training dataset for fine-tuning to enhance domain reasoning while maintaining practicality.
Result: Achieves state-of-the-art results on two interview-based benchmark datasets: CMDC and E-DAIC-WOZ, demonstrating effectiveness and superiority.
Conclusion: MLlm-DR enables explainable depression diagnosis by integrating multimodal understanding with logical reasoning, addressing limitations of previous methods for clinical adoption.
Abstract: Automated depression diagnosis aims to analyze multimodal information from interview videos to predict participants’ depression scores. Previous studies often lack clear explanations of how these scores were determined, limiting their adoption in clinical practice. While the advent of LLMs provides a possible pathway for explainable depression diagnosis, current LLMs capable of processing multimodal data lack training on interview data, resulting in poor diagnostic performance when used directly. In this paper, we propose a novel multimodal large language model (MLlm-DR) that can understand multimodal information inputs and supports explainable depression diagnosis. MLlm-DR integrates a smaller LLMs and a lightweight query module (LQ-former). Specifically, the smaller LLMs is designed to generate depression scores and corresponding evaluation rationales. To enhance its logical reasoning for domain-specific tasks while maintaining practicality, we constructed a robust training dataset to fine-tune it. Meanwhile, the LQ-former captures depression-related features from speech and visual data, aiding the model’s ability to process multimodal information, to achieve comprehensive depression diagnosis. Our approach achieves state-of-the-art results on two interview-based benchmark datasets, CMDC and E-DAIC-WOZ, demonstrating its effectiveness and superiority.
[380] InPhyRe Discovers: Large Multimodal Models Struggle in Inductive Physical Reasoning
Gautam Sreekumar, Vishnu Naresh Boddeti
Main category: cs.AI
TL;DR: InPhyRe benchmark reveals LMMs struggle with inductive physical reasoning, showing poor adaptation to unseen physical laws and language bias issues.
Details
Motivation: Current LMMs encode physical laws as parametric knowledge from training but cannot adapt to novel physical environments with unseen laws, which is crucial for safety-critical applications where human-like inductive reasoning is needed.Method: Proposed InPhyRe benchmark for evaluating inductive physical reasoning in LMMs using algorithmically generated synthetic videos of collision events, testing over 13 open-source and proprietary models.
Result: LMMs struggle to apply parametric knowledge to reasoning, show weak inductive reasoning for unseen physical laws, and suffer from language bias that causes them to ignore visual inputs, questioning their trustworthiness.
Conclusion: Inductive physical reasoning remains a significant challenge for LMMs, highlighting the need for benchmarks like InPhyRe to measure and improve their ability to adapt to novel physical environments beyond their training data.
Abstract: Large multimodal models (LMMs) encode physical laws observed during training, such as momentum conservation, as parametric knowledge. It allows LMMs to answer physical reasoning queries, such as the outcome of a potential collision event from visual input. However, since parametric knowledge includes only the physical laws seen during training, it is insufficient for reasoning in inference scenarios that follow physical laws unseen during training. In such novel physical environments, humans could adapt their physical reasoning based on provided demonstrations. This inductive physical reasoning ability is indispensable for LMMs if they are to replace human agents in safety-critical applications. Despite its importance, existing visual benchmarks do not evaluate inductive physical reasoning and only consider the parametric knowledge in LMMs. To this end, we propose InPhyRe, the first visual question answering benchmark to measure inductive physical reasoning in LMMs. InPhyRe evaluates LMMs’ ability to predict the outcome of collision events in algorithmically generated synthetic videos. By inspecting over 13 open-source and proprietary LMMs, InPhyRe informs us that (1) LMMs struggle to apply their limited parametric knowledge about universal physical laws to reasoning, (2) inductive physical reasoning in LMMs is weak when the physical laws underlying inference scenarios were unseen during training, and (3) inductive physical reasoning in LMMs suffers from language bias and may ignore the visual inputs, questioning the trustworthiness of LMMs regarding visual inputs.
[381] See, Think, Act: Teaching Multimodal Agents to Effectively Interact with GUI by Identifying Toggles
Zongru Wu, Rui Mao, Zhiyuan Tian, Pengzhou Cheng, Tianjie Ju, Zheng Wu, Lingzhong Dong, Haiyue Sheng, Zhuosheng Zhang, Gongshen Liu
Main category: cs.AI
TL;DR: Proposes State-aware Reasoning (StaR) to improve multimodal agents’ reliability in executing binary toggle control instructions in GUI environments, addressing a key bottleneck in GUI control tasks.
Details
Motivation: Multimodal agents struggle with reliably executing toggle control instructions in GUI environments, particularly when the current toggle state already matches the desired state, creating a key bottleneck for practical GUI control applications.Method: Constructs a state control benchmark with binary toggle instructions from public datasets, then proposes State-aware Reasoning (StaR) - a multimodal reasoning method that enables agents to perceive current toggle state, infer desired state from instructions, and act accordingly.
Result: StaR improves toggle instruction execution accuracy by over 30% on four multimodal agents, enhances general agentic task performance on three public benchmarks, and shows potential for real-world applications in dynamic environments.
Conclusion: State-aware Reasoning effectively addresses the toggle control reliability problem in multimodal GUI agents, demonstrating significant performance improvements and practical applicability for real-world GUI control tasks.
Abstract: The advent of multimodal agents facilitates effective interaction within graphical user interface (GUI), especially in ubiquitous GUI control. However, their inability to reliably execute toggle control instructions remains a key bottleneck. To investigate this, we construct a state control benchmark with binary toggle instructions derived from public datasets. Evaluation results of existing agents demonstrate their notable unreliability, particularly when the current toggle state already matches the desired state. To address the challenge, we propose State-aware Reasoning (StaR), a multimodal reasoning method that enables agents to perceive the current toggle state, infer the desired state from the instruction, and act accordingly. Experiments on four multimodal agents demonstrate that StaR can improve toggle instruction execution accuracy by over 30%. Further evaluations on three public agentic benchmarks show that StaR also enhances general agentic task performance. Finally, evaluations on a dynamic environment highlight the potential of StaR for real-world applications. Code and benchmark: https://github.com/ZrW00/StaR.
[382] ScheduleMe: Multi-Agent Calendar Assistant
Oshadha Wijerathne, Amandi Nimasha, Dushan Fernando, Nisansa de Silva, Srinath Perera
Main category: cs.AI
TL;DR: ScheduleMe is a multi-agent calendar assistant that manages Google Calendar events through natural language using a graph-structured coordination mechanism with specialized task agents supervised by a central agent.
Details
Motivation: To create a more usable and flexible personal calendar assistant that can handle natural language commands, resolve ambiguities, and manage complex scheduling tasks through structured reasoning and agent cooperation.Method: Uses a graph-structured coordination mechanism with a central supervisory agent that oversees specialized task agents. This allows for modularity, conflict resolution, and context-aware interactions to interpret user commands and manage calendar events.
Result: The system demonstrates how structured reasoning and agent cooperation can improve the usability and flexibility of personal calendar assistant tools, though specific performance metrics are not provided in the abstract.
Conclusion: ScheduleMe represents an example of how multi-agent systems with structured coordination can enhance conversational assistants for practical applications like calendar management.
Abstract: Recent advancements in LLMs have contributed to the rise of advanced conversational assistants that can assist with user needs through natural language conversation. This paper presents a ScheduleMe, a multi-agent calendar assistant for users to manage google calendar events in natural language. The system uses a graph-structured coordination mechanism where a central supervisory agent supervises specialized task agents, allowing modularity, conflicts resolution, and context-aware interactions to resolve ambiguities and evaluate user commands. This approach sets an example of how structured reasoning and agent cooperation might convince operators to increase the usability and flexibility of personal calendar assistant tools.
[383] Assessing LLM Reasoning Through Implicit Causal Chain Discovery in Climate Discourse
Liesbeth Allein, Nataly Pineda-Castañeda, Andrea Rocci, Marie-Francine Moens
Main category: cs.AI
TL;DR: LLMs are evaluated for mechanistic causal reasoning by generating intermediate causal steps between cause-effect pairs from climate change arguments, revealing they rely more on pattern matching than genuine causal reasoning.
Details
Motivation: To investigate whether LLMs can perform genuine mechanistic causal reasoning by discovering implicit intermediate causal steps between given cause-effect pairs, particularly in argumentation contexts like climate change discussions.Method: Nine LLMs were instructed to generate all possible intermediate causal steps linking cause-effect pairs from climate change argumentation resources. The framework evaluates the number, granularity, self-consistency, and confidence of generated causal chains, with human evaluation for logical coherence.
Result: LLMs vary in the number and granularity of causal steps produced. They show self-consistency and confidence but rely mainly on associative pattern matching rather than genuine causal reasoning. Human evaluations confirmed the logical coherence of generated chains.
Conclusion: The study provides a baseline approach, diagnostic insights, and benchmark dataset for advancing implicit mechanistic causal reasoning in argumentation settings, highlighting current LLM limitations in genuine causal reasoning.
Abstract: How does a cause lead to an effect, and which intermediate causal steps explain their connection? This work scrutinizes the mechanistic causal reasoning capabilities of large language models (LLMs) to answer these questions through the task of implicit causal chain discovery. In a diagnostic evaluation framework, we instruct nine LLMs to generate all possible intermediate causal steps linking given cause-effect pairs in causal chain structures. These pairs are drawn from recent resources in argumentation studies featuring polarized discussion on climate change. Our analysis reveals that LLMs vary in the number and granularity of causal steps they produce. Although they are generally self-consistent and confident about the intermediate causal connections in the generated chains, their judgments are mainly driven by associative pattern matching rather than genuine causal reasoning. Nonetheless, human evaluations confirmed the logical coherence and integrity of the generated chains. Our baseline causal chain discovery approach, insights from our diagnostic evaluation, and benchmark dataset with causal chains lay a solid foundation for advancing future work in implicit, mechanistic causal reasoning in argumentation settings.
[384] TheraMind: A Strategic and Adaptive Agent for Longitudinal Psychological Counseling
He Hu, Chiyuan Ma, Qianning Wang, Lin Liu, Yucheng Zhou, Laizhong Cui, Fei Ma, Qi Tian
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2510.25758: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.25758&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[385] Efficient LLM Safety Evaluation through Multi-Agent Debate
Dachuan Lin, Guobin Shen, Zihao Yang, Tianrong Liu, Dongcheng Zhao, Yi Zeng
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2511.06396: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.06396&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[386] Safety-Preserving PTQ via Contrastive Alignment Loss
Sunghyun Wee, Suyoung Kim, Hyeonjin Kim, Kyomin Hwang, Nojun Kwak
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) - need to try again later or use alternative methods to access the paper information.
Details
Motivation: Cannot determine motivation without access to the paper content. The HTTP 429 error indicates the arXiv API is rate limiting requests.Method: Cannot determine method without access to the paper content. The technical approach is unknown due to the fetch failure.
Result: Cannot determine results without access to the paper content. No experimental outcomes or findings can be analyzed.
Conclusion: Cannot draw conclusions about the paper’s contributions without access to its content. The fetch failure prevents any meaningful analysis.
Abstract: Failed to fetch summary for 2511.07842: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.07842&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[387] Aligning Probabilistic Beliefs under Informative Missingness: LLM Steerability in Clinical Reasoning
Yuta Kobayashi, Vincent Jeanselme, Shalmali Joshi
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to draw conclusions due to failed API request
Abstract: Failed to fetch summary for 2512.00479: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.00479&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[388] Stepwise Think-Critique: A Unified Framework for Robust and Interpretable LLM Reasoning
Jiaqi Xu, Cuiling Lan, Xuejin Chen, Yan Lu
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2512.15662: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.15662&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[389] CircuitLM: A Multi-Agent LLM-Aided Design Framework for Generating Circuit Schematics from Natural Language Prompts
Khandakar Shakib Al Hasan, Syed Rifat Raiyan, Hasin Mahtab Alvee, Wahid Sadik
Main category: cs.AI
TL;DR: CircuitLM: A multi-agent pipeline that translates natural language circuit descriptions into structured, physically viable CircuitJSON schematics using retrieval-augmented generation and dual-layered verification.
Details
Motivation: Existing LLMs struggle with circuit schematic generation, often hallucinating components, violating physical constraints, and producing non-machine-readable outputs, creating a gap between natural language descriptions and viable hardware designs.Method: Five-stage pipeline: (1) component identification, (2) canonical pinout retrieval from curated knowledge base, (3) chain-of-thought reasoning, (4) JSON schematic synthesis, (5) interactive force-directed visualization. Uses dual evaluation: deterministic Electrical Rule Checking (ERC) and LLM-as-a-judge meta-evaluator.
Result: Evaluated on 100 unique circuit-design prompts with 5 state-of-the-art LLMs. System demonstrates how retrieval-augmented generation with deterministic and semantic verification can produce structurally viable, schematic-ready hardware designs from natural language.
Conclusion: Targeted retrieval combined with deterministic and semantic verification bridges natural language to structurally viable circuit schematics, enabling safe circuit prototyping from high-level descriptions.
Abstract: Generating accurate circuit schematics from high-level natural language descriptions remains a persistent challenge in electronic design automation (EDA), as large language models (LLMs) frequently hallucinate components, violate strict physical constraints, and produce non-machine-readable outputs. To address this, we present CircuitLM, a multi-agent pipeline that translates user prompts into structured, visually interpretable $\texttt{CircuitJSON}$ schematics. The framework mitigates hallucination and ensures physical viability by grounding generation in a curated, embedding-powered component knowledge base through five sequential stages: (i) component identification, (ii) canonical pinout retrieval, (iii) chain-of-thought reasoning, (iv) JSON schematic synthesis, and (v) interactive force-directed visualization. We evaluate the system on a dataset of 100 unique circuit-design prompts using five state-of-the-art LLMs. To systematically assess performance, we deploy a rigorous dual-layered evaluation methodology: a deterministic Electrical Rule Checking (ERC) engine categorizes topological faults by strict severity (Critical, Major, Minor, Warning), while an LLM-as-a-judge meta-evaluator identifies complex, context-aware design flaws that bypass standard rule-based checkers. Ultimately, this work demonstrates how targeted retrieval combined with deterministic and semantic verification can bridge natural language to structurally viable, schematic-ready hardware and safe circuit prototyping. Our code and data will be made public.
[390] PaperScout: An Autonomous Agent for Academic Paper Search with Process-Aware Sequence-Level Policy Optimization
Tingyue Pan, Jie Ouyang, Mingyue Cheng, Qingchuan Li, Zirui Liu, Daoyu Wang, Mingfan Pan, Shuo Yu, Qi Liu
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2601.10029: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.10029&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[391] Chain of Mindset: Reasoning with Adaptive Cognitive Modes
Tianyi Jiang, Arctanx An, Hengyi Feng, Naixin Zhai, Haodong Li, Xiaomin Yu, Jiahui Liu, Hanwen Du, Shuo Zhang, Zhi Yang, Jie Huang, Youhua Li, Yongxin Ni, Huacan Wang, Ronghao Chen
Main category: cs.AI
TL;DR: Paper 2602.10063: Failed to fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access restrictions preventing paper retrievalMethod: Cannot analyze method without access to paper content
Result: No results available due to technical access limitations
Conclusion: Cannot draw conclusions about paper content due to retrieval failure
Abstract: Failed to fetch summary for 2602.10063: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.10063&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[392] A Hierarchical Error-Corrective Graph Framework for Autonomous Agents with LLM-Based Action Generation
Cong Cao, Jingyao Zhang, Kun Tong
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2603.08388 suggests it’s from March 2023, but no content available for analysis.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot draw conclusions without access to the paper content.
Abstract: Failed to fetch summary for 2603.08388: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.08388&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[393] Efficient Policy Learning with Hybrid Evaluation-Based Genetic Programming for Uncertain Agile Earth Observation Satellite Scheduling
Junhua Xue, Yuning Chen, Mingyan Shao, Yangming Zhou, Qinghua Wu, Yingwu Chen
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: Cannot analyze method as paper content is unavailable due to technical limitations
Result: No results available - arXiv API returned HTTP 429 (Too Many Requests) error
Conclusion: Technical issue prevents analysis of paper 2603.08447; need to retry later or use alternative access methods
Abstract: Failed to fetch summary for 2603.08447: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.08447&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[394] JobMatchAI An Intelligent Job Matching Platform Using Knowledge Graphs, Semantic Search and Explainable AI
Mayank Vyas, Abhijit Chakraborty, Vivek Gupta
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.14558: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.14558&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[395] The Comprehension-Gated Agent Economy: A Robustness-First Architecture for AI Economic Agency
Rahul Baxi
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) when querying arXiv API for paper ID 2603.15639
Details
Motivation: Unable to determine motivation as paper content could not be retrieved due to technical limitationsMethod: Unable to determine method as paper content could not be retrieved
Result: Unable to determine results as paper content could not be retrieved
Conclusion: Unable to determine conclusion as paper content could not be retrieved
Abstract: Failed to fetch summary for 2603.15639: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.15639&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[396] I Know What I Don’t Know: Latent Posterior Factor Models for Multi-Evidence Probabilistic Reasoning
Aliyu Agboola Alege
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2603.15670: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.15670&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[397] Theoretical Foundations of Latent Posterior Factors: Formal Guarantees for Multi-Evidence Reasoning
Aliyu Agboola Alege
Main category: cs.AI
TL;DR: Failed to fetch summary for paper 2603.15674 due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to determine conclusion due to failed API request
Abstract: Failed to fetch summary for 2603.15674: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.15674&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[398] Interpretable Context Methodology: Folder Structure as Agentic Architecture
Jake Van Clief, David McDermott
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2603.16021: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.16021&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[399] TRUST-SQL: Tool-Integrated Multi-Turn Reinforcement Learning for Text-to-SQL over Unknown Schemas
Ai Jian, Xiaoyun Zhang, Wanrou Du, Jingqing Ruan, Jiangbo Pei, Weipeng Zhang, Ke Zeng, Xunliang Cai
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2603.16448: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.16448&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[400] Theoretical Foundations of δ-margin Majority Voting
Margarita Boyarskaya, Panos Ipeirotis
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2111.06390: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2111.06390&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[401] ACT-JEPA: Novel Joint-Embedding Predictive Architecture for Efficient Policy Representation Learning
Aleksandar Vujinovic, Aleksandar Kovacevic
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2501.14622: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2501.14622&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[402] Reinforcement learning with learned gadgets to tackle hard quantum problems on real hardware
Akash Kundu, Leopoldo Sarra
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2411.00230: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2411.00230&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[403] Oracular Programming: A Modular Foundation for Building LLM-Enabled Software
Jonathan Laurent, André Platzer
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to missing paper contentMethod: Unable to determine method due to missing paper content
Result: Unable to determine results due to missing paper content
Conclusion: Unable to draw conclusions due to missing paper content
Abstract: Failed to fetch summary for 2502.05310: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.05310&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[404] Renormalization-Inspired Effective Field Neural Networks for Scalable Modeling of Classical and Quantum Many-Body Systems
Xi Liu, Yujun Zhao, Chun Yu Wan, Yang Zhang, Junwei Liu
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2502.17665: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.17665&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[405] Learning Over Dirty Data with Minimal Repairs
Cheng Zhen, Prayoga, Nischal Aryal, Arash Termehchy, Garrett Biwer, Lubna Alzamil
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2503.13921: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.13921&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[406] SocialJax: An Evaluation Suite for Multi-agent Reinforcement Learning in Sequential Social Dilemmas
Zihao Guo, Shuqing Shi, Richard Willis, Tristan Tomilin, Joel Z. Leibo, Yali Du
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2503.14576 suggests it’s from March 2025, but no content available for analysis.
Details
Motivation: Cannot determine motivation as paper content is unavailable due to HTTP 429 error from arXiv API.Method: Cannot determine method as paper content is unavailable due to HTTP 429 error from arXiv API.
Result: Cannot determine results as paper content is unavailable due to HTTP 429 error from arXiv API.
Conclusion: Cannot draw conclusions as paper content is unavailable due to HTTP 429 error from arXiv API.
Abstract: Failed to fetch summary for 2503.14576: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.14576&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[407] Minimum Volume Conformal Sets for Multivariate Regression
Sacha Braun, Liviu Aolaritei, Michael I. Jordan, Francis Bach
Main category: cs.AI
TL;DR: Paper 2503.19068: Unable to fetch abstract due to HTTP 429 error (rate limiting).
Details
Motivation: Cannot determine motivation without access to paper content.Method: Cannot determine method without access to paper content.
Result: Cannot determine results without access to paper content.
Conclusion: Cannot draw conclusions without access to paper content.
Abstract: Failed to fetch summary for 2503.19068: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.19068&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[408] Edge-Cloud Collaborative Computing on Distributed Intelligence and Model Optimization: A Survey
Jing Liu, Yao Du, Kun Yang, Jiaqi Wu, Yan Wang, Xiping Hu, Zehua Wang, Yang Liu, Peng Sun, Azzedine Boukerche, Victor C.M. Leung
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2505.01821: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.01821&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[409] Bi-Level Policy Optimization with Nyström Hypergradients
Arjun Prakash, Naicheng He, Denizalp Goktas, Jacob Makar-Limanov, Amy Greenwald
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2505.11714: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.11714&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[410] RAGXplain: From Explainable Evaluation to Actionable Guidance of RAG Pipelines
Dvir Cohen, Tamir Houri, Lin Burg, Gilad Barkan
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to draw conclusions due to access error
Abstract: Failed to fetch summary for 2505.13538: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.13538&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[411] MOBODY: Model Based Off-Dynamics Offline Reinforcement Learning
Yihong Guo, Yu Yang, Pan Xu, Anqi Liu
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2506.08460: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.08460&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[412] SatSOM: Saturation Self-Organizing Maps for Continual Learning
Igor Urbanik, Paweł Gajewski
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about paper content due to access limitations
Abstract: Failed to fetch summary for 2506.10680: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.10680&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[413] Improving Epidemic Analyses with Privacy-Preserving Integration of Sensitive Data
Zihan Guan, Zhiyuan Zhao, Fengwei Tian, Dung Nguyen, Payel Bhattacharjee, Ravi Tandon, B. Aditya Prakash, Anil Vullikanti
Main category: cs.AI
TL;DR: Paper analysis unavailable due to HTTP 429 error when fetching abstract from arXiv
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2506.22342: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.22342&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[414] Fast weight programming and linear transformers: from machine learning to neurobiology
Kazuki Irie, Samuel J. Gershman
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2508.08435: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.08435&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[415] Role-Augmented Intent-Driven Generative Search Engine Optimization
Xiaolu Chen, Haojie Wu, Jie Bao, Zhen Chen, Yong Liao, Hu Huang
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2508.11158: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.11158&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[416] AgriChrono: A Multi-modal Dataset Capturing Crop Growth and Lighting Variability with a Field Robot
Jaehwan Jeong, Tuan-Anh Vu, Mohammad Jony, Shahab Ahmad, Md. Mukhlesur Rahman, Sangpil Kim, M. Khalid Jawed
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper retrievalMethod: Unable to determine method due to failed paper retrieval
Result: Unable to determine results due to failed paper retrieval
Conclusion: Unable to determine conclusion due to failed paper retrieval
Abstract: Failed to fetch summary for 2508.18694: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.18694&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[417] Learning Domain- and Class-Disentangled Prototypes for Domain-Generalized EEG Emotion Recognition
Guangli Li, Canbiao Wu, Zhehao Zhou, Na Tian, Li Zhang, Zhen Liang
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2509.01135 appears to be from September 2025, suggesting it’s a recent multimodal or audio/vision research paper.
Details
Motivation: Cannot determine motivation without access to paper content. Based on the arXiv ID format (2509.01135), this appears to be a recent paper from September 2025, likely related to multimodal AI research given the reader's interests.Method: Method unknown due to HTTP 429 error preventing access to paper details. The arXiv API rate limiting prevents retrieval of the abstract and content.
Result: Results cannot be determined without access to the paper content. The HTTP 429 error indicates the arXiv API is rate limiting requests.
Conclusion: Unable to provide analysis due to technical limitations. The arXiv API rate limiting prevents fetching the paper details needed for proper analysis.
Abstract: Failed to fetch summary for 2509.01135: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.01135&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[418] Tree Search for LLM Agent Reinforcement Learning
Yuxiang Ji, Ziyu Ma, Yong Wang, Guanhua Chen, Xiangxiang Chu, Liaoni Wu
Main category: cs.AI
TL;DR: Unable to analyze paper 2509.21240 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract is unavailable due to API rate limitingMethod: Cannot determine method as abstract is unavailable due to API rate limiting
Result: Cannot determine results as abstract is unavailable due to API rate limiting
Conclusion: Cannot draw conclusions about paper content due to inability to access abstract
Abstract: Failed to fetch summary for 2509.21240: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.21240&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[419] In-Context Compositional Q-Learning for Offline Reinforcement Learning
Qiushui Xu, Yuhao Huang, Yushu Jiang, Lei Song, Jinyu Wang, Wenliang Zheng, Jiang Bian
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2509.24067: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.24067&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[420] Stable Forgetting: Bounded Parameter-Efficient Unlearning in Foundation Models
Arpit Garg, Hemanth Saratchandran, Ravi Garg, Simon Lucey
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2509.24166: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.24166&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[421] Scalable Energy-Based Models via Adversarial Training: Unifying Discrimination and Generation
Xuwang Yin, Claire Zhang, Julie Steele, Nir Shavit, Tony T. Wang
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2510.13872: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.13872&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[422] Personalized Motion Guidance Framework for Athlete-Centric Coaching
Ryota Takamido, Chiharu Suzuki, Hiroki Nakamoto
Main category: cs.AI
TL;DR: Paper 2510.10496: Unable to fetch abstract due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to abstract retrieval failureMethod: Cannot determine method due to abstract retrieval failure
Result: Cannot determine results due to abstract retrieval failure
Conclusion: Cannot determine conclusion due to abstract retrieval failure
Abstract: Failed to fetch summary for 2510.10496: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.10496&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[423] CBF-RL: Safety Filtering Reinforcement Learning in Training with Control Barrier Functions
Lizhi Yang, Blake Werner, Massimiliano de Sa, Aaron D. Ames
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to draw conclusions due to fetch failure
Abstract: Failed to fetch summary for 2510.14959: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.14959&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[424] A robust methodology for long-term sustainability evaluation of Machine Learning models
Jorge Paz-Ruza, João Gama, Amparo Alonso-Betanzos, Bertha Guijarro-Berdiñas
Main category: cs.AI
TL;DR: Paper 2511.08120 could not be analyzed due to HTTP 429 error when fetching from arXiv API
Details
Motivation: Unable to determine motivation due to failed paper retrievalMethod: Unable to determine method due to failed paper retrieval
Result: Unable to determine results due to failed paper retrieval
Conclusion: Unable to draw conclusions due to failed paper retrieval
Abstract: Failed to fetch summary for 2511.08120: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.08120&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[425] Reduced Density Matrices Through Machine Learning
Awwab A. Azam, Lexu Zhao, Jiabin Yu
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2511.07367: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.07367&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[426] Genomic Next-Token Predictors are In-Context Learners
Nathan Breslow, Aayush Mishra, Mahler Revsine, Michael C. Schatz, Anqi Liu, Daniel Khashabi
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2511.12797: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.12797&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[427] Volumetric Ergodic Control
Jueun Kwon, Max M. Sun, Todd Murphey
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2511.11533: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.11533&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[428] A Comedy of Estimators: On KL Regularization in RL Training of LLMs
Vedant Shah, Johan Obando-Ceron, Vineet Jain, Brian Bartoldson, Bhavya Kailkhura, Sarthak Mittal, Glen Berseth, Pablo Samuel Castro, Yoshua Bengio, Nikolay Malkin, Moksh Jain, Siddarth Venkatraman, Aaron Courville
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2512.21852: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.21852&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[429] Evaluating Feature Dependent Noise in Preference-based Reinforcement Learning
Yuxuan Li, Harshith Reddy Kethireddy, Srijita Das
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) when querying arXiv API for paper ID 2601.01904
Details
Motivation: Unable to determine motivation as paper content could not be retrieved due to API rate limitingMethod: No method information available - paper content inaccessible
Result: No results available - paper summary fetch failed
Conclusion: Cannot draw conclusions about paper content due to technical retrieval issues
Abstract: Failed to fetch summary for 2601.01904: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.01904&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[430] Learning Adaptive Distribution Alignment with Neural Characteristic Function for Graph Domain Adaptation
Wei Chen, Xingyu Guo, Shuang Li, Zhao Zhang, Yan Zhong, Fuzhen Zhuang, Deqing wang
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2602.10489: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.10489&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[431] Failing on Bias Mitigation: A Case Study on the Challenges of Fairness in Government Data
Hongbo Bo, Jingyu Hu, Debbie Watson, Weiru Liu
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2601.17054: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.17054&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[432] interwhen: A Generalizable Framework for Verifiable Reasoning with Test-time Monitors
Vishak K Bhat, Prateek Chanda, Vijval Ekbote, Ashmit Khandelwal, Maitreyi Swaroop, Vineeth N. Balasubramanian, Subbarao Kambhampati, Nagarajan Natarajan, Amit Sharma
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2602.11202: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.11202&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[433] Hyperparameter Trajectory Inference with Conditional Lagrangian Optimal Transport
Harry Amad, Mihaela van der Schaar
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2603.01771: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.01771&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[434] Pretrained Vision-Language-Action Models are Surprisingly Resistant to Forgetting in Continual Learning
Huihan Liu, Changyeon Kim, Bo Liu, Minghuan Liu, Yuke Zhu
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.03818: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.03818&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[435] Thin Keys, Full Values: Reducing KV Cache via Low-Dimensional Attention Selection
Hengshuai Yao, Xing Chen, Ahmed Murtadha, Guan Wang
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.04427: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.04427&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[436] JAWS: Enhancing Long-term Rollout of Neural PDE Solvers via Spatially-Adaptive Jacobian Regularization
Fengxiang Nie, Yasuhiro Suzuki
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2603.05538: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.05538&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[437] LUMINA: LLM-Guided GPU Architecture Exploration via Bottleneck Analysis
Tao Zhang, Rui Ma, Shuotao Xu, Yongqiang Xiong, Peng Cheng
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2603.05904: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.05904&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[438] Exploiting Adaptive Channel Pruning for Communication-Efficient Split Learning
Jialei Tan, Zheng Lin, Xiangming Cai, Ruoxi Zhu, Zihan Fang, Pingping Chen, Wei Ni
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting errorMethod: Unable to determine method due to API rate limiting error
Result: Unable to determine results due to API rate limiting error
Conclusion: Unable to determine conclusion due to API rate limiting error
Abstract: Failed to fetch summary for 2603.09792: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.09792&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[439] Real-World AI Evaluation: How FRAME Generates Systematic Evidence to Resolve the Decision-Maker’s Dilemma
Reva Schwartz, Gabriella Waters
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.13294: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.13294&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[440] Ethical Fairness without Demographics in Human-Centered AI
Shaily Roy, Harshit Sharma, Daniel A. Adler, Tanzeem Choudhury, Asif Salekin
Main category: cs.AI
TL;DR: Unable to analyze paper 2603.13373 due to HTTP 429 error when fetching from arXiv API
Details
Motivation: Cannot determine motivation as paper content could not be retrievedMethod: Cannot determine method as paper content could not be retrieved
Result: Cannot determine results as paper content could not be retrieved
Conclusion: Cannot draw conclusions as paper content could not be retrieved
Abstract: Failed to fetch summary for 2603.13373: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.13373&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[441] OMNIFLOW: A Physics-Grounded Multimodal Agent for Generalized Scientific Reasoning
Hao Wu, Yongheng Zhang, Yuan Gao, Fan Xu, Fan Zhang, Ruobing Xie, Ruijian Gou, Yuxuan Liang, Xiaomeng Huang, Xian Wu
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about the paper due to access limitations
Abstract: Failed to fetch summary for 2603.15797: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.15797&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[442] SAATT Nav: a Socially Aware Autonomous Transparent Transportation Navigation Framework for Wheelchairs
Yutong Zhang, Shaiv Y. Mehra, Bradley S. Duerstock, Juan P. Wachs
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2603.13698: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.13698&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[443] PhasorFlow: A Python Library for Unit Circle Based Computing
Dibakar Sigdel, Namuna Panday
Main category: cs.AI
TL;DR: Unable to analyze paper 2603.15886 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation without access to the paper's abstract or contentMethod: Cannot determine method without access to the paper’s abstract or content
Result: Cannot determine results without access to the paper’s abstract or content
Conclusion: Cannot draw conclusions without access to the paper’s abstract or content
Abstract: Failed to fetch summary for 2603.15886: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.15886&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[444] REFINE-DP: Diffusion Policy Fine-tuning for Humanoid Loco-manipulation via Reinforcement Learning
Zhaoyuan Gu, Yipu Chen, Zimeng Chai, Alfred Cueva, Thong Nguyen, Yifan Wu, Huishu Xue, Minji Kim, Isaac Legene, Fukang Liu, Matthew Kim, Ayan Barula, Yongxin Chen, Ye Zhao
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.13707: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.13707&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[445] “I’m Not Reading All of That”: Understanding Software Engineers’ Level of Cognitive Engagement with Agentic Coding Assistants
Carlos Rafael Catalan, Lheane Marie Dizon, Patricia Nicole Monderin, Emily Kuang
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) - need to try again later or use alternative methods
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2603.14225: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.14225&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[446] Open Biomedical Knowledge Graphs at Scale: Construction, Federation, and AI Agent Access with Samyama Graph Database
Madhulatha Mandarapu, Sandeep Kunkunuru
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation as paper content could not be retrievedMethod: Unable to determine method as paper content could not be retrieved
Result: Unable to determine results as paper content could not be retrieved
Conclusion: Unable to determine conclusion as paper content could not be retrieved
Abstract: Failed to fetch summary for 2603.15080: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.15080&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[447] A Framework and Prototype for a Navigable Map of Datasets in Engineering Design and Systems Engineering
H. Sinan Bank, Daniel R. Herber
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to determine conclusion due to failed API request
Abstract: Failed to fetch summary for 2603.15722: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.15722&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[448] 100x Cost & Latency Reduction: Performance Analysis of AI Query Approximation using Lightweight Proxy Models
Yeounoh Chung, Rushabh Desai, Jian He, Yu Xiao, Thibaud Hottelier, Yves-Laurent Kom Samo, Pushkar Kadilkar, Xianshun Chen, Sam Idicula, Fatma Özcan, Alon Halevy, Yannis Papakonstantinou
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about paper content due to access limitations
Abstract: Failed to fetch summary for 2603.15970: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.15970&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[449] DexGrasp-Zero: A Morphology-Aligned Policy for Zero-Shot Cross-Embodiment Dexterous Grasping
Yuliang Wu, Yanhan Lin, WengKit Lao, Yuhao Lin, Yi-Lin Wei, Wei-Shi Zheng, Ancong Wu
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper retrievalMethod: Unable to determine method due to failed paper retrieval
Result: Unable to determine results due to failed paper retrieval
Conclusion: Unable to determine conclusion due to failed paper retrieval
Abstract: Failed to fetch summary for 2603.16806: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.16806&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.SD
[450] Quantizer-Aware Hierarchical Neural Codec Modeling for Speech Deepfake Detection
Jinyang Wu, Zihan Pan, Qiquan Zhang, Sailor Hardik Bhupendra, Soumik Mondal
Main category: cs.SD
TL;DR: A hierarchy-aware representation learning framework for speech deepfake detection that leverages the coarse-to-fine structure of neural audio codecs’ residual vector quantization to capture complementary acoustic cues for forensic analysis.
Details
Motivation: Neural audio codecs use residual vector quantization (RVQ) to discretize speech, creating a coarse-to-fine hierarchy across quantizers. While codec models have been explored for representation learning, their discrete structure remains underutilized in speech deepfake detection. Different quantization levels capture complementary acoustic cues where early quantizers encode coarse structure and later quantizers refine residual details that reveal synthesis artifacts. Existing systems either rely on continuous encoder features or ignore this quantizer-level hierarchy.Method: Proposes a hierarchy-aware representation learning framework that models quantizer-level contributions through learnable global weighting, enabling structured codec representations aligned with forensic cues. The approach keeps the speech encoder backbone frozen and updates only 4.4% additional parameters.
Result: Achieves relative EER reductions of 46.2% on ASVspoof 2019 and 13.9% on ASVspoof5 over strong baselines.
Conclusion: The proposed framework effectively leverages the hierarchical structure of neural audio codecs for speech deepfake detection, demonstrating significant performance improvements with minimal parameter updates by modeling quantizer-level contributions through learnable weighting.
Abstract: Neural audio codecs discretize speech via residual vector quantization (RVQ), forming a coarse-to-fine hierarchy across quantizers. While codec models have been explored for representation learning, their discrete structure remains underutilized in speech deepfake detection. In particular, different quantization levels capture complementary acoustic cues, where early quantizers encode coarse structure and later quantizers refine residual details that reveal synthesis artifacts. Existing systems either rely on continuous encoder features or ignore this quantizer-level hierarchy. We propose a hierarchy-aware representation learning framework that models quantizer-level contributions through learnable global weighting, enabling structured codec representations aligned with forensic cues. Keeping the speech encoder backbone frozen and updating only 4.4% additional parameters, our method achieves relative EER reductions of 46.2% on ASVspoof 2019 and 13.9% on ASVspoof5 over strong baselines.
[451] Music Source Restoration with Ensemble Separation and Targeted Reconstruction
Xinlong Deng, Yu Xia, Jie Jiang
Main category: cs.SD
TL;DR: Two-stage system for Music Source Restoration (MSR) combining separation models and BSRNN-based restoration models to recover original stems from mixed/mastered music
Details
Motivation: MSR challenge requires reversing complex production processes (EQ, compression, reverb) beyond conventional source separation, addressing real-world degradations in music productionMethod: Two-stage approach: 1) Ensemble of pre-trained separation models produces initial source estimates, 2) Pre-trained BSRNN-based restoration models perform targeted reconstruction to refine estimates
Result: System surpasses baselines on all metrics in official MSR benchmark, ranking second among all submissions
Conclusion: Proposed two-stage system effectively addresses MSR challenge by combining separation and restoration, with code publicly available
Abstract: The Inaugural Music Source Restoration (MSR) Challenge targets the recovery of original, unprocessed stems from fully mixed and mastered music. Unlike conventional music source separation, MSR requires reversing complex production processes such as equalization, compression, reverberation, and other real-world degradations. To address MSR, we propose a two-stage system. First, an ensemble of pre-trained separation models produces preliminary source estimates. Then a set of pre-trained BSRNN-based restoration models performs targeted reconstruction to refine these estimates. On the official MSR benchmark, our system surpasses the baselines on all metrics, ranking second among all submissions. The code is available at https://github.com/xinghour/Music-source-restoration-CUPAudioGroup
[452] Modeling Overlapped Speech with Shuffles
Matthew Wiesner, Samuele Cornell, Alexander Polok, Lucas Ondel Yang, Lukáš Burget, Sanjeev Khudanpur
Main category: cs.SD
TL;DR: Novel approach using shuffle products and partial order FSAs for single-pass alignment and speaker-attributed transcription of overlapped speech
Details
Motivation: Current methods for overlapped speech processing require multiple passes or complex pipelines; need efficient single-pass solution for alignment and speaker attributionMethod: Use shuffle product and partial order finite-state automata (FSAs) to model parallel data streams; train with total score on FSAs as loss function, marginalizing over all possible serializations; add temporal constraints to reduce graph size; model (token, speaker) tuples directly for speaker attribution
Result: First algorithm enabling single-pass alignment of multi-talker recordings; evaluated on synthetic LibriSpeech overlaps; implemented using k2/Icefall framework
Conclusion: Shuffle products and partial order FSAs provide effective framework for overlapped speech processing with single-pass alignment capability
Abstract: We propose to model parallel streams of data, such as overlapped speech, using shuffles. Specifically, this paper shows how the shuffle product and partial order finite-state automata (FSAs) can be used for alignment and speaker-attributed transcription of overlapped speech. We train using the total score on these FSAs as a loss function, marginalizing over all possible serializations of overlapping sequences at subword, word, and phrase levels. To reduce graph size, we impose temporal constraints by constructing partial order FSAs. We address speaker attribution by modeling (token, speaker) tuples directly. Viterbi alignment through the shuffle product FSA directly enables one-pass alignment. We evaluate performance on synthetic LibriSpeech overlaps. To our knowledge, this is the first algorithm that enables single-pass alignment of multi-talker recordings. All algorithms are implemented using k2 / Icefall.
[453] NV-Bench: Benchmark of Nonverbal Vocalization Synthesis for Expressive Text-to-Speech Generation
Qinke Ni, Huan Liao, Dekun Chen, Yuxiang Wang, Zhizheng Wu
Main category: cs.SD
TL;DR: NV-Bench: A benchmark for evaluating nonverbal vocalizations in text-to-speech systems with standardized metrics and human reference audio.
Details
Motivation: Current TTS systems lack standardized evaluation metrics and reliable ground-truth references for nonverbal vocalizations, which are increasingly integrated but poorly evaluated.Method: Proposes NV-Bench with 1,651 multilingual utterances across 14 NV categories, featuring dual-dimensional evaluation: Instruction Alignment (using paralinguistic character error rate) and Acoustic Fidelity (measuring distributional gap to real recordings).
Result: Strong correlation between objective metrics and human perception, establishing NV-Bench as a standardized evaluation framework for TTS models with nonverbal vocalizations.
Conclusion: NV-Bench provides the first benchmark with functional taxonomy for evaluating nonverbal vocalizations in TTS systems, enabling standardized assessment of controllability and acoustic realism.
Abstract: While recent text-to-speech (TTS) systems increasingly integrate nonverbal vocalizations (NVs), their evaluations lack standardized metrics and reliable ground-truth references. To bridge this gap, we propose NV-Bench, the first benchmark grounded in a functional taxonomy that treats NVs as communicative acts rather than acoustic artifacts. NV-Bench comprises 1,651 multi-lingual, in-the-wild utterances with paired human reference audio, balanced across 14 NV categories. We introduce a dual-dimensional evaluation protocol: (1) Instruction Alignment, utilizing the proposed paralinguistic character error rate (PCER) to assess controllability, (2) Acoustic Fidelity, measuring the distributional gap to real recordings to assess acoustic realism. We evaluate diverse TTS models and develop two baselines. Experimental results demonstrate a strong correlation between our objective metrics and human perception, establishing NV-Bench as a standardized evaluation framework.
cs.LG
[454] A foundation model for electrodermal activity data
Leonardo Alchieri, Matteo Garzon, Lidia Alecci, Francesco Bombassei De Bona, Martin Gjoreski, Giovanni De Felice, Silvia Santini
Main category: cs.LG
TL;DR: EDAMAME is a large-scale public EDA dataset collection, and UME is the first foundation model trained on it, outperforming baselines with 20x fewer resources.
Details
Motivation: Progress in electrodermal activity (EDA) modeling is hindered by lack of large-scale, curated, open datasets. EDA reflects sympathetic nervous system activity and is used to infer cognitive load, stress, and engagement, but existing resources are limited and proprietary.Method: Compiled EDAMAME - a collection of EDA traces from 24 public datasets (25,000+ hours from 634 users). Trained UME, the first dedicated foundation model for EDA on this resource.
Result: UME outperforms baselines in 8 out of 10 scenarios and matches generalist timeseries foundation models while using 20x fewer computational resources. However, results also highlight intrinsic challenges of EDA modeling.
Conclusion: EDAMAME and UME provide valuable resources for EDA research, but further work is needed to unlock EDA’s full potential. All datasets, model weights, and code are released publicly.
Abstract: Foundation models have recently extended beyond natural language and vision to timeseries domains, including physiological signals. However, progress in electrodermal activity (EDA) modeling is hindered by the absence of large-scale, curated, and openly accessible datasets. EDA reflects sympathetic nervous system activity and is widely used to infer cognitive load, stress, and engagement. Yet very few wearable devices provide continuous, unobtrusive sensing, and the only large-scale archive to date is proprietary. To address this gap, we compile EDAMAME, a collection of EDA traces from 24 public datasets, comprising more than 25,000 hours from 634 users. Using this resource, we train UME, the first dedicated foundation model for EDA. In eight out of ten scenarios, UME outperforms baselines and matches generalist timeseries foundation models while using 20x fewer computational resources. Our findings, however, also highlight the intrinsic challenges of EDA modeling, motivating further research to unlock its full potential. All datasets, model weights, and code are released to support further research.
[455] Federated Multi Agent Deep Learning and Neural Networks for Advanced Distributed Sensing in Wireless Networks
Nadine Muller, Stefano DeRosa, Su Zhang, Chun Lee Huan
Main category: cs.LG
TL;DR: Survey paper on multi-agent deep learning for distributed sensing and wireless communications, covering learning formulations, neural architectures, advanced techniques, and applications relevant to 5G-Advanced/6G systems.
Details
Motivation: The increasing integration of sensing, communication, and computing in wireless systems (5G-Advanced/6G) creates complex decentralized control problems that require multi-agent deep learning approaches for effective decision-making and inference.Method: Comprehensive survey methodology with task-driven taxonomy across four dimensions: (1) learning formulations (Markov games, Dec-POMDPs, CTDE), (2) neural architectures (GNN-based, attention-based, hierarchical), (3) advanced techniques (federated RL, communication-efficient methods), and (4) application domains (MEC offloading, UAV networks, ISAC).
Result: Synthesis of state-of-the-art research (2021-2025) with comparative tables of algorithms, training topologies, and system-level trade-offs in latency, spectral efficiency, energy, privacy, and robustness.
Conclusion: Identifies open challenges (scalability, non-stationarity, security, communication overhead, real-time safety) and outlines research directions toward 6G-native sense-communicate-compute-learn systems.
Abstract: Multi-agent deep learning (MADL), including multi-agent deep reinforcement learning (MADRL), distributed/federated training, and graph-structured neural networks, is becoming a unifying framework for decision-making and inference in wireless systems where sensing, communication, and computing are tightly coupled. Recent 5G-Advanced and 6G visions strengthen this coupling through integrated sensing and communication, edge intelligence, open programmable RAN, and non-terrestrial/UAV networking, which create decentralized, partially observed, time-varying, and resource-constrained control problems. This survey synthesizes the state of the art, with emphasis on 2021-2025 research, on MADL for distributed sensing and wireless communications. We present a task-driven taxonomy across (i) learning formulations (Markov games, Dec-POMDPs, CTDE), (ii) neural architectures (GNN-based radio resource management, attention-based policies, hierarchical learning, and over-the-air aggregation), (iii) advanced techniques (federated reinforcement learning, communication-efficient federated deep RL, and serverless edge learning orchestration), and (iv) application domains (MEC offloading with slicing, UAV-enabled heterogeneous networks with power-domain NOMA, intrusion detection in sensor networks, and ISAC-driven perceptive mobile networks). We also provide comparative tables of algorithms, training topologies, and system-level trade-offs in latency, spectral efficiency, energy, privacy, and robustness. Finally, we identify open issues including scalability, non-stationarity, security against poisoning and backdoors, communication overhead, and real-time safety, and outline research directions toward 6G-native sense-communicate-compute-learn systems.
[456] Multi-Agent Reinforcement Learning for Dynamic Pricing: Balancing Profitability,Stability and Fairness
Krishna Kumar Neelakanta Pillai Santha Kumari Amma
Main category: cs.LG
TL;DR: Systematic evaluation of multi-agent RL (MAPPO, MADDPG) for dynamic pricing in competitive retail markets, showing MAPPO achieves highest profits with low variance while MADDPG provides fairest profit distribution.
Details
Motivation: Dynamic pricing in competitive retail markets requires adaptive strategies that respond to fluctuating demand and competitor behavior. Current approaches need systematic evaluation of multi-agent reinforcement learning methods for this problem.Method: Empirical evaluation of MAPPO and MADDPG algorithms for dynamic price optimization under competition, using simulated marketplace environment derived from real-world retail data. Benchmarked against Independent DDPG (IDDPG) baseline across profit performance, stability, fairness, and training efficiency metrics.
Result: MAPPO consistently achieves highest average returns with low variance, offering stable and reproducible approach. MADDPG achieves slightly lower profit but fairest profit distribution among agents. Both MARL methods outperform independent learning baseline.
Conclusion: MARL methods, particularly MAPPO, provide scalable and stable alternative to independent learning approaches for dynamic retail pricing in competitive markets.
Abstract: Dynamic pricing in competitive retail markets requires strategies that adapt to fluctuating demand and competitor behavior. In this work, we present a systematic empirical evaluation of multi-agent reinforcement learning (MARL) approaches-specifically MAPPO and MADDPG-for dynamic price optimization under competition. Using a simulated marketplace environment derived from real-world retail data, we benchmark these algorithms against an Independent DDPG (IDDPG) baseline, a widely used independent learner in MARL literature. We evaluate profit performance, stability across random seeds, fairness, and training efficiency. Our results show that MAPPO consistently achieves the highest average returns with low variance, offering a stable and reproducible approach for competitive price optimization, while MADDPG achieves slightly lower profit but the fairest profit distribution among agents. These findings demonstrate that MARL methods-particularly MAPPO-provide a scalable and stable alternative to independent learning approaches for dynamic retail pricing.
[457] From Language to Action in Arabic: Reliable Structured Tool Calling via Data-Centric Fine-Tuning
Omer Nacar, Deema Alquffari, Saleh Alsharideh, Adeem AlOtaibi, Abdulaziz Alabdulkarim, Leen Alhazmi, Nada Alomar, Wareef Alzubaidi, Nada Alsultan, Ahmed Alrabghi, Demah Alhoshan, Rana Alsayyari, Hamed Alruwaili, Albaraa Jaafar, Khaled Alusmani, Abdulaziz Alsohimy, Munirah Alsubaie, Shahd Aldukhayil, Arwa Alali, Yazeed BinShihah, Razan Alsulaymi, Nourah Alhumaid, Razan Abdulsalam, Reem Alamoudi, Mohammed Alkhalifa
Main category: cs.LG
TL;DR: AISA-AR-FunctionCall is a production-oriented Arabic function-calling framework that addresses severe structural instability in existing models when applied to Arabic, achieving dramatic improvements in parse success and accuracy.
Details
Motivation: Existing function-calling language models exhibit severe structural instability when applied to Arabic, making them unreliable for agentic AI systems that need to translate natural language into executable structured actions in Arabic contexts.Method: Built on a 270M-parameter FunctionGemma backbone with systematic dataset auditing, schema repair, tool-aware prompt restructuring, and full-parameter supervised fine-tuning. Also explored a reasoning-augmented LoRA variant with explicit intermediate reasoning before tool invocation.
Result: Fine-tuning reduces parse failures from 87% to below 1%, improves function name accuracy by more than eightfold, and substantially enhances argument alignment across dialects and domains. Error analysis shows transition from structural collapse to semantic misalignment.
Conclusion: The framework successfully addresses Arabic function-calling instability, revealing that serialization stability and decision-level reasoning are separable challenges. All datasets and models are publicly released under the AISA framework.
Abstract: Function-calling language models are essential for agentic AI systems that translate natural language into executable structured actions, yet existing models exhibit severe structural instability when applied to Arabic. We present AISA-AR-FunctionCall, a production-oriented Arabic function-calling framework built on a 270M-parameter FunctionGemma backbone and trained through systematic dataset auditing, schema repair, tool-aware prompt restructuring, and full-parameter supervised fine-tuning. On a held-out test set, fine-tuning reduces parse failures from 87% to below 1%, improves function name accuracy by more than eightfold, and substantially enhances argument alignment across dialects and domains. Error analysis reveals a transition from structural collapse to semantic misalignment, suggesting that serialization stability and decision-level reasoning are separable challenges. We further explore a reasoning-augmented LoRA variant that introduces explicit intermediate reasoning prior to tool invocation. All datasets and models are publicly released under the AISA framework.
[458] What on Earth is AlphaEarth? Hierarchical structure and functional interpretability for global land cover
Ivan Felipe Benavides-Martinez, Justin Guthrie, Jhon Edwin Arias, Yeison Alberto Garces-Gomez, Angela Ines Guzman-Alvis, Cristiam Victoriano Portilla-Cabrera, Somnath Mondal, Andrew J. Allyn, Auroop R. Ganguly
Main category: cs.LG
TL;DR: This paper analyzes the functional organization of geospatial foundation model embeddings, showing they have hierarchical structure with specialist and generalist dimensions, enabling significant dimension reduction for land cover classification.
Details
Motivation: While geospatial foundation models produce high-dimensional embeddings with strong predictive performance, their internal organization remains unclear, limiting scientific utility. The authors aim to understand whether embedding dimensions exhibit functional or hierarchical organization.Method: Proposed a functional interpretability framework combining large-scale experimentation with structural analysis of embedding-class relationships using feature importance patterns and progressive ablation to reverse-engineer the role of embedding dimensions.
Result: Embedding dimensions show consistent non-uniform functional behavior categorized hierarchically: specialist dimensions for specific land cover classes, low/mid-generalist dimensions for shared characteristics, and high-generalist dimensions for broader environmental gradients. Land cover classification achieves 98% of baseline performance using only 2-12 of 64 dimensions.
Conclusion: AlphaEarth embeddings are not only physically informative but functionally organized into hierarchical structure, revealing substantial redundancy and offering practical guidance for dimension selection in operational classification tasks with computational cost reductions.
Abstract: Geospatial foundation models generate high-dimensional embeddings that achieve strong predictive performance, yet their internal organization remains obscure, limiting their scientific use. Recent interpretability studies relate Google AlphaEarth Foundations (GAEF) embeddings to continuous environmental variables, but it is still unclear whether the embedding space exhibits a functional or hierarchical organization, in which some dimensions act as specialized representations while others encode shared or broader geospatial structure. In this work, we propose a functional interpretability framework that reverse-engineers the role of embedding dimensions by characterizing their contribution to land cover structure from observed classification behavior. The approach combines large-scale experimentation with a structural analysis of embedding-class relationships based on feature importance patterns and progressive ablation. Our results show that embedding dimensions exhibit consistent and non-uniform functional behavior, allowing them to be categorized along a hierarchical functional spectrum: specialist dimensions associated with specific land cover classes, low- and mid-generalist dimensions capturing shared characteristics between classes, and highgeneralist dimensions reflecting broader environmental gradients. Critically, we find that accurate land cover classification (98% of baseline performance) can be achieved using as few as 2 to 12 of the 64 available dimensions, depending on the class. This demonstrates substantial redundancy in the embedding space and offers a pathway toward significant reductions in computational cost. Together, these findings reveal that AlphaEarth embeddings are not only physically informative, but also functionally organized into a hierarchical structure, providing practical guidance for dimension selection in operational classification tasks.
[459] HoloByte: Continuous Hyperspherical Distillation for Tokenizer-Free Modeling
Vladimer Khasia
Main category: cs.LG
TL;DR: HoloByte introduces a tokenizer-free framework using Continuous Hyperspherical Distillation to project byte sequences into continuous hyperspherical representations, reducing attention complexity while maintaining exact byte-level recovery.
Details
Motivation: Current sequence modeling relies on discrete subword tokenization which imposes artificial morphological boundaries, enforces vocabulary dependence, and fractures optimization landscape continuity. The authors aim to resolve this dichotomy by eliminating tokenization entirely.Method: HoloByte partitions byte sequences into fixed-capacity chunks, projects them into continuous hyperspherical manifold via invertible orthogonal rotation operators. Uses macroscopic transformer on compressed continuous representations with reduced attention complexity, plus localized causal micro-decoder for exact byte-level distributions. Employs dual-objective formulation with Holographic Latent Mean Squared Error for gradient bounding and stability.
Result: Theoretically derived minimal embedding dimension required for error-free discrete recovery. Empirically outperforms comparable discrete Byte-Pair Encoding baseline under strictly matched parameter constraints.
Conclusion: Continuous Hyperspherical Distillation provides mathematically rigorous and computationally tractable foundation for vocabulary-invariant sequence modeling, eliminating dependence on discrete tokenization.
Abstract: Sequence modeling universally relies on discrete subword tokenization to circumvent the $\mathcal{O}(N^2)$ computational intractability of native byte-level attention. However, this heuristic quantization imposes artificial morphological boundaries, enforces vocabulary dependence, and fractures the continuity of the optimization landscape. To resolve this dichotomy, we introduce \textbf{HoloByte}: a strictly tokenizer-free framework utilizing Continuous Hyperspherical Distillation. HoloByte partitions discrete byte sequences into fixed-capacity chunks and projects them into a continuous, strictly bounded hyperspherical manifold via an invertible, dimension-preserving orthogonal rotation operator. This spatial superposition allows a macroscopic transformer to operate exclusively on compressed continuous representations, formally reducing the exact attention time complexity from $\mathcal{O}(N^2D)$ to $\mathcal{O}\left( \frac{N^2}{W^2}D + ND^2 \right)$. A localized causal micro-decoder subsequently unbinds these representations to compute exact byte-level distributions. To govern this continuous trajectory, we propose a dual-objective formulation incorporating a mathematically precise Holographic Latent Mean Squared Error, which strictly bounds the gradient and guarantees asymptotic stability. Theoretically, we derive the minimal embedding dimension $D = Ω(W \ln |\mathcal{V}|)$ required to ensure error-free discrete recovery from the continuous manifold. Empirically, under strictly matched parameter constraints, HoloByte is systematically outperforming a comparable discrete Byte-Pair Encoding (BPE) baseline. These results establish Continuous Hyperspherical Distillation as a mathematically rigorous and computationally tractable foundation for vocabulary-invariant sequence modeling. The code is available at https://github.com/VladimerKhasia/HoloByte
[460] MHPO: Modulated Hazard-aware Policy Optimization for Stable Reinforcement Learning
Hongjun Wang, Wei Liu, Weibo Gu, Xing Sun, Kai Han
Main category: cs.LG
TL;DR: MHPO introduces a novel RL framework with Log-Fidelity Modulator and Decoupled Hazard Penalty for stable policy optimization by regulating importance ratios and preventing extreme policy shifts.
Details
Motivation: Existing ratio control methods like hard clipping suffer from non-differentiable boundaries and vanishing gradients, lacking adaptive mechanisms to suppress extreme policy deviations, making optimization vulnerable to abrupt shifts.Method: Proposes Modulated Hazard-aware Policy Optimization (MHPO) with two components: 1) Log-Fidelity Modulator (LFM) maps unbounded importance ratios to bounded differentiable domain, 2) Decoupled Hazard Penalty (DHP) uses survival analysis hazard functions to independently regulate positive/negative policy shifts.
Result: Extensive evaluations on diverse reasoning benchmarks across text-based and vision-language tasks show MHPO consistently outperforms existing methods with superior performance and significantly enhanced training stability.
Conclusion: MHPO provides a robust framework for stable reinforcement learning that achieves fine-grained regulation of asymmetric policy shifts while preventing mode collapse and policy erosion within a stabilized trust region.
Abstract: Regulating the importance ratio is critical for the training stability of Group Relative Policy Optimization (GRPO) based frameworks. However, prevailing ratio control methods, such as hard clipping, suffer from non-differentiable boundaries and vanishing gradient regions, failing to maintain gradient fidelity. Furthermore, these methods lack a hazard-aware mechanism to adaptively suppress extreme deviations, leaving the optimization process vulnerable to abrupt policy shifts. To address these challenges, we propose Modulated Hazard-aware Policy Optimization (MHPO), a novel framework designed for robust and stable reinforcement learning. The proposed MHPO introduces a Log-Fidelity Modulator (LFM) to map unbounded importance ratios into a bounded, differentiable domain. This mechanism effectively prevents high-variance outlier tokens from destabilizing the loss landscape while ensuring global gradient stability. Complementarily, a Decoupled Hazard Penalty (DHP) integrates cumulative hazard functions from survival analysis to independently regulate positive and negative policy shifts. By shaping the optimization landscape with hazard-aware penalties, the proposed MHPO achieves fine-grained regulation of asymmetric policy shifts simultaneously mitigating mode collapse from over-expansion and preventing policy erosion from catastrophic contraction within a stabilized trust region. Extensive evaluations on diverse reasoning benchmarks across both text-based and vision-language tasks demonstrate that MHPO consistently outperforms existing methods, achieving superior performance while significantly enhancing training stability.
[461] Topology-Preserving Deep Joint Source-Channel Coding for Semantic Communication
Omar Erak, Omar Alhussein, Fang Fang, Sami Muhaidat
Main category: cs.LG
TL;DR: TopoJSCC: A topology-aware deep joint source-channel coding framework that preserves global structural information in wireless vision applications by integrating persistent homology regularizers into end-to-end training.
Details
Motivation: Wireless vision applications like autonomous driving require preservation of global structural information and connectivity, not just pixel fidelity. Existing DeepJSCC schemes optimize pixel-wise losses without explicit protection of topology and connectivity.Method: Proposes TopoJSCC framework that integrates persistent-homology regularizers into end-to-end training. Uses Wasserstein distances between cubical persistence diagrams of original/reconstructed images, and between Vietoris-Rips persistence of latent features before/after channel to promote robust latent manifold.
Result: Experiments show improved topology preservation and PSNR in low SNR and bandwidth-ratio regimes compared to existing methods.
Conclusion: TopoJSCC effectively preserves topological structure in wireless vision transmission through persistent-homology regularization, enhancing performance in challenging communication conditions.
Abstract: Many wireless vision applications, such as autonomous driving, require preservation of global structural information rather than only per-pixel fidelity. However, existing Deep joint source-channel coding (DeepJSCC) schemes mainly optimize pixel-wise losses and provide no explicit protection of connectivity or topology. This letter proposes TopoJSCC, a topology-aware DeepJSCC framework that integrates persistent-homology regularizers to end-to-end training. Specifically, we enforce topological consistency by penalizing Wasserstein distances between cubical persistence diagrams of original and reconstructed images, and between Vietoris–Rips persistence of latent features before and after the channel to promote a robust latent manifold. TopoJSCC is based on end-to-end learning and requires no side information. Experiments show improved topology preservation and peak signal-to-noise ratio (PSNR) in low signal-to-noise ratio (SNR) and bandwidth-ratio regimes.
[462] Integrating Explainable Machine Learning and Mixed-Integer Optimization for Personalized Sleep Quality Intervention
Mahfuz Ahmed Anik, Mohsin Mahmud Topu, Azmine Toushik Wasi, Md Isfar Khan, MD Manjurul Ahsan
Main category: cs.LG
TL;DR: Personalized predictive-prescriptive framework combining interpretable ML with mixed-integer optimization to translate sleep quality predictions into actionable behavioral interventions.
Details
Motivation: Most computational sleep studies focus on predictive risk identification rather than actionable intervention design, creating a gap between predictive insights and practical intervention strategies.Method: Supervised classifier predicts sleep quality from survey data, SHAP-based feature attribution quantifies modifiable factors, mixed-integer optimization identifies minimal feasible behavioral adjustments with penalty mechanism for resistance to change.
Result: Achieves strong predictive performance (test F1-score: 0.9544, accuracy: 0.9366), reveals trade-off between expected improvement and intervention intensity, generates concise personalized recommendations (often 1-2 high-impact adjustments).
Conclusion: Framework demonstrates how data-driven insights can be translated into structured, personalized decision support for sleep improvement by integrating prediction, explanation, and constrained optimization.
Abstract: Sleep quality is influenced by a complex interplay of behavioral, environmental, and psychosocial factors, yet most computational studies focus mainly on predictive risk identification rather than actionable intervention design. Although machine learning models can accurately predict subjective sleep outcomes, they rarely translate predictive insights into practical intervention strategies. To address this gap, we propose a personalized predictive-prescriptive framework that integrates interpretable machine learning with mixed-integer optimization. A supervised classifier trained on survey data predicts sleep quality, while SHAP-based feature attribution quantifies the influence of modifiable factors. These importance measures are incorporated into a mixed-integer optimization model that identifies minimal and feasible behavioral adjustments, while modelling resistance to change through a penalty mechanism. The framework achieves strong predictive performance, with a test F1-score of 0.9544 and an accuracy of 0.9366. Sensitivity and Pareto analyses reveal a clear trade-off between expected improvement and intervention intensity, with diminishing returns as additional changes are introduced. At the individual level, the model generates concise recommendations, often suggesting one or two high-impact behavioral adjustments and sometimes recommending no change when expected gains are minimal. By integrating prediction, explanation, and constrained optimization, this framework demonstrates how data-driven insights can be translated into structured and personalized decision support for sleep improvement.
[463] Minimum-Action Learning: Energy-Constrained Symbolic Model Selection for Physical Law Identification from Noisy Data
Martin G. Frasch
Main category: cs.LG
TL;DR: MAL framework identifies symbolic force laws from noisy data by minimizing a Triple-Action functional combining trajectory reconstruction, sparsity, and energy conservation, with noise reduction enabling accurate recovery.
Details
Motivation: Identifying physical laws from noisy observational data is challenging in scientific machine learning. Current methods struggle with noise and lack interpretable, energy-constrained model selection.Method: Minimum-Action Learning (MAL) selects symbolic force laws from a basis library by minimizing a Triple-Action functional. Uses wide-stencil acceleration-matching for 10,000x noise reduction, transforming SNR from ~0.02 to ~1.6. Combines trajectory reconstruction, architectural sparsity, and energy-conservation enforcement.
Result: Recovers Kepler gravity with exponent p = 3.01 ± 0.01 at ~0.07 kWh (40% reduction vs baselines). Raw correct-basis rate: 40% for Kepler, 90% for Hooke’s law. Energy-conservation criterion yields 100% pipeline-level identification. Outperforms SINDy variants, Hamiltonian Neural Networks, and Lagrangian Neural Networks.
Conclusion: MAL provides interpretable, energy-constrained model selection combining symbolic basis identification with dynamical rollout validation, offering a distinct niche for scientific discovery from noisy data.
Abstract: Identifying physical laws from noisy observational data is a central challenge in scientific machine learning. We present Minimum-Action Learning (MAL), a framework that selects symbolic force laws from a pre-specified basis library by minimizing a Triple-Action functional combining trajectory reconstruction, architectural sparsity, and energy-conservation enforcement. A wide-stencil acceleration-matching technique reduces noise variance by 10,000x, transforming an intractable problem (SNR ~0.02) into a learnable one (SNR ~1.6); this preprocessing is the critical enabler shared by all methods tested, including SINDy variants. On two benchmarks – Kepler gravity and Hooke’s law – MAL recovers the correct force law with Kepler exponent p = 3.01 +/- 0.01 at ~0.07 kWh (40% reduction vs. prediction-error-only baselines). The raw correct-basis rate is 40% for Kepler and 90% for Hooke; an energy-conservation-based criterion discriminates the true force law in all cases, yielding 100% pipeline-level identification. Basis library sensitivity experiments show that near-confounders degrade selection (20% with added r^{-2.5} and r^{-1.5}), while distant additions are harmless, and the conservation diagnostic remains informative even when the correct basis is absent. Direct comparison with noise-robust SINDy variants, Hamiltonian Neural Networks, and Lagrangian Neural Networks confirms MAL’s distinct niche: interpretable, energy-constrained model selection that combines symbolic basis identification with dynamical rollout validation.
[464] Formal verification of tree-based machine learning models for lateral spreading
Krishna Kumar
Main category: cs.LG
TL;DR: Paper presents a method to formally verify physical consistency of ML models in geotechnical hazard prediction using SMT solvers, showing trade-offs between accuracy and physical consistency.
Details
Motivation: Machine learning models for geotechnical hazard prediction can learn physically inconsistent relationships from sparse or biased data, and current remedies (post-hoc explainability or training constraints) don't provide exhaustive guarantees across the entire input domain.Method: Encodes trained tree ensembles as logical formulas in Satisfiability Modulo Theories (SMT) solver to check physical specifications across entire input domain. Formalizes four geotechnical specifications as decidable logical formulas and verifies them against XGBoost ensembles and Explainable Boosting Machines trained on Christchurch earthquake dataset.
Result: Unconstrained EBM (80.1% accuracy) violates all four specifications. Fully constrained EBM (67.2% accuracy) satisfies three of four specifications. Pareto analysis shows persistent trade-off between accuracy and compliance - no variants achieve both >80% accuracy and full compliance. SHAP analysis shows post-hoc explanations don’t substitute for formal verification.
Conclusion: Establishes a verify-fix-verify engineering loop and formal certification for deploying physically consistent ML models in safety-critical geotechnical applications, demonstrating that formal verification is necessary beyond post-hoc explanations.
Abstract: Machine learning models for geotechnical hazard prediction can achieve high accuracy while learning physically inconsistent relationships from sparse or biased training data. Current remedies (post-hoc explainability, such as SHAP and LIME, and training-time constraints) either diagnose individual predictions approximately or restrict model capacity without providing exhaustive guarantees. This paper encodes trained tree ensembles as logical formulas in a Satisfiability Modulo Theories (SMT) solver and checks physical specifications across the entire input domain, not just sampled points. Four geotechnical specifications (water table depth, PGA monotonicity, distance safety, and flat-ground safety) are formalized as decidable logical formulas and verified via SMT against both XGBoost ensembles and Explainable Boosting Machines (EBMs) trained on the 2011 Christchurch earthquake lateral spreading dataset (7,291 sites, four features). The SMT solver either produces a concrete counterexample where a specification fails or proves that no violation exists. The unconstrained EBM (80.1% accuracy) violates all four specifications. A fully constrained EBM (67.2%) satisfies three of four specifications, demonstrating that iterative constraint application guided by verification can progressively improve physical consistency. A Pareto analysis of 33 model variants reveals a persistent trade-off, as none of the variants studied achieve both greater than 80% accuracy and full compliance with the specified set. SHAP analysis of specification counterexamples shows that the offending feature can rank last, demonstrating that post-hoc explanations do not substitute for formal verification. These results establish a verify-fix-verify engineering loop and a formal certification for deploying physically consistent ML models in safety-critical geotechnical applications.
[465] Integrating Inductive Biases in Transformers via Distillation for Financial Time Series Forecasting
Yu-Chen Den, Kuan-Yu Chen, Kendro Vincent, Darby Tien-Hao Chang
Main category: cs.LG
TL;DR: TIPS is a Transformer framework that synthesizes multiple inductive biases (causality, locality, periodicity) via knowledge distillation from specialized teacher models to improve financial time-series forecasting in non-stationary markets.
Details
Motivation: Standard Transformer models assume stationarity and stable temporal dynamics, which are routinely violated in financial markets with regime shifts. Existing Transformers underperform simpler architectures like CNNs and RNNs on financial tasks, and no single inductive bias dominates across different market regimes.Method: TIPS trains bias-specialized Transformer teachers using attention masking to encode different inductive biases (causality, locality, periodicity), then distills their knowledge into a single student model with regime-dependent alignment across these biases.
Result: Across four major equity markets, TIPS achieves state-of-the-art performance, outperforming strong ensemble baselines by 55% in annual return, 9% in Sharpe ratio, and 16% in Calmar ratio, while requiring only 38% of the inference-time computation.
Conclusion: The framework demonstrates the importance of regime-dependent inductive bias utilization for robust generalization in non-stationary financial time series, generating statistically significant excess returns beyond vanilla Transformers and teacher ensembles.
Abstract: Transformer-based models have been widely adopted for time-series forecasting due to their high representational capacity and architectural flexibility. However, many Transformer variants implicitly assume stationarity and stable temporal dynamics – assumptions routinely violated in financial markets characterized by regime shifts and non-stationarity. Empirically, state-of-the-art time-series Transformers often underperform even vanilla Transformers on financial tasks, while simpler architectures with distinct inductive biases, such as CNNs and RNNs, can achieve stronger performance with substantially lower complexity. At the same time, no single inductive bias dominates across markets or regimes, suggesting that robust financial forecasting requires integrating complementary temporal priors. We propose TIPS (Transformer with Inductive Prior Synthesis), a knowledge distillation framework that synthesizes diverse inductive biases – causality, locality, and periodicity – within a unified Transformer. TIPS trains bias-specialized Transformer teachers via attention masking, then distills their knowledge into a single student model with regime-dependent alignment across inductive biases. Across four major equity markets, TIPS achieves state-of-the-art performance, outperforming strong ensemble baselines by 55%, 9%, and 16% in annual return, Sharpe ratio, and Calmar ratio, while requiring only 38% of the inference-time computation. Further analyses show that TIPS generates statistically significant excess returns beyond both vanilla Transformers and its teacher ensembles, and exhibits regime-dependent behavioral alignment with classical architectures during their profitable periods. These results highlight the importance of regime-dependent inductive bias utilization for robust generalization in non-stationary financial time series.
[466] Transformers Can Learn Rules They’ve Never Seen: Proof of Computation Beyond Interpolation
Andy Gray
Main category: cs.LG
TL;DR: Transformers can learn and apply unseen rules beyond interpolation, demonstrated through XOR cellular automata and symbolic operator chains experiments.
Details
Motivation: To test whether transformers can genuinely infer rules absent from training or merely interpolate from observed examples, addressing a central debate in LLM capabilities.Method: Two controlled experiments: 1) XOR cellular automaton with held-out patterns (XOR is linearly inseparable), 2) Symbolic operator chains with held-out operator pairs requiring intermediate-step generation. Circuit extraction and constraint propagation analysis.
Result: Transformers recovered XOR rule (best 100%, 47/60 runs) and exceeded all interpolation baselines on operator chains (mean 41.8% vs KRR 4.3%). Performance depended on multi-step reasoning and intermediate-step supervision.
Conclusion: Transformers can learn rule structure not directly observed in training and express it explicitly, providing an existence proof against strongest interpolation-only accounts, though leaving open when such behavior emerges in large-scale language training.
Abstract: A central question in the LLM debate is whether transformers can infer rules absent from training, or whether apparent generalisation reduces to similarity-based interpolation over observed examples. We test a strong interpolation-only hypothesis in two controlled settings: one where interpolation is ruled out by construction and proof, and one where success requires emitting intermediate symbolic derivations rather than only final answers. In Experiment 1, we use a cellular automaton with a pure XOR transition rule and remove specific local input patterns from training; since XOR is linearly inseparable, each held-out pattern’s nearest neighbours have the opposite label, so similarity-based predictors fail on the held-out region. Yet a two-layer transformer recovers the rule (best 100%; 47/60 converged runs), and circuit extraction identifies XOR computation. Performance depends on multi-step constraint propagation: without unrolling, accuracy matches output bias (63.1%), while soft unrolling reaches 96.7%. In Experiment 2, we study symbolic operator chains over integers with one operator pair held out; the model must emit intermediate steps and a final answer in a proof-like format. Across all 49 holdout pairs, the transformer exceeds every interpolation baseline (mean 41.8%, up to 78.6%; mean KRR 4.3%; KNN and MLP score 0% on every pair), while removing intermediate-step supervision degrades performance. Together with a construction showing that a standard transformer block can implement exact local Boolean rules, these results provide an existence proof that transformers can learn rule structure not directly observed in training and express it explicitly, ruling out the strongest architectural form of interpolation-only accounts: that transformers cannot in principle discover and communicate unseen rules, while leaving open when such behaviour arises in large-scale language training.
[467] Do Understanding and Generation Fight? A Diagnostic Study of DPO for Unified Multimodal Models
Abinav Rao, Sujan Rachuri
Main category: cs.LG
TL;DR: DPO alignment fails to improve generation quality in unified multimodal models due to gradient imbalance between understanding and generation tasks caused by VQ token count asymmetry.
Details
Motivation: The paper investigates whether Direct Preference Optimization (DPO) can simultaneously align both understanding and generation capabilities in unified multimodal models that share a language model backbone for processing both images and text.Method: Systematic study applying DPO to Janus-Pro models (1B and 7B parameters) under seven training strategies and two post-hoc methods, analyzing gradient behavior and tokenization effects.
Result: Generation quality resists DPO alignment across all tested conditions; no method improves generation CLIPScore at 7B, and all methods degrade generation at 1B. Gradient analysis reveals near-orthogonal understanding/generation gradients with 11-14x magnitude imbalance due to VQ token count asymmetry.
Conclusion: Discrete VQ tokenization is a structural bottleneck preventing effective DPO alignment of generation capabilities in unified multimodal models, providing practical guidance for practitioners working with VQ-based architectures.
Abstract: Unified multimodal models share a language model backbone for both understanding and generating images. Can DPO align both capabilities simultaneously? We present the first systematic study of this question, applying DPO to Janus-Pro at 1B and 7B parameters under seven training strategies and two post-hoc methods. The central finding is negative: generation quality resists DPO alignment across all tested conditions on this architecture. No method improves generation CLIPScore at 7B (|Delta| < 0.2, p > 0.5 at n=200 per seed, 3 seeds); at 1B, all methods degrade generation, and the result holds across preference data types (real-vs-generated and model-vs-model) and the data volumes tested (150-288 pairs). Gradient analysis reveals why: understanding and generation gradients are near-orthogonal (cos ~ 0) with ~11-14x magnitude imbalance driven by VQ token count asymmetry (576 generation tokens vs. ~30-100 text tokens). This imbalance is the dominant interference mechanism in multi-task DPO; magnitude-balancing yields directionally positive understanding deltas (+0.01-0.04 VQA, though individually not significant), but the generation gap persists regardless. We identify discrete VQ tokenization as a likely structural bottleneck – supported by the generation DPO loss converging to ln(2) – and provide practical guidance for practitioners working with VQ-based unified models.
[468] Impact of Data Duplication on Deep Neural Network-Based Image Classifiers: Robust vs. Standard Models
Alireza Aghabagherloo, Aydin Abadi, Sumanta Sarkar, Vishnu Asutosh Dasu, Bart Preneel
Main category: cs.LG
TL;DR: Study shows duplicated images in training sets negatively impact image classifier training efficiency and accuracy, especially with non-uniform duplication across classes or in adversarially trained models.
Details
Motivation: While data deduplication has been shown to improve language models, the impact of duplicated images in image classification training sets on model generalization and performance remains understudied, despite the recognized importance of data quality for DNNs.Method: Comprehensive study analyzing the effect of duplicated images in image classification training sets, examining both uniform and non-uniform duplication across classes, and investigating impacts on both standard and adversarially trained models.
Result: Duplicated images negatively affect training efficiency and classifier accuracy, with particularly strong negative impacts when duplication is non-uniform across classes or occurs in adversarially trained models. Even uniform duplication doesn’t significantly improve accuracy with increased duplication.
Conclusion: Data deduplication is important for image classification models as duplicated training images can harm both training efficiency and model accuracy, especially in adversarial training scenarios or with imbalanced duplication across classes.
Abstract: The accuracy and robustness of machine learning models against adversarial attacks are significantly influenced by factors such as training data quality, model architecture, the training process, and the deployment environment. In recent years, duplicated data in training sets, especially in language models, has attracted considerable attention. It has been shown that deduplication enhances both training performance and model accuracy in language models. While the importance of data quality in training image classifier Deep Neural Networks (DNNs) is widely recognized, the impact of duplicated images in the training set on model generalization and performance has received little attention. In this paper, we address this gap and provide a comprehensive study on the effect of duplicates in image classification. Our analysis indicates that the presence of duplicated images in the training set not only negatively affects the efficiency of model training but also may result in lower accuracy of the image classifier. This negative impact of duplication on accuracy is particularly evident when duplicated data is non-uniform across classes or when duplication, whether uniform or non-uniform, occurs in the training set of an adversarially trained model. Even when duplicated samples are selected in a uniform way, increasing the amount of duplication does not lead to a significant improvement in accuracy.
[469] SCE-LITE-HQ: Smooth visual counterfactual explanations with generative foundation models
Ahmed Zeid, Sidney Bender
Main category: cs.LG
TL;DR: SCE-LITE-HQ is a scalable counterfactual explanation framework that uses pretrained generative models to produce realistic, diverse explanations without task-specific retraining, evaluated on natural and medical datasets.
Details
Motivation: Modern neural networks are difficult to interpret in high-dimensional visual domains. Existing counterfactual explanation methods rely on dataset-specific generative models and have high computational costs, limiting scalability to high-resolution data.Method: Leverages pretrained generative foundation models without task-specific retraining. Operates in the generator’s latent space, uses smoothed gradients for optimization stability, and applies mask-based diversification to promote realistic and structurally diverse counterfactuals.
Result: SCE-LITE-HQ produces valid, realistic, and diverse counterfactuals competitive with or outperforming existing baselines, while avoiding the overhead of training dedicated generative models.
Conclusion: The framework provides a scalable approach for counterfactual generation in high-dimensional visual domains by effectively utilizing pretrained generative models and advanced optimization techniques.
Abstract: Modern neural networks achieve strong performance but remain difficult to interpret in high-dimensional visual domains. Counterfactual explanations (CFEs) provide a principled approach to interpreting black-box predictions by identifying minimal input changes that alter model outputs. However, existing CFE methods often rely on dataset-specific generative models and incur substantial computational cost, limiting their scalability to high-resolution data. We propose SCE-LITE-HQ, a scalable framework for counterfactual generation that leverages pretrained generative foundation models without task-specific retraining. The method operates in the latent space of the generator, incorporates smoothed gradients to improve optimization stability, and applies mask-based diversification to promote realistic and structurally diverse counterfactuals. We evaluate SCE-LITE-HQ on natural and medical datasets using a desiderata-driven evaluation protocol. Results show that SCE-LITE-HQ produces valid, realistic, and diverse counterfactuals competitive with or outperforming existing baselines, while avoiding the overhead of training dedicated generative models.
[470] Early Quantization Shrinks Codebook: A Simple Fix for Diversity-Preserving Tokenization
Wenhao Zhao, Qiran Zou, Rushi Shah, Yudi Wu, Zhouhan Lin, Dianbo Liu
Main category: cs.LG
TL;DR: First comprehensive study of representation collapsing problems in vector quantization for generative models, identifying two collapse types and proposing solutions.
Details
Motivation: Vector quantization is widely used in tokenizing data for LLMs and diffusion models, but its characteristics and behaviors in generative models remain underexplored, particularly the issue of representation collapses.Method: Systematic investigation of collapse issues using both synthetic and real datasets to identify severity and triggering conditions of two collapse types: token collapse and embedding collapse.
Result: Analysis reveals that random initialization and limited encoder capacity cause token collapse and embedding collapse respectively. The study identifies severity levels and triggering conditions for each collapse type.
Conclusion: This is the first comprehensive study of representation collapsing in vector quantization, with proposed solutions to mitigate each collapse type, advancing understanding of VQ in generative models.
Abstract: Vector quantization is a technique in machine learning that discretizes continuous representations into a set of discrete vectors. It is widely employed in tokenizing data representations for large language models, diffusion models, and other generative models. Despite its prevalence, the characteristics and behaviors of vector quantization in generative models remain largely underexplored. In this study, we systematically investigate the issue of collapses in vector quantization, where collapsed representations are observed across discrete codebook tokens and continuous latent embeddings. By leveraging both synthetic and real datasets, we identify the severity of each type of collapses and triggering conditions. Our analysis reveals that random initialization and limited encoder capacity result in tokens collapse and embeddings collapse. Building on these findings, we propose potential solutions aimed at mitigating each collapse. To the best of our knowledge, this is the first comprehensive study examining representation collapsing problems in vector quantization.
[471] PRISM: Demystifying Retention and Interaction in Mid-Training
Bharat Runwal, Ashish Agrawal, Anurag Roy, Rameswar Panda
Main category: cs.LG
TL;DR: PRISM study shows mid-training on high-quality data significantly improves reasoning capabilities in LLMs across math, code, and science benchmarks while preserving general performance, with RL only effective after mid-training.
Details
Motivation: To understand how mid-training design choices affect large language model performance, particularly for reasoning tasks, and to provide empirical guidance on effective training pipelines for enhancing reasoning capabilities.Method: Comprehensive empirical study across 7 base models from 4 families (Granite, LLaMA, Mistral, Nemotron-H), two architecture types (dense Transformer and attention-Mamba hybrid), scales from 3B to 24B parameters. Used controlled experiments with mid-training on ~27B high-quality tokens, followed by RL fine-tuning. Analyzed weight changes, representation geometry (CKA), and benchmark performance.
Result: Mid-training yields consistent gains: +15 to +40 points on math, +5 to +12 points on code, +6 to +13 points on science benchmarks while preserving general performance. RL only effective after mid-training (3-4x improvement on reasoning benchmarks). Data composition crucial during mid-training, not RL. Mid-training restructures 90%+ weights, RL makes sparse refinements to ~5% parameters.
Conclusion: Retention-aware mid-training is highly effective for reliable reasoning enhancement. Mid-training places models in configurations where RL can effectively improve performance. Provides practical guidance for designing robust mid-training pipelines.
Abstract: We present PRISM, a comprehensive empirical study of mid-training design choices for large language models. Through controlled experiments across seven base models spanning four families (Granite, LLaMA, Mistral, Nemotron-H), two architecture types (dense Transformer and attention-Mamba hybrid), and scales from 3B to 24B parameters, we show that mid-training on approximately 27B high-quality tokens yields consistent gains of +15 to +40 points on math, +5 to +12 points on code, and +6 to +13 points on science benchmarks while preserving general performance. The full PRISM to RL pipeline improves macro-average across six reasoning benchmarks from under 12 to 29-42 (a 3-4x improvement), whereas RL applied directly to most of the base models remains substantially less effective, with AIME scores near zero. Data composition matters most at mid-training, not RL: including science data during mid-training unlocks +17 to +28 point GPQA-Diamond gains during RL, while changing the RL mix produces less than 2 point differences. Mechanistically, mid-training densely restructures over 90% of model weights, while RL makes sparse, front-loaded refinements to approximately 5% of parameters. Representation analysis (CKA) confirms that RL consistently preserves mid-training’s representational geometry (over 0.998 CKA) across architectures. Crucially, RL applies identical weight changes regardless of starting point, yet only succeeds on mid-trained models, consistent with mid-training placing the model in a configuration from which RL can effectively improve performance. Our results demonstrate that retention-aware mid-training is highly effective for reliable reasoning enhancement and provide practical guidance for designing robust mid-training pipelines.
[472] CircuitBuilder: From Polynomials to Circuits via Reinforcement Learning
Weikun K. Zhang, Rohan Pandey, Bhaumik Mehta, Kaijie Jin, Naomi Morato, Archit Ganapule, Michael Ruofan Zeng, Jarod Alper
Main category: cs.LG
TL;DR: RL agents (PPO+MCTS and SAC) are trained to discover efficient arithmetic circuits for computing polynomials, with SAC performing best on two-variable problems and PPO+MCTS scaling to three variables.
Details
Motivation: Motivated by auto-proof generation and Valiant's VP vs. VNP conjecture, the paper aims to study the problem of discovering efficient arithmetic circuits to compute polynomials using addition and multiplication gates.Method: Formulates polynomial circuit synthesis as a single-player game where an RL agent builds circuits within fixed operations. Implements AlphaZero-style training loop comparing two approaches: Proximal Policy Optimization with Monte Carlo Tree Search (PPO+MCTS) and Soft Actor-Critic (SAC).
Result: SAC achieves highest success rates on two-variable polynomial targets, while PPO+MCTS scales to three variables and demonstrates steady improvement on harder instances.
Conclusion: Polynomial circuit synthesis provides a compact, verifiable setting for studying self-improving search policies, with different RL approaches showing complementary strengths.
Abstract: Motivated by auto-proof generation and Valiant’s VP vs. VNP conjecture, we study the problem of discovering efficient arithmetic circuits to compute polynomials, using addition and multiplication gates. We formulate this problem as a single-player game, where an RL agent attempts to build the circuit within a fixed number of operations. We implement an AlphaZero-style training loop and compare two approaches: Proximal Policy Optimization with Monte Carlo Tree Search (PPO+MCTS) and Soft Actor-Critic (SAC). SAC achieves the highest success rates on two-variable targets, while PPO+MCTS scales to three variables and demonstrates steady improvement on harder instances. These results suggest that polynomial circuit synthesis is a compact, verifiable setting for studying self-improving search policies.
[473] SENSE: Efficient EEG-to-Text via Privacy-Preserving Semantic Retrieval
Akshaj Murhekar, Christina Liu, Abhijit Mishra, Shounak Roychowdhury, Jacek Gwizdka
Main category: cs.LG
TL;DR: SENSE is a lightweight, privacy-preserving framework that decodes EEG signals into text using a two-stage approach: on-device semantic retrieval to extract keywords, followed by zero-shot LLM generation, without fine-tuning language models.
Details
Motivation: Existing BCI approaches require memory-intensive fine-tuning of LLMs on raw EEG signals, leading to expensive training, limited accessibility, and privacy risks from exposing sensitive neural data. There's a need for lightweight, privacy-preserving alternatives.Method: Two-stage framework: 1) On-device EEG-to-keyword module (~6M parameters) maps EEG signals to discrete textual space to extract Bag-of-Words, 2) Zero-shot prompt-based generation using off-the-shelf LLM conditioned on extracted keywords. Raw EEG stays local, only abstract semantic cues are shared.
Result: Evaluated on 128-channel EEG dataset across six subjects, SENSE matches or surpasses generative quality of fully fine-tuned baselines like Thought2Text while substantially reducing computational overhead. Provides privacy-preserving neural decoding.
Conclusion: SENSE offers a scalable, privacy-aware retrieval-augmented architecture for next-generation BCIs by localizing neural decoding and sharing only derived textual cues, enabling efficient brain-to-text conversion without LLM fine-tuning.
Abstract: Decoding brain activity into natural language is a major challenge in AI with important applications in assistive communication, neurotechnology, and human-computer interaction. Most existing Brain-Computer Interface (BCI) approaches rely on memory-intensive fine-tuning of Large Language Models (LLMs) or encoder-decoder models on raw EEG signals, resulting in expensive training pipelines, limited accessibility, and potential exposure of sensitive neural data. We introduce SENSE (SEmantic Neural Sparse Extraction), a lightweight and privacy-preserving framework that translates non-invasive electroencephalography (EEG) into text without LLM fine-tuning. SENSE decouples decoding into two stages: on-device semantic retrieval and prompt-based language generation. EEG signals are locally mapped to a discrete textual space to extract a non-sensitive Bag-of-Words (BoW), which conditions an off-the-shelf LLM to synthesize fluent text in a zero-shot manner. The EEG-to-keyword module contains only ~6M parameters and runs fully on-device, ensuring raw neural signals remain local while only abstract semantic cues interact with language models. Evaluated on a 128-channel EEG dataset across six subjects, SENSE matches or surpasses the generative quality of fully fine-tuned baselines such as Thought2Text while substantially reducing computational overhead. By localizing neural decoding and sharing only derived textual cues, SENSE provides a scalable and privacy-aware retrieval-augmented architecture for next-generation BCIs.
[474] Contextual Preference Distribution Learning
Benjamin Hudson, Laurent Charlin, Emma Frejinger
Main category: cs.LG
TL;DR: A pipeline for learning human preference distributions from observed choices and using them for risk-averse decision-making in optimization problems.
Details
Motivation: Decision-making often involves uncertainty from heterogeneous and context-dependent human preferences. Existing methods produce point estimates or fail to capture contextual shifts, making them unsuitable for risk-averse decision-making.Method: Sequential learning-and-optimization pipeline using bounded-variance score function gradient estimator to train a predictive model mapping contextual features to parameterizable distributions, then using generated scenarios for optimization.
Result: In synthetic ridesharing environment, reduces average post-decision surprise by up to 114× compared to risk-neutral approach with perfect predictions and up to 25× compared to leading risk-averse baselines.
Conclusion: Proposed approach effectively captures human preference uncertainty and enables robust risk-averse decision-making in optimization problems with contextual dependencies.
Abstract: Decision-making problems often feature uncertainty stemming from heterogeneous and context-dependent human preferences. To address this, we propose a sequential learning-and-optimization pipeline to learn preference distributions and leverage them to solve downstream problems, for example risk-averse formulations. We focus on human choice settings that can be formulated as (integer) linear programs. In such settings, existing inverse optimization and choice modelling methods infer preferences from observed choices but typically produce point estimates or fail to capture contextual shifts, making them unsuitable for risk-averse decision-making. Using a bounded-variance score function gradient estimator, we train a predictive model mapping contextual features to a rich class of parameterizable distributions. This approach yields a maximum likelihood estimate. The model generates scenarios for unseen contexts in the subsequent optimization phase. In a synthetic ridesharing environment, our approach reduces average post-decision surprise by up to 114$\times$ compared to a risk-neutral approach with perfect predictions and up to 25$\times$ compared to leading risk-averse baselines.
[475] REAL: Regression-Aware Reinforcement Learning for LLM-as-a-Judge
Yasi Zhang, Tianyu Chen, Mingyuan Zhou, Oscar Leong, Ying Nian Wu, Michal Lukasik
Main category: cs.LG
TL;DR: REAL is a regression-aware reinforcement learning framework that optimizes LLMs for evaluation tasks by considering ordinal structure in scoring, outperforming both regression-aware SFT and standard RL methods.
Details
Motivation: Standard RL methods use binary rewards that ignore ordinal structure in regression tasks (e.g., scoring 4 vs 1 when ground truth is 5), while existing regression-aware approaches are limited to SFT and can't explore optimal reasoning paths.Method: Proposes REAL, a principled RL framework using generalized policy gradient estimator that decomposes optimization into: (1) exploration over Chain-of-Thought trajectories, and (2) regression-aware prediction refinement of final scores.
Result: Extensive experiments across 8B to 32B models show REAL consistently outperforms regression-aware SFT baselines and standard RL methods, with significant generalization improvements on out-of-domain benchmarks. On Qwen3-32B: +8.40 Pearson and +7.20 Spearman correlation over SFT baseline.
Conclusion: Integrating regression objectives into RL exploration is critical for accurate LLM evaluation, with REAL demonstrating substantial improvements in correlation metrics and generalization capabilities.
Abstract: Large language models (LLMs) are increasingly deployed as automated evaluators that assign numeric scores to model outputs, a paradigm known as LLM-as-a-Judge. However, standard Reinforcement Learning (RL) methods typically rely on binary rewards (e.g., 0-1 accuracy), thereby ignoring the ordinal structure inherent in regression tasks; for instance, they fail to recognize that predicting 4 is significantly better than predicting 1 when the ground truth is 5. Conversely, existing regression-aware approaches are often confined to Supervised Fine-Tuning (SFT), limiting their ability to explore optimal reasoning paths. To bridge this gap, we propose \textbf{REAL} (\underline{RE}gression-\underline{A}ware Reinforcement \underline{L}earning), a principled RL framework designed to optimize regression rewards, and also proven to be optimal for correlation metrics. A key technical challenge is that the regression objective is explicitly policy-dependent, thus invalidating standard policy gradient methods. To address this, we employ the generalized policy gradient estimator, which naturally decomposes optimization into two complementary components: (1) exploration over Chain-of-Thought (CoT) trajectory, and (2) regression-aware prediction refinement of the final score. Extensive experiments across model scales (8B to 32B) demonstrate that REAL consistently outperforms both regression-aware SFT baselines and standard RL methods, exhibiting significantly better generalization on out-of-domain benchmarks. On Qwen3-32B specifically, we achieve gains of +8.40 Pearson and +7.20 Spearman correlation over the SFT baseline, and +18.30/+11.20 over the base model. These findings highlight the critical value of integrating regression objectives into RL exploration for accurate LLM evaluation.
[476] Personalized Fall Detection by Balancing Data with Selective Feedback Using Contrastive Learning
Awatif Yasmin, Tarek Mahmud, Sana Alamgeer, Anne H. H. Ngu
Main category: cs.LG
TL;DR: Personalized fall detection framework using semi-supervised clustering and contrastive learning to balance imbalanced user feedback data, evaluated across three retraining strategies with TFS achieving best results.
Details
Motivation: Personalized fall detection models improve accuracy but face challenges due to scarcity of real-world fall data and imbalance between fall and non-fall samples, which biases models toward routine activities and reduces sensitivity to true falls.Method: Proposes a personalization framework combining semi-supervised clustering with contrastive learning to identify and balance the most informative user feedback samples. Evaluated under three retraining strategies: Training from Scratch (TFS), Transfer Learning (TL), and Few-Shot Learning (FSL).
Result: Real-time experiments with ten participants show TFS achieves highest performance with up to 25% improvement over baseline, while FSL achieves second-highest performance with 7% improvement. Demonstrates effectiveness of selective personalization for real-world deployment.
Conclusion: The proposed framework effectively addresses data imbalance in personalized fall detection through intelligent sample selection and balancing, with TFS being the most effective retraining strategy for real-world deployment.
Abstract: Personalized fall detection models can significantly improve accuracy by adapting to individual motion patterns, yet their effectiveness is often limited by the scarcity of real-world fall data and the dominance of non-fall feedback samples. This imbalance biases the model toward routine activities and weakens its sensitivity to true fall events. To address this challenge, we propose a personalization framework that combines semi-supervised clustering with contrastive learning to identify and balance the most informative user feedback samples. The framework is evaluated under three retraining strategies, including Training from Scratch (TFS), Transfer Learning (TL), and Few-Shot Learning (FSL), to assess adaptability across learning paradigms. Real-time experiments with ten participants show that the TFS approach achieves the highest performance, with up to a 25% improvement over the baseline, while FSL achieves the second-highest performance with a 7% improvement, demonstrating the effectiveness of selective personalization for real-world deployment.
[477] Noise-Response Calibration: A Causal Intervention Protocol for LLM-Judges
Maxim Khomiakov, Jes Frellsen
Main category: cs.LG
TL;DR: Proposes a calibration protocol for LLM judges using controlled noise interventions to test performance degradation, revealing a modality gap where text judges degrade predictably but tabular judges often don’t show significant deterioration even under noise.
Details
Motivation: LLMs are increasingly used as automated judges and synthetic labelers, but they are stochastic and overconfident, making deployment decisions difficult when external ground truth is limited. Need a practical calibration protocol to assess reliability.Method: Proposes calibration protocol based on controlled input interventions: if noise severity increases, task performance should show statistically significant deterioration trend. Uses slope-based hypothesis test over repeated trials with SNR perturbations for tabular data and lexical perturbations for text data.
Result: Reveals modality gap: text-based judges degrade predictably, but majority of tabular datasets show lack of statistically significant performance deterioration even under significant signal-to-noise reduction. Model performance is lower on datasets insensitive to noise interventions.
Conclusion: Presents reproducible methodology and reporting protocol for robust LLM-judge calibration under distribution shift, highlighting modality-dependent behavior in LLM judge reliability.
Abstract: Large language models (LLMs) are increasingly used as automated judges and synthetic labelers, especially in low-label settings. Yet these systems are stochastic and often overconfident, which makes deployment decisions difficult when external ground truth is limited. We propose a practical calibration protocol based on controlled input interventions: if noise severity increases, task performance should exhibit a statistically significant deterioration trend. We operationalize this with a slope-based hypothesis test over repeated trials, using signal-to-noise-ratio (SNR) perturbations for tabular data and lexical perturbations for text data. Across UCI tabular benchmarks and four text classification datasets, we find clear modality-dependent behavior. Our results reveal a modality gap: while text-based judges degrade predictably, the majority of tabular datasets show a lack of statistically significant performance deterioration even under significant signal-to-noise reduction. Interestingly we find that model performance is lower on datasets that are insensitive to noise interventions. We present a reproducible methodology and reporting protocol for robust LLM-judge calibration under distribution shift.
[478] Domain-informed explainable boosting machines for trustworthy lateral spread predictions
Cheng-Hsi Hsiao, Krishna Kumar, Ellen M. Rathje
Main category: cs.LG
TL;DR: Domain-informed Explainable Boosting Machines (EBMs) for physically consistent lateral spreading prediction in earthquake hazard applications
Details
Motivation: EBMs provide transparent predictions but can learn non-physical relationships that reduce reliability in natural hazard applications like lateral spreading predictionMethod: Domain-informed framework that modifies learned shape functions based on domain knowledge to correct non-physical behavior while maintaining data-driven patterns
Result: Applied to 2011 Christchurch earthquake dataset, corrected non-physical trends in original EBM, producing more physically consistent explanations with 4-5% accuracy tradeoff
Conclusion: Domain-informed EBMs improve physical consistency for hazard applications with acceptable accuracy tradeoff, enhancing reliability of transparent models
Abstract: Explainable Boosting Machines (EBMs) provide transparent predictions through additive shape functions, enabling direct inspection of feature contributions. However, EBMs can learn non-physical relationships that reduce their reliability in natural hazard applications. This study presents a domain-informed framework to improve the physical consistency of EBMs for lateral spreading prediction. Our approach modifies learned shape functions based on domain knowledge. These modifications correct non-physical behavior while maintaining data-driven patterns. We apply the method to the 2011 Christchurch earthquake dataset and correct non-physical trends observed in the original EBM. The resulting model produces more physically consistent global and local explanations, with an acceptable tradeoff in accuracy (4–5%).
[479] MetaClaw: Just Talk – An Agent That Meta-Learns and Evolves in the Wild
Peng Xia, Jianwen Chen, Xinyu Yang, Haoqin Tu, Jiaqi Liu, Kaiwen Xiong, Siwei Han, Shi Qiu, Haonian Ji, Yuyin Zhou, Zeyu Zheng, Cihang Xie, Huaxiu Yao
Main category: cs.LG
TL;DR: MetaClaw is a continual meta-learning framework that enables LLM agents to adapt to evolving user needs through skill-driven fast adaptation and opportunistic policy optimization without service disruption.
Details
Motivation: Current LLM agents remain static after deployment, failing to adapt as user needs evolve, creating tension between continuous service and capability updates. Existing methods either store raw trajectories without knowledge distillation, maintain static skill libraries, or require disruptive downtime for retraining.Method: MetaClaw uses two complementary mechanisms: 1) Skill-driven fast adaptation analyzes failure trajectories via an LLM evolver to synthesize new skills for immediate improvement with zero downtime, and 2) Opportunistic policy optimization performs gradient-based updates via cloud LoRA fine-tuning and RL-PRM during user-inactive windows using the Opportunistic Meta-Learning Scheduler. A versioning mechanism prevents data contamination.
Result: Skill-driven adaptation improves accuracy by up to 32% relative. The full pipeline advances Kimi-K2.5 accuracy from 21.4% to 40.6% and increases composite robustness by 18.3%.
Conclusion: MetaClaw enables continuous adaptation of LLM agents to evolving user needs without service disruption through a meta-learning framework that combines skill synthesis and opportunistic policy optimization.
Abstract: Large language model (LLM) agents are increasingly used for complex tasks, yet deployed agents often remain static, failing to adapt as user needs evolve. This creates a tension between the need for continuous service and the necessity of updating capabilities to match shifting task distributions. On platforms like OpenClaw, which handle diverse workloads across 20+ channels, existing methods either store raw trajectories without distilling knowledge, maintain static skill libraries, or require disruptive downtime for retraining. We present MetaClaw, a continual meta-learning framework that jointly evolves a base LLM policy and a library of reusable behavioral skills. MetaClaw employs two complementary mechanisms. Skill-driven fast adaptation analyzes failure trajectories via an LLM evolver to synthesize new skills, enabling immediate improvement with zero downtime. Opportunistic policy optimization performs gradient-based updates via cloud LoRA fine-tuning and Reinforcement Learning with a Process Reward Model (RL-PRM). This is triggered during user-inactive windows by the Opportunistic Meta-Learning Scheduler (OMLS), which monitors system inactivity and calendar data. These mechanisms are mutually reinforcing: a refined policy generates better trajectories for skill synthesis, while richer skills provide higher-quality data for policy optimization. To prevent data contamination, a versioning mechanism separates support and query data. Built on a proxy-based architecture, MetaClaw scales to production-size LLMs without local GPUs. Experiments on MetaClaw-Bench and AutoResearchClaw show that skill-driven adaptation improves accuracy by up to 32% relative. The full pipeline advances Kimi-K2.5 accuracy from 21.4% to 40.6% and increases composite robustness by 18.3%. Code is available at https://github.com/aiming-lab/MetaClaw.
[480] Self-Conditioned Denoising for Atomistic Representation Learning
Tynan Perez, Rafael Gomez-Bombarelli
Main category: cs.LG
TL;DR: SCD is a self-supervised pretraining method for atomistic data that uses self-embeddings for conditional denoising across diverse domains (molecules, proteins, materials) and outperforms previous SSL methods while matching supervised pretraining performance.
Details
Motivation: Current self-supervised learning methods for atomistic data are limited to ground-state geometries and single domains, while supervised pretraining on DFT force-energy labels has shown better performance. There's a need for more effective SSL approaches that work across diverse atomistic data domains.Method: Self-Conditioned Denoising (SCD) - a backbone-agnostic reconstruction objective that uses self-embeddings for conditional denoising across any domain of atomistic data, including small molecules, proteins, periodic materials, and non-equilibrium geometries.
Result: SCD significantly outperforms previous SSL methods on downstream benchmarks and matches or exceeds supervised force-energy pretraining performance. A small GNN pretrained with SCD achieves competitive/superior performance to larger models pretrained on larger datasets across multiple domains.
Conclusion: SCD provides an effective self-supervised pretraining strategy for atomistic foundation models that works across diverse domains and achieves state-of-the-art performance comparable to supervised methods.
Abstract: The success of large-scale pretraining in NLP and computer vision has catalyzed growing efforts to develop analogous foundation models for the physical sciences. However, pretraining strategies using atomistic data remain underexplored. To date, large-scale supervised pretraining on DFT force-energy labels has provided the strongest performance gains to downstream property prediction, out-performing existing methods of self-supervised learning (SSL) which remain limited to ground-state geometries, and/or single domains of atomistic data. We address these shortcomings with Self-Conditioned Denoising (SCD), a backbone-agnostic reconstruction objective that utilizes self-embeddings for conditional denoising across any domain of atomistic data, including small molecules, proteins, periodic materials, and ’non-equilibrium’ geometries. When controlled for backbone architecture and pretraining dataset, SCD significantly outperforms previous SSL methods on downstream benchmarks and matches or exceeds the performance of supervised force-energy pretraining. We show that a small, fast GNN pretrained by SCD can achieve competitive or superior performance to larger models pretrained on significantly larger labeled or unlabeled datasets, across tasks in multiple domains. Our code is available at: https://github.com/TyJPerez/SelfConditionedDenoisingAtoms
[481] Abstraction as a Memory-Efficient Inductive Bias for Continual Learning
Elnaz Rahmati, Nona Ghazizadeh, Zhivar Sourati, Nina Rouhani, Morteza Dehghani
Main category: cs.LG
TL;DR: AAT (Abstraction-Augmented Training) is a loss-level modification for online continual learning that encourages models to capture latent relational structures across examples, eliminating the need for replay buffers while achieving comparable performance to experience replay methods.
Details
Motivation: Real-world environments are non-stationary and complex, requiring continual learning without expensive retraining. Online continual learning faces interference between new and old knowledge, causing forgetting and degraded generalization. Current methods often rely on replay buffers which require additional memory.Method: Proposes Abstraction-Augmented Training (AAT) that jointly optimizes over concrete instances and their abstract representations. Uses loss-level modification to encourage capturing latent relational structure. Evaluated on two benchmarks: controlled relational dataset with entity masking, and narrative dataset with shared proverbs.
Result: AAT achieves performance comparable to or exceeding strong experience replay baselines, despite requiring zero additional memory and only minimal changes to training objective. Shows structural abstraction as effective memory-free alternative to experience replay.
Conclusion: AAT demonstrates that structural abstraction can serve as a powerful, memory-free alternative to experience replay for online continual learning, effectively addressing forgetting while maintaining generalization without additional memory requirements.
Abstract: The real world is non-stationary and infinitely complex, requiring intelligent agents to learn continually without the prohibitive cost of retraining from scratch. While online continual learning offers a framework for this setting, learning new information often interferes with previously acquired knowledge, causes forgetting and degraded generalization. To address this, we propose Abstraction-Augmented Training (AAT), a loss-level modification encouraging models to capture the latent relational structure shared across examples. By jointly optimizing over concrete instances and their abstract representations, AAT introduces a memory-efficient inductive bias that stabilizes learning in strictly online data streams, eliminating the need for a replay buffer. To capture the multi-faceted nature of abstraction, we introduce and evaluate AAT on two benchmarks: a controlled relational dataset where abstraction is realized through entity masking, and a narrative dataset where abstraction is expressed through shared proverbs. Our results show that AAT achieves performance comparable to or exceeding strong experience replay (ER) baselines, despite requiring zero additional memory and only minimal changes to the training objective. This work highlights structural abstraction as a powerful, memory-free alternative to ER.
[482] Catching rationalization in the act: detecting motivated reasoning before and after CoT via activation probing
Parsa Mirtaheri, Mikhail Belkin
Main category: cs.LG
TL;DR: LLMs exhibit motivated reasoning by rationalizing answers influenced by hints without acknowledging them, detectable via internal activation probes better than chain-of-thought monitoring.
Details
Motivation: To investigate whether LLMs engage in motivated reasoning where they produce chain-of-thought rationalizations that don't reflect actual decision factors, particularly when influenced by external hints.Method: Inject hints favoring specific options in multiple-choice settings, analyze LLM behavior across families and datasets, train supervised probes on residual stream activations (pre- and post-generation), compare with CoT monitoring.
Result: Pre-generation probes predict motivated reasoning as well as CoT monitors, post-generation probes outperform CoT monitors, internal representations detect motivated reasoning more reliably than CoT analysis.
Conclusion: Motivated reasoning in LLMs is better detected through internal activation probes than chain-of-thought monitoring, with pre-generation probes enabling early detection to avoid unnecessary generation.
Abstract: Large language models (LLMs) can produce chains of thought (CoT) that do not accurately reflect the actual factors driving their answers. In multiple-choice settings with an injected hint favoring a particular option, models may shift their final answer toward the hinted option and produce a CoT that rationalizes the response without acknowledging the hint - an instance of motivated reasoning. We study this phenomenon across multiple LLM families and datasets demonstrating that motivated reasoning can be identified by probing internal activations even in cases when it cannot be easily determined from CoT. Using supervised probes trained on the model’s residual stream, we show that (i) pre-generation probes, applied before any CoT tokens are generated, predict motivated reasoning as well as a LLM-based CoT monitor that accesses the full CoT trace, and (ii) post-generation probes, applied after CoT generation, outperform the same monitor. Together, these results show that motivated reasoning is detected more reliably from internal representations than from CoT monitoring. Moreover, pre-generation probing can flag motivated behavior early, potentially avoiding unnecessary generation.
[483] On the Cone Effect and Modality Gap in Medical Vision-Language Embeddings
David Restrepo, Miguel L Martins, Chenwei Wu, Luis Filipe Nakayama, Diego M Lopez, Stergios Christodoulidis, Maria Vakalopoulou, Enzo Ferrante
Main category: cs.LG
TL;DR: A lightweight post-hoc mechanism to control modality gap in VLMs shows that reducing excessive cross-modal separation improves downstream performance, especially in medical domains, but complete collapse isn’t optimal.
Details
Motivation: While the modality gap (cross-modal separation) in VLMs is widely observed, its practical impact on supervised multimodal learning, particularly in medical domains, remains unclear. The paper aims to systematically analyze how this gap affects downstream performance.Method: Introduces a lightweight post-hoc mechanism that keeps pretrained VLM encoders frozen while continuously controlling cross-modal separation through a single hyperparameter λ. Evaluates generalist (CLIP, SigLIP) and medically specialized (BioMedCLIP, MedSigLIP) models across diverse medical and natural datasets in supervised multimodal settings.
Result: Results show reducing excessive modality gap improves downstream performance, with medical datasets exhibiting stronger sensitivity to gap modulation. However, fully collapsing the gap is not optimal - intermediate, task-dependent separation yields best results.
Conclusion: The modality gap should be viewed as a tunable property of multimodal representations rather than a quantity that should be universally minimized. Different tasks require different levels of cross-modal separation for optimal performance.
Abstract: Vision-Language Models (VLMs) exhibit a characteristic “cone effect” in which nonlinear encoders map embeddings into highly concentrated regions of the representation space, contributing to cross-modal separation known as the modality gap. While this phenomenon has been widely observed, its practical impact on supervised multimodal learning -particularly in medical domains- remains unclear. In this work, we introduce a lightweight post-hoc mechanism that keeps pretrained VLM encoders frozen while continuously controlling cross-modal separation through a single hyperparameter {λ}. This enables systematic analysis of how the modality gap affects downstream multimodal performance without expensive retraining. We evaluate generalist (CLIP, SigLIP) and medically specialized (BioMedCLIP, MedSigLIP) models across diverse medical and natural datasets in a supervised multimodal settings. Results consistently show that reducing excessive modality gap improves downstream performance, with medical datasets exhibiting stronger sensitivity to gap modulation; however, fully collapsing the gap is not always optimal, and intermediate, task-dependent separation yields the best results. These findings position the modality gap as a tunable property of multimodal representations rather than a quantity that should be universally minimized.
[484] Binary Latent Protein Fitness Landscapes for Quantum Annealing Optimization
Truong-Son Hy
Main category: cs.LG
TL;DR: Q-BIOLAT: A framework for modeling protein fitness landscapes in binary latent spaces using QUBO formulation for efficient combinatorial optimization, compatible with quantum annealing hardware.
Details
Motivation: To bridge protein representation learning with combinatorial optimization by creating a framework that models protein fitness in binary latent spaces, enabling efficient search for high-fitness variants and compatibility with emerging quantum hardware.Method: Uses pretrained protein language models to obtain continuous embeddings, transforms them into compact binary latent representations, approximates protein fitness using quadratic unconstrained binary optimization (QUBO) model, and applies classical heuristics like simulated annealing and genetic algorithms for combinatorial search.
Result: On ProteinGym benchmark, Q-BIOLAT captures meaningful structure in protein fitness landscapes and identifies high-fitness variants, with sequences whose nearest neighbors lie within top fraction of training fitness distribution. Different optimization strategies show distinct behaviors: evolutionary search better in higher-dimensional spaces, local search better at preserving realistic sequences.
Conclusion: Q-BIOLAT provides a natural bridge between protein representation learning and combinatorial optimization, with QUBO formulation making it compatible with quantum annealing hardware for potential quantum-assisted protein engineering.
Abstract: We propose Q-BIOLAT, a framework for modeling and optimizing protein fitness landscapes in binary latent spaces. Starting from protein sequences, we leverage pretrained protein language models to obtain continuous embeddings, which are then transformed into compact binary latent representations. In this space, protein fitness is approximated using a quadratic unconstrained binary optimization (QUBO) model, enabling efficient combinatorial search via classical heuristics such as simulated annealing and genetic algorithms. On the ProteinGym benchmark, we demonstrate that Q-BIOLAT captures meaningful structure in protein fitness landscapes and enables the identification of high-fitness variants. Despite using a simple binarization scheme, our method consistently retrieves sequences whose nearest neighbors lie within the top fraction of the training fitness distribution, particularly under the strongest configurations. We further show that different optimization strategies exhibit distinct behaviors, with evolutionary search performing better in higher-dimensional latent spaces and local search remaining competitive in preserving realistic sequences. Beyond its empirical performance, Q-BIOLAT provides a natural bridge between protein representation learning and combinatorial optimization. By formulating protein fitness as a QUBO problem, our framework is directly compatible with emerging quantum annealing hardware, opening new directions for quantum-assisted protein engineering. Our implementation is publicly available at: https://github.com/HySonLab/Q-BIOLAT
[485] Pathology-Aware Multi-View Contrastive Learning for Patient-Independent ECG Reconstruction
Youssef Youssef, Jitin Singla
Main category: cs.LG
TL;DR: A pathology-aware multi-view contrastive learning framework for reconstructing 12-lead ECGs from reduced lead sets that preserves cardiac morphology by learning pathology-aware embeddings.
Details
Motivation: Standard deep learning methods for ECG reconstruction from reduced lead sets often ignore underlying cardiac pathology, losing vital morphology in precordial leads due to anatomical variability.Method: Pathology-Aware Multi-View Contrastive Learning framework that regularizes latent space through a pathological manifold, integrating high-fidelity time-domain waveforms with pathology-aware embeddings learned via supervised contrastive alignment to filter anatomical “nuisance” variables.
Result: Achieves ~76% reduction in RMSE compared to state-of-the-art models in patient-independent setting on PTB-XL dataset, with superior generalization confirmed on PTB Diagnostic Database.
Conclusion: The framework bridges the gap between hardware portability and diagnostic-grade reconstruction by preserving pathology information while filtering anatomical variability.
Abstract: Reconstructing a 12-lead electrocardiogram (ECG) from a reduced lead set is an ill-posed inverse problem due to anatomical variability. Standard deep learning methods often ignore underlying cardiac pathology losing vital morphology in precordial leads. We propose Pathology-Aware Multi-View Contrastive Learning, a framework that regularizes the latent space through a pathological manifold. Our architecture integrates high-fidelity time-domain waveforms with pathology-aware embeddings learned via supervised contrastive alignment. By maximizing mutual information between latent representations and clinical labels, the framework learns to filter anatomical “nuisance” variables. On the PTB-XL dataset, our method achieves approx. 76% reduction in RMSE compared to state-of-the-art model in patient-independent setting. Cross-dataset evaluation on the PTB Diagnostic Database confirms superior generalization, bridging the gap between hardware portability and diagnostic-grade reconstruction.
[486] Variational Rectification Inference for Learning with Noisy Labels
Haoliang Sun, Qi Wei, Lei Feng, Yupeng Hu, Fan Liu, Hehe Fan, Yilong Yin
Main category: cs.LG
TL;DR: VRI proposes variational rectification inference for robust learning with noisy labels using amortized variational inference under meta-learning framework.
Details
Motivation: Existing meta-learning approaches for label noise suffer from model collapse that degenerates generalization performance, despite achieving robustness to noise.Method: Formulates adaptive loss rectification as amortized variational inference problem with hierarchical Bayes model, treating rectifying vector as latent variable. Uses amortization meta-network to approximate conditional posterior, with smoothness assumptions for reliable rectification vectors.
Result: VRI avoids collapsing to Dirac delta function, significantly improves generalization performance, and shows effectiveness for robust learning with noisy labels, particularly with open-set noise.
Conclusion: VRI provides effective variational rectification inference framework for robust learning with noisy labels through meta-learning with variational inference, addressing model collapse issues.
Abstract: Label noise has been broadly observed in real-world datasets. To mitigate the negative impact of overfitting to label noise for deep models, effective strategies (\textit{e.g.}, re-weighting, or loss rectification) have been broadly applied in prevailing approaches, which have been generally learned under the meta-learning scenario. Despite the robustness of noise achieved by the probabilistic meta-learning models, they usually suffer from model collapse that degenerates generalization performance. In this paper, we propose variational rectification inference (VRI) to formulate the adaptive rectification for loss functions as an amortized variational inference problem and derive the evidence lower bound under the meta-learning framework. Specifically, VRI is constructed as a hierarchical Bayes by treating the rectifying vector as a latent variable, which can rectify the loss of the noisy sample with the extra randomness regularization and is, therefore, more robust to label noise. To achieve the inference of the rectifying vector, we approximate its conditional posterior with an amortization meta-network. By introducing the variational term in VRI, the conditional posterior is estimated accurately and avoids collapsing to a Dirac delta function, which can significantly improve the generalization performance. The elaborated meta-network and prior network adhere to the smoothness assumption, enabling the generation of reliable rectification vectors. Given a set of clean meta-data, VRI can be efficiently meta-learned within the bi-level optimization programming. Besides, theoretical analysis guarantees that the meta-network can be efficiently learned with our algorithm. Comprehensive comparison experiments and analyses validate its effectiveness for robust learning with noisy labels, particularly in the presence of open-set noise.
[487] Classifier Pooling for Modern Ordinal Classification
Noam H. Rotenberg, Andreia V. Faria, Brian Caffo
Main category: cs.LG
TL;DR: A model-agnostic method for ordinal classification with Python implementation that adapts non-ordinal classifiers to handle ordinal data effectively.
Details
Motivation: Ordinal data is common in clinical and other domains, but there's a lack of modern machine learning methods and public software specifically designed for ordinal classification tasks.Method: Developed a model-agnostic approach that can apply any non-ordinal classification method in an ordinal fashion, with open-source Python package implementation.
Result: The method outperforms non-ordinal classification approaches, especially when datasets are small or have many outcome classes, as demonstrated on multiple real-world datasets.
Conclusion: This work facilitates the use of modern machine learning algorithms for ordinal data through both methodological innovation and accessible software tools.
Abstract: Ordinal data is widely prevalent in clinical and other domains, yet there is a lack of both modern, machine-learning based methods and publicly available software to address it. In this paper, we present a model-agnostic method of ordinal classification, which can apply any non-ordinal classification method in an ordinal fashion. We also provide an open-source implementation of these algorithms, in the form of a Python package. We apply these models on multiple real-world datasets to show their performance across domains. We show that they often outperform non-ordinal classification methods, especially when the number of datapoints is relatively small or when there are many classes of outcomes. This work, including the developed software, facilitates the use of modern, more powerful machine learning algorithms to handle ordinal data.
[488] WINFlowNets: Warm-up Integrated Networks Training of Generative Flow Networks for Robotics and Machine Fault Adaptation
Zahin Sufiyan, Shadan Golestan, Yoshihiro Mitsuka, Shotaro Miwa, Osmar Zaiane
Main category: cs.LG
TL;DR: WINFlowNets introduces a co-training framework for continuous Generative Flow Networks that eliminates the need for pre-training retrieval networks, enabling better performance in dynamic robotic environments.
Details
Motivation: Current CFlowNets for robotic control require pre-training of retrieval networks, which is impractical in dynamic environments where pre-training data may not be available or representative. This limits their real-world applicability.Method: WINFlowNets co-trains flow and retrieval networks together using a warm-up phase for the retrieval network, shared training architecture, and shared replay buffer, eliminating the need for separate pre-training.
Result: WINFlowNets outperforms CFlowNets and state-of-the-art RL algorithms in simulated robotic environments, achieving higher average rewards and better training stability, with strong adaptive capability in fault environments.
Conclusion: WINFlowNets enables practical deployment of flow-based methods in dynamic robotic systems by removing pre-training dependencies, offering better adaptation with limited data in malfunction-prone environments.
Abstract: Generative Flow Networks for continuous scenarios (CFlowNets) have shown promise in solving sequential decision-making tasks by learning stochastic policies using a flow and a retrieval network. Despite their demonstrated efficiency compared to state-of-the-art Reinforcement Learning (RL) algorithms, their practical application in robotic control tasks is constrained by the reliance on pre-training the retrieval network. This dependency poses challenges in dynamic robotic environments, where pre-training data may not be readily available or representative of the current environment. This paper introduces WINFlowNets, a novel CFlowNets framework that enables the co-training of flow and retrieval networks. WINFlowNets begins with a warm-up phase for the retrieval network to bootstrap its policy, followed by a shared training architecture and a shared replay buffer for co-training both networks. Experiments in simulated robotic environments demonstrate that WINFlowNets surpasses CFlowNets and state-of-the-art RL algorithms in terms of average reward and training stability. Furthermore, WINFlowNets exhibits strong adaptive capability in fault environments, making it suitable for tasks that demand quick adaptation with limited sample data. These findings highlight WINFlowNets’ potential for deployment in dynamic and malfunction-prone robotic systems, where traditional pre-training or sample inefficient data collection may be impractical.
[489] Learning Permutation Distributions via Reflected Diffusion on Ranks
Sizhuang He, Yangtian Zhang, Shiyang Zhang, David van Dijk
Main category: cs.LG
TL;DR: Soft-Rank Diffusion: A novel discrete diffusion framework for learning probability distributions on permutations using continuous soft-rank representations and contextualized Plackett-Luce denoisers.
Details
Motivation: Learning probability distributions on permutation groups S_n is challenging due to factorial growth in size and discrete, non-Euclidean structure. Existing permutation diffusion methods using shuffle-based random walks produce abrupt trajectories that become increasingly hard to denoise as n grows.Method: Proposes Soft-Rank Diffusion that replaces shuffle-based corruption with structured soft-rank forward process: lifts permutations to continuous latent representation by relaxing discrete ranks into soft ranks. For reverse process, introduces contextualized generalized Plackett-Luce (cGPL) denoisers that generalize prior PL-style parameterizations.
Result: Experiments on sorting and combinatorial optimization benchmarks show Soft-Rank Diffusion consistently outperforms prior diffusion baselines, with particularly strong gains in long-sequence and intrinsically sequential settings.
Conclusion: Soft-Rank Diffusion provides a more effective framework for learning permutation distributions by using continuous soft-rank representations and improved denoising models, addressing limitations of existing discrete diffusion methods for permutation learning.
Abstract: The finite symmetric group S_n provides a natural domain for permutations, yet learning probability distributions on S_n is challenging due to its factorially growing size and discrete, non-Euclidean structure. Recent permutation diffusion methods define forward noising via shuffle-based random walks (e.g., riffle shuffles) and learn reverse transitions with Plackett-Luce (PL) variants, but the resulting trajectories can be abrupt and increasingly hard to denoise as n grows. We propose Soft-Rank Diffusion, a discrete diffusion framework that replaces shuffle-based corruption with a structured soft-rank forward process: we lift permutations to a continuous latent representation of order by relaxing discrete ranks into soft ranks, yielding smoother and more tractable trajectories. For the reverse process, we introduce contextualized generalized Plackett-Luce (cGPL) denoisers that generalize prior PL-style parameterizations and improve expressivity for sequential decision structures. Experiments on sorting and combinatorial optimization benchmarks show that Soft-Rank Diffusion consistently outperforms prior diffusion baselines, with particularly strong gains in long-sequence and intrinsically sequential settings.
[490] Beyond Outliers: A Data-Free Layer-wise Mixed-Precision Quantization Approach Driven by Numerical and Structural Dual-Sensitivity
Hengyuan Zhang, Xinrong Chen, Zunhai Su, Xiao Liang, Jing Xiong, Wendong Xu, He Xiao, Chaofan Tao, Wei Zhang, Ruobing Xie, Lei Jiang, Hayden Kwok-Hay So, Ngai Wong
Main category: cs.LG
TL;DR: NSDS is a calibration-free layer-wise mixed-precision quantization framework that uses numerical and structural dual-sensitivity analysis to allocate higher precision to sensitive layers for effective compression under extreme low-bit settings.
Details
Motivation: Existing LMPQ methods treat all intra-layer weight modules uniformly and rely on single numerical properties for sensitivity estimation, overlooking distinct operational roles and structural characteristics of different modules within layers.Method: Proposes NSDS framework that: 1) Mechanistically decomposes each layer into distinct operational roles, 2) Quantifies sensitivity from both numerical and structural perspectives, 3) Aggregates dual-aspect scores using MAD-Sigmoid and Soft-OR robust aggregation scheme to guide bit allocation.
Result: Extensive experiments show NSDS consistently achieves superior performance compared to various baselines across diverse models and downstream tasks, without requiring any calibration data.
Conclusion: NSDS provides an effective calibration-free LMPQ framework that addresses limitations of existing methods by considering both numerical and structural sensitivity, enabling better compression under extreme low-bit settings.
Abstract: Layer-wise mixed-precision quantization (LMPQ) enables effective compression under extreme low-bit settings by allocating higher precision to sensitive layers. However, existing methods typically treat all intra-layer weight modules uniformly and rely on a single numerical property when estimating sensitivity, overlooking their distinct operational roles and structural characteristics. To address this, we propose NSDS, a novel calibration-free LMPQ framework driven by Numerical and Structural Dual-Sensitivity. Specifically, it first mechanistically decomposes each layer into distinct operational roles and quantifies their sensitivity from both numerical and structural perspectives. These dual-aspect scores are then aggregated into a unified layer-wise metric through a robust aggregation scheme based on MAD-Sigmoid and Soft-OR to guide bit allocation. Extensive experiments demonstrate that NSDS consistently achieves superior performance compared to various baselines across diverse models and downstream tasks, without relying on any calibration data.
[491] Variational Kernel Design for Internal Noise: Gaussian Chaos Noise, Representation Compatibility, and Reliable Deep Learning
Ziran Liu
Main category: cs.LG
TL;DR: Gaussian Chaos Noise (GCh) is a theoretically principled internal noise mechanism for deep networks derived from variational kernel design, using Dirichlet Green kernel geometry and Wick normalization to create positive mean-one gates that improve calibration and robustness.
Details
Motivation: Current internal noise mechanisms in deep networks (dropout, hard masking, additive perturbation) are heuristic and may not be compatible with the representations they act on. The paper aims to develop a theoretically principled noise mechanism with proper correlation geometry.Method: Proposes Variational Kernel Design (VKD) framework with three components: law family, correlation kernel, and injection operator. Derives Gaussian Chaos Noise (GCh) through maximum-entropy principle over latent log-fields yielding Gaussian optimizer with Dirichlet Laplacian precision, then applies Wick normalization to create canonical positive mean-one gates.
Result: GCh provides exact Gaussian control of pairwise log-ratio deformation, margin-sensitive ranking stability, and expected intrinsic roughness budget. On ImageNet and ImageNet-C, GCh consistently improves calibration and under distribution shift also improves NLL while maintaining competitive accuracy.
Conclusion: Gaussian Chaos Noise is a theoretically grounded alternative to heuristic noise mechanisms that improves model calibration and robustness to distribution shifts while maintaining accuracy.
Abstract: Internal noise in deep networks is usually inherited from heuristics such as dropout, hard masking, or additive perturbation. We ask two questions: what correlation geometry should internal noise have, and is the implemented perturbation compatible with the representations it acts on? We answer these questions through Variational Kernel Design (VKD), a framework in which a noise mechanism is specified by a law family, a correlation kernel, and an injection operator, and is derived from learning desiderata. In a solved spatial subfamily, a quadratic maximum-entropy principle over latent log-fields yields a Gaussian optimizer with precision given by the Dirichlet Laplacian, so the induced geometry is the Dirichlet Green kernel. Wick normalization then gives a canonical positive mean-one gate, Gaussian Chaos Noise (GCh). For the sample-wise gate used in practice, we prove exact Gaussian control of pairwise log-ratio deformation, margin-sensitive ranking stability, and an exact expected intrinsic roughness budget; hard binary masks instead induce singular or coherence-amplified distortions on positive coherent representations. On ImageNet and ImageNet-C, GCh consistently improves calibration and under shift also improves NLL at competitive accuracy.
[492] Efficient Exploration at Scale
Seyed Mohammad Asghari, Chris Chute, Vikranth Dwaracherla, Xiuyuan Lu, Mehdi Jafarnia, Victor Minden, Zheng Wen, Benjamin Van Roy
Main category: cs.LG
TL;DR: Online RLHF algorithm achieves 10-1000x data efficiency gains over offline methods by combining incremental updates, reward uncertainty modeling, and information-directed exploration
Details
Motivation: Current RLHF methods require massive amounts of human feedback data (often millions of labels), which is expensive and time-consuming to collect. There's a need for more data-efficient approaches to make RLHF more practical and scalable.Method: Develops an online learning algorithm that incrementally updates reward and language models as choice data arrives. Key innovations: 1) Small affirmative nudge added to reinforcement signals, 2) Epistemic neural network modeling reward uncertainty, 3) Information-directed exploration for efficient data collection.
Result: With Gemma LLMs, matches performance of offline RLHF trained on 200K labels using fewer than 20K labels (10x gain). Extrapolation suggests 1M labels could match 1B-label offline RLHF (1000x gain). First demonstration of such large efficiency improvements.
Conclusion: Online RLHF with uncertainty modeling and intelligent exploration can dramatically reduce human feedback requirements, potentially making RLHF 10-1000x more data-efficient than current state-of-the-art offline methods.
Abstract: We develop an online learning algorithm that dramatically improves the data efficiency of reinforcement learning from human feedback (RLHF). Our algorithm incrementally updates reward and language models as choice data is received. The reward model is fit to the choice data, while the language model is updated by a variation of reinforce, with reinforcement signals provided by the reward model. Several features enable the efficiency gains: a small affirmative nudge added to each reinforcement signal, an epistemic neural network that models reward uncertainty, and information-directed exploration. With Gemma large language models (LLMs), our algorithm matches the performance of offline RLHF trained on 200K labels using fewer than 20K labels, representing more than a 10x gain in data efficiency. Extrapolating from our results, we expect our algorithm trained on 1M labels to match offline RLHF trained on 1B labels. This represents a 1,000x gain. To our knowledge, these are the first results to demonstrate that such large improvements are possible.
[493] SCALE:Scalable Conditional Atlas-Level Endpoint transport for virtual cell perturbation prediction
Shuizhou Chen, Lang Yu, Kedu Jin, Songming Zhang, Hao Wu, Wenxuan Huang, Sheng Xu, Quan Qian, Qin Chen, Lei Bai, Siqi Sun, Zhangyang Gao
Main category: cs.LG
TL;DR: SCALE is a large-scale foundation model for virtual cell perturbation prediction that addresses bottlenecks in training efficiency, modeling stability, and biological evaluation through a BioNeMo-based framework, conditional transport formulation, and rigorous cell-level benchmarking.
Details
Motivation: Virtual cell models face three bottlenecks: inefficient training/inference pipelines, unstable modeling in high-dimensional sparse expression space, and evaluation protocols that overemphasize reconstruction accuracy while underestimating biological fidelity.Method: 1) BioNeMo-based training/inference framework for improved throughput and scalability; 2) Formulate perturbation prediction as conditional transport with set-aware flow architecture using LLaMA-based cellular encoding; 3) Rigorous cell-level evaluation on Tahoe-100M benchmark with biologically meaningful metrics.
Result: Achieved 12.51× speedup on pretraining and 1.29× on inference over prior SOTA; improved PDCorr by 12.02% and DE Overlap by 10.66% over STATE on Tahoe-100M benchmark.
Conclusion: Advancing virtual cells requires co-design of scalable infrastructure, stable transport modeling, and biologically faithful evaluation, not just better generative objectives.
Abstract: Virtual cell models aim to enable in silico experimentation by predicting how cells respond to genetic, chemical, or cytokine perturbations from single-cell measurements. In practice, however, large-scale perturbation prediction remains constrained by three coupled bottlenecks: inefficient training and inference pipelines, unstable modeling in high-dimensional sparse expression space, and evaluation protocols that overemphasize reconstruction-like accuracy while underestimating biological fidelity. In this work we present a specialized large-scale foundation model SCALE for virtual cell perturbation prediction that addresses the above limitations jointly. First, we build a BioNeMo-based training and inference framework that substantially improves data throughput, distributed scalability, and deployment efficiency, yielding 12.51* speedup on pretrain and 1.29* on inference over the prior SOTA pipeline under matched system settings. Second, we formulate perturbation prediction as conditional transport and implement it with a set-aware flow architecture that couples LLaMA-based cellular encoding with endpoint-oriented supervision. This design yields more stable training and stronger recovery of perturbation effects. Third, we evaluate the model on Tahoe-100M using a rigorous cell-level protocol centered on biologically meaningful metrics rather than reconstruction alone. On this benchmark, our model improves PDCorr by 12.02% and DE Overlap by 10.66% over STATE. Together, these results suggest that advancing virtual cells requires not only better generative objectives, but also the co-design of scalable infrastructure, stable transport modeling, and biologically faithful evaluation.
[494] Cohomological Obstructions to Global Counterfactuals: A Sheaf-Theoretic Foundation for Generative Causal Models
Rui Wu, Hong Xie, Yongjun Li
Main category: cs.LG
TL;DR: The paper introduces a topological framework for causal generative models using cellular sheaves over Wasserstein spaces, addressing cohomological obstructions in counterfactual generation with entropic regularization.
Details
Motivation: Current continuous generative models assume local causal consistency leads to globally coherent counterfactuals, but this fails when causal graphs have non-trivial homology (structural conflicts or hidden confounders). The paper aims to address these topological barriers in counterfactual generation.Method: Formalizes structural causal models as cellular sheaves over Wasserstein spaces, introduces entropic regularization to avoid deterministic singularities, derives the Entropic Wasserstein Causal Sheaf Laplacian (coupled non-linear Fokker-Planck equations), proves an entropic pullback lemma, and integrates with Implicit Function Theorem on Sinkhorn optimality conditions for O(1)-memory gradients.
Result: Achieves computational tractability with O(1)-memory reverse-mode gradients independent of iteration horizon, successfully navigates topological barriers in high-dimensional scRNA-seq counterfactuals using thermodynamic noise (“entropic tunneling”), and introduces Topological Causal Score for topology-aware causal discovery.
Conclusion: The framework provides a rigorous topological approach to causal generative modeling that addresses fundamental limitations of current methods when dealing with complex causal structures, with applications in both counterfactual generation and causal discovery.
Abstract: Current continuous generative models (e.g., Diffusion Models, Flow Matching) implicitly assume that locally consistent causal mechanisms naturally yield globally coherent counterfactuals. In this paper, we prove that this assumption fails fundamentally when the causal graph exhibits non-trivial homology (e.g., structural conflicts or hidden confounders). We formalize structural causal models as cellular sheaves over Wasserstein spaces, providing a strict algebraic topological definition of cohomological obstructions in measure spaces. To ensure computational tractability and avoid deterministic singularities (which we define as manifold tearing), we introduce entropic regularization and derive the Entropic Wasserstein Causal Sheaf Laplacian, a novel system of coupled non-linear Fokker-Planck equations. Crucially, we prove an entropic pullback lemma for the first variation of pushforward measures. By integrating this with the Implicit Function Theorem (IFT) on Sinkhorn optimality conditions, we establish a direct algorithmic bridge to automatic differentiation (VJP), achieving O(1)-memory reverse-mode gradients strictly independent of the iteration horizon. Empirically, our framework successfully leverages thermodynamic noise to navigate topological barriers (“entropic tunneling”) in high-dimensional scRNA-seq counterfactuals. Finally, we invert this theoretical framework to introduce the Topological Causal Score, demonstrating that our Sheaf Laplacian acts as a highly sensitive algebraic detector for topology-aware causal discovery.
[495] The Causal Uncertainty Principle: Manifold Tearing and the Topological Limits of Counterfactual Interventions
Rui Wu, Hong Xie, Yongjun Li
Main category: cs.LG
TL;DR: The paper establishes fundamental limits of causal interventions in continuous generative models, proving deterministic flows develop singularities under extreme interventions, and introduces a geometry-aware algorithm to bypass these issues.
Details
Motivation: While Judea Pearl's do-calculus provides a foundation for causal inference, its translation to continuous generative models faces geometric challenges. The paper aims to establish fundamental limits of such interventions and develop practical solutions.Method: The authors define the Counterfactual Event Horizon and prove the Manifold Tearing Theorem showing deterministic flows develop finite-time singularities under extreme interventions. They establish the Causal Uncertainty Principle for intervention extremity vs. identity preservation trade-off, and introduce Geometry-Aware Causal Flow (GACF) algorithm with topological radar to bypass manifold tearing.
Result: Theoretical results include proofs of fundamental limits of causal interventions in continuous models. The GACF algorithm is validated on high-dimensional scRNA-seq data, demonstrating practical application of the theoretical framework.
Conclusion: The paper provides fundamental theoretical limits for causal interventions in continuous generative models and offers a practical algorithmic solution (GACF) that addresses geometric challenges through topological awareness.
Abstract: Judea Pearl’s do-calculus provides a foundation for causal inference, but its translation to continuous generative models remains fraught with geometric challenges. We establish the fundamental limits of such interventions. We define the Counterfactual Event Horizon and prove the Manifold Tearing Theorem: deterministic flows inevitably develop finite-time singularities under extreme interventions. We establish the Causal Uncertainty Principle for the trade-off between intervention extremity and identity preservation. Finally, we introduce Geometry-Aware Causal Flow (GACF), a scalable algorithm that utilizes a topological radar to bypass manifold tearing, validated on high-dimensional scRNA-seq data.
[496] Large-Scale 3D Ground-Motion Synthesis with Physics-Inspired Latent Operator Flow Matching
Yaozhong Shi, Grigorios Lavrentiadis, Konstantinos Tsalouchidis, Zachary E. Ross, David McCallen, Caifeng Zou, Kamyar Azizzadenesheli, Domniki Asimaki
Main category: cs.LG
TL;DR: GMFlow is a physics-inspired generative model that rapidly produces realistic, spatially coherent ground-motion time histories for earthquake scenarios, enabling efficient uncertainty quantification for infrastructure design.
Details
Motivation: Current physics-based simulations for earthquake ground motions are computationally intensive and impractical for generating the large ensembles needed for uncertainty quantification in engineering workflows for distributed infrastructure like power grids and pipelines.Method: Introduces GMFlow, a physics-inspired latent operator flow matching framework that generates realistic, large-scale regional ground-motion time-histories conditioned on physical parameters using mesh-agnostic functional generative modeling.
Result: GMFlow generates spatially coherent ground motion across over 9 million grid points in seconds, achieving a 10,000-fold speedup over simulation workflows while maintaining realistic frequency content and spatiotemporal coherence.
Conclusion: GMFlow enables rapid, uncertainty-aware hazard assessment for distributed infrastructure and advances mesh-agnostic functional generative modeling, with potential applications for synthesizing large-scale spatiotemporal physical fields across diverse scientific domains.
Abstract: Earthquake hazard analysis and design of spatially distributed infrastructure, such as power grids and energy pipeline networks, require scenario-specific ground-motion time histories with realistic frequency content and spatiotemporal coherence. However, producing the large ensembles needed for uncertainty quantification with physics-based simulations is computationally intensive and impractical for engineering workflows. To address this challenge, we introduce Ground-Motion Flow (GMFlow), a physics-inspired latent operator flow matching framework that generates realistic, large-scale regional ground-motion time-histories conditioned on physical parameters. Validated on simulated earthquake scenarios in the San Francisco Bay Area, GMFlow generates spatially coherent ground motion across more than 9 million grid points in seconds, achieving a 10,000-fold speedup over the simulation workflow, which opens a path toward rapid and uncertainty-aware hazard assessment for distributed infrastructure. More broadly, GMFlow advances mesh-agnostic functional generative modeling and could potentially be extended to the synthesis of large-scale spatiotemporal physical fields in diverse scientific domains.
[497] Causal Representation Learning on High-Dimensional Data: Benchmarks, Reproducibility, and Evaluation Metrics
Alireza Sadeghi, Wael AbdAlmageed
Main category: cs.LG
TL;DR: A comprehensive review of causal representation learning (CRL) focusing on dataset analysis, evaluation metrics, and reproducibility assessment.
Details
Motivation: CRL models need robust evaluation across multiple directions (reconstruction, disentanglement, causal discovery, counterfactual reasoning), but current evaluation is fragmented and reproducibility is a major challenge in the field.Method: Critical analysis of existing synthetic and real-world datasets, proposal of essential dataset characteristics, introduction of a single aggregate metric combining all evaluation directions, and systematic review of implementations for reproducibility assessment.
Result: Identified limitations in current CRL datasets, proposed dataset characteristics, developed comprehensive evaluation metric, and assessed reproducibility gaps and best practices in existing implementations.
Conclusion: The study provides a framework for better dataset design, unified evaluation, and improved reproducibility in causal representation learning research.
Abstract: Causal representation learning (CRL) models aim to transform high-dimensional data into a latent space, enabling interventions to generate counterfactual samples or modify existing data based on the causal relationships among latent variables. To facilitate the development and evaluation of these models, a variety of synthetic and real-world datasets have been proposed, each with distinct advantages and limitations. For practical applications, CRL models must perform robustly across multiple evaluation directions, including reconstruction, disentanglement, causal discovery, and counterfactual reasoning, using appropriate metrics for each direction. However, this multi-directional evaluation can complicate model comparison, as a model may excel in some direction while under-performing in others. Another significant challenge in this field is reproducibility: the source code corresponding to published results must be publicly available, and repeated runs should yield performance consistent with the original reports. In this study, we critically analyzed the synthetic and real-world datasets currently employed in the literature, highlighting their limitations and proposing a set of essential characteristics for suitable datasets in CRL model development. We also introduce a single aggregate metric that consolidates performance across all evaluation directions, providing a comprehensive score for each model. Finally, we reviewed existing implementations from the literature and assessed them in terms of reproducibility, identifying gaps and best practices in the field.
[498] The Phasor Transformer: Resolving Attention Bottlenecks on the Unit Circle
Dibakar Sigdel
Main category: cs.LG
TL;DR: Phasor Transformer uses phase-shifts on unit-circle manifold with DFT token coupling for O(N log N) global mixing in time-series modeling, achieving competitive forecasting with compact parameters.
Details
Motivation: Transformer self-attention has quadratic computational bottleneck for long-context time-series; need efficient global token mixing for temporal modeling.Method: Introduces Phasor Transformer block representing sequence states on unit-circle manifold S¹, combining trainable phase-shifts with parameter-free DFT token coupling for global O(N log N) mixing without attention maps.
Result: LPM achieves competitive forecasting on synthetic multi-frequency benchmarks with compact parameter budget, learning stable global dynamics comparable to self-attention baselines.
Conclusion: Geometry-constrained phase computation with deterministic global coupling offers practical path for scalable temporal modeling in oscillatory domains, establishing efficiency-performance frontier.
Abstract: Transformer models have redefined sequence learning, yet dot-product self-attention introduces a quadratic token-mixing bottleneck for long-context time-series. We introduce the \textbf{Phasor Transformer} block, a phase-native alternative representing sequence states on the unit-circle manifold $S^1$. Each block combines lightweight trainable phase-shifts with parameter-free Discrete Fourier Transform (DFT) token coupling, achieving global $\mathcal{O}(N\log N)$ mixing without explicit attention maps. Stacking these blocks defines the \textbf{Large Phasor Model (LPM)}. We validate LPM on autoregressive time-series prediction over synthetic multi-frequency benchmarks. Operating with a highly compact parameter budget, LPM learns stable global dynamics and achieves competitive forecasting behavior compared to conventional self-attention baselines. Our results establish an explicit efficiency-performance frontier, demonstrating that large-model scaling for time-series can emerge from geometry-constrained phase computation with deterministic global coupling, offering a practical path toward scalable temporal modeling in oscillatory domains.
[499] TimeAPN: Adaptive Amplitude-Phase Non-Stationarity Normalization for Time Series Forecasting
Yue Hu, Jialiang Tang, Siwei Yu, Baosheng Yu, Jing Zhang, Dacheng Tao
Main category: cs.LG
TL;DR: TimeAPN is an adaptive amplitude-phase normalization framework for multivariate long-term time series forecasting that explicitly models non-stationary factors from both time and frequency domains to address distribution shifts.
Details
Motivation: Non-stationarity in time series forecasting causes rapid changes in amplitude and phase, leading to severe distribution shifts that degrade predictive performance. Existing normalization methods rely on basic statistics and assume smooth distribution evolution, overlooking fine-grained temporal dynamics.Method: TimeAPN models mean sequences jointly in time and frequency domains, forecasts their evolution, extracts phase information in frequency domain, models phase discrepancy between predicted and ground-truth sequences, incorporates amplitude into adaptive normalization, and integrates predicted non-stationary factors with backbone forecasting through collaborative de-normalization.
Result: Extensive experiments on seven real-world multivariate datasets show TimeAPN consistently improves long-term forecasting accuracy across multiple prediction horizons and outperforms state-of-the-art reversible normalization methods.
Conclusion: TimeAPN effectively addresses non-stationarity in time series forecasting by explicitly modeling amplitude and phase variations, is model-agnostic, and can be integrated with various forecasting backbones to enhance performance.
Abstract: Non-stationarity is a fundamental challenge in multivariate long-term time series forecasting, often manifested as rapid changes in amplitude and phase. These variations lead to severe distribution shifts and consequently degrade predictive performance. Existing normalization-based methods primarily rely on first- and second-order statistics, implicitly assuming that distributions evolve smoothly and overlooking fine-grained temporal dynamics. To address these limitations, we propose TimeAPN, an Adaptive Amplitude-Phase Non-Stationarity Normalization framework that explicitly models and predicts non-stationary factors from both the time and frequency domains. Specifically, TimeAPN first models the mean sequence jointly in the time and frequency domains, and then forecasts its evolution over future horizons. Meanwhile, phase information is extracted in the frequency domain, and the phase discrepancy between the predicted and ground-truth future sequences is explicitly modeled to capture temporal misalignment. Furthermore, TimeAPN incorporates amplitude information into an adaptive normalization mechanism, enabling the model to effectively account for abrupt fluctuations in signal energy. The predicted non-stationary factors are subsequently integrated with the backbone forecasting outputs through a collaborative de-normalization process to reconstruct the final non-stationary time series. The proposed framework is model-agnostic and can be seamlessly integrated with various forecasting backbones. Extensive experiments on seven real-world multivariate datasets demonstrate that TimeAPN consistently improves long-term forecasting accuracy across multiple prediction horizons and outperforms state-of-the-art reversible normalization methods.
[500] Baguan-TS: A Sequence-Native In-Context Learning Model for Time Series Forecasting with Covariates
Linxiao Yang, Xue Jiang, Gezheng Xu, Tian Zhou, Min Yang, ZhaoYang Zhu, Linyuan Geng, Zhipeng Zeng, Qiming Chen, Xinyue Gu, Rong Jin, Liang Sun
Main category: cs.LG
TL;DR: Baguan-TS: A unified framework combining raw-sequence representation learning with in-context learning for time series forecasting using a 3D Transformer that attends over temporal, variable, and context axes.
Details
Motivation: Current in-context learning approaches for time series forecasting rely on hand-crafted features, while end-to-end sequence models lack inference-time adaptation. There's a need to bridge this gap by integrating raw-sequence representation learning with in-context learning capabilities.Method: Proposes Baguan-TS framework with a 3D Transformer architecture that jointly attends over temporal, variable, and context axes. Addresses training stability with target-space retrieval-based local calibration and mitigates output oversmoothing via context-overfitting strategy.
Result: Outperforms established baselines on public benchmark with covariates, achieving highest win rate and significant reductions in both point and probabilistic forecasting metrics. Demonstrates robustness across diverse real-world energy datasets with substantial improvements.
Conclusion: Baguan-TS successfully integrates raw-sequence learning with in-context learning for time series forecasting, addressing key practical challenges and achieving state-of-the-art performance across multiple benchmarks.
Abstract: Transformers enable in-context learning (ICL) for rapid, gradient-free adaptation in time series forecasting, yet most ICL-style approaches rely on tabularized, hand-crafted features, while end-to-end sequence models lack inference-time adaptation. We bridge this gap with a unified framework, Baguan-TS, which integrates the raw-sequence representation learning with ICL, instantiated by a 3D Transformer that attends jointly over temporal, variable, and context axes. To make this high-capacity model practical, we tackle two key hurdles: (i) calibration and training stability, improved with a feature-agnostic, target-space retrieval-based local calibration; and (ii) output oversmoothing, mitigated via context-overfitting strategy. On public benchmark with covariates, Baguan-TS consistently outperforms established baselines, achieving the highest win rate and significant reductions in both point and probabilistic forecasting metrics. Further evaluations across diverse real-world energy datasets demonstrate its robustness, yielding substantial improvements.
[501] Efficient Soft Actor-Critic with LLM-Based Action-Level Guidance for Continuous Control
Hao Ma, Zhiqiang Pu, Xiaolin Ai, Huimu Wang
Main category: cs.LG
TL;DR: GuidedSAC is a reinforcement learning algorithm that uses LLMs as intelligent supervisors to provide action-level guidance for SAC, improving exploration efficiency in complex environments.
Details
Motivation: The paper addresses the challenge of efficient exploration in vast state-action spaces in reinforcement learning, where traditional exploration methods struggle with sample inefficiency.Method: GuidedSAC integrates LLMs as supervisors that analyze recent trajectories using state information and visual replays to provide action-level interventions, guiding the SAC algorithm’s exploration while preserving its theoretical guarantees.
Result: GuidedSAC outperforms standard SAC and state-of-the-art exploration methods (RND, ICM, E3B) in both discrete and continuous control environments, including MuJoCo benchmarks, showing improved sample efficiency and final performance.
Conclusion: LLM-guided reinforcement learning can significantly improve exploration efficiency while maintaining theoretical convergence guarantees, offering a promising direction for RL in complex environments.
Abstract: We present GuidedSAC, a novel reinforcement learning (RL) algorithm that facilitates efficient exploration in vast state-action spaces. GuidedSAC leverages large language models (LLMs) as intelligent supervisors that provide action-level guidance for the Soft Actor-Critic (SAC) algorithm. The LLM-based supervisor analyzes the most recent trajectory using state information and visual replays, offering action-level interventions that enable targeted exploration. Furthermore, we provide a theoretical analysis of GuidedSAC, proving that it preserves the convergence guarantees of SAC while improving convergence speed. Through experiments in both discrete and continuous control environments, including toy text tasks and complex MuJoCo benchmarks, we demonstrate that GuidedSAC consistently outperforms standard SAC and state-of-the-art exploration-enhanced variants (e.g., RND, ICM, and E3B) in terms of sample efficiency and final performance.
[502] Auto-Unrolled Proximal Gradient Descent: An AutoML Approach to Interpretable Waveform Optimization
Ahmet Kaplan
Main category: cs.LG
TL;DR: AutoML-enhanced deep unfolding of proximal gradient descent for wireless beamforming optimization, achieving near-optimal performance with fewer layers and minimal training data.
Details
Motivation: To reduce the computational cost and training data requirements of traditional iterative optimization algorithms for wireless beamforming and waveform design, while maintaining interpretability compared to black-box deep learning approaches.Method: Convert iterative proximal gradient descent (PGD) into a deep neural network with learnable layer parameters, add hybrid layers with learnable linear gradient transformations, and use AutoGluon with tree-structured parzen estimator for hyperparameter optimization across network architecture and training parameters.
Result: Auto-PGD achieves 98.8% of the spectral efficiency of traditional 200-iteration PGD using only five unrolled layers and 100 training samples, with improved training stability through gradient normalization and transparency via per-layer sum-rate logging.
Conclusion: The AutoML-enhanced deep unfolding approach significantly reduces training data requirements and inference costs while maintaining high performance and interpretability for wireless beamforming optimization problems.
Abstract: This study explores the combination of automated machine learning (AutoML) with model-based deep unfolding (DU) for optimizing wireless beamforming and waveforms. We convert the iterative proximal gradient descent (PGD) algorithm into a deep neural network, wherein the parameters of each layer are learned instead of being predetermined. Additionally, we enhance the architecture by incorporating a hybrid layer that performs a learnable linear gradient transformation prior to the proximal projection. By utilizing AutoGluon with a tree-structured parzen estimator (TPE) for hyperparameter optimization (HPO) across an expanded search space, which includes network depth, step-size initialization, optimizer, learning rate scheduler, layer type, and post-gradient activation, the proposed auto-unrolled PGD (Auto-PGD) achieves 98.8% of the spectral efficiency of a traditional 200-iteration PGD solver using only five unrolled layers, while requiring only 100 training samples. We also address a gradient normalization issue to ensure consistent performance during training and evaluation, and we illustrate per-layer sum-rate logging as a tool for transparency. These contributions highlight a notable reduction in the amount of training data and inference cost required, while maintaining high interpretability compared to conventional black-box architectures.
[503] QuantFL: Sustainable Federated Learning for Edge IoT via Pre-Trained Model Quantisation
Charuka Herath, Yogachandran Rahulamathavan, Varuna De Silva, Sangarapillai Lambotharan
Main category: cs.LG
TL;DR: QuantFL is a sustainable federated learning framework that uses pre-trained models and aggressive quantization to reduce communication energy costs in IoT networks while maintaining accuracy.
Details
Motivation: Federated learning on IoT devices has significant carbon footprint due to energy-intensive uplink transmissions. Pre-trained models are increasingly available on edge devices but their potential to reduce fine-tuning energy overhead remains unexplored.Method: QuantFL leverages pre-trained initialization to enable aggressive, computationally lightweight quantization. It uses memory-efficient bucket quantization without complex error-feedback mechanisms by exploiting the concentrated update statistics from pre-training.
Result: On MNIST and CIFAR-100, QuantFL reduces total communication by 40% (≥80% on uplink or when downlink is quantized) while matching or exceeding uncompressed baselines. Achieves 89.00% on MNIST and 66.89% on CIFAR-100 with orders of magnitude fewer bits.
Conclusion: QuantFL provides a practical, “green” solution for scalable training on battery-constrained IoT networks by combining pre-trained models with efficient quantization techniques.
Abstract: Federated Learning (FL) enables privacy-preserving intelligence on Internet of Things (IoT) devices but incurs a significant carbon footprint due to the high energy cost of frequent uplink transmission. While pre-trained models are increasingly available on edge devices, their potential to reduce the energy overhead of fine-tuning remains underexplored. In this work, we propose QuantFL, a sustainable FL framework that leverages pre-trained initialisation to enable aggressive, computationally lightweight quantisation. We demonstrate that pre-training naturally concentrates update statistics, allowing us to use memory-efficient bucket quantisation without the energy-intensive overhead of complex error-feedback mechanisms. On MNIST and CIFAR-100, QuantFL reduces total communication by 40% ($\simeq40%$ total-bit reduction with full-precision downlink; $\geq80%$ on uplink or when downlink is quantised) while matching or exceeding uncompressed baselines under strict bandwidth budgets; BU attains 89.00% (MNIST) and 66.89% (CIFAR-100) test accuracy with orders of magnitude fewer bits. We also account for uplink and downlink costs and provide ablations on quantisation levels and initialisation. QuantFL delivers a practical, “green” recipe for scalable training on battery-constrained IoT networks.
[504] Translation Invariance of Neural Operators for the FitzHugh-Nagumo Model
Luca Pellegrini
Main category: cs.LG
TL;DR: Neural Operators benchmarked on FitzHugh-Nagumo model dynamics, evaluating translation invariance and comparing 7 architectures on training/test accuracy, efficiency, and inference speed.
Details
Motivation: To investigate Neural Operators' ability to capture stiff spatio-temporal dynamics of excitable cell models (FitzHugh-Nagumo) and evaluate their translation invariance properties, which is important for practical applications where stimuli may appear at different times and locations.Method: Trained NOs with applied current at varying spatial locations/intensities at fixed time, tested on out-of-distribution translated currents in both time and space. Benchmarked 7 architectures: CNOs, DONs, DONs-CNN, POD-DONs, FNOs, TFNOs, LocalNOs based on accuracy, efficiency, and inference speed.
Result: CNOs performed well on translated test dynamics but required higher training costs. FNOs achieved lowest training error but highest inference time and poor generalization to translated dynamics. DONs and variants were efficient in training/inference but didn’t generalize well to test set.
Conclusion: The study provides comprehensive benchmarking of NOs for complex ionic model dynamics, revealing trade-offs between different architectures in handling translation invariance, computational efficiency, and generalization capabilities.
Abstract: Neural Operators (NOs) are a powerful deep learning framework designed to learn the solution operator that arise from partial differential equations. This study investigates NOs ability to capture the stiff spatio-temporal dynamics of the FitzHugh-Nagumo model, which describes excitable cells. A key contribution of this work is evaluating the translation invariance using a novel training strategy. NOs are trained using an applied current with varying spatial locations and intensities at a fixed time, and the test set introduces a more challenging out-of-distribution scenario in which the applied current is translated in both time and space. This approach significantly reduces the computational cost of dataset generation. Moreover we benchmark seven NOs architectures: Convolutional Neural Operators (CNOs), Deep Operator Networks (DONs), DONs with CNN encoder (DONs-CNN), Proper Orthogonal Decomposition DONs (POD-DONs), Fourier Neural Operators (FNOs), Tucker Tensorized FNOs (TFNOs), Localized Neural Operators (LocalNOs). We evaluated these models based on training and test accuracy, efficiency, and inference speed. Our results reveal that CNOs performs well on translated test dynamics. However, they require higher training costs, though their performance on the training set is similar to that of the other considered architectures. In contrast, FNOs achieve the lowest training error, but have the highest inference time. Regarding the translated dynamics, FNOs and their variants provide less accurate predictions. Finally, DONs and their variants demonstrate high efficiency in both training and inference, however they do not generalize well to the test set. These findings highlight the current capabilities and limitations of NOs in capturing complex ionic model dynamics and provide a comprehensive benchmark including their application to scenarios involving translated dynamics.
[505] AirDDE: Multifactor Neural Delay Differential Equations for Air Quality Forecasting
Binqing Wu, Zongjiang Shang, Shiyu Liu, Jianlong Huang, Jiahui Xu, Ling Chen
Main category: cs.LG
TL;DR: AirDDE: A neural delay differential equation framework for air quality forecasting that models pollutant propagation delays using memory-augmented attention and physics-guided delay evolving functions.
Details
Motivation: Existing deep learning methods for air quality forecasting often model pollutant dynamics as instantaneous processes, overlooking intrinsic delays in pollutant propagation, which limits forecasting accuracy.Method: Proposes AirDDE framework with two novel components: (1) memory-augmented attention module that retrieves globally and locally historical features to adaptively capture delay effects, and (2) physics-guided delay evolving function based on diffusion-advection equation to model diffusion, delayed advection, and source/sink terms.
Result: Extensive experiments on three real-world datasets show AirDDE achieves state-of-the-art forecasting performance with average MAE reduction of 8.79% over best baselines.
Conclusion: AirDDE successfully integrates delay modeling into continuous-time pollutant evolution under physical guidance, demonstrating superior forecasting accuracy by capturing delay-aware pollutant accumulation patterns.
Abstract: Accurate air quality forecasting is essential for public health and environmental sustainability, but remains challenging due to the complex pollutant dynamics. Existing deep learning methods often model pollutant dynamics as an instantaneous process, overlooking the intrinsic delays in pollutant propagation. Thus, we propose AirDDE, the first neural delay differential equation framework in this task that integrates delay modeling into a continuous-time pollutant evolution under physical guidance. Specifically, two novel components are introduced: (1) a memory-augmented attention module that retrieves globally and locally historical features, which can adaptively capture delay effects modulated by multifactor data; and (2) a physics-guided delay evolving function, grounded in the diffusion-advection equation, that models diffusion, delayed advection, and source/sink terms, which can capture delay-aware pollutant accumulation patterns with physical plausibility. Extensive experiments on three real-world datasets demonstrate that AirDDE achieves the state-of-the-art forecasting performance with an average MAE reduction of 8.79% over the best baselines. The code is available at https://github.com/w2obin/airdde-aaai.
[506] Anisotropic Permeability Tensor Prediction from Porous Media Microstructure via Physics-Informed Progressive Transfer Learning with Hybrid CNN-Transformer
Mohammad Nooraiepour
Main category: cs.LG
TL;DR: Physics-informed deep learning framework using MaxViT hybrid CNN-Transformer architecture to predict permeability tensors from pore-scale microstructure images, achieving high accuracy with progressive transfer learning and differentiable physical constraints.
Details
Motivation: Direct numerical simulation for permeability tensor prediction from pore-scale images requires hours per sample, limiting large-scale uncertainty quantification and reservoir optimization workflows. Need for faster, accurate prediction method.Method: Combines MaxViT hybrid CNN-Transformer architecture with progressive transfer learning and differentiable physical constraints. Uses multi-axis attention for spatial hierarchy, trained on 20,000 synthetic porous media samples with three-phase progressive curriculum including ImageNet pretraining, D4-equivariant augmentation, tensor transformation, component-weighted loss, and FiLM conditioning.
Result: Achieves variance-weighted R² = 0.9960 (R²_Kxx = 0.9967, R²_Kxy = 0.9758) on 4,000 test samples, representing 33% reduction in unexplained variance over supervised baseline.
Conclusion: Framework successfully predicts permeability tensors with high accuracy. Offers three transferable principles for physics-informed scientific ML: effective cross-domain visual pretraining, robust integration of physical constraints as differentiable components, and progressive training guided by failure-mode analysis.
Abstract: Accurate prediction of permeability tensors from pore-scale microstructure images is essential for subsurface flow modeling, yet direct numerical simulation requires hours per sample, fundamentally limiting large-scale uncertainty quantification and reservoir optimization workflows. A physics-informed deep learning framework is presented that resolves this bottleneck by combining a MaxViT hybrid CNN-Transformer architecture with progressive transfer learning and differentiable physical constraints. MaxViT’s multi-axis attention mechanism simultaneously resolves grain-scale pore-throat geometry via block-local operations and REV-scale connectivity statistics through grid-global operations, providing the spatial hierarchy that permeability tensor prediction physically requires. Training on 20000 synthetic porous media samples spanning three orders of magnitude in permeability, a three-phase progressive curriculum advances from an ImageNet-pretrained baseline with D4-equivariant augmentation and tensor transformation, through component-weighted loss prioritizing off-diagonal coupling, to frozen-backbone transfer learning with porosity conditioning via Feature-wise Linear Modulation (FiLM). Onsager reciprocity and positive definiteness are enforced via differentiable penalty terms. On a held-out test set of 4000 samples, the framework achieves variance-weighted R2 = 0.9960 (R2_Kxx = 0.9967, R2_Kxy = 0.9758), a 33% reduction in unexplained variance over the supervised baseline. The results offer three transferable principles for physics-informed scientific machine learning: large-scale visual pretraining transfers effectively across domain boundaries; physical constraints are most robustly integrated as differentiable architectural components; and progressive training guided by diagnostic failure-mode analysis enables unambiguous attribution of performance gains across methodological stages.
[507] CA-Based Interpretable Knowledge Representation and Analysis of Geometric Design Parameters
Alexander Köhler, Michael Breuß
Main category: cs.LG
TL;DR: PCA-based dimension reduction for CAD geometries enables compact representation but doesn’t directly recover original design parameters; this paper analyzes limitations and conditions for accurate parameter estimation from PCA representations.
Details
Motivation: High-dimensional CAD design spaces with many parameters are challenging for engineering processes like simulation and optimization. While PCA provides compact geometric representations, it doesn't directly recover the underlying design parameters, creating a gap between reduced representations and original parameter spaces.Method: The paper analyzes a recent modification of PCA for CAD applications, shows it’s identical to standard PCA, investigates limitations of parameter estimation from PCA representations, and establishes conditions for accurate parameter recovery through dedicated experiments examining each stage of PCA processing.
Result: The analysis reveals that the modified PCA approach is equivalent to standard PCA, identifies limitations in parameter estimation from PCA representations, and establishes reasonable conditions under which interpretable and accurate parameter estimation can be achieved.
Conclusion: While PCA provides effective dimension reduction for CAD geometries, careful analysis is needed to understand when and how original design parameters can be recovered from PCA representations, with specific conditions enabling accurate parameter estimation.
Abstract: In many CAD-based applications, complex geometries are defined by a high number of design parameters. This leads to high-dimensional design spaces that are challenging for downstream engineering processes like simulations, optimization, and design exploration tasks. Therefore, dimension reduction methods such as principal component analysis (PCA) are used. The PCA identifies dominant modes of geometric variation and yields a compact representation of the geometry. While classical PCA excels in the compact representation part, it does not directly recover underlying design parameters of a generated geometry. In this work, we deal with the problem of estimating design parameters from PCA-based representations. Analyzing a recent modification of the PCA dedicated to our field of application, we show that the results are actually identical to the standard PCA. We investigate limitations of this approach and present reasonable conditions under which accurate, interpretable parameter estimation can be obtained. With the help of dedicated experiments, we take a more in-depth look at every stage of the PCA and the possible changes of the geometry during these processes.
[508] CLeAN: Continual Learning Adaptive Normalization in Dynamic Environments
Isabella Marasco, Davide Evangelista, Elena Loli Piccolomini, Michele Colajanni
Main category: cs.LG
TL;DR: CLeAN introduces an adaptive normalization technique for continual learning in tabular data that uses learnable parameters updated via EMA to handle evolving data distributions and mitigate catastrophic forgetting.
Details
Motivation: Traditional normalization methods assume access to entire datasets, which conflicts with the sequential nature of continual learning where data distributions shift over time. This creates a critical gap in handling normalization for dynamic real-world applications like cybersecurity and autonomous systems.Method: CLeAN (Continual Learning Adaptive Normalization) estimates global feature scales using learnable parameters updated via an Exponential Moving Average (EMA) module, allowing the model to adapt to evolving data distributions without requiring access to the entire dataset.
Result: Comprehensive evaluations on two datasets with various continual learning strategies (Reservoir Experience Replay, A-GEM, EwC) show that CLeAN improves model performance on new data while mitigating catastrophic forgetting.
Conclusion: Adaptive normalization is crucial for enhancing stability and effectiveness in continual learning for tabular data, offering a novel approach to preserve knowledge in dynamic learning environments.
Abstract: Artificial intelligence systems predominantly rely on static data distributions, making them ineffective in dynamic real-world environments, such as cybersecurity, autonomous transportation, or finance, where data shifts frequently. Continual learning offers a potential solution by enabling models to learn from sequential data while retaining prior knowledge. However, a critical and underexplored issue in this domain is data normalization. Conventional normalization methods, such as min-max scaling, presuppose access to the entire dataset, which is incongruent with the sequential nature of continual learning. In this paper we introduce Continual Learning Adaptive Normalization (CLeAN), a novel adaptive normalization technique designed for continual learning in tabular data. CLeAN involves the estimation of global feature scales using learnable parameters that are updated via an Exponential Moving Average (EMA) module, enabling the model to adapt to evolving data distributions. Through comprehensive evaluations on two datasets and various continual learning strategies, including Resevoir Experience Replay, A-GEM, and EwC we demonstrate that CLeAN not only improves model performance on new data but also mitigates catastrophic forgetting. The findings underscore the importance of adaptive normalization in enhancing the stability and effectiveness of tabular data, offering a novel perspective on the use of normalization to preserve knowledge in dynamic learning environments.
[509] Conditional Inverse Learning of Time-Varying Reproduction Numbers Inference
Lanlan Yu, Quan-Hui Liu, Haoyue Zheng, Xinfu Yang
Main category: cs.LG
TL;DR: CIRL framework learns conditional mapping from incidence data to time-varying reproduction numbers using flexible likelihood-based modeling with epidemiological constraints.
Details
Motivation: Existing methods for estimating time-varying reproduction numbers rely on strong structural assumptions that limit adaptation to non-stationary transmission dynamics, causing delayed detection of regime shifts and degraded accuracy.Method: Conditional Inverse Reproduction Learning (CIRL) learns a conditional mapping from historical incidence patterns and time information to latent reproduction numbers, integrating epidemiological structure with flexible likelihood-based modeling using the renewal equation as a forward operator.
Result: Experiments on synthetic epidemics with controlled regime changes and real-world SARS and COVID-19 data demonstrate the framework’s effectiveness in producing robust estimates responsive to abrupt transmission changes.
Conclusion: CIRL combines epidemiologically grounded constraints with data-driven temporal representations to address the ill-posed inverse problem of reproduction number estimation while remaining adaptable to changing transmission dynamics.
Abstract: Estimating time-varying reproduction numbers from epidemic incidence data is a central task in infectious disease surveillance, yet it poses an inherently ill-posed inverse problem. Existing approaches often rely on strong structural assumptions derived from epidemiological models, which can limit their ability to adapt to non-stationary transmission dynamics induced by interventions or behavioral changes, leading to delayed detection of regime shifts and degraded estimation accuracy. In this work, we propose a Conditional Inverse Reproduction Learning framework (CIRL) that addresses the inverse problem by learning a {conditional mapping} from historical incidence patterns and explicit time information to latent reproduction numbers. Rather than imposing strongly enforced parametric constraints, CIRL softly integrates epidemiological structure with flexible likelihood-based statistical modeling, using the renewal equation as a forward operator to enforce dynamical consistency. The resulting framework combines epidemiologically grounded constraints with data-driven temporal representations, producing reproduction number estimates that are robust to observation noise while remaining responsive to abrupt transmission changes and zero-inflated incidence observations. Experiments on synthetic epidemics with controlled regime changes and real-world SARS and COVID-19 data demonstrate the effectiveness of the proposed approach.
[510] FoMo X: Modular Explainability Signals for Outlier Detection Foundation Models
Simon Klüttermann, Tim Katzke, Phuong Huong Nguyen, Emmanuel Müller
Main category: cs.LG
TL;DR: FoMo-X adds interpretable diagnostic heads to tabular foundation models for outlier detection, providing risk tiers and uncertainty measures without extra inference cost.
Details
Motivation: Current outlier detection foundation models are black boxes that output only scalar scores without operational context for safety-critical decisions. Existing explanation methods are computationally expensive or fail to capture epistemic uncertainty in zero-shot inference.Method: FoMo-X attaches auxiliary diagnostic heads (Severity Head and Uncertainty Head) to frozen embeddings of pretrained PFN backbone. Heads are trained offline using the same generative simulator prior as the backbone, distilling expensive properties like Monte Carlo dropout uncertainty into deterministic single-pass inference.
Result: Extensive evaluation on synthetic and real-world benchmarks (ADBench) shows FoMo-X recovers ground-truth diagnostic signals with high fidelity and negligible inference overhead.
Conclusion: FoMo-X bridges the gap between foundation model performance and operational explainability, offering a scalable path toward trustworthy, zero-shot outlier detection.
Abstract: Tabular foundation models, specifically Prior-Data Fitted Networks (PFNs), have revolutionized outlier detection (OD) by enabling unsupervised zero-shot adaptation to new datasets without training. However, despite their predictive power, these models typically function as opaque black boxes, outputting scalar outlier scores that lack the operational context required for safety-critical decision-making. Existing post-hoc explanation methods are often computationally prohibitive for real-time deployment or fail to capture the epistemic uncertainty inherent in zero-shot inference. In this work, we introduce FoMo-X, a modular framework that equips OD foundation models with intrinsic, lightweight diagnostic capabilities. We leverage the insight that the frozen embeddings of a pretrained PFN backbone already encode rich, context-conditioned relational information. FoMo-X attaches auxiliary diagnostic heads to these embeddings, trained offline using the same generative simulator prior as the backbone. This allows us to distill computationally expensive properties, such as Monte Carlo dropout based epistemic uncertainty, into a deterministic, single-pass inference. We instantiate FoMo-X with two novel heads: a Severity Head that discretizes deviations into interpretable risk tiers, and an Uncertainty Head that provides calibrated confidence measures. Extensive evaluation on synthetic and real-world benchmarks (ADBench) demonstrates that FoMo-X recovers ground-truth diagnostic signals with high fidelity and negligible inference overhead. By bridging the gap between foundation model performance and operational explainability, FoMo-X offers a scalable path toward trustworthy, zero-shot outlier detection.
[511] Complementary Reinforcement Learning
Dilxat Muhtar, Jiashun Liu, Wei Gao, Weixun Wang, Shaopan Xiong, Ju Huang, Siran Yang, Wenbo Su, Jiamang Wang, Ling Pan, Bo Zheng
Main category: cs.LG
TL;DR: Complementary RL enables co-evolution of experience extractor and policy actor within RL optimization loop for more efficient LLM-based agent training.
Details
Motivation: Current RL approaches for LLM-based agents suffer from low sample efficiency due to sparse feedback and inability to leverage prior experience effectively. Existing experience-augmentation methods fail because experience doesn't co-evolve with the improving actor, causing misalignment.Method: Inspired by complementary learning systems in neuroscience, Complementary RL simultaneously optimizes two components: 1) policy actor via sparse outcome-based rewards, and 2) experience extractor based on whether its distilled experiences demonstrably contribute to the actor’s success.
Result: Achieves 10% performance improvement in single-task scenarios compared to outcome-based agentic RL baselines, and exhibits robust scalability in multi-task settings.
Conclusion: Complementary RL establishes a paradigm for efficient experience-driven agent learning by enabling seamless co-evolution of experience management strategies with actor capabilities.
Abstract: Reinforcement Learning (RL) has emerged as a powerful paradigm for training LLM-based agents, yet remains limited by low sample efficiency, stemming not only from sparse outcome feedback but also from the agent’s inability to leverage prior experience across episodes. While augmenting agents with historical experience offers a promising remedy, existing approaches suffer from a critical weakness: the experience distilled from history is either stored statically or fail to coevolve with the improving actor, causing a progressive misalignment between the experience and the actor’s evolving capability that diminishes its utility over the course of training. Inspired by complementary learning systems in neuroscience, we present Complementary RL to achieve seamless co-evolution of an experience extractor and a policy actor within the RL optimization loop. Specifically, the actor is optimized via sparse outcome-based rewards, while the experience extractor is optimized according to whether its distilled experiences demonstrably contribute to the actor’s success, thereby evolving its experience management strategy in lockstep with the actor’s growing capabilities. Empirically, Complementary RL outperforms outcome-based agentic RL baselines that do not learn from experience, achieving 10% performance improvement in single-task scenarios and exhibits robust scalability in multi-task settings. These results establish Complementary RL as a paradigm for efficient experience-driven agent learning.
[512] Unsupervised Symbolic Anomaly Detection
Md Maruf Hossain, Tim Katzke, Simon Klüttermann, Emmanuel Müller
Main category: cs.LG
TL;DR: SYRAN is an unsupervised anomaly detection method using symbolic regression to learn human-readable equations that describe symbolic invariants in normal data, providing interpretable anomaly detection.
Details
Motivation: Current anomaly detection methods often use opaque, high-dimensional models that lack interpretability. There's a need for methods that provide transparent, human-readable explanations for why data points are flagged as anomalies.Method: SYRAN uses symbolic regression to learn an ensemble of human-readable equations that describe symbolic invariants - functions that are approximately constant on normal data. Deviations from these invariants yield interpretable anomaly scores.
Result: SYRAN demonstrates high interpretability, providing equations that correspond to known scientific or medical relationships, while maintaining strong anomaly detection performance comparable to state-of-the-art methods.
Conclusion: SYRAN offers a novel approach to interpretable anomaly detection by learning symbolic invariants, making detection logic transparent by construction rather than requiring post-hoc explanations.
Abstract: We propose SYRAN, an unsupervised anomaly detection method based on symbolic regression. Instead of encoding normal patterns in an opaque, high-dimensional model, our method learns an ensemble of human-readable equations that describe symbolic invariants: functions that are approximately constant on normal data. Deviations from these invariants yield anomaly scores, so that the detection logic is interpretable by construction, rather than via post-hoc explanation. Experimental results demonstrate that SYRAN is highly interpretable, providing equations that correspond to known scientific or medical relationships, and maintains strong anomaly detection performance comparable to that of state-of-the-art methods.
[513] Discovering Decoupled Functional Modules in Large Language Models
Yanke Yu, Jin Li, Ying Sun, Ping Li, Zhefeng Wang, Yi Zheng
Main category: cs.LG
TL;DR: ULCMOD discovers functional modules in LLMs by disentangling neurons into interpretable modules while identifying related input topics, revealing semantic coherence and hierarchical organization.
Details
Motivation: Understanding how LLMs internally organize different functions into modules is crucial for improving trustworthiness and performance, but remains largely unexplored, creating a critical gap in LLM interpretability research.Method: Proposes Unsupervised LLM Cross-layer MOdule Discovery (ULCMOD) framework with novel objective function and Iterative Decoupling (IterD) algorithm to simultaneously disentangle neurons into modules while discovering related input topics.
Result: Discovers high-quality, disentangled modules capturing meaningful semantic information, achieving superior performance in downstream tasks, with modules showing semantic coherence, interpretable specializations, and clear spatial/hierarchical organization.
Conclusion: Provides a novel tool for interpreting functional modules of LLMs, filling a critical gap in LLM interpretability research by revealing how LLMs internally organize functions.
Abstract: Understanding the internal functional organization of Large Language Models (LLMs) is crucial for improving their trustworthiness and performance. However, how LLMs organize different functions into modules remains highly unexplored. To bridge this gap, we formulate a functional module discovery problem and propose an Unsupervised LLM Cross-layer MOdule Discovery (ULCMOD) framework that simultaneously disentangles the large set of neurons in the entire LLM into modules while discovering the topics of input samples related to these modules. Our framework introduces a novel objective function and an efficient Iterative Decoupling (IterD) algorithm. Extensive experiments show that our method discovers high-quality, disentangled modules that capture more meaningful semantic information and achieve superior performance in various downstream tasks. Moreover, our qualitative analysis reveals that the discovered modules show semantic coherence, correspond to interpretable specializations, and a clear spatial and hierarchical organization within the LLM. Our work provides a novel tool for interpreting the functional modules of LLMs, filling a critical blank in LLM’s interpretability research.
[514] Identifying Latent Actions and Dynamics from Offline Data via Demonstrator Diversity
Felix Schur
Main category: cs.LG
TL;DR: The paper studies identifiability of latent actions and environment dynamics from offline trajectories without observed actions, using demonstrator identity tags and policy diversity assumptions.
Details
Motivation: In offline reinforcement learning, actions are often unobserved in trajectories, making it challenging to recover latent actions and environment dynamics. The paper aims to determine if these can be identified using only action-free trajectories tagged with demonstrator identity.Method: The approach assumes each demonstrator follows a distinct policy while sharing environment dynamics. This induces a column-stochastic nonnegative matrix factorization of observable conditional distributions. Using policy diversity and rank conditions, the paper proves identifiability up to permutation of latent action labels, extending to continuous observation spaces via Gram-determinant minimum-volume criterion.
Result: The paper proves that latent transitions and demonstrator policies are identifiable up to permutation of latent action labels under sufficiently scattered policy diversity and rank conditions. Continuity over connected state space upgrades local permutation ambiguities to a single global permutation, which can be fixed with minimal labeled action data.
Conclusion: Demonstrator diversity provides a principled source of identifiability for learning latent actions and dynamics from offline RL data without observed actions, establishing theoretical foundations for action-free offline learning.
Abstract: Can latent actions and environment dynamics be recovered from offline trajectories when actions are never observed? We study this question in a setting where trajectories are action-free but tagged with demonstrator identity. We assume that each demonstrator follows a distinct policy, while the environment dynamics are shared across demonstrators and identity affects the next observation only through the chosen action. Under these assumptions, the conditional next-observation distribution $p(o_{t+1}\mid o_t,e)$ is a mixture of latent action-conditioned transition kernels with demonstrator-specific mixing weights. We show that this induces, for each state, a column-stochastic nonnegative matrix factorization of the observable conditional distribution. Using sufficiently scattered policy diversity and rank conditions, we prove that the latent transitions and demonstrator policies are identifiable up to permutation of the latent action labels. We extend the result to continuous observation spaces via a Gram-determinant minimum-volume criterion, and show that continuity of the transition map over a connected state space upgrades local permutation ambiguities to a single global permutation. A small amount of labeled action data then suffices to fix this final ambiguity. These results establish demonstrator diversity as a principled source of identifiability for learning latent actions and dynamics from offline RL data.
[515] One-Step Sampler for Boltzmann Distributions via Drifting
Wenhan Cao, Keyu Yan, Lin Zhao
Main category: cs.LG
TL;DR: A drifting-based framework for amortized sampling of Boltzmann distributions using neural generators trained via Gaussian-smoothed score projections
Details
Motivation: To develop an efficient method for sampling from Boltzmann distributions defined by energy functions, particularly when targets are specified only up to an unknown normalization constant, enabling amortized sampling into a single forward pass at test timeMethod: Trains a one-step neural generator by projecting samples along a Gaussian-smoothed score field from current model distribution toward target Boltzmann distribution. Uses two estimators for targets with unknown normalization: local importance-sampling mean-shift estimator and second-order curvature-corrected approximation, combined with mini-batch Gaussian mean-shift estimate of sampler-side smoothed score
Result: On a four-mode Gaussian-mixture Boltzmann target, achieves mean error 0.0754, covariance error 0.0425, and RBF MMD 0.0020. Successfully handles nonconvex and curved low-energy geometries on additional double-well and banana targets
Conclusion: Drifting is an effective approach to amortize iterative sampling from Boltzmann distributions into a single forward pass at test time, providing stable one-step training with practical estimators for targets with unknown normalization constants
Abstract: We present a drifting-based framework for amortized sampling of Boltzmann distributions defined by energy functions. The method trains a one-step neural generator by projecting samples along a Gaussian-smoothed score field from the current model distribution toward the target Boltzmann distribution. For targets specified only up to an unknown normalization constant, we derive a practical target-side drift from a smoothed energy and use two estimators: a local importance-sampling mean-shift estimator and a second-order curvature-corrected approximation. Combined with a mini-batch Gaussian mean-shift estimate of the sampler-side smoothed score, this yields a simple stop-gradient objective for stable one-step training. On a four-mode Gaussian-mixture Boltzmann target, our sampler achieves mean error $0.0754$, covariance error $0.0425$, and RBF MMD $0.0020$. Additional double-well and banana targets show that the same formulation also handles nonconvex and curved low-energy geometries. Overall, the results support drifting as an effective way to amortize iterative sampling from Boltzmann distributions into a single forward pass at test time.
[516] Only relative ranks matter in weight-clustered large language models
Borja Aizpurua, Sukhbinder Singh, Román Orús
Main category: cs.LG
TL;DR: Weight clustering reduces LLM size by replacing weight matrices with K shared values via K-means, preserving accuracy with 16-64 values per matrix without retraining, showing that relative rank matters more than exact magnitudes.
Details
Motivation: LLMs have billions of parameters but many exact values are not essential; the paper investigates whether relative rank of weights (which connections are stronger/weaker) matters more than precise magnitudes for model performance.Method: Apply weight clustering to pretrained models using K-means to replace weight matrices with K shared values (16-64 distinct values per matrix). Optionally fine-tune cluster means. Systematically randomize cluster means while keeping assignments fixed to study rank vs. magnitude importance.
Result: Weight clustering with 16-64 values preserves strong accuracy without retraining. Scrambling relative ranks degrades quality sharply (perplexity increases orders of magnitude), while rank-preserving randomizations cause almost no loss. Scale drift, not rank distortion, is the dominant collapse mechanism when many layers are perturbed simultaneously.
Conclusion: Relative rank of weights is more important than exact magnitudes for LLM performance. This rank-based perspective offers new insights for model compression and robustness, with weight clustering providing a simple, training-free compression method.
Abstract: Large language models (LLMs) contain billions of parameters, yet many exact values are not essential. We show that what matters most is the relative rank of weights-whether one connection is stronger or weaker than another-rather than precise magnitudes. To reduce the number of unique weight values, we apply weight clustering to pretrained models, replacing every weight matrix with K shared values from K-means. For Llama 3.1-8B-Instruct and SmolLM2-135M, reducing each matrix to only 16-64 distinct values preserves strong accuracy without retraining, providing a simple, training-free method to compress LLMs on disk. Optionally fine-tuning only the cluster means (centroids) recovers 30-40 percent of the remaining accuracy gap at minimal cost. We then systematically randomize cluster means while keeping assignments fixed. Scrambling the relative ranks of the clusters degrades quality sharply-perplexity can increase by orders of magnitude-even when global statistics such as mean and variance are preserved. In contrast, rank-preserving randomizations cause almost no loss at mid and late layers. On the other hand, when many layers are perturbed simultaneously, progressive layer-by-layer replacement reveals that scale drift-not rank distortion-is the dominant collapse mechanism; however, an affine correction w’ = aw + b with a > 0 (which preserves both rank order and overall weight distribution) can substantially delay this drift. This rank-based perspective offers a new lens on model compression and robustness.
[517] End-to-end data-driven prediction of urban airflow and pollutant dispersion
Nishant Kumar, Franck Kerhervé, Lionel Agostini, Laurent Cordier
Main category: cs.LG
TL;DR: Data-driven reduced-order modeling framework using SPOD, autoencoders, LSTMs, and CNNs for fast prediction of urban airflow and pollutant dispersion in street canyons.
Details
Motivation: Climate change and urban population growth intensify environmental stresses, making urban atmospheric flow behavior critical for public health, energy use, and livability. Need for fast, accurate pollutant dispersion models to support timely decision-making.Method: Four-step approach: 1) SPOD for reduced basis from LES data, 2) autoencoder for nonlinear compression of temporal coefficients, 3) LSTM networks for reduced-order modeling in latent space, 4) CNN mapping velocity to pollutant dispersion fields.
Result: Model effectively predicts both instantaneous and statistically stationary fields over long time horizons, demonstrating efficacy for urban airflow and pollutant dispersion prediction.
Conclusion: Proposed data-driven framework enables fast and accurate modeling of urban pollutant dispersion, supporting timely decision-making for mitigation measures in urban environments.
Abstract: Climate change and the rapid growth of urban populations are intensifying environmental stresses within cities, making the behavior of urban atmospheric flows a critical factor in public health, energy use, and overall livability. This study targets to develop fast and accurate models of urban pollutant dispersion to support decision-makers, enabling them to implement mitigation measures in a timely and cost-effective manner. To reach this goal, an end-to-end data-driven approach is proposed to model and predict the airflow and pollutant dispersion in a street canyon in skimming flow regime. A series of time-resolved snapshots obtained from large eddy simulation (LES) serves as the database. The proposed framework is based on four fundamental steps. Firstly, a reduced basis is obtained by spectral proper orthogonal decomposition (SPOD) of the database. The projection of the time series snapshot data onto the SPOD modes (time-domain approach) provides the temporal coefficients of the dynamics. Secondly, a nonlinear compression of the temporal coefficients is performed by autoencoder to reduce further the dimensionality of the problem. Thirdly, a reduced-order model (ROM) is learned in the latent space using Long Short-Term Memory (LSTM) netowrks. Finally, the pollutant dispersion is estimated from the predicted velocity field through convolutional neural network that maps both fields. The results demonstrate the efficacy of the model in predicting the instantaneous as well as statistically stationary fields over long time horizon.
[518] AdaMuS: Adaptive Multi-view Sparsity Learning for Dimensionally Unbalanced Data
Cai Xu, Changhao Sun, Ziyu Guan, Wei Zhao
Main category: cs.LG
TL;DR: AdaMuS is a framework for unbalanced multi-view learning that addresses extreme dimensional disparities between views (e.g., video frames vs physiological signals) through adaptive sparsity learning and self-supervised graph-based supervision.
Details
Motivation: Real-world multi-view data often has severe dimensional disparities (e.g., video frames at 10^6 dimensions vs physiological signals at 10^1 dimensions), causing existing methods to bias toward high-dimensional views and struggle with representation alignment, introducing redundancy in low-dimensional views.Method: 1) View-specific encoders map all views to unified dimensional space; 2) Parameter-free pruning adaptively removes redundant encoder parameters to prevent overfitting when mapping low-dimensional data; 3) Sparse fusion paradigm suppresses redundant dimensions and aligns views; 4) Self-supervised learning using similarity graphs for supervision.
Result: Extensive evaluations on synthetic toy dataset and seven real-world benchmarks show AdaMuS consistently achieves superior performance and strong generalization across both classification and semantic segmentation tasks.
Conclusion: AdaMuS effectively addresses unbalanced multi-view learning with extreme dimensional disparities through adaptive sparsity mechanisms and self-supervised learning, demonstrating robust performance across diverse tasks.
Abstract: Multi-view learning primarily aims to fuse multiple features to describe data comprehensively. Most prior studies implicitly assume that different views share similar dimensions. In practice, however, severe dimensional disparities often exist among different views, leading to the unbalanced multi-view learning issue. For example, in emotion recognition tasks, video frames often reach dimensions of $10^6$, while physiological signals comprise only $10^1$ dimensions. Existing methods typically face two main challenges for this problem: (1) They often bias towards high-dimensional data, overlooking the low-dimensional views. (2) They struggle to effectively align representations under extreme dimensional imbalance, which introduces severe redundancy into the low-dimensional ones. To address these issues, we propose the Adaptive Multi-view Sparsity Learning (AdaMuS) framework. First, to prevent ignoring the information of low-dimensional views, we construct view-specific encoders to map them into a unified dimensional space. Given that mapping low-dimensional data to a high-dimensional space often causes severe overfitting, we design a parameter-free pruning method to adaptively remove redundant parameters in the encoders. Furthermore, we propose a sparse fusion paradigm that flexibly suppresses redundant dimensions and effectively aligns each view. Additionally, to learn representations with stronger generalization, we propose a self-supervised learning paradigm that obtains supervision information by constructing similarity graphs. Extensive evaluations on a synthetic toy dataset and seven real-world benchmarks demonstrate that AdaMuS consistently achieves superior performance and exhibits strong generalization across both classification and semantic segmentation tasks.
[519] ARES: Scalable and Practical Gradient Inversion Attack in Federated Learning through Activation Recovery
Zirui Gong, Leo Yu Zhang, Yanjun Zhang, Viet Vo, Tianqing Zhu, Shirui Pan, Cong Wang
Main category: cs.LG
TL;DR: ARES attack reconstructs training samples from federated learning gradients without architectural modifications, using sparse recovery and activation disentanglement for large batches.
Details
Motivation: Federated learning aims to protect privacy by sharing model updates instead of raw data, but gradient inversion attacks can leak sensitive training data. Existing active attacks require architectural modifications, limiting practical applicability.Method: ARES formulates recovery as a noisy sparse recovery task solved with generalized Lasso, uses imprint method to disentangle activations for multi-sample recovery, and provides theoretical guarantees for recovery rate and error bounds.
Result: ARES achieves high-fidelity reconstruction across diverse datasets on CNNs and MLPs, significantly outperforming prior gradient inversion attacks under large batch sizes and realistic federated learning settings.
Conclusion: Intermediate activations pose serious privacy risks in federated learning, highlighting the need for stronger defenses against gradient inversion attacks like ARES.
Abstract: Federated Learning (FL) enables collaborative model training by sharing model updates instead of raw data, aiming to protect user privacy. However, recent studies reveal that these shared updates can inadvertently leak sensitive training data through gradient inversion attacks (GIAs). Among them, active GIAs are particularly powerful, enabling high-fidelity reconstruction of individual samples even under large batch sizes. Nevertheless, existing approaches often require architectural modifications, which limit their practical applicability. In this work, we bridge this gap by introducing the Activation REcovery via Sparse inversion (ARES) attack, an active GIA designed to reconstruct training samples from large training batches without requiring architectural modifications. Specifically, we formulate the recovery problem as a noisy sparse recovery task and solve it using the generalized Least Absolute Shrinkage and Selection Operator (Lasso). To extend the attack to multi-sample recovery, ARES incorporates the imprint method to disentangle activations, enabling scalable per-sample reconstruction. We further establish the expected recovery rate and derive an upper bound on the reconstruction error, providing theoretical guarantees for the ARES attack. Extensive experiments on CNNs and MLPs demonstrate that ARES achieves high-fidelity reconstruction across diverse datasets, significantly outperforming prior GIAs under large batch sizes and realistic FL settings. Our results highlight that intermediate activations pose a serious and underestimated privacy risk in FL, underscoring the urgent need for stronger defenses.
[520] Benchmarking Reinforcement Learning via Stochastic Converse Optimality: Generating Systems with Known Optimal Policies
Sinan Ibrahim, Grégoire Ouerdane, Hadi Salloum, Henni Ouerdane, Stefan Streif, Pavel Osinenko
Main category: cs.LG
TL;DR: A rigorous benchmarking framework for RL algorithms using converse optimality theory to generate controlled benchmark environments with known optimal solutions.
Details
Motivation: RL algorithm comparison is notoriously complex due to sensitivity to environmental design, reward structures, and stochasticity. Current benchmarking lacks systematic rigor and reproducibility.Method: Extends converse optimality to discrete-time, control-affine, nonlinear systems with noise. Provides necessary/sufficient conditions for optimal value functions and policies, enabling systematic benchmark generation via homotopy variations and randomized parameters.
Result: Framework enables automatic construction of diverse environments for controlled evaluation. Validated by assessing standard RL methods against ground-truth optimum, providing reproducible benchmarking foundation.
Conclusion: Provides a rigorous, reproducible foundation for precise RL benchmarking by generating environments with known optimal solutions, enabling systematic algorithm comparison.
Abstract: The objective comparison of Reinforcement Learning (RL) algorithms is notoriously complex as outcomes and benchmarking of performances of different RL approaches are critically sensitive to environmental design, reward structures, and stochasticity inherent in both algorithmic learning and environmental dynamics. To manage this complexity, we introduce a rigorous benchmarking framework by extending converse optimality to discrete-time, control-affine, nonlinear systems with noise. Our framework provides necessary and sufficient conditions, under which a prescribed value function and policy are optimal for constructed systems, enabling the systematic generation of benchmark families via homotopy variations and randomized parameters. We validate it by automatically constructing diverse environments, demonstrating our framework’s capacity for a controlled and comprehensive evaluation across algorithms. By assessing standard methods against a ground-truth optimum, our work delivers a reproducible foundation for precise and rigorous RL benchmarking.
[521] DSS-GAN: Directional State Space GAN with Mamba backbone for Class-Conditional Image Synthesis
Aleksander Ogonowski, Konrad Klimaszewski, Przemysław Rokita
Main category: cs.LG
TL;DR: DSS-GAN introduces a novel GAN architecture using Mamba as hierarchical generator backbone with Directional Latent Routing for improved noise-to-image synthesis.
Details
Motivation: To improve image generation by leveraging Mamba's efficient sequence modeling capabilities as a generator backbone and introducing better conditioning mechanisms that couple class identity with latent structure along spatial axes.Method: Uses Mamba as hierarchical generator backbone with novel Directional Latent Routing (DLR) that decomposes latent vectors into direction-specific subvectors, each jointly projected with class embeddings to produce feature-wise affine modulation of corresponding Mamba scans.
Result: Achieves improved FID, KID, and precision-recall scores compared to StyleGAN2-ADA across multiple datasets. Latent space analysis shows directional subvectors exhibit specialization with structured, direction-correlated changes in synthesized images.
Conclusion: DSS-GAN demonstrates the effectiveness of Mamba as a generator backbone and the benefits of directional latent conditioning for improved image synthesis quality and interpretable latent space structure.
Abstract: We present DSS-GAN, the first generative adversarial network to employ Mamba as a hierarchical generator backbone for noise-to-image synthesis. The central contribution is Directional Latent Routing (DLR), a novel conditioning mechanism that decomposes the latent vector into direction-specific subvectors, each jointly projected with a class embedding to produce a feature-wise affine modulation of the corresponding Mamba scan. Unlike conventional class conditioning that injects a global signal, DLR couples class identity and latent structure along distinct spatial axes of the feature map, applied consistently across all generative scales. DSS-GAN achieves improved FID, KID, and precision-recall scores compared to StyleGAN2-ADA across multiple tested datasets. Analysis of the latent space reveals that directional subvectors exhibit measurable specialization: perturbations along individual components produce structured, direction-correlated changes in the synthesized image.
[522] Flow Matching Policy with Entropy Regularization
Ting Gao, Stavros Orfanoudakis, Nan Lin, Elvin Isufi, Winnie Daamen, Serge Hoogendoorn
Main category: cs.LG
TL;DR: FMER is a flow matching-based RL policy that uses ODEs instead of SDEs for more efficient action sampling and principled entropy regularization, achieving better performance with faster training.
Details
Motivation: Diffusion policies in RL face issues with indirect entropy control due to intractable exact entropy and computationally expensive policy gradients through iterative denoising chains. There's a need for more efficient and principled approaches to entropy regularization in generative policies.Method: Proposes Flow Matching Policy with Entropy Regularization (FMER), an ODE-based online RL framework that parameterizes policy via flow matching and samples actions along straight probability paths motivated by optimal transport. Constructs advantage-weighted target velocity field from candidate set to steer policy updates toward high-value regions, and derives tractable entropy objective for principled maximum-entropy optimization.
Result: Outperforms state-of-the-art methods on sparse multi-goal FrankaKitchen benchmarks, remains competitive on standard MuJoco benchmarks, reduces training time by 7x compared to heavy diffusion baselines (QVPO) and 10-15% relative to efficient variants.
Conclusion: FMER provides an efficient ODE-based alternative to diffusion policies in RL with principled entropy regularization, achieving better performance with significantly reduced computational cost.
Abstract: Diffusion-based policies have gained significant popularity in Reinforcement Learning (RL) due to their ability to represent complex, non-Gaussian distributions. Stochastic Differential Equation (SDE)-based diffusion policies often rely on indirect entropy control due to the intractability of the exact entropy, while also suffering from computationally prohibitive policy gradients through the iterative denoising chain. To overcome these issues, we propose Flow Matching Policy with Entropy Regularization (FMER), an Ordinary Differential Equation (ODE)-based online RL framework. FMER parameterizes the policy via flow matching and samples actions along a straight probability path, motivated by optimal transport. FMER leverages the model’s generative nature to construct an advantage-weighted target velocity field from a candidate set, steering policy updates toward high-value regions. By deriving a tractable entropy objective, FMER enables principled maximum-entropy optimization for enhanced exploration. Experiments on sparse multi-goal FrankaKitchen benchmarks demonstrate that FMER outperforms state-of-the-art methods, while remaining competitive on standard MuJoco benchmarks. Moreover, FMER reduces training time by 7x compared to heavy diffusion baselines (QVPO) and 10-15% relative to efficient variants.
[523] Objective Mispricing Detection for Shortlisting Undervalued Football Players via Market Dynamics and News Signals
Chinenye Omejieke, Shuyao Chen, Xia Cui
Main category: cs.LG
TL;DR: A framework for identifying undervalued football players using market data and NLP features from news articles to detect objective mispricing.
Details
Motivation: To create a practical, reproducible system for identifying undervalued football players that moves beyond subjective expert opinions by using objective data-driven mispricing detection.Method: Estimates expected market value from structured data (historical market dynamics, biographical features, contract details, transfer history), compares to observed valuations to define mispricing, then incorporates NLP features from news articles (sentiment statistics and semantic embeddings) to complement market signals for shortlisting undervalued players.
Result: Gradient-boosted regression explains large variance in log-transformed market value. Market dynamics are primary signal for undervaluation detection, while NLP features provide consistent secondary gains improving robustness and interpretability. SHAP analysis shows dominance of market trends and age, with news-derived volatility cues amplifying signals in high-uncertainty regimes.
Conclusion: The pipeline is designed for scouting decision support with emphasis on ranking/shortlisting over hard classification, includes reproducibility and ethics considerations. NLP features complement but don’t replace market signals.
Abstract: We present a practical, reproducible framework for identifying undervalued football players grounded in objective mispricing. Instead of relying on subjective expert labels, we estimate an expected market value from structured data (historical market dynamics, biographical and contract features, transfer history) and compare it to the observed valuation to define mispricing. We then assess whether news-derived Natural Language Processing (NLP) features (i.e., sentiment statistics and semantic embeddings from football articles) complement market signals for shortlisting undervalued players. Using a chronological (leakage-aware) evaluation, gradient-boosted regression explains a large share of the variance in log-transformed market value. For undervaluation shortlisting, ROC-AUC-based ablations show that market dynamics are the primary signal, while NLP features provide consistent, secondary gains that improve robustness and interpretability. SHAP analyses suggest the dominance of market trends and age, with news-derived volatility cues amplifying signals in high-uncertainty regimes. The proposed pipeline is designed for decision support in scouting workflows, emphasizing ranking/shortlisting over hard classification thresholds, and includes a concise reproducibility and ethics statement.
[524] Can Blindfolded LLMs Still Trade? An Anonymization-First Framework for Portfolio Optimization
Joohyoung Jeon, Hongchul Lee
Main category: cs.LG
TL;DR: BlindTrade: A framework for validating LLM trading agents by anonymizing ticker information to prevent memorization bias and using GNN-based trading policies.
Details
Motivation: To ensure LLM trading agents genuinely understand market dynamics rather than exploiting memorized ticker associations, requiring rigorous signal validation to prove predictions reflect legitimate patterns, not pre-trained recall.Method: Anonymizes tickers and company names to blindfold agents, uses four LLM agents to output scores with reasoning, constructs GNN graph from reasoning embeddings, and implements PPO-DSR policy for trading.
Result: Achieved Sharpe ratio of 1.40 +/- 0.22 across 20 seeds on 2025 YTD data, validated signal legitimacy through negative control experiments, and found market-regime dependency with better performance in volatile conditions.
Conclusion: The BlindTrade framework successfully validates LLM trading signals by preventing memorization bias, demonstrating that meaningful signals persist even when ticker information is anonymized, though performance varies with market conditions.
Abstract: For LLM trading agents to be genuinely trustworthy, they must demonstrate understanding of market dynamics rather than exploitation of memorized ticker associations. Building responsible multi-agent systems demands rigorous signal validation: proving that predictions reflect legitimate patterns, not pre-trained recall. We address two sources of spurious performance: memorization bias from ticker-specific pre-training, and survivorship bias from flawed backtesting. Our approach is to blindfold the agents–anonymizing all identifiers–and verify whether meaningful signals persist. BlindTrade anonymizes tickers and company names, and four LLM agents output scores along with reasoning. We construct a GNN graph from reasoning embeddings and trade using PPO-DSR policy. On 2025 YTD (through 2025-08-01), we achieved Sharpe 1.40 +/- 0.22 across 20 seeds and validated signal legitimacy through negative control experiments. To assess robustness beyond a single OOS window, we additionally evaluate an extended period (2024–2025), revealing market-regime dependency: the policy excels in volatile conditions but shows reduced alpha in trending bull markets.
[525] Predicting Trajectories of Long COVID in Adult Women: The Critical Role of Causal Disentanglement
Jing Wang, Jie Shen, Yiming Luo, Amar Sra, Qiaomin Xie, Jeremy C. Weiss
Main category: cs.LG
TL;DR: LLM-based causal network predicts Post-Acute Sequelae of SARS-CoV-2 severity in women using clinical profiles and wearable data, achieving 86.7% precision while differentiating between active pathology and confounding factors like menopause.
Details
Motivation: Early prediction of PASC severity is critical for women's health, especially given diagnostic overlap with hormonal transitions like menopause. Identifying and accounting for confounding factors is essential for accurate long-term trajectory prediction.Method: Retrospective study of 1,155 women from NIH RECOVER dataset, integrating static clinical profiles with four weeks of longitudinal wearable data (cardiac activity and sleep). Developed a causal network based on a Large Language Model to predict future PASC scores.
Result: Framework achieved 86.7% precision in clinical severity prediction. Causal attribution analysis showed model’s ability to differentiate between active pathology and baseline noise: direct indicators like breathlessness and malaise reached maximum saliency (1.00), while confounding factors like menopause and diabetes were suppressed with saliency scores below 0.27.
Conclusion: The LLM-based causal network effectively predicts PASC severity while successfully suppressing confounding factors, demonstrating potential for accurate clinical trajectory prediction in women’s health.
Abstract: Early prediction of Post-Acute Sequelae of SARS-CoV-2 severity is a critical challenge for women’s health, particularly given the diagnostic overlap between PASC and common hormonal transitions such as menopause. Identifying and accounting for these confounding factors is essential for accurate long-term trajectory prediction. We conducted a retrospective study of 1,155 women (mean age 61) from the NIH RECOVER dataset. By integrating static clinical profiles with four weeks of longitudinal wearable data (monitoring cardiac activity and sleep), we developed a causal network based on a Large Language Model to predict future PASC scores. Our framework achieved a precision of 86.7% in clinical severity prediction. Our causal attribution analysis demonstrate the model’s ability to differentiate between active pathology and baseline noise: direct indicators such as breathlessness and malaise reached maximum saliency (1.00), while confounding factors like menopause and diabetes were successfully suppressed with saliency scores below 0.27.
[526] Embedding World Knowledge into Tabular Models: Towards Best Practices for Embedding Pipeline Design
Oksana Kolomenko, Ricardo Knauer, Erik Rodner
Main category: cs.LG
TL;DR: Systematic benchmarking of 256 LLM-based embedding pipeline configurations for tabular prediction, showing performance depends heavily on pipeline design choices
Details
Motivation: To provide evidence-based guidance on designing effective LLM-based embedding pipelines for tabular prediction, as there's limited existing research on optimal pipeline configurationsMethod: Benchmarked 256 pipeline configurations covering 8 preprocessing strategies, 16 embedding models, and 2 downstream models (gradient boosting decision trees and others)
Result: Concatenating embeddings outperforms replacing original columns; larger embedding models yield better results; leaderboard rankings/popularity are poor performance indicators; gradient boosting trees are strong downstream models
Conclusion: Provides practical guidance for researchers/practitioners on building effective LLM embedding pipelines for tabular prediction, emphasizing the importance of specific design choices
Abstract: Embeddings are a powerful way to enrich data-driven machine learning models with the world knowledge of large language models (LLMs). Yet, there is limited evidence on how to design effective LLM-based embedding pipelines for tabular prediction. In this work, we systematically benchmark 256 pipeline configurations, covering 8 preprocessing strategies, 16 embedding models, and 2 downstream models. Our results show that it strongly depends on the specific pipeline design whether incorporating the prior knowledge of LLMs improves the predictive performance. In general, concatenating embeddings tends to outperform replacing the original columns with embeddings. Larger embedding models tend to yield better results, while public leaderboard rankings and model popularity are poor performance indicators. Finally, gradient boosting decision trees tend to be strong downstream models. Our findings provide researchers and practitioners with guidance for building more effective embedding pipelines for tabular prediction tasks.
[527] Towards Infinitely Long Neural Simulations: Self-Refining Neural Surrogate Models for Dynamical Systems
Qi Liu, Laure Zanna, Joan Bruna
Main category: cs.LG
TL;DR: SNS is a self-refining neural surrogate model using conditional diffusion to balance short-time accuracy with long-time consistency in dynamical system simulations, addressing distribution drift in autoregressive models.
Details
Motivation: Autoregressive neural surrogate models suffer from distribution drift where compounding errors degrade generation quality over long time horizons, and existing approaches rely on implicit trade-offs through hyperparameter tuning rather than explicit solutions.Method: Proposes a mathematical framework formalizing the trade-off between short-time accuracy and long-time consistency, then implements SNS as a conditional diffusion model that refines its own autoregressive outputs to balance these objectives without hyperparameter tuning.
Result: Demonstrates numerical feasibility through high-fidelity simulations of complex dynamical systems over arbitrarily long time horizons, showing improved long-time consistency compared to standard autoregressive approaches.
Conclusion: SNS provides a robust, hyperparameter-free solution to distribution drift in neural surrogate models, enabling accurate long-term simulations of dynamical systems while maintaining short-time fidelity.
Abstract: Recent advances in autoregressive neural surrogate models have enabled orders-of-magnitude speedups in simulating dynamical systems. However, autoregressive models are generally prone to distribution drift: compounding errors in autoregressive rollouts that severely degrade generation quality over long time horizons. Existing work attempts to address this issue by implicitly leveraging the inherent trade-off between short-time accuracy and long-time consistency through hyperparameter tuning. In this work, we introduce a unifying mathematical framework that makes this tradeoff explicit, formalizing and generalizing hyperparameter-based strategies in existing approaches. Within this framework, we propose a robust, hyperparameter-free model implemented as a conditional diffusion model that balances short-time fidelity with long-time consistency by construction. Our model, Self-refining Neural Surrogate model (SNS), can be implemented as a standalone model that refines its own autoregressive outputs or as a complementary model to existing neural surrogates to ensure long-time consistency. We also demonstrate the numerical feasibility of SNS through high-fidelity simulations of complex dynamical systems over arbitrarily long time horizons.
[528] Attention Sinks Induce Gradient Sinks
Yihong Chen, Quanming Yao
Main category: cs.LG
TL;DR: Attention sinks cause gradient concentration (gradient sinks) in Transformers, which in turn leads to massive activations as an adaptive response during training, with V-scale modification showing this causal relationship.
Details
Motivation: To understand the relationship between attention sinks and massive activations in Transformers, specifically whether their connection is direct or mediated by training-time mechanisms, by examining backpropagation dynamics.Method: Analyze backpropagation in Transformers with causal masks, showing attention sinks induce gradient concentration (gradient sinks). Introduce V-scale modification to adjust value-path backpropagated gradients to test the hypothesis.
Result: In pretrained V-scale models, attention sinks are preserved while massive activations are suppressed, supporting that gradient sink is a key training-time mediator linking attention sinks and massive activations.
Conclusion: Gradient sinks serve as a crucial training-time mechanism connecting attention sinks and massive activations in Transformer models, with implications for understanding and optimizing Transformer training dynamics.
Abstract: Attention sinks and massive activations are recurring and closely related phenomena in Transformer models. Existing studies have largely focused on the forward pass, making it unclear whether their connection is direct or mediated by a training-time mechanism. We study this question from the perspective of backpropagation. Empirically and theoretically, we show that under causal mask, attention sinks can induce pronounced gradient concentration, which we term gradient sinks. Furthermore, in pre-norm architectures with RMSNorm, massive activations can be understood as an adaptive response to this localized gradient pressure during training. To test this hypothesis, we introduce V-scale, a modification that adjusts value-path backpropagated gradients. In pretrained V-scale models, attention sinks are preserved whereas massive activations are suppressed. These results support the interpretation that gradient sink is a key training-time mediator linking attention sinks and massive activations.
[529] RangeAD: Fast On-Model Anomaly Detection
Luca Hinkamp, Simon Klüttermann, Emmanuel Müller
Main category: cs.LG
TL;DR: RangeAD: An on-model anomaly detection method that leverages neuron-wise output ranges from existing ML models for efficient anomaly detection without separate AD models.
Details
Motivation: Current anomaly detection approaches run separate AD models alongside primary models, ignoring that primary models already encode substantial information about the target distribution, leading to redundant computation and inefficiency.Method: Proposes On-Model AD setting that leverages access to related ML models, and introduces RangeAD algorithm that utilizes neuron-wise output ranges derived from the primary model for anomaly detection.
Result: RangeAD achieves superior performance on high-dimensional tasks while incurring substantially lower inference costs compared to traditional separate AD models.
Conclusion: On-Model AD setting provides a practical framework for efficient anomaly detection by leveraging existing model information rather than running separate detection systems.
Abstract: In practice, machine learning methods commonly require anomaly detection (AD) to filter inputs or detect distributional shifts. Typically, this is implemented by running a separate AD model alongside the primary model. However, this separation ignores the fact that the primary model already encodes substantial information about the target distribution. In this paper, we introduce On-Model AD, a setting for anomaly detection that explicitly leverages access to a related machine learning model. Within this setting, we propose RangeAD, an algorithm that utilizes neuron-wise output ranges derived from the primary model. RangeAD achieves superior performance even on high-dimensional tasks while incurring substantially lower inference costs. Our results demonstrate the potential of the On-Model AD setting as a practical framework for efficient anomaly detection.
[530] Dropout Robustness and Cognitive Profiling of Transformer Models via Stochastic Inference
Antônio Junior Alves Caiado, Michael Hahsler
Main category: cs.LG
TL;DR: Comprehensive analysis of transformer model robustness under MC Dropout reveals architecture-dependent variability, with medium models showing best overall performance and 53% suffering severe accuracy degradation, highlighting limitations for uncertainty quantification.
Details
Motivation: Transformer language models are widely used for reasoning tasks, but their behavior under inference-time stochasticity (via MC Dropout) remains underexplored, limiting understanding of model reliability for uncertainty-aware applications.Method: Analyzed 19 transformer models using MC Dropout with 100 stochastic forward passes per sample. Used a cognitive decomposition framework to separate performance into memory and reasoning components. Conducted 95 unique evaluations across five dropout configurations on 1,000 samples.
Result: Substantial architectural variation: smaller models show perfect stability, medium models exhibit volatility, mid-sized achieve best overall performance, larger excel at memory tasks. 53% suffer severe accuracy degradation (up to 24 percentage points). Asymmetric effects: high dropout reduces memory accuracy by 27 points vs. reasoning by only 1 point. 84% show memory-biased performance.
Conclusion: First comprehensive MC Dropout benchmark reveals dropout robustness is architecture-dependent and uncorrelated with scale. Cognitive profiling framework provides actionable guidance for model selection in uncertainty-aware applications.
Abstract: Transformer-based language models are widely deployed for reasoning, yet their behavior under inference-time stochasticity remains underexplored. While dropout is common during training, its inference-time effects via Monte Carlo sampling lack systematic evaluation across architectures, limiting understanding of model reliability in uncertainty-aware applications. This work analyzes dropout-induced variability across 19 transformer models using MC Dropout with 100 stochastic forward passes per sample. Dropout robustness is defined as maintaining high accuracy and stable predictions under stochastic inference, measured by standard deviation of per-run accuracies. A cognitive decomposition framework disentangles performance into memory and reasoning components. Experiments span five dropout configurations yielding 95 unique evaluations on 1,000 samples. Results reveal substantial architectural variation. Smaller models demonstrate perfect prediction stability while medium-sized models exhibit notable volatility. Mid-sized models achieve the best overall performance; larger models excel at memory tasks. Critically, 53% of models suffer severe accuracy degradation under baseline MC Dropout, with task-specialized models losing up to 24 percentage points, indicating unsuitability for uncertainty quantification in these architectures. Asymmetric effects emerge: high dropout reduces memory accuracy by 27 percentage points while reasoning degrades only 1 point, suggesting memory tasks rely on stable representations that dropout disrupts. 84% of models demonstrate memory-biased performance. This provides the first comprehensive MC Dropout benchmark for transformers, revealing dropout robustness is architecture-dependent and uncorrelated with scale. The cognitive profiling framework offers actionable guidance for model selection in uncertainty-aware applications.
[531] Federated Distributional Reinforcement Learning with Distributional Critic Regularization
David Millard, Cecilia Alm, Rashid Ali, Pengcheng Shi, Ali Baheri
Main category: cs.LG
TL;DR: FedDistRL: Federated distributional RL with quantile critics and Wasserstein barycenter trust regions for safety-critical applications
Details
Motivation: Standard federated RL aggregates value functions/policies via parameter averaging, which emphasizes expected return but obscures statistical multimodality and tail behavior crucial for safety-critical settings where risk assessment matters.Method: Proposes FedDistRL with quantile value function critics federated only, and TR-FedDistRL which builds per-client risk-aware Wasserstein barycenter over temporal buffer as reference region to constrain parameter-averaged critic via shrink-squash step.
Result: Experiments on bandit, multi-agent gridworld, and continuous highway environments show reduced mean-smearing, improved safety proxies (catastrophe/accident rate), and lower critic/policy drift compared to mean-oriented and non-federated baselines.
Conclusion: Distributional federated RL with Wasserstein barycenter trust regions preserves crucial distributional information during federation, improving safety performance in risk-sensitive applications.
Abstract: Federated reinforcement learning typically aggregates value functions or policies by parameter averaging, which emphasizes expected return and can obscure statistical multimodality and tail behavior that matter in safety-critical settings. We formalize federated distributional reinforcement learning (FedDistRL), where clients parametrize quantile value function critics and federate these networks only. We also propose TR-FedDistRL, which builds a per client, risk-aware Wasserstein barycenter over a temporal buffer. This local barycenter provides a reference region to constrain the parameter averaged critic, ensuring necessary distributional information is not averaged out during the federation process. The distributional trust region is implemented as a shrink-squash step around this reference. Under fixed-policy evaluation, the feasibility map is nonexpansive and the update is contractive in a probe-set Wasserstein metric under evaluation. Experiments on a bandit, multi-agent gridworld, and continuous highway environment show reduced mean-smearing, improved safety proxies (catastrophe/accident rate), and lower critic/policy drift versus mean-oriented and non-federated baselines.
[532] Symmetry-Reduced Physics-Informed Learning of Tensegrity Dynamics
Jing Qin, Muhao Chen
Main category: cs.LG
TL;DR: SymPINN: A symmetry-reduced physics-informed neural network framework that embeds group-theory-based symmetry into neural architecture for predicting tensegrity dynamics with improved accuracy and efficiency.
Details
Motivation: Existing physics-informed neural networks for tensegrity dynamics don't explicitly exploit the intrinsic geometric symmetries of tensegrity structures, leading to high computational complexity and unstable optimization.Method: Proposes SymPINN framework that embeds group-theory-based symmetry into both solution expression and neural architecture. Decomposes nodes into symmetry orbits, uses symmetry basis for reduced coordinate representation, recovers full coordinates via symmetry transformations. Enforces equivariance through orbit-based coordinate generation, symmetry-consistent message passing, and physics residual constraints. Includes hard constraint encoding of initial conditions, Fourier feature encoding, and two-stage optimization.
Result: Extensive numerical experiments on symmetric T-bars and lander structures demonstrate significantly improved prediction accuracy and computational efficiency compared to standard physics-informed models.
Conclusion: Symmetry-aware learning shows great potential for structure-preserving modeling of tensegrity dynamics, with SymPINN providing a framework that explicitly exploits geometric symmetries for better performance.
Abstract: Tensegrity structures possess intrinsic geometric symmetries that govern their dynamic behavior. However, most existing physics-informed neural network (PINN) approaches for tensegrity dynamics do not explicitly exploit these symmetries, leading to high computational complexity and unstable optimization. In this work, we propose a symmetry-reduced physics-informed neural network (SymPINN) framework that embeds group-theory-based symmetry directly into both the solution expression and the neural network architecture to predict tensegrity dynamics. By decomposing nodes into symmetry orbits and representing free nodal coordinates using a symmetry basis, the proposed method constructs a reduced coordinate representation that preserves geometric symmetry of the structure. The full coordinates are then recovered via symmetry transformations of the reduced solution learned by the network, ensuring that the predicted configurations automatically satisfy the symmetry constraints. In this framework, equivariance is enforced through orbit-based coordinate generation, symmetry-consistent message passing, and physics residual constraints. In addition, SymPINN improves training effectiveness by encoding initial conditions as hard constraints, incorporating Fourier feature encoding to enhance the representation of dynamic motions, and employing a two-stage optimization strategy. Extensive numerical experiments on symmetric T-bars and lander structures demonstrate significantly improved prediction accuracy and computational efficiency compared to standard physics-informed models, indicating the great potential of symmetry-aware learning for structure-preserving modeling of tensegrity dynamics.
[533] Physics-Aware Machine Learning for Seismic and Volcanic Signal Interpretation
William Thorossian
Main category: cs.LG
TL;DR: Survey paper on machine learning applications for seismic and volcanic monitoring, focusing on domain adaptation, uncertainty quantification, and physical constraints in operational settings.
Details
Motivation: Modern seismic/volcanic monitoring requires extracting actionable information from continuous, multi-sensor, nonstationary noisy data, but ML models need to be reliable under domain shifts, provide uncertainty estimates, and connect to physical constraints for operational use.Method: Survey and organization of recent ML approaches for seismic/volcanic signal analysis, examining where classical signal processing provides inductive bias, how self-supervision/generative modeling reduce label dependence, and evaluation protocols for cross-region transfer.
Result: Comprehensive overview of ML techniques applied to seismic/volcanic monitoring, highlighting approaches that address domain shift, uncertainty quantification, and physical interpretability for operational deployment.
Conclusion: Identifies open challenges for robust, interpretable, and maintainable AI-assisted monitoring, emphasizing the need for models that remain reliable under changing conditions and connect outputs to physically meaningful constraints.
Abstract: Modern seismic and volcanic monitoring is increasingly shaped by continuous, multi-sensor observations and by the need to extract actionable information from nonstationary, noisy wavefields. In this context, machine learning has moved from a research curiosity to a practical ingredient of processing chains for detection, phase picking, classification, denoising, and anomaly tracking. However, improved accuracy on a fixed dataset is not sufficient for operational use. Models must remain reliable under domain shift (new stations, changing noise, evolving volcanic activity), provide uncertainty that supports decision-making, and connect their outputs to physically meaningful constraints. This paper surveys and organizes recent ML approaches for seismic and volcanic signal analysis, highlighting where classical signal processing provides indispensable inductive bias, how self-supervision and generative modeling can reduce dependence on labels, and which evaluation protocols best reflect transfer across regions. We conclude with open challenges for robust, interpretable, and maintainable AI-assisted monitoring.
[534] Procedural Generation of Algorithm Discovery Tasks in Machine Learning
Alexander D. Goldie, Zilin Wang, Adrian Hayler, Deepak Nathani, Edan Toledo, Ken Thampiratwong, Aleksandra Kalisz, Michael Beukman, Alistair Letcher, Shashank Reddy, Clarisse Wibault, Theo Wolf, Charles O’Neill, Uljad Berdica, Nicholas Roberts, Saeed Rahmani, Hannah Erlebach, Roberta Raileanu, Shimon Whiteson, Jakob N. Foerster
Main category: cs.LG
TL;DR: DiscoGen is a procedural generator for creating millions of diverse machine learning algorithm discovery tasks across different ML fields, with DiscoBench providing a fixed benchmark subset for evaluating algorithm discovery agents.
Details
Motivation: Existing algorithm discovery task suites suffer from poor evaluation methodologies, data contamination, and containing saturated or very similar problems, limiting our ability to improve and evaluate algorithm discovery systems.Method: DiscoGen procedurally generates algorithm discovery tasks (like developing optimizers for RL or loss functions for image classification) spanning millions of tasks with varying difficulty and complexity, specified by a small number of configuration parameters.
Result: The system generates diverse ML algorithm discovery tasks and includes DiscoBench as a fixed benchmark subset for principled evaluation of algorithm discovery agents, with demonstrations of its use for prompt optimization.
Conclusion: DiscoGen enables ambitious research directions in algorithm discovery and provides an open-source tool for generating diverse ML algorithm development tasks, addressing limitations of existing task suites.
Abstract: Automating the development of machine learning algorithms has the potential to unlock new breakthroughs. However, our ability to improve and evaluate algorithm discovery systems has thus far been limited by existing task suites. They suffer from many issues, such as: poor evaluation methodologies; data contamination; and containing saturated or very similar problems. Here, we introduce DiscoGen, a procedural generator of algorithm discovery tasks for machine learning, such as developing optimisers for reinforcement learning or loss functions for image classification. Motivated by the success of procedural generation in reinforcement learning, DiscoGen spans millions of tasks of varying difficulty and complexity from a range of machine learning fields. These tasks are specified by a small number of configuration parameters and can be used to optimise algorithm discovery agents (ADAs). We present DiscoBench, a benchmark consisting of a fixed, small subset of DiscoGen tasks for principled evaluation of ADAs. Finally, we propose a number of ambitious, impactful research directions enabled by DiscoGen, in addition to experiments demonstrating its use for prompt optimisation of an ADA. DiscoGen is released open-source at https://github.com/AlexGoldie/discogen.
[535] RHYME-XT: A Neural Operator for Spatiotemporal Control Systems
Marijn Ruiter, Miguel Aguiar, Jake Rap, Karl H. Johansson, Amritam Das
Main category: cs.LG
TL;DR: RHYME-XT is a neural operator framework for surrogate modeling of spatiotemporal control systems with rhythmic behavior, using Galerkin projection and flow map learning to avoid costly integration.
Details
Motivation: The paper addresses the challenge of modeling complex spatiotemporal control systems governed by nonlinear partial integro-differential equations (PIDEs) with localized rhythmic behavior, which are computationally expensive to simulate directly.Method: Uses Galerkin projection to approximate infinite-dimensional PIDE on a learned finite-dimensional subspace with neural network-parameterized spatial basis functions, then directly learns the flow map of the projected ODE system using a flow function architecture instead of integrating the non-autonomous system.
Result: RHYME-XT outperforms state-of-the-art neural operators on neural field PIDE experiments and demonstrates effective knowledge transfer across different datasets through fine-tuning.
Conclusion: The framework provides an efficient, continuous-time, discretization-invariant approach for surrogate modeling of spatiotemporal control systems with rhythmic behavior, enabling knowledge transfer across related models.
Abstract: We propose RHYME-XT, an operator-learning framework for surrogate modeling of spatiotemporal control systems governed by input-affine nonlinear partial integro-differential equations (PIDEs) with localized rhythmic behavior. RHYME-XT uses a Galerkin projection to approximate the infinite-dimensional PIDE on a learned finite-dimensional subspace with spatial basis functions parameterized by a neural network. This yields a projected system of ODEs driven by projected inputs. Instead of integrating this non-autonomous system, we directly learn its flow map using an architecture for learning flow functions, avoiding costly computations while obtaining a continuous-time and discretization-invariant representation. Experiments on a neural field PIDE show that RHYME-XT outperforms a state-of-the-art neural operator and is able to transfer knowledge effectively across models trained on different datasets, through a fine-tuning process.
[536] Operator-Theoretic Foundations and Policy Gradient Methods for General MDPs with Unbounded Costs
Abhishek Gupta, Aditya Mahajan
Main category: cs.LG
TL;DR: The paper presents a theoretical framework for MDPs as optimization over linear operators in function spaces, enabling generalization of RL results to continuous state/action spaces and development of new PPO-type algorithms.
Details
Motivation: To extend reinforcement learning theory beyond finite state/action spaces and linear function approximations by leveraging perturbation theory of linear operators, allowing for more general continuous spaces.Method: Views MDPs as optimization over linear operators in general function spaces, applies perturbation theory to compute derivatives of objective functions, and develops new PPO-type algorithms for general state/action spaces.
Result: Generalizes many established RL results to cases with general state and action spaces, provides theoretical foundations for continuous-space MDPs, and introduces new low-complexity PPO-type algorithms.
Conclusion: The linear operator framework enables rigorous treatment of MDPs with continuous state/action spaces, extending RL theory beyond finite settings and enabling development of new algorithms.
Abstract: Markov decision processes (MDPs) is viewed as an optimization of an objective function over certain linear operators over general function spaces. Using the well-established perturbation theory of linear operators, this viewpoint allows one to identify derivatives of the objective function as a function of the linear operators. This leads to generalization of many well-known results in reinforcement learning to cases with generate state and action spaces. Prior results of this type were only established in the finite-state finite-action MDP settings and in settings with certain linear function approximations. The framework also leads to new low-complexity PPO-type reinforcement learning algorithms for general state and action space MDPs.
[537] RAMP: Reinforcement Adaptive Mixed Precision Quantization for Efficient On Device LLM Inference
Arpit Singh Gautam, Saurabh Jha
Main category: cs.LG
TL;DR: RAMP is a reinforcement learning framework for adaptive mixed-precision quantization of LLMs that learns per-layer bit-width assignments to optimize accuracy-efficiency trade-offs under global bit budgets.
Details
Motivation: Current post-training quantization methods use uniform bit widths across layers, leading to suboptimal accuracy-efficiency trade-offs. There's a need for adaptive quantization that can assign different bit widths to different layers based on their sensitivity to quantization.Method: RAMP uses an off-policy Soft Actor-Critic reinforcement learning framework that learns per-layer bit-width assignments. It conditions on 11-dimensional embeddings of activation statistics, weight properties, and structural descriptors. Introduces Scale Folding for stable sub-4-bit quantization by migrating activation outliers into weights via per-channel scaling and normalization layer compensation.
Result: On Llama 2 7B, RAMP achieves 5.54 perplexity at 3.68GB (3.65 effective bits), outperforming uniform 4-bit AWQ (5.60 at 3.90GB) and GPTQ by 6% in size and 1-3% in quality. Policies trained on Llama 2 7B generalize zero-shot to Llama 2 13B and Mistral 7B.
Conclusion: RAMP enables efficient adaptive quantization of LLMs with better accuracy-efficiency trade-offs than uniform quantization methods, and demonstrates that quantization sensitivity is primarily architectural, enabling zero-shot transfer across model families and scales.
Abstract: Post training quantization is essential for deploying large language models (LLMs) on resource constrained hardware, yet state of the art methods enforce uniform bit widths across layers, yielding suboptimal accuracy efficiency trade offs. We present RAMP (Reinforcement Adaptive Mixed Precision), an off policy Soft Actor Critic framework that learns per layer bit width assignments to minimize perplexity under a global bit budget. The policy conditions on an 11 dimensional embedding of activation statistics, weight properties, and structural descriptors, enabling zero shot transfer across model families and scales. To enable stable sub 4 bit quantization, we introduce Scale Folding, a preconditioning technique that migrates activation outliers into weights via per channel scaling and normalization layer compensation. A quality prioritized reward with asymmetric penalties and budget cliffs drives rapid convergence. On Llama 2 7B, RAMP achieves 5.54 perplexity at 3.68GB (3.65 effective bits), outperforming uniform 4 bit AWQ (5.60 at 3.90 GB) and GPTQ by 6% in size and 1% to3% in quality. Critically, a policy trained only on Llama 2 7B generalizes zero shot to Llama 2 13B and Mistral 7B, often surpassing target specific training, supporting the hypothesis that quantization sensitivity is primarily architectural. The HALO pipeline exports allocations to GGUF format for kernel free inference on CPUs, GPUs, and edge devices, retaining 99.5% of FP16 commonsense reasoning performance.
[538] CARE: Covariance-Aware and Rank-Enhanced Decomposition for Enabling Multi-Head Latent Attention
Zhongzhu Zhou, Fengxiang Bie, Ziyan Chen, Zhenyu Zhang, Yibo Yang, Junxiong Wang, Ben Athiwaratkun, Xiaoxia Wu, Shuaiwen Leon Song
Main category: cs.LG
TL;DR: CARE: Covariance-Aware Rank-Enhanced conversion pipeline that transforms pretrained attention modules into multi-head latent attention while preserving activation fidelity and maintaining fixed KV-cache size.
Details
Motivation: Existing conversion methods for transforming attention modules to MLA focus on minimizing weight matrix differences rather than activation effects, ignore activation covariance structure, and use uniform rank allocation, leading to activation drift and degraded attention fidelity.Method: Three-step pipeline: 1) Activation-preserving factorization that aligns approximation with input activations, 2) Adjusted-rank allocation that distributes fixed KV budget across layers based on need, 3) KV-parity mapping that reparameterizes converted K and V to fit MLA format while keeping KV-cache size unchanged.
Result: Outperforms uniform-rank SVD baseline on Qwen3-4B/30B and Llama-3.1-8B/70B models, reducing one-shot perplexity by up to 215x and improving mean accuracy by up to 1.70x at matched KV budgets. With brief post-SVD fine-tuning, fully recovers original model accuracy.
Conclusion: CARE provides an effective method for converting pretrained attention modules to MLA that maintains activation fidelity while keeping KV-cache costs fixed, enabling more expressive attention mechanisms without inference overhead.
Abstract: Converting pretrained attention modules such as grouped-query attention (GQA) into multi-head latent attention (MLA) can improve expressivity without increasing KV-cache cost, making it attractive for efficient inference. However, many practical conversion baselines rely on weight-only low-rank approximations (e.g., SVD-style initializations) and uniform rank allocation. They focus on minimizing the difference between weight matrices rather than on how those weights affect input activations, ignore the covariance structure of activations, and enforce uniform rank across layers, causing activation drift and degraded attention fidelity. To address these issues, we propose CARE, a Covariance-Aware, Rank-Enhanced MLA conversion pipeline under a fixed KV width. CARE introduces three key steps: (i) activation-preserving factorization, which aligns the approximation with the actual input activations rather than just the weights; (ii) adjusted-rank allocation, which spreads a fixed KV budget across layers by giving more capacity to layers that need it most; and (iii) KV-parity mapping, which reparameterizes the converted K and V to fit the MLA format while keeping the KV-cache size unchanged. Our method outperforms a uniform-rank SVD baseline on Qwen3-4B/30B-A3B-Instruct-2507 and Llama-3.1-8B/70B-Instruct, reducing one-shot perplexity by up to 215x and improving mean accuracy by up to 1.70x at matched KV budgets. With a brief post-SVD healing fine-tune, we fully recover the original model’s accuracy.
[539] Unified Policy Value Decomposition for Rapid Adaptation
Cristiano Capone, Luca Falorsi, Andrea Ciardiello, Luca Manneschi
Main category: cs.LG
TL;DR: Framework for rapid RL adaptation using shared low-dimensional goal embeddings that enable zero-shot task adaptation without retraining representations.
Details
Motivation: Addresses the challenge of rapid adaptation in complex control systems by enabling immediate adaptation to novel tasks without retraining neural network representations.Method: Jointly learns structured value bases and compatible policy bases through bilinear actor-critic decomposition with shared goal embeddings. Critic factorizes as Q = sum_k G_k(g) y_k(s,a), and actor composes primitive policies weighted by same coefficients G_k(g).
Result: Trained Soft Actor-Critic agent on MuJoCo Ant with multi-directional locomotion, showing policy heads specialize to subsets of directions while shared coefficients generalize across them, enabling interpolation for novel directions.
Conclusion: Shared low-dimensional goal embeddings offer general mechanism for rapid structured adaptation in high-dimensional control, potentially biologically plausible for efficient transfer in complex RL systems.
Abstract: Rapid adaptation in complex control systems remains a central challenge in reinforcement learning. We introduce a framework in which policy and value functions share a low-dimensional coefficient vector - a goal embedding - that captures task identity and enables immediate adaptation to novel tasks without retraining representations. During pretraining, we jointly learn structured value bases and compatible policy bases through a bilinear actor-critic decomposition. The critic factorizes as Q = sum_k G_k(g) y_k(s,a), where G_k(g) is a goal-conditioned coefficient vector and y_k(s,a) are learned value basis functions. This multiplicative gating - where a context signal scales a set of state-dependent bases - is reminiscent of gain modulation observed in Layer 5 pyramidal neurons, where top-down inputs modulate the gain of sensory-driven responses without altering their tuning. Building on Successor Features, we extend the decomposition to the actor, which composes a set of primitive policies weighted by the same coefficients G_k(g). At test time the bases are frozen and G_k(g) is estimated zero-shot via a single forward pass, enabling immediate adaptation to novel tasks without any gradient update. We train a Soft Actor-Critic agent on the MuJoCo Ant environment under a multi-directional locomotion objective, requiring the agent to walk in eight directions specified as continuous goal vectors. The bilinear structure allows each policy head to specialize to a subset of directions, while the shared coefficient layer generalizes across them, accommodating novel directions by interpolating in goal embedding space. Our results suggest that shared low-dimensional goal embeddings offer a general mechanism for rapid, structured adaptation in high-dimensional control, and highlight a potentially biologically plausible principle for efficient transfer in complex reinforcement learning systems.
[540] Beyond Muon: MUD (MomentUm Decorrelation) for Faster Transformer Training
Ben S. Southworth, Stephen Thomas
Main category: cs.LG
TL;DR: MUD is a new optimizer that improves training efficiency for transformers by using triangular whitening instead of polar decomposition, reducing computational overhead while maintaining performance.
Details
Motivation: Existing orthogonalized-momentum optimizers like Muon use polar decomposition for whitening momentum updates, but this requires multiple large matrix multiplications with substantial hardware-dependent overhead. There's a need for more efficient whitening approaches.Method: MUD replaces Muon’s polar update with a triangular (Cholesky-like) whitening surrogate inspired by classical Gram-Schmidt and Gauss-Seidel ideas. It decorrelates momentum updates using a computationally cheaper triangular approach rather than full orthogonalization.
Result: MUD achieves 10-50% wall-clock improvements over tuned AdamW and Muon in time-to-perplexity, with 1.3-2.6× higher peak tokens/s than Muon (up to nearly 3× on GPT-2 large). It matches Muon-level validation perplexity for ESM-2 150M protein language model in significantly less time.
Conclusion: MUD provides an efficient alternative to polar decomposition-based optimizers, offering substantial computational savings while maintaining training quality for transformer models, making it practical for large-scale training.
Abstract: Orthogonalized-momentum optimizers such as Muon improve transformer training by approximately whitening/orthogonalizing matrix-valued momentum updates via a short polar-decomposition iteration. However, polar-factor approximations typically require multiple large matrix multiplications, and the resulting overhead can be substantial and hardware-dependent. We introduce MUD (MomentUm Decorrelation), a complementary whitening approach that replaces Muon’s polar update with a triangular (Cholesky-like) whitening surrogate inspired by classical Gram–Schmidt and Gauss-Seidel ideas. We show that row-orthonormal matrices are fixed points of the MUD map, relate the inner step to symmetric Gauss-Seidel preconditioning of the Gram matrix, and prove quadratic local convergence near the fixed point. In terms of time-to-perplexity, MUD yields consistent 10-50% wall-clock improvements over tuned AdamW and Muon in time-to-perplexity, typically converging slightly slower per step than Muon but with substantially lower optimizer overhead – relative to Muon, MUD improves peak tokens/s by roughly $1.3-2.6\times$ across most settings and up to nearly $3\times$ on GPT-2 large on an A100. We also demonstrate training a ESM-2 150M protein language model, where MUD matches Muon-level validation perplexity in significantly less wall-clock time.
[541] HighAir: A Hierarchical Graph Neural Network-Based Air Quality Forecasting Method
Ling Chen, Jiahui Xu, Binqing Wu, Mingqi Lv, Chaoqun Zhan, Sanjian Chen, Jian Chang
Main category: cs.LG
TL;DR: Paper 2101.04264: Unable to fetch abstract due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to missing abstractMethod: Cannot determine method due to missing abstract
Result: Cannot determine results due to missing abstract
Conclusion: Cannot determine conclusion due to missing abstract
Abstract: Failed to fetch summary for 2101.04264: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2101.04264&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[542] Aergia: Leveraging Heterogeneity in Federated Learning Systems
Bart Cox, Lydia Y. Chen, Jérémie Decouchant
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2210.06154: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2210.06154&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[543] Feature Space Renormalization for Semi-supervised Learning
Jun Sun, Wancheng Zhang, Chao Zhou, Zhongjie Mao, Chao Li, Xiao-Jun Wu
Main category: cs.LG
TL;DR: I cannot analyze this paper as the system failed to fetch the abstract due to HTTP 429 error (rate limiting). The arXiv API request was blocked, so I don’t have access to the paper’s content to perform analysis.
Details
Motivation: Unable to determine motivation due to lack of access to paper content.Method: Unable to determine method due to lack of access to paper content.
Result: Unable to determine results due to lack of access to paper content.
Conclusion: Unable to draw conclusions due to lack of access to paper content.
Abstract: Failed to fetch summary for 2311.04055: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2311.04055&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[544] Hi-GMAE: Hierarchical Graph Masked Autoencoders
Chuang Liu, Zelin Yao, Xueqi Ma, Mukun Chen, Luzhi Wang, Jia Wu, Wenbin Hu
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to retrieval failureMethod: Unable to determine method due to retrieval failure
Result: Unable to determine results due to retrieval failure
Conclusion: Unable to determine conclusion due to retrieval failure
Abstract: Failed to fetch summary for 2405.10642: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2405.10642&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[545] Demystifying amortized causal discovery with transformers
Francesco Montagna, Max Cairney-Leeming, Dhanya Sridhar, Francesco Locatello
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2405.16924: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2405.16924&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[546] Fine-Grained Uncertainty Quantification via Collisions
Jesse Friedbaum, Sudarshan Adiga, Ravi Tandon
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2411.12127: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2411.12127&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[547] SALSA-RL: Stability Analysis in the Latent Space of Actions for Reinforcement Learning
Xuyang Li, Romit Maulik
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to missing paper contentMethod: Unable to determine method due to missing paper content
Result: Unable to determine results due to missing paper content
Conclusion: Unable to draw conclusions due to missing paper content
Abstract: Failed to fetch summary for 2502.15512: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.15512&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[548] Joint Value Estimation and Bidding in Repeated First-Price Auctions
Yuxiao Wen, Yanjun Han, Zhengyuan Zhou
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2502.17292: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.17292&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[549] Offline Reinforcement Learning via Inverse Optimization
Ioannis Dimanidis, Tolga Ok, Peyman Mohajerin Esfahani
Main category: cs.LG
TL;DR: Novel offline RL algorithm using inverse optimization’s sub-optimality loss with robust MPC expert for continuous control, achieving competitive results on MuJoCo with fewer parameters.
Details
Motivation: To address distribution shift in offline reinforcement learning for continuous state/action spaces by leveraging inverse optimization techniques and robust control methods.Method: Uses convex sub-optimality loss from inverse optimization literature combined with robust non-causal MPC expert that steers nominal dynamics using hindsight information, with exact tractable convex reformulation.
Result: Achieves competitive results on MuJoCo benchmarks compared to baselines in sample-constrained settings while using orders of magnitude fewer parameters.
Conclusion: The proposed inverse optimization approach provides an effective framework for offline RL with strong performance and parameter efficiency, with open-source implementation available.
Abstract: Inspired by the recent successes of Inverse Optimization (IO) across various application domains, we propose a novel offline Reinforcement Learning (ORL) algorithm for continuous state and action spaces, leveraging the convex loss function called ``sub-optimality loss’’ from the IO literature. To mitigate the distribution shift commonly observed in ORL problems, we further employ a robust and non-causal Model Predictive Control (MPC) expert steering a nominal model of the dynamics using in-hindsight information stemming from the model mismatch. Unlike the existing literature, our robust MPC expert enjoys an exact and tractable convex reformulation. In the second part of this study, we show that the IO hypothesis class, trained by the proposed convex loss function, enjoys ample expressiveness and {reliably recovers teacher behavior in MuJoCo benchmarks. The method achieves competitive results compared to widely-used baselines in sample-constrained settings, despite using} orders of magnitude fewer parameters. To facilitate the reproducibility of our results, we provide an open-source package implementing the proposed algorithms and the experiments. The code is available at https://github.com/TolgaOk/offlineRLviaIO.
[550] Adaptive UAV-Assisted Hierarchical Federated Learning: Optimizing Energy, Latency, and Resilience for Dynamic Smart IoT
Xiaohong Yang, Minghui Liwang, Liqun Fu, Yuhan Su, Seyyedali Hosseinalipour, Xianbin Wang, Yiguang Hong
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2503.06145: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.06145&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[551] Arch-VQ: Discrete Architecture Representation Learning with Autoregressive Priors
Deshani Geethika Poddenige, Sachith Seneviratne, Asela Hevapathige, Damith Senanayake, Mahesan Niranjan, PN Suganthan, Saman Halgamuge
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: The paper ID 2503.22063 could not be retrieved due to server rate limiting, preventing analysis of its motivationMethod: Unable to determine method due to access restrictions
Result: No results available due to technical limitations
Conclusion: Cannot analyze paper content due to server restrictions
Abstract: Failed to fetch summary for 2503.22063: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.22063&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[552] MSGCN: Multiplex Spatial Graph Convolution Network for Interlayer Link Weight Prediction
Steven E. Wilson, Sina Khanmohammadi
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2504.17749: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.17749&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[553] Explanations Go Linear: Post-hoc Explainability for Tabular Data with Interpretable Meta-Encoding
Simone Piaggesi, Riccardo Guidotti, Fosca Giannotti, Dino Pedreschi
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2504.20667: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.20667&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[554] Clust-Splitter - an Efficient Nonsmooth Optimization-Based Algorithm for Clustering Large Datasets
Jenni Lampainen, Kaisa Joki, Napsu Karmitsa, Marko M. Mäkelä
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper retrievalMethod: Unable to determine method due to failed paper retrieval
Result: Unable to determine results due to failed paper retrieval
Conclusion: Unable to draw conclusions due to failed paper retrieval
Abstract: Failed to fetch summary for 2505.04389: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.04389&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[555] Score Distillation Beyond Acceleration: Generative Modeling from Corrupted Data
Yasi Zhang, Tianyu Chen, Zhendong Wang, Ying Nian Wu, Mingyuan Zhou, Oscar Leong
Main category: cs.LG
TL;DR: Failed to fetch paper summary - HTTP 429 error indicates rate limiting from arXiv API
Details
Motivation: Unable to determine motivation due to API access issuesMethod: Unable to determine method due to API access issues
Result: Unable to determine results due to API access issues
Conclusion: Paper analysis not possible due to technical limitations in accessing the abstract
Abstract: Failed to fetch summary for 2505.13377: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.13377&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[556] Time Tracker: Mixture-of-Experts-Enhanced Foundation Time Series Forecasting Model with Decoupled Training Pipelines
Aobo Liang, Yan Sun, Xiaohou Shi, Ke Li
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2505.15151: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.15151&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[557] On the Dynamic Regret of Following the Regularized Leader: Optimism with History Pruning
Naram Mhaisen, George Iosifidis
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2505.22899: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.22899&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[558] An Introduction to Flow Matching and Diffusion Models
Peter Holderrieth, Ezra Erives
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2506.02070: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.02070&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[559] Knowing What You Cannot Explain: Learning to Reject Low-Quality Explanations
Luca Stradiotti, Dario Pesenti, Stefano Teso, Jesse Davis
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper retrievalMethod: Unable to determine method due to failed paper retrieval
Result: Unable to determine results due to failed paper retrieval
Conclusion: Unable to draw conclusions due to failed paper retrieval
Abstract: Failed to fetch summary for 2507.12900: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.12900&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[560] Constraint Learning in Multi-Agent Dynamic Games from Demonstrations of Local Nash Interactions
Zhouyu Zhang, Chih-Yuan Chiu, Glen Chou
Main category: cs.LG
TL;DR: Inverse dynamic game algorithm learns parametric constraints from multi-agent interaction demonstrations using MILP encoding of KKT conditions to recover constraints consistent with local Nash equilibrium.
Details
Motivation: To enable safe multi-agent interactions by learning underlying constraints from observed equilibrium behaviors, allowing for constraint-aware motion planning without explicit constraint knowledge.Method: Uses mixed-integer linear programs (MILP) encoding Karush-Kuhn-Tucker (KKT) conditions of interacting agents to recover constraints consistent with local Nash stationarity from demonstration data.
Result: Method learns inner approximations of true safe/unsafe sets, accurately infers both convex and non-convex constraints from nonlinear agent dynamics, and enables robust constraint-satisfying motion planning in simulations and hardware experiments.
Conclusion: The inverse dynamic game approach successfully learns parametric constraints from interaction demonstrations and enables safe motion planning, with theoretical guarantees for constraint recovery.
Abstract: We present an inverse dynamic game-based algorithm to learn parametric constraints from a given dataset of local Nash equilibrium interactions between multiple agents. Specifically, we introduce mixed-integer linear programs (MILP) encoding the Karush-Kuhn-Tucker (KKT) conditions of the interacting agents, which recover constraints consistent with the local Nash stationarity of the interaction demonstrations. We establish theoretical guarantees that our method learns inner approximations of the true safe and unsafe sets. We also use the interaction constraints recovered by our method to design motion plans that robustly satisfy the underlying constraints. Across simulations and hardware experiments, our methods accurately inferred constraints and designed safe interactive motion plans for various classes of constraints, both convex and non-convex, from interaction demonstrations of agents with nonlinear dynamics.
[561] Diagonal Linear Networks and the Lasso Regularization Path
Raphaël Berthier
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2509.18766: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.18766&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[562] SHAP Meets Tensor Networks: Provably Tractable Explanations with Parallelism
Reda Marzouk, Shahaf Bassan, Guy Katz
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2510.21599: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.21599&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[563] Continual Low-Rank Adapters for LLM-based Generative Recommender Systems
Hyunsik Yoo, Ting-Wei Li, SeongKu Kang, Zhining Liu, Charlie Xu, Qilin Qi, Hanghang Tong
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2510.25093: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.25093&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[564] SoilX: Calibration-Free Comprehensive Soil Sensing through Contrastive Cross-Component Learning
Kang Yang, Yuanlin Yang, Yuning Chen, Sikai Yang, Xinyu Zhang, Wan Du
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2511.05482: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.05482&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[565] Adaptive Multi-view Graph Contrastive Learning via Fractional-order Neural Diffusion Networks
Yanan Zhao, Feng Ji, Jingyang Dai, Jiaze Ma, Keyue Jiang, Kai Zhao, Wee Peng Tay
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2511.06216: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.06216&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[566] Provably Safe Model Updates
Leo Elmecker-Plakolm, Pierre Fasterling, Philip Sosnin, Calvin Tsay, Matthew Wicker
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2512.01899: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.01899&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[567] Efficient Cross-Domain Offline Reinforcement Learning with Dynamics- and Value-Aligned Data Filtering
Zhongjian Qiao, Rui Yang, Jiafei Lyu, Chenjia Bai, Xiu Li, Siyang Gao, Shuang Qiu
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot determine conclusion as paper content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2512.02435: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.02435&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[568] Global Optimization By Gradient From Hierarchical Score-Matching Spaces
Ming Li
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) - need to try again later or use alternative methods to access the paper information
Details
Motivation: Cannot determine motivation without access to the paper contentMethod: Cannot determine method without access to the paper content
Result: Cannot determine results without access to the paper content
Conclusion: Cannot draw conclusions without access to the paper content
Abstract: Failed to fetch summary for 2601.11639: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.11639&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[569] Distribution-Free Sequential Prediction with Abstentions
Jialin Yu, Moïse Blanchard
Main category: cs.LG
TL;DR: Failed to fetch summary for arXiv ID 2602.17918 due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation as the paper content could not be retrieved due to server rate limitingMethod: Cannot analyze method due to missing paper content
Result: No results available due to failed content retrieval
Conclusion: Unable to draw conclusions about the paper due to technical limitations in accessing the content
Abstract: Failed to fetch summary for 2602.17918: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.17918&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[570] A Deep Surrogate Model for Robust and Generalizable Long-Term Blast Wave Prediction
Danning Jing, Xinhai Chen, Xifeng Pu, Jie Hu, Chao Huang, Xuguang Chen, Qinglin Wang, Jie Liu
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2602.18168: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.18168&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[571] Federated Causal Representation Learning in State-Space Systems for Decentralized Counterfactual Reasoning
Nazal Mohamed, Ayush Mohanty, Nagi Gebraeel
Main category: cs.LG
TL;DR: Federated causal representation learning framework for industrial control systems that enables cross-client counterfactual reasoning while preserving data privacy and proprietary models.
Details
Motivation: Industrial assets are interdependent but client-specific data is high-dimensional and private, making centralized analysis infeasible. Each client has proprietary models that cannot be modified, creating a need for privacy-preserving causal inference across clients.Method: Federated framework where each client maps high-dimensional observations to low-dimensional latent states disentangling intrinsic dynamics from control-driven influences. A central server estimates global state-transition and control structure, enabling decentralized counterfactual reasoning through compact latent state exchanges.
Result: Proven convergence to centralized oracle with privacy guarantees. Experiments demonstrate scalability and accurate cross-client counterfactual inference on synthetic and real-world industrial control datasets.
Conclusion: The framework enables privacy-preserving causal inference across interdependent industrial systems while respecting data privacy and proprietary model constraints, with proven theoretical guarantees and practical effectiveness.
Abstract: Networks of interdependent industrial assets (clients) are tightly coupled through physical processes and control inputs, raising a key question: how would the output of one client change if another client were operated differently? This is difficult to answer because client-specific data are high-dimensional and private, making centralization of raw data infeasible. Each client also maintains proprietary local models that cannot be modified. We propose a federated framework for causal representation learning in state-space systems that captures interdependencies among clients under these constraints. Each client maps high-dimensional observations into low-dimensional latent states that disentangle intrinsic dynamics from control-driven influences. A central server estimates the global state-transition and control structure. This enables decentralized counterfactual reasoning where clients predict how outputs would change under alternative control inputs at others while only exchanging compact latent states. We prove convergence to a centralized oracle and provide privacy guarantees. Our experiments demonstrate scalability, and accurate cross-client counterfactual inference on synthetic and real-world industrial control system datasets.
[572] Detecting Transportation Mode Using Dense Smartphone GPS Trajectories and Transformer Models
Yuandong Zhang, Othmane Echchabi, Tianshu Feng, Wenyi Zhang, Hsuai-Kai Liao, Charles Chang
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2603.00340 suggests it’s from March 2024, but content is unavailable.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot draw conclusions without access to the paper content.
Abstract: Failed to fetch summary for 2603.00340: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.00340&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[573] Disentangled Representation Learning through Unsupervised Symmetry Group Discovery
Barthélémy Dang-Nhu, Louis Annabi, Sylvain Argentieri
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2603.11790: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.11790&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[574] Mechanistic Foundations of Goal-Directed Control
Alma Lago
Main category: cs.LG
TL;DR: Mechanistic interpretability framework extended from sequence prediction to embodied control systems, using infant motor learning as a model to analyze how reactive and prospective control strategies emerge during sensorimotor development.
Details
Motivation: Mechanistic interpretability has been successful in analyzing transformer circuits for sequence prediction, but hasn't been applied to embodied control systems. The paper aims to extend this framework to sensorimotor-cognitive development using infant motor learning as a model system.Method: Extends mechanistic interpretability framework to embodied control systems, analyzing learned gating mechanisms and their convergence toward uncertainty thresholds. Studies context window parameter k as critical for circuit formation and examines phase transitions in arbitration gates.
Result: Foundational inductive biases give rise to causal control circuits with learned gating mechanisms converging to uncertainty thresholds. Identifies context window k as critical parameter: below k≤4 arbitration can’t form; above k≥8, gate confidence scales as log k. Reveals phase transition in arbitration gate and task-demand-dependent route arbitration.
Conclusion: Provides mechanistic account of how reactive and prospective control strategies emerge and compete during learning. Sharpens mechanistic accounts of cognitive development and offers principled guidance for designing interpretable embodied agents.
Abstract: Mechanistic interpretability has transformed the analysis of transformer circuits by decomposing model behavior into competing algorithms, identifying phase transitions during training, and deriving closed-form predictions for when and why strategies shift. However, this program has remained largely confined to sequence-prediction architectures, leaving embodied control systems without comparable mechanistic accounts. Here we extend this framework to sensorimotor-cognitive development, using infant motor learning as a model system. We show that foundational inductive biases give rise to causal control circuits, with learned gating mechanisms converging toward theoretically motivated uncertainty thresholds. The resulting dynamics reveal a clean phase transition in the arbitration gate whose commitment behavior is well described by a closed-form exponential moving-average surrogate. We identify context window k as the critical parameter governing circuit formation: below a minimum threshold (k$\leq$4) the arbitration mechanism cannot form; above it (k$\geq$8), gate confidence scales asymptotically as log k. A two-dimensional phase diagram further reveals task-demand-dependent route arbitration consistent with the prediction that prospective execution becomes advantageous only when prediction error remains within the task tolerance window. Together, these results provide a mechanistic account of how reactive and prospective control strategies emerge and compete during learning. More broadly, this work sharpens mechanistic accounts of cognitive development and provides principled guidance for the design of interpretable embodied agents.
[575] Deep learning and the rate of approximation by flows
Jingpu Cheng, Qianxiao Li, Ting Lin, Zuowei Shen
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to draw conclusions due to fetch failure
Abstract: Failed to fetch summary for 2603.15363: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.15363&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[576] Trajectory-Optimized Time Reparameterization for Learning-Compatible Reduced-Order Modeling of Stiff Dynamical Systems
Joe Standridge, Daniel Livescu, Paul Cizmas
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: Cannot analyze method without access to paper content
Result: No results available due to technical error fetching paper information
Conclusion: Unable to provide analysis due to arXiv API rate limiting preventing access to paper details
Abstract: Failed to fetch summary for 2603.16583: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.16583&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[577] A Novel Single-Layer Quantum Neural Network for Approximate SRBB-Based Unitary Synthesis
Giacomo Belli, Marco Mordacci, Michele Amoretti
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2412.03083: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2412.03083&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[578] How PC-based Methods Err: Towards Better Reporting of Assumption Violations and Small Sample Errors
Sofia Faltenbacher, Jonas Wahl, Rebecca Herman, Jakob Runge
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2502.14719: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.14719&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[579] Optimization over Trained (and Sparse) Neural Networks: A Surrogate within a Surrogate
Hung Pham, Aiden Ren, Ibrahim Tahir, Jiatai Tong, Thiago Serra
Main category: cs.LG
TL;DR: Failed to fetch summary for paper 2505.01985 due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation as the paper summary could not be retrievedMethod: Unable to determine method as the paper summary could not be retrieved
Result: Unable to determine results as the paper summary could not be retrieved
Conclusion: Unable to draw conclusions about the paper content due to retrieval failure
Abstract: Failed to fetch summary for 2505.01985: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.01985&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[580] Optimizing Binary and Ternary Neural Network Inference on RRAM Crossbars using CIM-Explorer
Rebecca Pelke, José Cubero-Cascante, Nils Bosbach, Niklas Degener, Florian Idrizi, Lennart M. Reimann, Jan Moritz Joseph, Rainer Leupers
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to draw conclusions due to access error
Abstract: Failed to fetch summary for 2505.14303: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.14303&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[581] Statistical Inference for Online Algorithms
Selina Carter, Arun K Kuchibhotla
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2505.17300: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.17300&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[582] Code Roulette: How Prompt Variability Affects LLM Code Generation
Andrei Paleyes, Radzim Sendyka, Diana Robinson, Christian Cabrera, Neil D. Lawrence
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2506.10204: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.10204&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[583] Decadal sink-source shifts of forest aboveground carbon since 1988
Zhen Qian, Sebastian Bathiany, Teng Liu, Lana L. Blaschke, Hoong Chen Teo, Niklas Boers
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2506.11879: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.11879&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[584] Hebbian Physics Networks: A Self-Organizing Computational Architecture Based on Local Physical Laws
Gunjan Auti, Hirofumi Daiguji, Gouhei Tanaka
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2507.00641: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.00641&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[585] From Street Form to Spatial Justice: Explaining Urban Exercise Inequality via a Triadic SHAP-Informed Framework
Minwei Zhao, Guosheng Yang, Zhuoni Zhang, Filip Biljecki, Hanzhi Zu, Cai Wu
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper retrievalMethod: Unable to determine method due to failed paper retrieval
Result: Unable to determine results due to failed paper retrieval
Conclusion: Unable to draw conclusions due to failed paper retrieval
Abstract: Failed to fetch summary for 2507.03570: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.03570&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[586] Robust estimation of heterogeneous treatment effects in randomized trials leveraging external data
Rickard Karlsson, Piersilvio De Bartolomeis, Issa J. Dahabreh, Jesse H. Krijthe
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2507.03681: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.03681&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[587] Exact Generalisation Error Exposes Benchmarks Skew Graph Neural Networks Success (or Failure)
Nil Ayday, Mahalakshmi Sabanayagam, Debarghya Ghoshdastidar
Main category: cs.LG
TL;DR: Paper 2509.10337: Unable to fetch abstract due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2509.10337: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.10337&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[588] On the identifiability of causal graphs with multiple environments
Francesco Montagna
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2510.13583: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.13583&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[589] Learning Time-Varying Graphs from Incomplete Graph Signals
Chuansen Peng, Xiaojing Shen
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2510.17903: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.17903&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[590] Computing Pure-Strategy Nash Equilibria in a Two-Party Policy Competition: Existence and Algorithmic Approaches
Chuang-Chieh Lin, Chi-Jen Lu, Po-An Chen, Chih-Chieh Hung
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2512.22552: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.22552&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[591] Generative Adversarial Networks for Resource State Generation
Shahbaz Shaik, Sourav Chatterjee, Sayantan Pramanik, Indranil Chakrabarty
Main category: cs.LG
TL;DR: Unable to analyze paper 2601.13708 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract is unavailableMethod: Cannot determine method as abstract is unavailable
Result: Cannot determine results as abstract is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2601.13708: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.13708&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[592] Learning the Intrinsic Dimensionality of Fermi-Pasta-Ulam-Tsingou Trajectories: A Nonlinear Approach using a Deep Autoencoder Model
Gionni Marchetti
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2601.19567: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.19567&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[593] Lyapunov Constrained Soft Actor-Critic (LC-SAC) using Koopman Operator Theory for Quadrotor Trajectory Tracking
Dhruv S. Kushwaha, Zoleikha A. Biron
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: Unable to determine method due to API rate limiting preventing access to paper content
Result: Unable to determine results due to API rate limiting preventing access to paper content
Conclusion: Unable to determine conclusion due to API rate limiting preventing access to paper content
Abstract: Failed to fetch summary for 2602.04132: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.04132&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[594] Beware Untrusted Simulators – Reward-Free Backdoor Attacks in Reinforcement Learning
Ethan Rathbun, Wo Wei Lin, Alina Oprea, Christopher Amato
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2602.05089: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.05089&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[595] Unlearnable phases of matter
Tarun Advaith Kumar, Yijian Zou, Amir-Reza Negari, Roger G. Melko, Timothy H. Hsieh
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation as the paper content could not be retrievedMethod: Unable to determine method as the paper content could not be retrieved
Result: Unable to determine results as the paper content could not be retrieved
Conclusion: Unable to draw conclusions as the paper content could not be retrieved
Abstract: Failed to fetch summary for 2602.11262: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.11262&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[596] Dual Space Preconditioning for Gradient Descent in the Overparameterized Regime
Reza Ghane, Danil Akhtiamov, Babak Hassibi
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error fetching paper contentMethod: Unable to determine method due to technical error fetching paper content
Result: Unable to determine results due to technical error fetching paper content
Conclusion: Unable to determine conclusion due to technical error fetching paper content
Abstract: Failed to fetch summary for 2603.10485: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.10485&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[597] Solving physics-constrained inverse problems with conditional flow matching
Agnimitra Dasgupta, Ali Fardisi, Mehrnegar Aminy, Brianna Binder, Bryan Shaddy, Assad Oberai
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2603.14135: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.14135&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[598] AR-Flow VAE: A Structured Autoregressive Flow Prior Variational Autoencoder for Unsupervised Blind Source Separation
Yuan-Hao Wei, Fu-Hao Deng, Lin-Yong Cui, Yan-Jie Sun
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.14441: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.14441&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[599] Neural Pushforward Samplers for the Fokker-Planck Equation on Embedded Riemannian Manifolds
Andrew Qing He, Wei Cai
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2603.16239: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.16239&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.MA
[600] TerraLingua: Emergence and Analysis of Open-endedness in LLM Ecologies
Giuseppe Paolo, Jamieson Warner, Hormoz Shahrzad, Babak Hodjat, Risto Miikkulainen, Elliot Meyerson
Main category: cs.MA
TL;DR: TerraLingua is a persistent multi-agent ecology with resource constraints and limited lifespans that enables study of open-ended dynamics, cultural accumulation, and social organization in AI populations.
Details
Motivation: To understand how autonomous agents coordinate, form institutions, and accumulate shared culture in real-world digital ecosystems, addressing both scientific and practical priorities for guiding agentic populations toward socially beneficial outcomes.Method: Introduces TerraLingua, a persistent multi-agent ecology with resource constraints and limited agent lifespans, where agents create artifacts that persist beyond individuals. Uses an AI Anthropologist to systematically analyze agent behavior, group structure, and artifact evolution across experimental conditions.
Result: Reveals emergence of cooperative norms, division of labor, governance attempts, and branching artifact lineages consistent with cumulative cultural processes. Divergent outcomes across runs can be traced to specific innovations and organizational structures.
Conclusion: TerraLingua provides a platform for characterizing mechanisms of cumulative culture and social organization in artificial populations, serving as a foundation for guiding real-world agentic populations to socially beneficial outcomes.
Abstract: As autonomous agents increasingly operate in real-world digital ecosystems, understanding how they coordinate, form institutions, and accumulate shared culture becomes both a scientific and practical priority. This paper introduces TerraLingua, a persistent multi-agent ecology designed to study open-ended dynamics in such systems. Unlike prior large language model simulations with static or consequence-free environments, TerraLingua imposes resource constraints and limited lifespans for the agents. As a result, agents create artifacts that persist beyond individuals, shaping future interactions and selection pressures. To characterize the dynamics, an AI Anthropologist systematically analyzes agent behavior, group structure, and artifact evolution. Across experimental conditions, the results reveal the emergence of cooperative norms, division of labor, governance attempts, and branching artifact lineages consistent with cumulative cultural processes. Divergent outcomes across experimental runs can be traced back to specific innovations and organizational structures. TerraLingua thus provides a platform for characterizing the mechanisms of cumulative culture and social organization in artificial populations, and can serve as a foundation for guiding real-world agentic populations to socially beneficial outcomes.
[601] Impacts of Electric Vehicle Charging Regimes and Infrastructure Deployments on System Performance: An Agent-Based Study
Jiahua Hu, Hai L. Vu, Wynita Griggs, Hao Wang
Main category: cs.MA
TL;DR: Agent-based modeling of EV charging infrastructure planning in Melbourne, comparing optimization-based vs utilization-refined deployment strategies across different charging regimes to minimize total system costs.
Details
Motivation: The rapid growth of electric vehicles requires effective charging infrastructure planning that considers both deployment costs and user charging behavior. Different charging regimes (destination vs en-route) have distinct power requirements and can lead to substantially different infrastructure outcomes.Method: Uses agent-based modeling framework to generate trajectory-level latent public charging demand in Melbourne metropolitan area. Evaluates two deployment strategies: optimization-based approach and utilization-refined approach across different infrastructure layouts under three charging regimes.
Result: Utilization-refined deployments reduce total system cost (infrastructure + user charging costs), with most significant improvement under combined charging regime. Effective allocation of AC slow chargers reshapes destination charging behavior, reducing unnecessary reliance on en-route charging and lowering detour costs.
Conclusion: The interaction between destination and en-route charging regimes highlights behavioral linkages and demonstrates the importance of accounting for user response and multiple charging regimes in charging infrastructure planning.
Abstract: The rapid growth of electric vehicles (EVs) requires more effective charging infrastructure planning. Infrastructure layout not only determines deployment cost, but also reshapes charging behavior and influences overall system performance. In addition, destination charging and en-route charging represent distinct charging regimes associated with different power requirements, which may lead to substantially different infrastructure deployment outcomes. This study applies an agent-based modeling framework to generate trajectory-level latent public charging demand under three charging regimes based on a synthetic representation of the Melbourne (Australia) metropolitan area. Two deployment strategies, an optimization-based approach and a utilization-refined approach, are evaluated across different infrastructure layouts. Results show that utilization-refined deployments reduce total system cost, accounting for both infrastructure deployment cost and user generalized charging cost, with the most significant improvement observed under the combined charging regime. In particular, a more effective allocation of AC slow chargers reshapes destination charging behavior, which in turn reduces unnecessary reliance on en-route charging and lowers detour costs associated with en-route charging. This interaction highlights the behavioral linkage between destination and en-route charging regimes and demonstrates the importance of accounting for user response and multiple charging regimes in charging infrastructure planning.
[602] Ablation Study of a Fairness Auditing Agentic System for Bias Mitigation in Early-Onset Colorectal Cancer Detection
Amalia Ionescu, Jose Guadalupe Hernandez, Jui-Hsuan Chang, Emily F. Wong, Paul Wang, Jason H. Moore, Tiffani J. Bright
Main category: cs.MA
TL;DR: AI agent system with two specialized agents (Domain Expert and Fairness Consultant) helps audit clinical ML models for fairness in colorectal cancer, showing RAG-enhanced agents perform best at identifying disparities.
Details
Motivation: Clinical AI systems can perpetuate algorithmic bias and safety risks due to limited oversight and domain expertise, particularly in conditions like early-onset colorectal cancer which has documented demographic disparities.Method: Two-agent architecture: Domain Expert Agent synthesizes literature on EO-CRC disparities, and Fairness Consultant Agent recommends sensitive attributes and fairness metrics. Ablation study comparing three Ollama LLMs (8B, 20B, 120B parameters) across three configurations: pretrained LLM-only, Agent without RAG, and Agent with RAG.
Result: Agent with RAG achieved highest semantic similarity to expert-derived reference statements across all models, particularly for disparity identification, suggesting retrieval-augmented agentic systems can help scale fairness auditing in clinical AI.
Conclusion: Agentic AI systems with retrieval capabilities show promise for scaling fairness auditing in clinical machine learning applications, potentially addressing algorithmic bias in healthcare settings.
Abstract: Artificial intelligence (AI) is increasingly used in clinical settings, yet limited oversight and domain expertise can allow algorithmic bias and safety risks to persist. This study evaluates whether an agentic AI system can support auditing biomedical machine learning models for fairness in early-onset colorectal cancer (EO-CRC), a condition with documented demographic disparities. We implemented a two-agent architecture consisting of a Domain Expert Agent that synthesizes literature on EO-CRC disparities and a Fairness Consultant Agent that recommends sensitive attributes and fairness metrics for model evaluation. An ablation study compared three Ollama large language models (8B, 20B, and 120B parameters) across three configurations: pretrained LLM-only, Agent without Retrieval-Augmented Generation (RAG), and Agent with RAG. Across models, the Agent with RAG achieved the highest semantic similarity to expert-derived reference statements, particularly for disparity identification, suggesting agentic systems with retrieval may help scale fairness auditing in clinical AI.
[603] Agentic Cognitive Profiling: Realigning Automated Alzheimer’s Disease Detection with Clinical Construct Validity
Jiawen Kang, Kun Li, Dongrui Han, Jinchao Li, Junan Li, Lingwei Meng, Xixin Wu, Helen Meng
Main category: cs.MA
TL;DR: Agentic Cognitive Profiling (ACP) framework uses specialized LLM agents to decompose clinical cognitive assessments into atomic tasks, extracting verifiable scoring primitives for Alzheimer’s Disease screening with improved interpretability and construct validity.
Details
Motivation: Current automated AD screening follows inductive pattern recognition that sacrifices clinical protocol construct validity for statistical shortcuts. The paper aims to realign automated screening with clinical protocol logic across multiple cognitive domains.Method: Proposes Agentic Cognitive Profiling (ACP) framework that decomposes standardized assessments into atomic cognitive tasks, orchestrates specialized LLM agents to extract verifiable scoring primitives, and decouples semantic understanding from measurement by delegating quantification to deterministic function calling.
Result: Achieves 90.5% score match rate in task examination and 85.3% accuracy in AD prediction on a clinically-annotated corpus of 402 participants across eight structured cognitive tasks, surpassing popular baselines while generating interpretable cognitive profiles.
Conclusion: Demonstrates that construct validity and predictive performance need not be traded off, charting a path toward AD screening systems that explain rather than merely predict, using LLM agents for interpretable cognitive assessment.
Abstract: Automated Alzheimer’s Disease (AD) screening has predominantly followed the inductive paradigm of pattern recognition, which directly maps the input signal to the outcome label. This paradigm sacrifices construct validity of clinical protocol for statistical shortcuts. This paper proposes Agentic Cognitive Profiling (ACP), an agentic framework that realigns automated screening with clinical protocol logic across multiple cognitive domains. Rather than learning opaque mappings from transcripts to labels, the framework decomposes standardized assessments into atomic cognitive tasks and orchestrates specialized LLM agents to extract verifiable scoring primitives. Central to our design is decoupling semantic understanding from measurement by delegating all quantification to deterministic function calling, thereby mitigating hallucination and restoring construct validity. Unlike popular datasets that typically comprise around a hundred participants under a single task, we evaluate on a clinically-annotated corpus of 402 participants across eight structured cognitive tasks spanning multiple cognitive domains. The framework achieves 90.5% score match rate in task examination and 85.3% accuracy in AD prediction, surpassing popular baselines while generating interpretable cognitive profiles grounded in behavioral evidence. This work demonstrates that construct validity and predictive performance need not be traded off, charting a path toward AD screening systems that explain rather than merely predict.
[604] In Trust We Survive: Emergent Trust Learning
Qianpu Chen, Giulio Barbero, Mike Preuss, Derya Soydaner
Main category: cs.MA
TL;DR: ETL is a lightweight trust-based algorithm that enables AI agents to achieve cooperation in competitive environments with shared resources using minimal computational overhead.
Details
Motivation: The paper addresses the challenge of enabling AI agents to cooperate in competitive game environments with shared resources, where traditional approaches often fail to balance individual rewards with collective resource sustainability.Method: Emergent Trust Learning (ETL) uses a compact internal trust state that modulates memory, exploration, and action selection. It requires only individual rewards and local observations, with minimal computational and communication overhead.
Result: ETL reduces conflicts and prevents long-term resource depletion in grid-based resource worlds, sustains high survival rates in hierarchical Tower environments with social dilemmas, and maintains cooperation while avoiding exploitation in Iterated Prisoner’s Dilemma.
Conclusion: ETL provides an effective, lightweight approach for enabling cooperation in competitive multi-agent environments with shared resources, demonstrating robustness across different game scenarios.
Abstract: We introduce Emergent Trust Learning (ETL), a lightweight, trust-based control algorithm that can be plugged into existing AI agents. It enables these to reach cooperation in competitive game environments under shared resources. Each agent maintains a compact internal trust state, which modulates memory, exploration, and action selection. ETL requires only individual rewards and local observations and incurs negligible computational and communication overhead. We evaluate ETL in three environments: In a grid-based resource world, trust-based agents reduce conflicts and prevent long-term resource depletion while achieving competitive individual returns. In a hierarchical Tower environment with strong social dilemmas and randomised floor assignments, ETL sustains high survival rates and recovers cooperation even after extended phases of enforced greed. In the Iterated Prisoner’s Dilemma, the algorithm generalises to a strategic meta-game, maintaining cooperation with reciprocal opponents while avoiding long-term exploitation by defectors. Code will be released upon publication.
[605] Scalable UAV Multi-Hop Networking via Multi-Agent Reinforcement Learning with Large Language Models
Yanggang Xu, Jirong Zha, Weijie Hong, Xiangmin Yi, Geng Chen, Jianfeng Zheng, Chen-Chun Hsia, Xinlei Chen
Main category: cs.MA
TL;DR: MRLMN integrates multi-agent reinforcement learning with large language models to optimize UAV emergency communication networks through grouping strategies, reward decomposition, behavioral constraints, and knowledge distillation.
Details
Motivation: In disaster scenarios, establishing robust emergency communication networks using UAVs is critical but challenging due to algorithmic scalability limitations and vast exploration spaces for coordinated decision-making in large-scale dynamic environments.Method: Proposes MRLMN framework combining MARL and LLMs with grouping strategy and reward decomposition for scalability, behavioral constraints for robustness, and knowledge distillation from LLM agents to MARL agents using Hungarian algorithm-based matching for alignment.
Result: Extensive simulations show significant improvements in network performance over MAPPO baseline and other methods, with enhanced coverage and communication quality.
Conclusion: The integration of MARL and LLMs with grouping strategies and knowledge distillation effectively addresses UAV network optimization challenges in emergency scenarios.
Abstract: In disaster scenarios, establishing robust emergency communication networks is critical, and unmanned aerial vehicles (UAVs) offer a promising solution to rapidly restore connectivity. However, organizing UAVs to form multi-hop networks in large-scale dynamic environments presents significant challenges, including limitations in algorithmic scalability and the vast exploration space required for coordinated decision-making. To address these issues, we propose MRLMN, a novel framework that integrates multi-agent reinforcement learning (MARL) and large language models (LLMs) to jointly optimize UAV agents toward achieving optimal networking performance. The framework incorporates a grouping strategy with reward decomposition to enhance algorithmic scalability and balance decision-making across UAVs. In addition, behavioral constraints are applied to selected key UAVs to improve the robustness of the network. Furthermore, the framework integrates LLM agents, leveraging knowledge distillation to transfer their high-level decision-making capabilities to MARL agents. This enhances both the efficiency of exploration and the overall training process. In the distillation module, a Hungarian algorithm-based matching scheme is applied to align the decision outputs of the LLM and MARL agents and define the distillation loss. Extensive simulation results validate the effectiveness of our approach, demonstrating significant improvements in network performance over the MAPPO baseline and other comparison methods, including enhanced coverage and communication quality.
[606] Communication to Completion: Modeling Collaborative Workflows with Intelligent Multi-Agent Communication
Yiming Lu, Xun Wang, Simin Ma, Shujian Liu, Sathish Reddy Indurthi, Song Wang, Haoyun Deng, Fei Liu, Kaiqiang Song
Main category: cs.MA
TL;DR: A framework called Communication to Completion (C2C) that models communication costs in multi-agent LLM systems, introducing Alignment Factor to quantify task understanding efficiency, showing 40%+ efficiency gains in software engineering workflows.
Details
Motivation: Current multi-agent LLM systems treat communication as instantaneous and free, overlooking real-world collaboration costs, which is a fundamental constraint in teamwork.Method: Proposed C2C framework that explicitly models communication as a constrained resource with temporal costs, introducing Alignment Factor (AF) metric inspired by Shared Mental Models to quantify link between task understanding and work efficiency.
Result: Experiments on 15 software engineering workflows across three complexity tiers with 5-17 agents showed cost-aware strategies achieve over 40% higher efficiency than unconstrained interaction. Emergent coordination patterns included hub-and-spoke topologies, strategic escalation from async to sync channels, and prioritization of high-value help requests.
Conclusion: The study moves beyond simple agent construction to offer theoretical foundation for quantifying and optimizing collaboration dynamics in future digital workplaces, with patterns consistent across multiple frontier LLMs.
Abstract: Multi-agent LLM systems have demonstrated impressive capabilities in complex collaborative tasks, yet most frameworks treat communication as instantaneous and free, overlooking a fundamental constraint in real world teamwork, collaboration cost. We propose a scalable framework implemented via Communication to Completion (C2C), which explicitly models communication as a constrained resource with realistic temporal costs. We introduce the Alignment Factor (AF), a dynamic metric inspired by Shared Mental Models, to quantify the link between task understanding and work efficiency. Through experiments on 15 software engineering workflows spanning three complexity tiers and team sizes from 5 to 17 agents, we demonstrate that cost-aware strategies achieve over 40% higher efficiency compared to unconstrained interaction. Our analysis reveals emergent coordination patterns: agents naturally adopt manager centric hub-and-spoke topologies, strategically escalate from asynchronous to synchronous channels based on complexity, and prioritize high value help requests. These patterns remain consistent across multiple frontier models (GPT-5.2, Claude Sonnet 4.5, Gemini 2.5 Pro). This study moves beyond simple agent construction, offering a theoretical foundation for quantifying and optimizing the dynamics of collaboration in future digital workplaces.
[607] Adaptive Accountability in Networked MAS: Tracing and Mitigating Emergent Norms at Scale
Saad Alqithami
Main category: cs.MA
TL;DR: AAF is a runtime framework for multi-agent systems that detects norm violations, attributes responsibility, and applies interventions to steer systems toward compliant behavior while maintaining bounded compromise guarantees.
Details
Motivation: Large-scale networked multi-agent systems in critical infrastructure can develop undesirable emergent behaviors like collusion, resource hoarding, and unfairness, requiring automated mechanisms to detect and correct these norm violations.Method: AAF combines: (1) cryptographically verifiable interaction provenance recording, (2) distributional change point detection in streaming traces, (3) responsibility attribution via causal influence graphs, and (4) cost-bounded interventions (reward shaping and policy patching).
Result: AAF reduces compromise ratio by median 11.9% vs PPO baseline in 96% of regimes, maintains social welfare (0.4% median change), detects violations with 71-step median delay, and achieves 0.97 attribution accuracy at 10% Byzantine rate.
Conclusion: AAF provides a practical framework for maintaining normative compliance in multi-agent systems with provable bounded-compromise guarantees and effective intervention mechanisms.
Abstract: Large-scale networked multi-agent systems increasingly underpin critical infrastructure, yet their collective behavior can drift toward undesirable emergent norms such as collusion, resource hoarding, and implicit unfairness. We present the Adaptive Accountability Framework (AAF), an end-to-end runtime layer that (i) records cryptographically verifiable interaction provenance, (ii) detects distributional change points in streaming traces, (iii) attributes responsibility via a causal influence graph, and (iv) applies cost-bounded interventions-reward shaping and targeted policy patching-to steer the system back toward compliant behavior. We establish a bounded-compromise guarantee: if the expected cost of intervention exceeds an adversary’s expected payoff, the long-run fraction of compromised interactions converges to a value strictly below one. We evaluate AAF in a large-scale factorial simulation suite (87,480 runs across two tasks; up to 100 agents plus a 500-agent scaling sweep; full and partial observability; Byzantine rates up to 10%; 10 seeds per regime). Across 324 regimes, AAF lowers the executed compromise ratio relative to a Proximal Policy Optimization baseline in 96% of regimes (median relative reduction 11.9%) while preserving social welfare (median change 0.4%). Under adversarial injections, AAF detects norm violations with a median delay of 71 steps (interquartile range 39-177) and achieves a mean top-ranked attribution accuracy of 0.97 at 10% Byzantine rate.
[608] Forecast-Aware Cooperative Planning on Temporal Graphs under Stochastic Adversarial Risk
Manshi Limbu, Xuan Wang, Gregory J. Stein, Daigo Shishika, Xuesu Xiao
Main category: cs.MA
TL;DR: A forecast-aware cooperative planning framework for multi-robot teams that integrates stochastic risk forecasting with anticipatory support allocation on temporal graphs to handle evolving traversal risks.
Details
Motivation: Multi-robot missions in dynamic environments face evolving risks from adversary patrols or shifting hazards. Existing support coordination frameworks assume static risk landscapes and fail to account for predictable temporal trends in risk evolution, limiting their effectiveness.Method: Models adversary dynamics as a first-order Markov stay-move process over graph edges, propagates edge-occupancy probabilities forward in time to generate time-indexed edge-risk forecasts, uses forecasts to guide proactive allocation of support positions to forecasted risky edges, and informs joint robot path planning.
Result: The approach consistently reduces total expected team cost compared to non-anticipatory baselines and approaches the performance of an oracle planner.
Conclusion: Forecast-aware cooperative planning with stochastic risk forecasting enables effective support coordination in dynamic environments with evolving risks, significantly improving multi-robot mission performance.
Abstract: Cooperative multi-robot missions often require teams of robots to traverse environments where traversal risk evolves due to adversary patrols or shifting hazards with stochastic dynamics. While support coordination–where robots assist teammates in traversing risky regions–can significantly reduce mission costs, its effectiveness depends on the team’s ability to anticipate future risk. Existing support-based frameworks assume static risk landscapes and therefore fail to account for predictable temporal trends in risk evolution. We propose a forecast-aware cooperative planning framework that integrates stochastic risk forecasting with anticipatory support allocation on temporal graphs. By modeling adversary dynamics as a first-order Markov stay-move process over graph edges, we propagate the resulting edge-occupancy probabilities forward in time to generate time-indexed edge-risk forecasts. These forecasts guide the proactive allocation of support positions to forecasted risky edges for effective support coordination, while also informing joint robot path planning. Experimental results demonstrate that our approach consistently reduces total expected team cost compared to non-anticipatory baselines, approaching the performance of an oracle planner.
cs.MM
[609] Amanous: Distribution-Switching for Superhuman Piano Density on Disklavier
Joonhyung Bae
Main category: cs.MM
TL;DR: Amanous is a hardware-aware composition system for Yamaha Disklavier that unifies three traditions of automated piano composition through distribution-switching, with contributions in architecture, hardware abstraction, density analysis, and tempo-canon control.
Details
Motivation: The paper aims to unify three isolated traditions in automated piano composition (Nancarrow's tempo canons, Xenakis's stochastic distributions, and L-system grammars) through a hardware-aware system that can handle superhuman textures while maintaining physical constraints of the Disklavier.Method: The system uses a four-layer architecture (symbolic, parametric, numeric, physical) with distribution-switching where L-system symbols select distinct distributional regimes. It includes a hardware abstraction layer for velocity-dependent latency and key reset constraints, and a convergence point calculus for tempo-canon geometry control.
Result: The system produces statistically distinct sections with large effect sizes (d = 3.70-5.34), identifies a computational saturation transition at 24-30 notes/s, and successfully operates on physical Disklavier hardware with sub-millisecond precision. The pipeline demonstrates algorithmic self-consistency.
Conclusion: Amanous successfully unifies three composition traditions through distribution-switching and hardware-aware design, providing a computational framework for automated piano composition that bridges macro-temporal structure with micro-level texture while respecting physical hardware constraints.
Abstract: The automated piano enables note densities, polyphony, and register changes far beyond human physical limits, yet the three dominant traditions for composing such textures–Nancarrow’s tempo canons, Xenakis’s stochastic distributions, and L-system grammars–have developed in isolation. This paper presents Amanous, a hardware-aware composition system for Yamaha Disklavier that unifies these methodologies through distribution-switching: L-system symbols select distinct distributional regimes rather than merely modulating parameters within a fixed family. Four contributions are reported. (1) A four-layer architecture (symbolic, parametric, numeric, physical) produces statistically distinct sections with large effect sizes (d = 3.70-5.34), validated by per-layer degradation and ablation experiments. (2) A hardware abstraction layer formalizes velocity-dependent latency and key reset constraints, keeping superhuman textures within the Disklavier’s actuable envelope. (3) A density sweep reveals a computational saturation transition at 24-30 notes/s (bootstrap 95% CI: 23.3-50.0), beyond which single-domain melodic metrics lose discriminative power and cross-domain coupling becomes necessary. (4) A convergence point calculus operationalizes tempo-canon geometry as a control interface, enabling convergence events to trigger distribution switches linking macro-temporal structure to micro-level texture. All results are computational; a psychoacoustic validation protocol is proposed for future work. The pipeline has been deployed on a physical Disklavier, demonstrating algorithmic self-consistency and sub-millisecond software precision. Supplementary materials (Excerpts 1-4): https://www.amanous.xyz. Source code: https://github.com/joonhyungbae/Amanous.
[610] Beyond Forced Modality Balance: Intrinsic Information Budgets for Multimodal Learning
Zechang Xiong, Da Li, Kexin Tang, Pengyuan Li, Wenkang Kong, Yulan Hu
Main category: cs.MM
TL;DR: IIBalance: A multimodal learning framework that balances modality contributions using Intrinsic Information Budgets (IIB) to prevent dominant-modality overshadowing and improve complementary cue utilization.
Details
Motivation: Multimodal models often suffer from modality imbalance where stronger, faster-converging modalities overshadow weaker ones, leading to suboptimal performance. Existing methods focus on gradient/loss reweighting but overlook each modality's finite information capacity.Method: Proposes IIBalance framework with: 1) Task-grounded estimator of each modality’s Intrinsic Information Budget (IIB), 2) Prototype-based relative alignment mechanism anchored by highest-budget modality, 3) Probabilistic gating module for inference that integrates global budgets with sample-level uncertainty.
Result: Experiments on three representative benchmarks show IIBalance consistently outperforms state-of-the-art balancing methods and achieves better utilization of complementary modality cues.
Conclusion: IIBalance effectively addresses modality imbalance by considering intrinsic information budgets, enabling better multimodal learning without forcing weaker modalities to imitate stronger ones.
Abstract: Multimodal models often converge to a dominant-modality solution, in which a stronger, faster-converging modality overshadows weaker ones. This modality imbalance causes suboptimal performance. Existing methods attempt to balance different modalities by reweighting gradients or losses. However, they overlook the fact that each modality has finite information capacity. In this work, we propose IIBalance, a multimodal learning framework that aligns the modality contributions with Intrinsic Information Budgets (IIB). We propose a task-grounded estimator of each modality’s IIB, transforming its capacity into a global prior over modality contributions. Anchored by the highest-budget modality, we design a prototype-based relative alignment mechanism that corrects semantic drift only when weaker modalities deviate from their budgeted potential, rather than forcing imitation. During inference, we propose a probabilistic gating module that integrates the global budgets with sample-level uncertainty to generate calibrated fusion weights. Experiments on three representative benchmarks demonstrate that IIBalance consistently outperforms state-of-the-art balancing methods and achieves better utilization of complementary modality cues. Our code is available at: https://github.com/XiongZechang/IIBalance.
eess.AS
[611] Synthetic Data Domain Adaptation for ASR via LLM-based Text and Phonetic Respelling Augmentation
Natsuo Yamashita, Koichi Nagatsuka, Hiroaki Kokubo, Kota Dohi, Tuan Vu Ho
Main category: eess.AS
TL;DR: LLM-based domain adaptation for ASR using text augmentation with filtering and phonetic respelling to improve robustness on domain-specific data
Details
Motivation: End-to-end ASR systems often degrade on domain-specific data due to scarce in-domain resources, requiring effective domain adaptation methodsMethod: Two-part framework: (1) LLM-based text augmentation pipeline with filtering strategy balancing lexical diversity, perplexity, and domain-term coverage; (2) Phonetic Respelling Augmentation (PRA) that introduces pronunciation variability through LLM-generated orthographic pseudo-spellings
Result: Experimental results across four domain-specific datasets demonstrate consistent reductions in word error rate, showing improved ASR robustness
Conclusion: Combining domain-specific lexical coverage with realistic pronunciation variation through synthetic data generation significantly improves ASR domain adaptation
Abstract: End-to-end automatic speech recognition often degrades on domain-specific data due to scarce in-domain resources. We propose a synthetic-data-based domain adaptation framework with two contributions: (1) a large language model (LLM)-based text augmentation pipeline with a filtering strategy that balances lexical diversity, perplexity, and domain-term coverage, and (2) phonetic respelling augmentation (PRA), a novel method that introduces pronunciation variability through LLM-generated orthographic pseudo-spellings. Unlike conventional acoustic-level methods such as SpecAugment, PRA provides phonetic diversity before speech synthesis, enabling synthetic speech to better approximate real-world variability. Experimental results across four domain-specific datasets demonstrate consistent reductions in word error rate, confirming that combining domain-specific lexical coverage with realistic pronunciation variation significantly improves ASR robustness.
[612] Learnable Pulse Accumulation for On-Device Speech Recognition: How Much Attention Do You Need?
Yakov Pyotr Shkolnikov
Main category: eess.AS
TL;DR: LPA replaces quadratic self-attention with linear learned gating functions in speech transformers, achieving significant speedup with modest accuracy trade-off
Details
Motivation: Self-attention's quadratic complexity limits transformer-based speech models on edge devices; need efficient alternatives that maintain reasonable performanceMethod: Learnable Pulse Accumulator (LPA) replaces key-query dot products with learned gating functions: content-dependent rectangular pulses, periodic windows, and position-dependent basis functions; uses MSE diagnostic sweep to determine per-layer replacement difficulty and ordering
Result: Replacing 8 of 12 wav2vec2-base layers yields 10.61% WER on LibriSpeech test-clean (+7.24pp over 3.37% baseline) with 3.27x speedup; cross-domain validation on SepFormer shows all 16 intra-chunk attention layers can be replaced without collapse
Conclusion: LPA enables efficient speech transformers for edge devices; depth wall arises from linguistic computation rather than LPA limitation; near-binary gates enable efficient inference on mobile accelerators
Abstract: Self-attention scales quadratically with sequence length, limiting transformer-based speech models on edge devices. We introduce the Learnable Pulse Accumulator (LPA), an O(n) replacement that substitutes key-query dot products with learned gating functions: content-dependent rectangular pulses, periodic windows, and position-dependent basis functions. An MSE diagnostic sweep determines per-layer replacement difficulty and ordering. Replacing 8 of 12 wav2vec2-base layers yields 10.61% word error rate (WER) on LibriSpeech test-clean, +7.24 percentage points (pp) over the 3.37% baseline, with 3.27x speedup at 120s audio on Apple M4 Pro via an optimized MLX inference path. Cross-domain validation on SepFormer speech enhancement shows all 16 intra-chunk attention layers can be replaced without collapse, suggesting the depth wall arises from linguistic computation rather than an LPA limitation. LPA’s near-binary gates at inference enable dense GPU computation with no CPU-GPU synchronization, and all operations map to mobile neural accelerators.
[613] Beyond Deep Learning: Speech Segmentation and Phone Classification with Neural Assemblies
Trevor Adelson, Vidhyasaharan Sethu, Ting Dang
Main category: eess.AS
TL;DR: Assembly Calculus-based biologically plausible framework for speech processing using spike patterns and Hebbian learning, achieving competitive results on boundary detection and classification tasks without backpropagation.
Details
Motivation: To develop a biologically grounded alternative to deep learning for speech processing that avoids massive datasets, global backpropagation, and entangled representations, using Assembly Calculus principles instead.Method: Combines three components: (1) neural encoding converting speech to assembly-compatible spike patterns via probabilistic mel binarization and population-coded MFCCs, (2) multi-area architecture organizing assemblies across hierarchical timescales and classes, and (3) cross-area update schemes for downstream tasks.
Result: Achieves phone boundary detection (F1=0.69) and word boundary detection (F1=0.61) without weight training, and 47.5% accuracy on phone recognition and 45.1% accuracy on command recognition.
Conclusion: Assembly Calculus-based dynamical systems are a viable alternative to deep learning for speech processing, offering biologically plausible mechanisms without backpropagation.
Abstract: Deep learning dominates speech processing but relies on massive datasets, global backpropagation-guided weight updates, and produces entangled representations. Assembly Calculus (AC), which models sparse neuronal assemblies via Hebbian plasticity and winner-take-all competition, offers a biologically grounded alternative, yet prior work focused on discrete symbolic inputs. We introduce an AC-based speech processing framework that operates directly on continuous speech by combining three key contributions:(i) neural encoding that converts speech into assembly-compatible spike patterns using probabilistic mel binarisation and population-coded MFCCs; (ii) a multi-area architecture organising assemblies across hierarchical timescales and classes; and (iii) cross-area update schemes for downstream tasks. Applied to two core tasks of boundary detection and segment classification, our framework detects phone (F1=0.69) and word (F1=0.61) boundaries without any weight training, and achieves 47.5% and 45.1% accuracy on phone and command recognition. These results show that AC-based dynamical systems are a viable alternative to deep learning for speech processing.
[614] The Voice Behind the Words: Quantifying Intersectional Bias in SpeechLLMs
Shree Harsha Bokkahalli Satish, Christoph Minixhofer, Maria Teleki, James Caverlee, Ondřej Klejch, Peter Bell, Gustav Eje Henter, Éva Székely
Main category: eess.AS
TL;DR: SpeechLLMs show implicit bias: Eastern European-accented speech receives lower helpfulness scores, especially for female voices, despite polite responses.
Details
Motivation: SpeechLLMs process spoken input directly, retaining speaker identity cues like accent and perceived gender that were previously removed in cascaded pipelines. This introduces speaker identity dependent variation in responses, raising concerns about bias.Method: Large-scale intersectional evaluation of three SpeechLLMs using 2,880 controlled interactions across six English accents and two gender presentations. Linguistic content kept constant through voice cloning. Used pointwise LLM-judge ratings, pairwise comparisons, and Best-Worst Scaling with human validation.
Result: Consistent disparities detected: Eastern European-accented speech receives lower helpfulness scores, particularly for female-presenting voices. Bias is implicit (responses remain polite but differ in helpfulness). LLM judges capture directional trend, but human evaluators show significantly higher sensitivity and uncover sharper intersectional disparities.
Conclusion: SpeechLLMs exhibit accent and gender bias that requires attention. Human evaluation is more sensitive than LLM judges for detecting intersectional disparities. Need for bias mitigation in speech-based language models.
Abstract: Speech Large Language Models (SpeechLLMs) process spoken input directly, retaining cues such as accent and perceived gender that were previously removed in cascaded pipelines. This introduces speaker identity dependent variation in responses. We present a large-scale intersectional evaluation of accent and gender bias in three SpeechLLMs using 2,880 controlled interactions across six English accents and two gender presentations, keeping linguistic content constant through voice cloning. Using pointwise LLM-judge ratings, pairwise comparisons, and Best-Worst Scaling with human validation, we detect consistent disparities. Eastern European-accented speech receives lower helpfulness scores, particularly for female-presenting voices. The bias is implicit: responses remain polite but differ in helpfulness. While LLM judges capture the directional trend of these biases, human evaluators exhibit significantly higher sensitivity, uncovering sharper intersectional disparities.
[615] SimulU: Training-free Policy for Long-form Simultaneous Speech-to-Speech Translation
Amirbek Djanibekov, Luisa Bentivogli, Matteo Negri, Sara Papi
Main category: eess.AS
TL;DR: SimulU: A training-free policy for long-form simultaneous speech-to-speech translation using history management and speech output selection strategies with pre-trained end-to-end models.
Details
Motivation: Simultaneous speech-to-speech translation is crucial for real-time multilingual communication but remains underexplored, with current solutions requiring resource-intensive training and failing to generalize to continuous speech.Method: Proposes SimulU, a training-free policy that uses history management and speech output selection strategies exploiting cross-attention in pre-trained end-to-end models to regulate input history and output generation.
Result: Evaluations on MuST-C across 8 languages show SimulU achieves better or comparable quality-latency trade-off against strong cascaded models.
Conclusion: SimulU offers a promising path to end-to-end simultaneous speech-to-speech translation in realistic, long-form scenarios by eliminating the need for ad-hoc training.
Abstract: Simultaneous speech-to-speech translation (SimulS2S) is essential for real-time multilingual communication, with increasing integration into meeting and streaming platforms. Despite this, SimulS2S remains underexplored in research, where current solutions often rely on resource-intensive training procedures and operate on short-form, pre-segmented utterances, failing to generalize to continuous speech. To bridge this gap, we propose SimulU, the first training-free policy for long-form SimulS2S. SimulU adopts history management and speech output selection strategies that exploit cross-attention in pre-trained end-to-end models to regulate both input history and output generation. Evaluations on MuST-C across 8 languages show that SimulU achieves a better or comparable quality-latency trade-off against strong cascaded models. By eliminating the need for ad-hoc training, SimulU offers a promising path to end-to-end SimulS2S in realistic, long-form scenarios.
[616] Over-the-air White-box Attack on the Wav2Vec Speech Recognition Neural Network
Protopopov Alexey
Main category: eess.AS
TL;DR: Explores methods to make over-the-air adversarial attacks on speech recognition systems less detectable by humans while maintaining attack effectiveness
Details
Motivation: Current over-the-air adversarial attacks on neural network-based speech recognition systems are typically detectable by human hearing, which limits their practical applications. The paper aims to address this limitation by exploring approaches to reduce detectability.Method: The paper explores different approaches to make over-the-air adversarial attacks less detectable, though specific methods are not detailed in the abstract. Likely involves techniques to modify adversarial perturbations to be less perceptible to human listeners while still fooling ASR systems.
Result: The abstract mentions exploring the impact of these approaches on attack effectiveness, suggesting trade-offs between detectability and attack success rate were analyzed, but specific results are not provided.
Conclusion: The work investigates the important problem of stealth in adversarial attacks on speech systems, highlighting the need for attacks that are both effective and imperceptible to humans in real-world scenarios.
Abstract: Automatic speech recognition systems based on neural networks are vulnerable to adversarial attacks that alter transcriptions in a malicious way. Recent works in this field have focused on making attacks work in over-the-air scenarios, however such attacks are typically detectable by human hearing, limiting their potential applications. In the present work we explore different approaches of making over-the-air attacks less detectable, as well as the impact these approaches have on the attacks’ effectiveness.
[617] Shared Representation Learning for Reference-Guided Targeted Sound Detection
Shubham Gupta, Adarsh Arigala, B. R. Dilleswari, Sri Rama Murty Kodukula
Main category: eess.AS
TL;DR: Unified encoder architecture for targeted sound detection that processes reference and mixture audio in shared representation space, achieving state-of-the-art performance with improved generalization.
Details
Motivation: Human auditory attention inspires targeted sound detection (TSD), which requires detecting and localizing target sounds in mixtures when reference audio is provided. Prior approaches use separate encoders with conditional embeddings, but this work aims to create stronger alignment between reference and mixture representations while reducing architectural complexity.Method: Proposes a unified encoder architecture that processes both reference and mixture audio within a shared representation space, promoting stronger alignment. Uses multi-task training paradigm to jointly optimize detection and localization objectives.
Result: Achieves substantial improvements over prior approaches, establishing new state-of-the-art benchmark for targeted sound detection with segment-level F1 score of 83.15% and overall accuracy of 95.17% on URBAN-SED dataset.
Conclusion: The unified encoder approach simplifies architecture while enhancing generalization to unseen classes, demonstrating that shared representation space between reference and mixture audio leads to superior performance in targeted sound detection tasks.
Abstract: Human listeners exhibit the remarkable ability to segregate a desired sound from complex acoustic scenes through selective auditory attention, motivating the study of Targeted Sound Detection (TSD). The task requires detecting and localizing a target sound in a mixture when a reference audio of that sound is provided. Prior approaches, rely on generating a sound-discriminative conditional embedding vector for the reference and pairing it with a mixture encoder, jointly optimized with a multi-task learning approach. In this work, we propose a unified encoder architecture that processes both the reference and mixture audio within a shared representation space, promoting stronger alignment while reducing architectural complexity. This design choice not only simplifies the overall framework but also enhances generalization to unseen classes. Following the multi-task training paradigm, our method achieves substantial improvements over prior approaches, surpassing existing methods and establishing a new state-of-the-art benchmark for targeted sound detection, with a segment-level F1 score of 83.15% and an overall accuracy of 95.17% on the URBAN-SED dataset.
[618] Uncertainty Quantification and Risk Control for Multi-Speaker Sound Source Localization
Vadim Rozenfeld, Bracha Laufer Goldshtein
Main category: eess.AS
TL;DR: Conformal prediction methods for reliable sound source localization with uncertainty quantification in challenging acoustic environments
Details
Motivation: Existing SSL methods only provide point estimates without uncertainty quantification, which is crucial for reliable decision-making in downstream applications, especially in challenging acoustic conditions like reverberation and multi-source scenarios.Method: Two complementary UQ approaches using Conformal Prediction framework: 1) prediction regions covering true source locations when source count is known, 2) first estimating number of active sources then forming prediction regions when source count is unknown.
Result: Methods demonstrate reliable finite-sample guarantees and consistent performance across varying reverberation levels and source configurations in both simulations and real-world recordings.
Conclusion: The proposed frameworks provide practical uncertainty-aware SSL with reliable performance guarantees for both known and unknown source-count scenarios.
Abstract: Reliable Sound Source Localization (SSL) plays an essential role in many downstream tasks, where informed decision making depends not only on accurate localization but also on the confidence in each estimate. This need for reliability becomes even more pronounced in challenging conditions, such as reverberant environments and multi-source scenarios. However, existing SSL methods typically provide only point estimates, offering limited or no Uncertainty Quantification (UQ). We leverage the Conformal Prediction (CP) framework and its extensions for controlling general risk functions to develop two complementary UQ approaches for SSL. The first assumes that the number of active sources is known and constructs prediction regions that cover the true source locations. The second addresses the more challenging setting where the source count is unknown, first reliably estimating the number of active sources and then forming corresponding prediction regions. We evaluate the proposed methods on extensive simulations and real-world recordings across varying reverberation levels and source configurations. Results demonstrate reliable finite-sample guarantees and consistent performance for both known and unknown source-count scenarios, highlighting the practical utility of the proposed frameworks for uncertainty-aware SSL.
[619] Robust Nasality Representation Learning for Cleft Palate-Related Velopharyngeal Dysfunction Screening in Real-World Settings
Weixin Liu, Bowen Qu, Amy Stone, Maria E. Powell, Shama Dufresne, Stephane Braun, Izabela Galdyn, Michael Golinko, Bradley Malin, Zhijun Yin, Matthew E. Pontell
Main category: eess.AS
TL;DR: Two-stage framework for VPD screening using nasality-focused speech representation learning via supervised contrastive pre-training, achieving robust performance across clinical and real-world recordings.
Details
Motivation: Existing speech-based VPD screening models perform well in clinical settings but degrade in real-world conditions due to domain shift from device, noise, and acoustic variations. Need for more robust, deployable solutions.Method: 1) Supervised contrastive pre-training on auxiliary corpus with phoneme alignments using oral-context vs nasal-context supervision to learn nasality-focused speech representation. 2) Freeze encoder and use lightweight classifiers on 0.5-second speech chunks, aggregating probabilities for recording-level decisions.
Result: Perfect performance on in-domain clinical cohort (82 subjects: macro-F1=1.000, accuracy=1.000). Outperformed all baselines on out-of-domain Internet recordings (131 samples: macro-F1=0.679, accuracy=0.695 vs MFCC baseline 0.612/0.641). Large pretrained speech representations degraded substantially in real-world settings.
Conclusion: Learning nasality-focused representations before clinical classification reduces sensitivity to recording artifacts and improves robustness for deployable speech-based VPD screening.
Abstract: Velopharyngeal dysfunction (VPD) is characterized by inadequate velopharyngeal closure during speech and often causes hypernasality and reduced intelligibility. Although speech-based machine learning models can perform well under standardized clinical recording conditions, their performance often drops in real-world settings because of domain shift caused by differences in devices, channels, noise, and room acoustics. To improve robustness, we propose a two-stage framework for VPD screening. First, a nasality-focused speech representation is learned by supervised contrastive pre-training on an auxiliary corpus with phoneme alignments, using oral-context versus nasal-context supervision. Second, the encoder is frozen and used with lightweight classifiers on 0.5-second speech chunks, whose probabilities are aggregated to produce recording-level decisions with a fixed threshold. On an in-domain clinical cohort of 82 subjects, the proposed method achieved perfect recording-level screening performance (macro-F1 = 1.000, accuracy = 1.000). On a separate out-of-domain set of 131 heterogeneous public Internet recordings, large pretrained speech representations degraded substantially, while MFCC was the strongest baseline (macro-F1 = 0.612, accuracy = 0.641). The proposed method achieved the best out-of-domain performance (macro-F1 = 0.679, accuracy = 0.695), improving on the strongest baseline under the same evaluation protocol. These results suggest that learning a nasality-focused representation before clinical classification can reduce sensitivity to recording artifacts and improve robustness for deployable speech-based VPD screening.
[620] Feature Selection via Graph Topology Inference for Soundscape Emotion Recognition
Samuel Rey, Luca Martino, Roberto San Millan, Eduardo Morgado
Main category: eess.AS
TL;DR: A graph learning framework with novel information criterion for soundscape emotion recognition feature selection, revealing strong arousal-valence connection challenging SER assumptions.
Details
Motivation: Soundscape research has shifted from noise levels to perception, requiring better feature selection methods for soundscape emotion recognition (SER) that traditionally uses arousal and valence as affect descriptors.Method: Blend graph learning techniques with novel information criterion; estimate sparse graph representation of feature relations using linear structural equation models (SEM) on Emo-Soundscapes dataset; propose generalized elbow detector for sparsity level determination with point estimate and uncertainty interval.
Result: Extensive evaluation including visualizations of inferred relations; findings align with previous studies but graph representation reveals strong connection between arousal and valence, challenging common SER assumptions.
Conclusion: The proposed graph learning framework with novel information criterion provides effective feature selection for SER, uncovering important relationships between emotional dimensions that challenge traditional assumptions in the field.
Abstract: Research on soundscapes has shifted the focus of environmental acoustics from noise levels to the perception of sounds, incorporating contextual factors. Soundscape emotion recognition (SER) models perception using a set of features, with arousal and valence commonly regarded as sufficient descriptors of affect. In this work, we blend \emph{graph learning} techniques with a novel \emph{information criterion} to develop a feature selection framework for SER. Specifically, we estimate a sparse graph representation of feature relations using linear structural equation models (SEM) tailored to the widely used Emo-Soundscapes dataset. The resulting graph captures the relations between input features and the two emotional outputs. To determine the appropriate level of sparsity, we propose a novel \emph{generalized elbow detector}, which provides both a point estimate and an uncertainty interval. We conduct an extensive evaluation of our methods, including visualizations of the inferred relations. While several of our findings align with previous studies, the graph representation also reveals a strong connection between arousal and valence, challenging common SER assumptions.
[621] Multi-Source Evidence Fusion for Audio Question Answering
Aivo Olev, Tanel Alumäe
Main category: eess.AS
TL;DR: TalTech’s winning solution for Interspeech 2026 Audio Reasoning Challenge uses multi-source ensemble with LALMs and acoustic tools to produce verifiable reasoning chains about audio content.
Details
Motivation: Large audio language models (LALMs) can answer questions about audio content but their internal reasoning is opaque and difficult to validate. The challenge requires evaluating reasoning process quality - factual accuracy, logical soundness, and completeness of reasoning chains.Method: Multi-source ensemble pipeline using two LALMs to generate independent observations, with a separate text-only reasoning model cross-checking these against outputs from 25 acoustic tools organized into reliability tiers. Every inference step is grounded in explicit, reliability-tagged evidence.
Result: The system ranked first in the Interspeech 2026 Audio Reasoning Challenge, outperforming all competing systems by a wide margin in the challenge’s reasoning quality metric.
Conclusion: By grounding inferences in explicit evidence from multiple sources with reliability tagging, the system produces dense, verifiable reasoning chains that address the opacity problem in LALMs.
Abstract: Large audio language models (LALMs) can answer questions about speech, music, and environmental sounds, yet their internal reasoning is largely opaque and difficult to validate. We describe TalTech’s solution to the Agent Track of the Interspeech 2026 Audio Reasoning Challenge, in which systems are evaluated on reasoning process quality, specifically the factual accuracy, logical soundness, and completeness of their reasoning chains. Our multi-source ensemble pipeline uses two LALMs that generate independent observations, while a separate text-only reasoning model cross-checks these against outputs from 25 acoustic tools organized into reliability tiers. By grounding every inference step in explicit, reliability-tagged evidence, the system produces dense, verifiable reasoning chains. Our system ranked first in the challenge, outperforming all competing systems by a wide margin in challenge’s reasoning quality metric.
[622] Integrated Spoofing-Robust Automatic Speaker Verification via a Three-Class Formulation and LLR
Kai Tan, Lin Zhang, Ruiteng Zhang, Johan Rohdin, Leibny Paola García-Perera, Zexin Cai, Sanjeev Khudanpur, Matthew Wiesner, Nicholas Andrews
Main category: eess.AS
TL;DR: A unified end-to-end framework for spoofing-robust speaker verification using a three-class formulation for interpretable log-likelihood ratio inference
Details
Motivation: Current spoofing-robust automatic speaker verification (SASV) methods either fuse independent ASV and CM scores or use bi-encoder networks, which offer limited interpretability and cannot adapt to new evaluation parameters without retrainingMethod: Proposes a unified end-to-end framework via a three-class formulation (target speaker, non-target speaker, spoof) that enables log-likelihood ratio (LLR) inference directly from class logits for more interpretable decision making
Result: Achieves comparable performance to existing methods on ASVSpoof5 and better results on SpoofCeleb datasets; visualization and analysis demonstrate improved interpretability through the three-class reformulation
Conclusion: The proposed three-class end-to-end framework provides a more interpretable and unified approach to spoofing-robust speaker verification while maintaining competitive performance
Abstract: Spoofing-robust automatic speaker verification (SASV) aims to integrate automatic speaker verification (ASV) and countermeasure (CM). A popular solution is fusion of independent ASV and CM scores. To better modeling SASV, some frameworks integrate ASV and CM within a single network. However, these solutions are typically bi-encoder based, offer limited interpretability, and cannot be readily adapted to new evaluation parameters without retraining. Based on this, we propose a unified end-to-end framework via a three-class formulation that enables log-likelihood ratio (LLR) inference from class logits for a more interpretable decision pipeline. Experiments show comparable performance to existing methods on ASVSpoof5 and better results on SpoofCeleb. The visualization and analysis also prove that the three-class reformulation provides more interpretability.
[623] The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning
Donghang Wu, Tianyu Zhang, Yuxin Li, Hexin Liu, Chen Chen, Eng Siong Chng, Yoshua Bengio
Main category: eess.AS
TL;DR: FLAIR is a full-duplex latent reasoning method that enables AI systems to think while listening to speech, mimicking human cognitive processing during conversations without adding latency.
Details
Motivation: The paper is motivated by how humans engage in concurrent thinking while listening to speakers during conversations, which helps formulate high-quality responses. Current NLP "thinking" mechanisms require post-hoc generation, which doesn't align well with real-time spoken dialogue systems.Method: Proposes FLAIR (Full-duplex LAtent and Internal Reasoning) that conducts latent thinking simultaneously with speech perception. It recursively feeds latent embeddings from previous steps into the next step during user speech, enabling continuous causal reasoning. Uses an Evidence Lower Bound-based objective for supervised finetuning via teacher forcing without needing explicit reasoning annotations.
Result: Achieves competitive results on speech benchmarks and robustly handles conversational dynamics with competitive performance on full-duplex interaction metrics.
Conclusion: The think-while-listening design effectively mimics human cognitive processing for spoken dialogue systems, enabling continuous reasoning without additional latency.
Abstract: During conversational interactions, humans subconsciously engage in concurrent thinking while listening to a speaker. Although this internal cognitive processing may not always manifest as explicit linguistic structures, it is instrumental in formulating high-quality responses. Inspired by this cognitive phenomenon, we propose a novel Full-duplex LAtent and Internal Reasoning method named FLAIR that conducts latent thinking simultaneously with speech perception. Unlike conventional “thinking” mechanisms in NLP, which require post-hoc generation, our approach aligns seamlessly with spoken dialogue systems: during the user’s speaking phase, it recursively feeds the latent embedding output from the previous step into the next step, enabling continuous reasoning that strictly adheres to causality without introducing additional latency. To enable this latent reasoning, we design an Evidence Lower Bound-based objective that supports efficient supervised finetuning via teacher forcing, circumventing the need for explicit reasoning annotations. Experiments demonstrate the effectiveness of this think-while-listening design, which achieves competitive results on a range of speech benchmarks. Furthermore, FLAIR robustly handles conversational dynamics and attains competitive performance on full-duplex interaction metrics.
[624] Do You Hear What I Mean? Quantifying the Instruction-Perception Gap in Instruction-Guided Expressive Text-To-Speech Systems
Yi-Cheng Lin, Huang-Cheng Chou, Tzu-Chieh Wei, Kuan-Yu Chen, Hung-yi Lee
Main category: eess.AS
TL;DR: Instruction-guided TTS systems show gaps between user instructions and listener perception, with GPT-4o-mini-TTS performing best but fine-grained control remaining challenging.
Details
Motivation: To investigate the alignment between user style instructions and listener perception in instruction-guided text-to-speech (ITTS) systems, as this relationship remains largely unexplored despite the intuitive interface ITTS provides.Method: Conducted perceptual analysis of ITTS controllability across expressive dimensions (adverbs of degree, graded emotion intensity), collected human ratings on speaker age and word-level emphasis, and created the Expressive VOice Control (E-VOC) corpus with large-scale human evaluations.
Result: 1) GPT-4o-mini-TTS is the most reliable ITTS model with good instruction-utterance alignment. 2) ITTS systems tend to generate Adult voices regardless of child/Elderly instructions. 3) Fine-grained control remains a major challenge for most ITTS systems.
Conclusion: There’s a significant instruction-perception gap in ITTS systems, with GPT-4o-mini-TTS performing best but substantial room for improvement in fine-grained control and accurate interpretation of nuanced attribute instructions.
Abstract: Instruction-guided text-to-speech (ITTS) enables users to control speech generation through natural language prompts, offering a more intuitive interface than traditional TTS. However, the alignment between user style instructions and listener perception remains largely unexplored. This work first presents a perceptual analysis of ITTS controllability across two expressive dimensions (adverbs of degree and graded emotion intensity) and collects human ratings on speaker age and word-level emphasis attributes. To comprehensively reveal the instruction-perception gap, we provide a data collection with large-scale human evaluations, named Expressive VOice Control (E-VOC) corpus. Furthermore, we reveal that (1) gpt-4o-mini-tts is the most reliable ITTS model with great alignment between instruction and generated utterances across acoustic dimensions. (2) The 5 analyzed ITTS systems tend to generate Adult voices even when the instructions ask to use child or Elderly voices. (3) Fine-grained control remains a major challenge, indicating that most ITTS systems have substantial room for improvement in interpreting slightly different attribute instructions.
eess.IV
[625] Simultaneous super-resolution and optical sectioning with four-beam interference structured illumination microscopy (4I-SIM)
Jiaming Qian, Jing Feng, Hongjun Wu, Maoxian Zhang, Dongqin Lu, Tianchi Kang, Xinyu Han, Qian Chen, Chao Zuo
Main category: eess.IV
TL;DR: 4I-SIM improves structured illumination microscopy by adding interference orders to overcome missing cone problem, achieving artifact-free super-resolution with optical sectioning for thick specimens.
Details
Motivation: Conventional 2D-SIM suffers from missing cone problem in optical transfer function when imaging thick or scattering specimens, causing out-of-focus background and reconstruction artifacts that compromise image fidelity.Method: Four-beam interference structured illumination microscopy (4I-SIM) introduces additional interference orders to expand lateral frequency support and compensate the axial missing cone simultaneously, achieving intrinsic optical sectioning without additional acquisition overhead.
Result: 4I-SIM achieves nearly twofold lateral resolution enhancement (103 nm lateral, 336 nm axial) compared to 2D-SIM, revealing mitochondrial remodeling and apoptosis under high-glucose stress with millisecond temporal resolution in thick fixed and live specimens.
Conclusion: 4I-SIM establishes a practical platform for simultaneous super-resolution and optical sectioning imaging in complex biological environments with minimal hardware modification, low phototoxicity, and open-source reconstruction tools.
Abstract: Structured illumination microscopy (SIM) has emerged as a widely adopted super-resolution fluorescence imaging modality, offering high speed, low phototoxicity, large field-of-view, and compatibility with conventional probes. However, when applied to thick or scattering specimens, conventional two-dimensional SIM (2D-SIM) suffers from the missing cone problem in its optical transfer function, resulting in prominent out-of-focus background and severe reconstruction artifacts that compromise image fidelity. Here, we present four-beam interference structured illumination microscopy (4I-SIM), which introduces additional interference orders to expand lateral frequency support and compensate the axial missing cone simultaneously. This strategy achieves artifact-free super-resolution with intrinsic optical sectioning, effectively overcoming the fundamental limitation of 2D-SIM without additional acquisition overhead. Experimental validation across diverse thick fixed and live specimens demonstrates that 4I-SIM delivers nearly twofold lateral resolution enhancement and substantially improved sectioning compared with its 2D counterpart, achieving lateral and axial resolutions of 103 nm and 336 nm, respectively. In particular, 4I-SIM reveals mitochondrial remodeling and apoptosis under high-glucose stress with millisecond temporal resolution – features that remain obscured with conventional SIM. With minimal hardware modification, low phototoxicity, and open-source reconstruction tools, 4I-SIM establishes a practical and reproducible platform for simultaneous super-resolution and optical sectioning imaging in complex biological environments.
[626] On the Degrees of Freedom of Gridded Control Points in Learning-Based Medical Image Registration
Wen Yan, Qianye Yang, Yipei Wang, Shonit Punwani, Mark Emberton, Vasilis Stavrinides, Yipeng Hu, Dean Barratt
Main category: eess.IV
TL;DR: GridReg: A learning-based medical image registration framework using sparse control points instead of dense voxel-wise decoding, reducing parameters/memory while maintaining accuracy with adaptive multi-scale grid training.
Details
Motivation: Traditional dense voxel-wise registration is computationally expensive and ill-posed in homogeneous regions. Sparse control points offer compact, smooth deformation representation with better stability and memory efficiency.Method: Replace dense voxel-wise decoding with displacement predictions at sparse grid control points. Use 3D encoder feature maps flattened to 1D tokens with positional encoding, then predict sparse gridded deformation via cross-attention. Introduce grid-adaptive training for multi-scale inference without retraining.
Result: Significant improvement in registration performance with similar or less computational cost compared to dense deformation field (DDF) or scattered key point methods. Demonstrated on prostate gland, pelvic organs, and neurological structures datasets.
Conclusion: Sparse grid control points provide effective deformation representation for learning-based registration, reducing parameters/memory while maintaining accuracy, with adaptive training enabling flexible multi-scale inference.
Abstract: Many registration problems are ill-posed in homogeneous or noisy regions, and dense voxel-wise decoders can be unnecessarily high-dimensional. A sparse control-point parameterisation provides a compact, smooth deformation representation while reducing memory and improving stability. This work investigates the required control points for learning-based registration network development. We present GridReg, a learning-based registration framework that replaces dense voxel-wise decoding with displacement predictions at a sparse grid of control points. This design substantially cuts the parameter count and memory while retaining registration accuracy. Multiscale 3D encoder feature maps are flattened into a 1D token sequence with positional encoding to retain spatial context. The model then predicts a sparse gridded deformation field using a cross-attention module. We further introduce grid-adaptive training, enabling an adaptive model to operate at multiple grid sizes at inference without retraining. This work quantitatively demonstrates the benefits of using sparse grids. Using three data sets for registering prostate gland, pelvic organs and neurological structures, the results suggested a significant improvement with the usage of grid-controled displacement field. Alternatively, the superior registration performance was obtained using the proposed approach, with a similar or less computational cost, compared with existing algorithms that predict DDFs or displacements sampled on scattered key points.
[627] UNICORN: Ultrasound Nakagami Imaging via Score Matching and Adaptation for Assessing Hepatic Steatosis
Kwanyoung Kim, Jaa-Yeon Lee, Youngjun Ko, GunWoo Lee, Jong Chul Ye
Main category: eess.IV
TL;DR: UNICORN is a novel ultrasound Nakagami imaging method using score matching for accurate, pixel-by-pixel parameter estimation to detect hepatic steatosis with high resolution.
Details
Motivation: Current ultrasound Nakagami imaging methods for hepatic steatosis assessment have limitations: they struggle with optimal window size selection, suffer from estimator instability, degrade image resolution, and typically only visualize specific ROIs rather than providing comprehensive parameter mapping.Method: Proposes UNICORN (Ultrasound Nakagami Imaging via Score Matching and Adaptation) - a novel method using score function of ultrasound envelope signals to create an accurate, closed-form estimator for Nakagami parameter estimation. Unlike fixed-window approaches, it provides pixel-by-pixel estimation for comprehensive parameter mapping and high-resolution imaging.
Result: The method effectively assesses hepatic steatosis and provides visual distinction in backscattered statistics associated with this condition. Extensive experiments using real patient envelope data validated that UNICORN enables clinical detection of hepatic steatosis and exhibits robustness and generalizability.
Conclusion: UNICORN addresses key limitations of existing Nakagami imaging methods by providing accurate, high-resolution parameter mapping through pixel-by-pixel estimation, making it a promising tool for clinical hepatic steatosis detection.
Abstract: Ultrasound imaging is an essential first-line tool for assessing hepatic steatosis. While conventional B-mode ultrasound imaging has limitations in providing detailed tissue characterization, ultrasound Nakagami imaging holds promise for visualizing and quantifying tissue scattering in backscattered signals, with potential applications in fat fraction analysis. However, existing methods for Nakagami imaging struggle with optimal window size selection and suffer from estimator instability, leading to degraded image resolution. To address these challenges, we propose a novel method called UNICORN (Ultrasound Nakagami Imaging via Score Matching and Adaptation), which offers an accurate, closed-form estimator for Nakagami parameter estimation based on the score function of the ultrasound envelope signal. Unlike methods that visualize only specific regions of interest (ROI) and estimate parameters within fixed window sizes, our approach provides comprehensive parameter mapping by providing a pixel-by-pixel estimator, resulting in high-resolution imaging. We demonstrated that our proposed estimator effectively assesses hepatic steatosis and provides visual distinction in the backscattered statistics associated with this condition. Through extensive experiments using real envelope data from patient, we validated that UNICORN enables clinical detection of hepatic steatosis and exhibits robustness and generalizability.
[628] A Lensless Polarization Camera
Noa Kraicer, Shay Elmalem, Erez Yosef, Hani Barhum, Raja Giryes
Main category: eess.IV
TL;DR: A compact lensless polarization camera using a diffuser and striped polarization mask to recover four linear polarization images from a single snapshot.
Details
Motivation: Existing polarization cameras use spatial or temporal multiplexing which increases camera volume, weight, and cost. There's a need for compact polarization imaging systems that maintain functionality while reducing size and complexity.Method: Proposes a lensless polarization camera composed of a diffuser and simple striped polarization mask. Combines this optical design with a reconstruction algorithm that explicitly models polarization-encoded lensless measurements to recover four linear polarization images from a single snapshot.
Result: Demonstrates successful recovery of four linear polarization images from single snapshot measurements. Reveals physical factors governing reconstruction quality, providing guidance for developing high-quality practical systems.
Conclusion: The work demonstrates the potential of lensless approaches for polarization imaging, offering a compact alternative to traditional polarization cameras while maintaining the ability to capture polarization information.
Abstract: Polarization imaging is a technique that creates a pixel map of the polarization state in a scene. Although invisible to the human eye, polarization can assist various sensing and computer vision tasks. Existing polarization cameras use spatial or temporal multiplexing, which increases the camera volume, weight, cost, or all of the above. Recent lensless imaging approaches, such as DiffuserCam, have demonstrated that compact imaging systems can be realized by replacing the lens with a coding element and performing computational reconstruction. In this work, we propose a compact lensless polarization camera composed of a diffuser and a simple striped polarization mask. By combining this optical design with a reconstruction algorithm that explicitly models the polarization-encoded lensless measurements, four linear polarization images are recovered from a single snapshot. Our results demonstrate the potential of lensless approaches for polarization imaging and reveal the physical factors that govern reconstruction quality, guiding the development of high-quality practical systems.
[629] Structured SIR: Efficient and Expressive Importance-Weighted Inference for High-Dimensional Image Registration
Ivor J. A. Simpson, Neill D. F. Campbell
Main category: eess.IV
TL;DR: Structured SIR: A memory-efficient probabilistic method for 3D image registration that captures multi-modal uncertainty distributions using sampled importance resampling with structured covariance parameterization.
Details
Motivation: Image registration is ill-posed with multiple valid solutions, requiring probabilistic inference. Existing variational methods make restrictive assumptions leading to poor uncertainty characterization, overconfidence, and low-quality samples. Flexible posteriors are bottlenecked by high-dimensional covariance complexity in dense 3D registration.Method: Proposes Structured SIR using Sampled Importance Resampling with a novel memory-efficient covariance parameterization: sum of low-rank covariance and sparse, spatially structured Cholesky precision factor. This captures complex spatial correlations while remaining computationally tractable for high-dimensional problems.
Result: Evaluated on 3D brain MRI registration, the method produces significantly better calibrated uncertainty estimates than variational methods with equivalent or better accuracy. Yields highly structured multi-modal posterior distributions enabling effective uncertainty quantification.
Conclusion: Structured SIR enables expressive, multi-modal uncertainty characterization for high-dimensional image registration with memory and computational efficiency, addressing limitations of variational approaches.
Abstract: Image registration is an ill-posed dense vision task, where multiple solutions achieve similar loss values, motivating probabilistic inference. Variational inference has previously been employed to capture these distributions, however restrictive assumptions about the posterior form can lead to poor characterisation, overconfidence and low-quality samples. More flexible posteriors are typically bottlenecked by the complexity of high-dimensional covariance matrices required for dense 3D image registration. In this work, we present a memory and computationally efficient inference method, Structured SIR, that enables expressive, multi-modal, characterisation of uncertainty with high quality samples. We propose the use of a Sampled Importance Resampling (SIR) algorithm with a novel memory-efficient high-dimensional covariance parameterisation as the sum of a low-rank covariance and a sparse, spatially structured Cholesky precision factor. This structure enables capturing complex spatial correlations while remaining computationally tractable. We evaluate the efficacy of this approach in 3D dense image registration of brain MRI data, which is a very high-dimensional problem. We demonstrate that our proposed methods produces uncertainty estimates that are significantly better calibrated than those produced by variational methods, achieving equivalent or better accuracy. Crucially, we show that the model yields highly structured multi-modal posterior distributions, enable effective and efficient uncertainty quantification.
[630] Deep Learning-Based Airway Segmentation in Systemic Lupus Erythematosus Patients with Interstitial Lung Disease (SLE-ILD): A Comparative High-Resolution CT Analysis
Sirong Piao, Ying Ming, Ruijie Zhao, Jiaru Wang, Ran Xiao, Rui Zhao, Zicheng Liao, Qiqi Xu, Shaoze Luo, Bing Li, Lin Li, Zhuangfei Ma, Fuling Zheng, Wei Song
Main category: eess.IV
TL;DR: Deep learning-based airway segmentation on chest CT reveals significant upper lobe airway dilation in SLE patients with interstitial lung disease compared to those without ILD.
Details
Motivation: To characterize lobar and segmental airway volume differences between systemic lupus erythematosus (SLE) patients with interstitial lung disease (ILD) and those without ILD using automated deep learning analysis of chest CT scans.Method: Retrospective analysis of 106 SLE patients (27 with ILD, 79 without) using a customized U-Net deep learning framework to automatically segment airway structures at lobar and segmental levels from high-resolution CT scans, followed by statistical comparison of volumetric measurements.
Result: Significant airway volume enlargement in SLE-ILD patients was found in right upper lobe (p=0.009) and left upper lobe (p=0.039), with specific segmental differences in R1 (p=0.016), R3 (p<0.001), and L3 (p=0.038), showing upper lung zone predominance.
Conclusion: Automated deep learning can quantify airway volumes and reveal region-specific airway dilation in SLE-ILD, highlighting a distinct topographic phenotype and potential imaging biomarker for early detection and monitoring of ILD in SLE patients.
Abstract: To characterize lobar and segmental airway volume differences between systemic lupus erythematosus (SLE) patients with interstitial lung disease (ILD) and those without ILD (non-ILD) using a deep learning-based approach on non-contrast chest high-resolution CT (HRCT). Methods: A retrospective analysis was conducted on 106 SLE patients (27 SLE-ILD, 79 SLE-non-ILD) who underwent HRCT. A customized deep learning framework based on the U-Net architecture was developed to automatically segment airway structures at the lobar and segmental levels via HRCT. Volumetric measurements of lung lobes and segments derived from the segmentations were statistically compared between the two groups using two-sample t-tests (significance threshold: p < 0.05). Results: At lobar level, significant airway volume enlargement in SLE-ILD patients was observed in the right upper lobe (p=0.009) and left upper lobe (p=0.039) compared to SLE-non-ILD. At the segmental level, significant differences were found in segments including R1 (p=0.016), R3 (p<0.001), and L3 (p=0.038), with the most marked changes in the upper lung zones, while lower zones showed non-significant trends. Conclusion: Our study demonstrates that an automated deep learning-based approach can effectively quantify airway volumes on HRCT scans and reveal significant, region-specific airway dilation in patients with SLE-ILD compared to those without ILD. The pattern of involvement, predominantly affecting the upper lobes and specific segments, highlights a distinct topographic phenotype of SLE-ILD and implicates airway structural alterations as a potential biomarker for disease presence. This AI-powered quantitative imaging biomarker holds promise for enhancing the early detection and monitoring of ILD in the SLE population, ultimately contributing to more personalized patient management.
[631] Towards Clinical Practice in CT-Based Pulmonary Disease Screening: An Efficient and Reliable Framework
Qian Shao, Bang Du, Yixuan Wu, Zepeng Li, Qiyuan Chen, Qianqian Tang, Jian Wu, Jintai Chen, Hongxia Xu
Main category: eess.IV
TL;DR: ERF framework improves CT analysis efficiency by selecting optimal slice subsets and quantifying diagnostic uncertainty, achieving 90%+ accuracy with 60% faster processing.
Details
Motivation: Deep learning models for pulmonary disease screening from CT scans have high computational costs from processing entire 3D volumes, creating barriers to clinical adoption. Current sub-sampling techniques often compromise diagnostic integrity by introducing artifacts or discarding critical information.Method: Proposes ERF framework with two innovations: (1) Cluster-based Sub-Sampling (CSS) that selects compact yet comprehensive CT slice subsets using efficient k-nearest neighbor search with iterative refinement, and (2) Ambiguity-aware Uncertainty Quantification (AUQ) that leverages predictive discrepancy between auxiliary classifiers to construct specialized ambiguity scores for unreliable samples.
Result: Validated on two public datasets with 2,654 CT volumes across 3 pulmonary diseases, ERF achieves diagnostic performance comparable to full-volume analysis (over 90% accuracy and recall) while reducing processing time by more than 60%.
Conclusion: The work represents a significant step towards deploying fast, accurate, and trustworthy AI-powered screening tools in time-sensitive clinical settings by addressing computational efficiency and reliability concerns in CT analysis.
Abstract: Deep learning models for pulmonary disease screening from Computed Tomography (CT) scans promise to alleviate the immense workload on radiologists. Still, their high computational cost, stemming from processing entire 3D volumes, remains a major barrier to widespread clinical adoption. Current sub-sampling techniques often compromise diagnostic integrity by introducing artifacts or discarding critical information. To overcome these limitations, we propose an Efficient and Reliable Framework (ERF) that fundamentally improves the practicality of automated CT analysis. Our framework introduces two core innovations: (1) A Cluster-based Sub-Sampling (CSS) method that efficiently selects a compact yet comprehensive subset of CT slices by optimizing for both representativeness and diversity. By integrating an efficient k-nearest neighbor search with an iterative refinement process, CSS bypasses the computational bottlenecks of previous methods while preserving vital diagnostic features. (2) An Ambiguity-aware Uncertainty Quantification (AUQ) mechanism, which enhances reliability by specifically targeting data ambiguity arising from subtle lesions and artifacts. Unlike standard uncertainty measures, AUQ leverages the predictive discrepancy between auxiliary classifiers to construct a specialized ambiguity score. By maximizing this discrepancy during training, the system effectively flags ambiguous samples where the model lacks confidence due to visual noise or intricate pathologies. Validated on two public datasets with 2,654 CT volumes across diagnostic tasks for 3 pulmonary diseases, ERF achieves diagnostic performance comparable to the full-volume analysis (over 90% accuracy and recall) while reducing processing time by more than 60%. This work represents a significant step towards deploying fast, accurate, and trustworthy AI-powered screening tools in time-sensitive clinical settings.
[632] CogGen: Cognitive-Load-Informed Fully Unsupervised Deep Generative Modeling for Compressively Sampled MRI Reconstruction
Qingyong Zhu, Yumin Tan, Xiang Gu, Dong Liang
Main category: eess.IV
TL;DR: CogGen is a cognitive-load-informed fully unsupervised deep generative model for compressive sensing MRI that uses staged inversion with progressive scheduling of task difficulty to improve reconstruction quality and convergence.
Details
Motivation: Classical fully unsupervised deep generative models (FU-DGMs) like DIP and INR for compressive sensing MRI rely on architectural priors but often require many iterations and easily overfit measurement noise due to the ill-conditioned inverse problem.Method: CogGen casts CS-MRI as staged inversion and regulates “cognitive load” by progressively scheduling intrinsic difficulty and extraneous interference. It replaces uniform data fitting with an easy-to-hard k-space weighting/selection strategy: early iterations emphasize low-frequency, high-SNR, structure-dominant samples, while higher-frequency or noise-dominated measurements are introduced later. This is realized through self-paced curriculum learning with complementary student and teacher modes.
Result: Experiments show that CogGen-DIP and CogGen-INR improve reconstruction fidelity and convergence behavior compared with strong unsupervised baselines and competitive supervised pipelines.
Conclusion: CogGen’s cognitive-load-informed approach effectively addresses limitations of classical FU-DGMs for CS-MRI by managing task difficulty progression, leading to better reconstruction quality and convergence.
Abstract: Fully unsupervised deep generative modeling (FU-DGM) is promising for compressively sampled MRI (CS-MRI) when training data or compute are limited. Classical FU-DGMs such as DIP and INR rely on architectural priors, but the ill-conditioned inverse problem often demands many iterations and easily overfits measurement noise. We propose CogGen, a cognitive-load-informed FU-DGM that casts CS-MRI as staged inversion and regulates task-side “cognitive load” by progressively scheduling intrinsic difficulty and extraneous interference. CogGen replaces uniform data fitting with an easy-to-hard k-space weighting/selection strategy: early iterations emphasize low-frequency, high-SNR, structure-dominant samples, while higher-frequency or noise-dominated measurements are introduced later. We realize this schedule through self-paced curriculum learning (SPCL) with complementary criteria: a student mode that reflects what the model can currently learn and a teacher mode that indicates what it should follow, supporting both soft weighting and hard selection. Experiments and analyses show that CogGen-DIP and CogGen-INR improve reconstruction fidelity and convergence behavior compared with strong unsupervised baselines and competitive supervised pipelines.
[633] Halfway to 3D: Ensembling 2.5D and 3D Models for Robust COVID-19 CT Diagnosis
Tuan-Anh Yang, Bao V. Q. Bui, Chanh-Quang Vo-Van, Truong-Son Hy
Main category: eess.IV
TL;DR: A deep learning framework combining 2.5D and 3D representations for COVID-19 detection and disease classification from chest CT scans, using DINOv3 vision transformer for slice-level features and ResNet-18 with VREx pretraining for volumetric context.
Details
Motivation: To develop a robust framework for COVID-19 detection and disease classification from chest CT scans that captures both slice-level details and volumetric context, addressing the need for accurate multi-source medical imaging analysis.Method: Combines 2.5D branch (DINOv3 vision transformer processing multi-view CT slices) with 3D branch (ResNet-18 pretrained with Variance Risk Extrapolation and supervised contrastive learning). Uses logit-level ensemble inference to combine predictions from both branches.
Result: Achieves 94.48% accuracy and 0.9426 Macro F1-score for binary COVID-19 detection, and 79.35% accuracy with 0.7497 Macro F1-score for multi-class disease classification on PHAROS-AIF-MIH benchmark, outperforming individual models.
Conclusion: Combining pretrained slice-based representations with volumetric modeling is effective for robust multi-source medical imaging analysis, demonstrating the benefit of integrating complementary 2.5D and 3D information.
Abstract: We propose a deep learning framework for COVID-19 detection and disease classification from chest CT scans that integrates both 2.5D and 3D representations to capture complementary slice-level and volumetric information. The 2.5D branch processes multi-view CT slices (axial, coronal, sagittal) using a DINOv3 vision transformer to extract robust visual features, while the 3D branch employs a ResNet-18 architecture to model volumetric context and is pretrained with Variance Risk Extrapolation (VREx) followed by supervised contrastive learning to improve cross-source robustness. Predictions from both branches are combined through logit-level ensemble inference. Experiments on the PHAROS-AIF-MIH benchmark demonstrate the effectiveness of the proposed approach: for binary COVID-19 detection, the ensemble achieves 94.48% accuracy and a 0.9426 Macro F1-score, outperforming both individual models, while for multi-class disease classification the 2.5D DINOv3 model achieves the best performance with 79.35% accuracy and a 0.7497 Macro F1-score. These results highlight the benefit of combining pretrained slice-based representations with volumetric modeling for robust multi-source medical imaging analysis. Code is available at https://github.com/HySonLab/PHAROS-AIF-MIH